[v7 00/16] mm: support device-private THP

dri-devel.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed

* [v7 00/16] mm: support device-private THP
@ 2025-10-01  6:56 Balbir Singh
  2025-10-01  6:56 ` [v7 01/16] mm/zone_device: support large zone device private folios Balbir Singh
                   ` (16 more replies)
  0 siblings, 17 replies; 75+ messages in thread
From: Balbir Singh @ 2025-10-01  6:56 UTC (permalink / raw)
  To: linux-kernel, dri-devel, linux-mm
  Cc: akpm, Balbir Singh, David Hildenbrand, Zi Yan, Joshua Hahn,
	Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Oscar Salvador, Lorenzo Stoakes, Baolin Wang,
	Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
	Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Ralph Campbell, Mika Penttilä, Matthew Brost,
	Francois Dugast

This patch series introduces support for Transparent Huge Page
(THP) migration in zone device-private memory. The implementation enables
efficient migration of large folios between system memory and
device-private memory

Background

Current zone device-private memory implementation only supports PAGE_SIZE
granularity, leading to:
- Increased TLB pressure
- Inefficient migration between CPU and device memory

This series extends the existing zone device-private infrastructure to
support THP, leading to:
- Reduced page table overhead
- Improved memory bandwidth utilization
- Seamless fallback to base pages when needed

In my local testing (using lib/test_hmm) and a throughput test, the
series shows a 350% improvement in data transfer throughput and a
80% improvement in latency

These patches build on the earlier posts by Ralph Campbell [1]

Two new flags are added in vma_migration to select and mark compound pages.
migrate_vma_setup(), migrate_vma_pages() and migrate_vma_finalize()
support migration of these pages when MIGRATE_VMA_SELECT_COMPOUND
is passed in as arguments.

The series also adds zone device awareness to (m)THP pages along
with fault handling of large zone device private pages. page vma walk
and the rmap code is also zone device aware. Support has also been
added for folios that might need to be split in the middle
of migration (when the src and dst do not agree on
MIGRATE_PFN_COMPOUND), that occurs when src side of the migration can
migrate large pages, but the destination has not been able to allocate
large pages. The code supported and used folio_split() when migrating
THP pages, this is used when MIGRATE_VMA_SELECT_COMPOUND is not passed
as an argument to migrate_vma_setup().

The test infrastructure lib/test_hmm.c has been enhanced to support THP
migration. A new ioctl to emulate failure of large page allocations has
been added to test the folio split code path. hmm-tests.c has new test
cases for huge page migration and to test the folio split path. A new
throughput test has been added as well.

The nouveau dmem code has been enhanced to use the new THP migration
capability. 

mTHP support:

The patches hard code, HPAGE_PMD_NR in a few places, but the code has
been kept generic to support various order sizes. With additional
refactoring of the code support of different order sizes should be
possible.

The future plan is to post enhancements to support mTHP with a rough
design as follows:

1. Add the notion of allowable thp orders to the HMM based test driver
2. For non PMD based THP paths in migrate_device.c, check to see if
   a suitable order is found and supported by the driver
3. Iterate across orders to check the highest supported order for migration
4. Migrate and finalize

The mTHP patches can be built on top of this series, the key design
elements that need to be worked out are infrastructure and driver support
for multiple ordered pages and their migration.

HMM support for large folios, patches are already posted and in
mm-unstable.

Cc: Andrew Morton <akpm@linux-foundation.org> 
Cc: David Hildenbrand <david@redhat.com> 
Cc: Zi Yan <ziy@nvidia.com>  
Cc: Joshua Hahn <joshua.hahnjy@gmail.com> 
Cc: Rakie Kim <rakie.kim@sk.com> 
Cc: Byungchul Park <byungchul@sk.com> 
Cc: Gregory Price <gourry@gourry.net> 
Cc: Ying Huang <ying.huang@linux.alibaba.com> 
Cc: Alistair Popple <apopple@nvidia.com> 
Cc: Oscar Salvador <osalvador@suse.de> 
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> 
Cc: Baolin Wang <baolin.wang@linux.alibaba.com> 
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> 
Cc: Nico Pache <npache@redhat.com> 
Cc: Ryan Roberts <ryan.roberts@arm.com> 
Cc: Dev Jain <dev.jain@arm.com> 
Cc: Barry Song <baohua@kernel.org> 
Cc: Lyude Paul <lyude@redhat.com> 
Cc: Danilo Krummrich <dakr@kernel.org> 
Cc: David Airlie <airlied@gmail.com> 
Cc: Simona Vetter <simona@ffwll.ch> 
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mika Penttilä <mpenttil@redhat.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Francois Dugast <francois.dugast@intel.com>

References:
[1] https://lore.kernel.org/linux-mm/20201106005147.20113-1-rcampbell@nvidia.com/
[2] https://lore.kernel.org/linux-mm/20250306044239.3874247-3-balbirs@nvidia.com/T/
[3] https://lore.kernel.org/lkml/20250703233511.2028395-1-balbirs@nvidia.com/
[4] https://lkml.kernel.org/r/20250902130713.1644661-1-francois.dugast@intel.com
[5] https://lore.kernel.org/lkml/20250730092139.3890844-1-balbirs@nvidia.com/
[6] https://lore.kernel.org/lkml/20250812024036.690064-1-balbirs@nvidia.com/
[7] https://lore.kernel.org/lkml/20250903011900.3657435-1-balbirs@nvidia.com/
[8] https://lore.kernel.org/all/20250908000448.180088-1-balbirs@nvidia.com/
[9] https://lore.kernel.org/lkml/20250916122128.2098535-1-balbirs@nvidia.com/

These patches are built on top of mm/mm-new

Changelog v7 [9]:
- Rebased against mm/mm-new again
  - Addressed more review comments from Zi Yan and David Hildenbrand
  - Code flow reorganization of split_huge_pmd_locked
  - page_free callback is now changed to folio_free (posted as patch 2
    in the series)
  - zone_device_page_init() takes an order parameter
  - migrate_vma_split_pages() is now called
    migrate_vma_split_unmapped_folio()
  - More cleanups and fixes
  - Patch 6 partial unmapped folio case has been split into two parts
    some of the content has been moved to the actual device private
    split handling code
  - Fault handling for device-private pages now uses folio routines
    instead of page_get/trylock/put routines.
  
Changelog v6 [8]:
- Rebased against mm/mm-new after fixing the following
  - Two issues reported by kernel test robot
    - m68k requires an lvalue for pmd_present()
    - BUILD_BUG_ON() issues when THP is disabled
  - kernel doc warnings reported on linux-next
    - Thanks Stephen Rothwell!
  - smatch fixes and issues reported
    - Fix issue with potential NULL page
    - Report about young being uninitialized for device-private pages in
      __split_huge_pmd_locked()
- Several Review comments from David
  - Indentation changes and style improvements
  - Removal of some unwanted extra lines
  - Introduction of new helper function is_pmd_non_present_folio_entry()
    to represent migration and device private pmd's
  - Code flow refactoring into migration and device private paths
  - More consistent use of helper function is_pmd_device_private()
- Review comments from Mika
  - folio_get() is not required for huge_pmd prior to split

Changelog v5 [7] :
- Rebased against mm/mm-new (resolved conflict caused by
  MIGRATEPAGE_SUCCESS removal)
- Fixed a kernel-doc warning reported by kernel test robot

Changelog v4 [6] :
- Addressed review comments
  - Split patch 2 into a smaller set of patches
  - PVMW_THP_DEVICE_PRIVATE flag is no longer present
  - damon/page_idle and other page_vma_mapped_walk paths are aware of
    device-private folios
  - No more flush for non-present entries in set_pmd_migration_entry
  - Implemented a helper function for migrate_vma_split_folio() which
    splits large folios if seen during a pte walk
  - Removed the controversial change for folio_ref_freeze using
    folio_expected_ref_count()
  - Removed functions invoked from with VM_WARN_ON
  - New test cases and fixes from Matthew Brost
  - Fixed bugs reported by kernel test robot (Thanks!)
  - Several fixes for THP support in nouveau driver

Changelog v3 [5] :
- Addressed review comments
  - No more split_device_private_folio() helper
  - Device private large folios do not end up on deferred scan lists
  - Removed THP size order checks when initializing zone device folio
  - Fixed bugs reported by kernel test robot (Thanks!)

Changelog v2 [3] :
- Several review comments from David Hildenbrand were addressed, Mika,
  Zi, Matthew also provided helpful review comments
  - In paths where it makes sense a new helper
    is_pmd_device_private_entry() is used
  - anon_exclusive handling of zone device private pages in
    split_huge_pmd_locked() has been fixed
  - Patches that introduced helpers have been folded into where they
    are used
- Zone device handling in mm/huge_memory.c has benefited from the code
  and testing of Matthew Brost, he helped find bugs related to
  copy_huge_pmd() and partial unmapping of folios.
- Zone device THP PMD support via page_vma_mapped_walk() is restricted
  to try_to_migrate_one()
- There is a new dedicated helper to split large zone device folios

Changelog v1 [2]:
- Support for handling fault_folio and using trylock in the fault path
- A new test case has been added to measure the throughput improvement
- General refactoring of code to keep up with the changes in mm
- New split folio callback when the entire split is complete/done. The
  callback is used to know when the head order needs to be reset.

Testing:
- Testing was done with ZONE_DEVICE private pages on an x86 VM

Balbir Singh (15):
  mm/zone_device: support large zone device private folios
  mm/zone_device: Rename page_free callback to folio_free
  mm/huge_memory: add device-private THP support to PMD operations
  mm/rmap: extend rmap and migration support device-private entries
  mm/huge_memory: implement device-private THP splitting
  mm/migrate_device: handle partially mapped folios during collection
  mm/migrate_device: implement THP migration of zone device pages
  mm/memory/fault: add THP fault handling for zone device private pages
  lib/test_hmm: add zone device private THP test infrastructure
  mm/memremap: add driver callback support for folio splitting
  mm/migrate_device: add THP splitting during migration
  lib/test_hmm: add large page allocation failure testing
  selftests/mm/hmm-tests: new tests for zone device THP migration
  selftests/mm/hmm-tests: new throughput tests including THP
  gpu/drm/nouveau: enable THP support for GPU memory migration

Matthew Brost (1):
  selftests/mm/hmm-tests: partial unmap, mremap and anon_write tests

 Documentation/mm/memory-model.rst        |   2 +-
 arch/powerpc/kvm/book3s_hv_uvmem.c       |   7 +-
 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c |   7 +-
 drivers/gpu/drm/drm_pagemap.c            |  12 +-
 drivers/gpu/drm/nouveau/nouveau_dmem.c   | 308 ++++++--
 drivers/gpu/drm/nouveau/nouveau_svm.c    |   6 +-
 drivers/gpu/drm/nouveau/nouveau_svm.h    |   3 +-
 drivers/pci/p2pdma.c                     |   5 +-
 include/linux/huge_mm.h                  |  18 +-
 include/linux/memremap.h                 |  57 +-
 include/linux/migrate.h                  |   2 +
 include/linux/swapops.h                  |  32 +
 lib/test_hmm.c                           | 448 +++++++++--
 lib/test_hmm_uapi.h                      |   3 +
 mm/damon/ops-common.c                    |  20 +-
 mm/huge_memory.c                         | 243 ++++--
 mm/memory.c                              |   5 +-
 mm/memremap.c                            |  40 +-
 mm/migrate.c                             |   1 +
 mm/migrate_device.c                      | 609 +++++++++++++--
 mm/page_idle.c                           |   7 +-
 mm/page_vma_mapped.c                     |   7 +
 mm/pgtable-generic.c                     |   2 +-
 mm/rmap.c                                |  30 +-
 tools/testing/selftests/mm/hmm-tests.c   | 919 +++++++++++++++++++++--
 25 files changed, 2399 insertions(+), 394 deletions(-)

-- 
2.51.0


^ permalink raw reply	[flat|nested] 75+ messages in thread

* [v7 01/16] mm/zone_device: support large zone device private folios
  2025-10-01  6:56 [v7 00/16] mm: support device-private THP Balbir Singh
@ 2025-10-01  6:56 ` Balbir Singh
  2025-10-12  6:10   ` Lance Yang
  2025-10-01  6:56 ` [v7 02/16] mm/zone_device: Rename page_free callback to folio_free Balbir Singh
                   ` (15 subsequent siblings)
  16 siblings, 1 reply; 75+ messages in thread
From: Balbir Singh @ 2025-10-01  6:56 UTC (permalink / raw)
  To: linux-kernel, dri-devel, linux-mm
  Cc: akpm, Balbir Singh, David Hildenbrand, Zi Yan, Joshua Hahn,
	Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Oscar Salvador, Lorenzo Stoakes, Baolin Wang,
	Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
	Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Ralph Campbell, Mika Penttilä, Matthew Brost,
	Francois Dugast, Madhavan Srinivasan, Christophe Leroy,
	Felix Kuehling, Alex Deucher, Christian König

Add routines to support allocation of large order zone device folios
and helper functions for zone device folios, to check if a folio is
device private and helpers for setting zone device data.

When large folios are used, the existing page_free() callback in
pgmap is called when the folio is freed, this is true for both
PAGE_SIZE and higher order pages.

Zone device private large folios do not support deferred split and
scan like normal THP folios.

Signed-off-by: Balbir Singh <balbirs@nvidia.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
Cc: Rakie Kim <rakie.kim@sk.com>
Cc: Byungchul Park <byungchul@sk.com>
Cc: Gregory Price <gourry@gourry.net>
Cc: Ying Huang <ying.huang@linux.alibaba.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Nico Pache <npache@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona@ffwll.ch>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mika Penttilä <mpenttil@redhat.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Francois Dugast <francois.dugast@intel.com>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: "Christian König" <christian.koenig@amd.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
 arch/powerpc/kvm/book3s_hv_uvmem.c       |  2 +-
 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c |  2 +-
 drivers/gpu/drm/drm_pagemap.c            |  2 +-
 drivers/gpu/drm/nouveau/nouveau_dmem.c   |  2 +-
 include/linux/memremap.h                 | 10 ++++++++-
 lib/test_hmm.c                           |  2 +-
 mm/memremap.c                            | 26 ++++++++++++++----------
 mm/rmap.c                                |  6 +++++-
 8 files changed, 34 insertions(+), 18 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c b/arch/powerpc/kvm/book3s_hv_uvmem.c
index 03f8c34fa0a2..91f763410673 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -723,7 +723,7 @@ static struct page *kvmppc_uvmem_get_page(unsigned long gpa, struct kvm *kvm)
 
 	dpage = pfn_to_page(uvmem_pfn);
 	dpage->zone_device_data = pvt;
-	zone_device_page_init(dpage);
+	zone_device_page_init(dpage, 0);
 	return dpage;
 out_clear:
 	spin_lock(&kvmppc_uvmem_bitmap_lock);
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
index 79251f22b702..d0e2cae33035 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
@@ -217,7 +217,7 @@ svm_migrate_get_vram_page(struct svm_range *prange, unsigned long pfn)
 	page = pfn_to_page(pfn);
 	svm_range_bo_ref(prange->svm_bo);
 	page->zone_device_data = prange->svm_bo;
-	zone_device_page_init(page);
+	zone_device_page_init(page, 0);
 }
 
 static void
diff --git a/drivers/gpu/drm/drm_pagemap.c b/drivers/gpu/drm/drm_pagemap.c
index 1da55322af12..31c53f724e25 100644
--- a/drivers/gpu/drm/drm_pagemap.c
+++ b/drivers/gpu/drm/drm_pagemap.c
@@ -196,7 +196,7 @@ static void drm_pagemap_get_devmem_page(struct page *page,
 					struct drm_pagemap_zdd *zdd)
 {
 	page->zone_device_data = drm_pagemap_zdd_get(zdd);
-	zone_device_page_init(page);
+	zone_device_page_init(page, 0);
 }
 
 /**
diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c b/drivers/gpu/drm/nouveau/nouveau_dmem.c
index ca4932a150e3..53cc1926b9da 100644
--- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
@@ -318,7 +318,7 @@ nouveau_dmem_page_alloc_locked(struct nouveau_drm *drm)
 			return NULL;
 	}
 
-	zone_device_page_init(page);
+	zone_device_page_init(page, 0);
 	return page;
 }
 
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index e5951ba12a28..d2487a19cba2 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -206,7 +206,7 @@ static inline bool is_fsdax_page(const struct page *page)
 }
 
 #ifdef CONFIG_ZONE_DEVICE
-void zone_device_page_init(struct page *page);
+void zone_device_page_init(struct page *page, unsigned int order);
 void *memremap_pages(struct dev_pagemap *pgmap, int nid);
 void memunmap_pages(struct dev_pagemap *pgmap);
 void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap);
@@ -215,6 +215,14 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn);
 bool pgmap_pfn_valid(struct dev_pagemap *pgmap, unsigned long pfn);
 
 unsigned long memremap_compat_align(void);
+
+static inline void zone_device_folio_init(struct folio *folio, unsigned int order)
+{
+	zone_device_page_init(&folio->page, order);
+	if (order)
+		folio_set_large_rmappable(folio);
+}
+
 #else
 static inline void *devm_memremap_pages(struct device *dev,
 		struct dev_pagemap *pgmap)
diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index 83e3d8208a54..24d82121cde8 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -627,7 +627,7 @@ static struct page *dmirror_devmem_alloc_page(struct dmirror_device *mdevice)
 			goto error;
 	}
 
-	zone_device_page_init(dpage);
+	zone_device_page_init(dpage, 0);
 	dpage->zone_device_data = rpage;
 	return dpage;
 
diff --git a/mm/memremap.c b/mm/memremap.c
index 46cb1b0b6f72..e45dfb568710 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -416,20 +416,19 @@ EXPORT_SYMBOL_GPL(get_dev_pagemap);
 void free_zone_device_folio(struct folio *folio)
 {
 	struct dev_pagemap *pgmap = folio->pgmap;
+	unsigned long nr = folio_nr_pages(folio);
+	int i;
 
 	if (WARN_ON_ONCE(!pgmap))
 		return;
 
 	mem_cgroup_uncharge(folio);
 
-	/*
-	 * Note: we don't expect anonymous compound pages yet. Once supported
-	 * and we could PTE-map them similar to THP, we'd have to clear
-	 * PG_anon_exclusive on all tail pages.
-	 */
 	if (folio_test_anon(folio)) {
-		VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
-		__ClearPageAnonExclusive(folio_page(folio, 0));
+		for (i = 0; i < nr; i++)
+			__ClearPageAnonExclusive(folio_page(folio, i));
+	} else {
+		VM_WARN_ON_ONCE(folio_test_large(folio));
 	}
 
 	/*
@@ -456,8 +455,8 @@ void free_zone_device_folio(struct folio *folio)
 	case MEMORY_DEVICE_COHERENT:
 		if (WARN_ON_ONCE(!pgmap->ops || !pgmap->ops->page_free))
 			break;
-		pgmap->ops->page_free(folio_page(folio, 0));
-		put_dev_pagemap(pgmap);
+		pgmap->ops->page_free(&folio->page);
+		percpu_ref_put_many(&folio->pgmap->ref, nr);
 		break;
 
 	case MEMORY_DEVICE_GENERIC:
@@ -480,14 +479,19 @@ void free_zone_device_folio(struct folio *folio)
 	}
 }
 
-void zone_device_page_init(struct page *page)
+void zone_device_page_init(struct page *page, unsigned int order)
 {
+	VM_WARN_ON_ONCE(order > MAX_ORDER_NR_PAGES);
+
 	/*
 	 * Drivers shouldn't be allocating pages after calling
 	 * memunmap_pages().
 	 */
-	WARN_ON_ONCE(!percpu_ref_tryget_live(&page_pgmap(page)->ref));
+	WARN_ON_ONCE(!percpu_ref_tryget_many(&page_pgmap(page)->ref, 1 << order));
 	set_page_count(page, 1);
 	lock_page(page);
+
+	if (order)
+		prep_compound_page(page, order);
 }
 EXPORT_SYMBOL_GPL(zone_device_page_init);
diff --git a/mm/rmap.c b/mm/rmap.c
index ac4f783d6ec2..9bab13429975 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1757,9 +1757,13 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
 	 * the folio is unmapped and at least one page is still mapped.
 	 *
 	 * Check partially_mapped first to ensure it is a large folio.
+	 *
+	 * Device private folios do not support deferred splitting and
+	 * shrinker based scanning of the folios to free.
 	 */
 	if (partially_mapped && folio_test_anon(folio) &&
-	    !folio_test_partially_mapped(folio))
+	    !folio_test_partially_mapped(folio) &&
+	    !folio_is_device_private(folio))
 		deferred_split_folio(folio, true);
 
 	__folio_mod_stat(folio, -nr, -nr_pmdmapped);
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [v7 02/16] mm/zone_device: Rename page_free callback to folio_free
  2025-10-01  6:56 [v7 00/16] mm: support device-private THP Balbir Singh
  2025-10-01  6:56 ` [v7 01/16] mm/zone_device: support large zone device private folios Balbir Singh
@ 2025-10-01  6:56 ` Balbir Singh
  2025-10-01  6:56 ` [v7 03/16] mm/huge_memory: add device-private THP support to PMD operations Balbir Singh
                   ` (14 subsequent siblings)
  16 siblings, 0 replies; 75+ messages in thread
From: Balbir Singh @ 2025-10-01  6:56 UTC (permalink / raw)
  To: linux-kernel, dri-devel, linux-mm
  Cc: akpm, Balbir Singh, David Hildenbrand, Zi Yan, Joshua Hahn,
	Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Oscar Salvador, Lorenzo Stoakes, Baolin Wang,
	Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
	Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Ralph Campbell, Mika Penttilä, Matthew Brost,
	Francois Dugast, Madhavan Srinivasan, Christophe Leroy,
	Felix Kuehling, Alex Deucher, Christian König

Change page_free to folio_free to make the folio support for
zone device-private more consistent. The PCI P2PDMA callback
has also been updated and changed to folio_free() as a result.

For drivers that do not support folios (yet), the folio is
converted back into page via &folio->page and the page is used
as is, in the current callback implementation.

Cc: David Hildenbrand <david@redhat.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
Cc: Rakie Kim <rakie.kim@sk.com>
Cc: Byungchul Park <byungchul@sk.com>
Cc: Gregory Price <gourry@gourry.net>
Cc: Ying Huang <ying.huang@linux.alibaba.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Nico Pache <npache@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona@ffwll.ch>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mika Penttilä <mpenttil@redhat.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Francois Dugast <francois.dugast@intel.com>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: "Christian König" <christian.koenig@amd.com>
Cc: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Balbir Singh <balbirs@nvidia.com>
---
 Documentation/mm/memory-model.rst        |  2 +-
 arch/powerpc/kvm/book3s_hv_uvmem.c       |  5 +++--
 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c |  5 +++--
 drivers/gpu/drm/drm_pagemap.c            | 10 +++++-----
 drivers/gpu/drm/nouveau/nouveau_dmem.c   |  5 +++--
 drivers/pci/p2pdma.c                     |  5 +++--
 include/linux/memremap.h                 |  6 +++---
 lib/test_hmm.c                           |  5 +++--
 mm/memremap.c                            | 16 ++++++++--------
 9 files changed, 32 insertions(+), 27 deletions(-)

diff --git a/Documentation/mm/memory-model.rst b/Documentation/mm/memory-model.rst
index 5f3eafbbc520..7957122039e8 100644
--- a/Documentation/mm/memory-model.rst
+++ b/Documentation/mm/memory-model.rst
@@ -165,7 +165,7 @@ The users of `ZONE_DEVICE` are:
 * pmem: Map platform persistent memory to be used as a direct-I/O target
   via DAX mappings.
 
-* hmm: Extend `ZONE_DEVICE` with `->page_fault()` and `->page_free()`
+* hmm: Extend `ZONE_DEVICE` with `->page_fault()` and `->folio_free()`
   event callbacks to allow a device-driver to coordinate memory management
   events related to device-memory, typically GPU memory. See
   Documentation/mm/hmm.rst.
diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c b/arch/powerpc/kvm/book3s_hv_uvmem.c
index 91f763410673..e5000bef90f2 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -1014,8 +1014,9 @@ static vm_fault_t kvmppc_uvmem_migrate_to_ram(struct vm_fault *vmf)
  * to a normal PFN during H_SVM_PAGE_OUT.
  * Gets called with kvm->arch.uvmem_lock held.
  */
-static void kvmppc_uvmem_page_free(struct page *page)
+static void kvmppc_uvmem_folio_free(struct folio *folio)
 {
+	struct page *page = &folio->page;
 	unsigned long pfn = page_to_pfn(page) -
 			(kvmppc_uvmem_pgmap.range.start >> PAGE_SHIFT);
 	struct kvmppc_uvmem_page_pvt *pvt;
@@ -1034,7 +1035,7 @@ static void kvmppc_uvmem_page_free(struct page *page)
 }
 
 static const struct dev_pagemap_ops kvmppc_uvmem_ops = {
-	.page_free = kvmppc_uvmem_page_free,
+	.folio_free = kvmppc_uvmem_folio_free,
 	.migrate_to_ram	= kvmppc_uvmem_migrate_to_ram,
 };
 
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
index d0e2cae33035..e5203764287b 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
@@ -567,8 +567,9 @@ svm_migrate_ram_to_vram(struct svm_range *prange, uint32_t best_loc,
 	return r < 0 ? r : 0;
 }
 
-static void svm_migrate_page_free(struct page *page)
+static void svm_migrate_folio_free(struct folio *folio)
 {
+	struct page *page = &folio->page;
 	struct svm_range_bo *svm_bo = page->zone_device_data;
 
 	if (svm_bo) {
@@ -1008,7 +1009,7 @@ static vm_fault_t svm_migrate_to_ram(struct vm_fault *vmf)
 }
 
 static const struct dev_pagemap_ops svm_migrate_pgmap_ops = {
-	.page_free		= svm_migrate_page_free,
+	.folio_free		= svm_migrate_folio_free,
 	.migrate_to_ram		= svm_migrate_to_ram,
 };
 
diff --git a/drivers/gpu/drm/drm_pagemap.c b/drivers/gpu/drm/drm_pagemap.c
index 31c53f724e25..1bd949df2fe8 100644
--- a/drivers/gpu/drm/drm_pagemap.c
+++ b/drivers/gpu/drm/drm_pagemap.c
@@ -708,15 +708,15 @@ static int __drm_pagemap_migrate_to_ram(struct vm_area_struct *vas,
 }
 
 /**
- * drm_pagemap_page_free() - Put GPU SVM zone device data associated with a page
- * @page: Pointer to the page
+ * drm_pagemap_folio_free() - Put GPU SVM zone device data associated with a folio
+ * @folio: Pointer to the folio
  *
  * This function is a callback used to put the GPU SVM zone device data
  * associated with a page when it is being released.
  */
-static void drm_pagemap_page_free(struct page *page)
+static void drm_pagemap_folio_free(struct folio *folio)
 {
-	drm_pagemap_zdd_put(page->zone_device_data);
+	drm_pagemap_zdd_put(folio->page.zone_device_data);
 }
 
 /**
@@ -744,7 +744,7 @@ static vm_fault_t drm_pagemap_migrate_to_ram(struct vm_fault *vmf)
 }
 
 static const struct dev_pagemap_ops drm_pagemap_pagemap_ops = {
-	.page_free = drm_pagemap_page_free,
+	.folio_free = drm_pagemap_folio_free,
 	.migrate_to_ram = drm_pagemap_migrate_to_ram,
 };
 
diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c b/drivers/gpu/drm/nouveau/nouveau_dmem.c
index 53cc1926b9da..d34288ebe7d2 100644
--- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
@@ -108,8 +108,9 @@ unsigned long nouveau_dmem_page_addr(struct page *page)
 	return chunk->bo->offset + off;
 }
 
-static void nouveau_dmem_page_free(struct page *page)
+static void nouveau_dmem_folio_free(struct folio *folio)
 {
+	struct page *page = &folio->page;
 	struct nouveau_dmem_chunk *chunk = nouveau_page_to_chunk(page);
 	struct nouveau_dmem *dmem = chunk->drm->dmem;
 
@@ -220,7 +221,7 @@ static vm_fault_t nouveau_dmem_migrate_to_ram(struct vm_fault *vmf)
 }
 
 static const struct dev_pagemap_ops nouveau_dmem_pagemap_ops = {
-	.page_free		= nouveau_dmem_page_free,
+	.folio_free		= nouveau_dmem_folio_free,
 	.migrate_to_ram		= nouveau_dmem_migrate_to_ram,
 };
 
diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index da5657a02007..8515b3bfdfdf 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -200,8 +200,9 @@ static const struct attribute_group p2pmem_group = {
 	.name = "p2pmem",
 };
 
-static void p2pdma_page_free(struct page *page)
+static void p2pdma_folio_free(struct folio *folio)
 {
+	struct page *page = &folio->page;
 	struct pci_p2pdma_pagemap *pgmap = to_p2p_pgmap(page_pgmap(page));
 	/* safe to dereference while a reference is held to the percpu ref */
 	struct pci_p2pdma *p2pdma =
@@ -214,7 +215,7 @@ static void p2pdma_page_free(struct page *page)
 }
 
 static const struct dev_pagemap_ops p2pdma_pgmap_ops = {
-	.page_free = p2pdma_page_free,
+	.folio_free = p2pdma_folio_free,
 };
 
 static void pci_p2pdma_release(void *data)
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index d2487a19cba2..cd28d1666801 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -77,11 +77,11 @@ enum memory_type {
 
 struct dev_pagemap_ops {
 	/*
-	 * Called once the page refcount reaches 0.  The reference count will be
+	 * Called once the folio refcount reaches 0.  The reference count will be
 	 * reset to one by the core code after the method is called to prepare
-	 * for handing out the page again.
+	 * for handing out the folio again.
 	 */
-	void (*page_free)(struct page *page);
+	void (*folio_free)(struct folio *folio);
 
 	/*
 	 * Used for private (un-addressable) device memory only.  Must migrate
diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index 24d82121cde8..9dbf265d1036 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -1374,8 +1374,9 @@ static const struct file_operations dmirror_fops = {
 	.owner		= THIS_MODULE,
 };
 
-static void dmirror_devmem_free(struct page *page)
+static void dmirror_devmem_free(struct folio *folio)
 {
+	struct page *page = &folio->page;
 	struct page *rpage = BACKING_PAGE(page);
 	struct dmirror_device *mdevice;
 
@@ -1438,7 +1439,7 @@ static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
 }
 
 static const struct dev_pagemap_ops dmirror_devmem_ops = {
-	.page_free	= dmirror_devmem_free,
+	.folio_free	= dmirror_devmem_free,
 	.migrate_to_ram	= dmirror_devmem_fault,
 };
 
diff --git a/mm/memremap.c b/mm/memremap.c
index e45dfb568710..4c2e0d68eb27 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -289,8 +289,8 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid)
 			WARN(1, "Missing migrate_to_ram method\n");
 			return ERR_PTR(-EINVAL);
 		}
-		if (!pgmap->ops->page_free) {
-			WARN(1, "Missing page_free method\n");
+		if (!pgmap->ops->folio_free) {
+			WARN(1, "Missing folio_free method\n");
 			return ERR_PTR(-EINVAL);
 		}
 		if (!pgmap->owner) {
@@ -299,8 +299,8 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid)
 		}
 		break;
 	case MEMORY_DEVICE_COHERENT:
-		if (!pgmap->ops->page_free) {
-			WARN(1, "Missing page_free method\n");
+		if (!pgmap->ops->folio_free) {
+			WARN(1, "Missing folio_free method\n");
 			return ERR_PTR(-EINVAL);
 		}
 		if (!pgmap->owner) {
@@ -453,9 +453,9 @@ void free_zone_device_folio(struct folio *folio)
 	switch (pgmap->type) {
 	case MEMORY_DEVICE_PRIVATE:
 	case MEMORY_DEVICE_COHERENT:
-		if (WARN_ON_ONCE(!pgmap->ops || !pgmap->ops->page_free))
+		if (WARN_ON_ONCE(!pgmap->ops || !pgmap->ops->folio_free))
 			break;
-		pgmap->ops->page_free(&folio->page);
+		pgmap->ops->folio_free(folio);
 		percpu_ref_put_many(&folio->pgmap->ref, nr);
 		break;
 
@@ -472,9 +472,9 @@ void free_zone_device_folio(struct folio *folio)
 		break;
 
 	case MEMORY_DEVICE_PCI_P2PDMA:
-		if (WARN_ON_ONCE(!pgmap->ops || !pgmap->ops->page_free))
+		if (WARN_ON_ONCE(!pgmap->ops || !pgmap->ops->folio_free))
 			break;
-		pgmap->ops->page_free(folio_page(folio, 0));
+		pgmap->ops->folio_free(folio);
 		break;
 	}
 }
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [v7 03/16] mm/huge_memory: add device-private THP support to PMD operations
  2025-10-01  6:56 [v7 00/16] mm: support device-private THP Balbir Singh
  2025-10-01  6:56 ` [v7 01/16] mm/zone_device: support large zone device private folios Balbir Singh
  2025-10-01  6:56 ` [v7 02/16] mm/zone_device: Rename page_free callback to folio_free Balbir Singh
@ 2025-10-01  6:56 ` Balbir Singh
  2025-10-12 15:46   ` Lance Yang
  2025-10-17 14:49   ` linux-next: KVM/s390x regression (was: [v7 03/16] mm/huge_memory: add device-private THP support to PMD operations) Christian Borntraeger
  2025-10-01  6:56 ` [v7 04/16] mm/rmap: extend rmap and migration support device-private entries Balbir Singh
                   ` (13 subsequent siblings)
  16 siblings, 2 replies; 75+ messages in thread
From: Balbir Singh @ 2025-10-01  6:56 UTC (permalink / raw)
  To: linux-kernel, dri-devel, linux-mm
  Cc: akpm, Balbir Singh, David Hildenbrand, Zi Yan, Joshua Hahn,
	Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Oscar Salvador, Lorenzo Stoakes, Baolin Wang,
	Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
	Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Ralph Campbell, Mika Penttilä, Matthew Brost,
	Francois Dugast

Extend core huge page management functions to handle device-private THP
entries.  This enables proper handling of large device-private folios in
fundamental MM operations.

The following functions have been updated:

- copy_huge_pmd(): Handle device-private entries during fork/clone
- zap_huge_pmd(): Properly free device-private THP during munmap
- change_huge_pmd(): Support protection changes on device-private THP
- __pte_offset_map(): Add device-private entry awareness

Cc: David Hildenbrand <david@redhat.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
Cc: Rakie Kim <rakie.kim@sk.com>
Cc: Byungchul Park <byungchul@sk.com>
Cc: Gregory Price <gourry@gourry.net>
Cc: Ying Huang <ying.huang@linux.alibaba.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Nico Pache <npache@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona@ffwll.ch>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mika Penttilä <mpenttil@redhat.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Francois Dugast <francois.dugast@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Zi Yan <ziy@nvidia.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Balbir Singh <balbirs@nvidia.com>
---
 include/linux/swapops.h | 32 +++++++++++++++++++++++
 mm/huge_memory.c        | 56 ++++++++++++++++++++++++++++++++++-------
 mm/pgtable-generic.c    |  2 +-
 3 files changed, 80 insertions(+), 10 deletions(-)

diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 64ea151a7ae3..2687928a8146 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -594,10 +594,42 @@ static inline int is_pmd_migration_entry(pmd_t pmd)
 }
 #endif  /* CONFIG_ARCH_ENABLE_THP_MIGRATION */
 
+#if defined(CONFIG_ZONE_DEVICE) && defined(CONFIG_ARCH_ENABLE_THP_MIGRATION)
+
+/**
+ * is_pmd_device_private_entry() - Check if PMD contains a device private swap entry
+ * @pmd: The PMD to check
+ *
+ * Returns true if the PMD contains a swap entry that represents a device private
+ * page mapping. This is used for zone device private pages that have been
+ * swapped out but still need special handling during various memory management
+ * operations.
+ *
+ * Return: 1 if PMD contains device private entry, 0 otherwise
+ */
+static inline int is_pmd_device_private_entry(pmd_t pmd)
+{
+	return is_swap_pmd(pmd) && is_device_private_entry(pmd_to_swp_entry(pmd));
+}
+
+#else /* CONFIG_ZONE_DEVICE && CONFIG_ARCH_ENABLE_THP_MIGRATION */
+
+static inline int is_pmd_device_private_entry(pmd_t pmd)
+{
+	return 0;
+}
+
+#endif /* CONFIG_ZONE_DEVICE && CONFIG_ARCH_ENABLE_THP_MIGRATION */
+
 static inline int non_swap_entry(swp_entry_t entry)
 {
 	return swp_type(entry) >= MAX_SWAPFILES;
 }
 
+static inline int is_pmd_non_present_folio_entry(pmd_t pmd)
+{
+	return is_pmd_migration_entry(pmd) || is_pmd_device_private_entry(pmd);
+}
+
 #endif /* CONFIG_MMU */
 #endif /* _LINUX_SWAPOPS_H */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 1b81680b4225..8e0a1747762d 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1703,17 +1703,45 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	if (unlikely(is_swap_pmd(pmd))) {
 		swp_entry_t entry = pmd_to_swp_entry(pmd);
 
-		VM_BUG_ON(!is_pmd_migration_entry(pmd));
-		if (!is_readable_migration_entry(entry)) {
-			entry = make_readable_migration_entry(
-							swp_offset(entry));
+		VM_WARN_ON(!is_pmd_non_present_folio_entry(pmd));
+
+		if (is_writable_migration_entry(entry) ||
+		    is_readable_exclusive_migration_entry(entry)) {
+			entry = make_readable_migration_entry(swp_offset(entry));
 			pmd = swp_entry_to_pmd(entry);
 			if (pmd_swp_soft_dirty(*src_pmd))
 				pmd = pmd_swp_mksoft_dirty(pmd);
 			if (pmd_swp_uffd_wp(*src_pmd))
 				pmd = pmd_swp_mkuffd_wp(pmd);
 			set_pmd_at(src_mm, addr, src_pmd, pmd);
+		} else if (is_device_private_entry(entry)) {
+			/*
+			 * For device private entries, since there are no
+			 * read exclusive entries, writable = !readable
+			 */
+			if (is_writable_device_private_entry(entry)) {
+				entry = make_readable_device_private_entry(swp_offset(entry));
+				pmd = swp_entry_to_pmd(entry);
+
+				if (pmd_swp_soft_dirty(*src_pmd))
+					pmd = pmd_swp_mksoft_dirty(pmd);
+				if (pmd_swp_uffd_wp(*src_pmd))
+					pmd = pmd_swp_mkuffd_wp(pmd);
+				set_pmd_at(src_mm, addr, src_pmd, pmd);
+			}
+
+			src_folio = pfn_swap_entry_folio(entry);
+			VM_WARN_ON(!folio_test_large(src_folio));
+
+			folio_get(src_folio);
+			/*
+			 * folio_try_dup_anon_rmap_pmd does not fail for
+			 * device private entries.
+			 */
+			folio_try_dup_anon_rmap_pmd(src_folio, &src_folio->page,
+							dst_vma, src_vma);
 		}
+
 		add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
 		mm_inc_nr_ptes(dst_mm);
 		pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
@@ -2211,15 +2239,16 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 			folio_remove_rmap_pmd(folio, page, vma);
 			WARN_ON_ONCE(folio_mapcount(folio) < 0);
 			VM_BUG_ON_PAGE(!PageHead(page), page);
-		} else if (thp_migration_supported()) {
+		} else if (is_pmd_non_present_folio_entry(orig_pmd)) {
 			swp_entry_t entry;
 
-			VM_BUG_ON(!is_pmd_migration_entry(orig_pmd));
 			entry = pmd_to_swp_entry(orig_pmd);
 			folio = pfn_swap_entry_folio(entry);
 			flush_needed = 0;
-		} else
-			WARN_ONCE(1, "Non present huge pmd without pmd migration enabled!");
+
+			if (!thp_migration_supported())
+				WARN_ONCE(1, "Non present huge pmd without pmd migration enabled!");
+		}
 
 		if (folio_test_anon(folio)) {
 			zap_deposited_table(tlb->mm, pmd);
@@ -2239,6 +2268,12 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 				folio_mark_accessed(folio);
 		}
 
+		if (folio_is_device_private(folio)) {
+			folio_remove_rmap_pmd(folio, &folio->page, vma);
+			WARN_ON_ONCE(folio_mapcount(folio) < 0);
+			folio_put(folio);
+		}
+
 		spin_unlock(ptl);
 		if (flush_needed)
 			tlb_remove_page_size(tlb, &folio->page, HPAGE_PMD_SIZE);
@@ -2367,7 +2402,7 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		struct folio *folio = pfn_swap_entry_folio(entry);
 		pmd_t newpmd;
 
-		VM_BUG_ON(!is_pmd_migration_entry(*pmd));
+		VM_WARN_ON(!is_pmd_non_present_folio_entry(*pmd));
 		if (is_writable_migration_entry(entry)) {
 			/*
 			 * A protection check is difficult so
@@ -2380,6 +2415,9 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 			newpmd = swp_entry_to_pmd(entry);
 			if (pmd_swp_soft_dirty(*pmd))
 				newpmd = pmd_swp_mksoft_dirty(newpmd);
+		} else if (is_writable_device_private_entry(entry)) {
+			entry = make_readable_device_private_entry(swp_offset(entry));
+			newpmd = swp_entry_to_pmd(entry);
 		} else {
 			newpmd = *pmd;
 		}
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index 567e2d084071..0c847cdf4fd3 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -290,7 +290,7 @@ pte_t *___pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp)
 
 	if (pmdvalp)
 		*pmdvalp = pmdval;
-	if (unlikely(pmd_none(pmdval) || is_pmd_migration_entry(pmdval)))
+	if (unlikely(pmd_none(pmdval) || !pmd_present(pmdval)))
 		goto nomap;
 	if (unlikely(pmd_trans_huge(pmdval)))
 		goto nomap;
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [v7 04/16] mm/rmap: extend rmap and migration support device-private entries
  2025-10-01  6:56 [v7 00/16] mm: support device-private THP Balbir Singh
                   ` (2 preceding siblings ...)
  2025-10-01  6:56 ` [v7 03/16] mm/huge_memory: add device-private THP support to PMD operations Balbir Singh
@ 2025-10-01  6:56 ` Balbir Singh
  2025-10-22 11:54   ` Lance Yang
  2025-10-01  6:56 ` [v7 05/16] mm/huge_memory: implement device-private THP splitting Balbir Singh
                   ` (12 subsequent siblings)
  16 siblings, 1 reply; 75+ messages in thread
From: Balbir Singh @ 2025-10-01  6:56 UTC (permalink / raw)
  To: linux-kernel, dri-devel, linux-mm
  Cc: akpm, Balbir Singh, David Hildenbrand, Zi Yan, Joshua Hahn,
	Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Oscar Salvador, Lorenzo Stoakes, Baolin Wang,
	Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
	Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Ralph Campbell, Mika Penttilä, Matthew Brost,
	Francois Dugast, SeongJae Park

Add device-private THP support to reverse mapping infrastructure, enabling
proper handling during migration and walk operations.

The key changes are:
- add_migration_pmd()/remove_migration_pmd(): Handle device-private
  entries during folio migration and splitting
- page_vma_mapped_walk(): Recognize device-private THP entries during
  VMA traversal operations

This change supports folio splitting and migration operations on
device-private entries.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
Cc: Rakie Kim <rakie.kim@sk.com>
Cc: Byungchul Park <byungchul@sk.com>
Cc: Gregory Price <gourry@gourry.net>
Cc: Ying Huang <ying.huang@linux.alibaba.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Nico Pache <npache@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona@ffwll.ch>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mika Penttilä <mpenttil@redhat.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Francois Dugast <francois.dugast@intel.com>
Acked-by: Zi Yan <ziy@nvidia.com>
Signed-off-by: Balbir Singh <balbirs@nvidia.com>
Reviewed-by: SeongJae Park <sj@kernel.org>
---
 mm/damon/ops-common.c | 20 +++++++++++++++++---
 mm/huge_memory.c      | 16 +++++++++++++++-
 mm/page_idle.c        |  7 +++++--
 mm/page_vma_mapped.c  |  7 +++++++
 mm/rmap.c             | 24 ++++++++++++++++++++----
 5 files changed, 64 insertions(+), 10 deletions(-)

diff --git a/mm/damon/ops-common.c b/mm/damon/ops-common.c
index 998c5180a603..ac54bf5b2623 100644
--- a/mm/damon/ops-common.c
+++ b/mm/damon/ops-common.c
@@ -75,12 +75,24 @@ void damon_ptep_mkold(pte_t *pte, struct vm_area_struct *vma, unsigned long addr
 void damon_pmdp_mkold(pmd_t *pmd, struct vm_area_struct *vma, unsigned long addr)
 {
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-	struct folio *folio = damon_get_folio(pmd_pfn(pmdp_get(pmd)));
+	pmd_t pmdval = pmdp_get(pmd);
+	struct folio *folio;
+	bool young = false;
+	unsigned long pfn;
+
+	if (likely(pmd_present(pmdval)))
+		pfn = pmd_pfn(pmdval);
+	else
+		pfn = swp_offset_pfn(pmd_to_swp_entry(pmdval));
 
+	folio = damon_get_folio(pfn);
 	if (!folio)
 		return;
 
-	if (pmdp_clear_young_notify(vma, addr, pmd))
+	if (likely(pmd_present(pmdval)))
+		young |= pmdp_clear_young_notify(vma, addr, pmd);
+	young |= mmu_notifier_clear_young(vma->vm_mm, addr, addr + HPAGE_PMD_SIZE);
+	if (young)
 		folio_set_young(folio);
 
 	folio_set_idle(folio);
@@ -203,7 +215,9 @@ static bool damon_folio_young_one(struct folio *folio,
 				mmu_notifier_test_young(vma->vm_mm, addr);
 		} else {
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-			*accessed = pmd_young(pmdp_get(pvmw.pmd)) ||
+			pmd_t pmd = pmdp_get(pvmw.pmd);
+
+			*accessed = (pmd_present(pmd) && pmd_young(pmd)) ||
 				!folio_test_idle(folio) ||
 				mmu_notifier_test_young(vma->vm_mm, addr);
 #else
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 8e0a1747762d..483b8341ce22 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -4628,7 +4628,10 @@ int set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
 		return 0;
 
 	flush_cache_range(vma, address, address + HPAGE_PMD_SIZE);
-	pmdval = pmdp_invalidate(vma, address, pvmw->pmd);
+	if (unlikely(!pmd_present(*pvmw->pmd)))
+		pmdval = pmdp_huge_get_and_clear(vma->vm_mm, address, pvmw->pmd);
+	else
+		pmdval = pmdp_invalidate(vma, address, pvmw->pmd);
 
 	/* See folio_try_share_anon_rmap_pmd(): invalidate PMD first. */
 	anon_exclusive = folio_test_anon(folio) && PageAnonExclusive(page);
@@ -4678,6 +4681,17 @@ void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new)
 	entry = pmd_to_swp_entry(*pvmw->pmd);
 	folio_get(folio);
 	pmde = folio_mk_pmd(folio, READ_ONCE(vma->vm_page_prot));
+
+	if (folio_is_device_private(folio)) {
+		if (pmd_write(pmde))
+			entry = make_writable_device_private_entry(
+							page_to_pfn(new));
+		else
+			entry = make_readable_device_private_entry(
+							page_to_pfn(new));
+		pmde = swp_entry_to_pmd(entry);
+	}
+
 	if (pmd_swp_soft_dirty(*pvmw->pmd))
 		pmde = pmd_mksoft_dirty(pmde);
 	if (is_writable_migration_entry(entry))
diff --git a/mm/page_idle.c b/mm/page_idle.c
index a82b340dc204..d4299de81031 100644
--- a/mm/page_idle.c
+++ b/mm/page_idle.c
@@ -71,8 +71,11 @@ static bool page_idle_clear_pte_refs_one(struct folio *folio,
 				referenced |= ptep_test_and_clear_young(vma, addr, pvmw.pte);
 			referenced |= mmu_notifier_clear_young(vma->vm_mm, addr, addr + PAGE_SIZE);
 		} else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
-			if (pmdp_clear_young_notify(vma, addr, pvmw.pmd))
-				referenced = true;
+			pmd_t pmdval = pmdp_get(pvmw.pmd);
+
+			if (likely(pmd_present(pmdval)))
+				referenced |= pmdp_clear_young_notify(vma, addr, pvmw.pmd);
+			referenced |= mmu_notifier_clear_young(vma->vm_mm, addr, addr + PMD_SIZE);
 		} else {
 			/* unexpected pmd-mapped page? */
 			WARN_ON_ONCE(1);
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index c498a91b6706..137ce27ff68c 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -277,6 +277,13 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
 			 * cannot return prematurely, while zap_huge_pmd() has
 			 * cleared *pmd but not decremented compound_mapcount().
 			 */
+			swp_entry_t entry = pmd_to_swp_entry(pmde);
+
+			if (is_device_private_entry(entry)) {
+				pvmw->ptl = pmd_lock(mm, pvmw->pmd);
+				return true;
+			}
+
 			if ((pvmw->flags & PVMW_SYNC) &&
 			    thp_vma_suitable_order(vma, pvmw->address,
 						   PMD_ORDER) &&
diff --git a/mm/rmap.c b/mm/rmap.c
index 9bab13429975..c3fc30cf3636 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1046,9 +1046,16 @@ static int page_vma_mkclean_one(struct page_vma_mapped_walk *pvmw)
 		} else {
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 			pmd_t *pmd = pvmw->pmd;
-			pmd_t entry;
+			pmd_t entry = pmdp_get(pmd);
 
-			if (!pmd_dirty(*pmd) && !pmd_write(*pmd))
+			/*
+			 * Please see the comment above (!pte_present).
+			 * A non present PMD is not writable from a CPU
+			 * perspective.
+			 */
+			if (!pmd_present(entry))
+				continue;
+			if (!pmd_dirty(entry) && !pmd_write(entry))
 				continue;
 
 			flush_cache_range(vma, address,
@@ -2343,6 +2350,9 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 	while (page_vma_mapped_walk(&pvmw)) {
 		/* PMD-mapped THP migration entry */
 		if (!pvmw.pte) {
+			__maybe_unused unsigned long pfn;
+			__maybe_unused pmd_t pmdval;
+
 			if (flags & TTU_SPLIT_HUGE_PMD) {
 				split_huge_pmd_locked(vma, pvmw.address,
 						      pvmw.pmd, true);
@@ -2351,8 +2361,14 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 				break;
 			}
 #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
-			subpage = folio_page(folio,
-				pmd_pfn(*pvmw.pmd) - folio_pfn(folio));
+			pmdval = pmdp_get(pvmw.pmd);
+			if (likely(pmd_present(pmdval)))
+				pfn = pmd_pfn(pmdval);
+			else
+				pfn = swp_offset_pfn(pmd_to_swp_entry(pmdval));
+
+			subpage = folio_page(folio, pfn - folio_pfn(folio));
+
 			VM_BUG_ON_FOLIO(folio_test_hugetlb(folio) ||
 					!folio_test_pmd_mappable(folio), folio);
 
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [v7 05/16] mm/huge_memory: implement device-private THP splitting
  2025-10-01  6:56 [v7 00/16] mm: support device-private THP Balbir Singh
                   ` (3 preceding siblings ...)
  2025-10-01  6:56 ` [v7 04/16] mm/rmap: extend rmap and migration support device-private entries Balbir Singh
@ 2025-10-01  6:56 ` Balbir Singh
  2025-10-01  6:56 ` [v7 06/16] mm/migrate_device: handle partially mapped folios during collection Balbir Singh
                   ` (11 subsequent siblings)
  16 siblings, 0 replies; 75+ messages in thread
From: Balbir Singh @ 2025-10-01  6:56 UTC (permalink / raw)
  To: linux-kernel, dri-devel, linux-mm
  Cc: akpm, Balbir Singh, David Hildenbrand, Zi Yan, Joshua Hahn,
	Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Oscar Salvador, Lorenzo Stoakes, Baolin Wang,
	Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
	Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Ralph Campbell, Mika Penttilä, Matthew Brost,
	Francois Dugast

Add support for splitting device-private THP folios, enabling fallback
to smaller page sizes when large page allocation or migration fails.

Key changes:
- split_huge_pmd(): Handle device-private PMD entries during splitting
- Preserve RMAP_EXCLUSIVE semantics for anonymous exclusive folios
- Skip RMP_USE_SHARED_ZEROPAGE for device-private entries as they
  don't support shared zero page semantics

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
Cc: Rakie Kim <rakie.kim@sk.com>
Cc: Byungchul Park <byungchul@sk.com>
Cc: Gregory Price <gourry@gourry.net>
Cc: Ying Huang <ying.huang@linux.alibaba.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Nico Pache <npache@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona@ffwll.ch>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mika Penttilä <mpenttil@redhat.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Francois Dugast <francois.dugast@intel.com>
Signed-off-by: Balbir Singh <balbirs@nvidia.com>
---
 mm/huge_memory.c | 87 +++++++++++++++++++++++++++++++++++++++++-------
 mm/migrate.c     |  1 +
 2 files changed, 76 insertions(+), 12 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 483b8341ce22..05c68f5b5fe3 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2872,16 +2872,18 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 	struct page *page;
 	pgtable_t pgtable;
 	pmd_t old_pmd, _pmd;
-	bool young, write, soft_dirty, pmd_migration = false, uffd_wp = false;
+	bool soft_dirty, uffd_wp = false, young = false, write = false;
 	bool anon_exclusive = false, dirty = false;
 	unsigned long addr;
 	pte_t *pte;
 	int i;
+	swp_entry_t entry;
 
 	VM_BUG_ON(haddr & ~HPAGE_PMD_MASK);
 	VM_BUG_ON_VMA(vma->vm_start > haddr, vma);
 	VM_BUG_ON_VMA(vma->vm_end < haddr + HPAGE_PMD_SIZE, vma);
-	VM_BUG_ON(!is_pmd_migration_entry(*pmd) && !pmd_trans_huge(*pmd));
+
+	VM_WARN_ON(!is_pmd_non_present_folio_entry(*pmd) && !pmd_trans_huge(*pmd));
 
 	count_vm_event(THP_SPLIT_PMD);
 
@@ -2929,20 +2931,51 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 		return __split_huge_zero_page_pmd(vma, haddr, pmd);
 	}
 
-	pmd_migration = is_pmd_migration_entry(*pmd);
-	if (unlikely(pmd_migration)) {
-		swp_entry_t entry;
 
+	if (is_pmd_migration_entry(*pmd)) {
 		old_pmd = *pmd;
 		entry = pmd_to_swp_entry(old_pmd);
 		page = pfn_swap_entry_to_page(entry);
+		folio = page_folio(page);
+
+		soft_dirty = pmd_swp_soft_dirty(old_pmd);
+		uffd_wp = pmd_swp_uffd_wp(old_pmd);
+
 		write = is_writable_migration_entry(entry);
 		if (PageAnon(page))
 			anon_exclusive = is_readable_exclusive_migration_entry(entry);
 		young = is_migration_entry_young(entry);
 		dirty = is_migration_entry_dirty(entry);
+	} else if (is_pmd_device_private_entry(*pmd)) {
+		old_pmd = *pmd;
+		entry = pmd_to_swp_entry(old_pmd);
+		page = pfn_swap_entry_to_page(entry);
+		folio = page_folio(page);
+
 		soft_dirty = pmd_swp_soft_dirty(old_pmd);
 		uffd_wp = pmd_swp_uffd_wp(old_pmd);
+
+		write = is_writable_device_private_entry(entry);
+		anon_exclusive = PageAnonExclusive(page);
+
+		/*
+		 * Device private THP should be treated the same as regular
+		 * folios w.r.t anon exclusive handling. See the comments for
+		 * folio handling and anon_exclusive below.
+		 */
+		if (freeze && anon_exclusive &&
+		    folio_try_share_anon_rmap_pmd(folio, page))
+			freeze = false;
+		if (!freeze) {
+			rmap_t rmap_flags = RMAP_NONE;
+
+			folio_ref_add(folio, HPAGE_PMD_NR - 1);
+			if (anon_exclusive)
+				rmap_flags |= RMAP_EXCLUSIVE;
+
+			folio_add_anon_rmap_ptes(folio, page, HPAGE_PMD_NR,
+						 vma, haddr, rmap_flags);
+		}
 	} else {
 		/*
 		 * Up to this point the pmd is present and huge and userland has
@@ -3026,11 +3059,11 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 	 * Note that NUMA hinting access restrictions are not transferred to
 	 * avoid any possibility of altering permissions across VMAs.
 	 */
-	if (freeze || pmd_migration) {
-		for (i = 0, addr = haddr; i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE) {
-			pte_t entry;
-			swp_entry_t swp_entry;
+	if (freeze || is_pmd_migration_entry(old_pmd)) {
+		pte_t entry;
+		swp_entry_t swp_entry;
 
+		for (i = 0, addr = haddr; i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE) {
 			if (write)
 				swp_entry = make_writable_migration_entry(
 							page_to_pfn(page + i));
@@ -3049,7 +3082,33 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 				entry = pte_swp_mksoft_dirty(entry);
 			if (uffd_wp)
 				entry = pte_swp_mkuffd_wp(entry);
+			VM_WARN_ON(!pte_none(ptep_get(pte + i)));
+			set_pte_at(mm, addr, pte + i, entry);
+		}
+	} else if (is_pmd_device_private_entry(old_pmd)) {
+		pte_t entry;
+		swp_entry_t swp_entry;
 
+		for (i = 0, addr = haddr; i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE) {
+			/*
+			 * anon_exclusive was already propagated to the relevant
+			 * pages corresponding to the pte entries when freeze
+			 * is false.
+			 */
+			if (write)
+				swp_entry = make_writable_device_private_entry(
+							page_to_pfn(page + i));
+			else
+				swp_entry = make_readable_device_private_entry(
+							page_to_pfn(page + i));
+			/*
+			 * Young and dirty bits are not progated via swp_entry
+			 */
+			entry = swp_entry_to_pte(swp_entry);
+			if (soft_dirty)
+				entry = pte_swp_mksoft_dirty(entry);
+			if (uffd_wp)
+				entry = pte_swp_mkuffd_wp(entry);
 			VM_WARN_ON(!pte_none(ptep_get(pte + i)));
 			set_pte_at(mm, addr, pte + i, entry);
 		}
@@ -3076,7 +3135,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 	}
 	pte_unmap(pte);
 
-	if (!pmd_migration)
+	if (!is_pmd_migration_entry(*pmd))
 		folio_remove_rmap_pmd(folio, page, vma);
 	if (freeze)
 		put_page(page);
@@ -3089,7 +3148,7 @@ void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address,
 			   pmd_t *pmd, bool freeze)
 {
 	VM_WARN_ON_ONCE(!IS_ALIGNED(address, HPAGE_PMD_SIZE));
-	if (pmd_trans_huge(*pmd) || is_pmd_migration_entry(*pmd))
+	if (pmd_trans_huge(*pmd) || is_pmd_non_present_folio_entry(*pmd))
 		__split_huge_pmd_locked(vma, pmd, address, freeze);
 }
 
@@ -3268,6 +3327,9 @@ static void lru_add_split_folio(struct folio *folio, struct folio *new_folio,
 	VM_BUG_ON_FOLIO(folio_test_lru(new_folio), folio);
 	lockdep_assert_held(&lruvec->lru_lock);
 
+	if (folio_is_device_private(folio))
+		return;
+
 	if (list) {
 		/* page reclaim is reclaiming a huge page */
 		VM_WARN_ON(folio_test_lru(folio));
@@ -3885,8 +3947,9 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
 	if (nr_shmem_dropped)
 		shmem_uncharge(mapping->host, nr_shmem_dropped);
 
-	if (!ret && is_anon)
+	if (!ret && is_anon && !folio_is_device_private(folio))
 		remap_flags = RMP_USE_SHARED_ZEROPAGE;
+
 	remap_page(folio, 1 << order, remap_flags);
 
 	/*
diff --git a/mm/migrate.c b/mm/migrate.c
index ce83c2c3c287..11fbfe905e3c 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -307,6 +307,7 @@ static bool try_to_map_unused_to_zeropage(struct page_vma_mapped_walk *pvmw,
 	VM_BUG_ON_PAGE(!PageAnon(page), page);
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 	VM_BUG_ON_PAGE(pte_present(ptep_get(pvmw->pte)), page);
+	VM_WARN_ON_ONCE_FOLIO(folio_is_device_private(folio), folio);
 
 	if (folio_test_mlocked(folio) || (pvmw->vma->vm_flags & VM_LOCKED) ||
 	    mm_forbids_zeropage(pvmw->vma->vm_mm))
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [v7 06/16] mm/migrate_device: handle partially mapped folios during collection
  2025-10-01  6:56 [v7 00/16] mm: support device-private THP Balbir Singh
                   ` (4 preceding siblings ...)
  2025-10-01  6:56 ` [v7 05/16] mm/huge_memory: implement device-private THP splitting Balbir Singh
@ 2025-10-01  6:56 ` Balbir Singh
  2025-10-01  6:56 ` [v7 07/16] mm/migrate_device: implement THP migration of zone device pages Balbir Singh
                   ` (10 subsequent siblings)
  16 siblings, 0 replies; 75+ messages in thread
From: Balbir Singh @ 2025-10-01  6:56 UTC (permalink / raw)
  To: linux-kernel, dri-devel, linux-mm
  Cc: akpm, Balbir Singh, David Hildenbrand, Zi Yan, Joshua Hahn,
	Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Oscar Salvador, Lorenzo Stoakes, Baolin Wang,
	Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
	Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Ralph Campbell, Mika Penttilä, Matthew Brost,
	Francois Dugast

Extend migrate_vma_collect_pmd() to handle partially mapped large folios
that require splitting before migration can proceed.

During PTE walk in the collection phase, if a large folio is only
partially mapped in the migration range, it must be split to ensure the
folio is correctly migrated.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
Cc: Rakie Kim <rakie.kim@sk.com>
Cc: Byungchul Park <byungchul@sk.com>
Cc: Gregory Price <gourry@gourry.net>
Cc: Ying Huang <ying.huang@linux.alibaba.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Nico Pache <npache@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona@ffwll.ch>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mika Penttilä <mpenttil@redhat.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Francois Dugast <francois.dugast@intel.com>
Signed-off-by: Balbir Singh <balbirs@nvidia.com>
---
 mm/migrate_device.c | 63 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 63 insertions(+)

diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index abd9f6850db6..1c70d937ba44 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -54,6 +54,53 @@ static int migrate_vma_collect_hole(unsigned long start,
 	return 0;
 }
 
+/**
+ * migrate_vma_split_folio() - Helper function to split a THP folio
+ * @folio: the folio to split
+ * @fault_page: struct page associated with the fault if any
+ *
+ * Returns 0 on success
+ */
+static int migrate_vma_split_folio(struct folio *folio,
+				   struct page *fault_page)
+{
+	int ret;
+	struct folio *fault_folio = fault_page ? page_folio(fault_page) : NULL;
+	struct folio *new_fault_folio = NULL;
+
+	if (folio != fault_folio) {
+		folio_get(folio);
+		folio_lock(folio);
+	}
+
+	ret = split_folio(folio);
+	if (ret) {
+		if (folio != fault_folio) {
+			folio_unlock(folio);
+			folio_put(folio);
+		}
+		return ret;
+	}
+
+	new_fault_folio = fault_page ? page_folio(fault_page) : NULL;
+
+	/*
+	 * Ensure the lock is held on the correct
+	 * folio after the split
+	 */
+	if (!new_fault_folio) {
+		folio_unlock(folio);
+		folio_put(folio);
+	} else if (folio != new_fault_folio) {
+		folio_get(new_fault_folio);
+		folio_lock(new_fault_folio);
+		folio_unlock(folio);
+		folio_put(folio);
+	}
+
+	return 0;
+}
+
 static int migrate_vma_collect_pmd(pmd_t *pmdp,
 				   unsigned long start,
 				   unsigned long end,
@@ -171,6 +218,22 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 					pgmap->owner != migrate->pgmap_owner)
 					goto next;
 			}
+			folio = page ? page_folio(page) : NULL;
+			if (folio && folio_test_large(folio)) {
+				int ret;
+
+				pte_unmap_unlock(ptep, ptl);
+				ret = migrate_vma_split_folio(folio,
+							  migrate->fault_page);
+
+				if (ret) {
+					ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
+					goto next;
+				}
+
+				addr = start;
+				goto again;
+			}
 			mpfn = migrate_pfn(pfn) | MIGRATE_PFN_MIGRATE;
 			mpfn |= pte_write(pte) ? MIGRATE_PFN_WRITE : 0;
 		}
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [v7 07/16] mm/migrate_device: implement THP migration of zone device pages
  2025-10-01  6:56 [v7 00/16] mm: support device-private THP Balbir Singh
                   ` (5 preceding siblings ...)
  2025-10-01  6:56 ` [v7 06/16] mm/migrate_device: handle partially mapped folios during collection Balbir Singh
@ 2025-10-01  6:56 ` Balbir Singh
  2025-10-01  6:56 ` [v7 08/16] mm/memory/fault: add THP fault handling for zone device private pages Balbir Singh
                   ` (9 subsequent siblings)
  16 siblings, 0 replies; 75+ messages in thread
From: Balbir Singh @ 2025-10-01  6:56 UTC (permalink / raw)
  To: linux-kernel, dri-devel, linux-mm
  Cc: akpm, Balbir Singh, David Hildenbrand, Zi Yan, Joshua Hahn,
	Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Oscar Salvador, Lorenzo Stoakes, Baolin Wang,
	Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
	Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Ralph Campbell, Mika Penttilä, Matthew Brost,
	Francois Dugast

MIGRATE_VMA_SELECT_COMPOUND will be used to select THP pages during
migrate_vma_setup() and MIGRATE_PFN_COMPOUND will make migrating device
pages as compound pages during device pfn migration.

migrate_device code paths go through the collect, setup and finalize
phases of migration.

The entries in src and dst arrays passed to these functions still remain
at a PAGE_SIZE granularity.  When a compound page is passed, the first
entry has the PFN along with MIGRATE_PFN_COMPOUND and other flags set
(MIGRATE_PFN_MIGRATE, MIGRATE_PFN_VALID), the remaining entries
(HPAGE_PMD_NR - 1) are filled with 0's.  This representation allows for
the compound page to be split into smaller page sizes.

migrate_vma_collect_hole(), migrate_vma_collect_pmd() are now THP page
aware.  Two new helper functions migrate_vma_collect_huge_pmd() and
migrate_vma_insert_huge_pmd_page() have been added.

migrate_vma_collect_huge_pmd() can collect THP pages, but if for some
reason this fails, there is fallback support to split the folio and
migrate it.

migrate_vma_insert_huge_pmd_page() closely follows the logic of
migrate_vma_insert_page()

Support for splitting pages as needed for migration will follow in later
patches in this series.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
Cc: Rakie Kim <rakie.kim@sk.com>
Cc: Byungchul Park <byungchul@sk.com>
Cc: Gregory Price <gourry@gourry.net>
Cc: Ying Huang <ying.huang@linux.alibaba.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Nico Pache <npache@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona@ffwll.ch>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mika Penttilä <mpenttil@redhat.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Francois Dugast <francois.dugast@intel.com>

Signed-off-by: Balbir Singh <balbirs@nvidia.com>
---
 include/linux/migrate.h |   2 +
 mm/migrate_device.c     | 469 ++++++++++++++++++++++++++++++++++------
 2 files changed, 408 insertions(+), 63 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 1f0ac122c3bf..41b4cc05a450 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -125,6 +125,7 @@ static inline int migrate_misplaced_folio(struct folio *folio, int node)
 #define MIGRATE_PFN_VALID	(1UL << 0)
 #define MIGRATE_PFN_MIGRATE	(1UL << 1)
 #define MIGRATE_PFN_WRITE	(1UL << 3)
+#define MIGRATE_PFN_COMPOUND	(1UL << 4)
 #define MIGRATE_PFN_SHIFT	6
 
 static inline struct page *migrate_pfn_to_page(unsigned long mpfn)
@@ -143,6 +144,7 @@ enum migrate_vma_direction {
 	MIGRATE_VMA_SELECT_SYSTEM = 1 << 0,
 	MIGRATE_VMA_SELECT_DEVICE_PRIVATE = 1 << 1,
 	MIGRATE_VMA_SELECT_DEVICE_COHERENT = 1 << 2,
+	MIGRATE_VMA_SELECT_COMPOUND = 1 << 3,
 };
 
 struct migrate_vma {
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index 1c70d937ba44..4156fd6190d2 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -14,6 +14,7 @@
 #include <linux/pagewalk.h>
 #include <linux/rmap.h>
 #include <linux/swapops.h>
+#include <linux/pgalloc.h>
 #include <asm/tlbflush.h>
 #include "internal.h"
 
@@ -44,6 +45,23 @@ static int migrate_vma_collect_hole(unsigned long start,
 	if (!vma_is_anonymous(walk->vma))
 		return migrate_vma_collect_skip(start, end, walk);
 
+	if (thp_migration_supported() &&
+		(migrate->flags & MIGRATE_VMA_SELECT_COMPOUND) &&
+		(IS_ALIGNED(start, HPAGE_PMD_SIZE) &&
+		 IS_ALIGNED(end, HPAGE_PMD_SIZE))) {
+		migrate->src[migrate->npages] = MIGRATE_PFN_MIGRATE |
+						MIGRATE_PFN_COMPOUND;
+		migrate->dst[migrate->npages] = 0;
+		migrate->npages++;
+		migrate->cpages++;
+
+		/*
+		 * Collect the remaining entries as holes, in case we
+		 * need to split later
+		 */
+		return migrate_vma_collect_skip(start + PAGE_SIZE, end, walk);
+	}
+
 	for (addr = start; addr < end; addr += PAGE_SIZE) {
 		migrate->src[migrate->npages] = MIGRATE_PFN_MIGRATE;
 		migrate->dst[migrate->npages] = 0;
@@ -101,57 +119,151 @@ static int migrate_vma_split_folio(struct folio *folio,
 	return 0;
 }
 
-static int migrate_vma_collect_pmd(pmd_t *pmdp,
-				   unsigned long start,
-				   unsigned long end,
-				   struct mm_walk *walk)
+/** migrate_vma_collect_huge_pmd - collect THP pages without splitting the
+ * folio for device private pages.
+ * @pmdp: pointer to pmd entry
+ * @start: start address of the range for migration
+ * @end: end address of the range for migration
+ * @walk: mm_walk callback structure
+ * @fault_folio: folio associated with the fault if any
+ *
+ * Collect the huge pmd entry at @pmdp for migration and set the
+ * MIGRATE_PFN_COMPOUND flag in the migrate src entry to indicate that
+ * migration will occur at HPAGE_PMD granularity
+ */
+static int migrate_vma_collect_huge_pmd(pmd_t *pmdp, unsigned long start,
+					unsigned long end, struct mm_walk *walk,
+					struct folio *fault_folio)
 {
+	struct mm_struct *mm = walk->mm;
+	struct folio *folio;
 	struct migrate_vma *migrate = walk->private;
-	struct folio *fault_folio = migrate->fault_page ?
-		page_folio(migrate->fault_page) : NULL;
-	struct vm_area_struct *vma = walk->vma;
-	struct mm_struct *mm = vma->vm_mm;
-	unsigned long addr = start, unmapped = 0;
 	spinlock_t *ptl;
-	pte_t *ptep;
+	swp_entry_t entry;
+	int ret;
+	unsigned long write = 0;
 
-again:
-	if (pmd_none(*pmdp))
+	ptl = pmd_lock(mm, pmdp);
+	if (pmd_none(*pmdp)) {
+		spin_unlock(ptl);
 		return migrate_vma_collect_hole(start, end, -1, walk);
+	}
 
 	if (pmd_trans_huge(*pmdp)) {
-		struct folio *folio;
-
-		ptl = pmd_lock(mm, pmdp);
-		if (unlikely(!pmd_trans_huge(*pmdp))) {
+		if (!(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM)) {
 			spin_unlock(ptl);
-			goto again;
+			return migrate_vma_collect_skip(start, end, walk);
 		}
 
 		folio = pmd_folio(*pmdp);
 		if (is_huge_zero_folio(folio)) {
 			spin_unlock(ptl);
-			split_huge_pmd(vma, pmdp, addr);
-		} else {
-			int ret;
+			return migrate_vma_collect_hole(start, end, -1, walk);
+		}
+		if (pmd_write(*pmdp))
+			write = MIGRATE_PFN_WRITE;
+	} else if (!pmd_present(*pmdp)) {
+		entry = pmd_to_swp_entry(*pmdp);
+		folio = pfn_swap_entry_folio(entry);
+
+		if (!is_device_private_entry(entry) ||
+			!(migrate->flags & MIGRATE_VMA_SELECT_DEVICE_PRIVATE) ||
+			(folio->pgmap->owner != migrate->pgmap_owner)) {
+			spin_unlock(ptl);
+			return migrate_vma_collect_skip(start, end, walk);
+		}
 
-			folio_get(folio);
+		if (is_migration_entry(entry)) {
+			migration_entry_wait_on_locked(entry, ptl);
 			spin_unlock(ptl);
-			/* FIXME: we don't expect THP for fault_folio */
-			if (WARN_ON_ONCE(fault_folio == folio))
-				return migrate_vma_collect_skip(start, end,
-								walk);
-			if (unlikely(!folio_trylock(folio)))
-				return migrate_vma_collect_skip(start, end,
-								walk);
-			ret = split_folio(folio);
-			if (fault_folio != folio)
-				folio_unlock(folio);
-			folio_put(folio);
-			if (ret)
-				return migrate_vma_collect_skip(start, end,
-								walk);
+			return -EAGAIN;
+		}
+
+		if (is_writable_device_private_entry(entry))
+			write = MIGRATE_PFN_WRITE;
+	} else {
+		spin_unlock(ptl);
+		return -EAGAIN;
+	}
+
+	folio_get(folio);
+	if (folio != fault_folio && unlikely(!folio_trylock(folio))) {
+		spin_unlock(ptl);
+		folio_put(folio);
+		return migrate_vma_collect_skip(start, end, walk);
+	}
+
+	if (thp_migration_supported() &&
+		(migrate->flags & MIGRATE_VMA_SELECT_COMPOUND) &&
+		(IS_ALIGNED(start, HPAGE_PMD_SIZE) &&
+		 IS_ALIGNED(end, HPAGE_PMD_SIZE))) {
+
+		struct page_vma_mapped_walk pvmw = {
+			.ptl = ptl,
+			.address = start,
+			.pmd = pmdp,
+			.vma = walk->vma,
+		};
+
+		unsigned long pfn = page_to_pfn(folio_page(folio, 0));
+
+		migrate->src[migrate->npages] = migrate_pfn(pfn) | write
+						| MIGRATE_PFN_MIGRATE
+						| MIGRATE_PFN_COMPOUND;
+		migrate->dst[migrate->npages++] = 0;
+		migrate->cpages++;
+		ret = set_pmd_migration_entry(&pvmw, folio_page(folio, 0));
+		if (ret) {
+			migrate->npages--;
+			migrate->cpages--;
+			migrate->src[migrate->npages] = 0;
+			migrate->dst[migrate->npages] = 0;
+			goto fallback;
 		}
+		migrate_vma_collect_skip(start + PAGE_SIZE, end, walk);
+		spin_unlock(ptl);
+		return 0;
+	}
+
+fallback:
+	spin_unlock(ptl);
+	if (!folio_test_large(folio))
+		goto done;
+	ret = split_folio(folio);
+	if (fault_folio != folio)
+		folio_unlock(folio);
+	folio_put(folio);
+	if (ret)
+		return migrate_vma_collect_skip(start, end, walk);
+	if (pmd_none(pmdp_get_lockless(pmdp)))
+		return migrate_vma_collect_hole(start, end, -1, walk);
+
+done:
+	return -ENOENT;
+}
+
+static int migrate_vma_collect_pmd(pmd_t *pmdp,
+				   unsigned long start,
+				   unsigned long end,
+				   struct mm_walk *walk)
+{
+	struct migrate_vma *migrate = walk->private;
+	struct vm_area_struct *vma = walk->vma;
+	struct mm_struct *mm = vma->vm_mm;
+	unsigned long addr = start, unmapped = 0;
+	spinlock_t *ptl;
+	struct folio *fault_folio = migrate->fault_page ?
+		page_folio(migrate->fault_page) : NULL;
+	pte_t *ptep;
+
+again:
+	if (pmd_trans_huge(*pmdp) || !pmd_present(*pmdp)) {
+		int ret = migrate_vma_collect_huge_pmd(pmdp, start, end, walk, fault_folio);
+
+		if (ret == -EAGAIN)
+			goto again;
+		if (ret == 0)
+			return 0;
 	}
 
 	ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
@@ -238,8 +350,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 			mpfn |= pte_write(pte) ? MIGRATE_PFN_WRITE : 0;
 		}
 
-		/* FIXME support THP */
-		if (!page || !page->mapping || PageTransCompound(page)) {
+		if (!page || !page->mapping) {
 			mpfn = 0;
 			goto next;
 		}
@@ -410,14 +521,6 @@ static bool migrate_vma_check_page(struct page *page, struct page *fault_page)
 	 */
 	int extra = 1 + (page == fault_page);
 
-	/*
-	 * FIXME support THP (transparent huge page), it is bit more complex to
-	 * check them than regular pages, because they can be mapped with a pmd
-	 * or with a pte (split pte mapping).
-	 */
-	if (folio_test_large(folio))
-		return false;
-
 	/* Page from ZONE_DEVICE have one extra reference */
 	if (folio_is_zone_device(folio))
 		extra++;
@@ -448,17 +551,24 @@ static unsigned long migrate_device_unmap(unsigned long *src_pfns,
 
 	lru_add_drain();
 
-	for (i = 0; i < npages; i++) {
+	for (i = 0; i < npages; ) {
 		struct page *page = migrate_pfn_to_page(src_pfns[i]);
 		struct folio *folio;
+		unsigned int nr = 1;
 
 		if (!page) {
 			if (src_pfns[i] & MIGRATE_PFN_MIGRATE)
 				unmapped++;
-			continue;
+			goto next;
 		}
 
 		folio =	page_folio(page);
+		nr = folio_nr_pages(folio);
+
+		if (nr > 1)
+			src_pfns[i] |= MIGRATE_PFN_COMPOUND;
+
+
 		/* ZONE_DEVICE folios are not on LRU */
 		if (!folio_is_zone_device(folio)) {
 			if (!folio_test_lru(folio) && allow_drain) {
@@ -470,7 +580,7 @@ static unsigned long migrate_device_unmap(unsigned long *src_pfns,
 			if (!folio_isolate_lru(folio)) {
 				src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
 				restore++;
-				continue;
+				goto next;
 			}
 
 			/* Drop the reference we took in collect */
@@ -489,10 +599,12 @@ static unsigned long migrate_device_unmap(unsigned long *src_pfns,
 
 			src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
 			restore++;
-			continue;
+			goto next;
 		}
 
 		unmapped++;
+next:
+		i += nr;
 	}
 
 	for (i = 0; i < npages && restore; i++) {
@@ -638,6 +750,160 @@ int migrate_vma_setup(struct migrate_vma *args)
 }
 EXPORT_SYMBOL(migrate_vma_setup);
 
+#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
+/**
+ * migrate_vma_insert_huge_pmd_page: Insert a huge folio into @migrate->vma->vm_mm
+ * at @addr. folio is already allocated as a part of the migration process with
+ * large page.
+ *
+ * @page needs to be initialized and setup after it's allocated. The code bits
+ * here follow closely the code in __do_huge_pmd_anonymous_page(). This API does
+ * not support THP zero pages.
+ *
+ * @migrate: migrate_vma arguments
+ * @addr: address where the folio will be inserted
+ * @page: page to be inserted at @addr
+ * @src: src pfn which is being migrated
+ * @pmdp: pointer to the pmd
+ */
+static int migrate_vma_insert_huge_pmd_page(struct migrate_vma *migrate,
+					 unsigned long addr,
+					 struct page *page,
+					 unsigned long *src,
+					 pmd_t *pmdp)
+{
+	struct vm_area_struct *vma = migrate->vma;
+	gfp_t gfp = vma_thp_gfp_mask(vma);
+	struct folio *folio = page_folio(page);
+	int ret;
+	vm_fault_t csa_ret;
+	spinlock_t *ptl;
+	pgtable_t pgtable;
+	pmd_t entry;
+	bool flush = false;
+	unsigned long i;
+
+	VM_WARN_ON_FOLIO(!folio, folio);
+	VM_WARN_ON_ONCE(!pmd_none(*pmdp) && !is_huge_zero_pmd(*pmdp));
+
+	if (!thp_vma_suitable_order(vma, addr, HPAGE_PMD_ORDER))
+		return -EINVAL;
+
+	ret = anon_vma_prepare(vma);
+	if (ret)
+		return ret;
+
+	folio_set_order(folio, HPAGE_PMD_ORDER);
+	folio_set_large_rmappable(folio);
+
+	if (mem_cgroup_charge(folio, migrate->vma->vm_mm, gfp)) {
+		count_vm_event(THP_FAULT_FALLBACK);
+		count_mthp_stat(HPAGE_PMD_ORDER, MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE);
+		ret = -ENOMEM;
+		goto abort;
+	}
+
+	__folio_mark_uptodate(folio);
+
+	pgtable = pte_alloc_one(vma->vm_mm);
+	if (unlikely(!pgtable))
+		goto abort;
+
+	if (folio_is_device_private(folio)) {
+		swp_entry_t swp_entry;
+
+		if (vma->vm_flags & VM_WRITE)
+			swp_entry = make_writable_device_private_entry(
+						page_to_pfn(page));
+		else
+			swp_entry = make_readable_device_private_entry(
+						page_to_pfn(page));
+		entry = swp_entry_to_pmd(swp_entry);
+	} else {
+		if (folio_is_zone_device(folio) &&
+		    !folio_is_device_coherent(folio)) {
+			goto abort;
+		}
+		entry = folio_mk_pmd(folio, vma->vm_page_prot);
+		if (vma->vm_flags & VM_WRITE)
+			entry = pmd_mkwrite(pmd_mkdirty(entry), vma);
+	}
+
+	ptl = pmd_lock(vma->vm_mm, pmdp);
+	csa_ret = check_stable_address_space(vma->vm_mm);
+	if (csa_ret)
+		goto abort;
+
+	/*
+	 * Check for userfaultfd but do not deliver the fault. Instead,
+	 * just back off.
+	 */
+	if (userfaultfd_missing(vma))
+		goto unlock_abort;
+
+	if (!pmd_none(*pmdp)) {
+		if (!is_huge_zero_pmd(*pmdp))
+			goto unlock_abort;
+		flush = true;
+	} else if (!pmd_none(*pmdp))
+		goto unlock_abort;
+
+	add_mm_counter(vma->vm_mm, MM_ANONPAGES, HPAGE_PMD_NR);
+	folio_add_new_anon_rmap(folio, vma, addr, RMAP_EXCLUSIVE);
+	if (!folio_is_zone_device(folio))
+		folio_add_lru_vma(folio, vma);
+	folio_get(folio);
+
+	if (flush) {
+		pte_free(vma->vm_mm, pgtable);
+		flush_cache_page(vma, addr, addr + HPAGE_PMD_SIZE);
+		pmdp_invalidate(vma, addr, pmdp);
+	} else {
+		pgtable_trans_huge_deposit(vma->vm_mm, pmdp, pgtable);
+		mm_inc_nr_ptes(vma->vm_mm);
+	}
+	set_pmd_at(vma->vm_mm, addr, pmdp, entry);
+	update_mmu_cache_pmd(vma, addr, pmdp);
+
+	spin_unlock(ptl);
+
+	count_vm_event(THP_FAULT_ALLOC);
+	count_mthp_stat(HPAGE_PMD_ORDER, MTHP_STAT_ANON_FAULT_ALLOC);
+	count_memcg_event_mm(vma->vm_mm, THP_FAULT_ALLOC);
+
+	return 0;
+
+unlock_abort:
+	spin_unlock(ptl);
+abort:
+	for (i = 0; i < HPAGE_PMD_NR; i++)
+		src[i] &= ~MIGRATE_PFN_MIGRATE;
+	return 0;
+}
+#else /* !CONFIG_ARCH_ENABLE_THP_MIGRATION */
+static int migrate_vma_insert_huge_pmd_page(struct migrate_vma *migrate,
+					 unsigned long addr,
+					 struct page *page,
+					 unsigned long *src,
+					 pmd_t *pmdp)
+{
+	return 0;
+}
+#endif
+
+static unsigned long migrate_vma_nr_pages(unsigned long *src)
+{
+	unsigned long nr = 1;
+#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
+	if (*src & MIGRATE_PFN_COMPOUND)
+		nr = HPAGE_PMD_NR;
+#else
+	if (*src & MIGRATE_PFN_COMPOUND)
+		VM_WARN_ON_ONCE(true);
+#endif
+	return nr;
+}
+
 /*
  * This code closely matches the code in:
  *   __handle_mm_fault()
@@ -648,9 +914,10 @@ EXPORT_SYMBOL(migrate_vma_setup);
  */
 static void migrate_vma_insert_page(struct migrate_vma *migrate,
 				    unsigned long addr,
-				    struct page *page,
+				    unsigned long *dst,
 				    unsigned long *src)
 {
+	struct page *page = migrate_pfn_to_page(*dst);
 	struct folio *folio = page_folio(page);
 	struct vm_area_struct *vma = migrate->vma;
 	struct mm_struct *mm = vma->vm_mm;
@@ -678,8 +945,24 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
 	pmdp = pmd_alloc(mm, pudp, addr);
 	if (!pmdp)
 		goto abort;
-	if (pmd_trans_huge(*pmdp))
-		goto abort;
+
+	if (thp_migration_supported() && (*dst & MIGRATE_PFN_COMPOUND)) {
+		int ret = migrate_vma_insert_huge_pmd_page(migrate, addr, page,
+								src, pmdp);
+		if (ret)
+			goto abort;
+		return;
+	}
+
+	if (!pmd_none(*pmdp)) {
+		if (pmd_trans_huge(*pmdp)) {
+			if (!is_huge_zero_pmd(*pmdp))
+				goto abort;
+			split_huge_pmd(vma, pmdp, addr);
+		} else if (pmd_leaf(*pmdp))
+			goto abort;
+	}
+
 	if (pte_alloc(mm, pmdp))
 		goto abort;
 	if (unlikely(anon_vma_prepare(vma)))
@@ -770,23 +1053,24 @@ static void __migrate_device_pages(unsigned long *src_pfns,
 	unsigned long i;
 	bool notified = false;
 
-	for (i = 0; i < npages; i++) {
+	for (i = 0; i < npages; ) {
 		struct page *newpage = migrate_pfn_to_page(dst_pfns[i]);
 		struct page *page = migrate_pfn_to_page(src_pfns[i]);
 		struct address_space *mapping;
 		struct folio *newfolio, *folio;
 		int r, extra_cnt = 0;
+		unsigned long nr = 1;
 
 		if (!newpage) {
 			src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
-			continue;
+			goto next;
 		}
 
 		if (!page) {
 			unsigned long addr;
 
 			if (!(src_pfns[i] & MIGRATE_PFN_MIGRATE))
-				continue;
+				goto next;
 
 			/*
 			 * The only time there is no vma is when called from
@@ -804,15 +1088,47 @@ static void __migrate_device_pages(unsigned long *src_pfns,
 					migrate->pgmap_owner);
 				mmu_notifier_invalidate_range_start(&range);
 			}
-			migrate_vma_insert_page(migrate, addr, newpage,
+
+			if ((src_pfns[i] & MIGRATE_PFN_COMPOUND) &&
+				(!(dst_pfns[i] & MIGRATE_PFN_COMPOUND))) {
+				nr = migrate_vma_nr_pages(&src_pfns[i]);
+				src_pfns[i] &= ~MIGRATE_PFN_COMPOUND;
+				src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
+				goto next;
+			}
+
+			migrate_vma_insert_page(migrate, addr, &dst_pfns[i],
 						&src_pfns[i]);
-			continue;
+			goto next;
 		}
 
 		newfolio = page_folio(newpage);
 		folio = page_folio(page);
 		mapping = folio_mapping(folio);
 
+		/*
+		 * If THP migration is enabled, check if both src and dst
+		 * can migrate large pages
+		 */
+		if (thp_migration_supported()) {
+			if ((src_pfns[i] & MIGRATE_PFN_MIGRATE) &&
+				(src_pfns[i] & MIGRATE_PFN_COMPOUND) &&
+				!(dst_pfns[i] & MIGRATE_PFN_COMPOUND)) {
+
+				if (!migrate) {
+					src_pfns[i] &= ~(MIGRATE_PFN_MIGRATE |
+							 MIGRATE_PFN_COMPOUND);
+					goto next;
+				}
+				src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
+			} else if ((src_pfns[i] & MIGRATE_PFN_MIGRATE) &&
+				(dst_pfns[i] & MIGRATE_PFN_COMPOUND) &&
+				!(src_pfns[i] & MIGRATE_PFN_COMPOUND)) {
+				src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
+			}
+		}
+
+
 		if (folio_is_device_private(newfolio) ||
 		    folio_is_device_coherent(newfolio)) {
 			if (mapping) {
@@ -825,7 +1141,7 @@ static void __migrate_device_pages(unsigned long *src_pfns,
 				if (!folio_test_anon(folio) ||
 				    !folio_free_swap(folio)) {
 					src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
-					continue;
+					goto next;
 				}
 			}
 		} else if (folio_is_zone_device(newfolio)) {
@@ -833,7 +1149,7 @@ static void __migrate_device_pages(unsigned long *src_pfns,
 			 * Other types of ZONE_DEVICE page are not supported.
 			 */
 			src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
-			continue;
+			goto next;
 		}
 
 		BUG_ON(folio_test_writeback(folio));
@@ -845,6 +1161,8 @@ static void __migrate_device_pages(unsigned long *src_pfns,
 			src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
 		else
 			folio_migrate_flags(newfolio, folio);
+next:
+		i += nr;
 	}
 
 	if (notified)
@@ -1006,10 +1324,23 @@ static unsigned long migrate_device_pfn_lock(unsigned long pfn)
 int migrate_device_range(unsigned long *src_pfns, unsigned long start,
 			unsigned long npages)
 {
-	unsigned long i, pfn;
+	unsigned long i, j, pfn;
+
+	for (pfn = start, i = 0; i < npages; pfn++, i++) {
+		struct page *page = pfn_to_page(pfn);
+		struct folio *folio = page_folio(page);
+		unsigned int nr = 1;
 
-	for (pfn = start, i = 0; i < npages; pfn++, i++)
 		src_pfns[i] = migrate_device_pfn_lock(pfn);
+		nr = folio_nr_pages(folio);
+		if (nr > 1) {
+			src_pfns[i] |= MIGRATE_PFN_COMPOUND;
+			for (j = 1; j < nr; j++)
+				src_pfns[i+j] = 0;
+			i += j - 1;
+			pfn += j - 1;
+		}
+	}
 
 	migrate_device_unmap(src_pfns, npages, NULL);
 
@@ -1027,10 +1358,22 @@ EXPORT_SYMBOL(migrate_device_range);
  */
 int migrate_device_pfns(unsigned long *src_pfns, unsigned long npages)
 {
-	unsigned long i;
+	unsigned long i, j;
+
+	for (i = 0; i < npages; i++) {
+		struct page *page = pfn_to_page(src_pfns[i]);
+		struct folio *folio = page_folio(page);
+		unsigned int nr = 1;
 
-	for (i = 0; i < npages; i++)
 		src_pfns[i] = migrate_device_pfn_lock(src_pfns[i]);
+		nr = folio_nr_pages(folio);
+		if (nr > 1) {
+			src_pfns[i] |= MIGRATE_PFN_COMPOUND;
+			for (j = 1; j < nr; j++)
+				src_pfns[i+j] = 0;
+			i += j - 1;
+		}
+	}
 
 	migrate_device_unmap(src_pfns, npages, NULL);
 
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [v7 08/16] mm/memory/fault: add THP fault handling for zone device private pages
  2025-10-01  6:56 [v7 00/16] mm: support device-private THP Balbir Singh
                   ` (6 preceding siblings ...)
  2025-10-01  6:56 ` [v7 07/16] mm/migrate_device: implement THP migration of zone device pages Balbir Singh
@ 2025-10-01  6:56 ` Balbir Singh
  2025-10-01  6:57 ` [v7 09/16] lib/test_hmm: add zone device private THP test infrastructure Balbir Singh
                   ` (8 subsequent siblings)
  16 siblings, 0 replies; 75+ messages in thread
From: Balbir Singh @ 2025-10-01  6:56 UTC (permalink / raw)
  To: linux-kernel, dri-devel, linux-mm
  Cc: akpm, Balbir Singh, David Hildenbrand, Zi Yan, Joshua Hahn,
	Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Oscar Salvador, Lorenzo Stoakes, Baolin Wang,
	Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
	Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Ralph Campbell, Mika Penttilä, Matthew Brost,
	Francois Dugast

Implement CPU fault handling for zone device THP entries through
do_huge_pmd_device_private(), enabling transparent migration of
device-private large pages back to system memory on CPU access.

When the CPU accesses a zone device THP entry, the fault handler calls the
device driver's migrate_to_ram() callback to migrate the entire large page
back to system memory.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
Cc: Rakie Kim <rakie.kim@sk.com>
Cc: Byungchul Park <byungchul@sk.com>
Cc: Gregory Price <gourry@gourry.net>
Cc: Ying Huang <ying.huang@linux.alibaba.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Nico Pache <npache@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona@ffwll.ch>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mika Penttilä <mpenttil@redhat.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Francois Dugast <francois.dugast@intel.com>

Signed-off-by: Balbir Singh <balbirs@nvidia.com>
---
 include/linux/huge_mm.h |  7 +++++++
 mm/huge_memory.c        | 38 ++++++++++++++++++++++++++++++++++++++
 mm/memory.c             |  5 +++--
 3 files changed, 48 insertions(+), 2 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index f327d62fc985..2d669be7f1c8 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -496,6 +496,8 @@ static inline bool folio_test_pmd_mappable(struct folio *folio)
 
 vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf);
 
+vm_fault_t do_huge_pmd_device_private(struct vm_fault *vmf);
+
 extern struct folio *huge_zero_folio;
 extern unsigned long huge_zero_pfn;
 
@@ -671,6 +673,11 @@ static inline vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
 	return 0;
 }
 
+static inline vm_fault_t do_huge_pmd_device_private(struct vm_fault *vmf)
+{
+	return 0;
+}
+
 static inline bool is_huge_zero_folio(const struct folio *folio)
 {
 	return false;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 05c68f5b5fe3..8c95a658b3ec 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1287,6 +1287,44 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf)
 
 }
 
+vm_fault_t do_huge_pmd_device_private(struct vm_fault *vmf)
+{
+	struct vm_area_struct *vma = vmf->vma;
+	vm_fault_t ret = 0;
+	spinlock_t *ptl;
+	swp_entry_t swp_entry;
+	struct page *page;
+	struct folio *folio;
+
+	if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
+		vma_end_read(vma);
+		return VM_FAULT_RETRY;
+	}
+
+	ptl = pmd_lock(vma->vm_mm, vmf->pmd);
+	if (unlikely(!pmd_same(*vmf->pmd, vmf->orig_pmd))) {
+		spin_unlock(ptl);
+		return 0;
+	}
+
+	swp_entry = pmd_to_swp_entry(vmf->orig_pmd);
+	page = pfn_swap_entry_to_page(swp_entry);
+	folio = page_folio(page);
+	vmf->page = page;
+	vmf->pte = NULL;
+	if (folio_trylock(folio)) {
+		folio_get(folio);
+		spin_unlock(ptl);
+		ret = page_pgmap(page)->ops->migrate_to_ram(vmf);
+		folio_unlock(folio);
+		folio_put(folio);
+	} else {
+		spin_unlock(ptl);
+	}
+
+	return ret;
+}
+
 /*
  * always: directly stall for all thp allocations
  * defer: wake kswapd and fail if not immediately available
diff --git a/mm/memory.c b/mm/memory.c
index 7e32eb79ba99..59985b88a9b1 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -6337,8 +6337,9 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
 		vmf.orig_pmd = pmdp_get_lockless(vmf.pmd);
 
 		if (unlikely(is_swap_pmd(vmf.orig_pmd))) {
-			VM_BUG_ON(thp_migration_supported() &&
-					  !is_pmd_migration_entry(vmf.orig_pmd));
+			if (is_pmd_device_private_entry(vmf.orig_pmd))
+				return do_huge_pmd_device_private(&vmf);
+
 			if (is_pmd_migration_entry(vmf.orig_pmd))
 				pmd_migration_entry_wait(mm, vmf.pmd);
 			return 0;
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [v7 09/16] lib/test_hmm: add zone device private THP test infrastructure
  2025-10-01  6:56 [v7 00/16] mm: support device-private THP Balbir Singh
                   ` (7 preceding siblings ...)
  2025-10-01  6:56 ` [v7 08/16] mm/memory/fault: add THP fault handling for zone device private pages Balbir Singh
@ 2025-10-01  6:57 ` Balbir Singh
  2025-10-01  6:57 ` [v7 10/16] mm/memremap: add driver callback support for folio splitting Balbir Singh
                   ` (7 subsequent siblings)
  16 siblings, 0 replies; 75+ messages in thread
From: Balbir Singh @ 2025-10-01  6:57 UTC (permalink / raw)
  To: linux-kernel, dri-devel, linux-mm
  Cc: akpm, Balbir Singh, David Hildenbrand, Zi Yan, Joshua Hahn,
	Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Oscar Salvador, Lorenzo Stoakes, Baolin Wang,
	Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
	Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Ralph Campbell, Mika Penttilä, Matthew Brost,
	Francois Dugast

Enhance the hmm test driver (lib/test_hmm) with support for THP pages.

A new pool of free_folios() has now been added to the dmirror device,
which can be allocated when a request for a THP zone device private page
is made.

Add compound page awareness to the allocation function during normal
migration and fault based migration.  These routines also copy
folio_nr_pages() when moving data between system memory and device memory.

args.src and args.dst used to hold migration entries are now dynamically
allocated (as they need to hold HPAGE_PMD_NR entries or more).

Split and migrate support will be added in future patches in this series.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
Cc: Rakie Kim <rakie.kim@sk.com>
Cc: Byungchul Park <byungchul@sk.com>
Cc: Gregory Price <gourry@gourry.net>
Cc: Ying Huang <ying.huang@linux.alibaba.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Nico Pache <npache@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona@ffwll.ch>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mika Penttilä <mpenttil@redhat.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Francois Dugast <francois.dugast@intel.com>

Signed-off-by: Balbir Singh <balbirs@nvidia.com>
---
 include/linux/memremap.h |  12 ++
 lib/test_hmm.c           | 368 +++++++++++++++++++++++++++++++--------
 2 files changed, 304 insertions(+), 76 deletions(-)

diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index cd28d1666801..7df4dd037b69 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -177,6 +177,18 @@ static inline bool folio_is_pci_p2pdma(const struct folio *folio)
 		folio->pgmap->type == MEMORY_DEVICE_PCI_P2PDMA;
 }
 
+static inline void *folio_zone_device_data(const struct folio *folio)
+{
+	VM_WARN_ON_FOLIO(!folio_is_device_private(folio), folio);
+	return folio->page.zone_device_data;
+}
+
+static inline void folio_set_zone_device_data(struct folio *folio, void *data)
+{
+	VM_WARN_ON_FOLIO(!folio_is_device_private(folio), folio);
+	folio->page.zone_device_data = data;
+}
+
 static inline bool is_pci_p2pdma_page(const struct page *page)
 {
 	return IS_ENABLED(CONFIG_PCI_P2PDMA) &&
diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index 9dbf265d1036..32d402e80bcc 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -119,6 +119,7 @@ struct dmirror_device {
 	unsigned long		calloc;
 	unsigned long		cfree;
 	struct page		*free_pages;
+	struct folio		*free_folios;
 	spinlock_t		lock;		/* protects the above */
 };
 
@@ -492,7 +493,7 @@ static int dmirror_write(struct dmirror *dmirror, struct hmm_dmirror_cmd *cmd)
 }
 
 static int dmirror_allocate_chunk(struct dmirror_device *mdevice,
-				   struct page **ppage)
+				  struct page **ppage, bool is_large)
 {
 	struct dmirror_chunk *devmem;
 	struct resource *res = NULL;
@@ -572,20 +573,45 @@ static int dmirror_allocate_chunk(struct dmirror_device *mdevice,
 		pfn_first, pfn_last);
 
 	spin_lock(&mdevice->lock);
-	for (pfn = pfn_first; pfn < pfn_last; pfn++) {
+	for (pfn = pfn_first; pfn < pfn_last; ) {
 		struct page *page = pfn_to_page(pfn);
 
+		if (is_large && IS_ALIGNED(pfn, HPAGE_PMD_NR)
+			&& (pfn + HPAGE_PMD_NR <= pfn_last)) {
+			page->zone_device_data = mdevice->free_folios;
+			mdevice->free_folios = page_folio(page);
+			pfn += HPAGE_PMD_NR;
+			continue;
+		}
+
 		page->zone_device_data = mdevice->free_pages;
 		mdevice->free_pages = page;
+		pfn++;
 	}
+
+	ret = 0;
 	if (ppage) {
-		*ppage = mdevice->free_pages;
-		mdevice->free_pages = (*ppage)->zone_device_data;
-		mdevice->calloc++;
+		if (is_large) {
+			if (!mdevice->free_folios) {
+				ret = -ENOMEM;
+				goto err_unlock;
+			}
+			*ppage = folio_page(mdevice->free_folios, 0);
+			mdevice->free_folios = (*ppage)->zone_device_data;
+			mdevice->calloc += HPAGE_PMD_NR;
+		} else if (mdevice->free_pages) {
+			*ppage = mdevice->free_pages;
+			mdevice->free_pages = (*ppage)->zone_device_data;
+			mdevice->calloc++;
+		} else {
+			ret = -ENOMEM;
+			goto err_unlock;
+		}
 	}
+err_unlock:
 	spin_unlock(&mdevice->lock);
 
-	return 0;
+	return ret;
 
 err_release:
 	mutex_unlock(&mdevice->devmem_lock);
@@ -598,10 +624,13 @@ static int dmirror_allocate_chunk(struct dmirror_device *mdevice,
 	return ret;
 }
 
-static struct page *dmirror_devmem_alloc_page(struct dmirror_device *mdevice)
+static struct page *dmirror_devmem_alloc_page(struct dmirror *dmirror,
+					      bool is_large)
 {
 	struct page *dpage = NULL;
 	struct page *rpage = NULL;
+	unsigned int order = is_large ? HPAGE_PMD_ORDER : 0;
+	struct dmirror_device *mdevice = dmirror->mdevice;
 
 	/*
 	 * For ZONE_DEVICE private type, this is a fake device so we allocate
@@ -610,49 +639,55 @@ static struct page *dmirror_devmem_alloc_page(struct dmirror_device *mdevice)
 	 * data and ignore rpage.
 	 */
 	if (dmirror_is_private_zone(mdevice)) {
-		rpage = alloc_page(GFP_HIGHUSER);
+		rpage = folio_page(folio_alloc(GFP_HIGHUSER, order), 0);
 		if (!rpage)
 			return NULL;
 	}
 	spin_lock(&mdevice->lock);
 
-	if (mdevice->free_pages) {
+	if (is_large && mdevice->free_folios) {
+		dpage = folio_page(mdevice->free_folios, 0);
+		mdevice->free_folios = dpage->zone_device_data;
+		mdevice->calloc += 1 << order;
+		spin_unlock(&mdevice->lock);
+	} else if (!is_large && mdevice->free_pages) {
 		dpage = mdevice->free_pages;
 		mdevice->free_pages = dpage->zone_device_data;
 		mdevice->calloc++;
 		spin_unlock(&mdevice->lock);
 	} else {
 		spin_unlock(&mdevice->lock);
-		if (dmirror_allocate_chunk(mdevice, &dpage))
+		if (dmirror_allocate_chunk(mdevice, &dpage, is_large))
 			goto error;
 	}
 
-	zone_device_page_init(dpage, 0);
+	zone_device_folio_init(page_folio(dpage), order);
 	dpage->zone_device_data = rpage;
 	return dpage;
 
 error:
 	if (rpage)
-		__free_page(rpage);
+		__free_pages(rpage, order);
 	return NULL;
 }
 
 static void dmirror_migrate_alloc_and_copy(struct migrate_vma *args,
 					   struct dmirror *dmirror)
 {
-	struct dmirror_device *mdevice = dmirror->mdevice;
 	const unsigned long *src = args->src;
 	unsigned long *dst = args->dst;
 	unsigned long addr;
 
-	for (addr = args->start; addr < args->end; addr += PAGE_SIZE,
-						   src++, dst++) {
+	for (addr = args->start; addr < args->end; ) {
 		struct page *spage;
 		struct page *dpage;
 		struct page *rpage;
+		bool is_large = *src & MIGRATE_PFN_COMPOUND;
+		int write = (*src & MIGRATE_PFN_WRITE) ? MIGRATE_PFN_WRITE : 0;
+		unsigned long nr = 1;
 
 		if (!(*src & MIGRATE_PFN_MIGRATE))
-			continue;
+			goto next;
 
 		/*
 		 * Note that spage might be NULL which is OK since it is an
@@ -662,17 +697,45 @@ static void dmirror_migrate_alloc_and_copy(struct migrate_vma *args,
 		if (WARN(spage && is_zone_device_page(spage),
 		     "page already in device spage pfn: 0x%lx\n",
 		     page_to_pfn(spage)))
+			goto next;
+
+		dpage = dmirror_devmem_alloc_page(dmirror, is_large);
+		if (!dpage) {
+			struct folio *folio;
+			unsigned long i;
+			unsigned long spfn = *src >> MIGRATE_PFN_SHIFT;
+			struct page *src_page;
+
+			if (!is_large)
+				goto next;
+
+			if (!spage && is_large) {
+				nr = HPAGE_PMD_NR;
+			} else {
+				folio = page_folio(spage);
+				nr = folio_nr_pages(folio);
+			}
+
+			for (i = 0; i < nr && addr < args->end; i++) {
+				dpage = dmirror_devmem_alloc_page(dmirror, false);
+				rpage = BACKING_PAGE(dpage);
+				rpage->zone_device_data = dmirror;
+
+				*dst = migrate_pfn(page_to_pfn(dpage)) | write;
+				src_page = pfn_to_page(spfn + i);
+
+				if (spage)
+					copy_highpage(rpage, src_page);
+				else
+					clear_highpage(rpage);
+				src++;
+				dst++;
+				addr += PAGE_SIZE;
+			}
 			continue;
-
-		dpage = dmirror_devmem_alloc_page(mdevice);
-		if (!dpage)
-			continue;
+		}
 
 		rpage = BACKING_PAGE(dpage);
-		if (spage)
-			copy_highpage(rpage, spage);
-		else
-			clear_highpage(rpage);
 
 		/*
 		 * Normally, a device would use the page->zone_device_data to
@@ -684,10 +747,42 @@ static void dmirror_migrate_alloc_and_copy(struct migrate_vma *args,
 
 		pr_debug("migrating from sys to dev pfn src: 0x%lx pfn dst: 0x%lx\n",
 			 page_to_pfn(spage), page_to_pfn(dpage));
-		*dst = migrate_pfn(page_to_pfn(dpage));
-		if ((*src & MIGRATE_PFN_WRITE) ||
-		    (!spage && args->vma->vm_flags & VM_WRITE))
-			*dst |= MIGRATE_PFN_WRITE;
+
+		*dst = migrate_pfn(page_to_pfn(dpage)) | write;
+
+		if (is_large) {
+			int i;
+			struct folio *folio = page_folio(dpage);
+			*dst |= MIGRATE_PFN_COMPOUND;
+
+			if (folio_test_large(folio)) {
+				for (i = 0; i < folio_nr_pages(folio); i++) {
+					struct page *dst_page =
+						pfn_to_page(page_to_pfn(rpage) + i);
+					struct page *src_page =
+						pfn_to_page(page_to_pfn(spage) + i);
+
+					if (spage)
+						copy_highpage(dst_page, src_page);
+					else
+						clear_highpage(dst_page);
+					src++;
+					dst++;
+					addr += PAGE_SIZE;
+				}
+				continue;
+			}
+		}
+
+		if (spage)
+			copy_highpage(rpage, spage);
+		else
+			clear_highpage(rpage);
+
+next:
+		src++;
+		dst++;
+		addr += PAGE_SIZE;
 	}
 }
 
@@ -734,14 +829,17 @@ static int dmirror_migrate_finalize_and_map(struct migrate_vma *args,
 	const unsigned long *src = args->src;
 	const unsigned long *dst = args->dst;
 	unsigned long pfn;
+	const unsigned long start_pfn = start >> PAGE_SHIFT;
+	const unsigned long end_pfn = end >> PAGE_SHIFT;
 
 	/* Map the migrated pages into the device's page tables. */
 	mutex_lock(&dmirror->mutex);
 
-	for (pfn = start >> PAGE_SHIFT; pfn < (end >> PAGE_SHIFT); pfn++,
-								src++, dst++) {
+	for (pfn = start_pfn; pfn < end_pfn; pfn++, src++, dst++) {
 		struct page *dpage;
 		void *entry;
+		int nr, i;
+		struct page *rpage;
 
 		if (!(*src & MIGRATE_PFN_MIGRATE))
 			continue;
@@ -750,13 +848,25 @@ static int dmirror_migrate_finalize_and_map(struct migrate_vma *args,
 		if (!dpage)
 			continue;
 
-		entry = BACKING_PAGE(dpage);
-		if (*dst & MIGRATE_PFN_WRITE)
-			entry = xa_tag_pointer(entry, DPT_XA_TAG_WRITE);
-		entry = xa_store(&dmirror->pt, pfn, entry, GFP_ATOMIC);
-		if (xa_is_err(entry)) {
-			mutex_unlock(&dmirror->mutex);
-			return xa_err(entry);
+		if (*dst & MIGRATE_PFN_COMPOUND)
+			nr = folio_nr_pages(page_folio(dpage));
+		else
+			nr = 1;
+
+		WARN_ON_ONCE(end_pfn < start_pfn + nr);
+
+		rpage = BACKING_PAGE(dpage);
+		VM_WARN_ON(folio_nr_pages(page_folio(rpage)) != nr);
+
+		for (i = 0; i < nr; i++) {
+			entry = folio_page(page_folio(rpage), i);
+			if (*dst & MIGRATE_PFN_WRITE)
+				entry = xa_tag_pointer(entry, DPT_XA_TAG_WRITE);
+			entry = xa_store(&dmirror->pt, pfn + i, entry, GFP_ATOMIC);
+			if (xa_is_err(entry)) {
+				mutex_unlock(&dmirror->mutex);
+				return xa_err(entry);
+			}
 		}
 	}
 
@@ -829,31 +939,66 @@ static vm_fault_t dmirror_devmem_fault_alloc_and_copy(struct migrate_vma *args,
 	unsigned long start = args->start;
 	unsigned long end = args->end;
 	unsigned long addr;
+	unsigned int order = 0;
+	int i;
 
-	for (addr = start; addr < end; addr += PAGE_SIZE,
-				       src++, dst++) {
+	for (addr = start; addr < end; ) {
 		struct page *dpage, *spage;
 
 		spage = migrate_pfn_to_page(*src);
-		if (!spage || !(*src & MIGRATE_PFN_MIGRATE))
-			continue;
+		if (!spage || !(*src & MIGRATE_PFN_MIGRATE)) {
+			addr += PAGE_SIZE;
+			goto next;
+		}
 
 		if (WARN_ON(!is_device_private_page(spage) &&
-			    !is_device_coherent_page(spage)))
-			continue;
+			    !is_device_coherent_page(spage))) {
+			addr += PAGE_SIZE;
+			goto next;
+		}
+
 		spage = BACKING_PAGE(spage);
-		dpage = alloc_page_vma(GFP_HIGHUSER_MOVABLE, args->vma, addr);
-		if (!dpage)
-			continue;
-		pr_debug("migrating from dev to sys pfn src: 0x%lx pfn dst: 0x%lx\n",
-			 page_to_pfn(spage), page_to_pfn(dpage));
+		order = folio_order(page_folio(spage));
 
+		if (order)
+			dpage = folio_page(vma_alloc_folio(GFP_HIGHUSER_MOVABLE,
+						order, args->vma, addr), 0);
+		else
+			dpage = alloc_page_vma(GFP_HIGHUSER_MOVABLE, args->vma, addr);
+
+		/* Try with smaller pages if large allocation fails */
+		if (!dpage && order) {
+			dpage = alloc_page_vma(GFP_HIGHUSER_MOVABLE, args->vma, addr);
+			if (!dpage)
+				return VM_FAULT_OOM;
+			order = 0;
+		}
+
+		pr_debug("migrating from sys to dev pfn src: 0x%lx pfn dst: 0x%lx\n",
+				page_to_pfn(spage), page_to_pfn(dpage));
 		lock_page(dpage);
 		xa_erase(&dmirror->pt, addr >> PAGE_SHIFT);
 		copy_highpage(dpage, spage);
 		*dst = migrate_pfn(page_to_pfn(dpage));
 		if (*src & MIGRATE_PFN_WRITE)
 			*dst |= MIGRATE_PFN_WRITE;
+		if (order)
+			*dst |= MIGRATE_PFN_COMPOUND;
+
+		for (i = 0; i < (1 << order); i++) {
+			struct page *src_page;
+			struct page *dst_page;
+
+			src_page = pfn_to_page(page_to_pfn(spage) + i);
+			dst_page = pfn_to_page(page_to_pfn(dpage) + i);
+
+			xa_erase(&dmirror->pt, addr >> PAGE_SHIFT);
+			copy_highpage(dst_page, src_page);
+		}
+next:
+		addr += PAGE_SIZE << order;
+		src += 1 << order;
+		dst += 1 << order;
 	}
 	return 0;
 }
@@ -879,11 +1024,14 @@ static int dmirror_migrate_to_system(struct dmirror *dmirror,
 	unsigned long size = cmd->npages << PAGE_SHIFT;
 	struct mm_struct *mm = dmirror->notifier.mm;
 	struct vm_area_struct *vma;
-	unsigned long src_pfns[32] = { 0 };
-	unsigned long dst_pfns[32] = { 0 };
 	struct migrate_vma args = { 0 };
 	unsigned long next;
 	int ret;
+	unsigned long *src_pfns;
+	unsigned long *dst_pfns;
+
+	src_pfns = kvcalloc(PTRS_PER_PTE, sizeof(*src_pfns), GFP_KERNEL | __GFP_NOFAIL);
+	dst_pfns = kvcalloc(PTRS_PER_PTE, sizeof(*dst_pfns), GFP_KERNEL | __GFP_NOFAIL);
 
 	start = cmd->addr;
 	end = start + size;
@@ -902,7 +1050,7 @@ static int dmirror_migrate_to_system(struct dmirror *dmirror,
 			ret = -EINVAL;
 			goto out;
 		}
-		next = min(end, addr + (ARRAY_SIZE(src_pfns) << PAGE_SHIFT));
+		next = min(end, addr + (PTRS_PER_PTE << PAGE_SHIFT));
 		if (next > vma->vm_end)
 			next = vma->vm_end;
 
@@ -912,7 +1060,7 @@ static int dmirror_migrate_to_system(struct dmirror *dmirror,
 		args.start = addr;
 		args.end = next;
 		args.pgmap_owner = dmirror->mdevice;
-		args.flags = dmirror_select_device(dmirror);
+		args.flags = dmirror_select_device(dmirror) | MIGRATE_VMA_SELECT_COMPOUND;
 
 		ret = migrate_vma_setup(&args);
 		if (ret)
@@ -928,6 +1076,8 @@ static int dmirror_migrate_to_system(struct dmirror *dmirror,
 out:
 	mmap_read_unlock(mm);
 	mmput(mm);
+	kvfree(src_pfns);
+	kvfree(dst_pfns);
 
 	return ret;
 }
@@ -939,12 +1089,12 @@ static int dmirror_migrate_to_device(struct dmirror *dmirror,
 	unsigned long size = cmd->npages << PAGE_SHIFT;
 	struct mm_struct *mm = dmirror->notifier.mm;
 	struct vm_area_struct *vma;
-	unsigned long src_pfns[32] = { 0 };
-	unsigned long dst_pfns[32] = { 0 };
 	struct dmirror_bounce bounce;
 	struct migrate_vma args = { 0 };
 	unsigned long next;
 	int ret;
+	unsigned long *src_pfns = NULL;
+	unsigned long *dst_pfns = NULL;
 
 	start = cmd->addr;
 	end = start + size;
@@ -955,6 +1105,18 @@ static int dmirror_migrate_to_device(struct dmirror *dmirror,
 	if (!mmget_not_zero(mm))
 		return -EINVAL;
 
+	ret = -ENOMEM;
+	src_pfns = kvcalloc(PTRS_PER_PTE, sizeof(*src_pfns),
+			  GFP_KERNEL | __GFP_NOFAIL);
+	if (!src_pfns)
+		goto free_mem;
+
+	dst_pfns = kvcalloc(PTRS_PER_PTE, sizeof(*dst_pfns),
+			  GFP_KERNEL | __GFP_NOFAIL);
+	if (!dst_pfns)
+		goto free_mem;
+
+	ret = 0;
 	mmap_read_lock(mm);
 	for (addr = start; addr < end; addr = next) {
 		vma = vma_lookup(mm, addr);
@@ -962,7 +1124,7 @@ static int dmirror_migrate_to_device(struct dmirror *dmirror,
 			ret = -EINVAL;
 			goto out;
 		}
-		next = min(end, addr + (ARRAY_SIZE(src_pfns) << PAGE_SHIFT));
+		next = min(end, addr + (PTRS_PER_PTE << PAGE_SHIFT));
 		if (next > vma->vm_end)
 			next = vma->vm_end;
 
@@ -972,7 +1134,8 @@ static int dmirror_migrate_to_device(struct dmirror *dmirror,
 		args.start = addr;
 		args.end = next;
 		args.pgmap_owner = dmirror->mdevice;
-		args.flags = MIGRATE_VMA_SELECT_SYSTEM;
+		args.flags = MIGRATE_VMA_SELECT_SYSTEM |
+				MIGRATE_VMA_SELECT_COMPOUND;
 		ret = migrate_vma_setup(&args);
 		if (ret)
 			goto out;
@@ -992,7 +1155,7 @@ static int dmirror_migrate_to_device(struct dmirror *dmirror,
 	 */
 	ret = dmirror_bounce_init(&bounce, start, size);
 	if (ret)
-		return ret;
+		goto free_mem;
 	mutex_lock(&dmirror->mutex);
 	ret = dmirror_do_read(dmirror, start, end, &bounce);
 	mutex_unlock(&dmirror->mutex);
@@ -1003,11 +1166,14 @@ static int dmirror_migrate_to_device(struct dmirror *dmirror,
 	}
 	cmd->cpages = bounce.cpages;
 	dmirror_bounce_fini(&bounce);
-	return ret;
+	goto free_mem;
 
 out:
 	mmap_read_unlock(mm);
 	mmput(mm);
+free_mem:
+	kfree(src_pfns);
+	kfree(dst_pfns);
 	return ret;
 }
 
@@ -1200,6 +1366,7 @@ static void dmirror_device_evict_chunk(struct dmirror_chunk *chunk)
 	unsigned long i;
 	unsigned long *src_pfns;
 	unsigned long *dst_pfns;
+	unsigned int order = 0;
 
 	src_pfns = kvcalloc(npages, sizeof(*src_pfns), GFP_KERNEL | __GFP_NOFAIL);
 	dst_pfns = kvcalloc(npages, sizeof(*dst_pfns), GFP_KERNEL | __GFP_NOFAIL);
@@ -1215,13 +1382,25 @@ static void dmirror_device_evict_chunk(struct dmirror_chunk *chunk)
 		if (WARN_ON(!is_device_private_page(spage) &&
 			    !is_device_coherent_page(spage)))
 			continue;
+
+		order = folio_order(page_folio(spage));
 		spage = BACKING_PAGE(spage);
-		dpage = alloc_page(GFP_HIGHUSER_MOVABLE | __GFP_NOFAIL);
+		if (src_pfns[i] & MIGRATE_PFN_COMPOUND) {
+			dpage = folio_page(folio_alloc(GFP_HIGHUSER_MOVABLE,
+					      order), 0);
+		} else {
+			dpage = alloc_page(GFP_HIGHUSER_MOVABLE | __GFP_NOFAIL);
+			order = 0;
+		}
+
+		/* TODO Support splitting here */
 		lock_page(dpage);
-		copy_highpage(dpage, spage);
 		dst_pfns[i] = migrate_pfn(page_to_pfn(dpage));
 		if (src_pfns[i] & MIGRATE_PFN_WRITE)
 			dst_pfns[i] |= MIGRATE_PFN_WRITE;
+		if (order)
+			dst_pfns[i] |= MIGRATE_PFN_COMPOUND;
+		folio_copy(page_folio(dpage), page_folio(spage));
 	}
 	migrate_device_pages(src_pfns, dst_pfns, npages);
 	migrate_device_finalize(src_pfns, dst_pfns, npages);
@@ -1234,7 +1413,12 @@ static void dmirror_remove_free_pages(struct dmirror_chunk *devmem)
 {
 	struct dmirror_device *mdevice = devmem->mdevice;
 	struct page *page;
+	struct folio *folio;
+
 
+	for (folio = mdevice->free_folios; folio; folio = folio_zone_device_data(folio))
+		if (dmirror_page_to_chunk(folio_page(folio, 0)) == devmem)
+			mdevice->free_folios = folio_zone_device_data(folio);
 	for (page = mdevice->free_pages; page; page = page->zone_device_data)
 		if (dmirror_page_to_chunk(page) == devmem)
 			mdevice->free_pages = page->zone_device_data;
@@ -1265,6 +1449,7 @@ static void dmirror_device_remove_chunks(struct dmirror_device *mdevice)
 		mdevice->devmem_count = 0;
 		mdevice->devmem_capacity = 0;
 		mdevice->free_pages = NULL;
+		mdevice->free_folios = NULL;
 		kfree(mdevice->devmem_chunks);
 		mdevice->devmem_chunks = NULL;
 	}
@@ -1379,18 +1564,30 @@ static void dmirror_devmem_free(struct folio *folio)
 	struct page *page = &folio->page;
 	struct page *rpage = BACKING_PAGE(page);
 	struct dmirror_device *mdevice;
+	struct folio *rfolio = page_folio(rpage);
+	unsigned int order = folio_order(rfolio);
 
-	if (rpage != page)
-		__free_page(rpage);
+	if (rpage != page) {
+		if (order)
+			__free_pages(rpage, order);
+		else
+			__free_page(rpage);
+		rpage = NULL;
+	}
 
 	mdevice = dmirror_page_to_device(page);
 	spin_lock(&mdevice->lock);
 
 	/* Return page to our allocator if not freeing the chunk */
 	if (!dmirror_page_to_chunk(page)->remove) {
-		mdevice->cfree++;
-		page->zone_device_data = mdevice->free_pages;
-		mdevice->free_pages = page;
+		mdevice->cfree += 1 << order;
+		if (order) {
+			page->zone_device_data = mdevice->free_folios;
+			mdevice->free_folios = page_folio(page);
+		} else {
+			page->zone_device_data = mdevice->free_pages;
+			mdevice->free_pages = page;
+		}
 	}
 	spin_unlock(&mdevice->lock);
 }
@@ -1398,36 +1595,52 @@ static void dmirror_devmem_free(struct folio *folio)
 static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
 {
 	struct migrate_vma args = { 0 };
-	unsigned long src_pfns = 0;
-	unsigned long dst_pfns = 0;
 	struct page *rpage;
 	struct dmirror *dmirror;
-	vm_fault_t ret;
+	vm_fault_t ret = 0;
+	unsigned int order, nr;
 
 	/*
 	 * Normally, a device would use the page->zone_device_data to point to
 	 * the mirror but here we use it to hold the page for the simulated
 	 * device memory and that page holds the pointer to the mirror.
 	 */
-	rpage = vmf->page->zone_device_data;
+	rpage = folio_zone_device_data(page_folio(vmf->page));
 	dmirror = rpage->zone_device_data;
 
 	/* FIXME demonstrate how we can adjust migrate range */
+	order = folio_order(page_folio(vmf->page));
+	nr = 1 << order;
+
+	/*
+	 * Consider a per-cpu cache of src and dst pfns, but with
+	 * large number of cpus that might not scale well.
+	 */
+	args.start = ALIGN_DOWN(vmf->address, (PAGE_SIZE << order));
 	args.vma = vmf->vma;
-	args.start = vmf->address;
-	args.end = args.start + PAGE_SIZE;
-	args.src = &src_pfns;
-	args.dst = &dst_pfns;
+	args.end = args.start + (PAGE_SIZE << order);
+
+	nr = (args.end - args.start) >> PAGE_SHIFT;
+	args.src = kcalloc(nr, sizeof(unsigned long), GFP_KERNEL);
+	args.dst = kcalloc(nr, sizeof(unsigned long), GFP_KERNEL);
 	args.pgmap_owner = dmirror->mdevice;
 	args.flags = dmirror_select_device(dmirror);
 	args.fault_page = vmf->page;
 
+	if (!args.src || !args.dst) {
+		ret = VM_FAULT_OOM;
+		goto err;
+	}
+
+	if (order)
+		args.flags |= MIGRATE_VMA_SELECT_COMPOUND;
+
 	if (migrate_vma_setup(&args))
 		return VM_FAULT_SIGBUS;
 
 	ret = dmirror_devmem_fault_alloc_and_copy(&args, dmirror);
 	if (ret)
-		return ret;
+		goto err;
 	migrate_vma_pages(&args);
 	/*
 	 * No device finalize step is needed since
@@ -1435,7 +1648,10 @@ static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
 	 * invalidated the device page table.
 	 */
 	migrate_vma_finalize(&args);
-	return 0;
+err:
+	kfree(args.src);
+	kfree(args.dst);
+	return ret;
 }
 
 static const struct dev_pagemap_ops dmirror_devmem_ops = {
@@ -1466,7 +1682,7 @@ static int dmirror_device_init(struct dmirror_device *mdevice, int id)
 		return ret;
 
 	/* Build a list of free ZONE_DEVICE struct pages */
-	return dmirror_allocate_chunk(mdevice, NULL);
+	return dmirror_allocate_chunk(mdevice, NULL, false);
 }
 
 static void dmirror_device_remove(struct dmirror_device *mdevice)
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [v7 10/16] mm/memremap: add driver callback support for folio splitting
  2025-10-01  6:56 [v7 00/16] mm: support device-private THP Balbir Singh
                   ` (8 preceding siblings ...)
  2025-10-01  6:57 ` [v7 09/16] lib/test_hmm: add zone device private THP test infrastructure Balbir Singh
@ 2025-10-01  6:57 ` Balbir Singh
  2025-10-01  6:57 ` [v7 11/16] mm/migrate_device: add THP splitting during migration Balbir Singh
                   ` (6 subsequent siblings)
  16 siblings, 0 replies; 75+ messages in thread
From: Balbir Singh @ 2025-10-01  6:57 UTC (permalink / raw)
  To: linux-kernel, dri-devel, linux-mm
  Cc: akpm, Balbir Singh, David Hildenbrand, Zi Yan, Joshua Hahn,
	Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Oscar Salvador, Lorenzo Stoakes, Baolin Wang,
	Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
	Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Ralph Campbell, Mika Penttilä, Matthew Brost,
	Francois Dugast

When a zone device page is split (via huge pmd folio split).  The driver
callback for folio_split is invoked to let the device driver know that the
folio size has been split into a smaller order.

Provide a default implementation for drivers that do not provide this
callback that copies the pgmap and mapping fields for the split folios.

Update the HMM test driver to handle the split.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
Cc: Rakie Kim <rakie.kim@sk.com>
Cc: Byungchul Park <byungchul@sk.com>
Cc: Gregory Price <gourry@gourry.net>
Cc: Ying Huang <ying.huang@linux.alibaba.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Nico Pache <npache@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona@ffwll.ch>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mika Penttilä <mpenttil@redhat.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Francois Dugast <francois.dugast@intel.com>

Signed-off-by: Balbir Singh <balbirs@nvidia.com>
---
 include/linux/memremap.h | 29 +++++++++++++++++++++++++++++
 lib/test_hmm.c           | 35 +++++++++++++++++++++++++++++++++++
 2 files changed, 64 insertions(+)

diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 7df4dd037b69..aca2b16d6889 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -100,6 +100,13 @@ struct dev_pagemap_ops {
 	 */
 	int (*memory_failure)(struct dev_pagemap *pgmap, unsigned long pfn,
 			      unsigned long nr_pages, int mf_flags);
+
+	/*
+	 * Used for private (un-addressable) device memory only.
+	 * This callback is used when a folio is split into
+	 * a smaller folio
+	 */
+	void (*folio_split)(struct folio *head, struct folio *tail);
 };
 
 #define PGMAP_ALTMAP_VALID	(1 << 0)
@@ -235,6 +242,23 @@ static inline void zone_device_folio_init(struct folio *folio, unsigned int orde
 		folio_set_large_rmappable(folio);
 }
 
+static inline void zone_device_private_split_cb(struct folio *original_folio,
+						struct folio *new_folio)
+{
+	if (folio_is_device_private(original_folio)) {
+		if (!original_folio->pgmap->ops->folio_split) {
+			if (new_folio) {
+				new_folio->pgmap = original_folio->pgmap;
+				new_folio->page.mapping =
+					original_folio->page.mapping;
+			}
+		} else {
+			original_folio->pgmap->ops->folio_split(original_folio,
+								 new_folio);
+		}
+	}
+}
+
 #else
 static inline void *devm_memremap_pages(struct device *dev,
 		struct dev_pagemap *pgmap)
@@ -268,6 +292,11 @@ static inline unsigned long memremap_compat_align(void)
 {
 	return PAGE_SIZE;
 }
+
+static inline void zone_device_private_split_cb(struct folio *original_folio,
+						struct folio *new_folio)
+{
+}
 #endif /* CONFIG_ZONE_DEVICE */
 
 static inline void put_dev_pagemap(struct dev_pagemap *pgmap)
diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index 32d402e80bcc..46fa9e200db8 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -1654,9 +1654,44 @@ static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
 	return ret;
 }
 
+static void dmirror_devmem_folio_split(struct folio *head, struct folio *tail)
+{
+	struct page *rpage = BACKING_PAGE(folio_page(head, 0));
+	struct page *rpage_tail;
+	struct folio *rfolio;
+	unsigned long offset = 0;
+
+	if (!rpage) {
+		tail->page.zone_device_data = NULL;
+		return;
+	}
+
+	rfolio = page_folio(rpage);
+
+	if (tail == NULL) {
+		folio_reset_order(rfolio);
+		rfolio->mapping = NULL;
+		folio_set_count(rfolio, 1);
+		return;
+	}
+
+	offset = folio_pfn(tail) - folio_pfn(head);
+
+	rpage_tail = folio_page(rfolio, offset);
+	tail->page.zone_device_data = rpage_tail;
+	rpage_tail->zone_device_data = rpage->zone_device_data;
+	clear_compound_head(rpage_tail);
+	rpage_tail->mapping = NULL;
+
+	folio_page(tail, 0)->mapping = folio_page(head, 0)->mapping;
+	tail->pgmap = head->pgmap;
+	folio_set_count(page_folio(rpage_tail), 1);
+}
+
 static const struct dev_pagemap_ops dmirror_devmem_ops = {
 	.folio_free	= dmirror_devmem_free,
 	.migrate_to_ram	= dmirror_devmem_fault,
+	.folio_split	= dmirror_devmem_folio_split,
 };
 
 static int dmirror_device_init(struct dmirror_device *mdevice, int id)
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [v7 11/16] mm/migrate_device: add THP splitting during migration
  2025-10-01  6:56 [v7 00/16] mm: support device-private THP Balbir Singh
                   ` (9 preceding siblings ...)
  2025-10-01  6:57 ` [v7 10/16] mm/memremap: add driver callback support for folio splitting Balbir Singh
@ 2025-10-01  6:57 ` Balbir Singh
  2025-10-13 21:17   ` Zi Yan
  2025-10-19  8:19   ` Wei Yang
  2025-10-01  6:57 ` [v7 12/16] lib/test_hmm: add large page allocation failure testing Balbir Singh
                   ` (5 subsequent siblings)
  16 siblings, 2 replies; 75+ messages in thread
From: Balbir Singh @ 2025-10-01  6:57 UTC (permalink / raw)
  To: linux-kernel, dri-devel, linux-mm
  Cc: akpm, Balbir Singh, David Hildenbrand, Zi Yan, Joshua Hahn,
	Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Oscar Salvador, Lorenzo Stoakes, Baolin Wang,
	Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
	Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Ralph Campbell, Mika Penttilä, Matthew Brost,
	Francois Dugast

Implement migrate_vma_split_pages() to handle THP splitting during the
migration process when destination cannot allocate compound pages.

This addresses the common scenario where migrate_vma_setup() succeeds with
MIGRATE_PFN_COMPOUND pages, but the destination device cannot allocate
large pages during the migration phase.

Key changes:
- migrate_vma_split_pages(): Split already-isolated pages during migration
- Enhanced folio_split() and __split_unmapped_folio() with isolated
  parameter to avoid redundant unmap/remap operations

This provides a fallback mechansim to ensure migration succeeds even when
large page allocation fails at the destination.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
Cc: Rakie Kim <rakie.kim@sk.com>
Cc: Byungchul Park <byungchul@sk.com>
Cc: Gregory Price <gourry@gourry.net>
Cc: Ying Huang <ying.huang@linux.alibaba.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Nico Pache <npache@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona@ffwll.ch>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mika Penttilä <mpenttil@redhat.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Francois Dugast <francois.dugast@intel.com>

Signed-off-by: Balbir Singh <balbirs@nvidia.com>
---
 include/linux/huge_mm.h | 11 +++++-
 lib/test_hmm.c          |  9 +++++
 mm/huge_memory.c        | 46 ++++++++++++----------
 mm/migrate_device.c     | 85 +++++++++++++++++++++++++++++++++++------
 4 files changed, 117 insertions(+), 34 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 2d669be7f1c8..a166be872628 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -365,8 +365,8 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
 		vm_flags_t vm_flags);
 
 bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
-int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
-		unsigned int new_order);
+int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
+		unsigned int new_order, bool unmapped);
 int min_order_for_split(struct folio *folio);
 int split_folio_to_list(struct folio *folio, struct list_head *list);
 bool uniform_split_supported(struct folio *folio, unsigned int new_order,
@@ -375,6 +375,13 @@ bool non_uniform_split_supported(struct folio *folio, unsigned int new_order,
 		bool warns);
 int folio_split(struct folio *folio, unsigned int new_order, struct page *page,
 		struct list_head *list);
+
+static inline int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
+		unsigned int new_order)
+{
+	return __split_huge_page_to_list_to_order(page, list, new_order, false);
+}
+
 /*
  * try_folio_split - try to split a @folio at @page using non uniform split.
  * @folio: folio to be split
diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index 46fa9e200db8..df429670633e 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -1612,6 +1612,15 @@ static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
 	order = folio_order(page_folio(vmf->page));
 	nr = 1 << order;
 
+	/*
+	 * When folios are partially mapped, we can't rely on the folio
+	 * order of vmf->page as the folio might not be fully split yet
+	 */
+	if (vmf->pte) {
+		order = 0;
+		nr = 1;
+	}
+
 	/*
 	 * Consider a per-cpu cache of src and dst pfns, but with
 	 * large number of cpus that might not scale well.
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 8c95a658b3ec..022b0729f826 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3463,15 +3463,6 @@ static void __split_folio_to_order(struct folio *folio, int old_order,
 		new_folio->mapping = folio->mapping;
 		new_folio->index = folio->index + i;
 
-		/*
-		 * page->private should not be set in tail pages. Fix up and warn once
-		 * if private is unexpectedly set.
-		 */
-		if (unlikely(new_folio->private)) {
-			VM_WARN_ON_ONCE_PAGE(true, new_head);
-			new_folio->private = NULL;
-		}
-
 		if (folio_test_swapcache(folio))
 			new_folio->swap.val = folio->swap.val + i;
 
@@ -3700,6 +3691,7 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
  * @lock_at: a page within @folio to be left locked to caller
  * @list: after-split folios will be put on it if non NULL
  * @uniform_split: perform uniform split or not (non-uniform split)
+ * @unmapped: The pages are already unmapped, they are migration entries.
  *
  * It calls __split_unmapped_folio() to perform uniform and non-uniform split.
  * It is in charge of checking whether the split is supported or not and
@@ -3715,7 +3707,7 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
  */
 static int __folio_split(struct folio *folio, unsigned int new_order,
 		struct page *split_at, struct page *lock_at,
-		struct list_head *list, bool uniform_split)
+		struct list_head *list, bool uniform_split, bool unmapped)
 {
 	struct deferred_split *ds_queue = get_deferred_split_queue(folio);
 	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
@@ -3765,13 +3757,15 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
 		 * is taken to serialise against parallel split or collapse
 		 * operations.
 		 */
-		anon_vma = folio_get_anon_vma(folio);
-		if (!anon_vma) {
-			ret = -EBUSY;
-			goto out;
+		if (!unmapped) {
+			anon_vma = folio_get_anon_vma(folio);
+			if (!anon_vma) {
+				ret = -EBUSY;
+				goto out;
+			}
+			anon_vma_lock_write(anon_vma);
 		}
 		mapping = NULL;
-		anon_vma_lock_write(anon_vma);
 	} else {
 		unsigned int min_order;
 		gfp_t gfp;
@@ -3838,7 +3832,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
 		goto out_unlock;
 	}
 
-	unmap_folio(folio);
+	if (!unmapped)
+		unmap_folio(folio);
 
 	/* block interrupt reentry in xa_lock and spinlock */
 	local_irq_disable();
@@ -3925,10 +3920,13 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
 
 			next = folio_next(new_folio);
 
+			zone_device_private_split_cb(folio, new_folio);
+
 			expected_refs = folio_expected_ref_count(new_folio) + 1;
 			folio_ref_unfreeze(new_folio, expected_refs);
 
-			lru_add_split_folio(folio, new_folio, lruvec, list);
+			if (!unmapped)
+				lru_add_split_folio(folio, new_folio, lruvec, list);
 
 			/*
 			 * Anonymous folio with swap cache.
@@ -3959,6 +3957,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
 			__filemap_remove_folio(new_folio, NULL);
 			folio_put_refs(new_folio, nr_pages);
 		}
+
+		zone_device_private_split_cb(folio, NULL);
 		/*
 		 * Unfreeze @folio only after all page cache entries, which
 		 * used to point to it, have been updated with new folios.
@@ -3982,6 +3982,9 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
 
 	local_irq_enable();
 
+	if (unmapped)
+		return ret;
+
 	if (nr_shmem_dropped)
 		shmem_uncharge(mapping->host, nr_shmem_dropped);
 
@@ -4072,12 +4075,13 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
  * Returns -EINVAL when trying to split to an order that is incompatible
  * with the folio. Splitting to order 0 is compatible with all folios.
  */
-int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
-				     unsigned int new_order)
+int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
+				     unsigned int new_order, bool unmapped)
 {
 	struct folio *folio = page_folio(page);
 
-	return __folio_split(folio, new_order, &folio->page, page, list, true);
+	return __folio_split(folio, new_order, &folio->page, page, list, true,
+				unmapped);
 }
 
 /*
@@ -4106,7 +4110,7 @@ int folio_split(struct folio *folio, unsigned int new_order,
 		struct page *split_at, struct list_head *list)
 {
 	return __folio_split(folio, new_order, split_at, &folio->page, list,
-			false);
+			false, false);
 }
 
 int min_order_for_split(struct folio *folio)
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index 4156fd6190d2..fa42d2ebd024 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -306,6 +306,23 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 			    pgmap->owner != migrate->pgmap_owner)
 				goto next;
 
+			folio = page_folio(page);
+			if (folio_test_large(folio)) {
+				int ret;
+
+				pte_unmap_unlock(ptep, ptl);
+				ret = migrate_vma_split_folio(folio,
+							  migrate->fault_page);
+
+				if (ret) {
+					ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
+					goto next;
+				}
+
+				addr = start;
+				goto again;
+			}
+
 			mpfn = migrate_pfn(page_to_pfn(page)) |
 					MIGRATE_PFN_MIGRATE;
 			if (is_writable_device_private_entry(entry))
@@ -880,6 +897,29 @@ static int migrate_vma_insert_huge_pmd_page(struct migrate_vma *migrate,
 		src[i] &= ~MIGRATE_PFN_MIGRATE;
 	return 0;
 }
+
+static int migrate_vma_split_unmapped_folio(struct migrate_vma *migrate,
+					    unsigned long idx, unsigned long addr,
+					    struct folio *folio)
+{
+	unsigned long i;
+	unsigned long pfn;
+	unsigned long flags;
+	int ret = 0;
+
+	folio_get(folio);
+	split_huge_pmd_address(migrate->vma, addr, true);
+	ret = __split_huge_page_to_list_to_order(folio_page(folio, 0), NULL,
+							0, true);
+	if (ret)
+		return ret;
+	migrate->src[idx] &= ~MIGRATE_PFN_COMPOUND;
+	flags = migrate->src[idx] & ((1UL << MIGRATE_PFN_SHIFT) - 1);
+	pfn = migrate->src[idx] >> MIGRATE_PFN_SHIFT;
+	for (i = 1; i < HPAGE_PMD_NR; i++)
+		migrate->src[i+idx] = migrate_pfn(pfn + i) | flags;
+	return ret;
+}
 #else /* !CONFIG_ARCH_ENABLE_THP_MIGRATION */
 static int migrate_vma_insert_huge_pmd_page(struct migrate_vma *migrate,
 					 unsigned long addr,
@@ -889,6 +929,13 @@ static int migrate_vma_insert_huge_pmd_page(struct migrate_vma *migrate,
 {
 	return 0;
 }
+
+static int migrate_vma_split_unmapped_folio(struct migrate_vma *migrate,
+					    unsigned long idx, unsigned long addr,
+					    struct folio *folio)
+{
+	return 0;
+}
 #endif
 
 static unsigned long migrate_vma_nr_pages(unsigned long *src)
@@ -1050,8 +1097,9 @@ static void __migrate_device_pages(unsigned long *src_pfns,
 				struct migrate_vma *migrate)
 {
 	struct mmu_notifier_range range;
-	unsigned long i;
+	unsigned long i, j;
 	bool notified = false;
+	unsigned long addr;
 
 	for (i = 0; i < npages; ) {
 		struct page *newpage = migrate_pfn_to_page(dst_pfns[i]);
@@ -1093,12 +1141,16 @@ static void __migrate_device_pages(unsigned long *src_pfns,
 				(!(dst_pfns[i] & MIGRATE_PFN_COMPOUND))) {
 				nr = migrate_vma_nr_pages(&src_pfns[i]);
 				src_pfns[i] &= ~MIGRATE_PFN_COMPOUND;
-				src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
-				goto next;
+			} else {
+				nr = 1;
 			}
 
-			migrate_vma_insert_page(migrate, addr, &dst_pfns[i],
-						&src_pfns[i]);
+			for (j = 0; j < nr && i + j < npages; j++) {
+				src_pfns[i+j] |= MIGRATE_PFN_MIGRATE;
+				migrate_vma_insert_page(migrate,
+					addr + j * PAGE_SIZE,
+					&dst_pfns[i+j], &src_pfns[i+j]);
+			}
 			goto next;
 		}
 
@@ -1120,7 +1172,13 @@ static void __migrate_device_pages(unsigned long *src_pfns,
 							 MIGRATE_PFN_COMPOUND);
 					goto next;
 				}
-				src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
+				nr = 1 << folio_order(folio);
+				addr = migrate->start + i * PAGE_SIZE;
+				if (migrate_vma_split_unmapped_folio(migrate, i, addr, folio)) {
+					src_pfns[i] &= ~(MIGRATE_PFN_MIGRATE |
+							 MIGRATE_PFN_COMPOUND);
+					goto next;
+				}
 			} else if ((src_pfns[i] & MIGRATE_PFN_MIGRATE) &&
 				(dst_pfns[i] & MIGRATE_PFN_COMPOUND) &&
 				!(src_pfns[i] & MIGRATE_PFN_COMPOUND)) {
@@ -1156,11 +1214,16 @@ static void __migrate_device_pages(unsigned long *src_pfns,
 
 		if (migrate && migrate->fault_page == page)
 			extra_cnt = 1;
-		r = folio_migrate_mapping(mapping, newfolio, folio, extra_cnt);
-		if (r)
-			src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
-		else
-			folio_migrate_flags(newfolio, folio);
+		for (j = 0; j < nr && i + j < npages; j++) {
+			folio = page_folio(migrate_pfn_to_page(src_pfns[i+j]));
+			newfolio = page_folio(migrate_pfn_to_page(dst_pfns[i+j]));
+
+			r = folio_migrate_mapping(mapping, newfolio, folio, extra_cnt);
+			if (r)
+				src_pfns[i+j] &= ~MIGRATE_PFN_MIGRATE;
+			else
+				folio_migrate_flags(newfolio, folio);
+		}
 next:
 		i += nr;
 	}
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [v7 12/16] lib/test_hmm: add large page allocation failure testing
  2025-10-01  6:56 [v7 00/16] mm: support device-private THP Balbir Singh
                   ` (10 preceding siblings ...)
  2025-10-01  6:57 ` [v7 11/16] mm/migrate_device: add THP splitting during migration Balbir Singh
@ 2025-10-01  6:57 ` Balbir Singh
  2025-10-01  6:57 ` [v7 13/16] selftests/mm/hmm-tests: new tests for zone device THP migration Balbir Singh
                   ` (4 subsequent siblings)
  16 siblings, 0 replies; 75+ messages in thread
From: Balbir Singh @ 2025-10-01  6:57 UTC (permalink / raw)
  To: linux-kernel, dri-devel, linux-mm
  Cc: akpm, Balbir Singh, David Hildenbrand, Zi Yan, Joshua Hahn,
	Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Oscar Salvador, Lorenzo Stoakes, Baolin Wang,
	Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
	Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Ralph Campbell, Mika Penttilä, Matthew Brost,
	Francois Dugast

Add HMM_DMIRROR_FLAG_FAIL_ALLOC flag to simulate large page allocation
failures, enabling testing of split migration code paths.

This test flag allows validation of the fallback behavior when destination
device cannot allocate compound pages.  This is useful for testing the
split migration functionality.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
Cc: Rakie Kim <rakie.kim@sk.com>
Cc: Byungchul Park <byungchul@sk.com>
Cc: Gregory Price <gourry@gourry.net>
Cc: Ying Huang <ying.huang@linux.alibaba.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Nico Pache <npache@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona@ffwll.ch>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mika Penttilä <mpenttil@redhat.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Francois Dugast <francois.dugast@intel.com>

Signed-off-by: Balbir Singh <balbirs@nvidia.com>
---
 lib/test_hmm.c      | 61 ++++++++++++++++++++++++++++++---------------
 lib/test_hmm_uapi.h |  3 +++
 2 files changed, 44 insertions(+), 20 deletions(-)

diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index df429670633e..72a8b2f38d8a 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -92,6 +92,7 @@ struct dmirror {
 	struct xarray			pt;
 	struct mmu_interval_notifier	notifier;
 	struct mutex			mutex;
+	__u64			flags;
 };
 
 /*
@@ -699,7 +700,12 @@ static void dmirror_migrate_alloc_and_copy(struct migrate_vma *args,
 		     page_to_pfn(spage)))
 			goto next;
 
-		dpage = dmirror_devmem_alloc_page(dmirror, is_large);
+		if (dmirror->flags & HMM_DMIRROR_FLAG_FAIL_ALLOC) {
+			dmirror->flags &= ~HMM_DMIRROR_FLAG_FAIL_ALLOC;
+			dpage = NULL;
+		} else
+			dpage = dmirror_devmem_alloc_page(dmirror, is_large);
+
 		if (!dpage) {
 			struct folio *folio;
 			unsigned long i;
@@ -959,44 +965,55 @@ static vm_fault_t dmirror_devmem_fault_alloc_and_copy(struct migrate_vma *args,
 
 		spage = BACKING_PAGE(spage);
 		order = folio_order(page_folio(spage));
-
 		if (order)
+			*dst = MIGRATE_PFN_COMPOUND;
+		if (*src & MIGRATE_PFN_WRITE)
+			*dst |= MIGRATE_PFN_WRITE;
+
+		if (dmirror->flags & HMM_DMIRROR_FLAG_FAIL_ALLOC) {
+			dmirror->flags &= ~HMM_DMIRROR_FLAG_FAIL_ALLOC;
+			*dst &= ~MIGRATE_PFN_COMPOUND;
+			dpage = NULL;
+		} else if (order) {
 			dpage = folio_page(vma_alloc_folio(GFP_HIGHUSER_MOVABLE,
 						order, args->vma, addr), 0);
-		else
-			dpage = alloc_page_vma(GFP_HIGHUSER_MOVABLE, args->vma, addr);
-
-		/* Try with smaller pages if large allocation fails */
-		if (!dpage && order) {
+		} else {
 			dpage = alloc_page_vma(GFP_HIGHUSER_MOVABLE, args->vma, addr);
-			if (!dpage)
-				return VM_FAULT_OOM;
-			order = 0;
 		}
 
+		if (!dpage && !order)
+			return VM_FAULT_OOM;
+
 		pr_debug("migrating from sys to dev pfn src: 0x%lx pfn dst: 0x%lx\n",
 				page_to_pfn(spage), page_to_pfn(dpage));
-		lock_page(dpage);
-		xa_erase(&dmirror->pt, addr >> PAGE_SHIFT);
-		copy_highpage(dpage, spage);
-		*dst = migrate_pfn(page_to_pfn(dpage));
-		if (*src & MIGRATE_PFN_WRITE)
-			*dst |= MIGRATE_PFN_WRITE;
-		if (order)
-			*dst |= MIGRATE_PFN_COMPOUND;
+
+		if (dpage) {
+			lock_page(dpage);
+			*dst |= migrate_pfn(page_to_pfn(dpage));
+		}
 
 		for (i = 0; i < (1 << order); i++) {
 			struct page *src_page;
 			struct page *dst_page;
 
+			/* Try with smaller pages if large allocation fails */
+			if (!dpage && order) {
+				dpage = alloc_page_vma(GFP_HIGHUSER_MOVABLE, args->vma, addr);
+				lock_page(dpage);
+				dst[i] = migrate_pfn(page_to_pfn(dpage));
+				dst_page = pfn_to_page(page_to_pfn(dpage));
+				dpage = NULL; /* For the next iteration */
+			} else {
+				dst_page = pfn_to_page(page_to_pfn(dpage) + i);
+			}
+
 			src_page = pfn_to_page(page_to_pfn(spage) + i);
-			dst_page = pfn_to_page(page_to_pfn(dpage) + i);
 
 			xa_erase(&dmirror->pt, addr >> PAGE_SHIFT);
+			addr += PAGE_SIZE;
 			copy_highpage(dst_page, src_page);
 		}
 next:
-		addr += PAGE_SIZE << order;
 		src += 1 << order;
 		dst += 1 << order;
 	}
@@ -1514,6 +1531,10 @@ static long dmirror_fops_unlocked_ioctl(struct file *filp,
 		dmirror_device_remove_chunks(dmirror->mdevice);
 		ret = 0;
 		break;
+	case HMM_DMIRROR_FLAGS:
+		dmirror->flags = cmd.npages;
+		ret = 0;
+		break;
 
 	default:
 		return -EINVAL;
diff --git a/lib/test_hmm_uapi.h b/lib/test_hmm_uapi.h
index 8c818a2cf4f6..f94c6d457338 100644
--- a/lib/test_hmm_uapi.h
+++ b/lib/test_hmm_uapi.h
@@ -37,6 +37,9 @@ struct hmm_dmirror_cmd {
 #define HMM_DMIRROR_EXCLUSIVE		_IOWR('H', 0x05, struct hmm_dmirror_cmd)
 #define HMM_DMIRROR_CHECK_EXCLUSIVE	_IOWR('H', 0x06, struct hmm_dmirror_cmd)
 #define HMM_DMIRROR_RELEASE		_IOWR('H', 0x07, struct hmm_dmirror_cmd)
+#define HMM_DMIRROR_FLAGS		_IOWR('H', 0x08, struct hmm_dmirror_cmd)
+
+#define HMM_DMIRROR_FLAG_FAIL_ALLOC	(1ULL << 0)
 
 /*
  * Values returned in hmm_dmirror_cmd.ptr for HMM_DMIRROR_SNAPSHOT.
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [v7 13/16] selftests/mm/hmm-tests: new tests for zone device THP migration
  2025-10-01  6:56 [v7 00/16] mm: support device-private THP Balbir Singh
                   ` (11 preceding siblings ...)
  2025-10-01  6:57 ` [v7 12/16] lib/test_hmm: add large page allocation failure testing Balbir Singh
@ 2025-10-01  6:57 ` Balbir Singh
  2025-10-01  6:57 ` [v7 14/16] selftests/mm/hmm-tests: partial unmap, mremap and anon_write tests Balbir Singh
                   ` (3 subsequent siblings)
  16 siblings, 0 replies; 75+ messages in thread
From: Balbir Singh @ 2025-10-01  6:57 UTC (permalink / raw)
  To: linux-kernel, dri-devel, linux-mm
  Cc: akpm, Balbir Singh, David Hildenbrand, Zi Yan, Joshua Hahn,
	Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Oscar Salvador, Lorenzo Stoakes, Baolin Wang,
	Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
	Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Ralph Campbell, Mika Penttilä, Matthew Brost,
	Francois Dugast

Add new tests for migrating anon THP pages, including anon_huge,
anon_huge_zero and error cases involving forced splitting of pages during
migration.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
Cc: Rakie Kim <rakie.kim@sk.com>
Cc: Byungchul Park <byungchul@sk.com>
Cc: Gregory Price <gourry@gourry.net>
Cc: Ying Huang <ying.huang@linux.alibaba.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Nico Pache <npache@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona@ffwll.ch>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mika Penttilä <mpenttil@redhat.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Francois Dugast <francois.dugast@intel.com>

Signed-off-by: Balbir Singh <balbirs@nvidia.com>
---
 tools/testing/selftests/mm/hmm-tests.c | 410 +++++++++++++++++++++++++
 1 file changed, 410 insertions(+)

diff --git a/tools/testing/selftests/mm/hmm-tests.c b/tools/testing/selftests/mm/hmm-tests.c
index 15aadaf24a66..339a90183930 100644
--- a/tools/testing/selftests/mm/hmm-tests.c
+++ b/tools/testing/selftests/mm/hmm-tests.c
@@ -2055,4 +2055,414 @@ TEST_F(hmm, hmm_cow_in_device)
 
 	hmm_buffer_free(buffer);
 }
+
+/*
+ * Migrate private anonymous huge empty page.
+ */
+TEST_F(hmm, migrate_anon_huge_empty)
+{
+	struct hmm_buffer *buffer;
+	unsigned long npages;
+	unsigned long size;
+	unsigned long i;
+	void *old_ptr;
+	void *map;
+	int *ptr;
+	int ret;
+
+	size = TWOMEG;
+
+	buffer = malloc(sizeof(*buffer));
+	ASSERT_NE(buffer, NULL);
+
+	buffer->fd = -1;
+	buffer->size = 2 * size;
+	buffer->mirror = malloc(size);
+	ASSERT_NE(buffer->mirror, NULL);
+	memset(buffer->mirror, 0xFF, size);
+
+	buffer->ptr = mmap(NULL, 2 * size,
+			   PROT_READ,
+			   MAP_PRIVATE | MAP_ANONYMOUS,
+			   buffer->fd, 0);
+	ASSERT_NE(buffer->ptr, MAP_FAILED);
+
+	npages = size >> self->page_shift;
+	map = (void *)ALIGN((uintptr_t)buffer->ptr, size);
+	ret = madvise(map, size, MADV_HUGEPAGE);
+	ASSERT_EQ(ret, 0);
+	old_ptr = buffer->ptr;
+	buffer->ptr = map;
+
+	/* Migrate memory to device. */
+	ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, npages);
+
+	/* Check what the device read. */
+	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], 0);
+
+	buffer->ptr = old_ptr;
+	hmm_buffer_free(buffer);
+}
+
+/*
+ * Migrate private anonymous huge zero page.
+ */
+TEST_F(hmm, migrate_anon_huge_zero)
+{
+	struct hmm_buffer *buffer;
+	unsigned long npages;
+	unsigned long size;
+	unsigned long i;
+	void *old_ptr;
+	void *map;
+	int *ptr;
+	int ret;
+	int val;
+
+	size = TWOMEG;
+
+	buffer = malloc(sizeof(*buffer));
+	ASSERT_NE(buffer, NULL);
+
+	buffer->fd = -1;
+	buffer->size = 2 * size;
+	buffer->mirror = malloc(size);
+	ASSERT_NE(buffer->mirror, NULL);
+	memset(buffer->mirror, 0xFF, size);
+
+	buffer->ptr = mmap(NULL, 2 * size,
+			   PROT_READ,
+			   MAP_PRIVATE | MAP_ANONYMOUS,
+			   buffer->fd, 0);
+	ASSERT_NE(buffer->ptr, MAP_FAILED);
+
+	npages = size >> self->page_shift;
+	map = (void *)ALIGN((uintptr_t)buffer->ptr, size);
+	ret = madvise(map, size, MADV_HUGEPAGE);
+	ASSERT_EQ(ret, 0);
+	old_ptr = buffer->ptr;
+	buffer->ptr = map;
+
+	/* Initialize a read-only zero huge page. */
+	val = *(int *)buffer->ptr;
+	ASSERT_EQ(val, 0);
+
+	/* Migrate memory to device. */
+	ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, npages);
+
+	/* Check what the device read. */
+	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], 0);
+
+	/* Fault pages back to system memory and check them. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i) {
+		ASSERT_EQ(ptr[i], 0);
+		/* If it asserts once, it probably will 500,000 times */
+		if (ptr[i] != 0)
+			break;
+	}
+
+	buffer->ptr = old_ptr;
+	hmm_buffer_free(buffer);
+}
+
+/*
+ * Migrate private anonymous huge page and free.
+ */
+TEST_F(hmm, migrate_anon_huge_free)
+{
+	struct hmm_buffer *buffer;
+	unsigned long npages;
+	unsigned long size;
+	unsigned long i;
+	void *old_ptr;
+	void *map;
+	int *ptr;
+	int ret;
+
+	size = TWOMEG;
+
+	buffer = malloc(sizeof(*buffer));
+	ASSERT_NE(buffer, NULL);
+
+	buffer->fd = -1;
+	buffer->size = 2 * size;
+	buffer->mirror = malloc(size);
+	ASSERT_NE(buffer->mirror, NULL);
+	memset(buffer->mirror, 0xFF, size);
+
+	buffer->ptr = mmap(NULL, 2 * size,
+			   PROT_READ | PROT_WRITE,
+			   MAP_PRIVATE | MAP_ANONYMOUS,
+			   buffer->fd, 0);
+	ASSERT_NE(buffer->ptr, MAP_FAILED);
+
+	npages = size >> self->page_shift;
+	map = (void *)ALIGN((uintptr_t)buffer->ptr, size);
+	ret = madvise(map, size, MADV_HUGEPAGE);
+	ASSERT_EQ(ret, 0);
+	old_ptr = buffer->ptr;
+	buffer->ptr = map;
+
+	/* Initialize buffer in system memory. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ptr[i] = i;
+
+	/* Migrate memory to device. */
+	ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, npages);
+
+	/* Check what the device read. */
+	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], i);
+
+	/* Try freeing it. */
+	ret = madvise(map, size, MADV_FREE);
+	ASSERT_EQ(ret, 0);
+
+	buffer->ptr = old_ptr;
+	hmm_buffer_free(buffer);
+}
+
+/*
+ * Migrate private anonymous huge page and fault back to sysmem.
+ */
+TEST_F(hmm, migrate_anon_huge_fault)
+{
+	struct hmm_buffer *buffer;
+	unsigned long npages;
+	unsigned long size;
+	unsigned long i;
+	void *old_ptr;
+	void *map;
+	int *ptr;
+	int ret;
+
+	size = TWOMEG;
+
+	buffer = malloc(sizeof(*buffer));
+	ASSERT_NE(buffer, NULL);
+
+	buffer->fd = -1;
+	buffer->size = 2 * size;
+	buffer->mirror = malloc(size);
+	ASSERT_NE(buffer->mirror, NULL);
+	memset(buffer->mirror, 0xFF, size);
+
+	buffer->ptr = mmap(NULL, 2 * size,
+			   PROT_READ | PROT_WRITE,
+			   MAP_PRIVATE | MAP_ANONYMOUS,
+			   buffer->fd, 0);
+	ASSERT_NE(buffer->ptr, MAP_FAILED);
+
+	npages = size >> self->page_shift;
+	map = (void *)ALIGN((uintptr_t)buffer->ptr, size);
+	ret = madvise(map, size, MADV_HUGEPAGE);
+	ASSERT_EQ(ret, 0);
+	old_ptr = buffer->ptr;
+	buffer->ptr = map;
+
+	/* Initialize buffer in system memory. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ptr[i] = i;
+
+	/* Migrate memory to device. */
+	ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, npages);
+
+	/* Check what the device read. */
+	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], i);
+
+	/* Fault pages back to system memory and check them. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], i);
+
+	buffer->ptr = old_ptr;
+	hmm_buffer_free(buffer);
+}
+
+/*
+ * Migrate private anonymous huge page with allocation errors.
+ */
+TEST_F(hmm, migrate_anon_huge_err)
+{
+	struct hmm_buffer *buffer;
+	unsigned long npages;
+	unsigned long size;
+	unsigned long i;
+	void *old_ptr;
+	void *map;
+	int *ptr;
+	int ret;
+
+	size = TWOMEG;
+
+	buffer = malloc(sizeof(*buffer));
+	ASSERT_NE(buffer, NULL);
+
+	buffer->fd = -1;
+	buffer->size = 2 * size;
+	buffer->mirror = malloc(2 * size);
+	ASSERT_NE(buffer->mirror, NULL);
+	memset(buffer->mirror, 0xFF, 2 * size);
+
+	old_ptr = mmap(NULL, 2 * size, PROT_READ | PROT_WRITE,
+			MAP_PRIVATE | MAP_ANONYMOUS, buffer->fd, 0);
+	ASSERT_NE(old_ptr, MAP_FAILED);
+
+	npages = size >> self->page_shift;
+	map = (void *)ALIGN((uintptr_t)old_ptr, size);
+	ret = madvise(map, size, MADV_HUGEPAGE);
+	ASSERT_EQ(ret, 0);
+	buffer->ptr = map;
+
+	/* Initialize buffer in system memory. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ptr[i] = i;
+
+	/* Migrate memory to device but force a THP allocation error. */
+	ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_FLAGS, buffer,
+			      HMM_DMIRROR_FLAG_FAIL_ALLOC);
+	ASSERT_EQ(ret, 0);
+	ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, npages);
+
+	/* Check what the device read. */
+	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i) {
+		ASSERT_EQ(ptr[i], i);
+		if (ptr[i] != i)
+			break;
+	}
+
+	/* Try faulting back a single (PAGE_SIZE) page. */
+	ptr = buffer->ptr;
+	ASSERT_EQ(ptr[2048], 2048);
+
+	/* unmap and remap the region to reset things. */
+	ret = munmap(old_ptr, 2 * size);
+	ASSERT_EQ(ret, 0);
+	old_ptr = mmap(NULL, 2 * size, PROT_READ | PROT_WRITE,
+			MAP_PRIVATE | MAP_ANONYMOUS, buffer->fd, 0);
+	ASSERT_NE(old_ptr, MAP_FAILED);
+	map = (void *)ALIGN((uintptr_t)old_ptr, size);
+	ret = madvise(map, size, MADV_HUGEPAGE);
+	ASSERT_EQ(ret, 0);
+	buffer->ptr = map;
+
+	/* Initialize buffer in system memory. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ptr[i] = i;
+
+	/* Migrate THP to device. */
+	ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, npages);
+
+	/*
+	 * Force an allocation error when faulting back a THP resident in the
+	 * device.
+	 */
+	ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_FLAGS, buffer,
+			      HMM_DMIRROR_FLAG_FAIL_ALLOC);
+	ASSERT_EQ(ret, 0);
+
+	ret = hmm_migrate_dev_to_sys(self->fd, buffer, npages);
+	ASSERT_EQ(ret, 0);
+	ptr = buffer->ptr;
+	ASSERT_EQ(ptr[2048], 2048);
+
+	buffer->ptr = old_ptr;
+	hmm_buffer_free(buffer);
+}
+
+/*
+ * Migrate private anonymous huge zero page with allocation errors.
+ */
+TEST_F(hmm, migrate_anon_huge_zero_err)
+{
+	struct hmm_buffer *buffer;
+	unsigned long npages;
+	unsigned long size;
+	unsigned long i;
+	void *old_ptr;
+	void *map;
+	int *ptr;
+	int ret;
+
+	size = TWOMEG;
+
+	buffer = malloc(sizeof(*buffer));
+	ASSERT_NE(buffer, NULL);
+
+	buffer->fd = -1;
+	buffer->size = 2 * size;
+	buffer->mirror = malloc(2 * size);
+	ASSERT_NE(buffer->mirror, NULL);
+	memset(buffer->mirror, 0xFF, 2 * size);
+
+	old_ptr = mmap(NULL, 2 * size, PROT_READ,
+			MAP_PRIVATE | MAP_ANONYMOUS, buffer->fd, 0);
+	ASSERT_NE(old_ptr, MAP_FAILED);
+
+	npages = size >> self->page_shift;
+	map = (void *)ALIGN((uintptr_t)old_ptr, size);
+	ret = madvise(map, size, MADV_HUGEPAGE);
+	ASSERT_EQ(ret, 0);
+	buffer->ptr = map;
+
+	/* Migrate memory to device but force a THP allocation error. */
+	ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_FLAGS, buffer,
+			      HMM_DMIRROR_FLAG_FAIL_ALLOC);
+	ASSERT_EQ(ret, 0);
+	ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, npages);
+
+	/* Check what the device read. */
+	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], 0);
+
+	/* Try faulting back a single (PAGE_SIZE) page. */
+	ptr = buffer->ptr;
+	ASSERT_EQ(ptr[2048], 0);
+
+	/* unmap and remap the region to reset things. */
+	ret = munmap(old_ptr, 2 * size);
+	ASSERT_EQ(ret, 0);
+	old_ptr = mmap(NULL, 2 * size, PROT_READ,
+			MAP_PRIVATE | MAP_ANONYMOUS, buffer->fd, 0);
+	ASSERT_NE(old_ptr, MAP_FAILED);
+	map = (void *)ALIGN((uintptr_t)old_ptr, size);
+	ret = madvise(map, size, MADV_HUGEPAGE);
+	ASSERT_EQ(ret, 0);
+	buffer->ptr = map;
+
+	/* Initialize buffer in system memory (zero THP page). */
+	ret = ptr[0];
+	ASSERT_EQ(ret, 0);
+
+	/* Migrate memory to device but force a THP allocation error. */
+	ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_FLAGS, buffer,
+			      HMM_DMIRROR_FLAG_FAIL_ALLOC);
+	ASSERT_EQ(ret, 0);
+	ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, npages);
+
+	/* Fault the device memory back and check it. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], 0);
+
+	buffer->ptr = old_ptr;
+	hmm_buffer_free(buffer);
+}
 TEST_HARNESS_MAIN
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [v7 14/16] selftests/mm/hmm-tests: partial unmap, mremap and anon_write tests
  2025-10-01  6:56 [v7 00/16] mm: support device-private THP Balbir Singh
                   ` (12 preceding siblings ...)
  2025-10-01  6:57 ` [v7 13/16] selftests/mm/hmm-tests: new tests for zone device THP migration Balbir Singh
@ 2025-10-01  6:57 ` Balbir Singh
  2025-10-01  6:57 ` [v7 15/16] selftests/mm/hmm-tests: new throughput tests including THP Balbir Singh
                   ` (2 subsequent siblings)
  16 siblings, 0 replies; 75+ messages in thread
From: Balbir Singh @ 2025-10-01  6:57 UTC (permalink / raw)
  To: linux-kernel, dri-devel, linux-mm
  Cc: akpm, Matthew Brost, David Hildenbrand, Zi Yan, Joshua Hahn,
	Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Oscar Salvador, Lorenzo Stoakes, Baolin Wang,
	Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
	Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Ralph Campbell, Mika Penttilä, Francois Dugast, Balbir Singh

From: Matthew Brost <matthew.brost@intel.com>

Add partial unmap test case which munmaps memory while in the device.

Add tests exercising mremap on faulted-in memory (CPU and GPU) at
various offsets and verify correctness.

Update anon_write_child to read device memory after fork verifying
this flow works in the kernel.

Both THP and non-THP cases are updated.
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
Cc: Rakie Kim <rakie.kim@sk.com>
Cc: Byungchul Park <byungchul@sk.com>
Cc: Gregory Price <gourry@gourry.net>
Cc: Ying Huang <ying.huang@linux.alibaba.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Nico Pache <npache@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona@ffwll.ch>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mika Penttilä <mpenttil@redhat.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Francois Dugast <francois.dugast@intel.com>

Signed-off-by: Balbir Singh <balbirs@nvidia.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 tools/testing/selftests/mm/hmm-tests.c | 312 ++++++++++++++++++++-----
 1 file changed, 252 insertions(+), 60 deletions(-)

diff --git a/tools/testing/selftests/mm/hmm-tests.c b/tools/testing/selftests/mm/hmm-tests.c
index 339a90183930..dedc1049bd4d 100644
--- a/tools/testing/selftests/mm/hmm-tests.c
+++ b/tools/testing/selftests/mm/hmm-tests.c
@@ -50,6 +50,8 @@ enum {
 	HMM_COHERENCE_DEVICE_TWO,
 };
 
+#define ONEKB		(1 << 10)
+#define ONEMEG		(1 << 20)
 #define TWOMEG		(1 << 21)
 #define HMM_BUFFER_SIZE (1024 << 12)
 #define HMM_PATH_MAX    64
@@ -525,6 +527,8 @@ TEST_F(hmm, anon_write_prot)
 /*
  * Check that a device writing an anonymous private mapping
  * will copy-on-write if a child process inherits the mapping.
+ *
+ * Also verifies after fork() memory the device can be read by child.
  */
 TEST_F(hmm, anon_write_child)
 {
@@ -532,72 +536,101 @@ TEST_F(hmm, anon_write_child)
 	unsigned long npages;
 	unsigned long size;
 	unsigned long i;
+	void *old_ptr;
+	void *map;
 	int *ptr;
 	pid_t pid;
 	int child_fd;
-	int ret;
-
-	npages = ALIGN(HMM_BUFFER_SIZE, self->page_size) >> self->page_shift;
-	ASSERT_NE(npages, 0);
-	size = npages << self->page_shift;
-
-	buffer = malloc(sizeof(*buffer));
-	ASSERT_NE(buffer, NULL);
-
-	buffer->fd = -1;
-	buffer->size = size;
-	buffer->mirror = malloc(size);
-	ASSERT_NE(buffer->mirror, NULL);
-
-	buffer->ptr = mmap(NULL, size,
-			   PROT_READ | PROT_WRITE,
-			   MAP_PRIVATE | MAP_ANONYMOUS,
-			   buffer->fd, 0);
-	ASSERT_NE(buffer->ptr, MAP_FAILED);
-
-	/* Initialize buffer->ptr so we can tell if it is written. */
-	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
-		ptr[i] = i;
+	int ret, use_thp, migrate;
+
+	for (migrate = 0; migrate < 2; ++migrate) {
+		for (use_thp = 0; use_thp < 2; ++use_thp) {
+			npages = ALIGN(use_thp ? TWOMEG : HMM_BUFFER_SIZE,
+				       self->page_size) >> self->page_shift;
+			ASSERT_NE(npages, 0);
+			size = npages << self->page_shift;
+
+			buffer = malloc(sizeof(*buffer));
+			ASSERT_NE(buffer, NULL);
+
+			buffer->fd = -1;
+			buffer->size = size * 2;
+			buffer->mirror = malloc(size);
+			ASSERT_NE(buffer->mirror, NULL);
+
+			buffer->ptr = mmap(NULL, size * 2,
+					   PROT_READ | PROT_WRITE,
+					   MAP_PRIVATE | MAP_ANONYMOUS,
+					   buffer->fd, 0);
+			ASSERT_NE(buffer->ptr, MAP_FAILED);
+
+			old_ptr = buffer->ptr;
+			if (use_thp) {
+				map = (void *)ALIGN((uintptr_t)buffer->ptr, size);
+				ret = madvise(map, size, MADV_HUGEPAGE);
+				ASSERT_EQ(ret, 0);
+				buffer->ptr = map;
+			}
+
+			/* Initialize buffer->ptr so we can tell if it is written. */
+			for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+				ptr[i] = i;
+
+			/* Initialize data that the device will write to buffer->ptr. */
+			for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+				ptr[i] = -i;
+
+			if (migrate) {
+				ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
+				ASSERT_EQ(ret, 0);
+				ASSERT_EQ(buffer->cpages, npages);
+
+			}
+
+			pid = fork();
+			if (pid == -1)
+				ASSERT_EQ(pid, 0);
+			if (pid != 0) {
+				waitpid(pid, &ret, 0);
+				ASSERT_EQ(WIFEXITED(ret), 1);
+
+				/* Check that the parent's buffer did not change. */
+				for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+					ASSERT_EQ(ptr[i], i);
+
+				buffer->ptr = old_ptr;
+				hmm_buffer_free(buffer);
+				continue;
+			}
+
+			/* Check that we see the parent's values. */
+			for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+				ASSERT_EQ(ptr[i], i);
+			if (!migrate) {
+				for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+					ASSERT_EQ(ptr[i], -i);
+			}
+
+			/* The child process needs its own mirror to its own mm. */
+			child_fd = hmm_open(0);
+			ASSERT_GE(child_fd, 0);
+
+			/* Simulate a device writing system memory. */
+			ret = hmm_dmirror_cmd(child_fd, HMM_DMIRROR_WRITE, buffer, npages);
+			ASSERT_EQ(ret, 0);
+			ASSERT_EQ(buffer->cpages, npages);
+			ASSERT_EQ(buffer->faults, 1);
 
-	/* Initialize data that the device will write to buffer->ptr. */
-	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
-		ptr[i] = -i;
+			/* Check what the device wrote. */
+			if (!migrate) {
+				for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+					ASSERT_EQ(ptr[i], -i);
+			}
 
-	pid = fork();
-	if (pid == -1)
-		ASSERT_EQ(pid, 0);
-	if (pid != 0) {
-		waitpid(pid, &ret, 0);
-		ASSERT_EQ(WIFEXITED(ret), 1);
-
-		/* Check that the parent's buffer did not change. */
-		for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
-			ASSERT_EQ(ptr[i], i);
-		return;
+			close(child_fd);
+			exit(0);
+		}
 	}
-
-	/* Check that we see the parent's values. */
-	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
-		ASSERT_EQ(ptr[i], i);
-	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
-		ASSERT_EQ(ptr[i], -i);
-
-	/* The child process needs its own mirror to its own mm. */
-	child_fd = hmm_open(0);
-	ASSERT_GE(child_fd, 0);
-
-	/* Simulate a device writing system memory. */
-	ret = hmm_dmirror_cmd(child_fd, HMM_DMIRROR_WRITE, buffer, npages);
-	ASSERT_EQ(ret, 0);
-	ASSERT_EQ(buffer->cpages, npages);
-	ASSERT_EQ(buffer->faults, 1);
-
-	/* Check what the device wrote. */
-	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
-		ASSERT_EQ(ptr[i], -i);
-
-	close(child_fd);
-	exit(0);
 }
 
 /*
@@ -2289,6 +2322,165 @@ TEST_F(hmm, migrate_anon_huge_fault)
 	hmm_buffer_free(buffer);
 }
 
+/*
+ * Migrate memory and fault back to sysmem after partially unmapping.
+ */
+TEST_F(hmm, migrate_partial_unmap_fault)
+{
+	struct hmm_buffer *buffer;
+	unsigned long npages;
+	unsigned long size = TWOMEG;
+	unsigned long i;
+	void *old_ptr;
+	void *map;
+	int *ptr;
+	int ret, j, use_thp;
+	int offsets[] = { 0, 512 * ONEKB, ONEMEG };
+
+	for (use_thp = 0; use_thp < 2; ++use_thp) {
+		for (j = 0; j < ARRAY_SIZE(offsets); ++j) {
+			buffer = malloc(sizeof(*buffer));
+			ASSERT_NE(buffer, NULL);
+
+			buffer->fd = -1;
+			buffer->size = 2 * size;
+			buffer->mirror = malloc(size);
+			ASSERT_NE(buffer->mirror, NULL);
+			memset(buffer->mirror, 0xFF, size);
+
+			buffer->ptr = mmap(NULL, 2 * size,
+					   PROT_READ | PROT_WRITE,
+					   MAP_PRIVATE | MAP_ANONYMOUS,
+					   buffer->fd, 0);
+			ASSERT_NE(buffer->ptr, MAP_FAILED);
+
+			npages = size >> self->page_shift;
+			map = (void *)ALIGN((uintptr_t)buffer->ptr, size);
+			if (use_thp)
+				ret = madvise(map, size, MADV_HUGEPAGE);
+			else
+				ret = madvise(map, size, MADV_NOHUGEPAGE);
+			ASSERT_EQ(ret, 0);
+			old_ptr = buffer->ptr;
+			buffer->ptr = map;
+
+			/* Initialize buffer in system memory. */
+			for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+				ptr[i] = i;
+
+			/* Migrate memory to device. */
+			ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
+			ASSERT_EQ(ret, 0);
+			ASSERT_EQ(buffer->cpages, npages);
+
+			/* Check what the device read. */
+			for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+				ASSERT_EQ(ptr[i], i);
+
+			munmap(buffer->ptr + offsets[j], ONEMEG);
+
+			/* Fault pages back to system memory and check them. */
+			for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+				if (i * sizeof(int) < offsets[j] ||
+				    i * sizeof(int) >= offsets[j] + ONEMEG)
+					ASSERT_EQ(ptr[i], i);
+
+			buffer->ptr = old_ptr;
+			hmm_buffer_free(buffer);
+		}
+	}
+}
+
+TEST_F(hmm, migrate_remap_fault)
+{
+	struct hmm_buffer *buffer;
+	unsigned long npages;
+	unsigned long size = TWOMEG;
+	unsigned long i;
+	void *old_ptr, *new_ptr = NULL;
+	void *map;
+	int *ptr;
+	int ret, j, use_thp, dont_unmap, before;
+	int offsets[] = { 0, 512 * ONEKB, ONEMEG };
+
+	for (before = 0; before < 2; ++before) {
+		for (dont_unmap = 0; dont_unmap < 2; ++dont_unmap) {
+			for (use_thp = 0; use_thp < 2; ++use_thp) {
+				for (j = 0; j < ARRAY_SIZE(offsets); ++j) {
+					int flags = MREMAP_MAYMOVE | MREMAP_FIXED;
+
+					if (dont_unmap)
+						flags |= MREMAP_DONTUNMAP;
+
+					buffer = malloc(sizeof(*buffer));
+					ASSERT_NE(buffer, NULL);
+
+					buffer->fd = -1;
+					buffer->size = 8 * size;
+					buffer->mirror = malloc(size);
+					ASSERT_NE(buffer->mirror, NULL);
+					memset(buffer->mirror, 0xFF, size);
+
+					buffer->ptr = mmap(NULL, buffer->size,
+							   PROT_READ | PROT_WRITE,
+							   MAP_PRIVATE | MAP_ANONYMOUS,
+							   buffer->fd, 0);
+					ASSERT_NE(buffer->ptr, MAP_FAILED);
+
+					npages = size >> self->page_shift;
+					map = (void *)ALIGN((uintptr_t)buffer->ptr, size);
+					if (use_thp)
+						ret = madvise(map, size, MADV_HUGEPAGE);
+					else
+						ret = madvise(map, size, MADV_NOHUGEPAGE);
+					ASSERT_EQ(ret, 0);
+					old_ptr = buffer->ptr;
+					munmap(map + size, size * 2);
+					buffer->ptr = map;
+
+					/* Initialize buffer in system memory. */
+					for (i = 0, ptr = buffer->ptr;
+					     i < size / sizeof(*ptr); ++i)
+						ptr[i] = i;
+
+					if (before) {
+						new_ptr = mremap((void *)map, size, size, flags,
+								 map + size + offsets[j]);
+						ASSERT_NE(new_ptr, MAP_FAILED);
+						buffer->ptr = new_ptr;
+					}
+
+					/* Migrate memory to device. */
+					ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
+					ASSERT_EQ(ret, 0);
+					ASSERT_EQ(buffer->cpages, npages);
+
+					/* Check what the device read. */
+					for (i = 0, ptr = buffer->mirror;
+					     i < size / sizeof(*ptr); ++i)
+						ASSERT_EQ(ptr[i], i);
+
+					if (!before) {
+						new_ptr = mremap((void *)map, size, size, flags,
+								 map + size + offsets[j]);
+						ASSERT_NE(new_ptr, MAP_FAILED);
+						buffer->ptr = new_ptr;
+					}
+
+					/* Fault pages back to system memory and check them. */
+					for (i = 0, ptr = buffer->ptr;
+					     i < size / sizeof(*ptr); ++i)
+						ASSERT_EQ(ptr[i], i);
+
+					munmap(new_ptr, size);
+					buffer->ptr = old_ptr;
+					hmm_buffer_free(buffer);
+				}
+			}
+		}
+	}
+}
+
 /*
  * Migrate private anonymous huge page with allocation errors.
  */
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [v7 15/16] selftests/mm/hmm-tests: new throughput tests including THP
  2025-10-01  6:56 [v7 00/16] mm: support device-private THP Balbir Singh
                   ` (13 preceding siblings ...)
  2025-10-01  6:57 ` [v7 14/16] selftests/mm/hmm-tests: partial unmap, mremap and anon_write tests Balbir Singh
@ 2025-10-01  6:57 ` Balbir Singh
  2025-10-01  6:57 ` [v7 16/16] gpu/drm/nouveau: enable THP support for GPU memory migration Balbir Singh
  2025-10-09  3:17 ` [v7 00/16] mm: support device-private THP Andrew Morton
  16 siblings, 0 replies; 75+ messages in thread
From: Balbir Singh @ 2025-10-01  6:57 UTC (permalink / raw)
  To: linux-kernel, dri-devel, linux-mm
  Cc: akpm, Balbir Singh, David Hildenbrand, Zi Yan, Joshua Hahn,
	Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Oscar Salvador, Lorenzo Stoakes, Baolin Wang,
	Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
	Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Ralph Campbell, Mika Penttilä, Matthew Brost,
	Francois Dugast

Add new benchmark style support to test transfer bandwidth for zone device
memory operations.

Cc: David Hildenbrand <david@redhat.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
Cc: Rakie Kim <rakie.kim@sk.com>
Cc: Byungchul Park <byungchul@sk.com>
Cc: Gregory Price <gourry@gourry.net>
Cc: Ying Huang <ying.huang@linux.alibaba.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Nico Pache <npache@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona@ffwll.ch>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mika Penttilä <mpenttil@redhat.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Francois Dugast <francois.dugast@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Balbir Singh <balbirs@nvidia.com>
---
 tools/testing/selftests/mm/hmm-tests.c | 197 ++++++++++++++++++++++++-
 1 file changed, 196 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/mm/hmm-tests.c b/tools/testing/selftests/mm/hmm-tests.c
index dedc1049bd4d..5a1525f72daa 100644
--- a/tools/testing/selftests/mm/hmm-tests.c
+++ b/tools/testing/selftests/mm/hmm-tests.c
@@ -25,6 +25,7 @@
 #include <sys/stat.h>
 #include <sys/mman.h>
 #include <sys/ioctl.h>
+#include <sys/time.h>
 
 
 /*
@@ -209,8 +210,10 @@ static void hmm_buffer_free(struct hmm_buffer *buffer)
 	if (buffer == NULL)
 		return;
 
-	if (buffer->ptr)
+	if (buffer->ptr) {
 		munmap(buffer->ptr, buffer->size);
+		buffer->ptr = NULL;
+	}
 	free(buffer->mirror);
 	free(buffer);
 }
@@ -2657,4 +2660,196 @@ TEST_F(hmm, migrate_anon_huge_zero_err)
 	buffer->ptr = old_ptr;
 	hmm_buffer_free(buffer);
 }
+
+struct benchmark_results {
+	double sys_to_dev_time;
+	double dev_to_sys_time;
+	double throughput_s2d;
+	double throughput_d2s;
+};
+
+static double get_time_ms(void)
+{
+	struct timeval tv;
+
+	gettimeofday(&tv, NULL);
+	return (tv.tv_sec * 1000.0) + (tv.tv_usec / 1000.0);
+}
+
+static inline struct hmm_buffer *hmm_buffer_alloc(unsigned long size)
+{
+	struct hmm_buffer *buffer;
+
+	buffer = malloc(sizeof(*buffer));
+
+	buffer->fd = -1;
+	buffer->size = size;
+	buffer->mirror = malloc(size);
+	memset(buffer->mirror, 0xFF, size);
+	return buffer;
+}
+
+static void print_benchmark_results(const char *test_name, size_t buffer_size,
+				     struct benchmark_results *thp,
+				     struct benchmark_results *regular)
+{
+	double s2d_improvement = ((regular->sys_to_dev_time - thp->sys_to_dev_time) /
+				 regular->sys_to_dev_time) * 100.0;
+	double d2s_improvement = ((regular->dev_to_sys_time - thp->dev_to_sys_time) /
+				 regular->dev_to_sys_time) * 100.0;
+	double throughput_s2d_improvement = ((thp->throughput_s2d - regular->throughput_s2d) /
+					    regular->throughput_s2d) * 100.0;
+	double throughput_d2s_improvement = ((thp->throughput_d2s - regular->throughput_d2s) /
+					    regular->throughput_d2s) * 100.0;
+
+	printf("\n=== %s (%.1f MB) ===\n", test_name, buffer_size / (1024.0 * 1024.0));
+	printf("                     | With THP        | Without THP     | Improvement\n");
+	printf("---------------------------------------------------------------------\n");
+	printf("Sys->Dev Migration   | %.3f ms        | %.3f ms        | %.1f%%\n",
+	       thp->sys_to_dev_time, regular->sys_to_dev_time, s2d_improvement);
+	printf("Dev->Sys Migration   | %.3f ms        | %.3f ms        | %.1f%%\n",
+	       thp->dev_to_sys_time, regular->dev_to_sys_time, d2s_improvement);
+	printf("S->D Throughput      | %.2f GB/s      | %.2f GB/s      | %.1f%%\n",
+	       thp->throughput_s2d, regular->throughput_s2d, throughput_s2d_improvement);
+	printf("D->S Throughput      | %.2f GB/s      | %.2f GB/s      | %.1f%%\n",
+	       thp->throughput_d2s, regular->throughput_d2s, throughput_d2s_improvement);
+}
+
+/*
+ * Run a single migration benchmark
+ * fd: file descriptor for hmm device
+ * use_thp: whether to use THP
+ * buffer_size: size of buffer to allocate
+ * iterations: number of iterations
+ * results: where to store results
+ */
+static inline int run_migration_benchmark(int fd, int use_thp, size_t buffer_size,
+					   int iterations, struct benchmark_results *results)
+{
+	struct hmm_buffer *buffer;
+	unsigned long npages = buffer_size / sysconf(_SC_PAGESIZE);
+	double start, end;
+	double s2d_total = 0, d2s_total = 0;
+	int ret, i;
+	int *ptr;
+
+	buffer = hmm_buffer_alloc(buffer_size);
+
+	/* Map memory */
+	buffer->ptr = mmap(NULL, buffer_size, PROT_READ | PROT_WRITE,
+			  MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+
+	if (!buffer->ptr)
+		return -1;
+
+	/* Apply THP hint if requested */
+	if (use_thp)
+		ret = madvise(buffer->ptr, buffer_size, MADV_HUGEPAGE);
+	else
+		ret = madvise(buffer->ptr, buffer_size, MADV_NOHUGEPAGE);
+
+	if (ret)
+		return ret;
+
+	/* Initialize memory to make sure pages are allocated */
+	ptr = (int *)buffer->ptr;
+	for (i = 0; i < buffer_size / sizeof(int); i++)
+		ptr[i] = i & 0xFF;
+
+	/* Warmup iteration */
+	ret = hmm_migrate_sys_to_dev(fd, buffer, npages);
+	if (ret)
+		return ret;
+
+	ret = hmm_migrate_dev_to_sys(fd, buffer, npages);
+	if (ret)
+		return ret;
+
+	/* Benchmark iterations */
+	for (i = 0; i < iterations; i++) {
+		/* System to device migration */
+		start = get_time_ms();
+
+		ret = hmm_migrate_sys_to_dev(fd, buffer, npages);
+		if (ret)
+			return ret;
+
+		end = get_time_ms();
+		s2d_total += (end - start);
+
+		/* Device to system migration */
+		start = get_time_ms();
+
+		ret = hmm_migrate_dev_to_sys(fd, buffer, npages);
+		if (ret)
+			return ret;
+
+		end = get_time_ms();
+		d2s_total += (end - start);
+	}
+
+	/* Calculate average times and throughput */
+	results->sys_to_dev_time = s2d_total / iterations;
+	results->dev_to_sys_time = d2s_total / iterations;
+	results->throughput_s2d = (buffer_size / (1024.0 * 1024.0 * 1024.0)) /
+				 (results->sys_to_dev_time / 1000.0);
+	results->throughput_d2s = (buffer_size / (1024.0 * 1024.0 * 1024.0)) /
+				 (results->dev_to_sys_time / 1000.0);
+
+	/* Cleanup */
+	hmm_buffer_free(buffer);
+	return 0;
+}
+
+/*
+ * Benchmark THP migration with different buffer sizes
+ */
+TEST_F_TIMEOUT(hmm, benchmark_thp_migration, 120)
+{
+	struct benchmark_results thp_results, regular_results;
+	size_t thp_size = 2 * 1024 * 1024; /* 2MB - typical THP size */
+	int iterations = 5;
+
+	printf("\nHMM THP Migration Benchmark\n");
+	printf("---------------------------\n");
+	printf("System page size: %ld bytes\n", sysconf(_SC_PAGESIZE));
+
+	/* Test different buffer sizes */
+	size_t test_sizes[] = {
+		thp_size / 4,      /* 512KB - smaller than THP */
+		thp_size / 2,      /* 1MB - half THP */
+		thp_size,          /* 2MB - single THP */
+		thp_size * 2,      /* 4MB - two THPs */
+		thp_size * 4,      /* 8MB - four THPs */
+		thp_size * 8,       /* 16MB - eight THPs */
+		thp_size * 128,       /* 256MB - one twenty eight THPs */
+	};
+
+	static const char *const test_names[] = {
+		"Small Buffer (512KB)",
+		"Half THP Size (1MB)",
+		"Single THP Size (2MB)",
+		"Two THP Size (4MB)",
+		"Four THP Size (8MB)",
+		"Eight THP Size (16MB)",
+		"One twenty eight THP Size (256MB)"
+	};
+
+	int num_tests = ARRAY_SIZE(test_sizes);
+
+	/* Run all tests */
+	for (int i = 0; i < num_tests; i++) {
+		/* Test with THP */
+		ASSERT_EQ(run_migration_benchmark(self->fd, 1, test_sizes[i],
+					iterations, &thp_results), 0);
+
+		/* Test without THP */
+		ASSERT_EQ(run_migration_benchmark(self->fd, 0, test_sizes[i],
+					iterations, &regular_results), 0);
+
+		/* Print results */
+		print_benchmark_results(test_names[i], test_sizes[i],
+					&thp_results, &regular_results);
+	}
+}
 TEST_HARNESS_MAIN
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [v7 16/16] gpu/drm/nouveau: enable THP support for GPU memory migration
  2025-10-01  6:56 [v7 00/16] mm: support device-private THP Balbir Singh
                   ` (14 preceding siblings ...)
  2025-10-01  6:57 ` [v7 15/16] selftests/mm/hmm-tests: new throughput tests including THP Balbir Singh
@ 2025-10-01  6:57 ` Balbir Singh
  2025-10-09  3:17 ` [v7 00/16] mm: support device-private THP Andrew Morton
  16 siblings, 0 replies; 75+ messages in thread
From: Balbir Singh @ 2025-10-01  6:57 UTC (permalink / raw)
  To: linux-kernel, dri-devel, linux-mm
  Cc: akpm, Balbir Singh, David Hildenbrand, Zi Yan, Joshua Hahn,
	Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Oscar Salvador, Lorenzo Stoakes, Baolin Wang,
	Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
	Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Ralph Campbell, Mika Penttilä, Matthew Brost,
	Francois Dugast

Enable MIGRATE_VMA_SELECT_COMPOUND support in nouveau driver to take
advantage of THP zone device migration capabilities.

Update migration and eviction code paths to handle compound page sizes
appropriately, improving memory bandwidth utilization and reducing
migration overhead for large GPU memory allocations.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
Cc: Rakie Kim <rakie.kim@sk.com>
Cc: Byungchul Park <byungchul@sk.com>
Cc: Gregory Price <gourry@gourry.net>
Cc: Ying Huang <ying.huang@linux.alibaba.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Nico Pache <npache@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona@ffwll.ch>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mika Penttilä <mpenttil@redhat.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Francois Dugast <francois.dugast@intel.com>

Signed-off-by: Balbir Singh <balbirs@nvidia.com>
---
 drivers/gpu/drm/nouveau/nouveau_dmem.c | 303 ++++++++++++++++++-------
 drivers/gpu/drm/nouveau/nouveau_svm.c  |   6 +-
 drivers/gpu/drm/nouveau/nouveau_svm.h  |   3 +-
 3 files changed, 229 insertions(+), 83 deletions(-)

diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c b/drivers/gpu/drm/nouveau/nouveau_dmem.c
index d34288ebe7d2..244812e7dd69 100644
--- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
@@ -50,6 +50,7 @@
  */
 #define DMEM_CHUNK_SIZE (2UL << 20)
 #define DMEM_CHUNK_NPAGES (DMEM_CHUNK_SIZE >> PAGE_SHIFT)
+#define NR_CHUNKS (128)
 
 enum nouveau_aper {
 	NOUVEAU_APER_VIRT,
@@ -83,9 +84,15 @@ struct nouveau_dmem {
 	struct list_head chunks;
 	struct mutex mutex;
 	struct page *free_pages;
+	struct folio *free_folios;
 	spinlock_t lock;
 };
 
+struct nouveau_dmem_dma_info {
+	dma_addr_t dma_addr;
+	size_t size;
+};
+
 static struct nouveau_dmem_chunk *nouveau_page_to_chunk(struct page *page)
 {
 	return container_of(page_pgmap(page), struct nouveau_dmem_chunk,
@@ -115,8 +122,13 @@ static void nouveau_dmem_folio_free(struct folio *folio)
 	struct nouveau_dmem *dmem = chunk->drm->dmem;
 
 	spin_lock(&dmem->lock);
-	page->zone_device_data = dmem->free_pages;
-	dmem->free_pages = page;
+	if (folio_order(folio)) {
+		page->zone_device_data = dmem->free_folios;
+		dmem->free_folios = folio;
+	} else {
+		page->zone_device_data = dmem->free_pages;
+		dmem->free_pages = page;
+	}
 
 	WARN_ON(!chunk->callocated);
 	chunk->callocated--;
@@ -140,20 +152,28 @@ static void nouveau_dmem_fence_done(struct nouveau_fence **fence)
 	}
 }
 
-static int nouveau_dmem_copy_one(struct nouveau_drm *drm, struct page *spage,
-				struct page *dpage, dma_addr_t *dma_addr)
+static int nouveau_dmem_copy_folio(struct nouveau_drm *drm,
+				   struct folio *sfolio, struct folio *dfolio,
+				   struct nouveau_dmem_dma_info *dma_info)
 {
 	struct device *dev = drm->dev->dev;
+	struct page *dpage = folio_page(dfolio, 0);
+	struct page *spage = folio_page(sfolio, 0);
 
-	lock_page(dpage);
+	folio_lock(dfolio);
 
-	*dma_addr = dma_map_page(dev, dpage, 0, PAGE_SIZE, DMA_BIDIRECTIONAL);
-	if (dma_mapping_error(dev, *dma_addr))
+	dma_info->dma_addr = dma_map_page(dev, dpage, 0, page_size(dpage),
+					DMA_BIDIRECTIONAL);
+	dma_info->size = page_size(dpage);
+	if (dma_mapping_error(dev, dma_info->dma_addr))
 		return -EIO;
 
-	if (drm->dmem->migrate.copy_func(drm, 1, NOUVEAU_APER_HOST, *dma_addr,
-					 NOUVEAU_APER_VRAM, nouveau_dmem_page_addr(spage))) {
-		dma_unmap_page(dev, *dma_addr, PAGE_SIZE, DMA_BIDIRECTIONAL);
+	if (drm->dmem->migrate.copy_func(drm, folio_nr_pages(sfolio),
+					 NOUVEAU_APER_HOST, dma_info->dma_addr,
+					 NOUVEAU_APER_VRAM,
+					 nouveau_dmem_page_addr(spage))) {
+		dma_unmap_page(dev, dma_info->dma_addr, page_size(dpage),
+					DMA_BIDIRECTIONAL);
 		return -EIO;
 	}
 
@@ -166,21 +186,47 @@ static vm_fault_t nouveau_dmem_migrate_to_ram(struct vm_fault *vmf)
 	struct nouveau_dmem *dmem = drm->dmem;
 	struct nouveau_fence *fence;
 	struct nouveau_svmm *svmm;
-	struct page *spage, *dpage;
-	unsigned long src = 0, dst = 0;
-	dma_addr_t dma_addr = 0;
+	struct page *dpage;
 	vm_fault_t ret = 0;
 	struct migrate_vma args = {
 		.vma		= vmf->vma,
-		.start		= vmf->address,
-		.end		= vmf->address + PAGE_SIZE,
-		.src		= &src,
-		.dst		= &dst,
 		.pgmap_owner	= drm->dev,
 		.fault_page	= vmf->page,
-		.flags		= MIGRATE_VMA_SELECT_DEVICE_PRIVATE,
+		.flags		= MIGRATE_VMA_SELECT_DEVICE_PRIVATE |
+				  MIGRATE_VMA_SELECT_COMPOUND,
+		.src = NULL,
+		.dst = NULL,
 	};
+	unsigned int order, nr;
+	struct folio *sfolio, *dfolio;
+	struct nouveau_dmem_dma_info dma_info;
+
+	sfolio = page_folio(vmf->page);
+	order = folio_order(sfolio);
+	nr = 1 << order;
+
+	/*
+	 * Handle partial unmap faults, where the folio is large, but
+	 * the pmd is split.
+	 */
+	if (vmf->pte) {
+		order = 0;
+		nr = 1;
+	}
+
+	if (order)
+		args.flags |= MIGRATE_VMA_SELECT_COMPOUND;
 
+	args.start = ALIGN_DOWN(vmf->address, (PAGE_SIZE << order));
+	args.vma = vmf->vma;
+	args.end = args.start + (PAGE_SIZE << order);
+	args.src = kcalloc(nr, sizeof(*args.src), GFP_KERNEL);
+	args.dst = kcalloc(nr, sizeof(*args.dst), GFP_KERNEL);
+
+	if (!args.src || !args.dst) {
+		ret = VM_FAULT_OOM;
+		goto err;
+	}
 	/*
 	 * FIXME what we really want is to find some heuristic to migrate more
 	 * than just one page on CPU fault. When such fault happens it is very
@@ -191,20 +237,26 @@ static vm_fault_t nouveau_dmem_migrate_to_ram(struct vm_fault *vmf)
 	if (!args.cpages)
 		return 0;
 
-	spage = migrate_pfn_to_page(src);
-	if (!spage || !(src & MIGRATE_PFN_MIGRATE))
-		goto done;
-
-	dpage = alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vmf->vma, vmf->address);
-	if (!dpage)
+	if (order)
+		dpage = folio_page(vma_alloc_folio(GFP_HIGHUSER | __GFP_ZERO,
+					order, vmf->vma, vmf->address), 0);
+	else
+		dpage = alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vmf->vma,
+					vmf->address);
+	if (!dpage) {
+		ret = VM_FAULT_OOM;
 		goto done;
+	}
 
-	dst = migrate_pfn(page_to_pfn(dpage));
+	args.dst[0] = migrate_pfn(page_to_pfn(dpage));
+	if (order)
+		args.dst[0] |= MIGRATE_PFN_COMPOUND;
+	dfolio = page_folio(dpage);
 
-	svmm = spage->zone_device_data;
+	svmm = folio_zone_device_data(sfolio);
 	mutex_lock(&svmm->mutex);
 	nouveau_svmm_invalidate(svmm, args.start, args.end);
-	ret = nouveau_dmem_copy_one(drm, spage, dpage, &dma_addr);
+	ret = nouveau_dmem_copy_folio(drm, sfolio, dfolio, &dma_info);
 	mutex_unlock(&svmm->mutex);
 	if (ret) {
 		ret = VM_FAULT_SIGBUS;
@@ -214,25 +266,40 @@ static vm_fault_t nouveau_dmem_migrate_to_ram(struct vm_fault *vmf)
 	nouveau_fence_new(&fence, dmem->migrate.chan);
 	migrate_vma_pages(&args);
 	nouveau_dmem_fence_done(&fence);
-	dma_unmap_page(drm->dev->dev, dma_addr, PAGE_SIZE, DMA_BIDIRECTIONAL);
+	dma_unmap_page(drm->dev->dev, dma_info.dma_addr, PAGE_SIZE,
+				DMA_BIDIRECTIONAL);
 done:
 	migrate_vma_finalize(&args);
+err:
+	kfree(args.src);
+	kfree(args.dst);
 	return ret;
 }
 
+static void nouveau_dmem_folio_split(struct folio *head, struct folio *tail)
+{
+	if (tail == NULL)
+		return;
+	tail->pgmap = head->pgmap;
+	tail->mapping = head->mapping;
+	folio_set_zone_device_data(tail, folio_zone_device_data(head));
+}
+
 static const struct dev_pagemap_ops nouveau_dmem_pagemap_ops = {
 	.folio_free		= nouveau_dmem_folio_free,
 	.migrate_to_ram		= nouveau_dmem_migrate_to_ram,
+	.folio_split		= nouveau_dmem_folio_split,
 };
 
 static int
-nouveau_dmem_chunk_alloc(struct nouveau_drm *drm, struct page **ppage)
+nouveau_dmem_chunk_alloc(struct nouveau_drm *drm, struct page **ppage,
+			 bool is_large)
 {
 	struct nouveau_dmem_chunk *chunk;
 	struct resource *res;
 	struct page *page;
 	void *ptr;
-	unsigned long i, pfn_first;
+	unsigned long i, pfn_first, pfn;
 	int ret;
 
 	chunk = kzalloc(sizeof(*chunk), GFP_KERNEL);
@@ -242,7 +309,7 @@ nouveau_dmem_chunk_alloc(struct nouveau_drm *drm, struct page **ppage)
 	}
 
 	/* Allocate unused physical address space for device private pages. */
-	res = request_free_mem_region(&iomem_resource, DMEM_CHUNK_SIZE,
+	res = request_free_mem_region(&iomem_resource, DMEM_CHUNK_SIZE * NR_CHUNKS,
 				      "nouveau_dmem");
 	if (IS_ERR(res)) {
 		ret = PTR_ERR(res);
@@ -275,16 +342,40 @@ nouveau_dmem_chunk_alloc(struct nouveau_drm *drm, struct page **ppage)
 	pfn_first = chunk->pagemap.range.start >> PAGE_SHIFT;
 	page = pfn_to_page(pfn_first);
 	spin_lock(&drm->dmem->lock);
-	for (i = 0; i < DMEM_CHUNK_NPAGES - 1; ++i, ++page) {
-		page->zone_device_data = drm->dmem->free_pages;
-		drm->dmem->free_pages = page;
+
+	pfn = pfn_first;
+	for (i = 0; i < NR_CHUNKS; i++) {
+		int j;
+
+		if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) || !is_large) {
+			for (j = 0; j < DMEM_CHUNK_NPAGES - 1; j++, pfn++) {
+				page = pfn_to_page(pfn);
+				page->zone_device_data = drm->dmem->free_pages;
+				drm->dmem->free_pages = page;
+			}
+		} else {
+			page = pfn_to_page(pfn);
+			page->zone_device_data = drm->dmem->free_folios;
+			drm->dmem->free_folios = page_folio(page);
+			pfn += DMEM_CHUNK_NPAGES;
+		}
 	}
-	*ppage = page;
+
+	/* Move to next page */
+	if (is_large) {
+		*ppage = &drm->dmem->free_folios->page;
+		drm->dmem->free_folios = (*ppage)->zone_device_data;
+	} else {
+		*ppage = drm->dmem->free_pages;
+		drm->dmem->free_pages = (*ppage)->zone_device_data;
+	}
+
 	chunk->callocated++;
 	spin_unlock(&drm->dmem->lock);
 
-	NV_INFO(drm, "DMEM: registered %ldMB of device memory\n",
-		DMEM_CHUNK_SIZE >> 20);
+	NV_INFO(drm, "DMEM: registered %ldMB of %sdevice memory %lx %lx\n",
+		NR_CHUNKS * DMEM_CHUNK_SIZE >> 20, is_large ? "THP " : "", pfn_first,
+		nouveau_dmem_page_addr(page));
 
 	return 0;
 
@@ -299,27 +390,41 @@ nouveau_dmem_chunk_alloc(struct nouveau_drm *drm, struct page **ppage)
 }
 
 static struct page *
-nouveau_dmem_page_alloc_locked(struct nouveau_drm *drm)
+nouveau_dmem_page_alloc_locked(struct nouveau_drm *drm, bool is_large)
 {
 	struct nouveau_dmem_chunk *chunk;
 	struct page *page = NULL;
+	struct folio *folio = NULL;
 	int ret;
+	unsigned int order = 0;
 
 	spin_lock(&drm->dmem->lock);
-	if (drm->dmem->free_pages) {
+	if (is_large && drm->dmem->free_folios) {
+		folio = drm->dmem->free_folios;
+		page = &folio->page;
+		drm->dmem->free_folios = page->zone_device_data;
+		chunk = nouveau_page_to_chunk(&folio->page);
+		chunk->callocated++;
+		spin_unlock(&drm->dmem->lock);
+		order = ilog2(DMEM_CHUNK_NPAGES);
+	} else if (!is_large && drm->dmem->free_pages) {
 		page = drm->dmem->free_pages;
 		drm->dmem->free_pages = page->zone_device_data;
 		chunk = nouveau_page_to_chunk(page);
 		chunk->callocated++;
 		spin_unlock(&drm->dmem->lock);
+		folio = page_folio(page);
 	} else {
 		spin_unlock(&drm->dmem->lock);
-		ret = nouveau_dmem_chunk_alloc(drm, &page);
+		ret = nouveau_dmem_chunk_alloc(drm, &page, is_large);
 		if (ret)
 			return NULL;
+		folio = page_folio(page);
+		if (is_large)
+			order = ilog2(DMEM_CHUNK_NPAGES);
 	}
 
-	zone_device_page_init(page, 0);
+	zone_device_folio_init(folio, order);
 	return page;
 }
 
@@ -370,12 +475,12 @@ nouveau_dmem_evict_chunk(struct nouveau_dmem_chunk *chunk)
 {
 	unsigned long i, npages = range_len(&chunk->pagemap.range) >> PAGE_SHIFT;
 	unsigned long *src_pfns, *dst_pfns;
-	dma_addr_t *dma_addrs;
+	struct nouveau_dmem_dma_info *dma_info;
 	struct nouveau_fence *fence;
 
 	src_pfns = kvcalloc(npages, sizeof(*src_pfns), GFP_KERNEL | __GFP_NOFAIL);
 	dst_pfns = kvcalloc(npages, sizeof(*dst_pfns), GFP_KERNEL | __GFP_NOFAIL);
-	dma_addrs = kvcalloc(npages, sizeof(*dma_addrs), GFP_KERNEL | __GFP_NOFAIL);
+	dma_info = kvcalloc(npages, sizeof(*dma_info), GFP_KERNEL | __GFP_NOFAIL);
 
 	migrate_device_range(src_pfns, chunk->pagemap.range.start >> PAGE_SHIFT,
 			npages);
@@ -383,17 +488,28 @@ nouveau_dmem_evict_chunk(struct nouveau_dmem_chunk *chunk)
 	for (i = 0; i < npages; i++) {
 		if (src_pfns[i] & MIGRATE_PFN_MIGRATE) {
 			struct page *dpage;
+			struct folio *folio = page_folio(
+				migrate_pfn_to_page(src_pfns[i]));
+			unsigned int order = folio_order(folio);
+
+			if (src_pfns[i] & MIGRATE_PFN_COMPOUND) {
+				dpage = folio_page(
+						folio_alloc(
+						GFP_HIGHUSER_MOVABLE, order), 0);
+			} else {
+				/*
+				 * _GFP_NOFAIL because the GPU is going away and there
+				 * is nothing sensible we can do if we can't copy the
+				 * data back.
+				 */
+				dpage = alloc_page(GFP_HIGHUSER | __GFP_NOFAIL);
+			}
 
-			/*
-			 * _GFP_NOFAIL because the GPU is going away and there
-			 * is nothing sensible we can do if we can't copy the
-			 * data back.
-			 */
-			dpage = alloc_page(GFP_HIGHUSER | __GFP_NOFAIL);
 			dst_pfns[i] = migrate_pfn(page_to_pfn(dpage));
-			nouveau_dmem_copy_one(chunk->drm,
-					migrate_pfn_to_page(src_pfns[i]), dpage,
-					&dma_addrs[i]);
+			nouveau_dmem_copy_folio(chunk->drm,
+				page_folio(migrate_pfn_to_page(src_pfns[i])),
+				page_folio(dpage),
+				&dma_info[i]);
 		}
 	}
 
@@ -404,8 +520,9 @@ nouveau_dmem_evict_chunk(struct nouveau_dmem_chunk *chunk)
 	kvfree(src_pfns);
 	kvfree(dst_pfns);
 	for (i = 0; i < npages; i++)
-		dma_unmap_page(chunk->drm->dev->dev, dma_addrs[i], PAGE_SIZE, DMA_BIDIRECTIONAL);
-	kvfree(dma_addrs);
+		dma_unmap_page(chunk->drm->dev->dev, dma_info[i].dma_addr,
+				dma_info[i].size, DMA_BIDIRECTIONAL);
+	kvfree(dma_info);
 }
 
 void
@@ -608,31 +725,36 @@ nouveau_dmem_init(struct nouveau_drm *drm)
 
 static unsigned long nouveau_dmem_migrate_copy_one(struct nouveau_drm *drm,
 		struct nouveau_svmm *svmm, unsigned long src,
-		dma_addr_t *dma_addr, u64 *pfn)
+		struct nouveau_dmem_dma_info *dma_info, u64 *pfn)
 {
 	struct device *dev = drm->dev->dev;
 	struct page *dpage, *spage;
 	unsigned long paddr;
+	bool is_large = false;
+	unsigned long mpfn;
 
 	spage = migrate_pfn_to_page(src);
 	if (!(src & MIGRATE_PFN_MIGRATE))
 		goto out;
 
-	dpage = nouveau_dmem_page_alloc_locked(drm);
+	is_large = src & MIGRATE_PFN_COMPOUND;
+	dpage = nouveau_dmem_page_alloc_locked(drm, is_large);
 	if (!dpage)
 		goto out;
 
 	paddr = nouveau_dmem_page_addr(dpage);
 	if (spage) {
-		*dma_addr = dma_map_page(dev, spage, 0, page_size(spage),
+		dma_info->dma_addr = dma_map_page(dev, spage, 0, page_size(spage),
 					 DMA_BIDIRECTIONAL);
-		if (dma_mapping_error(dev, *dma_addr))
+		dma_info->size = page_size(spage);
+		if (dma_mapping_error(dev, dma_info->dma_addr))
 			goto out_free_page;
-		if (drm->dmem->migrate.copy_func(drm, 1,
-			NOUVEAU_APER_VRAM, paddr, NOUVEAU_APER_HOST, *dma_addr))
+		if (drm->dmem->migrate.copy_func(drm, folio_nr_pages(page_folio(spage)),
+			NOUVEAU_APER_VRAM, paddr, NOUVEAU_APER_HOST,
+			dma_info->dma_addr))
 			goto out_dma_unmap;
 	} else {
-		*dma_addr = DMA_MAPPING_ERROR;
+		dma_info->dma_addr = DMA_MAPPING_ERROR;
 		if (drm->dmem->migrate.clear_func(drm, page_size(dpage),
 			NOUVEAU_APER_VRAM, paddr))
 			goto out_free_page;
@@ -643,10 +765,13 @@ static unsigned long nouveau_dmem_migrate_copy_one(struct nouveau_drm *drm,
 		((paddr >> PAGE_SHIFT) << NVIF_VMM_PFNMAP_V0_ADDR_SHIFT);
 	if (src & MIGRATE_PFN_WRITE)
 		*pfn |= NVIF_VMM_PFNMAP_V0_W;
-	return migrate_pfn(page_to_pfn(dpage));
+	mpfn = migrate_pfn(page_to_pfn(dpage));
+	if (folio_order(page_folio(dpage)))
+		mpfn |= MIGRATE_PFN_COMPOUND;
+	return mpfn;
 
 out_dma_unmap:
-	dma_unmap_page(dev, *dma_addr, PAGE_SIZE, DMA_BIDIRECTIONAL);
+	dma_unmap_page(dev, dma_info->dma_addr, PAGE_SIZE, DMA_BIDIRECTIONAL);
 out_free_page:
 	nouveau_dmem_page_free_locked(drm, dpage);
 out:
@@ -656,27 +781,38 @@ static unsigned long nouveau_dmem_migrate_copy_one(struct nouveau_drm *drm,
 
 static void nouveau_dmem_migrate_chunk(struct nouveau_drm *drm,
 		struct nouveau_svmm *svmm, struct migrate_vma *args,
-		dma_addr_t *dma_addrs, u64 *pfns)
+		struct nouveau_dmem_dma_info *dma_info, u64 *pfns)
 {
 	struct nouveau_fence *fence;
 	unsigned long addr = args->start, nr_dma = 0, i;
+	unsigned long order = 0;
+
+	for (i = 0; addr < args->end; ) {
+		struct folio *folio;
 
-	for (i = 0; addr < args->end; i++) {
 		args->dst[i] = nouveau_dmem_migrate_copy_one(drm, svmm,
-				args->src[i], dma_addrs + nr_dma, pfns + i);
-		if (!dma_mapping_error(drm->dev->dev, dma_addrs[nr_dma]))
+				args->src[i], dma_info + nr_dma, pfns + i);
+		if (!args->dst[i]) {
+			i++;
+			addr += PAGE_SIZE;
+			continue;
+		}
+		if (!dma_mapping_error(drm->dev->dev, dma_info[nr_dma].dma_addr))
 			nr_dma++;
-		addr += PAGE_SIZE;
+		folio = page_folio(migrate_pfn_to_page(args->dst[i]));
+		order = folio_order(folio);
+		i += 1 << order;
+		addr += (1 << order) * PAGE_SIZE;
 	}
 
 	nouveau_fence_new(&fence, drm->dmem->migrate.chan);
 	migrate_vma_pages(args);
 	nouveau_dmem_fence_done(&fence);
-	nouveau_pfns_map(svmm, args->vma->vm_mm, args->start, pfns, i);
+	nouveau_pfns_map(svmm, args->vma->vm_mm, args->start, pfns, i, order);
 
 	while (nr_dma--) {
-		dma_unmap_page(drm->dev->dev, dma_addrs[nr_dma], PAGE_SIZE,
-				DMA_BIDIRECTIONAL);
+		dma_unmap_page(drm->dev->dev, dma_info[nr_dma].dma_addr,
+				dma_info[nr_dma].size, DMA_BIDIRECTIONAL);
 	}
 	migrate_vma_finalize(args);
 }
@@ -689,20 +825,27 @@ nouveau_dmem_migrate_vma(struct nouveau_drm *drm,
 			 unsigned long end)
 {
 	unsigned long npages = (end - start) >> PAGE_SHIFT;
-	unsigned long max = min(SG_MAX_SINGLE_ALLOC, npages);
-	dma_addr_t *dma_addrs;
+	unsigned long max = npages;
 	struct migrate_vma args = {
 		.vma		= vma,
 		.start		= start,
 		.pgmap_owner	= drm->dev,
-		.flags		= MIGRATE_VMA_SELECT_SYSTEM,
+		.flags		= MIGRATE_VMA_SELECT_SYSTEM
+				  | MIGRATE_VMA_SELECT_COMPOUND,
 	};
 	unsigned long i;
 	u64 *pfns;
 	int ret = -ENOMEM;
+	struct nouveau_dmem_dma_info *dma_info;
 
-	if (drm->dmem == NULL)
-		return -ENODEV;
+	if (drm->dmem == NULL) {
+		ret = -ENODEV;
+		goto out;
+	}
+
+	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
+		if (max > (unsigned long)HPAGE_PMD_NR)
+			max = (unsigned long)HPAGE_PMD_NR;
 
 	args.src = kcalloc(max, sizeof(*args.src), GFP_KERNEL);
 	if (!args.src)
@@ -711,8 +854,8 @@ nouveau_dmem_migrate_vma(struct nouveau_drm *drm,
 	if (!args.dst)
 		goto out_free_src;
 
-	dma_addrs = kmalloc_array(max, sizeof(*dma_addrs), GFP_KERNEL);
-	if (!dma_addrs)
+	dma_info = kmalloc_array(max, sizeof(*dma_info), GFP_KERNEL);
+	if (!dma_info)
 		goto out_free_dst;
 
 	pfns = nouveau_pfns_alloc(max);
@@ -730,7 +873,7 @@ nouveau_dmem_migrate_vma(struct nouveau_drm *drm,
 			goto out_free_pfns;
 
 		if (args.cpages)
-			nouveau_dmem_migrate_chunk(drm, svmm, &args, dma_addrs,
+			nouveau_dmem_migrate_chunk(drm, svmm, &args, dma_info,
 						   pfns);
 		args.start = args.end;
 	}
@@ -739,7 +882,7 @@ nouveau_dmem_migrate_vma(struct nouveau_drm *drm,
 out_free_pfns:
 	nouveau_pfns_free(pfns);
 out_free_dma:
-	kfree(dma_addrs);
+	kfree(dma_info);
 out_free_dst:
 	kfree(args.dst);
 out_free_src:
diff --git a/drivers/gpu/drm/nouveau/nouveau_svm.c b/drivers/gpu/drm/nouveau/nouveau_svm.c
index 6fa387da0637..b8a3378154d5 100644
--- a/drivers/gpu/drm/nouveau/nouveau_svm.c
+++ b/drivers/gpu/drm/nouveau/nouveau_svm.c
@@ -921,12 +921,14 @@ nouveau_pfns_free(u64 *pfns)
 
 void
 nouveau_pfns_map(struct nouveau_svmm *svmm, struct mm_struct *mm,
-		 unsigned long addr, u64 *pfns, unsigned long npages)
+		 unsigned long addr, u64 *pfns, unsigned long npages,
+		 unsigned int page_shift)
 {
 	struct nouveau_pfnmap_args *args = nouveau_pfns_to_args(pfns);
 
 	args->p.addr = addr;
-	args->p.size = npages << PAGE_SHIFT;
+	args->p.size = npages << page_shift;
+	args->p.page = page_shift;
 
 	mutex_lock(&svmm->mutex);
 
diff --git a/drivers/gpu/drm/nouveau/nouveau_svm.h b/drivers/gpu/drm/nouveau/nouveau_svm.h
index e7d63d7f0c2d..3fd78662f17e 100644
--- a/drivers/gpu/drm/nouveau/nouveau_svm.h
+++ b/drivers/gpu/drm/nouveau/nouveau_svm.h
@@ -33,7 +33,8 @@ void nouveau_svmm_invalidate(struct nouveau_svmm *svmm, u64 start, u64 limit);
 u64 *nouveau_pfns_alloc(unsigned long npages);
 void nouveau_pfns_free(u64 *pfns);
 void nouveau_pfns_map(struct nouveau_svmm *svmm, struct mm_struct *mm,
-		      unsigned long addr, u64 *pfns, unsigned long npages);
+		      unsigned long addr, u64 *pfns, unsigned long npages,
+		      unsigned int page_shift);
 #else /* IS_ENABLED(CONFIG_DRM_NOUVEAU_SVM) */
 static inline void nouveau_svm_init(struct nouveau_drm *drm) {}
 static inline void nouveau_svm_fini(struct nouveau_drm *drm) {}
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 75+ messages in thread

* Re: [v7 00/16] mm: support device-private THP
  2025-10-01  6:56 [v7 00/16] mm: support device-private THP Balbir Singh
                   ` (15 preceding siblings ...)
  2025-10-01  6:57 ` [v7 16/16] gpu/drm/nouveau: enable THP support for GPU memory migration Balbir Singh
@ 2025-10-09  3:17 ` Andrew Morton
  2025-10-09  3:26   ` Balbir Singh
  16 siblings, 1 reply; 75+ messages in thread
From: Andrew Morton @ 2025-10-09  3:17 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-kernel, dri-devel, linux-mm, David Hildenbrand, Zi Yan,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Oscar Salvador, Lorenzo Stoakes, Baolin Wang,
	Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
	Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Ralph Campbell, Mika Penttilä, Matthew Brost,
	Francois Dugast

On Wed,  1 Oct 2025 16:56:51 +1000 Balbir Singh <balbirs@nvidia.com> wrote:

> This patch series introduces support for Transparent Huge Page
> (THP) migration in zone device-private memory. The implementation enables
> efficient migration of large folios between system memory and
> device-private memory

Lots of chatter for the v6 series, but none for v7.  I hope that's a
good sign.

> 
> HMM support for large folios, patches are already posted and in
> mm-unstable.

Not any more.  Which series was this?

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [v7 00/16] mm: support device-private THP
  2025-10-09  3:17 ` [v7 00/16] mm: support device-private THP Andrew Morton
@ 2025-10-09  3:26   ` Balbir Singh
  2025-10-09 10:33     ` Matthew Brost
  0 siblings, 1 reply; 75+ messages in thread
From: Balbir Singh @ 2025-10-09  3:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, dri-devel, linux-mm, David Hildenbrand, Zi Yan,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Oscar Salvador, Lorenzo Stoakes, Baolin Wang,
	Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
	Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Ralph Campbell, Mika Penttilä, Matthew Brost,
	Francois Dugast

On 10/9/25 14:17, Andrew Morton wrote:
> On Wed,  1 Oct 2025 16:56:51 +1000 Balbir Singh <balbirs@nvidia.com> wrote:
> 
>> This patch series introduces support for Transparent Huge Page
>> (THP) migration in zone device-private memory. The implementation enables
>> efficient migration of large folios between system memory and
>> device-private memory
> 
> Lots of chatter for the v6 series, but none for v7.  I hope that's a
> good sign.
> 

I hope so too, I've tried to address the comments in v6.

>>
>> HMM support for large folios, patches are already posted and in
>> mm-unstable.
> 
> Not any more.  Which series was this?

Not a series, but a patch

https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git/commit/?id=10b9feee2d0dc81c44f7a9e69e7a894e33f8c4a1

Thanks,
Balbir

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [v7 00/16] mm: support device-private THP
  2025-10-09  3:26   ` Balbir Singh
@ 2025-10-09 10:33     ` Matthew Brost
  2025-10-13 22:51       ` Balbir Singh
  2025-11-11 23:43       ` Andrew Morton
  0 siblings, 2 replies; 75+ messages in thread
From: Matthew Brost @ 2025-10-09 10:33 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Andrew Morton, linux-kernel, dri-devel, linux-mm,
	David Hildenbrand, Zi Yan, Joshua Hahn, Rakie Kim, Byungchul Park,
	Gregory Price, Ying Huang, Alistair Popple, Oscar Salvador,
	Lorenzo Stoakes, Baolin Wang, Liam R. Howlett, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Ralph Campbell, Mika Penttilä,
	Francois Dugast

On Thu, Oct 09, 2025 at 02:26:30PM +1100, Balbir Singh wrote:
> On 10/9/25 14:17, Andrew Morton wrote:
> > On Wed,  1 Oct 2025 16:56:51 +1000 Balbir Singh <balbirs@nvidia.com> wrote:
> > 
> >> This patch series introduces support for Transparent Huge Page
> >> (THP) migration in zone device-private memory. The implementation enables
> >> efficient migration of large folios between system memory and
> >> device-private memory
> > 
> > Lots of chatter for the v6 series, but none for v7.  I hope that's a
> > good sign.
> > 
> 
> I hope so too, I've tried to address the comments in v6.
> 

Circling back to this series, we will itegrate and test this version.

> >>
> >> HMM support for large folios, patches are already posted and in
> >> mm-unstable.
> > 
> > Not any more.  Which series was this?
> 
> Not a series, but a patch
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git/commit/?id=10b9feee2d0dc81c44f7a9e69e7a894e33f8c4a1

I think this [1] means this patch is Linus's tree?

Matt

[1] https://github.com/torvalds/linux/commit/10b9feee2d0dc81c44f7a9e69e7a894e33f8c4a1 

> 
> Thanks,
> Balbir

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [v7 01/16] mm/zone_device: support large zone device private folios
  2025-10-01  6:56 ` [v7 01/16] mm/zone_device: support large zone device private folios Balbir Singh
@ 2025-10-12  6:10   ` Lance Yang
  2025-10-12 22:54     ` Balbir Singh
  0 siblings, 1 reply; 75+ messages in thread
From: Lance Yang @ 2025-10-12  6:10 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-kernel, dri-devel, linux-mm, akpm, David Hildenbrand,
	Zi Yan, Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price,
	Ying Huang, Alistair Popple, Oscar Salvador, Lorenzo Stoakes,
	Baolin Wang, Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain,
	Barry Song, Lyude Paul, Danilo Krummrich, David Airlie,
	Simona Vetter, Ralph Campbell, Mika Penttilä, Matthew Brost,
	Francois Dugast, Madhavan Srinivasan, Christophe Leroy,
	Felix Kuehling, Alex Deucher, Christian König

Hi Balbir,

Just one nit below :)

On Wed, Oct 1, 2025 at 3:43 PM Balbir Singh <balbirs@nvidia.com> wrote:
>
> Add routines to support allocation of large order zone device folios
> and helper functions for zone device folios, to check if a folio is
> device private and helpers for setting zone device data.
>
> When large folios are used, the existing page_free() callback in
> pgmap is called when the folio is freed, this is true for both
> PAGE_SIZE and higher order pages.
>
> Zone device private large folios do not support deferred split and
> scan like normal THP folios.
>
> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
> Cc: Rakie Kim <rakie.kim@sk.com>
> Cc: Byungchul Park <byungchul@sk.com>
> Cc: Gregory Price <gourry@gourry.net>
> Cc: Ying Huang <ying.huang@linux.alibaba.com>
> Cc: Alistair Popple <apopple@nvidia.com>
> Cc: Oscar Salvador <osalvador@suse.de>
> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> Cc: Nico Pache <npache@redhat.com>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Cc: Dev Jain <dev.jain@arm.com>
> Cc: Barry Song <baohua@kernel.org>
> Cc: Lyude Paul <lyude@redhat.com>
> Cc: Danilo Krummrich <dakr@kernel.org>
> Cc: David Airlie <airlied@gmail.com>
> Cc: Simona Vetter <simona@ffwll.ch>
> Cc: Ralph Campbell <rcampbell@nvidia.com>
> Cc: Mika Penttilä <mpenttil@redhat.com>
> Cc: Matthew Brost <matthew.brost@intel.com>
> Cc: Francois Dugast <francois.dugast@intel.com>
> Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
> Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
> Cc: Felix Kuehling <Felix.Kuehling@amd.com>
> Cc: Alex Deucher <alexander.deucher@amd.com>
> Cc: "Christian König" <christian.koenig@amd.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> ---
>  arch/powerpc/kvm/book3s_hv_uvmem.c       |  2 +-
>  drivers/gpu/drm/amd/amdkfd/kfd_migrate.c |  2 +-
>  drivers/gpu/drm/drm_pagemap.c            |  2 +-
>  drivers/gpu/drm/nouveau/nouveau_dmem.c   |  2 +-
>  include/linux/memremap.h                 | 10 ++++++++-
>  lib/test_hmm.c                           |  2 +-
>  mm/memremap.c                            | 26 ++++++++++++++----------
>  mm/rmap.c                                |  6 +++++-
>  8 files changed, 34 insertions(+), 18 deletions(-)
>
> diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c b/arch/powerpc/kvm/book3s_hv_uvmem.c
> index 03f8c34fa0a2..91f763410673 100644
> --- a/arch/powerpc/kvm/book3s_hv_uvmem.c
> +++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
> @@ -723,7 +723,7 @@ static struct page *kvmppc_uvmem_get_page(unsigned long gpa, struct kvm *kvm)
>
>         dpage = pfn_to_page(uvmem_pfn);
>         dpage->zone_device_data = pvt;
> -       zone_device_page_init(dpage);
> +       zone_device_page_init(dpage, 0);
>         return dpage;
>  out_clear:
>         spin_lock(&kvmppc_uvmem_bitmap_lock);
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
> index 79251f22b702..d0e2cae33035 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
> @@ -217,7 +217,7 @@ svm_migrate_get_vram_page(struct svm_range *prange, unsigned long pfn)
>         page = pfn_to_page(pfn);
>         svm_range_bo_ref(prange->svm_bo);
>         page->zone_device_data = prange->svm_bo;
> -       zone_device_page_init(page);
> +       zone_device_page_init(page, 0);
>  }
>
>  static void
> diff --git a/drivers/gpu/drm/drm_pagemap.c b/drivers/gpu/drm/drm_pagemap.c
> index 1da55322af12..31c53f724e25 100644
> --- a/drivers/gpu/drm/drm_pagemap.c
> +++ b/drivers/gpu/drm/drm_pagemap.c
> @@ -196,7 +196,7 @@ static void drm_pagemap_get_devmem_page(struct page *page,
>                                         struct drm_pagemap_zdd *zdd)
>  {
>         page->zone_device_data = drm_pagemap_zdd_get(zdd);
> -       zone_device_page_init(page);
> +       zone_device_page_init(page, 0);
>  }
>
>  /**
> diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c b/drivers/gpu/drm/nouveau/nouveau_dmem.c
> index ca4932a150e3..53cc1926b9da 100644
> --- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
> +++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
> @@ -318,7 +318,7 @@ nouveau_dmem_page_alloc_locked(struct nouveau_drm *drm)
>                         return NULL;
>         }
>
> -       zone_device_page_init(page);
> +       zone_device_page_init(page, 0);
>         return page;
>  }
>
> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> index e5951ba12a28..d2487a19cba2 100644
> --- a/include/linux/memremap.h
> +++ b/include/linux/memremap.h
> @@ -206,7 +206,7 @@ static inline bool is_fsdax_page(const struct page *page)
>  }
>
>  #ifdef CONFIG_ZONE_DEVICE
> -void zone_device_page_init(struct page *page);
> +void zone_device_page_init(struct page *page, unsigned int order);
>  void *memremap_pages(struct dev_pagemap *pgmap, int nid);
>  void memunmap_pages(struct dev_pagemap *pgmap);
>  void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap);
> @@ -215,6 +215,14 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn);
>  bool pgmap_pfn_valid(struct dev_pagemap *pgmap, unsigned long pfn);
>
>  unsigned long memremap_compat_align(void);
> +
> +static inline void zone_device_folio_init(struct folio *folio, unsigned int order)
> +{
> +       zone_device_page_init(&folio->page, order);
> +       if (order)
> +               folio_set_large_rmappable(folio);
> +}
> +
>  #else
>  static inline void *devm_memremap_pages(struct device *dev,
>                 struct dev_pagemap *pgmap)
> diff --git a/lib/test_hmm.c b/lib/test_hmm.c
> index 83e3d8208a54..24d82121cde8 100644
> --- a/lib/test_hmm.c
> +++ b/lib/test_hmm.c
> @@ -627,7 +627,7 @@ static struct page *dmirror_devmem_alloc_page(struct dmirror_device *mdevice)
>                         goto error;
>         }
>
> -       zone_device_page_init(dpage);
> +       zone_device_page_init(dpage, 0);
>         dpage->zone_device_data = rpage;
>         return dpage;
>
> diff --git a/mm/memremap.c b/mm/memremap.c
> index 46cb1b0b6f72..e45dfb568710 100644
> --- a/mm/memremap.c
> +++ b/mm/memremap.c
> @@ -416,20 +416,19 @@ EXPORT_SYMBOL_GPL(get_dev_pagemap);
>  void free_zone_device_folio(struct folio *folio)
>  {
>         struct dev_pagemap *pgmap = folio->pgmap;
> +       unsigned long nr = folio_nr_pages(folio);
> +       int i;
>
>         if (WARN_ON_ONCE(!pgmap))
>                 return;
>
>         mem_cgroup_uncharge(folio);
>
> -       /*
> -        * Note: we don't expect anonymous compound pages yet. Once supported
> -        * and we could PTE-map them similar to THP, we'd have to clear
> -        * PG_anon_exclusive on all tail pages.
> -        */
>         if (folio_test_anon(folio)) {
> -               VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
> -               __ClearPageAnonExclusive(folio_page(folio, 0));
> +               for (i = 0; i < nr; i++)
> +                       __ClearPageAnonExclusive(folio_page(folio, i));
> +       } else {
> +               VM_WARN_ON_ONCE(folio_test_large(folio));
>         }
>
>         /*
> @@ -456,8 +455,8 @@ void free_zone_device_folio(struct folio *folio)
>         case MEMORY_DEVICE_COHERENT:
>                 if (WARN_ON_ONCE(!pgmap->ops || !pgmap->ops->page_free))
>                         break;
> -               pgmap->ops->page_free(folio_page(folio, 0));
> -               put_dev_pagemap(pgmap);
> +               pgmap->ops->page_free(&folio->page);
> +               percpu_ref_put_many(&folio->pgmap->ref, nr);

Nit: &pgmap->ref here for consistency?

Cheers,
Lance

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [v7 03/16] mm/huge_memory: add device-private THP support to PMD operations
  2025-10-01  6:56 ` [v7 03/16] mm/huge_memory: add device-private THP support to PMD operations Balbir Singh
@ 2025-10-12 15:46   ` Lance Yang
  2025-10-13  0:01     ` Balbir Singh
  2025-10-17 14:49   ` linux-next: KVM/s390x regression (was: [v7 03/16] mm/huge_memory: add device-private THP support to PMD operations) Christian Borntraeger
  1 sibling, 1 reply; 75+ messages in thread
From: Lance Yang @ 2025-10-12 15:46 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-kernel, dri-devel, linux-mm, akpm, David Hildenbrand,
	Zi Yan, Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price,
	Ying Huang, Alistair Popple, Oscar Salvador, Lorenzo Stoakes,
	Baolin Wang, Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain,
	Barry Song, Lyude Paul, Danilo Krummrich, David Airlie,
	Simona Vetter, Ralph Campbell, Mika Penttilä, Matthew Brost,
	Francois Dugast

On Wed, Oct 1, 2025 at 4:20 PM Balbir Singh <balbirs@nvidia.com> wrote:
>
> Extend core huge page management functions to handle device-private THP
> entries.  This enables proper handling of large device-private folios in
> fundamental MM operations.
>
> The following functions have been updated:
>
> - copy_huge_pmd(): Handle device-private entries during fork/clone
> - zap_huge_pmd(): Properly free device-private THP during munmap
> - change_huge_pmd(): Support protection changes on device-private THP
> - __pte_offset_map(): Add device-private entry awareness
>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
> Cc: Rakie Kim <rakie.kim@sk.com>
> Cc: Byungchul Park <byungchul@sk.com>
> Cc: Gregory Price <gourry@gourry.net>
> Cc: Ying Huang <ying.huang@linux.alibaba.com>
> Cc: Alistair Popple <apopple@nvidia.com>
> Cc: Oscar Salvador <osalvador@suse.de>
> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> Cc: Nico Pache <npache@redhat.com>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Cc: Dev Jain <dev.jain@arm.com>
> Cc: Barry Song <baohua@kernel.org>
> Cc: Lyude Paul <lyude@redhat.com>
> Cc: Danilo Krummrich <dakr@kernel.org>
> Cc: David Airlie <airlied@gmail.com>
> Cc: Simona Vetter <simona@ffwll.ch>
> Cc: Ralph Campbell <rcampbell@nvidia.com>
> Cc: Mika Penttilä <mpenttil@redhat.com>
> Cc: Matthew Brost <matthew.brost@intel.com>
> Cc: Francois Dugast <francois.dugast@intel.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Acked-by: Zi Yan <ziy@nvidia.com>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
> ---
>  include/linux/swapops.h | 32 +++++++++++++++++++++++
>  mm/huge_memory.c        | 56 ++++++++++++++++++++++++++++++++++-------
>  mm/pgtable-generic.c    |  2 +-
>  3 files changed, 80 insertions(+), 10 deletions(-)
>
> diff --git a/include/linux/swapops.h b/include/linux/swapops.h
> index 64ea151a7ae3..2687928a8146 100644
> --- a/include/linux/swapops.h
> +++ b/include/linux/swapops.h
> @@ -594,10 +594,42 @@ static inline int is_pmd_migration_entry(pmd_t pmd)
>  }
>  #endif  /* CONFIG_ARCH_ENABLE_THP_MIGRATION */
>
> +#if defined(CONFIG_ZONE_DEVICE) && defined(CONFIG_ARCH_ENABLE_THP_MIGRATION)
> +
> +/**
> + * is_pmd_device_private_entry() - Check if PMD contains a device private swap entry
> + * @pmd: The PMD to check
> + *
> + * Returns true if the PMD contains a swap entry that represents a device private
> + * page mapping. This is used for zone device private pages that have been
> + * swapped out but still need special handling during various memory management
> + * operations.
> + *
> + * Return: 1 if PMD contains device private entry, 0 otherwise
> + */
> +static inline int is_pmd_device_private_entry(pmd_t pmd)
> +{
> +       return is_swap_pmd(pmd) && is_device_private_entry(pmd_to_swp_entry(pmd));
> +}
> +
> +#else /* CONFIG_ZONE_DEVICE && CONFIG_ARCH_ENABLE_THP_MIGRATION */
> +
> +static inline int is_pmd_device_private_entry(pmd_t pmd)
> +{
> +       return 0;
> +}
> +
> +#endif /* CONFIG_ZONE_DEVICE && CONFIG_ARCH_ENABLE_THP_MIGRATION */
> +
>  static inline int non_swap_entry(swp_entry_t entry)
>  {
>         return swp_type(entry) >= MAX_SWAPFILES;
>  }
>
> +static inline int is_pmd_non_present_folio_entry(pmd_t pmd)
> +{
> +       return is_pmd_migration_entry(pmd) || is_pmd_device_private_entry(pmd);
> +}
> +
>  #endif /* CONFIG_MMU */
>  #endif /* _LINUX_SWAPOPS_H */
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 1b81680b4225..8e0a1747762d 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1703,17 +1703,45 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>         if (unlikely(is_swap_pmd(pmd))) {
>                 swp_entry_t entry = pmd_to_swp_entry(pmd);
>
> -               VM_BUG_ON(!is_pmd_migration_entry(pmd));
> -               if (!is_readable_migration_entry(entry)) {
> -                       entry = make_readable_migration_entry(
> -                                                       swp_offset(entry));
> +               VM_WARN_ON(!is_pmd_non_present_folio_entry(pmd));
> +
> +               if (is_writable_migration_entry(entry) ||
> +                   is_readable_exclusive_migration_entry(entry)) {
> +                       entry = make_readable_migration_entry(swp_offset(entry));
>                         pmd = swp_entry_to_pmd(entry);
>                         if (pmd_swp_soft_dirty(*src_pmd))
>                                 pmd = pmd_swp_mksoft_dirty(pmd);
>                         if (pmd_swp_uffd_wp(*src_pmd))
>                                 pmd = pmd_swp_mkuffd_wp(pmd);
>                         set_pmd_at(src_mm, addr, src_pmd, pmd);
> +               } else if (is_device_private_entry(entry)) {
> +                       /*
> +                        * For device private entries, since there are no
> +                        * read exclusive entries, writable = !readable
> +                        */
> +                       if (is_writable_device_private_entry(entry)) {
> +                               entry = make_readable_device_private_entry(swp_offset(entry));
> +                               pmd = swp_entry_to_pmd(entry);
> +
> +                               if (pmd_swp_soft_dirty(*src_pmd))
> +                                       pmd = pmd_swp_mksoft_dirty(pmd);
> +                               if (pmd_swp_uffd_wp(*src_pmd))
> +                                       pmd = pmd_swp_mkuffd_wp(pmd);
> +                               set_pmd_at(src_mm, addr, src_pmd, pmd);
> +                       }
> +
> +                       src_folio = pfn_swap_entry_folio(entry);
> +                       VM_WARN_ON(!folio_test_large(src_folio));
> +
> +                       folio_get(src_folio);
> +                       /*
> +                        * folio_try_dup_anon_rmap_pmd does not fail for
> +                        * device private entries.
> +                        */
> +                       folio_try_dup_anon_rmap_pmd(src_folio, &src_folio->page,
> +                                                       dst_vma, src_vma);
>                 }
> +
>                 add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
>                 mm_inc_nr_ptes(dst_mm);
>                 pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
> @@ -2211,15 +2239,16 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>                         folio_remove_rmap_pmd(folio, page, vma);
>                         WARN_ON_ONCE(folio_mapcount(folio) < 0);
>                         VM_BUG_ON_PAGE(!PageHead(page), page);
> -               } else if (thp_migration_supported()) {
> +               } else if (is_pmd_non_present_folio_entry(orig_pmd)) {
>                         swp_entry_t entry;
>
> -                       VM_BUG_ON(!is_pmd_migration_entry(orig_pmd));
>                         entry = pmd_to_swp_entry(orig_pmd);
>                         folio = pfn_swap_entry_folio(entry);
>                         flush_needed = 0;
> -               } else
> -                       WARN_ONCE(1, "Non present huge pmd without pmd migration enabled!");
> +
> +                       if (!thp_migration_supported())
> +                               WARN_ONCE(1, "Non present huge pmd without pmd migration enabled!");
> +               }
>
>                 if (folio_test_anon(folio)) {
>                         zap_deposited_table(tlb->mm, pmd);
> @@ -2239,6 +2268,12 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>                                 folio_mark_accessed(folio);
>                 }
>
> +               if (folio_is_device_private(folio)) {
> +                       folio_remove_rmap_pmd(folio, &folio->page, vma);
> +                       WARN_ON_ONCE(folio_mapcount(folio) < 0);
> +                       folio_put(folio);
> +               }

IIUC, a device-private THP is always anonymous, right? would it make sense
to move this folio_is_device_private() block inside the folio_test_anon()
check above?

> +
>                 spin_unlock(ptl);
>                 if (flush_needed)
>                         tlb_remove_page_size(tlb, &folio->page, HPAGE_PMD_SIZE);
> @@ -2367,7 +2402,7 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>                 struct folio *folio = pfn_swap_entry_folio(entry);
>                 pmd_t newpmd;
>
> -               VM_BUG_ON(!is_pmd_migration_entry(*pmd));
> +               VM_WARN_ON(!is_pmd_non_present_folio_entry(*pmd));
>                 if (is_writable_migration_entry(entry)) {
>                         /*
>                          * A protection check is difficult so
> @@ -2380,6 +2415,9 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>                         newpmd = swp_entry_to_pmd(entry);
>                         if (pmd_swp_soft_dirty(*pmd))
>                                 newpmd = pmd_swp_mksoft_dirty(newpmd);
> +               } else if (is_writable_device_private_entry(entry)) {
> +                       entry = make_readable_device_private_entry(swp_offset(entry));
> +                       newpmd = swp_entry_to_pmd(entry);
>                 } else {
>                         newpmd = *pmd;
>                 }
> diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
> index 567e2d084071..0c847cdf4fd3 100644
> --- a/mm/pgtable-generic.c
> +++ b/mm/pgtable-generic.c
> @@ -290,7 +290,7 @@ pte_t *___pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp)
>
>         if (pmdvalp)
>                 *pmdvalp = pmdval;
> -       if (unlikely(pmd_none(pmdval) || is_pmd_migration_entry(pmdval)))
> +       if (unlikely(pmd_none(pmdval) || !pmd_present(pmdval)))
>                 goto nomap;
>         if (unlikely(pmd_trans_huge(pmdval)))
>                 goto nomap;
> --
> 2.51.0
>
>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [v7 01/16] mm/zone_device: support large zone device private folios
  2025-10-12  6:10   ` Lance Yang
@ 2025-10-12 22:54     ` Balbir Singh
  0 siblings, 0 replies; 75+ messages in thread
From: Balbir Singh @ 2025-10-12 22:54 UTC (permalink / raw)
  To: Lance Yang
  Cc: linux-kernel, dri-devel, linux-mm, akpm, David Hildenbrand,
	Zi Yan, Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price,
	Ying Huang, Alistair Popple, Oscar Salvador, Lorenzo Stoakes,
	Baolin Wang, Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain,
	Barry Song, Lyude Paul, Danilo Krummrich, David Airlie,
	Simona Vetter, Ralph Campbell, Mika Penttilä, Matthew Brost,
	Francois Dugast, Madhavan Srinivasan, Christophe Leroy,
	Felix Kuehling, Alex Deucher, Christian König

On 10/12/25 17:10, Lance Yang wrote:
> Hi Balbir,
> 
> Just one nit below :)
> 
> On Wed, Oct 1, 2025 at 3:43 PM Balbir Singh <balbirs@nvidia.com> wrote:
>>
>> Add routines to support allocation of large order zone device folios
>> and helper functions for zone device folios, to check if a folio is
>> device private and helpers for setting zone device data.
>>
>> When large folios are used, the existing page_free() callback in
>> pgmap is called when the folio is freed, this is true for both
>> PAGE_SIZE and higher order pages.
>>
>> Zone device private large folios do not support deferred split and
>> scan like normal THP folios.
>>
>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>> Cc: David Hildenbrand <david@redhat.com>
>> Cc: Zi Yan <ziy@nvidia.com>
>> Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
>> Cc: Rakie Kim <rakie.kim@sk.com>
>> Cc: Byungchul Park <byungchul@sk.com>
>> Cc: Gregory Price <gourry@gourry.net>
>> Cc: Ying Huang <ying.huang@linux.alibaba.com>
>> Cc: Alistair Popple <apopple@nvidia.com>
>> Cc: Oscar Salvador <osalvador@suse.de>
>> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
>> Cc: Nico Pache <npache@redhat.com>
>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>> Cc: Dev Jain <dev.jain@arm.com>
>> Cc: Barry Song <baohua@kernel.org>
>> Cc: Lyude Paul <lyude@redhat.com>
>> Cc: Danilo Krummrich <dakr@kernel.org>
>> Cc: David Airlie <airlied@gmail.com>
>> Cc: Simona Vetter <simona@ffwll.ch>
>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>> Cc: Mika Penttilä <mpenttil@redhat.com>
>> Cc: Matthew Brost <matthew.brost@intel.com>
>> Cc: Francois Dugast <francois.dugast@intel.com>
>> Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
>> Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
>> Cc: Felix Kuehling <Felix.Kuehling@amd.com>
>> Cc: Alex Deucher <alexander.deucher@amd.com>
>> Cc: "Christian König" <christian.koenig@amd.com>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> ---
>>  arch/powerpc/kvm/book3s_hv_uvmem.c       |  2 +-
>>  drivers/gpu/drm/amd/amdkfd/kfd_migrate.c |  2 +-
>>  drivers/gpu/drm/drm_pagemap.c            |  2 +-
>>  drivers/gpu/drm/nouveau/nouveau_dmem.c   |  2 +-
>>  include/linux/memremap.h                 | 10 ++++++++-
>>  lib/test_hmm.c                           |  2 +-
>>  mm/memremap.c                            | 26 ++++++++++++++----------
>>  mm/rmap.c                                |  6 +++++-
>>  8 files changed, 34 insertions(+), 18 deletions(-)
>>
>> diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c b/arch/powerpc/kvm/book3s_hv_uvmem.c
>> index 03f8c34fa0a2..91f763410673 100644
>> --- a/arch/powerpc/kvm/book3s_hv_uvmem.c
>> +++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
>> @@ -723,7 +723,7 @@ static struct page *kvmppc_uvmem_get_page(unsigned long gpa, struct kvm *kvm)
>>
>>         dpage = pfn_to_page(uvmem_pfn);
>>         dpage->zone_device_data = pvt;
>> -       zone_device_page_init(dpage);
>> +       zone_device_page_init(dpage, 0);
>>         return dpage;
>>  out_clear:
>>         spin_lock(&kvmppc_uvmem_bitmap_lock);
>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
>> index 79251f22b702..d0e2cae33035 100644
>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
>> @@ -217,7 +217,7 @@ svm_migrate_get_vram_page(struct svm_range *prange, unsigned long pfn)
>>         page = pfn_to_page(pfn);
>>         svm_range_bo_ref(prange->svm_bo);
>>         page->zone_device_data = prange->svm_bo;
>> -       zone_device_page_init(page);
>> +       zone_device_page_init(page, 0);
>>  }
>>
>>  static void
>> diff --git a/drivers/gpu/drm/drm_pagemap.c b/drivers/gpu/drm/drm_pagemap.c
>> index 1da55322af12..31c53f724e25 100644
>> --- a/drivers/gpu/drm/drm_pagemap.c
>> +++ b/drivers/gpu/drm/drm_pagemap.c
>> @@ -196,7 +196,7 @@ static void drm_pagemap_get_devmem_page(struct page *page,
>>                                         struct drm_pagemap_zdd *zdd)
>>  {
>>         page->zone_device_data = drm_pagemap_zdd_get(zdd);
>> -       zone_device_page_init(page);
>> +       zone_device_page_init(page, 0);
>>  }
>>
>>  /**
>> diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c b/drivers/gpu/drm/nouveau/nouveau_dmem.c
>> index ca4932a150e3..53cc1926b9da 100644
>> --- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
>> +++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
>> @@ -318,7 +318,7 @@ nouveau_dmem_page_alloc_locked(struct nouveau_drm *drm)
>>                         return NULL;
>>         }
>>
>> -       zone_device_page_init(page);
>> +       zone_device_page_init(page, 0);
>>         return page;
>>  }
>>
>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
>> index e5951ba12a28..d2487a19cba2 100644
>> --- a/include/linux/memremap.h
>> +++ b/include/linux/memremap.h
>> @@ -206,7 +206,7 @@ static inline bool is_fsdax_page(const struct page *page)
>>  }
>>
>>  #ifdef CONFIG_ZONE_DEVICE
>> -void zone_device_page_init(struct page *page);
>> +void zone_device_page_init(struct page *page, unsigned int order);
>>  void *memremap_pages(struct dev_pagemap *pgmap, int nid);
>>  void memunmap_pages(struct dev_pagemap *pgmap);
>>  void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap);
>> @@ -215,6 +215,14 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn);
>>  bool pgmap_pfn_valid(struct dev_pagemap *pgmap, unsigned long pfn);
>>
>>  unsigned long memremap_compat_align(void);
>> +
>> +static inline void zone_device_folio_init(struct folio *folio, unsigned int order)
>> +{
>> +       zone_device_page_init(&folio->page, order);
>> +       if (order)
>> +               folio_set_large_rmappable(folio);
>> +}
>> +
>>  #else
>>  static inline void *devm_memremap_pages(struct device *dev,
>>                 struct dev_pagemap *pgmap)
>> diff --git a/lib/test_hmm.c b/lib/test_hmm.c
>> index 83e3d8208a54..24d82121cde8 100644
>> --- a/lib/test_hmm.c
>> +++ b/lib/test_hmm.c
>> @@ -627,7 +627,7 @@ static struct page *dmirror_devmem_alloc_page(struct dmirror_device *mdevice)
>>                         goto error;
>>         }
>>
>> -       zone_device_page_init(dpage);
>> +       zone_device_page_init(dpage, 0);
>>         dpage->zone_device_data = rpage;
>>         return dpage;
>>
>> diff --git a/mm/memremap.c b/mm/memremap.c
>> index 46cb1b0b6f72..e45dfb568710 100644
>> --- a/mm/memremap.c
>> +++ b/mm/memremap.c
>> @@ -416,20 +416,19 @@ EXPORT_SYMBOL_GPL(get_dev_pagemap);
>>  void free_zone_device_folio(struct folio *folio)
>>  {
>>         struct dev_pagemap *pgmap = folio->pgmap;
>> +       unsigned long nr = folio_nr_pages(folio);
>> +       int i;
>>
>>         if (WARN_ON_ONCE(!pgmap))
>>                 return;
>>
>>         mem_cgroup_uncharge(folio);
>>
>> -       /*
>> -        * Note: we don't expect anonymous compound pages yet. Once supported
>> -        * and we could PTE-map them similar to THP, we'd have to clear
>> -        * PG_anon_exclusive on all tail pages.
>> -        */
>>         if (folio_test_anon(folio)) {
>> -               VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
>> -               __ClearPageAnonExclusive(folio_page(folio, 0));
>> +               for (i = 0; i < nr; i++)
>> +                       __ClearPageAnonExclusive(folio_page(folio, i));
>> +       } else {
>> +               VM_WARN_ON_ONCE(folio_test_large(folio));
>>         }
>>
>>         /*
>> @@ -456,8 +455,8 @@ void free_zone_device_folio(struct folio *folio)
>>         case MEMORY_DEVICE_COHERENT:
>>                 if (WARN_ON_ONCE(!pgmap->ops || !pgmap->ops->page_free))
>>                         break;
>> -               pgmap->ops->page_free(folio_page(folio, 0));
>> -               put_dev_pagemap(pgmap);
>> +               pgmap->ops->page_free(&folio->page);
>> +               percpu_ref_put_many(&folio->pgmap->ref, nr);
> 
> Nit: &pgmap->ref here for consistency?

Can be done, thanks!

Balbir

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [v7 03/16] mm/huge_memory: add device-private THP support to PMD operations
  2025-10-12 15:46   ` Lance Yang
@ 2025-10-13  0:01     ` Balbir Singh
  2025-10-13  1:48       ` Lance Yang
  0 siblings, 1 reply; 75+ messages in thread
From: Balbir Singh @ 2025-10-13  0:01 UTC (permalink / raw)
  To: Lance Yang
  Cc: linux-kernel, dri-devel, linux-mm, akpm, David Hildenbrand,
	Zi Yan, Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price,
	Ying Huang, Alistair Popple, Oscar Salvador, Lorenzo Stoakes,
	Baolin Wang, Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain,
	Barry Song, Lyude Paul, Danilo Krummrich, David Airlie,
	Simona Vetter, Ralph Campbell, Mika Penttilä, Matthew Brost,
	Francois Dugast

On 10/13/25 02:46, Lance Yang wrote:
> On Wed, Oct 1, 2025 at 4:20 PM Balbir Singh <balbirs@nvidia.com> wrote:
>>
>> Extend core huge page management functions to handle device-private THP
>> entries.  This enables proper handling of large device-private folios in
>> fundamental MM operations.
>>
>> The following functions have been updated:
>>
>> - copy_huge_pmd(): Handle device-private entries during fork/clone
>> - zap_huge_pmd(): Properly free device-private THP during munmap
>> - change_huge_pmd(): Support protection changes on device-private THP
>> - __pte_offset_map(): Add device-private entry awareness
>>
>> Cc: David Hildenbrand <david@redhat.com>
>> Cc: Zi Yan <ziy@nvidia.com>
>> Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
>> Cc: Rakie Kim <rakie.kim@sk.com>
>> Cc: Byungchul Park <byungchul@sk.com>
>> Cc: Gregory Price <gourry@gourry.net>
>> Cc: Ying Huang <ying.huang@linux.alibaba.com>
>> Cc: Alistair Popple <apopple@nvidia.com>
>> Cc: Oscar Salvador <osalvador@suse.de>
>> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
>> Cc: Nico Pache <npache@redhat.com>
>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>> Cc: Dev Jain <dev.jain@arm.com>
>> Cc: Barry Song <baohua@kernel.org>
>> Cc: Lyude Paul <lyude@redhat.com>
>> Cc: Danilo Krummrich <dakr@kernel.org>
>> Cc: David Airlie <airlied@gmail.com>
>> Cc: Simona Vetter <simona@ffwll.ch>
>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>> Cc: Mika Penttilä <mpenttil@redhat.com>
>> Cc: Matthew Brost <matthew.brost@intel.com>
>> Cc: Francois Dugast <francois.dugast@intel.com>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Acked-by: Zi Yan <ziy@nvidia.com>
>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>> ---
>>  include/linux/swapops.h | 32 +++++++++++++++++++++++
>>  mm/huge_memory.c        | 56 ++++++++++++++++++++++++++++++++++-------
>>  mm/pgtable-generic.c    |  2 +-
>>  3 files changed, 80 insertions(+), 10 deletions(-)
>>
>> diff --git a/include/linux/swapops.h b/include/linux/swapops.h
>> index 64ea151a7ae3..2687928a8146 100644
>> --- a/include/linux/swapops.h
>> +++ b/include/linux/swapops.h
>> @@ -594,10 +594,42 @@ static inline int is_pmd_migration_entry(pmd_t pmd)
>>  }
>>  #endif  /* CONFIG_ARCH_ENABLE_THP_MIGRATION */
>>
>> +#if defined(CONFIG_ZONE_DEVICE) && defined(CONFIG_ARCH_ENABLE_THP_MIGRATION)
>> +
>> +/**
>> + * is_pmd_device_private_entry() - Check if PMD contains a device private swap entry
>> + * @pmd: The PMD to check
>> + *
>> + * Returns true if the PMD contains a swap entry that represents a device private
>> + * page mapping. This is used for zone device private pages that have been
>> + * swapped out but still need special handling during various memory management
>> + * operations.
>> + *
>> + * Return: 1 if PMD contains device private entry, 0 otherwise
>> + */
>> +static inline int is_pmd_device_private_entry(pmd_t pmd)
>> +{
>> +       return is_swap_pmd(pmd) && is_device_private_entry(pmd_to_swp_entry(pmd));
>> +}
>> +
>> +#else /* CONFIG_ZONE_DEVICE && CONFIG_ARCH_ENABLE_THP_MIGRATION */
>> +
>> +static inline int is_pmd_device_private_entry(pmd_t pmd)
>> +{
>> +       return 0;
>> +}
>> +
>> +#endif /* CONFIG_ZONE_DEVICE && CONFIG_ARCH_ENABLE_THP_MIGRATION */
>> +
>>  static inline int non_swap_entry(swp_entry_t entry)
>>  {
>>         return swp_type(entry) >= MAX_SWAPFILES;
>>  }
>>
>> +static inline int is_pmd_non_present_folio_entry(pmd_t pmd)
>> +{
>> +       return is_pmd_migration_entry(pmd) || is_pmd_device_private_entry(pmd);
>> +}
>> +
>>  #endif /* CONFIG_MMU */
>>  #endif /* _LINUX_SWAPOPS_H */
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 1b81680b4225..8e0a1747762d 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -1703,17 +1703,45 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>>         if (unlikely(is_swap_pmd(pmd))) {
>>                 swp_entry_t entry = pmd_to_swp_entry(pmd);
>>
>> -               VM_BUG_ON(!is_pmd_migration_entry(pmd));
>> -               if (!is_readable_migration_entry(entry)) {
>> -                       entry = make_readable_migration_entry(
>> -                                                       swp_offset(entry));
>> +               VM_WARN_ON(!is_pmd_non_present_folio_entry(pmd));
>> +
>> +               if (is_writable_migration_entry(entry) ||
>> +                   is_readable_exclusive_migration_entry(entry)) {
>> +                       entry = make_readable_migration_entry(swp_offset(entry));
>>                         pmd = swp_entry_to_pmd(entry);
>>                         if (pmd_swp_soft_dirty(*src_pmd))
>>                                 pmd = pmd_swp_mksoft_dirty(pmd);
>>                         if (pmd_swp_uffd_wp(*src_pmd))
>>                                 pmd = pmd_swp_mkuffd_wp(pmd);
>>                         set_pmd_at(src_mm, addr, src_pmd, pmd);
>> +               } else if (is_device_private_entry(entry)) {
>> +                       /*
>> +                        * For device private entries, since there are no
>> +                        * read exclusive entries, writable = !readable
>> +                        */
>> +                       if (is_writable_device_private_entry(entry)) {
>> +                               entry = make_readable_device_private_entry(swp_offset(entry));
>> +                               pmd = swp_entry_to_pmd(entry);
>> +
>> +                               if (pmd_swp_soft_dirty(*src_pmd))
>> +                                       pmd = pmd_swp_mksoft_dirty(pmd);
>> +                               if (pmd_swp_uffd_wp(*src_pmd))
>> +                                       pmd = pmd_swp_mkuffd_wp(pmd);
>> +                               set_pmd_at(src_mm, addr, src_pmd, pmd);
>> +                       }
>> +
>> +                       src_folio = pfn_swap_entry_folio(entry);
>> +                       VM_WARN_ON(!folio_test_large(src_folio));
>> +
>> +                       folio_get(src_folio);
>> +                       /*
>> +                        * folio_try_dup_anon_rmap_pmd does not fail for
>> +                        * device private entries.
>> +                        */
>> +                       folio_try_dup_anon_rmap_pmd(src_folio, &src_folio->page,
>> +                                                       dst_vma, src_vma);
>>                 }
>> +
>>                 add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
>>                 mm_inc_nr_ptes(dst_mm);
>>                 pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
>> @@ -2211,15 +2239,16 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>>                         folio_remove_rmap_pmd(folio, page, vma);
>>                         WARN_ON_ONCE(folio_mapcount(folio) < 0);
>>                         VM_BUG_ON_PAGE(!PageHead(page), page);
>> -               } else if (thp_migration_supported()) {
>> +               } else if (is_pmd_non_present_folio_entry(orig_pmd)) {
>>                         swp_entry_t entry;
>>
>> -                       VM_BUG_ON(!is_pmd_migration_entry(orig_pmd));
>>                         entry = pmd_to_swp_entry(orig_pmd);
>>                         folio = pfn_swap_entry_folio(entry);
>>                         flush_needed = 0;
>> -               } else
>> -                       WARN_ONCE(1, "Non present huge pmd without pmd migration enabled!");
>> +
>> +                       if (!thp_migration_supported())
>> +                               WARN_ONCE(1, "Non present huge pmd without pmd migration enabled!");
>> +               }
>>
>>                 if (folio_test_anon(folio)) {
>>                         zap_deposited_table(tlb->mm, pmd);
>> @@ -2239,6 +2268,12 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>>                                 folio_mark_accessed(folio);
>>                 }
>>
>> +               if (folio_is_device_private(folio)) {
>> +                       folio_remove_rmap_pmd(folio, &folio->page, vma);
>> +                       WARN_ON_ONCE(folio_mapcount(folio) < 0);
>> +                       folio_put(folio);
>> +               }
> 
> IIUC, a device-private THP is always anonymous, right? would it make sense
> to move this folio_is_device_private() block inside the folio_test_anon()
> check above?
> 
Yes, they are, there is discussion on file-backed mapping at
https://lwn.net/Articles/1016124/. I don't see a benefit from moving it, do you?

Balbir

[...]

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [v7 03/16] mm/huge_memory: add device-private THP support to PMD operations
  2025-10-13  0:01     ` Balbir Singh
@ 2025-10-13  1:48       ` Lance Yang
  0 siblings, 0 replies; 75+ messages in thread
From: Lance Yang @ 2025-10-13  1:48 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-kernel, dri-devel, linux-mm, akpm, David Hildenbrand,
	Zi Yan, Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price,
	Ying Huang, Alistair Popple, Oscar Salvador, Lorenzo Stoakes,
	Baolin Wang, Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain,
	Barry Song, Lyude Paul, Danilo Krummrich, David Airlie,
	Simona Vetter, Ralph Campbell, Mika Penttilä, Matthew Brost,
	Francois Dugast



On 2025/10/13 08:01, Balbir Singh wrote:
> On 10/13/25 02:46, Lance Yang wrote:
>> On Wed, Oct 1, 2025 at 4:20 PM Balbir Singh <balbirs@nvidia.com> wrote:
>>>
>>> Extend core huge page management functions to handle device-private THP
>>> entries.  This enables proper handling of large device-private folios in
>>> fundamental MM operations.
>>>
>>> The following functions have been updated:
>>>
>>> - copy_huge_pmd(): Handle device-private entries during fork/clone
>>> - zap_huge_pmd(): Properly free device-private THP during munmap
>>> - change_huge_pmd(): Support protection changes on device-private THP
>>> - __pte_offset_map(): Add device-private entry awareness
>>>
>>> Cc: David Hildenbrand <david@redhat.com>
>>> Cc: Zi Yan <ziy@nvidia.com>
>>> Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
>>> Cc: Rakie Kim <rakie.kim@sk.com>
>>> Cc: Byungchul Park <byungchul@sk.com>
>>> Cc: Gregory Price <gourry@gourry.net>
>>> Cc: Ying Huang <ying.huang@linux.alibaba.com>
>>> Cc: Alistair Popple <apopple@nvidia.com>
>>> Cc: Oscar Salvador <osalvador@suse.de>
>>> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
>>> Cc: Nico Pache <npache@redhat.com>
>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>> Cc: Dev Jain <dev.jain@arm.com>
>>> Cc: Barry Song <baohua@kernel.org>
>>> Cc: Lyude Paul <lyude@redhat.com>
>>> Cc: Danilo Krummrich <dakr@kernel.org>
>>> Cc: David Airlie <airlied@gmail.com>
>>> Cc: Simona Vetter <simona@ffwll.ch>
>>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>>> Cc: Mika Penttilä <mpenttil@redhat.com>
>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>> Cc: Francois Dugast <francois.dugast@intel.com>
>>> Cc: Andrew Morton <akpm@linux-foundation.org>
>>> Acked-by: Zi Yan <ziy@nvidia.com>
>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>> ---
>>>   include/linux/swapops.h | 32 +++++++++++++++++++++++
>>>   mm/huge_memory.c        | 56 ++++++++++++++++++++++++++++++++++-------
>>>   mm/pgtable-generic.c    |  2 +-
>>>   3 files changed, 80 insertions(+), 10 deletions(-)
>>>
>>> diff --git a/include/linux/swapops.h b/include/linux/swapops.h
>>> index 64ea151a7ae3..2687928a8146 100644
>>> --- a/include/linux/swapops.h
>>> +++ b/include/linux/swapops.h
>>> @@ -594,10 +594,42 @@ static inline int is_pmd_migration_entry(pmd_t pmd)
>>>   }
>>>   #endif  /* CONFIG_ARCH_ENABLE_THP_MIGRATION */
>>>
>>> +#if defined(CONFIG_ZONE_DEVICE) && defined(CONFIG_ARCH_ENABLE_THP_MIGRATION)
>>> +
>>> +/**
>>> + * is_pmd_device_private_entry() - Check if PMD contains a device private swap entry
>>> + * @pmd: The PMD to check
>>> + *
>>> + * Returns true if the PMD contains a swap entry that represents a device private
>>> + * page mapping. This is used for zone device private pages that have been
>>> + * swapped out but still need special handling during various memory management
>>> + * operations.
>>> + *
>>> + * Return: 1 if PMD contains device private entry, 0 otherwise
>>> + */
>>> +static inline int is_pmd_device_private_entry(pmd_t pmd)
>>> +{
>>> +       return is_swap_pmd(pmd) && is_device_private_entry(pmd_to_swp_entry(pmd));
>>> +}
>>> +
>>> +#else /* CONFIG_ZONE_DEVICE && CONFIG_ARCH_ENABLE_THP_MIGRATION */
>>> +
>>> +static inline int is_pmd_device_private_entry(pmd_t pmd)
>>> +{
>>> +       return 0;
>>> +}
>>> +
>>> +#endif /* CONFIG_ZONE_DEVICE && CONFIG_ARCH_ENABLE_THP_MIGRATION */
>>> +
>>>   static inline int non_swap_entry(swp_entry_t entry)
>>>   {
>>>          return swp_type(entry) >= MAX_SWAPFILES;
>>>   }
>>>
>>> +static inline int is_pmd_non_present_folio_entry(pmd_t pmd)
>>> +{
>>> +       return is_pmd_migration_entry(pmd) || is_pmd_device_private_entry(pmd);
>>> +}
>>> +
>>>   #endif /* CONFIG_MMU */
>>>   #endif /* _LINUX_SWAPOPS_H */
>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>> index 1b81680b4225..8e0a1747762d 100644
>>> --- a/mm/huge_memory.c
>>> +++ b/mm/huge_memory.c
>>> @@ -1703,17 +1703,45 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>>>          if (unlikely(is_swap_pmd(pmd))) {
>>>                  swp_entry_t entry = pmd_to_swp_entry(pmd);
>>>
>>> -               VM_BUG_ON(!is_pmd_migration_entry(pmd));
>>> -               if (!is_readable_migration_entry(entry)) {
>>> -                       entry = make_readable_migration_entry(
>>> -                                                       swp_offset(entry));
>>> +               VM_WARN_ON(!is_pmd_non_present_folio_entry(pmd));
>>> +
>>> +               if (is_writable_migration_entry(entry) ||
>>> +                   is_readable_exclusive_migration_entry(entry)) {
>>> +                       entry = make_readable_migration_entry(swp_offset(entry));
>>>                          pmd = swp_entry_to_pmd(entry);
>>>                          if (pmd_swp_soft_dirty(*src_pmd))
>>>                                  pmd = pmd_swp_mksoft_dirty(pmd);
>>>                          if (pmd_swp_uffd_wp(*src_pmd))
>>>                                  pmd = pmd_swp_mkuffd_wp(pmd);
>>>                          set_pmd_at(src_mm, addr, src_pmd, pmd);
>>> +               } else if (is_device_private_entry(entry)) {
>>> +                       /*
>>> +                        * For device private entries, since there are no
>>> +                        * read exclusive entries, writable = !readable
>>> +                        */
>>> +                       if (is_writable_device_private_entry(entry)) {
>>> +                               entry = make_readable_device_private_entry(swp_offset(entry));
>>> +                               pmd = swp_entry_to_pmd(entry);
>>> +
>>> +                               if (pmd_swp_soft_dirty(*src_pmd))
>>> +                                       pmd = pmd_swp_mksoft_dirty(pmd);
>>> +                               if (pmd_swp_uffd_wp(*src_pmd))
>>> +                                       pmd = pmd_swp_mkuffd_wp(pmd);
>>> +                               set_pmd_at(src_mm, addr, src_pmd, pmd);
>>> +                       }
>>> +
>>> +                       src_folio = pfn_swap_entry_folio(entry);
>>> +                       VM_WARN_ON(!folio_test_large(src_folio));
>>> +
>>> +                       folio_get(src_folio);
>>> +                       /*
>>> +                        * folio_try_dup_anon_rmap_pmd does not fail for
>>> +                        * device private entries.
>>> +                        */
>>> +                       folio_try_dup_anon_rmap_pmd(src_folio, &src_folio->page,
>>> +                                                       dst_vma, src_vma);
>>>                  }
>>> +
>>>                  add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
>>>                  mm_inc_nr_ptes(dst_mm);
>>>                  pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
>>> @@ -2211,15 +2239,16 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>>>                          folio_remove_rmap_pmd(folio, page, vma);
>>>                          WARN_ON_ONCE(folio_mapcount(folio) < 0);
>>>                          VM_BUG_ON_PAGE(!PageHead(page), page);
>>> -               } else if (thp_migration_supported()) {
>>> +               } else if (is_pmd_non_present_folio_entry(orig_pmd)) {
>>>                          swp_entry_t entry;
>>>
>>> -                       VM_BUG_ON(!is_pmd_migration_entry(orig_pmd));
>>>                          entry = pmd_to_swp_entry(orig_pmd);
>>>                          folio = pfn_swap_entry_folio(entry);
>>>                          flush_needed = 0;
>>> -               } else
>>> -                       WARN_ONCE(1, "Non present huge pmd without pmd migration enabled!");
>>> +
>>> +                       if (!thp_migration_supported())
>>> +                               WARN_ONCE(1, "Non present huge pmd without pmd migration enabled!");
>>> +               }
>>>
>>>                  if (folio_test_anon(folio)) {
>>>                          zap_deposited_table(tlb->mm, pmd);
>>> @@ -2239,6 +2268,12 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>>>                                  folio_mark_accessed(folio);
>>>                  }
>>>
>>> +               if (folio_is_device_private(folio)) {
>>> +                       folio_remove_rmap_pmd(folio, &folio->page, vma);
>>> +                       WARN_ON_ONCE(folio_mapcount(folio) < 0);
>>> +                       folio_put(folio);
>>> +               }
>>
>> IIUC, a device-private THP is always anonymous, right? would it make sense
>> to move this folio_is_device_private() block inside the folio_test_anon()
>> check above?
>>
> Yes, they are, there is discussion on file-backed mapping at
> https://lwn.net/Articles/1016124/. I don't see a benefit from moving it, do you?

Ah, I see. Never mind :)

Cheers,
Lance


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [v7 11/16] mm/migrate_device: add THP splitting during migration
  2025-10-01  6:57 ` [v7 11/16] mm/migrate_device: add THP splitting during migration Balbir Singh
@ 2025-10-13 21:17   ` Zi Yan
  2025-10-13 21:33     ` Balbir Singh
  2025-10-19  8:19   ` Wei Yang
  1 sibling, 1 reply; 75+ messages in thread
From: Zi Yan @ 2025-10-13 21:17 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-kernel, dri-devel, linux-mm, akpm, David Hildenbrand,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Oscar Salvador, Lorenzo Stoakes, Baolin Wang,
	Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
	Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Ralph Campbell, Mika Penttilä, Matthew Brost,
	Francois Dugast

On 1 Oct 2025, at 2:57, Balbir Singh wrote:

> Implement migrate_vma_split_pages() to handle THP splitting during the
> migration process when destination cannot allocate compound pages.
>
> This addresses the common scenario where migrate_vma_setup() succeeds with
> MIGRATE_PFN_COMPOUND pages, but the destination device cannot allocate
> large pages during the migration phase.
>
> Key changes:
> - migrate_vma_split_pages(): Split already-isolated pages during migration
> - Enhanced folio_split() and __split_unmapped_folio() with isolated
>   parameter to avoid redundant unmap/remap operations
>
> This provides a fallback mechansim to ensure migration succeeds even when
> large page allocation fails at the destination.
>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
> Cc: Rakie Kim <rakie.kim@sk.com>
> Cc: Byungchul Park <byungchul@sk.com>
> Cc: Gregory Price <gourry@gourry.net>
> Cc: Ying Huang <ying.huang@linux.alibaba.com>
> Cc: Alistair Popple <apopple@nvidia.com>
> Cc: Oscar Salvador <osalvador@suse.de>
> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> Cc: Nico Pache <npache@redhat.com>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Cc: Dev Jain <dev.jain@arm.com>
> Cc: Barry Song <baohua@kernel.org>
> Cc: Lyude Paul <lyude@redhat.com>
> Cc: Danilo Krummrich <dakr@kernel.org>
> Cc: David Airlie <airlied@gmail.com>
> Cc: Simona Vetter <simona@ffwll.ch>
> Cc: Ralph Campbell <rcampbell@nvidia.com>
> Cc: Mika Penttilä <mpenttil@redhat.com>
> Cc: Matthew Brost <matthew.brost@intel.com>
> Cc: Francois Dugast <francois.dugast@intel.com>
>
> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
> ---
>  include/linux/huge_mm.h | 11 +++++-
>  lib/test_hmm.c          |  9 +++++
>  mm/huge_memory.c        | 46 ++++++++++++----------
>  mm/migrate_device.c     | 85 +++++++++++++++++++++++++++++++++++------
>  4 files changed, 117 insertions(+), 34 deletions(-)
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 2d669be7f1c8..a166be872628 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -365,8 +365,8 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
>  		vm_flags_t vm_flags);
>
>  bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
> -int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
> -		unsigned int new_order);
> +int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
> +		unsigned int new_order, bool unmapped);
>  int min_order_for_split(struct folio *folio);
>  int split_folio_to_list(struct folio *folio, struct list_head *list);
>  bool uniform_split_supported(struct folio *folio, unsigned int new_order,
> @@ -375,6 +375,13 @@ bool non_uniform_split_supported(struct folio *folio, unsigned int new_order,
>  		bool warns);
>  int folio_split(struct folio *folio, unsigned int new_order, struct page *page,
>  		struct list_head *list);
> +
> +static inline int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
> +		unsigned int new_order)
> +{
> +	return __split_huge_page_to_list_to_order(page, list, new_order, false);
> +}
> +
>  /*
>   * try_folio_split - try to split a @folio at @page using non uniform split.
>   * @folio: folio to be split
> diff --git a/lib/test_hmm.c b/lib/test_hmm.c
> index 46fa9e200db8..df429670633e 100644
> --- a/lib/test_hmm.c
> +++ b/lib/test_hmm.c
> @@ -1612,6 +1612,15 @@ static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
>  	order = folio_order(page_folio(vmf->page));
>  	nr = 1 << order;
>
> +	/*
> +	 * When folios are partially mapped, we can't rely on the folio
> +	 * order of vmf->page as the folio might not be fully split yet
> +	 */
> +	if (vmf->pte) {
> +		order = 0;
> +		nr = 1;
> +	}
> +
>  	/*
>  	 * Consider a per-cpu cache of src and dst pfns, but with
>  	 * large number of cpus that might not scale well.
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 8c95a658b3ec..022b0729f826 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3463,15 +3463,6 @@ static void __split_folio_to_order(struct folio *folio, int old_order,
>  		new_folio->mapping = folio->mapping;
>  		new_folio->index = folio->index + i;
>
> -		/*
> -		 * page->private should not be set in tail pages. Fix up and warn once
> -		 * if private is unexpectedly set.
> -		 */
> -		if (unlikely(new_folio->private)) {
> -			VM_WARN_ON_ONCE_PAGE(true, new_head);
> -			new_folio->private = NULL;
> -		}
> -
>  		if (folio_test_swapcache(folio))
>  			new_folio->swap.val = folio->swap.val + i;
>
> @@ -3700,6 +3691,7 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
>   * @lock_at: a page within @folio to be left locked to caller
>   * @list: after-split folios will be put on it if non NULL
>   * @uniform_split: perform uniform split or not (non-uniform split)
> + * @unmapped: The pages are already unmapped, they are migration entries.
>   *
>   * It calls __split_unmapped_folio() to perform uniform and non-uniform split.
>   * It is in charge of checking whether the split is supported or not and
> @@ -3715,7 +3707,7 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
>   */
>  static int __folio_split(struct folio *folio, unsigned int new_order,
>  		struct page *split_at, struct page *lock_at,
> -		struct list_head *list, bool uniform_split)
> +		struct list_head *list, bool uniform_split, bool unmapped)
>  {
>  	struct deferred_split *ds_queue = get_deferred_split_queue(folio);
>  	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
> @@ -3765,13 +3757,15 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>  		 * is taken to serialise against parallel split or collapse
>  		 * operations.
>  		 */
> -		anon_vma = folio_get_anon_vma(folio);
> -		if (!anon_vma) {
> -			ret = -EBUSY;
> -			goto out;
> +		if (!unmapped) {
> +			anon_vma = folio_get_anon_vma(folio);
> +			if (!anon_vma) {
> +				ret = -EBUSY;
> +				goto out;
> +			}
> +			anon_vma_lock_write(anon_vma);
>  		}
>  		mapping = NULL;
> -		anon_vma_lock_write(anon_vma);
>  	} else {
>  		unsigned int min_order;
>  		gfp_t gfp;
> @@ -3838,7 +3832,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>  		goto out_unlock;
>  	}
>
> -	unmap_folio(folio);
> +	if (!unmapped)
> +		unmap_folio(folio);
>
>  	/* block interrupt reentry in xa_lock and spinlock */
>  	local_irq_disable();
> @@ -3925,10 +3920,13 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>
>  			next = folio_next(new_folio);
>
> +			zone_device_private_split_cb(folio, new_folio);
> +
>  			expected_refs = folio_expected_ref_count(new_folio) + 1;
>  			folio_ref_unfreeze(new_folio, expected_refs);
>
> -			lru_add_split_folio(folio, new_folio, lruvec, list);
> +			if (!unmapped)
> +				lru_add_split_folio(folio, new_folio, lruvec, list);
>
>  			/*
>  			 * Anonymous folio with swap cache.
> @@ -3959,6 +3957,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>  			__filemap_remove_folio(new_folio, NULL);
>  			folio_put_refs(new_folio, nr_pages);
>  		}
> +
> +		zone_device_private_split_cb(folio, NULL);
>  		/*
>  		 * Unfreeze @folio only after all page cache entries, which
>  		 * used to point to it, have been updated with new folios.
> @@ -3982,6 +3982,9 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>
>  	local_irq_enable();
>
> +	if (unmapped)
> +		return ret;
> +
>  	if (nr_shmem_dropped)
>  		shmem_uncharge(mapping->host, nr_shmem_dropped);
>
> @@ -4072,12 +4075,13 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>   * Returns -EINVAL when trying to split to an order that is incompatible
>   * with the folio. Splitting to order 0 is compatible with all folios.
>   */
> -int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
> -				     unsigned int new_order)
> +int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
> +				     unsigned int new_order, bool unmapped)
>  {
>  	struct folio *folio = page_folio(page);
>
> -	return __folio_split(folio, new_order, &folio->page, page, list, true);
> +	return __folio_split(folio, new_order, &folio->page, page, list, true,
> +				unmapped);
>  }
>
>  /*
> @@ -4106,7 +4110,7 @@ int folio_split(struct folio *folio, unsigned int new_order,
>  		struct page *split_at, struct list_head *list)
>  {
>  	return __folio_split(folio, new_order, split_at, &folio->page, list,
> -			false);
> +			false, false);
>  }
>
>  int min_order_for_split(struct folio *folio)
> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
> index 4156fd6190d2..fa42d2ebd024 100644
> --- a/mm/migrate_device.c
> +++ b/mm/migrate_device.c
> @@ -306,6 +306,23 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>  			    pgmap->owner != migrate->pgmap_owner)
>  				goto next;
>
> +			folio = page_folio(page);
> +			if (folio_test_large(folio)) {
> +				int ret;
> +
> +				pte_unmap_unlock(ptep, ptl);
> +				ret = migrate_vma_split_folio(folio,
> +							  migrate->fault_page);
> +
> +				if (ret) {
> +					ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
> +					goto next;
> +				}
> +
> +				addr = start;
> +				goto again;
> +			}
> +
>  			mpfn = migrate_pfn(page_to_pfn(page)) |
>  					MIGRATE_PFN_MIGRATE;
>  			if (is_writable_device_private_entry(entry))
> @@ -880,6 +897,29 @@ static int migrate_vma_insert_huge_pmd_page(struct migrate_vma *migrate,
>  		src[i] &= ~MIGRATE_PFN_MIGRATE;
>  	return 0;
>  }
> +
> +static int migrate_vma_split_unmapped_folio(struct migrate_vma *migrate,
> +					    unsigned long idx, unsigned long addr,
> +					    struct folio *folio)
> +{
> +	unsigned long i;
> +	unsigned long pfn;
> +	unsigned long flags;
> +	int ret = 0;
> +
> +	folio_get(folio);
> +	split_huge_pmd_address(migrate->vma, addr, true);
> +	ret = __split_huge_page_to_list_to_order(folio_page(folio, 0), NULL,
> +							0, true);

Why not just call __split_unmapped_folio() here? Then, you do not need to add
a new unmapped parameter in __folio_split().


> +	if (ret)
> +		return ret;
> +	migrate->src[idx] &= ~MIGRATE_PFN_COMPOUND;
> +	flags = migrate->src[idx] & ((1UL << MIGRATE_PFN_SHIFT) - 1);
> +	pfn = migrate->src[idx] >> MIGRATE_PFN_SHIFT;
> +	for (i = 1; i < HPAGE_PMD_NR; i++)
> +		migrate->src[i+idx] = migrate_pfn(pfn + i) | flags;
> +	return ret;
> +}
>  #else /* !CONFIG_ARCH_ENABLE_THP_MIGRATION */
>  static int migrate_vma_insert_huge_pmd_page(struct migrate_vma *migrate,
>  					 unsigned long addr,
> @@ -889,6 +929,13 @@ static int migrate_vma_insert_huge_pmd_page(struct migrate_vma *migrate,
>  {
>  	return 0;
>  }
> +
> +static int migrate_vma_split_unmapped_folio(struct migrate_vma *migrate,
> +					    unsigned long idx, unsigned long addr,
> +					    struct folio *folio)
> +{
> +	return 0;
> +}
>  #endif
>
>  static unsigned long migrate_vma_nr_pages(unsigned long *src)
> @@ -1050,8 +1097,9 @@ static void __migrate_device_pages(unsigned long *src_pfns,
>  				struct migrate_vma *migrate)
>  {
>  	struct mmu_notifier_range range;
> -	unsigned long i;
> +	unsigned long i, j;
>  	bool notified = false;
> +	unsigned long addr;
>
>  	for (i = 0; i < npages; ) {
>  		struct page *newpage = migrate_pfn_to_page(dst_pfns[i]);
> @@ -1093,12 +1141,16 @@ static void __migrate_device_pages(unsigned long *src_pfns,
>  				(!(dst_pfns[i] & MIGRATE_PFN_COMPOUND))) {
>  				nr = migrate_vma_nr_pages(&src_pfns[i]);
>  				src_pfns[i] &= ~MIGRATE_PFN_COMPOUND;
> -				src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
> -				goto next;
> +			} else {
> +				nr = 1;
>  			}
>
> -			migrate_vma_insert_page(migrate, addr, &dst_pfns[i],
> -						&src_pfns[i]);
> +			for (j = 0; j < nr && i + j < npages; j++) {
> +				src_pfns[i+j] |= MIGRATE_PFN_MIGRATE;
> +				migrate_vma_insert_page(migrate,
> +					addr + j * PAGE_SIZE,
> +					&dst_pfns[i+j], &src_pfns[i+j]);
> +			}
>  			goto next;
>  		}
>
> @@ -1120,7 +1172,13 @@ static void __migrate_device_pages(unsigned long *src_pfns,
>  							 MIGRATE_PFN_COMPOUND);
>  					goto next;
>  				}
> -				src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
> +				nr = 1 << folio_order(folio);
> +				addr = migrate->start + i * PAGE_SIZE;
> +				if (migrate_vma_split_unmapped_folio(migrate, i, addr, folio)) {
> +					src_pfns[i] &= ~(MIGRATE_PFN_MIGRATE |
> +							 MIGRATE_PFN_COMPOUND);
> +					goto next;
> +				}
>  			} else if ((src_pfns[i] & MIGRATE_PFN_MIGRATE) &&
>  				(dst_pfns[i] & MIGRATE_PFN_COMPOUND) &&
>  				!(src_pfns[i] & MIGRATE_PFN_COMPOUND)) {
> @@ -1156,11 +1214,16 @@ static void __migrate_device_pages(unsigned long *src_pfns,
>
>  		if (migrate && migrate->fault_page == page)
>  			extra_cnt = 1;
> -		r = folio_migrate_mapping(mapping, newfolio, folio, extra_cnt);
> -		if (r)
> -			src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
> -		else
> -			folio_migrate_flags(newfolio, folio);
> +		for (j = 0; j < nr && i + j < npages; j++) {
> +			folio = page_folio(migrate_pfn_to_page(src_pfns[i+j]));
> +			newfolio = page_folio(migrate_pfn_to_page(dst_pfns[i+j]));
> +
> +			r = folio_migrate_mapping(mapping, newfolio, folio, extra_cnt);
> +			if (r)
> +				src_pfns[i+j] &= ~MIGRATE_PFN_MIGRATE;
> +			else
> +				folio_migrate_flags(newfolio, folio);
> +		}
>  next:
>  		i += nr;
>  	}
> -- 
> 2.51.0


--
Best Regards,
Yan, Zi

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [v7 11/16] mm/migrate_device: add THP splitting during migration
  2025-10-13 21:17   ` Zi Yan
@ 2025-10-13 21:33     ` Balbir Singh
  2025-10-13 21:55       ` Zi Yan
  0 siblings, 1 reply; 75+ messages in thread
From: Balbir Singh @ 2025-10-13 21:33 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-kernel, dri-devel, linux-mm, akpm, David Hildenbrand,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Oscar Salvador, Lorenzo Stoakes, Baolin Wang,
	Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
	Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Ralph Campbell, Mika Penttilä, Matthew Brost,
	Francois Dugast

On 10/14/25 08:17, Zi Yan wrote:
> On 1 Oct 2025, at 2:57, Balbir Singh wrote:
> 
>> Implement migrate_vma_split_pages() to handle THP splitting during the
>> migration process when destination cannot allocate compound pages.
>>
>> This addresses the common scenario where migrate_vma_setup() succeeds with
>> MIGRATE_PFN_COMPOUND pages, but the destination device cannot allocate
>> large pages during the migration phase.
>>
>> Key changes:
>> - migrate_vma_split_pages(): Split already-isolated pages during migration
>> - Enhanced folio_split() and __split_unmapped_folio() with isolated
>>   parameter to avoid redundant unmap/remap operations
>>
>> This provides a fallback mechansim to ensure migration succeeds even when
>> large page allocation fails at the destination.
>>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: David Hildenbrand <david@redhat.com>
>> Cc: Zi Yan <ziy@nvidia.com>
>> Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
>> Cc: Rakie Kim <rakie.kim@sk.com>
>> Cc: Byungchul Park <byungchul@sk.com>
>> Cc: Gregory Price <gourry@gourry.net>
>> Cc: Ying Huang <ying.huang@linux.alibaba.com>
>> Cc: Alistair Popple <apopple@nvidia.com>
>> Cc: Oscar Salvador <osalvador@suse.de>
>> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
>> Cc: Nico Pache <npache@redhat.com>
>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>> Cc: Dev Jain <dev.jain@arm.com>
>> Cc: Barry Song <baohua@kernel.org>
>> Cc: Lyude Paul <lyude@redhat.com>
>> Cc: Danilo Krummrich <dakr@kernel.org>
>> Cc: David Airlie <airlied@gmail.com>
>> Cc: Simona Vetter <simona@ffwll.ch>
>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>> Cc: Mika Penttilä <mpenttil@redhat.com>
>> Cc: Matthew Brost <matthew.brost@intel.com>
>> Cc: Francois Dugast <francois.dugast@intel.com>
>>
>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>> ---
>>  include/linux/huge_mm.h | 11 +++++-
>>  lib/test_hmm.c          |  9 +++++
>>  mm/huge_memory.c        | 46 ++++++++++++----------
>>  mm/migrate_device.c     | 85 +++++++++++++++++++++++++++++++++++------
>>  4 files changed, 117 insertions(+), 34 deletions(-)
>>
>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>> index 2d669be7f1c8..a166be872628 100644
>> --- a/include/linux/huge_mm.h
>> +++ b/include/linux/huge_mm.h
>> @@ -365,8 +365,8 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
>>  		vm_flags_t vm_flags);
>>
>>  bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
>> -int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>> -		unsigned int new_order);
>> +int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>> +		unsigned int new_order, bool unmapped);
>>  int min_order_for_split(struct folio *folio);
>>  int split_folio_to_list(struct folio *folio, struct list_head *list);
>>  bool uniform_split_supported(struct folio *folio, unsigned int new_order,
>> @@ -375,6 +375,13 @@ bool non_uniform_split_supported(struct folio *folio, unsigned int new_order,
>>  		bool warns);
>>  int folio_split(struct folio *folio, unsigned int new_order, struct page *page,
>>  		struct list_head *list);
>> +
>> +static inline int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>> +		unsigned int new_order)
>> +{
>> +	return __split_huge_page_to_list_to_order(page, list, new_order, false);
>> +}
>> +
>>  /*
>>   * try_folio_split - try to split a @folio at @page using non uniform split.
>>   * @folio: folio to be split
>> diff --git a/lib/test_hmm.c b/lib/test_hmm.c
>> index 46fa9e200db8..df429670633e 100644
>> --- a/lib/test_hmm.c
>> +++ b/lib/test_hmm.c
>> @@ -1612,6 +1612,15 @@ static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
>>  	order = folio_order(page_folio(vmf->page));
>>  	nr = 1 << order;
>>
>> +	/*
>> +	 * When folios are partially mapped, we can't rely on the folio
>> +	 * order of vmf->page as the folio might not be fully split yet
>> +	 */
>> +	if (vmf->pte) {
>> +		order = 0;
>> +		nr = 1;
>> +	}
>> +
>>  	/*
>>  	 * Consider a per-cpu cache of src and dst pfns, but with
>>  	 * large number of cpus that might not scale well.
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 8c95a658b3ec..022b0729f826 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -3463,15 +3463,6 @@ static void __split_folio_to_order(struct folio *folio, int old_order,
>>  		new_folio->mapping = folio->mapping;
>>  		new_folio->index = folio->index + i;
>>
>> -		/*
>> -		 * page->private should not be set in tail pages. Fix up and warn once
>> -		 * if private is unexpectedly set.
>> -		 */
>> -		if (unlikely(new_folio->private)) {
>> -			VM_WARN_ON_ONCE_PAGE(true, new_head);
>> -			new_folio->private = NULL;
>> -		}
>> -
>>  		if (folio_test_swapcache(folio))
>>  			new_folio->swap.val = folio->swap.val + i;
>>
>> @@ -3700,6 +3691,7 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
>>   * @lock_at: a page within @folio to be left locked to caller
>>   * @list: after-split folios will be put on it if non NULL
>>   * @uniform_split: perform uniform split or not (non-uniform split)
>> + * @unmapped: The pages are already unmapped, they are migration entries.
>>   *
>>   * It calls __split_unmapped_folio() to perform uniform and non-uniform split.
>>   * It is in charge of checking whether the split is supported or not and
>> @@ -3715,7 +3707,7 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
>>   */
>>  static int __folio_split(struct folio *folio, unsigned int new_order,
>>  		struct page *split_at, struct page *lock_at,
>> -		struct list_head *list, bool uniform_split)
>> +		struct list_head *list, bool uniform_split, bool unmapped)
>>  {
>>  	struct deferred_split *ds_queue = get_deferred_split_queue(folio);
>>  	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
>> @@ -3765,13 +3757,15 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>  		 * is taken to serialise against parallel split or collapse
>>  		 * operations.
>>  		 */
>> -		anon_vma = folio_get_anon_vma(folio);
>> -		if (!anon_vma) {
>> -			ret = -EBUSY;
>> -			goto out;
>> +		if (!unmapped) {
>> +			anon_vma = folio_get_anon_vma(folio);
>> +			if (!anon_vma) {
>> +				ret = -EBUSY;
>> +				goto out;
>> +			}
>> +			anon_vma_lock_write(anon_vma);
>>  		}
>>  		mapping = NULL;
>> -		anon_vma_lock_write(anon_vma);
>>  	} else {
>>  		unsigned int min_order;
>>  		gfp_t gfp;
>> @@ -3838,7 +3832,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>  		goto out_unlock;
>>  	}
>>
>> -	unmap_folio(folio);
>> +	if (!unmapped)
>> +		unmap_folio(folio);
>>
>>  	/* block interrupt reentry in xa_lock and spinlock */
>>  	local_irq_disable();
>> @@ -3925,10 +3920,13 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>
>>  			next = folio_next(new_folio);
>>
>> +			zone_device_private_split_cb(folio, new_folio);
>> +
>>  			expected_refs = folio_expected_ref_count(new_folio) + 1;
>>  			folio_ref_unfreeze(new_folio, expected_refs);
>>
>> -			lru_add_split_folio(folio, new_folio, lruvec, list);
>> +			if (!unmapped)
>> +				lru_add_split_folio(folio, new_folio, lruvec, list);
>>
>>  			/*
>>  			 * Anonymous folio with swap cache.
>> @@ -3959,6 +3957,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>  			__filemap_remove_folio(new_folio, NULL);
>>  			folio_put_refs(new_folio, nr_pages);
>>  		}
>> +
>> +		zone_device_private_split_cb(folio, NULL);
>>  		/*
>>  		 * Unfreeze @folio only after all page cache entries, which
>>  		 * used to point to it, have been updated with new folios.
>> @@ -3982,6 +3982,9 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>
>>  	local_irq_enable();
>>
>> +	if (unmapped)
>> +		return ret;
>> +
>>  	if (nr_shmem_dropped)
>>  		shmem_uncharge(mapping->host, nr_shmem_dropped);
>>
>> @@ -4072,12 +4075,13 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>   * Returns -EINVAL when trying to split to an order that is incompatible
>>   * with the folio. Splitting to order 0 is compatible with all folios.
>>   */
>> -int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>> -				     unsigned int new_order)
>> +int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>> +				     unsigned int new_order, bool unmapped)
>>  {
>>  	struct folio *folio = page_folio(page);
>>
>> -	return __folio_split(folio, new_order, &folio->page, page, list, true);
>> +	return __folio_split(folio, new_order, &folio->page, page, list, true,
>> +				unmapped);
>>  }
>>
>>  /*
>> @@ -4106,7 +4110,7 @@ int folio_split(struct folio *folio, unsigned int new_order,
>>  		struct page *split_at, struct list_head *list)
>>  {
>>  	return __folio_split(folio, new_order, split_at, &folio->page, list,
>> -			false);
>> +			false, false);
>>  }
>>
>>  int min_order_for_split(struct folio *folio)
>> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
>> index 4156fd6190d2..fa42d2ebd024 100644
>> --- a/mm/migrate_device.c
>> +++ b/mm/migrate_device.c
>> @@ -306,6 +306,23 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>  			    pgmap->owner != migrate->pgmap_owner)
>>  				goto next;
>>
>> +			folio = page_folio(page);
>> +			if (folio_test_large(folio)) {
>> +				int ret;
>> +
>> +				pte_unmap_unlock(ptep, ptl);
>> +				ret = migrate_vma_split_folio(folio,
>> +							  migrate->fault_page);
>> +
>> +				if (ret) {
>> +					ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
>> +					goto next;
>> +				}
>> +
>> +				addr = start;
>> +				goto again;
>> +			}
>> +
>>  			mpfn = migrate_pfn(page_to_pfn(page)) |
>>  					MIGRATE_PFN_MIGRATE;
>>  			if (is_writable_device_private_entry(entry))
>> @@ -880,6 +897,29 @@ static int migrate_vma_insert_huge_pmd_page(struct migrate_vma *migrate,
>>  		src[i] &= ~MIGRATE_PFN_MIGRATE;
>>  	return 0;
>>  }
>> +
>> +static int migrate_vma_split_unmapped_folio(struct migrate_vma *migrate,
>> +					    unsigned long idx, unsigned long addr,
>> +					    struct folio *folio)
>> +{
>> +	unsigned long i;
>> +	unsigned long pfn;
>> +	unsigned long flags;
>> +	int ret = 0;
>> +
>> +	folio_get(folio);
>> +	split_huge_pmd_address(migrate->vma, addr, true);
>> +	ret = __split_huge_page_to_list_to_order(folio_page(folio, 0), NULL,
>> +							0, true);
> 
> Why not just call __split_unmapped_folio() here? Then, you do not need to add
> a new unmapped parameter in __folio_split().
> 
> 

The benefit comes from the ref count checks and freeze/unfreeze (common code) in
__folio_split() and also from the callbacks that are to be made to the drivers on
folio split. These paths are required for both mapped and unmapped folios.

Otherwise we'd have to replicate that logic and checks again for unmapped folios
and handle post split processing again.

[...]

Thanks,
Balbir

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [v7 11/16] mm/migrate_device: add THP splitting during migration
  2025-10-13 21:33     ` Balbir Singh
@ 2025-10-13 21:55       ` Zi Yan
  2025-10-13 22:50         ` Balbir Singh
  0 siblings, 1 reply; 75+ messages in thread
From: Zi Yan @ 2025-10-13 21:55 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-kernel, dri-devel, linux-mm, akpm, David Hildenbrand,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Oscar Salvador, Lorenzo Stoakes, Baolin Wang,
	Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
	Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Ralph Campbell, Mika Penttilä, Matthew Brost,
	Francois Dugast

On 13 Oct 2025, at 17:33, Balbir Singh wrote:

> On 10/14/25 08:17, Zi Yan wrote:
>> On 1 Oct 2025, at 2:57, Balbir Singh wrote:
>>
>>> Implement migrate_vma_split_pages() to handle THP splitting during the
>>> migration process when destination cannot allocate compound pages.
>>>
>>> This addresses the common scenario where migrate_vma_setup() succeeds with
>>> MIGRATE_PFN_COMPOUND pages, but the destination device cannot allocate
>>> large pages during the migration phase.
>>>
>>> Key changes:
>>> - migrate_vma_split_pages(): Split already-isolated pages during migration
>>> - Enhanced folio_split() and __split_unmapped_folio() with isolated
>>>   parameter to avoid redundant unmap/remap operations
>>>
>>> This provides a fallback mechansim to ensure migration succeeds even when
>>> large page allocation fails at the destination.
>>>
>>> Cc: Andrew Morton <akpm@linux-foundation.org>
>>> Cc: David Hildenbrand <david@redhat.com>
>>> Cc: Zi Yan <ziy@nvidia.com>
>>> Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
>>> Cc: Rakie Kim <rakie.kim@sk.com>
>>> Cc: Byungchul Park <byungchul@sk.com>
>>> Cc: Gregory Price <gourry@gourry.net>
>>> Cc: Ying Huang <ying.huang@linux.alibaba.com>
>>> Cc: Alistair Popple <apopple@nvidia.com>
>>> Cc: Oscar Salvador <osalvador@suse.de>
>>> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
>>> Cc: Nico Pache <npache@redhat.com>
>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>> Cc: Dev Jain <dev.jain@arm.com>
>>> Cc: Barry Song <baohua@kernel.org>
>>> Cc: Lyude Paul <lyude@redhat.com>
>>> Cc: Danilo Krummrich <dakr@kernel.org>
>>> Cc: David Airlie <airlied@gmail.com>
>>> Cc: Simona Vetter <simona@ffwll.ch>
>>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>>> Cc: Mika Penttilä <mpenttil@redhat.com>
>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>> Cc: Francois Dugast <francois.dugast@intel.com>
>>>
>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>> ---
>>>  include/linux/huge_mm.h | 11 +++++-
>>>  lib/test_hmm.c          |  9 +++++
>>>  mm/huge_memory.c        | 46 ++++++++++++----------
>>>  mm/migrate_device.c     | 85 +++++++++++++++++++++++++++++++++++------
>>>  4 files changed, 117 insertions(+), 34 deletions(-)
>>>
>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>>> index 2d669be7f1c8..a166be872628 100644
>>> --- a/include/linux/huge_mm.h
>>> +++ b/include/linux/huge_mm.h
>>> @@ -365,8 +365,8 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
>>>  		vm_flags_t vm_flags);
>>>
>>>  bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
>>> -int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>>> -		unsigned int new_order);
>>> +int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>>> +		unsigned int new_order, bool unmapped);
>>>  int min_order_for_split(struct folio *folio);
>>>  int split_folio_to_list(struct folio *folio, struct list_head *list);
>>>  bool uniform_split_supported(struct folio *folio, unsigned int new_order,
>>> @@ -375,6 +375,13 @@ bool non_uniform_split_supported(struct folio *folio, unsigned int new_order,
>>>  		bool warns);
>>>  int folio_split(struct folio *folio, unsigned int new_order, struct page *page,
>>>  		struct list_head *list);
>>> +
>>> +static inline int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>>> +		unsigned int new_order)
>>> +{
>>> +	return __split_huge_page_to_list_to_order(page, list, new_order, false);
>>> +}
>>> +
>>>  /*
>>>   * try_folio_split - try to split a @folio at @page using non uniform split.
>>>   * @folio: folio to be split
>>> diff --git a/lib/test_hmm.c b/lib/test_hmm.c
>>> index 46fa9e200db8..df429670633e 100644
>>> --- a/lib/test_hmm.c
>>> +++ b/lib/test_hmm.c
>>> @@ -1612,6 +1612,15 @@ static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
>>>  	order = folio_order(page_folio(vmf->page));
>>>  	nr = 1 << order;
>>>
>>> +	/*
>>> +	 * When folios are partially mapped, we can't rely on the folio
>>> +	 * order of vmf->page as the folio might not be fully split yet
>>> +	 */
>>> +	if (vmf->pte) {
>>> +		order = 0;
>>> +		nr = 1;
>>> +	}
>>> +
>>>  	/*
>>>  	 * Consider a per-cpu cache of src and dst pfns, but with
>>>  	 * large number of cpus that might not scale well.
>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>> index 8c95a658b3ec..022b0729f826 100644
>>> --- a/mm/huge_memory.c
>>> +++ b/mm/huge_memory.c
>>> @@ -3463,15 +3463,6 @@ static void __split_folio_to_order(struct folio *folio, int old_order,
>>>  		new_folio->mapping = folio->mapping;
>>>  		new_folio->index = folio->index + i;
>>>
>>> -		/*
>>> -		 * page->private should not be set in tail pages. Fix up and warn once
>>> -		 * if private is unexpectedly set.
>>> -		 */
>>> -		if (unlikely(new_folio->private)) {
>>> -			VM_WARN_ON_ONCE_PAGE(true, new_head);
>>> -			new_folio->private = NULL;
>>> -		}
>>> -
>>>  		if (folio_test_swapcache(folio))
>>>  			new_folio->swap.val = folio->swap.val + i;
>>>
>>> @@ -3700,6 +3691,7 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
>>>   * @lock_at: a page within @folio to be left locked to caller
>>>   * @list: after-split folios will be put on it if non NULL
>>>   * @uniform_split: perform uniform split or not (non-uniform split)
>>> + * @unmapped: The pages are already unmapped, they are migration entries.
>>>   *
>>>   * It calls __split_unmapped_folio() to perform uniform and non-uniform split.
>>>   * It is in charge of checking whether the split is supported or not and
>>> @@ -3715,7 +3707,7 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
>>>   */
>>>  static int __folio_split(struct folio *folio, unsigned int new_order,
>>>  		struct page *split_at, struct page *lock_at,
>>> -		struct list_head *list, bool uniform_split)
>>> +		struct list_head *list, bool uniform_split, bool unmapped)
>>>  {
>>>  	struct deferred_split *ds_queue = get_deferred_split_queue(folio);
>>>  	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
>>> @@ -3765,13 +3757,15 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>  		 * is taken to serialise against parallel split or collapse
>>>  		 * operations.
>>>  		 */
>>> -		anon_vma = folio_get_anon_vma(folio);
>>> -		if (!anon_vma) {
>>> -			ret = -EBUSY;
>>> -			goto out;
>>> +		if (!unmapped) {
>>> +			anon_vma = folio_get_anon_vma(folio);
>>> +			if (!anon_vma) {
>>> +				ret = -EBUSY;
>>> +				goto out;
>>> +			}
>>> +			anon_vma_lock_write(anon_vma);
>>>  		}
>>>  		mapping = NULL;
>>> -		anon_vma_lock_write(anon_vma);
>>>  	} else {
>>>  		unsigned int min_order;
>>>  		gfp_t gfp;
>>> @@ -3838,7 +3832,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>  		goto out_unlock;
>>>  	}
>>>
>>> -	unmap_folio(folio);
>>> +	if (!unmapped)
>>> +		unmap_folio(folio);
>>>
>>>  	/* block interrupt reentry in xa_lock and spinlock */
>>>  	local_irq_disable();
>>> @@ -3925,10 +3920,13 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>
>>>  			next = folio_next(new_folio);
>>>
>>> +			zone_device_private_split_cb(folio, new_folio);
>>> +
>>>  			expected_refs = folio_expected_ref_count(new_folio) + 1;
>>>  			folio_ref_unfreeze(new_folio, expected_refs);
>>>
>>> -			lru_add_split_folio(folio, new_folio, lruvec, list);
>>> +			if (!unmapped)
>>> +				lru_add_split_folio(folio, new_folio, lruvec, list);
>>>
>>>  			/*
>>>  			 * Anonymous folio with swap cache.
>>> @@ -3959,6 +3957,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>  			__filemap_remove_folio(new_folio, NULL);
>>>  			folio_put_refs(new_folio, nr_pages);
>>>  		}
>>> +
>>> +		zone_device_private_split_cb(folio, NULL);
>>>  		/*
>>>  		 * Unfreeze @folio only after all page cache entries, which
>>>  		 * used to point to it, have been updated with new folios.
>>> @@ -3982,6 +3982,9 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>
>>>  	local_irq_enable();
>>>
>>> +	if (unmapped)
>>> +		return ret;
>>> +
>>>  	if (nr_shmem_dropped)
>>>  		shmem_uncharge(mapping->host, nr_shmem_dropped);
>>>
>>> @@ -4072,12 +4075,13 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>   * Returns -EINVAL when trying to split to an order that is incompatible
>>>   * with the folio. Splitting to order 0 is compatible with all folios.
>>>   */
>>> -int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>>> -				     unsigned int new_order)
>>> +int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>>> +				     unsigned int new_order, bool unmapped)
>>>  {
>>>  	struct folio *folio = page_folio(page);
>>>
>>> -	return __folio_split(folio, new_order, &folio->page, page, list, true);
>>> +	return __folio_split(folio, new_order, &folio->page, page, list, true,
>>> +				unmapped);
>>>  }
>>>
>>>  /*
>>> @@ -4106,7 +4110,7 @@ int folio_split(struct folio *folio, unsigned int new_order,
>>>  		struct page *split_at, struct list_head *list)
>>>  {
>>>  	return __folio_split(folio, new_order, split_at, &folio->page, list,
>>> -			false);
>>> +			false, false);
>>>  }
>>>
>>>  int min_order_for_split(struct folio *folio)
>>> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
>>> index 4156fd6190d2..fa42d2ebd024 100644
>>> --- a/mm/migrate_device.c
>>> +++ b/mm/migrate_device.c
>>> @@ -306,6 +306,23 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>>  			    pgmap->owner != migrate->pgmap_owner)
>>>  				goto next;
>>>
>>> +			folio = page_folio(page);
>>> +			if (folio_test_large(folio)) {
>>> +				int ret;
>>> +
>>> +				pte_unmap_unlock(ptep, ptl);
>>> +				ret = migrate_vma_split_folio(folio,
>>> +							  migrate->fault_page);
>>> +
>>> +				if (ret) {
>>> +					ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
>>> +					goto next;
>>> +				}
>>> +
>>> +				addr = start;
>>> +				goto again;
>>> +			}
>>> +
>>>  			mpfn = migrate_pfn(page_to_pfn(page)) |
>>>  					MIGRATE_PFN_MIGRATE;
>>>  			if (is_writable_device_private_entry(entry))
>>> @@ -880,6 +897,29 @@ static int migrate_vma_insert_huge_pmd_page(struct migrate_vma *migrate,
>>>  		src[i] &= ~MIGRATE_PFN_MIGRATE;
>>>  	return 0;
>>>  }
>>> +
>>> +static int migrate_vma_split_unmapped_folio(struct migrate_vma *migrate,
>>> +					    unsigned long idx, unsigned long addr,
>>> +					    struct folio *folio)
>>> +{
>>> +	unsigned long i;
>>> +	unsigned long pfn;
>>> +	unsigned long flags;
>>> +	int ret = 0;
>>> +
>>> +	folio_get(folio);
>>> +	split_huge_pmd_address(migrate->vma, addr, true);
>>> +	ret = __split_huge_page_to_list_to_order(folio_page(folio, 0), NULL,
>>> +							0, true);
>>
>> Why not just call __split_unmapped_folio() here? Then, you do not need to add
>> a new unmapped parameter in __folio_split().
>>
>>
>
> The benefit comes from the ref count checks and freeze/unfreeze (common code) in
> __folio_split() and also from the callbacks that are to be made to the drivers on
> folio split. These paths are required for both mapped and unmapped folios.
>
> Otherwise we'd have to replicate that logic and checks again for unmapped folios
> and handle post split processing again.

Replicating freeze/unfreeze code would be much better than adding unmapped parameter
and new path in __folio_split(). When it comes to adding support for file-backed
folios, are you going to use unmapped parameter to guard code for file-backed code
in __folio_split()? Just keep piling up special paths?


--
Best Regards,
Yan, Zi

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [v7 11/16] mm/migrate_device: add THP splitting during migration
  2025-10-13 21:55       ` Zi Yan
@ 2025-10-13 22:50         ` Balbir Singh
  0 siblings, 0 replies; 75+ messages in thread
From: Balbir Singh @ 2025-10-13 22:50 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-kernel, dri-devel, linux-mm, akpm, David Hildenbrand,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Oscar Salvador, Lorenzo Stoakes, Baolin Wang,
	Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
	Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Ralph Campbell, Mika Penttilä, Matthew Brost,
	Francois Dugast

On 10/14/25 08:55, Zi Yan wrote:
> On 13 Oct 2025, at 17:33, Balbir Singh wrote:
> 
>> On 10/14/25 08:17, Zi Yan wrote:
>>> On 1 Oct 2025, at 2:57, Balbir Singh wrote:
>>>
>>>> Implement migrate_vma_split_pages() to handle THP splitting during the
>>>> migration process when destination cannot allocate compound pages.
>>>>
>>>> This addresses the common scenario where migrate_vma_setup() succeeds with
>>>> MIGRATE_PFN_COMPOUND pages, but the destination device cannot allocate
>>>> large pages during the migration phase.
>>>>
>>>> Key changes:
>>>> - migrate_vma_split_pages(): Split already-isolated pages during migration
>>>> - Enhanced folio_split() and __split_unmapped_folio() with isolated
>>>>   parameter to avoid redundant unmap/remap operations
>>>>
>>>> This provides a fallback mechansim to ensure migration succeeds even when
>>>> large page allocation fails at the destination.
>>>>
>>>> Cc: Andrew Morton <akpm@linux-foundation.org>
>>>> Cc: David Hildenbrand <david@redhat.com>
>>>> Cc: Zi Yan <ziy@nvidia.com>
>>>> Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
>>>> Cc: Rakie Kim <rakie.kim@sk.com>
>>>> Cc: Byungchul Park <byungchul@sk.com>
>>>> Cc: Gregory Price <gourry@gourry.net>
>>>> Cc: Ying Huang <ying.huang@linux.alibaba.com>
>>>> Cc: Alistair Popple <apopple@nvidia.com>
>>>> Cc: Oscar Salvador <osalvador@suse.de>
>>>> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
>>>> Cc: Nico Pache <npache@redhat.com>
>>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>>> Cc: Dev Jain <dev.jain@arm.com>
>>>> Cc: Barry Song <baohua@kernel.org>
>>>> Cc: Lyude Paul <lyude@redhat.com>
>>>> Cc: Danilo Krummrich <dakr@kernel.org>
>>>> Cc: David Airlie <airlied@gmail.com>
>>>> Cc: Simona Vetter <simona@ffwll.ch>
>>>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>>>> Cc: Mika Penttilä <mpenttil@redhat.com>
>>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>>> Cc: Francois Dugast <francois.dugast@intel.com>
>>>>
>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>> ---
>>>>  include/linux/huge_mm.h | 11 +++++-
>>>>  lib/test_hmm.c          |  9 +++++
>>>>  mm/huge_memory.c        | 46 ++++++++++++----------
>>>>  mm/migrate_device.c     | 85 +++++++++++++++++++++++++++++++++++------
>>>>  4 files changed, 117 insertions(+), 34 deletions(-)
>>>>
>>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>>>> index 2d669be7f1c8..a166be872628 100644
>>>> --- a/include/linux/huge_mm.h
>>>> +++ b/include/linux/huge_mm.h
>>>> @@ -365,8 +365,8 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
>>>>  		vm_flags_t vm_flags);
>>>>
>>>>  bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
>>>> -int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>>>> -		unsigned int new_order);
>>>> +int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>>>> +		unsigned int new_order, bool unmapped);
>>>>  int min_order_for_split(struct folio *folio);
>>>>  int split_folio_to_list(struct folio *folio, struct list_head *list);
>>>>  bool uniform_split_supported(struct folio *folio, unsigned int new_order,
>>>> @@ -375,6 +375,13 @@ bool non_uniform_split_supported(struct folio *folio, unsigned int new_order,
>>>>  		bool warns);
>>>>  int folio_split(struct folio *folio, unsigned int new_order, struct page *page,
>>>>  		struct list_head *list);
>>>> +
>>>> +static inline int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>>>> +		unsigned int new_order)
>>>> +{
>>>> +	return __split_huge_page_to_list_to_order(page, list, new_order, false);
>>>> +}
>>>> +
>>>>  /*
>>>>   * try_folio_split - try to split a @folio at @page using non uniform split.
>>>>   * @folio: folio to be split
>>>> diff --git a/lib/test_hmm.c b/lib/test_hmm.c
>>>> index 46fa9e200db8..df429670633e 100644
>>>> --- a/lib/test_hmm.c
>>>> +++ b/lib/test_hmm.c
>>>> @@ -1612,6 +1612,15 @@ static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
>>>>  	order = folio_order(page_folio(vmf->page));
>>>>  	nr = 1 << order;
>>>>
>>>> +	/*
>>>> +	 * When folios are partially mapped, we can't rely on the folio
>>>> +	 * order of vmf->page as the folio might not be fully split yet
>>>> +	 */
>>>> +	if (vmf->pte) {
>>>> +		order = 0;
>>>> +		nr = 1;
>>>> +	}
>>>> +
>>>>  	/*
>>>>  	 * Consider a per-cpu cache of src and dst pfns, but with
>>>>  	 * large number of cpus that might not scale well.
>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>> index 8c95a658b3ec..022b0729f826 100644
>>>> --- a/mm/huge_memory.c
>>>> +++ b/mm/huge_memory.c
>>>> @@ -3463,15 +3463,6 @@ static void __split_folio_to_order(struct folio *folio, int old_order,
>>>>  		new_folio->mapping = folio->mapping;
>>>>  		new_folio->index = folio->index + i;
>>>>
>>>> -		/*
>>>> -		 * page->private should not be set in tail pages. Fix up and warn once
>>>> -		 * if private is unexpectedly set.
>>>> -		 */
>>>> -		if (unlikely(new_folio->private)) {
>>>> -			VM_WARN_ON_ONCE_PAGE(true, new_head);
>>>> -			new_folio->private = NULL;
>>>> -		}
>>>> -
>>>>  		if (folio_test_swapcache(folio))
>>>>  			new_folio->swap.val = folio->swap.val + i;
>>>>
>>>> @@ -3700,6 +3691,7 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
>>>>   * @lock_at: a page within @folio to be left locked to caller
>>>>   * @list: after-split folios will be put on it if non NULL
>>>>   * @uniform_split: perform uniform split or not (non-uniform split)
>>>> + * @unmapped: The pages are already unmapped, they are migration entries.
>>>>   *
>>>>   * It calls __split_unmapped_folio() to perform uniform and non-uniform split.
>>>>   * It is in charge of checking whether the split is supported or not and
>>>> @@ -3715,7 +3707,7 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
>>>>   */
>>>>  static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>  		struct page *split_at, struct page *lock_at,
>>>> -		struct list_head *list, bool uniform_split)
>>>> +		struct list_head *list, bool uniform_split, bool unmapped)
>>>>  {
>>>>  	struct deferred_split *ds_queue = get_deferred_split_queue(folio);
>>>>  	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
>>>> @@ -3765,13 +3757,15 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>  		 * is taken to serialise against parallel split or collapse
>>>>  		 * operations.
>>>>  		 */
>>>> -		anon_vma = folio_get_anon_vma(folio);
>>>> -		if (!anon_vma) {
>>>> -			ret = -EBUSY;
>>>> -			goto out;
>>>> +		if (!unmapped) {
>>>> +			anon_vma = folio_get_anon_vma(folio);
>>>> +			if (!anon_vma) {
>>>> +				ret = -EBUSY;
>>>> +				goto out;
>>>> +			}
>>>> +			anon_vma_lock_write(anon_vma);
>>>>  		}
>>>>  		mapping = NULL;
>>>> -		anon_vma_lock_write(anon_vma);
>>>>  	} else {
>>>>  		unsigned int min_order;
>>>>  		gfp_t gfp;
>>>> @@ -3838,7 +3832,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>  		goto out_unlock;
>>>>  	}
>>>>
>>>> -	unmap_folio(folio);
>>>> +	if (!unmapped)
>>>> +		unmap_folio(folio);
>>>>
>>>>  	/* block interrupt reentry in xa_lock and spinlock */
>>>>  	local_irq_disable();
>>>> @@ -3925,10 +3920,13 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>
>>>>  			next = folio_next(new_folio);
>>>>
>>>> +			zone_device_private_split_cb(folio, new_folio);
>>>> +
>>>>  			expected_refs = folio_expected_ref_count(new_folio) + 1;
>>>>  			folio_ref_unfreeze(new_folio, expected_refs);
>>>>
>>>> -			lru_add_split_folio(folio, new_folio, lruvec, list);
>>>> +			if (!unmapped)
>>>> +				lru_add_split_folio(folio, new_folio, lruvec, list);
>>>>
>>>>  			/*
>>>>  			 * Anonymous folio with swap cache.
>>>> @@ -3959,6 +3957,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>  			__filemap_remove_folio(new_folio, NULL);
>>>>  			folio_put_refs(new_folio, nr_pages);
>>>>  		}
>>>> +
>>>> +		zone_device_private_split_cb(folio, NULL);
>>>>  		/*
>>>>  		 * Unfreeze @folio only after all page cache entries, which
>>>>  		 * used to point to it, have been updated with new folios.
>>>> @@ -3982,6 +3982,9 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>
>>>>  	local_irq_enable();
>>>>
>>>> +	if (unmapped)
>>>> +		return ret;
>>>> +
>>>>  	if (nr_shmem_dropped)
>>>>  		shmem_uncharge(mapping->host, nr_shmem_dropped);
>>>>
>>>> @@ -4072,12 +4075,13 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>   * Returns -EINVAL when trying to split to an order that is incompatible
>>>>   * with the folio. Splitting to order 0 is compatible with all folios.
>>>>   */
>>>> -int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>>>> -				     unsigned int new_order)
>>>> +int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>>>> +				     unsigned int new_order, bool unmapped)
>>>>  {
>>>>  	struct folio *folio = page_folio(page);
>>>>
>>>> -	return __folio_split(folio, new_order, &folio->page, page, list, true);
>>>> +	return __folio_split(folio, new_order, &folio->page, page, list, true,
>>>> +				unmapped);
>>>>  }
>>>>
>>>>  /*
>>>> @@ -4106,7 +4110,7 @@ int folio_split(struct folio *folio, unsigned int new_order,
>>>>  		struct page *split_at, struct list_head *list)
>>>>  {
>>>>  	return __folio_split(folio, new_order, split_at, &folio->page, list,
>>>> -			false);
>>>> +			false, false);
>>>>  }
>>>>
>>>>  int min_order_for_split(struct folio *folio)
>>>> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
>>>> index 4156fd6190d2..fa42d2ebd024 100644
>>>> --- a/mm/migrate_device.c
>>>> +++ b/mm/migrate_device.c
>>>> @@ -306,6 +306,23 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>>>  			    pgmap->owner != migrate->pgmap_owner)
>>>>  				goto next;
>>>>
>>>> +			folio = page_folio(page);
>>>> +			if (folio_test_large(folio)) {
>>>> +				int ret;
>>>> +
>>>> +				pte_unmap_unlock(ptep, ptl);
>>>> +				ret = migrate_vma_split_folio(folio,
>>>> +							  migrate->fault_page);
>>>> +
>>>> +				if (ret) {
>>>> +					ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
>>>> +					goto next;
>>>> +				}
>>>> +
>>>> +				addr = start;
>>>> +				goto again;
>>>> +			}
>>>> +
>>>>  			mpfn = migrate_pfn(page_to_pfn(page)) |
>>>>  					MIGRATE_PFN_MIGRATE;
>>>>  			if (is_writable_device_private_entry(entry))
>>>> @@ -880,6 +897,29 @@ static int migrate_vma_insert_huge_pmd_page(struct migrate_vma *migrate,
>>>>  		src[i] &= ~MIGRATE_PFN_MIGRATE;
>>>>  	return 0;
>>>>  }
>>>> +
>>>> +static int migrate_vma_split_unmapped_folio(struct migrate_vma *migrate,
>>>> +					    unsigned long idx, unsigned long addr,
>>>> +					    struct folio *folio)
>>>> +{
>>>> +	unsigned long i;
>>>> +	unsigned long pfn;
>>>> +	unsigned long flags;
>>>> +	int ret = 0;
>>>> +
>>>> +	folio_get(folio);
>>>> +	split_huge_pmd_address(migrate->vma, addr, true);
>>>> +	ret = __split_huge_page_to_list_to_order(folio_page(folio, 0), NULL,
>>>> +							0, true);
>>>
>>> Why not just call __split_unmapped_folio() here? Then, you do not need to add
>>> a new unmapped parameter in __folio_split().
>>>
>>>
>>
>> The benefit comes from the ref count checks and freeze/unfreeze (common code) in
>> __folio_split() and also from the callbacks that are to be made to the drivers on
>> folio split. These paths are required for both mapped and unmapped folios.
>>
>> Otherwise we'd have to replicate that logic and checks again for unmapped folios
>> and handle post split processing again.
> 
> Replicating freeze/unfreeze code would be much better than adding unmapped parameter
> and new path in __folio_split(). When it comes to adding support for file-backed
> folios, are you going to use unmapped parameter to guard code for file-backed code
> in __folio_split()? Just keep piling up special paths?
> 

Adding file-backed would require more code duplication and hence the aim to reuse 
as much as possible. I am happy to aim towards refactoring the code to separate out
the unmapped part of the code as a follow on patch to the series.

Balbir

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [v7 00/16] mm: support device-private THP
  2025-10-09 10:33     ` Matthew Brost
@ 2025-10-13 22:51       ` Balbir Singh
  2025-11-11 23:43       ` Andrew Morton
  1 sibling, 0 replies; 75+ messages in thread
From: Balbir Singh @ 2025-10-13 22:51 UTC (permalink / raw)
  To: Matthew Brost
  Cc: Andrew Morton, linux-kernel, dri-devel, linux-mm,
	David Hildenbrand, Zi Yan, Joshua Hahn, Rakie Kim, Byungchul Park,
	Gregory Price, Ying Huang, Alistair Popple, Oscar Salvador,
	Lorenzo Stoakes, Baolin Wang, Liam R. Howlett, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Ralph Campbell, Mika Penttilä,
	Francois Dugast

On 10/9/25 21:33, Matthew Brost wrote:
> On Thu, Oct 09, 2025 at 02:26:30PM +1100, Balbir Singh wrote:
>> On 10/9/25 14:17, Andrew Morton wrote:
>>> On Wed,  1 Oct 2025 16:56:51 +1000 Balbir Singh <balbirs@nvidia.com> wrote:
>>>
>>>> This patch series introduces support for Transparent Huge Page
>>>> (THP) migration in zone device-private memory. The implementation enables
>>>> efficient migration of large folios between system memory and
>>>> device-private memory
>>>
>>> Lots of chatter for the v6 series, but none for v7.  I hope that's a
>>> good sign.
>>>
>>
>> I hope so too, I've tried to address the comments in v6.
>>
> 
> Circling back to this series, we will itegrate and test this version.
> 

Look forward to your feedback

>>>>
>>>> HMM support for large folios, patches are already posted and in
>>>> mm-unstable.
>>>
>>> Not any more.  Which series was this?
>>
>> Not a series, but a patch
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git/commit/?id=10b9feee2d0dc81c44f7a9e69e7a894e33f8c4a1
> 
> I think this [1] means this patch is Linus's tree?
> 
> Matt
> 
> [1] https://github.com/torvalds/linux/commit/10b9feee2d0dc81c44f7a9e69e7a894e33f8c4a1 
> 

Thanks!
Balbir

^ permalink raw reply	[flat|nested] 75+ messages in thread

* linux-next: KVM/s390x regression (was: [v7 03/16] mm/huge_memory: add device-private THP support to PMD operations)
  2025-10-01  6:56 ` [v7 03/16] mm/huge_memory: add device-private THP support to PMD operations Balbir Singh
  2025-10-12 15:46   ` Lance Yang
@ 2025-10-17 14:49   ` Christian Borntraeger
  2025-10-17 14:54     ` linux-next: KVM/s390x regression David Hildenbrand
  1 sibling, 1 reply; 75+ messages in thread
From: Christian Borntraeger @ 2025-10-17 14:49 UTC (permalink / raw)
  To: balbirs
  Cc: Liam.Howlett, airlied, akpm, apopple, baohua, baolin.wang,
	byungchul, dakr, david, dev.jain, dri-devel, francois.dugast,
	gourry, joshua.hahnjy, linux-kernel, linux-mm, lorenzo.stoakes,
	lyude, matthew.brost, mpenttil, npache, osalvador, rakie.kim,
	rcampbell, ryan.roberts, simona, ying.huang, ziy, kvm, linux-s390,
	linux-next

This patch triggers a regression for s390x kvm as qemu guests can no longer start

error: kvm run failed Cannot allocate memory
PSW=mask 0000000180000000 addr 000000007fd00600
R00=0000000000000000 R01=0000000000000000 R02=0000000000000000 R03=0000000000000000
R04=0000000000000000 R05=0000000000000000 R06=0000000000000000 R07=0000000000000000
R08=0000000000000000 R09=0000000000000000 R10=0000000000000000 R11=0000000000000000
R12=0000000000000000 R13=0000000000000000 R14=0000000000000000 R15=0000000000000000
C00=00000000000000e0 C01=0000000000000000 C02=0000000000000000 C03=0000000000000000
C04=0000000000000000 C05=0000000000000000 C06=0000000000000000 C07=0000000000000000
C08=0000000000000000 C09=0000000000000000 C10=0000000000000000 C11=0000000000000000
C12=0000000000000000 C13=0000000000000000 C14=00000000c2000000 C15=0000000000000000

KVM on s390x does not use THP so far, will investigate. Does anyone have a quick idea?

Christian Borntraeger

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: linux-next: KVM/s390x regression
  2025-10-17 14:49   ` linux-next: KVM/s390x regression (was: [v7 03/16] mm/huge_memory: add device-private THP support to PMD operations) Christian Borntraeger
@ 2025-10-17 14:54     ` David Hildenbrand
  2025-10-17 15:01       ` Christian Borntraeger
  0 siblings, 1 reply; 75+ messages in thread
From: David Hildenbrand @ 2025-10-17 14:54 UTC (permalink / raw)
  To: Christian Borntraeger, balbirs
  Cc: Liam.Howlett, airlied, akpm, apopple, baohua, baolin.wang,
	byungchul, dakr, dev.jain, dri-devel, francois.dugast, gourry,
	joshua.hahnjy, linux-kernel, linux-mm, lorenzo.stoakes, lyude,
	matthew.brost, mpenttil, npache, osalvador, rakie.kim, rcampbell,
	ryan.roberts, simona, ying.huang, ziy, kvm, linux-s390,
	linux-next

On 17.10.25 16:49, Christian Borntraeger wrote:
> This patch triggers a regression for s390x kvm as qemu guests can no longer start
> 
> error: kvm run failed Cannot allocate memory
> PSW=mask 0000000180000000 addr 000000007fd00600
> R00=0000000000000000 R01=0000000000000000 R02=0000000000000000 R03=0000000000000000
> R04=0000000000000000 R05=0000000000000000 R06=0000000000000000 R07=0000000000000000
> R08=0000000000000000 R09=0000000000000000 R10=0000000000000000 R11=0000000000000000
> R12=0000000000000000 R13=0000000000000000 R14=0000000000000000 R15=0000000000000000
> C00=00000000000000e0 C01=0000000000000000 C02=0000000000000000 C03=0000000000000000
> C04=0000000000000000 C05=0000000000000000 C06=0000000000000000 C07=0000000000000000
> C08=0000000000000000 C09=0000000000000000 C10=0000000000000000 C11=0000000000000000
> C12=0000000000000000 C13=0000000000000000 C14=00000000c2000000 C15=0000000000000000
> 
> KVM on s390x does not use THP so far, will investigate. Does anyone have a quick idea?

Only when running KVM guests and apart from that everything else seems 
to be fine?

That's weird :)

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: linux-next: KVM/s390x regression
  2025-10-17 14:54     ` linux-next: KVM/s390x regression David Hildenbrand
@ 2025-10-17 15:01       ` Christian Borntraeger
  2025-10-17 15:07         ` David Hildenbrand
  0 siblings, 1 reply; 75+ messages in thread
From: Christian Borntraeger @ 2025-10-17 15:01 UTC (permalink / raw)
  To: David Hildenbrand, balbirs
  Cc: Liam.Howlett, airlied, akpm, apopple, baohua, baolin.wang,
	byungchul, dakr, dev.jain, dri-devel, francois.dugast, gourry,
	joshua.hahnjy, linux-kernel, linux-mm, lorenzo.stoakes, lyude,
	matthew.brost, mpenttil, npache, osalvador, rakie.kim, rcampbell,
	ryan.roberts, simona, ying.huang, ziy, kvm, linux-s390,
	linux-next

Am 17.10.25 um 16:54 schrieb David Hildenbrand:
> On 17.10.25 16:49, Christian Borntraeger wrote:
>> This patch triggers a regression for s390x kvm as qemu guests can no longer start
>>
>> error: kvm run failed Cannot allocate memory
>> PSW=mask 0000000180000000 addr 000000007fd00600
>> R00=0000000000000000 R01=0000000000000000 R02=0000000000000000 R03=0000000000000000
>> R04=0000000000000000 R05=0000000000000000 R06=0000000000000000 R07=0000000000000000
>> R08=0000000000000000 R09=0000000000000000 R10=0000000000000000 R11=0000000000000000
>> R12=0000000000000000 R13=0000000000000000 R14=0000000000000000 R15=0000000000000000
>> C00=00000000000000e0 C01=0000000000000000 C02=0000000000000000 C03=0000000000000000
>> C04=0000000000000000 C05=0000000000000000 C06=0000000000000000 C07=0000000000000000
>> C08=0000000000000000 C09=0000000000000000 C10=0000000000000000 C11=0000000000000000
>> C12=0000000000000000 C13=0000000000000000 C14=00000000c2000000 C15=0000000000000000
>>
>> KVM on s390x does not use THP so far, will investigate. Does anyone have a quick idea?
> 
> Only when running KVM guests and apart from that everything else seems to be fine?

We have other weirdness in linux-next but in different areas. Could that somehow be
related to use disabling THP for the kvm address space?

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: linux-next: KVM/s390x regression
  2025-10-17 15:01       ` Christian Borntraeger
@ 2025-10-17 15:07         ` David Hildenbrand
  2025-10-17 15:20           ` Christian Borntraeger
  0 siblings, 1 reply; 75+ messages in thread
From: David Hildenbrand @ 2025-10-17 15:07 UTC (permalink / raw)
  To: Christian Borntraeger, balbirs
  Cc: Liam.Howlett, airlied, akpm, apopple, baohua, baolin.wang,
	byungchul, dakr, dev.jain, dri-devel, francois.dugast, gourry,
	joshua.hahnjy, linux-kernel, linux-mm, lorenzo.stoakes, lyude,
	matthew.brost, mpenttil, npache, osalvador, rakie.kim, rcampbell,
	ryan.roberts, simona, ying.huang, ziy, kvm, linux-s390,
	linux-next

On 17.10.25 17:01, Christian Borntraeger wrote:
> Am 17.10.25 um 16:54 schrieb David Hildenbrand:
>> On 17.10.25 16:49, Christian Borntraeger wrote:
>>> This patch triggers a regression for s390x kvm as qemu guests can no longer start
>>>
>>> error: kvm run failed Cannot allocate memory
>>> PSW=mask 0000000180000000 addr 000000007fd00600
>>> R00=0000000000000000 R01=0000000000000000 R02=0000000000000000 R03=0000000000000000
>>> R04=0000000000000000 R05=0000000000000000 R06=0000000000000000 R07=0000000000000000
>>> R08=0000000000000000 R09=0000000000000000 R10=0000000000000000 R11=0000000000000000
>>> R12=0000000000000000 R13=0000000000000000 R14=0000000000000000 R15=0000000000000000
>>> C00=00000000000000e0 C01=0000000000000000 C02=0000000000000000 C03=0000000000000000
>>> C04=0000000000000000 C05=0000000000000000 C06=0000000000000000 C07=0000000000000000
>>> C08=0000000000000000 C09=0000000000000000 C10=0000000000000000 C11=0000000000000000
>>> C12=0000000000000000 C13=0000000000000000 C14=00000000c2000000 C15=0000000000000000
>>>
>>> KVM on s390x does not use THP so far, will investigate. Does anyone have a quick idea?
>>
>> Only when running KVM guests and apart from that everything else seems to be fine?
> 
> We have other weirdness in linux-next but in different areas. Could that somehow be
> related to use disabling THP for the kvm address space?

Not sure ... it's a bit weird. I mean, when KVM disables THPs we 
essentially just remap everything to be mapped by PTEs. So there 
shouldn't be any PMDs in that whole process.

Remapping a file THP (shmem) implies zapping the THP completely.


I assume in your kernel config has CONFIG_ZONE_DEVICE and 
CONFIG_ARCH_ENABLE_THP_MIGRATION set, right?

I'd rule out copy_huge_pmd(), zap_huge_pmd() a well.


What happens if you revert the change in mm/pgtable-generic.c?


But the whole -ENOMEM error is a weird symptom.

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: linux-next: KVM/s390x regression
  2025-10-17 15:07         ` David Hildenbrand
@ 2025-10-17 15:20           ` Christian Borntraeger
  2025-10-17 17:07             ` David Hildenbrand
  0 siblings, 1 reply; 75+ messages in thread
From: Christian Borntraeger @ 2025-10-17 15:20 UTC (permalink / raw)
  To: David Hildenbrand, balbirs
  Cc: Liam.Howlett, airlied, akpm, apopple, baohua, baolin.wang,
	byungchul, dakr, dev.jain, dri-devel, francois.dugast, gourry,
	joshua.hahnjy, linux-kernel, linux-mm, lorenzo.stoakes, lyude,
	matthew.brost, mpenttil, npache, osalvador, rakie.kim, rcampbell,
	ryan.roberts, simona, ying.huang, ziy, kvm, linux-s390,
	linux-next



Am 17.10.25 um 17:07 schrieb David Hildenbrand:
> On 17.10.25 17:01, Christian Borntraeger wrote:
>> Am 17.10.25 um 16:54 schrieb David Hildenbrand:
>>> On 17.10.25 16:49, Christian Borntraeger wrote:
>>>> This patch triggers a regression for s390x kvm as qemu guests can no longer start
>>>>
>>>> error: kvm run failed Cannot allocate memory
>>>> PSW=mask 0000000180000000 addr 000000007fd00600
>>>> R00=0000000000000000 R01=0000000000000000 R02=0000000000000000 R03=0000000000000000
>>>> R04=0000000000000000 R05=0000000000000000 R06=0000000000000000 R07=0000000000000000
>>>> R08=0000000000000000 R09=0000000000000000 R10=0000000000000000 R11=0000000000000000
>>>> R12=0000000000000000 R13=0000000000000000 R14=0000000000000000 R15=0000000000000000
>>>> C00=00000000000000e0 C01=0000000000000000 C02=0000000000000000 C03=0000000000000000
>>>> C04=0000000000000000 C05=0000000000000000 C06=0000000000000000 C07=0000000000000000
>>>> C08=0000000000000000 C09=0000000000000000 C10=0000000000000000 C11=0000000000000000
>>>> C12=0000000000000000 C13=0000000000000000 C14=00000000c2000000 C15=0000000000000000
>>>>
>>>> KVM on s390x does not use THP so far, will investigate. Does anyone have a quick idea?
>>>
>>> Only when running KVM guests and apart from that everything else seems to be fine?
>>
>> We have other weirdness in linux-next but in different areas. Could that somehow be
>> related to use disabling THP for the kvm address space?
> 
> Not sure ... it's a bit weird. I mean, when KVM disables THPs we essentially just remap everything to be mapped by PTEs. So there shouldn't be any PMDs in that whole process.
> 
> Remapping a file THP (shmem) implies zapping the THP completely.
> 
> 
> I assume in your kernel config has CONFIG_ZONE_DEVICE and CONFIG_ARCH_ENABLE_THP_MIGRATION set, right?

yes.

> 
> I'd rule out copy_huge_pmd(), zap_huge_pmd() a well.
> 
> 
> What happens if you revert the change in mm/pgtable-generic.c?

That partial revert seems to fix the issue
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index 0c847cdf4fd3..567e2d084071 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -290,7 +290,7 @@ pte_t *___pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp)
  
         if (pmdvalp)
                 *pmdvalp = pmdval;
-       if (unlikely(pmd_none(pmdval) || !pmd_present(pmdval)))
+       if (unlikely(pmd_none(pmdval) || is_pmd_migration_entry(pmdval)))
                 goto nomap;
         if (unlikely(pmd_trans_huge(pmdval)))
                 goto nomap;






^ permalink raw reply related	[flat|nested] 75+ messages in thread

* Re: linux-next: KVM/s390x regression
  2025-10-17 15:20           ` Christian Borntraeger
@ 2025-10-17 17:07             ` David Hildenbrand
  2025-10-17 21:56               ` Balbir Singh
  0 siblings, 1 reply; 75+ messages in thread
From: David Hildenbrand @ 2025-10-17 17:07 UTC (permalink / raw)
  To: Christian Borntraeger, balbirs
  Cc: Liam.Howlett, airlied, akpm, apopple, baohua, baolin.wang,
	byungchul, dakr, dev.jain, dri-devel, francois.dugast, gourry,
	joshua.hahnjy, linux-kernel, linux-mm, lorenzo.stoakes, lyude,
	matthew.brost, mpenttil, npache, osalvador, rakie.kim, rcampbell,
	ryan.roberts, simona, ying.huang, ziy, kvm, linux-s390,
	linux-next

On 17.10.25 17:20, Christian Borntraeger wrote:
> 
> 
> Am 17.10.25 um 17:07 schrieb David Hildenbrand:
>> On 17.10.25 17:01, Christian Borntraeger wrote:
>>> Am 17.10.25 um 16:54 schrieb David Hildenbrand:
>>>> On 17.10.25 16:49, Christian Borntraeger wrote:
>>>>> This patch triggers a regression for s390x kvm as qemu guests can no longer start
>>>>>
>>>>> error: kvm run failed Cannot allocate memory
>>>>> PSW=mask 0000000180000000 addr 000000007fd00600
>>>>> R00=0000000000000000 R01=0000000000000000 R02=0000000000000000 R03=0000000000000000
>>>>> R04=0000000000000000 R05=0000000000000000 R06=0000000000000000 R07=0000000000000000
>>>>> R08=0000000000000000 R09=0000000000000000 R10=0000000000000000 R11=0000000000000000
>>>>> R12=0000000000000000 R13=0000000000000000 R14=0000000000000000 R15=0000000000000000
>>>>> C00=00000000000000e0 C01=0000000000000000 C02=0000000000000000 C03=0000000000000000
>>>>> C04=0000000000000000 C05=0000000000000000 C06=0000000000000000 C07=0000000000000000
>>>>> C08=0000000000000000 C09=0000000000000000 C10=0000000000000000 C11=0000000000000000
>>>>> C12=0000000000000000 C13=0000000000000000 C14=00000000c2000000 C15=0000000000000000
>>>>>
>>>>> KVM on s390x does not use THP so far, will investigate. Does anyone have a quick idea?
>>>>
>>>> Only when running KVM guests and apart from that everything else seems to be fine?
>>>
>>> We have other weirdness in linux-next but in different areas. Could that somehow be
>>> related to use disabling THP for the kvm address space?
>>
>> Not sure ... it's a bit weird. I mean, when KVM disables THPs we essentially just remap everything to be mapped by PTEs. So there shouldn't be any PMDs in that whole process.
>>
>> Remapping a file THP (shmem) implies zapping the THP completely.
>>
>>
>> I assume in your kernel config has CONFIG_ZONE_DEVICE and CONFIG_ARCH_ENABLE_THP_MIGRATION set, right?
> 
> yes.
> 
>>
>> I'd rule out copy_huge_pmd(), zap_huge_pmd() a well.
>>
>>
>> What happens if you revert the change in mm/pgtable-generic.c?
> 
> That partial revert seems to fix the issue
> diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
> index 0c847cdf4fd3..567e2d084071 100644
> --- a/mm/pgtable-generic.c
> +++ b/mm/pgtable-generic.c
> @@ -290,7 +290,7 @@ pte_t *___pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp)
>    
>           if (pmdvalp)
>                   *pmdvalp = pmdval;
> -       if (unlikely(pmd_none(pmdval) || !pmd_present(pmdval)))
> +       if (unlikely(pmd_none(pmdval) || is_pmd_migration_entry(pmdval)))

Okay, but that means that effectively we stumble over a PMD entry that 
is not a migration entry but still non-present.

And I would expect that it's a page table, because otherwise the change
wouldn't make a difference.

And the weird thing is that this only triggers sometimes, because if
it would always trigger nothing would ever work.

Is there some weird scenario where s390x might set a left page table 
mapped in a PMD to non-present?

Staring at the definition of pmd_present() on s390x it's really just

	return (pmd_val(pmd) & _SEGMENT_ENTRY_PRESENT) != 0;


Maybe this is happening in the gmap code only and not actually in the 
core-mm code?

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: linux-next: KVM/s390x regression
  2025-10-17 17:07             ` David Hildenbrand
@ 2025-10-17 21:56               ` Balbir Singh
  2025-10-17 22:15                 ` David Hildenbrand
  2025-10-20  7:00                 ` Christian Borntraeger
  0 siblings, 2 replies; 75+ messages in thread
From: Balbir Singh @ 2025-10-17 21:56 UTC (permalink / raw)
  To: David Hildenbrand, Christian Borntraeger
  Cc: Liam.Howlett, airlied, akpm, apopple, baohua, baolin.wang,
	byungchul, dakr, dev.jain, dri-devel, francois.dugast, gourry,
	joshua.hahnjy, linux-kernel, linux-mm, lorenzo.stoakes, lyude,
	matthew.brost, mpenttil, npache, osalvador, rakie.kim, rcampbell,
	ryan.roberts, simona, ying.huang, ziy, kvm, linux-s390,
	linux-next

On 10/18/25 04:07, David Hildenbrand wrote:
> On 17.10.25 17:20, Christian Borntraeger wrote:
>>
>>
>> Am 17.10.25 um 17:07 schrieb David Hildenbrand:
>>> On 17.10.25 17:01, Christian Borntraeger wrote:
>>>> Am 17.10.25 um 16:54 schrieb David Hildenbrand:
>>>>> On 17.10.25 16:49, Christian Borntraeger wrote:
>>>>>> This patch triggers a regression for s390x kvm as qemu guests can no longer start
>>>>>>
>>>>>> error: kvm run failed Cannot allocate memory
>>>>>> PSW=mask 0000000180000000 addr 000000007fd00600
>>>>>> R00=0000000000000000 R01=0000000000000000 R02=0000000000000000 R03=0000000000000000
>>>>>> R04=0000000000000000 R05=0000000000000000 R06=0000000000000000 R07=0000000000000000
>>>>>> R08=0000000000000000 R09=0000000000000000 R10=0000000000000000 R11=0000000000000000
>>>>>> R12=0000000000000000 R13=0000000000000000 R14=0000000000000000 R15=0000000000000000
>>>>>> C00=00000000000000e0 C01=0000000000000000 C02=0000000000000000 C03=0000000000000000
>>>>>> C04=0000000000000000 C05=0000000000000000 C06=0000000000000000 C07=0000000000000000
>>>>>> C08=0000000000000000 C09=0000000000000000 C10=0000000000000000 C11=0000000000000000
>>>>>> C12=0000000000000000 C13=0000000000000000 C14=00000000c2000000 C15=0000000000000000
>>>>>>
>>>>>> KVM on s390x does not use THP so far, will investigate. Does anyone have a quick idea?
>>>>>
>>>>> Only when running KVM guests and apart from that everything else seems to be fine?
>>>>
>>>> We have other weirdness in linux-next but in different areas. Could that somehow be
>>>> related to use disabling THP for the kvm address space?
>>>
>>> Not sure ... it's a bit weird. I mean, when KVM disables THPs we essentially just remap everything to be mapped by PTEs. So there shouldn't be any PMDs in that whole process.
>>>
>>> Remapping a file THP (shmem) implies zapping the THP completely.
>>>
>>>
>>> I assume in your kernel config has CONFIG_ZONE_DEVICE and CONFIG_ARCH_ENABLE_THP_MIGRATION set, right?
>>
>> yes.
>>
>>>
>>> I'd rule out copy_huge_pmd(), zap_huge_pmd() a well.
>>>
>>>
>>> What happens if you revert the change in mm/pgtable-generic.c?
>>
>> That partial revert seems to fix the issue
>> diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
>> index 0c847cdf4fd3..567e2d084071 100644
>> --- a/mm/pgtable-generic.c
>> +++ b/mm/pgtable-generic.c
>> @@ -290,7 +290,7 @@ pte_t *___pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp)
>>              if (pmdvalp)
>>                   *pmdvalp = pmdval;
>> -       if (unlikely(pmd_none(pmdval) || !pmd_present(pmdval)))
>> +       if (unlikely(pmd_none(pmdval) || is_pmd_migration_entry(pmdval)))
> 
> Okay, but that means that effectively we stumble over a PMD entry that is not a migration entry but still non-present.
> 
> And I would expect that it's a page table, because otherwise the change
> wouldn't make a difference.
> 
> And the weird thing is that this only triggers sometimes, because if
> it would always trigger nothing would ever work.
> 
> Is there some weird scenario where s390x might set a left page table mapped in a PMD to non-present?
> 

Good point

> Staring at the definition of pmd_present() on s390x it's really just
> 
>     return (pmd_val(pmd) & _SEGMENT_ENTRY_PRESENT) != 0;
> 
> 
> Maybe this is happening in the gmap code only and not actually in the core-mm code?
> 


I am not an s390 expert, but just looking at the code

So the check on s390 effectively

segment_entry/present = false or segment_entry_empty/invalid = true

Given that the revert works, the check changes to

segment_entry/present = false or pmd_migration_entry (PAGE_INVALID | PAGE_PROTECT)

So it isn't the first check of segment_entry/present = false

sounds like for s390 we would want __pte_offset_map to allow mappings with
segment_entry_empty/invalid entries?

Any chance we can get the stack trace and a dump of the PMD entry when the issue occurs?

In the meanwhile, does this fix/workaround work?

diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index 0c847cdf4fd3..31c1754d5bd4 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -290,7 +290,7 @@ pte_t *___pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp)
 
 	if (pmdvalp)
 		*pmdvalp = pmdval;
-	if (unlikely(pmd_none(pmdval) || !pmd_present(pmdval)))
+	if (unlikely(pmd_none(pmdval) || is_pmd_non_present_folio_entry(pmdval)))
 		goto nomap;
 	if (unlikely(pmd_trans_huge(pmdval)))
 		goto nomap;



Thanks David and Christian!

Balbir

^ permalink raw reply related	[flat|nested] 75+ messages in thread

* Re: linux-next: KVM/s390x regression
  2025-10-17 21:56               ` Balbir Singh
@ 2025-10-17 22:15                 ` David Hildenbrand
  2025-10-17 22:41                   ` David Hildenbrand
  2025-10-20  7:00                 ` Christian Borntraeger
  1 sibling, 1 reply; 75+ messages in thread
From: David Hildenbrand @ 2025-10-17 22:15 UTC (permalink / raw)
  To: Balbir Singh, Christian Borntraeger
  Cc: Liam.Howlett, airlied, akpm, apopple, baohua, baolin.wang,
	byungchul, dakr, dev.jain, dri-devel, francois.dugast, gourry,
	joshua.hahnjy, linux-kernel, linux-mm, lorenzo.stoakes, lyude,
	matthew.brost, mpenttil, npache, osalvador, rakie.kim, rcampbell,
	ryan.roberts, simona, ying.huang, ziy, kvm, linux-s390,
	linux-next

On 17.10.25 23:56, Balbir Singh wrote:
> On 10/18/25 04:07, David Hildenbrand wrote:
>> On 17.10.25 17:20, Christian Borntraeger wrote:
>>>
>>>
>>> Am 17.10.25 um 17:07 schrieb David Hildenbrand:
>>>> On 17.10.25 17:01, Christian Borntraeger wrote:
>>>>> Am 17.10.25 um 16:54 schrieb David Hildenbrand:
>>>>>> On 17.10.25 16:49, Christian Borntraeger wrote:
>>>>>>> This patch triggers a regression for s390x kvm as qemu guests can no longer start
>>>>>>>
>>>>>>> error: kvm run failed Cannot allocate memory
>>>>>>> PSW=mask 0000000180000000 addr 000000007fd00600
>>>>>>> R00=0000000000000000 R01=0000000000000000 R02=0000000000000000 R03=0000000000000000
>>>>>>> R04=0000000000000000 R05=0000000000000000 R06=0000000000000000 R07=0000000000000000
>>>>>>> R08=0000000000000000 R09=0000000000000000 R10=0000000000000000 R11=0000000000000000
>>>>>>> R12=0000000000000000 R13=0000000000000000 R14=0000000000000000 R15=0000000000000000
>>>>>>> C00=00000000000000e0 C01=0000000000000000 C02=0000000000000000 C03=0000000000000000
>>>>>>> C04=0000000000000000 C05=0000000000000000 C06=0000000000000000 C07=0000000000000000
>>>>>>> C08=0000000000000000 C09=0000000000000000 C10=0000000000000000 C11=0000000000000000
>>>>>>> C12=0000000000000000 C13=0000000000000000 C14=00000000c2000000 C15=0000000000000000
>>>>>>>
>>>>>>> KVM on s390x does not use THP so far, will investigate. Does anyone have a quick idea?
>>>>>>
>>>>>> Only when running KVM guests and apart from that everything else seems to be fine?
>>>>>
>>>>> We have other weirdness in linux-next but in different areas. Could that somehow be
>>>>> related to use disabling THP for the kvm address space?
>>>>
>>>> Not sure ... it's a bit weird. I mean, when KVM disables THPs we essentially just remap everything to be mapped by PTEs. So there shouldn't be any PMDs in that whole process.
>>>>
>>>> Remapping a file THP (shmem) implies zapping the THP completely.
>>>>
>>>>
>>>> I assume in your kernel config has CONFIG_ZONE_DEVICE and CONFIG_ARCH_ENABLE_THP_MIGRATION set, right?
>>>
>>> yes.
>>>
>>>>
>>>> I'd rule out copy_huge_pmd(), zap_huge_pmd() a well.
>>>>
>>>>
>>>> What happens if you revert the change in mm/pgtable-generic.c?
>>>
>>> That partial revert seems to fix the issue
>>> diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
>>> index 0c847cdf4fd3..567e2d084071 100644
>>> --- a/mm/pgtable-generic.c
>>> +++ b/mm/pgtable-generic.c
>>> @@ -290,7 +290,7 @@ pte_t *___pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp)
>>>               if (pmdvalp)
>>>                    *pmdvalp = pmdval;
>>> -       if (unlikely(pmd_none(pmdval) || !pmd_present(pmdval)))
>>> +       if (unlikely(pmd_none(pmdval) || is_pmd_migration_entry(pmdval)))
>>
>> Okay, but that means that effectively we stumble over a PMD entry that is not a migration entry but still non-present.
>>
>> And I would expect that it's a page table, because otherwise the change
>> wouldn't make a difference.
>>
>> And the weird thing is that this only triggers sometimes, because if
>> it would always trigger nothing would ever work.
>>
>> Is there some weird scenario where s390x might set a left page table mapped in a PMD to non-present?
>>
> 
> Good point
> 
>> Staring at the definition of pmd_present() on s390x it's really just
>>
>>      return (pmd_val(pmd) & _SEGMENT_ENTRY_PRESENT) != 0;
>>
>>
>> Maybe this is happening in the gmap code only and not actually in the core-mm code?
>>
> 
> 
> I am not an s390 expert, but just looking at the code
> 
> So the check on s390 effectively
> 
> segment_entry/present = false or segment_entry_empty/invalid = true

pmd_present() == true iff _SEGMENT_ENTRY_PRESENT is set

because

	return (pmd_val(pmd) & _SEGMENT_ENTRY_PRESENT) != 0;

is the same as

	return pmd_val(pmd) & _SEGMENT_ENTRY_PRESENT;

But that means we have something where _SEGMENT_ENTRY_PRESENT is not set.

I suspect that can only be the gmap tables.

Likely __gmap_link() does not set _SEGMENT_ENTRY_PRESENT, which is fine 
because it's a software managed bit for "ordinary" page tables, not gmap 
tables.

Which raises the question why someone would wrongly use 
pte_offset_map()/__pte_offset_map() on the gmap tables.

I cannot immediately spot any such usage in kvm/gmap code, though.

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: linux-next: KVM/s390x regression
  2025-10-17 22:15                 ` David Hildenbrand
@ 2025-10-17 22:41                   ` David Hildenbrand
  2025-10-20  7:01                     ` Christian Borntraeger
  0 siblings, 1 reply; 75+ messages in thread
From: David Hildenbrand @ 2025-10-17 22:41 UTC (permalink / raw)
  To: Balbir Singh, Christian Borntraeger
  Cc: Liam.Howlett, airlied, akpm, apopple, baohua, baolin.wang,
	byungchul, dakr, dev.jain, dri-devel, francois.dugast, gourry,
	joshua.hahnjy, linux-kernel, linux-mm, lorenzo.stoakes, lyude,
	matthew.brost, mpenttil, npache, osalvador, rakie.kim, rcampbell,
	ryan.roberts, simona, ying.huang, ziy, kvm, linux-s390,
	linux-next

On 18.10.25 00:15, David Hildenbrand wrote:
> On 17.10.25 23:56, Balbir Singh wrote:
>> On 10/18/25 04:07, David Hildenbrand wrote:
>>> On 17.10.25 17:20, Christian Borntraeger wrote:
>>>>
>>>>
>>>> Am 17.10.25 um 17:07 schrieb David Hildenbrand:
>>>>> On 17.10.25 17:01, Christian Borntraeger wrote:
>>>>>> Am 17.10.25 um 16:54 schrieb David Hildenbrand:
>>>>>>> On 17.10.25 16:49, Christian Borntraeger wrote:
>>>>>>>> This patch triggers a regression for s390x kvm as qemu guests can no longer start
>>>>>>>>
>>>>>>>> error: kvm run failed Cannot allocate memory
>>>>>>>> PSW=mask 0000000180000000 addr 000000007fd00600
>>>>>>>> R00=0000000000000000 R01=0000000000000000 R02=0000000000000000 R03=0000000000000000
>>>>>>>> R04=0000000000000000 R05=0000000000000000 R06=0000000000000000 R07=0000000000000000
>>>>>>>> R08=0000000000000000 R09=0000000000000000 R10=0000000000000000 R11=0000000000000000
>>>>>>>> R12=0000000000000000 R13=0000000000000000 R14=0000000000000000 R15=0000000000000000
>>>>>>>> C00=00000000000000e0 C01=0000000000000000 C02=0000000000000000 C03=0000000000000000
>>>>>>>> C04=0000000000000000 C05=0000000000000000 C06=0000000000000000 C07=0000000000000000
>>>>>>>> C08=0000000000000000 C09=0000000000000000 C10=0000000000000000 C11=0000000000000000
>>>>>>>> C12=0000000000000000 C13=0000000000000000 C14=00000000c2000000 C15=0000000000000000
>>>>>>>>
>>>>>>>> KVM on s390x does not use THP so far, will investigate. Does anyone have a quick idea?
>>>>>>>
>>>>>>> Only when running KVM guests and apart from that everything else seems to be fine?
>>>>>>
>>>>>> We have other weirdness in linux-next but in different areas. Could that somehow be
>>>>>> related to use disabling THP for the kvm address space?
>>>>>
>>>>> Not sure ... it's a bit weird. I mean, when KVM disables THPs we essentially just remap everything to be mapped by PTEs. So there shouldn't be any PMDs in that whole process.
>>>>>
>>>>> Remapping a file THP (shmem) implies zapping the THP completely.
>>>>>
>>>>>
>>>>> I assume in your kernel config has CONFIG_ZONE_DEVICE and CONFIG_ARCH_ENABLE_THP_MIGRATION set, right?
>>>>
>>>> yes.
>>>>
>>>>>
>>>>> I'd rule out copy_huge_pmd(), zap_huge_pmd() a well.
>>>>>
>>>>>
>>>>> What happens if you revert the change in mm/pgtable-generic.c?
>>>>
>>>> That partial revert seems to fix the issue
>>>> diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
>>>> index 0c847cdf4fd3..567e2d084071 100644
>>>> --- a/mm/pgtable-generic.c
>>>> +++ b/mm/pgtable-generic.c
>>>> @@ -290,7 +290,7 @@ pte_t *___pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp)
>>>>                if (pmdvalp)
>>>>                     *pmdvalp = pmdval;
>>>> -       if (unlikely(pmd_none(pmdval) || !pmd_present(pmdval)))
>>>> +       if (unlikely(pmd_none(pmdval) || is_pmd_migration_entry(pmdval)))
>>>
>>> Okay, but that means that effectively we stumble over a PMD entry that is not a migration entry but still non-present.
>>>
>>> And I would expect that it's a page table, because otherwise the change
>>> wouldn't make a difference.
>>>
>>> And the weird thing is that this only triggers sometimes, because if
>>> it would always trigger nothing would ever work.
>>>
>>> Is there some weird scenario where s390x might set a left page table mapped in a PMD to non-present?
>>>
>>
>> Good point
>>
>>> Staring at the definition of pmd_present() on s390x it's really just
>>>
>>>       return (pmd_val(pmd) & _SEGMENT_ENTRY_PRESENT) != 0;
>>>
>>>
>>> Maybe this is happening in the gmap code only and not actually in the core-mm code?
>>>
>>
>>
>> I am not an s390 expert, but just looking at the code
>>
>> So the check on s390 effectively
>>
>> segment_entry/present = false or segment_entry_empty/invalid = true
> 
> pmd_present() == true iff _SEGMENT_ENTRY_PRESENT is set
> 
> because
> 
> 	return (pmd_val(pmd) & _SEGMENT_ENTRY_PRESENT) != 0;
> 
> is the same as
> 
> 	return pmd_val(pmd) & _SEGMENT_ENTRY_PRESENT;
> 
> But that means we have something where _SEGMENT_ENTRY_PRESENT is not set.
> 
> I suspect that can only be the gmap tables.
> 
> Likely __gmap_link() does not set _SEGMENT_ENTRY_PRESENT, which is fine
> because it's a software managed bit for "ordinary" page tables, not gmap
> tables.
> 
> Which raises the question why someone would wrongly use
> pte_offset_map()/__pte_offset_map() on the gmap tables.
> 
> I cannot immediately spot any such usage in kvm/gmap code, though.
> 

Ah, it's all that pte_alloc_map_lock() stuff in gmap.c.

Oh my.

So we're mapping a user PTE table that is linked into the gmap tables 
through a PMD table that does not have the right sw bits set we would 
expect in a user PMD table.

What's also scary is that pte_alloc_map_lock() would try to pte_alloc() 
a user page table in the gmap, which sounds completely wrong?

Yeah, when walking the gmap and wanting to lock the linked user PTE 
table, we should probably never use the pte_*map variants but obtain
the lock through pte_lockptr().

All magic we end up doing with RCU etc in __pte_offset_map_lock()
does not apply to the gmap PMD table.

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [v7 11/16] mm/migrate_device: add THP splitting during migration
  2025-10-01  6:57 ` [v7 11/16] mm/migrate_device: add THP splitting during migration Balbir Singh
  2025-10-13 21:17   ` Zi Yan
@ 2025-10-19  8:19   ` Wei Yang
  2025-10-19 22:49     ` Balbir Singh
  1 sibling, 1 reply; 75+ messages in thread
From: Wei Yang @ 2025-10-19  8:19 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-kernel, dri-devel, linux-mm, akpm, David Hildenbrand,
	Zi Yan, Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price,
	Ying Huang, Alistair Popple, Oscar Salvador, Lorenzo Stoakes,
	Baolin Wang, Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain,
	Barry Song, Lyude Paul, Danilo Krummrich, David Airlie,
	Simona Vetter, Ralph Campbell, Mika Penttilä, Matthew Brost,
	Francois Dugast

On Wed, Oct 01, 2025 at 04:57:02PM +1000, Balbir Singh wrote:
[...]
> static int __folio_split(struct folio *folio, unsigned int new_order,
> 		struct page *split_at, struct page *lock_at,
>-		struct list_head *list, bool uniform_split)
>+		struct list_head *list, bool uniform_split, bool unmapped)
> {
> 	struct deferred_split *ds_queue = get_deferred_split_queue(folio);
> 	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
>@@ -3765,13 +3757,15 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
> 		 * is taken to serialise against parallel split or collapse
> 		 * operations.
> 		 */
>-		anon_vma = folio_get_anon_vma(folio);
>-		if (!anon_vma) {
>-			ret = -EBUSY;
>-			goto out;
>+		if (!unmapped) {
>+			anon_vma = folio_get_anon_vma(folio);
>+			if (!anon_vma) {
>+				ret = -EBUSY;
>+				goto out;
>+			}
>+			anon_vma_lock_write(anon_vma);
> 		}
> 		mapping = NULL;
>-		anon_vma_lock_write(anon_vma);
> 	} else {
> 		unsigned int min_order;
> 		gfp_t gfp;
>@@ -3838,7 +3832,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
> 		goto out_unlock;
> 	}
> 
>-	unmap_folio(folio);
>+	if (!unmapped)
>+		unmap_folio(folio);
> 
> 	/* block interrupt reentry in xa_lock and spinlock */
> 	local_irq_disable();
>@@ -3925,10 +3920,13 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
> 
> 			next = folio_next(new_folio);
> 
>+			zone_device_private_split_cb(folio, new_folio);
>+
> 			expected_refs = folio_expected_ref_count(new_folio) + 1;
> 			folio_ref_unfreeze(new_folio, expected_refs);
> 
>-			lru_add_split_folio(folio, new_folio, lruvec, list);
>+			if (!unmapped)
>+				lru_add_split_folio(folio, new_folio, lruvec, list);
> 
> 			/*
> 			 * Anonymous folio with swap cache.
>@@ -3959,6 +3957,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
> 			__filemap_remove_folio(new_folio, NULL);
> 			folio_put_refs(new_folio, nr_pages);
> 		}
>+
>+		zone_device_private_split_cb(folio, NULL);
> 		/*
> 		 * Unfreeze @folio only after all page cache entries, which
> 		 * used to point to it, have been updated with new folios.
>@@ -3982,6 +3982,9 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
> 
> 	local_irq_enable();
> 
>+	if (unmapped)
>+		return ret;

As the comment of __folio_split() and __split_huge_page_to_list_to_order()
mentioned:

  * The large folio must be locked
  * After splitting, the after-split folio containing @lock_at remains locked

But here we seems to change the prerequisites.

Hmm.. I am not sure this is correct.

>+
> 	if (nr_shmem_dropped)
> 		shmem_uncharge(mapping->host, nr_shmem_dropped);
> 

-- 
Wei Yang
Help you, Help me

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [v7 11/16] mm/migrate_device: add THP splitting during migration
  2025-10-19  8:19   ` Wei Yang
@ 2025-10-19 22:49     ` Balbir Singh
  2025-10-19 22:59       ` Zi Yan
  0 siblings, 1 reply; 75+ messages in thread
From: Balbir Singh @ 2025-10-19 22:49 UTC (permalink / raw)
  To: Wei Yang
  Cc: linux-kernel, dri-devel, linux-mm, akpm, David Hildenbrand,
	Zi Yan, Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price,
	Ying Huang, Alistair Popple, Oscar Salvador, Lorenzo Stoakes,
	Baolin Wang, Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain,
	Barry Song, Lyude Paul, Danilo Krummrich, David Airlie,
	Simona Vetter, Ralph Campbell, Mika Penttilä, Matthew Brost,
	Francois Dugast

On 10/19/25 19:19, Wei Yang wrote:
> On Wed, Oct 01, 2025 at 04:57:02PM +1000, Balbir Singh wrote:
> [...]
>> static int __folio_split(struct folio *folio, unsigned int new_order,
>> 		struct page *split_at, struct page *lock_at,
>> -		struct list_head *list, bool uniform_split)
>> +		struct list_head *list, bool uniform_split, bool unmapped)
>> {
>> 	struct deferred_split *ds_queue = get_deferred_split_queue(folio);
>> 	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
>> @@ -3765,13 +3757,15 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>> 		 * is taken to serialise against parallel split or collapse
>> 		 * operations.
>> 		 */
>> -		anon_vma = folio_get_anon_vma(folio);
>> -		if (!anon_vma) {
>> -			ret = -EBUSY;
>> -			goto out;
>> +		if (!unmapped) {
>> +			anon_vma = folio_get_anon_vma(folio);
>> +			if (!anon_vma) {
>> +				ret = -EBUSY;
>> +				goto out;
>> +			}
>> +			anon_vma_lock_write(anon_vma);
>> 		}
>> 		mapping = NULL;
>> -		anon_vma_lock_write(anon_vma);
>> 	} else {
>> 		unsigned int min_order;
>> 		gfp_t gfp;
>> @@ -3838,7 +3832,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>> 		goto out_unlock;
>> 	}
>>
>> -	unmap_folio(folio);
>> +	if (!unmapped)
>> +		unmap_folio(folio);
>>
>> 	/* block interrupt reentry in xa_lock and spinlock */
>> 	local_irq_disable();
>> @@ -3925,10 +3920,13 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>
>> 			next = folio_next(new_folio);
>>
>> +			zone_device_private_split_cb(folio, new_folio);
>> +
>> 			expected_refs = folio_expected_ref_count(new_folio) + 1;
>> 			folio_ref_unfreeze(new_folio, expected_refs);
>>
>> -			lru_add_split_folio(folio, new_folio, lruvec, list);
>> +			if (!unmapped)
>> +				lru_add_split_folio(folio, new_folio, lruvec, list);
>>
>> 			/*
>> 			 * Anonymous folio with swap cache.
>> @@ -3959,6 +3957,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>> 			__filemap_remove_folio(new_folio, NULL);
>> 			folio_put_refs(new_folio, nr_pages);
>> 		}
>> +
>> +		zone_device_private_split_cb(folio, NULL);
>> 		/*
>> 		 * Unfreeze @folio only after all page cache entries, which
>> 		 * used to point to it, have been updated with new folios.
>> @@ -3982,6 +3982,9 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>
>> 	local_irq_enable();
>>
>> +	if (unmapped)
>> +		return ret;
> 
> As the comment of __folio_split() and __split_huge_page_to_list_to_order()
> mentioned:
> 
>   * The large folio must be locked
>   * After splitting, the after-split folio containing @lock_at remains locked
> 
> But here we seems to change the prerequisites.
> 
> Hmm.. I am not sure this is correct.
> 

The code is correct, but you are right in that the documentation needs to be updated.
When "unmapped", we do want to leave the folios locked after the split.

Balbir

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [v7 11/16] mm/migrate_device: add THP splitting during migration
  2025-10-19 22:49     ` Balbir Singh
@ 2025-10-19 22:59       ` Zi Yan
  2025-10-21 21:34         ` Balbir Singh
  0 siblings, 1 reply; 75+ messages in thread
From: Zi Yan @ 2025-10-19 22:59 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Wei Yang, linux-kernel, dri-devel, linux-mm, akpm,
	David Hildenbrand, Joshua Hahn, Rakie Kim, Byungchul Park,
	Gregory Price, Ying Huang, Alistair Popple, Oscar Salvador,
	Lorenzo Stoakes, Baolin Wang, Liam R. Howlett, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Ralph Campbell, Mika Penttilä,
	Matthew Brost, Francois Dugast

On 19 Oct 2025, at 18:49, Balbir Singh wrote:

> On 10/19/25 19:19, Wei Yang wrote:
>> On Wed, Oct 01, 2025 at 04:57:02PM +1000, Balbir Singh wrote:
>> [...]
>>> static int __folio_split(struct folio *folio, unsigned int new_order,
>>> 		struct page *split_at, struct page *lock_at,
>>> -		struct list_head *list, bool uniform_split)
>>> +		struct list_head *list, bool uniform_split, bool unmapped)
>>> {
>>> 	struct deferred_split *ds_queue = get_deferred_split_queue(folio);
>>> 	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
>>> @@ -3765,13 +3757,15 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>> 		 * is taken to serialise against parallel split or collapse
>>> 		 * operations.
>>> 		 */
>>> -		anon_vma = folio_get_anon_vma(folio);
>>> -		if (!anon_vma) {
>>> -			ret = -EBUSY;
>>> -			goto out;
>>> +		if (!unmapped) {
>>> +			anon_vma = folio_get_anon_vma(folio);
>>> +			if (!anon_vma) {
>>> +				ret = -EBUSY;
>>> +				goto out;
>>> +			}
>>> +			anon_vma_lock_write(anon_vma);
>>> 		}
>>> 		mapping = NULL;
>>> -		anon_vma_lock_write(anon_vma);
>>> 	} else {
>>> 		unsigned int min_order;
>>> 		gfp_t gfp;
>>> @@ -3838,7 +3832,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>> 		goto out_unlock;
>>> 	}
>>>
>>> -	unmap_folio(folio);
>>> +	if (!unmapped)
>>> +		unmap_folio(folio);
>>>
>>> 	/* block interrupt reentry in xa_lock and spinlock */
>>> 	local_irq_disable();
>>> @@ -3925,10 +3920,13 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>
>>> 			next = folio_next(new_folio);
>>>
>>> +			zone_device_private_split_cb(folio, new_folio);
>>> +
>>> 			expected_refs = folio_expected_ref_count(new_folio) + 1;
>>> 			folio_ref_unfreeze(new_folio, expected_refs);
>>>
>>> -			lru_add_split_folio(folio, new_folio, lruvec, list);
>>> +			if (!unmapped)
>>> +				lru_add_split_folio(folio, new_folio, lruvec, list);
>>>
>>> 			/*
>>> 			 * Anonymous folio with swap cache.
>>> @@ -3959,6 +3957,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>> 			__filemap_remove_folio(new_folio, NULL);
>>> 			folio_put_refs(new_folio, nr_pages);
>>> 		}
>>> +
>>> +		zone_device_private_split_cb(folio, NULL);
>>> 		/*
>>> 		 * Unfreeze @folio only after all page cache entries, which
>>> 		 * used to point to it, have been updated with new folios.
>>> @@ -3982,6 +3982,9 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>
>>> 	local_irq_enable();
>>>
>>> +	if (unmapped)
>>> +		return ret;
>>
>> As the comment of __folio_split() and __split_huge_page_to_list_to_order()
>> mentioned:
>>
>>   * The large folio must be locked
>>   * After splitting, the after-split folio containing @lock_at remains locked
>>
>> But here we seems to change the prerequisites.
>>
>> Hmm.. I am not sure this is correct.
>>
>
> The code is correct, but you are right in that the documentation needs to be updated.
> When "unmapped", we do want to leave the folios locked after the split.

Sigh, this "unmapped" code needs so many special branches and a different locking
requirement. It should be a separate function to avoid confusions.

--
Best Regards,
Yan, Zi

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: linux-next: KVM/s390x regression
  2025-10-17 21:56               ` Balbir Singh
  2025-10-17 22:15                 ` David Hildenbrand
@ 2025-10-20  7:00                 ` Christian Borntraeger
  2025-10-20  8:41                   ` David Hildenbrand
  1 sibling, 1 reply; 75+ messages in thread
From: Christian Borntraeger @ 2025-10-20  7:00 UTC (permalink / raw)
  To: Balbir Singh, David Hildenbrand, Claudio Imbrenda
  Cc: Liam.Howlett, airlied, akpm, apopple, baohua, baolin.wang,
	byungchul, dakr, dev.jain, dri-devel, francois.dugast, gourry,
	joshua.hahnjy, linux-kernel, linux-mm, lorenzo.stoakes, lyude,
	matthew.brost, mpenttil, npache, osalvador, rakie.kim, rcampbell,
	ryan.roberts, simona, ying.huang, ziy, kvm, linux-s390,
	linux-next

Am 17.10.25 um 23:56 schrieb Balbir Singh:

> In the meanwhile, does this fix/workaround work?
> 
> diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
> index 0c847cdf4fd3..31c1754d5bd4 100644
> --- a/mm/pgtable-generic.c
> +++ b/mm/pgtable-generic.c
> @@ -290,7 +290,7 @@ pte_t *___pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp)
>   
>   	if (pmdvalp)
>   		*pmdvalp = pmdval;
> -	if (unlikely(pmd_none(pmdval) || !pmd_present(pmdval)))
> +	if (unlikely(pmd_none(pmdval) || is_pmd_non_present_folio_entry(pmdval)))
>   		goto nomap;
>   	if (unlikely(pmd_trans_huge(pmdval)))
>   		goto nomap;
> 

Yes, this seems to work.

CC Claudio.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: linux-next: KVM/s390x regression
  2025-10-17 22:41                   ` David Hildenbrand
@ 2025-10-20  7:01                     ` Christian Borntraeger
  0 siblings, 0 replies; 75+ messages in thread
From: Christian Borntraeger @ 2025-10-20  7:01 UTC (permalink / raw)
  To: David Hildenbrand, Balbir Singh, Claudio Imbrenda
  Cc: Liam.Howlett, airlied, akpm, apopple, baohua, baolin.wang,
	byungchul, dakr, dev.jain, dri-devel, francois.dugast, gourry,
	joshua.hahnjy, linux-kernel, linux-mm, lorenzo.stoakes, lyude,
	matthew.brost, mpenttil, npache, osalvador, rakie.kim, rcampbell,
	ryan.roberts, simona, ying.huang, ziy, kvm, linux-s390,
	linux-next

Am 18.10.25 um 00:41 schrieb David Hildenbrand:
> On 18.10.25 00:15, David Hildenbrand wrote:
>> On 17.10.25 23:56, Balbir Singh wrote:
>>> On 10/18/25 04:07, David Hildenbrand wrote:
>>>> On 17.10.25 17:20, Christian Borntraeger wrote:
>>>>>
>>>>>
>>>>> Am 17.10.25 um 17:07 schrieb David Hildenbrand:
>>>>>> On 17.10.25 17:01, Christian Borntraeger wrote:
>>>>>>> Am 17.10.25 um 16:54 schrieb David Hildenbrand:
>>>>>>>> On 17.10.25 16:49, Christian Borntraeger wrote:
>>>>>>>>> This patch triggers a regression for s390x kvm as qemu guests can no longer start
>>>>>>>>>
>>>>>>>>> error: kvm run failed Cannot allocate memory
>>>>>>>>> PSW=mask 0000000180000000 addr 000000007fd00600
>>>>>>>>> R00=0000000000000000 R01=0000000000000000 R02=0000000000000000 R03=0000000000000000
>>>>>>>>> R04=0000000000000000 R05=0000000000000000 R06=0000000000000000 R07=0000000000000000
>>>>>>>>> R08=0000000000000000 R09=0000000000000000 R10=0000000000000000 R11=0000000000000000
>>>>>>>>> R12=0000000000000000 R13=0000000000000000 R14=0000000000000000 R15=0000000000000000
>>>>>>>>> C00=00000000000000e0 C01=0000000000000000 C02=0000000000000000 C03=0000000000000000
>>>>>>>>> C04=0000000000000000 C05=0000000000000000 C06=0000000000000000 C07=0000000000000000
>>>>>>>>> C08=0000000000000000 C09=0000000000000000 C10=0000000000000000 C11=0000000000000000
>>>>>>>>> C12=0000000000000000 C13=0000000000000000 C14=00000000c2000000 C15=0000000000000000
>>>>>>>>>
>>>>>>>>> KVM on s390x does not use THP so far, will investigate. Does anyone have a quick idea?
>>>>>>>>
>>>>>>>> Only when running KVM guests and apart from that everything else seems to be fine?
>>>>>>>
>>>>>>> We have other weirdness in linux-next but in different areas. Could that somehow be
>>>>>>> related to use disabling THP for the kvm address space?
>>>>>>
>>>>>> Not sure ... it's a bit weird. I mean, when KVM disables THPs we essentially just remap everything to be mapped by PTEs. So there shouldn't be any PMDs in that whole process.
>>>>>>
>>>>>> Remapping a file THP (shmem) implies zapping the THP completely.
>>>>>>
>>>>>>
>>>>>> I assume in your kernel config has CONFIG_ZONE_DEVICE and CONFIG_ARCH_ENABLE_THP_MIGRATION set, right?
>>>>>
>>>>> yes.
>>>>>
>>>>>>
>>>>>> I'd rule out copy_huge_pmd(), zap_huge_pmd() a well.
>>>>>>
>>>>>>
>>>>>> What happens if you revert the change in mm/pgtable-generic.c?
>>>>>
>>>>> That partial revert seems to fix the issue
>>>>> diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
>>>>> index 0c847cdf4fd3..567e2d084071 100644
>>>>> --- a/mm/pgtable-generic.c
>>>>> +++ b/mm/pgtable-generic.c
>>>>> @@ -290,7 +290,7 @@ pte_t *___pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp)
>>>>>                if (pmdvalp)
>>>>>                     *pmdvalp = pmdval;
>>>>> -       if (unlikely(pmd_none(pmdval) || !pmd_present(pmdval)))
>>>>> +       if (unlikely(pmd_none(pmdval) || is_pmd_migration_entry(pmdval)))
>>>>
>>>> Okay, but that means that effectively we stumble over a PMD entry that is not a migration entry but still non-present.
>>>>
>>>> And I would expect that it's a page table, because otherwise the change
>>>> wouldn't make a difference.
>>>>
>>>> And the weird thing is that this only triggers sometimes, because if
>>>> it would always trigger nothing would ever work.
>>>>
>>>> Is there some weird scenario where s390x might set a left page table mapped in a PMD to non-present?
>>>>
>>>
>>> Good point
>>>
>>>> Staring at the definition of pmd_present() on s390x it's really just
>>>>
>>>>       return (pmd_val(pmd) & _SEGMENT_ENTRY_PRESENT) != 0;
>>>>
>>>>
>>>> Maybe this is happening in the gmap code only and not actually in the core-mm code?
>>>>
>>>
>>>
>>> I am not an s390 expert, but just looking at the code
>>>
>>> So the check on s390 effectively
>>>
>>> segment_entry/present = false or segment_entry_empty/invalid = true
>>
>> pmd_present() == true iff _SEGMENT_ENTRY_PRESENT is set
>>
>> because
>>
>>     return (pmd_val(pmd) & _SEGMENT_ENTRY_PRESENT) != 0;
>>
>> is the same as
>>
>>     return pmd_val(pmd) & _SEGMENT_ENTRY_PRESENT;
>>
>> But that means we have something where _SEGMENT_ENTRY_PRESENT is not set.
>>
>> I suspect that can only be the gmap tables.
>>
>> Likely __gmap_link() does not set _SEGMENT_ENTRY_PRESENT, which is fine
>> because it's a software managed bit for "ordinary" page tables, not gmap
>> tables.
>>
>> Which raises the question why someone would wrongly use
>> pte_offset_map()/__pte_offset_map() on the gmap tables.
>>
>> I cannot immediately spot any such usage in kvm/gmap code, though.
>>
> 
> Ah, it's all that pte_alloc_map_lock() stuff in gmap.c.
> 
> Oh my.
> 
> So we're mapping a user PTE table that is linked into the gmap tables through a PMD table that does not have the right sw bits set we would expect in a user PMD table.
> 
> What's also scary is that pte_alloc_map_lock() would try to pte_alloc() a user page table in the gmap, which sounds completely wrong?
> 
> Yeah, when walking the gmap and wanting to lock the linked user PTE table, we should probably never use the pte_*map variants but obtain
> the lock through pte_lockptr().
> 
> All magic we end up doing with RCU etc in __pte_offset_map_lock()
> does not apply to the gmap PMD table.
> 

CC Claudio.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: linux-next: KVM/s390x regression
  2025-10-20  7:00                 ` Christian Borntraeger
@ 2025-10-20  8:41                   ` David Hildenbrand
  2025-10-20  9:04                     ` Claudio Imbrenda
  2025-10-27 16:47                     ` Claudio Imbrenda
  0 siblings, 2 replies; 75+ messages in thread
From: David Hildenbrand @ 2025-10-20  8:41 UTC (permalink / raw)
  To: Christian Borntraeger, Balbir Singh, Claudio Imbrenda
  Cc: Liam.Howlett, airlied, akpm, apopple, baohua, baolin.wang,
	byungchul, dakr, dev.jain, dri-devel, francois.dugast, gourry,
	joshua.hahnjy, linux-kernel, linux-mm, lorenzo.stoakes, lyude,
	matthew.brost, mpenttil, npache, osalvador, rakie.kim, rcampbell,
	ryan.roberts, simona, ying.huang, ziy, kvm, linux-s390,
	linux-next

On 20.10.25 09:00, Christian Borntraeger wrote:
> Am 17.10.25 um 23:56 schrieb Balbir Singh:
> 
>> In the meanwhile, does this fix/workaround work?
>>
>> diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
>> index 0c847cdf4fd3..31c1754d5bd4 100644
>> --- a/mm/pgtable-generic.c
>> +++ b/mm/pgtable-generic.c
>> @@ -290,7 +290,7 @@ pte_t *___pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp)
>>    
>>    	if (pmdvalp)
>>    		*pmdvalp = pmdval;
>> -	if (unlikely(pmd_none(pmdval) || !pmd_present(pmdval)))
>> +	if (unlikely(pmd_none(pmdval) || is_pmd_non_present_folio_entry(pmdval)))
>>    		goto nomap;
>>    	if (unlikely(pmd_trans_huge(pmdval)))
>>    		goto nomap;
>>
> 
> Yes, this seems to work.

Right, but that's not what we will want here. We'll have to adjust s390x 
gmap code (which is getting redesigned either way) to only take the page 
lock.

In the end, we'll want here later a single

if (!pmd_present(pmdval))
	goto nomap;

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: linux-next: KVM/s390x regression
  2025-10-20  8:41                   ` David Hildenbrand
@ 2025-10-20  9:04                     ` Claudio Imbrenda
  2025-10-27 16:47                     ` Claudio Imbrenda
  1 sibling, 0 replies; 75+ messages in thread
From: Claudio Imbrenda @ 2025-10-20  9:04 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Christian Borntraeger, Balbir Singh, Liam.Howlett, airlied, akpm,
	apopple, baohua, baolin.wang, byungchul, dakr, dev.jain,
	dri-devel, francois.dugast, gourry, joshua.hahnjy, linux-kernel,
	linux-mm, lorenzo.stoakes, lyude, matthew.brost, mpenttil, npache,
	osalvador, rakie.kim, rcampbell, ryan.roberts, simona, ying.huang,
	ziy, kvm, linux-s390, linux-next

On Mon, 20 Oct 2025 10:41:28 +0200
David Hildenbrand <david@redhat.com> wrote:

> On 20.10.25 09:00, Christian Borntraeger wrote:
> > Am 17.10.25 um 23:56 schrieb Balbir Singh:
> >   
> >> In the meanwhile, does this fix/workaround work?
> >>
> >> diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
> >> index 0c847cdf4fd3..31c1754d5bd4 100644
> >> --- a/mm/pgtable-generic.c
> >> +++ b/mm/pgtable-generic.c
> >> @@ -290,7 +290,7 @@ pte_t *___pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp)
> >>    
> >>    	if (pmdvalp)
> >>    		*pmdvalp = pmdval;
> >> -	if (unlikely(pmd_none(pmdval) || !pmd_present(pmdval)))
> >> +	if (unlikely(pmd_none(pmdval) || is_pmd_non_present_folio_entry(pmdval)))
> >>    		goto nomap;
> >>    	if (unlikely(pmd_trans_huge(pmdval)))
> >>    		goto nomap;
> >>  
> > 
> > Yes, this seems to work.  
> 
> Right, but that's not what we will want here. We'll have to adjust s390x 

I'm looking into that

> gmap code (which is getting redesigned either way) to only take the page 

unfortunately the rework won't make it in 6.18, so I'll have to quickly
cobble together a fix

> lock.
> 
> In the end, we'll want here later a single
> 
> if (!pmd_present(pmdval))
> 	goto nomap;
> 


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [v7 11/16] mm/migrate_device: add THP splitting during migration
  2025-10-19 22:59       ` Zi Yan
@ 2025-10-21 21:34         ` Balbir Singh
  2025-10-22  2:59           ` Zi Yan
  0 siblings, 1 reply; 75+ messages in thread
From: Balbir Singh @ 2025-10-21 21:34 UTC (permalink / raw)
  To: Zi Yan
  Cc: Wei Yang, linux-kernel, dri-devel, linux-mm, akpm,
	David Hildenbrand, Joshua Hahn, Rakie Kim, Byungchul Park,
	Gregory Price, Ying Huang, Alistair Popple, Oscar Salvador,
	Lorenzo Stoakes, Baolin Wang, Liam R. Howlett, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Ralph Campbell, Mika Penttilä,
	Matthew Brost, Francois Dugast

On 10/20/25 09:59, Zi Yan wrote:
> On 19 Oct 2025, at 18:49, Balbir Singh wrote:
> 
>> On 10/19/25 19:19, Wei Yang wrote:
>>> On Wed, Oct 01, 2025 at 04:57:02PM +1000, Balbir Singh wrote:
>>> [...]
>>>> static int __folio_split(struct folio *folio, unsigned int new_order,
>>>> 		struct page *split_at, struct page *lock_at,
>>>> -		struct list_head *list, bool uniform_split)
>>>> +		struct list_head *list, bool uniform_split, bool unmapped)
>>>> {
>>>> 	struct deferred_split *ds_queue = get_deferred_split_queue(folio);
>>>> 	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
>>>> @@ -3765,13 +3757,15 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>> 		 * is taken to serialise against parallel split or collapse
>>>> 		 * operations.
>>>> 		 */
>>>> -		anon_vma = folio_get_anon_vma(folio);
>>>> -		if (!anon_vma) {
>>>> -			ret = -EBUSY;
>>>> -			goto out;
>>>> +		if (!unmapped) {
>>>> +			anon_vma = folio_get_anon_vma(folio);
>>>> +			if (!anon_vma) {
>>>> +				ret = -EBUSY;
>>>> +				goto out;
>>>> +			}
>>>> +			anon_vma_lock_write(anon_vma);
>>>> 		}
>>>> 		mapping = NULL;
>>>> -		anon_vma_lock_write(anon_vma);
>>>> 	} else {
>>>> 		unsigned int min_order;
>>>> 		gfp_t gfp;
>>>> @@ -3838,7 +3832,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>> 		goto out_unlock;
>>>> 	}
>>>>
>>>> -	unmap_folio(folio);
>>>> +	if (!unmapped)
>>>> +		unmap_folio(folio);
>>>>
>>>> 	/* block interrupt reentry in xa_lock and spinlock */
>>>> 	local_irq_disable();
>>>> @@ -3925,10 +3920,13 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>
>>>> 			next = folio_next(new_folio);
>>>>
>>>> +			zone_device_private_split_cb(folio, new_folio);
>>>> +
>>>> 			expected_refs = folio_expected_ref_count(new_folio) + 1;
>>>> 			folio_ref_unfreeze(new_folio, expected_refs);
>>>>
>>>> -			lru_add_split_folio(folio, new_folio, lruvec, list);
>>>> +			if (!unmapped)
>>>> +				lru_add_split_folio(folio, new_folio, lruvec, list);
>>>>
>>>> 			/*
>>>> 			 * Anonymous folio with swap cache.
>>>> @@ -3959,6 +3957,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>> 			__filemap_remove_folio(new_folio, NULL);
>>>> 			folio_put_refs(new_folio, nr_pages);
>>>> 		}
>>>> +
>>>> +		zone_device_private_split_cb(folio, NULL);
>>>> 		/*
>>>> 		 * Unfreeze @folio only after all page cache entries, which
>>>> 		 * used to point to it, have been updated with new folios.
>>>> @@ -3982,6 +3982,9 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>
>>>> 	local_irq_enable();
>>>>
>>>> +	if (unmapped)
>>>> +		return ret;
>>>
>>> As the comment of __folio_split() and __split_huge_page_to_list_to_order()
>>> mentioned:
>>>
>>>   * The large folio must be locked
>>>   * After splitting, the after-split folio containing @lock_at remains locked
>>>
>>> But here we seems to change the prerequisites.
>>>
>>> Hmm.. I am not sure this is correct.
>>>
>>
>> The code is correct, but you are right in that the documentation needs to be updated.
>> When "unmapped", we do want to leave the folios locked after the split.
> 
> Sigh, this "unmapped" code needs so many special branches and a different locking
> requirement. It should be a separate function to avoid confusions.
> 

Yep, I have a patch for it, I am also waiting on Matthew's feedback, FYI, here is
a WIP patch that can be applied on top of the series

---
 include/linux/huge_mm.h |   5 +-
 mm/huge_memory.c        | 137 ++++++++++++++++++++++++++++++++++------
 mm/migrate_device.c     |   3 +-
 3 files changed, 120 insertions(+), 25 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index c4a811958cda..86e1cefaf391 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -366,7 +366,8 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
 
 bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
 int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
-		unsigned int new_order, bool unmapped);
+		unsigned int new_order);
+int split_unmapped_folio_to_order(struct folio *folio, unsigned int new_order);
 int min_order_for_split(struct folio *folio);
 int split_folio_to_list(struct folio *folio, struct list_head *list);
 bool uniform_split_supported(struct folio *folio, unsigned int new_order,
@@ -379,7 +380,7 @@ int folio_split(struct folio *folio, unsigned int new_order, struct page *page,
 static inline int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
 		unsigned int new_order)
 {
-	return __split_huge_page_to_list_to_order(page, list, new_order, false);
+	return __split_huge_page_to_list_to_order(page, list, new_order);
 }
 
 /*
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 8c82a0ac6e69..e20cbf68d037 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3711,7 +3711,6 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
  * @lock_at: a page within @folio to be left locked to caller
  * @list: after-split folios will be put on it if non NULL
  * @uniform_split: perform uniform split or not (non-uniform split)
- * @unmapped: The pages are already unmapped, they are migration entries.
  *
  * It calls __split_unmapped_folio() to perform uniform and non-uniform split.
  * It is in charge of checking whether the split is supported or not and
@@ -3727,7 +3726,7 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
  */
 static int __folio_split(struct folio *folio, unsigned int new_order,
 		struct page *split_at, struct page *lock_at,
-		struct list_head *list, bool uniform_split, bool unmapped)
+		struct list_head *list, bool uniform_split)
 {
 	struct deferred_split *ds_queue;
 	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
@@ -3777,14 +3776,12 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
 		 * is taken to serialise against parallel split or collapse
 		 * operations.
 		 */
-		if (!unmapped) {
-			anon_vma = folio_get_anon_vma(folio);
-			if (!anon_vma) {
-				ret = -EBUSY;
-				goto out;
-			}
-			anon_vma_lock_write(anon_vma);
+		anon_vma = folio_get_anon_vma(folio);
+		if (!anon_vma) {
+			ret = -EBUSY;
+			goto out;
 		}
+		anon_vma_lock_write(anon_vma);
 		mapping = NULL;
 	} else {
 		unsigned int min_order;
@@ -3852,8 +3849,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
 		goto out_unlock;
 	}
 
-	if (!unmapped)
-		unmap_folio(folio);
+	unmap_folio(folio);
 
 	/* block interrupt reentry in xa_lock and spinlock */
 	local_irq_disable();
@@ -3954,8 +3950,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
 			expected_refs = folio_expected_ref_count(new_folio) + 1;
 			folio_ref_unfreeze(new_folio, expected_refs);
 
-			if (!unmapped)
-				lru_add_split_folio(folio, new_folio, lruvec, list);
+			lru_add_split_folio(folio, new_folio, lruvec, list);
 
 			/*
 			 * Anonymous folio with swap cache.
@@ -4011,9 +4006,6 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
 
 	local_irq_enable();
 
-	if (unmapped)
-		return ret;
-
 	if (nr_shmem_dropped)
 		shmem_uncharge(mapping->host, nr_shmem_dropped);
 
@@ -4057,6 +4049,111 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
 	return ret;
 }
 
+/*
+ * This function is a helper for splitting folios that have already been unmapped.
+ * The use case is that the device or the CPU can refuse to migrate THP pages in
+ * the middle of migration, due to allocation issues on either side
+ *
+ * The high level code is copied from __folio_split, since the pages are anonymous
+ * and are already isolated from the LRU, the code has been simplified to not
+ * burden __folio_split with unmapped sprinkled into the code.
+ *
+ * None of the split folios are unlocked
+ */
+int split_unmapped_folio_to_order(struct folio *folio, unsigned int new_order)
+{
+	int extra_pins;
+	int ret = 0;
+	struct folio *new_folio, *next;
+	struct folio *end_folio = folio_next(folio);
+	struct deferred_split *ds_queue;
+	int old_order = folio_order(folio);
+
+	VM_WARN_ON_FOLIO(folio_mapped(folio), folio);
+	VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio);
+	VM_WARN_ON_ONCE_FOLIO(!folio_test_large(folio), folio);
+
+	if (!can_split_folio(folio, 1, &extra_pins)) {
+		ret = -EAGAIN;
+		goto err;
+	}
+
+	local_irq_disable();
+	/* Prevent deferred_split_scan() touching ->_refcount */
+	ds_queue = folio_split_queue_lock(folio);
+	if (folio_ref_freeze(folio, 1 + extra_pins)) {
+		int expected_refs;
+		struct swap_cluster_info *ci = NULL;
+
+		if (old_order > 1) {
+			if (!list_empty(&folio->_deferred_list)) {
+				ds_queue->split_queue_len--;
+				/*
+				 * Reinitialize page_deferred_list after
+				 * removing the page from the split_queue,
+				 * otherwise a subsequent split will see list
+				 * corruption when checking the
+				 * page_deferred_list.
+				 */
+				list_del_init(&folio->_deferred_list);
+			}
+			if (folio_test_partially_mapped(folio)) {
+				folio_clear_partially_mapped(folio);
+				mod_mthp_stat(old_order,
+					MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, -1);
+			}
+			/*
+			 * Reinitialize page_deferred_list after removing the
+			 * page from the split_queue, otherwise a subsequent
+			 * split will see list corruption when checking the
+			 * page_deferred_list.
+			 */
+			list_del_init(&folio->_deferred_list);
+		}
+		split_queue_unlock(ds_queue);
+
+		if (folio_test_swapcache(folio))
+			ci = swap_cluster_get_and_lock(folio);
+
+		ret = __split_unmapped_folio(folio, new_order, &folio->page,
+					     NULL, NULL, true);
+
+		/*
+		 * Unfreeze after-split folios
+		 */
+		for (new_folio = folio_next(folio); new_folio != end_folio;
+		     new_folio = next) {
+			next = folio_next(new_folio);
+
+			zone_device_private_split_cb(folio, new_folio);
+
+			expected_refs = folio_expected_ref_count(new_folio) + 1;
+			folio_ref_unfreeze(new_folio, expected_refs);
+			if (ci)
+				__swap_cache_replace_folio(ci, folio, new_folio);
+		}
+
+		zone_device_private_split_cb(folio, NULL);
+		/*
+		 * Unfreeze @folio only after all page cache entries, which
+		 * used to point to it, have been updated with new folios.
+		 * Otherwise, a parallel folio_try_get() can grab @folio
+		 * and its caller can see stale page cache entries.
+		 */
+		expected_refs = folio_expected_ref_count(folio) + 1;
+		folio_ref_unfreeze(folio, expected_refs);
+
+		if (ci)
+			swap_cluster_unlock(ci);
+	} else {
+		split_queue_unlock(ds_queue);
+		ret = -EAGAIN;
+	}
+	local_irq_enable();
+err:
+	return ret;
+}
+
 /*
  * This function splits a large folio into smaller folios of order @new_order.
  * @page can point to any page of the large folio to split. The split operation
@@ -4105,12 +4202,11 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
  * with the folio. Splitting to order 0 is compatible with all folios.
  */
 int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
-				     unsigned int new_order, bool unmapped)
+				     unsigned int new_order)
 {
 	struct folio *folio = page_folio(page);
 
-	return __folio_split(folio, new_order, &folio->page, page, list, true,
-				unmapped);
+	return __folio_split(folio, new_order, &folio->page, page, list, true);
 }
 
 /*
@@ -4138,8 +4234,7 @@ int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list
 int folio_split(struct folio *folio, unsigned int new_order,
 		struct page *split_at, struct list_head *list)
 {
-	return __folio_split(folio, new_order, split_at, &folio->page, list,
-			false, false);
+	return __folio_split(folio, new_order, split_at, &folio->page, list, false);
 }
 
 int min_order_for_split(struct folio *folio)
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index c869b272e85a..23515f3ffc35 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -918,8 +918,7 @@ static int migrate_vma_split_unmapped_folio(struct migrate_vma *migrate,
 
 	folio_get(folio);
 	split_huge_pmd_address(migrate->vma, addr, true);
-	ret = __split_huge_page_to_list_to_order(folio_page(folio, 0), NULL,
-							0, true);
+	ret = split_unmapped_folio_to_order(folio, 0);
 	if (ret)
 		return ret;
 	migrate->src[idx] &= ~MIGRATE_PFN_COMPOUND;
-- 
2.51.0



^ permalink raw reply related	[flat|nested] 75+ messages in thread

* Re: [v7 11/16] mm/migrate_device: add THP splitting during migration
  2025-10-21 21:34         ` Balbir Singh
@ 2025-10-22  2:59           ` Zi Yan
  2025-10-22  7:16             ` Balbir Singh
  0 siblings, 1 reply; 75+ messages in thread
From: Zi Yan @ 2025-10-22  2:59 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Wei Yang, linux-kernel, dri-devel, linux-mm, akpm,
	David Hildenbrand, Joshua Hahn, Rakie Kim, Byungchul Park,
	Gregory Price, Ying Huang, Alistair Popple, Oscar Salvador,
	Lorenzo Stoakes, Baolin Wang, Liam R. Howlett, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Ralph Campbell, Mika Penttilä,
	Matthew Brost, Francois Dugast

On 21 Oct 2025, at 17:34, Balbir Singh wrote:

> On 10/20/25 09:59, Zi Yan wrote:
>> On 19 Oct 2025, at 18:49, Balbir Singh wrote:
>>
>>> On 10/19/25 19:19, Wei Yang wrote:
>>>> On Wed, Oct 01, 2025 at 04:57:02PM +1000, Balbir Singh wrote:
>>>> [...]
>>>>> static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>> 		struct page *split_at, struct page *lock_at,
>>>>> -		struct list_head *list, bool uniform_split)
>>>>> +		struct list_head *list, bool uniform_split, bool unmapped)
>>>>> {
>>>>> 	struct deferred_split *ds_queue = get_deferred_split_queue(folio);
>>>>> 	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
>>>>> @@ -3765,13 +3757,15 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>> 		 * is taken to serialise against parallel split or collapse
>>>>> 		 * operations.
>>>>> 		 */
>>>>> -		anon_vma = folio_get_anon_vma(folio);
>>>>> -		if (!anon_vma) {
>>>>> -			ret = -EBUSY;
>>>>> -			goto out;
>>>>> +		if (!unmapped) {
>>>>> +			anon_vma = folio_get_anon_vma(folio);
>>>>> +			if (!anon_vma) {
>>>>> +				ret = -EBUSY;
>>>>> +				goto out;
>>>>> +			}
>>>>> +			anon_vma_lock_write(anon_vma);
>>>>> 		}
>>>>> 		mapping = NULL;
>>>>> -		anon_vma_lock_write(anon_vma);
>>>>> 	} else {
>>>>> 		unsigned int min_order;
>>>>> 		gfp_t gfp;
>>>>> @@ -3838,7 +3832,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>> 		goto out_unlock;
>>>>> 	}
>>>>>
>>>>> -	unmap_folio(folio);
>>>>> +	if (!unmapped)
>>>>> +		unmap_folio(folio);
>>>>>
>>>>> 	/* block interrupt reentry in xa_lock and spinlock */
>>>>> 	local_irq_disable();
>>>>> @@ -3925,10 +3920,13 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>
>>>>> 			next = folio_next(new_folio);
>>>>>
>>>>> +			zone_device_private_split_cb(folio, new_folio);
>>>>> +
>>>>> 			expected_refs = folio_expected_ref_count(new_folio) + 1;
>>>>> 			folio_ref_unfreeze(new_folio, expected_refs);
>>>>>
>>>>> -			lru_add_split_folio(folio, new_folio, lruvec, list);
>>>>> +			if (!unmapped)
>>>>> +				lru_add_split_folio(folio, new_folio, lruvec, list);
>>>>>
>>>>> 			/*
>>>>> 			 * Anonymous folio with swap cache.
>>>>> @@ -3959,6 +3957,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>> 			__filemap_remove_folio(new_folio, NULL);
>>>>> 			folio_put_refs(new_folio, nr_pages);
>>>>> 		}
>>>>> +
>>>>> +		zone_device_private_split_cb(folio, NULL);
>>>>> 		/*
>>>>> 		 * Unfreeze @folio only after all page cache entries, which
>>>>> 		 * used to point to it, have been updated with new folios.
>>>>> @@ -3982,6 +3982,9 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>
>>>>> 	local_irq_enable();
>>>>>
>>>>> +	if (unmapped)
>>>>> +		return ret;
>>>>
>>>> As the comment of __folio_split() and __split_huge_page_to_list_to_order()
>>>> mentioned:
>>>>
>>>>   * The large folio must be locked
>>>>   * After splitting, the after-split folio containing @lock_at remains locked
>>>>
>>>> But here we seems to change the prerequisites.
>>>>
>>>> Hmm.. I am not sure this is correct.
>>>>
>>>
>>> The code is correct, but you are right in that the documentation needs to be updated.
>>> When "unmapped", we do want to leave the folios locked after the split.
>>
>> Sigh, this "unmapped" code needs so many special branches and a different locking
>> requirement. It should be a separate function to avoid confusions.
>>
>
> Yep, I have a patch for it, I am also waiting on Matthew's feedback, FYI, here is
> a WIP patch that can be applied on top of the series

Nice cleanup! Thanks.

>
> ---
>  include/linux/huge_mm.h |   5 +-
>  mm/huge_memory.c        | 137 ++++++++++++++++++++++++++++++++++------
>  mm/migrate_device.c     |   3 +-
>  3 files changed, 120 insertions(+), 25 deletions(-)
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index c4a811958cda..86e1cefaf391 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -366,7 +366,8 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
>
>  bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
>  int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
> -		unsigned int new_order, bool unmapped);
> +		unsigned int new_order);
> +int split_unmapped_folio_to_order(struct folio *folio, unsigned int new_order);
>  int min_order_for_split(struct folio *folio);
>  int split_folio_to_list(struct folio *folio, struct list_head *list);
>  bool uniform_split_supported(struct folio *folio, unsigned int new_order,
> @@ -379,7 +380,7 @@ int folio_split(struct folio *folio, unsigned int new_order, struct page *page,
>  static inline int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>  		unsigned int new_order)
>  {
> -	return __split_huge_page_to_list_to_order(page, list, new_order, false);
> +	return __split_huge_page_to_list_to_order(page, list, new_order);
>  }
>
>  /*
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 8c82a0ac6e69..e20cbf68d037 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3711,7 +3711,6 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
>   * @lock_at: a page within @folio to be left locked to caller
>   * @list: after-split folios will be put on it if non NULL
>   * @uniform_split: perform uniform split or not (non-uniform split)
> - * @unmapped: The pages are already unmapped, they are migration entries.
>   *
>   * It calls __split_unmapped_folio() to perform uniform and non-uniform split.
>   * It is in charge of checking whether the split is supported or not and
> @@ -3727,7 +3726,7 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
>   */
>  static int __folio_split(struct folio *folio, unsigned int new_order,
>  		struct page *split_at, struct page *lock_at,
> -		struct list_head *list, bool uniform_split, bool unmapped)
> +		struct list_head *list, bool uniform_split)
>  {
>  	struct deferred_split *ds_queue;
>  	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
> @@ -3777,14 +3776,12 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>  		 * is taken to serialise against parallel split or collapse
>  		 * operations.
>  		 */
> -		if (!unmapped) {
> -			anon_vma = folio_get_anon_vma(folio);
> -			if (!anon_vma) {
> -				ret = -EBUSY;
> -				goto out;
> -			}
> -			anon_vma_lock_write(anon_vma);
> +		anon_vma = folio_get_anon_vma(folio);
> +		if (!anon_vma) {
> +			ret = -EBUSY;
> +			goto out;
>  		}
> +		anon_vma_lock_write(anon_vma);
>  		mapping = NULL;
>  	} else {
>  		unsigned int min_order;
> @@ -3852,8 +3849,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>  		goto out_unlock;
>  	}
>
> -	if (!unmapped)
> -		unmap_folio(folio);
> +	unmap_folio(folio);
>
>  	/* block interrupt reentry in xa_lock and spinlock */
>  	local_irq_disable();
> @@ -3954,8 +3950,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>  			expected_refs = folio_expected_ref_count(new_folio) + 1;
>  			folio_ref_unfreeze(new_folio, expected_refs);
>
> -			if (!unmapped)
> -				lru_add_split_folio(folio, new_folio, lruvec, list);
> +			lru_add_split_folio(folio, new_folio, lruvec, list);
>
>  			/*
>  			 * Anonymous folio with swap cache.
> @@ -4011,9 +4006,6 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>
>  	local_irq_enable();
>
> -	if (unmapped)
> -		return ret;
> -
>  	if (nr_shmem_dropped)
>  		shmem_uncharge(mapping->host, nr_shmem_dropped);
>
> @@ -4057,6 +4049,111 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>  	return ret;
>  }
>
> +/*
> + * This function is a helper for splitting folios that have already been unmapped.
> + * The use case is that the device or the CPU can refuse to migrate THP pages in
> + * the middle of migration, due to allocation issues on either side
> + *
> + * The high level code is copied from __folio_split, since the pages are anonymous
> + * and are already isolated from the LRU, the code has been simplified to not
> + * burden __folio_split with unmapped sprinkled into the code.

I wonder if it makes sense to remove CPU side folio from both deferred_split queue
and swap cache before migration to further simplify split_unmapped_folio_to_order().
Basically require that device private folios cannot be on deferred_split queue nor
swap cache.

> + *
> + * None of the split folios are unlocked
> + */
> +int split_unmapped_folio_to_order(struct folio *folio, unsigned int new_order)
> +{
> +	int extra_pins;
> +	int ret = 0;
> +	struct folio *new_folio, *next;
> +	struct folio *end_folio = folio_next(folio);
> +	struct deferred_split *ds_queue;
> +	int old_order = folio_order(folio);
> +
> +	VM_WARN_ON_FOLIO(folio_mapped(folio), folio);
> +	VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio);
> +	VM_WARN_ON_ONCE_FOLIO(!folio_test_large(folio), folio);
> +
> +	if (!can_split_folio(folio, 1, &extra_pins)) {
> +		ret = -EAGAIN;
> +		goto err;
> +	}
> +
> +	local_irq_disable();
> +	/* Prevent deferred_split_scan() touching ->_refcount */
> +	ds_queue = folio_split_queue_lock(folio);
> +	if (folio_ref_freeze(folio, 1 + extra_pins)) {
> +		int expected_refs;
> +		struct swap_cluster_info *ci = NULL;
> +
> +		if (old_order > 1) {
> +			if (!list_empty(&folio->_deferred_list)) {
> +				ds_queue->split_queue_len--;
> +				/*
> +				 * Reinitialize page_deferred_list after
> +				 * removing the page from the split_queue,
> +				 * otherwise a subsequent split will see list
> +				 * corruption when checking the
> +				 * page_deferred_list.
> +				 */
> +				list_del_init(&folio->_deferred_list);
> +			}
> +			if (folio_test_partially_mapped(folio)) {
> +				folio_clear_partially_mapped(folio);
> +				mod_mthp_stat(old_order,
> +					MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, -1);
> +			}
> +			/*
> +			 * Reinitialize page_deferred_list after removing the
> +			 * page from the split_queue, otherwise a subsequent
> +			 * split will see list corruption when checking the
> +			 * page_deferred_list.
> +			 */
> +			list_del_init(&folio->_deferred_list);
> +		}
> +		split_queue_unlock(ds_queue);
> +
> +		if (folio_test_swapcache(folio))
> +			ci = swap_cluster_get_and_lock(folio);
> +
> +		ret = __split_unmapped_folio(folio, new_order, &folio->page,
> +					     NULL, NULL, true);
> +
> +		/*
> +		 * Unfreeze after-split folios
> +		 */
> +		for (new_folio = folio_next(folio); new_folio != end_folio;
> +		     new_folio = next) {
> +			next = folio_next(new_folio);
> +
> +			zone_device_private_split_cb(folio, new_folio);
> +
> +			expected_refs = folio_expected_ref_count(new_folio) + 1;
> +			folio_ref_unfreeze(new_folio, expected_refs);
> +			if (ci)
> +				__swap_cache_replace_folio(ci, folio, new_folio);
> +		}
> +
> +		zone_device_private_split_cb(folio, NULL);
> +		/*
> +		 * Unfreeze @folio only after all page cache entries, which
> +		 * used to point to it, have been updated with new folios.
> +		 * Otherwise, a parallel folio_try_get() can grab @folio
> +		 * and its caller can see stale page cache entries.
> +		 */
> +		expected_refs = folio_expected_ref_count(folio) + 1;
> +		folio_ref_unfreeze(folio, expected_refs);
> +
> +		if (ci)
> +			swap_cluster_unlock(ci);
> +	} else {
> +		split_queue_unlock(ds_queue);
> +		ret = -EAGAIN;
> +	}
> +	local_irq_enable();
> +err:
> +	return ret;
> +}
> +
>  /*
>   * This function splits a large folio into smaller folios of order @new_order.
>   * @page can point to any page of the large folio to split. The split operation
> @@ -4105,12 +4202,11 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>   * with the folio. Splitting to order 0 is compatible with all folios.
>   */
>  int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
> -				     unsigned int new_order, bool unmapped)
> +				     unsigned int new_order)
>  {
>  	struct folio *folio = page_folio(page);
>
> -	return __folio_split(folio, new_order, &folio->page, page, list, true,
> -				unmapped);
> +	return __folio_split(folio, new_order, &folio->page, page, list, true);
>  }
>
>  /*
> @@ -4138,8 +4234,7 @@ int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list
>  int folio_split(struct folio *folio, unsigned int new_order,
>  		struct page *split_at, struct list_head *list)
>  {
> -	return __folio_split(folio, new_order, split_at, &folio->page, list,
> -			false, false);
> +	return __folio_split(folio, new_order, split_at, &folio->page, list, false);
>  }
>
>  int min_order_for_split(struct folio *folio)
> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
> index c869b272e85a..23515f3ffc35 100644
> --- a/mm/migrate_device.c
> +++ b/mm/migrate_device.c
> @@ -918,8 +918,7 @@ static int migrate_vma_split_unmapped_folio(struct migrate_vma *migrate,
>
>  	folio_get(folio);
>  	split_huge_pmd_address(migrate->vma, addr, true);
> -	ret = __split_huge_page_to_list_to_order(folio_page(folio, 0), NULL,
> -							0, true);
> +	ret = split_unmapped_folio_to_order(folio, 0);
>  	if (ret)
>  		return ret;
>  	migrate->src[idx] &= ~MIGRATE_PFN_COMPOUND;
> -- 
> 2.51.0


--
Best Regards,
Yan, Zi

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [v7 11/16] mm/migrate_device: add THP splitting during migration
  2025-10-22  2:59           ` Zi Yan
@ 2025-10-22  7:16             ` Balbir Singh
  2025-10-22 15:26               ` Zi Yan
  0 siblings, 1 reply; 75+ messages in thread
From: Balbir Singh @ 2025-10-22  7:16 UTC (permalink / raw)
  To: Zi Yan
  Cc: Wei Yang, linux-kernel, dri-devel, linux-mm, akpm,
	David Hildenbrand, Joshua Hahn, Rakie Kim, Byungchul Park,
	Gregory Price, Ying Huang, Alistair Popple, Oscar Salvador,
	Lorenzo Stoakes, Baolin Wang, Liam R. Howlett, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Ralph Campbell, Mika Penttilä,
	Matthew Brost, Francois Dugast

On 10/22/25 13:59, Zi Yan wrote:
> On 21 Oct 2025, at 17:34, Balbir Singh wrote:
> 
>> On 10/20/25 09:59, Zi Yan wrote:
>>> On 19 Oct 2025, at 18:49, Balbir Singh wrote:
>>>
>>>> On 10/19/25 19:19, Wei Yang wrote:
>>>>> On Wed, Oct 01, 2025 at 04:57:02PM +1000, Balbir Singh wrote:
>>>>> [...]
>>>>>> static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>> 		struct page *split_at, struct page *lock_at,
>>>>>> -		struct list_head *list, bool uniform_split)
>>>>>> +		struct list_head *list, bool uniform_split, bool unmapped)
>>>>>> {
>>>>>> 	struct deferred_split *ds_queue = get_deferred_split_queue(folio);
>>>>>> 	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
>>>>>> @@ -3765,13 +3757,15 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>> 		 * is taken to serialise against parallel split or collapse
>>>>>> 		 * operations.
>>>>>> 		 */
>>>>>> -		anon_vma = folio_get_anon_vma(folio);
>>>>>> -		if (!anon_vma) {
>>>>>> -			ret = -EBUSY;
>>>>>> -			goto out;
>>>>>> +		if (!unmapped) {
>>>>>> +			anon_vma = folio_get_anon_vma(folio);
>>>>>> +			if (!anon_vma) {
>>>>>> +				ret = -EBUSY;
>>>>>> +				goto out;
>>>>>> +			}
>>>>>> +			anon_vma_lock_write(anon_vma);
>>>>>> 		}
>>>>>> 		mapping = NULL;
>>>>>> -		anon_vma_lock_write(anon_vma);
>>>>>> 	} else {
>>>>>> 		unsigned int min_order;
>>>>>> 		gfp_t gfp;
>>>>>> @@ -3838,7 +3832,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>> 		goto out_unlock;
>>>>>> 	}
>>>>>>
>>>>>> -	unmap_folio(folio);
>>>>>> +	if (!unmapped)
>>>>>> +		unmap_folio(folio);
>>>>>>
>>>>>> 	/* block interrupt reentry in xa_lock and spinlock */
>>>>>> 	local_irq_disable();
>>>>>> @@ -3925,10 +3920,13 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>
>>>>>> 			next = folio_next(new_folio);
>>>>>>
>>>>>> +			zone_device_private_split_cb(folio, new_folio);
>>>>>> +
>>>>>> 			expected_refs = folio_expected_ref_count(new_folio) + 1;
>>>>>> 			folio_ref_unfreeze(new_folio, expected_refs);
>>>>>>
>>>>>> -			lru_add_split_folio(folio, new_folio, lruvec, list);
>>>>>> +			if (!unmapped)
>>>>>> +				lru_add_split_folio(folio, new_folio, lruvec, list);
>>>>>>
>>>>>> 			/*
>>>>>> 			 * Anonymous folio with swap cache.
>>>>>> @@ -3959,6 +3957,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>> 			__filemap_remove_folio(new_folio, NULL);
>>>>>> 			folio_put_refs(new_folio, nr_pages);
>>>>>> 		}
>>>>>> +
>>>>>> +		zone_device_private_split_cb(folio, NULL);
>>>>>> 		/*
>>>>>> 		 * Unfreeze @folio only after all page cache entries, which
>>>>>> 		 * used to point to it, have been updated with new folios.
>>>>>> @@ -3982,6 +3982,9 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>
>>>>>> 	local_irq_enable();
>>>>>>
>>>>>> +	if (unmapped)
>>>>>> +		return ret;
>>>>>
>>>>> As the comment of __folio_split() and __split_huge_page_to_list_to_order()
>>>>> mentioned:
>>>>>
>>>>>   * The large folio must be locked
>>>>>   * After splitting, the after-split folio containing @lock_at remains locked
>>>>>
>>>>> But here we seems to change the prerequisites.
>>>>>
>>>>> Hmm.. I am not sure this is correct.
>>>>>
>>>>
>>>> The code is correct, but you are right in that the documentation needs to be updated.
>>>> When "unmapped", we do want to leave the folios locked after the split.
>>>
>>> Sigh, this "unmapped" code needs so many special branches and a different locking
>>> requirement. It should be a separate function to avoid confusions.
>>>
>>
>> Yep, I have a patch for it, I am also waiting on Matthew's feedback, FYI, here is
>> a WIP patch that can be applied on top of the series
> 
> Nice cleanup! Thanks.
> 
>>
>> ---
>>  include/linux/huge_mm.h |   5 +-
>>  mm/huge_memory.c        | 137 ++++++++++++++++++++++++++++++++++------
>>  mm/migrate_device.c     |   3 +-
>>  3 files changed, 120 insertions(+), 25 deletions(-)
>>
>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>> index c4a811958cda..86e1cefaf391 100644
>> --- a/include/linux/huge_mm.h
>> +++ b/include/linux/huge_mm.h
>> @@ -366,7 +366,8 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
>>
>>  bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
>>  int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>> -		unsigned int new_order, bool unmapped);
>> +		unsigned int new_order);
>> +int split_unmapped_folio_to_order(struct folio *folio, unsigned int new_order);
>>  int min_order_for_split(struct folio *folio);
>>  int split_folio_to_list(struct folio *folio, struct list_head *list);
>>  bool uniform_split_supported(struct folio *folio, unsigned int new_order,
>> @@ -379,7 +380,7 @@ int folio_split(struct folio *folio, unsigned int new_order, struct page *page,
>>  static inline int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>>  		unsigned int new_order)
>>  {
>> -	return __split_huge_page_to_list_to_order(page, list, new_order, false);
>> +	return __split_huge_page_to_list_to_order(page, list, new_order);
>>  }
>>
>>  /*
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 8c82a0ac6e69..e20cbf68d037 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -3711,7 +3711,6 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
>>   * @lock_at: a page within @folio to be left locked to caller
>>   * @list: after-split folios will be put on it if non NULL
>>   * @uniform_split: perform uniform split or not (non-uniform split)
>> - * @unmapped: The pages are already unmapped, they are migration entries.
>>   *
>>   * It calls __split_unmapped_folio() to perform uniform and non-uniform split.
>>   * It is in charge of checking whether the split is supported or not and
>> @@ -3727,7 +3726,7 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
>>   */
>>  static int __folio_split(struct folio *folio, unsigned int new_order,
>>  		struct page *split_at, struct page *lock_at,
>> -		struct list_head *list, bool uniform_split, bool unmapped)
>> +		struct list_head *list, bool uniform_split)
>>  {
>>  	struct deferred_split *ds_queue;
>>  	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
>> @@ -3777,14 +3776,12 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>  		 * is taken to serialise against parallel split or collapse
>>  		 * operations.
>>  		 */
>> -		if (!unmapped) {
>> -			anon_vma = folio_get_anon_vma(folio);
>> -			if (!anon_vma) {
>> -				ret = -EBUSY;
>> -				goto out;
>> -			}
>> -			anon_vma_lock_write(anon_vma);
>> +		anon_vma = folio_get_anon_vma(folio);
>> +		if (!anon_vma) {
>> +			ret = -EBUSY;
>> +			goto out;
>>  		}
>> +		anon_vma_lock_write(anon_vma);
>>  		mapping = NULL;
>>  	} else {
>>  		unsigned int min_order;
>> @@ -3852,8 +3849,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>  		goto out_unlock;
>>  	}
>>
>> -	if (!unmapped)
>> -		unmap_folio(folio);
>> +	unmap_folio(folio);
>>
>>  	/* block interrupt reentry in xa_lock and spinlock */
>>  	local_irq_disable();
>> @@ -3954,8 +3950,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>  			expected_refs = folio_expected_ref_count(new_folio) + 1;
>>  			folio_ref_unfreeze(new_folio, expected_refs);
>>
>> -			if (!unmapped)
>> -				lru_add_split_folio(folio, new_folio, lruvec, list);
>> +			lru_add_split_folio(folio, new_folio, lruvec, list);
>>
>>  			/*
>>  			 * Anonymous folio with swap cache.
>> @@ -4011,9 +4006,6 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>
>>  	local_irq_enable();
>>
>> -	if (unmapped)
>> -		return ret;
>> -
>>  	if (nr_shmem_dropped)
>>  		shmem_uncharge(mapping->host, nr_shmem_dropped);
>>
>> @@ -4057,6 +4049,111 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>  	return ret;
>>  }
>>
>> +/*
>> + * This function is a helper for splitting folios that have already been unmapped.
>> + * The use case is that the device or the CPU can refuse to migrate THP pages in
>> + * the middle of migration, due to allocation issues on either side
>> + *
>> + * The high level code is copied from __folio_split, since the pages are anonymous
>> + * and are already isolated from the LRU, the code has been simplified to not
>> + * burden __folio_split with unmapped sprinkled into the code.
> 
> I wonder if it makes sense to remove CPU side folio from both deferred_split queue
> and swap cache before migration to further simplify split_unmapped_folio_to_order().
> Basically require that device private folios cannot be on deferred_split queue nor
> swap cache.
> 

This API can be called for non-device private folios as well. Device private folios are
already not on the deferred queue. The use case is

1. Migrate a large folio page from CPU to Device
2. SRC - CPU has a THP (large folio page)
3. DST - Device cannot allocate a large page, hence split the SRC page


[...]


Thanks for the review!
Balbir

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [v7 04/16] mm/rmap: extend rmap and migration support device-private entries
  2025-10-01  6:56 ` [v7 04/16] mm/rmap: extend rmap and migration support device-private entries Balbir Singh
@ 2025-10-22 11:54   ` Lance Yang
  0 siblings, 0 replies; 75+ messages in thread
From: Lance Yang @ 2025-10-22 11:54 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-kernel, dri-devel, linux-mm, akpm, David Hildenbrand,
	Zi Yan, Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price,
	Ying Huang, Alistair Popple, Oscar Salvador, Lorenzo Stoakes,
	Baolin Wang, Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain,
	Barry Song, Lyude Paul, Danilo Krummrich, David Airlie,
	Simona Vetter, Ralph Campbell, Mika Penttilä, Matthew Brost,
	Francois Dugast, SeongJae Park

On Wed, Oct 1, 2025 at 3:25 PM Balbir Singh <balbirs@nvidia.com> wrote:
>
> Add device-private THP support to reverse mapping infrastructure, enabling
> proper handling during migration and walk operations.
>
> The key changes are:
> - add_migration_pmd()/remove_migration_pmd(): Handle device-private
>   entries during folio migration and splitting
> - page_vma_mapped_walk(): Recognize device-private THP entries during
>   VMA traversal operations
>
> This change supports folio splitting and migration operations on
> device-private entries.
>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
> Cc: Rakie Kim <rakie.kim@sk.com>
> Cc: Byungchul Park <byungchul@sk.com>
> Cc: Gregory Price <gourry@gourry.net>
> Cc: Ying Huang <ying.huang@linux.alibaba.com>
> Cc: Alistair Popple <apopple@nvidia.com>
> Cc: Oscar Salvador <osalvador@suse.de>
> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> Cc: Nico Pache <npache@redhat.com>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Cc: Dev Jain <dev.jain@arm.com>
> Cc: Barry Song <baohua@kernel.org>
> Cc: Lyude Paul <lyude@redhat.com>
> Cc: Danilo Krummrich <dakr@kernel.org>
> Cc: David Airlie <airlied@gmail.com>
> Cc: Simona Vetter <simona@ffwll.ch>
> Cc: Ralph Campbell <rcampbell@nvidia.com>
> Cc: Mika Penttilä <mpenttil@redhat.com>
> Cc: Matthew Brost <matthew.brost@intel.com>
> Cc: Francois Dugast <francois.dugast@intel.com>
> Acked-by: Zi Yan <ziy@nvidia.com>
> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
> Reviewed-by: SeongJae Park <sj@kernel.org>
> ---
>  mm/damon/ops-common.c | 20 +++++++++++++++++---
>  mm/huge_memory.c      | 16 +++++++++++++++-
>  mm/page_idle.c        |  7 +++++--
>  mm/page_vma_mapped.c  |  7 +++++++
>  mm/rmap.c             | 24 ++++++++++++++++++++----
>  5 files changed, 64 insertions(+), 10 deletions(-)
>
> diff --git a/mm/damon/ops-common.c b/mm/damon/ops-common.c
> index 998c5180a603..ac54bf5b2623 100644
> --- a/mm/damon/ops-common.c
> +++ b/mm/damon/ops-common.c
> @@ -75,12 +75,24 @@ void damon_ptep_mkold(pte_t *pte, struct vm_area_struct *vma, unsigned long addr
>  void damon_pmdp_mkold(pmd_t *pmd, struct vm_area_struct *vma, unsigned long addr)
>  {
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> -       struct folio *folio = damon_get_folio(pmd_pfn(pmdp_get(pmd)));
> +       pmd_t pmdval = pmdp_get(pmd);
> +       struct folio *folio;
> +       bool young = false;
> +       unsigned long pfn;
> +
> +       if (likely(pmd_present(pmdval)))
> +               pfn = pmd_pfn(pmdval);
> +       else
> +               pfn = swp_offset_pfn(pmd_to_swp_entry(pmdval));
>
> +       folio = damon_get_folio(pfn);
>         if (!folio)
>                 return;
>
> -       if (pmdp_clear_young_notify(vma, addr, pmd))
> +       if (likely(pmd_present(pmdval)))
> +               young |= pmdp_clear_young_notify(vma, addr, pmd);
> +       young |= mmu_notifier_clear_young(vma->vm_mm, addr, addr + HPAGE_PMD_SIZE);
> +       if (young)
>                 folio_set_young(folio);
>
>         folio_set_idle(folio);
> @@ -203,7 +215,9 @@ static bool damon_folio_young_one(struct folio *folio,
>                                 mmu_notifier_test_young(vma->vm_mm, addr);
>                 } else {
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> -                       *accessed = pmd_young(pmdp_get(pvmw.pmd)) ||
> +                       pmd_t pmd = pmdp_get(pvmw.pmd);
> +
> +                       *accessed = (pmd_present(pmd) && pmd_young(pmd)) ||
>                                 !folio_test_idle(folio) ||
>                                 mmu_notifier_test_young(vma->vm_mm, addr);
>  #else
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 8e0a1747762d..483b8341ce22 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -4628,7 +4628,10 @@ int set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
>                 return 0;
>
>         flush_cache_range(vma, address, address + HPAGE_PMD_SIZE);
> -       pmdval = pmdp_invalidate(vma, address, pvmw->pmd);
> +       if (unlikely(!pmd_present(*pvmw->pmd)))
> +               pmdval = pmdp_huge_get_and_clear(vma->vm_mm, address, pvmw->pmd);
> +       else
> +               pmdval = pmdp_invalidate(vma, address, pvmw->pmd);
>
>         /* See folio_try_share_anon_rmap_pmd(): invalidate PMD first. */
>         anon_exclusive = folio_test_anon(folio) && PageAnonExclusive(page);
> @@ -4678,6 +4681,17 @@ void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new)
>         entry = pmd_to_swp_entry(*pvmw->pmd);
>         folio_get(folio);
>         pmde = folio_mk_pmd(folio, READ_ONCE(vma->vm_page_prot));
> +
> +       if (folio_is_device_private(folio)) {
> +               if (pmd_write(pmde))
> +                       entry = make_writable_device_private_entry(
> +                                                       page_to_pfn(new));
> +               else
> +                       entry = make_readable_device_private_entry(
> +                                                       page_to_pfn(new));
> +               pmde = swp_entry_to_pmd(entry);
> +       }
> +
>         if (pmd_swp_soft_dirty(*pvmw->pmd))
>                 pmde = pmd_mksoft_dirty(pmde);
>         if (is_writable_migration_entry(entry))
> diff --git a/mm/page_idle.c b/mm/page_idle.c
> index a82b340dc204..d4299de81031 100644
> --- a/mm/page_idle.c
> +++ b/mm/page_idle.c
> @@ -71,8 +71,11 @@ static bool page_idle_clear_pte_refs_one(struct folio *folio,
>                                 referenced |= ptep_test_and_clear_young(vma, addr, pvmw.pte);
>                         referenced |= mmu_notifier_clear_young(vma->vm_mm, addr, addr + PAGE_SIZE);
>                 } else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
> -                       if (pmdp_clear_young_notify(vma, addr, pvmw.pmd))
> -                               referenced = true;
> +                       pmd_t pmdval = pmdp_get(pvmw.pmd);
> +
> +                       if (likely(pmd_present(pmdval)))
> +                               referenced |= pmdp_clear_young_notify(vma, addr, pvmw.pmd);
> +                       referenced |= mmu_notifier_clear_young(vma->vm_mm, addr, addr + PMD_SIZE);
>                 } else {
>                         /* unexpected pmd-mapped page? */
>                         WARN_ON_ONCE(1);
> diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
> index c498a91b6706..137ce27ff68c 100644
> --- a/mm/page_vma_mapped.c
> +++ b/mm/page_vma_mapped.c
> @@ -277,6 +277,13 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
>                          * cannot return prematurely, while zap_huge_pmd() has
>                          * cleared *pmd but not decremented compound_mapcount().
>                          */
> +                       swp_entry_t entry = pmd_to_swp_entry(pmde);
> +
> +                       if (is_device_private_entry(entry)) {
> +                               pvmw->ptl = pmd_lock(mm, pvmw->pmd);
> +                               return true;
> +                       }
> +

We could make this simpler:

                        if (is_device_private_entry(pmd_to_swp_entry(pmde))) {
                                pvmw->ptl = pmd_lock(mm, pvmw->pmd);
                                return true;
                        }

Thanks,
Lance

>                         if ((pvmw->flags & PVMW_SYNC) &&
>                             thp_vma_suitable_order(vma, pvmw->address,
>                                                    PMD_ORDER) &&
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 9bab13429975..c3fc30cf3636 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1046,9 +1046,16 @@ static int page_vma_mkclean_one(struct page_vma_mapped_walk *pvmw)
>                 } else {
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>                         pmd_t *pmd = pvmw->pmd;
> -                       pmd_t entry;
> +                       pmd_t entry = pmdp_get(pmd);
>
> -                       if (!pmd_dirty(*pmd) && !pmd_write(*pmd))
> +                       /*
> +                        * Please see the comment above (!pte_present).
> +                        * A non present PMD is not writable from a CPU
> +                        * perspective.
> +                        */
> +                       if (!pmd_present(entry))
> +                               continue;
> +                       if (!pmd_dirty(entry) && !pmd_write(entry))
>                                 continue;
>
>                         flush_cache_range(vma, address,
> @@ -2343,6 +2350,9 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
>         while (page_vma_mapped_walk(&pvmw)) {
>                 /* PMD-mapped THP migration entry */
>                 if (!pvmw.pte) {
> +                       __maybe_unused unsigned long pfn;
> +                       __maybe_unused pmd_t pmdval;
> +
>                         if (flags & TTU_SPLIT_HUGE_PMD) {
>                                 split_huge_pmd_locked(vma, pvmw.address,
>                                                       pvmw.pmd, true);
> @@ -2351,8 +2361,14 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
>                                 break;
>                         }
>  #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
> -                       subpage = folio_page(folio,
> -                               pmd_pfn(*pvmw.pmd) - folio_pfn(folio));
> +                       pmdval = pmdp_get(pvmw.pmd);
> +                       if (likely(pmd_present(pmdval)))
> +                               pfn = pmd_pfn(pmdval);
> +                       else
> +                               pfn = swp_offset_pfn(pmd_to_swp_entry(pmdval));
> +
> +                       subpage = folio_page(folio, pfn - folio_pfn(folio));
> +
>                         VM_BUG_ON_FOLIO(folio_test_hugetlb(folio) ||
>                                         !folio_test_pmd_mappable(folio), folio);
>
> --
> 2.51.0
>
>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [v7 11/16] mm/migrate_device: add THP splitting during migration
  2025-10-22  7:16             ` Balbir Singh
@ 2025-10-22 15:26               ` Zi Yan
  2025-10-28  9:32                 ` Balbir Singh
  0 siblings, 1 reply; 75+ messages in thread
From: Zi Yan @ 2025-10-22 15:26 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Wei Yang, linux-kernel, dri-devel, linux-mm, akpm,
	David Hildenbrand, Joshua Hahn, Rakie Kim, Byungchul Park,
	Gregory Price, Ying Huang, Alistair Popple, Oscar Salvador,
	Lorenzo Stoakes, Baolin Wang, Liam R. Howlett, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Ralph Campbell, Mika Penttilä,
	Matthew Brost, Francois Dugast

On 22 Oct 2025, at 3:16, Balbir Singh wrote:

> On 10/22/25 13:59, Zi Yan wrote:
>> On 21 Oct 2025, at 17:34, Balbir Singh wrote:
>>
>>> On 10/20/25 09:59, Zi Yan wrote:
>>>> On 19 Oct 2025, at 18:49, Balbir Singh wrote:
>>>>
>>>>> On 10/19/25 19:19, Wei Yang wrote:
>>>>>> On Wed, Oct 01, 2025 at 04:57:02PM +1000, Balbir Singh wrote:
>>>>>> [...]
>>>>>>> static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>> 		struct page *split_at, struct page *lock_at,
>>>>>>> -		struct list_head *list, bool uniform_split)
>>>>>>> +		struct list_head *list, bool uniform_split, bool unmapped)
>>>>>>> {
>>>>>>> 	struct deferred_split *ds_queue = get_deferred_split_queue(folio);
>>>>>>> 	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
>>>>>>> @@ -3765,13 +3757,15 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>> 		 * is taken to serialise against parallel split or collapse
>>>>>>> 		 * operations.
>>>>>>> 		 */
>>>>>>> -		anon_vma = folio_get_anon_vma(folio);
>>>>>>> -		if (!anon_vma) {
>>>>>>> -			ret = -EBUSY;
>>>>>>> -			goto out;
>>>>>>> +		if (!unmapped) {
>>>>>>> +			anon_vma = folio_get_anon_vma(folio);
>>>>>>> +			if (!anon_vma) {
>>>>>>> +				ret = -EBUSY;
>>>>>>> +				goto out;
>>>>>>> +			}
>>>>>>> +			anon_vma_lock_write(anon_vma);
>>>>>>> 		}
>>>>>>> 		mapping = NULL;
>>>>>>> -		anon_vma_lock_write(anon_vma);
>>>>>>> 	} else {
>>>>>>> 		unsigned int min_order;
>>>>>>> 		gfp_t gfp;
>>>>>>> @@ -3838,7 +3832,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>> 		goto out_unlock;
>>>>>>> 	}
>>>>>>>
>>>>>>> -	unmap_folio(folio);
>>>>>>> +	if (!unmapped)
>>>>>>> +		unmap_folio(folio);
>>>>>>>
>>>>>>> 	/* block interrupt reentry in xa_lock and spinlock */
>>>>>>> 	local_irq_disable();
>>>>>>> @@ -3925,10 +3920,13 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>>
>>>>>>> 			next = folio_next(new_folio);
>>>>>>>
>>>>>>> +			zone_device_private_split_cb(folio, new_folio);
>>>>>>> +
>>>>>>> 			expected_refs = folio_expected_ref_count(new_folio) + 1;
>>>>>>> 			folio_ref_unfreeze(new_folio, expected_refs);
>>>>>>>
>>>>>>> -			lru_add_split_folio(folio, new_folio, lruvec, list);
>>>>>>> +			if (!unmapped)
>>>>>>> +				lru_add_split_folio(folio, new_folio, lruvec, list);
>>>>>>>
>>>>>>> 			/*
>>>>>>> 			 * Anonymous folio with swap cache.
>>>>>>> @@ -3959,6 +3957,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>> 			__filemap_remove_folio(new_folio, NULL);
>>>>>>> 			folio_put_refs(new_folio, nr_pages);
>>>>>>> 		}
>>>>>>> +
>>>>>>> +		zone_device_private_split_cb(folio, NULL);
>>>>>>> 		/*
>>>>>>> 		 * Unfreeze @folio only after all page cache entries, which
>>>>>>> 		 * used to point to it, have been updated with new folios.
>>>>>>> @@ -3982,6 +3982,9 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>>
>>>>>>> 	local_irq_enable();
>>>>>>>
>>>>>>> +	if (unmapped)
>>>>>>> +		return ret;
>>>>>>
>>>>>> As the comment of __folio_split() and __split_huge_page_to_list_to_order()
>>>>>> mentioned:
>>>>>>
>>>>>>   * The large folio must be locked
>>>>>>   * After splitting, the after-split folio containing @lock_at remains locked
>>>>>>
>>>>>> But here we seems to change the prerequisites.
>>>>>>
>>>>>> Hmm.. I am not sure this is correct.
>>>>>>
>>>>>
>>>>> The code is correct, but you are right in that the documentation needs to be updated.
>>>>> When "unmapped", we do want to leave the folios locked after the split.
>>>>
>>>> Sigh, this "unmapped" code needs so many special branches and a different locking
>>>> requirement. It should be a separate function to avoid confusions.
>>>>
>>>
>>> Yep, I have a patch for it, I am also waiting on Matthew's feedback, FYI, here is
>>> a WIP patch that can be applied on top of the series
>>
>> Nice cleanup! Thanks.
>>
>>>
>>> ---
>>>  include/linux/huge_mm.h |   5 +-
>>>  mm/huge_memory.c        | 137 ++++++++++++++++++++++++++++++++++------
>>>  mm/migrate_device.c     |   3 +-
>>>  3 files changed, 120 insertions(+), 25 deletions(-)
>>>
>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>>> index c4a811958cda..86e1cefaf391 100644
>>> --- a/include/linux/huge_mm.h
>>> +++ b/include/linux/huge_mm.h
>>> @@ -366,7 +366,8 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
>>>
>>>  bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
>>>  int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>>> -		unsigned int new_order, bool unmapped);
>>> +		unsigned int new_order);
>>> +int split_unmapped_folio_to_order(struct folio *folio, unsigned int new_order);
>>>  int min_order_for_split(struct folio *folio);
>>>  int split_folio_to_list(struct folio *folio, struct list_head *list);
>>>  bool uniform_split_supported(struct folio *folio, unsigned int new_order,
>>> @@ -379,7 +380,7 @@ int folio_split(struct folio *folio, unsigned int new_order, struct page *page,
>>>  static inline int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>>>  		unsigned int new_order)
>>>  {
>>> -	return __split_huge_page_to_list_to_order(page, list, new_order, false);
>>> +	return __split_huge_page_to_list_to_order(page, list, new_order);
>>>  }
>>>
>>>  /*
>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>> index 8c82a0ac6e69..e20cbf68d037 100644
>>> --- a/mm/huge_memory.c
>>> +++ b/mm/huge_memory.c
>>> @@ -3711,7 +3711,6 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
>>>   * @lock_at: a page within @folio to be left locked to caller
>>>   * @list: after-split folios will be put on it if non NULL
>>>   * @uniform_split: perform uniform split or not (non-uniform split)
>>> - * @unmapped: The pages are already unmapped, they are migration entries.
>>>   *
>>>   * It calls __split_unmapped_folio() to perform uniform and non-uniform split.
>>>   * It is in charge of checking whether the split is supported or not and
>>> @@ -3727,7 +3726,7 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
>>>   */
>>>  static int __folio_split(struct folio *folio, unsigned int new_order,
>>>  		struct page *split_at, struct page *lock_at,
>>> -		struct list_head *list, bool uniform_split, bool unmapped)
>>> +		struct list_head *list, bool uniform_split)
>>>  {
>>>  	struct deferred_split *ds_queue;
>>>  	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
>>> @@ -3777,14 +3776,12 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>  		 * is taken to serialise against parallel split or collapse
>>>  		 * operations.
>>>  		 */
>>> -		if (!unmapped) {
>>> -			anon_vma = folio_get_anon_vma(folio);
>>> -			if (!anon_vma) {
>>> -				ret = -EBUSY;
>>> -				goto out;
>>> -			}
>>> -			anon_vma_lock_write(anon_vma);
>>> +		anon_vma = folio_get_anon_vma(folio);
>>> +		if (!anon_vma) {
>>> +			ret = -EBUSY;
>>> +			goto out;
>>>  		}
>>> +		anon_vma_lock_write(anon_vma);
>>>  		mapping = NULL;
>>>  	} else {
>>>  		unsigned int min_order;
>>> @@ -3852,8 +3849,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>  		goto out_unlock;
>>>  	}
>>>
>>> -	if (!unmapped)
>>> -		unmap_folio(folio);
>>> +	unmap_folio(folio);
>>>
>>>  	/* block interrupt reentry in xa_lock and spinlock */
>>>  	local_irq_disable();
>>> @@ -3954,8 +3950,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>  			expected_refs = folio_expected_ref_count(new_folio) + 1;
>>>  			folio_ref_unfreeze(new_folio, expected_refs);
>>>
>>> -			if (!unmapped)
>>> -				lru_add_split_folio(folio, new_folio, lruvec, list);
>>> +			lru_add_split_folio(folio, new_folio, lruvec, list);
>>>
>>>  			/*
>>>  			 * Anonymous folio with swap cache.
>>> @@ -4011,9 +4006,6 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>
>>>  	local_irq_enable();
>>>
>>> -	if (unmapped)
>>> -		return ret;
>>> -
>>>  	if (nr_shmem_dropped)
>>>  		shmem_uncharge(mapping->host, nr_shmem_dropped);
>>>
>>> @@ -4057,6 +4049,111 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>  	return ret;
>>>  }
>>>
>>> +/*
>>> + * This function is a helper for splitting folios that have already been unmapped.
>>> + * The use case is that the device or the CPU can refuse to migrate THP pages in
>>> + * the middle of migration, due to allocation issues on either side
>>> + *
>>> + * The high level code is copied from __folio_split, since the pages are anonymous
>>> + * and are already isolated from the LRU, the code has been simplified to not
>>> + * burden __folio_split with unmapped sprinkled into the code.
>>
>> I wonder if it makes sense to remove CPU side folio from both deferred_split queue
>> and swap cache before migration to further simplify split_unmapped_folio_to_order().
>> Basically require that device private folios cannot be on deferred_split queue nor
>> swap cache.
>>
>
> This API can be called for non-device private folios as well. Device private folios are
> already not on the deferred queue. The use case is
>
> 1. Migrate a large folio page from CPU to Device
> 2. SRC - CPU has a THP (large folio page)
> 3. DST - Device cannot allocate a large page, hence split the SRC page

Right. That is what I am talking about, sorry I was not clear.
I mean when migrating a large folio from CPU to device, the CPU large folio
can be first removed from deferred_split queue and swap cache, if it is there,
then the migration process begins, so that the CPU large folio will always
be out of deferred_split queue and not in swap cache. As a result, this split
function does not need to handle these two situations.

--
Best Regards,
Yan, Zi

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: linux-next: KVM/s390x regression
  2025-10-20  8:41                   ` David Hildenbrand
  2025-10-20  9:04                     ` Claudio Imbrenda
@ 2025-10-27 16:47                     ` Claudio Imbrenda
  2025-10-27 16:59                       ` David Hildenbrand
  2025-10-27 17:06                       ` Christian Borntraeger
  1 sibling, 2 replies; 75+ messages in thread
From: Claudio Imbrenda @ 2025-10-27 16:47 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Christian Borntraeger, Balbir Singh, Liam.Howlett, airlied, akpm,
	apopple, baohua, baolin.wang, byungchul, dakr, dev.jain,
	dri-devel, francois.dugast, gourry, joshua.hahnjy, linux-kernel,
	linux-mm, lorenzo.stoakes, lyude, matthew.brost, mpenttil, npache,
	osalvador, rakie.kim, rcampbell, ryan.roberts, simona, ying.huang,
	ziy, kvm, linux-s390, linux-next

On Mon, 20 Oct 2025 10:41:28 +0200
David Hildenbrand <david@redhat.com> wrote:

> On 20.10.25 09:00, Christian Borntraeger wrote:
> > Am 17.10.25 um 23:56 schrieb Balbir Singh:
> >   
> >> In the meanwhile, does this fix/workaround work?
> >>
> >> diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
> >> index 0c847cdf4fd3..31c1754d5bd4 100644
> >> --- a/mm/pgtable-generic.c
> >> +++ b/mm/pgtable-generic.c
> >> @@ -290,7 +290,7 @@ pte_t *___pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp)
> >>    
> >>    	if (pmdvalp)
> >>    		*pmdvalp = pmdval;
> >> -	if (unlikely(pmd_none(pmdval) || !pmd_present(pmdval)))
> >> +	if (unlikely(pmd_none(pmdval) || is_pmd_non_present_folio_entry(pmdval)))
> >>    		goto nomap;
> >>    	if (unlikely(pmd_trans_huge(pmdval)))
> >>    		goto nomap;
> >>  
> > 
> > Yes, this seems to work.  
> 
> Right, but that's not what we will want here. We'll have to adjust s390x 
> gmap code (which is getting redesigned either way) to only take the page 
> lock.
> 
> In the end, we'll want here later a single
> 
> if (!pmd_present(pmdval))
> 	goto nomap;
> 

this seems to do the trick:

diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
index 8ff6bba107e8..22c448b32340 100644
--- a/arch/s390/mm/gmap.c
+++ b/arch/s390/mm/gmap.c
@@ -599,8 +599,9 @@ int __gmap_link(struct gmap *gmap, unsigned long
gaddr, unsigned long vmaddr) | _SEGMENT_ENTRY_GMAP_UC
                                        | _SEGMENT_ENTRY;
                        } else
-                               *table = pmd_val(*pmd) &
-                                       _SEGMENT_ENTRY_HARDWARE_BITS;
+                               *table = (pmd_val(*pmd) &
+                                       _SEGMENT_ENTRY_HARDWARE_BITS)
+                                       | _SEGMENT_ENTRY;
                }
        } else if (*table & _SEGMENT_ENTRY_PROTECT &&
                   !(pmd_val(*pmd) & _SEGMENT_ENTRY_PROTECT)) {



it marks non-leaf gmap segment (pmd) entries as present, just as normal
pmds would be.

I think it's a good enough fix for now, pending the rewrite, which I
hope to get in the next merge window


^ permalink raw reply related	[flat|nested] 75+ messages in thread

* Re: linux-next: KVM/s390x regression
  2025-10-27 16:47                     ` Claudio Imbrenda
@ 2025-10-27 16:59                       ` David Hildenbrand
  2025-10-27 17:06                       ` Christian Borntraeger
  1 sibling, 0 replies; 75+ messages in thread
From: David Hildenbrand @ 2025-10-27 16:59 UTC (permalink / raw)
  To: Claudio Imbrenda
  Cc: Christian Borntraeger, Balbir Singh, Liam.Howlett, airlied, akpm,
	apopple, baohua, baolin.wang, byungchul, dakr, dev.jain,
	dri-devel, francois.dugast, gourry, joshua.hahnjy, linux-kernel,
	linux-mm, lorenzo.stoakes, lyude, matthew.brost, mpenttil, npache,
	osalvador, rakie.kim, rcampbell, ryan.roberts, simona, ying.huang,
	ziy, kvm, linux-s390, linux-next

On 27.10.25 17:47, Claudio Imbrenda wrote:
> On Mon, 20 Oct 2025 10:41:28 +0200
> David Hildenbrand <david@redhat.com> wrote:
> 
>> On 20.10.25 09:00, Christian Borntraeger wrote:
>>> Am 17.10.25 um 23:56 schrieb Balbir Singh:
>>>    
>>>> In the meanwhile, does this fix/workaround work?
>>>>
>>>> diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
>>>> index 0c847cdf4fd3..31c1754d5bd4 100644
>>>> --- a/mm/pgtable-generic.c
>>>> +++ b/mm/pgtable-generic.c
>>>> @@ -290,7 +290,7 @@ pte_t *___pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp)
>>>>     
>>>>     	if (pmdvalp)
>>>>     		*pmdvalp = pmdval;
>>>> -	if (unlikely(pmd_none(pmdval) || !pmd_present(pmdval)))
>>>> +	if (unlikely(pmd_none(pmdval) || is_pmd_non_present_folio_entry(pmdval)))
>>>>     		goto nomap;
>>>>     	if (unlikely(pmd_trans_huge(pmdval)))
>>>>     		goto nomap;
>>>>   
>>>
>>> Yes, this seems to work.
>>
>> Right, but that's not what we will want here. We'll have to adjust s390x
>> gmap code (which is getting redesigned either way) to only take the page
>> lock.
>>
>> In the end, we'll want here later a single
>>
>> if (!pmd_present(pmdval))
>> 	goto nomap;
>>
> 
> this seems to do the trick:
> 
> diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
> index 8ff6bba107e8..22c448b32340 100644
> --- a/arch/s390/mm/gmap.c
> +++ b/arch/s390/mm/gmap.c
> @@ -599,8 +599,9 @@ int __gmap_link(struct gmap *gmap, unsigned long
> gaddr, unsigned long vmaddr) | _SEGMENT_ENTRY_GMAP_UC
>                                          | _SEGMENT_ENTRY;
>                          } else
> -                               *table = pmd_val(*pmd) &
> -                                       _SEGMENT_ENTRY_HARDWARE_BITS;
> +                               *table = (pmd_val(*pmd) &
> +                                       _SEGMENT_ENTRY_HARDWARE_BITS)
> +                                       | _SEGMENT_ENTRY;

Probably worth adding a comment. I remember we don't reuse this bit as a 
SW bit in gmap code, right?

>                  }
>          } else if (*table & _SEGMENT_ENTRY_PROTECT &&
>                     !(pmd_val(*pmd) & _SEGMENT_ENTRY_PROTECT)) {
> 
> 
> 
> it marks non-leaf gmap segment (pmd) entries as present, just as normal
> pmds would be.

Yeah, I looked into hand-coding the PTL lookup but it just gets nasty 
real quick.

> 
> I think it's a good enough fix for now, pending the rewrite, which I
> hope to get in the next merge window

Agreed.

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: linux-next: KVM/s390x regression
  2025-10-27 16:47                     ` Claudio Imbrenda
  2025-10-27 16:59                       ` David Hildenbrand
@ 2025-10-27 17:06                       ` Christian Borntraeger
  2025-10-28  9:24                         ` Balbir Singh
  2025-10-28 13:01                         ` [PATCH v1 0/1] KVM: s390: Fix missing present bit for gmap puds Claudio Imbrenda
  1 sibling, 2 replies; 75+ messages in thread
From: Christian Borntraeger @ 2025-10-27 17:06 UTC (permalink / raw)
  To: Claudio Imbrenda, David Hildenbrand
  Cc: Balbir Singh, Liam.Howlett, airlied, akpm, apopple, baohua,
	baolin.wang, byungchul, dakr, dev.jain, dri-devel,
	francois.dugast, gourry, joshua.hahnjy, linux-kernel, linux-mm,
	lorenzo.stoakes, lyude, matthew.brost, mpenttil, npache,
	osalvador, rakie.kim, rcampbell, ryan.roberts, simona, ying.huang,
	ziy, kvm, linux-s390, linux-next, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev

Am 27.10.25 um 17:47 schrieb Claudio Imbrenda:
> On Mon, 20 Oct 2025 10:41:28 +0200
> David Hildenbrand <david@redhat.com> wrote:
> 
>> On 20.10.25 09:00, Christian Borntraeger wrote:
>>> Am 17.10.25 um 23:56 schrieb Balbir Singh:
>>>    
>>>> In the meanwhile, does this fix/workaround work?
>>>>
>>>> diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
>>>> index 0c847cdf4fd3..31c1754d5bd4 100644
>>>> --- a/mm/pgtable-generic.c
>>>> +++ b/mm/pgtable-generic.c
>>>> @@ -290,7 +290,7 @@ pte_t *___pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp)
>>>>     
>>>>     	if (pmdvalp)
>>>>     		*pmdvalp = pmdval;
>>>> -	if (unlikely(pmd_none(pmdval) || !pmd_present(pmdval)))
>>>> +	if (unlikely(pmd_none(pmdval) || is_pmd_non_present_folio_entry(pmdval)))
>>>>     		goto nomap;
>>>>     	if (unlikely(pmd_trans_huge(pmdval)))
>>>>     		goto nomap;
>>>>   
>>>
>>> Yes, this seems to work.
>>
>> Right, but that's not what we will want here. We'll have to adjust s390x
>> gmap code (which is getting redesigned either way) to only take the page
>> lock.
>>
>> In the end, we'll want here later a single
>>
>> if (!pmd_present(pmdval))
>> 	goto nomap;
>>
> 
> this seems to do the trick:
> 
> diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
> index 8ff6bba107e8..22c448b32340 100644
> --- a/arch/s390/mm/gmap.c
> +++ b/arch/s390/mm/gmap.c
> @@ -599,8 +599,9 @@ int __gmap_link(struct gmap *gmap, unsigned long
> gaddr, unsigned long vmaddr) | _SEGMENT_ENTRY_GMAP_UC
>                                          | _SEGMENT_ENTRY;
>                          } else
> -                               *table = pmd_val(*pmd) &
> -                                       _SEGMENT_ENTRY_HARDWARE_BITS;
> +                               *table = (pmd_val(*pmd) &
> +                                       _SEGMENT_ENTRY_HARDWARE_BITS)
> +                                       | _SEGMENT_ENTRY;
>                  }
>          } else if (*table & _SEGMENT_ENTRY_PROTECT &&
>                     !(pmd_val(*pmd) & _SEGMENT_ENTRY_PROTECT)) {
> 
> 

Tested-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Acked-by: Christian Borntraeger <borntraeger@linux.ibm.com>

can you send a proper patch? I guess we should add it to Andrews mm true to keep it close to the patch that uncovered the issue.
s390 maintainers cced.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: linux-next: KVM/s390x regression
  2025-10-27 17:06                       ` Christian Borntraeger
@ 2025-10-28  9:24                         ` Balbir Singh
  2025-10-28 13:01                         ` [PATCH v1 0/1] KVM: s390: Fix missing present bit for gmap puds Claudio Imbrenda
  1 sibling, 0 replies; 75+ messages in thread
From: Balbir Singh @ 2025-10-28  9:24 UTC (permalink / raw)
  To: Christian Borntraeger, Claudio Imbrenda, David Hildenbrand
  Cc: Liam.Howlett, airlied, akpm, apopple, baohua, baolin.wang,
	byungchul, dakr, dev.jain, dri-devel, francois.dugast, gourry,
	joshua.hahnjy, linux-kernel, linux-mm, lorenzo.stoakes, lyude,
	matthew.brost, mpenttil, npache, osalvador, rakie.kim, rcampbell,
	ryan.roberts, simona, ying.huang, ziy, kvm, linux-s390,
	linux-next, Heiko Carstens, Vasily Gorbik, Alexander Gordeev

On 10/28/25 04:06, Christian Borntraeger wrote:
> Am 27.10.25 um 17:47 schrieb Claudio Imbrenda:
>> On Mon, 20 Oct 2025 10:41:28 +0200
>> David Hildenbrand <david@redhat.com> wrote:
>>
>>> On 20.10.25 09:00, Christian Borntraeger wrote:
>>>> Am 17.10.25 um 23:56 schrieb Balbir Singh:
>>>>   
>>>>> In the meanwhile, does this fix/workaround work?
>>>>>
>>>>> diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
>>>>> index 0c847cdf4fd3..31c1754d5bd4 100644
>>>>> --- a/mm/pgtable-generic.c
>>>>> +++ b/mm/pgtable-generic.c
>>>>> @@ -290,7 +290,7 @@ pte_t *___pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp)
>>>>>             if (pmdvalp)
>>>>>             *pmdvalp = pmdval;
>>>>> -    if (unlikely(pmd_none(pmdval) || !pmd_present(pmdval)))
>>>>> +    if (unlikely(pmd_none(pmdval) || is_pmd_non_present_folio_entry(pmdval)))
>>>>>             goto nomap;
>>>>>         if (unlikely(pmd_trans_huge(pmdval)))
>>>>>             goto nomap;
>>>>>   
>>>>
>>>> Yes, this seems to work.
>>>
>>> Right, but that's not what we will want here. We'll have to adjust s390x
>>> gmap code (which is getting redesigned either way) to only take the page
>>> lock.
>>>
>>> In the end, we'll want here later a single
>>>
>>> if (!pmd_present(pmdval))
>>>     goto nomap;
>>>
>>
>> this seems to do the trick:
>>
>> diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
>> index 8ff6bba107e8..22c448b32340 100644
>> --- a/arch/s390/mm/gmap.c
>> +++ b/arch/s390/mm/gmap.c
>> @@ -599,8 +599,9 @@ int __gmap_link(struct gmap *gmap, unsigned long
>> gaddr, unsigned long vmaddr) | _SEGMENT_ENTRY_GMAP_UC
>>                                          | _SEGMENT_ENTRY;
>>                          } else
>> -                               *table = pmd_val(*pmd) &
>> -                                       _SEGMENT_ENTRY_HARDWARE_BITS;
>> +                               *table = (pmd_val(*pmd) &
>> +                                       _SEGMENT_ENTRY_HARDWARE_BITS)
>> +                                       | _SEGMENT_ENTRY;
>>                  }
>>          } else if (*table & _SEGMENT_ENTRY_PROTECT &&
>>                     !(pmd_val(*pmd) & _SEGMENT_ENTRY_PROTECT)) {
>>
>>
> 
> Tested-by: Christian Borntraeger <borntraeger@linux.ibm.com>
> Acked-by: Christian Borntraeger <borntraeger@linux.ibm.com>
> 
> can you send a proper patch? I guess we should add it to Andrews mm true to keep it close to the patch that uncovered the issue.
> s390 maintainers cced.


Thanks for finding the fix. Ideally, we want this fix just before my series if possible!

Balbir

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [v7 11/16] mm/migrate_device: add THP splitting during migration
  2025-10-22 15:26               ` Zi Yan
@ 2025-10-28  9:32                 ` Balbir Singh
  0 siblings, 0 replies; 75+ messages in thread
From: Balbir Singh @ 2025-10-28  9:32 UTC (permalink / raw)
  To: Zi Yan
  Cc: Wei Yang, linux-kernel, dri-devel, linux-mm, akpm,
	David Hildenbrand, Joshua Hahn, Rakie Kim, Byungchul Park,
	Gregory Price, Ying Huang, Alistair Popple, Oscar Salvador,
	Lorenzo Stoakes, Baolin Wang, Liam R. Howlett, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Ralph Campbell, Mika Penttilä,
	Matthew Brost, Francois Dugast

On 10/23/25 02:26, Zi Yan wrote:
> On 22 Oct 2025, at 3:16, Balbir Singh wrote:
> 
>> On 10/22/25 13:59, Zi Yan wrote:
>>> On 21 Oct 2025, at 17:34, Balbir Singh wrote:
>>>
>>>> On 10/20/25 09:59, Zi Yan wrote:
>>>>> On 19 Oct 2025, at 18:49, Balbir Singh wrote:
>>>>>
>>>>>> On 10/19/25 19:19, Wei Yang wrote:
>>>>>>> On Wed, Oct 01, 2025 at 04:57:02PM +1000, Balbir Singh wrote:
>>>>>>> [...]
>>>>>>>> static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>>> 		struct page *split_at, struct page *lock_at,
>>>>>>>> -		struct list_head *list, bool uniform_split)
>>>>>>>> +		struct list_head *list, bool uniform_split, bool unmapped)
>>>>>>>> {
>>>>>>>> 	struct deferred_split *ds_queue = get_deferred_split_queue(folio);
>>>>>>>> 	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
>>>>>>>> @@ -3765,13 +3757,15 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>>> 		 * is taken to serialise against parallel split or collapse
>>>>>>>> 		 * operations.
>>>>>>>> 		 */
>>>>>>>> -		anon_vma = folio_get_anon_vma(folio);
>>>>>>>> -		if (!anon_vma) {
>>>>>>>> -			ret = -EBUSY;
>>>>>>>> -			goto out;
>>>>>>>> +		if (!unmapped) {
>>>>>>>> +			anon_vma = folio_get_anon_vma(folio);
>>>>>>>> +			if (!anon_vma) {
>>>>>>>> +				ret = -EBUSY;
>>>>>>>> +				goto out;
>>>>>>>> +			}
>>>>>>>> +			anon_vma_lock_write(anon_vma);
>>>>>>>> 		}
>>>>>>>> 		mapping = NULL;
>>>>>>>> -		anon_vma_lock_write(anon_vma);
>>>>>>>> 	} else {
>>>>>>>> 		unsigned int min_order;
>>>>>>>> 		gfp_t gfp;
>>>>>>>> @@ -3838,7 +3832,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>>> 		goto out_unlock;
>>>>>>>> 	}
>>>>>>>>
>>>>>>>> -	unmap_folio(folio);
>>>>>>>> +	if (!unmapped)
>>>>>>>> +		unmap_folio(folio);
>>>>>>>>
>>>>>>>> 	/* block interrupt reentry in xa_lock and spinlock */
>>>>>>>> 	local_irq_disable();
>>>>>>>> @@ -3925,10 +3920,13 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>>>
>>>>>>>> 			next = folio_next(new_folio);
>>>>>>>>
>>>>>>>> +			zone_device_private_split_cb(folio, new_folio);
>>>>>>>> +
>>>>>>>> 			expected_refs = folio_expected_ref_count(new_folio) + 1;
>>>>>>>> 			folio_ref_unfreeze(new_folio, expected_refs);
>>>>>>>>
>>>>>>>> -			lru_add_split_folio(folio, new_folio, lruvec, list);
>>>>>>>> +			if (!unmapped)
>>>>>>>> +				lru_add_split_folio(folio, new_folio, lruvec, list);
>>>>>>>>
>>>>>>>> 			/*
>>>>>>>> 			 * Anonymous folio with swap cache.
>>>>>>>> @@ -3959,6 +3957,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>>> 			__filemap_remove_folio(new_folio, NULL);
>>>>>>>> 			folio_put_refs(new_folio, nr_pages);
>>>>>>>> 		}
>>>>>>>> +
>>>>>>>> +		zone_device_private_split_cb(folio, NULL);
>>>>>>>> 		/*
>>>>>>>> 		 * Unfreeze @folio only after all page cache entries, which
>>>>>>>> 		 * used to point to it, have been updated with new folios.
>>>>>>>> @@ -3982,6 +3982,9 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>>>
>>>>>>>> 	local_irq_enable();
>>>>>>>>
>>>>>>>> +	if (unmapped)
>>>>>>>> +		return ret;
>>>>>>>
>>>>>>> As the comment of __folio_split() and __split_huge_page_to_list_to_order()
>>>>>>> mentioned:
>>>>>>>
>>>>>>>   * The large folio must be locked
>>>>>>>   * After splitting, the after-split folio containing @lock_at remains locked
>>>>>>>
>>>>>>> But here we seems to change the prerequisites.
>>>>>>>
>>>>>>> Hmm.. I am not sure this is correct.
>>>>>>>
>>>>>>
>>>>>> The code is correct, but you are right in that the documentation needs to be updated.
>>>>>> When "unmapped", we do want to leave the folios locked after the split.
>>>>>
>>>>> Sigh, this "unmapped" code needs so many special branches and a different locking
>>>>> requirement. It should be a separate function to avoid confusions.
>>>>>
>>>>
>>>> Yep, I have a patch for it, I am also waiting on Matthew's feedback, FYI, here is
>>>> a WIP patch that can be applied on top of the series
>>>
>>> Nice cleanup! Thanks.
>>>
>>>>
>>>> ---
>>>>  include/linux/huge_mm.h |   5 +-
>>>>  mm/huge_memory.c        | 137 ++++++++++++++++++++++++++++++++++------
>>>>  mm/migrate_device.c     |   3 +-
>>>>  3 files changed, 120 insertions(+), 25 deletions(-)
>>>>
>>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>>>> index c4a811958cda..86e1cefaf391 100644
>>>> --- a/include/linux/huge_mm.h
>>>> +++ b/include/linux/huge_mm.h
>>>> @@ -366,7 +366,8 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
>>>>
>>>>  bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
>>>>  int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>>>> -		unsigned int new_order, bool unmapped);
>>>> +		unsigned int new_order);
>>>> +int split_unmapped_folio_to_order(struct folio *folio, unsigned int new_order);
>>>>  int min_order_for_split(struct folio *folio);
>>>>  int split_folio_to_list(struct folio *folio, struct list_head *list);
>>>>  bool uniform_split_supported(struct folio *folio, unsigned int new_order,
>>>> @@ -379,7 +380,7 @@ int folio_split(struct folio *folio, unsigned int new_order, struct page *page,
>>>>  static inline int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>>>>  		unsigned int new_order)
>>>>  {
>>>> -	return __split_huge_page_to_list_to_order(page, list, new_order, false);
>>>> +	return __split_huge_page_to_list_to_order(page, list, new_order);
>>>>  }
>>>>
>>>>  /*
>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>> index 8c82a0ac6e69..e20cbf68d037 100644
>>>> --- a/mm/huge_memory.c
>>>> +++ b/mm/huge_memory.c
>>>> @@ -3711,7 +3711,6 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
>>>>   * @lock_at: a page within @folio to be left locked to caller
>>>>   * @list: after-split folios will be put on it if non NULL
>>>>   * @uniform_split: perform uniform split or not (non-uniform split)
>>>> - * @unmapped: The pages are already unmapped, they are migration entries.
>>>>   *
>>>>   * It calls __split_unmapped_folio() to perform uniform and non-uniform split.
>>>>   * It is in charge of checking whether the split is supported or not and
>>>> @@ -3727,7 +3726,7 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
>>>>   */
>>>>  static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>  		struct page *split_at, struct page *lock_at,
>>>> -		struct list_head *list, bool uniform_split, bool unmapped)
>>>> +		struct list_head *list, bool uniform_split)
>>>>  {
>>>>  	struct deferred_split *ds_queue;
>>>>  	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
>>>> @@ -3777,14 +3776,12 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>  		 * is taken to serialise against parallel split or collapse
>>>>  		 * operations.
>>>>  		 */
>>>> -		if (!unmapped) {
>>>> -			anon_vma = folio_get_anon_vma(folio);
>>>> -			if (!anon_vma) {
>>>> -				ret = -EBUSY;
>>>> -				goto out;
>>>> -			}
>>>> -			anon_vma_lock_write(anon_vma);
>>>> +		anon_vma = folio_get_anon_vma(folio);
>>>> +		if (!anon_vma) {
>>>> +			ret = -EBUSY;
>>>> +			goto out;
>>>>  		}
>>>> +		anon_vma_lock_write(anon_vma);
>>>>  		mapping = NULL;
>>>>  	} else {
>>>>  		unsigned int min_order;
>>>> @@ -3852,8 +3849,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>  		goto out_unlock;
>>>>  	}
>>>>
>>>> -	if (!unmapped)
>>>> -		unmap_folio(folio);
>>>> +	unmap_folio(folio);
>>>>
>>>>  	/* block interrupt reentry in xa_lock and spinlock */
>>>>  	local_irq_disable();
>>>> @@ -3954,8 +3950,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>  			expected_refs = folio_expected_ref_count(new_folio) + 1;
>>>>  			folio_ref_unfreeze(new_folio, expected_refs);
>>>>
>>>> -			if (!unmapped)
>>>> -				lru_add_split_folio(folio, new_folio, lruvec, list);
>>>> +			lru_add_split_folio(folio, new_folio, lruvec, list);
>>>>
>>>>  			/*
>>>>  			 * Anonymous folio with swap cache.
>>>> @@ -4011,9 +4006,6 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>
>>>>  	local_irq_enable();
>>>>
>>>> -	if (unmapped)
>>>> -		return ret;
>>>> -
>>>>  	if (nr_shmem_dropped)
>>>>  		shmem_uncharge(mapping->host, nr_shmem_dropped);
>>>>
>>>> @@ -4057,6 +4049,111 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>  	return ret;
>>>>  }
>>>>
>>>> +/*
>>>> + * This function is a helper for splitting folios that have already been unmapped.
>>>> + * The use case is that the device or the CPU can refuse to migrate THP pages in
>>>> + * the middle of migration, due to allocation issues on either side
>>>> + *
>>>> + * The high level code is copied from __folio_split, since the pages are anonymous
>>>> + * and are already isolated from the LRU, the code has been simplified to not
>>>> + * burden __folio_split with unmapped sprinkled into the code.
>>>
>>> I wonder if it makes sense to remove CPU side folio from both deferred_split queue
>>> and swap cache before migration to further simplify split_unmapped_folio_to_order().
>>> Basically require that device private folios cannot be on deferred_split queue nor
>>> swap cache.
>>>
>>
>> This API can be called for non-device private folios as well. Device private folios are
>> already not on the deferred queue. The use case is
>>
>> 1. Migrate a large folio page from CPU to Device
>> 2. SRC - CPU has a THP (large folio page)
>> 3. DST - Device cannot allocate a large page, hence split the SRC page
> 
> Right. That is what I am talking about, sorry I was not clear.
> I mean when migrating a large folio from CPU to device, the CPU large folio
> can be first removed from deferred_split queue and swap cache, if it is there,
> then the migration process begins, so that the CPU large folio will always
> be out of deferred_split queue and not in swap cache. As a result, this split
> function does not need to handle these two situations.

It leads to more specialization for LRU and zone device-private folios.
I looked at giving it a go and ended up with some amount of duplication, with
taking the lock, freezing the refs and then prior to that finding the
extra_refs with can_split_folio()

Balbir Singh


^ permalink raw reply	[flat|nested] 75+ messages in thread

* [PATCH v1 0/1] KVM: s390: Fix missing present bit for gmap puds
  2025-10-27 17:06                       ` Christian Borntraeger
  2025-10-28  9:24                         ` Balbir Singh
@ 2025-10-28 13:01                         ` Claudio Imbrenda
  2025-10-28 13:01                           ` [PATCH v1 1/1] " Claudio Imbrenda
  2025-10-28 22:53                           ` [PATCH v1 0/1] " Andrew Morton
  1 sibling, 2 replies; 75+ messages in thread
From: Claudio Imbrenda @ 2025-10-28 13:01 UTC (permalink / raw)
  To: akpm
  Cc: balbirs, borntraeger, david, Liam.Howlett, airlied, apopple,
	baohua, baolin.wang, byungchul, dakr, dev.jain, dri-devel,
	francois.dugast, gourry, joshua.hahnjy, linux-kernel, linux-mm,
	lorenzo.stoakes, lyude, matthew.brost, mpenttil, npache,
	osalvador, rakie.kim, rcampbell, ryan.roberts, simona, ying.huang,
	ziy, kvm, linux-s390, linux-next, hca, gor, agordeev

This patch solves the issue uncovered by patch caf527048be8
("mm/huge_memory: add device-private THP support to PMD operations"),
which is at the moment in -next.

@Andrew: do you think it's possible to squeeze this patch in -next
_before_ the patches that introduce the issue? This will guarantee that
the patch is merged first, and will not break bisections once merged.


Claudio Imbrenda (1):
  KVM: s390: Fix missing present bit for gmap puds

 arch/s390/mm/gmap.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

-- 
2.51.0


^ permalink raw reply	[flat|nested] 75+ messages in thread

* [PATCH v1 1/1] KVM: s390: Fix missing present bit for gmap puds
  2025-10-28 13:01                         ` [PATCH v1 0/1] KVM: s390: Fix missing present bit for gmap puds Claudio Imbrenda
@ 2025-10-28 13:01                           ` Claudio Imbrenda
  2025-10-28 21:23                             ` Balbir Singh
  2025-10-29 10:00                             ` David Hildenbrand
  2025-10-28 22:53                           ` [PATCH v1 0/1] " Andrew Morton
  1 sibling, 2 replies; 75+ messages in thread
From: Claudio Imbrenda @ 2025-10-28 13:01 UTC (permalink / raw)
  To: akpm
  Cc: balbirs, borntraeger, david, Liam.Howlett, airlied, apopple,
	baohua, baolin.wang, byungchul, dakr, dev.jain, dri-devel,
	francois.dugast, gourry, joshua.hahnjy, linux-kernel, linux-mm,
	lorenzo.stoakes, lyude, matthew.brost, mpenttil, npache,
	osalvador, rakie.kim, rcampbell, ryan.roberts, simona, ying.huang,
	ziy, kvm, linux-s390, linux-next, hca, gor, agordeev

For hugetlbs, gmap puds have the present bit set. For normal puds
(which point to ptes), the bit is not set. This is in contrast to the
normal userspace puds, which always have the bit set for present pmds.

This causes issues when ___pte_offset_map() is modified to only check
for the present bit.

The solution to the problem is simply to always set the present bit for
present gmap pmds.

Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Link: https://lore.kernel.org/lkml/20251017144924.10034-1-borntraeger@linux.ibm.com/
Tested-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Acked-by: Christian Borntraeger <borntraeger@linux.ibm.com>
---
 arch/s390/mm/gmap.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
index 8ff6bba107e8..22c448b32340 100644
--- a/arch/s390/mm/gmap.c
+++ b/arch/s390/mm/gmap.c
@@ -599,8 +599,9 @@ int __gmap_link(struct gmap *gmap, unsigned long gaddr, unsigned long vmaddr)
 					| _SEGMENT_ENTRY_GMAP_UC
 					| _SEGMENT_ENTRY;
 			} else
-				*table = pmd_val(*pmd) &
-					_SEGMENT_ENTRY_HARDWARE_BITS;
+				*table = (pmd_val(*pmd) &
+					_SEGMENT_ENTRY_HARDWARE_BITS)
+					| _SEGMENT_ENTRY;
 		}
 	} else if (*table & _SEGMENT_ENTRY_PROTECT &&
 		   !(pmd_val(*pmd) & _SEGMENT_ENTRY_PROTECT)) {
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 75+ messages in thread

* Re: [PATCH v1 1/1] KVM: s390: Fix missing present bit for gmap puds
  2025-10-28 13:01                           ` [PATCH v1 1/1] " Claudio Imbrenda
@ 2025-10-28 21:23                             ` Balbir Singh
  2025-10-29 10:00                             ` David Hildenbrand
  1 sibling, 0 replies; 75+ messages in thread
From: Balbir Singh @ 2025-10-28 21:23 UTC (permalink / raw)
  To: Claudio Imbrenda, akpm
  Cc: borntraeger, david, Liam.Howlett, airlied, apopple, baohua,
	baolin.wang, byungchul, dakr, dev.jain, dri-devel,
	francois.dugast, gourry, joshua.hahnjy, linux-kernel, linux-mm,
	lorenzo.stoakes, lyude, matthew.brost, mpenttil, npache,
	osalvador, rakie.kim, rcampbell, ryan.roberts, simona, ying.huang,
	ziy, kvm, linux-s390, linux-next, hca, gor, agordeev

On 10/29/25 00:01, Claudio Imbrenda wrote:
> For hugetlbs, gmap puds have the present bit set. For normal puds
> (which point to ptes), the bit is not set. This is in contrast to the
> normal userspace puds, which always have the bit set for present pmds.
> 
> This causes issues when ___pte_offset_map() is modified to only check
> for the present bit.
> 
> The solution to the problem is simply to always set the present bit for
> present gmap pmds.
> 
> Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
> Link: https://lore.kernel.org/lkml/20251017144924.10034-1-borntraeger@linux.ibm.com/
> Tested-by: Christian Borntraeger <borntraeger@linux.ibm.com>
> Acked-by: Christian Borntraeger <borntraeger@linux.ibm.com>
> ---
>  arch/s390/mm/gmap.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
> index 8ff6bba107e8..22c448b32340 100644
> --- a/arch/s390/mm/gmap.c
> +++ b/arch/s390/mm/gmap.c
> @@ -599,8 +599,9 @@ int __gmap_link(struct gmap *gmap, unsigned long gaddr, unsigned long vmaddr)
>  					| _SEGMENT_ENTRY_GMAP_UC
>  					| _SEGMENT_ENTRY;
>  			} else
> -				*table = pmd_val(*pmd) &
> -					_SEGMENT_ENTRY_HARDWARE_BITS;
> +				*table = (pmd_val(*pmd) &
> +					_SEGMENT_ENTRY_HARDWARE_BITS)
> +					| _SEGMENT_ENTRY;
>  		}
>  	} else if (*table & _SEGMENT_ENTRY_PROTECT &&
>  		   !(pmd_val(*pmd) & _SEGMENT_ENTRY_PROTECT)) {

Acked-by: Balbir Singh <balbirs@nvidia.com>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v1 0/1] KVM: s390: Fix missing present bit for gmap puds
  2025-10-28 13:01                         ` [PATCH v1 0/1] KVM: s390: Fix missing present bit for gmap puds Claudio Imbrenda
  2025-10-28 13:01                           ` [PATCH v1 1/1] " Claudio Imbrenda
@ 2025-10-28 22:53                           ` Andrew Morton
  1 sibling, 0 replies; 75+ messages in thread
From: Andrew Morton @ 2025-10-28 22:53 UTC (permalink / raw)
  To: Claudio Imbrenda
  Cc: balbirs, borntraeger, david, Liam.Howlett, airlied, apopple,
	baohua, baolin.wang, byungchul, dakr, dev.jain, dri-devel,
	francois.dugast, gourry, joshua.hahnjy, linux-kernel, linux-mm,
	lorenzo.stoakes, lyude, matthew.brost, mpenttil, npache,
	osalvador, rakie.kim, rcampbell, ryan.roberts, simona, ying.huang,
	ziy, kvm, linux-s390, linux-next, hca, gor, agordeev

On Tue, 28 Oct 2025 14:01:49 +0100 Claudio Imbrenda <imbrenda@linux.ibm.com> wrote:

> @Andrew: do you think it's possible to squeeze this patch in -next
> _before_ the patches that introduce the issue? This will guarantee that
> the patch is merged first, and will not break bisections once merged.

no problem, thanks.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v1 1/1] KVM: s390: Fix missing present bit for gmap puds
  2025-10-28 13:01                           ` [PATCH v1 1/1] " Claudio Imbrenda
  2025-10-28 21:23                             ` Balbir Singh
@ 2025-10-29 10:00                             ` David Hildenbrand
  2025-10-29 10:20                               ` Claudio Imbrenda
  1 sibling, 1 reply; 75+ messages in thread
From: David Hildenbrand @ 2025-10-29 10:00 UTC (permalink / raw)
  To: Claudio Imbrenda, akpm
  Cc: balbirs, borntraeger, Liam.Howlett, airlied, apopple, baohua,
	baolin.wang, byungchul, dakr, dev.jain, dri-devel,
	francois.dugast, gourry, joshua.hahnjy, linux-kernel, linux-mm,
	lorenzo.stoakes, lyude, matthew.brost, mpenttil, npache,
	osalvador, rakie.kim, rcampbell, ryan.roberts, simona, ying.huang,
	ziy, kvm, linux-s390, linux-next, hca, gor, agordeev

On 28.10.25 14:01, Claudio Imbrenda wrote:
> For hugetlbs, gmap puds have the present bit set. For normal puds
> (which point to ptes), the bit is not set. This is in contrast to the
> normal userspace puds, which always have the bit set for present pmds.
> 
> This causes issues when ___pte_offset_map() is modified to only check
> for the present bit.
> 
> The solution to the problem is simply to always set the present bit for
> present gmap pmds.
> 
> Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
> Link: https://lore.kernel.org/lkml/20251017144924.10034-1-borntraeger@linux.ibm.com/
> Tested-by: Christian Borntraeger <borntraeger@linux.ibm.com>
> Acked-by: Christian Borntraeger <borntraeger@linux.ibm.com>
> ---
>   arch/s390/mm/gmap.c | 5 +++--
>   1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
> index 8ff6bba107e8..22c448b32340 100644
> --- a/arch/s390/mm/gmap.c
> +++ b/arch/s390/mm/gmap.c
> @@ -599,8 +599,9 @@ int __gmap_link(struct gmap *gmap, unsigned long gaddr, unsigned long vmaddr)
>   					| _SEGMENT_ENTRY_GMAP_UC
>   					| _SEGMENT_ENTRY;
>   			} else
> -				*table = pmd_val(*pmd) &
> -					_SEGMENT_ENTRY_HARDWARE_BITS;

I'd add a comment here like

/* Make sure that pmd_present() will work on these entries. */

> +				*table = (pmd_val(*pmd) &
> +					_SEGMENT_ENTRY_HARDWARE_BITS)
> +					| _SEGMENT_ENTRY;
>   		}
>   	} else if (*table & _SEGMENT_ENTRY_PROTECT &&
>   		   !(pmd_val(*pmd) & _SEGMENT_ENTRY_PROTECT)) {

Reviewed-by: David Hildenbrand <david@redhat.com>

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v1 1/1] KVM: s390: Fix missing present bit for gmap puds
  2025-10-29 10:00                             ` David Hildenbrand
@ 2025-10-29 10:20                               ` Claudio Imbrenda
  0 siblings, 0 replies; 75+ messages in thread
From: Claudio Imbrenda @ 2025-10-29 10:20 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: akpm, balbirs, borntraeger, Liam.Howlett, airlied, apopple,
	baohua, baolin.wang, byungchul, dakr, dev.jain, dri-devel,
	francois.dugast, gourry, joshua.hahnjy, linux-kernel, linux-mm,
	lorenzo.stoakes, lyude, matthew.brost, mpenttil, npache,
	osalvador, rakie.kim, rcampbell, ryan.roberts, simona, ying.huang,
	ziy, kvm, linux-s390, linux-next, hca, gor, agordeev

On Wed, 29 Oct 2025 11:00:14 +0100
David Hildenbrand <david@redhat.com> wrote:

> On 28.10.25 14:01, Claudio Imbrenda wrote:
> > For hugetlbs, gmap puds have the present bit set. For normal puds
> > (which point to ptes), the bit is not set. This is in contrast to the
> > normal userspace puds, which always have the bit set for present pmds.
> > 
> > This causes issues when ___pte_offset_map() is modified to only check
> > for the present bit.
> > 
> > The solution to the problem is simply to always set the present bit for
> > present gmap pmds.
> > 
> > Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
> > Link: https://lore.kernel.org/lkml/20251017144924.10034-1-borntraeger@linux.ibm.com/
> > Tested-by: Christian Borntraeger <borntraeger@linux.ibm.com>
> > Acked-by: Christian Borntraeger <borntraeger@linux.ibm.com>
> > ---
> >   arch/s390/mm/gmap.c | 5 +++--
> >   1 file changed, 3 insertions(+), 2 deletions(-)
> > 
> > diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
> > index 8ff6bba107e8..22c448b32340 100644
> > --- a/arch/s390/mm/gmap.c
> > +++ b/arch/s390/mm/gmap.c
> > @@ -599,8 +599,9 @@ int __gmap_link(struct gmap *gmap, unsigned long gaddr, unsigned long vmaddr)
> >   					| _SEGMENT_ENTRY_GMAP_UC
> >   					| _SEGMENT_ENTRY;
> >   			} else
> > -				*table = pmd_val(*pmd) &
> > -					_SEGMENT_ENTRY_HARDWARE_BITS;  
> 
> I'd add a comment here like
> 
> /* Make sure that pmd_present() will work on these entries. */

the whole file is going away very soon anyway

> 
> > +				*table = (pmd_val(*pmd) &
> > +					_SEGMENT_ENTRY_HARDWARE_BITS)
> > +					| _SEGMENT_ENTRY;
> >   		}
> >   	} else if (*table & _SEGMENT_ENTRY_PROTECT &&
> >   		   !(pmd_val(*pmd) & _SEGMENT_ENTRY_PROTECT)) {  
> 
> Reviewed-by: David Hildenbrand <david@redhat.com>
> 


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [v7 00/16] mm: support device-private THP
  2025-10-09 10:33     ` Matthew Brost
  2025-10-13 22:51       ` Balbir Singh
@ 2025-11-11 23:43       ` Andrew Morton
  2025-11-11 23:52         ` Balbir Singh
  1 sibling, 1 reply; 75+ messages in thread
From: Andrew Morton @ 2025-11-11 23:43 UTC (permalink / raw)
  To: Matthew Brost
  Cc: Balbir Singh, linux-kernel, dri-devel, linux-mm,
	David Hildenbrand, Zi Yan, Joshua Hahn, Rakie Kim, Byungchul Park,
	Gregory Price, Ying Huang, Alistair Popple, Oscar Salvador,
	Lorenzo Stoakes, Baolin Wang, Liam R. Howlett, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Ralph Campbell, Mika Penttilä,
	Francois Dugast

On Thu, 9 Oct 2025 03:33:33 -0700 Matthew Brost <matthew.brost@intel.com> wrote:

> > >> This patch series introduces support for Transparent Huge Page
> > >> (THP) migration in zone device-private memory. The implementation enables
> > >> efficient migration of large folios between system memory and
> > >> device-private memory
> > > 
> > > Lots of chatter for the v6 series, but none for v7.  I hope that's a
> > > good sign.
> > > 
> > 
> > I hope so too, I've tried to address the comments in v6.
> > 
> 
> Circling back to this series, we will itegrate and test this version.

How'd it go?

Balbir, what's the status here?  It's been a month and this series
still has a "needs a new version" feeling to it.  If so, very soon
please.

TODOs which I have noted are

https://lkml.kernel.org/r/aOePfeoDuRW+prFq@lstrano-desk.jf.intel.com
https://lkml.kernel.org/r/CABzRoyZZ8QLF5PSeDCVxgcnQmF9kFQ3RZdNq0Deik3o9OrK+BQ@mail.gmail.com
https://lkml.kernel.org/r/D2A4B724-E5EF-46D3-9D3F-EBAD9B22371E@nvidia.com
https://lkml.kernel.org/r/62073ca1-5bb6-49e8-b8d4-447c5e0e582e@

plus a general re-read of the
mm-migrate_device-add-thp-splitting-during-migration.patch review
discussion.


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [v7 00/16] mm: support device-private THP
  2025-11-11 23:43       ` Andrew Morton
@ 2025-11-11 23:52         ` Balbir Singh
  2025-11-12  0:24           ` Andrew Morton
  2025-11-20  2:40           ` Matthew Brost
  0 siblings, 2 replies; 75+ messages in thread
From: Balbir Singh @ 2025-11-11 23:52 UTC (permalink / raw)
  To: Andrew Morton, Matthew Brost
  Cc: linux-kernel, dri-devel, linux-mm, David Hildenbrand, Zi Yan,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Oscar Salvador, Lorenzo Stoakes, Baolin Wang,
	Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
	Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Ralph Campbell, Mika Penttilä, Francois Dugast

On 11/12/25 10:43, Andrew Morton wrote:
> On Thu, 9 Oct 2025 03:33:33 -0700 Matthew Brost <matthew.brost@intel.com> wrote:
> 
>>>>> This patch series introduces support for Transparent Huge Page
>>>>> (THP) migration in zone device-private memory. The implementation enables
>>>>> efficient migration of large folios between system memory and
>>>>> device-private memory
>>>>
>>>> Lots of chatter for the v6 series, but none for v7.  I hope that's a
>>>> good sign.
>>>>
>>>
>>> I hope so too, I've tried to address the comments in v6.
>>>
>>
>> Circling back to this series, we will itegrate and test this version.
> 
> How'd it go?
> 
> Balbir, what's the status here?  It's been a month and this series
> still has a "needs a new version" feeling to it.  If so, very soon
> please.
> 

I don't think this needs a new revision, I've been testing frequently
at my end to see if I can catch any regressions. I have a patch update for
mm-migrate_device-add-thp-splitting-during-migration.patch, it can be applied
on top or I can send a new version of the patch. I was waiting
on any feedback before I sent the patch out, but I'll do it now.

> TODOs which I have noted are
> 
> https://lkml.kernel.org/r/aOePfeoDuRW+prFq@lstrano-desk.jf.intel.com

This was a clarification on the HMM patch mentioned in the changelog

> https://lkml.kernel.org/r/CABzRoyZZ8QLF5PSeDCVxgcnQmF9kFQ3RZdNq0Deik3o9OrK+BQ@mail.gmail.com

That's a minor comment on not using a temporary declaration, I don't think we need it, let me know if you feel strongly

> https://lkml.kernel.org/r/D2A4B724-E5EF-46D3-9D3F-EBAD9B22371E@nvidia.com

I have a patch for this, which I posted, I can do an update and resend it if required (the one mentioned above)

> https://lkml.kernel.org/r/62073ca1-5bb6-49e8-b8d4-447c5e0e582e@
> 

I can't seem to open this

> plus a general re-read of the
> mm-migrate_device-add-thp-splitting-during-migration.patch review
> discussion.
> 
That's the patch I have

Thanks for following up
Balbir

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [v7 00/16] mm: support device-private THP
  2025-11-11 23:52         ` Balbir Singh
@ 2025-11-12  0:24           ` Andrew Morton
  2025-11-12  0:36             ` Balbir Singh
  2025-11-20  2:40           ` Matthew Brost
  1 sibling, 1 reply; 75+ messages in thread
From: Andrew Morton @ 2025-11-12  0:24 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Matthew Brost, linux-kernel, dri-devel, linux-mm,
	David Hildenbrand, Zi Yan, Joshua Hahn, Rakie Kim, Byungchul Park,
	Gregory Price, Ying Huang, Alistair Popple, Oscar Salvador,
	Lorenzo Stoakes, Baolin Wang, Liam R. Howlett, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Ralph Campbell, Mika Penttilä,
	Francois Dugast

On Wed, 12 Nov 2025 10:52:43 +1100 Balbir Singh <balbirs@nvidia.com> wrote:

> > https://lkml.kernel.org/r/62073ca1-5bb6-49e8-b8d4-447c5e0e582e@
> > 
> 
> I can't seem to open this

https://lore.kernel.org/all/62073ca1-5bb6-49e8-b8d4-447c5e0e582e@nvidia.com/T/#u

> > plus a general re-read of the
> > mm-migrate_device-add-thp-splitting-during-migration.patch review
> > discussion.
> > 
> That's the patch I have
> 

What I meant was - please re-read the review against that patch (it was
fairly long) and double-check that everything was addressed.


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [v7 00/16] mm: support device-private THP
  2025-11-12  0:24           ` Andrew Morton
@ 2025-11-12  0:36             ` Balbir Singh
  0 siblings, 0 replies; 75+ messages in thread
From: Balbir Singh @ 2025-11-12  0:36 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Matthew Brost, linux-kernel, dri-devel, linux-mm,
	David Hildenbrand, Zi Yan, Joshua Hahn, Rakie Kim, Byungchul Park,
	Gregory Price, Ying Huang, Alistair Popple, Oscar Salvador,
	Lorenzo Stoakes, Baolin Wang, Liam R. Howlett, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Ralph Campbell, Mika Penttilä,
	Francois Dugast

On 11/12/25 11:24, Andrew Morton wrote:
> On Wed, 12 Nov 2025 10:52:43 +1100 Balbir Singh <balbirs@nvidia.com> wrote:
> 
>>> https://lkml.kernel.org/r/62073ca1-5bb6-49e8-b8d4-447c5e0e582e@
>>>
>>
>> I can't seem to open this
> 
> https://lore.kernel.org/all/62073ca1-5bb6-49e8-b8d4-447c5e0e582e@nvidia.com/T/#u
> 
>>> plus a general re-read of the
>>> mm-migrate_device-add-thp-splitting-during-migration.patch review
>>> discussion.
>>>
>> That's the patch I have
>>
> 
> What I meant was - please re-read the review against that patch (it was
> fairly long) and double-check that everything was addressed.
> 

The last discussion ended at https://lore.kernel.org/all/CABzRoyZZ8QLF5PSeDCVxgcnQmF9kFQ3RZdNq0Deik3o9OrK+BQ@mail.gmail.com/T/#m2368cc4ed7d88b2ae99157a649a9a5877585f006

In summary, removing the cluster_info and deferred_list handling
into another function lead to duplication and so I've avoided
going down that route. I'll send the new patch out shortly

Balbir

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [v7 00/16] mm: support device-private THP
  2025-11-11 23:52         ` Balbir Singh
  2025-11-12  0:24           ` Andrew Morton
@ 2025-11-20  2:40           ` Matthew Brost
  2025-11-20  2:50             ` Balbir Singh
  1 sibling, 1 reply; 75+ messages in thread
From: Matthew Brost @ 2025-11-20  2:40 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Andrew Morton, linux-kernel, dri-devel, linux-mm,
	David Hildenbrand, Zi Yan, Joshua Hahn, Rakie Kim, Byungchul Park,
	Gregory Price, Ying Huang, Alistair Popple, Oscar Salvador,
	Lorenzo Stoakes, Baolin Wang, Liam R. Howlett, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Ralph Campbell, Mika Penttilä,
	Francois Dugast

On Wed, Nov 12, 2025 at 10:52:43AM +1100, Balbir Singh wrote:
> On 11/12/25 10:43, Andrew Morton wrote:
> > On Thu, 9 Oct 2025 03:33:33 -0700 Matthew Brost <matthew.brost@intel.com> wrote:
> > 
> >>>>> This patch series introduces support for Transparent Huge Page
> >>>>> (THP) migration in zone device-private memory. The implementation enables
> >>>>> efficient migration of large folios between system memory and
> >>>>> device-private memory
> >>>>
> >>>> Lots of chatter for the v6 series, but none for v7.  I hope that's a
> >>>> good sign.
> >>>>
> >>>
> >>> I hope so too, I've tried to address the comments in v6.
> >>>
> >>
> >> Circling back to this series, we will itegrate and test this version.
> > 
> > How'd it go?
> > 

My apologies for the delay—I got distracted by other tasks in Xe (my
driver) and was out for a bit. Unfortunately, this series breaks
something in the existing core MM code for the Xe SVM implementation. I
have an extensive test case that hammers on SVM, which fully passes
prior to applying this series, but fails randomly with the series
applied (to drm-tip-rc6) due to the below kernel lockup.

I've tried to trace where the migration PTE gets installed but not
removed or isolate a test case which causes this failure but no luck so
far. I'll keep digging as I have time.

Beyond that, if I enable Xe SVM + THP, it seems to mostly work (though
the same issue as above eventually occurs), but I do need two additional
core MM patches—one is new code required for Xe, and the other could be
considered a bug fix. Those patches can included when Xe merges SVM THP
support but we need at least not break Xe SVM before this series merges.

Stack trace:

INFO: task kworker/u65:2:1642 blocked for more than 30
seconds.
[  212.624286]       Tainted: G S      W           6.18.0-rc6-xe+ #1719
[  212.630561] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[  212.638285] task:kworker/u65:2   state:D stack:0     pid:1642
tgid:1642  ppid:2      task_flags:0x4208060 flags:0x00080000
[  212.638288] Workqueue: xe_page_fault_work_queue
xe_pagefault_queue_work [xe]
[  212.638323] Call Trace:
[  212.638324]  <TASK>
[  212.638325]  __schedule+0x4b0/0x990
[  212.638330]  schedule+0x22/0xd0
[  212.638331]  io_schedule+0x41/0x60
[  212.638333]  migration_entry_wait_on_locked+0x1d8/0x2d0
[  212.638336]  ? __pfx_wake_page_function+0x10/0x10
[  212.638339]  migration_entry_wait+0xd2/0xe0
[  212.638341]  hmm_vma_walk_pmd+0x7c9/0x8d0
[  212.638343]  walk_pgd_range+0x51d/0xa40
[  212.638345]  __walk_page_range+0x75/0x1e0
[  212.638347]  walk_page_range_mm+0x138/0x1f0
[  212.638349]  hmm_range_fault+0x59/0xa0
[  212.638351]  drm_gpusvm_get_pages+0x194/0x7b0 [drm_gpusvm_helper]
[  212.638354]  drm_gpusvm_range_get_pages+0x2d/0x40 [drm_gpusvm_helper]
[  212.638355]  __xe_svm_handle_pagefault+0x259/0x900 [xe]
[  212.638375]  ? update_load_avg+0x7f/0x6c0
[  212.638377]  ? update_curr+0x13d/0x170
[  212.638379]  xe_svm_handle_pagefault+0x37/0x90 [xe]
[  212.638396]  xe_pagefault_queue_work+0x2da/0x3c0 [xe]
[  212.638420]  process_one_work+0x16e/0x2e0
[  212.638422]  worker_thread+0x284/0x410
[  212.638423]  ? __pfx_worker_thread+0x10/0x10
[  212.638425]  kthread+0xec/0x210
[  212.638427]  ? __pfx_kthread+0x10/0x10
[  212.638428]  ? __pfx_kthread+0x10/0x10
[  212.638430]  ret_from_fork+0xbd/0x100
[  212.638433]  ? __pfx_kthread+0x10/0x10
[  212.638434]  ret_from_fork_asm+0x1a/0x30
[  212.638436]  </TASK>

Matt 

> > Balbir, what's the status here?  It's been a month and this series
> > still has a "needs a new version" feeling to it.  If so, very soon
> > please.
> > 
> 
> I don't think this needs a new revision, I've been testing frequently
> at my end to see if I can catch any regressions. I have a patch update for
> mm-migrate_device-add-thp-splitting-during-migration.patch, it can be applied
> on top or I can send a new version of the patch. I was waiting
> on any feedback before I sent the patch out, but I'll do it now.
> 
> > TODOs which I have noted are
> > 
> > https://lkml.kernel.org/r/aOePfeoDuRW+prFq@lstrano-desk.jf.intel.com
> 
> This was a clarification on the HMM patch mentioned in the changelog
> 
> > https://lkml.kernel.org/r/CABzRoyZZ8QLF5PSeDCVxgcnQmF9kFQ3RZdNq0Deik3o9OrK+BQ@mail.gmail.com
> 
> That's a minor comment on not using a temporary declaration, I don't think we need it, let me know if you feel strongly
> 
> > https://lkml.kernel.org/r/D2A4B724-E5EF-46D3-9D3F-EBAD9B22371E@nvidia.com
> 
> I have a patch for this, which I posted, I can do an update and resend it if required (the one mentioned above)
> 
> > https://lkml.kernel.org/r/62073ca1-5bb6-49e8-b8d4-447c5e0e582e@
> > 
> 
> I can't seem to open this
> 
> > plus a general re-read of the
> > mm-migrate_device-add-thp-splitting-during-migration.patch review
> > discussion.
> > 
> That's the patch I have
> 
> Thanks for following up
> Balbir

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [v7 00/16] mm: support device-private THP
  2025-11-20  2:40           ` Matthew Brost
@ 2025-11-20  2:50             ` Balbir Singh
  2025-11-20  2:59               ` Balbir Singh
  0 siblings, 1 reply; 75+ messages in thread
From: Balbir Singh @ 2025-11-20  2:50 UTC (permalink / raw)
  To: Matthew Brost
  Cc: Andrew Morton, linux-kernel, dri-devel, linux-mm,
	David Hildenbrand, Zi Yan, Joshua Hahn, Rakie Kim, Byungchul Park,
	Gregory Price, Ying Huang, Alistair Popple, Oscar Salvador,
	Lorenzo Stoakes, Baolin Wang, Liam R. Howlett, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Ralph Campbell, Mika Penttilä,
	Francois Dugast

On 11/20/25 13:40, Matthew Brost wrote:
> On Wed, Nov 12, 2025 at 10:52:43AM +1100, Balbir Singh wrote:
>> On 11/12/25 10:43, Andrew Morton wrote:
>>> On Thu, 9 Oct 2025 03:33:33 -0700 Matthew Brost <matthew.brost@intel.com> wrote:
>>>
>>>>>>> This patch series introduces support for Transparent Huge Page
>>>>>>> (THP) migration in zone device-private memory. The implementation enables
>>>>>>> efficient migration of large folios between system memory and
>>>>>>> device-private memory
>>>>>>
>>>>>> Lots of chatter for the v6 series, but none for v7.  I hope that's a
>>>>>> good sign.
>>>>>>
>>>>>
>>>>> I hope so too, I've tried to address the comments in v6.
>>>>>
>>>>
>>>> Circling back to this series, we will itegrate and test this version.
>>>
>>> How'd it go?
>>>
> 
> My apologies for the delay—I got distracted by other tasks in Xe (my
> driver) and was out for a bit. Unfortunately, this series breaks
> something in the existing core MM code for the Xe SVM implementation. I
> have an extensive test case that hammers on SVM, which fully passes
> prior to applying this series, but fails randomly with the series
> applied (to drm-tip-rc6) due to the below kernel lockup.
> 
> I've tried to trace where the migration PTE gets installed but not
> removed or isolate a test case which causes this failure but no luck so
> far. I'll keep digging as I have time.
> 
> Beyond that, if I enable Xe SVM + THP, it seems to mostly work (though
> the same issue as above eventually occurs), but I do need two additional
> core MM patches—one is new code required for Xe, and the other could be
> considered a bug fix. Those patches can included when Xe merges SVM THP
> support but we need at least not break Xe SVM before this series merges.
> 
> Stack trace:
> 
> INFO: task kworker/u65:2:1642 blocked for more than 30
> seconds.
> [  212.624286]       Tainted: G S      W           6.18.0-rc6-xe+ #1719
> [  212.630561] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [  212.638285] task:kworker/u65:2   state:D stack:0     pid:1642
> tgid:1642  ppid:2      task_flags:0x4208060 flags:0x00080000
> [  212.638288] Workqueue: xe_page_fault_work_queue
> xe_pagefault_queue_work [xe]
> [  212.638323] Call Trace:
> [  212.638324]  <TASK>
> [  212.638325]  __schedule+0x4b0/0x990
> [  212.638330]  schedule+0x22/0xd0
> [  212.638331]  io_schedule+0x41/0x60
> [  212.638333]  migration_entry_wait_on_locked+0x1d8/0x2d0
> [  212.638336]  ? __pfx_wake_page_function+0x10/0x10
> [  212.638339]  migration_entry_wait+0xd2/0xe0
> [  212.638341]  hmm_vma_walk_pmd+0x7c9/0x8d0
> [  212.638343]  walk_pgd_range+0x51d/0xa40
> [  212.638345]  __walk_page_range+0x75/0x1e0
> [  212.638347]  walk_page_range_mm+0x138/0x1f0
> [  212.638349]  hmm_range_fault+0x59/0xa0
> [  212.638351]  drm_gpusvm_get_pages+0x194/0x7b0 [drm_gpusvm_helper]
> [  212.638354]  drm_gpusvm_range_get_pages+0x2d/0x40 [drm_gpusvm_helper]
> [  212.638355]  __xe_svm_handle_pagefault+0x259/0x900 [xe]
> [  212.638375]  ? update_load_avg+0x7f/0x6c0
> [  212.638377]  ? update_curr+0x13d/0x170
> [  212.638379]  xe_svm_handle_pagefault+0x37/0x90 [xe]
> [  212.638396]  xe_pagefault_queue_work+0x2da/0x3c0 [xe]
> [  212.638420]  process_one_work+0x16e/0x2e0
> [  212.638422]  worker_thread+0x284/0x410
> [  212.638423]  ? __pfx_worker_thread+0x10/0x10
> [  212.638425]  kthread+0xec/0x210
> [  212.638427]  ? __pfx_kthread+0x10/0x10
> [  212.638428]  ? __pfx_kthread+0x10/0x10
> [  212.638430]  ret_from_fork+0xbd/0x100
> [  212.638433]  ? __pfx_kthread+0x10/0x10
> [  212.638434]  ret_from_fork_asm+0x1a/0x30
> [  212.638436]  </TASK>
> 

Hi, Matt

Thanks for the report, two questions

1. Are you using mm/mm-unstable, we've got some fixes in there (including fixes to remove_migration_pmd())
   - Generally a left behind migration entry is a symptom of a failed migration that did not clean up
     after itself.
2. The stack trace is from hmm_range_fault(), not something that this code touches.

The stack trace shows your code is seeing a migration entry and waiting on it.
Can you please provide a reproducer for the issue? In the form of a test in hmm-tests.c

Have you been able to bisect the issue?

Balbir


> Matt 
> 
>>> Balbir, what's the status here?  It's been a month and this series
>>> still has a "needs a new version" feeling to it.  If so, very soon
>>> please.
>>>
>>
>> I don't think this needs a new revision, I've been testing frequently
>> at my end to see if I can catch any regressions. I have a patch update for
>> mm-migrate_device-add-thp-splitting-during-migration.patch, it can be applied
>> on top or I can send a new version of the patch. I was waiting
>> on any feedback before I sent the patch out, but I'll do it now.
>>
>>> TODOs which I have noted are
>>>
>>> https://lkml.kernel.org/r/aOePfeoDuRW+prFq@lstrano-desk.jf.intel.com
>>
>> This was a clarification on the HMM patch mentioned in the changelog
>>
>>> https://lkml.kernel.org/r/CABzRoyZZ8QLF5PSeDCVxgcnQmF9kFQ3RZdNq0Deik3o9OrK+BQ@mail.gmail.com
>>
>> That's a minor comment on not using a temporary declaration, I don't think we need it, let me know if you feel strongly
>>
>>> https://lkml.kernel.org/r/D2A4B724-E5EF-46D3-9D3F-EBAD9B22371E@nvidia.com
>>
>> I have a patch for this, which I posted, I can do an update and resend it if required (the one mentioned above)
>>
>>> https://lkml.kernel.org/r/62073ca1-5bb6-49e8-b8d4-447c5e0e582e@
>>>
>>
>> I can't seem to open this
>>
>>> plus a general re-read of the
>>> mm-migrate_device-add-thp-splitting-during-migration.patch review
>>> discussion.
>>>
>> That's the patch I have
>>
>> Thanks for following up
>> Balbir


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [v7 00/16] mm: support device-private THP
  2025-11-20  2:50             ` Balbir Singh
@ 2025-11-20  2:59               ` Balbir Singh
  2025-11-20  3:15                 ` Matthew Brost
  0 siblings, 1 reply; 75+ messages in thread
From: Balbir Singh @ 2025-11-20  2:59 UTC (permalink / raw)
  To: Matthew Brost
  Cc: Andrew Morton, linux-kernel, dri-devel, linux-mm,
	David Hildenbrand, Zi Yan, Joshua Hahn, Rakie Kim, Byungchul Park,
	Gregory Price, Ying Huang, Alistair Popple, Oscar Salvador,
	Lorenzo Stoakes, Baolin Wang, Liam R. Howlett, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Ralph Campbell, Mika Penttilä,
	Francois Dugast

On 11/20/25 13:50, Balbir Singh wrote:
> On 11/20/25 13:40, Matthew Brost wrote:
>> On Wed, Nov 12, 2025 at 10:52:43AM +1100, Balbir Singh wrote:
>>> On 11/12/25 10:43, Andrew Morton wrote:
>>>> On Thu, 9 Oct 2025 03:33:33 -0700 Matthew Brost <matthew.brost@intel.com> wrote:
>>>>
>>>>>>>> This patch series introduces support for Transparent Huge Page
>>>>>>>> (THP) migration in zone device-private memory. The implementation enables
>>>>>>>> efficient migration of large folios between system memory and
>>>>>>>> device-private memory
>>>>>>>
>>>>>>> Lots of chatter for the v6 series, but none for v7.  I hope that's a
>>>>>>> good sign.
>>>>>>>
>>>>>>
>>>>>> I hope so too, I've tried to address the comments in v6.
>>>>>>
>>>>>
>>>>> Circling back to this series, we will itegrate and test this version.
>>>>
>>>> How'd it go?
>>>>
>>
>> My apologies for the delay—I got distracted by other tasks in Xe (my
>> driver) and was out for a bit. Unfortunately, this series breaks
>> something in the existing core MM code for the Xe SVM implementation. I
>> have an extensive test case that hammers on SVM, which fully passes
>> prior to applying this series, but fails randomly with the series
>> applied (to drm-tip-rc6) due to the below kernel lockup.
>>
>> I've tried to trace where the migration PTE gets installed but not
>> removed or isolate a test case which causes this failure but no luck so
>> far. I'll keep digging as I have time.
>>
>> Beyond that, if I enable Xe SVM + THP, it seems to mostly work (though
>> the same issue as above eventually occurs), but I do need two additional
>> core MM patches—one is new code required for Xe, and the other could be
>> considered a bug fix. Those patches can included when Xe merges SVM THP
>> support but we need at least not break Xe SVM before this series merges.
>>
>> Stack trace:
>>
>> INFO: task kworker/u65:2:1642 blocked for more than 30
>> seconds.
>> [  212.624286]       Tainted: G S      W           6.18.0-rc6-xe+ #1719
>> [  212.630561] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>> disables this message.
>> [  212.638285] task:kworker/u65:2   state:D stack:0     pid:1642
>> tgid:1642  ppid:2      task_flags:0x4208060 flags:0x00080000
>> [  212.638288] Workqueue: xe_page_fault_work_queue
>> xe_pagefault_queue_work [xe]
>> [  212.638323] Call Trace:
>> [  212.638324]  <TASK>
>> [  212.638325]  __schedule+0x4b0/0x990
>> [  212.638330]  schedule+0x22/0xd0
>> [  212.638331]  io_schedule+0x41/0x60
>> [  212.638333]  migration_entry_wait_on_locked+0x1d8/0x2d0
>> [  212.638336]  ? __pfx_wake_page_function+0x10/0x10
>> [  212.638339]  migration_entry_wait+0xd2/0xe0
>> [  212.638341]  hmm_vma_walk_pmd+0x7c9/0x8d0
>> [  212.638343]  walk_pgd_range+0x51d/0xa40
>> [  212.638345]  __walk_page_range+0x75/0x1e0
>> [  212.638347]  walk_page_range_mm+0x138/0x1f0
>> [  212.638349]  hmm_range_fault+0x59/0xa0
>> [  212.638351]  drm_gpusvm_get_pages+0x194/0x7b0 [drm_gpusvm_helper]
>> [  212.638354]  drm_gpusvm_range_get_pages+0x2d/0x40 [drm_gpusvm_helper]
>> [  212.638355]  __xe_svm_handle_pagefault+0x259/0x900 [xe]
>> [  212.638375]  ? update_load_avg+0x7f/0x6c0
>> [  212.638377]  ? update_curr+0x13d/0x170
>> [  212.638379]  xe_svm_handle_pagefault+0x37/0x90 [xe]
>> [  212.638396]  xe_pagefault_queue_work+0x2da/0x3c0 [xe]
>> [  212.638420]  process_one_work+0x16e/0x2e0
>> [  212.638422]  worker_thread+0x284/0x410
>> [  212.638423]  ? __pfx_worker_thread+0x10/0x10
>> [  212.638425]  kthread+0xec/0x210
>> [  212.638427]  ? __pfx_kthread+0x10/0x10
>> [  212.638428]  ? __pfx_kthread+0x10/0x10
>> [  212.638430]  ret_from_fork+0xbd/0x100
>> [  212.638433]  ? __pfx_kthread+0x10/0x10
>> [  212.638434]  ret_from_fork_asm+0x1a/0x30
>> [  212.638436]  </TASK>
>>
> 
> Hi, Matt
> 
> Thanks for the report, two questions
> 
> 1. Are you using mm/mm-unstable, we've got some fixes in there (including fixes to remove_migration_pmd())
>    - Generally a left behind migration entry is a symptom of a failed migration that did not clean up
>      after itself.
> 2. The stack trace is from hmm_range_fault(), not something that this code touches.
> 
> The stack trace shows your code is seeing a migration entry and waiting on it.
> Can you please provide a reproducer for the issue? In the form of a test in hmm-tests.c
> 
> Have you been able to bisect the issue?

Also could you please try with 10b9feee2d0d ("mm/hmm: populate PFNs from PMD swap entry")
reverted?

> 
> Balbir
> 
> 
>> Matt 
>>
>>>> Balbir, what's the status here?  It's been a month and this series
>>>> still has a "needs a new version" feeling to it.  If so, very soon
>>>> please.
>>>>
>>>
>>> I don't think this needs a new revision, I've been testing frequently
>>> at my end to see if I can catch any regressions. I have a patch update for
>>> mm-migrate_device-add-thp-splitting-during-migration.patch, it can be applied
>>> on top or I can send a new version of the patch. I was waiting
>>> on any feedback before I sent the patch out, but I'll do it now.
>>>
>>>> TODOs which I have noted are
>>>>
>>>> https://lkml.kernel.org/r/aOePfeoDuRW+prFq@lstrano-desk.jf.intel.com
>>>
>>> This was a clarification on the HMM patch mentioned in the changelog
>>>
>>>> https://lkml.kernel.org/r/CABzRoyZZ8QLF5PSeDCVxgcnQmF9kFQ3RZdNq0Deik3o9OrK+BQ@mail.gmail.com
>>>
>>> That's a minor comment on not using a temporary declaration, I don't think we need it, let me know if you feel strongly
>>>
>>>> https://lkml.kernel.org/r/D2A4B724-E5EF-46D3-9D3F-EBAD9B22371E@nvidia.com
>>>
>>> I have a patch for this, which I posted, I can do an update and resend it if required (the one mentioned above)
>>>
>>>> https://lkml.kernel.org/r/62073ca1-5bb6-49e8-b8d4-447c5e0e582e@
>>>>
>>>
>>> I can't seem to open this
>>>
>>>> plus a general re-read of the
>>>> mm-migrate_device-add-thp-splitting-during-migration.patch review
>>>> discussion.
>>>>
>>> That's the patch I have
>>>
>>> Thanks for following up
>>> Balbir
> 


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [v7 00/16] mm: support device-private THP
  2025-11-20  2:59               ` Balbir Singh
@ 2025-11-20  3:15                 ` Matthew Brost
  2025-11-20  3:58                   ` Balbir Singh
  0 siblings, 1 reply; 75+ messages in thread
From: Matthew Brost @ 2025-11-20  3:15 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Andrew Morton, linux-kernel, dri-devel, linux-mm,
	David Hildenbrand, Zi Yan, Joshua Hahn, Rakie Kim, Byungchul Park,
	Gregory Price, Ying Huang, Alistair Popple, Oscar Salvador,
	Lorenzo Stoakes, Baolin Wang, Liam R. Howlett, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Ralph Campbell, Mika Penttilä,
	Francois Dugast

On Thu, Nov 20, 2025 at 01:59:09PM +1100, Balbir Singh wrote:
> On 11/20/25 13:50, Balbir Singh wrote:
> > On 11/20/25 13:40, Matthew Brost wrote:
> >> On Wed, Nov 12, 2025 at 10:52:43AM +1100, Balbir Singh wrote:
> >>> On 11/12/25 10:43, Andrew Morton wrote:
> >>>> On Thu, 9 Oct 2025 03:33:33 -0700 Matthew Brost <matthew.brost@intel.com> wrote:
> >>>>
> >>>>>>>> This patch series introduces support for Transparent Huge Page
> >>>>>>>> (THP) migration in zone device-private memory. The implementation enables
> >>>>>>>> efficient migration of large folios between system memory and
> >>>>>>>> device-private memory
> >>>>>>>
> >>>>>>> Lots of chatter for the v6 series, but none for v7.  I hope that's a
> >>>>>>> good sign.
> >>>>>>>
> >>>>>>
> >>>>>> I hope so too, I've tried to address the comments in v6.
> >>>>>>
> >>>>>
> >>>>> Circling back to this series, we will itegrate and test this version.
> >>>>
> >>>> How'd it go?
> >>>>
> >>
> >> My apologies for the delay—I got distracted by other tasks in Xe (my
> >> driver) and was out for a bit. Unfortunately, this series breaks
> >> something in the existing core MM code for the Xe SVM implementation. I
> >> have an extensive test case that hammers on SVM, which fully passes
> >> prior to applying this series, but fails randomly with the series
> >> applied (to drm-tip-rc6) due to the below kernel lockup.
> >>
> >> I've tried to trace where the migration PTE gets installed but not
> >> removed or isolate a test case which causes this failure but no luck so
> >> far. I'll keep digging as I have time.
> >>
> >> Beyond that, if I enable Xe SVM + THP, it seems to mostly work (though
> >> the same issue as above eventually occurs), but I do need two additional
> >> core MM patches—one is new code required for Xe, and the other could be
> >> considered a bug fix. Those patches can included when Xe merges SVM THP
> >> support but we need at least not break Xe SVM before this series merges.
> >>
> >> Stack trace:
> >>
> >> INFO: task kworker/u65:2:1642 blocked for more than 30
> >> seconds.
> >> [  212.624286]       Tainted: G S      W           6.18.0-rc6-xe+ #1719
> >> [  212.630561] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> >> disables this message.
> >> [  212.638285] task:kworker/u65:2   state:D stack:0     pid:1642
> >> tgid:1642  ppid:2      task_flags:0x4208060 flags:0x00080000
> >> [  212.638288] Workqueue: xe_page_fault_work_queue
> >> xe_pagefault_queue_work [xe]
> >> [  212.638323] Call Trace:
> >> [  212.638324]  <TASK>
> >> [  212.638325]  __schedule+0x4b0/0x990
> >> [  212.638330]  schedule+0x22/0xd0
> >> [  212.638331]  io_schedule+0x41/0x60
> >> [  212.638333]  migration_entry_wait_on_locked+0x1d8/0x2d0
> >> [  212.638336]  ? __pfx_wake_page_function+0x10/0x10
> >> [  212.638339]  migration_entry_wait+0xd2/0xe0
> >> [  212.638341]  hmm_vma_walk_pmd+0x7c9/0x8d0
> >> [  212.638343]  walk_pgd_range+0x51d/0xa40
> >> [  212.638345]  __walk_page_range+0x75/0x1e0
> >> [  212.638347]  walk_page_range_mm+0x138/0x1f0
> >> [  212.638349]  hmm_range_fault+0x59/0xa0
> >> [  212.638351]  drm_gpusvm_get_pages+0x194/0x7b0 [drm_gpusvm_helper]
> >> [  212.638354]  drm_gpusvm_range_get_pages+0x2d/0x40 [drm_gpusvm_helper]
> >> [  212.638355]  __xe_svm_handle_pagefault+0x259/0x900 [xe]
> >> [  212.638375]  ? update_load_avg+0x7f/0x6c0
> >> [  212.638377]  ? update_curr+0x13d/0x170
> >> [  212.638379]  xe_svm_handle_pagefault+0x37/0x90 [xe]
> >> [  212.638396]  xe_pagefault_queue_work+0x2da/0x3c0 [xe]
> >> [  212.638420]  process_one_work+0x16e/0x2e0
> >> [  212.638422]  worker_thread+0x284/0x410
> >> [  212.638423]  ? __pfx_worker_thread+0x10/0x10
> >> [  212.638425]  kthread+0xec/0x210
> >> [  212.638427]  ? __pfx_kthread+0x10/0x10
> >> [  212.638428]  ? __pfx_kthread+0x10/0x10
> >> [  212.638430]  ret_from_fork+0xbd/0x100
> >> [  212.638433]  ? __pfx_kthread+0x10/0x10
> >> [  212.638434]  ret_from_fork_asm+0x1a/0x30
> >> [  212.638436]  </TASK>
> >>
> > 
> > Hi, Matt
> > 
> > Thanks for the report, two questions
> > 
> > 1. Are you using mm/mm-unstable, we've got some fixes in there (including fixes to remove_migration_pmd())

remove_migration_pmd - This is a PTE migration entry.

> >    - Generally a left behind migration entry is a symptom of a failed migration that did not clean up
> >      after itself.

I'm on drm-tip as I generally need the latest version of my driver
because of the speed we move at.

Yes, I agree it looks like somehow a migration PTE is not getting
properly removed.

I'm happy to cherry pick any patches that you think might be helpful
into my tree.

> > 2. The stack trace is from hmm_range_fault(), not something that this code touches.
> > 

Agree this is a symptom of the above issue.

> > The stack trace shows your code is seeing a migration entry and waiting on it.
> > Can you please provide a reproducer for the issue? In the form of a test in hmm-tests.c
> > 

That will be my plan. Right now I'm opening my test up which runs 1000s
of variations of SVM tests and the test that hangs is not consistent.
Some of these are threaded or multi-process so it might possibly be a
timing issue which could be hard to reproduce in hmm-tests.c. I'll do my
best here.

> > Have you been able to bisect the issue?
> 

That is my next step along with isolating a test case.

> Also could you please try with 10b9feee2d0d ("mm/hmm: populate PFNs from PMD swap entry")
> reverted?
> 

I can try but I highly doubt this is related. The hanging HMM code in is
PTE walk step after this, also I am not even enabling THP device pages
in my SVM code to reproduce this.

Matt

> > 
> > Balbir
> > 
> > 
> >> Matt 
> >>
> >>>> Balbir, what's the status here?  It's been a month and this series
> >>>> still has a "needs a new version" feeling to it.  If so, very soon
> >>>> please.
> >>>>
> >>>
> >>> I don't think this needs a new revision, I've been testing frequently
> >>> at my end to see if I can catch any regressions. I have a patch update for
> >>> mm-migrate_device-add-thp-splitting-during-migration.patch, it can be applied
> >>> on top or I can send a new version of the patch. I was waiting
> >>> on any feedback before I sent the patch out, but I'll do it now.
> >>>
> >>>> TODOs which I have noted are
> >>>>
> >>>> https://lkml.kernel.org/r/aOePfeoDuRW+prFq@lstrano-desk.jf.intel.com
> >>>
> >>> This was a clarification on the HMM patch mentioned in the changelog
> >>>
> >>>> https://lkml.kernel.org/r/CABzRoyZZ8QLF5PSeDCVxgcnQmF9kFQ3RZdNq0Deik3o9OrK+BQ@mail.gmail.com
> >>>
> >>> That's a minor comment on not using a temporary declaration, I don't think we need it, let me know if you feel strongly
> >>>
> >>>> https://lkml.kernel.org/r/D2A4B724-E5EF-46D3-9D3F-EBAD9B22371E@nvidia.com
> >>>
> >>> I have a patch for this, which I posted, I can do an update and resend it if required (the one mentioned above)
> >>>
> >>>> https://lkml.kernel.org/r/62073ca1-5bb6-49e8-b8d4-447c5e0e582e@
> >>>>
> >>>
> >>> I can't seem to open this
> >>>
> >>>> plus a general re-read of the
> >>>> mm-migrate_device-add-thp-splitting-during-migration.patch review
> >>>> discussion.
> >>>>
> >>> That's the patch I have
> >>>
> >>> Thanks for following up
> >>> Balbir
> > 
> 

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [v7 00/16] mm: support device-private THP
  2025-11-20  3:15                 ` Matthew Brost
@ 2025-11-20  3:58                   ` Balbir Singh
  2025-11-20  5:46                     ` Balbir Singh
  2025-11-20  5:53                     ` Matthew Brost
  0 siblings, 2 replies; 75+ messages in thread
From: Balbir Singh @ 2025-11-20  3:58 UTC (permalink / raw)
  To: Matthew Brost
  Cc: Andrew Morton, linux-kernel, dri-devel, linux-mm,
	David Hildenbrand, Zi Yan, Joshua Hahn, Rakie Kim, Byungchul Park,
	Gregory Price, Ying Huang, Alistair Popple, Oscar Salvador,
	Lorenzo Stoakes, Baolin Wang, Liam R. Howlett, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Ralph Campbell, Mika Penttilä,
	Francois Dugast

On 11/20/25 14:15, Matthew Brost wrote:
> On Thu, Nov 20, 2025 at 01:59:09PM +1100, Balbir Singh wrote:
>> On 11/20/25 13:50, Balbir Singh wrote:
>>> On 11/20/25 13:40, Matthew Brost wrote:
>>>> On Wed, Nov 12, 2025 at 10:52:43AM +1100, Balbir Singh wrote:
>>>>> On 11/12/25 10:43, Andrew Morton wrote:
>>>>>> On Thu, 9 Oct 2025 03:33:33 -0700 Matthew Brost <matthew.brost@intel.com> wrote:
>>>>>>
>>>>>>>>>> This patch series introduces support for Transparent Huge Page
>>>>>>>>>> (THP) migration in zone device-private memory. The implementation enables
>>>>>>>>>> efficient migration of large folios between system memory and
>>>>>>>>>> device-private memory
>>>>>>>>>
>>>>>>>>> Lots of chatter for the v6 series, but none for v7.  I hope that's a
>>>>>>>>> good sign.
>>>>>>>>>
>>>>>>>>
>>>>>>>> I hope so too, I've tried to address the comments in v6.
>>>>>>>>
>>>>>>>
>>>>>>> Circling back to this series, we will itegrate and test this version.
>>>>>>
>>>>>> How'd it go?
>>>>>>
>>>>
>>>> My apologies for the delay—I got distracted by other tasks in Xe (my
>>>> driver) and was out for a bit. Unfortunately, this series breaks
>>>> something in the existing core MM code for the Xe SVM implementation. I
>>>> have an extensive test case that hammers on SVM, which fully passes
>>>> prior to applying this series, but fails randomly with the series
>>>> applied (to drm-tip-rc6) due to the below kernel lockup.
>>>>
>>>> I've tried to trace where the migration PTE gets installed but not
>>>> removed or isolate a test case which causes this failure but no luck so
>>>> far. I'll keep digging as I have time.
>>>>
>>>> Beyond that, if I enable Xe SVM + THP, it seems to mostly work (though
>>>> the same issue as above eventually occurs), but I do need two additional
>>>> core MM patches—one is new code required for Xe, and the other could be
>>>> considered a bug fix. Those patches can included when Xe merges SVM THP
>>>> support but we need at least not break Xe SVM before this series merges.
>>>>
>>>> Stack trace:
>>>>
>>>> INFO: task kworker/u65:2:1642 blocked for more than 30
>>>> seconds.
>>>> [  212.624286]       Tainted: G S      W           6.18.0-rc6-xe+ #1719
>>>> [  212.630561] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>>>> disables this message.
>>>> [  212.638285] task:kworker/u65:2   state:D stack:0     pid:1642
>>>> tgid:1642  ppid:2      task_flags:0x4208060 flags:0x00080000
>>>> [  212.638288] Workqueue: xe_page_fault_work_queue
>>>> xe_pagefault_queue_work [xe]
>>>> [  212.638323] Call Trace:
>>>> [  212.638324]  <TASK>
>>>> [  212.638325]  __schedule+0x4b0/0x990
>>>> [  212.638330]  schedule+0x22/0xd0
>>>> [  212.638331]  io_schedule+0x41/0x60
>>>> [  212.638333]  migration_entry_wait_on_locked+0x1d8/0x2d0
>>>> [  212.638336]  ? __pfx_wake_page_function+0x10/0x10
>>>> [  212.638339]  migration_entry_wait+0xd2/0xe0
>>>> [  212.638341]  hmm_vma_walk_pmd+0x7c9/0x8d0
>>>> [  212.638343]  walk_pgd_range+0x51d/0xa40
>>>> [  212.638345]  __walk_page_range+0x75/0x1e0
>>>> [  212.638347]  walk_page_range_mm+0x138/0x1f0
>>>> [  212.638349]  hmm_range_fault+0x59/0xa0
>>>> [  212.638351]  drm_gpusvm_get_pages+0x194/0x7b0 [drm_gpusvm_helper]
>>>> [  212.638354]  drm_gpusvm_range_get_pages+0x2d/0x40 [drm_gpusvm_helper]
>>>> [  212.638355]  __xe_svm_handle_pagefault+0x259/0x900 [xe]
>>>> [  212.638375]  ? update_load_avg+0x7f/0x6c0
>>>> [  212.638377]  ? update_curr+0x13d/0x170
>>>> [  212.638379]  xe_svm_handle_pagefault+0x37/0x90 [xe]
>>>> [  212.638396]  xe_pagefault_queue_work+0x2da/0x3c0 [xe]
>>>> [  212.638420]  process_one_work+0x16e/0x2e0
>>>> [  212.638422]  worker_thread+0x284/0x410
>>>> [  212.638423]  ? __pfx_worker_thread+0x10/0x10
>>>> [  212.638425]  kthread+0xec/0x210
>>>> [  212.638427]  ? __pfx_kthread+0x10/0x10
>>>> [  212.638428]  ? __pfx_kthread+0x10/0x10
>>>> [  212.638430]  ret_from_fork+0xbd/0x100
>>>> [  212.638433]  ? __pfx_kthread+0x10/0x10
>>>> [  212.638434]  ret_from_fork_asm+0x1a/0x30
>>>> [  212.638436]  </TASK>
>>>>
>>>
>>> Hi, Matt
>>>
>>> Thanks for the report, two questions
>>>
>>> 1. Are you using mm/mm-unstable, we've got some fixes in there (including fixes to remove_migration_pmd())
> 
> remove_migration_pmd - This is a PTE migration entry.
> 

I don't have your symbols, I thought we were hitting, the following condition in the walk

	if (thp_migration_supported() && pmd_is_migration_entry(pmd)) {

But sounds like you are not, PMD/THP has not been enabled in this case


>>>    - Generally a left behind migration entry is a symptom of a failed migration that did not clean up
>>>      after itself.
> 
> I'm on drm-tip as I generally need the latest version of my driver
> because of the speed we move at.
> 
> Yes, I agree it looks like somehow a migration PTE is not getting
> properly removed.
> 
> I'm happy to cherry pick any patches that you think might be helpful
> into my tree.
> 

Could you try the mm/mm-new tree with the current xe driver?

In general, w.r.t failure, I would check for the following

1. Are the dst_pfns in migrate_vma_pages() setup correctly by the device driver?
2. Any failures in folio_migrate_mapping()?
3. In migrate_vma_finalize() check to see if remove_migration_ptes() failed

If (3) fails that will explain the left over migration entries

>>> 2. The stack trace is from hmm_range_fault(), not something that this code touches.
>>>
> 
> Agree this is a symptom of the above issue.
> 
>>> The stack trace shows your code is seeing a migration entry and waiting on it.
>>> Can you please provide a reproducer for the issue? In the form of a test in hmm-tests.c
>>>
> 
> That will be my plan. Right now I'm opening my test up which runs 1000s
> of variations of SVM tests and the test that hangs is not consistent.
> Some of these are threaded or multi-process so it might possibly be a
> timing issue which could be hard to reproduce in hmm-tests.c. I'll do my
> best here.
> 
>>> Have you been able to bisect the issue?
>>
> 
> That is my next step along with isolating a test case.
> 
>> Also could you please try with 10b9feee2d0d ("mm/hmm: populate PFNs from PMD swap entry")
>> reverted?
>>
> 
> I can try but I highly doubt this is related. The hanging HMM code in is
> PTE walk step after this, also I am not even enabling THP device pages
> in my SVM code to reproduce this.
> 

Thanks, do regular hmm-tests pass for you in that setup/environment?

Balbir

> Matt
> 
>>>
>>> Balbir
>>>
>>>
>>>> Matt 
>>>>
>>>>>> Balbir, what's the status here?  It's been a month and this series
>>>>>> still has a "needs a new version" feeling to it.  If so, very soon
>>>>>> please.
>>>>>>
>>>>>
>>>>> I don't think this needs a new revision, I've been testing frequently
>>>>> at my end to see if I can catch any regressions. I have a patch update for
>>>>> mm-migrate_device-add-thp-splitting-during-migration.patch, it can be applied
>>>>> on top or I can send a new version of the patch. I was waiting
>>>>> on any feedback before I sent the patch out, but I'll do it now.
>>>>>
>>>>>> TODOs which I have noted are
>>>>>>
>>>>>> https://lkml.kernel.org/r/aOePfeoDuRW+prFq@lstrano-desk.jf.intel.com
>>>>>
>>>>> This was a clarification on the HMM patch mentioned in the changelog
>>>>>
>>>>>> https://lkml.kernel.org/r/CABzRoyZZ8QLF5PSeDCVxgcnQmF9kFQ3RZdNq0Deik3o9OrK+BQ@mail.gmail.com
>>>>>
>>>>> That's a minor comment on not using a temporary declaration, I don't think we need it, let me know if you feel strongly
>>>>>
>>>>>> https://lkml.kernel.org/r/D2A4B724-E5EF-46D3-9D3F-EBAD9B22371E@nvidia.com
>>>>>
>>>>> I have a patch for this, which I posted, I can do an update and resend it if required (the one mentioned above)
>>>>>
>>>>>> https://lkml.kernel.org/r/62073ca1-5bb6-49e8-b8d4-447c5e0e582e@
>>>>>>
>>>>>
>>>>> I can't seem to open this
>>>>>
>>>>>> plus a general re-read of the
>>>>>> mm-migrate_device-add-thp-splitting-during-migration.patch review
>>>>>> discussion.
>>>>>>
>>>>> That's the patch I have
>>>>>
>>>>> Thanks for following up
>>>>> Balbir
>>>
>>


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [v7 00/16] mm: support device-private THP
  2025-11-20  3:58                   ` Balbir Singh
@ 2025-11-20  5:46                     ` Balbir Singh
  2025-11-20  5:53                     ` Matthew Brost
  1 sibling, 0 replies; 75+ messages in thread
From: Balbir Singh @ 2025-11-20  5:46 UTC (permalink / raw)
  To: Matthew Brost
  Cc: Andrew Morton, linux-kernel, dri-devel, linux-mm,
	David Hildenbrand, Zi Yan, Joshua Hahn, Rakie Kim, Byungchul Park,
	Gregory Price, Ying Huang, Alistair Popple, Oscar Salvador,
	Lorenzo Stoakes, Baolin Wang, Liam R. Howlett, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Ralph Campbell, Mika Penttilä,
	Francois Dugast

[...]>>>>>>>
>>>>>>> How'd it go?
>>>>>>>
>>>>>
>>>>> My apologies for the delay—I got distracted by other tasks in Xe (my
>>>>> driver) and was out for a bit. Unfortunately, this series breaks
>>>>> something in the existing core MM code for the Xe SVM implementation. I
>>>>> have an extensive test case that hammers on SVM, which fully passes
>>>>> prior to applying this series, but fails randomly with the series
>>>>> applied (to drm-tip-rc6) due to the below kernel lockup.
>>>>>
>>>>> I've tried to trace where the migration PTE gets installed but not
>>>>> removed or isolate a test case which causes this failure but no luck so
>>>>> far. I'll keep digging as I have time.
>>>>>
>>>>> Beyond that, if I enable Xe SVM + THP, it seems to mostly work (though
>>>>> the same issue as above eventually occurs), but I do need two additional
>>>>> core MM patches—one is new code required for Xe, and the other could be
>>>>> considered a bug fix. Those patches can included when Xe merges SVM THP
>>>>> support but we need at least not break Xe SVM before this series merges.
>>>>>
>>>>> Stack trace:
>>>>>
>>>>> INFO: task kworker/u65:2:1642 blocked for more than 30
>>>>> seconds.
>>>>> [  212.624286]       Tainted: G S      W           6.18.0-rc6-xe+ #1719
>>>>> [  212.630561] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>>>>> disables this message.
>>>>> [  212.638285] task:kworker/u65:2   state:D stack:0     pid:1642
>>>>> tgid:1642  ppid:2      task_flags:0x4208060 flags:0x00080000
>>>>> [  212.638288] Workqueue: xe_page_fault_work_queue
>>>>> xe_pagefault_queue_work [xe]
>>>>> [  212.638323] Call Trace:
>>>>> [  212.638324]  <TASK>
>>>>> [  212.638325]  __schedule+0x4b0/0x990
>>>>> [  212.638330]  schedule+0x22/0xd0
>>>>> [  212.638331]  io_schedule+0x41/0x60
>>>>> [  212.638333]  migration_entry_wait_on_locked+0x1d8/0x2d0
>>>>> [  212.638336]  ? __pfx_wake_page_function+0x10/0x10
>>>>> [  212.638339]  migration_entry_wait+0xd2/0xe0
>>>>> [  212.638341]  hmm_vma_walk_pmd+0x7c9/0x8d0
>>>>> [  212.638343]  walk_pgd_range+0x51d/0xa40
>>>>> [  212.638345]  __walk_page_range+0x75/0x1e0
>>>>> [  212.638347]  walk_page_range_mm+0x138/0x1f0
>>>>> [  212.638349]  hmm_range_fault+0x59/0xa0
>>>>> [  212.638351]  drm_gpusvm_get_pages+0x194/0x7b0 [drm_gpusvm_helper]
>>>>> [  212.638354]  drm_gpusvm_range_get_pages+0x2d/0x40 [drm_gpusvm_helper]
>>>>> [  212.638355]  __xe_svm_handle_pagefault+0x259/0x900 [xe]
>>>>> [  212.638375]  ? update_load_avg+0x7f/0x6c0
>>>>> [  212.638377]  ? update_curr+0x13d/0x170
>>>>> [  212.638379]  xe_svm_handle_pagefault+0x37/0x90 [xe]
>>>>> [  212.638396]  xe_pagefault_queue_work+0x2da/0x3c0 [xe]
>>>>> [  212.638420]  process_one_work+0x16e/0x2e0
>>>>> [  212.638422]  worker_thread+0x284/0x410
>>>>> [  212.638423]  ? __pfx_worker_thread+0x10/0x10
>>>>> [  212.638425]  kthread+0xec/0x210
>>>>> [  212.638427]  ? __pfx_kthread+0x10/0x10
>>>>> [  212.638428]  ? __pfx_kthread+0x10/0x10
>>>>> [  212.638430]  ret_from_fork+0xbd/0x100
>>>>> [  212.638433]  ? __pfx_kthread+0x10/0x10
>>>>> [  212.638434]  ret_from_fork_asm+0x1a/0x30
>>>>> [  212.638436]  </TASK>
>>>>>
>>>>
>>>> Hi, Matt
>>>>
>>>> Thanks for the report, two questions
>>>>
>>>> 1. Are you using mm/mm-unstable, we've got some fixes in there (including fixes to remove_migration_pmd())
>>
>> remove_migration_pmd - This is a PTE migration entry.
>>
> 
> I don't have your symbols, I thought we were hitting, the following condition in the walk
> 
> 	if (thp_migration_supported() && pmd_is_migration_entry(pmd)) {
> 
> But sounds like you are not, PMD/THP has not been enabled in this case
> 
> 
>>>>    - Generally a left behind migration entry is a symptom of a failed migration that did not clean up
>>>>      after itself.
>>
>> I'm on drm-tip as I generally need the latest version of my driver
>> because of the speed we move at.
>>
>> Yes, I agree it looks like somehow a migration PTE is not getting
>> properly removed.
>>
>> I'm happy to cherry pick any patches that you think might be helpful
>> into my tree.
>>
> 
> Could you try the mm/mm-new tree with the current xe driver?
> 
> In general, w.r.t failure, I would check for the following
> 
> 1. Are the dst_pfns in migrate_vma_pages() setup correctly by the device driver?
> 2. Any failures in folio_migrate_mapping()?
> 3. In migrate_vma_finalize() check to see if remove_migration_ptes() failed
> 
> If (3) fails that will explain the left over migration entries
> 
Just thought of two other places to look at

1. split_folio(), do you have a large entry on the CPU side that needs to be split
   prior to migration?
2. Any partial munmap() code, because that can cause a pmd split, but the folio
   is not fully split yet

I also have a patch for debugging migrations via trace-points (to be updated)
https://patchew.org/linux/20251016054619.3174997-1-balbirs@nvidia.com/

May be it'll help you figure out if something failed to migrate.

>>>> 2. The stack trace is from hmm_range_fault(), not something that this code touches.
>>>>
>>
>> Agree this is a symptom of the above issue.
>>
>>>> The stack trace shows your code is seeing a migration entry and waiting on it.
>>>> Can you please provide a reproducer for the issue? In the form of a test in hmm-tests.c
>>>>
>>
>> That will be my plan. Right now I'm opening my test up which runs 1000s
>> of variations of SVM tests and the test that hangs is not consistent.
>> Some of these are threaded or multi-process so it might possibly be a
>> timing issue which could be hard to reproduce in hmm-tests.c. I'll do my
>> best here.
>>
>>>> Have you been able to bisect the issue?
>>>
>>
>> That is my next step along with isolating a test case.
>>
>>> Also could you please try with 10b9feee2d0d ("mm/hmm: populate PFNs from PMD swap entry")
>>> reverted?
>>>
>>
>> I can try but I highly doubt this is related. The hanging HMM code in is
>> PTE walk step after this, also I am not even enabling THP device pages
>> in my SVM code to reproduce this.
>>
> 
> Thanks, do regular hmm-tests pass for you in that setup/environment?
> 
> Balbir
> 

[..]

Balbir

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [v7 00/16] mm: support device-private THP
  2025-11-20  3:58                   ` Balbir Singh
  2025-11-20  5:46                     ` Balbir Singh
@ 2025-11-20  5:53                     ` Matthew Brost
  2025-11-20  6:03                       ` Balbir Singh
  1 sibling, 1 reply; 75+ messages in thread
From: Matthew Brost @ 2025-11-20  5:53 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Andrew Morton, linux-kernel, dri-devel, linux-mm,
	David Hildenbrand, Zi Yan, Joshua Hahn, Rakie Kim, Byungchul Park,
	Gregory Price, Ying Huang, Alistair Popple, Oscar Salvador,
	Lorenzo Stoakes, Baolin Wang, Liam R. Howlett, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Ralph Campbell, Mika Penttilä,
	Francois Dugast

On Thu, Nov 20, 2025 at 02:58:58PM +1100, Balbir Singh wrote:
> On 11/20/25 14:15, Matthew Brost wrote:
> > On Thu, Nov 20, 2025 at 01:59:09PM +1100, Balbir Singh wrote:
> >> On 11/20/25 13:50, Balbir Singh wrote:
> >>> On 11/20/25 13:40, Matthew Brost wrote:
> >>>> On Wed, Nov 12, 2025 at 10:52:43AM +1100, Balbir Singh wrote:
> >>>>> On 11/12/25 10:43, Andrew Morton wrote:
> >>>>>> On Thu, 9 Oct 2025 03:33:33 -0700 Matthew Brost <matthew.brost@intel.com> wrote:
> >>>>>>
> >>>>>>>>>> This patch series introduces support for Transparent Huge Page
> >>>>>>>>>> (THP) migration in zone device-private memory. The implementation enables
> >>>>>>>>>> efficient migration of large folios between system memory and
> >>>>>>>>>> device-private memory
> >>>>>>>>>
> >>>>>>>>> Lots of chatter for the v6 series, but none for v7.  I hope that's a
> >>>>>>>>> good sign.
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> I hope so too, I've tried to address the comments in v6.
> >>>>>>>>
> >>>>>>>
> >>>>>>> Circling back to this series, we will itegrate and test this version.
> >>>>>>
> >>>>>> How'd it go?
> >>>>>>
> >>>>
> >>>> My apologies for the delay—I got distracted by other tasks in Xe (my
> >>>> driver) and was out for a bit. Unfortunately, this series breaks
> >>>> something in the existing core MM code for the Xe SVM implementation. I
> >>>> have an extensive test case that hammers on SVM, which fully passes
> >>>> prior to applying this series, but fails randomly with the series
> >>>> applied (to drm-tip-rc6) due to the below kernel lockup.
> >>>>
> >>>> I've tried to trace where the migration PTE gets installed but not
> >>>> removed or isolate a test case which causes this failure but no luck so
> >>>> far. I'll keep digging as I have time.
> >>>>
> >>>> Beyond that, if I enable Xe SVM + THP, it seems to mostly work (though
> >>>> the same issue as above eventually occurs), but I do need two additional
> >>>> core MM patches—one is new code required for Xe, and the other could be
> >>>> considered a bug fix. Those patches can included when Xe merges SVM THP
> >>>> support but we need at least not break Xe SVM before this series merges.
> >>>>
> >>>> Stack trace:
> >>>>
> >>>> INFO: task kworker/u65:2:1642 blocked for more than 30
> >>>> seconds.
> >>>> [  212.624286]       Tainted: G S      W           6.18.0-rc6-xe+ #1719
> >>>> [  212.630561] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> >>>> disables this message.
> >>>> [  212.638285] task:kworker/u65:2   state:D stack:0     pid:1642
> >>>> tgid:1642  ppid:2      task_flags:0x4208060 flags:0x00080000
> >>>> [  212.638288] Workqueue: xe_page_fault_work_queue
> >>>> xe_pagefault_queue_work [xe]
> >>>> [  212.638323] Call Trace:
> >>>> [  212.638324]  <TASK>
> >>>> [  212.638325]  __schedule+0x4b0/0x990
> >>>> [  212.638330]  schedule+0x22/0xd0
> >>>> [  212.638331]  io_schedule+0x41/0x60
> >>>> [  212.638333]  migration_entry_wait_on_locked+0x1d8/0x2d0
> >>>> [  212.638336]  ? __pfx_wake_page_function+0x10/0x10
> >>>> [  212.638339]  migration_entry_wait+0xd2/0xe0
> >>>> [  212.638341]  hmm_vma_walk_pmd+0x7c9/0x8d0
> >>>> [  212.638343]  walk_pgd_range+0x51d/0xa40
> >>>> [  212.638345]  __walk_page_range+0x75/0x1e0
> >>>> [  212.638347]  walk_page_range_mm+0x138/0x1f0
> >>>> [  212.638349]  hmm_range_fault+0x59/0xa0
> >>>> [  212.638351]  drm_gpusvm_get_pages+0x194/0x7b0 [drm_gpusvm_helper]
> >>>> [  212.638354]  drm_gpusvm_range_get_pages+0x2d/0x40 [drm_gpusvm_helper]
> >>>> [  212.638355]  __xe_svm_handle_pagefault+0x259/0x900 [xe]
> >>>> [  212.638375]  ? update_load_avg+0x7f/0x6c0
> >>>> [  212.638377]  ? update_curr+0x13d/0x170
> >>>> [  212.638379]  xe_svm_handle_pagefault+0x37/0x90 [xe]
> >>>> [  212.638396]  xe_pagefault_queue_work+0x2da/0x3c0 [xe]
> >>>> [  212.638420]  process_one_work+0x16e/0x2e0
> >>>> [  212.638422]  worker_thread+0x284/0x410
> >>>> [  212.638423]  ? __pfx_worker_thread+0x10/0x10
> >>>> [  212.638425]  kthread+0xec/0x210
> >>>> [  212.638427]  ? __pfx_kthread+0x10/0x10
> >>>> [  212.638428]  ? __pfx_kthread+0x10/0x10
> >>>> [  212.638430]  ret_from_fork+0xbd/0x100
> >>>> [  212.638433]  ? __pfx_kthread+0x10/0x10
> >>>> [  212.638434]  ret_from_fork_asm+0x1a/0x30
> >>>> [  212.638436]  </TASK>
> >>>>
> >>>
> >>> Hi, Matt
> >>>
> >>> Thanks for the report, two questions
> >>>
> >>> 1. Are you using mm/mm-unstable, we've got some fixes in there (including fixes to remove_migration_pmd())
> > 
> > remove_migration_pmd - This is a PTE migration entry.
> > 
> 
> I don't have your symbols, I thought we were hitting, the following condition in the walk
> 
> 	if (thp_migration_supported() && pmd_is_migration_entry(pmd)) {
> 
> But sounds like you are not, PMD/THP has not been enabled in this case
> 

No, migration_entry_wait rather than pmd_migration_entry_wait.

> 
> >>>    - Generally a left behind migration entry is a symptom of a failed migration that did not clean up
> >>>      after itself.
> > 
> > I'm on drm-tip as I generally need the latest version of my driver
> > because of the speed we move at.
> > 
> > Yes, I agree it looks like somehow a migration PTE is not getting
> > properly removed.
> > 
> > I'm happy to cherry pick any patches that you think might be helpful
> > into my tree.
> > 
> 
> Could you try the mm/mm-new tree with the current xe driver?
>

Unfortunately, this is a tough one. We land a lot of patches in Xe/DRM,
so bringing the driver up to date with an MM branch is difficult, and
I’m not an expert at merging branches. It would be nice if, in the DRM
flow, we could merge patches from outside our subsystem into a
bleeding-edge kernel for the things we typically care about—but we’d
need a maintainer to sign up for that.

> In general, w.r.t failure, I would check for the following
> 
> 1. Are the dst_pfns in migrate_vma_pages() setup correctly by the device driver?
> 2. Any failures in folio_migrate_mapping()?
> 3. In migrate_vma_finalize() check to see if remove_migration_ptes() failed
> 
> If (3) fails that will explain the left over migration entries
> 

Good tips, but think I got it via biscet.

Offending patch is:

'mm/migrate_device: handle partially mapped folios during collection'

The failing test case involves some remap-related issue. It’s a
parameterized test, so I honestly couldn’t tell you exactly what it’s
doing beyond the fact that it seems nonsensical but stresses remap. I
thought commit '66d81853fa3d selftests/mm/hmm-tests: partial unmap,
mremap and anon_write tests' would catch this, but it looks like I need
to make the remap HMM test cases a bit more robust—similar to my
driver-side tests. I can take an action item to follow up on this.

Good news, I can tell you how to fix this...

In 'mm/migrate_device: handle partially mapped folios during collection': 

109 +#if 0
110 +                       folio = page ? page_folio(page) : NULL;
111 +                       if (folio && folio_test_large(folio)) {
112 +                               int ret;
113 +
114 +                               pte_unmap_unlock(ptep, ptl);
115 +                               ret = migrate_vma_split_folio(folio,
116 +								  migrate->fault_page);
117 +
118 +                               if (ret) {
119 +                                       ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
120 +                                       goto next;
121 +                               }
122 +
123 +                               addr = start;
124 +                               goto again;
125 +                       }
126 +#endif

You can probably just delete this and use my patch below, but if you
want to try fixing it with a quick look: if migrate_vma_split_folio
fails, you probably need to collect a hole. On success, you likely want
to continue executing the remainder of the loop. I can try playing with
this tomorrow, but it’s late here.

I had privately sent you a version of this patch as a fix for Xe, and
this one seems to work:

[PATCH] mm/migrate: Split THP found in middle of PMD during page collection

The migrate layer is not coded to handle a THP found in the middle of a
PMD. This can occur if a user manipulates mappings with mremap(). If a
THP is found mid-PMD during page collection, split it.

Cc: Balbir Singh <balbirs@nvidia.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 mm/migrate_device.c | 37 +++++++++++++++++++++++++++++++++++--
 1 file changed, 35 insertions(+), 2 deletions(-)

diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index abd9f6850db6..9ffc025bad50 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -65,6 +65,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
        struct vm_area_struct *vma = walk->vma;
        struct mm_struct *mm = vma->vm_mm;
        unsigned long addr = start, unmapped = 0;
+       struct folio *split_folio = NULL;
        spinlock_t *ptl;
        pte_t *ptep;

@@ -107,10 +108,11 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
                }
        }

-       ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
+       ptep = pte_offset_map_lock(mm, pmdp, start, &ptl);
        if (!ptep)
                goto again;
        arch_enter_lazy_mmu_mode();
+       ptep += (addr - start) / PAGE_SIZE;

        for (; addr < end; addr += PAGE_SIZE, ptep++) {
                struct dev_pagemap *pgmap;
@@ -209,6 +211,11 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
                        bool anon_exclusive;
                        pte_t swp_pte;

+                       if (folio_order(folio)) {
+                               split_folio = folio;
+                               goto split;
+                       }
+
                        flush_cache_page(vma, addr, pte_pfn(pte));
                        anon_exclusive = folio_test_anon(folio) &&
                                          PageAnonExclusive(page);
@@ -287,8 +294,34 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
        if (unmapped)
                flush_tlb_range(walk->vma, start, end);

+split:
        arch_leave_lazy_mmu_mode();
-       pte_unmap_unlock(ptep - 1, ptl);
+       pte_unmap_unlock(ptep - 1 + !!split_folio, ptl);
+
+       /*
+        * XXX: No clean way to support higher-order folios that don't match PMD
+        * boundaries for now — split them instead. Once mTHP support lands, add
+        * proper support for this case.
+        *
+        * The test, which exposed this as problematic, remapped (memremap) a
+        * large folio to an unaligned address, resulting in the folio being
+        * found in the middle of the PTEs. The requested number of pages was
+        * less than the folio size. Likely to be handled gracefully by upper
+        * layers eventually, but not yet.
+        */
+       if (split_folio) {
+               int ret;
+
+               ret = split_folio(split_folio);
+               if (fault_folio != split_folio)
+                       folio_unlock(split_folio);
+               folio_put(split_folio);
+               if (ret)
+                       return migrate_vma_collect_skip(addr, end, walk);
+
+               split_folio = NULL;
+               goto again;
+       }

        return 0;
 }

If I apply the #if 0 change along with my patch above (plus one core
MM patch needed for Xe that adds a support function), Xe SVM fully
passes our test cases with both THP enabled and disabled.

> >>> 2. The stack trace is from hmm_range_fault(), not something that this code touches.
> >>>
> > 
> > Agree this is a symptom of the above issue.
> > 
> >>> The stack trace shows your code is seeing a migration entry and waiting on it.
> >>> Can you please provide a reproducer for the issue? In the form of a test in hmm-tests.c
> >>>
> > 
> > That will be my plan. Right now I'm opening my test up which runs 1000s
> > of variations of SVM tests and the test that hangs is not consistent.
> > Some of these are threaded or multi-process so it might possibly be a
> > timing issue which could be hard to reproduce in hmm-tests.c. I'll do my
> > best here.
> > 
> >>> Have you been able to bisect the issue?
> >>
> > 
> > That is my next step along with isolating a test case.
> > 
> >> Also could you please try with 10b9feee2d0d ("mm/hmm: populate PFNs from PMD swap entry")
> >> reverted?
> >>
> > 
> > I can try but I highly doubt this is related. The hanging HMM code in is
> > PTE walk step after this, also I am not even enabling THP device pages
> > in my SVM code to reproduce this.
> > 
> 
> Thanks, do regular hmm-tests pass for you in that setup/environment?
> 

Yes. As noted above, I need to make the remap HMM case a bit more
robust. I’ll try to get to this before the Thanksgiving break in the US
(next Thursday-Friday).

Matt

> Balbir
> 
> > Matt
> > 
> >>>
> >>> Balbir
> >>>
> >>>
> >>>> Matt 
> >>>>
> >>>>>> Balbir, what's the status here?  It's been a month and this series
> >>>>>> still has a "needs a new version" feeling to it.  If so, very soon
> >>>>>> please.
> >>>>>>
> >>>>>
> >>>>> I don't think this needs a new revision, I've been testing frequently
> >>>>> at my end to see if I can catch any regressions. I have a patch update for
> >>>>> mm-migrate_device-add-thp-splitting-during-migration.patch, it can be applied
> >>>>> on top or I can send a new version of the patch. I was waiting
> >>>>> on any feedback before I sent the patch out, but I'll do it now.
> >>>>>
> >>>>>> TODOs which I have noted are
> >>>>>>
> >>>>>> https://lkml.kernel.org/r/aOePfeoDuRW+prFq@lstrano-desk.jf.intel.com
> >>>>>
> >>>>> This was a clarification on the HMM patch mentioned in the changelog
> >>>>>
> >>>>>> https://lkml.kernel.org/r/CABzRoyZZ8QLF5PSeDCVxgcnQmF9kFQ3RZdNq0Deik3o9OrK+BQ@mail.gmail.com
> >>>>>
> >>>>> That's a minor comment on not using a temporary declaration, I don't think we need it, let me know if you feel strongly
> >>>>>
> >>>>>> https://lkml.kernel.org/r/D2A4B724-E5EF-46D3-9D3F-EBAD9B22371E@nvidia.com
> >>>>>
> >>>>> I have a patch for this, which I posted, I can do an update and resend it if required (the one mentioned above)
> >>>>>
> >>>>>> https://lkml.kernel.org/r/62073ca1-5bb6-49e8-b8d4-447c5e0e582e@
> >>>>>>
> >>>>>
> >>>>> I can't seem to open this
> >>>>>
> >>>>>> plus a general re-read of the
> >>>>>> mm-migrate_device-add-thp-splitting-during-migration.patch review
> >>>>>> discussion.
> >>>>>>
> >>>>> That's the patch I have
> >>>>>
> >>>>> Thanks for following up
> >>>>> Balbir
> >>>
> >>
> 

^ permalink raw reply related	[flat|nested] 75+ messages in thread

* Re: [v7 00/16] mm: support device-private THP
  2025-11-20  5:53                     ` Matthew Brost
@ 2025-11-20  6:03                       ` Balbir Singh
  2025-11-20 17:27                         ` Matthew Brost
  0 siblings, 1 reply; 75+ messages in thread
From: Balbir Singh @ 2025-11-20  6:03 UTC (permalink / raw)
  To: Matthew Brost
  Cc: Andrew Morton, linux-kernel, dri-devel, linux-mm,
	David Hildenbrand, Zi Yan, Joshua Hahn, Rakie Kim, Byungchul Park,
	Gregory Price, Ying Huang, Alistair Popple, Oscar Salvador,
	Lorenzo Stoakes, Baolin Wang, Liam R. Howlett, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Ralph Campbell, Mika Penttilä,
	Francois Dugast

On 11/20/25 16:53, Matthew Brost wrote:
> On Thu, Nov 20, 2025 at 02:58:58PM +1100, Balbir Singh wrote:
>> On 11/20/25 14:15, Matthew Brost wrote:
>>> On Thu, Nov 20, 2025 at 01:59:09PM +1100, Balbir Singh wrote:
>>>> On 11/20/25 13:50, Balbir Singh wrote:
>>>>> On 11/20/25 13:40, Matthew Brost wrote:
>>>>>> On Wed, Nov 12, 2025 at 10:52:43AM +1100, Balbir Singh wrote:
>>>>>>> On 11/12/25 10:43, Andrew Morton wrote:
>>>>>>>> On Thu, 9 Oct 2025 03:33:33 -0700 Matthew Brost <matthew.brost@intel.com> wrote:
>>>>>>>>
>>>>>>>>>>>> This patch series introduces support for Transparent Huge Page
>>>>>>>>>>>> (THP) migration in zone device-private memory. The implementation enables
>>>>>>>>>>>> efficient migration of large folios between system memory and
>>>>>>>>>>>> device-private memory
>>>>>>>>>>>
>>>>>>>>>>> Lots of chatter for the v6 series, but none for v7.  I hope that's a
>>>>>>>>>>> good sign.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I hope so too, I've tried to address the comments in v6.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Circling back to this series, we will itegrate and test this version.
>>>>>>>>
>>>>>>>> How'd it go?
>>>>>>>>
>>>>>>
>>>>>> My apologies for the delay—I got distracted by other tasks in Xe (my
>>>>>> driver) and was out for a bit. Unfortunately, this series breaks
>>>>>> something in the existing core MM code for the Xe SVM implementation. I
>>>>>> have an extensive test case that hammers on SVM, which fully passes
>>>>>> prior to applying this series, but fails randomly with the series
>>>>>> applied (to drm-tip-rc6) due to the below kernel lockup.
>>>>>>
>>>>>> I've tried to trace where the migration PTE gets installed but not
>>>>>> removed or isolate a test case which causes this failure but no luck so
>>>>>> far. I'll keep digging as I have time.
>>>>>>
>>>>>> Beyond that, if I enable Xe SVM + THP, it seems to mostly work (though
>>>>>> the same issue as above eventually occurs), but I do need two additional
>>>>>> core MM patches—one is new code required for Xe, and the other could be
>>>>>> considered a bug fix. Those patches can included when Xe merges SVM THP
>>>>>> support but we need at least not break Xe SVM before this series merges.
>>>>>>
>>>>>> Stack trace:
>>>>>>
>>>>>> INFO: task kworker/u65:2:1642 blocked for more than 30
>>>>>> seconds.
>>>>>> [  212.624286]       Tainted: G S      W           6.18.0-rc6-xe+ #1719
>>>>>> [  212.630561] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>>>>>> disables this message.
>>>>>> [  212.638285] task:kworker/u65:2   state:D stack:0     pid:1642
>>>>>> tgid:1642  ppid:2      task_flags:0x4208060 flags:0x00080000
>>>>>> [  212.638288] Workqueue: xe_page_fault_work_queue
>>>>>> xe_pagefault_queue_work [xe]
>>>>>> [  212.638323] Call Trace:
>>>>>> [  212.638324]  <TASK>
>>>>>> [  212.638325]  __schedule+0x4b0/0x990
>>>>>> [  212.638330]  schedule+0x22/0xd0
>>>>>> [  212.638331]  io_schedule+0x41/0x60
>>>>>> [  212.638333]  migration_entry_wait_on_locked+0x1d8/0x2d0
>>>>>> [  212.638336]  ? __pfx_wake_page_function+0x10/0x10
>>>>>> [  212.638339]  migration_entry_wait+0xd2/0xe0
>>>>>> [  212.638341]  hmm_vma_walk_pmd+0x7c9/0x8d0
>>>>>> [  212.638343]  walk_pgd_range+0x51d/0xa40
>>>>>> [  212.638345]  __walk_page_range+0x75/0x1e0
>>>>>> [  212.638347]  walk_page_range_mm+0x138/0x1f0
>>>>>> [  212.638349]  hmm_range_fault+0x59/0xa0
>>>>>> [  212.638351]  drm_gpusvm_get_pages+0x194/0x7b0 [drm_gpusvm_helper]
>>>>>> [  212.638354]  drm_gpusvm_range_get_pages+0x2d/0x40 [drm_gpusvm_helper]
>>>>>> [  212.638355]  __xe_svm_handle_pagefault+0x259/0x900 [xe]
>>>>>> [  212.638375]  ? update_load_avg+0x7f/0x6c0
>>>>>> [  212.638377]  ? update_curr+0x13d/0x170
>>>>>> [  212.638379]  xe_svm_handle_pagefault+0x37/0x90 [xe]
>>>>>> [  212.638396]  xe_pagefault_queue_work+0x2da/0x3c0 [xe]
>>>>>> [  212.638420]  process_one_work+0x16e/0x2e0
>>>>>> [  212.638422]  worker_thread+0x284/0x410
>>>>>> [  212.638423]  ? __pfx_worker_thread+0x10/0x10
>>>>>> [  212.638425]  kthread+0xec/0x210
>>>>>> [  212.638427]  ? __pfx_kthread+0x10/0x10
>>>>>> [  212.638428]  ? __pfx_kthread+0x10/0x10
>>>>>> [  212.638430]  ret_from_fork+0xbd/0x100
>>>>>> [  212.638433]  ? __pfx_kthread+0x10/0x10
>>>>>> [  212.638434]  ret_from_fork_asm+0x1a/0x30
>>>>>> [  212.638436]  </TASK>
>>>>>>
>>>>>
>>>>> Hi, Matt
>>>>>
>>>>> Thanks for the report, two questions
>>>>>
>>>>> 1. Are you using mm/mm-unstable, we've got some fixes in there (including fixes to remove_migration_pmd())
>>>
>>> remove_migration_pmd - This is a PTE migration entry.
>>>
>>
>> I don't have your symbols, I thought we were hitting, the following condition in the walk
>>
>> 	if (thp_migration_supported() && pmd_is_migration_entry(pmd)) {
>>
>> But sounds like you are not, PMD/THP has not been enabled in this case
>>
> 
> No, migration_entry_wait rather than pmd_migration_entry_wait.
> 
>>
>>>>>    - Generally a left behind migration entry is a symptom of a failed migration that did not clean up
>>>>>      after itself.
>>>
>>> I'm on drm-tip as I generally need the latest version of my driver
>>> because of the speed we move at.
>>>
>>> Yes, I agree it looks like somehow a migration PTE is not getting
>>> properly removed.
>>>
>>> I'm happy to cherry pick any patches that you think might be helpful
>>> into my tree.
>>>
>>
>> Could you try the mm/mm-new tree with the current xe driver?
>>
> 
> Unfortunately, this is a tough one. We land a lot of patches in Xe/DRM,
> so bringing the driver up to date with an MM branch is difficult, and
> I’m not an expert at merging branches. It would be nice if, in the DRM
> flow, we could merge patches from outside our subsystem into a
> bleeding-edge kernel for the things we typically care about—but we’d
> need a maintainer to sign up for that.
> 
>> In general, w.r.t failure, I would check for the following
>>
>> 1. Are the dst_pfns in migrate_vma_pages() setup correctly by the device driver?
>> 2. Any failures in folio_migrate_mapping()?
>> 3. In migrate_vma_finalize() check to see if remove_migration_ptes() failed
>>
>> If (3) fails that will explain the left over migration entries
>>
> 
> Good tips, but think I got it via biscet.
> 
> Offending patch is:
> 
> 'mm/migrate_device: handle partially mapped folios during collection'
> 
> The failing test case involves some remap-related issue. It’s a
> parameterized test, so I honestly couldn’t tell you exactly what it’s
> doing beyond the fact that it seems nonsensical but stresses remap. I
> thought commit '66d81853fa3d selftests/mm/hmm-tests: partial unmap,
> mremap and anon_write tests' would catch this, but it looks like I need
> to make the remap HMM test cases a bit more robust—similar to my
> driver-side tests. I can take an action item to follow up on this.
> 
> Good news, I can tell you how to fix this...
> 
> In 'mm/migrate_device: handle partially mapped folios during collection': 
> 
> 109 +#if 0
> 110 +                       folio = page ? page_folio(page) : NULL;
> 111 +                       if (folio && folio_test_large(folio)) {
> 112 +                               int ret;
> 113 +
> 114 +                               pte_unmap_unlock(ptep, ptl);
> 115 +                               ret = migrate_vma_split_folio(folio,
> 116 +								  migrate->fault_page);
> 117 +
> 118 +                               if (ret) {
> 119 +                                       ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
> 120 +                                       goto next;
> 121 +                               }
> 122 +
> 123 +                               addr = start;
> 124 +                               goto again;
> 125 +                       }
> 126 +#endif
> 
> You can probably just delete this and use my patch below, but if you
> want to try fixing it with a quick look: if migrate_vma_split_folio
> fails, you probably need to collect a hole. On success, you likely want
> to continue executing the remainder of the loop. I can try playing with
> this tomorrow, but it’s late here.
> 
> I had privately sent you a version of this patch as a fix for Xe, and
> this one seems to work:
> 
> [PATCH] mm/migrate: Split THP found in middle of PMD during page collection
> 
> The migrate layer is not coded to handle a THP found in the middle of a
> PMD. This can occur if a user manipulates mappings with mremap(). If a
> THP is found mid-PMD during page collection, split it.
> 
> Cc: Balbir Singh <balbirs@nvidia.com>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>  mm/migrate_device.c | 37 +++++++++++++++++++++++++++++++++++--
>  1 file changed, 35 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
> index abd9f6850db6..9ffc025bad50 100644
> --- a/mm/migrate_device.c
> +++ b/mm/migrate_device.c
> @@ -65,6 +65,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>         struct vm_area_struct *vma = walk->vma;
>         struct mm_struct *mm = vma->vm_mm;
>         unsigned long addr = start, unmapped = 0;
> +       struct folio *split_folio = NULL;
>         spinlock_t *ptl;
>         pte_t *ptep;
> 
> @@ -107,10 +108,11 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>                 }
>         }
> 
> -       ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
> +       ptep = pte_offset_map_lock(mm, pmdp, start, &ptl);
>         if (!ptep)
>                 goto again;
>         arch_enter_lazy_mmu_mode();
> +       ptep += (addr - start) / PAGE_SIZE;
> 
>         for (; addr < end; addr += PAGE_SIZE, ptep++) {
>                 struct dev_pagemap *pgmap;
> @@ -209,6 +211,11 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>                         bool anon_exclusive;
>                         pte_t swp_pte;
> 
> +                       if (folio_order(folio)) {
> +                               split_folio = folio;
> +                               goto split;
> +                       }
> +
>                         flush_cache_page(vma, addr, pte_pfn(pte));
>                         anon_exclusive = folio_test_anon(folio) &&
>                                           PageAnonExclusive(page);
> @@ -287,8 +294,34 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>         if (unmapped)
>                 flush_tlb_range(walk->vma, start, end);
> 
> +split:
>         arch_leave_lazy_mmu_mode();
> -       pte_unmap_unlock(ptep - 1, ptl);
> +       pte_unmap_unlock(ptep - 1 + !!split_folio, ptl);
> +
> +       /*
> +        * XXX: No clean way to support higher-order folios that don't match PMD
> +        * boundaries for now — split them instead. Once mTHP support lands, add
> +        * proper support for this case.
> +        *
> +        * The test, which exposed this as problematic, remapped (memremap) a
> +        * large folio to an unaligned address, resulting in the folio being
> +        * found in the middle of the PTEs. The requested number of pages was
> +        * less than the folio size. Likely to be handled gracefully by upper
> +        * layers eventually, but not yet.
> +        */
> +       if (split_folio) {
> +               int ret;
> +
> +               ret = split_folio(split_folio);
> +               if (fault_folio != split_folio)
> +                       folio_unlock(split_folio);
> +               folio_put(split_folio);
> +               if (ret)
> +                       return migrate_vma_collect_skip(addr, end, walk);
> +
> +               split_folio = NULL;
> +               goto again;
> +       }
> 
>         return 0;
>  }
> 
> If I apply the #if 0 change along with my patch above (plus one core
> MM patch needed for Xe that adds a support function), Xe SVM fully
> passes our test cases with both THP enabled and disabled.
> 
Excellent work! Since you found this, do you mind sending the fix to Andrew as a fixup
to the original patch. Since I don't have the test case, I have no way of validating the
change or any change on top of it would continue to work

FYI: The original code does something similar, I might be missing the 
migrate_vma_collect_skip() bits.

Thanks!
Balbir



^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [v7 00/16] mm: support device-private THP
  2025-11-20  6:03                       ` Balbir Singh
@ 2025-11-20 17:27                         ` Matthew Brost
  0 siblings, 0 replies; 75+ messages in thread
From: Matthew Brost @ 2025-11-20 17:27 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Andrew Morton, linux-kernel, dri-devel, linux-mm,
	David Hildenbrand, Zi Yan, Joshua Hahn, Rakie Kim, Byungchul Park,
	Gregory Price, Ying Huang, Alistair Popple, Oscar Salvador,
	Lorenzo Stoakes, Baolin Wang, Liam R. Howlett, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Ralph Campbell, Mika Penttilä,
	Francois Dugast

On Thu, Nov 20, 2025 at 05:03:36PM +1100, Balbir Singh wrote:
> On 11/20/25 16:53, Matthew Brost wrote:
> > On Thu, Nov 20, 2025 at 02:58:58PM +1100, Balbir Singh wrote:
> >> On 11/20/25 14:15, Matthew Brost wrote:
> >>> On Thu, Nov 20, 2025 at 01:59:09PM +1100, Balbir Singh wrote:
> >>>> On 11/20/25 13:50, Balbir Singh wrote:
> >>>>> On 11/20/25 13:40, Matthew Brost wrote:
> >>>>>> On Wed, Nov 12, 2025 at 10:52:43AM +1100, Balbir Singh wrote:
> >>>>>>> On 11/12/25 10:43, Andrew Morton wrote:
> >>>>>>>> On Thu, 9 Oct 2025 03:33:33 -0700 Matthew Brost <matthew.brost@intel.com> wrote:
> >>>>>>>>
> >>>>>>>>>>>> This patch series introduces support for Transparent Huge Page
> >>>>>>>>>>>> (THP) migration in zone device-private memory. The implementation enables
> >>>>>>>>>>>> efficient migration of large folios between system memory and
> >>>>>>>>>>>> device-private memory
> >>>>>>>>>>>
> >>>>>>>>>>> Lots of chatter for the v6 series, but none for v7.  I hope that's a
> >>>>>>>>>>> good sign.
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> I hope so too, I've tried to address the comments in v6.
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Circling back to this series, we will itegrate and test this version.
> >>>>>>>>
> >>>>>>>> How'd it go?
> >>>>>>>>
> >>>>>>
> >>>>>> My apologies for the delay—I got distracted by other tasks in Xe (my
> >>>>>> driver) and was out for a bit. Unfortunately, this series breaks
> >>>>>> something in the existing core MM code for the Xe SVM implementation. I
> >>>>>> have an extensive test case that hammers on SVM, which fully passes
> >>>>>> prior to applying this series, but fails randomly with the series
> >>>>>> applied (to drm-tip-rc6) due to the below kernel lockup.
> >>>>>>
> >>>>>> I've tried to trace where the migration PTE gets installed but not
> >>>>>> removed or isolate a test case which causes this failure but no luck so
> >>>>>> far. I'll keep digging as I have time.
> >>>>>>
> >>>>>> Beyond that, if I enable Xe SVM + THP, it seems to mostly work (though
> >>>>>> the same issue as above eventually occurs), but I do need two additional
> >>>>>> core MM patches—one is new code required for Xe, and the other could be
> >>>>>> considered a bug fix. Those patches can included when Xe merges SVM THP
> >>>>>> support but we need at least not break Xe SVM before this series merges.
> >>>>>>
> >>>>>> Stack trace:
> >>>>>>
> >>>>>> INFO: task kworker/u65:2:1642 blocked for more than 30
> >>>>>> seconds.
> >>>>>> [  212.624286]       Tainted: G S      W           6.18.0-rc6-xe+ #1719
> >>>>>> [  212.630561] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> >>>>>> disables this message.
> >>>>>> [  212.638285] task:kworker/u65:2   state:D stack:0     pid:1642
> >>>>>> tgid:1642  ppid:2      task_flags:0x4208060 flags:0x00080000
> >>>>>> [  212.638288] Workqueue: xe_page_fault_work_queue
> >>>>>> xe_pagefault_queue_work [xe]
> >>>>>> [  212.638323] Call Trace:
> >>>>>> [  212.638324]  <TASK>
> >>>>>> [  212.638325]  __schedule+0x4b0/0x990
> >>>>>> [  212.638330]  schedule+0x22/0xd0
> >>>>>> [  212.638331]  io_schedule+0x41/0x60
> >>>>>> [  212.638333]  migration_entry_wait_on_locked+0x1d8/0x2d0
> >>>>>> [  212.638336]  ? __pfx_wake_page_function+0x10/0x10
> >>>>>> [  212.638339]  migration_entry_wait+0xd2/0xe0
> >>>>>> [  212.638341]  hmm_vma_walk_pmd+0x7c9/0x8d0
> >>>>>> [  212.638343]  walk_pgd_range+0x51d/0xa40
> >>>>>> [  212.638345]  __walk_page_range+0x75/0x1e0
> >>>>>> [  212.638347]  walk_page_range_mm+0x138/0x1f0
> >>>>>> [  212.638349]  hmm_range_fault+0x59/0xa0
> >>>>>> [  212.638351]  drm_gpusvm_get_pages+0x194/0x7b0 [drm_gpusvm_helper]
> >>>>>> [  212.638354]  drm_gpusvm_range_get_pages+0x2d/0x40 [drm_gpusvm_helper]
> >>>>>> [  212.638355]  __xe_svm_handle_pagefault+0x259/0x900 [xe]
> >>>>>> [  212.638375]  ? update_load_avg+0x7f/0x6c0
> >>>>>> [  212.638377]  ? update_curr+0x13d/0x170
> >>>>>> [  212.638379]  xe_svm_handle_pagefault+0x37/0x90 [xe]
> >>>>>> [  212.638396]  xe_pagefault_queue_work+0x2da/0x3c0 [xe]
> >>>>>> [  212.638420]  process_one_work+0x16e/0x2e0
> >>>>>> [  212.638422]  worker_thread+0x284/0x410
> >>>>>> [  212.638423]  ? __pfx_worker_thread+0x10/0x10
> >>>>>> [  212.638425]  kthread+0xec/0x210
> >>>>>> [  212.638427]  ? __pfx_kthread+0x10/0x10
> >>>>>> [  212.638428]  ? __pfx_kthread+0x10/0x10
> >>>>>> [  212.638430]  ret_from_fork+0xbd/0x100
> >>>>>> [  212.638433]  ? __pfx_kthread+0x10/0x10
> >>>>>> [  212.638434]  ret_from_fork_asm+0x1a/0x30
> >>>>>> [  212.638436]  </TASK>
> >>>>>>
> >>>>>
> >>>>> Hi, Matt
> >>>>>
> >>>>> Thanks for the report, two questions
> >>>>>
> >>>>> 1. Are you using mm/mm-unstable, we've got some fixes in there (including fixes to remove_migration_pmd())
> >>>
> >>> remove_migration_pmd - This is a PTE migration entry.
> >>>
> >>
> >> I don't have your symbols, I thought we were hitting, the following condition in the walk
> >>
> >> 	if (thp_migration_supported() && pmd_is_migration_entry(pmd)) {
> >>
> >> But sounds like you are not, PMD/THP has not been enabled in this case
> >>
> > 
> > No, migration_entry_wait rather than pmd_migration_entry_wait.
> > 
> >>
> >>>>>    - Generally a left behind migration entry is a symptom of a failed migration that did not clean up
> >>>>>      after itself.
> >>>
> >>> I'm on drm-tip as I generally need the latest version of my driver
> >>> because of the speed we move at.
> >>>
> >>> Yes, I agree it looks like somehow a migration PTE is not getting
> >>> properly removed.
> >>>
> >>> I'm happy to cherry pick any patches that you think might be helpful
> >>> into my tree.
> >>>
> >>
> >> Could you try the mm/mm-new tree with the current xe driver?
> >>
> > 
> > Unfortunately, this is a tough one. We land a lot of patches in Xe/DRM,
> > so bringing the driver up to date with an MM branch is difficult, and
> > I’m not an expert at merging branches. It would be nice if, in the DRM
> > flow, we could merge patches from outside our subsystem into a
> > bleeding-edge kernel for the things we typically care about—but we’d
> > need a maintainer to sign up for that.
> > 
> >> In general, w.r.t failure, I would check for the following
> >>
> >> 1. Are the dst_pfns in migrate_vma_pages() setup correctly by the device driver?
> >> 2. Any failures in folio_migrate_mapping()?
> >> 3. In migrate_vma_finalize() check to see if remove_migration_ptes() failed
> >>
> >> If (3) fails that will explain the left over migration entries
> >>
> > 
> > Good tips, but think I got it via biscet.
> > 
> > Offending patch is:
> > 
> > 'mm/migrate_device: handle partially mapped folios during collection'
> > 
> > The failing test case involves some remap-related issue. It’s a
> > parameterized test, so I honestly couldn’t tell you exactly what it’s
> > doing beyond the fact that it seems nonsensical but stresses remap. I
> > thought commit '66d81853fa3d selftests/mm/hmm-tests: partial unmap,
> > mremap and anon_write tests' would catch this, but it looks like I need
> > to make the remap HMM test cases a bit more robust—similar to my
> > driver-side tests. I can take an action item to follow up on this.
> > 
> > Good news, I can tell you how to fix this...
> > 
> > In 'mm/migrate_device: handle partially mapped folios during collection': 
> > 
> > 109 +#if 0
> > 110 +                       folio = page ? page_folio(page) : NULL;
> > 111 +                       if (folio && folio_test_large(folio)) {
> > 112 +                               int ret;
> > 113 +
> > 114 +                               pte_unmap_unlock(ptep, ptl);
> > 115 +                               ret = migrate_vma_split_folio(folio,
> > 116 +								  migrate->fault_page);
> > 117 +
> > 118 +                               if (ret) {
> > 119 +                                       ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
> > 120 +                                       goto next;
> > 121 +                               }
> > 122 +
> > 123 +                               addr = start;
> > 124 +                               goto again;
> > 125 +                       }
> > 126 +#endif
> > 
> > You can probably just delete this and use my patch below, but if you
> > want to try fixing it with a quick look: if migrate_vma_split_folio
> > fails, you probably need to collect a hole. On success, you likely want
> > to continue executing the remainder of the loop. I can try playing with
> > this tomorrow, but it’s late here.
> > 
> > I had privately sent you a version of this patch as a fix for Xe, and
> > this one seems to work:
> > 
> > [PATCH] mm/migrate: Split THP found in middle of PMD during page collection
> > 
> > The migrate layer is not coded to handle a THP found in the middle of a
> > PMD. This can occur if a user manipulates mappings with mremap(). If a
> > THP is found mid-PMD during page collection, split it.
> > 
> > Cc: Balbir Singh <balbirs@nvidia.com>
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >  mm/migrate_device.c | 37 +++++++++++++++++++++++++++++++++++--
> >  1 file changed, 35 insertions(+), 2 deletions(-)
> > 
> > diff --git a/mm/migrate_device.c b/mm/migrate_device.c
> > index abd9f6850db6..9ffc025bad50 100644
> > --- a/mm/migrate_device.c
> > +++ b/mm/migrate_device.c
> > @@ -65,6 +65,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> >         struct vm_area_struct *vma = walk->vma;
> >         struct mm_struct *mm = vma->vm_mm;
> >         unsigned long addr = start, unmapped = 0;
> > +       struct folio *split_folio = NULL;
> >         spinlock_t *ptl;
> >         pte_t *ptep;
> > 
> > @@ -107,10 +108,11 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> >                 }
> >         }
> > 
> > -       ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
> > +       ptep = pte_offset_map_lock(mm, pmdp, start, &ptl);
> >         if (!ptep)
> >                 goto again;
> >         arch_enter_lazy_mmu_mode();
> > +       ptep += (addr - start) / PAGE_SIZE;
> > 
> >         for (; addr < end; addr += PAGE_SIZE, ptep++) {
> >                 struct dev_pagemap *pgmap;
> > @@ -209,6 +211,11 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> >                         bool anon_exclusive;
> >                         pte_t swp_pte;
> > 
> > +                       if (folio_order(folio)) {
> > +                               split_folio = folio;
> > +                               goto split;
> > +                       }
> > +
> >                         flush_cache_page(vma, addr, pte_pfn(pte));
> >                         anon_exclusive = folio_test_anon(folio) &&
> >                                           PageAnonExclusive(page);
> > @@ -287,8 +294,34 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> >         if (unmapped)
> >                 flush_tlb_range(walk->vma, start, end);
> > 
> > +split:
> >         arch_leave_lazy_mmu_mode();
> > -       pte_unmap_unlock(ptep - 1, ptl);
> > +       pte_unmap_unlock(ptep - 1 + !!split_folio, ptl);
> > +
> > +       /*
> > +        * XXX: No clean way to support higher-order folios that don't match PMD
> > +        * boundaries for now — split them instead. Once mTHP support lands, add
> > +        * proper support for this case.
> > +        *
> > +        * The test, which exposed this as problematic, remapped (memremap) a
> > +        * large folio to an unaligned address, resulting in the folio being
> > +        * found in the middle of the PTEs. The requested number of pages was
> > +        * less than the folio size. Likely to be handled gracefully by upper
> > +        * layers eventually, but not yet.
> > +        */
> > +       if (split_folio) {
> > +               int ret;
> > +
> > +               ret = split_folio(split_folio);
> > +               if (fault_folio != split_folio)
> > +                       folio_unlock(split_folio);
> > +               folio_put(split_folio);
> > +               if (ret)
> > +                       return migrate_vma_collect_skip(addr, end, walk);
> > +
> > +               split_folio = NULL;
> > +               goto again;
> > +       }
> > 
> >         return 0;
> >  }
> > 
> > If I apply the #if 0 change along with my patch above (plus one core
> > MM patch needed for Xe that adds a support function), Xe SVM fully
> > passes our test cases with both THP enabled and disabled.
> > 
> Excellent work! Since you found this, do you mind sending the fix to Andrew as a fixup

Done. Here is a dri-devel patchworks link [1] to the patch.

Matt

[1] https://patchwork.freedesktop.org/series/157859/

> to the original patch. Since I don't have the test case, I have no way of validating the
> change or any change on top of it would continue to work
> 
> FYI: The original code does something similar, I might be missing the 
> migrate_vma_collect_skip() bits.
> 
> Thanks!
> Balbir
> 
> 

^ permalink raw reply	[flat|nested] 75+ messages in thread

end of thread, other threads:[~2025-11-20 17:27 UTC | newest]

Thread overview: 75+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-01  6:56 [v7 00/16] mm: support device-private THP Balbir Singh
2025-10-01  6:56 ` [v7 01/16] mm/zone_device: support large zone device private folios Balbir Singh
2025-10-12  6:10   ` Lance Yang
2025-10-12 22:54     ` Balbir Singh
2025-10-01  6:56 ` [v7 02/16] mm/zone_device: Rename page_free callback to folio_free Balbir Singh
2025-10-01  6:56 ` [v7 03/16] mm/huge_memory: add device-private THP support to PMD operations Balbir Singh
2025-10-12 15:46   ` Lance Yang
2025-10-13  0:01     ` Balbir Singh
2025-10-13  1:48       ` Lance Yang
2025-10-17 14:49   ` linux-next: KVM/s390x regression (was: [v7 03/16] mm/huge_memory: add device-private THP support to PMD operations) Christian Borntraeger
2025-10-17 14:54     ` linux-next: KVM/s390x regression David Hildenbrand
2025-10-17 15:01       ` Christian Borntraeger
2025-10-17 15:07         ` David Hildenbrand
2025-10-17 15:20           ` Christian Borntraeger
2025-10-17 17:07             ` David Hildenbrand
2025-10-17 21:56               ` Balbir Singh
2025-10-17 22:15                 ` David Hildenbrand
2025-10-17 22:41                   ` David Hildenbrand
2025-10-20  7:01                     ` Christian Borntraeger
2025-10-20  7:00                 ` Christian Borntraeger
2025-10-20  8:41                   ` David Hildenbrand
2025-10-20  9:04                     ` Claudio Imbrenda
2025-10-27 16:47                     ` Claudio Imbrenda
2025-10-27 16:59                       ` David Hildenbrand
2025-10-27 17:06                       ` Christian Borntraeger
2025-10-28  9:24                         ` Balbir Singh
2025-10-28 13:01                         ` [PATCH v1 0/1] KVM: s390: Fix missing present bit for gmap puds Claudio Imbrenda
2025-10-28 13:01                           ` [PATCH v1 1/1] " Claudio Imbrenda
2025-10-28 21:23                             ` Balbir Singh
2025-10-29 10:00                             ` David Hildenbrand
2025-10-29 10:20                               ` Claudio Imbrenda
2025-10-28 22:53                           ` [PATCH v1 0/1] " Andrew Morton
2025-10-01  6:56 ` [v7 04/16] mm/rmap: extend rmap and migration support device-private entries Balbir Singh
2025-10-22 11:54   ` Lance Yang
2025-10-01  6:56 ` [v7 05/16] mm/huge_memory: implement device-private THP splitting Balbir Singh
2025-10-01  6:56 ` [v7 06/16] mm/migrate_device: handle partially mapped folios during collection Balbir Singh
2025-10-01  6:56 ` [v7 07/16] mm/migrate_device: implement THP migration of zone device pages Balbir Singh
2025-10-01  6:56 ` [v7 08/16] mm/memory/fault: add THP fault handling for zone device private pages Balbir Singh
2025-10-01  6:57 ` [v7 09/16] lib/test_hmm: add zone device private THP test infrastructure Balbir Singh
2025-10-01  6:57 ` [v7 10/16] mm/memremap: add driver callback support for folio splitting Balbir Singh
2025-10-01  6:57 ` [v7 11/16] mm/migrate_device: add THP splitting during migration Balbir Singh
2025-10-13 21:17   ` Zi Yan
2025-10-13 21:33     ` Balbir Singh
2025-10-13 21:55       ` Zi Yan
2025-10-13 22:50         ` Balbir Singh
2025-10-19  8:19   ` Wei Yang
2025-10-19 22:49     ` Balbir Singh
2025-10-19 22:59       ` Zi Yan
2025-10-21 21:34         ` Balbir Singh
2025-10-22  2:59           ` Zi Yan
2025-10-22  7:16             ` Balbir Singh
2025-10-22 15:26               ` Zi Yan
2025-10-28  9:32                 ` Balbir Singh
2025-10-01  6:57 ` [v7 12/16] lib/test_hmm: add large page allocation failure testing Balbir Singh
2025-10-01  6:57 ` [v7 13/16] selftests/mm/hmm-tests: new tests for zone device THP migration Balbir Singh
2025-10-01  6:57 ` [v7 14/16] selftests/mm/hmm-tests: partial unmap, mremap and anon_write tests Balbir Singh
2025-10-01  6:57 ` [v7 15/16] selftests/mm/hmm-tests: new throughput tests including THP Balbir Singh
2025-10-01  6:57 ` [v7 16/16] gpu/drm/nouveau: enable THP support for GPU memory migration Balbir Singh
2025-10-09  3:17 ` [v7 00/16] mm: support device-private THP Andrew Morton
2025-10-09  3:26   ` Balbir Singh
2025-10-09 10:33     ` Matthew Brost
2025-10-13 22:51       ` Balbir Singh
2025-11-11 23:43       ` Andrew Morton
2025-11-11 23:52         ` Balbir Singh
2025-11-12  0:24           ` Andrew Morton
2025-11-12  0:36             ` Balbir Singh
2025-11-20  2:40           ` Matthew Brost
2025-11-20  2:50             ` Balbir Singh
2025-11-20  2:59               ` Balbir Singh
2025-11-20  3:15                 ` Matthew Brost
2025-11-20  3:58                   ` Balbir Singh
2025-11-20  5:46                     ` Balbir Singh
2025-11-20  5:53                     ` Matthew Brost
2025-11-20  6:03                       ` Balbir Singh
2025-11-20 17:27                         ` Matthew Brost

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).