[PATCH 0/2] mm/huge_memory: optimize migration when huge PMD needs split

public inbox for linux-mm@kvack.org
 help / color / mirror / Atom feed

* [PATCH 0/2] mm/huge_memory: optimize migration when huge PMD needs split
@ 2026-04-15  1:08 Wei Yang
  2026-04-15  1:08 ` [PATCH 1/2] mm/huge_memory: return true if split_huge_pmd_locked() split PMD to migration entry Wei Yang
  2026-04-15  1:08 ` [PATCH 2/2] mm/selftests: add split_shared_pmd() Wei Yang
  0 siblings, 2 replies; 11+ messages in thread
From: Wei Yang @ 2026-04-15  1:08 UTC (permalink / raw)
  To: akpm, david, ljs, ziy, baolin.wang, Liam.Howlett, npache,
	ryan.roberts, dev.jain, baohua, lance.yang, riel, vbabka, harry,
	jannh, rppt, surenb, mhocko, shuah
  Cc: linux-mm, Wei Yang, Gavin Guo

This is a follow up optimization for commit 939080834fef ("mm/huge_memory: 
fix early failure try_to_migrate() when split huge pmd for shared THP").

When split_huge_pmd_locked() successfully split the PMD entry with @freeze
= true, it means each PTE entry are properly set to migration entry.  And 
we can return from try_to_migrate_one() directly. 

Currently it is done in a sub-optimal way: it always restarts the walk and go
through each PTE entry by page_vma_mapped_walk() which then skip all of them
when the PMD is split to migration entry.  

Let split_huge_pmd_locked() indicate whether it split PMD to migration entry,
so that to optimize migration if huge PMD needs split.

Also add a selftest to check the bug fixed in commit 939080834fef
("mm/huge_memory: fix early failure try_to_migrate() when split huge pmd for
shared THP") will not be introduced.

Cc: Gavin Guo <gavinguo@igalia.com>
Cc: "David Hildenbrand (Red Hat)" <david@kernel.org>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Lance Yang <lance.yang@linux.dev>

Wei Yang (2):
  mm/huge_memory: return true if split_huge_pmd_locked() split PMD to
    migration entry
  mm/selftests: add split_shared_pmd()

 include/linux/huge_mm.h                       |  9 ++-
 mm/huge_memory.c                              | 21 ++++--
 mm/rmap.c                                     | 11 ++-
 .../selftests/mm/split_huge_page_test.c       | 73 ++++++++++++++++++-
 4 files changed, 99 insertions(+), 15 deletions(-)

-- 
2.34.1

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 1/2] mm/huge_memory: return true if split_huge_pmd_locked() split PMD to migration entry
  2026-04-15  1:08 [PATCH 0/2] mm/huge_memory: optimize migration when huge PMD needs split Wei Yang
@ 2026-04-15  1:08 ` Wei Yang
  2026-04-24 19:29   ` David Hildenbrand (Arm)
  2026-04-15  1:08 ` [PATCH 2/2] mm/selftests: add split_shared_pmd() Wei Yang
  1 sibling, 1 reply; 11+ messages in thread
From: Wei Yang @ 2026-04-15  1:08 UTC (permalink / raw)
  To: akpm, david, ljs, ziy, baolin.wang, Liam.Howlett, npache,
	ryan.roberts, dev.jain, baohua, lance.yang, riel, vbabka, harry,
	jannh, rppt, surenb, mhocko, shuah
  Cc: linux-mm, Wei Yang, Gavin Guo

When @freeze is set to true, split_huge_pmd_locked() is intended to
split the PMD to migration entry. But if it doesn't manage to clear
PageAnonExclusive(), it just split PMD and leave the folio mapped
through PTE.

This patch let split_huge_pmd_locked() return true to indicate it does
split PMD to migration entry. With this knowledge, we can return
directly in try_to_migrate_one() if it does.

Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Cc: Gavin Guo <gavinguo@igalia.com>
Cc: "David Hildenbrand (Red Hat)" <david@kernel.org>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Lance Yang <lance.yang@linux.dev>
---
 include/linux/huge_mm.h |  9 ++++++---
 mm/huge_memory.c        | 21 +++++++++++++--------
 mm/rmap.c               | 11 ++++++++---
 3 files changed, 27 insertions(+), 14 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 2949e5acff35..6ae423b8dbc0 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -561,7 +561,7 @@ static inline bool thp_migration_supported(void)
 	return IS_ENABLED(CONFIG_ARCH_ENABLE_THP_MIGRATION);
 }
 
-void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address,
+bool split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address,
 			   pmd_t *pmd, bool freeze);
 bool unmap_huge_pmd_locked(struct vm_area_struct *vma, unsigned long addr,
 			   pmd_t *pmdp, struct folio *folio);
@@ -658,9 +658,12 @@ static inline void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 		unsigned long address, bool freeze) {}
 static inline void split_huge_pmd_address(struct vm_area_struct *vma,
 		unsigned long address, bool freeze) {}
-static inline void split_huge_pmd_locked(struct vm_area_struct *vma,
+static inline bool split_huge_pmd_locked(struct vm_area_struct *vma,
 					 unsigned long address, pmd_t *pmd,
-					 bool freeze) {}
+					 bool freeze)
+{
+	return false;
+}
 
 static inline bool unmap_huge_pmd_locked(struct vm_area_struct *vma,
 					 unsigned long addr, pmd_t *pmdp,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 970e077019b7..ec84bb4a0cc3 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3087,7 +3087,7 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
 	pmd_populate(mm, pmd, pgtable);
 }
 
-static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
+static bool __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 		unsigned long haddr, bool freeze)
 {
 	struct mm_struct *mm = vma->vm_mm;
@@ -3096,7 +3096,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 	pgtable_t pgtable;
 	pmd_t old_pmd, _pmd;
 	bool soft_dirty, uffd_wp = false, young = false, write = false;
-	bool anon_exclusive = false, dirty = false;
+	bool anon_exclusive = false, dirty = false, ret = false;
 	unsigned long addr;
 	pte_t *pte;
 	int i;
@@ -3118,13 +3118,13 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 		if (arch_needs_pgtable_deposit())
 			zap_deposited_table(mm, pmd);
 		if (vma_is_special_huge(vma))
-			return;
+			return ret;
 		if (unlikely(pmd_is_migration_entry(old_pmd))) {
 			const softleaf_t old_entry = softleaf_from_pmd(old_pmd);
 
 			folio = softleaf_to_folio(old_entry);
 		} else if (is_huge_zero_pmd(old_pmd)) {
-			return;
+			return ret;
 		} else {
 			page = pmd_page(old_pmd);
 			folio = page_folio(page);
@@ -3136,7 +3136,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 			folio_put(folio);
 		}
 		add_mm_counter(mm, mm_counter_file(folio), -HPAGE_PMD_NR);
-		return;
+		return ret;
 	}
 
 	if (is_huge_zero_pmd(*pmd)) {
@@ -3149,7 +3149,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 		 * small page also write protected so it does not seems useful
 		 * to invalidate secondary mmu at this time.
 		 */
-		return __split_huge_zero_page_pmd(vma, haddr, pmd);
+		__split_huge_zero_page_pmd(vma, haddr, pmd);
+		return ret;
 	}
 
 	if (pmd_is_migration_entry(*pmd)) {
@@ -3309,6 +3310,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 			VM_WARN_ON(!pte_none(ptep_get(pte + i)));
 			set_pte_at(mm, addr, pte + i, entry);
 		}
+		ret = true;
 	} else if (pmd_is_device_private_entry(old_pmd)) {
 		pte_t entry;
 		swp_entry_t swp_entry;
@@ -3366,14 +3368,17 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 
 	smp_wmb(); /* make pte visible before pmd */
 	pmd_populate(mm, pmd, pgtable);
+	return ret;
 }
 
-void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address,
+bool split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address,
 			   pmd_t *pmd, bool freeze)
 {
 	VM_WARN_ON_ONCE(!IS_ALIGNED(address, HPAGE_PMD_SIZE));
 	if (pmd_trans_huge(*pmd) || pmd_is_valid_softleaf(*pmd))
-		__split_huge_pmd_locked(vma, pmd, address, freeze);
+		return __split_huge_pmd_locked(vma, pmd, address, freeze);
+	else
+		return false;
 }
 
 void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
diff --git a/mm/rmap.c b/mm/rmap.c
index 78b7fb5f367c..91fb495bebbe 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -2464,13 +2464,18 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 
 			if (flags & TTU_SPLIT_HUGE_PMD) {
 				/*
-				 * split_huge_pmd_locked() might leave the
+				 * If split_huge_pmd_locked() does split PMD
+				 * to migration entry, we are done.
+				 * If split_huge_pmd_locked() leave the
 				 * folio mapped through PTEs. Retry the walk
 				 * so we can detect this scenario and properly
 				 * abort the walk.
 				 */
-				split_huge_pmd_locked(vma, pvmw.address,
-						      pvmw.pmd, true);
+				if (split_huge_pmd_locked(vma, pvmw.address,
+						      pvmw.pmd, true)) {
+					page_vma_mapped_walk_done(&pvmw);
+					break;
+				}
 				flags &= ~TTU_SPLIT_HUGE_PMD;
 				page_vma_mapped_walk_restart(&pvmw);
 				continue;
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 2/2] mm/selftests: add split_shared_pmd()
  2026-04-15  1:08 [PATCH 0/2] mm/huge_memory: optimize migration when huge PMD needs split Wei Yang
  2026-04-15  1:08 ` [PATCH 1/2] mm/huge_memory: return true if split_huge_pmd_locked() split PMD to migration entry Wei Yang
@ 2026-04-15  1:08 ` Wei Yang
  1 sibling, 0 replies; 11+ messages in thread
From: Wei Yang @ 2026-04-15  1:08 UTC (permalink / raw)
  To: akpm, david, ljs, ziy, baolin.wang, Liam.Howlett, npache,
	ryan.roberts, dev.jain, baohua, lance.yang, riel, vbabka, harry,
	jannh, rppt, surenb, mhocko, shuah
  Cc: linux-mm, Wei Yang, Gavin Guo

Commit 60fbb14396d5 ("mm/huge_memory: adjust try_to_migrate_one() and
split_huge_pmd_locked()") introduced a bug to fail try_to_migrate()
early by returning false unconditionally after split_huge_pmd_locked(),
when this large pmd is shared by multi-process.

This is fixed by commit 939080834fef ("mm/huge_memory: fix
early failure try_to_migrate() when split huge pmd for shared THP").

Let's add a selftest to check this is not broken any more.

Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Cc: Gavin Guo <gavinguo@igalia.com>
Cc: "David Hildenbrand (Red Hat)" <david@kernel.org>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Lance Yang <lance.yang@linux.dev>
---
 .../selftests/mm/split_huge_page_test.c       | 73 ++++++++++++++++++-
 1 file changed, 72 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/mm/split_huge_page_test.c b/tools/testing/selftests/mm/split_huge_page_test.c
index 500d07c4938b..9d1de67f9929 100644
--- a/tools/testing/selftests/mm/split_huge_page_test.c
+++ b/tools/testing/selftests/mm/split_huge_page_test.c
@@ -16,6 +16,7 @@
 #include <sys/mman.h>
 #include <sys/mount.h>
 #include <sys/param.h>
+#include <sys/wait.h>
 #include <malloc.h>
 #include <stdbool.h>
 #include <time.h>
@@ -332,6 +333,74 @@ static void split_pmd_zero_pages(void)
 	free(one_page);
 }
 
+static void split_shared_pmd(void)
+{
+	char *one_page;
+	int nr_pmds = 1;
+	size_t len = nr_pmds * pmd_pagesize;
+	size_t i;
+	pid_t pid;
+	int status;
+	int ret = 0, level = 0;
+
+	one_page = memalign(pmd_pagesize, len);
+	if (!one_page)
+		ksft_exit_fail_msg("Fail to allocate memory: %s\n", strerror(errno));
+
+	madvise(one_page, len, MADV_HUGEPAGE);
+
+	for (i = 0; i < len; i++)
+		one_page[i] = (char)i;
+
+	if (!check_huge_anon(one_page, nr_pmds, pmd_pagesize))
+		ksft_exit_fail_msg("No THP is allocated\n");
+
+	for (;;) {
+		pid = fork();
+
+		if (pid < 0) {
+			perror("Error: fork\n");
+			exit(KSFT_SKIP);
+		}
+
+		if (pid != 0)
+			break;
+
+		/*
+		 * Current /sys/kernel/debug/split_huge_pages interface would
+		 * call folio_split() for each page in the range. So we need
+		 * to create one more map to the PMD, otherwise it still split
+		 * successfully after 512 = (pmd_pagesize / pagesize) trials.
+		 */
+		if (++level == (pmd_pagesize / pagesize)) {
+			/* split THPs */
+			write_debugfs(PID_FMT, getpid(), (uint64_t)one_page,
+				(uint64_t)one_page + len, 0);
+
+			memset(expected_orders, 0, sizeof(int) * (pmd_order + 1));
+			expected_orders[0] = nr_pmds << pmd_order;
+
+			if (check_after_split_folio_orders(one_page, len, pagemap_fd,
+							   kpageflags_fd, expected_orders,
+							   (pmd_order + 1)))
+				exit(KSFT_FAIL);
+
+			exit(KSFT_PASS);
+		}
+	}
+
+	wait(&status);
+	free(one_page);
+
+	if (WIFEXITED(status))
+		ret = WEXITSTATUS(status);
+
+	if (level != 0)
+		exit(ret);
+
+	ksft_test_result_report(ret, "Split shared pmd\n");
+}
+
 static void split_pmd_thp_to_order(int order)
 {
 	char *one_page;
@@ -777,7 +846,7 @@ int main(int argc, char **argv)
 	if (!expected_orders)
 		ksft_exit_fail_msg("Fail to allocate memory: %s\n", strerror(errno));
 
-	tests = 2 + (pmd_order - 1) + (2 * pmd_order) + (pmd_order - 1) * 4 + 2;
+	tests = 3 + (pmd_order - 1) + (2 * pmd_order) + (pmd_order - 1) * 4 + 2;
 	ksft_set_plan(tests);
 
 	pagemap_fd = open(pagemap_proc, O_RDONLY);
@@ -792,6 +861,8 @@ int main(int argc, char **argv)
 
 	split_pmd_zero_pages();
 
+	split_shared_pmd();
+
 	for (i = 0; i < pmd_order; i++)
 		if (i != 1)
 			split_pmd_thp_to_order(i);
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/2] mm/huge_memory: return true if split_huge_pmd_locked() split PMD to migration entry
  2026-04-15  1:08 ` [PATCH 1/2] mm/huge_memory: return true if split_huge_pmd_locked() split PMD to migration entry Wei Yang
@ 2026-04-24 19:29   ` David Hildenbrand (Arm)
  2026-04-26  9:19     ` Wei Yang
  0 siblings, 1 reply; 11+ messages in thread
From: David Hildenbrand (Arm) @ 2026-04-24 19:29 UTC (permalink / raw)
  To: Wei Yang, akpm, ljs, ziy, baolin.wang, Liam.Howlett, npache,
	ryan.roberts, dev.jain, baohua, lance.yang, riel, vbabka, harry,
	jannh, rppt, surenb, mhocko, shuah
  Cc: linux-mm, Gavin Guo

On 4/15/26 03:08, Wei Yang wrote:
> When @freeze is set to true, split_huge_pmd_locked() is intended to
> split the PMD to migration entry. But if it doesn't manage to clear
> PageAnonExclusive(), it just split PMD and leave the folio mapped
> through PTE.
> 
> This patch let split_huge_pmd_locked() return true to indicate it does
> split PMD to migration entry. With this knowledge, we can return
> directly in try_to_migrate_one() if it does.
> 
> Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
> Cc: Gavin Guo <gavinguo@igalia.com>
> Cc: "David Hildenbrand (Red Hat)" <david@kernel.org>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> Cc: Lance Yang <lance.yang@linux.dev>
> ---

[...]

>  static inline bool unmap_huge_pmd_locked(struct vm_area_struct *vma,
>  					 unsigned long addr, pmd_t *pmdp,
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 970e077019b7..ec84bb4a0cc3 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3087,7 +3087,7 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
>  	pmd_populate(mm, pmd, pgtable);
>  }
>  
> -static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> +static bool __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>  		unsigned long haddr, bool freeze)
>  {
>  	struct mm_struct *mm = vma->vm_mm;
> @@ -3096,7 +3096,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>  	pgtable_t pgtable;
>  	pmd_t old_pmd, _pmd;
>  	bool soft_dirty, uffd_wp = false, young = false, write = false;
> -	bool anon_exclusive = false, dirty = false;
> +	bool anon_exclusive = false, dirty = false, ret = false;
>  	unsigned long addr;
>  	pte_t *pte;
>  	int i;
> @@ -3118,13 +3118,13 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>  		if (arch_needs_pgtable_deposit())
>  			zap_deposited_table(mm, pmd);
>  		if (vma_is_special_huge(vma))
> -			return;
> +			return ret;

Why not "return false" in these cases where it really can always only false?

>  		if (unlikely(pmd_is_migration_entry(old_pmd))) {
>  			const softleaf_t old_entry = softleaf_from_pmd(old_pmd);
>  
>  			folio = softleaf_to_folio(old_entry);
>  		} else if (is_huge_zero_pmd(old_pmd)) {
> -			return;
> +			return ret;
>  		} else {
>  			page = pmd_page(old_pmd);
>  			folio = page_folio(page);
> @@ -3136,7 +3136,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>  			folio_put(folio);
>  		}
>  		add_mm_counter(mm, mm_counter_file(folio), -HPAGE_PMD_NR);
> -		return;
> +		return ret;
>  	}
>  
>  	if (is_huge_zero_pmd(*pmd)) {
> @@ -3149,7 +3149,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>  		 * small page also write protected so it does not seems useful
>  		 * to invalidate secondary mmu at this time.
>  		 */
> -		return __split_huge_zero_page_pmd(vma, haddr, pmd);
> +		__split_huge_zero_page_pmd(vma, haddr, pmd);
> +		return ret;
>  	}
>  
>  	if (pmd_is_migration_entry(*pmd)) {
> @@ -3309,6 +3310,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>  			VM_WARN_ON(!pte_none(ptep_get(pte + i)));
>  			set_pte_at(mm, addr, pte + i, entry);
>  		}
> +		ret = true;
>  	} else if (pmd_is_device_private_entry(old_pmd)) {
>  		pte_t entry;
>  		swp_entry_t swp_entry;
> @@ -3366,14 +3368,17 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>  
>  	smp_wmb(); /* make pte visible before pmd */
>  	pmd_populate(mm, pmd, pgtable);
> +	return ret;
>  }
>  
> -void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address,
> +bool split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address,
>  			   pmd_t *pmd, bool freeze)
>  {
>  	VM_WARN_ON_ONCE(!IS_ALIGNED(address, HPAGE_PMD_SIZE));
>  	if (pmd_trans_huge(*pmd) || pmd_is_valid_softleaf(*pmd))
> -		__split_huge_pmd_locked(vma, pmd, address, freeze);
> +		return __split_huge_pmd_locked(vma, pmd, address, freeze);
> +	else
> +		return false;

No need for the "else".

>  }
>  
>  void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 78b7fb5f367c..91fb495bebbe 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -2464,13 +2464,18 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
>  
>  			if (flags & TTU_SPLIT_HUGE_PMD) {
>  				/*
> -				 * split_huge_pmd_locked() might leave the
> +				 * If split_huge_pmd_locked() does split PMD
> +				 * to migration entry, we are done.
> +				 * If split_huge_pmd_locked() leave the
>  				 * folio mapped through PTEs. Retry the walk
>  				 * so we can detect this scenario and properly
>  				 * abort the walk.

Couldn't we just abort right away, based on the return value?

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/2] mm/huge_memory: return true if split_huge_pmd_locked() split PMD to migration entry
  2026-04-24 19:29   ` David Hildenbrand (Arm)
@ 2026-04-26  9:19     ` Wei Yang
  2026-04-28  8:24       ` David Hildenbrand (Arm)
  0 siblings, 1 reply; 11+ messages in thread
From: Wei Yang @ 2026-04-26  9:19 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Wei Yang, akpm, ljs, ziy, baolin.wang, Liam.Howlett, npache,
	ryan.roberts, dev.jain, baohua, lance.yang, riel, vbabka, harry,
	jannh, rppt, surenb, mhocko, shuah, linux-mm, Gavin Guo

On Fri, Apr 24, 2026 at 09:29:18PM +0200, David Hildenbrand (Arm) wrote:
>On 4/15/26 03:08, Wei Yang wrote:
>> When @freeze is set to true, split_huge_pmd_locked() is intended to
>> split the PMD to migration entry. But if it doesn't manage to clear
>> PageAnonExclusive(), it just split PMD and leave the folio mapped
>> through PTE.
>> 
>> This patch let split_huge_pmd_locked() return true to indicate it does
>> split PMD to migration entry. With this knowledge, we can return
>> directly in try_to_migrate_one() if it does.
>> 
>> Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
>> Cc: Gavin Guo <gavinguo@igalia.com>
>> Cc: "David Hildenbrand (Red Hat)" <david@kernel.org>
>> Cc: Zi Yan <ziy@nvidia.com>
>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>> Cc: Lance Yang <lance.yang@linux.dev>
>> ---
>
>[...]
>
>>  static inline bool unmap_huge_pmd_locked(struct vm_area_struct *vma,
>>  					 unsigned long addr, pmd_t *pmdp,
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 970e077019b7..ec84bb4a0cc3 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -3087,7 +3087,7 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
>>  	pmd_populate(mm, pmd, pgtable);
>>  }
>>  
>> -static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>> +static bool __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>>  		unsigned long haddr, bool freeze)
>>  {
>>  	struct mm_struct *mm = vma->vm_mm;
>> @@ -3096,7 +3096,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>>  	pgtable_t pgtable;
>>  	pmd_t old_pmd, _pmd;
>>  	bool soft_dirty, uffd_wp = false, young = false, write = false;
>> -	bool anon_exclusive = false, dirty = false;
>> +	bool anon_exclusive = false, dirty = false, ret = false;
>>  	unsigned long addr;
>>  	pte_t *pte;
>>  	int i;
>> @@ -3118,13 +3118,13 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>>  		if (arch_needs_pgtable_deposit())
>>  			zap_deposited_table(mm, pmd);
>>  		if (vma_is_special_huge(vma))
>> -			return;
>> +			return ret;
>
>Why not "return false" in these cases where it really can always only false?
>

Will adjust related places.

>>  		if (unlikely(pmd_is_migration_entry(old_pmd))) {
>>  			const softleaf_t old_entry = softleaf_from_pmd(old_pmd);
>>  
>>  			folio = softleaf_to_folio(old_entry);
>>  		} else if (is_huge_zero_pmd(old_pmd)) {
>> -			return;
>> +			return ret;
>>  		} else {
>>  			page = pmd_page(old_pmd);
>>  			folio = page_folio(page);
>> @@ -3136,7 +3136,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>>  			folio_put(folio);
>>  		}
>>  		add_mm_counter(mm, mm_counter_file(folio), -HPAGE_PMD_NR);
>> -		return;
>> +		return ret;
>>  	}
>>  
>>  	if (is_huge_zero_pmd(*pmd)) {
>> @@ -3149,7 +3149,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>>  		 * small page also write protected so it does not seems useful
>>  		 * to invalidate secondary mmu at this time.
>>  		 */
>> -		return __split_huge_zero_page_pmd(vma, haddr, pmd);
>> +		__split_huge_zero_page_pmd(vma, haddr, pmd);
>> +		return ret;
>>  	}
>>  
>>  	if (pmd_is_migration_entry(*pmd)) {
>> @@ -3309,6 +3310,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>>  			VM_WARN_ON(!pte_none(ptep_get(pte + i)));
>>  			set_pte_at(mm, addr, pte + i, entry);
>>  		}
>> +		ret = true;
>>  	} else if (pmd_is_device_private_entry(old_pmd)) {
>>  		pte_t entry;
>>  		swp_entry_t swp_entry;
>> @@ -3366,14 +3368,17 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>>  
>>  	smp_wmb(); /* make pte visible before pmd */
>>  	pmd_populate(mm, pmd, pgtable);
>> +	return ret;
>>  }
>>  
>> -void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address,
>> +bool split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address,
>>  			   pmd_t *pmd, bool freeze)
>>  {
>>  	VM_WARN_ON_ONCE(!IS_ALIGNED(address, HPAGE_PMD_SIZE));
>>  	if (pmd_trans_huge(*pmd) || pmd_is_valid_softleaf(*pmd))
>> -		__split_huge_pmd_locked(vma, pmd, address, freeze);
>> +		return __split_huge_pmd_locked(vma, pmd, address, freeze);
>> +	else
>> +		return false;
>
>No need for the "else".
>

Got it.

>>  }
>>  
>>  void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index 78b7fb5f367c..91fb495bebbe 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -2464,13 +2464,18 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
>>  
>>  			if (flags & TTU_SPLIT_HUGE_PMD) {
>>  				/*
>> -				 * split_huge_pmd_locked() might leave the
>> +				 * If split_huge_pmd_locked() does split PMD
>> +				 * to migration entry, we are done.
>> +				 * If split_huge_pmd_locked() leave the
>>  				 * folio mapped through PTEs. Retry the walk
>>  				 * so we can detect this scenario and properly
>>  				 * abort the walk.
>
>Couldn't we just abort right away, based on the return value?
>

Here is my understanding.

We get here when page_vma_mapped_walk() touch a pmd entry, with three cases:

  * pmd_trans_huge()
  * pmd_is_migration_entry()
  * pmd_is_device_private_entry()

For the first two cases, we grab pmd_lock() and then check the condition is
still valid before return. But for case 3, after grab pmd_lock(), it return
directly.

This may give chance for another thread to split pmd_is_device_private_entry()
to pte mapped, IIUC. For this case, we should restart the walk here.

Not sure I got it correctly.

-- 
Wei Yang
Help you, Help me


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/2] mm/huge_memory: return true if split_huge_pmd_locked() split PMD to migration entry
  2026-04-26  9:19     ` Wei Yang
@ 2026-04-28  8:24       ` David Hildenbrand (Arm)
  2026-04-29  2:49         ` Wei Yang
  0 siblings, 1 reply; 11+ messages in thread
From: David Hildenbrand (Arm) @ 2026-04-28  8:24 UTC (permalink / raw)
  To: Wei Yang
  Cc: akpm, ljs, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts,
	dev.jain, baohua, lance.yang, riel, vbabka, harry, jannh, rppt,
	surenb, mhocko, shuah, linux-mm, Gavin Guo

On 4/26/26 11:19, Wei Yang wrote:
> On Fri, Apr 24, 2026 at 09:29:18PM +0200, David Hildenbrand (Arm) wrote:
>> On 4/15/26 03:08, Wei Yang wrote:
>>> When @freeze is set to true, split_huge_pmd_locked() is intended to
>>> split the PMD to migration entry. But if it doesn't manage to clear
>>> PageAnonExclusive(), it just split PMD and leave the folio mapped
>>> through PTE.
>>>
>>> This patch let split_huge_pmd_locked() return true to indicate it does
>>> split PMD to migration entry. With this knowledge, we can return
>>> directly in try_to_migrate_one() if it does.
>>>
>>> Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
>>> Cc: Gavin Guo <gavinguo@igalia.com>
>>> Cc: "David Hildenbrand (Red Hat)" <david@kernel.org>
>>> Cc: Zi Yan <ziy@nvidia.com>
>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>> Cc: Lance Yang <lance.yang@linux.dev>
>>> ---
>>
>> [...]
>>
>>>  static inline bool unmap_huge_pmd_locked(struct vm_area_struct *vma,
>>>  					 unsigned long addr, pmd_t *pmdp,
>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>> index 970e077019b7..ec84bb4a0cc3 100644
>>> --- a/mm/huge_memory.c
>>> +++ b/mm/huge_memory.c
>>> @@ -3087,7 +3087,7 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
>>>  	pmd_populate(mm, pmd, pgtable);
>>>  }
>>>  
>>> -static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>>> +static bool __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>>>  		unsigned long haddr, bool freeze)
>>>  {
>>>  	struct mm_struct *mm = vma->vm_mm;
>>> @@ -3096,7 +3096,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>>>  	pgtable_t pgtable;
>>>  	pmd_t old_pmd, _pmd;
>>>  	bool soft_dirty, uffd_wp = false, young = false, write = false;
>>> -	bool anon_exclusive = false, dirty = false;
>>> +	bool anon_exclusive = false, dirty = false, ret = false;
>>>  	unsigned long addr;
>>>  	pte_t *pte;
>>>  	int i;
>>> @@ -3118,13 +3118,13 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>>>  		if (arch_needs_pgtable_deposit())
>>>  			zap_deposited_table(mm, pmd);
>>>  		if (vma_is_special_huge(vma))
>>> -			return;
>>> +			return ret;
>>
>> Why not "return false" in these cases where it really can always only false?
>>
> 
> Will adjust related places.
> 
>>>  		if (unlikely(pmd_is_migration_entry(old_pmd))) {
>>>  			const softleaf_t old_entry = softleaf_from_pmd(old_pmd);
>>>  
>>>  			folio = softleaf_to_folio(old_entry);
>>>  		} else if (is_huge_zero_pmd(old_pmd)) {
>>> -			return;
>>> +			return ret;
>>>  		} else {
>>>  			page = pmd_page(old_pmd);
>>>  			folio = page_folio(page);
>>> @@ -3136,7 +3136,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>>>  			folio_put(folio);
>>>  		}
>>>  		add_mm_counter(mm, mm_counter_file(folio), -HPAGE_PMD_NR);
>>> -		return;
>>> +		return ret;
>>>  	}
>>>  
>>>  	if (is_huge_zero_pmd(*pmd)) {
>>> @@ -3149,7 +3149,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>>>  		 * small page also write protected so it does not seems useful
>>>  		 * to invalidate secondary mmu at this time.
>>>  		 */
>>> -		return __split_huge_zero_page_pmd(vma, haddr, pmd);
>>> +		__split_huge_zero_page_pmd(vma, haddr, pmd);
>>> +		return ret;
>>>  	}
>>>  
>>>  	if (pmd_is_migration_entry(*pmd)) {
>>> @@ -3309,6 +3310,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>>>  			VM_WARN_ON(!pte_none(ptep_get(pte + i)));
>>>  			set_pte_at(mm, addr, pte + i, entry);
>>>  		}
>>> +		ret = true;
>>>  	} else if (pmd_is_device_private_entry(old_pmd)) {
>>>  		pte_t entry;
>>>  		swp_entry_t swp_entry;
>>> @@ -3366,14 +3368,17 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>>>  
>>>  	smp_wmb(); /* make pte visible before pmd */
>>>  	pmd_populate(mm, pmd, pgtable);
>>> +	return ret;
>>>  }
>>>  
>>> -void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address,
>>> +bool split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address,
>>>  			   pmd_t *pmd, bool freeze)
>>>  {
>>>  	VM_WARN_ON_ONCE(!IS_ALIGNED(address, HPAGE_PMD_SIZE));
>>>  	if (pmd_trans_huge(*pmd) || pmd_is_valid_softleaf(*pmd))
>>> -		__split_huge_pmd_locked(vma, pmd, address, freeze);
>>> +		return __split_huge_pmd_locked(vma, pmd, address, freeze);
>>> +	else
>>> +		return false;
>>
>> No need for the "else".
>>
> 
> Got it.
> 
>>>  }
>>>  
>>>  void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>> index 78b7fb5f367c..91fb495bebbe 100644
>>> --- a/mm/rmap.c
>>> +++ b/mm/rmap.c
>>> @@ -2464,13 +2464,18 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
>>>  
>>>  			if (flags & TTU_SPLIT_HUGE_PMD) {
>>>  				/*
>>> -				 * split_huge_pmd_locked() might leave the
>>> +				 * If split_huge_pmd_locked() does split PMD
>>> +				 * to migration entry, we are done.
>>> +				 * If split_huge_pmd_locked() leave the
>>>  				 * folio mapped through PTEs. Retry the walk
>>>  				 * so we can detect this scenario and properly
>>>  				 * abort the walk.
>>
>> Couldn't we just abort right away, based on the return value?
>>
> 
> Here is my understanding.
> 
> We get here when page_vma_mapped_walk() touch a pmd entry, with three cases:
> 
>   * pmd_trans_huge()
>   * pmd_is_migration_entry()
>   * pmd_is_device_private_entry()
> 
> For the first two cases, we grab pmd_lock() and then check the condition is
> still valid before return. But for case 3, after grab pmd_lock(), it return
> directly.
> 
> This may give chance for another thread to split pmd_is_device_private_entry()
> to pte mapped, IIUC. For this case, we should restart the walk here.


So what you are saying is that we should re-validate in page_vma_mapped_walk()
that we indeed still have a device-private entry after grabbing the lock?

That's what we do in map_pte() through pmd_same() check.

Likely we should apply the same model here!


-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/2] mm/huge_memory: return true if split_huge_pmd_locked() split PMD to migration entry
  2026-04-28  8:24       ` David Hildenbrand (Arm)
@ 2026-04-29  2:49         ` Wei Yang
  2026-04-29  6:55           ` David Hildenbrand (Arm)
  0 siblings, 1 reply; 11+ messages in thread
From: Wei Yang @ 2026-04-29  2:49 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Wei Yang, akpm, ljs, ziy, baolin.wang, Liam.Howlett, npache,
	ryan.roberts, dev.jain, baohua, lance.yang, riel, vbabka, harry,
	jannh, rppt, surenb, mhocko, shuah, linux-mm, Gavin Guo

On Tue, Apr 28, 2026 at 10:24:42AM +0200, David Hildenbrand (Arm) wrote:
>On 4/26/26 11:19, Wei Yang wrote:
>> On Fri, Apr 24, 2026 at 09:29:18PM +0200, David Hildenbrand (Arm) wrote:
>>> On 4/15/26 03:08, Wei Yang wrote:
>>>> When @freeze is set to true, split_huge_pmd_locked() is intended to
>>>> split the PMD to migration entry. But if it doesn't manage to clear
>>>> PageAnonExclusive(), it just split PMD and leave the folio mapped
>>>> through PTE.
>>>>
>>>> This patch let split_huge_pmd_locked() return true to indicate it does
>>>> split PMD to migration entry. With this knowledge, we can return
>>>> directly in try_to_migrate_one() if it does.
>>>>
>>>> Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
>>>> Cc: Gavin Guo <gavinguo@igalia.com>
>>>> Cc: "David Hildenbrand (Red Hat)" <david@kernel.org>
>>>> Cc: Zi Yan <ziy@nvidia.com>
>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>> Cc: Lance Yang <lance.yang@linux.dev>
>>>> ---
>>>
>>> [...]
>>>
>>>>  static inline bool unmap_huge_pmd_locked(struct vm_area_struct *vma,
>>>>  					 unsigned long addr, pmd_t *pmdp,
>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>> index 970e077019b7..ec84bb4a0cc3 100644
>>>> --- a/mm/huge_memory.c
>>>> +++ b/mm/huge_memory.c
>>>> @@ -3087,7 +3087,7 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
>>>>  	pmd_populate(mm, pmd, pgtable);
>>>>  }
>>>>  
>>>> -static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>>>> +static bool __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>>>>  		unsigned long haddr, bool freeze)
>>>>  {
>>>>  	struct mm_struct *mm = vma->vm_mm;
>>>> @@ -3096,7 +3096,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>>>>  	pgtable_t pgtable;
>>>>  	pmd_t old_pmd, _pmd;
>>>>  	bool soft_dirty, uffd_wp = false, young = false, write = false;
>>>> -	bool anon_exclusive = false, dirty = false;
>>>> +	bool anon_exclusive = false, dirty = false, ret = false;
>>>>  	unsigned long addr;
>>>>  	pte_t *pte;
>>>>  	int i;
>>>> @@ -3118,13 +3118,13 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>>>>  		if (arch_needs_pgtable_deposit())
>>>>  			zap_deposited_table(mm, pmd);
>>>>  		if (vma_is_special_huge(vma))
>>>> -			return;
>>>> +			return ret;
>>>
>>> Why not "return false" in these cases where it really can always only false?
>>>
>> 
>> Will adjust related places.
>> 
>>>>  		if (unlikely(pmd_is_migration_entry(old_pmd))) {
>>>>  			const softleaf_t old_entry = softleaf_from_pmd(old_pmd);
>>>>  
>>>>  			folio = softleaf_to_folio(old_entry);
>>>>  		} else if (is_huge_zero_pmd(old_pmd)) {
>>>> -			return;
>>>> +			return ret;
>>>>  		} else {
>>>>  			page = pmd_page(old_pmd);
>>>>  			folio = page_folio(page);
>>>> @@ -3136,7 +3136,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>>>>  			folio_put(folio);
>>>>  		}
>>>>  		add_mm_counter(mm, mm_counter_file(folio), -HPAGE_PMD_NR);
>>>> -		return;
>>>> +		return ret;
>>>>  	}
>>>>  
>>>>  	if (is_huge_zero_pmd(*pmd)) {
>>>> @@ -3149,7 +3149,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>>>>  		 * small page also write protected so it does not seems useful
>>>>  		 * to invalidate secondary mmu at this time.
>>>>  		 */
>>>> -		return __split_huge_zero_page_pmd(vma, haddr, pmd);
>>>> +		__split_huge_zero_page_pmd(vma, haddr, pmd);
>>>> +		return ret;
>>>>  	}
>>>>  
>>>>  	if (pmd_is_migration_entry(*pmd)) {
>>>> @@ -3309,6 +3310,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>>>>  			VM_WARN_ON(!pte_none(ptep_get(pte + i)));
>>>>  			set_pte_at(mm, addr, pte + i, entry);
>>>>  		}
>>>> +		ret = true;
>>>>  	} else if (pmd_is_device_private_entry(old_pmd)) {
>>>>  		pte_t entry;
>>>>  		swp_entry_t swp_entry;
>>>> @@ -3366,14 +3368,17 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>>>>  
>>>>  	smp_wmb(); /* make pte visible before pmd */
>>>>  	pmd_populate(mm, pmd, pgtable);
>>>> +	return ret;
>>>>  }
>>>>  
>>>> -void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address,
>>>> +bool split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address,
>>>>  			   pmd_t *pmd, bool freeze)
>>>>  {
>>>>  	VM_WARN_ON_ONCE(!IS_ALIGNED(address, HPAGE_PMD_SIZE));
>>>>  	if (pmd_trans_huge(*pmd) || pmd_is_valid_softleaf(*pmd))
>>>> -		__split_huge_pmd_locked(vma, pmd, address, freeze);
>>>> +		return __split_huge_pmd_locked(vma, pmd, address, freeze);
>>>> +	else
>>>> +		return false;
>>>
>>> No need for the "else".
>>>
>> 
>> Got it.
>> 
>>>>  }
>>>>  
>>>>  void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>> index 78b7fb5f367c..91fb495bebbe 100644
>>>> --- a/mm/rmap.c
>>>> +++ b/mm/rmap.c
>>>> @@ -2464,13 +2464,18 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
>>>>  
>>>>  			if (flags & TTU_SPLIT_HUGE_PMD) {
>>>>  				/*
>>>> -				 * split_huge_pmd_locked() might leave the
>>>> +				 * If split_huge_pmd_locked() does split PMD
>>>> +				 * to migration entry, we are done.
>>>> +				 * If split_huge_pmd_locked() leave the
>>>>  				 * folio mapped through PTEs. Retry the walk
>>>>  				 * so we can detect this scenario and properly
>>>>  				 * abort the walk.
>>>
>>> Couldn't we just abort right away, based on the return value?
>>>
>> 
>> Here is my understanding.
>> 
>> We get here when page_vma_mapped_walk() touch a pmd entry, with three cases:
>> 
>>   * pmd_trans_huge()
>>   * pmd_is_migration_entry()
>>   * pmd_is_device_private_entry()
>> 
>> For the first two cases, we grab pmd_lock() and then check the condition is
>> still valid before return. But for case 3, after grab pmd_lock(), it return
>> directly.
>> 
>> This may give chance for another thread to split pmd_is_device_private_entry()
>> to pte mapped, IIUC. For this case, we should restart the walk here.
>
>
>So what you are saying is that we should re-validate in page_vma_mapped_walk()
>that we indeed still have a device-private entry after grabbing the lock?
>
>That's what we do in map_pte() through pmd_same() check.
>
>Likely we should apply the same model here!
>

Below is my proposed change:

diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index a4d52fdb3056..6e915d35ae54 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -273,17 +273,21 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
 
 			if (softleaf_is_device_private(entry)) {
 				pvmw->ptl = pmd_lock(mm, pvmw->pmd);
-				return true;
+				if (pmd_same(pmde, pmdp_get_lockless(pvmw->pmd)))
+					return true;
+				/* THP pmd was split under us: handle on pte level */
+				spin_unlock(pvmw->ptl);
+				pvmw->ptl = NULL;
+			} else {
+				if ((pvmw->flags & PVMW_SYNC) &&
+				    thp_vma_suitable_order(vma, pvmw->address,
+							   PMD_ORDER) &&
+				    (pvmw->nr_pages >= HPAGE_PMD_NR))
+					sync_with_folio_pmd_zap(mm, pvmw->pmd);
+
+				step_forward(pvmw, PMD_SIZE);
+				continue;
 			}
-
-			if ((pvmw->flags & PVMW_SYNC) &&
-			    thp_vma_suitable_order(vma, pvmw->address,
-						   PMD_ORDER) &&
-			    (pvmw->nr_pages >= HPAGE_PMD_NR))
-				sync_with_folio_pmd_zap(mm, pvmw->pmd);
-
-			step_forward(pvmw, PMD_SIZE);
-			continue;
 		}
 		if (!map_pte(pvmw, &pmde, &ptl)) {
 			if (!pvmw->pte)

After this, we could simplify the logic in try_to_migrate_one() as:

@@ -2471,14 +2471,10 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
                                 * so we can detect this scenario and properly
                                 * abort the walk.
                                 */
-                               if (split_huge_pmd_locked(vma, pvmw.address,
-                                                     pvmw.pmd, true)) {
-                                       page_vma_mapped_walk_done(&pvmw);
-                                       break;
-                               }
-                               flags &= ~TTU_SPLIT_HUGE_PMD;
-                               page_vma_mapped_walk_restart(&pvmw);
-                               continue;
+                               ret = split_huge_pmd_locked(vma, pvmw.address,
+                                                     pvmw.pmd, true);
+                               page_vma_mapped_walk_done(&pvmw);
+                               break;
                        }
-- 
Wei Yang
Help you, Help me


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/2] mm/huge_memory: return true if split_huge_pmd_locked() split PMD to migration entry
  2026-04-29  2:49         ` Wei Yang
@ 2026-04-29  6:55           ` David Hildenbrand (Arm)
  2026-05-03  0:38             ` Wei Yang
  0 siblings, 1 reply; 11+ messages in thread
From: David Hildenbrand (Arm) @ 2026-04-29  6:55 UTC (permalink / raw)
  To: Wei Yang
  Cc: akpm, ljs, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts,
	dev.jain, baohua, lance.yang, riel, vbabka, harry, jannh, rppt,
	surenb, mhocko, shuah, linux-mm, Gavin Guo

On 4/29/26 04:49, Wei Yang wrote:
> On Tue, Apr 28, 2026 at 10:24:42AM +0200, David Hildenbrand (Arm) wrote:
>> On 4/26/26 11:19, Wei Yang wrote:
>>>
>>> Will adjust related places.
>>>
>>>
>>> Got it.
>>>
>>>
>>> Here is my understanding.
>>>
>>> We get here when page_vma_mapped_walk() touch a pmd entry, with three cases:
>>>
>>>   * pmd_trans_huge()
>>>   * pmd_is_migration_entry()
>>>   * pmd_is_device_private_entry()
>>>
>>> For the first two cases, we grab pmd_lock() and then check the condition is
>>> still valid before return. But for case 3, after grab pmd_lock(), it return
>>> directly.
>>>
>>> This may give chance for another thread to split pmd_is_device_private_entry()
>>> to pte mapped, IIUC. For this case, we should restart the walk here.
>>
>>
>> So what you are saying is that we should re-validate in page_vma_mapped_walk()
>> that we indeed still have a device-private entry after grabbing the lock?
>>
>> That's what we do in map_pte() through pmd_same() check.
>>
>> Likely we should apply the same model here!
>>
> 
> Below is my proposed change:
> 
> diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
> index a4d52fdb3056..6e915d35ae54 100644
> --- a/mm/page_vma_mapped.c
> +++ b/mm/page_vma_mapped.c
> @@ -273,17 +273,21 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
>  
>  			if (softleaf_is_device_private(entry)) {
>  				pvmw->ptl = pmd_lock(mm, pvmw->pmd);
> -				return true;
> +				if (pmd_same(pmde, pmdp_get_lockless(pvmw->pmd)))
> +					return true;

As we have a softleaf entry, I assume we wouldn't expect to get any other bits
(access/dirty) set until we grab the lock. Verifying
softleaf_is_device_private() again would be better cleaner, though.

But really, I do wonder if we should just have a "goto retry" back to the "pmde
= pmdp_get_lockless(pvmw->pmd);" instead?


And now I wonder why we don't have a check_pmd() handling in there? :/

Should we check for the pfn here?

> +				/* THP pmd was split under us: handle on pte level */
> +				spin_unlock(pvmw->ptl);
> +				pvmw->ptl = NULL;



> +			} else {
> +				if ((pvmw->flags & PVMW_SYNC) &&
> +				    thp_vma_suitable_order(vma, pvmw->address,
> +							   PMD_ORDER) &&
> +				    (pvmw->nr_pages >= HPAGE_PMD_NR))
> +					sync_with_folio_pmd_zap(mm, pvmw->pmd);
> +
> +				step_forward(pvmw, PMD_SIZE);
> +				continue;
>  			}
> -
> -			if ((pvmw->flags & PVMW_SYNC) &&
> -			    thp_vma_suitable_order(vma, pvmw->address,
> -						   PMD_ORDER) &&
> -			    (pvmw->nr_pages >= HPAGE_PMD_NR))
> -				sync_with_folio_pmd_zap(mm, pvmw->pmd);
> -
> -			step_forward(pvmw, PMD_SIZE);
> -			continue;
>  		}
>  		if (!map_pte(pvmw, &pmde, &ptl)) {
>  			if (!pvmw->pte)
> 
> After this, we could simplify the logic in try_to_migrate_one() as:
> 
> @@ -2471,14 +2471,10 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
>                                  * so we can detect this scenario and properly
>                                  * abort the walk.
>                                  */
> -                               if (split_huge_pmd_locked(vma, pvmw.address,
> -                                                     pvmw.pmd, true)) {
> -                                       page_vma_mapped_walk_done(&pvmw);
> -                                       break;
> -                               }
> -                               flags &= ~TTU_SPLIT_HUGE_PMD;
> -                               page_vma_mapped_walk_restart(&pvmw);
> -                               continue;
> +                               ret = split_huge_pmd_locked(vma, pvmw.address,
> +                                                     pvmw.pmd, true);
> +                               page_vma_mapped_walk_done(&pvmw);
> +                               break;
>                         }

Right. But just to be clear: Let's split the page_vma_mapped_walk() validation
(which looks like a bugfix to me) from the other optimization.

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/2] mm/huge_memory: return true if split_huge_pmd_locked() split PMD to migration entry
  2026-04-29  6:55           ` David Hildenbrand (Arm)
@ 2026-05-03  0:38             ` Wei Yang
  2026-05-04 12:44               ` David Hildenbrand (Arm)
  0 siblings, 1 reply; 11+ messages in thread
From: Wei Yang @ 2026-05-03  0:38 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Wei Yang, akpm, ljs, ziy, baolin.wang, Liam.Howlett, npache,
	ryan.roberts, dev.jain, baohua, lance.yang, riel, vbabka, harry,
	jannh, rppt, surenb, mhocko, shuah, linux-mm, Gavin Guo

On Wed, Apr 29, 2026 at 08:55:07AM +0200, David Hildenbrand (Arm) wrote:
>On 4/29/26 04:49, Wei Yang wrote:
>> On Tue, Apr 28, 2026 at 10:24:42AM +0200, David Hildenbrand (Arm) wrote:
>>> On 4/26/26 11:19, Wei Yang wrote:
>>>>
>>>> Will adjust related places.
>>>>
>>>>
>>>> Got it.
>>>>
>>>>
>>>> Here is my understanding.
>>>>
>>>> We get here when page_vma_mapped_walk() touch a pmd entry, with three cases:
>>>>
>>>>   * pmd_trans_huge()
>>>>   * pmd_is_migration_entry()
>>>>   * pmd_is_device_private_entry()
>>>>
>>>> For the first two cases, we grab pmd_lock() and then check the condition is
>>>> still valid before return. But for case 3, after grab pmd_lock(), it return
>>>> directly.
>>>>
>>>> This may give chance for another thread to split pmd_is_device_private_entry()
>>>> to pte mapped, IIUC. For this case, we should restart the walk here.
>>>
>>>
>>> So what you are saying is that we should re-validate in page_vma_mapped_walk()
>>> that we indeed still have a device-private entry after grabbing the lock?
>>>
>>> That's what we do in map_pte() through pmd_same() check.
>>>
>>> Likely we should apply the same model here!
>>>
>> 
>> Below is my proposed change:
>> 
>> diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
>> index a4d52fdb3056..6e915d35ae54 100644
>> --- a/mm/page_vma_mapped.c
>> +++ b/mm/page_vma_mapped.c
>> @@ -273,17 +273,21 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
>>  
>>  			if (softleaf_is_device_private(entry)) {
>>  				pvmw->ptl = pmd_lock(mm, pvmw->pmd);
>> -				return true;
>> +				if (pmd_same(pmde, pmdp_get_lockless(pvmw->pmd)))
>> +					return true;
>
>As we have a softleaf entry, I assume we wouldn't expect to get any other bits
>(access/dirty) set until we grab the lock. Verifying
>softleaf_is_device_private() again would be better cleaner, though.
>

Got it.

>But really, I do wonder if we should just have a "goto retry" back to the "pmde
>= pmdp_get_lockless(pvmw->pmd);" instead?
>

Sounds reasonable. See below.

>
>And now I wonder why we don't have a check_pmd() handling in there? :/
>
>Should we check for the pfn here?

Thanks for pointing out. I think you are right.

After re-read the code, more questions come up my mind. I am afraid we need
more cleanup for page_vma_mapped_walk().

Below is my finding based on current understanding:

1. thp_migration_supported() seems not necessary

   The code reaches here means pmd_is_migration_entry() return true, which
   means CONFIG_ARCH_ENABLE_THP_MIGRATION is set, otherwise
   softleaf_from_pmd() should return softleaf_mk_none() which is not a
   migration softleaf.
   
   CONFIG_ARCH_ENABLE_THP_MIGRATION is set in turn means
   CONFIG_TRANSPARENT_HUGEPAGE is set, so thp_migration_supported() must
   returns true.

2. if migration entry change under us, we may need to handle on pte level

   In pmd_is_migration_entry() -> !pmd_present() branch, we have:

		if (!softleaf_is_migration(entry) ||
		    !check_pmd(softleaf_to_pfn(entry), pvmw))
			return not_found(pvmw);
		return true;

   But I think we need do this:

		if (softleaf_is_migration(entry)) {
			if (check_pmd(softleaf_to_pfn(entry), pvmw))
				return not_found(pvmw);
			return true;
		}

   Per my understanding, if the pmd_is_migration_entry() change under us, we
   need to handle on pte level. Just like pmd_trans_huge() case. Break the
   loop and return false seems not consistent.

3. add proper check for device private entry

   For device private entry, currently we just grab lock and return. While
   according to the handling to pmd_trans_huge() and pmd_is_migration_entry(),
   we should:

     * re-validate it is still device private entry after pmd_lock()
     * check PVMW_MIGRATION
     * check_pmd()

4. consolidate pmd entry handling

   Per my understanding, there are 4 cases for pmd entry handling:

     * pmd_trans_huge()
     * pmd_is_migration_entry()
     * pmd_is_device_private_entry()
     * !pmd_present()

   Now we handle them in a mixed state check, which complicates the logic. And
   the first three share similar logic. (If my above analysis is correct.)

     * grab pmd_lock()
     * re-validate pmde
     * check PVMW_MIGRATION
     * check_pmd

   Here I would like to take a more bold step: consolidate handling for these
   three cases.

   Below is what it would look like.

	pmde = pmdp_get_lockless(pvmw->pmd);

	if (pmd_trans_huge(pmde) || pmd_is_valid_softleaf(pmde)) {
		unsigned long pfn;
		bool is_migration;
		bool for_migration;

		pvmw->ptl = pmd_lock(mm, pvmw->pmd);
		if (pmd_same(pmde, pmdp_get_lockless(pvmw->pmd))) {
			is_migration = pmd_is_migration_entry(pmde);
			for_migration = !!(pvmw->flags & PVMW_MIGRATION);

			if (is_migration != for_migration)
				return not_found(pvmw);

			if (pmd_trans_huge(pmde))
				pfn = pmd_pfn(pmde);
			else
				pfn = softleaf_to_pfn(softleaf_from_pmd(pmde));

			if (!check_pmd(pfn, pvmw))
				return not_found(pvmw);

			return true;
		}
		/* THP pmd was split under us: handle on pte level */
		spin_unlock(pvmw->ptl);
		pvmw->ptl = NULL;
	} else if (!pmd_present(pmde)) {
		if ((pvmw->flags & PVMW_SYNC) &&
		    thp_vma_suitable_order(vma, pvmw->address,
					   PMD_ORDER) &&
		    (pvmw->nr_pages >= HPAGE_PMD_NR))
			sync_with_folio_pmd_zap(mm, pvmw->pmd);

		step_forward(pvmw, PMD_SIZE);
		continue;
	}

5. use "goto retry"

   As you mentioned above. Instead of "handle on pte level", go to
   pmdp_get_lockless() for retry. This looks more reasonable to me.
>
>> +				/* THP pmd was split under us: handle on pte level */
>> +				spin_unlock(pvmw->ptl);
>> +				pvmw->ptl = NULL;
>
>
>
>> +			} else {
>> +				if ((pvmw->flags & PVMW_SYNC) &&
>> +				    thp_vma_suitable_order(vma, pvmw->address,
>> +							   PMD_ORDER) &&
>> +				    (pvmw->nr_pages >= HPAGE_PMD_NR))
>> +					sync_with_folio_pmd_zap(mm, pvmw->pmd);
>> +
>> +				step_forward(pvmw, PMD_SIZE);
>> +				continue;
>>  			}
>> -
>> -			if ((pvmw->flags & PVMW_SYNC) &&
>> -			    thp_vma_suitable_order(vma, pvmw->address,
>> -						   PMD_ORDER) &&
>> -			    (pvmw->nr_pages >= HPAGE_PMD_NR))
>> -				sync_with_folio_pmd_zap(mm, pvmw->pmd);
>> -
>> -			step_forward(pvmw, PMD_SIZE);
>> -			continue;
>>  		}
>>  		if (!map_pte(pvmw, &pmde, &ptl)) {
>>  			if (!pvmw->pte)
>> 
>> After this, we could simplify the logic in try_to_migrate_one() as:
>> 
>> @@ -2471,14 +2471,10 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
>>                                  * so we can detect this scenario and properly
>>                                  * abort the walk.
>>                                  */
>> -                               if (split_huge_pmd_locked(vma, pvmw.address,
>> -                                                     pvmw.pmd, true)) {
>> -                                       page_vma_mapped_walk_done(&pvmw);
>> -                                       break;
>> -                               }
>> -                               flags &= ~TTU_SPLIT_HUGE_PMD;
>> -                               page_vma_mapped_walk_restart(&pvmw);
>> -                               continue;
>> +                               ret = split_huge_pmd_locked(vma, pvmw.address,
>> +                                                     pvmw.pmd, true);
>> +                               page_vma_mapped_walk_done(&pvmw);
>> +                               break;
>>                         }
>
>Right. But just to be clear: Let's split the page_vma_mapped_walk() validation
>(which looks like a bugfix to me) from the other optimization.
>

Sure, maybe we can split the page_vma_mapped_walk() cleanup out to another
patch set for better reviewing?

>-- 
>Cheers,
>
>David

-- 
Wei Yang
Help you, Help me


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/2] mm/huge_memory: return true if split_huge_pmd_locked() split PMD to migration entry
  2026-05-03  0:38             ` Wei Yang
@ 2026-05-04 12:44               ` David Hildenbrand (Arm)
  2026-05-05  3:15                 ` Wei Yang
  0 siblings, 1 reply; 11+ messages in thread
From: David Hildenbrand (Arm) @ 2026-05-04 12:44 UTC (permalink / raw)
  To: Wei Yang
  Cc: akpm, ljs, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts,
	dev.jain, baohua, lance.yang, riel, vbabka, harry, jannh, rppt,
	surenb, mhocko, shuah, linux-mm, Gavin Guo

On 5/3/26 02:38, Wei Yang wrote:
> On Wed, Apr 29, 2026 at 08:55:07AM +0200, David Hildenbrand (Arm) wrote:
>> On 4/29/26 04:49, Wei Yang wrote:
>>>
>>> Below is my proposed change:
>>>
>>> diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
>>> index a4d52fdb3056..6e915d35ae54 100644
>>> --- a/mm/page_vma_mapped.c
>>> +++ b/mm/page_vma_mapped.c
>>> @@ -273,17 +273,21 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
>>>  
>>>  			if (softleaf_is_device_private(entry)) {
>>>  				pvmw->ptl = pmd_lock(mm, pvmw->pmd);
>>> -				return true;
>>> +				if (pmd_same(pmde, pmdp_get_lockless(pvmw->pmd)))
>>> +					return true;
>>
>> As we have a softleaf entry, I assume we wouldn't expect to get any other bits
>> (access/dirty) set until we grab the lock. Verifying
>> softleaf_is_device_private() again would be better cleaner, though.
>>
> 
> Got it.
> 
>> But really, I do wonder if we should just have a "goto retry" back to the "pmde
>> = pmdp_get_lockless(pvmw->pmd);" instead?
>>
> 
> Sounds reasonable. See below.
> 
>>
>> And now I wonder why we don't have a check_pmd() handling in there? :/
>>
>> Should we check for the pfn here?
> 
> Thanks for pointing out. I think you are right.
> 
> After re-read the code, more questions come up my mind. I am afraid we need
> more cleanup for page_vma_mapped_walk().
> 
> Below is my finding based on current understanding:
> 
> 1. thp_migration_supported() seems not necessary
> 
>    The code reaches here means pmd_is_migration_entry() return true, which
>    means CONFIG_ARCH_ENABLE_THP_MIGRATION is set, otherwise
>    softleaf_from_pmd() should return softleaf_mk_none() which is not a
>    migration softleaf.
>    
>    CONFIG_ARCH_ENABLE_THP_MIGRATION is set in turn means
>    CONFIG_TRANSPARENT_HUGEPAGE is set, so thp_migration_supported() must
>    returns true.
> 
> 2. if migration entry change under us, we may need to handle on pte level
> 
>    In pmd_is_migration_entry() -> !pmd_present() branch, we have:
> 
> 		if (!softleaf_is_migration(entry) ||
> 		    !check_pmd(softleaf_to_pfn(entry), pvmw))
> 			return not_found(pvmw);
> 		return true;
> 
>    But I think we need do this:
> 
> 		if (softleaf_is_migration(entry)) {
> 			if (check_pmd(softleaf_to_pfn(entry), pvmw))
> 				return not_found(pvmw);
> 			return true;
> 		}
> 
>    Per my understanding, if the pmd_is_migration_entry() change under us, we
>    need to handle on pte level. Just like pmd_trans_huge() case. Break the
>    loop and return false seems not consistent.
> 
> 3. add proper check for device private entry
> 
>    For device private entry, currently we just grab lock and return. While
>    according to the handling to pmd_trans_huge() and pmd_is_migration_entry(),
>    we should:
> 
>      * re-validate it is still device private entry after pmd_lock()
>      * check PVMW_MIGRATION
>      * check_pmd()
> 
> 4. consolidate pmd entry handling
> 
>    Per my understanding, there are 4 cases for pmd entry handling:
> 
>      * pmd_trans_huge()
>      * pmd_is_migration_entry()
>      * pmd_is_device_private_entry()
>      * !pmd_present()
> 
>    Now we handle them in a mixed state check, which complicates the logic. And
>    the first three share similar logic. (If my above analysis is correct.)
> 
>      * grab pmd_lock()
>      * re-validate pmde
>      * check PVMW_MIGRATION
>      * check_pmd
> 
>    Here I would like to take a more bold step: consolidate handling for these
>    three cases.
> 
>    Below is what it would look like.
> 
> 	pmde = pmdp_get_lockless(pvmw->pmd);
> 
> 	if (pmd_trans_huge(pmde) || pmd_is_valid_softleaf(pmde)) {
> 		unsigned long pfn;
> 		bool is_migration;
> 		bool for_migration;
> 
> 		pvmw->ptl = pmd_lock(mm, pvmw->pmd);
> 		if (pmd_same(pmde, pmdp_get_lockless(pvmw->pmd))) {
> 			is_migration = pmd_is_migration_entry(pmde);
> 			for_migration = !!(pvmw->flags & PVMW_MIGRATION);
> 
> 			if (is_migration != for_migration)
> 				return not_found(pvmw);
> 
> 			if (pmd_trans_huge(pmde))
> 				pfn = pmd_pfn(pmde);
> 			else
> 				pfn = softleaf_to_pfn(softleaf_from_pmd(pmde));
> 
> 			if (!check_pmd(pfn, pvmw))
> 				return not_found(pvmw);
> 
> 			return true;
> 		}
> 		/* THP pmd was split under us: handle on pte level */
> 		spin_unlock(pvmw->ptl);
> 		pvmw->ptl = NULL;
> 	} else if (!pmd_present(pmde)) {
> 		if ((pvmw->flags & PVMW_SYNC) &&
> 		    thp_vma_suitable_order(vma, pvmw->address,
> 					   PMD_ORDER) &&
> 		    (pvmw->nr_pages >= HPAGE_PMD_NR))
> 			sync_with_folio_pmd_zap(mm, pvmw->pmd);
> 
> 		step_forward(pvmw, PMD_SIZE);
> 		continue;
> 	}
> 
> 5. use "goto retry"
> 
>    As you mentioned above. Instead of "handle on pte level", go to
>    pmdp_get_lockless() for retry. This looks more reasonable to me.
>>
>>> +				/* THP pmd was split under us: handle on pte level */
>>> +				spin_unlock(pvmw->ptl);
>>> +				pvmw->ptl = NULL;
>>
>>
>>
>>> +			} else {
>>> +				if ((pvmw->flags & PVMW_SYNC) &&
>>> +				    thp_vma_suitable_order(vma, pvmw->address,
>>> +							   PMD_ORDER) &&
>>> +				    (pvmw->nr_pages >= HPAGE_PMD_NR))
>>> +					sync_with_folio_pmd_zap(mm, pvmw->pmd);
>>> +
>>> +				step_forward(pvmw, PMD_SIZE);
>>> +				continue;
>>>  			}
>>> -
>>> -			if ((pvmw->flags & PVMW_SYNC) &&
>>> -			    thp_vma_suitable_order(vma, pvmw->address,
>>> -						   PMD_ORDER) &&
>>> -			    (pvmw->nr_pages >= HPAGE_PMD_NR))
>>> -				sync_with_folio_pmd_zap(mm, pvmw->pmd);
>>> -
>>> -			step_forward(pvmw, PMD_SIZE);
>>> -			continue;
>>>  		}
>>>  		if (!map_pte(pvmw, &pmde, &ptl)) {
>>>  			if (!pvmw->pte)
>>>
>>> After this, we could simplify the logic in try_to_migrate_one() as:
>>>
>>> @@ -2471,14 +2471,10 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
>>>                                  * so we can detect this scenario and properly
>>>                                  * abort the walk.
>>>                                  */
>>> -                               if (split_huge_pmd_locked(vma, pvmw.address,
>>> -                                                     pvmw.pmd, true)) {
>>> -                                       page_vma_mapped_walk_done(&pvmw);
>>> -                                       break;
>>> -                               }
>>> -                               flags &= ~TTU_SPLIT_HUGE_PMD;
>>> -                               page_vma_mapped_walk_restart(&pvmw);
>>> -                               continue;
>>> +                               ret = split_huge_pmd_locked(vma, pvmw.address,
>>> +                                                     pvmw.pmd, true);
>>> +                               page_vma_mapped_walk_done(&pvmw);
>>> +                               break;
>>>                         }
>>
>> Right. But just to be clear: Let's split the page_vma_mapped_walk() validation
>> (which looks like a bugfix to me) from the other optimization.
>>
> 
> Sure, maybe we can split the page_vma_mapped_walk() cleanup out to another
> patch set for better reviewing?

Yes, but I assume it could even be fixes?

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/2] mm/huge_memory: return true if split_huge_pmd_locked() split PMD to migration entry
  2026-05-04 12:44               ` David Hildenbrand (Arm)
@ 2026-05-05  3:15                 ` Wei Yang
  0 siblings, 0 replies; 11+ messages in thread
From: Wei Yang @ 2026-05-05  3:15 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Wei Yang, akpm, ljs, ziy, baolin.wang, Liam.Howlett, npache,
	ryan.roberts, dev.jain, baohua, lance.yang, riel, vbabka, harry,
	jannh, rppt, surenb, mhocko, shuah, linux-mm, Gavin Guo

On Mon, May 04, 2026 at 02:44:43PM +0200, David Hildenbrand (Arm) wrote:
>On 5/3/26 02:38, Wei Yang wrote:
>> On Wed, Apr 29, 2026 at 08:55:07AM +0200, David Hildenbrand (Arm) wrote:
>>> On 4/29/26 04:49, Wei Yang wrote:
>>>>
>>>> Below is my proposed change:
>>>>
>>>> diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
>>>> index a4d52fdb3056..6e915d35ae54 100644
>>>> --- a/mm/page_vma_mapped.c
>>>> +++ b/mm/page_vma_mapped.c
>>>> @@ -273,17 +273,21 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
>>>>  
>>>>  			if (softleaf_is_device_private(entry)) {
>>>>  				pvmw->ptl = pmd_lock(mm, pvmw->pmd);
>>>> -				return true;
>>>> +				if (pmd_same(pmde, pmdp_get_lockless(pvmw->pmd)))
>>>> +					return true;
>>>
>>> As we have a softleaf entry, I assume we wouldn't expect to get any other bits
>>> (access/dirty) set until we grab the lock. Verifying
>>> softleaf_is_device_private() again would be better cleaner, though.
>>>
>> 
>> Got it.
>> 
>>> But really, I do wonder if we should just have a "goto retry" back to the "pmde
>>> = pmdp_get_lockless(pvmw->pmd);" instead?
>>>
>> 
>> Sounds reasonable. See below.
>> 
>>>
>>> And now I wonder why we don't have a check_pmd() handling in there? :/
>>>
>>> Should we check for the pfn here?
>> 
>> Thanks for pointing out. I think you are right.
>> 
>> After re-read the code, more questions come up my mind. I am afraid we need
>> more cleanup for page_vma_mapped_walk().
>> 
>> Below is my finding based on current understanding:
>> 
>> 1. thp_migration_supported() seems not necessary
>> 
>>    The code reaches here means pmd_is_migration_entry() return true, which
>>    means CONFIG_ARCH_ENABLE_THP_MIGRATION is set, otherwise
>>    softleaf_from_pmd() should return softleaf_mk_none() which is not a
>>    migration softleaf.
>>    
>>    CONFIG_ARCH_ENABLE_THP_MIGRATION is set in turn means
>>    CONFIG_TRANSPARENT_HUGEPAGE is set, so thp_migration_supported() must
>>    returns true.
>> 
>> 2. if migration entry change under us, we may need to handle on pte level
>> 
>>    In pmd_is_migration_entry() -> !pmd_present() branch, we have:
>> 
>> 		if (!softleaf_is_migration(entry) ||
>> 		    !check_pmd(softleaf_to_pfn(entry), pvmw))
>> 			return not_found(pvmw);
>> 		return true;
>> 
>>    But I think we need do this:
>> 
>> 		if (softleaf_is_migration(entry)) {
>> 			if (check_pmd(softleaf_to_pfn(entry), pvmw))
>> 				return not_found(pvmw);
>> 			return true;
>> 		}
>> 
>>    Per my understanding, if the pmd_is_migration_entry() change under us, we
>>    need to handle on pte level. Just like pmd_trans_huge() case. Break the
>>    loop and return false seems not consistent.
>> 
>> 3. add proper check for device private entry
>> 
>>    For device private entry, currently we just grab lock and return. While
>>    according to the handling to pmd_trans_huge() and pmd_is_migration_entry(),
>>    we should:
>> 
>>      * re-validate it is still device private entry after pmd_lock()
>>      * check PVMW_MIGRATION
>>      * check_pmd()
>> 
>> 4. consolidate pmd entry handling
>> 
>>    Per my understanding, there are 4 cases for pmd entry handling:
>> 
>>      * pmd_trans_huge()
>>      * pmd_is_migration_entry()
>>      * pmd_is_device_private_entry()
>>      * !pmd_present()
>> 
>>    Now we handle them in a mixed state check, which complicates the logic. And
>>    the first three share similar logic. (If my above analysis is correct.)
>> 
>>      * grab pmd_lock()
>>      * re-validate pmde
>>      * check PVMW_MIGRATION
>>      * check_pmd
>> 
>>    Here I would like to take a more bold step: consolidate handling for these
>>    three cases.
>> 
>>    Below is what it would look like.
>> 
>> 	pmde = pmdp_get_lockless(pvmw->pmd);
>> 
>> 	if (pmd_trans_huge(pmde) || pmd_is_valid_softleaf(pmde)) {
>> 		unsigned long pfn;
>> 		bool is_migration;
>> 		bool for_migration;
>> 
>> 		pvmw->ptl = pmd_lock(mm, pvmw->pmd);
>> 		if (pmd_same(pmde, pmdp_get_lockless(pvmw->pmd))) {
>> 			is_migration = pmd_is_migration_entry(pmde);
>> 			for_migration = !!(pvmw->flags & PVMW_MIGRATION);
>> 
>> 			if (is_migration != for_migration)
>> 				return not_found(pvmw);
>> 
>> 			if (pmd_trans_huge(pmde))
>> 				pfn = pmd_pfn(pmde);
>> 			else
>> 				pfn = softleaf_to_pfn(softleaf_from_pmd(pmde));
>> 
>> 			if (!check_pmd(pfn, pvmw))
>> 				return not_found(pvmw);
>> 
>> 			return true;
>> 		}
>> 		/* THP pmd was split under us: handle on pte level */
>> 		spin_unlock(pvmw->ptl);
>> 		pvmw->ptl = NULL;
>> 	} else if (!pmd_present(pmde)) {
>> 		if ((pvmw->flags & PVMW_SYNC) &&
>> 		    thp_vma_suitable_order(vma, pvmw->address,
>> 					   PMD_ORDER) &&
>> 		    (pvmw->nr_pages >= HPAGE_PMD_NR))
>> 			sync_with_folio_pmd_zap(mm, pvmw->pmd);
>> 
>> 		step_forward(pvmw, PMD_SIZE);
>> 		continue;
>> 	}
>> 
>> 5. use "goto retry"
>> 
>>    As you mentioned above. Instead of "handle on pte level", go to
>>    pmdp_get_lockless() for retry. This looks more reasonable to me.
>>>
>>>> +				/* THP pmd was split under us: handle on pte level */
>>>> +				spin_unlock(pvmw->ptl);
>>>> +				pvmw->ptl = NULL;
>>>
>>>
>>>
>>>> +			} else {
>>>> +				if ((pvmw->flags & PVMW_SYNC) &&
>>>> +				    thp_vma_suitable_order(vma, pvmw->address,
>>>> +							   PMD_ORDER) &&
>>>> +				    (pvmw->nr_pages >= HPAGE_PMD_NR))
>>>> +					sync_with_folio_pmd_zap(mm, pvmw->pmd);
>>>> +
>>>> +				step_forward(pvmw, PMD_SIZE);
>>>> +				continue;
>>>>  			}
>>>> -
>>>> -			if ((pvmw->flags & PVMW_SYNC) &&
>>>> -			    thp_vma_suitable_order(vma, pvmw->address,
>>>> -						   PMD_ORDER) &&
>>>> -			    (pvmw->nr_pages >= HPAGE_PMD_NR))
>>>> -				sync_with_folio_pmd_zap(mm, pvmw->pmd);
>>>> -
>>>> -			step_forward(pvmw, PMD_SIZE);
>>>> -			continue;
>>>>  		}
>>>>  		if (!map_pte(pvmw, &pmde, &ptl)) {
>>>>  			if (!pvmw->pte)
>>>>
>>>> After this, we could simplify the logic in try_to_migrate_one() as:
>>>>
>>>> @@ -2471,14 +2471,10 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
>>>>                                  * so we can detect this scenario and properly
>>>>                                  * abort the walk.
>>>>                                  */
>>>> -                               if (split_huge_pmd_locked(vma, pvmw.address,
>>>> -                                                     pvmw.pmd, true)) {
>>>> -                                       page_vma_mapped_walk_done(&pvmw);
>>>> -                                       break;
>>>> -                               }
>>>> -                               flags &= ~TTU_SPLIT_HUGE_PMD;
>>>> -                               page_vma_mapped_walk_restart(&pvmw);
>>>> -                               continue;
>>>> +                               ret = split_huge_pmd_locked(vma, pvmw.address,
>>>> +                                                     pvmw.pmd, true);
>>>> +                               page_vma_mapped_walk_done(&pvmw);
>>>> +                               break;
>>>>                         }
>>>
>>> Right. But just to be clear: Let's split the page_vma_mapped_walk() validation
>>> (which looks like a bugfix to me) from the other optimization.
>>>
>> 
>> Sure, maybe we can split the page_vma_mapped_walk() cleanup out to another
>> patch set for better reviewing?
>
>Yes, but I assume it could even be fixes?

Agree.

For those above proposed changes, #2 and #3 is proper to be fixes.

 2. if migration entry change under us, we may need to handle on pte level
 3. add proper check for device private entry

The corresponding commit to fix are

   commit 616b8371539a6c487404c3b8fb04078016dab4ba
   Author: Zi Yan <zi.yan@cs.rutgers.edu>
   Date:   Fri Sep 8 16:10:57 2017 -0700
   
       mm: thp: enable thp migration in generic path

   commit 65edfda6f3f2e58f757485a056e4f1775a1404a8
   Author: Balbir Singh <balbirs@nvidia.com>
   Date:   Wed Oct 1 16:56:55 2025 +1000

       mm/rmap: extend rmap and migration support device-private entries

As Andrew suggested, I would send fixes and cleanup separately.

After all these settled down, I will respin this thread.

-- 
Wei Yang
Help you, Help me


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2026-05-05  3:15 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-15  1:08 [PATCH 0/2] mm/huge_memory: optimize migration when huge PMD needs split Wei Yang
2026-04-15  1:08 ` [PATCH 1/2] mm/huge_memory: return true if split_huge_pmd_locked() split PMD to migration entry Wei Yang
2026-04-24 19:29   ` David Hildenbrand (Arm)
2026-04-26  9:19     ` Wei Yang
2026-04-28  8:24       ` David Hildenbrand (Arm)
2026-04-29  2:49         ` Wei Yang
2026-04-29  6:55           ` David Hildenbrand (Arm)
2026-05-03  0:38             ` Wei Yang
2026-05-04 12:44               ` David Hildenbrand (Arm)
2026-05-05  3:15                 ` Wei Yang
2026-04-15  1:08 ` [PATCH 2/2] mm/selftests: add split_shared_pmd() Wei Yang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox