[RFC v2 00/21] mm: thp: lazy PTE page table allocation at PMD split

All of lore.kernel.org
 help / color / mirror / Atom feed

* [RFC v2 00/21] mm: thp: lazy PTE page table allocation at PMD split
@ 2026-02-26 11:23 Usama Arif
  2026-02-26 11:23 ` [RFC v2 01/21] mm: thp: make split_huge_pmd functions return int for error propagation Usama Arif
                   ` (21 more replies)
  0 siblings, 22 replies; 36+ messages in thread
From: Usama Arif @ 2026-02-26 11:23 UTC (permalink / raw)
  To: Andrew Morton, david, lorenzo.stoakes, willy, linux-mm
  Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
	baolin.wang, npache, Liam.Howlett, ryan.roberts, Vlastimil Babka,
	lance.yang, linux-kernel, kernel-team, maddy, mpe, linuxppc-dev,
	hca, gor, agordeev, borntraeger, svens, linux-s390, Usama Arif

When the kernel creates a PMD-level THP mapping for anonymous pages, it
pre-allocates a PTE page table via pgtable_trans_huge_deposit(). This
page table sits unused in a deposit list for the lifetime of the THP
mapping, only to be withdrawn when the PMD is split or zapped. Every
anonymous THP therefore wastes 4KB of memory unconditionally. On large
servers where hundreds of gigabytes of memory are mapped as THPs, this
adds up: roughly 200MB wasted per 100GB of THP memory. This memory
could otherwise satisfy other allocations, including the very PTE page
table allocations needed when splits eventually occur.

This series removes the pre-deposit and allocates the PTE page table
lazily — only when a PMD split actually happens. Since a large number
of THPs are never split (they are zapped wholesale when processes exit or
munmap the full range), the allocation is avoided entirely in the common
case.

The pre-deposit pattern exists because split_huge_pmd was designed as an
operation that must never fail: if the kernel decides to split, it needs
a PTE page table, so one is deposited in advance. But "must never fail"
is an unnecessarily strong requirement. A PMD split is typically triggered
by a partial operation on a sub-PMD range — partial munmap, partial
mprotect, partial mremap and so on.
Most of these operations already have well-defined error handling for
allocation failures (e.g., -ENOMEM, VM_FAULT_OOM). Allowing split to
fail and propagating the error through these existing paths is the natural
thing to do. Furthermore, split failing requires an order-0 allocation for
a page table to fail, which is extremely unlikely.

Designing functions like split_huge_pmd as operations that cannot fail
has a subtle but real cost to code quality. It forces a pre-allocation
pattern - every THP creation path must deposit a page table, and every
split or zap path must withdraw one, creating a hidden coupling between
widely separated code paths.

This also serves as a code cleanup. On every architecture except powerpc
with hash MMU, the deposit/withdraw machinery becomes dead code. The
series removes the generic implementations in pgtable-generic.c and the
s390/sparc overrides, replacing them with no-op stubs guarded by
arch_needs_pgtable_deposit(), which evaluates to false at compile time
on all non-powerpc architectures.

The series is structured as follows:

Patches 1-2:    Error infrastructure — make split functions return int
                and propagate errors from vma_adjust_trans_huge()
                through __split_vma, vma_shrink, and commit_merge.

Patches 3-12:   Handle split failure at every call site — copy_huge_pmd,
                do_huge_pmd_wp_page, zap_pmd_range, wp_huge_pmd,
                change_pmd_range (mprotect), follow_pmd_mask (GUP),
                walk_pmd_range (pagewalk), move_page_tables (mremap),
                move_pages (userfaultfd), and device migration.
                The code will become affective in Patch 14 when split
                functions start returning -ENOMEM.

Patch 13:       Add __must_check to __split_huge_pmd(), split_huge_pmd()
                and split_huge_pmd_address() so the compiler warns on
                unchecked return values.

Patch 14:       The actual change — allocate PTE page tables lazily at
                split time instead of pre-depositing at THP creation.
                This is when split functions will actually start returning
                -ENOMEM.

Patch 15:       Remove the now-dead deposit/withdraw code on
                non-powerpc architectures.

Patch 16:       Add THP_SPLIT_PMD_FAILED vmstat counter for monitoring
                split failures.

Patches 17-21:  Selftests covering partial munmap, mprotect, mlock,
                mremap, and MADV_DONTNEED on THPs to exercise the
                split paths.

The error handling patches are placed before the lazy allocation patch so
that every call site is already prepared to handle split failures before
the failure mode is introduced. This makes each patch independently safe
to apply and bisect through.

The patches were tested with CONFIG_DEBUG_ATOMIC_SLEEP and CONFIG_DEBUG_VM
enabled. The test results are below:

TAP version 13
1..5
# Starting 5 tests from 1 test cases.
#  RUN           thp_pmd_split.partial_munmap ...
# thp_pmd_split_test.c:60:partial_munmap:thp_split_pmd: 0 -> 1
# thp_pmd_split_test.c:62:partial_munmap:thp_split_pmd_failed: 0 -> 0
#            OK  thp_pmd_split.partial_munmap
ok 1 thp_pmd_split.partial_munmap
#  RUN           thp_pmd_split.partial_mprotect ...
# thp_pmd_split_test.c:60:partial_mprotect:thp_split_pmd: 1 -> 2
# thp_pmd_split_test.c:62:partial_mprotect:thp_split_pmd_failed: 0 -> 0
#            OK  thp_pmd_split.partial_mprotect
ok 2 thp_pmd_split.partial_mprotect
#  RUN           thp_pmd_split.partial_mlock ...
# thp_pmd_split_test.c:60:partial_mlock:thp_split_pmd: 2 -> 3
# thp_pmd_split_test.c:62:partial_mlock:thp_split_pmd_failed: 0 -> 0
#            OK  thp_pmd_split.partial_mlock
ok 3 thp_pmd_split.partial_mlock
#  RUN           thp_pmd_split.partial_mremap ...
# thp_pmd_split_test.c:60:partial_mremap:thp_split_pmd: 3 -> 4
# thp_pmd_split_test.c:62:partial_mremap:thp_split_pmd_failed: 0 -> 0
#            OK  thp_pmd_split.partial_mremap
ok 4 thp_pmd_split.partial_mremap
#  RUN           thp_pmd_split.partial_madv_dontneed ...
# thp_pmd_split_test.c:60:partial_madv_dontneed:thp_split_pmd: 4 -> 5
# thp_pmd_split_test.c:62:partial_madv_dontneed:thp_split_pmd_failed: 0 -> 0
#            OK  thp_pmd_split.partial_madv_dontneed
ok 5 thp_pmd_split.partial_madv_dontneed
# PASSED: 5 / 5 tests passed.
# Totals: pass:5 fail:0 xfail:0 xpass:0 skip:0 error:0

The patches are based off of 957a3fab8811b455420128ea5f41c51fd23eb6c7 from
mm-unstable as of 25 Feb (7.0.0-rc1).


RFC v1 -> v2: https://lore.kernel.org/all/20260211125507.4175026-1-usama.arif@linux.dev/
- Change counter name to THP_SPLIT_PMD_FAILED (David)
- remove pgtable_trans_huge_{deposit/withdraw} when not needed and
  make them arch specific (David)
- make split functions return error code and have callers handle them
  (David and Kiryl)
- Add test cases for splitting

Usama Arif (21):
  mm: thp: make split_huge_pmd functions return int for error
    propagation
  mm: thp: propagate split failure from vma_adjust_trans_huge()
  mm: thp: handle split failure in copy_huge_pmd()
  mm: thp: handle split failure in do_huge_pmd_wp_page()
  mm: thp: handle split failure in zap_pmd_range()
  mm: thp: handle split failure in wp_huge_pmd()
  mm: thp: retry on split failure in change_pmd_range()
  mm: thp: handle split failure in follow_pmd_mask()
  mm: handle walk_page_range() failure from THP split
  mm: thp: handle split failure in mremap move_page_tables()
  mm: thp: handle split failure in userfaultfd move_pages()
  mm: thp: handle split failure in device migration
  mm: huge_mm: Make sure all split_huge_pmd calls are checked
  mm: thp: allocate PTE page tables lazily at split time
  mm: thp: remove pgtable_trans_huge_{deposit/withdraw} when not needed
  mm: thp: add THP_SPLIT_PMD_FAILED counter
  selftests/mm: add THP PMD split test infrastructure
  selftests/mm: add partial_mprotect test for change_pmd_range
  selftests/mm: add partial_mlock test
  selftests/mm: add partial_mremap test for move_page_tables
  selftests/mm: add madv_dontneed_partial test

 arch/powerpc/include/asm/book3s/64/pgtable.h  |  12 +-
 arch/s390/include/asm/pgtable.h               |   6 -
 arch/s390/mm/pgtable.c                        |  41 ---
 arch/sparc/include/asm/pgtable_64.h           |   6 -
 arch/sparc/mm/tlb.c                           |  36 ---
 include/linux/huge_mm.h                       |  51 +--
 include/linux/pgtable.h                       |  16 +-
 include/linux/vm_event_item.h                 |   1 +
 mm/debug_vm_pgtable.c                         |   4 +-
 mm/gup.c                                      |  10 +-
 mm/huge_memory.c                              | 208 +++++++++----
 mm/khugepaged.c                               |   7 +-
 mm/memory.c                                   |  26 +-
 mm/migrate_device.c                           |  33 +-
 mm/mprotect.c                                 |  11 +-
 mm/mremap.c                                   |   8 +-
 mm/pagewalk.c                                 |   8 +-
 mm/pgtable-generic.c                          |  32 --
 mm/rmap.c                                     |  42 ++-
 mm/userfaultfd.c                              |   8 +-
 mm/vma.c                                      |  37 ++-
 mm/vmstat.c                                   |   1 +
 tools/testing/selftests/mm/Makefile           |   1 +
 .../testing/selftests/mm/thp_pmd_split_test.c | 290 ++++++++++++++++++
 tools/testing/vma/include/stubs.h             |   9 +-
 25 files changed, 645 insertions(+), 259 deletions(-)
 create mode 100644 tools/testing/selftests/mm/thp_pmd_split_test.c

-- 
2.47.3



^ permalink raw reply	[flat|nested] 36+ messages in thread

* [RFC v2 01/21] mm: thp: make split_huge_pmd functions return int for error propagation
  2026-02-26 11:23 [RFC v2 00/21] mm: thp: lazy PTE page table allocation at PMD split Usama Arif
@ 2026-02-26 11:23 ` Usama Arif
  2026-02-26 11:23 ` [RFC v2 02/21] mm: thp: propagate split failure from vma_adjust_trans_huge() Usama Arif
                   ` (20 subsequent siblings)
  21 siblings, 0 replies; 36+ messages in thread
From: Usama Arif @ 2026-02-26 11:23 UTC (permalink / raw)
  To: Andrew Morton, david, lorenzo.stoakes, willy, linux-mm
  Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
	baolin.wang, npache, Liam.Howlett, ryan.roberts, Vlastimil Babka,
	lance.yang, linux-kernel, kernel-team, maddy, mpe, linuxppc-dev,
	hca, gor, agordeev, borntraeger, svens, linux-s390, Usama Arif

Currently split cannot fail, but future patches will add lazy PTE page
table allocation. With lazy PTE page table allocation at THP split time
__split_huge_pmd() calls pte_alloc_one() which can fail if order-0
allocation cannot be satisfied.
Split functions currently return void, so callers have no way to detect
this failure.  The PMD would remain huge, but callers assumed the split
succeeded and proceeded to operate on that basis — interpreting a huge PMD
entry as a page table pointer could result in a kernel bug.

Change __split_huge_pmd(), split_huge_pmd(), split_huge_pmd_if_needed()
and split_huge_pmd_address() to return 0 on success (-ENOMEM on
allocation failure in later patch).  Convert the split_huge_pmd macro
to a static inline function that propagates the return value. The return
values will be handled by the callers in future commits.

The CONFIG_TRANSPARENT_HUGEPAGE=n stubs are changed to return 0.

No behaviour change is expected with this patch.

Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 include/linux/huge_mm.h | 34 ++++++++++++++++++----------------
 mm/huge_memory.c        | 16 ++++++++++------
 2 files changed, 28 insertions(+), 22 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index a4d9f964dfdea..e4cbf5afdbe7e 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -419,7 +419,7 @@ void deferred_split_folio(struct folio *folio, bool partially_mapped);
 void reparent_deferred_split_queue(struct mem_cgroup *memcg);
 #endif
 
-void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
+int __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 		unsigned long address, bool freeze);
 
 /**
@@ -448,15 +448,15 @@ static inline bool pmd_is_huge(pmd_t pmd)
 	return false;
 }
 
-#define split_huge_pmd(__vma, __pmd, __address)				\
-	do {								\
-		pmd_t *____pmd = (__pmd);				\
-		if (pmd_is_huge(*____pmd))				\
-			__split_huge_pmd(__vma, __pmd, __address,	\
-					 false);			\
-	}  while (0)
+static inline int split_huge_pmd(struct vm_area_struct *vma,
+					     pmd_t *pmd, unsigned long address)
+{
+	if (pmd_is_huge(*pmd))
+		return __split_huge_pmd(vma, pmd, address, false);
+	return 0;
+}
 
-void split_huge_pmd_address(struct vm_area_struct *vma, unsigned long address,
+int split_huge_pmd_address(struct vm_area_struct *vma, unsigned long address,
 		bool freeze);
 
 void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
@@ -651,13 +651,15 @@ static inline int try_folio_split_to_order(struct folio *folio,
 
 static inline void deferred_split_folio(struct folio *folio, bool partially_mapped) {}
 static inline void reparent_deferred_split_queue(struct mem_cgroup *memcg) {}
-#define split_huge_pmd(__vma, __pmd, __address)	\
-	do { } while (0)
-
-static inline void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
-		unsigned long address, bool freeze) {}
-static inline void split_huge_pmd_address(struct vm_area_struct *vma,
-		unsigned long address, bool freeze) {}
+static inline int split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
+				unsigned long address)
+{
+	return 0;
+}
+static inline int __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
+		unsigned long address, bool freeze) { return 0; }
+static inline int split_huge_pmd_address(struct vm_area_struct *vma,
+		unsigned long address, bool freeze) { return 0; }
 static inline void split_huge_pmd_locked(struct vm_area_struct *vma,
 					 unsigned long address, pmd_t *pmd,
 					 bool freeze) {}
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 8003d3a498220..125ff36f475de 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3273,7 +3273,7 @@ void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address,
 		__split_huge_pmd_locked(vma, pmd, address, freeze);
 }
 
-void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
+int __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 		unsigned long address, bool freeze)
 {
 	spinlock_t *ptl;
@@ -3287,20 +3287,22 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 	split_huge_pmd_locked(vma, range.start, pmd, freeze);
 	spin_unlock(ptl);
 	mmu_notifier_invalidate_range_end(&range);
+
+	return 0;
 }
 
-void split_huge_pmd_address(struct vm_area_struct *vma, unsigned long address,
+int split_huge_pmd_address(struct vm_area_struct *vma, unsigned long address,
 		bool freeze)
 {
 	pmd_t *pmd = mm_find_pmd(vma->vm_mm, address);
 
 	if (!pmd)
-		return;
+		return 0;
 
-	__split_huge_pmd(vma, pmd, address, freeze);
+	return __split_huge_pmd(vma, pmd, address, freeze);
 }
 
-static inline void split_huge_pmd_if_needed(struct vm_area_struct *vma, unsigned long address)
+static inline int split_huge_pmd_if_needed(struct vm_area_struct *vma, unsigned long address)
 {
 	/*
 	 * If the new address isn't hpage aligned and it could previously
@@ -3309,7 +3311,9 @@ static inline void split_huge_pmd_if_needed(struct vm_area_struct *vma, unsigned
 	if (!IS_ALIGNED(address, HPAGE_PMD_SIZE) &&
 	    range_in_vma(vma, ALIGN_DOWN(address, HPAGE_PMD_SIZE),
 			 ALIGN(address, HPAGE_PMD_SIZE)))
-		split_huge_pmd_address(vma, address, false);
+		return split_huge_pmd_address(vma, address, false);
+
+	return 0;
 }
 
 void vma_adjust_trans_huge(struct vm_area_struct *vma,
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [RFC v2 02/21] mm: thp: propagate split failure from vma_adjust_trans_huge()
  2026-02-26 11:23 [RFC v2 00/21] mm: thp: lazy PTE page table allocation at PMD split Usama Arif
  2026-02-26 11:23 ` [RFC v2 01/21] mm: thp: make split_huge_pmd functions return int for error propagation Usama Arif
@ 2026-02-26 11:23 ` Usama Arif
  2026-02-26 11:23 ` [RFC v2 03/21] mm: thp: handle split failure in copy_huge_pmd() Usama Arif
                   ` (19 subsequent siblings)
  21 siblings, 0 replies; 36+ messages in thread
From: Usama Arif @ 2026-02-26 11:23 UTC (permalink / raw)
  To: Andrew Morton, david, lorenzo.stoakes, willy, linux-mm
  Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
	baolin.wang, npache, Liam.Howlett, ryan.roberts, Vlastimil Babka,
	lance.yang, linux-kernel, kernel-team, maddy, mpe, linuxppc-dev,
	hca, gor, agordeev, borntraeger, svens, linux-s390, Usama Arif

With lazy PTE page table allocation, split_huge_pmd_if_needed() and
thus vma_adjust_trans_huge() can now fail if order-0 allocation
for pagetable fails when trying to split. It is important to check
if this failure occurred to prevent a huge PMD straddling at VMA
boundary.

The vma_adjust_trans_huge() call is moved before vma_prepare() in all
three callers (__split_vma, vma_shrink, commit_merge). Previously it sat
between vma_prepare() and vma_complete(), where there is no mechanism to
abort - once vma_prepare() has been called, we must reach vma_complete().
By moving the call earlier, a split failure can return -ENOMEM cleanly
without needing to undo VMA preparation.

This move is safe because vma_adjust_trans_huge() acquires its own
pmd_lock() internally and does not depend on any locks or state changes
from vma_prepare(). The VMA boundaries are also unchanged at the new
call site, satisfying __split_huge_pmd_locked()'s requirement that the
VMA covers the full PMD range.

All 3 callers (__split_vma, vma_shrink, commit_merge) already return
-ENOMEM if there are allocation failures for other reasons (failure in
vma_iter_prealloc for example), this follows the same pattern.

Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 include/linux/huge_mm.h           | 13 ++++++-----
 mm/huge_memory.c                  | 21 +++++++++++++-----
 mm/vma.c                          | 37 +++++++++++++++++++++----------
 tools/testing/vma/include/stubs.h |  9 ++++----
 4 files changed, 53 insertions(+), 27 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index e4cbf5afdbe7e..207bf7cd95c78 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -484,8 +484,8 @@ int hugepage_madvise(struct vm_area_struct *vma, vm_flags_t *vm_flags,
 		     int advice);
 int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
 		     unsigned long end, bool *lock_dropped);
-void vma_adjust_trans_huge(struct vm_area_struct *vma, unsigned long start,
-			   unsigned long end, struct vm_area_struct *next);
+int vma_adjust_trans_huge(struct vm_area_struct *vma, unsigned long start,
+			  unsigned long end, struct vm_area_struct *next);
 spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma);
 spinlock_t *__pud_trans_huge_lock(pud_t *pud, struct vm_area_struct *vma);
 
@@ -687,11 +687,12 @@ static inline int madvise_collapse(struct vm_area_struct *vma,
 	return -EINVAL;
 }
 
-static inline void vma_adjust_trans_huge(struct vm_area_struct *vma,
-					 unsigned long start,
-					 unsigned long end,
-					 struct vm_area_struct *next)
+static inline int vma_adjust_trans_huge(struct vm_area_struct *vma,
+					unsigned long start,
+					unsigned long end,
+					struct vm_area_struct *next)
 {
+	return 0;
 }
 static inline spinlock_t *pmd_trans_huge_lock(pmd_t *pmd,
 		struct vm_area_struct *vma)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 125ff36f475de..a979aa5bd2995 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3316,20 +3316,31 @@ static inline int split_huge_pmd_if_needed(struct vm_area_struct *vma, unsigned
 	return 0;
 }
 
-void vma_adjust_trans_huge(struct vm_area_struct *vma,
+int vma_adjust_trans_huge(struct vm_area_struct *vma,
 			   unsigned long start,
 			   unsigned long end,
 			   struct vm_area_struct *next)
 {
+	int err;
+
 	/* Check if we need to split start first. */
-	split_huge_pmd_if_needed(vma, start);
+	err = split_huge_pmd_if_needed(vma, start);
+	if (err)
+		return err;
 
 	/* Check if we need to split end next. */
-	split_huge_pmd_if_needed(vma, end);
+	err = split_huge_pmd_if_needed(vma, end);
+	if (err)
+		return err;
 
 	/* If we're incrementing next->vm_start, we might need to split it. */
-	if (next)
-		split_huge_pmd_if_needed(next, end);
+	if (next) {
+		err = split_huge_pmd_if_needed(next, end);
+		if (err)
+			return err;
+	}
+
+	return 0;
 }
 
 static void unmap_folio(struct folio *folio)
diff --git a/mm/vma.c b/mm/vma.c
index be64f781a3aa7..f50b1f291ab7c 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -510,6 +510,15 @@ __split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
 			return err;
 	}
 
+	/*
+	 * Split any THP straddling the split boundary before splitting
+	 * the VMA itself. Do this before vma_prepare() so we can
+	 * cleanly fail without undoing VMA preparation.
+	 */
+	err = vma_adjust_trans_huge(vma, vma->vm_start, addr, NULL);
+	if (err)
+		return err;
+
 	new = vm_area_dup(vma);
 	if (!new)
 		return -ENOMEM;
@@ -547,11 +556,6 @@ __split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
 	vp.insert = new;
 	vma_prepare(&vp);
 
-	/*
-	 * Get rid of huge pages and shared page tables straddling the split
-	 * boundary.
-	 */
-	vma_adjust_trans_huge(vma, vma->vm_start, addr, NULL);
 	if (is_vm_hugetlb_page(vma))
 		hugetlb_split(vma, addr);
 
@@ -729,6 +733,7 @@ static int commit_merge(struct vma_merge_struct *vmg)
 {
 	struct vm_area_struct *vma;
 	struct vma_prepare vp;
+	int err;
 
 	if (vmg->__adjust_next_start) {
 		/* We manipulate middle and adjust next, which is the target. */
@@ -740,6 +745,16 @@ static int commit_merge(struct vma_merge_struct *vmg)
 		vma_iter_config(vmg->vmi, vmg->start, vmg->end);
 	}
 
+	/*
+	 * THP pages may need to do additional splits if we increase
+	 * middle->vm_start. Do this before vma_prepare() so we can
+	 * cleanly fail without undoing VMA preparation.
+	 */
+	err = vma_adjust_trans_huge(vma, vmg->start, vmg->end,
+				  vmg->__adjust_middle_start ? vmg->middle : NULL);
+	if (err)
+		return err;
+
 	init_multi_vma_prep(&vp, vma, vmg);
 
 	/*
@@ -752,12 +767,6 @@ static int commit_merge(struct vma_merge_struct *vmg)
 		return -ENOMEM;
 
 	vma_prepare(&vp);
-	/*
-	 * THP pages may need to do additional splits if we increase
-	 * middle->vm_start.
-	 */
-	vma_adjust_trans_huge(vma, vmg->start, vmg->end,
-			      vmg->__adjust_middle_start ? vmg->middle : NULL);
 	vma_set_range(vma, vmg->start, vmg->end, vmg->pgoff);
 	vmg_adjust_set_range(vmg);
 	vma_iter_store_overwrite(vmg->vmi, vmg->target);
@@ -1229,9 +1238,14 @@ int vma_shrink(struct vma_iterator *vmi, struct vm_area_struct *vma,
 	       unsigned long start, unsigned long end, pgoff_t pgoff)
 {
 	struct vma_prepare vp;
+	int err;
 
 	WARN_ON((vma->vm_start != start) && (vma->vm_end != end));
 
+	err = vma_adjust_trans_huge(vma, start, end, NULL);
+	if (err)
+		return err;
+
 	if (vma->vm_start < start)
 		vma_iter_config(vmi, vma->vm_start, start);
 	else
@@ -1244,7 +1258,6 @@ int vma_shrink(struct vma_iterator *vmi, struct vm_area_struct *vma,
 
 	init_vma_prep(&vp, vma);
 	vma_prepare(&vp);
-	vma_adjust_trans_huge(vma, start, end, NULL);
 
 	vma_iter_clear(vmi);
 	vma_set_range(vma, start, end, pgoff);
diff --git a/tools/testing/vma/include/stubs.h b/tools/testing/vma/include/stubs.h
index 947a3a0c25665..171986f9c9fcd 100644
--- a/tools/testing/vma/include/stubs.h
+++ b/tools/testing/vma/include/stubs.h
@@ -418,11 +418,12 @@ static inline int vma_dup_policy(struct vm_area_struct *src, struct vm_area_stru
 	return 0;
 }
 
-static inline void vma_adjust_trans_huge(struct vm_area_struct *vma,
-					 unsigned long start,
-					 unsigned long end,
-					 struct vm_area_struct *next)
+static inline int vma_adjust_trans_huge(struct vm_area_struct *vma,
+					unsigned long start,
+					unsigned long end,
+					struct vm_area_struct *next)
 {
+	return 0;
 }
 
 static inline void hugetlb_split(struct vm_area_struct *, unsigned long) {}
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [RFC v2 03/21] mm: thp: handle split failure in copy_huge_pmd()
  2026-02-26 11:23 [RFC v2 00/21] mm: thp: lazy PTE page table allocation at PMD split Usama Arif
  2026-02-26 11:23 ` [RFC v2 01/21] mm: thp: make split_huge_pmd functions return int for error propagation Usama Arif
  2026-02-26 11:23 ` [RFC v2 02/21] mm: thp: propagate split failure from vma_adjust_trans_huge() Usama Arif
@ 2026-02-26 11:23 ` Usama Arif
  2026-02-26 11:23 ` [RFC v2 04/21] mm: thp: handle split failure in do_huge_pmd_wp_page() Usama Arif
                   ` (18 subsequent siblings)
  21 siblings, 0 replies; 36+ messages in thread
From: Usama Arif @ 2026-02-26 11:23 UTC (permalink / raw)
  To: Andrew Morton, david, lorenzo.stoakes, willy, linux-mm
  Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
	baolin.wang, npache, Liam.Howlett, ryan.roberts, Vlastimil Babka,
	lance.yang, linux-kernel, kernel-team, maddy, mpe, linuxppc-dev,
	hca, gor, agordeev, borntraeger, svens, linux-s390, Usama Arif

copy_huge_pmd() splits the source PMD when a folio is pinned and can't
be COW-shared at PMD granularity.  It then returns -EAGAIN so
copy_pmd_range() falls through to copy_pte_range().

If the split fails, the PMD is still huge.  Returning -EAGAIN would cause
copy_pmd_range() to call copy_pte_range(), which would dereference the
huge PMD entry as if it were a pointer to a PTE page table.
Return -ENOMEM on split failure instead (which is already done in
copy_huge_pmd() if pte_alloc_one() fails), which causes copy_page_range()
to abort the fork with -ENOMEM, similar to how copy_pmd_range() would
be aborted if pmd_alloc() and copy_pte_range() fail.

Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 mm/huge_memory.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a979aa5bd2995..d9fb5875fa59e 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1929,7 +1929,13 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		pte_free(dst_mm, pgtable);
 		spin_unlock(src_ptl);
 		spin_unlock(dst_ptl);
-		__split_huge_pmd(src_vma, src_pmd, addr, false);
+		/*
+		 * If split fails, the PMD is still huge so copy_pte_range
+		 * (via -EAGAIN) would misinterpret it as a page table
+		 * pointer.  Return -ENOMEM directly to copy_pmd_range.
+		 */
+		if (__split_huge_pmd(src_vma, src_pmd, addr, false))
+			return -ENOMEM;
 		return -EAGAIN;
 	}
 	add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [RFC v2 04/21] mm: thp: handle split failure in do_huge_pmd_wp_page()
  2026-02-26 11:23 [RFC v2 00/21] mm: thp: lazy PTE page table allocation at PMD split Usama Arif
                   ` (2 preceding siblings ...)
  2026-02-26 11:23 ` [RFC v2 03/21] mm: thp: handle split failure in copy_huge_pmd() Usama Arif
@ 2026-02-26 11:23 ` Usama Arif
  2026-02-26 11:23 ` [RFC v2 05/21] mm: thp: handle split failure in zap_pmd_range() Usama Arif
                   ` (17 subsequent siblings)
  21 siblings, 0 replies; 36+ messages in thread
From: Usama Arif @ 2026-02-26 11:23 UTC (permalink / raw)
  To: Andrew Morton, david, lorenzo.stoakes, willy, linux-mm
  Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
	baolin.wang, npache, Liam.Howlett, ryan.roberts, Vlastimil Babka,
	lance.yang, linux-kernel, kernel-team, maddy, mpe, linuxppc-dev,
	hca, gor, agordeev, borntraeger, svens, linux-s390, Usama Arif

do_huge_pmd_wp_page() splits the PMD when a COW of the entire huge page
fails (e.g., can't allocate a new folio or the folio is pinned).  It then
returns VM_FAULT_FALLBACK so the fault can be retried at PTE granularity.

If the split fails, the PMD is still huge.  Returning VM_FAULT_FALLBACK
would re-enter the PTE fault path, which expects a PTE page table at the
PMD entry — not a huge PMD.

Return VM_FAULT_OOM on split failure, which signals the fault handler to
invoke the OOM killer or return -ENOMEM to userspace.

Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 mm/huge_memory.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d9fb5875fa59e..e82b8435a0b7f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2153,7 +2153,13 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf)
 	folio_unlock(folio);
 	spin_unlock(vmf->ptl);
 fallback:
-	__split_huge_pmd(vma, vmf->pmd, vmf->address, false);
+	/*
+	 * Split failure means the PMD is still huge; returning
+	 * VM_FAULT_FALLBACK would re-enter the PTE path with a
+	 * huge PMD, causing incorrect behavior.
+	 */
+	if (__split_huge_pmd(vma, vmf->pmd, vmf->address, false))
+		return VM_FAULT_OOM;
 	return VM_FAULT_FALLBACK;
 }
 
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [RFC v2 05/21] mm: thp: handle split failure in zap_pmd_range()
  2026-02-26 11:23 [RFC v2 00/21] mm: thp: lazy PTE page table allocation at PMD split Usama Arif
                   ` (3 preceding siblings ...)
  2026-02-26 11:23 ` [RFC v2 04/21] mm: thp: handle split failure in do_huge_pmd_wp_page() Usama Arif
@ 2026-02-26 11:23 ` Usama Arif
  2026-02-26 11:23 ` [RFC v2 06/21] mm: thp: handle split failure in wp_huge_pmd() Usama Arif
                   ` (16 subsequent siblings)
  21 siblings, 0 replies; 36+ messages in thread
From: Usama Arif @ 2026-02-26 11:23 UTC (permalink / raw)
  To: Andrew Morton, david, lorenzo.stoakes, willy, linux-mm
  Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
	baolin.wang, npache, Liam.Howlett, ryan.roberts, Vlastimil Babka,
	lance.yang, linux-kernel, kernel-team, maddy, mpe, linuxppc-dev,
	hca, gor, agordeev, borntraeger, svens, linux-s390, Usama Arif

zap_pmd_range() splits a huge PMD when the zap range doesn't cover the
full PMD (partial unmap).  If the split fails, the PMD stays huge.
Falling through to zap_pte_range() would dereference the huge PMD entry
as a PTE page table pointer.

Skip the range covered by the PMD on split failure instead.

The skip is safe across all call paths into zap_pmd_range():

- exit_mmap() and OOM reaper: the zap range covers entire VMAs, so
  every PMD is fully covered (next - addr == HPAGE_PMD_SIZE).  The
  zap_huge_pmd() branch handles these without splitting.  The split
  failure path is unreachable.

- munmap / mmap overlay: vma_adjust_trans_huge() (called from
  __split_vma) splits any PMD straddling the VMA boundary before the
  VMA is split.  If that PMD split fails, __split_vma() returns
  -ENOMEM and the munmap is aborted before reaching zap_pmd_range().
  The split failure path is unreachable.

- MADV_DONTNEED: advisory hint, the kernel is allowed to ignore it.
  The pages remain valid and accessible.  A subsequent access returns
  existing data without faulting.

Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 mm/memory.c | 15 ++++++++++++---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 9385842c35034..7ba1221c63792 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1983,9 +1983,18 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
 	do {
 		next = pmd_addr_end(addr, end);
 		if (pmd_is_huge(*pmd)) {
-			if (next - addr != HPAGE_PMD_SIZE)
-				__split_huge_pmd(vma, pmd, addr, false);
-			else if (zap_huge_pmd(tlb, vma, pmd, addr)) {
+			if (next - addr != HPAGE_PMD_SIZE) {
+				/*
+				 * If split fails, the PMD stays huge.
+				 * Skip the range to avoid falling through
+				 * to zap_pte_range, which would treat the
+				 * huge PMD entry as a page table pointer.
+				 */
+				if (__split_huge_pmd(vma, pmd, addr, false)) {
+					addr = next;
+					continue;
+				}
+			} else if (zap_huge_pmd(tlb, vma, pmd, addr)) {
 				addr = next;
 				continue;
 			}
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [RFC v2 06/21] mm: thp: handle split failure in wp_huge_pmd()
  2026-02-26 11:23 [RFC v2 00/21] mm: thp: lazy PTE page table allocation at PMD split Usama Arif
                   ` (4 preceding siblings ...)
  2026-02-26 11:23 ` [RFC v2 05/21] mm: thp: handle split failure in zap_pmd_range() Usama Arif
@ 2026-02-26 11:23 ` Usama Arif
  2026-02-26 11:23 ` [RFC v2 07/21] mm: thp: retry on split failure in change_pmd_range() Usama Arif
                   ` (15 subsequent siblings)
  21 siblings, 0 replies; 36+ messages in thread
From: Usama Arif @ 2026-02-26 11:23 UTC (permalink / raw)
  To: Andrew Morton, david, lorenzo.stoakes, willy, linux-mm
  Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
	baolin.wang, npache, Liam.Howlett, ryan.roberts, Vlastimil Babka,
	lance.yang, linux-kernel, kernel-team, maddy, mpe, linuxppc-dev,
	hca, gor, agordeev, borntraeger, svens, linux-s390, Usama Arif

wp_huge_pmd() splits the PMD when COW or write-notify must be handled at
PTE level (e.g., shared/file VMAs, userfaultfd).  It then returns
VM_FAULT_FALLBACK so the fault handler retries at PTE granularity.
If the split fails, the PMD is still huge.  The PTE fault path cannot
handle a huge PMD entry.
Return VM_FAULT_OOM on split failure, which signals the fault handler to
invoke the OOM killer or return -ENOMEM to userspace. This is similar to
what __handle_mm_fault would do if p4d_alloc or pud_alloc fails.

Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 mm/memory.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 7ba1221c63792..51d2717e3f1b4 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -6161,8 +6161,13 @@ static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf)
 	}
 
 split:
-	/* COW or write-notify handled on pte level: split pmd. */
-	__split_huge_pmd(vma, vmf->pmd, vmf->address, false);
+	/*
+	 * COW or write-notify handled on pte level: split pmd.
+	 * If split fails, the PMD is still huge so falling back
+	 * to PTE handling would be incorrect.
+	 */
+	if (__split_huge_pmd(vma, vmf->pmd, vmf->address, false))
+		return VM_FAULT_OOM;
 
 	return VM_FAULT_FALLBACK;
 }
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [RFC v2 07/21] mm: thp: retry on split failure in change_pmd_range()
  2026-02-26 11:23 [RFC v2 00/21] mm: thp: lazy PTE page table allocation at PMD split Usama Arif
                   ` (5 preceding siblings ...)
  2026-02-26 11:23 ` [RFC v2 06/21] mm: thp: handle split failure in wp_huge_pmd() Usama Arif
@ 2026-02-26 11:23 ` Usama Arif
  2026-02-26 11:23 ` [RFC v2 08/21] mm: thp: handle split failure in follow_pmd_mask() Usama Arif
                   ` (14 subsequent siblings)
  21 siblings, 0 replies; 36+ messages in thread
From: Usama Arif @ 2026-02-26 11:23 UTC (permalink / raw)
  To: Andrew Morton, david, lorenzo.stoakes, willy, linux-mm
  Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
	baolin.wang, npache, Liam.Howlett, ryan.roberts, Vlastimil Babka,
	lance.yang, linux-kernel, kernel-team, maddy, mpe, linuxppc-dev,
	hca, gor, agordeev, borntraeger, svens, linux-s390, Usama Arif

change_pmd_range() splits a huge PMD when mprotect() targets a sub-PMD
range or when VMA flags require per-PTE protection bits that can't be
represented at PMD granularity.

If pte_alloc_one() fails inside __split_huge_pmd(), the huge PMD remains
intact. Without this change, change_pte_range() would return -EAGAIN
because pte_offset_map_lock() returns NULL for a huge PMD, sending the
code back to the 'again' label to retry the split—without ever calling
cond_resched().

Now that __split_huge_pmd() returns an error code, handle it explicitly:
yield the CPU with cond_resched() and retry via goto again, giving other
tasks a chance to free memory.

Trying to return an error all the way to change_protection_range would
not work as it would leave a memory range with new protections, and
others unchanged, with no easy way to roll back the already modified
entries (and previous splits). __split_huge_pmd only requires an
order-0 allocation and is extremely unlikely to fail.

Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 mm/mprotect.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 9681f055b9fca..599d80a7d6969 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -477,7 +477,16 @@ static inline long change_pmd_range(struct mmu_gather *tlb,
 		if (pmd_is_huge(_pmd)) {
 			if ((next - addr != HPAGE_PMD_SIZE) ||
 			    pgtable_split_needed(vma, cp_flags)) {
-				__split_huge_pmd(vma, pmd, addr, false);
+				ret = __split_huge_pmd(vma, pmd, addr, false);
+				if (ret) {
+					/*
+					 * Yield and retry. Other tasks
+					 * may free memory while we
+					 * reschedule.
+					 */
+					cond_resched();
+					goto again;
+				}
 				/*
 				 * For file-backed, the pmd could have been
 				 * cleared; make sure pmd populated if
-- 
2.47.3

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [RFC v2 08/21] mm: thp: handle split failure in follow_pmd_mask()
  2026-02-26 11:23 [RFC v2 00/21] mm: thp: lazy PTE page table allocation at PMD split Usama Arif
                   ` (6 preceding siblings ...)
  2026-02-26 11:23 ` [RFC v2 07/21] mm: thp: retry on split failure in change_pmd_range() Usama Arif
@ 2026-02-26 11:23 ` Usama Arif
  2026-02-26 11:23 ` [RFC v2 09/21] mm: handle walk_page_range() failure from THP split Usama Arif
                   ` (13 subsequent siblings)
  21 siblings, 0 replies; 36+ messages in thread
From: Usama Arif @ 2026-02-26 11:23 UTC (permalink / raw)
  To: Andrew Morton, david, lorenzo.stoakes, willy, linux-mm
  Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
	baolin.wang, npache, Liam.Howlett, ryan.roberts, Vlastimil Babka,
	lance.yang, linux-kernel, kernel-team, maddy, mpe, linuxppc-dev,
	hca, gor, agordeev, borntraeger, svens, linux-s390, Usama Arif

follow_pmd_mask() splits a huge PMD when FOLL_SPLIT_PMD is set, so GUP
can pin individual pages at PTE granularity.

If the split fails, the PMD is still huge and follow_page_pte() cannot
process it. Return ERR_PTR(-ENOMEM) on split failure, which causes the
GUP caller to get -ENOMEM. -ENOMEM is already returned in follow_pmd_mask
if pte_alloc_one fails (which is the reason why split_huge_pmd could
fail), hence this is a safe change.

Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 mm/gup.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/mm/gup.c b/mm/gup.c
index 8e7dc2c6ee738..792b2e7319dd0 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -928,8 +928,16 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
 		return follow_page_pte(vma, address, pmd, flags);
 	}
 	if (pmd_trans_huge(pmdval) && (flags & FOLL_SPLIT_PMD)) {
+		int ret;
+
 		spin_unlock(ptl);
-		split_huge_pmd(vma, pmd, address);
+		/*
+		 * If split fails, the PMD is still huge and
+		 * we cannot proceed to follow_page_pte.
+		 */
+		ret = split_huge_pmd(vma, pmd, address);
+		if (ret)
+			return ERR_PTR(ret);
 		/* If pmd was left empty, stuff a page table in there quickly */
 		return pte_alloc(mm, pmd) ? ERR_PTR(-ENOMEM) :
 			follow_page_pte(vma, address, pmd, flags);
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [RFC v2 09/21] mm: handle walk_page_range() failure from THP split
  2026-02-26 11:23 [RFC v2 00/21] mm: thp: lazy PTE page table allocation at PMD split Usama Arif
                   ` (7 preceding siblings ...)
  2026-02-26 11:23 ` [RFC v2 08/21] mm: thp: handle split failure in follow_pmd_mask() Usama Arif
@ 2026-02-26 11:23 ` Usama Arif
  2026-02-26 11:23 ` [RFC v2 10/21] mm: thp: handle split failure in mremap move_page_tables() Usama Arif
                   ` (12 subsequent siblings)
  21 siblings, 0 replies; 36+ messages in thread
From: Usama Arif @ 2026-02-26 11:23 UTC (permalink / raw)
  To: Andrew Morton, david, lorenzo.stoakes, willy, linux-mm
  Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
	baolin.wang, npache, Liam.Howlett, ryan.roberts, Vlastimil Babka,
	lance.yang, linux-kernel, kernel-team, maddy, mpe, linuxppc-dev,
	hca, gor, agordeev, borntraeger, svens, linux-s390, Usama Arif

walk_pmd_range() splits a huge PMD when a page table walker with
pte_entry or install_pte callbacks needs PTE-level granularity. If
the split fails due to memory allocation failure in pte_alloc_one(),
walk_pte_range() would encounter a huge PMD instead of a PTE page
table.

Break out of the loop on split failure and return -ENOMEM to the
walker's caller. Callers that reach this path (those with pte_entry
or install_pte set) such as mincore, hmm_range_fault and
queue_pages_range already handle negative return values from
walk_page_range(). Similar approach is taken when __pte_alloc()
fails in walk_pmd_range().

Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 mm/pagewalk.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index a94c401ab2cfe..1ee9df7a4461d 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -147,9 +147,11 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
 				continue;
 		}
 
-		if (walk->vma)
-			split_huge_pmd(walk->vma, pmd, addr);
-		else if (pmd_leaf(*pmd) || !pmd_present(*pmd))
+		if (walk->vma) {
+			err = split_huge_pmd(walk->vma, pmd, addr);
+			if (err)
+				break;
+		} else if (pmd_leaf(*pmd) || !pmd_present(*pmd))
 			continue; /* Nothing to do. */
 
 		err = walk_pte_range(pmd, addr, next, walk);
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [RFC v2 10/21] mm: thp: handle split failure in mremap move_page_tables()
  2026-02-26 11:23 [RFC v2 00/21] mm: thp: lazy PTE page table allocation at PMD split Usama Arif
                   ` (8 preceding siblings ...)
  2026-02-26 11:23 ` [RFC v2 09/21] mm: handle walk_page_range() failure from THP split Usama Arif
@ 2026-02-26 11:23 ` Usama Arif
  2026-02-26 11:23 ` [RFC v2 11/21] mm: thp: handle split failure in userfaultfd move_pages() Usama Arif
                   ` (11 subsequent siblings)
  21 siblings, 0 replies; 36+ messages in thread
From: Usama Arif @ 2026-02-26 11:23 UTC (permalink / raw)
  To: Andrew Morton, david, lorenzo.stoakes, willy, linux-mm
  Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
	baolin.wang, npache, Liam.Howlett, ryan.roberts, Vlastimil Babka,
	lance.yang, linux-kernel, kernel-team, maddy, mpe, linuxppc-dev,
	hca, gor, agordeev, borntraeger, svens, linux-s390, Usama Arif

move_page_tables() splits a huge PMD when the extent is smaller than
HPAGE_PMD_SIZE and the PMD can't be moved at PMD granularity.

If the split fails, the PMD stays huge and move_ptes() can't operate on
individual PTEs.

Break out of the loop on split failure, which causes mremap() to return
however much was moved so far (partial move).  This is consistent with
other allocation failures in the same loop (e.g., alloc_new_pmd(),
pte_alloc()).

Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 mm/mremap.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/mm/mremap.c b/mm/mremap.c
index 2be876a70cc0d..d067c9fbf140b 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -855,7 +855,13 @@ unsigned long move_page_tables(struct pagetable_move_control *pmc)
 			if (extent == HPAGE_PMD_SIZE &&
 			    move_pgt_entry(pmc, HPAGE_PMD, old_pmd, new_pmd))
 				continue;
-			split_huge_pmd(pmc->old, old_pmd, pmc->old_addr);
+			/*
+			 * If split fails, the PMD stays huge and move_ptes
+			 * can't operate on it.  Break out so the caller
+			 * can handle the partial move.
+			 */
+			if (split_huge_pmd(pmc->old, old_pmd, pmc->old_addr))
+				break;
 		} else if (IS_ENABLED(CONFIG_HAVE_MOVE_PMD) &&
 			   extent == PMD_SIZE) {
 			/*
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [RFC v2 11/21] mm: thp: handle split failure in userfaultfd move_pages()
  2026-02-26 11:23 [RFC v2 00/21] mm: thp: lazy PTE page table allocation at PMD split Usama Arif
                   ` (9 preceding siblings ...)
  2026-02-26 11:23 ` [RFC v2 10/21] mm: thp: handle split failure in mremap move_page_tables() Usama Arif
@ 2026-02-26 11:23 ` Usama Arif
  2026-02-26 11:23 ` [RFC v2 12/21] mm: thp: handle split failure in device migration Usama Arif
                   ` (10 subsequent siblings)
  21 siblings, 0 replies; 36+ messages in thread
From: Usama Arif @ 2026-02-26 11:23 UTC (permalink / raw)
  To: Andrew Morton, david, lorenzo.stoakes, willy, linux-mm
  Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
	baolin.wang, npache, Liam.Howlett, ryan.roberts, Vlastimil Babka,
	lance.yang, linux-kernel, kernel-team, maddy, mpe, linuxppc-dev,
	hca, gor, agordeev, borntraeger, svens, linux-s390, Usama Arif

The UFFDIO_MOVE ioctl's move_pages() loop splits a huge PMD when the
folio is pinned and can't be moved at PMD granularity.

If the split fails, the PMD stays huge and move_pages_pte() can't
process individual pages. Break out of the loop on split failure
and return -ENOMEM to the caller. This is similar to how other
allocation failures (__pte_alloc, mm_alloc_pmd) are handled in
move_pages().

Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 mm/userfaultfd.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index e19872e518785..2728102e00c72 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -1870,7 +1870,13 @@ ssize_t move_pages(struct userfaultfd_ctx *ctx, unsigned long dst_start,
 				}
 
 				spin_unlock(ptl);
-				split_huge_pmd(src_vma, src_pmd, src_addr);
+				/*
+				 * If split fails, the PMD stays huge and
+				 * move_pages_pte can't process it.
+				 */
+				err = split_huge_pmd(src_vma, src_pmd, src_addr);
+				if (err)
+					break;
 				/* The folio will be split by move_pages_pte() */
 				continue;
 			}
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [RFC v2 12/21] mm: thp: handle split failure in device migration
  2026-02-26 11:23 [RFC v2 00/21] mm: thp: lazy PTE page table allocation at PMD split Usama Arif
                   ` (10 preceding siblings ...)
  2026-02-26 11:23 ` [RFC v2 11/21] mm: thp: handle split failure in userfaultfd move_pages() Usama Arif
@ 2026-02-26 11:23 ` Usama Arif
  2026-03-02 21:20   ` Nico Pache
  2026-02-26 11:23 ` [RFC v2 13/21] mm: huge_mm: Make sure all split_huge_pmd calls are checked Usama Arif
                   ` (9 subsequent siblings)
  21 siblings, 1 reply; 36+ messages in thread
From: Usama Arif @ 2026-02-26 11:23 UTC (permalink / raw)
  To: Andrew Morton, david, lorenzo.stoakes, willy, linux-mm
  Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
	baolin.wang, npache, Liam.Howlett, ryan.roberts, Vlastimil Babka,
	lance.yang, linux-kernel, kernel-team, maddy, mpe, linuxppc-dev,
	hca, gor, agordeev, borntraeger, svens, linux-s390, Usama Arif

Device memory migration has two call sites that split huge PMDs:

migrate_vma_split_unmapped_folio():
  Called from migrate_vma_pages() when migrating a PMD-mapped THP to a
  destination that doesn't support compound pages.  It splits the PMD
  then splits the folio via folio_split_unmapped().

  If the PMD split fails, folio_split_unmapped() would operate on an
  unsplit folio with inconsistent page table state.  Propagate -ENOMEM
  to skip this page's migration. This is safe as folio_split_unmapped
  failure would be propagated in a similar way.

migrate_vma_insert_page():
  Called from migrate_vma_pages() when inserting a page into a VMA
  during migration back from device memory.  If a huge zero PMD exists
  at the target address, it must be split before PTE insertion.

  If the split fails, the subsequent pte_alloc() and set_pte_at() would
  operate on a PMD slot still occupied by the huge zero entry.  Use
  goto abort, consistent with other allocation failures in this function.

Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 mm/migrate_device.c | 16 ++++++++++++++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index 78c7acf024615..bc53e06fd9735 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -909,7 +909,13 @@ static int migrate_vma_split_unmapped_folio(struct migrate_vma *migrate,
 	int ret = 0;
 
 	folio_get(folio);
-	split_huge_pmd_address(migrate->vma, addr, true);
+	/*
+	 * If PMD split fails, folio_split_unmapped would operate on an
+	 * unsplit folio with inconsistent page table state.
+	 */
+	ret = split_huge_pmd_address(migrate->vma, addr, true);
+	if (ret)
+		return ret;
 	ret = folio_split_unmapped(folio, 0);
 	if (ret)
 		return ret;
@@ -1005,7 +1011,13 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
 		if (pmd_trans_huge(*pmdp)) {
 			if (!is_huge_zero_pmd(*pmdp))
 				goto abort;
-			split_huge_pmd(vma, pmdp, addr);
+			/*
+			 * If split fails, the huge zero PMD remains and
+			 * pte_alloc/PTE insertion that follows would be
+			 * incorrect.
+			 */
+			if (split_huge_pmd(vma, pmdp, addr))
+				goto abort;
 		} else if (pmd_leaf(*pmdp))
 			goto abort;
 	}
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [RFC v2 13/21] mm: huge_mm: Make sure all split_huge_pmd calls are checked
  2026-02-26 11:23 [RFC v2 00/21] mm: thp: lazy PTE page table allocation at PMD split Usama Arif
                   ` (11 preceding siblings ...)
  2026-02-26 11:23 ` [RFC v2 12/21] mm: thp: handle split failure in device migration Usama Arif
@ 2026-02-26 11:23 ` Usama Arif
  2026-02-26 16:32   ` kernel test robot
  2026-02-27 12:11   ` Usama Arif
  2026-02-26 11:23 ` [RFC v2 14/21] mm: thp: allocate PTE page tables lazily at split time Usama Arif
                   ` (8 subsequent siblings)
  21 siblings, 2 replies; 36+ messages in thread
From: Usama Arif @ 2026-02-26 11:23 UTC (permalink / raw)
  To: Andrew Morton, david, lorenzo.stoakes, willy, linux-mm
  Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
	baolin.wang, npache, Liam.Howlett, ryan.roberts, Vlastimil Babka,
	lance.yang, linux-kernel, kernel-team, maddy, mpe, linuxppc-dev,
	hca, gor, agordeev, borntraeger, svens, linux-s390, Usama Arif

Mark __split_huge_pmd(), split_huge_pmd() and split_huge_pmd_address()
with __must_check so the compiler warns if any caller ignores the return
value. Not checking return value and operating on the basis that the pmd
is split could result in a kernel bug. The possibility of an order-0
allocation failing for page table allocation is very low, but it should
be handled correctly.

Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 include/linux/huge_mm.h | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 207bf7cd95c78..b4c2fd4252097 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -419,7 +419,7 @@ void deferred_split_folio(struct folio *folio, bool partially_mapped);
 void reparent_deferred_split_queue(struct mem_cgroup *memcg);
 #endif
 
-int __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
+int __must_check __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 		unsigned long address, bool freeze);
 
 /**
@@ -448,7 +448,7 @@ static inline bool pmd_is_huge(pmd_t pmd)
 	return false;
 }
 
-static inline int split_huge_pmd(struct vm_area_struct *vma,
+static inline int __must_check split_huge_pmd(struct vm_area_struct *vma,
 					     pmd_t *pmd, unsigned long address)
 {
 	if (pmd_is_huge(*pmd))
@@ -456,7 +456,7 @@ static inline int split_huge_pmd(struct vm_area_struct *vma,
 	return 0;
 }
 
-int split_huge_pmd_address(struct vm_area_struct *vma, unsigned long address,
+int __must_check split_huge_pmd_address(struct vm_area_struct *vma, unsigned long address,
 		bool freeze);
 
 void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [RFC v2 14/21] mm: thp: allocate PTE page tables lazily at split time
  2026-02-26 11:23 [RFC v2 00/21] mm: thp: lazy PTE page table allocation at PMD split Usama Arif
                   ` (12 preceding siblings ...)
  2026-02-26 11:23 ` [RFC v2 13/21] mm: huge_mm: Make sure all split_huge_pmd calls are checked Usama Arif
@ 2026-02-26 11:23 ` Usama Arif
  2026-02-26 11:23 ` [RFC v2 15/21] mm: thp: remove pgtable_trans_huge_{deposit/withdraw} when not needed Usama Arif
                   ` (7 subsequent siblings)
  21 siblings, 0 replies; 36+ messages in thread
From: Usama Arif @ 2026-02-26 11:23 UTC (permalink / raw)
  To: Andrew Morton, david, lorenzo.stoakes, willy, linux-mm
  Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
	baolin.wang, npache, Liam.Howlett, ryan.roberts, Vlastimil Babka,
	lance.yang, linux-kernel, kernel-team, maddy, mpe, linuxppc-dev,
	hca, gor, agordeev, borntraeger, svens, linux-s390, Usama Arif

When the kernel creates a PMD-level THP mapping for anonymous pages,
it pre-allocates a PTE page table and deposits it via
pgtable_trans_huge_deposit(). This deposited table is withdrawn during
PMD split or zap. The rationale was that split must not fail—if the
kernel decides to split a THP, it needs a PTE table to populate.

However, every anon THP wastes 4KB (one page table page) that sits
unused in the deposit list for the lifetime of the mapping. On systems
with many THPs, this adds up to significant memory waste. The original
rationale is also not an issue. It is ok for split to fail, and if the
kernel can't find an order 0 allocation for split, there are much bigger
problems. On large servers where you can easily have 100s of GBs of THPs,
the memory usage for these tables is 200M per 100G. This memory could be
used for any other usecase, which include allocating the pagetables
required during split.

This patch removes the pre-deposit for anonymous pages on architectures
where arch_needs_pgtable_deposit() returns false (every arch apart from
powerpc, and only when radix hash tables are not enabled) and allocates
the PTE table lazily—only when a split actually occurs. The split path
is modified to accept a caller-provided page table.

PowerPC exception:

It would have been great if we can completely remove the pagetable
deposit code and this commit would mostly have been a code cleanup patch,
unfortunately PowerPC has hash MMU, it stores hash slot information in
the deposited page table and pre-deposit is necessary. All deposit/
withdraw paths are guarded by arch_needs_pgtable_deposit(), so PowerPC
behavior is unchanged with this patch. On a better note,
arch_needs_pgtable_deposit will always evaluate to false at compile time
on non PowerPC architectures and the pre-deposit code will not be
compiled in.

Suggested-by: David Hildenbrand <david@kernel.org>
Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 include/linux/huge_mm.h |   4 +-
 mm/huge_memory.c        | 144 ++++++++++++++++++++++++++++------------
 mm/khugepaged.c         |   7 +-
 mm/migrate_device.c     |  15 +++--
 mm/rmap.c               |  39 ++++++++++-
 5 files changed, 156 insertions(+), 53 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index b4c2fd4252097..ed4c97734b335 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -562,7 +562,7 @@ static inline bool thp_migration_supported(void)
 }
 
 void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address,
-			   pmd_t *pmd, bool freeze);
+			   pmd_t *pmd, bool freeze, pgtable_t pgtable);
 bool unmap_huge_pmd_locked(struct vm_area_struct *vma, unsigned long addr,
 			   pmd_t *pmdp, struct folio *folio);
 void map_anon_folio_pmd_nopf(struct folio *folio, pmd_t *pmd,
@@ -662,7 +662,7 @@ static inline int split_huge_pmd_address(struct vm_area_struct *vma,
 		unsigned long address, bool freeze) { return 0; }
 static inline void split_huge_pmd_locked(struct vm_area_struct *vma,
 					 unsigned long address, pmd_t *pmd,
-					 bool freeze) {}
+					 bool freeze, pgtable_t pgtable) {}
 
 static inline bool unmap_huge_pmd_locked(struct vm_area_struct *vma,
 					 unsigned long addr, pmd_t *pmdp,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e82b8435a0b7f..a10cb136000d1 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1325,17 +1325,19 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf)
 	unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
 	struct vm_area_struct *vma = vmf->vma;
 	struct folio *folio;
-	pgtable_t pgtable;
+	pgtable_t pgtable = NULL;
 	vm_fault_t ret = 0;
 
 	folio = vma_alloc_anon_folio_pmd(vma, vmf->address);
 	if (unlikely(!folio))
 		return VM_FAULT_FALLBACK;
 
-	pgtable = pte_alloc_one(vma->vm_mm);
-	if (unlikely(!pgtable)) {
-		ret = VM_FAULT_OOM;
-		goto release;
+	if (arch_needs_pgtable_deposit()) {
+		pgtable = pte_alloc_one(vma->vm_mm);
+		if (unlikely(!pgtable)) {
+			ret = VM_FAULT_OOM;
+			goto release;
+		}
 	}
 
 	vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
@@ -1350,14 +1352,18 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf)
 		if (userfaultfd_missing(vma)) {
 			spin_unlock(vmf->ptl);
 			folio_put(folio);
-			pte_free(vma->vm_mm, pgtable);
+			if (pgtable)
+				pte_free(vma->vm_mm, pgtable);
 			ret = handle_userfault(vmf, VM_UFFD_MISSING);
 			VM_BUG_ON(ret & VM_FAULT_FALLBACK);
 			return ret;
 		}
-		pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd, pgtable);
+		if (pgtable) {
+			pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd,
+						   pgtable);
+			mm_inc_nr_ptes(vma->vm_mm);
+		}
 		map_anon_folio_pmd_pf(folio, vmf->pmd, vma, haddr);
-		mm_inc_nr_ptes(vma->vm_mm);
 		spin_unlock(vmf->ptl);
 	}
 
@@ -1453,9 +1459,11 @@ static void set_huge_zero_folio(pgtable_t pgtable, struct mm_struct *mm,
 	pmd_t entry;
 	entry = folio_mk_pmd(zero_folio, vma->vm_page_prot);
 	entry = pmd_mkspecial(entry);
-	pgtable_trans_huge_deposit(mm, pmd, pgtable);
+	if (pgtable) {
+		pgtable_trans_huge_deposit(mm, pmd, pgtable);
+		mm_inc_nr_ptes(mm);
+	}
 	set_pmd_at(mm, haddr, pmd, entry);
-	mm_inc_nr_ptes(mm);
 }
 
 vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
@@ -1474,16 +1482,19 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
 	if (!(vmf->flags & FAULT_FLAG_WRITE) &&
 			!mm_forbids_zeropage(vma->vm_mm) &&
 			transparent_hugepage_use_zero_page()) {
-		pgtable_t pgtable;
+		pgtable_t pgtable = NULL;
 		struct folio *zero_folio;
 		vm_fault_t ret;
 
-		pgtable = pte_alloc_one(vma->vm_mm);
-		if (unlikely(!pgtable))
-			return VM_FAULT_OOM;
+		if (arch_needs_pgtable_deposit()) {
+			pgtable = pte_alloc_one(vma->vm_mm);
+			if (unlikely(!pgtable))
+				return VM_FAULT_OOM;
+		}
 		zero_folio = mm_get_huge_zero_folio(vma->vm_mm);
 		if (unlikely(!zero_folio)) {
-			pte_free(vma->vm_mm, pgtable);
+			if (pgtable)
+				pte_free(vma->vm_mm, pgtable);
 			count_vm_event(THP_FAULT_FALLBACK);
 			return VM_FAULT_FALLBACK;
 		}
@@ -1493,10 +1504,12 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
 			ret = check_stable_address_space(vma->vm_mm);
 			if (ret) {
 				spin_unlock(vmf->ptl);
-				pte_free(vma->vm_mm, pgtable);
+				if (pgtable)
+					pte_free(vma->vm_mm, pgtable);
 			} else if (userfaultfd_missing(vma)) {
 				spin_unlock(vmf->ptl);
-				pte_free(vma->vm_mm, pgtable);
+				if (pgtable)
+					pte_free(vma->vm_mm, pgtable);
 				ret = handle_userfault(vmf, VM_UFFD_MISSING);
 				VM_BUG_ON(ret & VM_FAULT_FALLBACK);
 			} else {
@@ -1507,7 +1520,8 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
 			}
 		} else {
 			spin_unlock(vmf->ptl);
-			pte_free(vma->vm_mm, pgtable);
+			if (pgtable)
+				pte_free(vma->vm_mm, pgtable);
 		}
 		return ret;
 	}
@@ -1839,8 +1853,10 @@ static void copy_huge_non_present_pmd(
 	}
 
 	add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
-	mm_inc_nr_ptes(dst_mm);
-	pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
+	if (pgtable) {
+		mm_inc_nr_ptes(dst_mm);
+		pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
+	}
 	if (!userfaultfd_wp(dst_vma))
 		pmd = pmd_swp_clear_uffd_wp(pmd);
 	set_pmd_at(dst_mm, addr, dst_pmd, pmd);
@@ -1880,9 +1896,11 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	if (!vma_is_anonymous(dst_vma))
 		return 0;
 
-	pgtable = pte_alloc_one(dst_mm);
-	if (unlikely(!pgtable))
-		goto out;
+	if (arch_needs_pgtable_deposit()) {
+		pgtable = pte_alloc_one(dst_mm);
+		if (unlikely(!pgtable))
+			goto out;
+	}
 
 	dst_ptl = pmd_lock(dst_mm, dst_pmd);
 	src_ptl = pmd_lockptr(src_mm, src_pmd);
@@ -1900,7 +1918,8 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	}
 
 	if (unlikely(!pmd_trans_huge(pmd))) {
-		pte_free(dst_mm, pgtable);
+		if (pgtable)
+			pte_free(dst_mm, pgtable);
 		goto out_unlock;
 	}
 	/*
@@ -1926,7 +1945,8 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	if (unlikely(folio_try_dup_anon_rmap_pmd(src_folio, src_page, dst_vma, src_vma))) {
 		/* Page maybe pinned: split and retry the fault on PTEs. */
 		folio_put(src_folio);
-		pte_free(dst_mm, pgtable);
+		if (pgtable)
+			pte_free(dst_mm, pgtable);
 		spin_unlock(src_ptl);
 		spin_unlock(dst_ptl);
 		/*
@@ -1940,8 +1960,10 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	}
 	add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
 out_zero_page:
-	mm_inc_nr_ptes(dst_mm);
-	pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
+	if (pgtable) {
+		mm_inc_nr_ptes(dst_mm);
+		pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
+	}
 	pmdp_set_wrprotect(src_mm, addr, src_pmd);
 	if (!userfaultfd_wp(dst_vma))
 		pmd = pmd_clear_uffd_wp(pmd);
@@ -2379,7 +2401,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 			zap_deposited_table(tlb->mm, pmd);
 		spin_unlock(ptl);
 	} else if (is_huge_zero_pmd(orig_pmd)) {
-		if (!vma_is_dax(vma) || arch_needs_pgtable_deposit())
+		if (arch_needs_pgtable_deposit())
 			zap_deposited_table(tlb->mm, pmd);
 		spin_unlock(ptl);
 	} else {
@@ -2404,7 +2426,8 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		}
 
 		if (folio_test_anon(folio)) {
-			zap_deposited_table(tlb->mm, pmd);
+			if (arch_needs_pgtable_deposit())
+				zap_deposited_table(tlb->mm, pmd);
 			add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
 		} else {
 			if (arch_needs_pgtable_deposit())
@@ -2505,7 +2528,8 @@ bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
 			force_flush = true;
 		VM_BUG_ON(!pmd_none(*new_pmd));
 
-		if (pmd_move_must_withdraw(new_ptl, old_ptl, vma)) {
+		if (pmd_move_must_withdraw(new_ptl, old_ptl, vma) &&
+		    arch_needs_pgtable_deposit()) {
 			pgtable_t pgtable;
 			pgtable = pgtable_trans_huge_withdraw(mm, old_pmd);
 			pgtable_trans_huge_deposit(mm, new_pmd, pgtable);
@@ -2813,8 +2837,10 @@ int move_pages_huge_pmd(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd, pm
 	}
 	set_pmd_at(mm, dst_addr, dst_pmd, _dst_pmd);
 
-	src_pgtable = pgtable_trans_huge_withdraw(mm, src_pmd);
-	pgtable_trans_huge_deposit(mm, dst_pmd, src_pgtable);
+	if (arch_needs_pgtable_deposit()) {
+		src_pgtable = pgtable_trans_huge_withdraw(mm, src_pmd);
+		pgtable_trans_huge_deposit(mm, dst_pmd, src_pgtable);
+	}
 unlock_ptls:
 	double_pt_unlock(src_ptl, dst_ptl);
 	/* unblock rmap walks */
@@ -2956,10 +2982,9 @@ void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
 #endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
 
 static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
-		unsigned long haddr, pmd_t *pmd)
+		unsigned long haddr, pmd_t *pmd, pgtable_t pgtable)
 {
 	struct mm_struct *mm = vma->vm_mm;
-	pgtable_t pgtable;
 	pmd_t _pmd, old_pmd;
 	unsigned long addr;
 	pte_t *pte;
@@ -2975,7 +3000,16 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
 	 */
 	old_pmd = pmdp_huge_clear_flush(vma, haddr, pmd);
 
-	pgtable = pgtable_trans_huge_withdraw(mm, pmd);
+	if (arch_needs_pgtable_deposit()) {
+		pgtable = pgtable_trans_huge_withdraw(mm, pmd);
+	} else {
+		VM_BUG_ON(!pgtable);
+		/*
+		 * Account for the freshly allocated (in __split_huge_pmd) pgtable
+		 * being used in mm.
+		 */
+		mm_inc_nr_ptes(mm);
+	}
 	pmd_populate(mm, &_pmd, pgtable);
 
 	pte = pte_offset_map(&_pmd, haddr);
@@ -2997,12 +3031,11 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
 }
 
 static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
-		unsigned long haddr, bool freeze)
+		unsigned long haddr, bool freeze, pgtable_t pgtable)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	struct folio *folio;
 	struct page *page;
-	pgtable_t pgtable;
 	pmd_t old_pmd, _pmd;
 	bool soft_dirty, uffd_wp = false, young = false, write = false;
 	bool anon_exclusive = false, dirty = false;
@@ -3026,6 +3059,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 		 */
 		if (arch_needs_pgtable_deposit())
 			zap_deposited_table(mm, pmd);
+		if (pgtable)
+			pte_free(mm, pgtable);
 		if (!vma_is_dax(vma) && vma_is_special_huge(vma))
 			return;
 		if (unlikely(pmd_is_migration_entry(old_pmd))) {
@@ -3058,7 +3093,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 		 * small page also write protected so it does not seems useful
 		 * to invalidate secondary mmu at this time.
 		 */
-		return __split_huge_zero_page_pmd(vma, haddr, pmd);
+		return __split_huge_zero_page_pmd(vma, haddr, pmd, pgtable);
 	}
 
 	if (pmd_is_migration_entry(*pmd)) {
@@ -3182,7 +3217,16 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 	 * Withdraw the table only after we mark the pmd entry invalid.
 	 * This's critical for some architectures (Power).
 	 */
-	pgtable = pgtable_trans_huge_withdraw(mm, pmd);
+	if (arch_needs_pgtable_deposit()) {
+		pgtable = pgtable_trans_huge_withdraw(mm, pmd);
+	} else {
+		VM_BUG_ON(!pgtable);
+		/*
+		 * Account for the freshly allocated (in __split_huge_pmd) pgtable
+		 * being used in mm.
+		 */
+		mm_inc_nr_ptes(mm);
+	}
 	pmd_populate(mm, &_pmd, pgtable);
 
 	pte = pte_offset_map(&_pmd, haddr);
@@ -3278,11 +3322,13 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 }
 
 void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address,
-			   pmd_t *pmd, bool freeze)
+			   pmd_t *pmd, bool freeze, pgtable_t pgtable)
 {
 	VM_WARN_ON_ONCE(!IS_ALIGNED(address, HPAGE_PMD_SIZE));
 	if (pmd_trans_huge(*pmd) || pmd_is_valid_softleaf(*pmd))
-		__split_huge_pmd_locked(vma, pmd, address, freeze);
+		__split_huge_pmd_locked(vma, pmd, address, freeze, pgtable);
+	else if (pgtable)
+		pte_free(vma->vm_mm, pgtable);
 }
 
 int __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
@@ -3290,13 +3336,24 @@ int __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 {
 	spinlock_t *ptl;
 	struct mmu_notifier_range range;
+	pgtable_t pgtable = NULL;
 
 	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma->vm_mm,
 				address & HPAGE_PMD_MASK,
 				(address & HPAGE_PMD_MASK) + HPAGE_PMD_SIZE);
 	mmu_notifier_invalidate_range_start(&range);
+
+	/* allocate pagetable before acquiring pmd lock */
+	if (vma_is_anonymous(vma) && !arch_needs_pgtable_deposit()) {
+		pgtable = pte_alloc_one(vma->vm_mm);
+		if (!pgtable) {
+			mmu_notifier_invalidate_range_end(&range);
+			return -ENOMEM;
+		}
+	}
+
 	ptl = pmd_lock(vma->vm_mm, pmd);
-	split_huge_pmd_locked(vma, range.start, pmd, freeze);
+	split_huge_pmd_locked(vma, range.start, pmd, freeze, pgtable);
 	spin_unlock(ptl);
 	mmu_notifier_invalidate_range_end(&range);
 
@@ -3432,7 +3489,8 @@ static bool __discard_anon_folio_pmd_locked(struct vm_area_struct *vma,
 	}
 
 	folio_remove_rmap_pmd(folio, pmd_page(orig_pmd), vma);
-	zap_deposited_table(mm, pmdp);
+	if (arch_needs_pgtable_deposit())
+		zap_deposited_table(mm, pmdp);
 	add_mm_counter(mm, MM_ANONPAGES, -HPAGE_PMD_NR);
 	if (vma->vm_flags & VM_LOCKED)
 		mlock_drain_local();
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index c85d7381adb5f..735d7ee5bbab2 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1224,7 +1224,12 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
 
 	spin_lock(pmd_ptl);
 	BUG_ON(!pmd_none(*pmd));
-	pgtable_trans_huge_deposit(mm, pmd, pgtable);
+	if (arch_needs_pgtable_deposit()) {
+		pgtable_trans_huge_deposit(mm, pmd, pgtable);
+	} else {
+		mm_dec_nr_ptes(mm);
+		pte_free(mm, pgtable);
+	}
 	map_anon_folio_pmd_nopf(folio, pmd, vma, address);
 	spin_unlock(pmd_ptl);
 
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index bc53e06fd9735..1adb5abccfb70 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -823,9 +823,13 @@ static int migrate_vma_insert_huge_pmd_page(struct migrate_vma *migrate,
 
 	__folio_mark_uptodate(folio);
 
-	pgtable = pte_alloc_one(vma->vm_mm);
-	if (unlikely(!pgtable))
-		goto abort;
+	if (arch_needs_pgtable_deposit()) {
+		pgtable = pte_alloc_one(vma->vm_mm);
+		if (unlikely(!pgtable))
+			goto abort;
+	} else {
+		pgtable = NULL;
+	}
 
 	if (folio_is_device_private(folio)) {
 		swp_entry_t swp_entry;
@@ -873,10 +877,11 @@ static int migrate_vma_insert_huge_pmd_page(struct migrate_vma *migrate,
 	folio_get(folio);
 
 	if (flush) {
-		pte_free(vma->vm_mm, pgtable);
+		if (pgtable)
+			pte_free(vma->vm_mm, pgtable);
 		flush_cache_page(vma, addr, addr + HPAGE_PMD_SIZE);
 		pmdp_invalidate(vma, addr, pmdp);
-	} else {
+	} else if (pgtable) {
 		pgtable_trans_huge_deposit(vma->vm_mm, pmdp, pgtable);
 		mm_inc_nr_ptes(vma->vm_mm);
 	}
diff --git a/mm/rmap.c b/mm/rmap.c
index bff8f222004e4..2519d579bc1d8 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -76,6 +76,7 @@
 #include <linux/mm_inline.h>
 #include <linux/oom.h>
 
+#include <asm/pgalloc.h>
 #include <asm/tlb.h>
 
 #define CREATE_TRACE_POINTS
@@ -1975,6 +1976,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 	unsigned long pfn;
 	unsigned long hsz = 0;
 	int ptes = 0;
+	pgtable_t prealloc_pte = NULL;
 
 	/*
 	 * When racing against e.g. zap_pte_range() on another cpu,
@@ -2009,6 +2011,10 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 	}
 	mmu_notifier_invalidate_range_start(&range);
 
+	if ((flags & TTU_SPLIT_HUGE_PMD) && vma_is_anonymous(vma) &&
+	    !arch_needs_pgtable_deposit())
+		prealloc_pte = pte_alloc_one(mm);
+
 	while (page_vma_mapped_walk(&pvmw)) {
 		/*
 		 * If the folio is in an mlock()d vma, we must not swap it out.
@@ -2058,12 +2064,21 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 			}
 
 			if (flags & TTU_SPLIT_HUGE_PMD) {
+				pgtable_t pgtable = prealloc_pte;
+
+				prealloc_pte = NULL;
+				if (!arch_needs_pgtable_deposit() && !pgtable &&
+				    vma_is_anonymous(vma)) {
+					page_vma_mapped_walk_done(&pvmw);
+					ret = false;
+					break;
+				}
 				/*
 				 * We temporarily have to drop the PTL and
 				 * restart so we can process the PTE-mapped THP.
 				 */
 				split_huge_pmd_locked(vma, pvmw.address,
-						      pvmw.pmd, false);
+						      pvmw.pmd, false, pgtable);
 				flags &= ~TTU_SPLIT_HUGE_PMD;
 				page_vma_mapped_walk_restart(&pvmw);
 				continue;
@@ -2343,6 +2358,9 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 		break;
 	}
 
+	if (prealloc_pte)
+		pte_free(mm, prealloc_pte);
+
 	mmu_notifier_invalidate_range_end(&range);
 
 	return ret;
@@ -2402,6 +2420,7 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 	enum ttu_flags flags = (enum ttu_flags)(long)arg;
 	unsigned long pfn;
 	unsigned long hsz = 0;
+	pgtable_t prealloc_pte = NULL;
 
 	/*
 	 * When racing against e.g. zap_pte_range() on another cpu,
@@ -2436,6 +2455,10 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 	}
 	mmu_notifier_invalidate_range_start(&range);
 
+	if ((flags & TTU_SPLIT_HUGE_PMD) && vma_is_anonymous(vma) &&
+	    !arch_needs_pgtable_deposit())
+		prealloc_pte = pte_alloc_one(mm);
+
 	while (page_vma_mapped_walk(&pvmw)) {
 		/* PMD-mapped THP migration entry */
 		if (!pvmw.pte) {
@@ -2443,8 +2466,17 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 			__maybe_unused pmd_t pmdval;
 
 			if (flags & TTU_SPLIT_HUGE_PMD) {
+				pgtable_t pgtable = prealloc_pte;
+
+				prealloc_pte = NULL;
+				if (!arch_needs_pgtable_deposit() && !pgtable &&
+				    vma_is_anonymous(vma)) {
+					page_vma_mapped_walk_done(&pvmw);
+					ret = false;
+					break;
+				}
 				split_huge_pmd_locked(vma, pvmw.address,
-						      pvmw.pmd, true);
+						      pvmw.pmd, true, pgtable);
 				ret = false;
 				page_vma_mapped_walk_done(&pvmw);
 				break;
@@ -2695,6 +2727,9 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 		folio_put(folio);
 	}
 
+	if (prealloc_pte)
+		pte_free(mm, prealloc_pte);
+
 	mmu_notifier_invalidate_range_end(&range);
 
 	return ret;
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [RFC v2 15/21] mm: thp: remove pgtable_trans_huge_{deposit/withdraw} when not needed
  2026-02-26 11:23 [RFC v2 00/21] mm: thp: lazy PTE page table allocation at PMD split Usama Arif
                   ` (13 preceding siblings ...)
  2026-02-26 11:23 ` [RFC v2 14/21] mm: thp: allocate PTE page tables lazily at split time Usama Arif
@ 2026-02-26 11:23 ` Usama Arif
  2026-02-26 11:23 ` [RFC v2 16/21] mm: thp: add THP_SPLIT_PMD_FAILED counter Usama Arif
                   ` (6 subsequent siblings)
  21 siblings, 0 replies; 36+ messages in thread
From: Usama Arif @ 2026-02-26 11:23 UTC (permalink / raw)
  To: Andrew Morton, david, lorenzo.stoakes, willy, linux-mm
  Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
	baolin.wang, npache, Liam.Howlett, ryan.roberts, Vlastimil Babka,
	lance.yang, linux-kernel, kernel-team, maddy, mpe, linuxppc-dev,
	hca, gor, agordeev, borntraeger, svens, linux-s390, Usama Arif

Since the previous commit made deposit/withdraw only needed for
architectures where arch_needs_pgtable_deposit() returns true (currently
only powerpc hash MMU), the generic implementation in pgtable-generic.c
and the s390/sparc overrides are now dead code — all call sites are
guarded by arch_needs_pgtable_deposit() which is compile-time false on
those architectures. Remove them entirely and replace the extern
declarations with static inline no-op stubs for the default case.

pgtable_trans_huge_{deposit,withdraw}() are renamed to
arch_pgtable_trans_huge_{deposit,withdraw}().

Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 arch/powerpc/include/asm/book3s/64/pgtable.h | 12 +++---
 arch/s390/include/asm/pgtable.h              |  6 ---
 arch/s390/mm/pgtable.c                       | 41 --------------------
 arch/sparc/include/asm/pgtable_64.h          |  6 ---
 arch/sparc/mm/tlb.c                          | 36 -----------------
 include/linux/pgtable.h                      | 16 +++++---
 mm/debug_vm_pgtable.c                        |  4 +-
 mm/huge_memory.c                             | 26 ++++++-------
 mm/khugepaged.c                              |  2 +-
 mm/memory.c                                  |  2 +-
 mm/migrate_device.c                          |  2 +-
 mm/pgtable-generic.c                         | 32 ---------------
 12 files changed, 35 insertions(+), 150 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 1a91762b455d9..e0dd2a83b9e05 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -1360,18 +1360,18 @@ pud_t pudp_huge_get_and_clear_full(struct vm_area_struct *vma,
 				   unsigned long addr,
 				   pud_t *pudp, int full);
 
-#define __HAVE_ARCH_PGTABLE_DEPOSIT
-static inline void pgtable_trans_huge_deposit(struct mm_struct *mm,
-					      pmd_t *pmdp, pgtable_t pgtable)
+#define arch_pgtable_trans_huge_deposit arch_pgtable_trans_huge_deposit
+static inline void arch_pgtable_trans_huge_deposit(struct mm_struct *mm,
+						   pmd_t *pmdp, pgtable_t pgtable)
 {
 	if (radix_enabled())
 		return radix__pgtable_trans_huge_deposit(mm, pmdp, pgtable);
 	return hash__pgtable_trans_huge_deposit(mm, pmdp, pgtable);
 }
 
-#define __HAVE_ARCH_PGTABLE_WITHDRAW
-static inline pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm,
-						    pmd_t *pmdp)
+#define arch_pgtable_trans_huge_withdraw arch_pgtable_trans_huge_withdraw
+static inline pgtable_t arch_pgtable_trans_huge_withdraw(struct mm_struct *mm,
+							 pmd_t *pmdp)
 {
 	if (radix_enabled())
 		return radix__pgtable_trans_huge_withdraw(mm, pmdp);
diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 1c3c3be93be9c..6bffe88b297b8 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -1659,12 +1659,6 @@ pud_t pudp_xchg_direct(struct mm_struct *, unsigned long, pud_t *, pud_t);
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 
-#define __HAVE_ARCH_PGTABLE_DEPOSIT
-void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
-				pgtable_t pgtable);
-
-#define __HAVE_ARCH_PGTABLE_WITHDRAW
-pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp);
 
 #define  __HAVE_ARCH_PMDP_SET_ACCESS_FLAGS
 static inline int pmdp_set_access_flags(struct vm_area_struct *vma,
diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
index 4acd8b140c4bd..c9a9ab2c7d937 100644
--- a/arch/s390/mm/pgtable.c
+++ b/arch/s390/mm/pgtable.c
@@ -312,44 +312,3 @@ pud_t pudp_xchg_direct(struct mm_struct *mm, unsigned long addr,
 	return old;
 }
 EXPORT_SYMBOL(pudp_xchg_direct);
-
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
-				pgtable_t pgtable)
-{
-	struct list_head *lh = (struct list_head *) pgtable;
-
-	assert_spin_locked(pmd_lockptr(mm, pmdp));
-
-	/* FIFO */
-	if (!pmd_huge_pte(mm, pmdp))
-		INIT_LIST_HEAD(lh);
-	else
-		list_add(lh, (struct list_head *) pmd_huge_pte(mm, pmdp));
-	pmd_huge_pte(mm, pmdp) = pgtable;
-}
-
-pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp)
-{
-	struct list_head *lh;
-	pgtable_t pgtable;
-	pte_t *ptep;
-
-	assert_spin_locked(pmd_lockptr(mm, pmdp));
-
-	/* FIFO */
-	pgtable = pmd_huge_pte(mm, pmdp);
-	lh = (struct list_head *) pgtable;
-	if (list_empty(lh))
-		pmd_huge_pte(mm, pmdp) = NULL;
-	else {
-		pmd_huge_pte(mm, pmdp) = (pgtable_t) lh->next;
-		list_del(lh);
-	}
-	ptep = (pte_t *) pgtable;
-	set_pte(ptep, __pte(_PAGE_INVALID));
-	ptep++;
-	set_pte(ptep, __pte(_PAGE_INVALID));
-	return pgtable;
-}
-#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h
index 74ede706fb325..60861560f8c40 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -987,12 +987,6 @@ void update_mmu_cache_pmd(struct vm_area_struct *vma, unsigned long addr,
 extern pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
 			    pmd_t *pmdp);
 
-#define __HAVE_ARCH_PGTABLE_DEPOSIT
-void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
-				pgtable_t pgtable);
-
-#define __HAVE_ARCH_PGTABLE_WITHDRAW
-pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp);
 #endif
 
 /*
diff --git a/arch/sparc/mm/tlb.c b/arch/sparc/mm/tlb.c
index 6d9dd5eb13287..9049d54e6e2cb 100644
--- a/arch/sparc/mm/tlb.c
+++ b/arch/sparc/mm/tlb.c
@@ -275,40 +275,4 @@ pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
 	return old;
 }
 
-void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
-				pgtable_t pgtable)
-{
-	struct list_head *lh = (struct list_head *) pgtable;
-
-	assert_spin_locked(&mm->page_table_lock);
-
-	/* FIFO */
-	if (!pmd_huge_pte(mm, pmdp))
-		INIT_LIST_HEAD(lh);
-	else
-		list_add(lh, (struct list_head *) pmd_huge_pte(mm, pmdp));
-	pmd_huge_pte(mm, pmdp) = pgtable;
-}
-
-pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp)
-{
-	struct list_head *lh;
-	pgtable_t pgtable;
-
-	assert_spin_locked(&mm->page_table_lock);
-
-	/* FIFO */
-	pgtable = pmd_huge_pte(mm, pmdp);
-	lh = (struct list_head *) pgtable;
-	if (list_empty(lh))
-		pmd_huge_pte(mm, pmdp) = NULL;
-	else {
-		pmd_huge_pte(mm, pmdp) = (pgtable_t) lh->next;
-		list_del(lh);
-	}
-	pte_val(pgtable[0]) = 0;
-	pte_val(pgtable[1]) = 0;
-
-	return pgtable;
-}
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 776993d4567b4..6e3b66d17ccf0 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1171,13 +1171,19 @@ static inline pmd_t pmdp_collapse_flush(struct vm_area_struct *vma,
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 #endif
 
-#ifndef __HAVE_ARCH_PGTABLE_DEPOSIT
-extern void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
-				       pgtable_t pgtable);
+#ifndef arch_pgtable_trans_huge_deposit
+static inline void arch_pgtable_trans_huge_deposit(struct mm_struct *mm,
+						   pmd_t *pmdp, pgtable_t pgtable)
+{
+}
 #endif
 
-#ifndef __HAVE_ARCH_PGTABLE_WITHDRAW
-extern pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp);
+#ifndef arch_pgtable_trans_huge_withdraw
+static inline pgtable_t arch_pgtable_trans_huge_withdraw(struct mm_struct *mm,
+							 pmd_t *pmdp)
+{
+	return NULL;
+}
 #endif
 
 #ifndef arch_needs_pgtable_deposit
diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
index 83cf07269f134..2f811c5a083ce 100644
--- a/mm/debug_vm_pgtable.c
+++ b/mm/debug_vm_pgtable.c
@@ -240,7 +240,7 @@ static void __init pmd_advanced_tests(struct pgtable_debug_args *args)
 	/* Align the address wrt HPAGE_PMD_SIZE */
 	vaddr &= HPAGE_PMD_MASK;
 
-	pgtable_trans_huge_deposit(args->mm, args->pmdp, args->start_ptep);
+	arch_pgtable_trans_huge_deposit(args->mm, args->pmdp, args->start_ptep);
 
 	pmd = pfn_pmd(args->pmd_pfn, args->page_prot);
 	set_pmd_at(args->mm, vaddr, args->pmdp, pmd);
@@ -276,7 +276,7 @@ static void __init pmd_advanced_tests(struct pgtable_debug_args *args)
 
 	/*  Clear the pte entries  */
 	pmdp_huge_get_and_clear(args->mm, vaddr, args->pmdp);
-	pgtable_trans_huge_withdraw(args->mm, args->pmdp);
+	arch_pgtable_trans_huge_withdraw(args->mm, args->pmdp);
 }
 
 static void __init pmd_leaf_tests(struct pgtable_debug_args *args)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a10cb136000d1..55b14ba244b1b 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1359,7 +1359,7 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf)
 			return ret;
 		}
 		if (pgtable) {
-			pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd,
+			arch_pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd,
 						   pgtable);
 			mm_inc_nr_ptes(vma->vm_mm);
 		}
@@ -1460,7 +1460,7 @@ static void set_huge_zero_folio(pgtable_t pgtable, struct mm_struct *mm,
 	entry = folio_mk_pmd(zero_folio, vma->vm_page_prot);
 	entry = pmd_mkspecial(entry);
 	if (pgtable) {
-		pgtable_trans_huge_deposit(mm, pmd, pgtable);
+		arch_pgtable_trans_huge_deposit(mm, pmd, pgtable);
 		mm_inc_nr_ptes(mm);
 	}
 	set_pmd_at(mm, haddr, pmd, entry);
@@ -1593,7 +1593,7 @@ static vm_fault_t insert_pmd(struct vm_area_struct *vma, unsigned long addr,
 	}
 
 	if (pgtable) {
-		pgtable_trans_huge_deposit(mm, pmd, pgtable);
+		arch_pgtable_trans_huge_deposit(mm, pmd, pgtable);
 		mm_inc_nr_ptes(mm);
 		pgtable = NULL;
 	}
@@ -1855,7 +1855,7 @@ static void copy_huge_non_present_pmd(
 	add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
 	if (pgtable) {
 		mm_inc_nr_ptes(dst_mm);
-		pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
+		arch_pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
 	}
 	if (!userfaultfd_wp(dst_vma))
 		pmd = pmd_swp_clear_uffd_wp(pmd);
@@ -1962,7 +1962,7 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 out_zero_page:
 	if (pgtable) {
 		mm_inc_nr_ptes(dst_mm);
-		pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
+		arch_pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
 	}
 	pmdp_set_wrprotect(src_mm, addr, src_pmd);
 	if (!userfaultfd_wp(dst_vma))
@@ -2370,7 +2370,7 @@ static inline void zap_deposited_table(struct mm_struct *mm, pmd_t *pmd)
 {
 	pgtable_t pgtable;
 
-	pgtable = pgtable_trans_huge_withdraw(mm, pmd);
+	pgtable = arch_pgtable_trans_huge_withdraw(mm, pmd);
 	pte_free(mm, pgtable);
 	mm_dec_nr_ptes(mm);
 }
@@ -2389,7 +2389,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	/*
 	 * For architectures like ppc64 we look at deposited pgtable
 	 * when calling pmdp_huge_get_and_clear. So do the
-	 * pgtable_trans_huge_withdraw after finishing pmdp related
+	 * arch_pgtable_trans_huge_withdraw after finishing pmdp related
 	 * operations.
 	 */
 	orig_pmd = pmdp_huge_get_and_clear_full(vma, addr, pmd,
@@ -2531,8 +2531,8 @@ bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
 		if (pmd_move_must_withdraw(new_ptl, old_ptl, vma) &&
 		    arch_needs_pgtable_deposit()) {
 			pgtable_t pgtable;
-			pgtable = pgtable_trans_huge_withdraw(mm, old_pmd);
-			pgtable_trans_huge_deposit(mm, new_pmd, pgtable);
+			pgtable = arch_pgtable_trans_huge_withdraw(mm, old_pmd);
+			arch_pgtable_trans_huge_deposit(mm, new_pmd, pgtable);
 		}
 		pmd = move_soft_dirty_pmd(pmd);
 		if (vma_has_uffd_without_event_remap(vma))
@@ -2838,8 +2838,8 @@ int move_pages_huge_pmd(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd, pm
 	set_pmd_at(mm, dst_addr, dst_pmd, _dst_pmd);
 
 	if (arch_needs_pgtable_deposit()) {
-		src_pgtable = pgtable_trans_huge_withdraw(mm, src_pmd);
-		pgtable_trans_huge_deposit(mm, dst_pmd, src_pgtable);
+		src_pgtable = arch_pgtable_trans_huge_withdraw(mm, src_pmd);
+		arch_pgtable_trans_huge_deposit(mm, dst_pmd, src_pgtable);
 	}
 unlock_ptls:
 	double_pt_unlock(src_ptl, dst_ptl);
@@ -3001,7 +3001,7 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
 	old_pmd = pmdp_huge_clear_flush(vma, haddr, pmd);
 
 	if (arch_needs_pgtable_deposit()) {
-		pgtable = pgtable_trans_huge_withdraw(mm, pmd);
+		pgtable = arch_pgtable_trans_huge_withdraw(mm, pmd);
 	} else {
 		VM_BUG_ON(!pgtable);
 		/*
@@ -3218,7 +3218,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 	 * This's critical for some architectures (Power).
 	 */
 	if (arch_needs_pgtable_deposit()) {
-		pgtable = pgtable_trans_huge_withdraw(mm, pmd);
+		pgtable = arch_pgtable_trans_huge_withdraw(mm, pmd);
 	} else {
 		VM_BUG_ON(!pgtable);
 		/*
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 735d7ee5bbab2..2b426bcd16977 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1225,7 +1225,7 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
 	spin_lock(pmd_ptl);
 	BUG_ON(!pmd_none(*pmd));
 	if (arch_needs_pgtable_deposit()) {
-		pgtable_trans_huge_deposit(mm, pmd, pgtable);
+		arch_pgtable_trans_huge_deposit(mm, pmd, pgtable);
 	} else {
 		mm_dec_nr_ptes(mm);
 		pte_free(mm, pgtable);
diff --git a/mm/memory.c b/mm/memory.c
index 51d2717e3f1b4..4ec1ae909baf4 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5384,7 +5384,7 @@ static void deposit_prealloc_pte(struct vm_fault *vmf)
 {
 	struct vm_area_struct *vma = vmf->vma;
 
-	pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd, vmf->prealloc_pte);
+	arch_pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd, vmf->prealloc_pte);
 	/*
 	 * We are going to consume the prealloc table,
 	 * count that as nr_ptes.
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index 1adb5abccfb70..be84ace37b88f 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -882,7 +882,7 @@ static int migrate_vma_insert_huge_pmd_page(struct migrate_vma *migrate,
 		flush_cache_page(vma, addr, addr + HPAGE_PMD_SIZE);
 		pmdp_invalidate(vma, addr, pmdp);
 	} else if (pgtable) {
-		pgtable_trans_huge_deposit(vma->vm_mm, pmdp, pgtable);
+		arch_pgtable_trans_huge_deposit(vma->vm_mm, pmdp, pgtable);
 		mm_inc_nr_ptes(vma->vm_mm);
 	}
 	set_pmd_at(vma->vm_mm, addr, pmdp, entry);
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index af7966169d695..d8d5875d66fed 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -162,38 +162,6 @@ pud_t pudp_huge_clear_flush(struct vm_area_struct *vma, unsigned long address,
 #endif
 #endif
 
-#ifndef __HAVE_ARCH_PGTABLE_DEPOSIT
-void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
-				pgtable_t pgtable)
-{
-	assert_spin_locked(pmd_lockptr(mm, pmdp));
-
-	/* FIFO */
-	if (!pmd_huge_pte(mm, pmdp))
-		INIT_LIST_HEAD(&pgtable->lru);
-	else
-		list_add(&pgtable->lru, &pmd_huge_pte(mm, pmdp)->lru);
-	pmd_huge_pte(mm, pmdp) = pgtable;
-}
-#endif
-
-#ifndef __HAVE_ARCH_PGTABLE_WITHDRAW
-/* no "address" argument so destroys page coloring of some arch */
-pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp)
-{
-	pgtable_t pgtable;
-
-	assert_spin_locked(pmd_lockptr(mm, pmdp));
-
-	/* FIFO */
-	pgtable = pmd_huge_pte(mm, pmdp);
-	pmd_huge_pte(mm, pmdp) = list_first_entry_or_null(&pgtable->lru,
-							  struct page, lru);
-	if (pmd_huge_pte(mm, pmdp))
-		list_del(&pgtable->lru);
-	return pgtable;
-}
-#endif
 
 #ifndef __HAVE_ARCH_PMDP_INVALIDATE
 pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [RFC v2 16/21] mm: thp: add THP_SPLIT_PMD_FAILED counter
  2026-02-26 11:23 [RFC v2 00/21] mm: thp: lazy PTE page table allocation at PMD split Usama Arif
                   ` (14 preceding siblings ...)
  2026-02-26 11:23 ` [RFC v2 15/21] mm: thp: remove pgtable_trans_huge_{deposit/withdraw} when not needed Usama Arif
@ 2026-02-26 11:23 ` Usama Arif
  2026-02-26 13:56   ` kernel test robot
                     ` (2 more replies)
  2026-02-26 11:23 ` [RFC v2 17/21] selftests/mm: add THP PMD split test infrastructure Usama Arif
                   ` (5 subsequent siblings)
  21 siblings, 3 replies; 36+ messages in thread
From: Usama Arif @ 2026-02-26 11:23 UTC (permalink / raw)
  To: Andrew Morton, david, lorenzo.stoakes, willy, linux-mm
  Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
	baolin.wang, npache, Liam.Howlett, ryan.roberts, Vlastimil Babka,
	lance.yang, linux-kernel, kernel-team, maddy, mpe, linuxppc-dev,
	hca, gor, agordeev, borntraeger, svens, linux-s390, Usama Arif

Add a vmstat counter to track PTE allocation failures during PMD split.
This enables monitoring of split failures due to memory pressure after
the lazy PTE page table allocation change.

The counter is incremented in three places:
- __split_huge_pmd(): Main entry point for splitting a PMD
- try_to_unmap_one(): When reclaim needs to split a PMD-mapped THP
- try_to_migrate_one(): When migration needs to split a PMD-mapped THP

Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 include/linux/vm_event_item.h | 1 +
 mm/huge_memory.c              | 1 +
 mm/rmap.c                     | 3 +++
 mm/vmstat.c                   | 1 +
 4 files changed, 6 insertions(+)

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 03fe95f5a0201..ce696cf7d6321 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -98,6 +98,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		THP_DEFERRED_SPLIT_PAGE,
 		THP_UNDERUSED_SPLIT_PAGE,
 		THP_SPLIT_PMD,
+		THP_SPLIT_PMD_FAILED,
 		THP_SCAN_EXCEED_NONE_PTE,
 		THP_SCAN_EXCEED_SWAP_PTE,
 		THP_SCAN_EXCEED_SHARED_PTE,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 55b14ba244b1b..fc0a5e91b4d40 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3347,6 +3347,7 @@ int __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 	if (vma_is_anonymous(vma) && !arch_needs_pgtable_deposit()) {
 		pgtable = pte_alloc_one(vma->vm_mm);
 		if (!pgtable) {
+			count_vm_event(THP_SPLIT_PMD_FAILED);
 			mmu_notifier_invalidate_range_end(&range);
 			return -ENOMEM;
 		}
diff --git a/mm/rmap.c b/mm/rmap.c
index 2519d579bc1d8..2dae46fff08ae 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -2067,8 +2067,10 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 				pgtable_t pgtable = prealloc_pte;
 
 				prealloc_pte = NULL;
+
 				if (!arch_needs_pgtable_deposit() && !pgtable &&
 				    vma_is_anonymous(vma)) {
+					count_vm_event(THP_SPLIT_PMD_FAILED);
 					page_vma_mapped_walk_done(&pvmw);
 					ret = false;
 					break;
@@ -2471,6 +2473,7 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 				prealloc_pte = NULL;
 				if (!arch_needs_pgtable_deposit() && !pgtable &&
 				    vma_is_anonymous(vma)) {
+					count_vm_event(THP_SPLIT_PMD_FAILED);
 					page_vma_mapped_walk_done(&pvmw);
 					ret = false;
 					break;
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 667474773dbc7..da276ef0072ed 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1408,6 +1408,7 @@ const char * const vmstat_text[] = {
 	[I(THP_DEFERRED_SPLIT_PAGE)]		= "thp_deferred_split_page",
 	[I(THP_UNDERUSED_SPLIT_PAGE)]		= "thp_underused_split_page",
 	[I(THP_SPLIT_PMD)]			= "thp_split_pmd",
+	[I(THP_SPLIT_PMD_FAILED)]		= "thp_split_pmd_failed",
 	[I(THP_SCAN_EXCEED_NONE_PTE)]		= "thp_scan_exceed_none_pte",
 	[I(THP_SCAN_EXCEED_SWAP_PTE)]		= "thp_scan_exceed_swap_pte",
 	[I(THP_SCAN_EXCEED_SHARED_PTE)]		= "thp_scan_exceed_share_pte",
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [RFC v2 17/21] selftests/mm: add THP PMD split test infrastructure
  2026-02-26 11:23 [RFC v2 00/21] mm: thp: lazy PTE page table allocation at PMD split Usama Arif
                   ` (15 preceding siblings ...)
  2026-02-26 11:23 ` [RFC v2 16/21] mm: thp: add THP_SPLIT_PMD_FAILED counter Usama Arif
@ 2026-02-26 11:23 ` Usama Arif
  2026-02-26 11:23 ` [RFC v2 18/21] selftests/mm: add partial_mprotect test for change_pmd_range Usama Arif
                   ` (4 subsequent siblings)
  21 siblings, 0 replies; 36+ messages in thread
From: Usama Arif @ 2026-02-26 11:23 UTC (permalink / raw)
  To: Andrew Morton, david, lorenzo.stoakes, willy, linux-mm
  Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
	baolin.wang, npache, Liam.Howlett, ryan.roberts, Vlastimil Babka,
	lance.yang, linux-kernel, kernel-team, maddy, mpe, linuxppc-dev,
	hca, gor, agordeev, borntraeger, svens, linux-s390, Usama Arif

Add test infrastructure for verifying THP PMD split behavior with lazy
PTE allocation. This includes:

- Test fixture with PMD-aligned memory allocation
- Helper functions for reading vmstat counters
- log_and_check_pmd_split() macro for logging counters and checking
  if thp_split_pmd has incremented and thp_split_pmd_failed hasn't.
- THP allocation helper with verification

Also add a test to check if partial unmap of a THP splits the PMD.
This exercises zap_pmd_range part of split.

Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 tools/testing/selftests/mm/Makefile           |   1 +
 .../testing/selftests/mm/thp_pmd_split_test.c | 149 ++++++++++++++++++
 2 files changed, 150 insertions(+)
 create mode 100644 tools/testing/selftests/mm/thp_pmd_split_test.c

diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile
index 7a5de4e9bf520..e80551e76013a 100644
--- a/tools/testing/selftests/mm/Makefile
+++ b/tools/testing/selftests/mm/Makefile
@@ -95,6 +95,7 @@ TEST_GEN_FILES += uffd-stress
 TEST_GEN_FILES += uffd-unit-tests
 TEST_GEN_FILES += uffd-wp-mremap
 TEST_GEN_FILES += split_huge_page_test
+TEST_GEN_FILES += thp_pmd_split_test
 TEST_GEN_FILES += ksm_tests
 TEST_GEN_FILES += ksm_functional_tests
 TEST_GEN_FILES += mdwe_test
diff --git a/tools/testing/selftests/mm/thp_pmd_split_test.c b/tools/testing/selftests/mm/thp_pmd_split_test.c
new file mode 100644
index 0000000000000..0f54ac04760d5
--- /dev/null
+++ b/tools/testing/selftests/mm/thp_pmd_split_test.c
@@ -0,0 +1,149 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Tests various kernel code paths that handle THP PMD splitting.
+ *
+ * Prerequisites:
+ * - THP enabled (always or madvise mode):
+ *   echo always > /sys/kernel/mm/transparent_hugepage/enabled
+ *   or
+ *   echo madvise > /sys/kernel/mm/transparent_hugepage/enabled
+ */
+
+#define _GNU_SOURCE
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <sys/mman.h>
+#include <sys/wait.h>
+#include <fcntl.h>
+#include <errno.h>
+#include <stdint.h>
+
+#include "kselftest_harness.h"
+#include "thp_settings.h"
+#include "vm_util.h"
+
+/* Read vmstat counter */
+static unsigned long read_vmstat(const char *name)
+{
+	FILE *fp;
+	char line[256];
+	unsigned long value = 0;
+
+	fp = fopen("/proc/vmstat", "r");
+	if (!fp)
+		return 0;
+
+	while (fgets(line, sizeof(line), fp)) {
+		if (strncmp(line, name, strlen(name)) == 0 &&
+		    line[strlen(name)] == ' ') {
+			sscanf(line + strlen(name), " %lu", &value);
+			break;
+		}
+	}
+	fclose(fp);
+	return value;
+}
+
+/*
+ * Log vmstat counters for split_pmd_after/split_pmd_failed_after,
+ * check if split_pmd_after is greater than before and split_pmd_failed_after
+ * hasn't incremented.
+ */
+static void log_and_check_pmd_split(struct __test_metadata *const _metadata,
+	unsigned long split_pmd_before, unsigned long split_pmd_failed_before)
+{
+	unsigned long split_pmd_after = read_vmstat("thp_split_pmd");
+	unsigned long split_pmd_failed_after = read_vmstat("thp_split_pmd_failed");
+
+	TH_LOG("thp_split_pmd: %lu -> %lu", \
+	       split_pmd_before, split_pmd_after);
+	TH_LOG("thp_split_pmd_failed: %lu -> %lu", \
+	       split_pmd_failed_before, split_pmd_failed_after);
+	ASSERT_GT(split_pmd_after, split_pmd_before);
+	ASSERT_EQ(split_pmd_failed_after, split_pmd_failed_before);
+}
+
+/* Allocate a THP at the given aligned address */
+static int allocate_thp(void *aligned, size_t pmdsize)
+{
+	int ret;
+
+	ret = madvise(aligned, pmdsize, MADV_HUGEPAGE);
+	if (ret)
+		return -1;
+
+	/* Touch all pages to allocate the THP */
+	memset(aligned, 0xAA, pmdsize);
+
+	/* Verify we got a THP */
+	if (!check_huge_anon(aligned, 1, pmdsize))
+		return -1;
+
+	return 0;
+}
+
+FIXTURE(thp_pmd_split)
+{
+	void *mem;		/* Base mmap allocation */
+	void *aligned;		/* PMD-aligned pointer within mem */
+	size_t pmdsize;		/* PMD size from sysfs */
+	size_t pagesize;	/* Base page size */
+	size_t mmap_size;	/* Total mmap size for alignment */
+	unsigned long split_pmd_before;
+	unsigned long split_pmd_failed_before;
+};
+
+FIXTURE_SETUP(thp_pmd_split)
+{
+	if (!thp_available())
+		SKIP(return, "THP not available");
+
+	self->pmdsize = read_pmd_pagesize();
+	if (!self->pmdsize)
+		SKIP(return, "Unable to read PMD size");
+
+	self->pagesize = getpagesize();
+	self->mmap_size = 4 * self->pmdsize;
+
+	self->split_pmd_before = read_vmstat("thp_split_pmd");
+	self->split_pmd_failed_before = read_vmstat("thp_split_pmd_failed");
+
+	self->mem = mmap(NULL, self->mmap_size, PROT_READ | PROT_WRITE,
+			 MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+	ASSERT_NE(self->mem, MAP_FAILED);
+
+	/* Align to PMD boundary */
+	self->aligned = (void *)(((unsigned long)self->mem + self->pmdsize - 1) &
+				 ~(self->pmdsize - 1));
+}
+
+FIXTURE_TEARDOWN(thp_pmd_split)
+{
+	if (self->mem && self->mem != MAP_FAILED)
+		munmap(self->mem, self->mmap_size);
+}
+
+/*
+ * Partial munmap on THP (zap_pmd_range)
+ *
+ * Tests that partial munmap of a THP correctly splits the PMD.
+ * This exercises zap_pmd_range part of split.
+ */
+TEST_F(thp_pmd_split, partial_munmap)
+{
+	int ret;
+
+	ret = allocate_thp(self->aligned, self->pmdsize);
+	if (ret)
+		SKIP(return, "Failed to allocate THP");
+
+	ret = munmap((char *)self->aligned + self->pagesize, self->pagesize);
+	ASSERT_EQ(ret, 0);
+
+	log_and_check_pmd_split(_metadata, self->split_pmd_before,
+		self->split_pmd_failed_before);
+}
+
+TEST_HARNESS_MAIN
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [RFC v2 18/21] selftests/mm: add partial_mprotect test for change_pmd_range
  2026-02-26 11:23 [RFC v2 00/21] mm: thp: lazy PTE page table allocation at PMD split Usama Arif
                   ` (16 preceding siblings ...)
  2026-02-26 11:23 ` [RFC v2 17/21] selftests/mm: add THP PMD split test infrastructure Usama Arif
@ 2026-02-26 11:23 ` Usama Arif
  2026-02-26 11:23 ` [RFC v2 19/21] selftests/mm: add partial_mlock test Usama Arif
                   ` (3 subsequent siblings)
  21 siblings, 0 replies; 36+ messages in thread
From: Usama Arif @ 2026-02-26 11:23 UTC (permalink / raw)
  To: Andrew Morton, david, lorenzo.stoakes, willy, linux-mm
  Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
	baolin.wang, npache, Liam.Howlett, ryan.roberts, Vlastimil Babka,
	lance.yang, linux-kernel, kernel-team, maddy, mpe, linuxppc-dev,
	hca, gor, agordeev, borntraeger, svens, linux-s390, Usama Arif

Add test for partial mprotect on THP which exercises change_pmd_range().
This verifies that partial mprotect correctly splits the PMD, applies
protection only to the requested portion, and leaves the rest of the
mapping writable.

Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 .../testing/selftests/mm/thp_pmd_split_test.c | 31 +++++++++++++++++++
 1 file changed, 31 insertions(+)

diff --git a/tools/testing/selftests/mm/thp_pmd_split_test.c b/tools/testing/selftests/mm/thp_pmd_split_test.c
index 0f54ac04760d5..4944a5a516da9 100644
--- a/tools/testing/selftests/mm/thp_pmd_split_test.c
+++ b/tools/testing/selftests/mm/thp_pmd_split_test.c
@@ -146,4 +146,35 @@ TEST_F(thp_pmd_split, partial_munmap)
 		self->split_pmd_failed_before);
 }
 
+/*
+ * Partial mprotect on THP (change_pmd_range)
+ *
+ * Tests that partial mprotect of a THP correctly splits the PMD and
+ * applies protection only to the requested portion. This exercises
+ * the mprotect path which now handles split failures.
+ */
+TEST_F(thp_pmd_split, partial_mprotect)
+{
+	volatile unsigned char *ptr = (volatile unsigned char *)self->aligned;
+	int ret;
+
+	ret = allocate_thp(self->aligned, self->pmdsize);
+	if (ret)
+		SKIP(return, "Failed to allocate THP");
+
+	/* Partial mprotect - make middle page read-only */
+	ret = mprotect((char *)self->aligned + self->pagesize, self->pagesize, PROT_READ);
+	ASSERT_EQ(ret, 0);
+
+	/* Verify we can still write to non-protected pages */
+	ptr[0] = 0xDD;
+	ptr[self->pmdsize - 1] = 0xEE;
+
+	ASSERT_EQ(ptr[0], (unsigned char)0xDD);
+	ASSERT_EQ(ptr[self->pmdsize - 1], (unsigned char)0xEE);
+
+	log_and_check_pmd_split(_metadata, self->split_pmd_before,
+		self->split_pmd_failed_before);
+}
+
 TEST_HARNESS_MAIN
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [RFC v2 19/21] selftests/mm: add partial_mlock test
  2026-02-26 11:23 [RFC v2 00/21] mm: thp: lazy PTE page table allocation at PMD split Usama Arif
                   ` (17 preceding siblings ...)
  2026-02-26 11:23 ` [RFC v2 18/21] selftests/mm: add partial_mprotect test for change_pmd_range Usama Arif
@ 2026-02-26 11:23 ` Usama Arif
  2026-02-26 11:23 ` [RFC v2 20/21] selftests/mm: add partial_mremap test for move_page_tables Usama Arif
                   ` (2 subsequent siblings)
  21 siblings, 0 replies; 36+ messages in thread
From: Usama Arif @ 2026-02-26 11:23 UTC (permalink / raw)
  To: Andrew Morton, david, lorenzo.stoakes, willy, linux-mm
  Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
	baolin.wang, npache, Liam.Howlett, ryan.roberts, Vlastimil Babka,
	lance.yang, linux-kernel, kernel-team, maddy, mpe, linuxppc-dev,
	hca, gor, agordeev, borntraeger, svens, linux-s390, Usama Arif

Add test for partial mlock on THP which exercises walk_page_range()
with a subset of the THP. This should trigger a PMD split since
mlock operates at page granularity.

Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 .../testing/selftests/mm/thp_pmd_split_test.c | 26 +++++++++++++++++++
 1 file changed, 26 insertions(+)

diff --git a/tools/testing/selftests/mm/thp_pmd_split_test.c b/tools/testing/selftests/mm/thp_pmd_split_test.c
index 4944a5a516da9..3c9f05457efec 100644
--- a/tools/testing/selftests/mm/thp_pmd_split_test.c
+++ b/tools/testing/selftests/mm/thp_pmd_split_test.c
@@ -177,4 +177,30 @@ TEST_F(thp_pmd_split, partial_mprotect)
 		self->split_pmd_failed_before);
 }
 
+/*
+ * Partial mlock triggering split (walk_page_range)
+ *
+ * Tests mlock on a partial THP region which should trigger a PMD split.
+ */
+TEST_F(thp_pmd_split, partial_mlock)
+{
+	int ret;
+
+	ret = allocate_thp(self->aligned, self->pmdsize);
+	if (ret)
+		SKIP(return, "Failed to allocate THP");
+
+	/* Partial mlock - should trigger PMD split */
+	ret = mlock((char *)self->aligned + self->pagesize, self->pagesize);
+	if (ret && errno == ENOMEM)
+		SKIP(return, "mlock failed with ENOMEM (resource limit)");
+	ASSERT_EQ(ret, 0);
+
+	/* Cleanup */
+	munlock((char *)self->aligned + self->pagesize, self->pagesize);
+
+	log_and_check_pmd_split(_metadata, self->split_pmd_before,
+		self->split_pmd_failed_before);
+}
+
 TEST_HARNESS_MAIN
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [RFC v2 20/21] selftests/mm: add partial_mremap test for move_page_tables
  2026-02-26 11:23 [RFC v2 00/21] mm: thp: lazy PTE page table allocation at PMD split Usama Arif
                   ` (18 preceding siblings ...)
  2026-02-26 11:23 ` [RFC v2 19/21] selftests/mm: add partial_mlock test Usama Arif
@ 2026-02-26 11:23 ` Usama Arif
  2026-02-26 11:23 ` [RFC v2 21/21] selftests/mm: add madv_dontneed_partial test Usama Arif
  2026-02-26 21:01 ` [RFC v2 00/21] mm: thp: lazy PTE page table allocation at PMD split Nico Pache
  21 siblings, 0 replies; 36+ messages in thread
From: Usama Arif @ 2026-02-26 11:23 UTC (permalink / raw)
  To: Andrew Morton, david, lorenzo.stoakes, willy, linux-mm
  Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
	baolin.wang, npache, Liam.Howlett, ryan.roberts, Vlastimil Babka,
	lance.yang, linux-kernel, kernel-team, maddy, mpe, linuxppc-dev,
	hca, gor, agordeev, borntraeger, svens, linux-s390, Usama Arif

Add test for partial mremap on THP which exercises move_page_tables().
This verifies that partial mremap correctly splits the PMD, moves
only the requested page, and preserves data integrity in both the
moved region and the original mapping.

Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 .../testing/selftests/mm/thp_pmd_split_test.c | 50 +++++++++++++++++++
 1 file changed, 50 insertions(+)

diff --git a/tools/testing/selftests/mm/thp_pmd_split_test.c b/tools/testing/selftests/mm/thp_pmd_split_test.c
index 3c9f05457efec..1f29296759a5b 100644
--- a/tools/testing/selftests/mm/thp_pmd_split_test.c
+++ b/tools/testing/selftests/mm/thp_pmd_split_test.c
@@ -203,4 +203,54 @@ TEST_F(thp_pmd_split, partial_mlock)
 		self->split_pmd_failed_before);
 }
 
+/*
+ * Partial mremap (move_page_tables)
+ *
+ * Tests that partial mremap of a THP correctly splits the PMD and
+ * moves only the requested portion. This exercises move_page_tables()
+ * which now handles split failures.
+ */
+TEST_F(thp_pmd_split, partial_mremap)
+{
+	void *new_addr;
+	unsigned long *ptr = (unsigned long *)self->aligned;
+	unsigned long *new_ptr;
+	unsigned long pattern = 0xABCDUL;
+	int ret;
+
+	ret = allocate_thp(self->aligned, self->pmdsize);
+	if (ret)
+		SKIP(return, "Failed to allocate THP");
+
+	/* Write pattern to the page we'll move */
+	ptr[self->pagesize / sizeof(unsigned long)] = pattern;
+
+	/* Also write to first and last page to verify they stay intact */
+	ptr[0] = 0x1234UL;
+	ptr[(self->pmdsize - self->pagesize) / sizeof(unsigned long)] = 0x4567UL;
+
+	/* Partial mremap - move one base page from the THP */
+	new_addr = mremap((char *)self->aligned + self->pagesize, self->pagesize,
+			  self->pagesize, MREMAP_MAYMOVE);
+	if (new_addr == MAP_FAILED) {
+		if (errno == ENOMEM)
+			SKIP(return, "mremap failed with ENOMEM");
+		ASSERT_NE(new_addr, MAP_FAILED);
+	}
+
+	/* Verify data was moved correctly */
+	new_ptr = (unsigned long *)new_addr;
+	ASSERT_EQ(new_ptr[0], pattern);
+
+	/* Verify surrounding data is intact */
+	ASSERT_EQ(ptr[0], 0x1234UL);
+	ASSERT_EQ(ptr[(self->pmdsize - self->pagesize) / sizeof(unsigned long)], 0x4567UL);
+
+	/* Cleanup the moved page */
+	munmap(new_addr, self->pagesize);
+
+	log_and_check_pmd_split(_metadata, self->split_pmd_before,
+		self->split_pmd_failed_before);
+}
+
 TEST_HARNESS_MAIN
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [RFC v2 21/21] selftests/mm: add madv_dontneed_partial test
  2026-02-26 11:23 [RFC v2 00/21] mm: thp: lazy PTE page table allocation at PMD split Usama Arif
                   ` (19 preceding siblings ...)
  2026-02-26 11:23 ` [RFC v2 20/21] selftests/mm: add partial_mremap test for move_page_tables Usama Arif
@ 2026-02-26 11:23 ` Usama Arif
  2026-02-26 21:01 ` [RFC v2 00/21] mm: thp: lazy PTE page table allocation at PMD split Nico Pache
  21 siblings, 0 replies; 36+ messages in thread
From: Usama Arif @ 2026-02-26 11:23 UTC (permalink / raw)
  To: Andrew Morton, david, lorenzo.stoakes, willy, linux-mm
  Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
	baolin.wang, npache, Liam.Howlett, ryan.roberts, Vlastimil Babka,
	lance.yang, linux-kernel, kernel-team, maddy, mpe, linuxppc-dev,
	hca, gor, agordeev, borntraeger, svens, linux-s390, Usama Arif

Add test for partial MADV_DONTNEED on THP. This verifies that
MADV_DONTNEED correctly triggers a PMD split, discards only the
requested page (which becomes zero-filled), and preserves data
in the surrounding pages.

Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 .../testing/selftests/mm/thp_pmd_split_test.c | 34 +++++++++++++++++++
 1 file changed, 34 insertions(+)

diff --git a/tools/testing/selftests/mm/thp_pmd_split_test.c b/tools/testing/selftests/mm/thp_pmd_split_test.c
index 1f29296759a5b..060ca1e341b75 100644
--- a/tools/testing/selftests/mm/thp_pmd_split_test.c
+++ b/tools/testing/selftests/mm/thp_pmd_split_test.c
@@ -253,4 +253,38 @@ TEST_F(thp_pmd_split, partial_mremap)
 		self->split_pmd_failed_before);
 }
 
+/*
+ * MADV_DONTNEED on THP
+ *
+ * Tests that MADV_DONTNEED on a partial THP correctly handles
+ * the PMD split and discards only the requested pages.
+ */
+TEST_F(thp_pmd_split, partial_madv_dontneed)
+{
+	volatile unsigned char *ptr = (volatile unsigned char *)self->aligned;
+	int ret;
+
+	ret = allocate_thp(self->aligned, self->pmdsize);
+	if (ret)
+		SKIP(return, "Failed to allocate THP");
+
+	/* Write pattern */
+	memset(self->aligned, 0xDD, self->pmdsize);
+
+	/* Partial MADV_DONTNEED - discard middle page */
+	ret = madvise((char *)self->aligned + self->pagesize, self->pagesize, MADV_DONTNEED);
+	ASSERT_EQ(ret, 0);
+
+	/* Verify non-discarded pages still have data */
+	ASSERT_EQ(ptr[0], (unsigned char)0xDD);
+	ASSERT_EQ(ptr[2 * self->pagesize], (unsigned char)0xDD);
+	ASSERT_EQ(ptr[self->pmdsize - 1], (unsigned char)0xDD);
+
+	/* Discarded page should be zero */
+	ASSERT_EQ(ptr[self->pagesize], (unsigned char)0x00);
+
+	log_and_check_pmd_split(_metadata, self->split_pmd_before,
+		self->split_pmd_failed_before);
+}
+
 TEST_HARNESS_MAIN
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [RFC v2 16/21] mm: thp: add THP_SPLIT_PMD_FAILED counter
  2026-02-26 11:23 ` [RFC v2 16/21] mm: thp: add THP_SPLIT_PMD_FAILED counter Usama Arif
@ 2026-02-26 13:56   ` kernel test robot
  2026-02-26 14:22   ` Usama Arif
  2026-02-26 15:10   ` kernel test robot
  2 siblings, 0 replies; 36+ messages in thread
From: kernel test robot @ 2026-02-26 13:56 UTC (permalink / raw)
  To: Usama Arif; +Cc: oe-kbuild-all

Hi Usama,

[This is a private test report for your RFC patch.]
kernel test robot noticed the following build errors:

[auto build test ERROR on akpm-mm/mm-everything]

url:    https://github.com/intel-lab-lkp/linux/commits/Usama-Arif/mm-thp-make-split_huge_pmd-functions-return-int-for-error-propagation/20260226-193910
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20260226113233.3987674-17-usama.arif%40linux.dev
patch subject: [RFC v2 16/21] mm: thp: add THP_SPLIT_PMD_FAILED counter
config: nios2-allnoconfig (https://download.01.org/0day-ci/archive/20260226/202602262157.PBMOf3wm-lkp@intel.com/config)
compiler: nios2-linux-gcc (GCC) 11.5.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260226/202602262157.PBMOf3wm-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202602262157.PBMOf3wm-lkp@intel.com/

All errors (new ones prefixed by >>):

   mm/rmap.c: In function 'try_to_unmap_one':
>> mm/rmap.c:2073:56: error: 'THP_SPLIT_PMD_FAILED' undeclared (first use in this function); did you mean 'MTHP_STAT_SPLIT_FAILED'?
    2073 |                                         count_vm_event(THP_SPLIT_PMD_FAILED);
         |                                                        ^~~~~~~~~~~~~~~~~~~~
         |                                                        MTHP_STAT_SPLIT_FAILED
   mm/rmap.c:2073:56: note: each undeclared identifier is reported only once for each function it appears in
   mm/rmap.c: In function 'try_to_migrate_one':
   mm/rmap.c:2476:56: error: 'THP_SPLIT_PMD_FAILED' undeclared (first use in this function); did you mean 'MTHP_STAT_SPLIT_FAILED'?
    2476 |                                         count_vm_event(THP_SPLIT_PMD_FAILED);
         |                                                        ^~~~~~~~~~~~~~~~~~~~
         |                                                        MTHP_STAT_SPLIT_FAILED


vim +2073 mm/rmap.c

  1961	
  1962	/*
  1963	 * @arg: enum ttu_flags will be passed to this argument
  1964	 */
  1965	static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
  1966			     unsigned long address, void *arg)
  1967	{
  1968		struct mm_struct *mm = vma->vm_mm;
  1969		DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0);
  1970		bool anon_exclusive, ret = true;
  1971		pte_t pteval;
  1972		struct page *subpage;
  1973		struct mmu_notifier_range range;
  1974		enum ttu_flags flags = (enum ttu_flags)(long)arg;
  1975		unsigned long nr_pages = 1, end_addr;
  1976		unsigned long pfn;
  1977		unsigned long hsz = 0;
  1978		int ptes = 0;
  1979		pgtable_t prealloc_pte = NULL;
  1980	
  1981		/*
  1982		 * When racing against e.g. zap_pte_range() on another cpu,
  1983		 * in between its ptep_get_and_clear_full() and folio_remove_rmap_*(),
  1984		 * try_to_unmap() may return before page_mapped() has become false,
  1985		 * if page table locking is skipped: use TTU_SYNC to wait for that.
  1986		 */
  1987		if (flags & TTU_SYNC)
  1988			pvmw.flags = PVMW_SYNC;
  1989	
  1990		/*
  1991		 * For THP, we have to assume the worse case ie pmd for invalidation.
  1992		 * For hugetlb, it could be much worse if we need to do pud
  1993		 * invalidation in the case of pmd sharing.
  1994		 *
  1995		 * Note that the folio can not be freed in this function as call of
  1996		 * try_to_unmap() must hold a reference on the folio.
  1997		 */
  1998		range.end = vma_address_end(&pvmw);
  1999		mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma->vm_mm,
  2000					address, range.end);
  2001		if (folio_test_hugetlb(folio)) {
  2002			/*
  2003			 * If sharing is possible, start and end will be adjusted
  2004			 * accordingly.
  2005			 */
  2006			adjust_range_if_pmd_sharing_possible(vma, &range.start,
  2007							     &range.end);
  2008	
  2009			/* We need the huge page size for set_huge_pte_at() */
  2010			hsz = huge_page_size(hstate_vma(vma));
  2011		}
  2012		mmu_notifier_invalidate_range_start(&range);
  2013	
  2014		if ((flags & TTU_SPLIT_HUGE_PMD) && vma_is_anonymous(vma) &&
  2015		    !arch_needs_pgtable_deposit())
  2016			prealloc_pte = pte_alloc_one(mm);
  2017	
  2018		while (page_vma_mapped_walk(&pvmw)) {
  2019			/*
  2020			 * If the folio is in an mlock()d vma, we must not swap it out.
  2021			 */
  2022			if (!(flags & TTU_IGNORE_MLOCK) &&
  2023			    (vma->vm_flags & VM_LOCKED)) {
  2024				ptes++;
  2025	
  2026				/*
  2027				 * Set 'ret' to indicate the page cannot be unmapped.
  2028				 *
  2029				 * Do not jump to walk_abort immediately as additional
  2030				 * iteration might be required to detect fully mapped
  2031				 * folio an mlock it.
  2032				 */
  2033				ret = false;
  2034	
  2035				/* Only mlock fully mapped pages */
  2036				if (pvmw.pte && ptes != pvmw.nr_pages)
  2037					continue;
  2038	
  2039				/*
  2040				 * All PTEs must be protected by page table lock in
  2041				 * order to mlock the page.
  2042				 *
  2043				 * If page table boundary has been cross, current ptl
  2044				 * only protect part of ptes.
  2045				 */
  2046				if (pvmw.flags & PVMW_PGTABLE_CROSSED)
  2047					goto walk_done;
  2048	
  2049				/* Restore the mlock which got missed */
  2050				mlock_vma_folio(folio, vma);
  2051				goto walk_done;
  2052			}
  2053	
  2054			if (!pvmw.pte) {
  2055				if (folio_test_lazyfree(folio)) {
  2056					if (unmap_huge_pmd_locked(vma, pvmw.address, pvmw.pmd, folio))
  2057						goto walk_done;
  2058					/*
  2059					 * unmap_huge_pmd_locked has either already marked
  2060					 * the folio as swap-backed or decided to retain it
  2061					 * due to GUP or speculative references.
  2062					 */
  2063					goto walk_abort;
  2064				}
  2065	
  2066				if (flags & TTU_SPLIT_HUGE_PMD) {
  2067					pgtable_t pgtable = prealloc_pte;
  2068	
  2069					prealloc_pte = NULL;
  2070	
  2071					if (!arch_needs_pgtable_deposit() && !pgtable &&
  2072					    vma_is_anonymous(vma)) {
> 2073						count_vm_event(THP_SPLIT_PMD_FAILED);
  2074						page_vma_mapped_walk_done(&pvmw);
  2075						ret = false;
  2076						break;
  2077					}
  2078					/*
  2079					 * We temporarily have to drop the PTL and
  2080					 * restart so we can process the PTE-mapped THP.
  2081					 */
  2082					split_huge_pmd_locked(vma, pvmw.address,
  2083							      pvmw.pmd, false, pgtable);
  2084					flags &= ~TTU_SPLIT_HUGE_PMD;
  2085					page_vma_mapped_walk_restart(&pvmw);
  2086					continue;
  2087				}
  2088			}
  2089	
  2090			/* Unexpected PMD-mapped THP? */
  2091			VM_BUG_ON_FOLIO(!pvmw.pte, folio);
  2092	
  2093			/*
  2094			 * Handle PFN swap PTEs, such as device-exclusive ones, that
  2095			 * actually map pages.
  2096			 */
  2097			pteval = ptep_get(pvmw.pte);
  2098			if (likely(pte_present(pteval))) {
  2099				pfn = pte_pfn(pteval);
  2100			} else {
  2101				const softleaf_t entry = softleaf_from_pte(pteval);
  2102	
  2103				pfn = softleaf_to_pfn(entry);
  2104				VM_WARN_ON_FOLIO(folio_test_hugetlb(folio), folio);
  2105			}
  2106	
  2107			subpage = folio_page(folio, pfn - folio_pfn(folio));
  2108			address = pvmw.address;
  2109			anon_exclusive = folio_test_anon(folio) &&
  2110					 PageAnonExclusive(subpage);
  2111	
  2112			if (folio_test_hugetlb(folio)) {
  2113				bool anon = folio_test_anon(folio);
  2114	
  2115				/*
  2116				 * The try_to_unmap() is only passed a hugetlb page
  2117				 * in the case where the hugetlb page is poisoned.
  2118				 */
  2119				VM_BUG_ON_PAGE(!PageHWPoison(subpage), subpage);
  2120				/*
  2121				 * huge_pmd_unshare may unmap an entire PMD page.
  2122				 * There is no way of knowing exactly which PMDs may
  2123				 * be cached for this mm, so we must flush them all.
  2124				 * start/end were already adjusted above to cover this
  2125				 * range.
  2126				 */
  2127				flush_cache_range(vma, range.start, range.end);
  2128	
  2129				/*
  2130				 * To call huge_pmd_unshare, i_mmap_rwsem must be
  2131				 * held in write mode.  Caller needs to explicitly
  2132				 * do this outside rmap routines.
  2133				 *
  2134				 * We also must hold hugetlb vma_lock in write mode.
  2135				 * Lock order dictates acquiring vma_lock BEFORE
  2136				 * i_mmap_rwsem.  We can only try lock here and fail
  2137				 * if unsuccessful.
  2138				 */
  2139				if (!anon) {
  2140					struct mmu_gather tlb;
  2141	
  2142					VM_BUG_ON(!(flags & TTU_RMAP_LOCKED));
  2143					if (!hugetlb_vma_trylock_write(vma))
  2144						goto walk_abort;
  2145	
  2146					tlb_gather_mmu_vma(&tlb, vma);
  2147					if (huge_pmd_unshare(&tlb, vma, address, pvmw.pte)) {
  2148						hugetlb_vma_unlock_write(vma);
  2149						huge_pmd_unshare_flush(&tlb, vma);
  2150						tlb_finish_mmu(&tlb);
  2151						/*
  2152						 * The PMD table was unmapped,
  2153						 * consequently unmapping the folio.
  2154						 */
  2155						goto walk_done;
  2156					}
  2157					hugetlb_vma_unlock_write(vma);
  2158					tlb_finish_mmu(&tlb);
  2159				}
  2160				pteval = huge_ptep_clear_flush(vma, address, pvmw.pte);
  2161				if (pte_dirty(pteval))
  2162					folio_mark_dirty(folio);
  2163			} else if (likely(pte_present(pteval))) {
  2164				nr_pages = folio_unmap_pte_batch(folio, &pvmw, flags, pteval);
  2165				end_addr = address + nr_pages * PAGE_SIZE;
  2166				flush_cache_range(vma, address, end_addr);
  2167	
  2168				/* Nuke the page table entry. */
  2169				pteval = get_and_clear_ptes(mm, address, pvmw.pte, nr_pages);
  2170				/*
  2171				 * We clear the PTE but do not flush so potentially
  2172				 * a remote CPU could still be writing to the folio.
  2173				 * If the entry was previously clean then the
  2174				 * architecture must guarantee that a clear->dirty
  2175				 * transition on a cached TLB entry is written through
  2176				 * and traps if the PTE is unmapped.
  2177				 */
  2178				if (should_defer_flush(mm, flags))
  2179					set_tlb_ubc_flush_pending(mm, pteval, address, end_addr);
  2180				else
  2181					flush_tlb_range(vma, address, end_addr);
  2182				if (pte_dirty(pteval))
  2183					folio_mark_dirty(folio);
  2184			} else {
  2185				pte_clear(mm, address, pvmw.pte);
  2186			}
  2187	
  2188			/*
  2189			 * Now the pte is cleared. If this pte was uffd-wp armed,
  2190			 * we may want to replace a none pte with a marker pte if
  2191			 * it's file-backed, so we don't lose the tracking info.
  2192			 */
  2193			pte_install_uffd_wp_if_needed(vma, address, pvmw.pte, pteval);
  2194	
  2195			/* Update high watermark before we lower rss */
  2196			update_hiwater_rss(mm);
  2197	
  2198			if (PageHWPoison(subpage) && (flags & TTU_HWPOISON)) {
  2199				pteval = swp_entry_to_pte(make_hwpoison_entry(subpage));
  2200				if (folio_test_hugetlb(folio)) {
  2201					hugetlb_count_sub(folio_nr_pages(folio), mm);
  2202					set_huge_pte_at(mm, address, pvmw.pte, pteval,
  2203							hsz);
  2204				} else {
  2205					dec_mm_counter(mm, mm_counter(folio));
  2206					set_pte_at(mm, address, pvmw.pte, pteval);
  2207				}
  2208			} else if (likely(pte_present(pteval)) && pte_unused(pteval) &&
  2209				   !userfaultfd_armed(vma)) {
  2210				/*
  2211				 * The guest indicated that the page content is of no
  2212				 * interest anymore. Simply discard the pte, vmscan
  2213				 * will take care of the rest.
  2214				 * A future reference will then fault in a new zero
  2215				 * page. When userfaultfd is active, we must not drop
  2216				 * this page though, as its main user (postcopy
  2217				 * migration) will not expect userfaults on already
  2218				 * copied pages.
  2219				 */
  2220				dec_mm_counter(mm, mm_counter(folio));
  2221			} else if (folio_test_anon(folio)) {
  2222				swp_entry_t entry = page_swap_entry(subpage);
  2223				pte_t swp_pte;
  2224				/*
  2225				 * Store the swap location in the pte.
  2226				 * See handle_pte_fault() ...
  2227				 */
  2228				if (unlikely(folio_test_swapbacked(folio) !=
  2229						folio_test_swapcache(folio))) {
  2230					WARN_ON_ONCE(1);
  2231					goto walk_abort;
  2232				}
  2233	
  2234				/* MADV_FREE page check */
  2235				if (!folio_test_swapbacked(folio)) {
  2236					int ref_count, map_count;
  2237	
  2238					/*
  2239					 * Synchronize with gup_pte_range():
  2240					 * - clear PTE; barrier; read refcount
  2241					 * - inc refcount; barrier; read PTE
  2242					 */
  2243					smp_mb();
  2244	
  2245					ref_count = folio_ref_count(folio);
  2246					map_count = folio_mapcount(folio);
  2247	
  2248					/*
  2249					 * Order reads for page refcount and dirty flag
  2250					 * (see comments in __remove_mapping()).
  2251					 */
  2252					smp_rmb();
  2253	
  2254					if (folio_test_dirty(folio) && !(vma->vm_flags & VM_DROPPABLE)) {
  2255						/*
  2256						 * redirtied either using the page table or a previously
  2257						 * obtained GUP reference.
  2258						 */
  2259						set_ptes(mm, address, pvmw.pte, pteval, nr_pages);
  2260						folio_set_swapbacked(folio);
  2261						goto walk_abort;
  2262					} else if (ref_count != 1 + map_count) {
  2263						/*
  2264						 * Additional reference. Could be a GUP reference or any
  2265						 * speculative reference. GUP users must mark the folio
  2266						 * dirty if there was a modification. This folio cannot be
  2267						 * reclaimed right now either way, so act just like nothing
  2268						 * happened.
  2269						 * We'll come back here later and detect if the folio was
  2270						 * dirtied when the additional reference is gone.
  2271						 */
  2272						set_ptes(mm, address, pvmw.pte, pteval, nr_pages);
  2273						goto walk_abort;
  2274					}
  2275					add_mm_counter(mm, MM_ANONPAGES, -nr_pages);
  2276					goto discard;
  2277				}
  2278	
  2279				if (folio_dup_swap(folio, subpage) < 0) {
  2280					set_pte_at(mm, address, pvmw.pte, pteval);
  2281					goto walk_abort;
  2282				}
  2283	
  2284				/*
  2285				 * arch_unmap_one() is expected to be a NOP on
  2286				 * architectures where we could have PFN swap PTEs,
  2287				 * so we'll not check/care.
  2288				 */
  2289				if (arch_unmap_one(mm, vma, address, pteval) < 0) {
  2290					folio_put_swap(folio, subpage);
  2291					set_pte_at(mm, address, pvmw.pte, pteval);
  2292					goto walk_abort;
  2293				}
  2294	
  2295				/* See folio_try_share_anon_rmap(): clear PTE first. */
  2296				if (anon_exclusive &&
  2297				    folio_try_share_anon_rmap_pte(folio, subpage)) {
  2298					folio_put_swap(folio, subpage);
  2299					set_pte_at(mm, address, pvmw.pte, pteval);
  2300					goto walk_abort;
  2301				}
  2302				if (list_empty(&mm->mmlist)) {
  2303					spin_lock(&mmlist_lock);
  2304					if (list_empty(&mm->mmlist))
  2305						list_add(&mm->mmlist, &init_mm.mmlist);
  2306					spin_unlock(&mmlist_lock);
  2307				}
  2308				dec_mm_counter(mm, MM_ANONPAGES);
  2309				inc_mm_counter(mm, MM_SWAPENTS);
  2310				swp_pte = swp_entry_to_pte(entry);
  2311				if (anon_exclusive)
  2312					swp_pte = pte_swp_mkexclusive(swp_pte);
  2313				if (likely(pte_present(pteval))) {
  2314					if (pte_soft_dirty(pteval))
  2315						swp_pte = pte_swp_mksoft_dirty(swp_pte);
  2316					if (pte_uffd_wp(pteval))
  2317						swp_pte = pte_swp_mkuffd_wp(swp_pte);
  2318				} else {
  2319					if (pte_swp_soft_dirty(pteval))
  2320						swp_pte = pte_swp_mksoft_dirty(swp_pte);
  2321					if (pte_swp_uffd_wp(pteval))
  2322						swp_pte = pte_swp_mkuffd_wp(swp_pte);
  2323				}
  2324				set_pte_at(mm, address, pvmw.pte, swp_pte);
  2325			} else {
  2326				/*
  2327				 * This is a locked file-backed folio,
  2328				 * so it cannot be removed from the page
  2329				 * cache and replaced by a new folio before
  2330				 * mmu_notifier_invalidate_range_end, so no
  2331				 * concurrent thread might update its page table
  2332				 * to point at a new folio while a device is
  2333				 * still using this folio.
  2334				 *
  2335				 * See Documentation/mm/mmu_notifier.rst
  2336				 */
  2337				add_mm_counter(mm, mm_counter_file(folio), -nr_pages);
  2338			}
  2339	discard:
  2340			if (unlikely(folio_test_hugetlb(folio))) {
  2341				hugetlb_remove_rmap(folio);
  2342			} else {
  2343				folio_remove_rmap_ptes(folio, subpage, nr_pages, vma);
  2344			}
  2345			if (vma->vm_flags & VM_LOCKED)
  2346				mlock_drain_local();
  2347			folio_put_refs(folio, nr_pages);
  2348	
  2349			/*
  2350			 * If we are sure that we batched the entire folio and cleared
  2351			 * all PTEs, we can just optimize and stop right here.
  2352			 */
  2353			if (nr_pages == folio_nr_pages(folio))
  2354				goto walk_done;
  2355			continue;
  2356	walk_abort:
  2357			ret = false;
  2358	walk_done:
  2359			page_vma_mapped_walk_done(&pvmw);
  2360			break;
  2361		}
  2362	
  2363		if (prealloc_pte)
  2364			pte_free(mm, prealloc_pte);
  2365	
  2366		mmu_notifier_invalidate_range_end(&range);
  2367	
  2368		return ret;
  2369	}
  2370	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC v2 16/21] mm: thp: add THP_SPLIT_PMD_FAILED counter
  2026-02-26 11:23 ` [RFC v2 16/21] mm: thp: add THP_SPLIT_PMD_FAILED counter Usama Arif
  2026-02-26 13:56   ` kernel test robot
@ 2026-02-26 14:22   ` Usama Arif
  2026-02-26 15:10   ` kernel test robot
  2 siblings, 0 replies; 36+ messages in thread
From: Usama Arif @ 2026-02-26 14:22 UTC (permalink / raw)
  To: Andrew Morton, david, lorenzo.stoakes, willy, linux-mm
  Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
	baolin.wang, npache, Liam.Howlett, ryan.roberts, Vlastimil Babka,
	lance.yang, linux-kernel, kernel-team, maddy, mpe, linuxppc-dev,
	hca, gor, agordeev, borntraeger, svens, linux-s390



On 26/02/2026 11:23, Usama Arif wrote:
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 2519d579bc1d8..2dae46fff08ae 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -2067,8 +2067,10 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>  				pgtable_t pgtable = prealloc_pte;
>  
>  				prealloc_pte = NULL;
> +
>  				if (!arch_needs_pgtable_deposit() && !pgtable &&
>  				    vma_is_anonymous(vma)) {
> +					count_vm_event(THP_SPLIT_PMD_FAILED);
>  					page_vma_mapped_walk_done(&pvmw);
>  					ret = false;
>  					break;
> @@ -2471,6 +2473,7 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
>  				prealloc_pte = NULL;
>  				if (!arch_needs_pgtable_deposit() && !pgtable &&
>  				    vma_is_anonymous(vma)) {
> +					count_vm_event(THP_SPLIT_PMD_FAILED);
>  					page_vma_mapped_walk_done(&pvmw);
>  					ret = false;
>  					break;
This will need to be guarded by CONFIG_TRANSPARENT_HUGEPAGE. Will need below diff in next series..

diff --git a/mm/rmap.c b/mm/rmap.c
index 2dae46fff08ae..9d74600951cf6 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -2070,7 +2070,9 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 
                                if (!arch_needs_pgtable_deposit() && !pgtable &&
                                    vma_is_anonymous(vma)) {
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE)
                                        count_vm_event(THP_SPLIT_PMD_FAILED);
+#endif
                                        page_vma_mapped_walk_done(&pvmw);
                                        ret = false;
                                        break;
@@ -2473,7 +2475,9 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
                                prealloc_pte = NULL;
                                if (!arch_needs_pgtable_deposit() && !pgtable &&
                                    vma_is_anonymous(vma)) {
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE)
                                        count_vm_event(THP_SPLIT_PMD_FAILED);
+#endif
                                        page_vma_mapped_walk_done(&pvmw);
                                        ret = false;
                                        break;


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [RFC v2 16/21] mm: thp: add THP_SPLIT_PMD_FAILED counter
  2026-02-26 11:23 ` [RFC v2 16/21] mm: thp: add THP_SPLIT_PMD_FAILED counter Usama Arif
  2026-02-26 13:56   ` kernel test robot
  2026-02-26 14:22   ` Usama Arif
@ 2026-02-26 15:10   ` kernel test robot
  2 siblings, 0 replies; 36+ messages in thread
From: kernel test robot @ 2026-02-26 15:10 UTC (permalink / raw)
  To: Usama Arif; +Cc: llvm, oe-kbuild-all

Hi Usama,

[This is a private test report for your RFC patch.]
kernel test robot noticed the following build errors:

[auto build test ERROR on akpm-mm/mm-everything]

url:    https://github.com/intel-lab-lkp/linux/commits/Usama-Arif/mm-thp-make-split_huge_pmd-functions-return-int-for-error-propagation/20260226-193910
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20260226113233.3987674-17-usama.arif%40linux.dev
patch subject: [RFC v2 16/21] mm: thp: add THP_SPLIT_PMD_FAILED counter
config: x86_64-allnoconfig (https://download.01.org/0day-ci/archive/20260226/202602262304.dAJZ9uRD-lkp@intel.com/config)
compiler: clang version 20.1.8 (https://github.com/llvm/llvm-project 87f0227cb60147a26a1eeb4fb06e3b505e9c7261)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260226/202602262304.dAJZ9uRD-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202602262304.dAJZ9uRD-lkp@intel.com/

All errors (new ones prefixed by >>):

>> mm/rmap.c:2073:21: error: use of undeclared identifier 'THP_SPLIT_PMD_FAILED'
    2073 |                                         count_vm_event(THP_SPLIT_PMD_FAILED);
         |                                                        ^
   mm/rmap.c:2476:21: error: use of undeclared identifier 'THP_SPLIT_PMD_FAILED'
    2476 |                                         count_vm_event(THP_SPLIT_PMD_FAILED);
         |                                                        ^
   2 errors generated.


vim +/THP_SPLIT_PMD_FAILED +2073 mm/rmap.c

  1961	
  1962	/*
  1963	 * @arg: enum ttu_flags will be passed to this argument
  1964	 */
  1965	static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
  1966			     unsigned long address, void *arg)
  1967	{
  1968		struct mm_struct *mm = vma->vm_mm;
  1969		DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0);
  1970		bool anon_exclusive, ret = true;
  1971		pte_t pteval;
  1972		struct page *subpage;
  1973		struct mmu_notifier_range range;
  1974		enum ttu_flags flags = (enum ttu_flags)(long)arg;
  1975		unsigned long nr_pages = 1, end_addr;
  1976		unsigned long pfn;
  1977		unsigned long hsz = 0;
  1978		int ptes = 0;
  1979		pgtable_t prealloc_pte = NULL;
  1980	
  1981		/*
  1982		 * When racing against e.g. zap_pte_range() on another cpu,
  1983		 * in between its ptep_get_and_clear_full() and folio_remove_rmap_*(),
  1984		 * try_to_unmap() may return before page_mapped() has become false,
  1985		 * if page table locking is skipped: use TTU_SYNC to wait for that.
  1986		 */
  1987		if (flags & TTU_SYNC)
  1988			pvmw.flags = PVMW_SYNC;
  1989	
  1990		/*
  1991		 * For THP, we have to assume the worse case ie pmd for invalidation.
  1992		 * For hugetlb, it could be much worse if we need to do pud
  1993		 * invalidation in the case of pmd sharing.
  1994		 *
  1995		 * Note that the folio can not be freed in this function as call of
  1996		 * try_to_unmap() must hold a reference on the folio.
  1997		 */
  1998		range.end = vma_address_end(&pvmw);
  1999		mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma->vm_mm,
  2000					address, range.end);
  2001		if (folio_test_hugetlb(folio)) {
  2002			/*
  2003			 * If sharing is possible, start and end will be adjusted
  2004			 * accordingly.
  2005			 */
  2006			adjust_range_if_pmd_sharing_possible(vma, &range.start,
  2007							     &range.end);
  2008	
  2009			/* We need the huge page size for set_huge_pte_at() */
  2010			hsz = huge_page_size(hstate_vma(vma));
  2011		}
  2012		mmu_notifier_invalidate_range_start(&range);
  2013	
  2014		if ((flags & TTU_SPLIT_HUGE_PMD) && vma_is_anonymous(vma) &&
  2015		    !arch_needs_pgtable_deposit())
  2016			prealloc_pte = pte_alloc_one(mm);
  2017	
  2018		while (page_vma_mapped_walk(&pvmw)) {
  2019			/*
  2020			 * If the folio is in an mlock()d vma, we must not swap it out.
  2021			 */
  2022			if (!(flags & TTU_IGNORE_MLOCK) &&
  2023			    (vma->vm_flags & VM_LOCKED)) {
  2024				ptes++;
  2025	
  2026				/*
  2027				 * Set 'ret' to indicate the page cannot be unmapped.
  2028				 *
  2029				 * Do not jump to walk_abort immediately as additional
  2030				 * iteration might be required to detect fully mapped
  2031				 * folio an mlock it.
  2032				 */
  2033				ret = false;
  2034	
  2035				/* Only mlock fully mapped pages */
  2036				if (pvmw.pte && ptes != pvmw.nr_pages)
  2037					continue;
  2038	
  2039				/*
  2040				 * All PTEs must be protected by page table lock in
  2041				 * order to mlock the page.
  2042				 *
  2043				 * If page table boundary has been cross, current ptl
  2044				 * only protect part of ptes.
  2045				 */
  2046				if (pvmw.flags & PVMW_PGTABLE_CROSSED)
  2047					goto walk_done;
  2048	
  2049				/* Restore the mlock which got missed */
  2050				mlock_vma_folio(folio, vma);
  2051				goto walk_done;
  2052			}
  2053	
  2054			if (!pvmw.pte) {
  2055				if (folio_test_lazyfree(folio)) {
  2056					if (unmap_huge_pmd_locked(vma, pvmw.address, pvmw.pmd, folio))
  2057						goto walk_done;
  2058					/*
  2059					 * unmap_huge_pmd_locked has either already marked
  2060					 * the folio as swap-backed or decided to retain it
  2061					 * due to GUP or speculative references.
  2062					 */
  2063					goto walk_abort;
  2064				}
  2065	
  2066				if (flags & TTU_SPLIT_HUGE_PMD) {
  2067					pgtable_t pgtable = prealloc_pte;
  2068	
  2069					prealloc_pte = NULL;
  2070	
  2071					if (!arch_needs_pgtable_deposit() && !pgtable &&
  2072					    vma_is_anonymous(vma)) {
> 2073						count_vm_event(THP_SPLIT_PMD_FAILED);
  2074						page_vma_mapped_walk_done(&pvmw);
  2075						ret = false;
  2076						break;
  2077					}
  2078					/*
  2079					 * We temporarily have to drop the PTL and
  2080					 * restart so we can process the PTE-mapped THP.
  2081					 */
  2082					split_huge_pmd_locked(vma, pvmw.address,
  2083							      pvmw.pmd, false, pgtable);
  2084					flags &= ~TTU_SPLIT_HUGE_PMD;
  2085					page_vma_mapped_walk_restart(&pvmw);
  2086					continue;
  2087				}
  2088			}
  2089	
  2090			/* Unexpected PMD-mapped THP? */
  2091			VM_BUG_ON_FOLIO(!pvmw.pte, folio);
  2092	
  2093			/*
  2094			 * Handle PFN swap PTEs, such as device-exclusive ones, that
  2095			 * actually map pages.
  2096			 */
  2097			pteval = ptep_get(pvmw.pte);
  2098			if (likely(pte_present(pteval))) {
  2099				pfn = pte_pfn(pteval);
  2100			} else {
  2101				const softleaf_t entry = softleaf_from_pte(pteval);
  2102	
  2103				pfn = softleaf_to_pfn(entry);
  2104				VM_WARN_ON_FOLIO(folio_test_hugetlb(folio), folio);
  2105			}
  2106	
  2107			subpage = folio_page(folio, pfn - folio_pfn(folio));
  2108			address = pvmw.address;
  2109			anon_exclusive = folio_test_anon(folio) &&
  2110					 PageAnonExclusive(subpage);
  2111	
  2112			if (folio_test_hugetlb(folio)) {
  2113				bool anon = folio_test_anon(folio);
  2114	
  2115				/*
  2116				 * The try_to_unmap() is only passed a hugetlb page
  2117				 * in the case where the hugetlb page is poisoned.
  2118				 */
  2119				VM_BUG_ON_PAGE(!PageHWPoison(subpage), subpage);
  2120				/*
  2121				 * huge_pmd_unshare may unmap an entire PMD page.
  2122				 * There is no way of knowing exactly which PMDs may
  2123				 * be cached for this mm, so we must flush them all.
  2124				 * start/end were already adjusted above to cover this
  2125				 * range.
  2126				 */
  2127				flush_cache_range(vma, range.start, range.end);
  2128	
  2129				/*
  2130				 * To call huge_pmd_unshare, i_mmap_rwsem must be
  2131				 * held in write mode.  Caller needs to explicitly
  2132				 * do this outside rmap routines.
  2133				 *
  2134				 * We also must hold hugetlb vma_lock in write mode.
  2135				 * Lock order dictates acquiring vma_lock BEFORE
  2136				 * i_mmap_rwsem.  We can only try lock here and fail
  2137				 * if unsuccessful.
  2138				 */
  2139				if (!anon) {
  2140					struct mmu_gather tlb;
  2141	
  2142					VM_BUG_ON(!(flags & TTU_RMAP_LOCKED));
  2143					if (!hugetlb_vma_trylock_write(vma))
  2144						goto walk_abort;
  2145	
  2146					tlb_gather_mmu_vma(&tlb, vma);
  2147					if (huge_pmd_unshare(&tlb, vma, address, pvmw.pte)) {
  2148						hugetlb_vma_unlock_write(vma);
  2149						huge_pmd_unshare_flush(&tlb, vma);
  2150						tlb_finish_mmu(&tlb);
  2151						/*
  2152						 * The PMD table was unmapped,
  2153						 * consequently unmapping the folio.
  2154						 */
  2155						goto walk_done;
  2156					}
  2157					hugetlb_vma_unlock_write(vma);
  2158					tlb_finish_mmu(&tlb);
  2159				}
  2160				pteval = huge_ptep_clear_flush(vma, address, pvmw.pte);
  2161				if (pte_dirty(pteval))
  2162					folio_mark_dirty(folio);
  2163			} else if (likely(pte_present(pteval))) {
  2164				nr_pages = folio_unmap_pte_batch(folio, &pvmw, flags, pteval);
  2165				end_addr = address + nr_pages * PAGE_SIZE;
  2166				flush_cache_range(vma, address, end_addr);
  2167	
  2168				/* Nuke the page table entry. */
  2169				pteval = get_and_clear_ptes(mm, address, pvmw.pte, nr_pages);
  2170				/*
  2171				 * We clear the PTE but do not flush so potentially
  2172				 * a remote CPU could still be writing to the folio.
  2173				 * If the entry was previously clean then the
  2174				 * architecture must guarantee that a clear->dirty
  2175				 * transition on a cached TLB entry is written through
  2176				 * and traps if the PTE is unmapped.
  2177				 */
  2178				if (should_defer_flush(mm, flags))
  2179					set_tlb_ubc_flush_pending(mm, pteval, address, end_addr);
  2180				else
  2181					flush_tlb_range(vma, address, end_addr);
  2182				if (pte_dirty(pteval))
  2183					folio_mark_dirty(folio);
  2184			} else {
  2185				pte_clear(mm, address, pvmw.pte);
  2186			}
  2187	
  2188			/*
  2189			 * Now the pte is cleared. If this pte was uffd-wp armed,
  2190			 * we may want to replace a none pte with a marker pte if
  2191			 * it's file-backed, so we don't lose the tracking info.
  2192			 */
  2193			pte_install_uffd_wp_if_needed(vma, address, pvmw.pte, pteval);
  2194	
  2195			/* Update high watermark before we lower rss */
  2196			update_hiwater_rss(mm);
  2197	
  2198			if (PageHWPoison(subpage) && (flags & TTU_HWPOISON)) {
  2199				pteval = swp_entry_to_pte(make_hwpoison_entry(subpage));
  2200				if (folio_test_hugetlb(folio)) {
  2201					hugetlb_count_sub(folio_nr_pages(folio), mm);
  2202					set_huge_pte_at(mm, address, pvmw.pte, pteval,
  2203							hsz);
  2204				} else {
  2205					dec_mm_counter(mm, mm_counter(folio));
  2206					set_pte_at(mm, address, pvmw.pte, pteval);
  2207				}
  2208			} else if (likely(pte_present(pteval)) && pte_unused(pteval) &&
  2209				   !userfaultfd_armed(vma)) {
  2210				/*
  2211				 * The guest indicated that the page content is of no
  2212				 * interest anymore. Simply discard the pte, vmscan
  2213				 * will take care of the rest.
  2214				 * A future reference will then fault in a new zero
  2215				 * page. When userfaultfd is active, we must not drop
  2216				 * this page though, as its main user (postcopy
  2217				 * migration) will not expect userfaults on already
  2218				 * copied pages.
  2219				 */
  2220				dec_mm_counter(mm, mm_counter(folio));
  2221			} else if (folio_test_anon(folio)) {
  2222				swp_entry_t entry = page_swap_entry(subpage);
  2223				pte_t swp_pte;
  2224				/*
  2225				 * Store the swap location in the pte.
  2226				 * See handle_pte_fault() ...
  2227				 */
  2228				if (unlikely(folio_test_swapbacked(folio) !=
  2229						folio_test_swapcache(folio))) {
  2230					WARN_ON_ONCE(1);
  2231					goto walk_abort;
  2232				}
  2233	
  2234				/* MADV_FREE page check */
  2235				if (!folio_test_swapbacked(folio)) {
  2236					int ref_count, map_count;
  2237	
  2238					/*
  2239					 * Synchronize with gup_pte_range():
  2240					 * - clear PTE; barrier; read refcount
  2241					 * - inc refcount; barrier; read PTE
  2242					 */
  2243					smp_mb();
  2244	
  2245					ref_count = folio_ref_count(folio);
  2246					map_count = folio_mapcount(folio);
  2247	
  2248					/*
  2249					 * Order reads for page refcount and dirty flag
  2250					 * (see comments in __remove_mapping()).
  2251					 */
  2252					smp_rmb();
  2253	
  2254					if (folio_test_dirty(folio) && !(vma->vm_flags & VM_DROPPABLE)) {
  2255						/*
  2256						 * redirtied either using the page table or a previously
  2257						 * obtained GUP reference.
  2258						 */
  2259						set_ptes(mm, address, pvmw.pte, pteval, nr_pages);
  2260						folio_set_swapbacked(folio);
  2261						goto walk_abort;
  2262					} else if (ref_count != 1 + map_count) {
  2263						/*
  2264						 * Additional reference. Could be a GUP reference or any
  2265						 * speculative reference. GUP users must mark the folio
  2266						 * dirty if there was a modification. This folio cannot be
  2267						 * reclaimed right now either way, so act just like nothing
  2268						 * happened.
  2269						 * We'll come back here later and detect if the folio was
  2270						 * dirtied when the additional reference is gone.
  2271						 */
  2272						set_ptes(mm, address, pvmw.pte, pteval, nr_pages);
  2273						goto walk_abort;
  2274					}
  2275					add_mm_counter(mm, MM_ANONPAGES, -nr_pages);
  2276					goto discard;
  2277				}
  2278	
  2279				if (folio_dup_swap(folio, subpage) < 0) {
  2280					set_pte_at(mm, address, pvmw.pte, pteval);
  2281					goto walk_abort;
  2282				}
  2283	
  2284				/*
  2285				 * arch_unmap_one() is expected to be a NOP on
  2286				 * architectures where we could have PFN swap PTEs,
  2287				 * so we'll not check/care.
  2288				 */
  2289				if (arch_unmap_one(mm, vma, address, pteval) < 0) {
  2290					folio_put_swap(folio, subpage);
  2291					set_pte_at(mm, address, pvmw.pte, pteval);
  2292					goto walk_abort;
  2293				}
  2294	
  2295				/* See folio_try_share_anon_rmap(): clear PTE first. */
  2296				if (anon_exclusive &&
  2297				    folio_try_share_anon_rmap_pte(folio, subpage)) {
  2298					folio_put_swap(folio, subpage);
  2299					set_pte_at(mm, address, pvmw.pte, pteval);
  2300					goto walk_abort;
  2301				}
  2302				if (list_empty(&mm->mmlist)) {
  2303					spin_lock(&mmlist_lock);
  2304					if (list_empty(&mm->mmlist))
  2305						list_add(&mm->mmlist, &init_mm.mmlist);
  2306					spin_unlock(&mmlist_lock);
  2307				}
  2308				dec_mm_counter(mm, MM_ANONPAGES);
  2309				inc_mm_counter(mm, MM_SWAPENTS);
  2310				swp_pte = swp_entry_to_pte(entry);
  2311				if (anon_exclusive)
  2312					swp_pte = pte_swp_mkexclusive(swp_pte);
  2313				if (likely(pte_present(pteval))) {
  2314					if (pte_soft_dirty(pteval))
  2315						swp_pte = pte_swp_mksoft_dirty(swp_pte);
  2316					if (pte_uffd_wp(pteval))
  2317						swp_pte = pte_swp_mkuffd_wp(swp_pte);
  2318				} else {
  2319					if (pte_swp_soft_dirty(pteval))
  2320						swp_pte = pte_swp_mksoft_dirty(swp_pte);
  2321					if (pte_swp_uffd_wp(pteval))
  2322						swp_pte = pte_swp_mkuffd_wp(swp_pte);
  2323				}
  2324				set_pte_at(mm, address, pvmw.pte, swp_pte);
  2325			} else {
  2326				/*
  2327				 * This is a locked file-backed folio,
  2328				 * so it cannot be removed from the page
  2329				 * cache and replaced by a new folio before
  2330				 * mmu_notifier_invalidate_range_end, so no
  2331				 * concurrent thread might update its page table
  2332				 * to point at a new folio while a device is
  2333				 * still using this folio.
  2334				 *
  2335				 * See Documentation/mm/mmu_notifier.rst
  2336				 */
  2337				add_mm_counter(mm, mm_counter_file(folio), -nr_pages);
  2338			}
  2339	discard:
  2340			if (unlikely(folio_test_hugetlb(folio))) {
  2341				hugetlb_remove_rmap(folio);
  2342			} else {
  2343				folio_remove_rmap_ptes(folio, subpage, nr_pages, vma);
  2344			}
  2345			if (vma->vm_flags & VM_LOCKED)
  2346				mlock_drain_local();
  2347			folio_put_refs(folio, nr_pages);
  2348	
  2349			/*
  2350			 * If we are sure that we batched the entire folio and cleared
  2351			 * all PTEs, we can just optimize and stop right here.
  2352			 */
  2353			if (nr_pages == folio_nr_pages(folio))
  2354				goto walk_done;
  2355			continue;
  2356	walk_abort:
  2357			ret = false;
  2358	walk_done:
  2359			page_vma_mapped_walk_done(&pvmw);
  2360			break;
  2361		}
  2362	
  2363		if (prealloc_pte)
  2364			pte_free(mm, prealloc_pte);
  2365	
  2366		mmu_notifier_invalidate_range_end(&range);
  2367	
  2368		return ret;
  2369	}
  2370	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC v2 13/21] mm: huge_mm: Make sure all split_huge_pmd calls are checked
  2026-02-26 11:23 ` [RFC v2 13/21] mm: huge_mm: Make sure all split_huge_pmd calls are checked Usama Arif
@ 2026-02-26 16:32   ` kernel test robot
  2026-02-27 12:11   ` Usama Arif
  1 sibling, 0 replies; 36+ messages in thread
From: kernel test robot @ 2026-02-26 16:32 UTC (permalink / raw)
  To: Usama Arif; +Cc: oe-kbuild-all

Hi Usama,

[This is a private test report for your RFC patch.]
kernel test robot noticed the following build warnings:

[auto build test WARNING on akpm-mm/mm-everything]

url:    https://github.com/intel-lab-lkp/linux/commits/Usama-Arif/mm-thp-make-split_huge_pmd-functions-return-int-for-error-propagation/20260226-193910
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20260226113233.3987674-14-usama.arif%40linux.dev
patch subject: [RFC v2 13/21] mm: huge_mm: Make sure all split_huge_pmd calls are checked
config: arc-allmodconfig (https://download.01.org/0day-ci/archive/20260227/202602270049.Hcnrxeen-lkp@intel.com/config)
compiler: arc-linux-gcc (GCC) 15.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260227/202602270049.Hcnrxeen-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202602270049.Hcnrxeen-lkp@intel.com/

All warnings (new ones prefixed by >>):

   fs/proc/task_mmu.c: In function 'pagemap_scan_thp_entry':
>> fs/proc/task_mmu.c:2718:17: warning: ignoring return value of 'split_huge_pmd' declared with attribute 'warn_unused_result' [-Wunused-result]
    2718 |                 split_huge_pmd(vma, pmd, start);
         |                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


vim +2718 fs/proc/task_mmu.c

52526ca7fdb905 Muhammad Usama Anjum 2023-08-21  2682  
52526ca7fdb905 Muhammad Usama Anjum 2023-08-21  2683  static int pagemap_scan_thp_entry(pmd_t *pmd, unsigned long start,
52526ca7fdb905 Muhammad Usama Anjum 2023-08-21  2684  				  unsigned long end, struct mm_walk *walk)
52526ca7fdb905 Muhammad Usama Anjum 2023-08-21  2685  {
52526ca7fdb905 Muhammad Usama Anjum 2023-08-21  2686  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
52526ca7fdb905 Muhammad Usama Anjum 2023-08-21  2687  	struct pagemap_scan_private *p = walk->private;
52526ca7fdb905 Muhammad Usama Anjum 2023-08-21  2688  	struct vm_area_struct *vma = walk->vma;
52526ca7fdb905 Muhammad Usama Anjum 2023-08-21  2689  	unsigned long categories;
52526ca7fdb905 Muhammad Usama Anjum 2023-08-21  2690  	spinlock_t *ptl;
52526ca7fdb905 Muhammad Usama Anjum 2023-08-21  2691  	int ret = 0;
52526ca7fdb905 Muhammad Usama Anjum 2023-08-21  2692  
52526ca7fdb905 Muhammad Usama Anjum 2023-08-21  2693  	ptl = pmd_trans_huge_lock(pmd, vma);
52526ca7fdb905 Muhammad Usama Anjum 2023-08-21  2694  	if (!ptl)
52526ca7fdb905 Muhammad Usama Anjum 2023-08-21  2695  		return -ENOENT;
52526ca7fdb905 Muhammad Usama Anjum 2023-08-21  2696  
52526ca7fdb905 Muhammad Usama Anjum 2023-08-21  2697  	categories = p->cur_vma_category |
52526ca7fdb905 Muhammad Usama Anjum 2023-08-21  2698  		     pagemap_thp_category(p, vma, start, *pmd);
52526ca7fdb905 Muhammad Usama Anjum 2023-08-21  2699  
52526ca7fdb905 Muhammad Usama Anjum 2023-08-21  2700  	if (!pagemap_scan_is_interesting_page(categories, p))
52526ca7fdb905 Muhammad Usama Anjum 2023-08-21  2701  		goto out_unlock;
52526ca7fdb905 Muhammad Usama Anjum 2023-08-21  2702  
52526ca7fdb905 Muhammad Usama Anjum 2023-08-21  2703  	ret = pagemap_scan_output(categories, p, start, &end);
52526ca7fdb905 Muhammad Usama Anjum 2023-08-21  2704  	if (start == end)
52526ca7fdb905 Muhammad Usama Anjum 2023-08-21  2705  		goto out_unlock;
52526ca7fdb905 Muhammad Usama Anjum 2023-08-21  2706  
52526ca7fdb905 Muhammad Usama Anjum 2023-08-21  2707  	if (~p->arg.flags & PM_SCAN_WP_MATCHING)
52526ca7fdb905 Muhammad Usama Anjum 2023-08-21  2708  		goto out_unlock;
52526ca7fdb905 Muhammad Usama Anjum 2023-08-21  2709  	if (~categories & PAGE_IS_WRITTEN)
52526ca7fdb905 Muhammad Usama Anjum 2023-08-21  2710  		goto out_unlock;
52526ca7fdb905 Muhammad Usama Anjum 2023-08-21  2711  
52526ca7fdb905 Muhammad Usama Anjum 2023-08-21  2712  	/*
52526ca7fdb905 Muhammad Usama Anjum 2023-08-21  2713  	 * Break huge page into small pages if the WP operation
52526ca7fdb905 Muhammad Usama Anjum 2023-08-21  2714  	 * needs to be performed on a portion of the huge page.
52526ca7fdb905 Muhammad Usama Anjum 2023-08-21  2715  	 */
52526ca7fdb905 Muhammad Usama Anjum 2023-08-21  2716  	if (end != start + HPAGE_SIZE) {
52526ca7fdb905 Muhammad Usama Anjum 2023-08-21  2717  		spin_unlock(ptl);
52526ca7fdb905 Muhammad Usama Anjum 2023-08-21 @2718  		split_huge_pmd(vma, pmd, start);
52526ca7fdb905 Muhammad Usama Anjum 2023-08-21  2719  		pagemap_scan_backout_range(p, start, end);
52526ca7fdb905 Muhammad Usama Anjum 2023-08-21  2720  		/* Report as if there was no THP */
52526ca7fdb905 Muhammad Usama Anjum 2023-08-21  2721  		return -ENOENT;
52526ca7fdb905 Muhammad Usama Anjum 2023-08-21  2722  	}
52526ca7fdb905 Muhammad Usama Anjum 2023-08-21  2723  
52526ca7fdb905 Muhammad Usama Anjum 2023-08-21  2724  	make_uffd_wp_pmd(vma, start, pmd);
52526ca7fdb905 Muhammad Usama Anjum 2023-08-21  2725  	flush_tlb_range(vma, start, end);
52526ca7fdb905 Muhammad Usama Anjum 2023-08-21  2726  out_unlock:
52526ca7fdb905 Muhammad Usama Anjum 2023-08-21  2727  	spin_unlock(ptl);
52526ca7fdb905 Muhammad Usama Anjum 2023-08-21  2728  	return ret;
52526ca7fdb905 Muhammad Usama Anjum 2023-08-21  2729  #else /* !CONFIG_TRANSPARENT_HUGEPAGE */
52526ca7fdb905 Muhammad Usama Anjum 2023-08-21  2730  	return -ENOENT;
52526ca7fdb905 Muhammad Usama Anjum 2023-08-21  2731  #endif
52526ca7fdb905 Muhammad Usama Anjum 2023-08-21  2732  }
52526ca7fdb905 Muhammad Usama Anjum 2023-08-21  2733  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC v2 00/21] mm: thp: lazy PTE page table allocation at PMD split
  2026-02-26 11:23 [RFC v2 00/21] mm: thp: lazy PTE page table allocation at PMD split Usama Arif
                   ` (20 preceding siblings ...)
  2026-02-26 11:23 ` [RFC v2 21/21] selftests/mm: add madv_dontneed_partial test Usama Arif
@ 2026-02-26 21:01 ` Nico Pache
  2026-02-27 11:13   ` Usama Arif
  21 siblings, 1 reply; 36+ messages in thread
From: Nico Pache @ 2026-02-26 21:01 UTC (permalink / raw)
  To: Usama Arif
  Cc: Andrew Morton, david, lorenzo.stoakes, willy, linux-mm, fvdl,
	hannes, riel, shakeel.butt, kas, baohua, dev.jain, baolin.wang,
	Liam.Howlett, ryan.roberts, Vlastimil Babka, lance.yang,
	linux-kernel, kernel-team, maddy, mpe, linuxppc-dev, hca, gor,
	agordeev, borntraeger, svens, linux-s390

On Thu, Feb 26, 2026 at 4:33 AM Usama Arif <usama.arif@linux.dev> wrote:
>
> When the kernel creates a PMD-level THP mapping for anonymous pages, it
> pre-allocates a PTE page table via pgtable_trans_huge_deposit(). This
> page table sits unused in a deposit list for the lifetime of the THP
> mapping, only to be withdrawn when the PMD is split or zapped. Every
> anonymous THP therefore wastes 4KB of memory unconditionally. On large
> servers where hundreds of gigabytes of memory are mapped as THPs, this
> adds up: roughly 200MB wasted per 100GB of THP memory. This memory
> could otherwise satisfy other allocations, including the very PTE page
> table allocations needed when splits eventually occur.
>
> This series removes the pre-deposit and allocates the PTE page table
> lazily — only when a PMD split actually happens. Since a large number
> of THPs are never split (they are zapped wholesale when processes exit or
> munmap the full range), the allocation is avoided entirely in the common
> case.
>
> The pre-deposit pattern exists because split_huge_pmd was designed as an
> operation that must never fail: if the kernel decides to split, it needs
> a PTE page table, so one is deposited in advance. But "must never fail"
> is an unnecessarily strong requirement. A PMD split is typically triggered
> by a partial operation on a sub-PMD range — partial munmap, partial
> mprotect, partial mremap and so on.
> Most of these operations already have well-defined error handling for
> allocation failures (e.g., -ENOMEM, VM_FAULT_OOM). Allowing split to
> fail and propagating the error through these existing paths is the natural
> thing to do. Furthermore, split failing requires an order-0 allocation for
> a page table to fail, which is extremely unlikely.
>
> Designing functions like split_huge_pmd as operations that cannot fail
> has a subtle but real cost to code quality. It forces a pre-allocation
> pattern - every THP creation path must deposit a page table, and every
> split or zap path must withdraw one, creating a hidden coupling between
> widely separated code paths.
>
> This also serves as a code cleanup. On every architecture except powerpc
> with hash MMU, the deposit/withdraw machinery becomes dead code. The
> series removes the generic implementations in pgtable-generic.c and the
> s390/sparc overrides, replacing them with no-op stubs guarded by
> arch_needs_pgtable_deposit(), which evaluates to false at compile time
> on all non-powerpc architectures.

Hi Usama,

Thanks for tackling this, it seems like an interesting problem. Im
trying to get more into reviewing, so bare with me I may have some
stupid comments or questions. Where I can really help out is with
testing. I will build this for all RH-supported architectures and run
some automated test suites and performance metrics. I'll report back
if I spot anything.

Cheers!
-- Nico

>
> The series is structured as follows:
>
> Patches 1-2:    Error infrastructure — make split functions return int
>                 and propagate errors from vma_adjust_trans_huge()
>                 through __split_vma, vma_shrink, and commit_merge.
>
> Patches 3-12:   Handle split failure at every call site — copy_huge_pmd,
>                 do_huge_pmd_wp_page, zap_pmd_range, wp_huge_pmd,
>                 change_pmd_range (mprotect), follow_pmd_mask (GUP),
>                 walk_pmd_range (pagewalk), move_page_tables (mremap),
>                 move_pages (userfaultfd), and device migration.
>                 The code will become affective in Patch 14 when split
>                 functions start returning -ENOMEM.
>
> Patch 13:       Add __must_check to __split_huge_pmd(), split_huge_pmd()
>                 and split_huge_pmd_address() so the compiler warns on
>                 unchecked return values.
>
> Patch 14:       The actual change — allocate PTE page tables lazily at
>                 split time instead of pre-depositing at THP creation.
>                 This is when split functions will actually start returning
>                 -ENOMEM.
>
> Patch 15:       Remove the now-dead deposit/withdraw code on
>                 non-powerpc architectures.
>
> Patch 16:       Add THP_SPLIT_PMD_FAILED vmstat counter for monitoring
>                 split failures.
>
> Patches 17-21:  Selftests covering partial munmap, mprotect, mlock,
>                 mremap, and MADV_DONTNEED on THPs to exercise the
>                 split paths.
>
> The error handling patches are placed before the lazy allocation patch so
> that every call site is already prepared to handle split failures before
> the failure mode is introduced. This makes each patch independently safe
> to apply and bisect through.
>
> The patches were tested with CONFIG_DEBUG_ATOMIC_SLEEP and CONFIG_DEBUG_VM
> enabled. The test results are below:
>
> TAP version 13
> 1..5
> # Starting 5 tests from 1 test cases.
> #  RUN           thp_pmd_split.partial_munmap ...
> # thp_pmd_split_test.c:60:partial_munmap:thp_split_pmd: 0 -> 1
> # thp_pmd_split_test.c:62:partial_munmap:thp_split_pmd_failed: 0 -> 0
> #            OK  thp_pmd_split.partial_munmap
> ok 1 thp_pmd_split.partial_munmap
> #  RUN           thp_pmd_split.partial_mprotect ...
> # thp_pmd_split_test.c:60:partial_mprotect:thp_split_pmd: 1 -> 2
> # thp_pmd_split_test.c:62:partial_mprotect:thp_split_pmd_failed: 0 -> 0
> #            OK  thp_pmd_split.partial_mprotect
> ok 2 thp_pmd_split.partial_mprotect
> #  RUN           thp_pmd_split.partial_mlock ...
> # thp_pmd_split_test.c:60:partial_mlock:thp_split_pmd: 2 -> 3
> # thp_pmd_split_test.c:62:partial_mlock:thp_split_pmd_failed: 0 -> 0
> #            OK  thp_pmd_split.partial_mlock
> ok 3 thp_pmd_split.partial_mlock
> #  RUN           thp_pmd_split.partial_mremap ...
> # thp_pmd_split_test.c:60:partial_mremap:thp_split_pmd: 3 -> 4
> # thp_pmd_split_test.c:62:partial_mremap:thp_split_pmd_failed: 0 -> 0
> #            OK  thp_pmd_split.partial_mremap
> ok 4 thp_pmd_split.partial_mremap
> #  RUN           thp_pmd_split.partial_madv_dontneed ...
> # thp_pmd_split_test.c:60:partial_madv_dontneed:thp_split_pmd: 4 -> 5
> # thp_pmd_split_test.c:62:partial_madv_dontneed:thp_split_pmd_failed: 0 -> 0
> #            OK  thp_pmd_split.partial_madv_dontneed
> ok 5 thp_pmd_split.partial_madv_dontneed
> # PASSED: 5 / 5 tests passed.
> # Totals: pass:5 fail:0 xfail:0 xpass:0 skip:0 error:0
>
> The patches are based off of 957a3fab8811b455420128ea5f41c51fd23eb6c7 from
> mm-unstable as of 25 Feb (7.0.0-rc1).
>
>
> RFC v1 -> v2: https://lore.kernel.org/all/20260211125507.4175026-1-usama.arif@linux.dev/
> - Change counter name to THP_SPLIT_PMD_FAILED (David)
> - remove pgtable_trans_huge_{deposit/withdraw} when not needed and
>   make them arch specific (David)
> - make split functions return error code and have callers handle them
>   (David and Kiryl)
> - Add test cases for splitting
>
> Usama Arif (21):
>   mm: thp: make split_huge_pmd functions return int for error
>     propagation
>   mm: thp: propagate split failure from vma_adjust_trans_huge()
>   mm: thp: handle split failure in copy_huge_pmd()
>   mm: thp: handle split failure in do_huge_pmd_wp_page()
>   mm: thp: handle split failure in zap_pmd_range()
>   mm: thp: handle split failure in wp_huge_pmd()
>   mm: thp: retry on split failure in change_pmd_range()
>   mm: thp: handle split failure in follow_pmd_mask()
>   mm: handle walk_page_range() failure from THP split
>   mm: thp: handle split failure in mremap move_page_tables()
>   mm: thp: handle split failure in userfaultfd move_pages()
>   mm: thp: handle split failure in device migration
>   mm: huge_mm: Make sure all split_huge_pmd calls are checked
>   mm: thp: allocate PTE page tables lazily at split time
>   mm: thp: remove pgtable_trans_huge_{deposit/withdraw} when not needed
>   mm: thp: add THP_SPLIT_PMD_FAILED counter
>   selftests/mm: add THP PMD split test infrastructure
>   selftests/mm: add partial_mprotect test for change_pmd_range
>   selftests/mm: add partial_mlock test
>   selftests/mm: add partial_mremap test for move_page_tables
>   selftests/mm: add madv_dontneed_partial test
>
>  arch/powerpc/include/asm/book3s/64/pgtable.h  |  12 +-
>  arch/s390/include/asm/pgtable.h               |   6 -
>  arch/s390/mm/pgtable.c                        |  41 ---
>  arch/sparc/include/asm/pgtable_64.h           |   6 -
>  arch/sparc/mm/tlb.c                           |  36 ---
>  include/linux/huge_mm.h                       |  51 +--
>  include/linux/pgtable.h                       |  16 +-
>  include/linux/vm_event_item.h                 |   1 +
>  mm/debug_vm_pgtable.c                         |   4 +-
>  mm/gup.c                                      |  10 +-
>  mm/huge_memory.c                              | 208 +++++++++----
>  mm/khugepaged.c                               |   7 +-
>  mm/memory.c                                   |  26 +-
>  mm/migrate_device.c                           |  33 +-
>  mm/mprotect.c                                 |  11 +-
>  mm/mremap.c                                   |   8 +-
>  mm/pagewalk.c                                 |   8 +-
>  mm/pgtable-generic.c                          |  32 --
>  mm/rmap.c                                     |  42 ++-
>  mm/userfaultfd.c                              |   8 +-
>  mm/vma.c                                      |  37 ++-
>  mm/vmstat.c                                   |   1 +
>  tools/testing/selftests/mm/Makefile           |   1 +
>  .../testing/selftests/mm/thp_pmd_split_test.c | 290 ++++++++++++++++++
>  tools/testing/vma/include/stubs.h             |   9 +-
>  25 files changed, 645 insertions(+), 259 deletions(-)
>  create mode 100644 tools/testing/selftests/mm/thp_pmd_split_test.c
>
> --
> 2.47.3
>



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC v2 00/21] mm: thp: lazy PTE page table allocation at PMD split
  2026-02-26 21:01 ` [RFC v2 00/21] mm: thp: lazy PTE page table allocation at PMD split Nico Pache
@ 2026-02-27 11:13   ` Usama Arif
  2026-02-28  0:06     ` Nico Pache
  0 siblings, 1 reply; 36+ messages in thread
From: Usama Arif @ 2026-02-27 11:13 UTC (permalink / raw)
  To: Nico Pache
  Cc: Andrew Morton, david, lorenzo.stoakes, willy, linux-mm, fvdl,
	hannes, riel, shakeel.butt, kas, baohua, dev.jain, baolin.wang,
	Liam.Howlett, ryan.roberts, Vlastimil Babka, lance.yang,
	linux-kernel, kernel-team, maddy, mpe, linuxppc-dev, hca, gor,
	agordeev, borntraeger, svens, linux-s390



On 26/02/2026 21:01, Nico Pache wrote:
> On Thu, Feb 26, 2026 at 4:33 AM Usama Arif <usama.arif@linux.dev> wrote:
>>
>> When the kernel creates a PMD-level THP mapping for anonymous pages, it
>> pre-allocates a PTE page table via pgtable_trans_huge_deposit(). This
>> page table sits unused in a deposit list for the lifetime of the THP
>> mapping, only to be withdrawn when the PMD is split or zapped. Every
>> anonymous THP therefore wastes 4KB of memory unconditionally. On large
>> servers where hundreds of gigabytes of memory are mapped as THPs, this
>> adds up: roughly 200MB wasted per 100GB of THP memory. This memory
>> could otherwise satisfy other allocations, including the very PTE page
>> table allocations needed when splits eventually occur.
>>
>> This series removes the pre-deposit and allocates the PTE page table
>> lazily — only when a PMD split actually happens. Since a large number
>> of THPs are never split (they are zapped wholesale when processes exit or
>> munmap the full range), the allocation is avoided entirely in the common
>> case.
>>
>> The pre-deposit pattern exists because split_huge_pmd was designed as an
>> operation that must never fail: if the kernel decides to split, it needs
>> a PTE page table, so one is deposited in advance. But "must never fail"
>> is an unnecessarily strong requirement. A PMD split is typically triggered
>> by a partial operation on a sub-PMD range — partial munmap, partial
>> mprotect, partial mremap and so on.
>> Most of these operations already have well-defined error handling for
>> allocation failures (e.g., -ENOMEM, VM_FAULT_OOM). Allowing split to
>> fail and propagating the error through these existing paths is the natural
>> thing to do. Furthermore, split failing requires an order-0 allocation for
>> a page table to fail, which is extremely unlikely.
>>
>> Designing functions like split_huge_pmd as operations that cannot fail
>> has a subtle but real cost to code quality. It forces a pre-allocation
>> pattern - every THP creation path must deposit a page table, and every
>> split or zap path must withdraw one, creating a hidden coupling between
>> widely separated code paths.
>>
>> This also serves as a code cleanup. On every architecture except powerpc
>> with hash MMU, the deposit/withdraw machinery becomes dead code. The
>> series removes the generic implementations in pgtable-generic.c and the
>> s390/sparc overrides, replacing them with no-op stubs guarded by
>> arch_needs_pgtable_deposit(), which evaluates to false at compile time
>> on all non-powerpc architectures.
> 
> Hi Usama,
> 
> Thanks for tackling this, it seems like an interesting problem. Im
> trying to get more into reviewing, so bare with me I may have some
> stupid comments or questions. Where I can really help out is with
> testing. I will build this for all RH-supported architectures and run
> some automated test suites and performance metrics. I'll report back
> if I spot anything.
> 
> Cheers!
> -- Nico
> 

Thanks for the build and looking into reviewing this. All comments
and questions are welcome! I had only tested on x86, and I had a look
at the link you shared so its great to know that powerPC and s390 are fine.



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC v2 13/21] mm: huge_mm: Make sure all split_huge_pmd calls are checked
  2026-02-26 11:23 ` [RFC v2 13/21] mm: huge_mm: Make sure all split_huge_pmd calls are checked Usama Arif
  2026-02-26 16:32   ` kernel test robot
@ 2026-02-27 12:11   ` Usama Arif
  1 sibling, 0 replies; 36+ messages in thread
From: Usama Arif @ 2026-02-27 12:11 UTC (permalink / raw)
  To: Andrew Morton, david, lorenzo.stoakes, willy, linux-mm
  Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
	baolin.wang, npache, Liam.Howlett, ryan.roberts, Vlastimil Babka,
	lance.yang, linux-kernel, kernel-team, maddy, mpe, linuxppc-dev,
	hca, gor, agordeev, borntraeger, svens, linux-s390



On 26/02/2026 11:23, Usama Arif wrote:
> Mark __split_huge_pmd(), split_huge_pmd() and split_huge_pmd_address()
> with __must_check so the compiler warns if any caller ignores the return
> value. Not checking return value and operating on the basis that the pmd
> is split could result in a kernel bug. The possibility of an order-0
> allocation failing for page table allocation is very low, but it should
> be handled correctly.
> 
> Signed-off-by: Usama Arif <usama.arif@linux.dev>


Kernel test bot reported that I missed one split_huge_pmd call. I will include
the below patch in the next revision.


commit 9e1bb250ea8ef0a39c738cd4137ed6c98131ebb0 (HEAD)
Author: Usama Arif <usama.arif@linux.dev>
Date:   Thu Feb 26 10:45:35 2026 -0800

    mm: proc: handle split_huge_pmd failure in pagemap_scan
    
    pagemap_scan_thp_entry() splits a huge PMD when the PAGEMAP_SCAN ioctl
    needs to write-protect only a portion of a THP. It then returns -ENOENT
    so pagemap_scan_pmd_entry() falls through to PTE-level handling.
    
    Check the split_huge_pmd() return value and propagate the error on
    failure. Returning -ENOMEM instead of -ENOENT prevents the fallthrough
    to PTE handling, and the error propagates through walk_page_range() to
    do_pagemap_scan() where it becomes the ioctl return value.
    pagemap_scan_backout_range() already undoes the buffered output, and
    walk_end is written back to userspace so the caller knows where the
    scan stopped.
    
    If the split fails, the PMD remains huge. An alternative to the approach
    in the patch is to return -ENOENT, causing the caller to proceed to
    pte_offset_map_lock(). ___pte_offset_map() detects the trans_huge PMD
    and returns NULL, which sets ACTION_AGAIN — restarting the walker on the
    same PMD by which time the system might have enough memory to satisfy
    the split from succeeding.
    
    Signed-off-by: Usama Arif <usama.arif@linux.dev>

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index e091931d7ca19..f5f459140b5c0 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -2714,9 +2714,13 @@ static int pagemap_scan_thp_entry(pmd_t *pmd, unsigned long start,
         * needs to be performed on a portion of the huge page.
         */
        if (end != start + HPAGE_SIZE) {
+               int err;
+
                spin_unlock(ptl);
-               split_huge_pmd(vma, pmd, start);
+               err = split_huge_pmd(vma, pmd, start);
                pagemap_scan_backout_range(p, start, end);
+               if (err)
+                       return err;
                /* Report as if there was no THP */
                return -ENOENT;
        }


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [RFC v2 00/21] mm: thp: lazy PTE page table allocation at PMD split
  2026-02-27 11:13   ` Usama Arif
@ 2026-02-28  0:06     ` Nico Pache
  2026-03-02 11:08       ` Usama Arif
  0 siblings, 1 reply; 36+ messages in thread
From: Nico Pache @ 2026-02-28  0:06 UTC (permalink / raw)
  To: Usama Arif
  Cc: Andrew Morton, david, lorenzo.stoakes, willy, linux-mm, fvdl,
	hannes, riel, shakeel.butt, kas, baohua, dev.jain, baolin.wang,
	Liam.Howlett, ryan.roberts, Vlastimil Babka, lance.yang,
	linux-kernel, kernel-team, maddy, mpe, linuxppc-dev, hca, gor,
	agordeev, borntraeger, svens, linux-s390

On Fri, Feb 27, 2026 at 4:14 AM Usama Arif <usama.arif@linux.dev> wrote:
>
>
>
> On 26/02/2026 21:01, Nico Pache wrote:
> > On Thu, Feb 26, 2026 at 4:33 AM Usama Arif <usama.arif@linux.dev> wrote:
> >>
> >> When the kernel creates a PMD-level THP mapping for anonymous pages, it
> >> pre-allocates a PTE page table via pgtable_trans_huge_deposit(). This
> >> page table sits unused in a deposit list for the lifetime of the THP
> >> mapping, only to be withdrawn when the PMD is split or zapped. Every
> >> anonymous THP therefore wastes 4KB of memory unconditionally. On large
> >> servers where hundreds of gigabytes of memory are mapped as THPs, this
> >> adds up: roughly 200MB wasted per 100GB of THP memory. This memory
> >> could otherwise satisfy other allocations, including the very PTE page
> >> table allocations needed when splits eventually occur.
> >>
> >> This series removes the pre-deposit and allocates the PTE page table
> >> lazily — only when a PMD split actually happens. Since a large number
> >> of THPs are never split (they are zapped wholesale when processes exit or
> >> munmap the full range), the allocation is avoided entirely in the common
> >> case.
> >>
> >> The pre-deposit pattern exists because split_huge_pmd was designed as an
> >> operation that must never fail: if the kernel decides to split, it needs
> >> a PTE page table, so one is deposited in advance. But "must never fail"
> >> is an unnecessarily strong requirement. A PMD split is typically triggered
> >> by a partial operation on a sub-PMD range — partial munmap, partial
> >> mprotect, partial mremap and so on.
> >> Most of these operations already have well-defined error handling for
> >> allocation failures (e.g., -ENOMEM, VM_FAULT_OOM). Allowing split to
> >> fail and propagating the error through these existing paths is the natural
> >> thing to do. Furthermore, split failing requires an order-0 allocation for
> >> a page table to fail, which is extremely unlikely.
> >>
> >> Designing functions like split_huge_pmd as operations that cannot fail
> >> has a subtle but real cost to code quality. It forces a pre-allocation
> >> pattern - every THP creation path must deposit a page table, and every
> >> split or zap path must withdraw one, creating a hidden coupling between
> >> widely separated code paths.
> >>
> >> This also serves as a code cleanup. On every architecture except powerpc
> >> with hash MMU, the deposit/withdraw machinery becomes dead code. The
> >> series removes the generic implementations in pgtable-generic.c and the
> >> s390/sparc overrides, replacing them with no-op stubs guarded by
> >> arch_needs_pgtable_deposit(), which evaluates to false at compile time
> >> on all non-powerpc architectures.
> >
> > Hi Usama,
> >
> > Thanks for tackling this, it seems like an interesting problem. Im
> > trying to get more into reviewing, so bare with me I may have some
> > stupid comments or questions. Where I can really help out is with
> > testing. I will build this for all RH-supported architectures and run
> > some automated test suites and performance metrics. I'll report back
> > if I spot anything.
> >
> > Cheers!
> > -- Nico
> >
>
> Thanks for the build and looking into reviewing this. All comments
> and questions are welcome! I had only tested on x86, and I had a look
> at the link you shared so its great to know that powerPC and s390 are fine.

Good news: as you noted all the builds succeeded, and the sanity tests
dont show any signs of an immediate issue across the architectures.
I'll proceed to debug kernels, and then performance testing. I will
try to start reviewing the actual code changes in depth next week :)

Cheers,
-- Nico

>



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC v2 00/21] mm: thp: lazy PTE page table allocation at PMD split
  2026-02-28  0:06     ` Nico Pache
@ 2026-03-02 11:08       ` Usama Arif
  0 siblings, 0 replies; 36+ messages in thread
From: Usama Arif @ 2026-03-02 11:08 UTC (permalink / raw)
  To: Nico Pache
  Cc: Andrew Morton, david, lorenzo.stoakes, willy, linux-mm, fvdl,
	hannes, riel, shakeel.butt, kas, baohua, dev.jain, baolin.wang,
	Liam.Howlett, ryan.roberts, Vlastimil Babka, lance.yang,
	linux-kernel, kernel-team, maddy, mpe, linuxppc-dev, hca, gor,
	agordeev, borntraeger, svens, linux-s390


>> Thanks for the build and looking into reviewing this. All comments
>> and questions are welcome! I had only tested on x86, and I had a look
>> at the link you shared so its great to know that powerPC and s390 are fine.
> 
> Good news: as you noted all the builds succeeded, and the sanity tests
> dont show any signs of an immediate issue across the architectures.
> I'll proceed to debug kernels, and then performance testing. I will
> try to start reviewing the actual code changes in depth next week :)

Thank you!!


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC v2 12/21] mm: thp: handle split failure in device migration
  2026-02-26 11:23 ` [RFC v2 12/21] mm: thp: handle split failure in device migration Usama Arif
@ 2026-03-02 21:20   ` Nico Pache
  2026-03-04 11:55     ` Usama Arif
  2026-03-05 16:55     ` Usama Arif
  0 siblings, 2 replies; 36+ messages in thread
From: Nico Pache @ 2026-03-02 21:20 UTC (permalink / raw)
  To: Usama Arif
  Cc: Andrew Morton, david, lorenzo.stoakes, willy, linux-mm, fvdl,
	hannes, riel, shakeel.butt, kas, baohua, dev.jain, baolin.wang,
	Liam.Howlett, ryan.roberts, Vlastimil Babka, lance.yang,
	linux-kernel, kernel-team, maddy, mpe, linuxppc-dev, hca, gor,
	agordeev, borntraeger, svens, linux-s390

On Thu, Feb 26, 2026 at 4:34 AM Usama Arif <usama.arif@linux.dev> wrote:
>
> Device memory migration has two call sites that split huge PMDs:
>
> migrate_vma_split_unmapped_folio():
>   Called from migrate_vma_pages() when migrating a PMD-mapped THP to a
>   destination that doesn't support compound pages.  It splits the PMD
>   then splits the folio via folio_split_unmapped().
>
>   If the PMD split fails, folio_split_unmapped() would operate on an
>   unsplit folio with inconsistent page table state.  Propagate -ENOMEM
>   to skip this page's migration. This is safe as folio_split_unmapped
>   failure would be propagated in a similar way.
>
> migrate_vma_insert_page():
>   Called from migrate_vma_pages() when inserting a page into a VMA
>   during migration back from device memory.  If a huge zero PMD exists
>   at the target address, it must be split before PTE insertion.
>
>   If the split fails, the subsequent pte_alloc() and set_pte_at() would
>   operate on a PMD slot still occupied by the huge zero entry.  Use
>   goto abort, consistent with other allocation failures in this function.
>
> Signed-off-by: Usama Arif <usama.arif@linux.dev>
> ---
>  mm/migrate_device.c | 16 ++++++++++++++--
>  1 file changed, 14 insertions(+), 2 deletions(-)
>
> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
> index 78c7acf024615..bc53e06fd9735 100644
> --- a/mm/migrate_device.c
> +++ b/mm/migrate_device.c
> @@ -909,7 +909,13 @@ static int migrate_vma_split_unmapped_folio(struct migrate_vma *migrate,
>         int ret = 0;
>
>         folio_get(folio);

Should we be concerned about this folio_get? Are we incrementing a
reference that was already held if we back out of the split?

-- Nico

> -       split_huge_pmd_address(migrate->vma, addr, true);
> +       /*
> +        * If PMD split fails, folio_split_unmapped would operate on an
> +        * unsplit folio with inconsistent page table state.
> +        */
> +       ret = split_huge_pmd_address(migrate->vma, addr, true);
> +       if (ret)
> +               return ret;
>         ret = folio_split_unmapped(folio, 0);
>         if (ret)
>                 return ret;
> @@ -1005,7 +1011,13 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
>                 if (pmd_trans_huge(*pmdp)) {
>                         if (!is_huge_zero_pmd(*pmdp))
>                                 goto abort;
> -                       split_huge_pmd(vma, pmdp, addr);
> +                       /*
> +                        * If split fails, the huge zero PMD remains and
> +                        * pte_alloc/PTE insertion that follows would be
> +                        * incorrect.
> +                        */
> +                       if (split_huge_pmd(vma, pmdp, addr))
> +                               goto abort;
>                 } else if (pmd_leaf(*pmdp))
>                         goto abort;
>         }
> --
> 2.47.3
>



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC v2 12/21] mm: thp: handle split failure in device migration
  2026-03-02 21:20   ` Nico Pache
@ 2026-03-04 11:55     ` Usama Arif
  2026-03-05 16:55     ` Usama Arif
  1 sibling, 0 replies; 36+ messages in thread
From: Usama Arif @ 2026-03-04 11:55 UTC (permalink / raw)
  To: Nico Pache
  Cc: Andrew Morton, david, lorenzo.stoakes, willy, linux-mm, fvdl,
	hannes, riel, shakeel.butt, kas, baohua, dev.jain, baolin.wang,
	Liam.Howlett, ryan.roberts, Vlastimil Babka, lance.yang,
	linux-kernel, kernel-team, maddy, mpe, linuxppc-dev, hca, gor,
	agordeev, borntraeger, svens, linux-s390



On 02/03/2026 21:20, Nico Pache wrote:
> On Thu, Feb 26, 2026 at 4:34 AM Usama Arif <usama.arif@linux.dev> wrote:
>>
>> Device memory migration has two call sites that split huge PMDs:
>>
>> migrate_vma_split_unmapped_folio():
>>   Called from migrate_vma_pages() when migrating a PMD-mapped THP to a
>>   destination that doesn't support compound pages.  It splits the PMD
>>   then splits the folio via folio_split_unmapped().
>>
>>   If the PMD split fails, folio_split_unmapped() would operate on an
>>   unsplit folio with inconsistent page table state.  Propagate -ENOMEM
>>   to skip this page's migration. This is safe as folio_split_unmapped
>>   failure would be propagated in a similar way.
>>
>> migrate_vma_insert_page():
>>   Called from migrate_vma_pages() when inserting a page into a VMA
>>   during migration back from device memory.  If a huge zero PMD exists
>>   at the target address, it must be split before PTE insertion.
>>
>>   If the split fails, the subsequent pte_alloc() and set_pte_at() would
>>   operate on a PMD slot still occupied by the huge zero entry.  Use
>>   goto abort, consistent with other allocation failures in this function.
>>
>> Signed-off-by: Usama Arif <usama.arif@linux.dev>
>> ---
>>  mm/migrate_device.c | 16 ++++++++++++++--
>>  1 file changed, 14 insertions(+), 2 deletions(-)
>>
>> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
>> index 78c7acf024615..bc53e06fd9735 100644
>> --- a/mm/migrate_device.c
>> +++ b/mm/migrate_device.c
>> @@ -909,7 +909,13 @@ static int migrate_vma_split_unmapped_folio(struct migrate_vma *migrate,
>>         int ret = 0;
>>
>>         folio_get(folio);
> 
> Should we be concerned about this folio_get? Are we incrementing a
> reference that was already held if we back out of the split?
> 

Good catch! I think this bug existed even before this patch, if
folio_split_unmapped fails, the reference is still there. Let me
send an independent fix for this.

> -- Nico
> 
>> -       split_huge_pmd_address(migrate->vma, addr, true);
>> +       /*
>> +        * If PMD split fails, folio_split_unmapped would operate on an
>> +        * unsplit folio with inconsistent page table state.
>> +        */
>> +       ret = split_huge_pmd_address(migrate->vma, addr, true);
>> +       if (ret)
>> +               return ret;
>>         ret = folio_split_unmapped(folio, 0);
>>         if (ret)
>>                 return ret;


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC v2 12/21] mm: thp: handle split failure in device migration
  2026-03-02 21:20   ` Nico Pache
  2026-03-04 11:55     ` Usama Arif
@ 2026-03-05 16:55     ` Usama Arif
  2026-03-09 15:09       ` Nico Pache
  1 sibling, 1 reply; 36+ messages in thread
From: Usama Arif @ 2026-03-05 16:55 UTC (permalink / raw)
  To: Nico Pache
  Cc: Andrew Morton, david, lorenzo.stoakes, willy, linux-mm, fvdl,
	hannes, riel, shakeel.butt, kas, baohua, dev.jain, baolin.wang,
	Liam.Howlett, ryan.roberts, Vlastimil Babka, lance.yang,
	linux-kernel, kernel-team, maddy, mpe, linuxppc-dev, hca, gor,
	agordeev, borntraeger, svens, linux-s390



On 02/03/2026 21:20, Nico Pache wrote:
> On Thu, Feb 26, 2026 at 4:34 AM Usama Arif <usama.arif@linux.dev> wrote:
>>
>> Device memory migration has two call sites that split huge PMDs:
>>
>> migrate_vma_split_unmapped_folio():
>>   Called from migrate_vma_pages() when migrating a PMD-mapped THP to a
>>   destination that doesn't support compound pages.  It splits the PMD
>>   then splits the folio via folio_split_unmapped().
>>
>>   If the PMD split fails, folio_split_unmapped() would operate on an
>>   unsplit folio with inconsistent page table state.  Propagate -ENOMEM
>>   to skip this page's migration. This is safe as folio_split_unmapped
>>   failure would be propagated in a similar way.
>>
>> migrate_vma_insert_page():
>>   Called from migrate_vma_pages() when inserting a page into a VMA
>>   during migration back from device memory.  If a huge zero PMD exists
>>   at the target address, it must be split before PTE insertion.
>>
>>   If the split fails, the subsequent pte_alloc() and set_pte_at() would
>>   operate on a PMD slot still occupied by the huge zero entry.  Use
>>   goto abort, consistent with other allocation failures in this function.
>>
>> Signed-off-by: Usama Arif <usama.arif@linux.dev>
>> ---
>>  mm/migrate_device.c | 16 ++++++++++++++--
>>  1 file changed, 14 insertions(+), 2 deletions(-)
>>
>> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
>> index 78c7acf024615..bc53e06fd9735 100644
>> --- a/mm/migrate_device.c
>> +++ b/mm/migrate_device.c
>> @@ -909,7 +909,13 @@ static int migrate_vma_split_unmapped_folio(struct migrate_vma *migrate,
>>         int ret = 0;
>>
>>         folio_get(folio);
> 
> Should we be concerned about this folio_get? Are we incrementing a
> reference that was already held if we back out of the split?
> 
> -- Nico



Hi Nico,

Thanks for pointing this out. It spun out to an entire investigation for me [1].

Similar to [1], I inserted trace prints [2] and created a new __split_huge_pmd2
that always returns -ENOMEM. Without folio_put on error [3], we get a refcount of 2.

       hmm-tests-129     [000] .l...     1.485514: __migrate_device_finalize: FINALIZE[0]: src=ffb48827440e8000 dst=ffb48827440e8000 src==dst=1 refcount_src=2 mapcount_src=0 order_src=9 migrate=0 BEFORE remove_migration_ptes
       hmm-tests-129     [000] .l...     1.485517: __migrate_device_finalize: FINALIZE[0]: src=ffb48827440e8000 refcount=3 mapcount=1 AFTER remove_migration_ptes
       hmm-tests-129     [000] .l...     1.485518: __migrate_device_finalize: FINALIZE[0]: src=ffb48827440e8000 refcount=2 AFTER folio_put(src)


With folio_put on error [4], we get a refcount of 1.

       hmm-tests-129     [001] .....     1.492216: __migrate_device_finalize: FINALIZE[0]: src=fff7b8be840f0000 dst=fff7b8be840f0000 src==dst=1 refcount_src=1 mapcount_src=0 order_src=9 migrate=0 BEFORE remove_migration_ptes
       hmm-tests-129     [001] .....     1.492219: __migrate_device_finalize: FINALIZE[0]: src=fff7b8be840f0000 refcount=2 mapcount=1 AFTER remove_migration_ptes
       hmm-tests-129     [001] .....     1.492220: __migrate_device_finalize: FINALIZE[0]: src=fff7b8be840f0000 refcount=1 AFTER folio_put(src)


So we need folio_put for split_huge_pmd_address failure, but NOT for
folio_split_unmapped.


[1] https://lore.kernel.org/all/332c9e16-46c3-4e1c-898e-2cb0a87ba1fc@linux.dev/
[2] https://gist.github.com/uarif1/6abe4bedb85814e9be8d48a4fe742b41
[3] https://gist.github.com/uarif1/f718af2113bc1a33484674b61b9dafcc
[4] https://gist.github.com/uarif1/03c42f2549eaf2bc555e8b03e07a63c8


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC v2 12/21] mm: thp: handle split failure in device migration
  2026-03-05 16:55     ` Usama Arif
@ 2026-03-09 15:09       ` Nico Pache
  2026-03-09 21:34         ` Usama Arif
  0 siblings, 1 reply; 36+ messages in thread
From: Nico Pache @ 2026-03-09 15:09 UTC (permalink / raw)
  To: Usama Arif
  Cc: Andrew Morton, david, lorenzo.stoakes, willy, linux-mm, fvdl,
	hannes, riel, shakeel.butt, kas, baohua, dev.jain, baolin.wang,
	Liam.Howlett, ryan.roberts, Vlastimil Babka, lance.yang,
	linux-kernel, kernel-team, maddy, mpe, linuxppc-dev, hca, gor,
	agordeev, borntraeger, svens, linux-s390

On Thu, Mar 5, 2026 at 9:55 AM Usama Arif <usama.arif@linux.dev> wrote:
>
>
>
> On 02/03/2026 21:20, Nico Pache wrote:
> > On Thu, Feb 26, 2026 at 4:34 AM Usama Arif <usama.arif@linux.dev> wrote:
> >>
> >> Device memory migration has two call sites that split huge PMDs:
> >>
> >> migrate_vma_split_unmapped_folio():
> >>   Called from migrate_vma_pages() when migrating a PMD-mapped THP to a
> >>   destination that doesn't support compound pages.  It splits the PMD
> >>   then splits the folio via folio_split_unmapped().
> >>
> >>   If the PMD split fails, folio_split_unmapped() would operate on an
> >>   unsplit folio with inconsistent page table state.  Propagate -ENOMEM
> >>   to skip this page's migration. This is safe as folio_split_unmapped
> >>   failure would be propagated in a similar way.
> >>
> >> migrate_vma_insert_page():
> >>   Called from migrate_vma_pages() when inserting a page into a VMA
> >>   during migration back from device memory.  If a huge zero PMD exists
> >>   at the target address, it must be split before PTE insertion.
> >>
> >>   If the split fails, the subsequent pte_alloc() and set_pte_at() would
> >>   operate on a PMD slot still occupied by the huge zero entry.  Use
> >>   goto abort, consistent with other allocation failures in this function.
> >>
> >> Signed-off-by: Usama Arif <usama.arif@linux.dev>
> >> ---
> >>  mm/migrate_device.c | 16 ++++++++++++++--
> >>  1 file changed, 14 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
> >> index 78c7acf024615..bc53e06fd9735 100644
> >> --- a/mm/migrate_device.c
> >> +++ b/mm/migrate_device.c
> >> @@ -909,7 +909,13 @@ static int migrate_vma_split_unmapped_folio(struct migrate_vma *migrate,
> >>         int ret = 0;
> >>
> >>         folio_get(folio);
> >
> > Should we be concerned about this folio_get? Are we incrementing a
> > reference that was already held if we back out of the split?
> >
> > -- Nico
>
>
>
> Hi Nico,
>
> Thanks for pointing this out. It spun out to an entire investigation for me [1].

Hey Usama,

I'm sorry my question lead you down a rabbit hole but I'm glad you did
the proper investigation and found the correct answer :) Thanks for
looking into it and for clearing that up via a comment!

Cheers,
-- Nico

>
> Similar to [1], I inserted trace prints [2] and created a new __split_huge_pmd2
> that always returns -ENOMEM. Without folio_put on error [3], we get a refcount of 2.
>
>        hmm-tests-129     [000] .l...     1.485514: __migrate_device_finalize: FINALIZE[0]: src=ffb48827440e8000 dst=ffb48827440e8000 src==dst=1 refcount_src=2 mapcount_src=0 order_src=9 migrate=0 BEFORE remove_migration_ptes
>        hmm-tests-129     [000] .l...     1.485517: __migrate_device_finalize: FINALIZE[0]: src=ffb48827440e8000 refcount=3 mapcount=1 AFTER remove_migration_ptes
>        hmm-tests-129     [000] .l...     1.485518: __migrate_device_finalize: FINALIZE[0]: src=ffb48827440e8000 refcount=2 AFTER folio_put(src)
>
>
> With folio_put on error [4], we get a refcount of 1.
>
>        hmm-tests-129     [001] .....     1.492216: __migrate_device_finalize: FINALIZE[0]: src=fff7b8be840f0000 dst=fff7b8be840f0000 src==dst=1 refcount_src=1 mapcount_src=0 order_src=9 migrate=0 BEFORE remove_migration_ptes
>        hmm-tests-129     [001] .....     1.492219: __migrate_device_finalize: FINALIZE[0]: src=fff7b8be840f0000 refcount=2 mapcount=1 AFTER remove_migration_ptes
>        hmm-tests-129     [001] .....     1.492220: __migrate_device_finalize: FINALIZE[0]: src=fff7b8be840f0000 refcount=1 AFTER folio_put(src)
>
>
> So we need folio_put for split_huge_pmd_address failure, but NOT for
> folio_split_unmapped.
>
>
> [1] https://lore.kernel.org/all/332c9e16-46c3-4e1c-898e-2cb0a87ba1fc@linux.dev/
> [2] https://gist.github.com/uarif1/6abe4bedb85814e9be8d48a4fe742b41
> [3] https://gist.github.com/uarif1/f718af2113bc1a33484674b61b9dafcc
> [4] https://gist.github.com/uarif1/03c42f2549eaf2bc555e8b03e07a63c8
>



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC v2 12/21] mm: thp: handle split failure in device migration
  2026-03-09 15:09       ` Nico Pache
@ 2026-03-09 21:34         ` Usama Arif
  0 siblings, 0 replies; 36+ messages in thread
From: Usama Arif @ 2026-03-09 21:34 UTC (permalink / raw)
  To: Nico Pache
  Cc: Andrew Morton, david, lorenzo.stoakes, willy, linux-mm, fvdl,
	hannes, riel, shakeel.butt, kas, baohua, dev.jain, baolin.wang,
	Liam.Howlett, ryan.roberts, Vlastimil Babka, lance.yang,
	linux-kernel, kernel-team, maddy, mpe, linuxppc-dev, hca, gor,
	agordeev, borntraeger, svens, linux-s390



On 09/03/2026 18:09, Nico Pache wrote:
> On Thu, Mar 5, 2026 at 9:55 AM Usama Arif <usama.arif@linux.dev> wrote:
>>
>>
>>
>> On 02/03/2026 21:20, Nico Pache wrote:
>>> On Thu, Feb 26, 2026 at 4:34 AM Usama Arif <usama.arif@linux.dev> wrote:
>>>>
>>>> Device memory migration has two call sites that split huge PMDs:
>>>>
>>>> migrate_vma_split_unmapped_folio():
>>>>   Called from migrate_vma_pages() when migrating a PMD-mapped THP to a
>>>>   destination that doesn't support compound pages.  It splits the PMD
>>>>   then splits the folio via folio_split_unmapped().
>>>>
>>>>   If the PMD split fails, folio_split_unmapped() would operate on an
>>>>   unsplit folio with inconsistent page table state.  Propagate -ENOMEM
>>>>   to skip this page's migration. This is safe as folio_split_unmapped
>>>>   failure would be propagated in a similar way.
>>>>
>>>> migrate_vma_insert_page():
>>>>   Called from migrate_vma_pages() when inserting a page into a VMA
>>>>   during migration back from device memory.  If a huge zero PMD exists
>>>>   at the target address, it must be split before PTE insertion.
>>>>
>>>>   If the split fails, the subsequent pte_alloc() and set_pte_at() would
>>>>   operate on a PMD slot still occupied by the huge zero entry.  Use
>>>>   goto abort, consistent with other allocation failures in this function.
>>>>
>>>> Signed-off-by: Usama Arif <usama.arif@linux.dev>
>>>> ---
>>>>  mm/migrate_device.c | 16 ++++++++++++++--
>>>>  1 file changed, 14 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
>>>> index 78c7acf024615..bc53e06fd9735 100644
>>>> --- a/mm/migrate_device.c
>>>> +++ b/mm/migrate_device.c
>>>> @@ -909,7 +909,13 @@ static int migrate_vma_split_unmapped_folio(struct migrate_vma *migrate,
>>>>         int ret = 0;
>>>>
>>>>         folio_get(folio);
>>>
>>> Should we be concerned about this folio_get? Are we incrementing a
>>> reference that was already held if we back out of the split?
>>>
>>> -- Nico
>>
>>
>>
>> Hi Nico,
>>
>> Thanks for pointing this out. It spun out to an entire investigation for me [1].
> 
> Hey Usama,
> 
> I'm sorry my question lead you down a rabbit hole but I'm glad you did
> the proper investigation and found the correct answer :) Thanks for
> looking into it and for clearing that up via a comment!
> 
> Cheers,
> -- Nico
>

Thanks for the review! There is a need for folio_put in this patch
specifically for split_huge_pmd_address which I wouldnt have figured out
without your comment, so really appreciate it!

Usama


^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2026-03-09 21:35 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-26 11:23 [RFC v2 00/21] mm: thp: lazy PTE page table allocation at PMD split Usama Arif
2026-02-26 11:23 ` [RFC v2 01/21] mm: thp: make split_huge_pmd functions return int for error propagation Usama Arif
2026-02-26 11:23 ` [RFC v2 02/21] mm: thp: propagate split failure from vma_adjust_trans_huge() Usama Arif
2026-02-26 11:23 ` [RFC v2 03/21] mm: thp: handle split failure in copy_huge_pmd() Usama Arif
2026-02-26 11:23 ` [RFC v2 04/21] mm: thp: handle split failure in do_huge_pmd_wp_page() Usama Arif
2026-02-26 11:23 ` [RFC v2 05/21] mm: thp: handle split failure in zap_pmd_range() Usama Arif
2026-02-26 11:23 ` [RFC v2 06/21] mm: thp: handle split failure in wp_huge_pmd() Usama Arif
2026-02-26 11:23 ` [RFC v2 07/21] mm: thp: retry on split failure in change_pmd_range() Usama Arif
2026-02-26 11:23 ` [RFC v2 08/21] mm: thp: handle split failure in follow_pmd_mask() Usama Arif
2026-02-26 11:23 ` [RFC v2 09/21] mm: handle walk_page_range() failure from THP split Usama Arif
2026-02-26 11:23 ` [RFC v2 10/21] mm: thp: handle split failure in mremap move_page_tables() Usama Arif
2026-02-26 11:23 ` [RFC v2 11/21] mm: thp: handle split failure in userfaultfd move_pages() Usama Arif
2026-02-26 11:23 ` [RFC v2 12/21] mm: thp: handle split failure in device migration Usama Arif
2026-03-02 21:20   ` Nico Pache
2026-03-04 11:55     ` Usama Arif
2026-03-05 16:55     ` Usama Arif
2026-03-09 15:09       ` Nico Pache
2026-03-09 21:34         ` Usama Arif
2026-02-26 11:23 ` [RFC v2 13/21] mm: huge_mm: Make sure all split_huge_pmd calls are checked Usama Arif
2026-02-26 16:32   ` kernel test robot
2026-02-27 12:11   ` Usama Arif
2026-02-26 11:23 ` [RFC v2 14/21] mm: thp: allocate PTE page tables lazily at split time Usama Arif
2026-02-26 11:23 ` [RFC v2 15/21] mm: thp: remove pgtable_trans_huge_{deposit/withdraw} when not needed Usama Arif
2026-02-26 11:23 ` [RFC v2 16/21] mm: thp: add THP_SPLIT_PMD_FAILED counter Usama Arif
2026-02-26 13:56   ` kernel test robot
2026-02-26 14:22   ` Usama Arif
2026-02-26 15:10   ` kernel test robot
2026-02-26 11:23 ` [RFC v2 17/21] selftests/mm: add THP PMD split test infrastructure Usama Arif
2026-02-26 11:23 ` [RFC v2 18/21] selftests/mm: add partial_mprotect test for change_pmd_range Usama Arif
2026-02-26 11:23 ` [RFC v2 19/21] selftests/mm: add partial_mlock test Usama Arif
2026-02-26 11:23 ` [RFC v2 20/21] selftests/mm: add partial_mremap test for move_page_tables Usama Arif
2026-02-26 11:23 ` [RFC v2 21/21] selftests/mm: add madv_dontneed_partial test Usama Arif
2026-02-26 21:01 ` [RFC v2 00/21] mm: thp: lazy PTE page table allocation at PMD split Nico Pache
2026-02-27 11:13   ` Usama Arif
2026-02-28  0:06     ` Nico Pache
2026-03-02 11:08       ` Usama Arif

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.