[v3 00/24] mm: thp: lazy PTE page table allocation at PMD split time

public inbox for linux-mm@kvack.org
 help / color / mirror / Atom feed

* [v3 00/24] mm: thp: lazy PTE page table allocation at PMD split time
@ 2026-03-27  2:08 Usama Arif
  2026-03-27  2:08 ` [v3 01/24] mm: thp: make split_huge_pmd functions return int for error propagation Usama Arif
                   ` (24 more replies)
  0 siblings, 25 replies; 27+ messages in thread
From: Usama Arif @ 2026-03-27  2:08 UTC (permalink / raw)
  To: Andrew Morton, david, Lorenzo Stoakes, willy, linux-mm
  Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
	baolin.wang, npache, Liam.Howlett, ryan.roberts, Vlastimil Babka,
	lance.yang, linux-kernel, kernel-team, maddy, mpe, linuxppc-dev,
	hca, gor, agordeev, borntraeger, svens, linux-s390, Usama Arif

When the kernel creates a PMD-level THP mapping for anonymous pages, it
pre-allocates a PTE page table via pgtable_trans_huge_deposit(). This
page table sits unused in a deposit list for the lifetime of the THP
mapping, only to be withdrawn when the PMD is split or zapped. Every
anonymous THP therefore wastes 4KB of memory unconditionally. On large
servers where hundreds of gigabytes of memory are mapped as THPs, this
adds up: roughly 200MB wasted per 100GB of THP memory. This memory
could otherwise satisfy other allocations, including the very PTE page
table allocations needed when splits eventually occur.

This series removes the pre-deposit and allocates the PTE page table
lazily — only when a PMD split actually happens. Since a large number
of THPs are never split (they are zapped wholesale when processes exit or
munmap the full range), the allocation is avoided entirely in the common
case.

The pre-deposit pattern exists because split_huge_pmd was designed as an
operation that must never fail: if the kernel decides to split, it needs
a PTE page table, so one is deposited in advance. But "must never fail"
is an unnecessarily strong requirement. A PMD split is typically triggered
by a partial operation on a sub-PMD range — partial munmap, partial
mprotect, COW on a pinned folio, GUP with FOLL_SPLIT_PMD, and similar.
All of these operations already have well-defined error handling for
allocation failures (e.g., -ENOMEM, VM_FAULT_OOM). Allowing split to
fail and propagating the error through these existing paths is the natural
thing to do. Furthermore, if the system cannot satisfy a single order-0
allocation for a page table, it is under extreme memory pressure and
failing the operation is the correct response.

Designing functions like split_huge_pmd as operations that cannot fail
has a subtle but real cost to code quality. It forces a pre-allocation
pattern - every THP creation path must deposit a page table, and every
split or zap path must withdraw one, creating a hidden coupling between
widely separated code paths.

This also serves as a code cleanup. On every architecture except powerpc
with hash MMU, the deposit/withdraw machinery becomes dead code. The
series removes the generic implementations in pgtable-generic.c and the
s390/sparc overrides, replacing them with no-op stubs guarded by
arch_needs_pgtable_deposit(), which evaluates to false at compile time
on all non-powerpc architectures.

The series is structured as follows:

Patches 1-2:    Infrastructure — make split functions return int and
                propagate errors from vma_adjust_trans_huge() through
                __split_vma, vma_shrink, and commit_merge.

Patches 3-15:   Handle split failure at every call site — copy_huge_pmd,
                do_huge_pmd_wp_page, zap_pmd_range, wp_huge_pmd,
                change_pmd_range (mprotect), follow_pmd_mask (GUP),
                walk_pmd_range (pagewalk), move_page_tables (mremap),
                move_pages (userfaultfd), device migration,
                pagemap_scan_thp_entry (proc), powerpc subpage_prot,
                and dax_iomap_pmd_fault (DAX). The code will become
                effective in Patch 17 when split functions start
                returning -ENOMEM.

Patch 16:       Add __must_check to __split_huge_pmd(), split_huge_pmd()
                and split_huge_pmd_address() so the compiler warns on
                unchecked return values.

Patch 17:       The actual change — allocate PTE page tables lazily at
                split time instead of pre-depositing at THP creation.
                This is when split functions will actually start returning
                -ENOMEM.

Patch 18:       Remove the now-dead deposit/withdraw code on
                non-powerpc architectures.

Patch 19:       Add THP_SPLIT_PMD_FAILED vmstat counter for monitoring
                split failures.

Patches 20-24:  Selftests covering partial munmap, mprotect, mlock,
                mremap, and MADV_DONTNEED on THPs to exercise the
                split paths.

The error handling patches are placed before the lazy allocation patch so
that every call site is already prepared to handle split failures before
the failure mode is introduced. This makes each patch independently safe
to apply and bisect through.

The patches were tested with CONFIG_DEBUG_ATOMIC_SLEEP and CONFIG_DEBUG_VM
enabled. The test results are below:

TAP version 13
1..5
# Starting 5 tests from 1 test cases.
#  RUN           thp_pmd_split.partial_munmap ...
# thp_pmd_split_test.c:60:partial_munmap:thp_split_pmd: 0 -> 1
# thp_pmd_split_test.c:62:partial_munmap:thp_split_pmd_failed: 0 -> 0
#            OK  thp_pmd_split.partial_munmap
ok 1 thp_pmd_split.partial_munmap
#  RUN           thp_pmd_split.partial_mprotect ...
# thp_pmd_split_test.c:60:partial_mprotect:thp_split_pmd: 1 -> 2
# thp_pmd_split_test.c:62:partial_mprotect:thp_split_pmd_failed: 0 -> 0
#            OK  thp_pmd_split.partial_mprotect
ok 2 thp_pmd_split.partial_mprotect
#  RUN           thp_pmd_split.partial_mlock ...
# thp_pmd_split_test.c:60:partial_mlock:thp_split_pmd: 2 -> 3
# thp_pmd_split_test.c:62:partial_mlock:thp_split_pmd_failed: 0 -> 0
#            OK  thp_pmd_split.partial_mlock
ok 3 thp_pmd_split.partial_mlock
#  RUN           thp_pmd_split.partial_mremap ...
# thp_pmd_split_test.c:60:partial_mremap:thp_split_pmd: 3 -> 4
# thp_pmd_split_test.c:62:partial_mremap:thp_split_pmd_failed: 0 -> 0
#            OK  thp_pmd_split.partial_mremap
ok 4 thp_pmd_split.partial_mremap
#  RUN           thp_pmd_split.partial_madv_dontneed ...
# thp_pmd_split_test.c:60:partial_madv_dontneed:thp_split_pmd: 4 -> 5
# thp_pmd_split_test.c:62:partial_madv_dontneed:thp_split_pmd_failed: 0 -> 0
#            OK  thp_pmd_split.partial_madv_dontneed
ok 5 thp_pmd_split.partial_madv_dontneed
# PASSED: 5 / 5 tests passed.
# Totals: pass:5 fail:0 xfail:0 xpass:0 skip:0 error:0

The patches are based off of mm-unstable as of 25 Mar
git hash: d6f51e38433489eb22cb65d1bf72ac7993c5bdec

RFC v2 -> v3: https://lore.kernel.org/all/de0dc7ec-7a8d-4b1a-a419-1d97d2e4d510@linux.dev/
- Rebased on top of mm-unstable as of 25 Mar.
- handle split_huge_pmd failure in pagemap_scan
- handle split_huge_pmd failure in subpage_prot
- handle split_huge_pmd failure in dax_iomap_pmd_fault (cannot
  actually fail for file-backed DAX, but needed for __must_check
  compliance)
- Added a folio_put(folio) on the split failure path in
  migrate_vma_split_unmapped_folio() (Nico Pache)
— Added #if defined(CONFIG_TRANSPARENT_HUGEPAGE) guards
  around THP_SPLIT_PMD_FAILED vmstat counter.

Usama Arif (24):
  mm: thp: make split_huge_pmd functions return int for error
    propagation
  mm: thp: propagate split failure from vma_adjust_trans_huge()
  mm: thp: handle split failure in copy_huge_pmd()
  mm: thp: handle split failure in do_huge_pmd_wp_page()
  mm: thp: handle split failure in zap_pmd_range()
  mm: thp: handle split failure in wp_huge_pmd()
  mm: thp: retry on split failure in change_pmd_range()
  mm: thp: handle split failure in follow_pmd_mask()
  mm: handle walk_page_range() failure from THP split
  mm: thp: handle split failure in mremap move_page_tables()
  mm: thp: handle split failure in userfaultfd move_pages()
  mm: thp: handle split failure in device migration
  mm: proc: handle split_huge_pmd failure in pagemap_scan
  powerpc/mm: handle split_huge_pmd failure in subpage_prot
  fs/dax: handle split_huge_pmd failure in dax_iomap_pmd_fault
  mm: huge_mm: Make sure all split_huge_pmd calls are checked
  mm: thp: allocate PTE page tables lazily at split time
  mm: thp: remove pgtable_trans_huge_{deposit/withdraw} when not needed
  mm: thp: add THP_SPLIT_PMD_FAILED counter
  selftests/mm: add THP PMD split test infrastructure
  selftests/mm: add partial_mprotect test for change_pmd_range
  selftests/mm: add partial_mlock test
  selftests/mm: add partial_mremap test for move_page_tables
  selftests/mm: add madv_dontneed_partial test

 arch/powerpc/include/asm/book3s/64/pgtable.h  |  12 +-
 arch/powerpc/mm/book3s64/subpage_prot.c       |  24 +-
 arch/s390/include/asm/pgtable.h               |   6 -
 arch/s390/mm/pgtable.c                        |  41 ---
 arch/sparc/include/asm/pgtable_64.h           |   6 -
 arch/sparc/mm/tlb.c                           |  36 ---
 fs/dax.c                                      |   9 +-
 fs/proc/task_mmu.c                            |   6 +-
 include/linux/huge_mm.h                       |  51 +--
 include/linux/pgtable.h                       |  16 +-
 include/linux/vm_event_item.h                 |   1 +
 mm/debug_vm_pgtable.c                         |   4 +-
 mm/gup.c                                      |  10 +-
 mm/huge_memory.c                              | 222 +++++++++-----
 mm/khugepaged.c                               |   7 +-
 mm/memory.c                                   |  26 +-
 mm/migrate_device.c                           |  39 ++-
 mm/mprotect.c                                 |  11 +-
 mm/mremap.c                                   |   8 +-
 mm/pagewalk.c                                 |   8 +-
 mm/pgtable-generic.c                          |  32 --
 mm/rmap.c                                     |  46 ++-
 mm/userfaultfd.c                              |   8 +-
 mm/vma.c                                      |  37 ++-
 mm/vmstat.c                                   |   1 +
 tools/testing/selftests/mm/Makefile           |   1 +
 .../testing/selftests/mm/thp_pmd_split_test.c | 290 ++++++++++++++++++
 tools/testing/vma/include/stubs.h             |   9 +-
 28 files changed, 685 insertions(+), 282 deletions(-)
 create mode 100644 tools/testing/selftests/mm/thp_pmd_split_test.c

-- 
2.52.0



^ permalink raw reply	[flat|nested] 27+ messages in thread

* [v3 01/24] mm: thp: make split_huge_pmd functions return int for error propagation
  2026-03-27  2:08 [v3 00/24] mm: thp: lazy PTE page table allocation at PMD split time Usama Arif
@ 2026-03-27  2:08 ` Usama Arif
  2026-03-27  2:08 ` [v3 02/24] mm: thp: propagate split failure from vma_adjust_trans_huge() Usama Arif
                   ` (23 subsequent siblings)
  24 siblings, 0 replies; 27+ messages in thread
From: Usama Arif @ 2026-03-27  2:08 UTC (permalink / raw)
  To: Andrew Morton, david, Lorenzo Stoakes, willy, linux-mm
  Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
	baolin.wang, npache, Liam.Howlett, ryan.roberts, Vlastimil Babka,
	lance.yang, linux-kernel, kernel-team, maddy, mpe, linuxppc-dev,
	hca, gor, agordeev, borntraeger, svens, linux-s390, Usama Arif

Currently split cannot fail, but future patches will add lazy PTE page
table allocation. With lazy PTE page table allocation at THP split time
__split_huge_pmd() calls pte_alloc_one() which can fail if order-0
allocation cannot be satisfied.
Split functions currently return void, so callers have no way to detect
this failure.  The PMD would remain huge, but callers assumed the split
succeeded and proceeded to operate on that basis — interpreting a huge PMD
entry as a page table pointer could result in a kernel bug.

Change __split_huge_pmd(), split_huge_pmd(), split_huge_pmd_if_needed()
and split_huge_pmd_address() to return 0 on success (-ENOMEM on
allocation failure in later patch).  Convert the split_huge_pmd macro
to a static inline function that propagates the return value. The return
values will be handled by the callers in future commits.

The CONFIG_TRANSPARENT_HUGEPAGE=n stubs are changed to return 0.

No behaviour change is expected with this patch.

Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 include/linux/huge_mm.h | 34 ++++++++++++++++++----------------
 mm/huge_memory.c        | 16 ++++++++++------
 2 files changed, 28 insertions(+), 22 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 1258fa37e85b5..b081ce044c735 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -418,7 +418,7 @@ static inline int split_huge_page(struct page *page)
 extern struct list_lru deferred_split_lru;
 void deferred_split_folio(struct folio *folio, bool partially_mapped);
 
-void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
+int __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 		unsigned long address, bool freeze);
 
 /**
@@ -447,15 +447,15 @@ static inline bool pmd_is_huge(pmd_t pmd)
 	return false;
 }
 
-#define split_huge_pmd(__vma, __pmd, __address)				\
-	do {								\
-		pmd_t *____pmd = (__pmd);				\
-		if (pmd_is_huge(*____pmd))				\
-			__split_huge_pmd(__vma, __pmd, __address,	\
-					 false);			\
-	}  while (0)
+static inline int split_huge_pmd(struct vm_area_struct *vma,
+					     pmd_t *pmd, unsigned long address)
+{
+	if (pmd_is_huge(*pmd))
+		return __split_huge_pmd(vma, pmd, address, false);
+	return 0;
+}
 
-void split_huge_pmd_address(struct vm_area_struct *vma, unsigned long address,
+int split_huge_pmd_address(struct vm_area_struct *vma, unsigned long address,
 		bool freeze);
 
 void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
@@ -649,13 +649,15 @@ static inline int try_folio_split_to_order(struct folio *folio,
 }
 
 static inline void deferred_split_folio(struct folio *folio, bool partially_mapped) {}
-#define split_huge_pmd(__vma, __pmd, __address)	\
-	do { } while (0)
-
-static inline void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
-		unsigned long address, bool freeze) {}
-static inline void split_huge_pmd_address(struct vm_area_struct *vma,
-		unsigned long address, bool freeze) {}
+static inline int split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
+				unsigned long address)
+{
+	return 0;
+}
+static inline int __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
+		unsigned long address, bool freeze) { return 0; }
+static inline int split_huge_pmd_address(struct vm_area_struct *vma,
+		unsigned long address, bool freeze) { return 0; }
 static inline void split_huge_pmd_locked(struct vm_area_struct *vma,
 					 unsigned long address, pmd_t *pmd,
 					 bool freeze) {}
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index b2a6060b3c202..976a1c74c0870 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3283,7 +3283,7 @@ void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address,
 		__split_huge_pmd_locked(vma, pmd, address, freeze);
 }
 
-void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
+int __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 		unsigned long address, bool freeze)
 {
 	spinlock_t *ptl;
@@ -3297,20 +3297,22 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 	split_huge_pmd_locked(vma, range.start, pmd, freeze);
 	spin_unlock(ptl);
 	mmu_notifier_invalidate_range_end(&range);
+
+	return 0;
 }
 
-void split_huge_pmd_address(struct vm_area_struct *vma, unsigned long address,
+int split_huge_pmd_address(struct vm_area_struct *vma, unsigned long address,
 		bool freeze)
 {
 	pmd_t *pmd = mm_find_pmd(vma->vm_mm, address);
 
 	if (!pmd)
-		return;
+		return 0;
 
-	__split_huge_pmd(vma, pmd, address, freeze);
+	return __split_huge_pmd(vma, pmd, address, freeze);
 }
 
-static inline void split_huge_pmd_if_needed(struct vm_area_struct *vma, unsigned long address)
+static inline int split_huge_pmd_if_needed(struct vm_area_struct *vma, unsigned long address)
 {
 	/*
 	 * If the new address isn't hpage aligned and it could previously
@@ -3319,7 +3321,9 @@ static inline void split_huge_pmd_if_needed(struct vm_area_struct *vma, unsigned
 	if (!IS_ALIGNED(address, HPAGE_PMD_SIZE) &&
 	    range_in_vma(vma, ALIGN_DOWN(address, HPAGE_PMD_SIZE),
 			 ALIGN(address, HPAGE_PMD_SIZE)))
-		split_huge_pmd_address(vma, address, false);
+		return split_huge_pmd_address(vma, address, false);
+
+	return 0;
 }
 
 void vma_adjust_trans_huge(struct vm_area_struct *vma,
-- 
2.52.0



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [v3 02/24] mm: thp: propagate split failure from vma_adjust_trans_huge()
  2026-03-27  2:08 [v3 00/24] mm: thp: lazy PTE page table allocation at PMD split time Usama Arif
  2026-03-27  2:08 ` [v3 01/24] mm: thp: make split_huge_pmd functions return int for error propagation Usama Arif
@ 2026-03-27  2:08 ` Usama Arif
  2026-03-27  2:08 ` [v3 03/24] mm: thp: handle split failure in copy_huge_pmd() Usama Arif
                   ` (22 subsequent siblings)
  24 siblings, 0 replies; 27+ messages in thread
From: Usama Arif @ 2026-03-27  2:08 UTC (permalink / raw)
  To: Andrew Morton, david, Lorenzo Stoakes, willy, linux-mm
  Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
	baolin.wang, npache, Liam.Howlett, ryan.roberts, Vlastimil Babka,
	lance.yang, linux-kernel, kernel-team, maddy, mpe, linuxppc-dev,
	hca, gor, agordeev, borntraeger, svens, linux-s390, Usama Arif

With lazy PTE page table allocation, split_huge_pmd_if_needed() and
thus vma_adjust_trans_huge() can now fail if order-0 allocation
for pagetable fails when trying to split. It is important to check
if this failure occurred to prevent a huge PMD straddling at VMA
boundary.

The vma_adjust_trans_huge() call is moved before vma_prepare() in all
three callers (__split_vma, vma_shrink, commit_merge). Previously it sat
between vma_prepare() and vma_complete(), where there is no mechanism to
abort - once vma_prepare() has been called, we must reach vma_complete().
By moving the call earlier, a split failure can return -ENOMEM cleanly
without needing to undo VMA preparation.

This move is safe because vma_adjust_trans_huge() acquires its own
pmd_lock() internally and does not depend on any locks or state changes
from vma_prepare(). The VMA boundaries are also unchanged at the new
call site, satisfying __split_huge_pmd_locked()'s requirement that the
VMA covers the full PMD range.

All 3 callers (__split_vma, vma_shrink, commit_merge) already return
-ENOMEM if there are allocation failures for other reasons (failure in
vma_iter_prealloc for example), this follows the same pattern.

Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 include/linux/huge_mm.h           | 13 ++++++-----
 mm/huge_memory.c                  | 21 +++++++++++++-----
 mm/vma.c                          | 37 +++++++++++++++++++++----------
 tools/testing/vma/include/stubs.h |  9 ++++----
 4 files changed, 53 insertions(+), 27 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index b081ce044c735..224965fce4e66 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -483,8 +483,8 @@ int hugepage_madvise(struct vm_area_struct *vma, vm_flags_t *vm_flags,
 		     int advice);
 int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
 		     unsigned long end, bool *lock_dropped);
-void vma_adjust_trans_huge(struct vm_area_struct *vma, unsigned long start,
-			   unsigned long end, struct vm_area_struct *next);
+int vma_adjust_trans_huge(struct vm_area_struct *vma, unsigned long start,
+			  unsigned long end, struct vm_area_struct *next);
 spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma);
 spinlock_t *__pud_trans_huge_lock(pud_t *pud, struct vm_area_struct *vma);
 
@@ -685,11 +685,12 @@ static inline int madvise_collapse(struct vm_area_struct *vma,
 	return -EINVAL;
 }
 
-static inline void vma_adjust_trans_huge(struct vm_area_struct *vma,
-					 unsigned long start,
-					 unsigned long end,
-					 struct vm_area_struct *next)
+static inline int vma_adjust_trans_huge(struct vm_area_struct *vma,
+					unsigned long start,
+					unsigned long end,
+					struct vm_area_struct *next)
 {
+	return 0;
 }
 static inline spinlock_t *pmd_trans_huge_lock(pmd_t *pmd,
 		struct vm_area_struct *vma)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 976a1c74c0870..99f3b8b24c682 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3326,20 +3326,31 @@ static inline int split_huge_pmd_if_needed(struct vm_area_struct *vma, unsigned
 	return 0;
 }
 
-void vma_adjust_trans_huge(struct vm_area_struct *vma,
+int vma_adjust_trans_huge(struct vm_area_struct *vma,
 			   unsigned long start,
 			   unsigned long end,
 			   struct vm_area_struct *next)
 {
+	int err;
+
 	/* Check if we need to split start first. */
-	split_huge_pmd_if_needed(vma, start);
+	err = split_huge_pmd_if_needed(vma, start);
+	if (err)
+		return err;
 
 	/* Check if we need to split end next. */
-	split_huge_pmd_if_needed(vma, end);
+	err = split_huge_pmd_if_needed(vma, end);
+	if (err)
+		return err;
 
 	/* If we're incrementing next->vm_start, we might need to split it. */
-	if (next)
-		split_huge_pmd_if_needed(next, end);
+	if (next) {
+		err = split_huge_pmd_if_needed(next, end);
+		if (err)
+			return err;
+	}
+
+	return 0;
 }
 
 static void unmap_folio(struct folio *folio)
diff --git a/mm/vma.c b/mm/vma.c
index a43f3c5d4b3dd..b4a3839a8036e 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -513,6 +513,15 @@ __split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
 			return err;
 	}
 
+	/*
+	 * Split any THP straddling the split boundary before splitting
+	 * the VMA itself. Do this before vma_prepare() so we can
+	 * cleanly fail without undoing VMA preparation.
+	 */
+	err = vma_adjust_trans_huge(vma, vma->vm_start, addr, NULL);
+	if (err)
+		return err;
+
 	new = vm_area_dup(vma);
 	if (!new)
 		return -ENOMEM;
@@ -550,11 +559,6 @@ __split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
 	vp.insert = new;
 	vma_prepare(&vp);
 
-	/*
-	 * Get rid of huge pages and shared page tables straddling the split
-	 * boundary.
-	 */
-	vma_adjust_trans_huge(vma, vma->vm_start, addr, NULL);
 	if (is_vm_hugetlb_page(vma))
 		hugetlb_split(vma, addr);
 
@@ -732,6 +736,7 @@ static int commit_merge(struct vma_merge_struct *vmg)
 {
 	struct vm_area_struct *vma;
 	struct vma_prepare vp;
+	int err;
 
 	if (vmg->__adjust_next_start) {
 		/* We manipulate middle and adjust next, which is the target. */
@@ -743,6 +748,16 @@ static int commit_merge(struct vma_merge_struct *vmg)
 		vma_iter_config(vmg->vmi, vmg->start, vmg->end);
 	}
 
+	/*
+	 * THP pages may need to do additional splits if we increase
+	 * middle->vm_start. Do this before vma_prepare() so we can
+	 * cleanly fail without undoing VMA preparation.
+	 */
+	err = vma_adjust_trans_huge(vma, vmg->start, vmg->end,
+				  vmg->__adjust_middle_start ? vmg->middle : NULL);
+	if (err)
+		return err;
+
 	init_multi_vma_prep(&vp, vma, vmg);
 
 	/*
@@ -755,12 +770,6 @@ static int commit_merge(struct vma_merge_struct *vmg)
 		return -ENOMEM;
 
 	vma_prepare(&vp);
-	/*
-	 * THP pages may need to do additional splits if we increase
-	 * middle->vm_start.
-	 */
-	vma_adjust_trans_huge(vma, vmg->start, vmg->end,
-			      vmg->__adjust_middle_start ? vmg->middle : NULL);
 	vma_set_range(vma, vmg->start, vmg->end, vmg->pgoff);
 	vmg_adjust_set_range(vmg);
 	vma_iter_store_overwrite(vmg->vmi, vmg->target);
@@ -1248,9 +1257,14 @@ int vma_shrink(struct vma_iterator *vmi, struct vm_area_struct *vma,
 	       unsigned long start, unsigned long end, pgoff_t pgoff)
 {
 	struct vma_prepare vp;
+	int err;
 
 	WARN_ON((vma->vm_start != start) && (vma->vm_end != end));
 
+	err = vma_adjust_trans_huge(vma, start, end, NULL);
+	if (err)
+		return err;
+
 	if (vma->vm_start < start)
 		vma_iter_config(vmi, vma->vm_start, start);
 	else
@@ -1263,7 +1277,6 @@ int vma_shrink(struct vma_iterator *vmi, struct vm_area_struct *vma,
 
 	init_vma_prep(&vp, vma);
 	vma_prepare(&vp);
-	vma_adjust_trans_huge(vma, start, end, NULL);
 
 	vma_iter_clear(vmi);
 	vma_set_range(vma, start, end, pgoff);
diff --git a/tools/testing/vma/include/stubs.h b/tools/testing/vma/include/stubs.h
index a30b8bc849557..952e3cc88ef10 100644
--- a/tools/testing/vma/include/stubs.h
+++ b/tools/testing/vma/include/stubs.h
@@ -419,11 +419,12 @@ static inline int vma_dup_policy(struct vm_area_struct *src, struct vm_area_stru
 	return 0;
 }
 
-static inline void vma_adjust_trans_huge(struct vm_area_struct *vma,
-					 unsigned long start,
-					 unsigned long end,
-					 struct vm_area_struct *next)
+static inline int vma_adjust_trans_huge(struct vm_area_struct *vma,
+					unsigned long start,
+					unsigned long end,
+					struct vm_area_struct *next)
 {
+	return 0;
 }
 
 static inline void hugetlb_split(struct vm_area_struct *, unsigned long) {}
-- 
2.52.0



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [v3 03/24] mm: thp: handle split failure in copy_huge_pmd()
  2026-03-27  2:08 [v3 00/24] mm: thp: lazy PTE page table allocation at PMD split time Usama Arif
  2026-03-27  2:08 ` [v3 01/24] mm: thp: make split_huge_pmd functions return int for error propagation Usama Arif
  2026-03-27  2:08 ` [v3 02/24] mm: thp: propagate split failure from vma_adjust_trans_huge() Usama Arif
@ 2026-03-27  2:08 ` Usama Arif
  2026-03-27  2:08 ` [v3 04/24] mm: thp: handle split failure in do_huge_pmd_wp_page() Usama Arif
                   ` (21 subsequent siblings)
  24 siblings, 0 replies; 27+ messages in thread
From: Usama Arif @ 2026-03-27  2:08 UTC (permalink / raw)
  To: Andrew Morton, david, Lorenzo Stoakes, willy, linux-mm
  Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
	baolin.wang, npache, Liam.Howlett, ryan.roberts, Vlastimil Babka,
	lance.yang, linux-kernel, kernel-team, maddy, mpe, linuxppc-dev,
	hca, gor, agordeev, borntraeger, svens, linux-s390, Usama Arif

copy_huge_pmd() splits the source PMD when a folio is pinned and can't
be COW-shared at PMD granularity.  It then returns -EAGAIN so
copy_pmd_range() falls through to copy_pte_range().

If the split fails, the PMD is still huge.  Returning -EAGAIN would cause
copy_pmd_range() to call copy_pte_range(), which would dereference the
huge PMD entry as if it were a pointer to a PTE page table.
Return -ENOMEM on split failure instead (which is already done in
copy_huge_pmd() if pte_alloc_one() fails), which causes copy_page_range()
to abort the fork with -ENOMEM, similar to how copy_pmd_range() would
be aborted if pmd_alloc() and copy_pte_range() fail.

Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 mm/huge_memory.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 99f3b8b24c682..8ad43897bdf80 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1913,7 +1913,13 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		pte_free(dst_mm, pgtable);
 		spin_unlock(src_ptl);
 		spin_unlock(dst_ptl);
-		__split_huge_pmd(src_vma, src_pmd, addr, false);
+		/*
+		 * If split fails, the PMD is still huge so copy_pte_range
+		 * (via -EAGAIN) would misinterpret it as a page table
+		 * pointer.  Return -ENOMEM directly to copy_pmd_range.
+		 */
+		if (__split_huge_pmd(src_vma, src_pmd, addr, false))
+			return -ENOMEM;
 		return -EAGAIN;
 	}
 	add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
-- 
2.52.0



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [v3 04/24] mm: thp: handle split failure in do_huge_pmd_wp_page()
  2026-03-27  2:08 [v3 00/24] mm: thp: lazy PTE page table allocation at PMD split time Usama Arif
                   ` (2 preceding siblings ...)
  2026-03-27  2:08 ` [v3 03/24] mm: thp: handle split failure in copy_huge_pmd() Usama Arif
@ 2026-03-27  2:08 ` Usama Arif
  2026-03-27  2:08 ` [v3 05/24] mm: thp: handle split failure in zap_pmd_range() Usama Arif
                   ` (20 subsequent siblings)
  24 siblings, 0 replies; 27+ messages in thread
From: Usama Arif @ 2026-03-27  2:08 UTC (permalink / raw)
  To: Andrew Morton, david, Lorenzo Stoakes, willy, linux-mm
  Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
	baolin.wang, npache, Liam.Howlett, ryan.roberts, Vlastimil Babka,
	lance.yang, linux-kernel, kernel-team, maddy, mpe, linuxppc-dev,
	hca, gor, agordeev, borntraeger, svens, linux-s390, Usama Arif

do_huge_pmd_wp_page() splits the PMD when a COW of the entire huge page
fails (e.g., can't allocate a new folio or the folio is pinned).  It then
returns VM_FAULT_FALLBACK so the fault can be retried at PTE granularity.

If the split fails, the PMD is still huge.  Returning VM_FAULT_FALLBACK
would re-enter the PTE fault path, which expects a PTE page table at the
PMD entry — not a huge PMD.

Return VM_FAULT_OOM on split failure, which signals the fault handler to
invoke the OOM killer or return -ENOMEM to userspace.

Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 mm/huge_memory.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 8ad43897bdf80..9f4be707c8cb0 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2137,7 +2137,13 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf)
 	folio_unlock(folio);
 	spin_unlock(vmf->ptl);
 fallback:
-	__split_huge_pmd(vma, vmf->pmd, vmf->address, false);
+	/*
+	 * Split failure means the PMD is still huge; returning
+	 * VM_FAULT_FALLBACK would re-enter the PTE path with a
+	 * huge PMD, causing incorrect behavior.
+	 */
+	if (__split_huge_pmd(vma, vmf->pmd, vmf->address, false))
+		return VM_FAULT_OOM;
 	return VM_FAULT_FALLBACK;
 }
 
-- 
2.52.0



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [v3 05/24] mm: thp: handle split failure in zap_pmd_range()
  2026-03-27  2:08 [v3 00/24] mm: thp: lazy PTE page table allocation at PMD split time Usama Arif
                   ` (3 preceding siblings ...)
  2026-03-27  2:08 ` [v3 04/24] mm: thp: handle split failure in do_huge_pmd_wp_page() Usama Arif
@ 2026-03-27  2:08 ` Usama Arif
  2026-03-27  2:08 ` [v3 06/24] mm: thp: handle split failure in wp_huge_pmd() Usama Arif
                   ` (19 subsequent siblings)
  24 siblings, 0 replies; 27+ messages in thread
From: Usama Arif @ 2026-03-27  2:08 UTC (permalink / raw)
  To: Andrew Morton, david, Lorenzo Stoakes, willy, linux-mm
  Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
	baolin.wang, npache, Liam.Howlett, ryan.roberts, Vlastimil Babka,
	lance.yang, linux-kernel, kernel-team, maddy, mpe, linuxppc-dev,
	hca, gor, agordeev, borntraeger, svens, linux-s390, Usama Arif

zap_pmd_range() splits a huge PMD when the zap range doesn't cover the
full PMD (partial unmap).  If the split fails, the PMD stays huge.
Falling through to zap_pte_range() would dereference the huge PMD entry
as a PTE page table pointer.

Skip the range covered by the PMD on split failure instead.

The skip is safe across all call paths into zap_pmd_range():

- exit_mmap() and OOM reaper: the zap range covers entire VMAs, so
  every PMD is fully covered (next - addr == HPAGE_PMD_SIZE).  The
  zap_huge_pmd() branch handles these without splitting.  The split
  failure path is unreachable.

- munmap / mmap overlay: vma_adjust_trans_huge() (called from
  __split_vma) splits any PMD straddling the VMA boundary before the
  VMA is split.  If that PMD split fails, __split_vma() returns
  -ENOMEM and the munmap is aborted before reaching zap_pmd_range().
  The split failure path is unreachable.

- MADV_DONTNEED: advisory hint, the kernel is allowed to ignore it.
  The pages remain valid and accessible.  A subsequent access returns
  existing data without faulting.

Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 mm/memory.c | 15 ++++++++++++---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index e44469f9cf659..caf97c48cb166 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1985,9 +1985,18 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
 	do {
 		next = pmd_addr_end(addr, end);
 		if (pmd_is_huge(*pmd)) {
-			if (next - addr != HPAGE_PMD_SIZE)
-				__split_huge_pmd(vma, pmd, addr, false);
-			else if (zap_huge_pmd(tlb, vma, pmd, addr)) {
+			if (next - addr != HPAGE_PMD_SIZE) {
+				/*
+				 * If split fails, the PMD stays huge.
+				 * Skip the range to avoid falling through
+				 * to zap_pte_range, which would treat the
+				 * huge PMD entry as a page table pointer.
+				 */
+				if (__split_huge_pmd(vma, pmd, addr, false)) {
+					addr = next;
+					continue;
+				}
+			} else if (zap_huge_pmd(tlb, vma, pmd, addr)) {
 				addr = next;
 				continue;
 			}
-- 
2.52.0



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [v3 06/24] mm: thp: handle split failure in wp_huge_pmd()
  2026-03-27  2:08 [v3 00/24] mm: thp: lazy PTE page table allocation at PMD split time Usama Arif
                   ` (4 preceding siblings ...)
  2026-03-27  2:08 ` [v3 05/24] mm: thp: handle split failure in zap_pmd_range() Usama Arif
@ 2026-03-27  2:08 ` Usama Arif
  2026-03-27  2:08 ` [v3 07/24] mm: thp: retry on split failure in change_pmd_range() Usama Arif
                   ` (18 subsequent siblings)
  24 siblings, 0 replies; 27+ messages in thread
From: Usama Arif @ 2026-03-27  2:08 UTC (permalink / raw)
  To: Andrew Morton, david, Lorenzo Stoakes, willy, linux-mm
  Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
	baolin.wang, npache, Liam.Howlett, ryan.roberts, Vlastimil Babka,
	lance.yang, linux-kernel, kernel-team, maddy, mpe, linuxppc-dev,
	hca, gor, agordeev, borntraeger, svens, linux-s390, Usama Arif

wp_huge_pmd() splits the PMD when COW or write-notify must be handled at
PTE level (e.g., shared/file VMAs, userfaultfd).  It then returns
VM_FAULT_FALLBACK so the fault handler retries at PTE granularity.
If the split fails, the PMD is still huge.  The PTE fault path cannot
handle a huge PMD entry.
Return VM_FAULT_OOM on split failure, which signals the fault handler to
invoke the OOM killer or return -ENOMEM to userspace. This is similar to
what __handle_mm_fault would do if p4d_alloc or pud_alloc fails.

Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 mm/memory.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index caf97c48cb166..b99ec3ffc18d1 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -6328,8 +6328,13 @@ static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf)
 	}
 
 split:
-	/* COW or write-notify handled on pte level: split pmd. */
-	__split_huge_pmd(vma, vmf->pmd, vmf->address, false);
+	/*
+	 * COW or write-notify handled on pte level: split pmd.
+	 * If split fails, the PMD is still huge so falling back
+	 * to PTE handling would be incorrect.
+	 */
+	if (__split_huge_pmd(vma, vmf->pmd, vmf->address, false))
+		return VM_FAULT_OOM;
 
 	return VM_FAULT_FALLBACK;
 }
-- 
2.52.0



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [v3 07/24] mm: thp: retry on split failure in change_pmd_range()
  2026-03-27  2:08 [v3 00/24] mm: thp: lazy PTE page table allocation at PMD split time Usama Arif
                   ` (5 preceding siblings ...)
  2026-03-27  2:08 ` [v3 06/24] mm: thp: handle split failure in wp_huge_pmd() Usama Arif
@ 2026-03-27  2:08 ` Usama Arif
  2026-03-27  2:08 ` [v3 08/24] mm: thp: handle split failure in follow_pmd_mask() Usama Arif
                   ` (17 subsequent siblings)
  24 siblings, 0 replies; 27+ messages in thread
From: Usama Arif @ 2026-03-27  2:08 UTC (permalink / raw)
  To: Andrew Morton, david, Lorenzo Stoakes, willy, linux-mm
  Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
	baolin.wang, npache, Liam.Howlett, ryan.roberts, Vlastimil Babka,
	lance.yang, linux-kernel, kernel-team, maddy, mpe, linuxppc-dev,
	hca, gor, agordeev, borntraeger, svens, linux-s390, Usama Arif

change_pmd_range() splits a huge PMD when mprotect() targets a sub-PMD
range or when VMA flags require per-PTE protection bits that can't be
represented at PMD granularity.

If pte_alloc_one() fails inside __split_huge_pmd(), the huge PMD remains
intact. Without this change, change_pte_range() would return -EAGAIN
because pte_offset_map_lock() returns NULL for a huge PMD, sending the
code back to the 'again' label to retry the split—without ever calling
cond_resched().

Now that __split_huge_pmd() returns an error code, handle it explicitly:
yield the CPU with cond_resched() and retry via goto again, giving other
tasks a chance to free memory.

Trying to return an error all the way to change_protection_range would
not work as it would leave a memory range with new protections, and
others unchanged, with no easy way to roll back the already modified
entries (and previous splits). __split_huge_pmd only requires an
order-0 allocation and is extremely unlikely to fail.

Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 mm/mprotect.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 110d47a36d4bb..e39e96963da8b 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -477,7 +477,16 @@ static inline long change_pmd_range(struct mmu_gather *tlb,
 		if (pmd_is_huge(_pmd)) {
 			if ((next - addr != HPAGE_PMD_SIZE) ||
 			    pgtable_split_needed(vma, cp_flags)) {
-				__split_huge_pmd(vma, pmd, addr, false);
+				ret = __split_huge_pmd(vma, pmd, addr, false);
+				if (ret) {
+					/*
+					 * Yield and retry. Other tasks
+					 * may free memory while we
+					 * reschedule.
+					 */
+					cond_resched();
+					goto again;
+				}
 				/*
 				 * For file-backed, the pmd could have been
 				 * cleared; make sure pmd populated if
-- 
2.52.0

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [v3 08/24] mm: thp: handle split failure in follow_pmd_mask()
  2026-03-27  2:08 [v3 00/24] mm: thp: lazy PTE page table allocation at PMD split time Usama Arif
                   ` (6 preceding siblings ...)
  2026-03-27  2:08 ` [v3 07/24] mm: thp: retry on split failure in change_pmd_range() Usama Arif
@ 2026-03-27  2:08 ` Usama Arif
  2026-03-27  2:08 ` [v3 09/24] mm: handle walk_page_range() failure from THP split Usama Arif
                   ` (16 subsequent siblings)
  24 siblings, 0 replies; 27+ messages in thread
From: Usama Arif @ 2026-03-27  2:08 UTC (permalink / raw)
  To: Andrew Morton, david, Lorenzo Stoakes, willy, linux-mm
  Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
	baolin.wang, npache, Liam.Howlett, ryan.roberts, Vlastimil Babka,
	lance.yang, linux-kernel, kernel-team, maddy, mpe, linuxppc-dev,
	hca, gor, agordeev, borntraeger, svens, linux-s390, Usama Arif

follow_pmd_mask() splits a huge PMD when FOLL_SPLIT_PMD is set, so GUP
can pin individual pages at PTE granularity.

If the split fails, the PMD is still huge and follow_page_pte() cannot
process it. Return ERR_PTR(-ENOMEM) on split failure, which causes the
GUP caller to get -ENOMEM. -ENOMEM is already returned in follow_pmd_mask
if pte_alloc_one fails (which is the reason why split_huge_pmd could
fail), hence this is a safe change.

Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 mm/gup.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/mm/gup.c b/mm/gup.c
index ad9ded39609cb..07c6b0483c322 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -928,8 +928,16 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
 		return follow_page_pte(vma, address, pmd, flags);
 	}
 	if (pmd_trans_huge(pmdval) && (flags & FOLL_SPLIT_PMD)) {
+		int ret;
+
 		spin_unlock(ptl);
-		split_huge_pmd(vma, pmd, address);
+		/*
+		 * If split fails, the PMD is still huge and
+		 * we cannot proceed to follow_page_pte.
+		 */
+		ret = split_huge_pmd(vma, pmd, address);
+		if (ret)
+			return ERR_PTR(ret);
 		/* If pmd was left empty, stuff a page table in there quickly */
 		return pte_alloc(mm, pmd) ? ERR_PTR(-ENOMEM) :
 			follow_page_pte(vma, address, pmd, flags);
-- 
2.52.0



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [v3 09/24] mm: handle walk_page_range() failure from THP split
  2026-03-27  2:08 [v3 00/24] mm: thp: lazy PTE page table allocation at PMD split time Usama Arif
                   ` (7 preceding siblings ...)
  2026-03-27  2:08 ` [v3 08/24] mm: thp: handle split failure in follow_pmd_mask() Usama Arif
@ 2026-03-27  2:08 ` Usama Arif
  2026-03-27  2:08 ` [v3 10/24] mm: thp: handle split failure in mremap move_page_tables() Usama Arif
                   ` (15 subsequent siblings)
  24 siblings, 0 replies; 27+ messages in thread
From: Usama Arif @ 2026-03-27  2:08 UTC (permalink / raw)
  To: Andrew Morton, david, Lorenzo Stoakes, willy, linux-mm
  Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
	baolin.wang, npache, Liam.Howlett, ryan.roberts, Vlastimil Babka,
	lance.yang, linux-kernel, kernel-team, maddy, mpe, linuxppc-dev,
	hca, gor, agordeev, borntraeger, svens, linux-s390, Usama Arif

walk_pmd_range() splits a huge PMD when a page table walker with
pte_entry or install_pte callbacks needs PTE-level granularity. If
the split fails due to memory allocation failure in pte_alloc_one(),
walk_pte_range() would encounter a huge PMD instead of a PTE page
table.

Break out of the loop on split failure and return -ENOMEM to the
walker's caller. Callers that reach this path (those with pte_entry
or install_pte set) such as mincore, hmm_range_fault and
queue_pages_range already handle negative return values from
walk_page_range(). Similar approach is taken when __pte_alloc()
fails in walk_pmd_range().

Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 mm/pagewalk.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index 3ae2586ff45b4..c5850de71b8cb 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -165,9 +165,11 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
 				continue;
 		}
 
-		if (walk->vma)
-			split_huge_pmd(walk->vma, pmd, addr);
-		else if (pmd_leaf(*pmd) || !pmd_present(*pmd))
+		if (walk->vma) {
+			err = split_huge_pmd(walk->vma, pmd, addr);
+			if (err)
+				break;
+		} else if (pmd_leaf(*pmd) || !pmd_present(*pmd))
 			continue; /* Nothing to do. */
 
 		err = walk_pte_range(pmd, addr, next, walk);
-- 
2.52.0



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [v3 10/24] mm: thp: handle split failure in mremap move_page_tables()
  2026-03-27  2:08 [v3 00/24] mm: thp: lazy PTE page table allocation at PMD split time Usama Arif
                   ` (8 preceding siblings ...)
  2026-03-27  2:08 ` [v3 09/24] mm: handle walk_page_range() failure from THP split Usama Arif
@ 2026-03-27  2:08 ` Usama Arif
  2026-03-27  2:08 ` [v3 11/24] mm: thp: handle split failure in userfaultfd move_pages() Usama Arif
                   ` (14 subsequent siblings)
  24 siblings, 0 replies; 27+ messages in thread
From: Usama Arif @ 2026-03-27  2:08 UTC (permalink / raw)
  To: Andrew Morton, david, Lorenzo Stoakes, willy, linux-mm
  Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
	baolin.wang, npache, Liam.Howlett, ryan.roberts, Vlastimil Babka,
	lance.yang, linux-kernel, kernel-team, maddy, mpe, linuxppc-dev,
	hca, gor, agordeev, borntraeger, svens, linux-s390, Usama Arif

move_page_tables() splits a huge PMD when the extent is smaller than
HPAGE_PMD_SIZE and the PMD can't be moved at PMD granularity.

If the split fails, the PMD stays huge and move_ptes() can't operate on
individual PTEs.

Break out of the loop on split failure, which causes mremap() to return
however much was moved so far (partial move).  This is consistent with
other allocation failures in the same loop (e.g., alloc_new_pmd(),
pte_alloc()).

Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 mm/mremap.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/mm/mremap.c b/mm/mremap.c
index e9c8b1d05832b..2f70cb48f6061 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -855,7 +855,13 @@ unsigned long move_page_tables(struct pagetable_move_control *pmc)
 			if (extent == HPAGE_PMD_SIZE &&
 			    move_pgt_entry(pmc, HPAGE_PMD, old_pmd, new_pmd))
 				continue;
-			split_huge_pmd(pmc->old, old_pmd, pmc->old_addr);
+			/*
+			 * If split fails, the PMD stays huge and move_ptes
+			 * can't operate on it.  Break out so the caller
+			 * can handle the partial move.
+			 */
+			if (split_huge_pmd(pmc->old, old_pmd, pmc->old_addr))
+				break;
 		} else if (IS_ENABLED(CONFIG_HAVE_MOVE_PMD) &&
 			   extent == PMD_SIZE) {
 			/*
-- 
2.52.0



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [v3 11/24] mm: thp: handle split failure in userfaultfd move_pages()
  2026-03-27  2:08 [v3 00/24] mm: thp: lazy PTE page table allocation at PMD split time Usama Arif
                   ` (9 preceding siblings ...)
  2026-03-27  2:08 ` [v3 10/24] mm: thp: handle split failure in mremap move_page_tables() Usama Arif
@ 2026-03-27  2:08 ` Usama Arif
  2026-03-27  2:08 ` [v3 12/24] mm: thp: handle split failure in device migration Usama Arif
                   ` (13 subsequent siblings)
  24 siblings, 0 replies; 27+ messages in thread
From: Usama Arif @ 2026-03-27  2:08 UTC (permalink / raw)
  To: Andrew Morton, david, Lorenzo Stoakes, willy, linux-mm
  Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
	baolin.wang, npache, Liam.Howlett, ryan.roberts, Vlastimil Babka,
	lance.yang, linux-kernel, kernel-team, maddy, mpe, linuxppc-dev,
	hca, gor, agordeev, borntraeger, svens, linux-s390, Usama Arif

The UFFDIO_MOVE ioctl's move_pages() loop splits a huge PMD when the
folio is pinned and can't be moved at PMD granularity.

If the split fails, the PMD stays huge and move_pages_pte() can't
process individual pages. Break out of the loop on split failure
and return -ENOMEM to the caller. This is similar to how other
allocation failures (__pte_alloc, mm_alloc_pmd) are handled in
move_pages().

Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 mm/userfaultfd.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 481ec7eb44420..a04d62dd1e065 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -1946,7 +1946,13 @@ ssize_t move_pages(struct userfaultfd_ctx *ctx, unsigned long dst_start,
 				}
 
 				spin_unlock(ptl);
-				split_huge_pmd(src_vma, src_pmd, src_addr);
+				/*
+				 * If split fails, the PMD stays huge and
+				 * move_pages_pte can't process it.
+				 */
+				err = split_huge_pmd(src_vma, src_pmd, src_addr);
+				if (err)
+					break;
 				/* The folio will be split by move_pages_pte() */
 				continue;
 			}
-- 
2.52.0



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [v3 12/24] mm: thp: handle split failure in device migration
  2026-03-27  2:08 [v3 00/24] mm: thp: lazy PTE page table allocation at PMD split time Usama Arif
                   ` (10 preceding siblings ...)
  2026-03-27  2:08 ` [v3 11/24] mm: thp: handle split failure in userfaultfd move_pages() Usama Arif
@ 2026-03-27  2:08 ` Usama Arif
  2026-03-27  2:08 ` [v3 13/24] mm: proc: handle split_huge_pmd failure in pagemap_scan Usama Arif
                   ` (12 subsequent siblings)
  24 siblings, 0 replies; 27+ messages in thread
From: Usama Arif @ 2026-03-27  2:08 UTC (permalink / raw)
  To: Andrew Morton, david, Lorenzo Stoakes, willy, linux-mm
  Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
	baolin.wang, npache, Liam.Howlett, ryan.roberts, Vlastimil Babka,
	lance.yang, linux-kernel, kernel-team, maddy, mpe, linuxppc-dev,
	hca, gor, agordeev, borntraeger, svens, linux-s390, Usama Arif

Device memory migration has two call sites that split huge PMDs:

migrate_vma_split_unmapped_folio():
  Called from migrate_vma_pages() when migrating a PMD-mapped THP to a
  destination that doesn't support compound pages.  It splits the PMD
  then splits the folio via folio_split_unmapped().

  If the PMD split fails, folio_split_unmapped() would operate on an
  unsplit folio with inconsistent page table state.  Propagate -ENOMEM
  to skip this page's migration. This is safe as folio_split_unmapped
  failure would be propagated in a similar way.

migrate_vma_insert_page():
  Called from migrate_vma_pages() when inserting a page into a VMA
  during migration back from device memory.  If a huge zero PMD exists
  at the target address, it must be split before PTE insertion.

  If the split fails, the subsequent pte_alloc() and set_pte_at() would
  operate on a PMD slot still occupied by the huge zero entry.  Use
  goto abort, consistent with other allocation failures in this function.

Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 mm/migrate_device.c | 22 ++++++++++++++++++++--
 1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index 2912eba575d5e..00003fbe803df 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -919,7 +919,19 @@ static int migrate_vma_split_unmapped_folio(struct migrate_vma *migrate,
 	 * drops a reference at the end.
 	 */
 	folio_get(folio);
-	split_huge_pmd_address(migrate->vma, addr, true);
+	/*
+	 * If PMD split fails, folio_split_unmapped would operate on an
+	 * unsplit folio with inconsistent page table state.
+	 */
+	ret = split_huge_pmd_address(migrate->vma, addr, true);
+	if (ret) {
+		/*
+		 * folio_get above was not consumed by split_huge_pmd_address.
+		 * put back that reference.
+		 */
+		folio_put(folio);
+		return ret;
+	}
 	ret = folio_split_unmapped(folio, 0);
 	if (ret)
 		return ret;
@@ -1015,7 +1027,13 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
 		if (pmd_trans_huge(*pmdp)) {
 			if (!is_huge_zero_pmd(*pmdp))
 				goto abort;
-			split_huge_pmd(vma, pmdp, addr);
+			/*
+			 * If split fails, the huge zero PMD remains and
+			 * pte_alloc/PTE insertion that follows would be
+			 * incorrect.
+			 */
+			if (split_huge_pmd(vma, pmdp, addr))
+				goto abort;
 		} else if (pmd_leaf(*pmdp))
 			goto abort;
 	}
-- 
2.52.0



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [v3 13/24] mm: proc: handle split_huge_pmd failure in pagemap_scan
  2026-03-27  2:08 [v3 00/24] mm: thp: lazy PTE page table allocation at PMD split time Usama Arif
                   ` (11 preceding siblings ...)
  2026-03-27  2:08 ` [v3 12/24] mm: thp: handle split failure in device migration Usama Arif
@ 2026-03-27  2:08 ` Usama Arif
  2026-03-27  2:08 ` [v3 14/24] powerpc/mm: handle split_huge_pmd failure in subpage_prot Usama Arif
                   ` (11 subsequent siblings)
  24 siblings, 0 replies; 27+ messages in thread
From: Usama Arif @ 2026-03-27  2:08 UTC (permalink / raw)
  To: Andrew Morton, david, Lorenzo Stoakes, willy, linux-mm
  Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
	baolin.wang, npache, Liam.Howlett, ryan.roberts, Vlastimil Babka,
	lance.yang, linux-kernel, kernel-team, maddy, mpe, linuxppc-dev,
	hca, gor, agordeev, borntraeger, svens, linux-s390, Usama Arif

pagemap_scan_thp_entry() splits a huge PMD when the PAGEMAP_SCAN ioctl
needs to write-protect only a portion of a THP. It then returns -ENOENT
so pagemap_scan_pmd_entry() falls through to PTE-level handling.

Check the split_huge_pmd() return value and propagate the error on
failure. Returning -ENOMEM instead of -ENOENT prevents the fallthrough
to PTE handling, and the error propagates through walk_page_range() to
do_pagemap_scan() where it becomes the ioctl return value.
pagemap_scan_backout_range() already undoes the buffered output, and
walk_end is written back to userspace so the caller knows where the
scan stopped.

If the split fails, the PMD remains huge. An alternative to the approach
in the patch is to return -ENOENT, causing the caller to proceed to
pte_offset_map_lock(). ___pte_offset_map() detects the trans_huge PMD
and returns NULL, which sets ACTION_AGAIN — restarting the walker on the
same PMD by which time the system might have enough memory to satisfy
the split from succeeding.

Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 fs/proc/task_mmu.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index e091931d7ca19..f5f459140b5c0 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -2714,9 +2714,13 @@ static int pagemap_scan_thp_entry(pmd_t *pmd, unsigned long start,
 	 * needs to be performed on a portion of the huge page.
 	 */
 	if (end != start + HPAGE_SIZE) {
+		int err;
+
 		spin_unlock(ptl);
-		split_huge_pmd(vma, pmd, start);
+		err = split_huge_pmd(vma, pmd, start);
 		pagemap_scan_backout_range(p, start, end);
+		if (err)
+			return err;
 		/* Report as if there was no THP */
 		return -ENOENT;
 	}
-- 
2.52.0

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [v3 14/24] powerpc/mm: handle split_huge_pmd failure in subpage_prot
  2026-03-27  2:08 [v3 00/24] mm: thp: lazy PTE page table allocation at PMD split time Usama Arif
                   ` (12 preceding siblings ...)
  2026-03-27  2:08 ` [v3 13/24] mm: proc: handle split_huge_pmd failure in pagemap_scan Usama Arif
@ 2026-03-27  2:08 ` Usama Arif
  2026-03-27  2:08 ` [v3 15/24] fs/dax: handle split_huge_pmd failure in dax_iomap_pmd_fault Usama Arif
                   ` (10 subsequent siblings)
  24 siblings, 0 replies; 27+ messages in thread
From: Usama Arif @ 2026-03-27  2:08 UTC (permalink / raw)
  To: Andrew Morton, david, Lorenzo Stoakes, willy, linux-mm
  Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
	baolin.wang, npache, Liam.Howlett, ryan.roberts, Vlastimil Babka,
	lance.yang, linux-kernel, kernel-team, maddy, mpe, linuxppc-dev,
	hca, gor, agordeev, borntraeger, svens, linux-s390, Usama Arif

subpage_walk_pmd_entry() splits huge PMDs when the subpage_prot
syscall controls the access permissions on individual 4 kB.

In practice this cannot fail today: sys_subpage_prot() returns -ENOENT
early when radix is enabled, and on hash powerpc
arch_needs_pgtable_deposit() is true so split uses the pre-deposited
page table and always succeeds. The change is for __must_check
compliance introduced in a later patch and correctness if the call
chain ever becomes reachable on architectures with lazy PTE allocation.

Propagate the error through the full call chain up to the syscall.
The syscall already returns -ENOMEM in other places when it runs out
of memory.

Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 arch/powerpc/mm/book3s64/subpage_prot.c | 24 +++++++++++++++---------
 1 file changed, 15 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/mm/book3s64/subpage_prot.c b/arch/powerpc/mm/book3s64/subpage_prot.c
index 37d47282c3686..b3635a11ff433 100644
--- a/arch/powerpc/mm/book3s64/subpage_prot.c
+++ b/arch/powerpc/mm/book3s64/subpage_prot.c
@@ -139,8 +139,8 @@ static int subpage_walk_pmd_entry(pmd_t *pmd, unsigned long addr,
 				  unsigned long end, struct mm_walk *walk)
 {
 	struct vm_area_struct *vma = walk->vma;
-	split_huge_pmd(vma, pmd, addr);
-	return 0;
+
+	return split_huge_pmd(vma, pmd, addr);
 }
 
 static const struct mm_walk_ops subpage_walk_ops = {
@@ -148,11 +148,12 @@ static const struct mm_walk_ops subpage_walk_ops = {
 	.walk_lock	= PGWALK_WRLOCK_VERIFY,
 };
 
-static void subpage_mark_vma_nohuge(struct mm_struct *mm, unsigned long addr,
-				    unsigned long len)
+static int subpage_mark_vma_nohuge(struct mm_struct *mm, unsigned long addr,
+				   unsigned long len)
 {
 	struct vm_area_struct *vma;
 	VMA_ITERATOR(vmi, mm, addr);
+	int err;
 
 	/*
 	 * We don't try too hard, we just mark all the vma in that range
@@ -160,14 +161,17 @@ static void subpage_mark_vma_nohuge(struct mm_struct *mm, unsigned long addr,
 	 */
 	for_each_vma_range(vmi, vma, addr + len) {
 		vm_flags_set(vma, VM_NOHUGEPAGE);
-		walk_page_vma(vma, &subpage_walk_ops, NULL);
+		err = walk_page_vma(vma, &subpage_walk_ops, NULL);
+		if (err)
+			return err;
 	}
+	return 0;
 }
 #else
-static void subpage_mark_vma_nohuge(struct mm_struct *mm, unsigned long addr,
-				    unsigned long len)
+static int subpage_mark_vma_nohuge(struct mm_struct *mm, unsigned long addr,
+				   unsigned long len)
 {
-	return;
+	return 0;
 }
 #endif
 
@@ -229,7 +233,9 @@ SYSCALL_DEFINE3(subpage_prot, unsigned long, addr,
 		mm->context.hash_context->spt = spt;
 	}
 
-	subpage_mark_vma_nohuge(mm, addr, len);
+	err = subpage_mark_vma_nohuge(mm, addr, len);
+	if (err)
+		goto out;
 	for (limit = addr + len; addr < limit; addr = next) {
 		next = pmd_addr_end(addr, limit);
 		err = -ENOMEM;
-- 
2.52.0



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [v3 15/24] fs/dax: handle split_huge_pmd failure in dax_iomap_pmd_fault
  2026-03-27  2:08 [v3 00/24] mm: thp: lazy PTE page table allocation at PMD split time Usama Arif
                   ` (13 preceding siblings ...)
  2026-03-27  2:08 ` [v3 14/24] powerpc/mm: handle split_huge_pmd failure in subpage_prot Usama Arif
@ 2026-03-27  2:08 ` Usama Arif
  2026-03-27  2:08 ` [v3 16/24] mm: huge_mm: Make sure all split_huge_pmd calls are checked Usama Arif
                   ` (9 subsequent siblings)
  24 siblings, 0 replies; 27+ messages in thread
From: Usama Arif @ 2026-03-27  2:08 UTC (permalink / raw)
  To: Andrew Morton, david, Lorenzo Stoakes, willy, linux-mm
  Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
	baolin.wang, npache, Liam.Howlett, ryan.roberts, Vlastimil Babka,
	lance.yang, linux-kernel, kernel-team, maddy, mpe, linuxppc-dev,
	hca, gor, agordeev, borntraeger, svens, linux-s390, Usama Arif

dax_iomap_pmd_fault() splits a huge PMD when the PMD fault falls back
to PTE-level handling. The split is necessary so that the subsequent
PTE fault does not misinterpret the huge PMD entry as a page table
pointer.

In practice this cannot fail today: DAX VMAs are always file-backed,
and __split_huge_pmd() only allocates a PTE page table (the operation
that can return -ENOMEM) for anonymous VMAs. For file-backed VMAs the
split path simply zaps the PMD and returns 0.

Use WARN_ON_ONCE to document this invariant and check the return value
for __must_check compliance introduced in the next patch.

Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 fs/dax.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/fs/dax.c b/fs/dax.c
index a5237169b4679..ed1859e8a916f 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -2039,7 +2039,14 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, unsigned long *pfnp,
 	dax_unlock_entry(&xas, entry);
 fallback:
 	if (ret == VM_FAULT_FALLBACK) {
-		split_huge_pmd(vmf->vma, vmf->pmd, vmf->address);
+		/*
+		 * split_huge_pmd() cannot fail for file-backed (DAX) VMAs
+		 * since splitting only zaps the PMD without allocating a
+		 * PTE page table.
+		 */
+		if (WARN_ON_ONCE(split_huge_pmd(vmf->vma, vmf->pmd,
+						vmf->address)))
+			ret = VM_FAULT_OOM;
 		count_vm_event(THP_FAULT_FALLBACK);
 	}
 out:
-- 
2.52.0



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [v3 16/24] mm: huge_mm: Make sure all split_huge_pmd calls are checked
  2026-03-27  2:08 [v3 00/24] mm: thp: lazy PTE page table allocation at PMD split time Usama Arif
                   ` (14 preceding siblings ...)
  2026-03-27  2:08 ` [v3 15/24] fs/dax: handle split_huge_pmd failure in dax_iomap_pmd_fault Usama Arif
@ 2026-03-27  2:08 ` Usama Arif
  2026-03-27  2:08 ` [v3 17/24] mm: thp: allocate PTE page tables lazily at split time Usama Arif
                   ` (8 subsequent siblings)
  24 siblings, 0 replies; 27+ messages in thread
From: Usama Arif @ 2026-03-27  2:08 UTC (permalink / raw)
  To: Andrew Morton, david, Lorenzo Stoakes, willy, linux-mm
  Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
	baolin.wang, npache, Liam.Howlett, ryan.roberts, Vlastimil Babka,
	lance.yang, linux-kernel, kernel-team, maddy, mpe, linuxppc-dev,
	hca, gor, agordeev, borntraeger, svens, linux-s390, Usama Arif

Mark __split_huge_pmd(), split_huge_pmd() and split_huge_pmd_address()
with __must_check so the compiler warns if any caller ignores the return
value. Not checking return value and operating on the basis that the pmd
is split could result in a kernel bug. The possibility of an order-0
allocation failing for page table allocation is very low, but it should
be handled correctly.

Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 include/linux/huge_mm.h | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 224965fce4e66..c4d0badc4ce27 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -418,7 +418,7 @@ static inline int split_huge_page(struct page *page)
 extern struct list_lru deferred_split_lru;
 void deferred_split_folio(struct folio *folio, bool partially_mapped);
 
-int __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
+int __must_check __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 		unsigned long address, bool freeze);
 
 /**
@@ -447,7 +447,7 @@ static inline bool pmd_is_huge(pmd_t pmd)
 	return false;
 }
 
-static inline int split_huge_pmd(struct vm_area_struct *vma,
+static inline int __must_check split_huge_pmd(struct vm_area_struct *vma,
 					     pmd_t *pmd, unsigned long address)
 {
 	if (pmd_is_huge(*pmd))
@@ -455,7 +455,7 @@ static inline int split_huge_pmd(struct vm_area_struct *vma,
 	return 0;
 }
 
-int split_huge_pmd_address(struct vm_area_struct *vma, unsigned long address,
+int __must_check split_huge_pmd_address(struct vm_area_struct *vma, unsigned long address,
 		bool freeze);
 
 void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
-- 
2.52.0



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [v3 17/24] mm: thp: allocate PTE page tables lazily at split time
  2026-03-27  2:08 [v3 00/24] mm: thp: lazy PTE page table allocation at PMD split time Usama Arif
                   ` (15 preceding siblings ...)
  2026-03-27  2:08 ` [v3 16/24] mm: huge_mm: Make sure all split_huge_pmd calls are checked Usama Arif
@ 2026-03-27  2:08 ` Usama Arif
  2026-03-27  2:09 ` [v3 18/24] mm: thp: remove pgtable_trans_huge_{deposit/withdraw} when not needed Usama Arif
                   ` (7 subsequent siblings)
  24 siblings, 0 replies; 27+ messages in thread
From: Usama Arif @ 2026-03-27  2:08 UTC (permalink / raw)
  To: Andrew Morton, david, Lorenzo Stoakes, willy, linux-mm
  Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
	baolin.wang, npache, Liam.Howlett, ryan.roberts, Vlastimil Babka,
	lance.yang, linux-kernel, kernel-team, maddy, mpe, linuxppc-dev,
	hca, gor, agordeev, borntraeger, svens, linux-s390, Usama Arif

When the kernel creates a PMD-level THP mapping for anonymous pages,
it pre-allocates a PTE page table and deposits it via
pgtable_trans_huge_deposit(). This deposited table is withdrawn during
PMD split or zap. The rationale was that split must not fail—if the
kernel decides to split a THP, it needs a PTE table to populate.

However, every anon THP wastes 4KB (one page table page) that sits
unused in the deposit list for the lifetime of the mapping. On systems
with many THPs, this adds up to significant memory waste. The original
rationale is also not an issue. It is ok for split to fail, and if the
kernel can't find an order 0 allocation for split, there are much bigger
problems. On large servers where you can easily have 100s of GBs of THPs,
the memory usage for these tables is 200M per 100G. This memory could be
used for any other usecase, which include allocating the pagetables
required during split.

This patch removes the pre-deposit for anonymous pages on architectures
where arch_needs_pgtable_deposit() returns false (every arch apart from
powerpc, and only when radix hash tables are not enabled) and allocates
the PTE table lazily—only when a split actually occurs. The split path
is modified to accept a caller-provided page table.

PowerPC exception:

It would have been great if we can completely remove the pagetable
deposit code and this commit would mostly have been a code cleanup patch,
unfortunately PowerPC has hash MMU, it stores hash slot information in
the deposited page table and pre-deposit is necessary. All deposit/
withdraw paths are guarded by arch_needs_pgtable_deposit(), so PowerPC
behavior is unchanged with this patch. On a better note,
arch_needs_pgtable_deposit will always evaluate to false at compile time
on non PowerPC architectures and the pre-deposit code will not be
compiled in.

Suggested-by: David Hildenbrand <david@kernel.org>
Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 include/linux/huge_mm.h |   4 +-
 mm/huge_memory.c        | 158 ++++++++++++++++++++++++++--------------
 mm/khugepaged.c         |   7 +-
 mm/migrate_device.c     |  15 ++--
 mm/rmap.c               |  39 +++++++++-
 5 files changed, 158 insertions(+), 65 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index c4d0badc4ce27..c02ba9c4b8d5b 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -561,7 +561,7 @@ static inline bool thp_migration_supported(void)
 }
 
 void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address,
-			   pmd_t *pmd, bool freeze);
+			   pmd_t *pmd, bool freeze, pgtable_t pgtable);
 bool unmap_huge_pmd_locked(struct vm_area_struct *vma, unsigned long addr,
 			   pmd_t *pmdp, struct folio *folio);
 void map_anon_folio_pmd_nopf(struct folio *folio, pmd_t *pmd,
@@ -660,7 +660,7 @@ static inline int split_huge_pmd_address(struct vm_area_struct *vma,
 		unsigned long address, bool freeze) { return 0; }
 static inline void split_huge_pmd_locked(struct vm_area_struct *vma,
 					 unsigned long address, pmd_t *pmd,
-					 bool freeze) {}
+					 bool freeze, pgtable_t pgtable) {}
 
 static inline bool unmap_huge_pmd_locked(struct vm_area_struct *vma,
 					 unsigned long addr, pmd_t *pmdp,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9f4be707c8cb0..2acedb1de7404 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1309,17 +1309,19 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf)
 	unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
 	struct vm_area_struct *vma = vmf->vma;
 	struct folio *folio;
-	pgtable_t pgtable;
+	pgtable_t pgtable = NULL;
 	vm_fault_t ret = 0;
 
 	folio = vma_alloc_anon_folio_pmd(vma, vmf->address);
 	if (unlikely(!folio))
 		return VM_FAULT_FALLBACK;
 
-	pgtable = pte_alloc_one(vma->vm_mm);
-	if (unlikely(!pgtable)) {
-		ret = VM_FAULT_OOM;
-		goto release;
+	if (arch_needs_pgtable_deposit()) {
+		pgtable = pte_alloc_one(vma->vm_mm);
+		if (unlikely(!pgtable)) {
+			ret = VM_FAULT_OOM;
+			goto release;
+		}
 	}
 
 	vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
@@ -1334,14 +1336,18 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf)
 		if (userfaultfd_missing(vma)) {
 			spin_unlock(vmf->ptl);
 			folio_put(folio);
-			pte_free(vma->vm_mm, pgtable);
+			if (pgtable)
+				pte_free(vma->vm_mm, pgtable);
 			ret = handle_userfault(vmf, VM_UFFD_MISSING);
 			VM_BUG_ON(ret & VM_FAULT_FALLBACK);
 			return ret;
 		}
-		pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd, pgtable);
+		if (pgtable) {
+			pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd,
+						   pgtable);
+			mm_inc_nr_ptes(vma->vm_mm);
+		}
 		map_anon_folio_pmd_pf(folio, vmf->pmd, vma, haddr);
-		mm_inc_nr_ptes(vma->vm_mm);
 		spin_unlock(vmf->ptl);
 	}
 
@@ -1437,9 +1443,11 @@ static void set_huge_zero_folio(pgtable_t pgtable, struct mm_struct *mm,
 	pmd_t entry;
 	entry = folio_mk_pmd(zero_folio, vma->vm_page_prot);
 	entry = pmd_mkspecial(entry);
-	pgtable_trans_huge_deposit(mm, pmd, pgtable);
+	if (pgtable) {
+		pgtable_trans_huge_deposit(mm, pmd, pgtable);
+		mm_inc_nr_ptes(mm);
+	}
 	set_pmd_at(mm, haddr, pmd, entry);
-	mm_inc_nr_ptes(mm);
 }
 
 vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
@@ -1458,16 +1466,19 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
 	if (!(vmf->flags & FAULT_FLAG_WRITE) &&
 			!mm_forbids_zeropage(vma->vm_mm) &&
 			transparent_hugepage_use_zero_page()) {
-		pgtable_t pgtable;
+		pgtable_t pgtable = NULL;
 		struct folio *zero_folio;
 		vm_fault_t ret;
 
-		pgtable = pte_alloc_one(vma->vm_mm);
-		if (unlikely(!pgtable))
-			return VM_FAULT_OOM;
+		if (arch_needs_pgtable_deposit()) {
+			pgtable = pte_alloc_one(vma->vm_mm);
+			if (unlikely(!pgtable))
+				return VM_FAULT_OOM;
+		}
 		zero_folio = mm_get_huge_zero_folio(vma->vm_mm);
 		if (unlikely(!zero_folio)) {
-			pte_free(vma->vm_mm, pgtable);
+			if (pgtable)
+				pte_free(vma->vm_mm, pgtable);
 			count_vm_event(THP_FAULT_FALLBACK);
 			return VM_FAULT_FALLBACK;
 		}
@@ -1477,10 +1488,12 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
 			ret = check_stable_address_space(vma->vm_mm);
 			if (ret) {
 				spin_unlock(vmf->ptl);
-				pte_free(vma->vm_mm, pgtable);
+				if (pgtable)
+					pte_free(vma->vm_mm, pgtable);
 			} else if (userfaultfd_missing(vma)) {
 				spin_unlock(vmf->ptl);
-				pte_free(vma->vm_mm, pgtable);
+				if (pgtable)
+					pte_free(vma->vm_mm, pgtable);
 				ret = handle_userfault(vmf, VM_UFFD_MISSING);
 				VM_BUG_ON(ret & VM_FAULT_FALLBACK);
 			} else {
@@ -1491,7 +1504,8 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
 			}
 		} else {
 			spin_unlock(vmf->ptl);
-			pte_free(vma->vm_mm, pgtable);
+			if (pgtable)
+				pte_free(vma->vm_mm, pgtable);
 		}
 		return ret;
 	}
@@ -1823,8 +1837,10 @@ static void copy_huge_non_present_pmd(
 	}
 
 	add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
-	mm_inc_nr_ptes(dst_mm);
-	pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
+	if (pgtable) {
+		mm_inc_nr_ptes(dst_mm);
+		pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
+	}
 	if (!userfaultfd_wp(dst_vma))
 		pmd = pmd_swp_clear_uffd_wp(pmd);
 	set_pmd_at(dst_mm, addr, dst_pmd, pmd);
@@ -1864,9 +1880,11 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	if (!vma_is_anonymous(dst_vma))
 		return 0;
 
-	pgtable = pte_alloc_one(dst_mm);
-	if (unlikely(!pgtable))
-		goto out;
+	if (arch_needs_pgtable_deposit()) {
+		pgtable = pte_alloc_one(dst_mm);
+		if (unlikely(!pgtable))
+			goto out;
+	}
 
 	dst_ptl = pmd_lock(dst_mm, dst_pmd);
 	src_ptl = pmd_lockptr(src_mm, src_pmd);
@@ -1884,7 +1902,8 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	}
 
 	if (unlikely(!pmd_trans_huge(pmd))) {
-		pte_free(dst_mm, pgtable);
+		if (pgtable)
+			pte_free(dst_mm, pgtable);
 		goto out_unlock;
 	}
 	/*
@@ -1910,7 +1929,8 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	if (unlikely(folio_try_dup_anon_rmap_pmd(src_folio, src_page, dst_vma, src_vma))) {
 		/* Page maybe pinned: split and retry the fault on PTEs. */
 		folio_put(src_folio);
-		pte_free(dst_mm, pgtable);
+		if (pgtable)
+			pte_free(dst_mm, pgtable);
 		spin_unlock(src_ptl);
 		spin_unlock(dst_ptl);
 		/*
@@ -1924,8 +1944,10 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	}
 	add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
 out_zero_page:
-	mm_inc_nr_ptes(dst_mm);
-	pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
+	if (pgtable) {
+		mm_inc_nr_ptes(dst_mm);
+		pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
+	}
 	pmdp_set_wrprotect(src_mm, addr, src_pmd);
 	if (!userfaultfd_wp(dst_vma))
 		pmd = pmd_clear_uffd_wp(pmd);
@@ -2376,22 +2398,13 @@ static struct folio *normal_or_softleaf_folio_pmd(struct vm_area_struct *vma,
 static bool has_deposited_pgtable(struct vm_area_struct *vma, pmd_t pmdval,
 		struct folio *folio)
 {
-	/* Some architectures require unconditional depositing. */
-	if (arch_needs_pgtable_deposit())
-		return true;
-
-	/*
-	 * Huge zero always deposited except for DAX which handles itself, see
-	 * set_huge_zero_folio().
-	 */
-	if (is_huge_zero_pmd(pmdval))
-		return !vma_is_dax(vma);
-
 	/*
-	 * Otherwise, only anonymous folios are deposited, see
-	 * __do_huge_pmd_anonymous_page().
+	 * With lazy PTE page table allocation, only architectures that
+	 * require unconditional depositing (powerpc hash MMU) will have
+	 * deposited page tables. All other architectures allocate PTE
+	 * page tables lazily at split time.
 	 */
-	return folio && folio_test_anon(folio);
+	return arch_needs_pgtable_deposit();
 }
 
 /**
@@ -2514,7 +2527,8 @@ bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
 			force_flush = true;
 		VM_BUG_ON(!pmd_none(*new_pmd));
 
-		if (pmd_move_must_withdraw(new_ptl, old_ptl, vma)) {
+		if (pmd_move_must_withdraw(new_ptl, old_ptl, vma) &&
+		    arch_needs_pgtable_deposit()) {
 			pgtable_t pgtable;
 			pgtable = pgtable_trans_huge_withdraw(mm, old_pmd);
 			pgtable_trans_huge_deposit(mm, new_pmd, pgtable);
@@ -2823,8 +2837,10 @@ int move_pages_huge_pmd(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd, pm
 	}
 	set_pmd_at(mm, dst_addr, dst_pmd, _dst_pmd);
 
-	src_pgtable = pgtable_trans_huge_withdraw(mm, src_pmd);
-	pgtable_trans_huge_deposit(mm, dst_pmd, src_pgtable);
+	if (arch_needs_pgtable_deposit()) {
+		src_pgtable = pgtable_trans_huge_withdraw(mm, src_pmd);
+		pgtable_trans_huge_deposit(mm, dst_pmd, src_pgtable);
+	}
 unlock_ptls:
 	double_pt_unlock(src_ptl, dst_ptl);
 	/* unblock rmap walks */
@@ -2966,10 +2982,9 @@ void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
 #endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
 
 static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
-		unsigned long haddr, pmd_t *pmd)
+		unsigned long haddr, pmd_t *pmd, pgtable_t pgtable)
 {
 	struct mm_struct *mm = vma->vm_mm;
-	pgtable_t pgtable;
 	pmd_t _pmd, old_pmd;
 	unsigned long addr;
 	pte_t *pte;
@@ -2985,7 +3000,16 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
 	 */
 	old_pmd = pmdp_huge_clear_flush(vma, haddr, pmd);
 
-	pgtable = pgtable_trans_huge_withdraw(mm, pmd);
+	if (arch_needs_pgtable_deposit()) {
+		pgtable = pgtable_trans_huge_withdraw(mm, pmd);
+	} else {
+		VM_BUG_ON(!pgtable);
+		/*
+		 * Account for the freshly allocated (in __split_huge_pmd) pgtable
+		 * being used in mm.
+		 */
+		mm_inc_nr_ptes(mm);
+	}
 	pmd_populate(mm, &_pmd, pgtable);
 
 	pte = pte_offset_map(&_pmd, haddr);
@@ -3007,12 +3031,11 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
 }
 
 static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
-		unsigned long haddr, bool freeze)
+		unsigned long haddr, bool freeze, pgtable_t pgtable)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	struct folio *folio;
 	struct page *page;
-	pgtable_t pgtable;
 	pmd_t old_pmd, _pmd;
 	bool soft_dirty, uffd_wp = false, young = false, write = false;
 	bool anon_exclusive = false, dirty = false;
@@ -3036,6 +3059,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 		 */
 		if (arch_needs_pgtable_deposit())
 			zap_deposited_table(mm, pmd);
+		if (pgtable)
+			pte_free(mm, pgtable);
 		if (vma_is_special_huge(vma))
 			return;
 		if (unlikely(pmd_is_migration_entry(old_pmd))) {
@@ -3068,7 +3093,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 		 * small page also write protected so it does not seems useful
 		 * to invalidate secondary mmu at this time.
 		 */
-		return __split_huge_zero_page_pmd(vma, haddr, pmd);
+		return __split_huge_zero_page_pmd(vma, haddr, pmd, pgtable);
 	}
 
 	if (pmd_is_migration_entry(*pmd)) {
@@ -3192,7 +3217,16 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 	 * Withdraw the table only after we mark the pmd entry invalid.
 	 * This's critical for some architectures (Power).
 	 */
-	pgtable = pgtable_trans_huge_withdraw(mm, pmd);
+	if (arch_needs_pgtable_deposit()) {
+		pgtable = pgtable_trans_huge_withdraw(mm, pmd);
+	} else {
+		VM_BUG_ON(!pgtable);
+		/*
+		 * Account for the freshly allocated (in __split_huge_pmd) pgtable
+		 * being used in mm.
+		 */
+		mm_inc_nr_ptes(mm);
+	}
 	pmd_populate(mm, &_pmd, pgtable);
 
 	pte = pte_offset_map(&_pmd, haddr);
@@ -3288,11 +3322,13 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 }
 
 void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address,
-			   pmd_t *pmd, bool freeze)
+			   pmd_t *pmd, bool freeze, pgtable_t pgtable)
 {
 	VM_WARN_ON_ONCE(!IS_ALIGNED(address, HPAGE_PMD_SIZE));
 	if (pmd_trans_huge(*pmd) || pmd_is_valid_softleaf(*pmd))
-		__split_huge_pmd_locked(vma, pmd, address, freeze);
+		__split_huge_pmd_locked(vma, pmd, address, freeze, pgtable);
+	else if (pgtable)
+		pte_free(vma->vm_mm, pgtable);
 }
 
 int __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
@@ -3300,13 +3336,24 @@ int __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 {
 	spinlock_t *ptl;
 	struct mmu_notifier_range range;
+	pgtable_t pgtable = NULL;
 
 	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma->vm_mm,
 				address & HPAGE_PMD_MASK,
 				(address & HPAGE_PMD_MASK) + HPAGE_PMD_SIZE);
 	mmu_notifier_invalidate_range_start(&range);
+
+	/* allocate pagetable before acquiring pmd lock */
+	if (vma_is_anonymous(vma) && !arch_needs_pgtable_deposit()) {
+		pgtable = pte_alloc_one(vma->vm_mm);
+		if (!pgtable) {
+			mmu_notifier_invalidate_range_end(&range);
+			return -ENOMEM;
+		}
+	}
+
 	ptl = pmd_lock(vma->vm_mm, pmd);
-	split_huge_pmd_locked(vma, range.start, pmd, freeze);
+	split_huge_pmd_locked(vma, range.start, pmd, freeze, pgtable);
 	spin_unlock(ptl);
 	mmu_notifier_invalidate_range_end(&range);
 
@@ -3442,7 +3489,8 @@ static bool __discard_anon_folio_pmd_locked(struct vm_area_struct *vma,
 	}
 
 	folio_remove_rmap_pmd(folio, pmd_page(orig_pmd), vma);
-	zap_deposited_table(mm, pmdp);
+	if (arch_needs_pgtable_deposit())
+		zap_deposited_table(mm, pmdp);
 	add_mm_counter(mm, MM_ANONPAGES, -HPAGE_PMD_NR);
 	if (vma->vm_flags & VM_LOCKED)
 		mlock_drain_local();
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index d06d84219e1b8..40b33263f6135 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1239,7 +1239,12 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
 
 	spin_lock(pmd_ptl);
 	BUG_ON(!pmd_none(*pmd));
-	pgtable_trans_huge_deposit(mm, pmd, pgtable);
+	if (arch_needs_pgtable_deposit()) {
+		pgtable_trans_huge_deposit(mm, pmd, pgtable);
+	} else {
+		mm_dec_nr_ptes(mm);
+		pte_free(mm, pgtable);
+	}
 	map_anon_folio_pmd_nopf(folio, pmd, vma, address);
 	spin_unlock(pmd_ptl);
 
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index 00003fbe803df..b9242217a81b6 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -829,9 +829,13 @@ static int migrate_vma_insert_huge_pmd_page(struct migrate_vma *migrate,
 
 	__folio_mark_uptodate(folio);
 
-	pgtable = pte_alloc_one(vma->vm_mm);
-	if (unlikely(!pgtable))
-		goto abort;
+	if (arch_needs_pgtable_deposit()) {
+		pgtable = pte_alloc_one(vma->vm_mm);
+		if (unlikely(!pgtable))
+			goto abort;
+	} else {
+		pgtable = NULL;
+	}
 
 	if (folio_is_device_private(folio)) {
 		swp_entry_t swp_entry;
@@ -879,10 +883,11 @@ static int migrate_vma_insert_huge_pmd_page(struct migrate_vma *migrate,
 	folio_get(folio);
 
 	if (flush) {
-		pte_free(vma->vm_mm, pgtable);
+		if (pgtable)
+			pte_free(vma->vm_mm, pgtable);
 		flush_cache_page(vma, addr, addr + HPAGE_PMD_SIZE);
 		pmdp_invalidate(vma, addr, pmdp);
-	} else {
+	} else if (pgtable) {
 		pgtable_trans_huge_deposit(vma->vm_mm, pmdp, pgtable);
 		mm_inc_nr_ptes(vma->vm_mm);
 	}
diff --git a/mm/rmap.c b/mm/rmap.c
index 78b7fb5f367ce..efbcdd3b32632 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -76,6 +76,7 @@
 #include <linux/mm_inline.h>
 #include <linux/oom.h>
 
+#include <asm/pgalloc.h>
 #include <asm/tlb.h>
 
 #define CREATE_TRACE_POINTS
@@ -1995,6 +1996,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 	unsigned long pfn;
 	unsigned long hsz = 0;
 	int ptes = 0;
+	pgtable_t prealloc_pte = NULL;
 
 	/*
 	 * When racing against e.g. zap_pte_range() on another cpu,
@@ -2029,6 +2031,10 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 	}
 	mmu_notifier_invalidate_range_start(&range);
 
+	if ((flags & TTU_SPLIT_HUGE_PMD) && vma_is_anonymous(vma) &&
+	    !arch_needs_pgtable_deposit())
+		prealloc_pte = pte_alloc_one(mm);
+
 	while (page_vma_mapped_walk(&pvmw)) {
 		/*
 		 * If the folio is in an mlock()d vma, we must not swap it out.
@@ -2078,12 +2084,21 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 			}
 
 			if (flags & TTU_SPLIT_HUGE_PMD) {
+				pgtable_t pgtable = prealloc_pte;
+
+				prealloc_pte = NULL;
+				if (!arch_needs_pgtable_deposit() && !pgtable &&
+				    vma_is_anonymous(vma)) {
+					page_vma_mapped_walk_done(&pvmw);
+					ret = false;
+					break;
+				}
 				/*
 				 * We temporarily have to drop the PTL and
 				 * restart so we can process the PTE-mapped THP.
 				 */
 				split_huge_pmd_locked(vma, pvmw.address,
-						      pvmw.pmd, false);
+						      pvmw.pmd, false, pgtable);
 				flags &= ~TTU_SPLIT_HUGE_PMD;
 				page_vma_mapped_walk_restart(&pvmw);
 				continue;
@@ -2363,6 +2378,9 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 		break;
 	}
 
+	if (prealloc_pte)
+		pte_free(mm, prealloc_pte);
+
 	mmu_notifier_invalidate_range_end(&range);
 
 	return ret;
@@ -2422,6 +2440,7 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 	enum ttu_flags flags = (enum ttu_flags)(long)arg;
 	unsigned long pfn;
 	unsigned long hsz = 0;
+	pgtable_t prealloc_pte = NULL;
 
 	/*
 	 * When racing against e.g. zap_pte_range() on another cpu,
@@ -2456,6 +2475,10 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 	}
 	mmu_notifier_invalidate_range_start(&range);
 
+	if ((flags & TTU_SPLIT_HUGE_PMD) && vma_is_anonymous(vma) &&
+	    !arch_needs_pgtable_deposit())
+		prealloc_pte = pte_alloc_one(mm);
+
 	while (page_vma_mapped_walk(&pvmw)) {
 		/* PMD-mapped THP migration entry */
 		if (!pvmw.pte) {
@@ -2463,6 +2486,15 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 			__maybe_unused pmd_t pmdval;
 
 			if (flags & TTU_SPLIT_HUGE_PMD) {
+				pgtable_t pgtable = prealloc_pte;
+
+				prealloc_pte = NULL;
+				if (!arch_needs_pgtable_deposit() && !pgtable &&
+				    vma_is_anonymous(vma)) {
+					page_vma_mapped_walk_done(&pvmw);
+					ret = false;
+					break;
+				}
 				/*
 				 * split_huge_pmd_locked() might leave the
 				 * folio mapped through PTEs. Retry the walk
@@ -2470,7 +2502,7 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 				 * abort the walk.
 				 */
 				split_huge_pmd_locked(vma, pvmw.address,
-						      pvmw.pmd, true);
+						      pvmw.pmd, true, pgtable);
 				flags &= ~TTU_SPLIT_HUGE_PMD;
 				page_vma_mapped_walk_restart(&pvmw);
 				continue;
@@ -2721,6 +2753,9 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 		folio_put(folio);
 	}
 
+	if (prealloc_pte)
+		pte_free(mm, prealloc_pte);
+
 	mmu_notifier_invalidate_range_end(&range);
 
 	return ret;
-- 
2.52.0



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [v3 18/24] mm: thp: remove pgtable_trans_huge_{deposit/withdraw} when not needed
  2026-03-27  2:08 [v3 00/24] mm: thp: lazy PTE page table allocation at PMD split time Usama Arif
                   ` (16 preceding siblings ...)
  2026-03-27  2:08 ` [v3 17/24] mm: thp: allocate PTE page tables lazily at split time Usama Arif
@ 2026-03-27  2:09 ` Usama Arif
  2026-03-27  2:09 ` [v3 19/24] mm: thp: add THP_SPLIT_PMD_FAILED counter Usama Arif
                   ` (6 subsequent siblings)
  24 siblings, 0 replies; 27+ messages in thread
From: Usama Arif @ 2026-03-27  2:09 UTC (permalink / raw)
  To: Andrew Morton, david, Lorenzo Stoakes, willy, linux-mm
  Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
	baolin.wang, npache, Liam.Howlett, ryan.roberts, Vlastimil Babka,
	lance.yang, linux-kernel, kernel-team, maddy, mpe, linuxppc-dev,
	hca, gor, agordeev, borntraeger, svens, linux-s390, Usama Arif

Since the previous commit made deposit/withdraw only needed for
architectures where arch_needs_pgtable_deposit() returns true (currently
only powerpc hash MMU), the generic implementation in pgtable-generic.c
and the s390/sparc overrides are now dead code — all call sites are
guarded by arch_needs_pgtable_deposit() which is compile-time false on
those architectures. Remove them entirely and replace the extern
declarations with static inline no-op stubs for the default case.

pgtable_trans_huge_{deposit,withdraw}() are renamed to
arch_pgtable_trans_huge_{deposit,withdraw}().

Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 arch/powerpc/include/asm/book3s/64/pgtable.h | 12 +++---
 arch/s390/include/asm/pgtable.h              |  6 ---
 arch/s390/mm/pgtable.c                       | 41 --------------------
 arch/sparc/include/asm/pgtable_64.h          |  6 ---
 arch/sparc/mm/tlb.c                          | 36 -----------------
 include/linux/pgtable.h                      | 16 +++++---
 mm/debug_vm_pgtable.c                        |  4 +-
 mm/huge_memory.c                             | 26 ++++++-------
 mm/khugepaged.c                              |  2 +-
 mm/memory.c                                  |  2 +-
 mm/migrate_device.c                          |  2 +-
 mm/pgtable-generic.c                         | 32 ---------------
 12 files changed, 35 insertions(+), 150 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 60e283cf22be1..f1f36a4ed2bc8 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -1360,18 +1360,18 @@ pud_t pudp_huge_get_and_clear_full(struct vm_area_struct *vma,
 				   unsigned long addr,
 				   pud_t *pudp, int full);
 
-#define __HAVE_ARCH_PGTABLE_DEPOSIT
-static inline void pgtable_trans_huge_deposit(struct mm_struct *mm,
-					      pmd_t *pmdp, pgtable_t pgtable)
+#define arch_pgtable_trans_huge_deposit arch_pgtable_trans_huge_deposit
+static inline void arch_pgtable_trans_huge_deposit(struct mm_struct *mm,
+						   pmd_t *pmdp, pgtable_t pgtable)
 {
 	if (radix_enabled())
 		return radix__pgtable_trans_huge_deposit(mm, pmdp, pgtable);
 	return hash__pgtable_trans_huge_deposit(mm, pmdp, pgtable);
 }
 
-#define __HAVE_ARCH_PGTABLE_WITHDRAW
-static inline pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm,
-						    pmd_t *pmdp)
+#define arch_pgtable_trans_huge_withdraw arch_pgtable_trans_huge_withdraw
+static inline pgtable_t arch_pgtable_trans_huge_withdraw(struct mm_struct *mm,
+							 pmd_t *pmdp)
 {
 	if (radix_enabled())
 		return radix__pgtable_trans_huge_withdraw(mm, pmdp);
diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 40a6fb19dd1dc..9394aabe0442b 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -1659,12 +1659,6 @@ pud_t pudp_xchg_direct(struct mm_struct *, unsigned long, pud_t *, pud_t);
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 
-#define __HAVE_ARCH_PGTABLE_DEPOSIT
-void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
-				pgtable_t pgtable);
-
-#define __HAVE_ARCH_PGTABLE_WITHDRAW
-pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp);
 
 #define  __HAVE_ARCH_PMDP_SET_ACCESS_FLAGS
 static inline int pmdp_set_access_flags(struct vm_area_struct *vma,
diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
index 4acd8b140c4bd..c9a9ab2c7d937 100644
--- a/arch/s390/mm/pgtable.c
+++ b/arch/s390/mm/pgtable.c
@@ -312,44 +312,3 @@ pud_t pudp_xchg_direct(struct mm_struct *mm, unsigned long addr,
 	return old;
 }
 EXPORT_SYMBOL(pudp_xchg_direct);
-
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
-				pgtable_t pgtable)
-{
-	struct list_head *lh = (struct list_head *) pgtable;
-
-	assert_spin_locked(pmd_lockptr(mm, pmdp));
-
-	/* FIFO */
-	if (!pmd_huge_pte(mm, pmdp))
-		INIT_LIST_HEAD(lh);
-	else
-		list_add(lh, (struct list_head *) pmd_huge_pte(mm, pmdp));
-	pmd_huge_pte(mm, pmdp) = pgtable;
-}
-
-pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp)
-{
-	struct list_head *lh;
-	pgtable_t pgtable;
-	pte_t *ptep;
-
-	assert_spin_locked(pmd_lockptr(mm, pmdp));
-
-	/* FIFO */
-	pgtable = pmd_huge_pte(mm, pmdp);
-	lh = (struct list_head *) pgtable;
-	if (list_empty(lh))
-		pmd_huge_pte(mm, pmdp) = NULL;
-	else {
-		pmd_huge_pte(mm, pmdp) = (pgtable_t) lh->next;
-		list_del(lh);
-	}
-	ptep = (pte_t *) pgtable;
-	set_pte(ptep, __pte(_PAGE_INVALID));
-	ptep++;
-	set_pte(ptep, __pte(_PAGE_INVALID));
-	return pgtable;
-}
-#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h
index 74ede706fb325..60861560f8c40 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -987,12 +987,6 @@ void update_mmu_cache_pmd(struct vm_area_struct *vma, unsigned long addr,
 extern pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
 			    pmd_t *pmdp);
 
-#define __HAVE_ARCH_PGTABLE_DEPOSIT
-void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
-				pgtable_t pgtable);
-
-#define __HAVE_ARCH_PGTABLE_WITHDRAW
-pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp);
 #endif
 
 /*
diff --git a/arch/sparc/mm/tlb.c b/arch/sparc/mm/tlb.c
index 6d9dd5eb13287..9049d54e6e2cb 100644
--- a/arch/sparc/mm/tlb.c
+++ b/arch/sparc/mm/tlb.c
@@ -275,40 +275,4 @@ pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
 	return old;
 }
 
-void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
-				pgtable_t pgtable)
-{
-	struct list_head *lh = (struct list_head *) pgtable;
-
-	assert_spin_locked(&mm->page_table_lock);
-
-	/* FIFO */
-	if (!pmd_huge_pte(mm, pmdp))
-		INIT_LIST_HEAD(lh);
-	else
-		list_add(lh, (struct list_head *) pmd_huge_pte(mm, pmdp));
-	pmd_huge_pte(mm, pmdp) = pgtable;
-}
-
-pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp)
-{
-	struct list_head *lh;
-	pgtable_t pgtable;
-
-	assert_spin_locked(&mm->page_table_lock);
-
-	/* FIFO */
-	pgtable = pmd_huge_pte(mm, pmdp);
-	lh = (struct list_head *) pgtable;
-	if (list_empty(lh))
-		pmd_huge_pte(mm, pmdp) = NULL;
-	else {
-		pmd_huge_pte(mm, pmdp) = (pgtable_t) lh->next;
-		list_del(lh);
-	}
-	pte_val(pgtable[0]) = 0;
-	pte_val(pgtable[1]) = 0;
-
-	return pgtable;
-}
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index cdd68ed3ae1a9..f646414e801b7 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1207,13 +1207,19 @@ static inline pmd_t pmdp_collapse_flush(struct vm_area_struct *vma,
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 #endif
 
-#ifndef __HAVE_ARCH_PGTABLE_DEPOSIT
-extern void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
-				       pgtable_t pgtable);
+#ifndef arch_pgtable_trans_huge_deposit
+static inline void arch_pgtable_trans_huge_deposit(struct mm_struct *mm,
+						   pmd_t *pmdp, pgtable_t pgtable)
+{
+}
 #endif
 
-#ifndef __HAVE_ARCH_PGTABLE_WITHDRAW
-extern pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp);
+#ifndef arch_pgtable_trans_huge_withdraw
+static inline pgtable_t arch_pgtable_trans_huge_withdraw(struct mm_struct *mm,
+							 pmd_t *pmdp)
+{
+	return NULL;
+}
 #endif
 
 #ifndef arch_needs_pgtable_deposit
diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
index 23dc3ee095619..db58c5a1f4f48 100644
--- a/mm/debug_vm_pgtable.c
+++ b/mm/debug_vm_pgtable.c
@@ -240,7 +240,7 @@ static void __init pmd_advanced_tests(struct pgtable_debug_args *args)
 	/* Align the address wrt HPAGE_PMD_SIZE */
 	vaddr &= HPAGE_PMD_MASK;
 
-	pgtable_trans_huge_deposit(args->mm, args->pmdp, args->start_ptep);
+	arch_pgtable_trans_huge_deposit(args->mm, args->pmdp, args->start_ptep);
 
 	pmd = pfn_pmd(args->pmd_pfn, args->page_prot);
 	set_pmd_at(args->mm, vaddr, args->pmdp, pmd);
@@ -276,7 +276,7 @@ static void __init pmd_advanced_tests(struct pgtable_debug_args *args)
 
 	/*  Clear the pte entries  */
 	pmdp_huge_get_and_clear(args->mm, vaddr, args->pmdp);
-	pgtable_trans_huge_withdraw(args->mm, args->pmdp);
+	arch_pgtable_trans_huge_withdraw(args->mm, args->pmdp);
 }
 
 static void __init pmd_leaf_tests(struct pgtable_debug_args *args)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2acedb1de7404..48c4884a6f386 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1343,7 +1343,7 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf)
 			return ret;
 		}
 		if (pgtable) {
-			pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd,
+			arch_pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd,
 						   pgtable);
 			mm_inc_nr_ptes(vma->vm_mm);
 		}
@@ -1444,7 +1444,7 @@ static void set_huge_zero_folio(pgtable_t pgtable, struct mm_struct *mm,
 	entry = folio_mk_pmd(zero_folio, vma->vm_page_prot);
 	entry = pmd_mkspecial(entry);
 	if (pgtable) {
-		pgtable_trans_huge_deposit(mm, pmd, pgtable);
+		arch_pgtable_trans_huge_deposit(mm, pmd, pgtable);
 		mm_inc_nr_ptes(mm);
 	}
 	set_pmd_at(mm, haddr, pmd, entry);
@@ -1577,7 +1577,7 @@ static vm_fault_t insert_pmd(struct vm_area_struct *vma, unsigned long addr,
 	}
 
 	if (pgtable) {
-		pgtable_trans_huge_deposit(mm, pmd, pgtable);
+		arch_pgtable_trans_huge_deposit(mm, pmd, pgtable);
 		mm_inc_nr_ptes(mm);
 		pgtable = NULL;
 	}
@@ -1839,7 +1839,7 @@ static void copy_huge_non_present_pmd(
 	add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
 	if (pgtable) {
 		mm_inc_nr_ptes(dst_mm);
-		pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
+		arch_pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
 	}
 	if (!userfaultfd_wp(dst_vma))
 		pmd = pmd_swp_clear_uffd_wp(pmd);
@@ -1946,7 +1946,7 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 out_zero_page:
 	if (pgtable) {
 		mm_inc_nr_ptes(dst_mm);
-		pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
+		arch_pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
 	}
 	pmdp_set_wrprotect(src_mm, addr, src_pmd);
 	if (!userfaultfd_wp(dst_vma))
@@ -2354,7 +2354,7 @@ static inline void zap_deposited_table(struct mm_struct *mm, pmd_t *pmd)
 {
 	pgtable_t pgtable;
 
-	pgtable = pgtable_trans_huge_withdraw(mm, pmd);
+	pgtable = arch_pgtable_trans_huge_withdraw(mm, pmd);
 	pte_free(mm, pgtable);
 	mm_dec_nr_ptes(mm);
 }
@@ -2434,7 +2434,7 @@ bool zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	/*
 	 * For architectures like ppc64 we look at deposited pgtable
 	 * when calling pmdp_huge_get_and_clear. So do the
-	 * pgtable_trans_huge_withdraw after finishing pmdp related
+	 * arch_pgtable_trans_huge_withdraw after finishing pmdp related
 	 * operations.
 	 */
 	orig_pmd = pmdp_huge_get_and_clear_full(vma, addr, pmd,
@@ -2530,8 +2530,8 @@ bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
 		if (pmd_move_must_withdraw(new_ptl, old_ptl, vma) &&
 		    arch_needs_pgtable_deposit()) {
 			pgtable_t pgtable;
-			pgtable = pgtable_trans_huge_withdraw(mm, old_pmd);
-			pgtable_trans_huge_deposit(mm, new_pmd, pgtable);
+			pgtable = arch_pgtable_trans_huge_withdraw(mm, old_pmd);
+			arch_pgtable_trans_huge_deposit(mm, new_pmd, pgtable);
 		}
 		pmd = move_soft_dirty_pmd(pmd);
 		if (vma_has_uffd_without_event_remap(vma))
@@ -2838,8 +2838,8 @@ int move_pages_huge_pmd(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd, pm
 	set_pmd_at(mm, dst_addr, dst_pmd, _dst_pmd);
 
 	if (arch_needs_pgtable_deposit()) {
-		src_pgtable = pgtable_trans_huge_withdraw(mm, src_pmd);
-		pgtable_trans_huge_deposit(mm, dst_pmd, src_pgtable);
+		src_pgtable = arch_pgtable_trans_huge_withdraw(mm, src_pmd);
+		arch_pgtable_trans_huge_deposit(mm, dst_pmd, src_pgtable);
 	}
 unlock_ptls:
 	double_pt_unlock(src_ptl, dst_ptl);
@@ -3001,7 +3001,7 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
 	old_pmd = pmdp_huge_clear_flush(vma, haddr, pmd);
 
 	if (arch_needs_pgtable_deposit()) {
-		pgtable = pgtable_trans_huge_withdraw(mm, pmd);
+		pgtable = arch_pgtable_trans_huge_withdraw(mm, pmd);
 	} else {
 		VM_BUG_ON(!pgtable);
 		/*
@@ -3218,7 +3218,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 	 * This's critical for some architectures (Power).
 	 */
 	if (arch_needs_pgtable_deposit()) {
-		pgtable = pgtable_trans_huge_withdraw(mm, pmd);
+		pgtable = arch_pgtable_trans_huge_withdraw(mm, pmd);
 	} else {
 		VM_BUG_ON(!pgtable);
 		/*
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 40b33263f6135..b6d5da4567fe0 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1240,7 +1240,7 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
 	spin_lock(pmd_ptl);
 	BUG_ON(!pmd_none(*pmd));
 	if (arch_needs_pgtable_deposit()) {
-		pgtable_trans_huge_deposit(mm, pmd, pgtable);
+		arch_pgtable_trans_huge_deposit(mm, pmd, pgtable);
 	} else {
 		mm_dec_nr_ptes(mm);
 		pte_free(mm, pgtable);
diff --git a/mm/memory.c b/mm/memory.c
index b99ec3ffc18d1..583ca340cef43 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5551,7 +5551,7 @@ static void deposit_prealloc_pte(struct vm_fault *vmf)
 {
 	struct vm_area_struct *vma = vmf->vma;
 
-	pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd, vmf->prealloc_pte);
+	arch_pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd, vmf->prealloc_pte);
 	/*
 	 * We are going to consume the prealloc table,
 	 * count that as nr_ptes.
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index b9242217a81b6..fb0f29a7c73bc 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -888,7 +888,7 @@ static int migrate_vma_insert_huge_pmd_page(struct migrate_vma *migrate,
 		flush_cache_page(vma, addr, addr + HPAGE_PMD_SIZE);
 		pmdp_invalidate(vma, addr, pmdp);
 	} else if (pgtable) {
-		pgtable_trans_huge_deposit(vma->vm_mm, pmdp, pgtable);
+		arch_pgtable_trans_huge_deposit(vma->vm_mm, pmdp, pgtable);
 		mm_inc_nr_ptes(vma->vm_mm);
 	}
 	set_pmd_at(vma->vm_mm, addr, pmdp, entry);
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index b91b1a98029c7..5dfdbe6488062 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -164,38 +164,6 @@ pud_t pudp_huge_clear_flush(struct vm_area_struct *vma, unsigned long address,
 #endif
 #endif
 
-#ifndef __HAVE_ARCH_PGTABLE_DEPOSIT
-void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
-				pgtable_t pgtable)
-{
-	assert_spin_locked(pmd_lockptr(mm, pmdp));
-
-	/* FIFO */
-	if (!pmd_huge_pte(mm, pmdp))
-		INIT_LIST_HEAD(&pgtable->lru);
-	else
-		list_add(&pgtable->lru, &pmd_huge_pte(mm, pmdp)->lru);
-	pmd_huge_pte(mm, pmdp) = pgtable;
-}
-#endif
-
-#ifndef __HAVE_ARCH_PGTABLE_WITHDRAW
-/* no "address" argument so destroys page coloring of some arch */
-pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp)
-{
-	pgtable_t pgtable;
-
-	assert_spin_locked(pmd_lockptr(mm, pmdp));
-
-	/* FIFO */
-	pgtable = pmd_huge_pte(mm, pmdp);
-	pmd_huge_pte(mm, pmdp) = list_first_entry_or_null(&pgtable->lru,
-							  struct page, lru);
-	if (pmd_huge_pte(mm, pmdp))
-		list_del(&pgtable->lru);
-	return pgtable;
-}
-#endif
 
 #ifndef __HAVE_ARCH_PMDP_INVALIDATE
 pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
-- 
2.52.0



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [v3 19/24] mm: thp: add THP_SPLIT_PMD_FAILED counter
  2026-03-27  2:08 [v3 00/24] mm: thp: lazy PTE page table allocation at PMD split time Usama Arif
                   ` (17 preceding siblings ...)
  2026-03-27  2:09 ` [v3 18/24] mm: thp: remove pgtable_trans_huge_{deposit/withdraw} when not needed Usama Arif
@ 2026-03-27  2:09 ` Usama Arif
  2026-03-27  2:09 ` [v3 20/24] selftests/mm: add THP PMD split test infrastructure Usama Arif
                   ` (5 subsequent siblings)
  24 siblings, 0 replies; 27+ messages in thread
From: Usama Arif @ 2026-03-27  2:09 UTC (permalink / raw)
  To: Andrew Morton, david, Lorenzo Stoakes, willy, linux-mm
  Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
	baolin.wang, npache, Liam.Howlett, ryan.roberts, Vlastimil Babka,
	lance.yang, linux-kernel, kernel-team, maddy, mpe, linuxppc-dev,
	hca, gor, agordeev, borntraeger, svens, linux-s390, Usama Arif

Add a vmstat counter to track PTE allocation failures during PMD split.
This enables monitoring of split failures due to memory pressure after
the lazy PTE page table allocation change.

The counter is incremented in three places:
- __split_huge_pmd(): Main entry point for splitting a PMD
- try_to_unmap_one(): When reclaim needs to split a PMD-mapped THP
- try_to_migrate_one(): When migration needs to split a PMD-mapped THP

Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 include/linux/vm_event_item.h | 1 +
 mm/huge_memory.c              | 1 +
 mm/rmap.c                     | 7 +++++++
 mm/vmstat.c                   | 1 +
 4 files changed, 10 insertions(+)

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 03fe95f5a0201..ce696cf7d6321 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -98,6 +98,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		THP_DEFERRED_SPLIT_PAGE,
 		THP_UNDERUSED_SPLIT_PAGE,
 		THP_SPLIT_PMD,
+		THP_SPLIT_PMD_FAILED,
 		THP_SCAN_EXCEED_NONE_PTE,
 		THP_SCAN_EXCEED_SWAP_PTE,
 		THP_SCAN_EXCEED_SHARED_PTE,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 48c4884a6f386..b93718931e729 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3347,6 +3347,7 @@ int __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 	if (vma_is_anonymous(vma) && !arch_needs_pgtable_deposit()) {
 		pgtable = pte_alloc_one(vma->vm_mm);
 		if (!pgtable) {
+			count_vm_event(THP_SPLIT_PMD_FAILED);
 			mmu_notifier_invalidate_range_end(&range);
 			return -ENOMEM;
 		}
diff --git a/mm/rmap.c b/mm/rmap.c
index efbcdd3b32632..a0180f62d9f69 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -2087,8 +2087,12 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 				pgtable_t pgtable = prealloc_pte;
 
 				prealloc_pte = NULL;
+
 				if (!arch_needs_pgtable_deposit() && !pgtable &&
 				    vma_is_anonymous(vma)) {
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE)
+					count_vm_event(THP_SPLIT_PMD_FAILED);
+#endif
 					page_vma_mapped_walk_done(&pvmw);
 					ret = false;
 					break;
@@ -2491,6 +2495,9 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 				prealloc_pte = NULL;
 				if (!arch_needs_pgtable_deposit() && !pgtable &&
 				    vma_is_anonymous(vma)) {
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE)
+					count_vm_event(THP_SPLIT_PMD_FAILED);
+#endif
 					page_vma_mapped_walk_done(&pvmw);
 					ret = false;
 					break;
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 2370c6fb1fcd6..b8df9c7296d8a 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1409,6 +1409,7 @@ const char * const vmstat_text[] = {
 	[I(THP_DEFERRED_SPLIT_PAGE)]		= "thp_deferred_split_page",
 	[I(THP_UNDERUSED_SPLIT_PAGE)]		= "thp_underused_split_page",
 	[I(THP_SPLIT_PMD)]			= "thp_split_pmd",
+	[I(THP_SPLIT_PMD_FAILED)]		= "thp_split_pmd_failed",
 	[I(THP_SCAN_EXCEED_NONE_PTE)]		= "thp_scan_exceed_none_pte",
 	[I(THP_SCAN_EXCEED_SWAP_PTE)]		= "thp_scan_exceed_swap_pte",
 	[I(THP_SCAN_EXCEED_SHARED_PTE)]		= "thp_scan_exceed_share_pte",
-- 
2.52.0



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [v3 20/24] selftests/mm: add THP PMD split test infrastructure
  2026-03-27  2:08 [v3 00/24] mm: thp: lazy PTE page table allocation at PMD split time Usama Arif
                   ` (18 preceding siblings ...)
  2026-03-27  2:09 ` [v3 19/24] mm: thp: add THP_SPLIT_PMD_FAILED counter Usama Arif
@ 2026-03-27  2:09 ` Usama Arif
  2026-03-27  2:09 ` [v3 21/24] selftests/mm: add partial_mprotect test for change_pmd_range Usama Arif
                   ` (4 subsequent siblings)
  24 siblings, 0 replies; 27+ messages in thread
From: Usama Arif @ 2026-03-27  2:09 UTC (permalink / raw)
  To: Andrew Morton, david, Lorenzo Stoakes, willy, linux-mm
  Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
	baolin.wang, npache, Liam.Howlett, ryan.roberts, Vlastimil Babka,
	lance.yang, linux-kernel, kernel-team, maddy, mpe, linuxppc-dev,
	hca, gor, agordeev, borntraeger, svens, linux-s390, Usama Arif

Add test infrastructure for verifying THP PMD split behavior with lazy
PTE allocation. This includes:

- Test fixture with PMD-aligned memory allocation
- Helper functions for reading vmstat counters
- log_and_check_pmd_split() macro for logging counters and checking
  if thp_split_pmd has incremented and thp_split_pmd_failed hasn't.
- THP allocation helper with verification

Also add a test to check if partial unmap of a THP splits the PMD.
This exercises zap_pmd_range part of split.

Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 tools/testing/selftests/mm/Makefile           |   1 +
 .../testing/selftests/mm/thp_pmd_split_test.c | 149 ++++++++++++++++++
 2 files changed, 150 insertions(+)
 create mode 100644 tools/testing/selftests/mm/thp_pmd_split_test.c

diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile
index cd24596cdd27e..4b4610c9b693d 100644
--- a/tools/testing/selftests/mm/Makefile
+++ b/tools/testing/selftests/mm/Makefile
@@ -95,6 +95,7 @@ TEST_GEN_FILES += uffd-stress
 TEST_GEN_FILES += uffd-unit-tests
 TEST_GEN_FILES += uffd-wp-mremap
 TEST_GEN_FILES += split_huge_page_test
+TEST_GEN_FILES += thp_pmd_split_test
 TEST_GEN_FILES += ksm_tests
 TEST_GEN_FILES += ksm_functional_tests
 TEST_GEN_FILES += mdwe_test
diff --git a/tools/testing/selftests/mm/thp_pmd_split_test.c b/tools/testing/selftests/mm/thp_pmd_split_test.c
new file mode 100644
index 0000000000000..0f54ac04760d5
--- /dev/null
+++ b/tools/testing/selftests/mm/thp_pmd_split_test.c
@@ -0,0 +1,149 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Tests various kernel code paths that handle THP PMD splitting.
+ *
+ * Prerequisites:
+ * - THP enabled (always or madvise mode):
+ *   echo always > /sys/kernel/mm/transparent_hugepage/enabled
+ *   or
+ *   echo madvise > /sys/kernel/mm/transparent_hugepage/enabled
+ */
+
+#define _GNU_SOURCE
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <sys/mman.h>
+#include <sys/wait.h>
+#include <fcntl.h>
+#include <errno.h>
+#include <stdint.h>
+
+#include "kselftest_harness.h"
+#include "thp_settings.h"
+#include "vm_util.h"
+
+/* Read vmstat counter */
+static unsigned long read_vmstat(const char *name)
+{
+	FILE *fp;
+	char line[256];
+	unsigned long value = 0;
+
+	fp = fopen("/proc/vmstat", "r");
+	if (!fp)
+		return 0;
+
+	while (fgets(line, sizeof(line), fp)) {
+		if (strncmp(line, name, strlen(name)) == 0 &&
+		    line[strlen(name)] == ' ') {
+			sscanf(line + strlen(name), " %lu", &value);
+			break;
+		}
+	}
+	fclose(fp);
+	return value;
+}
+
+/*
+ * Log vmstat counters for split_pmd_after/split_pmd_failed_after,
+ * check if split_pmd_after is greater than before and split_pmd_failed_after
+ * hasn't incremented.
+ */
+static void log_and_check_pmd_split(struct __test_metadata *const _metadata,
+	unsigned long split_pmd_before, unsigned long split_pmd_failed_before)
+{
+	unsigned long split_pmd_after = read_vmstat("thp_split_pmd");
+	unsigned long split_pmd_failed_after = read_vmstat("thp_split_pmd_failed");
+
+	TH_LOG("thp_split_pmd: %lu -> %lu", \
+	       split_pmd_before, split_pmd_after);
+	TH_LOG("thp_split_pmd_failed: %lu -> %lu", \
+	       split_pmd_failed_before, split_pmd_failed_after);
+	ASSERT_GT(split_pmd_after, split_pmd_before);
+	ASSERT_EQ(split_pmd_failed_after, split_pmd_failed_before);
+}
+
+/* Allocate a THP at the given aligned address */
+static int allocate_thp(void *aligned, size_t pmdsize)
+{
+	int ret;
+
+	ret = madvise(aligned, pmdsize, MADV_HUGEPAGE);
+	if (ret)
+		return -1;
+
+	/* Touch all pages to allocate the THP */
+	memset(aligned, 0xAA, pmdsize);
+
+	/* Verify we got a THP */
+	if (!check_huge_anon(aligned, 1, pmdsize))
+		return -1;
+
+	return 0;
+}
+
+FIXTURE(thp_pmd_split)
+{
+	void *mem;		/* Base mmap allocation */
+	void *aligned;		/* PMD-aligned pointer within mem */
+	size_t pmdsize;		/* PMD size from sysfs */
+	size_t pagesize;	/* Base page size */
+	size_t mmap_size;	/* Total mmap size for alignment */
+	unsigned long split_pmd_before;
+	unsigned long split_pmd_failed_before;
+};
+
+FIXTURE_SETUP(thp_pmd_split)
+{
+	if (!thp_available())
+		SKIP(return, "THP not available");
+
+	self->pmdsize = read_pmd_pagesize();
+	if (!self->pmdsize)
+		SKIP(return, "Unable to read PMD size");
+
+	self->pagesize = getpagesize();
+	self->mmap_size = 4 * self->pmdsize;
+
+	self->split_pmd_before = read_vmstat("thp_split_pmd");
+	self->split_pmd_failed_before = read_vmstat("thp_split_pmd_failed");
+
+	self->mem = mmap(NULL, self->mmap_size, PROT_READ | PROT_WRITE,
+			 MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+	ASSERT_NE(self->mem, MAP_FAILED);
+
+	/* Align to PMD boundary */
+	self->aligned = (void *)(((unsigned long)self->mem + self->pmdsize - 1) &
+				 ~(self->pmdsize - 1));
+}
+
+FIXTURE_TEARDOWN(thp_pmd_split)
+{
+	if (self->mem && self->mem != MAP_FAILED)
+		munmap(self->mem, self->mmap_size);
+}
+
+/*
+ * Partial munmap on THP (zap_pmd_range)
+ *
+ * Tests that partial munmap of a THP correctly splits the PMD.
+ * This exercises zap_pmd_range part of split.
+ */
+TEST_F(thp_pmd_split, partial_munmap)
+{
+	int ret;
+
+	ret = allocate_thp(self->aligned, self->pmdsize);
+	if (ret)
+		SKIP(return, "Failed to allocate THP");
+
+	ret = munmap((char *)self->aligned + self->pagesize, self->pagesize);
+	ASSERT_EQ(ret, 0);
+
+	log_and_check_pmd_split(_metadata, self->split_pmd_before,
+		self->split_pmd_failed_before);
+}
+
+TEST_HARNESS_MAIN
-- 
2.52.0



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [v3 21/24] selftests/mm: add partial_mprotect test for change_pmd_range
  2026-03-27  2:08 [v3 00/24] mm: thp: lazy PTE page table allocation at PMD split time Usama Arif
                   ` (19 preceding siblings ...)
  2026-03-27  2:09 ` [v3 20/24] selftests/mm: add THP PMD split test infrastructure Usama Arif
@ 2026-03-27  2:09 ` Usama Arif
  2026-03-27  2:09 ` [v3 22/24] selftests/mm: add partial_mlock test Usama Arif
                   ` (3 subsequent siblings)
  24 siblings, 0 replies; 27+ messages in thread
From: Usama Arif @ 2026-03-27  2:09 UTC (permalink / raw)
  To: Andrew Morton, david, Lorenzo Stoakes, willy, linux-mm
  Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
	baolin.wang, npache, Liam.Howlett, ryan.roberts, Vlastimil Babka,
	lance.yang, linux-kernel, kernel-team, maddy, mpe, linuxppc-dev,
	hca, gor, agordeev, borntraeger, svens, linux-s390, Usama Arif

Add test for partial mprotect on THP which exercises change_pmd_range().
This verifies that partial mprotect correctly splits the PMD, applies
protection only to the requested portion, and leaves the rest of the
mapping writable.

Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 .../testing/selftests/mm/thp_pmd_split_test.c | 31 +++++++++++++++++++
 1 file changed, 31 insertions(+)

diff --git a/tools/testing/selftests/mm/thp_pmd_split_test.c b/tools/testing/selftests/mm/thp_pmd_split_test.c
index 0f54ac04760d5..4944a5a516da9 100644
--- a/tools/testing/selftests/mm/thp_pmd_split_test.c
+++ b/tools/testing/selftests/mm/thp_pmd_split_test.c
@@ -146,4 +146,35 @@ TEST_F(thp_pmd_split, partial_munmap)
 		self->split_pmd_failed_before);
 }
 
+/*
+ * Partial mprotect on THP (change_pmd_range)
+ *
+ * Tests that partial mprotect of a THP correctly splits the PMD and
+ * applies protection only to the requested portion. This exercises
+ * the mprotect path which now handles split failures.
+ */
+TEST_F(thp_pmd_split, partial_mprotect)
+{
+	volatile unsigned char *ptr = (volatile unsigned char *)self->aligned;
+	int ret;
+
+	ret = allocate_thp(self->aligned, self->pmdsize);
+	if (ret)
+		SKIP(return, "Failed to allocate THP");
+
+	/* Partial mprotect - make middle page read-only */
+	ret = mprotect((char *)self->aligned + self->pagesize, self->pagesize, PROT_READ);
+	ASSERT_EQ(ret, 0);
+
+	/* Verify we can still write to non-protected pages */
+	ptr[0] = 0xDD;
+	ptr[self->pmdsize - 1] = 0xEE;
+
+	ASSERT_EQ(ptr[0], (unsigned char)0xDD);
+	ASSERT_EQ(ptr[self->pmdsize - 1], (unsigned char)0xEE);
+
+	log_and_check_pmd_split(_metadata, self->split_pmd_before,
+		self->split_pmd_failed_before);
+}
+
 TEST_HARNESS_MAIN
-- 
2.52.0



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [v3 22/24] selftests/mm: add partial_mlock test
  2026-03-27  2:08 [v3 00/24] mm: thp: lazy PTE page table allocation at PMD split time Usama Arif
                   ` (20 preceding siblings ...)
  2026-03-27  2:09 ` [v3 21/24] selftests/mm: add partial_mprotect test for change_pmd_range Usama Arif
@ 2026-03-27  2:09 ` Usama Arif
  2026-03-27  2:09 ` [v3 23/24] selftests/mm: add partial_mremap test for move_page_tables Usama Arif
                   ` (2 subsequent siblings)
  24 siblings, 0 replies; 27+ messages in thread
From: Usama Arif @ 2026-03-27  2:09 UTC (permalink / raw)
  To: Andrew Morton, david, Lorenzo Stoakes, willy, linux-mm
  Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
	baolin.wang, npache, Liam.Howlett, ryan.roberts, Vlastimil Babka,
	lance.yang, linux-kernel, kernel-team, maddy, mpe, linuxppc-dev,
	hca, gor, agordeev, borntraeger, svens, linux-s390, Usama Arif

Add test for partial mlock on THP which exercises walk_page_range()
with a subset of the THP. This should trigger a PMD split since
mlock operates at page granularity.

Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 .../testing/selftests/mm/thp_pmd_split_test.c | 26 +++++++++++++++++++
 1 file changed, 26 insertions(+)

diff --git a/tools/testing/selftests/mm/thp_pmd_split_test.c b/tools/testing/selftests/mm/thp_pmd_split_test.c
index 4944a5a516da9..3c9f05457efec 100644
--- a/tools/testing/selftests/mm/thp_pmd_split_test.c
+++ b/tools/testing/selftests/mm/thp_pmd_split_test.c
@@ -177,4 +177,30 @@ TEST_F(thp_pmd_split, partial_mprotect)
 		self->split_pmd_failed_before);
 }
 
+/*
+ * Partial mlock triggering split (walk_page_range)
+ *
+ * Tests mlock on a partial THP region which should trigger a PMD split.
+ */
+TEST_F(thp_pmd_split, partial_mlock)
+{
+	int ret;
+
+	ret = allocate_thp(self->aligned, self->pmdsize);
+	if (ret)
+		SKIP(return, "Failed to allocate THP");
+
+	/* Partial mlock - should trigger PMD split */
+	ret = mlock((char *)self->aligned + self->pagesize, self->pagesize);
+	if (ret && errno == ENOMEM)
+		SKIP(return, "mlock failed with ENOMEM (resource limit)");
+	ASSERT_EQ(ret, 0);
+
+	/* Cleanup */
+	munlock((char *)self->aligned + self->pagesize, self->pagesize);
+
+	log_and_check_pmd_split(_metadata, self->split_pmd_before,
+		self->split_pmd_failed_before);
+}
+
 TEST_HARNESS_MAIN
-- 
2.52.0



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [v3 23/24] selftests/mm: add partial_mremap test for move_page_tables
  2026-03-27  2:08 [v3 00/24] mm: thp: lazy PTE page table allocation at PMD split time Usama Arif
                   ` (21 preceding siblings ...)
  2026-03-27  2:09 ` [v3 22/24] selftests/mm: add partial_mlock test Usama Arif
@ 2026-03-27  2:09 ` Usama Arif
  2026-03-27  2:09 ` [v3 24/24] selftests/mm: add madv_dontneed_partial test Usama Arif
  2026-03-27  8:51 ` [v3 00/24] mm: thp: lazy PTE page table allocation at PMD split time David Hildenbrand (Arm)
  24 siblings, 0 replies; 27+ messages in thread
From: Usama Arif @ 2026-03-27  2:09 UTC (permalink / raw)
  To: Andrew Morton, david, Lorenzo Stoakes, willy, linux-mm
  Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
	baolin.wang, npache, Liam.Howlett, ryan.roberts, Vlastimil Babka,
	lance.yang, linux-kernel, kernel-team, maddy, mpe, linuxppc-dev,
	hca, gor, agordeev, borntraeger, svens, linux-s390, Usama Arif

Add test for partial mremap on THP which exercises move_page_tables().
This verifies that partial mremap correctly splits the PMD, moves
only the requested page, and preserves data integrity in both the
moved region and the original mapping.

Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 .../testing/selftests/mm/thp_pmd_split_test.c | 50 +++++++++++++++++++
 1 file changed, 50 insertions(+)

diff --git a/tools/testing/selftests/mm/thp_pmd_split_test.c b/tools/testing/selftests/mm/thp_pmd_split_test.c
index 3c9f05457efec..1f29296759a5b 100644
--- a/tools/testing/selftests/mm/thp_pmd_split_test.c
+++ b/tools/testing/selftests/mm/thp_pmd_split_test.c
@@ -203,4 +203,54 @@ TEST_F(thp_pmd_split, partial_mlock)
 		self->split_pmd_failed_before);
 }
 
+/*
+ * Partial mremap (move_page_tables)
+ *
+ * Tests that partial mremap of a THP correctly splits the PMD and
+ * moves only the requested portion. This exercises move_page_tables()
+ * which now handles split failures.
+ */
+TEST_F(thp_pmd_split, partial_mremap)
+{
+	void *new_addr;
+	unsigned long *ptr = (unsigned long *)self->aligned;
+	unsigned long *new_ptr;
+	unsigned long pattern = 0xABCDUL;
+	int ret;
+
+	ret = allocate_thp(self->aligned, self->pmdsize);
+	if (ret)
+		SKIP(return, "Failed to allocate THP");
+
+	/* Write pattern to the page we'll move */
+	ptr[self->pagesize / sizeof(unsigned long)] = pattern;
+
+	/* Also write to first and last page to verify they stay intact */
+	ptr[0] = 0x1234UL;
+	ptr[(self->pmdsize - self->pagesize) / sizeof(unsigned long)] = 0x4567UL;
+
+	/* Partial mremap - move one base page from the THP */
+	new_addr = mremap((char *)self->aligned + self->pagesize, self->pagesize,
+			  self->pagesize, MREMAP_MAYMOVE);
+	if (new_addr == MAP_FAILED) {
+		if (errno == ENOMEM)
+			SKIP(return, "mremap failed with ENOMEM");
+		ASSERT_NE(new_addr, MAP_FAILED);
+	}
+
+	/* Verify data was moved correctly */
+	new_ptr = (unsigned long *)new_addr;
+	ASSERT_EQ(new_ptr[0], pattern);
+
+	/* Verify surrounding data is intact */
+	ASSERT_EQ(ptr[0], 0x1234UL);
+	ASSERT_EQ(ptr[(self->pmdsize - self->pagesize) / sizeof(unsigned long)], 0x4567UL);
+
+	/* Cleanup the moved page */
+	munmap(new_addr, self->pagesize);
+
+	log_and_check_pmd_split(_metadata, self->split_pmd_before,
+		self->split_pmd_failed_before);
+}
+
 TEST_HARNESS_MAIN
-- 
2.52.0



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [v3 24/24] selftests/mm: add madv_dontneed_partial test
  2026-03-27  2:08 [v3 00/24] mm: thp: lazy PTE page table allocation at PMD split time Usama Arif
                   ` (22 preceding siblings ...)
  2026-03-27  2:09 ` [v3 23/24] selftests/mm: add partial_mremap test for move_page_tables Usama Arif
@ 2026-03-27  2:09 ` Usama Arif
  2026-03-27  8:51 ` [v3 00/24] mm: thp: lazy PTE page table allocation at PMD split time David Hildenbrand (Arm)
  24 siblings, 0 replies; 27+ messages in thread
From: Usama Arif @ 2026-03-27  2:09 UTC (permalink / raw)
  To: Andrew Morton, david, Lorenzo Stoakes, willy, linux-mm
  Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
	baolin.wang, npache, Liam.Howlett, ryan.roberts, Vlastimil Babka,
	lance.yang, linux-kernel, kernel-team, maddy, mpe, linuxppc-dev,
	hca, gor, agordeev, borntraeger, svens, linux-s390, Usama Arif

Add test for partial MADV_DONTNEED on THP. This verifies that
MADV_DONTNEED correctly triggers a PMD split, discards only the
requested page (which becomes zero-filled), and preserves data
in the surrounding pages.

Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 .../testing/selftests/mm/thp_pmd_split_test.c | 34 +++++++++++++++++++
 1 file changed, 34 insertions(+)

diff --git a/tools/testing/selftests/mm/thp_pmd_split_test.c b/tools/testing/selftests/mm/thp_pmd_split_test.c
index 1f29296759a5b..060ca1e341b75 100644
--- a/tools/testing/selftests/mm/thp_pmd_split_test.c
+++ b/tools/testing/selftests/mm/thp_pmd_split_test.c
@@ -253,4 +253,38 @@ TEST_F(thp_pmd_split, partial_mremap)
 		self->split_pmd_failed_before);
 }
 
+/*
+ * MADV_DONTNEED on THP
+ *
+ * Tests that MADV_DONTNEED on a partial THP correctly handles
+ * the PMD split and discards only the requested pages.
+ */
+TEST_F(thp_pmd_split, partial_madv_dontneed)
+{
+	volatile unsigned char *ptr = (volatile unsigned char *)self->aligned;
+	int ret;
+
+	ret = allocate_thp(self->aligned, self->pmdsize);
+	if (ret)
+		SKIP(return, "Failed to allocate THP");
+
+	/* Write pattern */
+	memset(self->aligned, 0xDD, self->pmdsize);
+
+	/* Partial MADV_DONTNEED - discard middle page */
+	ret = madvise((char *)self->aligned + self->pagesize, self->pagesize, MADV_DONTNEED);
+	ASSERT_EQ(ret, 0);
+
+	/* Verify non-discarded pages still have data */
+	ASSERT_EQ(ptr[0], (unsigned char)0xDD);
+	ASSERT_EQ(ptr[2 * self->pagesize], (unsigned char)0xDD);
+	ASSERT_EQ(ptr[self->pmdsize - 1], (unsigned char)0xDD);
+
+	/* Discarded page should be zero */
+	ASSERT_EQ(ptr[self->pagesize], (unsigned char)0x00);
+
+	log_and_check_pmd_split(_metadata, self->split_pmd_before,
+		self->split_pmd_failed_before);
+}
+
 TEST_HARNESS_MAIN
-- 
2.52.0



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [v3 00/24] mm: thp: lazy PTE page table allocation at PMD split time
  2026-03-27  2:08 [v3 00/24] mm: thp: lazy PTE page table allocation at PMD split time Usama Arif
                   ` (23 preceding siblings ...)
  2026-03-27  2:09 ` [v3 24/24] selftests/mm: add madv_dontneed_partial test Usama Arif
@ 2026-03-27  8:51 ` David Hildenbrand (Arm)
  2026-03-27  9:25   ` Lorenzo Stoakes (Oracle)
  24 siblings, 1 reply; 27+ messages in thread
From: David Hildenbrand (Arm) @ 2026-03-27  8:51 UTC (permalink / raw)
  To: Usama Arif, Andrew Morton, Lorenzo Stoakes, willy, linux-mm
  Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
	baolin.wang, npache, Liam.Howlett, ryan.roberts, Vlastimil Babka,
	lance.yang, linux-kernel, kernel-team, maddy, mpe, linuxppc-dev,
	hca, gor, agordeev, borntraeger, svens, linux-s390

On 3/27/26 03:08, Usama Arif wrote:
> When the kernel creates a PMD-level THP mapping for anonymous pages, it
> pre-allocates a PTE page table via pgtable_trans_huge_deposit(). This
> page table sits unused in a deposit list for the lifetime of the THP
> mapping, only to be withdrawn when the PMD is split or zapped. Every
> anonymous THP therefore wastes 4KB of memory unconditionally. On large
> servers where hundreds of gigabytes of memory are mapped as THPs, this
> adds up: roughly 200MB wasted per 100GB of THP memory. This memory
> could otherwise satisfy other allocations, including the very PTE page
> table allocations needed when splits eventually occur.
> 
> This series removes the pre-deposit and allocates the PTE page table
> lazily — only when a PMD split actually happens. Since a large number
> of THPs are never split (they are zapped wholesale when processes exit or
> munmap the full range), the allocation is avoided entirely in the common
> case.
> 
> The pre-deposit pattern exists because split_huge_pmd was designed as an
> operation that must never fail: if the kernel decides to split, it needs
> a PTE page table, so one is deposited in advance. But "must never fail"
> is an unnecessarily strong requirement. A PMD split is typically triggered
> by a partial operation on a sub-PMD range — partial munmap, partial
> mprotect, COW on a pinned folio, GUP with FOLL_SPLIT_PMD, and similar.
> All of these operations already have well-defined error handling for
> allocation failures (e.g., -ENOMEM, VM_FAULT_OOM). Allowing split to
> fail and propagating the error through these existing paths is the natural
> thing to do. Furthermore, if the system cannot satisfy a single order-0
> allocation for a page table, it is under extreme memory pressure and
> failing the operation is the correct response.
> 
> Designing functions like split_huge_pmd as operations that cannot fail
> has a subtle but real cost to code quality. It forces a pre-allocation
> pattern - every THP creation path must deposit a page table, and every
> split or zap path must withdraw one, creating a hidden coupling between
> widely separated code paths.
> 
> This also serves as a code cleanup. On every architecture except powerpc
> with hash MMU, the deposit/withdraw machinery becomes dead code. The
> series removes the generic implementations in pgtable-generic.c and the
> s390/sparc overrides, replacing them with no-op stubs guarded by
> arch_needs_pgtable_deposit(), which evaluates to false at compile time
> on all non-powerpc architectures.
> 
> The series is structured as follows:
> 
> Patches 1-2:    Infrastructure — make split functions return int and
>                 propagate errors from vma_adjust_trans_huge() through
>                 __split_vma, vma_shrink, and commit_merge.
> 
> Patches 3-15:   Handle split failure at every call site — copy_huge_pmd,
>                 do_huge_pmd_wp_page, zap_pmd_range, wp_huge_pmd,
>                 change_pmd_range (mprotect), follow_pmd_mask (GUP),
>                 walk_pmd_range (pagewalk), move_page_tables (mremap),
>                 move_pages (userfaultfd), device migration,
>                 pagemap_scan_thp_entry (proc), powerpc subpage_prot,
>                 and dax_iomap_pmd_fault (DAX). The code will become
>                 effective in Patch 17 when split functions start
>                 returning -ENOMEM.
> 
> Patch 16:       Add __must_check to __split_huge_pmd(), split_huge_pmd()
>                 and split_huge_pmd_address() so the compiler warns on
>                 unchecked return values.
> 
> Patch 17:       The actual change — allocate PTE page tables lazily at
>                 split time instead of pre-depositing at THP creation.
>                 This is when split functions will actually start returning
>                 -ENOMEM.
> 
> Patch 18:       Remove the now-dead deposit/withdraw code on
>                 non-powerpc architectures.
> 
> Patch 19:       Add THP_SPLIT_PMD_FAILED vmstat counter for monitoring
>                 split failures.
> 
> Patches 20-24:  Selftests covering partial munmap, mprotect, mlock,
>                 mremap, and MADV_DONTNEED on THPs to exercise the
>                 split paths.
> 
> The error handling patches are placed before the lazy allocation patch so
> that every call site is already prepared to handle split failures before
> the failure mode is introduced. This makes each patch independently safe
> to apply and bisect through.
> 
> The patches were tested with CONFIG_DEBUG_ATOMIC_SLEEP and CONFIG_DEBUG_VM
> enabled. The test results are below:
> 
> TAP version 13
> 1..5
> # Starting 5 tests from 1 test cases.
> #  RUN           thp_pmd_split.partial_munmap ...
> # thp_pmd_split_test.c:60:partial_munmap:thp_split_pmd: 0 -> 1
> # thp_pmd_split_test.c:62:partial_munmap:thp_split_pmd_failed: 0 -> 0
> #            OK  thp_pmd_split.partial_munmap
> ok 1 thp_pmd_split.partial_munmap
> #  RUN           thp_pmd_split.partial_mprotect ...
> # thp_pmd_split_test.c:60:partial_mprotect:thp_split_pmd: 1 -> 2
> # thp_pmd_split_test.c:62:partial_mprotect:thp_split_pmd_failed: 0 -> 0
> #            OK  thp_pmd_split.partial_mprotect
> ok 2 thp_pmd_split.partial_mprotect
> #  RUN           thp_pmd_split.partial_mlock ...
> # thp_pmd_split_test.c:60:partial_mlock:thp_split_pmd: 2 -> 3
> # thp_pmd_split_test.c:62:partial_mlock:thp_split_pmd_failed: 0 -> 0
> #            OK  thp_pmd_split.partial_mlock
> ok 3 thp_pmd_split.partial_mlock
> #  RUN           thp_pmd_split.partial_mremap ...
> # thp_pmd_split_test.c:60:partial_mremap:thp_split_pmd: 3 -> 4
> # thp_pmd_split_test.c:62:partial_mremap:thp_split_pmd_failed: 0 -> 0
> #            OK  thp_pmd_split.partial_mremap
> ok 4 thp_pmd_split.partial_mremap
> #  RUN           thp_pmd_split.partial_madv_dontneed ...
> # thp_pmd_split_test.c:60:partial_madv_dontneed:thp_split_pmd: 4 -> 5
> # thp_pmd_split_test.c:62:partial_madv_dontneed:thp_split_pmd_failed: 0 -> 0
> #            OK  thp_pmd_split.partial_madv_dontneed
> ok 5 thp_pmd_split.partial_madv_dontneed
> # PASSED: 5 / 5 tests passed.
> # Totals: pass:5 fail:0 xfail:0 xpass:0 skip:0 error:0
> 
> The patches are based off of mm-unstable as of 25 Mar
> git hash: d6f51e38433489eb22cb65d1bf72ac7993c5bdec
> 
> RFC v2 -> v3: https://lore.kernel.org/all/de0dc7ec-7a8d-4b1a-a419-1d97d2e4d510@linux.dev/

Note that we usually go from RFC to v1.

I'll put this series on my review backlog, but it will take some time
until I get to it (it won't make the next release either way :) ).

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [v3 00/24] mm: thp: lazy PTE page table allocation at PMD split time
  2026-03-27  8:51 ` [v3 00/24] mm: thp: lazy PTE page table allocation at PMD split time David Hildenbrand (Arm)
@ 2026-03-27  9:25   ` Lorenzo Stoakes (Oracle)
  0 siblings, 0 replies; 27+ messages in thread
From: Lorenzo Stoakes (Oracle) @ 2026-03-27  9:25 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Usama Arif, Andrew Morton, willy, linux-mm, fvdl, hannes, riel,
	shakeel.butt, kas, baohua, dev.jain, baolin.wang, npache,
	Liam.Howlett, ryan.roberts, Vlastimil Babka, lance.yang,
	linux-kernel, kernel-team, maddy, mpe, linuxppc-dev, hca, gor,
	agordeev, borntraeger, svens, linux-s390

On Fri, Mar 27, 2026 at 09:51:31AM +0100, David Hildenbrand (Arm) wrote:
> >
> > RFC v2 -> v3: https://lore.kernel.org/all/de0dc7ec-7a8d-4b1a-a419-1d97d2e4d510@linux.dev/
>
> Note that we usually go from RFC to v1.
>
> I'll put this series on my review backlog, but it will take some time
> until I get to it (it won't make the next release either way :) ).

Yeah, please update to v1 from RFC because I'm looking at this and wondering
where v1, v2 was and why I didn't see them...

Generally I'd also advise un-RFC'ing a biiiig series IDEALLY be done early in a
merge window :)

We've pretty much shut the door to new series this cycle, but being so late in
the window at -rc5 would mean no way for this one anyway.

But in general it's going to be a rebase pain this, and I'd rather not see it
land in mm-unstable at this point, because that's supposed to be 'what's in the
next release' and it's stuff like this that leads to 'I am not sure what
mm-unstable represents any more' being a thing.

I think in an ideal world we'd ONLY see this in mm-new.

I wonder if we need some process for un-RFC'ing really, where somebody kinda
asks rather than it being a vibes thing as it is now (or a 'people don't reply
to my RFC' which yes I'm guilty of :)

Anyway this is more general points and not about you Usama, because - hey - all
this stuff is pretty unclear generally.

>
> --
> Cheers,
>
> David

Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2026-03-27  9:25 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-27  2:08 [v3 00/24] mm: thp: lazy PTE page table allocation at PMD split time Usama Arif
2026-03-27  2:08 ` [v3 01/24] mm: thp: make split_huge_pmd functions return int for error propagation Usama Arif
2026-03-27  2:08 ` [v3 02/24] mm: thp: propagate split failure from vma_adjust_trans_huge() Usama Arif
2026-03-27  2:08 ` [v3 03/24] mm: thp: handle split failure in copy_huge_pmd() Usama Arif
2026-03-27  2:08 ` [v3 04/24] mm: thp: handle split failure in do_huge_pmd_wp_page() Usama Arif
2026-03-27  2:08 ` [v3 05/24] mm: thp: handle split failure in zap_pmd_range() Usama Arif
2026-03-27  2:08 ` [v3 06/24] mm: thp: handle split failure in wp_huge_pmd() Usama Arif
2026-03-27  2:08 ` [v3 07/24] mm: thp: retry on split failure in change_pmd_range() Usama Arif
2026-03-27  2:08 ` [v3 08/24] mm: thp: handle split failure in follow_pmd_mask() Usama Arif
2026-03-27  2:08 ` [v3 09/24] mm: handle walk_page_range() failure from THP split Usama Arif
2026-03-27  2:08 ` [v3 10/24] mm: thp: handle split failure in mremap move_page_tables() Usama Arif
2026-03-27  2:08 ` [v3 11/24] mm: thp: handle split failure in userfaultfd move_pages() Usama Arif
2026-03-27  2:08 ` [v3 12/24] mm: thp: handle split failure in device migration Usama Arif
2026-03-27  2:08 ` [v3 13/24] mm: proc: handle split_huge_pmd failure in pagemap_scan Usama Arif
2026-03-27  2:08 ` [v3 14/24] powerpc/mm: handle split_huge_pmd failure in subpage_prot Usama Arif
2026-03-27  2:08 ` [v3 15/24] fs/dax: handle split_huge_pmd failure in dax_iomap_pmd_fault Usama Arif
2026-03-27  2:08 ` [v3 16/24] mm: huge_mm: Make sure all split_huge_pmd calls are checked Usama Arif
2026-03-27  2:08 ` [v3 17/24] mm: thp: allocate PTE page tables lazily at split time Usama Arif
2026-03-27  2:09 ` [v3 18/24] mm: thp: remove pgtable_trans_huge_{deposit/withdraw} when not needed Usama Arif
2026-03-27  2:09 ` [v3 19/24] mm: thp: add THP_SPLIT_PMD_FAILED counter Usama Arif
2026-03-27  2:09 ` [v3 20/24] selftests/mm: add THP PMD split test infrastructure Usama Arif
2026-03-27  2:09 ` [v3 21/24] selftests/mm: add partial_mprotect test for change_pmd_range Usama Arif
2026-03-27  2:09 ` [v3 22/24] selftests/mm: add partial_mlock test Usama Arif
2026-03-27  2:09 ` [v3 23/24] selftests/mm: add partial_mremap test for move_page_tables Usama Arif
2026-03-27  2:09 ` [v3 24/24] selftests/mm: add madv_dontneed_partial test Usama Arif
2026-03-27  8:51 ` [v3 00/24] mm: thp: lazy PTE page table allocation at PMD split time David Hildenbrand (Arm)
2026-03-27  9:25   ` Lorenzo Stoakes (Oracle)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox