[RFC 0/2] mm: thp: split time allocation of page table for THPs

All of lore.kernel.org
 help / color / mirror / Atom feed

* [RFC 0/2] mm: thp: split time allocation of page table for THPs
@ 2026-02-11 12:49 Usama Arif
  2026-02-11 12:49 ` [RFC 1/2] mm: thp: allocate PTE page tables lazily at split time Usama Arif
  2026-02-11 12:49 ` [RFC 2/2] mm: thp: add THP_SPLIT_PMD_PTE_ALLOC_FAILED counter Usama Arif
  0 siblings, 2 replies; 22+ messages in thread
From: Usama Arif @ 2026-02-11 12:49 UTC (permalink / raw)
  To: Andrew Morton, david, lorenzo.stoakes, willy, linux-mm
  Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
	baolin.wang, npache, Liam.Howlett, ryan.roberts, vbabka,
	lance.yang, linux-kernel, kernel-team, Usama Arif

This is an RFC patch to allocate the PTE page table at split time only
and not do pre-deposit for THPs as suggested by David [1].
The core patch is the first one. The second one is not needed and its
just vmstat counters I used to show that split doesn't fail. Its going to be
0 all the time and won't include it in future revisions.

It would have been ideal if all pre-deposit code was removed but its not
possible due to PowerPC. The rationale and further details are covered
in the commit message of the first patch, including why the patch is safe.

[1] https://lore.kernel.org/all/ee5bd77f-87ad-4640-a974-304b488e4c64@kernel.org/
 
Usama Arif (2):
  mm: thp: allocate PTE page tables lazily at split time
  mm: thp: add THP_SPLIT_PMD_PTE_ALLOC_FAILED counter

 include/linux/huge_mm.h       |   4 +-
 include/linux/vm_event_item.h |   1 +
 mm/huge_memory.c              | 145 ++++++++++++++++++++++++----------
 mm/khugepaged.c               |   7 +-
 mm/migrate_device.c           |  15 ++--
 mm/rmap.c                     |  42 +++++++++-
 mm/vmstat.c                   |   1 +
 7 files changed, 162 insertions(+), 53 deletions(-)

-- 
2.47.3



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [RFC 1/2] mm: thp: allocate PTE page tables lazily at split time
  2026-02-11 12:49 [RFC 0/2] mm: thp: split time allocation of page table for THPs Usama Arif
@ 2026-02-11 12:49 ` Usama Arif
  2026-02-11 13:25   ` David Hildenbrand (Arm)
                     ` (2 more replies)
  2026-02-11 12:49 ` [RFC 2/2] mm: thp: add THP_SPLIT_PMD_PTE_ALLOC_FAILED counter Usama Arif
  1 sibling, 3 replies; 22+ messages in thread
From: Usama Arif @ 2026-02-11 12:49 UTC (permalink / raw)
  To: Andrew Morton, david, lorenzo.stoakes, willy, linux-mm
  Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
	baolin.wang, npache, Liam.Howlett, ryan.roberts, vbabka,
	lance.yang, linux-kernel, kernel-team, Usama Arif

When the kernel creates a PMD-level THP mapping for anonymous pages,
it pre-allocates a PTE page table and deposits it via
pgtable_trans_huge_deposit(). This deposited table is withdrawn during
PMD split or zap. The rationale was that split must not fail—if the
kernel decides to split a THP, it needs a PTE table to populate.

However, every anon THP wastes 4KB (one page table page) that sits
unused in the deposit list for the lifetime of the mapping. On systems
with many THPs, this adds up to significant memory waste. The original
rationale is also not an issue. It is ok for split to fail, and if the
kernel can't find an order 0 allocation for split, there are much bigger
problems. On large servers where you can easily have 100s of GBs of THPs,
the memory usage for these tables is 200M per 100G. This memory could be
used for any other usecase, which include allocating the pagetables
required during split.

This patch removes the pre-deposit for anonymous pages on architectures
where arch_needs_pgtable_deposit() returns false (every arch apart from
powerpc, and only when radix hash tables are not enabled) and allocates
the PTE table lazily—only when a split actually occurs. The split path
is modified to accept a caller-provided page table.

PowerPC exception:

It would have been great if we can completely remove the pagetable
deposit code and this commit would mostly have been a code cleanup patch,
unfortunately PowerPC has hash MMU, it stores hash slot information in
the deposited page table and pre-deposit is necessary. All deposit/
withdraw paths are guarded by arch_needs_pgtable_deposit(), so PowerPC
behavior is unchanged with this patch. On a better note,
arch_needs_pgtable_deposit will always evaluate to false at compile time
on non PowerPC architectures and the pre-deposit code will not be
compiled in.

Why Split Failures Are Safe:

If a system is under severe memory pressure that even a 4K allocation
fails for a PTE table, there are far greater problems than a THP split
being delayed. The OOM killer will likely intervene before this becomes an
issue.
When pte_alloc_one() fails due to not being able to allocate a 4K page,
the PMD split is aborted and the THP remains intact. I could not get split
to fail, as its very difficult to make order-0 allocation to fail.
Code analysis of what would happen if it does:

- mprotect(): If split fails in change_pmd_range, it will fallback
to change_pte_range, which will return an error which will cause the
whole function to be retried again.

- munmap() (partial THP range): zap_pte_range() returns early when
pte_offset_map_lock() fails, causing zap_pmd_range() to retry via pmd--.
For full THP range, zap_huge_pmd() unmaps the entire PMD without
split.

- Memory reclaim (try_to_unmap()): Returns false, folio rotated back
LRU, retried in next reclaim cycle.

- Migration / compaction (try_to_migrate()): Returns -EAGAIN, migration
skips this folio, retried later.

- CoW fault (wp_huge_pmd()): Returns VM_FAULT_FALLBACK, fault retried.

-  madvise (MADV_COLD/PAGEOUT): split_folio() internally calls
try_to_migrate() with TTU_SPLIT_HUGE_PMD. If PMD split fails,
try_to_migrate() returns false, split_folio() returns -EAGAIN,
and madvise returns 0 (success) silently skipping the region. This
should be fine. madvise is just an advice and can fail for other
reasons as well.

Suggested-by: David Hildenbrand <david@kernel.org>
Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 include/linux/huge_mm.h |   4 +-
 mm/huge_memory.c        | 144 ++++++++++++++++++++++++++++------------
 mm/khugepaged.c         |   7 +-
 mm/migrate_device.c     |  15 +++--
 mm/rmap.c               |  39 ++++++++++-
 5 files changed, 156 insertions(+), 53 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index a4d9f964dfdea..b21bb72a298c9 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -562,7 +562,7 @@ static inline bool thp_migration_supported(void)
 }
 
 void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address,
-			   pmd_t *pmd, bool freeze);
+			   pmd_t *pmd, bool freeze, pgtable_t pgtable);
 bool unmap_huge_pmd_locked(struct vm_area_struct *vma, unsigned long addr,
 			   pmd_t *pmdp, struct folio *folio);
 void map_anon_folio_pmd_nopf(struct folio *folio, pmd_t *pmd,
@@ -660,7 +660,7 @@ static inline void split_huge_pmd_address(struct vm_area_struct *vma,
 		unsigned long address, bool freeze) {}
 static inline void split_huge_pmd_locked(struct vm_area_struct *vma,
 					 unsigned long address, pmd_t *pmd,
-					 bool freeze) {}
+					 bool freeze, pgtable_t pgtable) {}
 
 static inline bool unmap_huge_pmd_locked(struct vm_area_struct *vma,
 					 unsigned long addr, pmd_t *pmdp,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 44ff8a648afd5..4c9a8d89fc8aa 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1322,17 +1322,19 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf)
 	unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
 	struct vm_area_struct *vma = vmf->vma;
 	struct folio *folio;
-	pgtable_t pgtable;
+	pgtable_t pgtable = NULL;
 	vm_fault_t ret = 0;
 
 	folio = vma_alloc_anon_folio_pmd(vma, vmf->address);
 	if (unlikely(!folio))
 		return VM_FAULT_FALLBACK;
 
-	pgtable = pte_alloc_one(vma->vm_mm);
-	if (unlikely(!pgtable)) {
-		ret = VM_FAULT_OOM;
-		goto release;
+	if (arch_needs_pgtable_deposit()) {
+		pgtable = pte_alloc_one(vma->vm_mm);
+		if (unlikely(!pgtable)) {
+			ret = VM_FAULT_OOM;
+			goto release;
+		}
 	}
 
 	vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
@@ -1347,14 +1349,18 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf)
 		if (userfaultfd_missing(vma)) {
 			spin_unlock(vmf->ptl);
 			folio_put(folio);
-			pte_free(vma->vm_mm, pgtable);
+			if (pgtable)
+				pte_free(vma->vm_mm, pgtable);
 			ret = handle_userfault(vmf, VM_UFFD_MISSING);
 			VM_BUG_ON(ret & VM_FAULT_FALLBACK);
 			return ret;
 		}
-		pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd, pgtable);
+		if (pgtable) {
+			pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd,
+						   pgtable);
+			mm_inc_nr_ptes(vma->vm_mm);
+		}
 		map_anon_folio_pmd_pf(folio, vmf->pmd, vma, haddr);
-		mm_inc_nr_ptes(vma->vm_mm);
 		spin_unlock(vmf->ptl);
 	}
 
@@ -1450,9 +1456,11 @@ static void set_huge_zero_folio(pgtable_t pgtable, struct mm_struct *mm,
 	pmd_t entry;
 	entry = folio_mk_pmd(zero_folio, vma->vm_page_prot);
 	entry = pmd_mkspecial(entry);
-	pgtable_trans_huge_deposit(mm, pmd, pgtable);
+	if (pgtable) {
+		pgtable_trans_huge_deposit(mm, pmd, pgtable);
+		mm_inc_nr_ptes(mm);
+	}
 	set_pmd_at(mm, haddr, pmd, entry);
-	mm_inc_nr_ptes(mm);
 }
 
 vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
@@ -1471,16 +1479,19 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
 	if (!(vmf->flags & FAULT_FLAG_WRITE) &&
 			!mm_forbids_zeropage(vma->vm_mm) &&
 			transparent_hugepage_use_zero_page()) {
-		pgtable_t pgtable;
+		pgtable_t pgtable = NULL;
 		struct folio *zero_folio;
 		vm_fault_t ret;
 
-		pgtable = pte_alloc_one(vma->vm_mm);
-		if (unlikely(!pgtable))
-			return VM_FAULT_OOM;
+		if (arch_needs_pgtable_deposit()) {
+			pgtable = pte_alloc_one(vma->vm_mm);
+			if (unlikely(!pgtable))
+				return VM_FAULT_OOM;
+		}
 		zero_folio = mm_get_huge_zero_folio(vma->vm_mm);
 		if (unlikely(!zero_folio)) {
-			pte_free(vma->vm_mm, pgtable);
+			if (pgtable)
+				pte_free(vma->vm_mm, pgtable);
 			count_vm_event(THP_FAULT_FALLBACK);
 			return VM_FAULT_FALLBACK;
 		}
@@ -1490,10 +1501,12 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
 			ret = check_stable_address_space(vma->vm_mm);
 			if (ret) {
 				spin_unlock(vmf->ptl);
-				pte_free(vma->vm_mm, pgtable);
+				if (pgtable)
+					pte_free(vma->vm_mm, pgtable);
 			} else if (userfaultfd_missing(vma)) {
 				spin_unlock(vmf->ptl);
-				pte_free(vma->vm_mm, pgtable);
+				if (pgtable)
+					pte_free(vma->vm_mm, pgtable);
 				ret = handle_userfault(vmf, VM_UFFD_MISSING);
 				VM_BUG_ON(ret & VM_FAULT_FALLBACK);
 			} else {
@@ -1504,7 +1517,8 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
 			}
 		} else {
 			spin_unlock(vmf->ptl);
-			pte_free(vma->vm_mm, pgtable);
+			if (pgtable)
+				pte_free(vma->vm_mm, pgtable);
 		}
 		return ret;
 	}
@@ -1836,8 +1850,10 @@ static void copy_huge_non_present_pmd(
 	}
 
 	add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
-	mm_inc_nr_ptes(dst_mm);
-	pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
+	if (pgtable) {
+		mm_inc_nr_ptes(dst_mm);
+		pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
+	}
 	if (!userfaultfd_wp(dst_vma))
 		pmd = pmd_swp_clear_uffd_wp(pmd);
 	set_pmd_at(dst_mm, addr, dst_pmd, pmd);
@@ -1877,9 +1893,11 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	if (!vma_is_anonymous(dst_vma))
 		return 0;
 
-	pgtable = pte_alloc_one(dst_mm);
-	if (unlikely(!pgtable))
-		goto out;
+	if (arch_needs_pgtable_deposit()) {
+		pgtable = pte_alloc_one(dst_mm);
+		if (unlikely(!pgtable))
+			goto out;
+	}
 
 	dst_ptl = pmd_lock(dst_mm, dst_pmd);
 	src_ptl = pmd_lockptr(src_mm, src_pmd);
@@ -1897,7 +1915,8 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	}
 
 	if (unlikely(!pmd_trans_huge(pmd))) {
-		pte_free(dst_mm, pgtable);
+		if (pgtable)
+			pte_free(dst_mm, pgtable);
 		goto out_unlock;
 	}
 	/*
@@ -1923,7 +1942,8 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	if (unlikely(folio_try_dup_anon_rmap_pmd(src_folio, src_page, dst_vma, src_vma))) {
 		/* Page maybe pinned: split and retry the fault on PTEs. */
 		folio_put(src_folio);
-		pte_free(dst_mm, pgtable);
+		if (pgtable)
+			pte_free(dst_mm, pgtable);
 		spin_unlock(src_ptl);
 		spin_unlock(dst_ptl);
 		__split_huge_pmd(src_vma, src_pmd, addr, false);
@@ -1931,8 +1951,10 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	}
 	add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
 out_zero_page:
-	mm_inc_nr_ptes(dst_mm);
-	pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
+	if (pgtable) {
+		mm_inc_nr_ptes(dst_mm);
+		pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
+	}
 	pmdp_set_wrprotect(src_mm, addr, src_pmd);
 	if (!userfaultfd_wp(dst_vma))
 		pmd = pmd_clear_uffd_wp(pmd);
@@ -2364,7 +2386,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 			zap_deposited_table(tlb->mm, pmd);
 		spin_unlock(ptl);
 	} else if (is_huge_zero_pmd(orig_pmd)) {
-		if (!vma_is_dax(vma) || arch_needs_pgtable_deposit())
+		if (arch_needs_pgtable_deposit())
 			zap_deposited_table(tlb->mm, pmd);
 		spin_unlock(ptl);
 	} else {
@@ -2389,7 +2411,8 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		}
 
 		if (folio_test_anon(folio)) {
-			zap_deposited_table(tlb->mm, pmd);
+			if (arch_needs_pgtable_deposit())
+				zap_deposited_table(tlb->mm, pmd);
 			add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
 		} else {
 			if (arch_needs_pgtable_deposit())
@@ -2490,7 +2513,8 @@ bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
 			force_flush = true;
 		VM_BUG_ON(!pmd_none(*new_pmd));
 
-		if (pmd_move_must_withdraw(new_ptl, old_ptl, vma)) {
+		if (pmd_move_must_withdraw(new_ptl, old_ptl, vma) &&
+		    arch_needs_pgtable_deposit()) {
 			pgtable_t pgtable;
 			pgtable = pgtable_trans_huge_withdraw(mm, old_pmd);
 			pgtable_trans_huge_deposit(mm, new_pmd, pgtable);
@@ -2798,8 +2822,10 @@ int move_pages_huge_pmd(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd, pm
 	}
 	set_pmd_at(mm, dst_addr, dst_pmd, _dst_pmd);
 
-	src_pgtable = pgtable_trans_huge_withdraw(mm, src_pmd);
-	pgtable_trans_huge_deposit(mm, dst_pmd, src_pgtable);
+	if (arch_needs_pgtable_deposit()) {
+		src_pgtable = pgtable_trans_huge_withdraw(mm, src_pmd);
+		pgtable_trans_huge_deposit(mm, dst_pmd, src_pgtable);
+	}
 unlock_ptls:
 	double_pt_unlock(src_ptl, dst_ptl);
 	/* unblock rmap walks */
@@ -2941,10 +2967,9 @@ void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
 #endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
 
 static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
-		unsigned long haddr, pmd_t *pmd)
+		unsigned long haddr, pmd_t *pmd, pgtable_t pgtable)
 {
 	struct mm_struct *mm = vma->vm_mm;
-	pgtable_t pgtable;
 	pmd_t _pmd, old_pmd;
 	unsigned long addr;
 	pte_t *pte;
@@ -2960,7 +2985,16 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
 	 */
 	old_pmd = pmdp_huge_clear_flush(vma, haddr, pmd);
 
-	pgtable = pgtable_trans_huge_withdraw(mm, pmd);
+	if (arch_needs_pgtable_deposit()) {
+		pgtable = pgtable_trans_huge_withdraw(mm, pmd);
+	} else {
+		VM_BUG_ON(!pgtable);
+		/*
+		 * Account for the freshly allocated (in __split_huge_pmd) pgtable
+		 * being used in mm.
+		 */
+		mm_inc_nr_ptes(mm);
+	}
 	pmd_populate(mm, &_pmd, pgtable);
 
 	pte = pte_offset_map(&_pmd, haddr);
@@ -2982,12 +3016,11 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
 }
 
 static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
-		unsigned long haddr, bool freeze)
+		unsigned long haddr, bool freeze, pgtable_t pgtable)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	struct folio *folio;
 	struct page *page;
-	pgtable_t pgtable;
 	pmd_t old_pmd, _pmd;
 	bool soft_dirty, uffd_wp = false, young = false, write = false;
 	bool anon_exclusive = false, dirty = false;
@@ -3011,6 +3044,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 		 */
 		if (arch_needs_pgtable_deposit())
 			zap_deposited_table(mm, pmd);
+		if (pgtable)
+			pte_free(mm, pgtable);
 		if (!vma_is_dax(vma) && vma_is_special_huge(vma))
 			return;
 		if (unlikely(pmd_is_migration_entry(old_pmd))) {
@@ -3043,7 +3078,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 		 * small page also write protected so it does not seems useful
 		 * to invalidate secondary mmu at this time.
 		 */
-		return __split_huge_zero_page_pmd(vma, haddr, pmd);
+		return __split_huge_zero_page_pmd(vma, haddr, pmd, pgtable);
 	}
 
 	if (pmd_is_migration_entry(*pmd)) {
@@ -3167,7 +3202,16 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 	 * Withdraw the table only after we mark the pmd entry invalid.
 	 * This's critical for some architectures (Power).
 	 */
-	pgtable = pgtable_trans_huge_withdraw(mm, pmd);
+	if (arch_needs_pgtable_deposit()) {
+		pgtable = pgtable_trans_huge_withdraw(mm, pmd);
+	} else {
+		VM_BUG_ON(!pgtable);
+		/*
+		 * Account for the freshly allocated (in __split_huge_pmd) pgtable
+		 * being used in mm.
+		 */
+		mm_inc_nr_ptes(mm);
+	}
 	pmd_populate(mm, &_pmd, pgtable);
 
 	pte = pte_offset_map(&_pmd, haddr);
@@ -3263,11 +3307,13 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 }
 
 void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address,
-			   pmd_t *pmd, bool freeze)
+			   pmd_t *pmd, bool freeze, pgtable_t pgtable)
 {
 	VM_WARN_ON_ONCE(!IS_ALIGNED(address, HPAGE_PMD_SIZE));
 	if (pmd_trans_huge(*pmd) || pmd_is_valid_softleaf(*pmd))
-		__split_huge_pmd_locked(vma, pmd, address, freeze);
+		__split_huge_pmd_locked(vma, pmd, address, freeze, pgtable);
+	else if (pgtable)
+		pte_free(vma->vm_mm, pgtable);
 }
 
 void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
@@ -3275,13 +3321,24 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 {
 	spinlock_t *ptl;
 	struct mmu_notifier_range range;
+	pgtable_t pgtable = NULL;
 
 	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma->vm_mm,
 				address & HPAGE_PMD_MASK,
 				(address & HPAGE_PMD_MASK) + HPAGE_PMD_SIZE);
 	mmu_notifier_invalidate_range_start(&range);
+
+	/* allocate pagetable before acquiring pmd lock */
+	if (vma_is_anonymous(vma) && !arch_needs_pgtable_deposit()) {
+		pgtable = pte_alloc_one(vma->vm_mm);
+		if (!pgtable) {
+			mmu_notifier_invalidate_range_end(&range);
+			return;
+		}
+	}
+
 	ptl = pmd_lock(vma->vm_mm, pmd);
-	split_huge_pmd_locked(vma, range.start, pmd, freeze);
+	split_huge_pmd_locked(vma, range.start, pmd, freeze, pgtable);
 	spin_unlock(ptl);
 	mmu_notifier_invalidate_range_end(&range);
 }
@@ -3402,7 +3459,8 @@ static bool __discard_anon_folio_pmd_locked(struct vm_area_struct *vma,
 	}
 
 	folio_remove_rmap_pmd(folio, pmd_page(orig_pmd), vma);
-	zap_deposited_table(mm, pmdp);
+	if (arch_needs_pgtable_deposit())
+		zap_deposited_table(mm, pmdp);
 	add_mm_counter(mm, MM_ANONPAGES, -HPAGE_PMD_NR);
 	if (vma->vm_flags & VM_LOCKED)
 		mlock_drain_local();
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index fa1e57fd2c469..0e976e4c975ef 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1223,7 +1223,12 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
 
 	spin_lock(pmd_ptl);
 	BUG_ON(!pmd_none(*pmd));
-	pgtable_trans_huge_deposit(mm, pmd, pgtable);
+	if (arch_needs_pgtable_deposit()) {
+		pgtable_trans_huge_deposit(mm, pmd, pgtable);
+	} else {
+		mm_dec_nr_ptes(mm);
+		pte_free(mm, pgtable);
+	}
 	map_anon_folio_pmd_nopf(folio, pmd, vma, address);
 	spin_unlock(pmd_ptl);
 
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index 0a8b31939640f..053db74303e36 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -829,9 +829,13 @@ static int migrate_vma_insert_huge_pmd_page(struct migrate_vma *migrate,
 
 	__folio_mark_uptodate(folio);
 
-	pgtable = pte_alloc_one(vma->vm_mm);
-	if (unlikely(!pgtable))
-		goto abort;
+	if (arch_needs_pgtable_deposit()) {
+		pgtable = pte_alloc_one(vma->vm_mm);
+		if (unlikely(!pgtable))
+			goto abort;
+	} else {
+		pgtable = NULL;
+	}
 
 	if (folio_is_device_private(folio)) {
 		swp_entry_t swp_entry;
@@ -879,10 +883,11 @@ static int migrate_vma_insert_huge_pmd_page(struct migrate_vma *migrate,
 	folio_get(folio);
 
 	if (flush) {
-		pte_free(vma->vm_mm, pgtable);
+		if (pgtable)
+			pte_free(vma->vm_mm, pgtable);
 		flush_cache_page(vma, addr, addr + HPAGE_PMD_SIZE);
 		pmdp_invalidate(vma, addr, pmdp);
-	} else {
+	} else if (pgtable) {
 		pgtable_trans_huge_deposit(vma->vm_mm, pmdp, pgtable);
 		mm_inc_nr_ptes(vma->vm_mm);
 	}
diff --git a/mm/rmap.c b/mm/rmap.c
index edf5d32f46042..c6ff23fc12944 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -76,6 +76,7 @@
 #include <linux/mm_inline.h>
 #include <linux/oom.h>
 
+#include <asm/pgalloc.h>
 #include <asm/tlb.h>
 
 #define CREATE_TRACE_POINTS
@@ -1978,6 +1979,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 	unsigned long pfn;
 	unsigned long hsz = 0;
 	int ptes = 0;
+	pgtable_t prealloc_pte = NULL;
 
 	/*
 	 * When racing against e.g. zap_pte_range() on another cpu,
@@ -2012,6 +2014,10 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 	}
 	mmu_notifier_invalidate_range_start(&range);
 
+	if ((flags & TTU_SPLIT_HUGE_PMD) && vma_is_anonymous(vma) &&
+	    !arch_needs_pgtable_deposit())
+		prealloc_pte = pte_alloc_one(mm);
+
 	while (page_vma_mapped_walk(&pvmw)) {
 		/*
 		 * If the folio is in an mlock()d vma, we must not swap it out.
@@ -2061,12 +2067,21 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 			}
 
 			if (flags & TTU_SPLIT_HUGE_PMD) {
+				pgtable_t pgtable = prealloc_pte;
+
+				prealloc_pte = NULL;
+				if (!arch_needs_pgtable_deposit() && !pgtable &&
+				    vma_is_anonymous(vma)) {
+					page_vma_mapped_walk_done(&pvmw);
+					ret = false;
+					break;
+				}
 				/*
 				 * We temporarily have to drop the PTL and
 				 * restart so we can process the PTE-mapped THP.
 				 */
 				split_huge_pmd_locked(vma, pvmw.address,
-						      pvmw.pmd, false);
+						      pvmw.pmd, false, pgtable);
 				flags &= ~TTU_SPLIT_HUGE_PMD;
 				page_vma_mapped_walk_restart(&pvmw);
 				continue;
@@ -2346,6 +2361,9 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 		break;
 	}
 
+	if (prealloc_pte)
+		pte_free(mm, prealloc_pte);
+
 	mmu_notifier_invalidate_range_end(&range);
 
 	return ret;
@@ -2405,6 +2423,7 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 	enum ttu_flags flags = (enum ttu_flags)(long)arg;
 	unsigned long pfn;
 	unsigned long hsz = 0;
+	pgtable_t prealloc_pte = NULL;
 
 	/*
 	 * When racing against e.g. zap_pte_range() on another cpu,
@@ -2439,6 +2458,10 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 	}
 	mmu_notifier_invalidate_range_start(&range);
 
+	if ((flags & TTU_SPLIT_HUGE_PMD) && vma_is_anonymous(vma) &&
+	    !arch_needs_pgtable_deposit())
+		prealloc_pte = pte_alloc_one(mm);
+
 	while (page_vma_mapped_walk(&pvmw)) {
 		/* PMD-mapped THP migration entry */
 		if (!pvmw.pte) {
@@ -2446,8 +2469,17 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 			__maybe_unused pmd_t pmdval;
 
 			if (flags & TTU_SPLIT_HUGE_PMD) {
+				pgtable_t pgtable = prealloc_pte;
+
+				prealloc_pte = NULL;
+				if (!arch_needs_pgtable_deposit() && !pgtable &&
+				    vma_is_anonymous(vma)) {
+					page_vma_mapped_walk_done(&pvmw);
+					ret = false;
+					break;
+				}
 				split_huge_pmd_locked(vma, pvmw.address,
-						      pvmw.pmd, true);
+						      pvmw.pmd, true, pgtable);
 				ret = false;
 				page_vma_mapped_walk_done(&pvmw);
 				break;
@@ -2698,6 +2730,9 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 		folio_put(folio);
 	}
 
+	if (prealloc_pte)
+		pte_free(mm, prealloc_pte);
+
 	mmu_notifier_invalidate_range_end(&range);
 
 	return ret;
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC 2/2] mm: thp: add THP_SPLIT_PMD_PTE_ALLOC_FAILED counter
  2026-02-11 12:49 [RFC 0/2] mm: thp: split time allocation of page table for THPs Usama Arif
  2026-02-11 12:49 ` [RFC 1/2] mm: thp: allocate PTE page tables lazily at split time Usama Arif
@ 2026-02-11 12:49 ` Usama Arif
  2026-02-11 13:27   ` David Hildenbrand (Arm)
                     ` (2 more replies)
  1 sibling, 3 replies; 22+ messages in thread
From: Usama Arif @ 2026-02-11 12:49 UTC (permalink / raw)
  To: Andrew Morton, david, lorenzo.stoakes, willy, linux-mm
  Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
	baolin.wang, npache, Liam.Howlett, ryan.roberts, vbabka,
	lance.yang, linux-kernel, kernel-team, Usama Arif

Add a vmstat counter to track PTE allocation failures during PMD split.
This enables monitoring of split failures due to memory pressure after
the lazy PTE page table allocation change.

The counter is incremented in three places:
- __split_huge_pmd(): Main entry point for splitting a PMD
- try_to_unmap_one(): When reclaim needs to split a PMD-mapped THP
- try_to_migrate_one(): When migration needs to split a PMD-mapped THP

Visible via /proc/vmstat as thp_split_pmd_pte_alloc_failed.

Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 include/linux/vm_event_item.h | 1 +
 mm/huge_memory.c              | 1 +
 mm/rmap.c                     | 3 +++
 mm/vmstat.c                   | 1 +
 4 files changed, 6 insertions(+)

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 22a139f82d75f..827c9a8c251de 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -111,6 +111,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		THP_DEFERRED_SPLIT_PAGE,
 		THP_UNDERUSED_SPLIT_PAGE,
 		THP_SPLIT_PMD,
+		THP_SPLIT_PMD_PTE_ALLOC_FAILED,
 		THP_SCAN_EXCEED_NONE_PTE,
 		THP_SCAN_EXCEED_SWAP_PTE,
 		THP_SCAN_EXCEED_SHARED_PTE,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 4c9a8d89fc8aa..8d7c9f67f8a1d 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3332,6 +3332,7 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 	if (vma_is_anonymous(vma) && !arch_needs_pgtable_deposit()) {
 		pgtable = pte_alloc_one(vma->vm_mm);
 		if (!pgtable) {
+			count_vm_event(THP_SPLIT_PMD_PTE_ALLOC_FAILED);
 			mmu_notifier_invalidate_range_end(&range);
 			return;
 		}
diff --git a/mm/rmap.c b/mm/rmap.c
index c6ff23fc12944..5c4afedb29d5a 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -2070,8 +2070,10 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 				pgtable_t pgtable = prealloc_pte;
 
 				prealloc_pte = NULL;
+
 				if (!arch_needs_pgtable_deposit() && !pgtable &&
 				    vma_is_anonymous(vma)) {
+					count_vm_event(THP_SPLIT_PMD_PTE_ALLOC_FAILED);
 					page_vma_mapped_walk_done(&pvmw);
 					ret = false;
 					break;
@@ -2474,6 +2476,7 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 				prealloc_pte = NULL;
 				if (!arch_needs_pgtable_deposit() && !pgtable &&
 				    vma_is_anonymous(vma)) {
+					count_vm_event(THP_SPLIT_PMD_PTE_ALLOC_FAILED);
 					page_vma_mapped_walk_done(&pvmw);
 					ret = false;
 					break;
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 99270713e0c13..473edfa624a41 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1408,6 +1408,7 @@ const char * const vmstat_text[] = {
 	[I(THP_DEFERRED_SPLIT_PAGE)]		= "thp_deferred_split_page",
 	[I(THP_UNDERUSED_SPLIT_PAGE)]		= "thp_underused_split_page",
 	[I(THP_SPLIT_PMD)]			= "thp_split_pmd",
+	[I(THP_SPLIT_PMD_PTE_ALLOC_FAILED)]	= "thp_split_pmd_pte_alloc_failed",
 	[I(THP_SCAN_EXCEED_NONE_PTE)]		= "thp_scan_exceed_none_pte",
 	[I(THP_SCAN_EXCEED_SWAP_PTE)]		= "thp_scan_exceed_swap_pte",
 	[I(THP_SCAN_EXCEED_SHARED_PTE)]		= "thp_scan_exceed_share_pte",
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [RFC 1/2] mm: thp: allocate PTE page tables lazily at split time
  2026-02-11 12:49 ` [RFC 1/2] mm: thp: allocate PTE page tables lazily at split time Usama Arif
@ 2026-02-11 13:25   ` David Hildenbrand (Arm)
  2026-02-11 13:38     ` Usama Arif
  2026-02-12 12:13     ` Ritesh Harjani
  2026-02-11 13:35   ` David Hildenbrand (Arm)
  2026-02-11 19:28   ` Matthew Wilcox
  2 siblings, 2 replies; 22+ messages in thread
From: David Hildenbrand (Arm) @ 2026-02-11 13:25 UTC (permalink / raw)
  To: Usama Arif, Andrew Morton, lorenzo.stoakes, willy, linux-mm
  Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
	baolin.wang, npache, Liam.Howlett, ryan.roberts, vbabka,
	lance.yang, linux-kernel, kernel-team, Madhavan Srinivasan,
	Michael Ellerman, linuxppc-dev

CCing ppc folks

On 2/11/26 13:49, Usama Arif wrote:
> When the kernel creates a PMD-level THP mapping for anonymous pages,
> it pre-allocates a PTE page table and deposits it via
> pgtable_trans_huge_deposit(). This deposited table is withdrawn during
> PMD split or zap. The rationale was that split must not fail—if the
> kernel decides to split a THP, it needs a PTE table to populate.
> 
> However, every anon THP wastes 4KB (one page table page) that sits
> unused in the deposit list for the lifetime of the mapping. On systems
> with many THPs, this adds up to significant memory waste. The original
> rationale is also not an issue. It is ok for split to fail, and if the
> kernel can't find an order 0 allocation for split, there are much bigger
> problems. On large servers where you can easily have 100s of GBs of THPs,
> the memory usage for these tables is 200M per 100G. This memory could be
> used for any other usecase, which include allocating the pagetables
> required during split.
> 
> This patch removes the pre-deposit for anonymous pages on architectures
> where arch_needs_pgtable_deposit() returns false (every arch apart from
> powerpc, and only when radix hash tables are not enabled) and allocates
> the PTE table lazily—only when a split actually occurs. The split path
> is modified to accept a caller-provided page table.
> 
> PowerPC exception:
> 
> It would have been great if we can completely remove the pagetable
> deposit code and this commit would mostly have been a code cleanup patch,
> unfortunately PowerPC has hash MMU, it stores hash slot information in
> the deposited page table and pre-deposit is necessary. All deposit/
> withdraw paths are guarded by arch_needs_pgtable_deposit(), so PowerPC
> behavior is unchanged with this patch. On a better note,
> arch_needs_pgtable_deposit will always evaluate to false at compile time
> on non PowerPC architectures and the pre-deposit code will not be
> compiled in.

Is there a way to remove this? It's always been a confusing hack, now 
it's unpleasant to have around :)

In particular, seeing that radix__pgtable_trans_huge_deposit() just 1:1 
copied generic pgtable_trans_huge_deposit() hurts my belly.


IIUC, hash is mostly used on legacy power systems, radix on newer ones.

So one obvious solution: remove PMD THP support for hash MMUs along with 
all this hacky deposit code.


the "vma_is_anonymous(vma) && !arch_needs_pgtable_deposit()" and similar 
checks need to be wrapped in a reasonable helper and likely this all 
needs to get cleaned up further.

The implementation if the generic pgtable_trans_huge_deposit and the 
radix handlers etc must be removed. If any code would trigger them it 
would be a bug.

If we have to keep this around, pgtable_trans_huge_deposit() should 
likely get renamed to arch_pgtable_trans_huge_deposit() etc, as there 
will not be generic support for it.

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC 2/2] mm: thp: add THP_SPLIT_PMD_PTE_ALLOC_FAILED counter
  2026-02-11 12:49 ` [RFC 2/2] mm: thp: add THP_SPLIT_PMD_PTE_ALLOC_FAILED counter Usama Arif
@ 2026-02-11 13:27   ` David Hildenbrand (Arm)
  2026-02-11 13:31     ` Usama Arif
  2026-02-12 21:40   ` kernel test robot
  2026-02-12 21:40   ` kernel test robot
  2 siblings, 1 reply; 22+ messages in thread
From: David Hildenbrand (Arm) @ 2026-02-11 13:27 UTC (permalink / raw)
  To: Usama Arif, Andrew Morton, lorenzo.stoakes, willy, linux-mm
  Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
	baolin.wang, npache, Liam.Howlett, ryan.roberts, vbabka,
	lance.yang, linux-kernel, kernel-team

On 2/11/26 13:49, Usama Arif wrote:
> Add a vmstat counter to track PTE allocation failures during PMD split.
> This enables monitoring of split failures due to memory pressure after
> the lazy PTE page table allocation change.
> 
> The counter is incremented in three places:
> - __split_huge_pmd(): Main entry point for splitting a PMD
> - try_to_unmap_one(): When reclaim needs to split a PMD-mapped THP
> - try_to_migrate_one(): When migration needs to split a PMD-mapped THP
> 
> Visible via /proc/vmstat as thp_split_pmd_pte_alloc_failed.
> 
> Signed-off-by: Usama Arif <usama.arif@linux.dev>
> ---
>   include/linux/vm_event_item.h | 1 +
>   mm/huge_memory.c              | 1 +
>   mm/rmap.c                     | 3 +++
>   mm/vmstat.c                   | 1 +
>   4 files changed, 6 insertions(+)
> 
> diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
> index 22a139f82d75f..827c9a8c251de 100644
> --- a/include/linux/vm_event_item.h
> +++ b/include/linux/vm_event_item.h
> @@ -111,6 +111,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>   		THP_DEFERRED_SPLIT_PAGE,
>   		THP_UNDERUSED_SPLIT_PAGE,
>   		THP_SPLIT_PMD,
> +		THP_SPLIT_PMD_PTE_ALLOC_FAILED,

Probably sufficient to call this THP_SPLIT_PMD_FAILED and count any 
(future) failures (if any) as well.

It's a shame that we called a remapping a "split" and keep causing 
confusion.

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC 2/2] mm: thp: add THP_SPLIT_PMD_PTE_ALLOC_FAILED counter
  2026-02-11 13:27   ` David Hildenbrand (Arm)
@ 2026-02-11 13:31     ` Usama Arif
  2026-02-11 13:36       ` David Hildenbrand (Arm)
  2026-02-11 13:38       ` David Hildenbrand (Arm)
  0 siblings, 2 replies; 22+ messages in thread
From: Usama Arif @ 2026-02-11 13:31 UTC (permalink / raw)
  To: David Hildenbrand (Arm), Andrew Morton, lorenzo.stoakes, willy,
	linux-mm
  Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
	baolin.wang, npache, Liam.Howlett, ryan.roberts, vbabka,
	lance.yang, linux-kernel, kernel-team



On 11/02/2026 13:27, David Hildenbrand (Arm) wrote:
> On 2/11/26 13:49, Usama Arif wrote:
>> Add a vmstat counter to track PTE allocation failures during PMD split.
>> This enables monitoring of split failures due to memory pressure after
>> the lazy PTE page table allocation change.
>>
>> The counter is incremented in three places:
>> - __split_huge_pmd(): Main entry point for splitting a PMD
>> - try_to_unmap_one(): When reclaim needs to split a PMD-mapped THP
>> - try_to_migrate_one(): When migration needs to split a PMD-mapped THP
>>
>> Visible via /proc/vmstat as thp_split_pmd_pte_alloc_failed.
>>
>> Signed-off-by: Usama Arif <usama.arif@linux.dev>
>> ---
>>   include/linux/vm_event_item.h | 1 +
>>   mm/huge_memory.c              | 1 +
>>   mm/rmap.c                     | 3 +++
>>   mm/vmstat.c                   | 1 +
>>   4 files changed, 6 insertions(+)
>>
>> diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
>> index 22a139f82d75f..827c9a8c251de 100644
>> --- a/include/linux/vm_event_item.h
>> +++ b/include/linux/vm_event_item.h
>> @@ -111,6 +111,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>>           THP_DEFERRED_SPLIT_PAGE,
>>           THP_UNDERUSED_SPLIT_PAGE,
>>           THP_SPLIT_PMD,
>> +        THP_SPLIT_PMD_PTE_ALLOC_FAILED,
> 
> Probably sufficient to call this THP_SPLIT_PMD_FAILED and count any (future) failures (if any) as well.
> 

Makes sense. This was just a patch I was using for testing and I wanted to share.
It was always 0 as I couldnt get split to fail :) But I can rename it as THP_SPLIT_PMD_FAILED
as suggested and we can use for future split failures (hopefully none).

> It's a shame that we called a remapping a "split" and keep causing confusion.
> 



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC 1/2] mm: thp: allocate PTE page tables lazily at split time
  2026-02-11 12:49 ` [RFC 1/2] mm: thp: allocate PTE page tables lazily at split time Usama Arif
  2026-02-11 13:25   ` David Hildenbrand (Arm)
@ 2026-02-11 13:35   ` David Hildenbrand (Arm)
  2026-02-11 13:46     ` Kiryl Shutsemau
  2026-02-11 13:47     ` Usama Arif
  2026-02-11 19:28   ` Matthew Wilcox
  2 siblings, 2 replies; 22+ messages in thread
From: David Hildenbrand (Arm) @ 2026-02-11 13:35 UTC (permalink / raw)
  To: Usama Arif, Andrew Morton, lorenzo.stoakes, willy, linux-mm
  Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
	baolin.wang, npache, Liam.Howlett, ryan.roberts, vbabka,
	lance.yang, linux-kernel, kernel-team

On 2/11/26 13:49, Usama Arif wrote:
> When the kernel creates a PMD-level THP mapping for anonymous pages,
> it pre-allocates a PTE page table and deposits it via
> pgtable_trans_huge_deposit(). This deposited table is withdrawn during
> PMD split or zap. The rationale was that split must not fail—if the
> kernel decides to split a THP, it needs a PTE table to populate.
> 
> However, every anon THP wastes 4KB (one page table page) that sits
> unused in the deposit list for the lifetime of the mapping. On systems
> with many THPs, this adds up to significant memory waste. The original
> rationale is also not an issue. It is ok for split to fail, and if the
> kernel can't find an order 0 allocation for split, there are much bigger
> problems. On large servers where you can easily have 100s of GBs of THPs,
> the memory usage for these tables is 200M per 100G. This memory could be
> used for any other usecase, which include allocating the pagetables
> required during split.
> 
> This patch removes the pre-deposit for anonymous pages on architectures
> where arch_needs_pgtable_deposit() returns false (every arch apart from
> powerpc, and only when radix hash tables are not enabled) and allocates
> the PTE table lazily—only when a split actually occurs. The split path
> is modified to accept a caller-provided page table.
> 
> PowerPC exception:
> 
> It would have been great if we can completely remove the pagetable
> deposit code and this commit would mostly have been a code cleanup patch,
> unfortunately PowerPC has hash MMU, it stores hash slot information in
> the deposited page table and pre-deposit is necessary. All deposit/
> withdraw paths are guarded by arch_needs_pgtable_deposit(), so PowerPC
> behavior is unchanged with this patch. On a better note,
> arch_needs_pgtable_deposit will always evaluate to false at compile time
> on non PowerPC architectures and the pre-deposit code will not be
> compiled in.
> 
> Why Split Failures Are Safe:
> 
> If a system is under severe memory pressure that even a 4K allocation
> fails for a PTE table, there are far greater problems than a THP split
> being delayed. The OOM killer will likely intervene before this becomes an
> issue.
> When pte_alloc_one() fails due to not being able to allocate a 4K page,
> the PMD split is aborted and the THP remains intact. I could not get split
> to fail, as its very difficult to make order-0 allocation to fail.
> Code analysis of what would happen if it does:
> 
> - mprotect(): If split fails in change_pmd_range, it will fallback
> to change_pte_range, which will return an error which will cause the
> whole function to be retried again.
> 
> - munmap() (partial THP range): zap_pte_range() returns early when
> pte_offset_map_lock() fails, causing zap_pmd_range() to retry via pmd--.
> For full THP range, zap_huge_pmd() unmaps the entire PMD without
> split.
> 
> - Memory reclaim (try_to_unmap()): Returns false, folio rotated back
> LRU, retried in next reclaim cycle.
> 
> - Migration / compaction (try_to_migrate()): Returns -EAGAIN, migration
> skips this folio, retried later.
> 
> - CoW fault (wp_huge_pmd()): Returns VM_FAULT_FALLBACK, fault retried.
> 
> -  madvise (MADV_COLD/PAGEOUT): split_folio() internally calls
> try_to_migrate() with TTU_SPLIT_HUGE_PMD. If PMD split fails,
> try_to_migrate() returns false, split_folio() returns -EAGAIN,
> and madvise returns 0 (success) silently skipping the region. This
> should be fine. madvise is just an advice and can fail for other
> reasons as well.
> 
> Suggested-by: David Hildenbrand <david@kernel.org>
> Signed-off-by: Usama Arif <usama.arif@linux.dev>
> ---
>   include/linux/huge_mm.h |   4 +-
>   mm/huge_memory.c        | 144 ++++++++++++++++++++++++++++------------
>   mm/khugepaged.c         |   7 +-
>   mm/migrate_device.c     |  15 +++--
>   mm/rmap.c               |  39 ++++++++++-
>   5 files changed, 156 insertions(+), 53 deletions(-)
> 
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index a4d9f964dfdea..b21bb72a298c9 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -562,7 +562,7 @@ static inline bool thp_migration_supported(void)
>   }
>   
>   void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address,
> -			   pmd_t *pmd, bool freeze);
> +			   pmd_t *pmd, bool freeze, pgtable_t pgtable);
>   bool unmap_huge_pmd_locked(struct vm_area_struct *vma, unsigned long addr,
>   			   pmd_t *pmdp, struct folio *folio);
>   void map_anon_folio_pmd_nopf(struct folio *folio, pmd_t *pmd,
> @@ -660,7 +660,7 @@ static inline void split_huge_pmd_address(struct vm_area_struct *vma,
>   		unsigned long address, bool freeze) {}
>   static inline void split_huge_pmd_locked(struct vm_area_struct *vma,
>   					 unsigned long address, pmd_t *pmd,
> -					 bool freeze) {}
> +					 bool freeze, pgtable_t pgtable) {}
>   
>   static inline bool unmap_huge_pmd_locked(struct vm_area_struct *vma,
>   					 unsigned long addr, pmd_t *pmdp,
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 44ff8a648afd5..4c9a8d89fc8aa 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1322,17 +1322,19 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf)
>   	unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
>   	struct vm_area_struct *vma = vmf->vma;
>   	struct folio *folio;
> -	pgtable_t pgtable;
> +	pgtable_t pgtable = NULL;
>   	vm_fault_t ret = 0;
>   
>   	folio = vma_alloc_anon_folio_pmd(vma, vmf->address);
>   	if (unlikely(!folio))
>   		return VM_FAULT_FALLBACK;
>   
> -	pgtable = pte_alloc_one(vma->vm_mm);
> -	if (unlikely(!pgtable)) {
> -		ret = VM_FAULT_OOM;
> -		goto release;
> +	if (arch_needs_pgtable_deposit()) {
> +		pgtable = pte_alloc_one(vma->vm_mm);
> +		if (unlikely(!pgtable)) {
> +			ret = VM_FAULT_OOM;
> +			goto release;
> +		}
>   	}
>   
>   	vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
> @@ -1347,14 +1349,18 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf)
>   		if (userfaultfd_missing(vma)) {
>   			spin_unlock(vmf->ptl);
>   			folio_put(folio);
> -			pte_free(vma->vm_mm, pgtable);
> +			if (pgtable)
> +				pte_free(vma->vm_mm, pgtable);
>   			ret = handle_userfault(vmf, VM_UFFD_MISSING);
>   			VM_BUG_ON(ret & VM_FAULT_FALLBACK);
>   			return ret;
>   		}
> -		pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd, pgtable);
> +		if (pgtable) {
> +			pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd,
> +						   pgtable);
> +			mm_inc_nr_ptes(vma->vm_mm);
> +		}
>   		map_anon_folio_pmd_pf(folio, vmf->pmd, vma, haddr);
> -		mm_inc_nr_ptes(vma->vm_mm);
>   		spin_unlock(vmf->ptl);
>   	}
>   
> @@ -1450,9 +1456,11 @@ static void set_huge_zero_folio(pgtable_t pgtable, struct mm_struct *mm,
>   	pmd_t entry;
>   	entry = folio_mk_pmd(zero_folio, vma->vm_page_prot);
>   	entry = pmd_mkspecial(entry);
> -	pgtable_trans_huge_deposit(mm, pmd, pgtable);
> +	if (pgtable) {
> +		pgtable_trans_huge_deposit(mm, pmd, pgtable);
> +		mm_inc_nr_ptes(mm);
> +	}
>   	set_pmd_at(mm, haddr, pmd, entry);
> -	mm_inc_nr_ptes(mm);
>   }
>   
>   vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
> @@ -1471,16 +1479,19 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
>   	if (!(vmf->flags & FAULT_FLAG_WRITE) &&
>   			!mm_forbids_zeropage(vma->vm_mm) &&
>   			transparent_hugepage_use_zero_page()) {
> -		pgtable_t pgtable;
> +		pgtable_t pgtable = NULL;
>   		struct folio *zero_folio;
>   		vm_fault_t ret;
>   
> -		pgtable = pte_alloc_one(vma->vm_mm);
> -		if (unlikely(!pgtable))
> -			return VM_FAULT_OOM;
> +		if (arch_needs_pgtable_deposit()) {
> +			pgtable = pte_alloc_one(vma->vm_mm);
> +			if (unlikely(!pgtable))
> +				return VM_FAULT_OOM;
> +		}
>   		zero_folio = mm_get_huge_zero_folio(vma->vm_mm);
>   		if (unlikely(!zero_folio)) {
> -			pte_free(vma->vm_mm, pgtable);
> +			if (pgtable)
> +				pte_free(vma->vm_mm, pgtable);
>   			count_vm_event(THP_FAULT_FALLBACK);
>   			return VM_FAULT_FALLBACK;
>   		}
> @@ -1490,10 +1501,12 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
>   			ret = check_stable_address_space(vma->vm_mm);
>   			if (ret) {
>   				spin_unlock(vmf->ptl);
> -				pte_free(vma->vm_mm, pgtable);
> +				if (pgtable)
> +					pte_free(vma->vm_mm, pgtable);
>   			} else if (userfaultfd_missing(vma)) {
>   				spin_unlock(vmf->ptl);
> -				pte_free(vma->vm_mm, pgtable);
> +				if (pgtable)
> +					pte_free(vma->vm_mm, pgtable);
>   				ret = handle_userfault(vmf, VM_UFFD_MISSING);
>   				VM_BUG_ON(ret & VM_FAULT_FALLBACK);
>   			} else {
> @@ -1504,7 +1517,8 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
>   			}
>   		} else {
>   			spin_unlock(vmf->ptl);
> -			pte_free(vma->vm_mm, pgtable);
> +			if (pgtable)
> +				pte_free(vma->vm_mm, pgtable);
>   		}
>   		return ret;
>   	}
> @@ -1836,8 +1850,10 @@ static void copy_huge_non_present_pmd(
>   	}
>   
>   	add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
> -	mm_inc_nr_ptes(dst_mm);
> -	pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
> +	if (pgtable) {
> +		mm_inc_nr_ptes(dst_mm);
> +		pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
> +	}
>   	if (!userfaultfd_wp(dst_vma))
>   		pmd = pmd_swp_clear_uffd_wp(pmd);
>   	set_pmd_at(dst_mm, addr, dst_pmd, pmd);
> @@ -1877,9 +1893,11 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>   	if (!vma_is_anonymous(dst_vma))
>   		return 0;
>   
> -	pgtable = pte_alloc_one(dst_mm);
> -	if (unlikely(!pgtable))
> -		goto out;
> +	if (arch_needs_pgtable_deposit()) {
> +		pgtable = pte_alloc_one(dst_mm);
> +		if (unlikely(!pgtable))
> +			goto out;
> +	}
>   
>   	dst_ptl = pmd_lock(dst_mm, dst_pmd);
>   	src_ptl = pmd_lockptr(src_mm, src_pmd);
> @@ -1897,7 +1915,8 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>   	}
>   
>   	if (unlikely(!pmd_trans_huge(pmd))) {
> -		pte_free(dst_mm, pgtable);
> +		if (pgtable)
> +			pte_free(dst_mm, pgtable);
>   		goto out_unlock;
>   	}
>   	/*
> @@ -1923,7 +1942,8 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>   	if (unlikely(folio_try_dup_anon_rmap_pmd(src_folio, src_page, dst_vma, src_vma))) {
>   		/* Page maybe pinned: split and retry the fault on PTEs. */
>   		folio_put(src_folio);
> -		pte_free(dst_mm, pgtable);
> +		if (pgtable)
> +			pte_free(dst_mm, pgtable);
>   		spin_unlock(src_ptl);
>   		spin_unlock(dst_ptl);
>   		__split_huge_pmd(src_vma, src_pmd, addr, false);
> @@ -1931,8 +1951,10 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>   	}
>   	add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
>   out_zero_page:
> -	mm_inc_nr_ptes(dst_mm);
> -	pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
> +	if (pgtable) {
> +		mm_inc_nr_ptes(dst_mm);
> +		pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
> +	}
>   	pmdp_set_wrprotect(src_mm, addr, src_pmd);
>   	if (!userfaultfd_wp(dst_vma))
>   		pmd = pmd_clear_uffd_wp(pmd);
> @@ -2364,7 +2386,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>   			zap_deposited_table(tlb->mm, pmd);
>   		spin_unlock(ptl);
>   	} else if (is_huge_zero_pmd(orig_pmd)) {
> -		if (!vma_is_dax(vma) || arch_needs_pgtable_deposit())
> +		if (arch_needs_pgtable_deposit())
>   			zap_deposited_table(tlb->mm, pmd);
>   		spin_unlock(ptl);
>   	} else {
> @@ -2389,7 +2411,8 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>   		}
>   
>   		if (folio_test_anon(folio)) {
> -			zap_deposited_table(tlb->mm, pmd);
> +			if (arch_needs_pgtable_deposit())
> +				zap_deposited_table(tlb->mm, pmd);
>   			add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
>   		} else {
>   			if (arch_needs_pgtable_deposit())
> @@ -2490,7 +2513,8 @@ bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
>   			force_flush = true;
>   		VM_BUG_ON(!pmd_none(*new_pmd));
>   
> -		if (pmd_move_must_withdraw(new_ptl, old_ptl, vma)) {
> +		if (pmd_move_must_withdraw(new_ptl, old_ptl, vma) &&
> +		    arch_needs_pgtable_deposit()) {
>   			pgtable_t pgtable;
>   			pgtable = pgtable_trans_huge_withdraw(mm, old_pmd);
>   			pgtable_trans_huge_deposit(mm, new_pmd, pgtable);
> @@ -2798,8 +2822,10 @@ int move_pages_huge_pmd(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd, pm
>   	}
>   	set_pmd_at(mm, dst_addr, dst_pmd, _dst_pmd);
>   
> -	src_pgtable = pgtable_trans_huge_withdraw(mm, src_pmd);
> -	pgtable_trans_huge_deposit(mm, dst_pmd, src_pgtable);
> +	if (arch_needs_pgtable_deposit()) {
> +		src_pgtable = pgtable_trans_huge_withdraw(mm, src_pmd);
> +		pgtable_trans_huge_deposit(mm, dst_pmd, src_pgtable);
> +	}
>   unlock_ptls:
>   	double_pt_unlock(src_ptl, dst_ptl);
>   	/* unblock rmap walks */
> @@ -2941,10 +2967,9 @@ void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
>   #endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
>   
>   static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
> -		unsigned long haddr, pmd_t *pmd)
> +		unsigned long haddr, pmd_t *pmd, pgtable_t pgtable)
>   {
>   	struct mm_struct *mm = vma->vm_mm;
> -	pgtable_t pgtable;
>   	pmd_t _pmd, old_pmd;
>   	unsigned long addr;
>   	pte_t *pte;
> @@ -2960,7 +2985,16 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
>   	 */
>   	old_pmd = pmdp_huge_clear_flush(vma, haddr, pmd);
>   
> -	pgtable = pgtable_trans_huge_withdraw(mm, pmd);
> +	if (arch_needs_pgtable_deposit()) {
> +		pgtable = pgtable_trans_huge_withdraw(mm, pmd);
> +	} else {
> +		VM_BUG_ON(!pgtable);
> +		/*
> +		 * Account for the freshly allocated (in __split_huge_pmd) pgtable
> +		 * being used in mm.
> +		 */
> +		mm_inc_nr_ptes(mm);
> +	}
>   	pmd_populate(mm, &_pmd, pgtable);
>   
>   	pte = pte_offset_map(&_pmd, haddr);
> @@ -2982,12 +3016,11 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
>   }
>   
>   static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> -		unsigned long haddr, bool freeze)
> +		unsigned long haddr, bool freeze, pgtable_t pgtable)
>   {
>   	struct mm_struct *mm = vma->vm_mm;
>   	struct folio *folio;
>   	struct page *page;
> -	pgtable_t pgtable;
>   	pmd_t old_pmd, _pmd;
>   	bool soft_dirty, uffd_wp = false, young = false, write = false;
>   	bool anon_exclusive = false, dirty = false;
> @@ -3011,6 +3044,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>   		 */
>   		if (arch_needs_pgtable_deposit())
>   			zap_deposited_table(mm, pmd);
> +		if (pgtable)
> +			pte_free(mm, pgtable);
>   		if (!vma_is_dax(vma) && vma_is_special_huge(vma))
>   			return;
>   		if (unlikely(pmd_is_migration_entry(old_pmd))) {
> @@ -3043,7 +3078,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>   		 * small page also write protected so it does not seems useful
>   		 * to invalidate secondary mmu at this time.
>   		 */
> -		return __split_huge_zero_page_pmd(vma, haddr, pmd);
> +		return __split_huge_zero_page_pmd(vma, haddr, pmd, pgtable);
>   	}
>   
>   	if (pmd_is_migration_entry(*pmd)) {
> @@ -3167,7 +3202,16 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>   	 * Withdraw the table only after we mark the pmd entry invalid.
>   	 * This's critical for some architectures (Power).
>   	 */
> -	pgtable = pgtable_trans_huge_withdraw(mm, pmd);
> +	if (arch_needs_pgtable_deposit()) {
> +		pgtable = pgtable_trans_huge_withdraw(mm, pmd);
> +	} else {
> +		VM_BUG_ON(!pgtable);
> +		/*
> +		 * Account for the freshly allocated (in __split_huge_pmd) pgtable
> +		 * being used in mm.
> +		 */
> +		mm_inc_nr_ptes(mm);
> +	}
>   	pmd_populate(mm, &_pmd, pgtable);
>   
>   	pte = pte_offset_map(&_pmd, haddr);
> @@ -3263,11 +3307,13 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>   }
>   
>   void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address,
> -			   pmd_t *pmd, bool freeze)
> +			   pmd_t *pmd, bool freeze, pgtable_t pgtable)
>   {
>   	VM_WARN_ON_ONCE(!IS_ALIGNED(address, HPAGE_PMD_SIZE));
>   	if (pmd_trans_huge(*pmd) || pmd_is_valid_softleaf(*pmd))
> -		__split_huge_pmd_locked(vma, pmd, address, freeze);
> +		__split_huge_pmd_locked(vma, pmd, address, freeze, pgtable);
> +	else if (pgtable)
> +		pte_free(vma->vm_mm, pgtable);
>   }
>   
>   void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
> @@ -3275,13 +3321,24 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>   {
>   	spinlock_t *ptl;
>   	struct mmu_notifier_range range;
> +	pgtable_t pgtable = NULL;
>   
>   	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma->vm_mm,
>   				address & HPAGE_PMD_MASK,
>   				(address & HPAGE_PMD_MASK) + HPAGE_PMD_SIZE);
>   	mmu_notifier_invalidate_range_start(&range);
> +
> +	/* allocate pagetable before acquiring pmd lock */
> +	if (vma_is_anonymous(vma) && !arch_needs_pgtable_deposit()) {
> +		pgtable = pte_alloc_one(vma->vm_mm);
> +		if (!pgtable) {
> +			mmu_notifier_invalidate_range_end(&range);

What I last looked at this, I thought the clean thing to do is to let 
__split_huge_pmd() and friends return an error.

Let's take a look at walk_pmd_range() as one example:

if (walk->vma)
	split_huge_pmd(walk->vma, pmd, addr);
else if (pmd_leaf(*pmd) || !pmd_present(*pmd))
	continue;

err = walk_pte_range(pmd, addr, next, walk);

Where walk_pte_range() just does a pte_offset_map_lock.

	pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);

But if that fails (as the remapping failed), we will silently skip this 
range.

I don't think silently skipping is the right thing to do.

So I would think that all splitting functions have to be taught to 
return an error and handle it accordingly. Then we can actually start 
returning errors.

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC 2/2] mm: thp: add THP_SPLIT_PMD_PTE_ALLOC_FAILED counter
  2026-02-11 13:31     ` Usama Arif
@ 2026-02-11 13:36       ` David Hildenbrand (Arm)
  2026-02-11 13:42         ` Usama Arif
  2026-02-11 13:38       ` David Hildenbrand (Arm)
  1 sibling, 1 reply; 22+ messages in thread
From: David Hildenbrand (Arm) @ 2026-02-11 13:36 UTC (permalink / raw)
  To: Usama Arif, Andrew Morton, lorenzo.stoakes, willy, linux-mm
  Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
	baolin.wang, npache, Liam.Howlett, ryan.roberts, vbabka,
	lance.yang, linux-kernel, kernel-team

On 2/11/26 14:31, Usama Arif wrote:
> 
> 
> On 11/02/2026 13:27, David Hildenbrand (Arm) wrote:
>> On 2/11/26 13:49, Usama Arif wrote:
>>> Add a vmstat counter to track PTE allocation failures during PMD split.
>>> This enables monitoring of split failures due to memory pressure after
>>> the lazy PTE page table allocation change.
>>>
>>> The counter is incremented in three places:
>>> - __split_huge_pmd(): Main entry point for splitting a PMD
>>> - try_to_unmap_one(): When reclaim needs to split a PMD-mapped THP
>>> - try_to_migrate_one(): When migration needs to split a PMD-mapped THP
>>>
>>> Visible via /proc/vmstat as thp_split_pmd_pte_alloc_failed.
>>>
>>> Signed-off-by: Usama Arif <usama.arif@linux.dev>
>>> ---
>>>    include/linux/vm_event_item.h | 1 +
>>>    mm/huge_memory.c              | 1 +
>>>    mm/rmap.c                     | 3 +++
>>>    mm/vmstat.c                   | 1 +
>>>    4 files changed, 6 insertions(+)
>>>
>>> diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
>>> index 22a139f82d75f..827c9a8c251de 100644
>>> --- a/include/linux/vm_event_item.h
>>> +++ b/include/linux/vm_event_item.h
>>> @@ -111,6 +111,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>>>            THP_DEFERRED_SPLIT_PAGE,
>>>            THP_UNDERUSED_SPLIT_PAGE,
>>>            THP_SPLIT_PMD,
>>> +        THP_SPLIT_PMD_PTE_ALLOC_FAILED,
>>
>> Probably sufficient to call this THP_SPLIT_PMD_FAILED and count any (future) failures (if any) as well.
>>
> 
> Makes sense. This was just a patch I was using for testing and I wanted to share.
> It was always 0 as I couldnt get split to fail :) But I can rename it as THP_SPLIT_PMD_FAILED
> as suggested and we can use for future split failures (hopefully none).

I guess it might be reasonable to have because I am sure it will fail at 
some point and maybe provoke weird issues we didn't think of. In that 
case, having an indication that splitting failed at some point might be 
reasonable.

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC 1/2] mm: thp: allocate PTE page tables lazily at split time
  2026-02-11 13:25   ` David Hildenbrand (Arm)
@ 2026-02-11 13:38     ` Usama Arif
  2026-02-12 12:13     ` Ritesh Harjani
  1 sibling, 0 replies; 22+ messages in thread
From: Usama Arif @ 2026-02-11 13:38 UTC (permalink / raw)
  To: David Hildenbrand (Arm), Andrew Morton, lorenzo.stoakes, willy,
	linux-mm
  Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
	baolin.wang, npache, Liam.Howlett, ryan.roberts, vbabka,
	lance.yang, linux-kernel, kernel-team, Madhavan Srinivasan,
	Michael Ellerman, linuxppc-dev



On 11/02/2026 13:25, David Hildenbrand (Arm) wrote:
> CCing ppc folks
> 
> On 2/11/26 13:49, Usama Arif wrote:
>> When the kernel creates a PMD-level THP mapping for anonymous pages,
>> it pre-allocates a PTE page table and deposits it via
>> pgtable_trans_huge_deposit(). This deposited table is withdrawn during
>> PMD split or zap. The rationale was that split must not fail—if the
>> kernel decides to split a THP, it needs a PTE table to populate.
>>
>> However, every anon THP wastes 4KB (one page table page) that sits
>> unused in the deposit list for the lifetime of the mapping. On systems
>> with many THPs, this adds up to significant memory waste. The original
>> rationale is also not an issue. It is ok for split to fail, and if the
>> kernel can't find an order 0 allocation for split, there are much bigger
>> problems. On large servers where you can easily have 100s of GBs of THPs,
>> the memory usage for these tables is 200M per 100G. This memory could be
>> used for any other usecase, which include allocating the pagetables
>> required during split.
>>
>> This patch removes the pre-deposit for anonymous pages on architectures
>> where arch_needs_pgtable_deposit() returns false (every arch apart from
>> powerpc, and only when radix hash tables are not enabled) and allocates
>> the PTE table lazily—only when a split actually occurs. The split path
>> is modified to accept a caller-provided page table.
>>
>> PowerPC exception:
>>
>> It would have been great if we can completely remove the pagetable
>> deposit code and this commit would mostly have been a code cleanup patch,
>> unfortunately PowerPC has hash MMU, it stores hash slot information in
>> the deposited page table and pre-deposit is necessary. All deposit/
>> withdraw paths are guarded by arch_needs_pgtable_deposit(), so PowerPC
>> behavior is unchanged with this patch. On a better note,
>> arch_needs_pgtable_deposit will always evaluate to false at compile time
>> on non PowerPC architectures and the pre-deposit code will not be
>> compiled in.
> 
> Is there a way to remove this? It's always been a confusing hack, now it's unpleasant to have around :)


I spent some time researching this (I havent worked with PowerPC before)
as I really wanted to get rid of all the pre-deposit code. I cant really see a
way without removing PMD THP support. I was going to CC the PowerPC maintainers
but I see that you already did!

> 
> In particular, seeing that radix__pgtable_trans_huge_deposit() just 1:1 copied generic pgtable_trans_huge_deposit() hurts my belly.
> 
> 
> IIUC, hash is mostly used on legacy power systems, radix on newer ones.
> 

Yes that is what I found as well.

> So one obvious solution: remove PMD THP support for hash MMUs along with all this hacky deposit code.
> 

I would be happy with that!

> 
> the "vma_is_anonymous(vma) && !arch_needs_pgtable_deposit()" and similar checks need to be wrapped in a reasonable helper and likely this all needs to get cleaned up further.

Ack. The code will definitely look a lot lot cleaner and wont have much of this if we decide to remove
PMD THP support for hash MMU.

> 
> The implementation if the generic pgtable_trans_huge_deposit and the radix handlers etc must be removed. If any code would trigger them it would be a bug.
> 
> If we have to keep this around, pgtable_trans_huge_deposit() should likely get renamed to arch_pgtable_trans_huge_deposit() etc, as there will not be generic support for it.
> 

Ack.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC 2/2] mm: thp: add THP_SPLIT_PMD_PTE_ALLOC_FAILED counter
  2026-02-11 13:31     ` Usama Arif
  2026-02-11 13:36       ` David Hildenbrand (Arm)
@ 2026-02-11 13:38       ` David Hildenbrand (Arm)
  2026-02-11 13:43         ` Usama Arif
  1 sibling, 1 reply; 22+ messages in thread
From: David Hildenbrand (Arm) @ 2026-02-11 13:38 UTC (permalink / raw)
  To: Usama Arif, Andrew Morton, lorenzo.stoakes, willy, linux-mm
  Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
	baolin.wang, npache, Liam.Howlett, ryan.roberts, vbabka,
	lance.yang, linux-kernel, kernel-team

On 2/11/26 14:31, Usama Arif wrote:
> 
> 
> On 11/02/2026 13:27, David Hildenbrand (Arm) wrote:
>> On 2/11/26 13:49, Usama Arif wrote:
>>> Add a vmstat counter to track PTE allocation failures during PMD split.
>>> This enables monitoring of split failures due to memory pressure after
>>> the lazy PTE page table allocation change.
>>>
>>> The counter is incremented in three places:
>>> - __split_huge_pmd(): Main entry point for splitting a PMD
>>> - try_to_unmap_one(): When reclaim needs to split a PMD-mapped THP
>>> - try_to_migrate_one(): When migration needs to split a PMD-mapped THP
>>>
>>> Visible via /proc/vmstat as thp_split_pmd_pte_alloc_failed.
>>>
>>> Signed-off-by: Usama Arif <usama.arif@linux.dev>
>>> ---
>>>    include/linux/vm_event_item.h | 1 +
>>>    mm/huge_memory.c              | 1 +
>>>    mm/rmap.c                     | 3 +++
>>>    mm/vmstat.c                   | 1 +
>>>    4 files changed, 6 insertions(+)
>>>
>>> diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
>>> index 22a139f82d75f..827c9a8c251de 100644
>>> --- a/include/linux/vm_event_item.h
>>> +++ b/include/linux/vm_event_item.h
>>> @@ -111,6 +111,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>>>            THP_DEFERRED_SPLIT_PAGE,
>>>            THP_UNDERUSED_SPLIT_PAGE,
>>>            THP_SPLIT_PMD,
>>> +        THP_SPLIT_PMD_PTE_ALLOC_FAILED,
>>
>> Probably sufficient to call this THP_SPLIT_PMD_FAILED and count any (future) failures (if any) as well.
>>
> 
> Makes sense. This was just a patch I was using for testing and I wanted to share.
> It was always 0 as I couldnt get split to fail :) But I can rename it as THP_SPLIT_PMD_FAILED
> as suggested and we can use for future split failures (hopefully none).

Btw, you can use the allocation fault injection framework to find weird 
issues, if you haven't heard of that yet.

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC 2/2] mm: thp: add THP_SPLIT_PMD_PTE_ALLOC_FAILED counter
  2026-02-11 13:36       ` David Hildenbrand (Arm)
@ 2026-02-11 13:42         ` Usama Arif
  0 siblings, 0 replies; 22+ messages in thread
From: Usama Arif @ 2026-02-11 13:42 UTC (permalink / raw)
  To: David Hildenbrand (Arm), Andrew Morton, lorenzo.stoakes, willy,
	linux-mm
  Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
	baolin.wang, npache, Liam.Howlett, ryan.roberts, vbabka,
	lance.yang, linux-kernel, kernel-team



On 11/02/2026 13:36, David Hildenbrand (Arm) wrote:
> On 2/11/26 14:31, Usama Arif wrote:
>>
>>
>> On 11/02/2026 13:27, David Hildenbrand (Arm) wrote:
>>> On 2/11/26 13:49, Usama Arif wrote:
>>>> Add a vmstat counter to track PTE allocation failures during PMD split.
>>>> This enables monitoring of split failures due to memory pressure after
>>>> the lazy PTE page table allocation change.
>>>>
>>>> The counter is incremented in three places:
>>>> - __split_huge_pmd(): Main entry point for splitting a PMD
>>>> - try_to_unmap_one(): When reclaim needs to split a PMD-mapped THP
>>>> - try_to_migrate_one(): When migration needs to split a PMD-mapped THP
>>>>
>>>> Visible via /proc/vmstat as thp_split_pmd_pte_alloc_failed.
>>>>
>>>> Signed-off-by: Usama Arif <usama.arif@linux.dev>
>>>> ---
>>>>    include/linux/vm_event_item.h | 1 +
>>>>    mm/huge_memory.c              | 1 +
>>>>    mm/rmap.c                     | 3 +++
>>>>    mm/vmstat.c                   | 1 +
>>>>    4 files changed, 6 insertions(+)
>>>>
>>>> diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
>>>> index 22a139f82d75f..827c9a8c251de 100644
>>>> --- a/include/linux/vm_event_item.h
>>>> +++ b/include/linux/vm_event_item.h
>>>> @@ -111,6 +111,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>>>>            THP_DEFERRED_SPLIT_PAGE,
>>>>            THP_UNDERUSED_SPLIT_PAGE,
>>>>            THP_SPLIT_PMD,
>>>> +        THP_SPLIT_PMD_PTE_ALLOC_FAILED,
>>>
>>> Probably sufficient to call this THP_SPLIT_PMD_FAILED and count any (future) failures (if any) as well.
>>>
>>
>> Makes sense. This was just a patch I was using for testing and I wanted to share.
>> It was always 0 as I couldnt get split to fail :) But I can rename it as THP_SPLIT_PMD_FAILED
>> as suggested and we can use for future split failures (hopefully none).
> 
> I guess it might be reasonable to have because I am sure it will fail at some point and maybe provoke weird issues we didn't think of. In that case, having an indication that splitting failed at some point might be reasonable.
> 
ack



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC 2/2] mm: thp: add THP_SPLIT_PMD_PTE_ALLOC_FAILED counter
  2026-02-11 13:38       ` David Hildenbrand (Arm)
@ 2026-02-11 13:43         ` Usama Arif
  0 siblings, 0 replies; 22+ messages in thread
From: Usama Arif @ 2026-02-11 13:43 UTC (permalink / raw)
  To: David Hildenbrand (Arm), Andrew Morton, lorenzo.stoakes, willy,
	linux-mm
  Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
	baolin.wang, npache, Liam.Howlett, ryan.roberts, vbabka,
	lance.yang, linux-kernel, kernel-team



On 11/02/2026 13:38, David Hildenbrand (Arm) wrote:
> On 2/11/26 14:31, Usama Arif wrote:
>>
>>
>> On 11/02/2026 13:27, David Hildenbrand (Arm) wrote:
>>> On 2/11/26 13:49, Usama Arif wrote:
>>>> Add a vmstat counter to track PTE allocation failures during PMD split.
>>>> This enables monitoring of split failures due to memory pressure after
>>>> the lazy PTE page table allocation change.
>>>>
>>>> The counter is incremented in three places:
>>>> - __split_huge_pmd(): Main entry point for splitting a PMD
>>>> - try_to_unmap_one(): When reclaim needs to split a PMD-mapped THP
>>>> - try_to_migrate_one(): When migration needs to split a PMD-mapped THP
>>>>
>>>> Visible via /proc/vmstat as thp_split_pmd_pte_alloc_failed.
>>>>
>>>> Signed-off-by: Usama Arif <usama.arif@linux.dev>
>>>> ---
>>>>    include/linux/vm_event_item.h | 1 +
>>>>    mm/huge_memory.c              | 1 +
>>>>    mm/rmap.c                     | 3 +++
>>>>    mm/vmstat.c                   | 1 +
>>>>    4 files changed, 6 insertions(+)
>>>>
>>>> diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
>>>> index 22a139f82d75f..827c9a8c251de 100644
>>>> --- a/include/linux/vm_event_item.h
>>>> +++ b/include/linux/vm_event_item.h
>>>> @@ -111,6 +111,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>>>>            THP_DEFERRED_SPLIT_PAGE,
>>>>            THP_UNDERUSED_SPLIT_PAGE,
>>>>            THP_SPLIT_PMD,
>>>> +        THP_SPLIT_PMD_PTE_ALLOC_FAILED,
>>>
>>> Probably sufficient to call this THP_SPLIT_PMD_FAILED and count any (future) failures (if any) as well.
>>>
>>
>> Makes sense. This was just a patch I was using for testing and I wanted to share.
>> It was always 0 as I couldnt get split to fail :) But I can rename it as THP_SPLIT_PMD_FAILED
>> as suggested and we can use for future split failures (hopefully none).
> 
> Btw, you can use the allocation fault injection framework to find weird issues, if you haven't heard of that yet.
> 

This looks very interesting, Thanks! Let me have a look.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC 1/2] mm: thp: allocate PTE page tables lazily at split time
  2026-02-11 13:35   ` David Hildenbrand (Arm)
@ 2026-02-11 13:46     ` Kiryl Shutsemau
  2026-02-11 13:47     ` Usama Arif
  1 sibling, 0 replies; 22+ messages in thread
From: Kiryl Shutsemau @ 2026-02-11 13:46 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Usama Arif, Andrew Morton, lorenzo.stoakes, willy, linux-mm, fvdl,
	hannes, riel, shakeel.butt, baohua, dev.jain, baolin.wang, npache,
	Liam.Howlett, ryan.roberts, vbabka, lance.yang, linux-kernel,
	kernel-team

On Wed, Feb 11, 2026 at 02:35:07PM +0100, David Hildenbrand (Arm) wrote:
> On 2/11/26 13:49, Usama Arif wrote:
> > When the kernel creates a PMD-level THP mapping for anonymous pages,
> > it pre-allocates a PTE page table and deposits it via
> > pgtable_trans_huge_deposit(). This deposited table is withdrawn during
> > PMD split or zap. The rationale was that split must not fail—if the
> > kernel decides to split a THP, it needs a PTE table to populate.
> > 
> > However, every anon THP wastes 4KB (one page table page) that sits
> > unused in the deposit list for the lifetime of the mapping. On systems
> > with many THPs, this adds up to significant memory waste. The original
> > rationale is also not an issue. It is ok for split to fail, and if the
> > kernel can't find an order 0 allocation for split, there are much bigger
> > problems. On large servers where you can easily have 100s of GBs of THPs,
> > the memory usage for these tables is 200M per 100G. This memory could be
> > used for any other usecase, which include allocating the pagetables
> > required during split.
> > 
> > This patch removes the pre-deposit for anonymous pages on architectures
> > where arch_needs_pgtable_deposit() returns false (every arch apart from
> > powerpc, and only when radix hash tables are not enabled) and allocates
> > the PTE table lazily—only when a split actually occurs. The split path
> > is modified to accept a caller-provided page table.
> > 
> > PowerPC exception:
> > 
> > It would have been great if we can completely remove the pagetable
> > deposit code and this commit would mostly have been a code cleanup patch,
> > unfortunately PowerPC has hash MMU, it stores hash slot information in
> > the deposited page table and pre-deposit is necessary. All deposit/
> > withdraw paths are guarded by arch_needs_pgtable_deposit(), so PowerPC
> > behavior is unchanged with this patch. On a better note,
> > arch_needs_pgtable_deposit will always evaluate to false at compile time
> > on non PowerPC architectures and the pre-deposit code will not be
> > compiled in.
> > 
> > Why Split Failures Are Safe:
> > 
> > If a system is under severe memory pressure that even a 4K allocation
> > fails for a PTE table, there are far greater problems than a THP split
> > being delayed. The OOM killer will likely intervene before this becomes an
> > issue.
> > When pte_alloc_one() fails due to not being able to allocate a 4K page,
> > the PMD split is aborted and the THP remains intact. I could not get split
> > to fail, as its very difficult to make order-0 allocation to fail.
> > Code analysis of what would happen if it does:
> > 
> > - mprotect(): If split fails in change_pmd_range, it will fallback
> > to change_pte_range, which will return an error which will cause the
> > whole function to be retried again.
> > 
> > - munmap() (partial THP range): zap_pte_range() returns early when
> > pte_offset_map_lock() fails, causing zap_pmd_range() to retry via pmd--.
> > For full THP range, zap_huge_pmd() unmaps the entire PMD without
> > split.
> > 
> > - Memory reclaim (try_to_unmap()): Returns false, folio rotated back
> > LRU, retried in next reclaim cycle.
> > 
> > - Migration / compaction (try_to_migrate()): Returns -EAGAIN, migration
> > skips this folio, retried later.
> > 
> > - CoW fault (wp_huge_pmd()): Returns VM_FAULT_FALLBACK, fault retried.
> > 
> > -  madvise (MADV_COLD/PAGEOUT): split_folio() internally calls
> > try_to_migrate() with TTU_SPLIT_HUGE_PMD. If PMD split fails,
> > try_to_migrate() returns false, split_folio() returns -EAGAIN,
> > and madvise returns 0 (success) silently skipping the region. This
> > should be fine. madvise is just an advice and can fail for other
> > reasons as well.
> > 
> > Suggested-by: David Hildenbrand <david@kernel.org>
> > Signed-off-by: Usama Arif <usama.arif@linux.dev>
> > ---
> >   include/linux/huge_mm.h |   4 +-
> >   mm/huge_memory.c        | 144 ++++++++++++++++++++++++++++------------
> >   mm/khugepaged.c         |   7 +-
> >   mm/migrate_device.c     |  15 +++--
> >   mm/rmap.c               |  39 ++++++++++-
> >   5 files changed, 156 insertions(+), 53 deletions(-)
> > 
> > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > index a4d9f964dfdea..b21bb72a298c9 100644
> > --- a/include/linux/huge_mm.h
> > +++ b/include/linux/huge_mm.h
> > @@ -562,7 +562,7 @@ static inline bool thp_migration_supported(void)
> >   }
> >   void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address,
> > -			   pmd_t *pmd, bool freeze);
> > +			   pmd_t *pmd, bool freeze, pgtable_t pgtable);
> >   bool unmap_huge_pmd_locked(struct vm_area_struct *vma, unsigned long addr,
> >   			   pmd_t *pmdp, struct folio *folio);
> >   void map_anon_folio_pmd_nopf(struct folio *folio, pmd_t *pmd,
> > @@ -660,7 +660,7 @@ static inline void split_huge_pmd_address(struct vm_area_struct *vma,
> >   		unsigned long address, bool freeze) {}
> >   static inline void split_huge_pmd_locked(struct vm_area_struct *vma,
> >   					 unsigned long address, pmd_t *pmd,
> > -					 bool freeze) {}
> > +					 bool freeze, pgtable_t pgtable) {}
> >   static inline bool unmap_huge_pmd_locked(struct vm_area_struct *vma,
> >   					 unsigned long addr, pmd_t *pmdp,
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index 44ff8a648afd5..4c9a8d89fc8aa 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -1322,17 +1322,19 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf)
> >   	unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
> >   	struct vm_area_struct *vma = vmf->vma;
> >   	struct folio *folio;
> > -	pgtable_t pgtable;
> > +	pgtable_t pgtable = NULL;
> >   	vm_fault_t ret = 0;
> >   	folio = vma_alloc_anon_folio_pmd(vma, vmf->address);
> >   	if (unlikely(!folio))
> >   		return VM_FAULT_FALLBACK;
> > -	pgtable = pte_alloc_one(vma->vm_mm);
> > -	if (unlikely(!pgtable)) {
> > -		ret = VM_FAULT_OOM;
> > -		goto release;
> > +	if (arch_needs_pgtable_deposit()) {
> > +		pgtable = pte_alloc_one(vma->vm_mm);
> > +		if (unlikely(!pgtable)) {
> > +			ret = VM_FAULT_OOM;
> > +			goto release;
> > +		}
> >   	}
> >   	vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
> > @@ -1347,14 +1349,18 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf)
> >   		if (userfaultfd_missing(vma)) {
> >   			spin_unlock(vmf->ptl);
> >   			folio_put(folio);
> > -			pte_free(vma->vm_mm, pgtable);
> > +			if (pgtable)
> > +				pte_free(vma->vm_mm, pgtable);
> >   			ret = handle_userfault(vmf, VM_UFFD_MISSING);
> >   			VM_BUG_ON(ret & VM_FAULT_FALLBACK);
> >   			return ret;
> >   		}
> > -		pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd, pgtable);
> > +		if (pgtable) {
> > +			pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd,
> > +						   pgtable);
> > +			mm_inc_nr_ptes(vma->vm_mm);
> > +		}
> >   		map_anon_folio_pmd_pf(folio, vmf->pmd, vma, haddr);
> > -		mm_inc_nr_ptes(vma->vm_mm);
> >   		spin_unlock(vmf->ptl);
> >   	}
> > @@ -1450,9 +1456,11 @@ static void set_huge_zero_folio(pgtable_t pgtable, struct mm_struct *mm,
> >   	pmd_t entry;
> >   	entry = folio_mk_pmd(zero_folio, vma->vm_page_prot);
> >   	entry = pmd_mkspecial(entry);
> > -	pgtable_trans_huge_deposit(mm, pmd, pgtable);
> > +	if (pgtable) {
> > +		pgtable_trans_huge_deposit(mm, pmd, pgtable);
> > +		mm_inc_nr_ptes(mm);
> > +	}
> >   	set_pmd_at(mm, haddr, pmd, entry);
> > -	mm_inc_nr_ptes(mm);
> >   }
> >   vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
> > @@ -1471,16 +1479,19 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
> >   	if (!(vmf->flags & FAULT_FLAG_WRITE) &&
> >   			!mm_forbids_zeropage(vma->vm_mm) &&
> >   			transparent_hugepage_use_zero_page()) {
> > -		pgtable_t pgtable;
> > +		pgtable_t pgtable = NULL;
> >   		struct folio *zero_folio;
> >   		vm_fault_t ret;
> > -		pgtable = pte_alloc_one(vma->vm_mm);
> > -		if (unlikely(!pgtable))
> > -			return VM_FAULT_OOM;
> > +		if (arch_needs_pgtable_deposit()) {
> > +			pgtable = pte_alloc_one(vma->vm_mm);
> > +			if (unlikely(!pgtable))
> > +				return VM_FAULT_OOM;
> > +		}
> >   		zero_folio = mm_get_huge_zero_folio(vma->vm_mm);
> >   		if (unlikely(!zero_folio)) {
> > -			pte_free(vma->vm_mm, pgtable);
> > +			if (pgtable)
> > +				pte_free(vma->vm_mm, pgtable);
> >   			count_vm_event(THP_FAULT_FALLBACK);
> >   			return VM_FAULT_FALLBACK;
> >   		}
> > @@ -1490,10 +1501,12 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
> >   			ret = check_stable_address_space(vma->vm_mm);
> >   			if (ret) {
> >   				spin_unlock(vmf->ptl);
> > -				pte_free(vma->vm_mm, pgtable);
> > +				if (pgtable)
> > +					pte_free(vma->vm_mm, pgtable);
> >   			} else if (userfaultfd_missing(vma)) {
> >   				spin_unlock(vmf->ptl);
> > -				pte_free(vma->vm_mm, pgtable);
> > +				if (pgtable)
> > +					pte_free(vma->vm_mm, pgtable);
> >   				ret = handle_userfault(vmf, VM_UFFD_MISSING);
> >   				VM_BUG_ON(ret & VM_FAULT_FALLBACK);
> >   			} else {
> > @@ -1504,7 +1517,8 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
> >   			}
> >   		} else {
> >   			spin_unlock(vmf->ptl);
> > -			pte_free(vma->vm_mm, pgtable);
> > +			if (pgtable)
> > +				pte_free(vma->vm_mm, pgtable);
> >   		}
> >   		return ret;
> >   	}
> > @@ -1836,8 +1850,10 @@ static void copy_huge_non_present_pmd(
> >   	}
> >   	add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
> > -	mm_inc_nr_ptes(dst_mm);
> > -	pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
> > +	if (pgtable) {
> > +		mm_inc_nr_ptes(dst_mm);
> > +		pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
> > +	}
> >   	if (!userfaultfd_wp(dst_vma))
> >   		pmd = pmd_swp_clear_uffd_wp(pmd);
> >   	set_pmd_at(dst_mm, addr, dst_pmd, pmd);
> > @@ -1877,9 +1893,11 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> >   	if (!vma_is_anonymous(dst_vma))
> >   		return 0;
> > -	pgtable = pte_alloc_one(dst_mm);
> > -	if (unlikely(!pgtable))
> > -		goto out;
> > +	if (arch_needs_pgtable_deposit()) {
> > +		pgtable = pte_alloc_one(dst_mm);
> > +		if (unlikely(!pgtable))
> > +			goto out;
> > +	}
> >   	dst_ptl = pmd_lock(dst_mm, dst_pmd);
> >   	src_ptl = pmd_lockptr(src_mm, src_pmd);
> > @@ -1897,7 +1915,8 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> >   	}
> >   	if (unlikely(!pmd_trans_huge(pmd))) {
> > -		pte_free(dst_mm, pgtable);
> > +		if (pgtable)
> > +			pte_free(dst_mm, pgtable);
> >   		goto out_unlock;
> >   	}
> >   	/*
> > @@ -1923,7 +1942,8 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> >   	if (unlikely(folio_try_dup_anon_rmap_pmd(src_folio, src_page, dst_vma, src_vma))) {
> >   		/* Page maybe pinned: split and retry the fault on PTEs. */
> >   		folio_put(src_folio);
> > -		pte_free(dst_mm, pgtable);
> > +		if (pgtable)
> > +			pte_free(dst_mm, pgtable);
> >   		spin_unlock(src_ptl);
> >   		spin_unlock(dst_ptl);
> >   		__split_huge_pmd(src_vma, src_pmd, addr, false);
> > @@ -1931,8 +1951,10 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> >   	}
> >   	add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
> >   out_zero_page:
> > -	mm_inc_nr_ptes(dst_mm);
> > -	pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
> > +	if (pgtable) {
> > +		mm_inc_nr_ptes(dst_mm);
> > +		pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
> > +	}
> >   	pmdp_set_wrprotect(src_mm, addr, src_pmd);
> >   	if (!userfaultfd_wp(dst_vma))
> >   		pmd = pmd_clear_uffd_wp(pmd);
> > @@ -2364,7 +2386,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
> >   			zap_deposited_table(tlb->mm, pmd);
> >   		spin_unlock(ptl);
> >   	} else if (is_huge_zero_pmd(orig_pmd)) {
> > -		if (!vma_is_dax(vma) || arch_needs_pgtable_deposit())
> > +		if (arch_needs_pgtable_deposit())
> >   			zap_deposited_table(tlb->mm, pmd);
> >   		spin_unlock(ptl);
> >   	} else {
> > @@ -2389,7 +2411,8 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
> >   		}
> >   		if (folio_test_anon(folio)) {
> > -			zap_deposited_table(tlb->mm, pmd);
> > +			if (arch_needs_pgtable_deposit())
> > +				zap_deposited_table(tlb->mm, pmd);
> >   			add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
> >   		} else {
> >   			if (arch_needs_pgtable_deposit())
> > @@ -2490,7 +2513,8 @@ bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
> >   			force_flush = true;
> >   		VM_BUG_ON(!pmd_none(*new_pmd));
> > -		if (pmd_move_must_withdraw(new_ptl, old_ptl, vma)) {
> > +		if (pmd_move_must_withdraw(new_ptl, old_ptl, vma) &&
> > +		    arch_needs_pgtable_deposit()) {
> >   			pgtable_t pgtable;
> >   			pgtable = pgtable_trans_huge_withdraw(mm, old_pmd);
> >   			pgtable_trans_huge_deposit(mm, new_pmd, pgtable);
> > @@ -2798,8 +2822,10 @@ int move_pages_huge_pmd(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd, pm
> >   	}
> >   	set_pmd_at(mm, dst_addr, dst_pmd, _dst_pmd);
> > -	src_pgtable = pgtable_trans_huge_withdraw(mm, src_pmd);
> > -	pgtable_trans_huge_deposit(mm, dst_pmd, src_pgtable);
> > +	if (arch_needs_pgtable_deposit()) {
> > +		src_pgtable = pgtable_trans_huge_withdraw(mm, src_pmd);
> > +		pgtable_trans_huge_deposit(mm, dst_pmd, src_pgtable);
> > +	}
> >   unlock_ptls:
> >   	double_pt_unlock(src_ptl, dst_ptl);
> >   	/* unblock rmap walks */
> > @@ -2941,10 +2967,9 @@ void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
> >   #endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
> >   static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
> > -		unsigned long haddr, pmd_t *pmd)
> > +		unsigned long haddr, pmd_t *pmd, pgtable_t pgtable)
> >   {
> >   	struct mm_struct *mm = vma->vm_mm;
> > -	pgtable_t pgtable;
> >   	pmd_t _pmd, old_pmd;
> >   	unsigned long addr;
> >   	pte_t *pte;
> > @@ -2960,7 +2985,16 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
> >   	 */
> >   	old_pmd = pmdp_huge_clear_flush(vma, haddr, pmd);
> > -	pgtable = pgtable_trans_huge_withdraw(mm, pmd);
> > +	if (arch_needs_pgtable_deposit()) {
> > +		pgtable = pgtable_trans_huge_withdraw(mm, pmd);
> > +	} else {
> > +		VM_BUG_ON(!pgtable);
> > +		/*
> > +		 * Account for the freshly allocated (in __split_huge_pmd) pgtable
> > +		 * being used in mm.
> > +		 */
> > +		mm_inc_nr_ptes(mm);
> > +	}
> >   	pmd_populate(mm, &_pmd, pgtable);
> >   	pte = pte_offset_map(&_pmd, haddr);
> > @@ -2982,12 +3016,11 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
> >   }
> >   static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> > -		unsigned long haddr, bool freeze)
> > +		unsigned long haddr, bool freeze, pgtable_t pgtable)
> >   {
> >   	struct mm_struct *mm = vma->vm_mm;
> >   	struct folio *folio;
> >   	struct page *page;
> > -	pgtable_t pgtable;
> >   	pmd_t old_pmd, _pmd;
> >   	bool soft_dirty, uffd_wp = false, young = false, write = false;
> >   	bool anon_exclusive = false, dirty = false;
> > @@ -3011,6 +3044,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> >   		 */
> >   		if (arch_needs_pgtable_deposit())
> >   			zap_deposited_table(mm, pmd);
> > +		if (pgtable)
> > +			pte_free(mm, pgtable);
> >   		if (!vma_is_dax(vma) && vma_is_special_huge(vma))
> >   			return;
> >   		if (unlikely(pmd_is_migration_entry(old_pmd))) {
> > @@ -3043,7 +3078,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> >   		 * small page also write protected so it does not seems useful
> >   		 * to invalidate secondary mmu at this time.
> >   		 */
> > -		return __split_huge_zero_page_pmd(vma, haddr, pmd);
> > +		return __split_huge_zero_page_pmd(vma, haddr, pmd, pgtable);
> >   	}
> >   	if (pmd_is_migration_entry(*pmd)) {
> > @@ -3167,7 +3202,16 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> >   	 * Withdraw the table only after we mark the pmd entry invalid.
> >   	 * This's critical for some architectures (Power).
> >   	 */
> > -	pgtable = pgtable_trans_huge_withdraw(mm, pmd);
> > +	if (arch_needs_pgtable_deposit()) {
> > +		pgtable = pgtable_trans_huge_withdraw(mm, pmd);
> > +	} else {
> > +		VM_BUG_ON(!pgtable);
> > +		/*
> > +		 * Account for the freshly allocated (in __split_huge_pmd) pgtable
> > +		 * being used in mm.
> > +		 */
> > +		mm_inc_nr_ptes(mm);
> > +	}
> >   	pmd_populate(mm, &_pmd, pgtable);
> >   	pte = pte_offset_map(&_pmd, haddr);
> > @@ -3263,11 +3307,13 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> >   }
> >   void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address,
> > -			   pmd_t *pmd, bool freeze)
> > +			   pmd_t *pmd, bool freeze, pgtable_t pgtable)
> >   {
> >   	VM_WARN_ON_ONCE(!IS_ALIGNED(address, HPAGE_PMD_SIZE));
> >   	if (pmd_trans_huge(*pmd) || pmd_is_valid_softleaf(*pmd))
> > -		__split_huge_pmd_locked(vma, pmd, address, freeze);
> > +		__split_huge_pmd_locked(vma, pmd, address, freeze, pgtable);
> > +	else if (pgtable)
> > +		pte_free(vma->vm_mm, pgtable);
> >   }
> >   void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
> > @@ -3275,13 +3321,24 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
> >   {
> >   	spinlock_t *ptl;
> >   	struct mmu_notifier_range range;
> > +	pgtable_t pgtable = NULL;
> >   	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma->vm_mm,
> >   				address & HPAGE_PMD_MASK,
> >   				(address & HPAGE_PMD_MASK) + HPAGE_PMD_SIZE);
> >   	mmu_notifier_invalidate_range_start(&range);
> > +
> > +	/* allocate pagetable before acquiring pmd lock */
> > +	if (vma_is_anonymous(vma) && !arch_needs_pgtable_deposit()) {
> > +		pgtable = pte_alloc_one(vma->vm_mm);
> > +		if (!pgtable) {
> > +			mmu_notifier_invalidate_range_end(&range);
> 
> What I last looked at this, I thought the clean thing to do is to let
> __split_huge_pmd() and friends return an error.
> 
> Let's take a look at walk_pmd_range() as one example:
> 
> if (walk->vma)
> 	split_huge_pmd(walk->vma, pmd, addr);
> else if (pmd_leaf(*pmd) || !pmd_present(*pmd))
> 	continue;
> 
> err = walk_pte_range(pmd, addr, next, walk);
> 
> Where walk_pte_range() just does a pte_offset_map_lock.
> 
> 	pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
> 
> But if that fails (as the remapping failed), we will silently skip this
> range.
> 
> I don't think silently skipping is the right thing to do.
> 
> So I would think that all splitting functions have to be taught to return an
> error and handle it accordingly. Then we can actually start returning
> errors.

Yeah, I am also confused by silent split PMD failure. It has to be
communicated to the caller cleanly.

It is also an opportunity to audit all callers and check if they can
deal with the failure.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC 1/2] mm: thp: allocate PTE page tables lazily at split time
  2026-02-11 13:35   ` David Hildenbrand (Arm)
  2026-02-11 13:46     ` Kiryl Shutsemau
@ 2026-02-11 13:47     ` Usama Arif
  1 sibling, 0 replies; 22+ messages in thread
From: Usama Arif @ 2026-02-11 13:47 UTC (permalink / raw)
  To: David Hildenbrand (Arm), Andrew Morton, lorenzo.stoakes, willy,
	linux-mm
  Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
	baolin.wang, npache, Liam.Howlett, ryan.roberts, vbabka,
	lance.yang, linux-kernel, kernel-team



On 11/02/2026 13:35, David Hildenbrand (Arm) wrote:
> On 2/11/26 13:49, Usama Arif wrote:
>> When the kernel creates a PMD-level THP mapping for anonymous pages,
>> it pre-allocates a PTE page table and deposits it via
>> pgtable_trans_huge_deposit(). This deposited table is withdrawn during
>> PMD split or zap. The rationale was that split must not fail—if the
>> kernel decides to split a THP, it needs a PTE table to populate.
>>
>> However, every anon THP wastes 4KB (one page table page) that sits
>> unused in the deposit list for the lifetime of the mapping. On systems
>> with many THPs, this adds up to significant memory waste. The original
>> rationale is also not an issue. It is ok for split to fail, and if the
>> kernel can't find an order 0 allocation for split, there are much bigger
>> problems. On large servers where you can easily have 100s of GBs of THPs,
>> the memory usage for these tables is 200M per 100G. This memory could be
>> used for any other usecase, which include allocating the pagetables
>> required during split.
>>
>> This patch removes the pre-deposit for anonymous pages on architectures
>> where arch_needs_pgtable_deposit() returns false (every arch apart from
>> powerpc, and only when radix hash tables are not enabled) and allocates
>> the PTE table lazily—only when a split actually occurs. The split path
>> is modified to accept a caller-provided page table.
>>
>> PowerPC exception:
>>
>> It would have been great if we can completely remove the pagetable
>> deposit code and this commit would mostly have been a code cleanup patch,
>> unfortunately PowerPC has hash MMU, it stores hash slot information in
>> the deposited page table and pre-deposit is necessary. All deposit/
>> withdraw paths are guarded by arch_needs_pgtable_deposit(), so PowerPC
>> behavior is unchanged with this patch. On a better note,
>> arch_needs_pgtable_deposit will always evaluate to false at compile time
>> on non PowerPC architectures and the pre-deposit code will not be
>> compiled in.
>>
>> Why Split Failures Are Safe:
>>
>> If a system is under severe memory pressure that even a 4K allocation
>> fails for a PTE table, there are far greater problems than a THP split
>> being delayed. The OOM killer will likely intervene before this becomes an
>> issue.
>> When pte_alloc_one() fails due to not being able to allocate a 4K page,
>> the PMD split is aborted and the THP remains intact. I could not get split
>> to fail, as its very difficult to make order-0 allocation to fail.
>> Code analysis of what would happen if it does:
>>
>> - mprotect(): If split fails in change_pmd_range, it will fallback
>> to change_pte_range, which will return an error which will cause the
>> whole function to be retried again.
>>
>> - munmap() (partial THP range): zap_pte_range() returns early when
>> pte_offset_map_lock() fails, causing zap_pmd_range() to retry via pmd--.
>> For full THP range, zap_huge_pmd() unmaps the entire PMD without
>> split.
>>
>> - Memory reclaim (try_to_unmap()): Returns false, folio rotated back
>> LRU, retried in next reclaim cycle.
>>
>> - Migration / compaction (try_to_migrate()): Returns -EAGAIN, migration
>> skips this folio, retried later.
>>
>> - CoW fault (wp_huge_pmd()): Returns VM_FAULT_FALLBACK, fault retried.
>>
>> -  madvise (MADV_COLD/PAGEOUT): split_folio() internally calls
>> try_to_migrate() with TTU_SPLIT_HUGE_PMD. If PMD split fails,
>> try_to_migrate() returns false, split_folio() returns -EAGAIN,
>> and madvise returns 0 (success) silently skipping the region. This
>> should be fine. madvise is just an advice and can fail for other
>> reasons as well.
>>
>> Suggested-by: David Hildenbrand <david@kernel.org>
>> Signed-off-by: Usama Arif <usama.arif@linux.dev>
>> ---
>>   include/linux/huge_mm.h |   4 +-
>>   mm/huge_memory.c        | 144 ++++++++++++++++++++++++++++------------
>>   mm/khugepaged.c         |   7 +-
>>   mm/migrate_device.c     |  15 +++--
>>   mm/rmap.c               |  39 ++++++++++-
>>   5 files changed, 156 insertions(+), 53 deletions(-)
>>
>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>> index a4d9f964dfdea..b21bb72a298c9 100644
>> --- a/include/linux/huge_mm.h
>> +++ b/include/linux/huge_mm.h
>> @@ -562,7 +562,7 @@ static inline bool thp_migration_supported(void)
>>   }
>>     void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address,
>> -               pmd_t *pmd, bool freeze);
>> +               pmd_t *pmd, bool freeze, pgtable_t pgtable);
>>   bool unmap_huge_pmd_locked(struct vm_area_struct *vma, unsigned long addr,
>>                  pmd_t *pmdp, struct folio *folio);
>>   void map_anon_folio_pmd_nopf(struct folio *folio, pmd_t *pmd,
>> @@ -660,7 +660,7 @@ static inline void split_huge_pmd_address(struct vm_area_struct *vma,
>>           unsigned long address, bool freeze) {}
>>   static inline void split_huge_pmd_locked(struct vm_area_struct *vma,
>>                        unsigned long address, pmd_t *pmd,
>> -                     bool freeze) {}
>> +                     bool freeze, pgtable_t pgtable) {}
>>     static inline bool unmap_huge_pmd_locked(struct vm_area_struct *vma,
>>                        unsigned long addr, pmd_t *pmdp,
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 44ff8a648afd5..4c9a8d89fc8aa 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -1322,17 +1322,19 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf)
>>       unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
>>       struct vm_area_struct *vma = vmf->vma;
>>       struct folio *folio;
>> -    pgtable_t pgtable;
>> +    pgtable_t pgtable = NULL;
>>       vm_fault_t ret = 0;
>>         folio = vma_alloc_anon_folio_pmd(vma, vmf->address);
>>       if (unlikely(!folio))
>>           return VM_FAULT_FALLBACK;
>>   -    pgtable = pte_alloc_one(vma->vm_mm);
>> -    if (unlikely(!pgtable)) {
>> -        ret = VM_FAULT_OOM;
>> -        goto release;
>> +    if (arch_needs_pgtable_deposit()) {
>> +        pgtable = pte_alloc_one(vma->vm_mm);
>> +        if (unlikely(!pgtable)) {
>> +            ret = VM_FAULT_OOM;
>> +            goto release;
>> +        }
>>       }
>>         vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
>> @@ -1347,14 +1349,18 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf)
>>           if (userfaultfd_missing(vma)) {
>>               spin_unlock(vmf->ptl);
>>               folio_put(folio);
>> -            pte_free(vma->vm_mm, pgtable);
>> +            if (pgtable)
>> +                pte_free(vma->vm_mm, pgtable);
>>               ret = handle_userfault(vmf, VM_UFFD_MISSING);
>>               VM_BUG_ON(ret & VM_FAULT_FALLBACK);
>>               return ret;
>>           }
>> -        pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd, pgtable);
>> +        if (pgtable) {
>> +            pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd,
>> +                           pgtable);
>> +            mm_inc_nr_ptes(vma->vm_mm);
>> +        }
>>           map_anon_folio_pmd_pf(folio, vmf->pmd, vma, haddr);
>> -        mm_inc_nr_ptes(vma->vm_mm);
>>           spin_unlock(vmf->ptl);
>>       }
>>   @@ -1450,9 +1456,11 @@ static void set_huge_zero_folio(pgtable_t pgtable, struct mm_struct *mm,
>>       pmd_t entry;
>>       entry = folio_mk_pmd(zero_folio, vma->vm_page_prot);
>>       entry = pmd_mkspecial(entry);
>> -    pgtable_trans_huge_deposit(mm, pmd, pgtable);
>> +    if (pgtable) {
>> +        pgtable_trans_huge_deposit(mm, pmd, pgtable);
>> +        mm_inc_nr_ptes(mm);
>> +    }
>>       set_pmd_at(mm, haddr, pmd, entry);
>> -    mm_inc_nr_ptes(mm);
>>   }
>>     vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
>> @@ -1471,16 +1479,19 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
>>       if (!(vmf->flags & FAULT_FLAG_WRITE) &&
>>               !mm_forbids_zeropage(vma->vm_mm) &&
>>               transparent_hugepage_use_zero_page()) {
>> -        pgtable_t pgtable;
>> +        pgtable_t pgtable = NULL;
>>           struct folio *zero_folio;
>>           vm_fault_t ret;
>>   -        pgtable = pte_alloc_one(vma->vm_mm);
>> -        if (unlikely(!pgtable))
>> -            return VM_FAULT_OOM;
>> +        if (arch_needs_pgtable_deposit()) {
>> +            pgtable = pte_alloc_one(vma->vm_mm);
>> +            if (unlikely(!pgtable))
>> +                return VM_FAULT_OOM;
>> +        }
>>           zero_folio = mm_get_huge_zero_folio(vma->vm_mm);
>>           if (unlikely(!zero_folio)) {
>> -            pte_free(vma->vm_mm, pgtable);
>> +            if (pgtable)
>> +                pte_free(vma->vm_mm, pgtable);
>>               count_vm_event(THP_FAULT_FALLBACK);
>>               return VM_FAULT_FALLBACK;
>>           }
>> @@ -1490,10 +1501,12 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
>>               ret = check_stable_address_space(vma->vm_mm);
>>               if (ret) {
>>                   spin_unlock(vmf->ptl);
>> -                pte_free(vma->vm_mm, pgtable);
>> +                if (pgtable)
>> +                    pte_free(vma->vm_mm, pgtable);
>>               } else if (userfaultfd_missing(vma)) {
>>                   spin_unlock(vmf->ptl);
>> -                pte_free(vma->vm_mm, pgtable);
>> +                if (pgtable)
>> +                    pte_free(vma->vm_mm, pgtable);
>>                   ret = handle_userfault(vmf, VM_UFFD_MISSING);
>>                   VM_BUG_ON(ret & VM_FAULT_FALLBACK);
>>               } else {
>> @@ -1504,7 +1517,8 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
>>               }
>>           } else {
>>               spin_unlock(vmf->ptl);
>> -            pte_free(vma->vm_mm, pgtable);
>> +            if (pgtable)
>> +                pte_free(vma->vm_mm, pgtable);
>>           }
>>           return ret;
>>       }
>> @@ -1836,8 +1850,10 @@ static void copy_huge_non_present_pmd(
>>       }
>>         add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
>> -    mm_inc_nr_ptes(dst_mm);
>> -    pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
>> +    if (pgtable) {
>> +        mm_inc_nr_ptes(dst_mm);
>> +        pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
>> +    }
>>       if (!userfaultfd_wp(dst_vma))
>>           pmd = pmd_swp_clear_uffd_wp(pmd);
>>       set_pmd_at(dst_mm, addr, dst_pmd, pmd);
>> @@ -1877,9 +1893,11 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>>       if (!vma_is_anonymous(dst_vma))
>>           return 0;
>>   -    pgtable = pte_alloc_one(dst_mm);
>> -    if (unlikely(!pgtable))
>> -        goto out;
>> +    if (arch_needs_pgtable_deposit()) {
>> +        pgtable = pte_alloc_one(dst_mm);
>> +        if (unlikely(!pgtable))
>> +            goto out;
>> +    }
>>         dst_ptl = pmd_lock(dst_mm, dst_pmd);
>>       src_ptl = pmd_lockptr(src_mm, src_pmd);
>> @@ -1897,7 +1915,8 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>>       }
>>         if (unlikely(!pmd_trans_huge(pmd))) {
>> -        pte_free(dst_mm, pgtable);
>> +        if (pgtable)
>> +            pte_free(dst_mm, pgtable);
>>           goto out_unlock;
>>       }
>>       /*
>> @@ -1923,7 +1942,8 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>>       if (unlikely(folio_try_dup_anon_rmap_pmd(src_folio, src_page, dst_vma, src_vma))) {
>>           /* Page maybe pinned: split and retry the fault on PTEs. */
>>           folio_put(src_folio);
>> -        pte_free(dst_mm, pgtable);
>> +        if (pgtable)
>> +            pte_free(dst_mm, pgtable);
>>           spin_unlock(src_ptl);
>>           spin_unlock(dst_ptl);
>>           __split_huge_pmd(src_vma, src_pmd, addr, false);
>> @@ -1931,8 +1951,10 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>>       }
>>       add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
>>   out_zero_page:
>> -    mm_inc_nr_ptes(dst_mm);
>> -    pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
>> +    if (pgtable) {
>> +        mm_inc_nr_ptes(dst_mm);
>> +        pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
>> +    }
>>       pmdp_set_wrprotect(src_mm, addr, src_pmd);
>>       if (!userfaultfd_wp(dst_vma))
>>           pmd = pmd_clear_uffd_wp(pmd);
>> @@ -2364,7 +2386,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>>               zap_deposited_table(tlb->mm, pmd);
>>           spin_unlock(ptl);
>>       } else if (is_huge_zero_pmd(orig_pmd)) {
>> -        if (!vma_is_dax(vma) || arch_needs_pgtable_deposit())
>> +        if (arch_needs_pgtable_deposit())
>>               zap_deposited_table(tlb->mm, pmd);
>>           spin_unlock(ptl);
>>       } else {
>> @@ -2389,7 +2411,8 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>>           }
>>             if (folio_test_anon(folio)) {
>> -            zap_deposited_table(tlb->mm, pmd);
>> +            if (arch_needs_pgtable_deposit())
>> +                zap_deposited_table(tlb->mm, pmd);
>>               add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
>>           } else {
>>               if (arch_needs_pgtable_deposit())
>> @@ -2490,7 +2513,8 @@ bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
>>               force_flush = true;
>>           VM_BUG_ON(!pmd_none(*new_pmd));
>>   -        if (pmd_move_must_withdraw(new_ptl, old_ptl, vma)) {
>> +        if (pmd_move_must_withdraw(new_ptl, old_ptl, vma) &&
>> +            arch_needs_pgtable_deposit()) {
>>               pgtable_t pgtable;
>>               pgtable = pgtable_trans_huge_withdraw(mm, old_pmd);
>>               pgtable_trans_huge_deposit(mm, new_pmd, pgtable);
>> @@ -2798,8 +2822,10 @@ int move_pages_huge_pmd(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd, pm
>>       }
>>       set_pmd_at(mm, dst_addr, dst_pmd, _dst_pmd);
>>   -    src_pgtable = pgtable_trans_huge_withdraw(mm, src_pmd);
>> -    pgtable_trans_huge_deposit(mm, dst_pmd, src_pgtable);
>> +    if (arch_needs_pgtable_deposit()) {
>> +        src_pgtable = pgtable_trans_huge_withdraw(mm, src_pmd);
>> +        pgtable_trans_huge_deposit(mm, dst_pmd, src_pgtable);
>> +    }
>>   unlock_ptls:
>>       double_pt_unlock(src_ptl, dst_ptl);
>>       /* unblock rmap walks */
>> @@ -2941,10 +2967,9 @@ void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
>>   #endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
>>     static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
>> -        unsigned long haddr, pmd_t *pmd)
>> +        unsigned long haddr, pmd_t *pmd, pgtable_t pgtable)
>>   {
>>       struct mm_struct *mm = vma->vm_mm;
>> -    pgtable_t pgtable;
>>       pmd_t _pmd, old_pmd;
>>       unsigned long addr;
>>       pte_t *pte;
>> @@ -2960,7 +2985,16 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
>>        */
>>       old_pmd = pmdp_huge_clear_flush(vma, haddr, pmd);
>>   -    pgtable = pgtable_trans_huge_withdraw(mm, pmd);
>> +    if (arch_needs_pgtable_deposit()) {
>> +        pgtable = pgtable_trans_huge_withdraw(mm, pmd);
>> +    } else {
>> +        VM_BUG_ON(!pgtable);
>> +        /*
>> +         * Account for the freshly allocated (in __split_huge_pmd) pgtable
>> +         * being used in mm.
>> +         */
>> +        mm_inc_nr_ptes(mm);
>> +    }
>>       pmd_populate(mm, &_pmd, pgtable);
>>         pte = pte_offset_map(&_pmd, haddr);
>> @@ -2982,12 +3016,11 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
>>   }
>>     static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>> -        unsigned long haddr, bool freeze)
>> +        unsigned long haddr, bool freeze, pgtable_t pgtable)
>>   {
>>       struct mm_struct *mm = vma->vm_mm;
>>       struct folio *folio;
>>       struct page *page;
>> -    pgtable_t pgtable;
>>       pmd_t old_pmd, _pmd;
>>       bool soft_dirty, uffd_wp = false, young = false, write = false;
>>       bool anon_exclusive = false, dirty = false;
>> @@ -3011,6 +3044,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>>            */
>>           if (arch_needs_pgtable_deposit())
>>               zap_deposited_table(mm, pmd);
>> +        if (pgtable)
>> +            pte_free(mm, pgtable);
>>           if (!vma_is_dax(vma) && vma_is_special_huge(vma))
>>               return;
>>           if (unlikely(pmd_is_migration_entry(old_pmd))) {
>> @@ -3043,7 +3078,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>>            * small page also write protected so it does not seems useful
>>            * to invalidate secondary mmu at this time.
>>            */
>> -        return __split_huge_zero_page_pmd(vma, haddr, pmd);
>> +        return __split_huge_zero_page_pmd(vma, haddr, pmd, pgtable);
>>       }
>>         if (pmd_is_migration_entry(*pmd)) {
>> @@ -3167,7 +3202,16 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>>        * Withdraw the table only after we mark the pmd entry invalid.
>>        * This's critical for some architectures (Power).
>>        */
>> -    pgtable = pgtable_trans_huge_withdraw(mm, pmd);
>> +    if (arch_needs_pgtable_deposit()) {
>> +        pgtable = pgtable_trans_huge_withdraw(mm, pmd);
>> +    } else {
>> +        VM_BUG_ON(!pgtable);
>> +        /*
>> +         * Account for the freshly allocated (in __split_huge_pmd) pgtable
>> +         * being used in mm.
>> +         */
>> +        mm_inc_nr_ptes(mm);
>> +    }
>>       pmd_populate(mm, &_pmd, pgtable);
>>         pte = pte_offset_map(&_pmd, haddr);
>> @@ -3263,11 +3307,13 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>>   }
>>     void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address,
>> -               pmd_t *pmd, bool freeze)
>> +               pmd_t *pmd, bool freeze, pgtable_t pgtable)
>>   {
>>       VM_WARN_ON_ONCE(!IS_ALIGNED(address, HPAGE_PMD_SIZE));
>>       if (pmd_trans_huge(*pmd) || pmd_is_valid_softleaf(*pmd))
>> -        __split_huge_pmd_locked(vma, pmd, address, freeze);
>> +        __split_huge_pmd_locked(vma, pmd, address, freeze, pgtable);
>> +    else if (pgtable)
>> +        pte_free(vma->vm_mm, pgtable);
>>   }
>>     void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>> @@ -3275,13 +3321,24 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>>   {
>>       spinlock_t *ptl;
>>       struct mmu_notifier_range range;
>> +    pgtable_t pgtable = NULL;
>>         mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma->vm_mm,
>>                   address & HPAGE_PMD_MASK,
>>                   (address & HPAGE_PMD_MASK) + HPAGE_PMD_SIZE);
>>       mmu_notifier_invalidate_range_start(&range);
>> +
>> +    /* allocate pagetable before acquiring pmd lock */
>> +    if (vma_is_anonymous(vma) && !arch_needs_pgtable_deposit()) {
>> +        pgtable = pte_alloc_one(vma->vm_mm);
>> +        if (!pgtable) {
>> +            mmu_notifier_invalidate_range_end(&range);
> 
> What I last looked at this, I thought the clean thing to do is to let __split_huge_pmd() and friends return an error.
> 
> Let's take a look at walk_pmd_range() as one example:
> 
> if (walk->vma)
>     split_huge_pmd(walk->vma, pmd, addr);
> else if (pmd_leaf(*pmd) || !pmd_present(*pmd))
>     continue;
> 
> err = walk_pte_range(pmd, addr, next, walk);
> 
> Where walk_pte_range() just does a pte_offset_map_lock.
> 
>     pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
> 
> But if that fails (as the remapping failed), we will silently skip this range.
> 
> I don't think silently skipping is the right thing to do.
> 
> So I would think that all splitting functions have to be taught to return an error and handle it accordingly. Then we can actually start returning errors.
> 

Ack. This was one of the cases where we would try again if needed.
I did manual code analysis which I included at the end of the commit message
but agreed, its best to return an error and handle accordingly.
I will look into doing this for the next revision.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC 1/2] mm: thp: allocate PTE page tables lazily at split time
  2026-02-11 12:49 ` [RFC 1/2] mm: thp: allocate PTE page tables lazily at split time Usama Arif
  2026-02-11 13:25   ` David Hildenbrand (Arm)
  2026-02-11 13:35   ` David Hildenbrand (Arm)
@ 2026-02-11 19:28   ` Matthew Wilcox
  2026-02-11 19:55     ` David Hildenbrand (Arm)
  2 siblings, 1 reply; 22+ messages in thread
From: Matthew Wilcox @ 2026-02-11 19:28 UTC (permalink / raw)
  To: Usama Arif
  Cc: Andrew Morton, david, lorenzo.stoakes, linux-mm, fvdl, hannes,
	riel, shakeel.butt, kas, baohua, dev.jain, baolin.wang, npache,
	Liam.Howlett, ryan.roberts, vbabka, lance.yang, linux-kernel,
	kernel-team

On Wed, Feb 11, 2026 at 04:49:45AM -0800, Usama Arif wrote:
> - Memory reclaim (try_to_unmap()): Returns false, folio rotated back
> LRU, retried in next reclaim cycle.

I was advised to ask my stupid question ...

Why do we still try to split the PMD in reclaim?  I understand we're
about to swap the folio out and we'll need to put a swap entry in the page
table so we can find it again.  But can't we now store swap entries at the
PMD level, or are we still forced to store 512 entries at the PTE level?


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC 1/2] mm: thp: allocate PTE page tables lazily at split time
  2026-02-11 19:28   ` Matthew Wilcox
@ 2026-02-11 19:55     ` David Hildenbrand (Arm)
  0 siblings, 0 replies; 22+ messages in thread
From: David Hildenbrand (Arm) @ 2026-02-11 19:55 UTC (permalink / raw)
  To: Matthew Wilcox, Usama Arif
  Cc: Andrew Morton, lorenzo.stoakes, linux-mm, fvdl, hannes, riel,
	shakeel.butt, kas, baohua, dev.jain, baolin.wang, npache,
	Liam.Howlett, ryan.roberts, vbabka, lance.yang, linux-kernel,
	kernel-team

On 2/11/26 20:28, Matthew Wilcox wrote:
> On Wed, Feb 11, 2026 at 04:49:45AM -0800, Usama Arif wrote:
>> - Memory reclaim (try_to_unmap()): Returns false, folio rotated back
>> LRU, retried in next reclaim cycle.
> 
> I was advised to ask my stupid question ...
> 
> Why do we still try to split the PMD in reclaim?  I understand we're
> about to swap the folio out and we'll need to put a swap entry in the page
> table so we can find it again.  But can't we now store swap entries at the
> PMD level, or are we still forced to store 512 entries at the PTE level?

Yes. We don't support PMD swap entries yet.

I don't know all historical details. I suspect there are some rough 
edges around swapin (assume we cannot swapin a 2M THP), and maybe it was 
just easier to not deal with splitting of PMD swap entries (which we 
would similarly have to support).

For sure an interesting project to look into.

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC 1/2] mm: thp: allocate PTE page tables lazily at split time
  2026-02-11 13:25   ` David Hildenbrand (Arm)
  2026-02-11 13:38     ` Usama Arif
@ 2026-02-12 12:13     ` Ritesh Harjani
  2026-02-12 15:25       ` Usama Arif
  2026-02-12 15:39       ` David Hildenbrand (Arm)
  1 sibling, 2 replies; 22+ messages in thread
From: Ritesh Harjani @ 2026-02-12 12:13 UTC (permalink / raw)
  To: David Hildenbrand (Arm), Usama Arif, Andrew Morton,
	lorenzo.stoakes, willy, linux-mm
  Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
	baolin.wang, npache, Liam.Howlett, ryan.roberts, vbabka,
	lance.yang, linux-kernel, kernel-team, Madhavan Srinivasan,
	Michael Ellerman, linuxppc-dev

"David Hildenbrand (Arm)" <david@kernel.org> writes:

> CCing ppc folks
>

Thanks David!

> On 2/11/26 13:49, Usama Arif wrote:
>> When the kernel creates a PMD-level THP mapping for anonymous pages,
>> it pre-allocates a PTE page table and deposits it via
>> pgtable_trans_huge_deposit(). This deposited table is withdrawn during
>> PMD split or zap. The rationale was that split must not fail—if the
>> kernel decides to split a THP, it needs a PTE table to populate.
>> 
>> However, every anon THP wastes 4KB (one page table page) that sits
>> unused in the deposit list for the lifetime of the mapping. On systems
>> with many THPs, this adds up to significant memory waste. The original
>> rationale is also not an issue. It is ok for split to fail, and if the
>> kernel can't find an order 0 allocation for split, there are much bigger
>> problems. On large servers where you can easily have 100s of GBs of THPs,
>> the memory usage for these tables is 200M per 100G. This memory could be
>> used for any other usecase, which include allocating the pagetables
>> required during split.
>> 
>> This patch removes the pre-deposit for anonymous pages on architectures
>> where arch_needs_pgtable_deposit() returns false (every arch apart from
>> powerpc, and only when radix hash tables are not enabled) and allocates
>> the PTE table lazily—only when a split actually occurs. The split path
>> is modified to accept a caller-provided page table.
>> 
>> PowerPC exception:
>> 
>> It would have been great if we can completely remove the pagetable
>> deposit code and this commit would mostly have been a code cleanup patch,
>> unfortunately PowerPC has hash MMU, it stores hash slot information in
>> the deposited page table and pre-deposit is necessary. All deposit/
>> withdraw paths are guarded by arch_needs_pgtable_deposit(), so PowerPC
>> behavior is unchanged with this patch. On a better note,
>> arch_needs_pgtable_deposit will always evaluate to false at compile time
>> on non PowerPC architectures and the pre-deposit code will not be
>> compiled in.
>
> Is there a way to remove this? It's always been a confusing hack, now 
> it's unpleasant to have around :)
>

Hash MMU on PowerPC works fundamentally different than other MMUs
(unlike Radix MMU on PowerPC). So yes, it requires few tricks to fit
into the Linux's multi-level SW page table model. ;) 


> In particular, seeing that radix__pgtable_trans_huge_deposit() just 1:1 
> copied generic pgtable_trans_huge_deposit() hurts my belly.
>

On PowerPC, pgtable_t can be a pte fragment. 

typedef pte_t *pgtable_t;

That means a single page can be shared among other PTE page tables. So, we
cannot use page->lru which the generic implementation uses. I guess due
to this, there is a slight change in implementation of
radix__pgtable_trans_huge_deposit(). 

Doing a grep search, I think that's the same for sparc and s390 as well.

>
> IIUC, hash is mostly used on legacy power systems, radix on newer ones.
>
> So one obvious solution: remove PMD THP support for hash MMUs along with 
> all this hacky deposit code.
>

Unfortunately, please no. There are real customers using Hash MMU on
Power9 and even on older generations and this would mean breaking Hash
PMD THP support for them. 


>
> the "vma_is_anonymous(vma) && !arch_needs_pgtable_deposit()" and similar 
> checks need to be wrapped in a reasonable helper and likely this all 
> needs to get cleaned up further.
>
> The implementation if the generic pgtable_trans_huge_deposit and the 
> radix handlers etc must be removed. If any code would trigger them it 
> would be a bug.
>

Sure, I think after this patch series, the radix__pgtable_trans_huge_deposit() 
will mostly be a dead code anyways. I will spend some time going
through this series and will also give it a test on powerpc HW (with
both Hash and Radix MMU).

I guess, we should also look at removing pgtable_trans_huge_deposit() and
pgtable_trans_huge_withdraw() implementations from s390 and sparc, since
those too will be dead code after this.


> If we have to keep this around, pgtable_trans_huge_deposit() should 
> likely get renamed to arch_pgtable_trans_huge_deposit() etc, as there 
> will not be generic support for it.
>

Sure. That make sense since PowerPC Hash MMU will still need this.

-ritesh


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC 1/2] mm: thp: allocate PTE page tables lazily at split time
  2026-02-12 12:13     ` Ritesh Harjani
@ 2026-02-12 15:25       ` Usama Arif
  2026-02-12 15:39       ` David Hildenbrand (Arm)
  1 sibling, 0 replies; 22+ messages in thread
From: Usama Arif @ 2026-02-12 15:25 UTC (permalink / raw)
  To: Ritesh Harjani (IBM), David Hildenbrand (Arm), Andrew Morton,
	lorenzo.stoakes, willy, linux-mm
  Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
	baolin.wang, npache, Liam.Howlett, ryan.roberts, vbabka,
	lance.yang, linux-kernel, kernel-team, Madhavan Srinivasan,
	Michael Ellerman, linuxppc-dev



On 12/02/2026 12:13, Ritesh Harjani (IBM) wrote:
> "David Hildenbrand (Arm)" <david@kernel.org> writes:
> 
>> CCing ppc folks
>>
> 
> Thanks David!
> 
>> On 2/11/26 13:49, Usama Arif wrote:
>>> When the kernel creates a PMD-level THP mapping for anonymous pages,
>>> it pre-allocates a PTE page table and deposits it via
>>> pgtable_trans_huge_deposit(). This deposited table is withdrawn during
>>> PMD split or zap. The rationale was that split must not fail—if the
>>> kernel decides to split a THP, it needs a PTE table to populate.
>>>
>>> However, every anon THP wastes 4KB (one page table page) that sits
>>> unused in the deposit list for the lifetime of the mapping. On systems
>>> with many THPs, this adds up to significant memory waste. The original
>>> rationale is also not an issue. It is ok for split to fail, and if the
>>> kernel can't find an order 0 allocation for split, there are much bigger
>>> problems. On large servers where you can easily have 100s of GBs of THPs,
>>> the memory usage for these tables is 200M per 100G. This memory could be
>>> used for any other usecase, which include allocating the pagetables
>>> required during split.
>>>
>>> This patch removes the pre-deposit for anonymous pages on architectures
>>> where arch_needs_pgtable_deposit() returns false (every arch apart from
>>> powerpc, and only when radix hash tables are not enabled) and allocates
>>> the PTE table lazily—only when a split actually occurs. The split path
>>> is modified to accept a caller-provided page table.
>>>
>>> PowerPC exception:
>>>
>>> It would have been great if we can completely remove the pagetable
>>> deposit code and this commit would mostly have been a code cleanup patch,
>>> unfortunately PowerPC has hash MMU, it stores hash slot information in
>>> the deposited page table and pre-deposit is necessary. All deposit/
>>> withdraw paths are guarded by arch_needs_pgtable_deposit(), so PowerPC
>>> behavior is unchanged with this patch. On a better note,
>>> arch_needs_pgtable_deposit will always evaluate to false at compile time
>>> on non PowerPC architectures and the pre-deposit code will not be
>>> compiled in.
>>
>> Is there a way to remove this? It's always been a confusing hack, now 
>> it's unpleasant to have around :)
>>
> 
> Hash MMU on PowerPC works fundamentally different than other MMUs
> (unlike Radix MMU on PowerPC). So yes, it requires few tricks to fit
> into the Linux's multi-level SW page table model. ;) 
> 
> 
>> In particular, seeing that radix__pgtable_trans_huge_deposit() just 1:1 
>> copied generic pgtable_trans_huge_deposit() hurts my belly.
>>
> 
> On PowerPC, pgtable_t can be a pte fragment. 
> 
> typedef pte_t *pgtable_t;
> 
> That means a single page can be shared among other PTE page tables. So, we
> cannot use page->lru which the generic implementation uses. I guess due
> to this, there is a slight change in implementation of
> radix__pgtable_trans_huge_deposit(). 
> 
> Doing a grep search, I think that's the same for sparc and s390 as well.
> 
>>
>> IIUC, hash is mostly used on legacy power systems, radix on newer ones.
>>
>> So one obvious solution: remove PMD THP support for hash MMUs along with 
>> all this hacky deposit code.
>>
> 
> Unfortunately, please no. There are real customers using Hash MMU on
> Power9 and even on older generations and this would mean breaking Hash
> PMD THP support for them. 
> 
> 

Thanks for confirming! I will keep the pagetable deposit for powerpc
in the next revision.
I will rename pgtable_trans_huge_deposit to arch_pgtable_trans_huge_deposit
and move it to arch/powerpc. It will an empty function for the rest of the
architectures.

>>
>> the "vma_is_anonymous(vma) && !arch_needs_pgtable_deposit()" and similar 
>> checks need to be wrapped in a reasonable helper and likely this all 
>> needs to get cleaned up further.
>>
>> The implementation if the generic pgtable_trans_huge_deposit and the 
>> radix handlers etc must be removed. If any code would trigger them it 
>> would be a bug.
>>
> 
> Sure, I think after this patch series, the radix__pgtable_trans_huge_deposit() 
> will mostly be a dead code anyways. I will spend some time going
> through this series and will also give it a test on powerpc HW (with
> both Hash and Radix MMU).
> 
> I guess, we should also look at removing pgtable_trans_huge_deposit() and
> pgtable_trans_huge_withdraw() implementations from s390 and sparc, since
> those too will be dead code after this.
> 
> 
>> If we have to keep this around, pgtable_trans_huge_deposit() should 
>> likely get renamed to arch_pgtable_trans_huge_deposit() etc, as there 
>> will not be generic support for it.
>>
> 
> Sure. That make sense since PowerPC Hash MMU will still need this.
> 
> -ritesh



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC 1/2] mm: thp: allocate PTE page tables lazily at split time
  2026-02-12 12:13     ` Ritesh Harjani
  2026-02-12 15:25       ` Usama Arif
@ 2026-02-12 15:39       ` David Hildenbrand (Arm)
  2026-02-12 16:46         ` Ritesh Harjani
  1 sibling, 1 reply; 22+ messages in thread
From: David Hildenbrand (Arm) @ 2026-02-12 15:39 UTC (permalink / raw)
  To: Ritesh Harjani (IBM), Usama Arif, Andrew Morton, lorenzo.stoakes,
	willy, linux-mm
  Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
	baolin.wang, npache, Liam.Howlett, ryan.roberts, vbabka,
	lance.yang, linux-kernel, kernel-team, Madhavan Srinivasan,
	Michael Ellerman, linuxppc-dev

>>
>> Is there a way to remove this? It's always been a confusing hack, now
>> it's unpleasant to have around :)
>>
> 
> Hash MMU on PowerPC works fundamentally different than other MMUs
> (unlike Radix MMU on PowerPC). So yes, it requires few tricks to fit
> into the Linux's multi-level SW page table model. ;)

:)

> 
>> In particular, seeing that radix__pgtable_trans_huge_deposit() just 1:1
>> copied generic pgtable_trans_huge_deposit() hurts my belly.
>>
> 
> On PowerPC, pgtable_t can be a pte fragment.
> 
> typedef pte_t *pgtable_t;
> 
> That means a single page can be shared among other PTE page tables. So, we
> cannot use page->lru which the generic implementation uses. I guess due
> to this, there is a slight change in implementation of
> radix__pgtable_trans_huge_deposit().

Ah, did not spot this difference, but makes sense. Still ugly, but make 
sense. Fortunately it would go away with this RFC.

> 
> Doing a grep search, I think that's the same for sparc and s390 as well.

... and I also did not realize that s390x+sparc have separate 
implementations we can now get rid of as well.

> 
>>
>> IIUC, hash is mostly used on legacy power systems, radix on newer ones.
>>
>> So one obvious solution: remove PMD THP support for hash MMUs along with
>> all this hacky deposit code.
>>
> 
> Unfortunately, please no. There are real customers using Hash MMU on
> Power9 and even on older generations and this would mean breaking Hash
> PMD THP support for them.
> 

I was expecting this answer :)

> 
>>
>> the "vma_is_anonymous(vma) && !arch_needs_pgtable_deposit()" and similar
>> checks need to be wrapped in a reasonable helper and likely this all
>> needs to get cleaned up further.
>>
>> The implementation if the generic pgtable_trans_huge_deposit and the
>> radix handlers etc must be removed. If any code would trigger them it
>> would be a bug.
>>
> 
> Sure, I think after this patch series, the radix__pgtable_trans_huge_deposit()
> will mostly be a dead code anyways. I will spend some time going
> through this series and will also give it a test on powerpc HW (with
> both Hash and Radix MMU).

Thanks! The series will grow quite a bit I think, so retesting new 
revisions will be very appreciated!

> 
> I guess, we should also look at removing pgtable_trans_huge_deposit() and
> pgtable_trans_huge_withdraw() implementations from s390 and sparc, since
> those too will be dead code after this.

Exactly.


-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC 1/2] mm: thp: allocate PTE page tables lazily at split time
  2026-02-12 15:39       ` David Hildenbrand (Arm)
@ 2026-02-12 16:46         ` Ritesh Harjani
  0 siblings, 0 replies; 22+ messages in thread
From: Ritesh Harjani @ 2026-02-12 16:46 UTC (permalink / raw)
  To: David Hildenbrand (Arm), Usama Arif, Andrew Morton,
	lorenzo.stoakes, willy, linux-mm
  Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
	baolin.wang, npache, Liam.Howlett, ryan.roberts, vbabka,
	lance.yang, linux-kernel, kernel-team, Madhavan Srinivasan,
	Michael Ellerman, linuxppc-dev

"David Hildenbrand (Arm)" <david@kernel.org> writes:

>
> Thanks! The series will grow quite a bit I think, so retesting new 
> revisions will be very appreciated!
>

Definitely. Thanks!

-ritesh


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC 2/2] mm: thp: add THP_SPLIT_PMD_PTE_ALLOC_FAILED counter
  2026-02-11 12:49 ` [RFC 2/2] mm: thp: add THP_SPLIT_PMD_PTE_ALLOC_FAILED counter Usama Arif
  2026-02-11 13:27   ` David Hildenbrand (Arm)
@ 2026-02-12 21:40   ` kernel test robot
  2026-02-12 21:40   ` kernel test robot
  2 siblings, 0 replies; 22+ messages in thread
From: kernel test robot @ 2026-02-12 21:40 UTC (permalink / raw)
  To: Usama Arif; +Cc: llvm, oe-kbuild-all

Hi Usama,

[This is a private test report for your RFC patch.]
kernel test robot noticed the following build errors:

[auto build test ERROR on linus/master]
[also build test ERROR on v6.19 next-20260212]
[cannot apply to akpm-mm/mm-everything]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Usama-Arif/mm-thp-allocate-PTE-page-tables-lazily-at-split-time/20260211-205726
base:   linus/master
patch link:    https://lore.kernel.org/r/20260211125507.4175026-3-usama.arif%40linux.dev
patch subject: [RFC 2/2] mm: thp: add THP_SPLIT_PMD_PTE_ALLOC_FAILED counter
config: x86_64-allnoconfig (https://download.01.org/0day-ci/archive/20260213/202602130506.7Tm8CJkW-lkp@intel.com/config)
compiler: clang version 20.1.8 (https://github.com/llvm/llvm-project 87f0227cb60147a26a1eeb4fb06e3b505e9c7261)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260213/202602130506.7Tm8CJkW-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202602130506.7Tm8CJkW-lkp@intel.com/

All errors (new ones prefixed by >>):

>> mm/rmap.c:1961:21: error: use of undeclared identifier 'THP_SPLIT_PMD_PTE_ALLOC_FAILED'
    1961 |                                         count_vm_event(THP_SPLIT_PMD_PTE_ALLOC_FAILED);
         |                                                        ^
   mm/rmap.c:2364:21: error: use of undeclared identifier 'THP_SPLIT_PMD_PTE_ALLOC_FAILED'
    2364 |                                         count_vm_event(THP_SPLIT_PMD_PTE_ALLOC_FAILED);
         |                                                        ^
   2 errors generated.


vim +/THP_SPLIT_PMD_PTE_ALLOC_FAILED +1961 mm/rmap.c

  1849	
  1850	/*
  1851	 * @arg: enum ttu_flags will be passed to this argument
  1852	 */
  1853	static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
  1854			     unsigned long address, void *arg)
  1855	{
  1856		struct mm_struct *mm = vma->vm_mm;
  1857		DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0);
  1858		bool anon_exclusive, ret = true;
  1859		pte_t pteval;
  1860		struct page *subpage;
  1861		struct mmu_notifier_range range;
  1862		enum ttu_flags flags = (enum ttu_flags)(long)arg;
  1863		unsigned long nr_pages = 1, end_addr;
  1864		unsigned long pfn;
  1865		unsigned long hsz = 0;
  1866		int ptes = 0;
  1867		pgtable_t prealloc_pte = NULL;
  1868	
  1869		/*
  1870		 * When racing against e.g. zap_pte_range() on another cpu,
  1871		 * in between its ptep_get_and_clear_full() and folio_remove_rmap_*(),
  1872		 * try_to_unmap() may return before page_mapped() has become false,
  1873		 * if page table locking is skipped: use TTU_SYNC to wait for that.
  1874		 */
  1875		if (flags & TTU_SYNC)
  1876			pvmw.flags = PVMW_SYNC;
  1877	
  1878		/*
  1879		 * For THP, we have to assume the worse case ie pmd for invalidation.
  1880		 * For hugetlb, it could be much worse if we need to do pud
  1881		 * invalidation in the case of pmd sharing.
  1882		 *
  1883		 * Note that the folio can not be freed in this function as call of
  1884		 * try_to_unmap() must hold a reference on the folio.
  1885		 */
  1886		range.end = vma_address_end(&pvmw);
  1887		mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma->vm_mm,
  1888					address, range.end);
  1889		if (folio_test_hugetlb(folio)) {
  1890			/*
  1891			 * If sharing is possible, start and end will be adjusted
  1892			 * accordingly.
  1893			 */
  1894			adjust_range_if_pmd_sharing_possible(vma, &range.start,
  1895							     &range.end);
  1896	
  1897			/* We need the huge page size for set_huge_pte_at() */
  1898			hsz = huge_page_size(hstate_vma(vma));
  1899		}
  1900		mmu_notifier_invalidate_range_start(&range);
  1901	
  1902		if ((flags & TTU_SPLIT_HUGE_PMD) && vma_is_anonymous(vma) &&
  1903		    !arch_needs_pgtable_deposit())
  1904			prealloc_pte = pte_alloc_one(mm);
  1905	
  1906		while (page_vma_mapped_walk(&pvmw)) {
  1907			/*
  1908			 * If the folio is in an mlock()d vma, we must not swap it out.
  1909			 */
  1910			if (!(flags & TTU_IGNORE_MLOCK) &&
  1911			    (vma->vm_flags & VM_LOCKED)) {
  1912				ptes++;
  1913	
  1914				/*
  1915				 * Set 'ret' to indicate the page cannot be unmapped.
  1916				 *
  1917				 * Do not jump to walk_abort immediately as additional
  1918				 * iteration might be required to detect fully mapped
  1919				 * folio an mlock it.
  1920				 */
  1921				ret = false;
  1922	
  1923				/* Only mlock fully mapped pages */
  1924				if (pvmw.pte && ptes != pvmw.nr_pages)
  1925					continue;
  1926	
  1927				/*
  1928				 * All PTEs must be protected by page table lock in
  1929				 * order to mlock the page.
  1930				 *
  1931				 * If page table boundary has been cross, current ptl
  1932				 * only protect part of ptes.
  1933				 */
  1934				if (pvmw.flags & PVMW_PGTABLE_CROSSED)
  1935					goto walk_done;
  1936	
  1937				/* Restore the mlock which got missed */
  1938				mlock_vma_folio(folio, vma);
  1939				goto walk_done;
  1940			}
  1941	
  1942			if (!pvmw.pte) {
  1943				if (folio_test_anon(folio) && !folio_test_swapbacked(folio)) {
  1944					if (unmap_huge_pmd_locked(vma, pvmw.address, pvmw.pmd, folio))
  1945						goto walk_done;
  1946					/*
  1947					 * unmap_huge_pmd_locked has either already marked
  1948					 * the folio as swap-backed or decided to retain it
  1949					 * due to GUP or speculative references.
  1950					 */
  1951					goto walk_abort;
  1952				}
  1953	
  1954				if (flags & TTU_SPLIT_HUGE_PMD) {
  1955					pgtable_t pgtable = prealloc_pte;
  1956	
  1957					prealloc_pte = NULL;
  1958	
  1959					if (!arch_needs_pgtable_deposit() && !pgtable &&
  1960					    vma_is_anonymous(vma)) {
> 1961						count_vm_event(THP_SPLIT_PMD_PTE_ALLOC_FAILED);
  1962						page_vma_mapped_walk_done(&pvmw);
  1963						ret = false;
  1964						break;
  1965					}
  1966					/*
  1967					 * We temporarily have to drop the PTL and
  1968					 * restart so we can process the PTE-mapped THP.
  1969					 */
  1970					split_huge_pmd_locked(vma, pvmw.address,
  1971							      pvmw.pmd, false, pgtable);
  1972					flags &= ~TTU_SPLIT_HUGE_PMD;
  1973					page_vma_mapped_walk_restart(&pvmw);
  1974					continue;
  1975				}
  1976			}
  1977	
  1978			/* Unexpected PMD-mapped THP? */
  1979			VM_BUG_ON_FOLIO(!pvmw.pte, folio);
  1980	
  1981			/*
  1982			 * Handle PFN swap PTEs, such as device-exclusive ones, that
  1983			 * actually map pages.
  1984			 */
  1985			pteval = ptep_get(pvmw.pte);
  1986			if (likely(pte_present(pteval))) {
  1987				pfn = pte_pfn(pteval);
  1988			} else {
  1989				const softleaf_t entry = softleaf_from_pte(pteval);
  1990	
  1991				pfn = softleaf_to_pfn(entry);
  1992				VM_WARN_ON_FOLIO(folio_test_hugetlb(folio), folio);
  1993			}
  1994	
  1995			subpage = folio_page(folio, pfn - folio_pfn(folio));
  1996			address = pvmw.address;
  1997			anon_exclusive = folio_test_anon(folio) &&
  1998					 PageAnonExclusive(subpage);
  1999	
  2000			if (folio_test_hugetlb(folio)) {
  2001				bool anon = folio_test_anon(folio);
  2002	
  2003				/*
  2004				 * The try_to_unmap() is only passed a hugetlb page
  2005				 * in the case where the hugetlb page is poisoned.
  2006				 */
  2007				VM_BUG_ON_PAGE(!PageHWPoison(subpage), subpage);
  2008				/*
  2009				 * huge_pmd_unshare may unmap an entire PMD page.
  2010				 * There is no way of knowing exactly which PMDs may
  2011				 * be cached for this mm, so we must flush them all.
  2012				 * start/end were already adjusted above to cover this
  2013				 * range.
  2014				 */
  2015				flush_cache_range(vma, range.start, range.end);
  2016	
  2017				/*
  2018				 * To call huge_pmd_unshare, i_mmap_rwsem must be
  2019				 * held in write mode.  Caller needs to explicitly
  2020				 * do this outside rmap routines.
  2021				 *
  2022				 * We also must hold hugetlb vma_lock in write mode.
  2023				 * Lock order dictates acquiring vma_lock BEFORE
  2024				 * i_mmap_rwsem.  We can only try lock here and fail
  2025				 * if unsuccessful.
  2026				 */
  2027				if (!anon) {
  2028					struct mmu_gather tlb;
  2029	
  2030					VM_BUG_ON(!(flags & TTU_RMAP_LOCKED));
  2031					if (!hugetlb_vma_trylock_write(vma))
  2032						goto walk_abort;
  2033	
  2034					tlb_gather_mmu_vma(&tlb, vma);
  2035					if (huge_pmd_unshare(&tlb, vma, address, pvmw.pte)) {
  2036						hugetlb_vma_unlock_write(vma);
  2037						huge_pmd_unshare_flush(&tlb, vma);
  2038						tlb_finish_mmu(&tlb);
  2039						/*
  2040						 * The PMD table was unmapped,
  2041						 * consequently unmapping the folio.
  2042						 */
  2043						goto walk_done;
  2044					}
  2045					hugetlb_vma_unlock_write(vma);
  2046					tlb_finish_mmu(&tlb);
  2047				}
  2048				pteval = huge_ptep_clear_flush(vma, address, pvmw.pte);
  2049				if (pte_dirty(pteval))
  2050					folio_mark_dirty(folio);
  2051			} else if (likely(pte_present(pteval))) {
  2052				nr_pages = folio_unmap_pte_batch(folio, &pvmw, flags, pteval);
  2053				end_addr = address + nr_pages * PAGE_SIZE;
  2054				flush_cache_range(vma, address, end_addr);
  2055	
  2056				/* Nuke the page table entry. */
  2057				pteval = get_and_clear_ptes(mm, address, pvmw.pte, nr_pages);
  2058				/*
  2059				 * We clear the PTE but do not flush so potentially
  2060				 * a remote CPU could still be writing to the folio.
  2061				 * If the entry was previously clean then the
  2062				 * architecture must guarantee that a clear->dirty
  2063				 * transition on a cached TLB entry is written through
  2064				 * and traps if the PTE is unmapped.
  2065				 */
  2066				if (should_defer_flush(mm, flags))
  2067					set_tlb_ubc_flush_pending(mm, pteval, address, end_addr);
  2068				else
  2069					flush_tlb_range(vma, address, end_addr);
  2070				if (pte_dirty(pteval))
  2071					folio_mark_dirty(folio);
  2072			} else {
  2073				pte_clear(mm, address, pvmw.pte);
  2074			}
  2075	
  2076			/*
  2077			 * Now the pte is cleared. If this pte was uffd-wp armed,
  2078			 * we may want to replace a none pte with a marker pte if
  2079			 * it's file-backed, so we don't lose the tracking info.
  2080			 */
  2081			pte_install_uffd_wp_if_needed(vma, address, pvmw.pte, pteval);
  2082	
  2083			/* Update high watermark before we lower rss */
  2084			update_hiwater_rss(mm);
  2085	
  2086			if (PageHWPoison(subpage) && (flags & TTU_HWPOISON)) {
  2087				pteval = swp_entry_to_pte(make_hwpoison_entry(subpage));
  2088				if (folio_test_hugetlb(folio)) {
  2089					hugetlb_count_sub(folio_nr_pages(folio), mm);
  2090					set_huge_pte_at(mm, address, pvmw.pte, pteval,
  2091							hsz);
  2092				} else {
  2093					dec_mm_counter(mm, mm_counter(folio));
  2094					set_pte_at(mm, address, pvmw.pte, pteval);
  2095				}
  2096			} else if (likely(pte_present(pteval)) && pte_unused(pteval) &&
  2097				   !userfaultfd_armed(vma)) {
  2098				/*
  2099				 * The guest indicated that the page content is of no
  2100				 * interest anymore. Simply discard the pte, vmscan
  2101				 * will take care of the rest.
  2102				 * A future reference will then fault in a new zero
  2103				 * page. When userfaultfd is active, we must not drop
  2104				 * this page though, as its main user (postcopy
  2105				 * migration) will not expect userfaults on already
  2106				 * copied pages.
  2107				 */
  2108				dec_mm_counter(mm, mm_counter(folio));
  2109			} else if (folio_test_anon(folio)) {
  2110				swp_entry_t entry = page_swap_entry(subpage);
  2111				pte_t swp_pte;
  2112				/*
  2113				 * Store the swap location in the pte.
  2114				 * See handle_pte_fault() ...
  2115				 */
  2116				if (unlikely(folio_test_swapbacked(folio) !=
  2117						folio_test_swapcache(folio))) {
  2118					WARN_ON_ONCE(1);
  2119					goto walk_abort;
  2120				}
  2121	
  2122				/* MADV_FREE page check */
  2123				if (!folio_test_swapbacked(folio)) {
  2124					int ref_count, map_count;
  2125	
  2126					/*
  2127					 * Synchronize with gup_pte_range():
  2128					 * - clear PTE; barrier; read refcount
  2129					 * - inc refcount; barrier; read PTE
  2130					 */
  2131					smp_mb();
  2132	
  2133					ref_count = folio_ref_count(folio);
  2134					map_count = folio_mapcount(folio);
  2135	
  2136					/*
  2137					 * Order reads for page refcount and dirty flag
  2138					 * (see comments in __remove_mapping()).
  2139					 */
  2140					smp_rmb();
  2141	
  2142					if (folio_test_dirty(folio) && !(vma->vm_flags & VM_DROPPABLE)) {
  2143						/*
  2144						 * redirtied either using the page table or a previously
  2145						 * obtained GUP reference.
  2146						 */
  2147						set_ptes(mm, address, pvmw.pte, pteval, nr_pages);
  2148						folio_set_swapbacked(folio);
  2149						goto walk_abort;
  2150					} else if (ref_count != 1 + map_count) {
  2151						/*
  2152						 * Additional reference. Could be a GUP reference or any
  2153						 * speculative reference. GUP users must mark the folio
  2154						 * dirty if there was a modification. This folio cannot be
  2155						 * reclaimed right now either way, so act just like nothing
  2156						 * happened.
  2157						 * We'll come back here later and detect if the folio was
  2158						 * dirtied when the additional reference is gone.
  2159						 */
  2160						set_ptes(mm, address, pvmw.pte, pteval, nr_pages);
  2161						goto walk_abort;
  2162					}
  2163					add_mm_counter(mm, MM_ANONPAGES, -nr_pages);
  2164					goto discard;
  2165				}
  2166	
  2167				if (swap_duplicate(entry) < 0) {
  2168					set_pte_at(mm, address, pvmw.pte, pteval);
  2169					goto walk_abort;
  2170				}
  2171	
  2172				/*
  2173				 * arch_unmap_one() is expected to be a NOP on
  2174				 * architectures where we could have PFN swap PTEs,
  2175				 * so we'll not check/care.
  2176				 */
  2177				if (arch_unmap_one(mm, vma, address, pteval) < 0) {
  2178					swap_free(entry);
  2179					set_pte_at(mm, address, pvmw.pte, pteval);
  2180					goto walk_abort;
  2181				}
  2182	
  2183				/* See folio_try_share_anon_rmap(): clear PTE first. */
  2184				if (anon_exclusive &&
  2185				    folio_try_share_anon_rmap_pte(folio, subpage)) {
  2186					swap_free(entry);
  2187					set_pte_at(mm, address, pvmw.pte, pteval);
  2188					goto walk_abort;
  2189				}
  2190				if (list_empty(&mm->mmlist)) {
  2191					spin_lock(&mmlist_lock);
  2192					if (list_empty(&mm->mmlist))
  2193						list_add(&mm->mmlist, &init_mm.mmlist);
  2194					spin_unlock(&mmlist_lock);
  2195				}
  2196				dec_mm_counter(mm, MM_ANONPAGES);
  2197				inc_mm_counter(mm, MM_SWAPENTS);
  2198				swp_pte = swp_entry_to_pte(entry);
  2199				if (anon_exclusive)
  2200					swp_pte = pte_swp_mkexclusive(swp_pte);
  2201				if (likely(pte_present(pteval))) {
  2202					if (pte_soft_dirty(pteval))
  2203						swp_pte = pte_swp_mksoft_dirty(swp_pte);
  2204					if (pte_uffd_wp(pteval))
  2205						swp_pte = pte_swp_mkuffd_wp(swp_pte);
  2206				} else {
  2207					if (pte_swp_soft_dirty(pteval))
  2208						swp_pte = pte_swp_mksoft_dirty(swp_pte);
  2209					if (pte_swp_uffd_wp(pteval))
  2210						swp_pte = pte_swp_mkuffd_wp(swp_pte);
  2211				}
  2212				set_pte_at(mm, address, pvmw.pte, swp_pte);
  2213			} else {
  2214				/*
  2215				 * This is a locked file-backed folio,
  2216				 * so it cannot be removed from the page
  2217				 * cache and replaced by a new folio before
  2218				 * mmu_notifier_invalidate_range_end, so no
  2219				 * concurrent thread might update its page table
  2220				 * to point at a new folio while a device is
  2221				 * still using this folio.
  2222				 *
  2223				 * See Documentation/mm/mmu_notifier.rst
  2224				 */
  2225				dec_mm_counter(mm, mm_counter_file(folio));
  2226			}
  2227	discard:
  2228			if (unlikely(folio_test_hugetlb(folio))) {
  2229				hugetlb_remove_rmap(folio);
  2230			} else {
  2231				folio_remove_rmap_ptes(folio, subpage, nr_pages, vma);
  2232			}
  2233			if (vma->vm_flags & VM_LOCKED)
  2234				mlock_drain_local();
  2235			folio_put_refs(folio, nr_pages);
  2236	
  2237			/*
  2238			 * If we are sure that we batched the entire folio and cleared
  2239			 * all PTEs, we can just optimize and stop right here.
  2240			 */
  2241			if (nr_pages == folio_nr_pages(folio))
  2242				goto walk_done;
  2243			continue;
  2244	walk_abort:
  2245			ret = false;
  2246	walk_done:
  2247			page_vma_mapped_walk_done(&pvmw);
  2248			break;
  2249		}
  2250	
  2251		if (prealloc_pte)
  2252			pte_free(mm, prealloc_pte);
  2253	
  2254		mmu_notifier_invalidate_range_end(&range);
  2255	
  2256		return ret;
  2257	}
  2258	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC 2/2] mm: thp: add THP_SPLIT_PMD_PTE_ALLOC_FAILED counter
  2026-02-11 12:49 ` [RFC 2/2] mm: thp: add THP_SPLIT_PMD_PTE_ALLOC_FAILED counter Usama Arif
  2026-02-11 13:27   ` David Hildenbrand (Arm)
  2026-02-12 21:40   ` kernel test robot
@ 2026-02-12 21:40   ` kernel test robot
  2 siblings, 0 replies; 22+ messages in thread
From: kernel test robot @ 2026-02-12 21:40 UTC (permalink / raw)
  To: Usama Arif; +Cc: oe-kbuild-all

Hi Usama,

[This is a private test report for your RFC patch.]
kernel test robot noticed the following build errors:

[auto build test ERROR on linus/master]
[also build test ERROR on v6.19 next-20260212]
[cannot apply to akpm-mm/mm-everything]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Usama-Arif/mm-thp-allocate-PTE-page-tables-lazily-at-split-time/20260211-205726
base:   linus/master
patch link:    https://lore.kernel.org/r/20260211125507.4175026-3-usama.arif%40linux.dev
patch subject: [RFC 2/2] mm: thp: add THP_SPLIT_PMD_PTE_ALLOC_FAILED counter
config: nios2-allnoconfig (https://download.01.org/0day-ci/archive/20260213/202602130520.mofTmHuk-lkp@intel.com/config)
compiler: nios2-linux-gcc (GCC) 11.5.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260213/202602130520.mofTmHuk-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202602130520.mofTmHuk-lkp@intel.com/

All errors (new ones prefixed by >>):

   mm/rmap.c: In function 'try_to_unmap_one':
>> mm/rmap.c:1961:56: error: 'THP_SPLIT_PMD_PTE_ALLOC_FAILED' undeclared (first use in this function)
    1961 |                                         count_vm_event(THP_SPLIT_PMD_PTE_ALLOC_FAILED);
         |                                                        ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   mm/rmap.c:1961:56: note: each undeclared identifier is reported only once for each function it appears in
   mm/rmap.c: In function 'try_to_migrate_one':
   mm/rmap.c:2364:56: error: 'THP_SPLIT_PMD_PTE_ALLOC_FAILED' undeclared (first use in this function)
    2364 |                                         count_vm_event(THP_SPLIT_PMD_PTE_ALLOC_FAILED);
         |                                                        ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


vim +/THP_SPLIT_PMD_PTE_ALLOC_FAILED +1961 mm/rmap.c

  1849	
  1850	/*
  1851	 * @arg: enum ttu_flags will be passed to this argument
  1852	 */
  1853	static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
  1854			     unsigned long address, void *arg)
  1855	{
  1856		struct mm_struct *mm = vma->vm_mm;
  1857		DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0);
  1858		bool anon_exclusive, ret = true;
  1859		pte_t pteval;
  1860		struct page *subpage;
  1861		struct mmu_notifier_range range;
  1862		enum ttu_flags flags = (enum ttu_flags)(long)arg;
  1863		unsigned long nr_pages = 1, end_addr;
  1864		unsigned long pfn;
  1865		unsigned long hsz = 0;
  1866		int ptes = 0;
  1867		pgtable_t prealloc_pte = NULL;
  1868	
  1869		/*
  1870		 * When racing against e.g. zap_pte_range() on another cpu,
  1871		 * in between its ptep_get_and_clear_full() and folio_remove_rmap_*(),
  1872		 * try_to_unmap() may return before page_mapped() has become false,
  1873		 * if page table locking is skipped: use TTU_SYNC to wait for that.
  1874		 */
  1875		if (flags & TTU_SYNC)
  1876			pvmw.flags = PVMW_SYNC;
  1877	
  1878		/*
  1879		 * For THP, we have to assume the worse case ie pmd for invalidation.
  1880		 * For hugetlb, it could be much worse if we need to do pud
  1881		 * invalidation in the case of pmd sharing.
  1882		 *
  1883		 * Note that the folio can not be freed in this function as call of
  1884		 * try_to_unmap() must hold a reference on the folio.
  1885		 */
  1886		range.end = vma_address_end(&pvmw);
  1887		mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma->vm_mm,
  1888					address, range.end);
  1889		if (folio_test_hugetlb(folio)) {
  1890			/*
  1891			 * If sharing is possible, start and end will be adjusted
  1892			 * accordingly.
  1893			 */
  1894			adjust_range_if_pmd_sharing_possible(vma, &range.start,
  1895							     &range.end);
  1896	
  1897			/* We need the huge page size for set_huge_pte_at() */
  1898			hsz = huge_page_size(hstate_vma(vma));
  1899		}
  1900		mmu_notifier_invalidate_range_start(&range);
  1901	
  1902		if ((flags & TTU_SPLIT_HUGE_PMD) && vma_is_anonymous(vma) &&
  1903		    !arch_needs_pgtable_deposit())
  1904			prealloc_pte = pte_alloc_one(mm);
  1905	
  1906		while (page_vma_mapped_walk(&pvmw)) {
  1907			/*
  1908			 * If the folio is in an mlock()d vma, we must not swap it out.
  1909			 */
  1910			if (!(flags & TTU_IGNORE_MLOCK) &&
  1911			    (vma->vm_flags & VM_LOCKED)) {
  1912				ptes++;
  1913	
  1914				/*
  1915				 * Set 'ret' to indicate the page cannot be unmapped.
  1916				 *
  1917				 * Do not jump to walk_abort immediately as additional
  1918				 * iteration might be required to detect fully mapped
  1919				 * folio an mlock it.
  1920				 */
  1921				ret = false;
  1922	
  1923				/* Only mlock fully mapped pages */
  1924				if (pvmw.pte && ptes != pvmw.nr_pages)
  1925					continue;
  1926	
  1927				/*
  1928				 * All PTEs must be protected by page table lock in
  1929				 * order to mlock the page.
  1930				 *
  1931				 * If page table boundary has been cross, current ptl
  1932				 * only protect part of ptes.
  1933				 */
  1934				if (pvmw.flags & PVMW_PGTABLE_CROSSED)
  1935					goto walk_done;
  1936	
  1937				/* Restore the mlock which got missed */
  1938				mlock_vma_folio(folio, vma);
  1939				goto walk_done;
  1940			}
  1941	
  1942			if (!pvmw.pte) {
  1943				if (folio_test_anon(folio) && !folio_test_swapbacked(folio)) {
  1944					if (unmap_huge_pmd_locked(vma, pvmw.address, pvmw.pmd, folio))
  1945						goto walk_done;
  1946					/*
  1947					 * unmap_huge_pmd_locked has either already marked
  1948					 * the folio as swap-backed or decided to retain it
  1949					 * due to GUP or speculative references.
  1950					 */
  1951					goto walk_abort;
  1952				}
  1953	
  1954				if (flags & TTU_SPLIT_HUGE_PMD) {
  1955					pgtable_t pgtable = prealloc_pte;
  1956	
  1957					prealloc_pte = NULL;
  1958	
  1959					if (!arch_needs_pgtable_deposit() && !pgtable &&
  1960					    vma_is_anonymous(vma)) {
> 1961						count_vm_event(THP_SPLIT_PMD_PTE_ALLOC_FAILED);
  1962						page_vma_mapped_walk_done(&pvmw);
  1963						ret = false;
  1964						break;
  1965					}
  1966					/*
  1967					 * We temporarily have to drop the PTL and
  1968					 * restart so we can process the PTE-mapped THP.
  1969					 */
  1970					split_huge_pmd_locked(vma, pvmw.address,
  1971							      pvmw.pmd, false, pgtable);
  1972					flags &= ~TTU_SPLIT_HUGE_PMD;
  1973					page_vma_mapped_walk_restart(&pvmw);
  1974					continue;
  1975				}
  1976			}
  1977	
  1978			/* Unexpected PMD-mapped THP? */
  1979			VM_BUG_ON_FOLIO(!pvmw.pte, folio);
  1980	
  1981			/*
  1982			 * Handle PFN swap PTEs, such as device-exclusive ones, that
  1983			 * actually map pages.
  1984			 */
  1985			pteval = ptep_get(pvmw.pte);
  1986			if (likely(pte_present(pteval))) {
  1987				pfn = pte_pfn(pteval);
  1988			} else {
  1989				const softleaf_t entry = softleaf_from_pte(pteval);
  1990	
  1991				pfn = softleaf_to_pfn(entry);
  1992				VM_WARN_ON_FOLIO(folio_test_hugetlb(folio), folio);
  1993			}
  1994	
  1995			subpage = folio_page(folio, pfn - folio_pfn(folio));
  1996			address = pvmw.address;
  1997			anon_exclusive = folio_test_anon(folio) &&
  1998					 PageAnonExclusive(subpage);
  1999	
  2000			if (folio_test_hugetlb(folio)) {
  2001				bool anon = folio_test_anon(folio);
  2002	
  2003				/*
  2004				 * The try_to_unmap() is only passed a hugetlb page
  2005				 * in the case where the hugetlb page is poisoned.
  2006				 */
  2007				VM_BUG_ON_PAGE(!PageHWPoison(subpage), subpage);
  2008				/*
  2009				 * huge_pmd_unshare may unmap an entire PMD page.
  2010				 * There is no way of knowing exactly which PMDs may
  2011				 * be cached for this mm, so we must flush them all.
  2012				 * start/end were already adjusted above to cover this
  2013				 * range.
  2014				 */
  2015				flush_cache_range(vma, range.start, range.end);
  2016	
  2017				/*
  2018				 * To call huge_pmd_unshare, i_mmap_rwsem must be
  2019				 * held in write mode.  Caller needs to explicitly
  2020				 * do this outside rmap routines.
  2021				 *
  2022				 * We also must hold hugetlb vma_lock in write mode.
  2023				 * Lock order dictates acquiring vma_lock BEFORE
  2024				 * i_mmap_rwsem.  We can only try lock here and fail
  2025				 * if unsuccessful.
  2026				 */
  2027				if (!anon) {
  2028					struct mmu_gather tlb;
  2029	
  2030					VM_BUG_ON(!(flags & TTU_RMAP_LOCKED));
  2031					if (!hugetlb_vma_trylock_write(vma))
  2032						goto walk_abort;
  2033	
  2034					tlb_gather_mmu_vma(&tlb, vma);
  2035					if (huge_pmd_unshare(&tlb, vma, address, pvmw.pte)) {
  2036						hugetlb_vma_unlock_write(vma);
  2037						huge_pmd_unshare_flush(&tlb, vma);
  2038						tlb_finish_mmu(&tlb);
  2039						/*
  2040						 * The PMD table was unmapped,
  2041						 * consequently unmapping the folio.
  2042						 */
  2043						goto walk_done;
  2044					}
  2045					hugetlb_vma_unlock_write(vma);
  2046					tlb_finish_mmu(&tlb);
  2047				}
  2048				pteval = huge_ptep_clear_flush(vma, address, pvmw.pte);
  2049				if (pte_dirty(pteval))
  2050					folio_mark_dirty(folio);
  2051			} else if (likely(pte_present(pteval))) {
  2052				nr_pages = folio_unmap_pte_batch(folio, &pvmw, flags, pteval);
  2053				end_addr = address + nr_pages * PAGE_SIZE;
  2054				flush_cache_range(vma, address, end_addr);
  2055	
  2056				/* Nuke the page table entry. */
  2057				pteval = get_and_clear_ptes(mm, address, pvmw.pte, nr_pages);
  2058				/*
  2059				 * We clear the PTE but do not flush so potentially
  2060				 * a remote CPU could still be writing to the folio.
  2061				 * If the entry was previously clean then the
  2062				 * architecture must guarantee that a clear->dirty
  2063				 * transition on a cached TLB entry is written through
  2064				 * and traps if the PTE is unmapped.
  2065				 */
  2066				if (should_defer_flush(mm, flags))
  2067					set_tlb_ubc_flush_pending(mm, pteval, address, end_addr);
  2068				else
  2069					flush_tlb_range(vma, address, end_addr);
  2070				if (pte_dirty(pteval))
  2071					folio_mark_dirty(folio);
  2072			} else {
  2073				pte_clear(mm, address, pvmw.pte);
  2074			}
  2075	
  2076			/*
  2077			 * Now the pte is cleared. If this pte was uffd-wp armed,
  2078			 * we may want to replace a none pte with a marker pte if
  2079			 * it's file-backed, so we don't lose the tracking info.
  2080			 */
  2081			pte_install_uffd_wp_if_needed(vma, address, pvmw.pte, pteval);
  2082	
  2083			/* Update high watermark before we lower rss */
  2084			update_hiwater_rss(mm);
  2085	
  2086			if (PageHWPoison(subpage) && (flags & TTU_HWPOISON)) {
  2087				pteval = swp_entry_to_pte(make_hwpoison_entry(subpage));
  2088				if (folio_test_hugetlb(folio)) {
  2089					hugetlb_count_sub(folio_nr_pages(folio), mm);
  2090					set_huge_pte_at(mm, address, pvmw.pte, pteval,
  2091							hsz);
  2092				} else {
  2093					dec_mm_counter(mm, mm_counter(folio));
  2094					set_pte_at(mm, address, pvmw.pte, pteval);
  2095				}
  2096			} else if (likely(pte_present(pteval)) && pte_unused(pteval) &&
  2097				   !userfaultfd_armed(vma)) {
  2098				/*
  2099				 * The guest indicated that the page content is of no
  2100				 * interest anymore. Simply discard the pte, vmscan
  2101				 * will take care of the rest.
  2102				 * A future reference will then fault in a new zero
  2103				 * page. When userfaultfd is active, we must not drop
  2104				 * this page though, as its main user (postcopy
  2105				 * migration) will not expect userfaults on already
  2106				 * copied pages.
  2107				 */
  2108				dec_mm_counter(mm, mm_counter(folio));
  2109			} else if (folio_test_anon(folio)) {
  2110				swp_entry_t entry = page_swap_entry(subpage);
  2111				pte_t swp_pte;
  2112				/*
  2113				 * Store the swap location in the pte.
  2114				 * See handle_pte_fault() ...
  2115				 */
  2116				if (unlikely(folio_test_swapbacked(folio) !=
  2117						folio_test_swapcache(folio))) {
  2118					WARN_ON_ONCE(1);
  2119					goto walk_abort;
  2120				}
  2121	
  2122				/* MADV_FREE page check */
  2123				if (!folio_test_swapbacked(folio)) {
  2124					int ref_count, map_count;
  2125	
  2126					/*
  2127					 * Synchronize with gup_pte_range():
  2128					 * - clear PTE; barrier; read refcount
  2129					 * - inc refcount; barrier; read PTE
  2130					 */
  2131					smp_mb();
  2132	
  2133					ref_count = folio_ref_count(folio);
  2134					map_count = folio_mapcount(folio);
  2135	
  2136					/*
  2137					 * Order reads for page refcount and dirty flag
  2138					 * (see comments in __remove_mapping()).
  2139					 */
  2140					smp_rmb();
  2141	
  2142					if (folio_test_dirty(folio) && !(vma->vm_flags & VM_DROPPABLE)) {
  2143						/*
  2144						 * redirtied either using the page table or a previously
  2145						 * obtained GUP reference.
  2146						 */
  2147						set_ptes(mm, address, pvmw.pte, pteval, nr_pages);
  2148						folio_set_swapbacked(folio);
  2149						goto walk_abort;
  2150					} else if (ref_count != 1 + map_count) {
  2151						/*
  2152						 * Additional reference. Could be a GUP reference or any
  2153						 * speculative reference. GUP users must mark the folio
  2154						 * dirty if there was a modification. This folio cannot be
  2155						 * reclaimed right now either way, so act just like nothing
  2156						 * happened.
  2157						 * We'll come back here later and detect if the folio was
  2158						 * dirtied when the additional reference is gone.
  2159						 */
  2160						set_ptes(mm, address, pvmw.pte, pteval, nr_pages);
  2161						goto walk_abort;
  2162					}
  2163					add_mm_counter(mm, MM_ANONPAGES, -nr_pages);
  2164					goto discard;
  2165				}
  2166	
  2167				if (swap_duplicate(entry) < 0) {
  2168					set_pte_at(mm, address, pvmw.pte, pteval);
  2169					goto walk_abort;
  2170				}
  2171	
  2172				/*
  2173				 * arch_unmap_one() is expected to be a NOP on
  2174				 * architectures where we could have PFN swap PTEs,
  2175				 * so we'll not check/care.
  2176				 */
  2177				if (arch_unmap_one(mm, vma, address, pteval) < 0) {
  2178					swap_free(entry);
  2179					set_pte_at(mm, address, pvmw.pte, pteval);
  2180					goto walk_abort;
  2181				}
  2182	
  2183				/* See folio_try_share_anon_rmap(): clear PTE first. */
  2184				if (anon_exclusive &&
  2185				    folio_try_share_anon_rmap_pte(folio, subpage)) {
  2186					swap_free(entry);
  2187					set_pte_at(mm, address, pvmw.pte, pteval);
  2188					goto walk_abort;
  2189				}
  2190				if (list_empty(&mm->mmlist)) {
  2191					spin_lock(&mmlist_lock);
  2192					if (list_empty(&mm->mmlist))
  2193						list_add(&mm->mmlist, &init_mm.mmlist);
  2194					spin_unlock(&mmlist_lock);
  2195				}
  2196				dec_mm_counter(mm, MM_ANONPAGES);
  2197				inc_mm_counter(mm, MM_SWAPENTS);
  2198				swp_pte = swp_entry_to_pte(entry);
  2199				if (anon_exclusive)
  2200					swp_pte = pte_swp_mkexclusive(swp_pte);
  2201				if (likely(pte_present(pteval))) {
  2202					if (pte_soft_dirty(pteval))
  2203						swp_pte = pte_swp_mksoft_dirty(swp_pte);
  2204					if (pte_uffd_wp(pteval))
  2205						swp_pte = pte_swp_mkuffd_wp(swp_pte);
  2206				} else {
  2207					if (pte_swp_soft_dirty(pteval))
  2208						swp_pte = pte_swp_mksoft_dirty(swp_pte);
  2209					if (pte_swp_uffd_wp(pteval))
  2210						swp_pte = pte_swp_mkuffd_wp(swp_pte);
  2211				}
  2212				set_pte_at(mm, address, pvmw.pte, swp_pte);
  2213			} else {
  2214				/*
  2215				 * This is a locked file-backed folio,
  2216				 * so it cannot be removed from the page
  2217				 * cache and replaced by a new folio before
  2218				 * mmu_notifier_invalidate_range_end, so no
  2219				 * concurrent thread might update its page table
  2220				 * to point at a new folio while a device is
  2221				 * still using this folio.
  2222				 *
  2223				 * See Documentation/mm/mmu_notifier.rst
  2224				 */
  2225				dec_mm_counter(mm, mm_counter_file(folio));
  2226			}
  2227	discard:
  2228			if (unlikely(folio_test_hugetlb(folio))) {
  2229				hugetlb_remove_rmap(folio);
  2230			} else {
  2231				folio_remove_rmap_ptes(folio, subpage, nr_pages, vma);
  2232			}
  2233			if (vma->vm_flags & VM_LOCKED)
  2234				mlock_drain_local();
  2235			folio_put_refs(folio, nr_pages);
  2236	
  2237			/*
  2238			 * If we are sure that we batched the entire folio and cleared
  2239			 * all PTEs, we can just optimize and stop right here.
  2240			 */
  2241			if (nr_pages == folio_nr_pages(folio))
  2242				goto walk_done;
  2243			continue;
  2244	walk_abort:
  2245			ret = false;
  2246	walk_done:
  2247			page_vma_mapped_walk_done(&pvmw);
  2248			break;
  2249		}
  2250	
  2251		if (prealloc_pte)
  2252			pte_free(mm, prealloc_pte);
  2253	
  2254		mmu_notifier_invalidate_range_end(&range);
  2255	
  2256		return ret;
  2257	}
  2258	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2026-02-12 21:40 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-11 12:49 [RFC 0/2] mm: thp: split time allocation of page table for THPs Usama Arif
2026-02-11 12:49 ` [RFC 1/2] mm: thp: allocate PTE page tables lazily at split time Usama Arif
2026-02-11 13:25   ` David Hildenbrand (Arm)
2026-02-11 13:38     ` Usama Arif
2026-02-12 12:13     ` Ritesh Harjani
2026-02-12 15:25       ` Usama Arif
2026-02-12 15:39       ` David Hildenbrand (Arm)
2026-02-12 16:46         ` Ritesh Harjani
2026-02-11 13:35   ` David Hildenbrand (Arm)
2026-02-11 13:46     ` Kiryl Shutsemau
2026-02-11 13:47     ` Usama Arif
2026-02-11 19:28   ` Matthew Wilcox
2026-02-11 19:55     ` David Hildenbrand (Arm)
2026-02-11 12:49 ` [RFC 2/2] mm: thp: add THP_SPLIT_PMD_PTE_ALLOC_FAILED counter Usama Arif
2026-02-11 13:27   ` David Hildenbrand (Arm)
2026-02-11 13:31     ` Usama Arif
2026-02-11 13:36       ` David Hildenbrand (Arm)
2026-02-11 13:42         ` Usama Arif
2026-02-11 13:38       ` David Hildenbrand (Arm)
2026-02-11 13:43         ` Usama Arif
2026-02-12 21:40   ` kernel test robot
2026-02-12 21:40   ` kernel test robot

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.