* [RFC 0/2] mm: thp: split time allocation of page table for THPs
@ 2026-02-11 12:49 Usama Arif
2026-02-11 12:49 ` [RFC 1/2] mm: thp: allocate PTE page tables lazily at split time Usama Arif
2026-02-11 12:49 ` [RFC 2/2] mm: thp: add THP_SPLIT_PMD_PTE_ALLOC_FAILED counter Usama Arif
0 siblings, 2 replies; 22+ messages in thread
From: Usama Arif @ 2026-02-11 12:49 UTC (permalink / raw)
To: Andrew Morton, david, lorenzo.stoakes, willy, linux-mm
Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
baolin.wang, npache, Liam.Howlett, ryan.roberts, vbabka,
lance.yang, linux-kernel, kernel-team, Usama Arif
This is an RFC patch to allocate the PTE page table at split time only
and not do pre-deposit for THPs as suggested by David [1].
The core patch is the first one. The second one is not needed and its
just vmstat counters I used to show that split doesn't fail. Its going to be
0 all the time and won't include it in future revisions.
It would have been ideal if all pre-deposit code was removed but its not
possible due to PowerPC. The rationale and further details are covered
in the commit message of the first patch, including why the patch is safe.
[1] https://lore.kernel.org/all/ee5bd77f-87ad-4640-a974-304b488e4c64@kernel.org/
Usama Arif (2):
mm: thp: allocate PTE page tables lazily at split time
mm: thp: add THP_SPLIT_PMD_PTE_ALLOC_FAILED counter
include/linux/huge_mm.h | 4 +-
include/linux/vm_event_item.h | 1 +
mm/huge_memory.c | 145 ++++++++++++++++++++++++----------
mm/khugepaged.c | 7 +-
mm/migrate_device.c | 15 ++--
mm/rmap.c | 42 +++++++++-
mm/vmstat.c | 1 +
7 files changed, 162 insertions(+), 53 deletions(-)
--
2.47.3
^ permalink raw reply [flat|nested] 22+ messages in thread
* [RFC 1/2] mm: thp: allocate PTE page tables lazily at split time
2026-02-11 12:49 [RFC 0/2] mm: thp: split time allocation of page table for THPs Usama Arif
@ 2026-02-11 12:49 ` Usama Arif
2026-02-11 13:25 ` David Hildenbrand (Arm)
` (2 more replies)
2026-02-11 12:49 ` [RFC 2/2] mm: thp: add THP_SPLIT_PMD_PTE_ALLOC_FAILED counter Usama Arif
1 sibling, 3 replies; 22+ messages in thread
From: Usama Arif @ 2026-02-11 12:49 UTC (permalink / raw)
To: Andrew Morton, david, lorenzo.stoakes, willy, linux-mm
Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
baolin.wang, npache, Liam.Howlett, ryan.roberts, vbabka,
lance.yang, linux-kernel, kernel-team, Usama Arif
When the kernel creates a PMD-level THP mapping for anonymous pages,
it pre-allocates a PTE page table and deposits it via
pgtable_trans_huge_deposit(). This deposited table is withdrawn during
PMD split or zap. The rationale was that split must not fail—if the
kernel decides to split a THP, it needs a PTE table to populate.
However, every anon THP wastes 4KB (one page table page) that sits
unused in the deposit list for the lifetime of the mapping. On systems
with many THPs, this adds up to significant memory waste. The original
rationale is also not an issue. It is ok for split to fail, and if the
kernel can't find an order 0 allocation for split, there are much bigger
problems. On large servers where you can easily have 100s of GBs of THPs,
the memory usage for these tables is 200M per 100G. This memory could be
used for any other usecase, which include allocating the pagetables
required during split.
This patch removes the pre-deposit for anonymous pages on architectures
where arch_needs_pgtable_deposit() returns false (every arch apart from
powerpc, and only when radix hash tables are not enabled) and allocates
the PTE table lazily—only when a split actually occurs. The split path
is modified to accept a caller-provided page table.
PowerPC exception:
It would have been great if we can completely remove the pagetable
deposit code and this commit would mostly have been a code cleanup patch,
unfortunately PowerPC has hash MMU, it stores hash slot information in
the deposited page table and pre-deposit is necessary. All deposit/
withdraw paths are guarded by arch_needs_pgtable_deposit(), so PowerPC
behavior is unchanged with this patch. On a better note,
arch_needs_pgtable_deposit will always evaluate to false at compile time
on non PowerPC architectures and the pre-deposit code will not be
compiled in.
Why Split Failures Are Safe:
If a system is under severe memory pressure that even a 4K allocation
fails for a PTE table, there are far greater problems than a THP split
being delayed. The OOM killer will likely intervene before this becomes an
issue.
When pte_alloc_one() fails due to not being able to allocate a 4K page,
the PMD split is aborted and the THP remains intact. I could not get split
to fail, as its very difficult to make order-0 allocation to fail.
Code analysis of what would happen if it does:
- mprotect(): If split fails in change_pmd_range, it will fallback
to change_pte_range, which will return an error which will cause the
whole function to be retried again.
- munmap() (partial THP range): zap_pte_range() returns early when
pte_offset_map_lock() fails, causing zap_pmd_range() to retry via pmd--.
For full THP range, zap_huge_pmd() unmaps the entire PMD without
split.
- Memory reclaim (try_to_unmap()): Returns false, folio rotated back
LRU, retried in next reclaim cycle.
- Migration / compaction (try_to_migrate()): Returns -EAGAIN, migration
skips this folio, retried later.
- CoW fault (wp_huge_pmd()): Returns VM_FAULT_FALLBACK, fault retried.
- madvise (MADV_COLD/PAGEOUT): split_folio() internally calls
try_to_migrate() with TTU_SPLIT_HUGE_PMD. If PMD split fails,
try_to_migrate() returns false, split_folio() returns -EAGAIN,
and madvise returns 0 (success) silently skipping the region. This
should be fine. madvise is just an advice and can fail for other
reasons as well.
Suggested-by: David Hildenbrand <david@kernel.org>
Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
include/linux/huge_mm.h | 4 +-
mm/huge_memory.c | 144 ++++++++++++++++++++++++++++------------
mm/khugepaged.c | 7 +-
mm/migrate_device.c | 15 +++--
mm/rmap.c | 39 ++++++++++-
5 files changed, 156 insertions(+), 53 deletions(-)
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index a4d9f964dfdea..b21bb72a298c9 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -562,7 +562,7 @@ static inline bool thp_migration_supported(void)
}
void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address,
- pmd_t *pmd, bool freeze);
+ pmd_t *pmd, bool freeze, pgtable_t pgtable);
bool unmap_huge_pmd_locked(struct vm_area_struct *vma, unsigned long addr,
pmd_t *pmdp, struct folio *folio);
void map_anon_folio_pmd_nopf(struct folio *folio, pmd_t *pmd,
@@ -660,7 +660,7 @@ static inline void split_huge_pmd_address(struct vm_area_struct *vma,
unsigned long address, bool freeze) {}
static inline void split_huge_pmd_locked(struct vm_area_struct *vma,
unsigned long address, pmd_t *pmd,
- bool freeze) {}
+ bool freeze, pgtable_t pgtable) {}
static inline bool unmap_huge_pmd_locked(struct vm_area_struct *vma,
unsigned long addr, pmd_t *pmdp,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 44ff8a648afd5..4c9a8d89fc8aa 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1322,17 +1322,19 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf)
unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
struct vm_area_struct *vma = vmf->vma;
struct folio *folio;
- pgtable_t pgtable;
+ pgtable_t pgtable = NULL;
vm_fault_t ret = 0;
folio = vma_alloc_anon_folio_pmd(vma, vmf->address);
if (unlikely(!folio))
return VM_FAULT_FALLBACK;
- pgtable = pte_alloc_one(vma->vm_mm);
- if (unlikely(!pgtable)) {
- ret = VM_FAULT_OOM;
- goto release;
+ if (arch_needs_pgtable_deposit()) {
+ pgtable = pte_alloc_one(vma->vm_mm);
+ if (unlikely(!pgtable)) {
+ ret = VM_FAULT_OOM;
+ goto release;
+ }
}
vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
@@ -1347,14 +1349,18 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf)
if (userfaultfd_missing(vma)) {
spin_unlock(vmf->ptl);
folio_put(folio);
- pte_free(vma->vm_mm, pgtable);
+ if (pgtable)
+ pte_free(vma->vm_mm, pgtable);
ret = handle_userfault(vmf, VM_UFFD_MISSING);
VM_BUG_ON(ret & VM_FAULT_FALLBACK);
return ret;
}
- pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd, pgtable);
+ if (pgtable) {
+ pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd,
+ pgtable);
+ mm_inc_nr_ptes(vma->vm_mm);
+ }
map_anon_folio_pmd_pf(folio, vmf->pmd, vma, haddr);
- mm_inc_nr_ptes(vma->vm_mm);
spin_unlock(vmf->ptl);
}
@@ -1450,9 +1456,11 @@ static void set_huge_zero_folio(pgtable_t pgtable, struct mm_struct *mm,
pmd_t entry;
entry = folio_mk_pmd(zero_folio, vma->vm_page_prot);
entry = pmd_mkspecial(entry);
- pgtable_trans_huge_deposit(mm, pmd, pgtable);
+ if (pgtable) {
+ pgtable_trans_huge_deposit(mm, pmd, pgtable);
+ mm_inc_nr_ptes(mm);
+ }
set_pmd_at(mm, haddr, pmd, entry);
- mm_inc_nr_ptes(mm);
}
vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
@@ -1471,16 +1479,19 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
if (!(vmf->flags & FAULT_FLAG_WRITE) &&
!mm_forbids_zeropage(vma->vm_mm) &&
transparent_hugepage_use_zero_page()) {
- pgtable_t pgtable;
+ pgtable_t pgtable = NULL;
struct folio *zero_folio;
vm_fault_t ret;
- pgtable = pte_alloc_one(vma->vm_mm);
- if (unlikely(!pgtable))
- return VM_FAULT_OOM;
+ if (arch_needs_pgtable_deposit()) {
+ pgtable = pte_alloc_one(vma->vm_mm);
+ if (unlikely(!pgtable))
+ return VM_FAULT_OOM;
+ }
zero_folio = mm_get_huge_zero_folio(vma->vm_mm);
if (unlikely(!zero_folio)) {
- pte_free(vma->vm_mm, pgtable);
+ if (pgtable)
+ pte_free(vma->vm_mm, pgtable);
count_vm_event(THP_FAULT_FALLBACK);
return VM_FAULT_FALLBACK;
}
@@ -1490,10 +1501,12 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
ret = check_stable_address_space(vma->vm_mm);
if (ret) {
spin_unlock(vmf->ptl);
- pte_free(vma->vm_mm, pgtable);
+ if (pgtable)
+ pte_free(vma->vm_mm, pgtable);
} else if (userfaultfd_missing(vma)) {
spin_unlock(vmf->ptl);
- pte_free(vma->vm_mm, pgtable);
+ if (pgtable)
+ pte_free(vma->vm_mm, pgtable);
ret = handle_userfault(vmf, VM_UFFD_MISSING);
VM_BUG_ON(ret & VM_FAULT_FALLBACK);
} else {
@@ -1504,7 +1517,8 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
}
} else {
spin_unlock(vmf->ptl);
- pte_free(vma->vm_mm, pgtable);
+ if (pgtable)
+ pte_free(vma->vm_mm, pgtable);
}
return ret;
}
@@ -1836,8 +1850,10 @@ static void copy_huge_non_present_pmd(
}
add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
- mm_inc_nr_ptes(dst_mm);
- pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
+ if (pgtable) {
+ mm_inc_nr_ptes(dst_mm);
+ pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
+ }
if (!userfaultfd_wp(dst_vma))
pmd = pmd_swp_clear_uffd_wp(pmd);
set_pmd_at(dst_mm, addr, dst_pmd, pmd);
@@ -1877,9 +1893,11 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
if (!vma_is_anonymous(dst_vma))
return 0;
- pgtable = pte_alloc_one(dst_mm);
- if (unlikely(!pgtable))
- goto out;
+ if (arch_needs_pgtable_deposit()) {
+ pgtable = pte_alloc_one(dst_mm);
+ if (unlikely(!pgtable))
+ goto out;
+ }
dst_ptl = pmd_lock(dst_mm, dst_pmd);
src_ptl = pmd_lockptr(src_mm, src_pmd);
@@ -1897,7 +1915,8 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
}
if (unlikely(!pmd_trans_huge(pmd))) {
- pte_free(dst_mm, pgtable);
+ if (pgtable)
+ pte_free(dst_mm, pgtable);
goto out_unlock;
}
/*
@@ -1923,7 +1942,8 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
if (unlikely(folio_try_dup_anon_rmap_pmd(src_folio, src_page, dst_vma, src_vma))) {
/* Page maybe pinned: split and retry the fault on PTEs. */
folio_put(src_folio);
- pte_free(dst_mm, pgtable);
+ if (pgtable)
+ pte_free(dst_mm, pgtable);
spin_unlock(src_ptl);
spin_unlock(dst_ptl);
__split_huge_pmd(src_vma, src_pmd, addr, false);
@@ -1931,8 +1951,10 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
}
add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
out_zero_page:
- mm_inc_nr_ptes(dst_mm);
- pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
+ if (pgtable) {
+ mm_inc_nr_ptes(dst_mm);
+ pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
+ }
pmdp_set_wrprotect(src_mm, addr, src_pmd);
if (!userfaultfd_wp(dst_vma))
pmd = pmd_clear_uffd_wp(pmd);
@@ -2364,7 +2386,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
zap_deposited_table(tlb->mm, pmd);
spin_unlock(ptl);
} else if (is_huge_zero_pmd(orig_pmd)) {
- if (!vma_is_dax(vma) || arch_needs_pgtable_deposit())
+ if (arch_needs_pgtable_deposit())
zap_deposited_table(tlb->mm, pmd);
spin_unlock(ptl);
} else {
@@ -2389,7 +2411,8 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
}
if (folio_test_anon(folio)) {
- zap_deposited_table(tlb->mm, pmd);
+ if (arch_needs_pgtable_deposit())
+ zap_deposited_table(tlb->mm, pmd);
add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
} else {
if (arch_needs_pgtable_deposit())
@@ -2490,7 +2513,8 @@ bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
force_flush = true;
VM_BUG_ON(!pmd_none(*new_pmd));
- if (pmd_move_must_withdraw(new_ptl, old_ptl, vma)) {
+ if (pmd_move_must_withdraw(new_ptl, old_ptl, vma) &&
+ arch_needs_pgtable_deposit()) {
pgtable_t pgtable;
pgtable = pgtable_trans_huge_withdraw(mm, old_pmd);
pgtable_trans_huge_deposit(mm, new_pmd, pgtable);
@@ -2798,8 +2822,10 @@ int move_pages_huge_pmd(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd, pm
}
set_pmd_at(mm, dst_addr, dst_pmd, _dst_pmd);
- src_pgtable = pgtable_trans_huge_withdraw(mm, src_pmd);
- pgtable_trans_huge_deposit(mm, dst_pmd, src_pgtable);
+ if (arch_needs_pgtable_deposit()) {
+ src_pgtable = pgtable_trans_huge_withdraw(mm, src_pmd);
+ pgtable_trans_huge_deposit(mm, dst_pmd, src_pgtable);
+ }
unlock_ptls:
double_pt_unlock(src_ptl, dst_ptl);
/* unblock rmap walks */
@@ -2941,10 +2967,9 @@ void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
#endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
- unsigned long haddr, pmd_t *pmd)
+ unsigned long haddr, pmd_t *pmd, pgtable_t pgtable)
{
struct mm_struct *mm = vma->vm_mm;
- pgtable_t pgtable;
pmd_t _pmd, old_pmd;
unsigned long addr;
pte_t *pte;
@@ -2960,7 +2985,16 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
*/
old_pmd = pmdp_huge_clear_flush(vma, haddr, pmd);
- pgtable = pgtable_trans_huge_withdraw(mm, pmd);
+ if (arch_needs_pgtable_deposit()) {
+ pgtable = pgtable_trans_huge_withdraw(mm, pmd);
+ } else {
+ VM_BUG_ON(!pgtable);
+ /*
+ * Account for the freshly allocated (in __split_huge_pmd) pgtable
+ * being used in mm.
+ */
+ mm_inc_nr_ptes(mm);
+ }
pmd_populate(mm, &_pmd, pgtable);
pte = pte_offset_map(&_pmd, haddr);
@@ -2982,12 +3016,11 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
}
static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
- unsigned long haddr, bool freeze)
+ unsigned long haddr, bool freeze, pgtable_t pgtable)
{
struct mm_struct *mm = vma->vm_mm;
struct folio *folio;
struct page *page;
- pgtable_t pgtable;
pmd_t old_pmd, _pmd;
bool soft_dirty, uffd_wp = false, young = false, write = false;
bool anon_exclusive = false, dirty = false;
@@ -3011,6 +3044,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
*/
if (arch_needs_pgtable_deposit())
zap_deposited_table(mm, pmd);
+ if (pgtable)
+ pte_free(mm, pgtable);
if (!vma_is_dax(vma) && vma_is_special_huge(vma))
return;
if (unlikely(pmd_is_migration_entry(old_pmd))) {
@@ -3043,7 +3078,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
* small page also write protected so it does not seems useful
* to invalidate secondary mmu at this time.
*/
- return __split_huge_zero_page_pmd(vma, haddr, pmd);
+ return __split_huge_zero_page_pmd(vma, haddr, pmd, pgtable);
}
if (pmd_is_migration_entry(*pmd)) {
@@ -3167,7 +3202,16 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
* Withdraw the table only after we mark the pmd entry invalid.
* This's critical for some architectures (Power).
*/
- pgtable = pgtable_trans_huge_withdraw(mm, pmd);
+ if (arch_needs_pgtable_deposit()) {
+ pgtable = pgtable_trans_huge_withdraw(mm, pmd);
+ } else {
+ VM_BUG_ON(!pgtable);
+ /*
+ * Account for the freshly allocated (in __split_huge_pmd) pgtable
+ * being used in mm.
+ */
+ mm_inc_nr_ptes(mm);
+ }
pmd_populate(mm, &_pmd, pgtable);
pte = pte_offset_map(&_pmd, haddr);
@@ -3263,11 +3307,13 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
}
void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address,
- pmd_t *pmd, bool freeze)
+ pmd_t *pmd, bool freeze, pgtable_t pgtable)
{
VM_WARN_ON_ONCE(!IS_ALIGNED(address, HPAGE_PMD_SIZE));
if (pmd_trans_huge(*pmd) || pmd_is_valid_softleaf(*pmd))
- __split_huge_pmd_locked(vma, pmd, address, freeze);
+ __split_huge_pmd_locked(vma, pmd, address, freeze, pgtable);
+ else if (pgtable)
+ pte_free(vma->vm_mm, pgtable);
}
void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
@@ -3275,13 +3321,24 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
{
spinlock_t *ptl;
struct mmu_notifier_range range;
+ pgtable_t pgtable = NULL;
mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma->vm_mm,
address & HPAGE_PMD_MASK,
(address & HPAGE_PMD_MASK) + HPAGE_PMD_SIZE);
mmu_notifier_invalidate_range_start(&range);
+
+ /* allocate pagetable before acquiring pmd lock */
+ if (vma_is_anonymous(vma) && !arch_needs_pgtable_deposit()) {
+ pgtable = pte_alloc_one(vma->vm_mm);
+ if (!pgtable) {
+ mmu_notifier_invalidate_range_end(&range);
+ return;
+ }
+ }
+
ptl = pmd_lock(vma->vm_mm, pmd);
- split_huge_pmd_locked(vma, range.start, pmd, freeze);
+ split_huge_pmd_locked(vma, range.start, pmd, freeze, pgtable);
spin_unlock(ptl);
mmu_notifier_invalidate_range_end(&range);
}
@@ -3402,7 +3459,8 @@ static bool __discard_anon_folio_pmd_locked(struct vm_area_struct *vma,
}
folio_remove_rmap_pmd(folio, pmd_page(orig_pmd), vma);
- zap_deposited_table(mm, pmdp);
+ if (arch_needs_pgtable_deposit())
+ zap_deposited_table(mm, pmdp);
add_mm_counter(mm, MM_ANONPAGES, -HPAGE_PMD_NR);
if (vma->vm_flags & VM_LOCKED)
mlock_drain_local();
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index fa1e57fd2c469..0e976e4c975ef 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1223,7 +1223,12 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
spin_lock(pmd_ptl);
BUG_ON(!pmd_none(*pmd));
- pgtable_trans_huge_deposit(mm, pmd, pgtable);
+ if (arch_needs_pgtable_deposit()) {
+ pgtable_trans_huge_deposit(mm, pmd, pgtable);
+ } else {
+ mm_dec_nr_ptes(mm);
+ pte_free(mm, pgtable);
+ }
map_anon_folio_pmd_nopf(folio, pmd, vma, address);
spin_unlock(pmd_ptl);
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index 0a8b31939640f..053db74303e36 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -829,9 +829,13 @@ static int migrate_vma_insert_huge_pmd_page(struct migrate_vma *migrate,
__folio_mark_uptodate(folio);
- pgtable = pte_alloc_one(vma->vm_mm);
- if (unlikely(!pgtable))
- goto abort;
+ if (arch_needs_pgtable_deposit()) {
+ pgtable = pte_alloc_one(vma->vm_mm);
+ if (unlikely(!pgtable))
+ goto abort;
+ } else {
+ pgtable = NULL;
+ }
if (folio_is_device_private(folio)) {
swp_entry_t swp_entry;
@@ -879,10 +883,11 @@ static int migrate_vma_insert_huge_pmd_page(struct migrate_vma *migrate,
folio_get(folio);
if (flush) {
- pte_free(vma->vm_mm, pgtable);
+ if (pgtable)
+ pte_free(vma->vm_mm, pgtable);
flush_cache_page(vma, addr, addr + HPAGE_PMD_SIZE);
pmdp_invalidate(vma, addr, pmdp);
- } else {
+ } else if (pgtable) {
pgtable_trans_huge_deposit(vma->vm_mm, pmdp, pgtable);
mm_inc_nr_ptes(vma->vm_mm);
}
diff --git a/mm/rmap.c b/mm/rmap.c
index edf5d32f46042..c6ff23fc12944 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -76,6 +76,7 @@
#include <linux/mm_inline.h>
#include <linux/oom.h>
+#include <asm/pgalloc.h>
#include <asm/tlb.h>
#define CREATE_TRACE_POINTS
@@ -1978,6 +1979,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
unsigned long pfn;
unsigned long hsz = 0;
int ptes = 0;
+ pgtable_t prealloc_pte = NULL;
/*
* When racing against e.g. zap_pte_range() on another cpu,
@@ -2012,6 +2014,10 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
}
mmu_notifier_invalidate_range_start(&range);
+ if ((flags & TTU_SPLIT_HUGE_PMD) && vma_is_anonymous(vma) &&
+ !arch_needs_pgtable_deposit())
+ prealloc_pte = pte_alloc_one(mm);
+
while (page_vma_mapped_walk(&pvmw)) {
/*
* If the folio is in an mlock()d vma, we must not swap it out.
@@ -2061,12 +2067,21 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
}
if (flags & TTU_SPLIT_HUGE_PMD) {
+ pgtable_t pgtable = prealloc_pte;
+
+ prealloc_pte = NULL;
+ if (!arch_needs_pgtable_deposit() && !pgtable &&
+ vma_is_anonymous(vma)) {
+ page_vma_mapped_walk_done(&pvmw);
+ ret = false;
+ break;
+ }
/*
* We temporarily have to drop the PTL and
* restart so we can process the PTE-mapped THP.
*/
split_huge_pmd_locked(vma, pvmw.address,
- pvmw.pmd, false);
+ pvmw.pmd, false, pgtable);
flags &= ~TTU_SPLIT_HUGE_PMD;
page_vma_mapped_walk_restart(&pvmw);
continue;
@@ -2346,6 +2361,9 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
break;
}
+ if (prealloc_pte)
+ pte_free(mm, prealloc_pte);
+
mmu_notifier_invalidate_range_end(&range);
return ret;
@@ -2405,6 +2423,7 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
enum ttu_flags flags = (enum ttu_flags)(long)arg;
unsigned long pfn;
unsigned long hsz = 0;
+ pgtable_t prealloc_pte = NULL;
/*
* When racing against e.g. zap_pte_range() on another cpu,
@@ -2439,6 +2458,10 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
}
mmu_notifier_invalidate_range_start(&range);
+ if ((flags & TTU_SPLIT_HUGE_PMD) && vma_is_anonymous(vma) &&
+ !arch_needs_pgtable_deposit())
+ prealloc_pte = pte_alloc_one(mm);
+
while (page_vma_mapped_walk(&pvmw)) {
/* PMD-mapped THP migration entry */
if (!pvmw.pte) {
@@ -2446,8 +2469,17 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
__maybe_unused pmd_t pmdval;
if (flags & TTU_SPLIT_HUGE_PMD) {
+ pgtable_t pgtable = prealloc_pte;
+
+ prealloc_pte = NULL;
+ if (!arch_needs_pgtable_deposit() && !pgtable &&
+ vma_is_anonymous(vma)) {
+ page_vma_mapped_walk_done(&pvmw);
+ ret = false;
+ break;
+ }
split_huge_pmd_locked(vma, pvmw.address,
- pvmw.pmd, true);
+ pvmw.pmd, true, pgtable);
ret = false;
page_vma_mapped_walk_done(&pvmw);
break;
@@ -2698,6 +2730,9 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
folio_put(folio);
}
+ if (prealloc_pte)
+ pte_free(mm, prealloc_pte);
+
mmu_notifier_invalidate_range_end(&range);
return ret;
--
2.47.3
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [RFC 2/2] mm: thp: add THP_SPLIT_PMD_PTE_ALLOC_FAILED counter
2026-02-11 12:49 [RFC 0/2] mm: thp: split time allocation of page table for THPs Usama Arif
2026-02-11 12:49 ` [RFC 1/2] mm: thp: allocate PTE page tables lazily at split time Usama Arif
@ 2026-02-11 12:49 ` Usama Arif
2026-02-11 13:27 ` David Hildenbrand (Arm)
` (2 more replies)
1 sibling, 3 replies; 22+ messages in thread
From: Usama Arif @ 2026-02-11 12:49 UTC (permalink / raw)
To: Andrew Morton, david, lorenzo.stoakes, willy, linux-mm
Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
baolin.wang, npache, Liam.Howlett, ryan.roberts, vbabka,
lance.yang, linux-kernel, kernel-team, Usama Arif
Add a vmstat counter to track PTE allocation failures during PMD split.
This enables monitoring of split failures due to memory pressure after
the lazy PTE page table allocation change.
The counter is incremented in three places:
- __split_huge_pmd(): Main entry point for splitting a PMD
- try_to_unmap_one(): When reclaim needs to split a PMD-mapped THP
- try_to_migrate_one(): When migration needs to split a PMD-mapped THP
Visible via /proc/vmstat as thp_split_pmd_pte_alloc_failed.
Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
include/linux/vm_event_item.h | 1 +
mm/huge_memory.c | 1 +
mm/rmap.c | 3 +++
mm/vmstat.c | 1 +
4 files changed, 6 insertions(+)
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 22a139f82d75f..827c9a8c251de 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -111,6 +111,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
THP_DEFERRED_SPLIT_PAGE,
THP_UNDERUSED_SPLIT_PAGE,
THP_SPLIT_PMD,
+ THP_SPLIT_PMD_PTE_ALLOC_FAILED,
THP_SCAN_EXCEED_NONE_PTE,
THP_SCAN_EXCEED_SWAP_PTE,
THP_SCAN_EXCEED_SHARED_PTE,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 4c9a8d89fc8aa..8d7c9f67f8a1d 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3332,6 +3332,7 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
if (vma_is_anonymous(vma) && !arch_needs_pgtable_deposit()) {
pgtable = pte_alloc_one(vma->vm_mm);
if (!pgtable) {
+ count_vm_event(THP_SPLIT_PMD_PTE_ALLOC_FAILED);
mmu_notifier_invalidate_range_end(&range);
return;
}
diff --git a/mm/rmap.c b/mm/rmap.c
index c6ff23fc12944..5c4afedb29d5a 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -2070,8 +2070,10 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
pgtable_t pgtable = prealloc_pte;
prealloc_pte = NULL;
+
if (!arch_needs_pgtable_deposit() && !pgtable &&
vma_is_anonymous(vma)) {
+ count_vm_event(THP_SPLIT_PMD_PTE_ALLOC_FAILED);
page_vma_mapped_walk_done(&pvmw);
ret = false;
break;
@@ -2474,6 +2476,7 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
prealloc_pte = NULL;
if (!arch_needs_pgtable_deposit() && !pgtable &&
vma_is_anonymous(vma)) {
+ count_vm_event(THP_SPLIT_PMD_PTE_ALLOC_FAILED);
page_vma_mapped_walk_done(&pvmw);
ret = false;
break;
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 99270713e0c13..473edfa624a41 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1408,6 +1408,7 @@ const char * const vmstat_text[] = {
[I(THP_DEFERRED_SPLIT_PAGE)] = "thp_deferred_split_page",
[I(THP_UNDERUSED_SPLIT_PAGE)] = "thp_underused_split_page",
[I(THP_SPLIT_PMD)] = "thp_split_pmd",
+ [I(THP_SPLIT_PMD_PTE_ALLOC_FAILED)] = "thp_split_pmd_pte_alloc_failed",
[I(THP_SCAN_EXCEED_NONE_PTE)] = "thp_scan_exceed_none_pte",
[I(THP_SCAN_EXCEED_SWAP_PTE)] = "thp_scan_exceed_swap_pte",
[I(THP_SCAN_EXCEED_SHARED_PTE)] = "thp_scan_exceed_share_pte",
--
2.47.3
^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [RFC 1/2] mm: thp: allocate PTE page tables lazily at split time
2026-02-11 12:49 ` [RFC 1/2] mm: thp: allocate PTE page tables lazily at split time Usama Arif
@ 2026-02-11 13:25 ` David Hildenbrand (Arm)
2026-02-11 13:38 ` Usama Arif
2026-02-12 12:13 ` Ritesh Harjani
2026-02-11 13:35 ` David Hildenbrand (Arm)
2026-02-11 19:28 ` Matthew Wilcox
2 siblings, 2 replies; 22+ messages in thread
From: David Hildenbrand (Arm) @ 2026-02-11 13:25 UTC (permalink / raw)
To: Usama Arif, Andrew Morton, lorenzo.stoakes, willy, linux-mm
Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
baolin.wang, npache, Liam.Howlett, ryan.roberts, vbabka,
lance.yang, linux-kernel, kernel-team, Madhavan Srinivasan,
Michael Ellerman, linuxppc-dev
CCing ppc folks
On 2/11/26 13:49, Usama Arif wrote:
> When the kernel creates a PMD-level THP mapping for anonymous pages,
> it pre-allocates a PTE page table and deposits it via
> pgtable_trans_huge_deposit(). This deposited table is withdrawn during
> PMD split or zap. The rationale was that split must not fail—if the
> kernel decides to split a THP, it needs a PTE table to populate.
>
> However, every anon THP wastes 4KB (one page table page) that sits
> unused in the deposit list for the lifetime of the mapping. On systems
> with many THPs, this adds up to significant memory waste. The original
> rationale is also not an issue. It is ok for split to fail, and if the
> kernel can't find an order 0 allocation for split, there are much bigger
> problems. On large servers where you can easily have 100s of GBs of THPs,
> the memory usage for these tables is 200M per 100G. This memory could be
> used for any other usecase, which include allocating the pagetables
> required during split.
>
> This patch removes the pre-deposit for anonymous pages on architectures
> where arch_needs_pgtable_deposit() returns false (every arch apart from
> powerpc, and only when radix hash tables are not enabled) and allocates
> the PTE table lazily—only when a split actually occurs. The split path
> is modified to accept a caller-provided page table.
>
> PowerPC exception:
>
> It would have been great if we can completely remove the pagetable
> deposit code and this commit would mostly have been a code cleanup patch,
> unfortunately PowerPC has hash MMU, it stores hash slot information in
> the deposited page table and pre-deposit is necessary. All deposit/
> withdraw paths are guarded by arch_needs_pgtable_deposit(), so PowerPC
> behavior is unchanged with this patch. On a better note,
> arch_needs_pgtable_deposit will always evaluate to false at compile time
> on non PowerPC architectures and the pre-deposit code will not be
> compiled in.
Is there a way to remove this? It's always been a confusing hack, now
it's unpleasant to have around :)
In particular, seeing that radix__pgtable_trans_huge_deposit() just 1:1
copied generic pgtable_trans_huge_deposit() hurts my belly.
IIUC, hash is mostly used on legacy power systems, radix on newer ones.
So one obvious solution: remove PMD THP support for hash MMUs along with
all this hacky deposit code.
the "vma_is_anonymous(vma) && !arch_needs_pgtable_deposit()" and similar
checks need to be wrapped in a reasonable helper and likely this all
needs to get cleaned up further.
The implementation if the generic pgtable_trans_huge_deposit and the
radix handlers etc must be removed. If any code would trigger them it
would be a bug.
If we have to keep this around, pgtable_trans_huge_deposit() should
likely get renamed to arch_pgtable_trans_huge_deposit() etc, as there
will not be generic support for it.
--
Cheers,
David
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC 2/2] mm: thp: add THP_SPLIT_PMD_PTE_ALLOC_FAILED counter
2026-02-11 12:49 ` [RFC 2/2] mm: thp: add THP_SPLIT_PMD_PTE_ALLOC_FAILED counter Usama Arif
@ 2026-02-11 13:27 ` David Hildenbrand (Arm)
2026-02-11 13:31 ` Usama Arif
2026-02-12 21:40 ` kernel test robot
2026-02-12 21:40 ` kernel test robot
2 siblings, 1 reply; 22+ messages in thread
From: David Hildenbrand (Arm) @ 2026-02-11 13:27 UTC (permalink / raw)
To: Usama Arif, Andrew Morton, lorenzo.stoakes, willy, linux-mm
Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
baolin.wang, npache, Liam.Howlett, ryan.roberts, vbabka,
lance.yang, linux-kernel, kernel-team
On 2/11/26 13:49, Usama Arif wrote:
> Add a vmstat counter to track PTE allocation failures during PMD split.
> This enables monitoring of split failures due to memory pressure after
> the lazy PTE page table allocation change.
>
> The counter is incremented in three places:
> - __split_huge_pmd(): Main entry point for splitting a PMD
> - try_to_unmap_one(): When reclaim needs to split a PMD-mapped THP
> - try_to_migrate_one(): When migration needs to split a PMD-mapped THP
>
> Visible via /proc/vmstat as thp_split_pmd_pte_alloc_failed.
>
> Signed-off-by: Usama Arif <usama.arif@linux.dev>
> ---
> include/linux/vm_event_item.h | 1 +
> mm/huge_memory.c | 1 +
> mm/rmap.c | 3 +++
> mm/vmstat.c | 1 +
> 4 files changed, 6 insertions(+)
>
> diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
> index 22a139f82d75f..827c9a8c251de 100644
> --- a/include/linux/vm_event_item.h
> +++ b/include/linux/vm_event_item.h
> @@ -111,6 +111,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
> THP_DEFERRED_SPLIT_PAGE,
> THP_UNDERUSED_SPLIT_PAGE,
> THP_SPLIT_PMD,
> + THP_SPLIT_PMD_PTE_ALLOC_FAILED,
Probably sufficient to call this THP_SPLIT_PMD_FAILED and count any
(future) failures (if any) as well.
It's a shame that we called a remapping a "split" and keep causing
confusion.
--
Cheers,
David
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC 2/2] mm: thp: add THP_SPLIT_PMD_PTE_ALLOC_FAILED counter
2026-02-11 13:27 ` David Hildenbrand (Arm)
@ 2026-02-11 13:31 ` Usama Arif
2026-02-11 13:36 ` David Hildenbrand (Arm)
2026-02-11 13:38 ` David Hildenbrand (Arm)
0 siblings, 2 replies; 22+ messages in thread
From: Usama Arif @ 2026-02-11 13:31 UTC (permalink / raw)
To: David Hildenbrand (Arm), Andrew Morton, lorenzo.stoakes, willy,
linux-mm
Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
baolin.wang, npache, Liam.Howlett, ryan.roberts, vbabka,
lance.yang, linux-kernel, kernel-team
On 11/02/2026 13:27, David Hildenbrand (Arm) wrote:
> On 2/11/26 13:49, Usama Arif wrote:
>> Add a vmstat counter to track PTE allocation failures during PMD split.
>> This enables monitoring of split failures due to memory pressure after
>> the lazy PTE page table allocation change.
>>
>> The counter is incremented in three places:
>> - __split_huge_pmd(): Main entry point for splitting a PMD
>> - try_to_unmap_one(): When reclaim needs to split a PMD-mapped THP
>> - try_to_migrate_one(): When migration needs to split a PMD-mapped THP
>>
>> Visible via /proc/vmstat as thp_split_pmd_pte_alloc_failed.
>>
>> Signed-off-by: Usama Arif <usama.arif@linux.dev>
>> ---
>> include/linux/vm_event_item.h | 1 +
>> mm/huge_memory.c | 1 +
>> mm/rmap.c | 3 +++
>> mm/vmstat.c | 1 +
>> 4 files changed, 6 insertions(+)
>>
>> diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
>> index 22a139f82d75f..827c9a8c251de 100644
>> --- a/include/linux/vm_event_item.h
>> +++ b/include/linux/vm_event_item.h
>> @@ -111,6 +111,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>> THP_DEFERRED_SPLIT_PAGE,
>> THP_UNDERUSED_SPLIT_PAGE,
>> THP_SPLIT_PMD,
>> + THP_SPLIT_PMD_PTE_ALLOC_FAILED,
>
> Probably sufficient to call this THP_SPLIT_PMD_FAILED and count any (future) failures (if any) as well.
>
Makes sense. This was just a patch I was using for testing and I wanted to share.
It was always 0 as I couldnt get split to fail :) But I can rename it as THP_SPLIT_PMD_FAILED
as suggested and we can use for future split failures (hopefully none).
> It's a shame that we called a remapping a "split" and keep causing confusion.
>
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC 1/2] mm: thp: allocate PTE page tables lazily at split time
2026-02-11 12:49 ` [RFC 1/2] mm: thp: allocate PTE page tables lazily at split time Usama Arif
2026-02-11 13:25 ` David Hildenbrand (Arm)
@ 2026-02-11 13:35 ` David Hildenbrand (Arm)
2026-02-11 13:46 ` Kiryl Shutsemau
2026-02-11 13:47 ` Usama Arif
2026-02-11 19:28 ` Matthew Wilcox
2 siblings, 2 replies; 22+ messages in thread
From: David Hildenbrand (Arm) @ 2026-02-11 13:35 UTC (permalink / raw)
To: Usama Arif, Andrew Morton, lorenzo.stoakes, willy, linux-mm
Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
baolin.wang, npache, Liam.Howlett, ryan.roberts, vbabka,
lance.yang, linux-kernel, kernel-team
On 2/11/26 13:49, Usama Arif wrote:
> When the kernel creates a PMD-level THP mapping for anonymous pages,
> it pre-allocates a PTE page table and deposits it via
> pgtable_trans_huge_deposit(). This deposited table is withdrawn during
> PMD split or zap. The rationale was that split must not fail—if the
> kernel decides to split a THP, it needs a PTE table to populate.
>
> However, every anon THP wastes 4KB (one page table page) that sits
> unused in the deposit list for the lifetime of the mapping. On systems
> with many THPs, this adds up to significant memory waste. The original
> rationale is also not an issue. It is ok for split to fail, and if the
> kernel can't find an order 0 allocation for split, there are much bigger
> problems. On large servers where you can easily have 100s of GBs of THPs,
> the memory usage for these tables is 200M per 100G. This memory could be
> used for any other usecase, which include allocating the pagetables
> required during split.
>
> This patch removes the pre-deposit for anonymous pages on architectures
> where arch_needs_pgtable_deposit() returns false (every arch apart from
> powerpc, and only when radix hash tables are not enabled) and allocates
> the PTE table lazily—only when a split actually occurs. The split path
> is modified to accept a caller-provided page table.
>
> PowerPC exception:
>
> It would have been great if we can completely remove the pagetable
> deposit code and this commit would mostly have been a code cleanup patch,
> unfortunately PowerPC has hash MMU, it stores hash slot information in
> the deposited page table and pre-deposit is necessary. All deposit/
> withdraw paths are guarded by arch_needs_pgtable_deposit(), so PowerPC
> behavior is unchanged with this patch. On a better note,
> arch_needs_pgtable_deposit will always evaluate to false at compile time
> on non PowerPC architectures and the pre-deposit code will not be
> compiled in.
>
> Why Split Failures Are Safe:
>
> If a system is under severe memory pressure that even a 4K allocation
> fails for a PTE table, there are far greater problems than a THP split
> being delayed. The OOM killer will likely intervene before this becomes an
> issue.
> When pte_alloc_one() fails due to not being able to allocate a 4K page,
> the PMD split is aborted and the THP remains intact. I could not get split
> to fail, as its very difficult to make order-0 allocation to fail.
> Code analysis of what would happen if it does:
>
> - mprotect(): If split fails in change_pmd_range, it will fallback
> to change_pte_range, which will return an error which will cause the
> whole function to be retried again.
>
> - munmap() (partial THP range): zap_pte_range() returns early when
> pte_offset_map_lock() fails, causing zap_pmd_range() to retry via pmd--.
> For full THP range, zap_huge_pmd() unmaps the entire PMD without
> split.
>
> - Memory reclaim (try_to_unmap()): Returns false, folio rotated back
> LRU, retried in next reclaim cycle.
>
> - Migration / compaction (try_to_migrate()): Returns -EAGAIN, migration
> skips this folio, retried later.
>
> - CoW fault (wp_huge_pmd()): Returns VM_FAULT_FALLBACK, fault retried.
>
> - madvise (MADV_COLD/PAGEOUT): split_folio() internally calls
> try_to_migrate() with TTU_SPLIT_HUGE_PMD. If PMD split fails,
> try_to_migrate() returns false, split_folio() returns -EAGAIN,
> and madvise returns 0 (success) silently skipping the region. This
> should be fine. madvise is just an advice and can fail for other
> reasons as well.
>
> Suggested-by: David Hildenbrand <david@kernel.org>
> Signed-off-by: Usama Arif <usama.arif@linux.dev>
> ---
> include/linux/huge_mm.h | 4 +-
> mm/huge_memory.c | 144 ++++++++++++++++++++++++++++------------
> mm/khugepaged.c | 7 +-
> mm/migrate_device.c | 15 +++--
> mm/rmap.c | 39 ++++++++++-
> 5 files changed, 156 insertions(+), 53 deletions(-)
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index a4d9f964dfdea..b21bb72a298c9 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -562,7 +562,7 @@ static inline bool thp_migration_supported(void)
> }
>
> void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address,
> - pmd_t *pmd, bool freeze);
> + pmd_t *pmd, bool freeze, pgtable_t pgtable);
> bool unmap_huge_pmd_locked(struct vm_area_struct *vma, unsigned long addr,
> pmd_t *pmdp, struct folio *folio);
> void map_anon_folio_pmd_nopf(struct folio *folio, pmd_t *pmd,
> @@ -660,7 +660,7 @@ static inline void split_huge_pmd_address(struct vm_area_struct *vma,
> unsigned long address, bool freeze) {}
> static inline void split_huge_pmd_locked(struct vm_area_struct *vma,
> unsigned long address, pmd_t *pmd,
> - bool freeze) {}
> + bool freeze, pgtable_t pgtable) {}
>
> static inline bool unmap_huge_pmd_locked(struct vm_area_struct *vma,
> unsigned long addr, pmd_t *pmdp,
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 44ff8a648afd5..4c9a8d89fc8aa 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1322,17 +1322,19 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf)
> unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
> struct vm_area_struct *vma = vmf->vma;
> struct folio *folio;
> - pgtable_t pgtable;
> + pgtable_t pgtable = NULL;
> vm_fault_t ret = 0;
>
> folio = vma_alloc_anon_folio_pmd(vma, vmf->address);
> if (unlikely(!folio))
> return VM_FAULT_FALLBACK;
>
> - pgtable = pte_alloc_one(vma->vm_mm);
> - if (unlikely(!pgtable)) {
> - ret = VM_FAULT_OOM;
> - goto release;
> + if (arch_needs_pgtable_deposit()) {
> + pgtable = pte_alloc_one(vma->vm_mm);
> + if (unlikely(!pgtable)) {
> + ret = VM_FAULT_OOM;
> + goto release;
> + }
> }
>
> vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
> @@ -1347,14 +1349,18 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf)
> if (userfaultfd_missing(vma)) {
> spin_unlock(vmf->ptl);
> folio_put(folio);
> - pte_free(vma->vm_mm, pgtable);
> + if (pgtable)
> + pte_free(vma->vm_mm, pgtable);
> ret = handle_userfault(vmf, VM_UFFD_MISSING);
> VM_BUG_ON(ret & VM_FAULT_FALLBACK);
> return ret;
> }
> - pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd, pgtable);
> + if (pgtable) {
> + pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd,
> + pgtable);
> + mm_inc_nr_ptes(vma->vm_mm);
> + }
> map_anon_folio_pmd_pf(folio, vmf->pmd, vma, haddr);
> - mm_inc_nr_ptes(vma->vm_mm);
> spin_unlock(vmf->ptl);
> }
>
> @@ -1450,9 +1456,11 @@ static void set_huge_zero_folio(pgtable_t pgtable, struct mm_struct *mm,
> pmd_t entry;
> entry = folio_mk_pmd(zero_folio, vma->vm_page_prot);
> entry = pmd_mkspecial(entry);
> - pgtable_trans_huge_deposit(mm, pmd, pgtable);
> + if (pgtable) {
> + pgtable_trans_huge_deposit(mm, pmd, pgtable);
> + mm_inc_nr_ptes(mm);
> + }
> set_pmd_at(mm, haddr, pmd, entry);
> - mm_inc_nr_ptes(mm);
> }
>
> vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
> @@ -1471,16 +1479,19 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
> if (!(vmf->flags & FAULT_FLAG_WRITE) &&
> !mm_forbids_zeropage(vma->vm_mm) &&
> transparent_hugepage_use_zero_page()) {
> - pgtable_t pgtable;
> + pgtable_t pgtable = NULL;
> struct folio *zero_folio;
> vm_fault_t ret;
>
> - pgtable = pte_alloc_one(vma->vm_mm);
> - if (unlikely(!pgtable))
> - return VM_FAULT_OOM;
> + if (arch_needs_pgtable_deposit()) {
> + pgtable = pte_alloc_one(vma->vm_mm);
> + if (unlikely(!pgtable))
> + return VM_FAULT_OOM;
> + }
> zero_folio = mm_get_huge_zero_folio(vma->vm_mm);
> if (unlikely(!zero_folio)) {
> - pte_free(vma->vm_mm, pgtable);
> + if (pgtable)
> + pte_free(vma->vm_mm, pgtable);
> count_vm_event(THP_FAULT_FALLBACK);
> return VM_FAULT_FALLBACK;
> }
> @@ -1490,10 +1501,12 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
> ret = check_stable_address_space(vma->vm_mm);
> if (ret) {
> spin_unlock(vmf->ptl);
> - pte_free(vma->vm_mm, pgtable);
> + if (pgtable)
> + pte_free(vma->vm_mm, pgtable);
> } else if (userfaultfd_missing(vma)) {
> spin_unlock(vmf->ptl);
> - pte_free(vma->vm_mm, pgtable);
> + if (pgtable)
> + pte_free(vma->vm_mm, pgtable);
> ret = handle_userfault(vmf, VM_UFFD_MISSING);
> VM_BUG_ON(ret & VM_FAULT_FALLBACK);
> } else {
> @@ -1504,7 +1517,8 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
> }
> } else {
> spin_unlock(vmf->ptl);
> - pte_free(vma->vm_mm, pgtable);
> + if (pgtable)
> + pte_free(vma->vm_mm, pgtable);
> }
> return ret;
> }
> @@ -1836,8 +1850,10 @@ static void copy_huge_non_present_pmd(
> }
>
> add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
> - mm_inc_nr_ptes(dst_mm);
> - pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
> + if (pgtable) {
> + mm_inc_nr_ptes(dst_mm);
> + pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
> + }
> if (!userfaultfd_wp(dst_vma))
> pmd = pmd_swp_clear_uffd_wp(pmd);
> set_pmd_at(dst_mm, addr, dst_pmd, pmd);
> @@ -1877,9 +1893,11 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> if (!vma_is_anonymous(dst_vma))
> return 0;
>
> - pgtable = pte_alloc_one(dst_mm);
> - if (unlikely(!pgtable))
> - goto out;
> + if (arch_needs_pgtable_deposit()) {
> + pgtable = pte_alloc_one(dst_mm);
> + if (unlikely(!pgtable))
> + goto out;
> + }
>
> dst_ptl = pmd_lock(dst_mm, dst_pmd);
> src_ptl = pmd_lockptr(src_mm, src_pmd);
> @@ -1897,7 +1915,8 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> }
>
> if (unlikely(!pmd_trans_huge(pmd))) {
> - pte_free(dst_mm, pgtable);
> + if (pgtable)
> + pte_free(dst_mm, pgtable);
> goto out_unlock;
> }
> /*
> @@ -1923,7 +1942,8 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> if (unlikely(folio_try_dup_anon_rmap_pmd(src_folio, src_page, dst_vma, src_vma))) {
> /* Page maybe pinned: split and retry the fault on PTEs. */
> folio_put(src_folio);
> - pte_free(dst_mm, pgtable);
> + if (pgtable)
> + pte_free(dst_mm, pgtable);
> spin_unlock(src_ptl);
> spin_unlock(dst_ptl);
> __split_huge_pmd(src_vma, src_pmd, addr, false);
> @@ -1931,8 +1951,10 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> }
> add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
> out_zero_page:
> - mm_inc_nr_ptes(dst_mm);
> - pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
> + if (pgtable) {
> + mm_inc_nr_ptes(dst_mm);
> + pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
> + }
> pmdp_set_wrprotect(src_mm, addr, src_pmd);
> if (!userfaultfd_wp(dst_vma))
> pmd = pmd_clear_uffd_wp(pmd);
> @@ -2364,7 +2386,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
> zap_deposited_table(tlb->mm, pmd);
> spin_unlock(ptl);
> } else if (is_huge_zero_pmd(orig_pmd)) {
> - if (!vma_is_dax(vma) || arch_needs_pgtable_deposit())
> + if (arch_needs_pgtable_deposit())
> zap_deposited_table(tlb->mm, pmd);
> spin_unlock(ptl);
> } else {
> @@ -2389,7 +2411,8 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
> }
>
> if (folio_test_anon(folio)) {
> - zap_deposited_table(tlb->mm, pmd);
> + if (arch_needs_pgtable_deposit())
> + zap_deposited_table(tlb->mm, pmd);
> add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
> } else {
> if (arch_needs_pgtable_deposit())
> @@ -2490,7 +2513,8 @@ bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
> force_flush = true;
> VM_BUG_ON(!pmd_none(*new_pmd));
>
> - if (pmd_move_must_withdraw(new_ptl, old_ptl, vma)) {
> + if (pmd_move_must_withdraw(new_ptl, old_ptl, vma) &&
> + arch_needs_pgtable_deposit()) {
> pgtable_t pgtable;
> pgtable = pgtable_trans_huge_withdraw(mm, old_pmd);
> pgtable_trans_huge_deposit(mm, new_pmd, pgtable);
> @@ -2798,8 +2822,10 @@ int move_pages_huge_pmd(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd, pm
> }
> set_pmd_at(mm, dst_addr, dst_pmd, _dst_pmd);
>
> - src_pgtable = pgtable_trans_huge_withdraw(mm, src_pmd);
> - pgtable_trans_huge_deposit(mm, dst_pmd, src_pgtable);
> + if (arch_needs_pgtable_deposit()) {
> + src_pgtable = pgtable_trans_huge_withdraw(mm, src_pmd);
> + pgtable_trans_huge_deposit(mm, dst_pmd, src_pgtable);
> + }
> unlock_ptls:
> double_pt_unlock(src_ptl, dst_ptl);
> /* unblock rmap walks */
> @@ -2941,10 +2967,9 @@ void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
> #endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
>
> static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
> - unsigned long haddr, pmd_t *pmd)
> + unsigned long haddr, pmd_t *pmd, pgtable_t pgtable)
> {
> struct mm_struct *mm = vma->vm_mm;
> - pgtable_t pgtable;
> pmd_t _pmd, old_pmd;
> unsigned long addr;
> pte_t *pte;
> @@ -2960,7 +2985,16 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
> */
> old_pmd = pmdp_huge_clear_flush(vma, haddr, pmd);
>
> - pgtable = pgtable_trans_huge_withdraw(mm, pmd);
> + if (arch_needs_pgtable_deposit()) {
> + pgtable = pgtable_trans_huge_withdraw(mm, pmd);
> + } else {
> + VM_BUG_ON(!pgtable);
> + /*
> + * Account for the freshly allocated (in __split_huge_pmd) pgtable
> + * being used in mm.
> + */
> + mm_inc_nr_ptes(mm);
> + }
> pmd_populate(mm, &_pmd, pgtable);
>
> pte = pte_offset_map(&_pmd, haddr);
> @@ -2982,12 +3016,11 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
> }
>
> static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> - unsigned long haddr, bool freeze)
> + unsigned long haddr, bool freeze, pgtable_t pgtable)
> {
> struct mm_struct *mm = vma->vm_mm;
> struct folio *folio;
> struct page *page;
> - pgtable_t pgtable;
> pmd_t old_pmd, _pmd;
> bool soft_dirty, uffd_wp = false, young = false, write = false;
> bool anon_exclusive = false, dirty = false;
> @@ -3011,6 +3044,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> */
> if (arch_needs_pgtable_deposit())
> zap_deposited_table(mm, pmd);
> + if (pgtable)
> + pte_free(mm, pgtable);
> if (!vma_is_dax(vma) && vma_is_special_huge(vma))
> return;
> if (unlikely(pmd_is_migration_entry(old_pmd))) {
> @@ -3043,7 +3078,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> * small page also write protected so it does not seems useful
> * to invalidate secondary mmu at this time.
> */
> - return __split_huge_zero_page_pmd(vma, haddr, pmd);
> + return __split_huge_zero_page_pmd(vma, haddr, pmd, pgtable);
> }
>
> if (pmd_is_migration_entry(*pmd)) {
> @@ -3167,7 +3202,16 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> * Withdraw the table only after we mark the pmd entry invalid.
> * This's critical for some architectures (Power).
> */
> - pgtable = pgtable_trans_huge_withdraw(mm, pmd);
> + if (arch_needs_pgtable_deposit()) {
> + pgtable = pgtable_trans_huge_withdraw(mm, pmd);
> + } else {
> + VM_BUG_ON(!pgtable);
> + /*
> + * Account for the freshly allocated (in __split_huge_pmd) pgtable
> + * being used in mm.
> + */
> + mm_inc_nr_ptes(mm);
> + }
> pmd_populate(mm, &_pmd, pgtable);
>
> pte = pte_offset_map(&_pmd, haddr);
> @@ -3263,11 +3307,13 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> }
>
> void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address,
> - pmd_t *pmd, bool freeze)
> + pmd_t *pmd, bool freeze, pgtable_t pgtable)
> {
> VM_WARN_ON_ONCE(!IS_ALIGNED(address, HPAGE_PMD_SIZE));
> if (pmd_trans_huge(*pmd) || pmd_is_valid_softleaf(*pmd))
> - __split_huge_pmd_locked(vma, pmd, address, freeze);
> + __split_huge_pmd_locked(vma, pmd, address, freeze, pgtable);
> + else if (pgtable)
> + pte_free(vma->vm_mm, pgtable);
> }
>
> void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
> @@ -3275,13 +3321,24 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
> {
> spinlock_t *ptl;
> struct mmu_notifier_range range;
> + pgtable_t pgtable = NULL;
>
> mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma->vm_mm,
> address & HPAGE_PMD_MASK,
> (address & HPAGE_PMD_MASK) + HPAGE_PMD_SIZE);
> mmu_notifier_invalidate_range_start(&range);
> +
> + /* allocate pagetable before acquiring pmd lock */
> + if (vma_is_anonymous(vma) && !arch_needs_pgtable_deposit()) {
> + pgtable = pte_alloc_one(vma->vm_mm);
> + if (!pgtable) {
> + mmu_notifier_invalidate_range_end(&range);
What I last looked at this, I thought the clean thing to do is to let
__split_huge_pmd() and friends return an error.
Let's take a look at walk_pmd_range() as one example:
if (walk->vma)
split_huge_pmd(walk->vma, pmd, addr);
else if (pmd_leaf(*pmd) || !pmd_present(*pmd))
continue;
err = walk_pte_range(pmd, addr, next, walk);
Where walk_pte_range() just does a pte_offset_map_lock.
pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
But if that fails (as the remapping failed), we will silently skip this
range.
I don't think silently skipping is the right thing to do.
So I would think that all splitting functions have to be taught to
return an error and handle it accordingly. Then we can actually start
returning errors.
--
Cheers,
David
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC 2/2] mm: thp: add THP_SPLIT_PMD_PTE_ALLOC_FAILED counter
2026-02-11 13:31 ` Usama Arif
@ 2026-02-11 13:36 ` David Hildenbrand (Arm)
2026-02-11 13:42 ` Usama Arif
2026-02-11 13:38 ` David Hildenbrand (Arm)
1 sibling, 1 reply; 22+ messages in thread
From: David Hildenbrand (Arm) @ 2026-02-11 13:36 UTC (permalink / raw)
To: Usama Arif, Andrew Morton, lorenzo.stoakes, willy, linux-mm
Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
baolin.wang, npache, Liam.Howlett, ryan.roberts, vbabka,
lance.yang, linux-kernel, kernel-team
On 2/11/26 14:31, Usama Arif wrote:
>
>
> On 11/02/2026 13:27, David Hildenbrand (Arm) wrote:
>> On 2/11/26 13:49, Usama Arif wrote:
>>> Add a vmstat counter to track PTE allocation failures during PMD split.
>>> This enables monitoring of split failures due to memory pressure after
>>> the lazy PTE page table allocation change.
>>>
>>> The counter is incremented in three places:
>>> - __split_huge_pmd(): Main entry point for splitting a PMD
>>> - try_to_unmap_one(): When reclaim needs to split a PMD-mapped THP
>>> - try_to_migrate_one(): When migration needs to split a PMD-mapped THP
>>>
>>> Visible via /proc/vmstat as thp_split_pmd_pte_alloc_failed.
>>>
>>> Signed-off-by: Usama Arif <usama.arif@linux.dev>
>>> ---
>>> include/linux/vm_event_item.h | 1 +
>>> mm/huge_memory.c | 1 +
>>> mm/rmap.c | 3 +++
>>> mm/vmstat.c | 1 +
>>> 4 files changed, 6 insertions(+)
>>>
>>> diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
>>> index 22a139f82d75f..827c9a8c251de 100644
>>> --- a/include/linux/vm_event_item.h
>>> +++ b/include/linux/vm_event_item.h
>>> @@ -111,6 +111,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>>> THP_DEFERRED_SPLIT_PAGE,
>>> THP_UNDERUSED_SPLIT_PAGE,
>>> THP_SPLIT_PMD,
>>> + THP_SPLIT_PMD_PTE_ALLOC_FAILED,
>>
>> Probably sufficient to call this THP_SPLIT_PMD_FAILED and count any (future) failures (if any) as well.
>>
>
> Makes sense. This was just a patch I was using for testing and I wanted to share.
> It was always 0 as I couldnt get split to fail :) But I can rename it as THP_SPLIT_PMD_FAILED
> as suggested and we can use for future split failures (hopefully none).
I guess it might be reasonable to have because I am sure it will fail at
some point and maybe provoke weird issues we didn't think of. In that
case, having an indication that splitting failed at some point might be
reasonable.
--
Cheers,
David
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC 1/2] mm: thp: allocate PTE page tables lazily at split time
2026-02-11 13:25 ` David Hildenbrand (Arm)
@ 2026-02-11 13:38 ` Usama Arif
2026-02-12 12:13 ` Ritesh Harjani
1 sibling, 0 replies; 22+ messages in thread
From: Usama Arif @ 2026-02-11 13:38 UTC (permalink / raw)
To: David Hildenbrand (Arm), Andrew Morton, lorenzo.stoakes, willy,
linux-mm
Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
baolin.wang, npache, Liam.Howlett, ryan.roberts, vbabka,
lance.yang, linux-kernel, kernel-team, Madhavan Srinivasan,
Michael Ellerman, linuxppc-dev
On 11/02/2026 13:25, David Hildenbrand (Arm) wrote:
> CCing ppc folks
>
> On 2/11/26 13:49, Usama Arif wrote:
>> When the kernel creates a PMD-level THP mapping for anonymous pages,
>> it pre-allocates a PTE page table and deposits it via
>> pgtable_trans_huge_deposit(). This deposited table is withdrawn during
>> PMD split or zap. The rationale was that split must not fail—if the
>> kernel decides to split a THP, it needs a PTE table to populate.
>>
>> However, every anon THP wastes 4KB (one page table page) that sits
>> unused in the deposit list for the lifetime of the mapping. On systems
>> with many THPs, this adds up to significant memory waste. The original
>> rationale is also not an issue. It is ok for split to fail, and if the
>> kernel can't find an order 0 allocation for split, there are much bigger
>> problems. On large servers where you can easily have 100s of GBs of THPs,
>> the memory usage for these tables is 200M per 100G. This memory could be
>> used for any other usecase, which include allocating the pagetables
>> required during split.
>>
>> This patch removes the pre-deposit for anonymous pages on architectures
>> where arch_needs_pgtable_deposit() returns false (every arch apart from
>> powerpc, and only when radix hash tables are not enabled) and allocates
>> the PTE table lazily—only when a split actually occurs. The split path
>> is modified to accept a caller-provided page table.
>>
>> PowerPC exception:
>>
>> It would have been great if we can completely remove the pagetable
>> deposit code and this commit would mostly have been a code cleanup patch,
>> unfortunately PowerPC has hash MMU, it stores hash slot information in
>> the deposited page table and pre-deposit is necessary. All deposit/
>> withdraw paths are guarded by arch_needs_pgtable_deposit(), so PowerPC
>> behavior is unchanged with this patch. On a better note,
>> arch_needs_pgtable_deposit will always evaluate to false at compile time
>> on non PowerPC architectures and the pre-deposit code will not be
>> compiled in.
>
> Is there a way to remove this? It's always been a confusing hack, now it's unpleasant to have around :)
I spent some time researching this (I havent worked with PowerPC before)
as I really wanted to get rid of all the pre-deposit code. I cant really see a
way without removing PMD THP support. I was going to CC the PowerPC maintainers
but I see that you already did!
>
> In particular, seeing that radix__pgtable_trans_huge_deposit() just 1:1 copied generic pgtable_trans_huge_deposit() hurts my belly.
>
>
> IIUC, hash is mostly used on legacy power systems, radix on newer ones.
>
Yes that is what I found as well.
> So one obvious solution: remove PMD THP support for hash MMUs along with all this hacky deposit code.
>
I would be happy with that!
>
> the "vma_is_anonymous(vma) && !arch_needs_pgtable_deposit()" and similar checks need to be wrapped in a reasonable helper and likely this all needs to get cleaned up further.
Ack. The code will definitely look a lot lot cleaner and wont have much of this if we decide to remove
PMD THP support for hash MMU.
>
> The implementation if the generic pgtable_trans_huge_deposit and the radix handlers etc must be removed. If any code would trigger them it would be a bug.
>
> If we have to keep this around, pgtable_trans_huge_deposit() should likely get renamed to arch_pgtable_trans_huge_deposit() etc, as there will not be generic support for it.
>
Ack.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC 2/2] mm: thp: add THP_SPLIT_PMD_PTE_ALLOC_FAILED counter
2026-02-11 13:31 ` Usama Arif
2026-02-11 13:36 ` David Hildenbrand (Arm)
@ 2026-02-11 13:38 ` David Hildenbrand (Arm)
2026-02-11 13:43 ` Usama Arif
1 sibling, 1 reply; 22+ messages in thread
From: David Hildenbrand (Arm) @ 2026-02-11 13:38 UTC (permalink / raw)
To: Usama Arif, Andrew Morton, lorenzo.stoakes, willy, linux-mm
Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
baolin.wang, npache, Liam.Howlett, ryan.roberts, vbabka,
lance.yang, linux-kernel, kernel-team
On 2/11/26 14:31, Usama Arif wrote:
>
>
> On 11/02/2026 13:27, David Hildenbrand (Arm) wrote:
>> On 2/11/26 13:49, Usama Arif wrote:
>>> Add a vmstat counter to track PTE allocation failures during PMD split.
>>> This enables monitoring of split failures due to memory pressure after
>>> the lazy PTE page table allocation change.
>>>
>>> The counter is incremented in three places:
>>> - __split_huge_pmd(): Main entry point for splitting a PMD
>>> - try_to_unmap_one(): When reclaim needs to split a PMD-mapped THP
>>> - try_to_migrate_one(): When migration needs to split a PMD-mapped THP
>>>
>>> Visible via /proc/vmstat as thp_split_pmd_pte_alloc_failed.
>>>
>>> Signed-off-by: Usama Arif <usama.arif@linux.dev>
>>> ---
>>> include/linux/vm_event_item.h | 1 +
>>> mm/huge_memory.c | 1 +
>>> mm/rmap.c | 3 +++
>>> mm/vmstat.c | 1 +
>>> 4 files changed, 6 insertions(+)
>>>
>>> diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
>>> index 22a139f82d75f..827c9a8c251de 100644
>>> --- a/include/linux/vm_event_item.h
>>> +++ b/include/linux/vm_event_item.h
>>> @@ -111,6 +111,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>>> THP_DEFERRED_SPLIT_PAGE,
>>> THP_UNDERUSED_SPLIT_PAGE,
>>> THP_SPLIT_PMD,
>>> + THP_SPLIT_PMD_PTE_ALLOC_FAILED,
>>
>> Probably sufficient to call this THP_SPLIT_PMD_FAILED and count any (future) failures (if any) as well.
>>
>
> Makes sense. This was just a patch I was using for testing and I wanted to share.
> It was always 0 as I couldnt get split to fail :) But I can rename it as THP_SPLIT_PMD_FAILED
> as suggested and we can use for future split failures (hopefully none).
Btw, you can use the allocation fault injection framework to find weird
issues, if you haven't heard of that yet.
--
Cheers,
David
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC 2/2] mm: thp: add THP_SPLIT_PMD_PTE_ALLOC_FAILED counter
2026-02-11 13:36 ` David Hildenbrand (Arm)
@ 2026-02-11 13:42 ` Usama Arif
0 siblings, 0 replies; 22+ messages in thread
From: Usama Arif @ 2026-02-11 13:42 UTC (permalink / raw)
To: David Hildenbrand (Arm), Andrew Morton, lorenzo.stoakes, willy,
linux-mm
Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
baolin.wang, npache, Liam.Howlett, ryan.roberts, vbabka,
lance.yang, linux-kernel, kernel-team
On 11/02/2026 13:36, David Hildenbrand (Arm) wrote:
> On 2/11/26 14:31, Usama Arif wrote:
>>
>>
>> On 11/02/2026 13:27, David Hildenbrand (Arm) wrote:
>>> On 2/11/26 13:49, Usama Arif wrote:
>>>> Add a vmstat counter to track PTE allocation failures during PMD split.
>>>> This enables monitoring of split failures due to memory pressure after
>>>> the lazy PTE page table allocation change.
>>>>
>>>> The counter is incremented in three places:
>>>> - __split_huge_pmd(): Main entry point for splitting a PMD
>>>> - try_to_unmap_one(): When reclaim needs to split a PMD-mapped THP
>>>> - try_to_migrate_one(): When migration needs to split a PMD-mapped THP
>>>>
>>>> Visible via /proc/vmstat as thp_split_pmd_pte_alloc_failed.
>>>>
>>>> Signed-off-by: Usama Arif <usama.arif@linux.dev>
>>>> ---
>>>> include/linux/vm_event_item.h | 1 +
>>>> mm/huge_memory.c | 1 +
>>>> mm/rmap.c | 3 +++
>>>> mm/vmstat.c | 1 +
>>>> 4 files changed, 6 insertions(+)
>>>>
>>>> diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
>>>> index 22a139f82d75f..827c9a8c251de 100644
>>>> --- a/include/linux/vm_event_item.h
>>>> +++ b/include/linux/vm_event_item.h
>>>> @@ -111,6 +111,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>>>> THP_DEFERRED_SPLIT_PAGE,
>>>> THP_UNDERUSED_SPLIT_PAGE,
>>>> THP_SPLIT_PMD,
>>>> + THP_SPLIT_PMD_PTE_ALLOC_FAILED,
>>>
>>> Probably sufficient to call this THP_SPLIT_PMD_FAILED and count any (future) failures (if any) as well.
>>>
>>
>> Makes sense. This was just a patch I was using for testing and I wanted to share.
>> It was always 0 as I couldnt get split to fail :) But I can rename it as THP_SPLIT_PMD_FAILED
>> as suggested and we can use for future split failures (hopefully none).
>
> I guess it might be reasonable to have because I am sure it will fail at some point and maybe provoke weird issues we didn't think of. In that case, having an indication that splitting failed at some point might be reasonable.
>
ack
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC 2/2] mm: thp: add THP_SPLIT_PMD_PTE_ALLOC_FAILED counter
2026-02-11 13:38 ` David Hildenbrand (Arm)
@ 2026-02-11 13:43 ` Usama Arif
0 siblings, 0 replies; 22+ messages in thread
From: Usama Arif @ 2026-02-11 13:43 UTC (permalink / raw)
To: David Hildenbrand (Arm), Andrew Morton, lorenzo.stoakes, willy,
linux-mm
Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
baolin.wang, npache, Liam.Howlett, ryan.roberts, vbabka,
lance.yang, linux-kernel, kernel-team
On 11/02/2026 13:38, David Hildenbrand (Arm) wrote:
> On 2/11/26 14:31, Usama Arif wrote:
>>
>>
>> On 11/02/2026 13:27, David Hildenbrand (Arm) wrote:
>>> On 2/11/26 13:49, Usama Arif wrote:
>>>> Add a vmstat counter to track PTE allocation failures during PMD split.
>>>> This enables monitoring of split failures due to memory pressure after
>>>> the lazy PTE page table allocation change.
>>>>
>>>> The counter is incremented in three places:
>>>> - __split_huge_pmd(): Main entry point for splitting a PMD
>>>> - try_to_unmap_one(): When reclaim needs to split a PMD-mapped THP
>>>> - try_to_migrate_one(): When migration needs to split a PMD-mapped THP
>>>>
>>>> Visible via /proc/vmstat as thp_split_pmd_pte_alloc_failed.
>>>>
>>>> Signed-off-by: Usama Arif <usama.arif@linux.dev>
>>>> ---
>>>> include/linux/vm_event_item.h | 1 +
>>>> mm/huge_memory.c | 1 +
>>>> mm/rmap.c | 3 +++
>>>> mm/vmstat.c | 1 +
>>>> 4 files changed, 6 insertions(+)
>>>>
>>>> diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
>>>> index 22a139f82d75f..827c9a8c251de 100644
>>>> --- a/include/linux/vm_event_item.h
>>>> +++ b/include/linux/vm_event_item.h
>>>> @@ -111,6 +111,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>>>> THP_DEFERRED_SPLIT_PAGE,
>>>> THP_UNDERUSED_SPLIT_PAGE,
>>>> THP_SPLIT_PMD,
>>>> + THP_SPLIT_PMD_PTE_ALLOC_FAILED,
>>>
>>> Probably sufficient to call this THP_SPLIT_PMD_FAILED and count any (future) failures (if any) as well.
>>>
>>
>> Makes sense. This was just a patch I was using for testing and I wanted to share.
>> It was always 0 as I couldnt get split to fail :) But I can rename it as THP_SPLIT_PMD_FAILED
>> as suggested and we can use for future split failures (hopefully none).
>
> Btw, you can use the allocation fault injection framework to find weird issues, if you haven't heard of that yet.
>
This looks very interesting, Thanks! Let me have a look.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC 1/2] mm: thp: allocate PTE page tables lazily at split time
2026-02-11 13:35 ` David Hildenbrand (Arm)
@ 2026-02-11 13:46 ` Kiryl Shutsemau
2026-02-11 13:47 ` Usama Arif
1 sibling, 0 replies; 22+ messages in thread
From: Kiryl Shutsemau @ 2026-02-11 13:46 UTC (permalink / raw)
To: David Hildenbrand (Arm)
Cc: Usama Arif, Andrew Morton, lorenzo.stoakes, willy, linux-mm, fvdl,
hannes, riel, shakeel.butt, baohua, dev.jain, baolin.wang, npache,
Liam.Howlett, ryan.roberts, vbabka, lance.yang, linux-kernel,
kernel-team
On Wed, Feb 11, 2026 at 02:35:07PM +0100, David Hildenbrand (Arm) wrote:
> On 2/11/26 13:49, Usama Arif wrote:
> > When the kernel creates a PMD-level THP mapping for anonymous pages,
> > it pre-allocates a PTE page table and deposits it via
> > pgtable_trans_huge_deposit(). This deposited table is withdrawn during
> > PMD split or zap. The rationale was that split must not fail—if the
> > kernel decides to split a THP, it needs a PTE table to populate.
> >
> > However, every anon THP wastes 4KB (one page table page) that sits
> > unused in the deposit list for the lifetime of the mapping. On systems
> > with many THPs, this adds up to significant memory waste. The original
> > rationale is also not an issue. It is ok for split to fail, and if the
> > kernel can't find an order 0 allocation for split, there are much bigger
> > problems. On large servers where you can easily have 100s of GBs of THPs,
> > the memory usage for these tables is 200M per 100G. This memory could be
> > used for any other usecase, which include allocating the pagetables
> > required during split.
> >
> > This patch removes the pre-deposit for anonymous pages on architectures
> > where arch_needs_pgtable_deposit() returns false (every arch apart from
> > powerpc, and only when radix hash tables are not enabled) and allocates
> > the PTE table lazily—only when a split actually occurs. The split path
> > is modified to accept a caller-provided page table.
> >
> > PowerPC exception:
> >
> > It would have been great if we can completely remove the pagetable
> > deposit code and this commit would mostly have been a code cleanup patch,
> > unfortunately PowerPC has hash MMU, it stores hash slot information in
> > the deposited page table and pre-deposit is necessary. All deposit/
> > withdraw paths are guarded by arch_needs_pgtable_deposit(), so PowerPC
> > behavior is unchanged with this patch. On a better note,
> > arch_needs_pgtable_deposit will always evaluate to false at compile time
> > on non PowerPC architectures and the pre-deposit code will not be
> > compiled in.
> >
> > Why Split Failures Are Safe:
> >
> > If a system is under severe memory pressure that even a 4K allocation
> > fails for a PTE table, there are far greater problems than a THP split
> > being delayed. The OOM killer will likely intervene before this becomes an
> > issue.
> > When pte_alloc_one() fails due to not being able to allocate a 4K page,
> > the PMD split is aborted and the THP remains intact. I could not get split
> > to fail, as its very difficult to make order-0 allocation to fail.
> > Code analysis of what would happen if it does:
> >
> > - mprotect(): If split fails in change_pmd_range, it will fallback
> > to change_pte_range, which will return an error which will cause the
> > whole function to be retried again.
> >
> > - munmap() (partial THP range): zap_pte_range() returns early when
> > pte_offset_map_lock() fails, causing zap_pmd_range() to retry via pmd--.
> > For full THP range, zap_huge_pmd() unmaps the entire PMD without
> > split.
> >
> > - Memory reclaim (try_to_unmap()): Returns false, folio rotated back
> > LRU, retried in next reclaim cycle.
> >
> > - Migration / compaction (try_to_migrate()): Returns -EAGAIN, migration
> > skips this folio, retried later.
> >
> > - CoW fault (wp_huge_pmd()): Returns VM_FAULT_FALLBACK, fault retried.
> >
> > - madvise (MADV_COLD/PAGEOUT): split_folio() internally calls
> > try_to_migrate() with TTU_SPLIT_HUGE_PMD. If PMD split fails,
> > try_to_migrate() returns false, split_folio() returns -EAGAIN,
> > and madvise returns 0 (success) silently skipping the region. This
> > should be fine. madvise is just an advice and can fail for other
> > reasons as well.
> >
> > Suggested-by: David Hildenbrand <david@kernel.org>
> > Signed-off-by: Usama Arif <usama.arif@linux.dev>
> > ---
> > include/linux/huge_mm.h | 4 +-
> > mm/huge_memory.c | 144 ++++++++++++++++++++++++++++------------
> > mm/khugepaged.c | 7 +-
> > mm/migrate_device.c | 15 +++--
> > mm/rmap.c | 39 ++++++++++-
> > 5 files changed, 156 insertions(+), 53 deletions(-)
> >
> > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > index a4d9f964dfdea..b21bb72a298c9 100644
> > --- a/include/linux/huge_mm.h
> > +++ b/include/linux/huge_mm.h
> > @@ -562,7 +562,7 @@ static inline bool thp_migration_supported(void)
> > }
> > void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address,
> > - pmd_t *pmd, bool freeze);
> > + pmd_t *pmd, bool freeze, pgtable_t pgtable);
> > bool unmap_huge_pmd_locked(struct vm_area_struct *vma, unsigned long addr,
> > pmd_t *pmdp, struct folio *folio);
> > void map_anon_folio_pmd_nopf(struct folio *folio, pmd_t *pmd,
> > @@ -660,7 +660,7 @@ static inline void split_huge_pmd_address(struct vm_area_struct *vma,
> > unsigned long address, bool freeze) {}
> > static inline void split_huge_pmd_locked(struct vm_area_struct *vma,
> > unsigned long address, pmd_t *pmd,
> > - bool freeze) {}
> > + bool freeze, pgtable_t pgtable) {}
> > static inline bool unmap_huge_pmd_locked(struct vm_area_struct *vma,
> > unsigned long addr, pmd_t *pmdp,
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index 44ff8a648afd5..4c9a8d89fc8aa 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -1322,17 +1322,19 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf)
> > unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
> > struct vm_area_struct *vma = vmf->vma;
> > struct folio *folio;
> > - pgtable_t pgtable;
> > + pgtable_t pgtable = NULL;
> > vm_fault_t ret = 0;
> > folio = vma_alloc_anon_folio_pmd(vma, vmf->address);
> > if (unlikely(!folio))
> > return VM_FAULT_FALLBACK;
> > - pgtable = pte_alloc_one(vma->vm_mm);
> > - if (unlikely(!pgtable)) {
> > - ret = VM_FAULT_OOM;
> > - goto release;
> > + if (arch_needs_pgtable_deposit()) {
> > + pgtable = pte_alloc_one(vma->vm_mm);
> > + if (unlikely(!pgtable)) {
> > + ret = VM_FAULT_OOM;
> > + goto release;
> > + }
> > }
> > vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
> > @@ -1347,14 +1349,18 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf)
> > if (userfaultfd_missing(vma)) {
> > spin_unlock(vmf->ptl);
> > folio_put(folio);
> > - pte_free(vma->vm_mm, pgtable);
> > + if (pgtable)
> > + pte_free(vma->vm_mm, pgtable);
> > ret = handle_userfault(vmf, VM_UFFD_MISSING);
> > VM_BUG_ON(ret & VM_FAULT_FALLBACK);
> > return ret;
> > }
> > - pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd, pgtable);
> > + if (pgtable) {
> > + pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd,
> > + pgtable);
> > + mm_inc_nr_ptes(vma->vm_mm);
> > + }
> > map_anon_folio_pmd_pf(folio, vmf->pmd, vma, haddr);
> > - mm_inc_nr_ptes(vma->vm_mm);
> > spin_unlock(vmf->ptl);
> > }
> > @@ -1450,9 +1456,11 @@ static void set_huge_zero_folio(pgtable_t pgtable, struct mm_struct *mm,
> > pmd_t entry;
> > entry = folio_mk_pmd(zero_folio, vma->vm_page_prot);
> > entry = pmd_mkspecial(entry);
> > - pgtable_trans_huge_deposit(mm, pmd, pgtable);
> > + if (pgtable) {
> > + pgtable_trans_huge_deposit(mm, pmd, pgtable);
> > + mm_inc_nr_ptes(mm);
> > + }
> > set_pmd_at(mm, haddr, pmd, entry);
> > - mm_inc_nr_ptes(mm);
> > }
> > vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
> > @@ -1471,16 +1479,19 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
> > if (!(vmf->flags & FAULT_FLAG_WRITE) &&
> > !mm_forbids_zeropage(vma->vm_mm) &&
> > transparent_hugepage_use_zero_page()) {
> > - pgtable_t pgtable;
> > + pgtable_t pgtable = NULL;
> > struct folio *zero_folio;
> > vm_fault_t ret;
> > - pgtable = pte_alloc_one(vma->vm_mm);
> > - if (unlikely(!pgtable))
> > - return VM_FAULT_OOM;
> > + if (arch_needs_pgtable_deposit()) {
> > + pgtable = pte_alloc_one(vma->vm_mm);
> > + if (unlikely(!pgtable))
> > + return VM_FAULT_OOM;
> > + }
> > zero_folio = mm_get_huge_zero_folio(vma->vm_mm);
> > if (unlikely(!zero_folio)) {
> > - pte_free(vma->vm_mm, pgtable);
> > + if (pgtable)
> > + pte_free(vma->vm_mm, pgtable);
> > count_vm_event(THP_FAULT_FALLBACK);
> > return VM_FAULT_FALLBACK;
> > }
> > @@ -1490,10 +1501,12 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
> > ret = check_stable_address_space(vma->vm_mm);
> > if (ret) {
> > spin_unlock(vmf->ptl);
> > - pte_free(vma->vm_mm, pgtable);
> > + if (pgtable)
> > + pte_free(vma->vm_mm, pgtable);
> > } else if (userfaultfd_missing(vma)) {
> > spin_unlock(vmf->ptl);
> > - pte_free(vma->vm_mm, pgtable);
> > + if (pgtable)
> > + pte_free(vma->vm_mm, pgtable);
> > ret = handle_userfault(vmf, VM_UFFD_MISSING);
> > VM_BUG_ON(ret & VM_FAULT_FALLBACK);
> > } else {
> > @@ -1504,7 +1517,8 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
> > }
> > } else {
> > spin_unlock(vmf->ptl);
> > - pte_free(vma->vm_mm, pgtable);
> > + if (pgtable)
> > + pte_free(vma->vm_mm, pgtable);
> > }
> > return ret;
> > }
> > @@ -1836,8 +1850,10 @@ static void copy_huge_non_present_pmd(
> > }
> > add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
> > - mm_inc_nr_ptes(dst_mm);
> > - pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
> > + if (pgtable) {
> > + mm_inc_nr_ptes(dst_mm);
> > + pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
> > + }
> > if (!userfaultfd_wp(dst_vma))
> > pmd = pmd_swp_clear_uffd_wp(pmd);
> > set_pmd_at(dst_mm, addr, dst_pmd, pmd);
> > @@ -1877,9 +1893,11 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> > if (!vma_is_anonymous(dst_vma))
> > return 0;
> > - pgtable = pte_alloc_one(dst_mm);
> > - if (unlikely(!pgtable))
> > - goto out;
> > + if (arch_needs_pgtable_deposit()) {
> > + pgtable = pte_alloc_one(dst_mm);
> > + if (unlikely(!pgtable))
> > + goto out;
> > + }
> > dst_ptl = pmd_lock(dst_mm, dst_pmd);
> > src_ptl = pmd_lockptr(src_mm, src_pmd);
> > @@ -1897,7 +1915,8 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> > }
> > if (unlikely(!pmd_trans_huge(pmd))) {
> > - pte_free(dst_mm, pgtable);
> > + if (pgtable)
> > + pte_free(dst_mm, pgtable);
> > goto out_unlock;
> > }
> > /*
> > @@ -1923,7 +1942,8 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> > if (unlikely(folio_try_dup_anon_rmap_pmd(src_folio, src_page, dst_vma, src_vma))) {
> > /* Page maybe pinned: split and retry the fault on PTEs. */
> > folio_put(src_folio);
> > - pte_free(dst_mm, pgtable);
> > + if (pgtable)
> > + pte_free(dst_mm, pgtable);
> > spin_unlock(src_ptl);
> > spin_unlock(dst_ptl);
> > __split_huge_pmd(src_vma, src_pmd, addr, false);
> > @@ -1931,8 +1951,10 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> > }
> > add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
> > out_zero_page:
> > - mm_inc_nr_ptes(dst_mm);
> > - pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
> > + if (pgtable) {
> > + mm_inc_nr_ptes(dst_mm);
> > + pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
> > + }
> > pmdp_set_wrprotect(src_mm, addr, src_pmd);
> > if (!userfaultfd_wp(dst_vma))
> > pmd = pmd_clear_uffd_wp(pmd);
> > @@ -2364,7 +2386,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
> > zap_deposited_table(tlb->mm, pmd);
> > spin_unlock(ptl);
> > } else if (is_huge_zero_pmd(orig_pmd)) {
> > - if (!vma_is_dax(vma) || arch_needs_pgtable_deposit())
> > + if (arch_needs_pgtable_deposit())
> > zap_deposited_table(tlb->mm, pmd);
> > spin_unlock(ptl);
> > } else {
> > @@ -2389,7 +2411,8 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
> > }
> > if (folio_test_anon(folio)) {
> > - zap_deposited_table(tlb->mm, pmd);
> > + if (arch_needs_pgtable_deposit())
> > + zap_deposited_table(tlb->mm, pmd);
> > add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
> > } else {
> > if (arch_needs_pgtable_deposit())
> > @@ -2490,7 +2513,8 @@ bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
> > force_flush = true;
> > VM_BUG_ON(!pmd_none(*new_pmd));
> > - if (pmd_move_must_withdraw(new_ptl, old_ptl, vma)) {
> > + if (pmd_move_must_withdraw(new_ptl, old_ptl, vma) &&
> > + arch_needs_pgtable_deposit()) {
> > pgtable_t pgtable;
> > pgtable = pgtable_trans_huge_withdraw(mm, old_pmd);
> > pgtable_trans_huge_deposit(mm, new_pmd, pgtable);
> > @@ -2798,8 +2822,10 @@ int move_pages_huge_pmd(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd, pm
> > }
> > set_pmd_at(mm, dst_addr, dst_pmd, _dst_pmd);
> > - src_pgtable = pgtable_trans_huge_withdraw(mm, src_pmd);
> > - pgtable_trans_huge_deposit(mm, dst_pmd, src_pgtable);
> > + if (arch_needs_pgtable_deposit()) {
> > + src_pgtable = pgtable_trans_huge_withdraw(mm, src_pmd);
> > + pgtable_trans_huge_deposit(mm, dst_pmd, src_pgtable);
> > + }
> > unlock_ptls:
> > double_pt_unlock(src_ptl, dst_ptl);
> > /* unblock rmap walks */
> > @@ -2941,10 +2967,9 @@ void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
> > #endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
> > static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
> > - unsigned long haddr, pmd_t *pmd)
> > + unsigned long haddr, pmd_t *pmd, pgtable_t pgtable)
> > {
> > struct mm_struct *mm = vma->vm_mm;
> > - pgtable_t pgtable;
> > pmd_t _pmd, old_pmd;
> > unsigned long addr;
> > pte_t *pte;
> > @@ -2960,7 +2985,16 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
> > */
> > old_pmd = pmdp_huge_clear_flush(vma, haddr, pmd);
> > - pgtable = pgtable_trans_huge_withdraw(mm, pmd);
> > + if (arch_needs_pgtable_deposit()) {
> > + pgtable = pgtable_trans_huge_withdraw(mm, pmd);
> > + } else {
> > + VM_BUG_ON(!pgtable);
> > + /*
> > + * Account for the freshly allocated (in __split_huge_pmd) pgtable
> > + * being used in mm.
> > + */
> > + mm_inc_nr_ptes(mm);
> > + }
> > pmd_populate(mm, &_pmd, pgtable);
> > pte = pte_offset_map(&_pmd, haddr);
> > @@ -2982,12 +3016,11 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
> > }
> > static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> > - unsigned long haddr, bool freeze)
> > + unsigned long haddr, bool freeze, pgtable_t pgtable)
> > {
> > struct mm_struct *mm = vma->vm_mm;
> > struct folio *folio;
> > struct page *page;
> > - pgtable_t pgtable;
> > pmd_t old_pmd, _pmd;
> > bool soft_dirty, uffd_wp = false, young = false, write = false;
> > bool anon_exclusive = false, dirty = false;
> > @@ -3011,6 +3044,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> > */
> > if (arch_needs_pgtable_deposit())
> > zap_deposited_table(mm, pmd);
> > + if (pgtable)
> > + pte_free(mm, pgtable);
> > if (!vma_is_dax(vma) && vma_is_special_huge(vma))
> > return;
> > if (unlikely(pmd_is_migration_entry(old_pmd))) {
> > @@ -3043,7 +3078,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> > * small page also write protected so it does not seems useful
> > * to invalidate secondary mmu at this time.
> > */
> > - return __split_huge_zero_page_pmd(vma, haddr, pmd);
> > + return __split_huge_zero_page_pmd(vma, haddr, pmd, pgtable);
> > }
> > if (pmd_is_migration_entry(*pmd)) {
> > @@ -3167,7 +3202,16 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> > * Withdraw the table only after we mark the pmd entry invalid.
> > * This's critical for some architectures (Power).
> > */
> > - pgtable = pgtable_trans_huge_withdraw(mm, pmd);
> > + if (arch_needs_pgtable_deposit()) {
> > + pgtable = pgtable_trans_huge_withdraw(mm, pmd);
> > + } else {
> > + VM_BUG_ON(!pgtable);
> > + /*
> > + * Account for the freshly allocated (in __split_huge_pmd) pgtable
> > + * being used in mm.
> > + */
> > + mm_inc_nr_ptes(mm);
> > + }
> > pmd_populate(mm, &_pmd, pgtable);
> > pte = pte_offset_map(&_pmd, haddr);
> > @@ -3263,11 +3307,13 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> > }
> > void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address,
> > - pmd_t *pmd, bool freeze)
> > + pmd_t *pmd, bool freeze, pgtable_t pgtable)
> > {
> > VM_WARN_ON_ONCE(!IS_ALIGNED(address, HPAGE_PMD_SIZE));
> > if (pmd_trans_huge(*pmd) || pmd_is_valid_softleaf(*pmd))
> > - __split_huge_pmd_locked(vma, pmd, address, freeze);
> > + __split_huge_pmd_locked(vma, pmd, address, freeze, pgtable);
> > + else if (pgtable)
> > + pte_free(vma->vm_mm, pgtable);
> > }
> > void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
> > @@ -3275,13 +3321,24 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
> > {
> > spinlock_t *ptl;
> > struct mmu_notifier_range range;
> > + pgtable_t pgtable = NULL;
> > mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma->vm_mm,
> > address & HPAGE_PMD_MASK,
> > (address & HPAGE_PMD_MASK) + HPAGE_PMD_SIZE);
> > mmu_notifier_invalidate_range_start(&range);
> > +
> > + /* allocate pagetable before acquiring pmd lock */
> > + if (vma_is_anonymous(vma) && !arch_needs_pgtable_deposit()) {
> > + pgtable = pte_alloc_one(vma->vm_mm);
> > + if (!pgtable) {
> > + mmu_notifier_invalidate_range_end(&range);
>
> What I last looked at this, I thought the clean thing to do is to let
> __split_huge_pmd() and friends return an error.
>
> Let's take a look at walk_pmd_range() as one example:
>
> if (walk->vma)
> split_huge_pmd(walk->vma, pmd, addr);
> else if (pmd_leaf(*pmd) || !pmd_present(*pmd))
> continue;
>
> err = walk_pte_range(pmd, addr, next, walk);
>
> Where walk_pte_range() just does a pte_offset_map_lock.
>
> pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
>
> But if that fails (as the remapping failed), we will silently skip this
> range.
>
> I don't think silently skipping is the right thing to do.
>
> So I would think that all splitting functions have to be taught to return an
> error and handle it accordingly. Then we can actually start returning
> errors.
Yeah, I am also confused by silent split PMD failure. It has to be
communicated to the caller cleanly.
It is also an opportunity to audit all callers and check if they can
deal with the failure.
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC 1/2] mm: thp: allocate PTE page tables lazily at split time
2026-02-11 13:35 ` David Hildenbrand (Arm)
2026-02-11 13:46 ` Kiryl Shutsemau
@ 2026-02-11 13:47 ` Usama Arif
1 sibling, 0 replies; 22+ messages in thread
From: Usama Arif @ 2026-02-11 13:47 UTC (permalink / raw)
To: David Hildenbrand (Arm), Andrew Morton, lorenzo.stoakes, willy,
linux-mm
Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
baolin.wang, npache, Liam.Howlett, ryan.roberts, vbabka,
lance.yang, linux-kernel, kernel-team
On 11/02/2026 13:35, David Hildenbrand (Arm) wrote:
> On 2/11/26 13:49, Usama Arif wrote:
>> When the kernel creates a PMD-level THP mapping for anonymous pages,
>> it pre-allocates a PTE page table and deposits it via
>> pgtable_trans_huge_deposit(). This deposited table is withdrawn during
>> PMD split or zap. The rationale was that split must not fail—if the
>> kernel decides to split a THP, it needs a PTE table to populate.
>>
>> However, every anon THP wastes 4KB (one page table page) that sits
>> unused in the deposit list for the lifetime of the mapping. On systems
>> with many THPs, this adds up to significant memory waste. The original
>> rationale is also not an issue. It is ok for split to fail, and if the
>> kernel can't find an order 0 allocation for split, there are much bigger
>> problems. On large servers where you can easily have 100s of GBs of THPs,
>> the memory usage for these tables is 200M per 100G. This memory could be
>> used for any other usecase, which include allocating the pagetables
>> required during split.
>>
>> This patch removes the pre-deposit for anonymous pages on architectures
>> where arch_needs_pgtable_deposit() returns false (every arch apart from
>> powerpc, and only when radix hash tables are not enabled) and allocates
>> the PTE table lazily—only when a split actually occurs. The split path
>> is modified to accept a caller-provided page table.
>>
>> PowerPC exception:
>>
>> It would have been great if we can completely remove the pagetable
>> deposit code and this commit would mostly have been a code cleanup patch,
>> unfortunately PowerPC has hash MMU, it stores hash slot information in
>> the deposited page table and pre-deposit is necessary. All deposit/
>> withdraw paths are guarded by arch_needs_pgtable_deposit(), so PowerPC
>> behavior is unchanged with this patch. On a better note,
>> arch_needs_pgtable_deposit will always evaluate to false at compile time
>> on non PowerPC architectures and the pre-deposit code will not be
>> compiled in.
>>
>> Why Split Failures Are Safe:
>>
>> If a system is under severe memory pressure that even a 4K allocation
>> fails for a PTE table, there are far greater problems than a THP split
>> being delayed. The OOM killer will likely intervene before this becomes an
>> issue.
>> When pte_alloc_one() fails due to not being able to allocate a 4K page,
>> the PMD split is aborted and the THP remains intact. I could not get split
>> to fail, as its very difficult to make order-0 allocation to fail.
>> Code analysis of what would happen if it does:
>>
>> - mprotect(): If split fails in change_pmd_range, it will fallback
>> to change_pte_range, which will return an error which will cause the
>> whole function to be retried again.
>>
>> - munmap() (partial THP range): zap_pte_range() returns early when
>> pte_offset_map_lock() fails, causing zap_pmd_range() to retry via pmd--.
>> For full THP range, zap_huge_pmd() unmaps the entire PMD without
>> split.
>>
>> - Memory reclaim (try_to_unmap()): Returns false, folio rotated back
>> LRU, retried in next reclaim cycle.
>>
>> - Migration / compaction (try_to_migrate()): Returns -EAGAIN, migration
>> skips this folio, retried later.
>>
>> - CoW fault (wp_huge_pmd()): Returns VM_FAULT_FALLBACK, fault retried.
>>
>> - madvise (MADV_COLD/PAGEOUT): split_folio() internally calls
>> try_to_migrate() with TTU_SPLIT_HUGE_PMD. If PMD split fails,
>> try_to_migrate() returns false, split_folio() returns -EAGAIN,
>> and madvise returns 0 (success) silently skipping the region. This
>> should be fine. madvise is just an advice and can fail for other
>> reasons as well.
>>
>> Suggested-by: David Hildenbrand <david@kernel.org>
>> Signed-off-by: Usama Arif <usama.arif@linux.dev>
>> ---
>> include/linux/huge_mm.h | 4 +-
>> mm/huge_memory.c | 144 ++++++++++++++++++++++++++++------------
>> mm/khugepaged.c | 7 +-
>> mm/migrate_device.c | 15 +++--
>> mm/rmap.c | 39 ++++++++++-
>> 5 files changed, 156 insertions(+), 53 deletions(-)
>>
>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>> index a4d9f964dfdea..b21bb72a298c9 100644
>> --- a/include/linux/huge_mm.h
>> +++ b/include/linux/huge_mm.h
>> @@ -562,7 +562,7 @@ static inline bool thp_migration_supported(void)
>> }
>> void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address,
>> - pmd_t *pmd, bool freeze);
>> + pmd_t *pmd, bool freeze, pgtable_t pgtable);
>> bool unmap_huge_pmd_locked(struct vm_area_struct *vma, unsigned long addr,
>> pmd_t *pmdp, struct folio *folio);
>> void map_anon_folio_pmd_nopf(struct folio *folio, pmd_t *pmd,
>> @@ -660,7 +660,7 @@ static inline void split_huge_pmd_address(struct vm_area_struct *vma,
>> unsigned long address, bool freeze) {}
>> static inline void split_huge_pmd_locked(struct vm_area_struct *vma,
>> unsigned long address, pmd_t *pmd,
>> - bool freeze) {}
>> + bool freeze, pgtable_t pgtable) {}
>> static inline bool unmap_huge_pmd_locked(struct vm_area_struct *vma,
>> unsigned long addr, pmd_t *pmdp,
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 44ff8a648afd5..4c9a8d89fc8aa 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -1322,17 +1322,19 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf)
>> unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
>> struct vm_area_struct *vma = vmf->vma;
>> struct folio *folio;
>> - pgtable_t pgtable;
>> + pgtable_t pgtable = NULL;
>> vm_fault_t ret = 0;
>> folio = vma_alloc_anon_folio_pmd(vma, vmf->address);
>> if (unlikely(!folio))
>> return VM_FAULT_FALLBACK;
>> - pgtable = pte_alloc_one(vma->vm_mm);
>> - if (unlikely(!pgtable)) {
>> - ret = VM_FAULT_OOM;
>> - goto release;
>> + if (arch_needs_pgtable_deposit()) {
>> + pgtable = pte_alloc_one(vma->vm_mm);
>> + if (unlikely(!pgtable)) {
>> + ret = VM_FAULT_OOM;
>> + goto release;
>> + }
>> }
>> vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
>> @@ -1347,14 +1349,18 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf)
>> if (userfaultfd_missing(vma)) {
>> spin_unlock(vmf->ptl);
>> folio_put(folio);
>> - pte_free(vma->vm_mm, pgtable);
>> + if (pgtable)
>> + pte_free(vma->vm_mm, pgtable);
>> ret = handle_userfault(vmf, VM_UFFD_MISSING);
>> VM_BUG_ON(ret & VM_FAULT_FALLBACK);
>> return ret;
>> }
>> - pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd, pgtable);
>> + if (pgtable) {
>> + pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd,
>> + pgtable);
>> + mm_inc_nr_ptes(vma->vm_mm);
>> + }
>> map_anon_folio_pmd_pf(folio, vmf->pmd, vma, haddr);
>> - mm_inc_nr_ptes(vma->vm_mm);
>> spin_unlock(vmf->ptl);
>> }
>> @@ -1450,9 +1456,11 @@ static void set_huge_zero_folio(pgtable_t pgtable, struct mm_struct *mm,
>> pmd_t entry;
>> entry = folio_mk_pmd(zero_folio, vma->vm_page_prot);
>> entry = pmd_mkspecial(entry);
>> - pgtable_trans_huge_deposit(mm, pmd, pgtable);
>> + if (pgtable) {
>> + pgtable_trans_huge_deposit(mm, pmd, pgtable);
>> + mm_inc_nr_ptes(mm);
>> + }
>> set_pmd_at(mm, haddr, pmd, entry);
>> - mm_inc_nr_ptes(mm);
>> }
>> vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
>> @@ -1471,16 +1479,19 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
>> if (!(vmf->flags & FAULT_FLAG_WRITE) &&
>> !mm_forbids_zeropage(vma->vm_mm) &&
>> transparent_hugepage_use_zero_page()) {
>> - pgtable_t pgtable;
>> + pgtable_t pgtable = NULL;
>> struct folio *zero_folio;
>> vm_fault_t ret;
>> - pgtable = pte_alloc_one(vma->vm_mm);
>> - if (unlikely(!pgtable))
>> - return VM_FAULT_OOM;
>> + if (arch_needs_pgtable_deposit()) {
>> + pgtable = pte_alloc_one(vma->vm_mm);
>> + if (unlikely(!pgtable))
>> + return VM_FAULT_OOM;
>> + }
>> zero_folio = mm_get_huge_zero_folio(vma->vm_mm);
>> if (unlikely(!zero_folio)) {
>> - pte_free(vma->vm_mm, pgtable);
>> + if (pgtable)
>> + pte_free(vma->vm_mm, pgtable);
>> count_vm_event(THP_FAULT_FALLBACK);
>> return VM_FAULT_FALLBACK;
>> }
>> @@ -1490,10 +1501,12 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
>> ret = check_stable_address_space(vma->vm_mm);
>> if (ret) {
>> spin_unlock(vmf->ptl);
>> - pte_free(vma->vm_mm, pgtable);
>> + if (pgtable)
>> + pte_free(vma->vm_mm, pgtable);
>> } else if (userfaultfd_missing(vma)) {
>> spin_unlock(vmf->ptl);
>> - pte_free(vma->vm_mm, pgtable);
>> + if (pgtable)
>> + pte_free(vma->vm_mm, pgtable);
>> ret = handle_userfault(vmf, VM_UFFD_MISSING);
>> VM_BUG_ON(ret & VM_FAULT_FALLBACK);
>> } else {
>> @@ -1504,7 +1517,8 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
>> }
>> } else {
>> spin_unlock(vmf->ptl);
>> - pte_free(vma->vm_mm, pgtable);
>> + if (pgtable)
>> + pte_free(vma->vm_mm, pgtable);
>> }
>> return ret;
>> }
>> @@ -1836,8 +1850,10 @@ static void copy_huge_non_present_pmd(
>> }
>> add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
>> - mm_inc_nr_ptes(dst_mm);
>> - pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
>> + if (pgtable) {
>> + mm_inc_nr_ptes(dst_mm);
>> + pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
>> + }
>> if (!userfaultfd_wp(dst_vma))
>> pmd = pmd_swp_clear_uffd_wp(pmd);
>> set_pmd_at(dst_mm, addr, dst_pmd, pmd);
>> @@ -1877,9 +1893,11 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>> if (!vma_is_anonymous(dst_vma))
>> return 0;
>> - pgtable = pte_alloc_one(dst_mm);
>> - if (unlikely(!pgtable))
>> - goto out;
>> + if (arch_needs_pgtable_deposit()) {
>> + pgtable = pte_alloc_one(dst_mm);
>> + if (unlikely(!pgtable))
>> + goto out;
>> + }
>> dst_ptl = pmd_lock(dst_mm, dst_pmd);
>> src_ptl = pmd_lockptr(src_mm, src_pmd);
>> @@ -1897,7 +1915,8 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>> }
>> if (unlikely(!pmd_trans_huge(pmd))) {
>> - pte_free(dst_mm, pgtable);
>> + if (pgtable)
>> + pte_free(dst_mm, pgtable);
>> goto out_unlock;
>> }
>> /*
>> @@ -1923,7 +1942,8 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>> if (unlikely(folio_try_dup_anon_rmap_pmd(src_folio, src_page, dst_vma, src_vma))) {
>> /* Page maybe pinned: split and retry the fault on PTEs. */
>> folio_put(src_folio);
>> - pte_free(dst_mm, pgtable);
>> + if (pgtable)
>> + pte_free(dst_mm, pgtable);
>> spin_unlock(src_ptl);
>> spin_unlock(dst_ptl);
>> __split_huge_pmd(src_vma, src_pmd, addr, false);
>> @@ -1931,8 +1951,10 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>> }
>> add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
>> out_zero_page:
>> - mm_inc_nr_ptes(dst_mm);
>> - pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
>> + if (pgtable) {
>> + mm_inc_nr_ptes(dst_mm);
>> + pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
>> + }
>> pmdp_set_wrprotect(src_mm, addr, src_pmd);
>> if (!userfaultfd_wp(dst_vma))
>> pmd = pmd_clear_uffd_wp(pmd);
>> @@ -2364,7 +2386,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>> zap_deposited_table(tlb->mm, pmd);
>> spin_unlock(ptl);
>> } else if (is_huge_zero_pmd(orig_pmd)) {
>> - if (!vma_is_dax(vma) || arch_needs_pgtable_deposit())
>> + if (arch_needs_pgtable_deposit())
>> zap_deposited_table(tlb->mm, pmd);
>> spin_unlock(ptl);
>> } else {
>> @@ -2389,7 +2411,8 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>> }
>> if (folio_test_anon(folio)) {
>> - zap_deposited_table(tlb->mm, pmd);
>> + if (arch_needs_pgtable_deposit())
>> + zap_deposited_table(tlb->mm, pmd);
>> add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
>> } else {
>> if (arch_needs_pgtable_deposit())
>> @@ -2490,7 +2513,8 @@ bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
>> force_flush = true;
>> VM_BUG_ON(!pmd_none(*new_pmd));
>> - if (pmd_move_must_withdraw(new_ptl, old_ptl, vma)) {
>> + if (pmd_move_must_withdraw(new_ptl, old_ptl, vma) &&
>> + arch_needs_pgtable_deposit()) {
>> pgtable_t pgtable;
>> pgtable = pgtable_trans_huge_withdraw(mm, old_pmd);
>> pgtable_trans_huge_deposit(mm, new_pmd, pgtable);
>> @@ -2798,8 +2822,10 @@ int move_pages_huge_pmd(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd, pm
>> }
>> set_pmd_at(mm, dst_addr, dst_pmd, _dst_pmd);
>> - src_pgtable = pgtable_trans_huge_withdraw(mm, src_pmd);
>> - pgtable_trans_huge_deposit(mm, dst_pmd, src_pgtable);
>> + if (arch_needs_pgtable_deposit()) {
>> + src_pgtable = pgtable_trans_huge_withdraw(mm, src_pmd);
>> + pgtable_trans_huge_deposit(mm, dst_pmd, src_pgtable);
>> + }
>> unlock_ptls:
>> double_pt_unlock(src_ptl, dst_ptl);
>> /* unblock rmap walks */
>> @@ -2941,10 +2967,9 @@ void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
>> #endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
>> static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
>> - unsigned long haddr, pmd_t *pmd)
>> + unsigned long haddr, pmd_t *pmd, pgtable_t pgtable)
>> {
>> struct mm_struct *mm = vma->vm_mm;
>> - pgtable_t pgtable;
>> pmd_t _pmd, old_pmd;
>> unsigned long addr;
>> pte_t *pte;
>> @@ -2960,7 +2985,16 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
>> */
>> old_pmd = pmdp_huge_clear_flush(vma, haddr, pmd);
>> - pgtable = pgtable_trans_huge_withdraw(mm, pmd);
>> + if (arch_needs_pgtable_deposit()) {
>> + pgtable = pgtable_trans_huge_withdraw(mm, pmd);
>> + } else {
>> + VM_BUG_ON(!pgtable);
>> + /*
>> + * Account for the freshly allocated (in __split_huge_pmd) pgtable
>> + * being used in mm.
>> + */
>> + mm_inc_nr_ptes(mm);
>> + }
>> pmd_populate(mm, &_pmd, pgtable);
>> pte = pte_offset_map(&_pmd, haddr);
>> @@ -2982,12 +3016,11 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
>> }
>> static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>> - unsigned long haddr, bool freeze)
>> + unsigned long haddr, bool freeze, pgtable_t pgtable)
>> {
>> struct mm_struct *mm = vma->vm_mm;
>> struct folio *folio;
>> struct page *page;
>> - pgtable_t pgtable;
>> pmd_t old_pmd, _pmd;
>> bool soft_dirty, uffd_wp = false, young = false, write = false;
>> bool anon_exclusive = false, dirty = false;
>> @@ -3011,6 +3044,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>> */
>> if (arch_needs_pgtable_deposit())
>> zap_deposited_table(mm, pmd);
>> + if (pgtable)
>> + pte_free(mm, pgtable);
>> if (!vma_is_dax(vma) && vma_is_special_huge(vma))
>> return;
>> if (unlikely(pmd_is_migration_entry(old_pmd))) {
>> @@ -3043,7 +3078,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>> * small page also write protected so it does not seems useful
>> * to invalidate secondary mmu at this time.
>> */
>> - return __split_huge_zero_page_pmd(vma, haddr, pmd);
>> + return __split_huge_zero_page_pmd(vma, haddr, pmd, pgtable);
>> }
>> if (pmd_is_migration_entry(*pmd)) {
>> @@ -3167,7 +3202,16 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>> * Withdraw the table only after we mark the pmd entry invalid.
>> * This's critical for some architectures (Power).
>> */
>> - pgtable = pgtable_trans_huge_withdraw(mm, pmd);
>> + if (arch_needs_pgtable_deposit()) {
>> + pgtable = pgtable_trans_huge_withdraw(mm, pmd);
>> + } else {
>> + VM_BUG_ON(!pgtable);
>> + /*
>> + * Account for the freshly allocated (in __split_huge_pmd) pgtable
>> + * being used in mm.
>> + */
>> + mm_inc_nr_ptes(mm);
>> + }
>> pmd_populate(mm, &_pmd, pgtable);
>> pte = pte_offset_map(&_pmd, haddr);
>> @@ -3263,11 +3307,13 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>> }
>> void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address,
>> - pmd_t *pmd, bool freeze)
>> + pmd_t *pmd, bool freeze, pgtable_t pgtable)
>> {
>> VM_WARN_ON_ONCE(!IS_ALIGNED(address, HPAGE_PMD_SIZE));
>> if (pmd_trans_huge(*pmd) || pmd_is_valid_softleaf(*pmd))
>> - __split_huge_pmd_locked(vma, pmd, address, freeze);
>> + __split_huge_pmd_locked(vma, pmd, address, freeze, pgtable);
>> + else if (pgtable)
>> + pte_free(vma->vm_mm, pgtable);
>> }
>> void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>> @@ -3275,13 +3321,24 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>> {
>> spinlock_t *ptl;
>> struct mmu_notifier_range range;
>> + pgtable_t pgtable = NULL;
>> mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma->vm_mm,
>> address & HPAGE_PMD_MASK,
>> (address & HPAGE_PMD_MASK) + HPAGE_PMD_SIZE);
>> mmu_notifier_invalidate_range_start(&range);
>> +
>> + /* allocate pagetable before acquiring pmd lock */
>> + if (vma_is_anonymous(vma) && !arch_needs_pgtable_deposit()) {
>> + pgtable = pte_alloc_one(vma->vm_mm);
>> + if (!pgtable) {
>> + mmu_notifier_invalidate_range_end(&range);
>
> What I last looked at this, I thought the clean thing to do is to let __split_huge_pmd() and friends return an error.
>
> Let's take a look at walk_pmd_range() as one example:
>
> if (walk->vma)
> split_huge_pmd(walk->vma, pmd, addr);
> else if (pmd_leaf(*pmd) || !pmd_present(*pmd))
> continue;
>
> err = walk_pte_range(pmd, addr, next, walk);
>
> Where walk_pte_range() just does a pte_offset_map_lock.
>
> pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
>
> But if that fails (as the remapping failed), we will silently skip this range.
>
> I don't think silently skipping is the right thing to do.
>
> So I would think that all splitting functions have to be taught to return an error and handle it accordingly. Then we can actually start returning errors.
>
Ack. This was one of the cases where we would try again if needed.
I did manual code analysis which I included at the end of the commit message
but agreed, its best to return an error and handle accordingly.
I will look into doing this for the next revision.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC 1/2] mm: thp: allocate PTE page tables lazily at split time
2026-02-11 12:49 ` [RFC 1/2] mm: thp: allocate PTE page tables lazily at split time Usama Arif
2026-02-11 13:25 ` David Hildenbrand (Arm)
2026-02-11 13:35 ` David Hildenbrand (Arm)
@ 2026-02-11 19:28 ` Matthew Wilcox
2026-02-11 19:55 ` David Hildenbrand (Arm)
2 siblings, 1 reply; 22+ messages in thread
From: Matthew Wilcox @ 2026-02-11 19:28 UTC (permalink / raw)
To: Usama Arif
Cc: Andrew Morton, david, lorenzo.stoakes, linux-mm, fvdl, hannes,
riel, shakeel.butt, kas, baohua, dev.jain, baolin.wang, npache,
Liam.Howlett, ryan.roberts, vbabka, lance.yang, linux-kernel,
kernel-team
On Wed, Feb 11, 2026 at 04:49:45AM -0800, Usama Arif wrote:
> - Memory reclaim (try_to_unmap()): Returns false, folio rotated back
> LRU, retried in next reclaim cycle.
I was advised to ask my stupid question ...
Why do we still try to split the PMD in reclaim? I understand we're
about to swap the folio out and we'll need to put a swap entry in the page
table so we can find it again. But can't we now store swap entries at the
PMD level, or are we still forced to store 512 entries at the PTE level?
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC 1/2] mm: thp: allocate PTE page tables lazily at split time
2026-02-11 19:28 ` Matthew Wilcox
@ 2026-02-11 19:55 ` David Hildenbrand (Arm)
0 siblings, 0 replies; 22+ messages in thread
From: David Hildenbrand (Arm) @ 2026-02-11 19:55 UTC (permalink / raw)
To: Matthew Wilcox, Usama Arif
Cc: Andrew Morton, lorenzo.stoakes, linux-mm, fvdl, hannes, riel,
shakeel.butt, kas, baohua, dev.jain, baolin.wang, npache,
Liam.Howlett, ryan.roberts, vbabka, lance.yang, linux-kernel,
kernel-team
On 2/11/26 20:28, Matthew Wilcox wrote:
> On Wed, Feb 11, 2026 at 04:49:45AM -0800, Usama Arif wrote:
>> - Memory reclaim (try_to_unmap()): Returns false, folio rotated back
>> LRU, retried in next reclaim cycle.
>
> I was advised to ask my stupid question ...
>
> Why do we still try to split the PMD in reclaim? I understand we're
> about to swap the folio out and we'll need to put a swap entry in the page
> table so we can find it again. But can't we now store swap entries at the
> PMD level, or are we still forced to store 512 entries at the PTE level?
Yes. We don't support PMD swap entries yet.
I don't know all historical details. I suspect there are some rough
edges around swapin (assume we cannot swapin a 2M THP), and maybe it was
just easier to not deal with splitting of PMD swap entries (which we
would similarly have to support).
For sure an interesting project to look into.
--
Cheers,
David
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC 1/2] mm: thp: allocate PTE page tables lazily at split time
2026-02-11 13:25 ` David Hildenbrand (Arm)
2026-02-11 13:38 ` Usama Arif
@ 2026-02-12 12:13 ` Ritesh Harjani
2026-02-12 15:25 ` Usama Arif
2026-02-12 15:39 ` David Hildenbrand (Arm)
1 sibling, 2 replies; 22+ messages in thread
From: Ritesh Harjani @ 2026-02-12 12:13 UTC (permalink / raw)
To: David Hildenbrand (Arm), Usama Arif, Andrew Morton,
lorenzo.stoakes, willy, linux-mm
Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
baolin.wang, npache, Liam.Howlett, ryan.roberts, vbabka,
lance.yang, linux-kernel, kernel-team, Madhavan Srinivasan,
Michael Ellerman, linuxppc-dev
"David Hildenbrand (Arm)" <david@kernel.org> writes:
> CCing ppc folks
>
Thanks David!
> On 2/11/26 13:49, Usama Arif wrote:
>> When the kernel creates a PMD-level THP mapping for anonymous pages,
>> it pre-allocates a PTE page table and deposits it via
>> pgtable_trans_huge_deposit(). This deposited table is withdrawn during
>> PMD split or zap. The rationale was that split must not fail—if the
>> kernel decides to split a THP, it needs a PTE table to populate.
>>
>> However, every anon THP wastes 4KB (one page table page) that sits
>> unused in the deposit list for the lifetime of the mapping. On systems
>> with many THPs, this adds up to significant memory waste. The original
>> rationale is also not an issue. It is ok for split to fail, and if the
>> kernel can't find an order 0 allocation for split, there are much bigger
>> problems. On large servers where you can easily have 100s of GBs of THPs,
>> the memory usage for these tables is 200M per 100G. This memory could be
>> used for any other usecase, which include allocating the pagetables
>> required during split.
>>
>> This patch removes the pre-deposit for anonymous pages on architectures
>> where arch_needs_pgtable_deposit() returns false (every arch apart from
>> powerpc, and only when radix hash tables are not enabled) and allocates
>> the PTE table lazily—only when a split actually occurs. The split path
>> is modified to accept a caller-provided page table.
>>
>> PowerPC exception:
>>
>> It would have been great if we can completely remove the pagetable
>> deposit code and this commit would mostly have been a code cleanup patch,
>> unfortunately PowerPC has hash MMU, it stores hash slot information in
>> the deposited page table and pre-deposit is necessary. All deposit/
>> withdraw paths are guarded by arch_needs_pgtable_deposit(), so PowerPC
>> behavior is unchanged with this patch. On a better note,
>> arch_needs_pgtable_deposit will always evaluate to false at compile time
>> on non PowerPC architectures and the pre-deposit code will not be
>> compiled in.
>
> Is there a way to remove this? It's always been a confusing hack, now
> it's unpleasant to have around :)
>
Hash MMU on PowerPC works fundamentally different than other MMUs
(unlike Radix MMU on PowerPC). So yes, it requires few tricks to fit
into the Linux's multi-level SW page table model. ;)
> In particular, seeing that radix__pgtable_trans_huge_deposit() just 1:1
> copied generic pgtable_trans_huge_deposit() hurts my belly.
>
On PowerPC, pgtable_t can be a pte fragment.
typedef pte_t *pgtable_t;
That means a single page can be shared among other PTE page tables. So, we
cannot use page->lru which the generic implementation uses. I guess due
to this, there is a slight change in implementation of
radix__pgtable_trans_huge_deposit().
Doing a grep search, I think that's the same for sparc and s390 as well.
>
> IIUC, hash is mostly used on legacy power systems, radix on newer ones.
>
> So one obvious solution: remove PMD THP support for hash MMUs along with
> all this hacky deposit code.
>
Unfortunately, please no. There are real customers using Hash MMU on
Power9 and even on older generations and this would mean breaking Hash
PMD THP support for them.
>
> the "vma_is_anonymous(vma) && !arch_needs_pgtable_deposit()" and similar
> checks need to be wrapped in a reasonable helper and likely this all
> needs to get cleaned up further.
>
> The implementation if the generic pgtable_trans_huge_deposit and the
> radix handlers etc must be removed. If any code would trigger them it
> would be a bug.
>
Sure, I think after this patch series, the radix__pgtable_trans_huge_deposit()
will mostly be a dead code anyways. I will spend some time going
through this series and will also give it a test on powerpc HW (with
both Hash and Radix MMU).
I guess, we should also look at removing pgtable_trans_huge_deposit() and
pgtable_trans_huge_withdraw() implementations from s390 and sparc, since
those too will be dead code after this.
> If we have to keep this around, pgtable_trans_huge_deposit() should
> likely get renamed to arch_pgtable_trans_huge_deposit() etc, as there
> will not be generic support for it.
>
Sure. That make sense since PowerPC Hash MMU will still need this.
-ritesh
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC 1/2] mm: thp: allocate PTE page tables lazily at split time
2026-02-12 12:13 ` Ritesh Harjani
@ 2026-02-12 15:25 ` Usama Arif
2026-02-12 15:39 ` David Hildenbrand (Arm)
1 sibling, 0 replies; 22+ messages in thread
From: Usama Arif @ 2026-02-12 15:25 UTC (permalink / raw)
To: Ritesh Harjani (IBM), David Hildenbrand (Arm), Andrew Morton,
lorenzo.stoakes, willy, linux-mm
Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
baolin.wang, npache, Liam.Howlett, ryan.roberts, vbabka,
lance.yang, linux-kernel, kernel-team, Madhavan Srinivasan,
Michael Ellerman, linuxppc-dev
On 12/02/2026 12:13, Ritesh Harjani (IBM) wrote:
> "David Hildenbrand (Arm)" <david@kernel.org> writes:
>
>> CCing ppc folks
>>
>
> Thanks David!
>
>> On 2/11/26 13:49, Usama Arif wrote:
>>> When the kernel creates a PMD-level THP mapping for anonymous pages,
>>> it pre-allocates a PTE page table and deposits it via
>>> pgtable_trans_huge_deposit(). This deposited table is withdrawn during
>>> PMD split or zap. The rationale was that split must not fail—if the
>>> kernel decides to split a THP, it needs a PTE table to populate.
>>>
>>> However, every anon THP wastes 4KB (one page table page) that sits
>>> unused in the deposit list for the lifetime of the mapping. On systems
>>> with many THPs, this adds up to significant memory waste. The original
>>> rationale is also not an issue. It is ok for split to fail, and if the
>>> kernel can't find an order 0 allocation for split, there are much bigger
>>> problems. On large servers where you can easily have 100s of GBs of THPs,
>>> the memory usage for these tables is 200M per 100G. This memory could be
>>> used for any other usecase, which include allocating the pagetables
>>> required during split.
>>>
>>> This patch removes the pre-deposit for anonymous pages on architectures
>>> where arch_needs_pgtable_deposit() returns false (every arch apart from
>>> powerpc, and only when radix hash tables are not enabled) and allocates
>>> the PTE table lazily—only when a split actually occurs. The split path
>>> is modified to accept a caller-provided page table.
>>>
>>> PowerPC exception:
>>>
>>> It would have been great if we can completely remove the pagetable
>>> deposit code and this commit would mostly have been a code cleanup patch,
>>> unfortunately PowerPC has hash MMU, it stores hash slot information in
>>> the deposited page table and pre-deposit is necessary. All deposit/
>>> withdraw paths are guarded by arch_needs_pgtable_deposit(), so PowerPC
>>> behavior is unchanged with this patch. On a better note,
>>> arch_needs_pgtable_deposit will always evaluate to false at compile time
>>> on non PowerPC architectures and the pre-deposit code will not be
>>> compiled in.
>>
>> Is there a way to remove this? It's always been a confusing hack, now
>> it's unpleasant to have around :)
>>
>
> Hash MMU on PowerPC works fundamentally different than other MMUs
> (unlike Radix MMU on PowerPC). So yes, it requires few tricks to fit
> into the Linux's multi-level SW page table model. ;)
>
>
>> In particular, seeing that radix__pgtable_trans_huge_deposit() just 1:1
>> copied generic pgtable_trans_huge_deposit() hurts my belly.
>>
>
> On PowerPC, pgtable_t can be a pte fragment.
>
> typedef pte_t *pgtable_t;
>
> That means a single page can be shared among other PTE page tables. So, we
> cannot use page->lru which the generic implementation uses. I guess due
> to this, there is a slight change in implementation of
> radix__pgtable_trans_huge_deposit().
>
> Doing a grep search, I think that's the same for sparc and s390 as well.
>
>>
>> IIUC, hash is mostly used on legacy power systems, radix on newer ones.
>>
>> So one obvious solution: remove PMD THP support for hash MMUs along with
>> all this hacky deposit code.
>>
>
> Unfortunately, please no. There are real customers using Hash MMU on
> Power9 and even on older generations and this would mean breaking Hash
> PMD THP support for them.
>
>
Thanks for confirming! I will keep the pagetable deposit for powerpc
in the next revision.
I will rename pgtable_trans_huge_deposit to arch_pgtable_trans_huge_deposit
and move it to arch/powerpc. It will an empty function for the rest of the
architectures.
>>
>> the "vma_is_anonymous(vma) && !arch_needs_pgtable_deposit()" and similar
>> checks need to be wrapped in a reasonable helper and likely this all
>> needs to get cleaned up further.
>>
>> The implementation if the generic pgtable_trans_huge_deposit and the
>> radix handlers etc must be removed. If any code would trigger them it
>> would be a bug.
>>
>
> Sure, I think after this patch series, the radix__pgtable_trans_huge_deposit()
> will mostly be a dead code anyways. I will spend some time going
> through this series and will also give it a test on powerpc HW (with
> both Hash and Radix MMU).
>
> I guess, we should also look at removing pgtable_trans_huge_deposit() and
> pgtable_trans_huge_withdraw() implementations from s390 and sparc, since
> those too will be dead code after this.
>
>
>> If we have to keep this around, pgtable_trans_huge_deposit() should
>> likely get renamed to arch_pgtable_trans_huge_deposit() etc, as there
>> will not be generic support for it.
>>
>
> Sure. That make sense since PowerPC Hash MMU will still need this.
>
> -ritesh
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC 1/2] mm: thp: allocate PTE page tables lazily at split time
2026-02-12 12:13 ` Ritesh Harjani
2026-02-12 15:25 ` Usama Arif
@ 2026-02-12 15:39 ` David Hildenbrand (Arm)
2026-02-12 16:46 ` Ritesh Harjani
1 sibling, 1 reply; 22+ messages in thread
From: David Hildenbrand (Arm) @ 2026-02-12 15:39 UTC (permalink / raw)
To: Ritesh Harjani (IBM), Usama Arif, Andrew Morton, lorenzo.stoakes,
willy, linux-mm
Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
baolin.wang, npache, Liam.Howlett, ryan.roberts, vbabka,
lance.yang, linux-kernel, kernel-team, Madhavan Srinivasan,
Michael Ellerman, linuxppc-dev
>>
>> Is there a way to remove this? It's always been a confusing hack, now
>> it's unpleasant to have around :)
>>
>
> Hash MMU on PowerPC works fundamentally different than other MMUs
> (unlike Radix MMU on PowerPC). So yes, it requires few tricks to fit
> into the Linux's multi-level SW page table model. ;)
:)
>
>> In particular, seeing that radix__pgtable_trans_huge_deposit() just 1:1
>> copied generic pgtable_trans_huge_deposit() hurts my belly.
>>
>
> On PowerPC, pgtable_t can be a pte fragment.
>
> typedef pte_t *pgtable_t;
>
> That means a single page can be shared among other PTE page tables. So, we
> cannot use page->lru which the generic implementation uses. I guess due
> to this, there is a slight change in implementation of
> radix__pgtable_trans_huge_deposit().
Ah, did not spot this difference, but makes sense. Still ugly, but make
sense. Fortunately it would go away with this RFC.
>
> Doing a grep search, I think that's the same for sparc and s390 as well.
... and I also did not realize that s390x+sparc have separate
implementations we can now get rid of as well.
>
>>
>> IIUC, hash is mostly used on legacy power systems, radix on newer ones.
>>
>> So one obvious solution: remove PMD THP support for hash MMUs along with
>> all this hacky deposit code.
>>
>
> Unfortunately, please no. There are real customers using Hash MMU on
> Power9 and even on older generations and this would mean breaking Hash
> PMD THP support for them.
>
I was expecting this answer :)
>
>>
>> the "vma_is_anonymous(vma) && !arch_needs_pgtable_deposit()" and similar
>> checks need to be wrapped in a reasonable helper and likely this all
>> needs to get cleaned up further.
>>
>> The implementation if the generic pgtable_trans_huge_deposit and the
>> radix handlers etc must be removed. If any code would trigger them it
>> would be a bug.
>>
>
> Sure, I think after this patch series, the radix__pgtable_trans_huge_deposit()
> will mostly be a dead code anyways. I will spend some time going
> through this series and will also give it a test on powerpc HW (with
> both Hash and Radix MMU).
Thanks! The series will grow quite a bit I think, so retesting new
revisions will be very appreciated!
>
> I guess, we should also look at removing pgtable_trans_huge_deposit() and
> pgtable_trans_huge_withdraw() implementations from s390 and sparc, since
> those too will be dead code after this.
Exactly.
--
Cheers,
David
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC 1/2] mm: thp: allocate PTE page tables lazily at split time
2026-02-12 15:39 ` David Hildenbrand (Arm)
@ 2026-02-12 16:46 ` Ritesh Harjani
0 siblings, 0 replies; 22+ messages in thread
From: Ritesh Harjani @ 2026-02-12 16:46 UTC (permalink / raw)
To: David Hildenbrand (Arm), Usama Arif, Andrew Morton,
lorenzo.stoakes, willy, linux-mm
Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
baolin.wang, npache, Liam.Howlett, ryan.roberts, vbabka,
lance.yang, linux-kernel, kernel-team, Madhavan Srinivasan,
Michael Ellerman, linuxppc-dev
"David Hildenbrand (Arm)" <david@kernel.org> writes:
>
> Thanks! The series will grow quite a bit I think, so retesting new
> revisions will be very appreciated!
>
Definitely. Thanks!
-ritesh
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC 2/2] mm: thp: add THP_SPLIT_PMD_PTE_ALLOC_FAILED counter
2026-02-11 12:49 ` [RFC 2/2] mm: thp: add THP_SPLIT_PMD_PTE_ALLOC_FAILED counter Usama Arif
2026-02-11 13:27 ` David Hildenbrand (Arm)
@ 2026-02-12 21:40 ` kernel test robot
2026-02-12 21:40 ` kernel test robot
2 siblings, 0 replies; 22+ messages in thread
From: kernel test robot @ 2026-02-12 21:40 UTC (permalink / raw)
To: Usama Arif; +Cc: llvm, oe-kbuild-all
Hi Usama,
[This is a private test report for your RFC patch.]
kernel test robot noticed the following build errors:
[auto build test ERROR on linus/master]
[also build test ERROR on v6.19 next-20260212]
[cannot apply to akpm-mm/mm-everything]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Usama-Arif/mm-thp-allocate-PTE-page-tables-lazily-at-split-time/20260211-205726
base: linus/master
patch link: https://lore.kernel.org/r/20260211125507.4175026-3-usama.arif%40linux.dev
patch subject: [RFC 2/2] mm: thp: add THP_SPLIT_PMD_PTE_ALLOC_FAILED counter
config: x86_64-allnoconfig (https://download.01.org/0day-ci/archive/20260213/202602130506.7Tm8CJkW-lkp@intel.com/config)
compiler: clang version 20.1.8 (https://github.com/llvm/llvm-project 87f0227cb60147a26a1eeb4fb06e3b505e9c7261)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260213/202602130506.7Tm8CJkW-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202602130506.7Tm8CJkW-lkp@intel.com/
All errors (new ones prefixed by >>):
>> mm/rmap.c:1961:21: error: use of undeclared identifier 'THP_SPLIT_PMD_PTE_ALLOC_FAILED'
1961 | count_vm_event(THP_SPLIT_PMD_PTE_ALLOC_FAILED);
| ^
mm/rmap.c:2364:21: error: use of undeclared identifier 'THP_SPLIT_PMD_PTE_ALLOC_FAILED'
2364 | count_vm_event(THP_SPLIT_PMD_PTE_ALLOC_FAILED);
| ^
2 errors generated.
vim +/THP_SPLIT_PMD_PTE_ALLOC_FAILED +1961 mm/rmap.c
1849
1850 /*
1851 * @arg: enum ttu_flags will be passed to this argument
1852 */
1853 static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
1854 unsigned long address, void *arg)
1855 {
1856 struct mm_struct *mm = vma->vm_mm;
1857 DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0);
1858 bool anon_exclusive, ret = true;
1859 pte_t pteval;
1860 struct page *subpage;
1861 struct mmu_notifier_range range;
1862 enum ttu_flags flags = (enum ttu_flags)(long)arg;
1863 unsigned long nr_pages = 1, end_addr;
1864 unsigned long pfn;
1865 unsigned long hsz = 0;
1866 int ptes = 0;
1867 pgtable_t prealloc_pte = NULL;
1868
1869 /*
1870 * When racing against e.g. zap_pte_range() on another cpu,
1871 * in between its ptep_get_and_clear_full() and folio_remove_rmap_*(),
1872 * try_to_unmap() may return before page_mapped() has become false,
1873 * if page table locking is skipped: use TTU_SYNC to wait for that.
1874 */
1875 if (flags & TTU_SYNC)
1876 pvmw.flags = PVMW_SYNC;
1877
1878 /*
1879 * For THP, we have to assume the worse case ie pmd for invalidation.
1880 * For hugetlb, it could be much worse if we need to do pud
1881 * invalidation in the case of pmd sharing.
1882 *
1883 * Note that the folio can not be freed in this function as call of
1884 * try_to_unmap() must hold a reference on the folio.
1885 */
1886 range.end = vma_address_end(&pvmw);
1887 mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma->vm_mm,
1888 address, range.end);
1889 if (folio_test_hugetlb(folio)) {
1890 /*
1891 * If sharing is possible, start and end will be adjusted
1892 * accordingly.
1893 */
1894 adjust_range_if_pmd_sharing_possible(vma, &range.start,
1895 &range.end);
1896
1897 /* We need the huge page size for set_huge_pte_at() */
1898 hsz = huge_page_size(hstate_vma(vma));
1899 }
1900 mmu_notifier_invalidate_range_start(&range);
1901
1902 if ((flags & TTU_SPLIT_HUGE_PMD) && vma_is_anonymous(vma) &&
1903 !arch_needs_pgtable_deposit())
1904 prealloc_pte = pte_alloc_one(mm);
1905
1906 while (page_vma_mapped_walk(&pvmw)) {
1907 /*
1908 * If the folio is in an mlock()d vma, we must not swap it out.
1909 */
1910 if (!(flags & TTU_IGNORE_MLOCK) &&
1911 (vma->vm_flags & VM_LOCKED)) {
1912 ptes++;
1913
1914 /*
1915 * Set 'ret' to indicate the page cannot be unmapped.
1916 *
1917 * Do not jump to walk_abort immediately as additional
1918 * iteration might be required to detect fully mapped
1919 * folio an mlock it.
1920 */
1921 ret = false;
1922
1923 /* Only mlock fully mapped pages */
1924 if (pvmw.pte && ptes != pvmw.nr_pages)
1925 continue;
1926
1927 /*
1928 * All PTEs must be protected by page table lock in
1929 * order to mlock the page.
1930 *
1931 * If page table boundary has been cross, current ptl
1932 * only protect part of ptes.
1933 */
1934 if (pvmw.flags & PVMW_PGTABLE_CROSSED)
1935 goto walk_done;
1936
1937 /* Restore the mlock which got missed */
1938 mlock_vma_folio(folio, vma);
1939 goto walk_done;
1940 }
1941
1942 if (!pvmw.pte) {
1943 if (folio_test_anon(folio) && !folio_test_swapbacked(folio)) {
1944 if (unmap_huge_pmd_locked(vma, pvmw.address, pvmw.pmd, folio))
1945 goto walk_done;
1946 /*
1947 * unmap_huge_pmd_locked has either already marked
1948 * the folio as swap-backed or decided to retain it
1949 * due to GUP or speculative references.
1950 */
1951 goto walk_abort;
1952 }
1953
1954 if (flags & TTU_SPLIT_HUGE_PMD) {
1955 pgtable_t pgtable = prealloc_pte;
1956
1957 prealloc_pte = NULL;
1958
1959 if (!arch_needs_pgtable_deposit() && !pgtable &&
1960 vma_is_anonymous(vma)) {
> 1961 count_vm_event(THP_SPLIT_PMD_PTE_ALLOC_FAILED);
1962 page_vma_mapped_walk_done(&pvmw);
1963 ret = false;
1964 break;
1965 }
1966 /*
1967 * We temporarily have to drop the PTL and
1968 * restart so we can process the PTE-mapped THP.
1969 */
1970 split_huge_pmd_locked(vma, pvmw.address,
1971 pvmw.pmd, false, pgtable);
1972 flags &= ~TTU_SPLIT_HUGE_PMD;
1973 page_vma_mapped_walk_restart(&pvmw);
1974 continue;
1975 }
1976 }
1977
1978 /* Unexpected PMD-mapped THP? */
1979 VM_BUG_ON_FOLIO(!pvmw.pte, folio);
1980
1981 /*
1982 * Handle PFN swap PTEs, such as device-exclusive ones, that
1983 * actually map pages.
1984 */
1985 pteval = ptep_get(pvmw.pte);
1986 if (likely(pte_present(pteval))) {
1987 pfn = pte_pfn(pteval);
1988 } else {
1989 const softleaf_t entry = softleaf_from_pte(pteval);
1990
1991 pfn = softleaf_to_pfn(entry);
1992 VM_WARN_ON_FOLIO(folio_test_hugetlb(folio), folio);
1993 }
1994
1995 subpage = folio_page(folio, pfn - folio_pfn(folio));
1996 address = pvmw.address;
1997 anon_exclusive = folio_test_anon(folio) &&
1998 PageAnonExclusive(subpage);
1999
2000 if (folio_test_hugetlb(folio)) {
2001 bool anon = folio_test_anon(folio);
2002
2003 /*
2004 * The try_to_unmap() is only passed a hugetlb page
2005 * in the case where the hugetlb page is poisoned.
2006 */
2007 VM_BUG_ON_PAGE(!PageHWPoison(subpage), subpage);
2008 /*
2009 * huge_pmd_unshare may unmap an entire PMD page.
2010 * There is no way of knowing exactly which PMDs may
2011 * be cached for this mm, so we must flush them all.
2012 * start/end were already adjusted above to cover this
2013 * range.
2014 */
2015 flush_cache_range(vma, range.start, range.end);
2016
2017 /*
2018 * To call huge_pmd_unshare, i_mmap_rwsem must be
2019 * held in write mode. Caller needs to explicitly
2020 * do this outside rmap routines.
2021 *
2022 * We also must hold hugetlb vma_lock in write mode.
2023 * Lock order dictates acquiring vma_lock BEFORE
2024 * i_mmap_rwsem. We can only try lock here and fail
2025 * if unsuccessful.
2026 */
2027 if (!anon) {
2028 struct mmu_gather tlb;
2029
2030 VM_BUG_ON(!(flags & TTU_RMAP_LOCKED));
2031 if (!hugetlb_vma_trylock_write(vma))
2032 goto walk_abort;
2033
2034 tlb_gather_mmu_vma(&tlb, vma);
2035 if (huge_pmd_unshare(&tlb, vma, address, pvmw.pte)) {
2036 hugetlb_vma_unlock_write(vma);
2037 huge_pmd_unshare_flush(&tlb, vma);
2038 tlb_finish_mmu(&tlb);
2039 /*
2040 * The PMD table was unmapped,
2041 * consequently unmapping the folio.
2042 */
2043 goto walk_done;
2044 }
2045 hugetlb_vma_unlock_write(vma);
2046 tlb_finish_mmu(&tlb);
2047 }
2048 pteval = huge_ptep_clear_flush(vma, address, pvmw.pte);
2049 if (pte_dirty(pteval))
2050 folio_mark_dirty(folio);
2051 } else if (likely(pte_present(pteval))) {
2052 nr_pages = folio_unmap_pte_batch(folio, &pvmw, flags, pteval);
2053 end_addr = address + nr_pages * PAGE_SIZE;
2054 flush_cache_range(vma, address, end_addr);
2055
2056 /* Nuke the page table entry. */
2057 pteval = get_and_clear_ptes(mm, address, pvmw.pte, nr_pages);
2058 /*
2059 * We clear the PTE but do not flush so potentially
2060 * a remote CPU could still be writing to the folio.
2061 * If the entry was previously clean then the
2062 * architecture must guarantee that a clear->dirty
2063 * transition on a cached TLB entry is written through
2064 * and traps if the PTE is unmapped.
2065 */
2066 if (should_defer_flush(mm, flags))
2067 set_tlb_ubc_flush_pending(mm, pteval, address, end_addr);
2068 else
2069 flush_tlb_range(vma, address, end_addr);
2070 if (pte_dirty(pteval))
2071 folio_mark_dirty(folio);
2072 } else {
2073 pte_clear(mm, address, pvmw.pte);
2074 }
2075
2076 /*
2077 * Now the pte is cleared. If this pte was uffd-wp armed,
2078 * we may want to replace a none pte with a marker pte if
2079 * it's file-backed, so we don't lose the tracking info.
2080 */
2081 pte_install_uffd_wp_if_needed(vma, address, pvmw.pte, pteval);
2082
2083 /* Update high watermark before we lower rss */
2084 update_hiwater_rss(mm);
2085
2086 if (PageHWPoison(subpage) && (flags & TTU_HWPOISON)) {
2087 pteval = swp_entry_to_pte(make_hwpoison_entry(subpage));
2088 if (folio_test_hugetlb(folio)) {
2089 hugetlb_count_sub(folio_nr_pages(folio), mm);
2090 set_huge_pte_at(mm, address, pvmw.pte, pteval,
2091 hsz);
2092 } else {
2093 dec_mm_counter(mm, mm_counter(folio));
2094 set_pte_at(mm, address, pvmw.pte, pteval);
2095 }
2096 } else if (likely(pte_present(pteval)) && pte_unused(pteval) &&
2097 !userfaultfd_armed(vma)) {
2098 /*
2099 * The guest indicated that the page content is of no
2100 * interest anymore. Simply discard the pte, vmscan
2101 * will take care of the rest.
2102 * A future reference will then fault in a new zero
2103 * page. When userfaultfd is active, we must not drop
2104 * this page though, as its main user (postcopy
2105 * migration) will not expect userfaults on already
2106 * copied pages.
2107 */
2108 dec_mm_counter(mm, mm_counter(folio));
2109 } else if (folio_test_anon(folio)) {
2110 swp_entry_t entry = page_swap_entry(subpage);
2111 pte_t swp_pte;
2112 /*
2113 * Store the swap location in the pte.
2114 * See handle_pte_fault() ...
2115 */
2116 if (unlikely(folio_test_swapbacked(folio) !=
2117 folio_test_swapcache(folio))) {
2118 WARN_ON_ONCE(1);
2119 goto walk_abort;
2120 }
2121
2122 /* MADV_FREE page check */
2123 if (!folio_test_swapbacked(folio)) {
2124 int ref_count, map_count;
2125
2126 /*
2127 * Synchronize with gup_pte_range():
2128 * - clear PTE; barrier; read refcount
2129 * - inc refcount; barrier; read PTE
2130 */
2131 smp_mb();
2132
2133 ref_count = folio_ref_count(folio);
2134 map_count = folio_mapcount(folio);
2135
2136 /*
2137 * Order reads for page refcount and dirty flag
2138 * (see comments in __remove_mapping()).
2139 */
2140 smp_rmb();
2141
2142 if (folio_test_dirty(folio) && !(vma->vm_flags & VM_DROPPABLE)) {
2143 /*
2144 * redirtied either using the page table or a previously
2145 * obtained GUP reference.
2146 */
2147 set_ptes(mm, address, pvmw.pte, pteval, nr_pages);
2148 folio_set_swapbacked(folio);
2149 goto walk_abort;
2150 } else if (ref_count != 1 + map_count) {
2151 /*
2152 * Additional reference. Could be a GUP reference or any
2153 * speculative reference. GUP users must mark the folio
2154 * dirty if there was a modification. This folio cannot be
2155 * reclaimed right now either way, so act just like nothing
2156 * happened.
2157 * We'll come back here later and detect if the folio was
2158 * dirtied when the additional reference is gone.
2159 */
2160 set_ptes(mm, address, pvmw.pte, pteval, nr_pages);
2161 goto walk_abort;
2162 }
2163 add_mm_counter(mm, MM_ANONPAGES, -nr_pages);
2164 goto discard;
2165 }
2166
2167 if (swap_duplicate(entry) < 0) {
2168 set_pte_at(mm, address, pvmw.pte, pteval);
2169 goto walk_abort;
2170 }
2171
2172 /*
2173 * arch_unmap_one() is expected to be a NOP on
2174 * architectures where we could have PFN swap PTEs,
2175 * so we'll not check/care.
2176 */
2177 if (arch_unmap_one(mm, vma, address, pteval) < 0) {
2178 swap_free(entry);
2179 set_pte_at(mm, address, pvmw.pte, pteval);
2180 goto walk_abort;
2181 }
2182
2183 /* See folio_try_share_anon_rmap(): clear PTE first. */
2184 if (anon_exclusive &&
2185 folio_try_share_anon_rmap_pte(folio, subpage)) {
2186 swap_free(entry);
2187 set_pte_at(mm, address, pvmw.pte, pteval);
2188 goto walk_abort;
2189 }
2190 if (list_empty(&mm->mmlist)) {
2191 spin_lock(&mmlist_lock);
2192 if (list_empty(&mm->mmlist))
2193 list_add(&mm->mmlist, &init_mm.mmlist);
2194 spin_unlock(&mmlist_lock);
2195 }
2196 dec_mm_counter(mm, MM_ANONPAGES);
2197 inc_mm_counter(mm, MM_SWAPENTS);
2198 swp_pte = swp_entry_to_pte(entry);
2199 if (anon_exclusive)
2200 swp_pte = pte_swp_mkexclusive(swp_pte);
2201 if (likely(pte_present(pteval))) {
2202 if (pte_soft_dirty(pteval))
2203 swp_pte = pte_swp_mksoft_dirty(swp_pte);
2204 if (pte_uffd_wp(pteval))
2205 swp_pte = pte_swp_mkuffd_wp(swp_pte);
2206 } else {
2207 if (pte_swp_soft_dirty(pteval))
2208 swp_pte = pte_swp_mksoft_dirty(swp_pte);
2209 if (pte_swp_uffd_wp(pteval))
2210 swp_pte = pte_swp_mkuffd_wp(swp_pte);
2211 }
2212 set_pte_at(mm, address, pvmw.pte, swp_pte);
2213 } else {
2214 /*
2215 * This is a locked file-backed folio,
2216 * so it cannot be removed from the page
2217 * cache and replaced by a new folio before
2218 * mmu_notifier_invalidate_range_end, so no
2219 * concurrent thread might update its page table
2220 * to point at a new folio while a device is
2221 * still using this folio.
2222 *
2223 * See Documentation/mm/mmu_notifier.rst
2224 */
2225 dec_mm_counter(mm, mm_counter_file(folio));
2226 }
2227 discard:
2228 if (unlikely(folio_test_hugetlb(folio))) {
2229 hugetlb_remove_rmap(folio);
2230 } else {
2231 folio_remove_rmap_ptes(folio, subpage, nr_pages, vma);
2232 }
2233 if (vma->vm_flags & VM_LOCKED)
2234 mlock_drain_local();
2235 folio_put_refs(folio, nr_pages);
2236
2237 /*
2238 * If we are sure that we batched the entire folio and cleared
2239 * all PTEs, we can just optimize and stop right here.
2240 */
2241 if (nr_pages == folio_nr_pages(folio))
2242 goto walk_done;
2243 continue;
2244 walk_abort:
2245 ret = false;
2246 walk_done:
2247 page_vma_mapped_walk_done(&pvmw);
2248 break;
2249 }
2250
2251 if (prealloc_pte)
2252 pte_free(mm, prealloc_pte);
2253
2254 mmu_notifier_invalidate_range_end(&range);
2255
2256 return ret;
2257 }
2258
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC 2/2] mm: thp: add THP_SPLIT_PMD_PTE_ALLOC_FAILED counter
2026-02-11 12:49 ` [RFC 2/2] mm: thp: add THP_SPLIT_PMD_PTE_ALLOC_FAILED counter Usama Arif
2026-02-11 13:27 ` David Hildenbrand (Arm)
2026-02-12 21:40 ` kernel test robot
@ 2026-02-12 21:40 ` kernel test robot
2 siblings, 0 replies; 22+ messages in thread
From: kernel test robot @ 2026-02-12 21:40 UTC (permalink / raw)
To: Usama Arif; +Cc: oe-kbuild-all
Hi Usama,
[This is a private test report for your RFC patch.]
kernel test robot noticed the following build errors:
[auto build test ERROR on linus/master]
[also build test ERROR on v6.19 next-20260212]
[cannot apply to akpm-mm/mm-everything]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Usama-Arif/mm-thp-allocate-PTE-page-tables-lazily-at-split-time/20260211-205726
base: linus/master
patch link: https://lore.kernel.org/r/20260211125507.4175026-3-usama.arif%40linux.dev
patch subject: [RFC 2/2] mm: thp: add THP_SPLIT_PMD_PTE_ALLOC_FAILED counter
config: nios2-allnoconfig (https://download.01.org/0day-ci/archive/20260213/202602130520.mofTmHuk-lkp@intel.com/config)
compiler: nios2-linux-gcc (GCC) 11.5.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260213/202602130520.mofTmHuk-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202602130520.mofTmHuk-lkp@intel.com/
All errors (new ones prefixed by >>):
mm/rmap.c: In function 'try_to_unmap_one':
>> mm/rmap.c:1961:56: error: 'THP_SPLIT_PMD_PTE_ALLOC_FAILED' undeclared (first use in this function)
1961 | count_vm_event(THP_SPLIT_PMD_PTE_ALLOC_FAILED);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
mm/rmap.c:1961:56: note: each undeclared identifier is reported only once for each function it appears in
mm/rmap.c: In function 'try_to_migrate_one':
mm/rmap.c:2364:56: error: 'THP_SPLIT_PMD_PTE_ALLOC_FAILED' undeclared (first use in this function)
2364 | count_vm_event(THP_SPLIT_PMD_PTE_ALLOC_FAILED);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
vim +/THP_SPLIT_PMD_PTE_ALLOC_FAILED +1961 mm/rmap.c
1849
1850 /*
1851 * @arg: enum ttu_flags will be passed to this argument
1852 */
1853 static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
1854 unsigned long address, void *arg)
1855 {
1856 struct mm_struct *mm = vma->vm_mm;
1857 DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0);
1858 bool anon_exclusive, ret = true;
1859 pte_t pteval;
1860 struct page *subpage;
1861 struct mmu_notifier_range range;
1862 enum ttu_flags flags = (enum ttu_flags)(long)arg;
1863 unsigned long nr_pages = 1, end_addr;
1864 unsigned long pfn;
1865 unsigned long hsz = 0;
1866 int ptes = 0;
1867 pgtable_t prealloc_pte = NULL;
1868
1869 /*
1870 * When racing against e.g. zap_pte_range() on another cpu,
1871 * in between its ptep_get_and_clear_full() and folio_remove_rmap_*(),
1872 * try_to_unmap() may return before page_mapped() has become false,
1873 * if page table locking is skipped: use TTU_SYNC to wait for that.
1874 */
1875 if (flags & TTU_SYNC)
1876 pvmw.flags = PVMW_SYNC;
1877
1878 /*
1879 * For THP, we have to assume the worse case ie pmd for invalidation.
1880 * For hugetlb, it could be much worse if we need to do pud
1881 * invalidation in the case of pmd sharing.
1882 *
1883 * Note that the folio can not be freed in this function as call of
1884 * try_to_unmap() must hold a reference on the folio.
1885 */
1886 range.end = vma_address_end(&pvmw);
1887 mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma->vm_mm,
1888 address, range.end);
1889 if (folio_test_hugetlb(folio)) {
1890 /*
1891 * If sharing is possible, start and end will be adjusted
1892 * accordingly.
1893 */
1894 adjust_range_if_pmd_sharing_possible(vma, &range.start,
1895 &range.end);
1896
1897 /* We need the huge page size for set_huge_pte_at() */
1898 hsz = huge_page_size(hstate_vma(vma));
1899 }
1900 mmu_notifier_invalidate_range_start(&range);
1901
1902 if ((flags & TTU_SPLIT_HUGE_PMD) && vma_is_anonymous(vma) &&
1903 !arch_needs_pgtable_deposit())
1904 prealloc_pte = pte_alloc_one(mm);
1905
1906 while (page_vma_mapped_walk(&pvmw)) {
1907 /*
1908 * If the folio is in an mlock()d vma, we must not swap it out.
1909 */
1910 if (!(flags & TTU_IGNORE_MLOCK) &&
1911 (vma->vm_flags & VM_LOCKED)) {
1912 ptes++;
1913
1914 /*
1915 * Set 'ret' to indicate the page cannot be unmapped.
1916 *
1917 * Do not jump to walk_abort immediately as additional
1918 * iteration might be required to detect fully mapped
1919 * folio an mlock it.
1920 */
1921 ret = false;
1922
1923 /* Only mlock fully mapped pages */
1924 if (pvmw.pte && ptes != pvmw.nr_pages)
1925 continue;
1926
1927 /*
1928 * All PTEs must be protected by page table lock in
1929 * order to mlock the page.
1930 *
1931 * If page table boundary has been cross, current ptl
1932 * only protect part of ptes.
1933 */
1934 if (pvmw.flags & PVMW_PGTABLE_CROSSED)
1935 goto walk_done;
1936
1937 /* Restore the mlock which got missed */
1938 mlock_vma_folio(folio, vma);
1939 goto walk_done;
1940 }
1941
1942 if (!pvmw.pte) {
1943 if (folio_test_anon(folio) && !folio_test_swapbacked(folio)) {
1944 if (unmap_huge_pmd_locked(vma, pvmw.address, pvmw.pmd, folio))
1945 goto walk_done;
1946 /*
1947 * unmap_huge_pmd_locked has either already marked
1948 * the folio as swap-backed or decided to retain it
1949 * due to GUP or speculative references.
1950 */
1951 goto walk_abort;
1952 }
1953
1954 if (flags & TTU_SPLIT_HUGE_PMD) {
1955 pgtable_t pgtable = prealloc_pte;
1956
1957 prealloc_pte = NULL;
1958
1959 if (!arch_needs_pgtable_deposit() && !pgtable &&
1960 vma_is_anonymous(vma)) {
> 1961 count_vm_event(THP_SPLIT_PMD_PTE_ALLOC_FAILED);
1962 page_vma_mapped_walk_done(&pvmw);
1963 ret = false;
1964 break;
1965 }
1966 /*
1967 * We temporarily have to drop the PTL and
1968 * restart so we can process the PTE-mapped THP.
1969 */
1970 split_huge_pmd_locked(vma, pvmw.address,
1971 pvmw.pmd, false, pgtable);
1972 flags &= ~TTU_SPLIT_HUGE_PMD;
1973 page_vma_mapped_walk_restart(&pvmw);
1974 continue;
1975 }
1976 }
1977
1978 /* Unexpected PMD-mapped THP? */
1979 VM_BUG_ON_FOLIO(!pvmw.pte, folio);
1980
1981 /*
1982 * Handle PFN swap PTEs, such as device-exclusive ones, that
1983 * actually map pages.
1984 */
1985 pteval = ptep_get(pvmw.pte);
1986 if (likely(pte_present(pteval))) {
1987 pfn = pte_pfn(pteval);
1988 } else {
1989 const softleaf_t entry = softleaf_from_pte(pteval);
1990
1991 pfn = softleaf_to_pfn(entry);
1992 VM_WARN_ON_FOLIO(folio_test_hugetlb(folio), folio);
1993 }
1994
1995 subpage = folio_page(folio, pfn - folio_pfn(folio));
1996 address = pvmw.address;
1997 anon_exclusive = folio_test_anon(folio) &&
1998 PageAnonExclusive(subpage);
1999
2000 if (folio_test_hugetlb(folio)) {
2001 bool anon = folio_test_anon(folio);
2002
2003 /*
2004 * The try_to_unmap() is only passed a hugetlb page
2005 * in the case where the hugetlb page is poisoned.
2006 */
2007 VM_BUG_ON_PAGE(!PageHWPoison(subpage), subpage);
2008 /*
2009 * huge_pmd_unshare may unmap an entire PMD page.
2010 * There is no way of knowing exactly which PMDs may
2011 * be cached for this mm, so we must flush them all.
2012 * start/end were already adjusted above to cover this
2013 * range.
2014 */
2015 flush_cache_range(vma, range.start, range.end);
2016
2017 /*
2018 * To call huge_pmd_unshare, i_mmap_rwsem must be
2019 * held in write mode. Caller needs to explicitly
2020 * do this outside rmap routines.
2021 *
2022 * We also must hold hugetlb vma_lock in write mode.
2023 * Lock order dictates acquiring vma_lock BEFORE
2024 * i_mmap_rwsem. We can only try lock here and fail
2025 * if unsuccessful.
2026 */
2027 if (!anon) {
2028 struct mmu_gather tlb;
2029
2030 VM_BUG_ON(!(flags & TTU_RMAP_LOCKED));
2031 if (!hugetlb_vma_trylock_write(vma))
2032 goto walk_abort;
2033
2034 tlb_gather_mmu_vma(&tlb, vma);
2035 if (huge_pmd_unshare(&tlb, vma, address, pvmw.pte)) {
2036 hugetlb_vma_unlock_write(vma);
2037 huge_pmd_unshare_flush(&tlb, vma);
2038 tlb_finish_mmu(&tlb);
2039 /*
2040 * The PMD table was unmapped,
2041 * consequently unmapping the folio.
2042 */
2043 goto walk_done;
2044 }
2045 hugetlb_vma_unlock_write(vma);
2046 tlb_finish_mmu(&tlb);
2047 }
2048 pteval = huge_ptep_clear_flush(vma, address, pvmw.pte);
2049 if (pte_dirty(pteval))
2050 folio_mark_dirty(folio);
2051 } else if (likely(pte_present(pteval))) {
2052 nr_pages = folio_unmap_pte_batch(folio, &pvmw, flags, pteval);
2053 end_addr = address + nr_pages * PAGE_SIZE;
2054 flush_cache_range(vma, address, end_addr);
2055
2056 /* Nuke the page table entry. */
2057 pteval = get_and_clear_ptes(mm, address, pvmw.pte, nr_pages);
2058 /*
2059 * We clear the PTE but do not flush so potentially
2060 * a remote CPU could still be writing to the folio.
2061 * If the entry was previously clean then the
2062 * architecture must guarantee that a clear->dirty
2063 * transition on a cached TLB entry is written through
2064 * and traps if the PTE is unmapped.
2065 */
2066 if (should_defer_flush(mm, flags))
2067 set_tlb_ubc_flush_pending(mm, pteval, address, end_addr);
2068 else
2069 flush_tlb_range(vma, address, end_addr);
2070 if (pte_dirty(pteval))
2071 folio_mark_dirty(folio);
2072 } else {
2073 pte_clear(mm, address, pvmw.pte);
2074 }
2075
2076 /*
2077 * Now the pte is cleared. If this pte was uffd-wp armed,
2078 * we may want to replace a none pte with a marker pte if
2079 * it's file-backed, so we don't lose the tracking info.
2080 */
2081 pte_install_uffd_wp_if_needed(vma, address, pvmw.pte, pteval);
2082
2083 /* Update high watermark before we lower rss */
2084 update_hiwater_rss(mm);
2085
2086 if (PageHWPoison(subpage) && (flags & TTU_HWPOISON)) {
2087 pteval = swp_entry_to_pte(make_hwpoison_entry(subpage));
2088 if (folio_test_hugetlb(folio)) {
2089 hugetlb_count_sub(folio_nr_pages(folio), mm);
2090 set_huge_pte_at(mm, address, pvmw.pte, pteval,
2091 hsz);
2092 } else {
2093 dec_mm_counter(mm, mm_counter(folio));
2094 set_pte_at(mm, address, pvmw.pte, pteval);
2095 }
2096 } else if (likely(pte_present(pteval)) && pte_unused(pteval) &&
2097 !userfaultfd_armed(vma)) {
2098 /*
2099 * The guest indicated that the page content is of no
2100 * interest anymore. Simply discard the pte, vmscan
2101 * will take care of the rest.
2102 * A future reference will then fault in a new zero
2103 * page. When userfaultfd is active, we must not drop
2104 * this page though, as its main user (postcopy
2105 * migration) will not expect userfaults on already
2106 * copied pages.
2107 */
2108 dec_mm_counter(mm, mm_counter(folio));
2109 } else if (folio_test_anon(folio)) {
2110 swp_entry_t entry = page_swap_entry(subpage);
2111 pte_t swp_pte;
2112 /*
2113 * Store the swap location in the pte.
2114 * See handle_pte_fault() ...
2115 */
2116 if (unlikely(folio_test_swapbacked(folio) !=
2117 folio_test_swapcache(folio))) {
2118 WARN_ON_ONCE(1);
2119 goto walk_abort;
2120 }
2121
2122 /* MADV_FREE page check */
2123 if (!folio_test_swapbacked(folio)) {
2124 int ref_count, map_count;
2125
2126 /*
2127 * Synchronize with gup_pte_range():
2128 * - clear PTE; barrier; read refcount
2129 * - inc refcount; barrier; read PTE
2130 */
2131 smp_mb();
2132
2133 ref_count = folio_ref_count(folio);
2134 map_count = folio_mapcount(folio);
2135
2136 /*
2137 * Order reads for page refcount and dirty flag
2138 * (see comments in __remove_mapping()).
2139 */
2140 smp_rmb();
2141
2142 if (folio_test_dirty(folio) && !(vma->vm_flags & VM_DROPPABLE)) {
2143 /*
2144 * redirtied either using the page table or a previously
2145 * obtained GUP reference.
2146 */
2147 set_ptes(mm, address, pvmw.pte, pteval, nr_pages);
2148 folio_set_swapbacked(folio);
2149 goto walk_abort;
2150 } else if (ref_count != 1 + map_count) {
2151 /*
2152 * Additional reference. Could be a GUP reference or any
2153 * speculative reference. GUP users must mark the folio
2154 * dirty if there was a modification. This folio cannot be
2155 * reclaimed right now either way, so act just like nothing
2156 * happened.
2157 * We'll come back here later and detect if the folio was
2158 * dirtied when the additional reference is gone.
2159 */
2160 set_ptes(mm, address, pvmw.pte, pteval, nr_pages);
2161 goto walk_abort;
2162 }
2163 add_mm_counter(mm, MM_ANONPAGES, -nr_pages);
2164 goto discard;
2165 }
2166
2167 if (swap_duplicate(entry) < 0) {
2168 set_pte_at(mm, address, pvmw.pte, pteval);
2169 goto walk_abort;
2170 }
2171
2172 /*
2173 * arch_unmap_one() is expected to be a NOP on
2174 * architectures where we could have PFN swap PTEs,
2175 * so we'll not check/care.
2176 */
2177 if (arch_unmap_one(mm, vma, address, pteval) < 0) {
2178 swap_free(entry);
2179 set_pte_at(mm, address, pvmw.pte, pteval);
2180 goto walk_abort;
2181 }
2182
2183 /* See folio_try_share_anon_rmap(): clear PTE first. */
2184 if (anon_exclusive &&
2185 folio_try_share_anon_rmap_pte(folio, subpage)) {
2186 swap_free(entry);
2187 set_pte_at(mm, address, pvmw.pte, pteval);
2188 goto walk_abort;
2189 }
2190 if (list_empty(&mm->mmlist)) {
2191 spin_lock(&mmlist_lock);
2192 if (list_empty(&mm->mmlist))
2193 list_add(&mm->mmlist, &init_mm.mmlist);
2194 spin_unlock(&mmlist_lock);
2195 }
2196 dec_mm_counter(mm, MM_ANONPAGES);
2197 inc_mm_counter(mm, MM_SWAPENTS);
2198 swp_pte = swp_entry_to_pte(entry);
2199 if (anon_exclusive)
2200 swp_pte = pte_swp_mkexclusive(swp_pte);
2201 if (likely(pte_present(pteval))) {
2202 if (pte_soft_dirty(pteval))
2203 swp_pte = pte_swp_mksoft_dirty(swp_pte);
2204 if (pte_uffd_wp(pteval))
2205 swp_pte = pte_swp_mkuffd_wp(swp_pte);
2206 } else {
2207 if (pte_swp_soft_dirty(pteval))
2208 swp_pte = pte_swp_mksoft_dirty(swp_pte);
2209 if (pte_swp_uffd_wp(pteval))
2210 swp_pte = pte_swp_mkuffd_wp(swp_pte);
2211 }
2212 set_pte_at(mm, address, pvmw.pte, swp_pte);
2213 } else {
2214 /*
2215 * This is a locked file-backed folio,
2216 * so it cannot be removed from the page
2217 * cache and replaced by a new folio before
2218 * mmu_notifier_invalidate_range_end, so no
2219 * concurrent thread might update its page table
2220 * to point at a new folio while a device is
2221 * still using this folio.
2222 *
2223 * See Documentation/mm/mmu_notifier.rst
2224 */
2225 dec_mm_counter(mm, mm_counter_file(folio));
2226 }
2227 discard:
2228 if (unlikely(folio_test_hugetlb(folio))) {
2229 hugetlb_remove_rmap(folio);
2230 } else {
2231 folio_remove_rmap_ptes(folio, subpage, nr_pages, vma);
2232 }
2233 if (vma->vm_flags & VM_LOCKED)
2234 mlock_drain_local();
2235 folio_put_refs(folio, nr_pages);
2236
2237 /*
2238 * If we are sure that we batched the entire folio and cleared
2239 * all PTEs, we can just optimize and stop right here.
2240 */
2241 if (nr_pages == folio_nr_pages(folio))
2242 goto walk_done;
2243 continue;
2244 walk_abort:
2245 ret = false;
2246 walk_done:
2247 page_vma_mapped_walk_done(&pvmw);
2248 break;
2249 }
2250
2251 if (prealloc_pte)
2252 pte_free(mm, prealloc_pte);
2253
2254 mmu_notifier_invalidate_range_end(&range);
2255
2256 return ret;
2257 }
2258
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2026-02-12 21:40 UTC | newest]
Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-11 12:49 [RFC 0/2] mm: thp: split time allocation of page table for THPs Usama Arif
2026-02-11 12:49 ` [RFC 1/2] mm: thp: allocate PTE page tables lazily at split time Usama Arif
2026-02-11 13:25 ` David Hildenbrand (Arm)
2026-02-11 13:38 ` Usama Arif
2026-02-12 12:13 ` Ritesh Harjani
2026-02-12 15:25 ` Usama Arif
2026-02-12 15:39 ` David Hildenbrand (Arm)
2026-02-12 16:46 ` Ritesh Harjani
2026-02-11 13:35 ` David Hildenbrand (Arm)
2026-02-11 13:46 ` Kiryl Shutsemau
2026-02-11 13:47 ` Usama Arif
2026-02-11 19:28 ` Matthew Wilcox
2026-02-11 19:55 ` David Hildenbrand (Arm)
2026-02-11 12:49 ` [RFC 2/2] mm: thp: add THP_SPLIT_PMD_PTE_ALLOC_FAILED counter Usama Arif
2026-02-11 13:27 ` David Hildenbrand (Arm)
2026-02-11 13:31 ` Usama Arif
2026-02-11 13:36 ` David Hildenbrand (Arm)
2026-02-11 13:42 ` Usama Arif
2026-02-11 13:38 ` David Hildenbrand (Arm)
2026-02-11 13:43 ` Usama Arif
2026-02-12 21:40 ` kernel test robot
2026-02-12 21:40 ` kernel test robot
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.