* [RFC 0/2] mm: thp: split time allocation of page table for THPs @ 2026-02-11 12:49 Usama Arif 2026-02-11 12:49 ` [RFC 1/2] mm: thp: allocate PTE page tables lazily at split time Usama Arif 2026-02-11 12:49 ` [RFC 2/2] mm: thp: add THP_SPLIT_PMD_PTE_ALLOC_FAILED counter Usama Arif 0 siblings, 2 replies; 22+ messages in thread From: Usama Arif @ 2026-02-11 12:49 UTC (permalink / raw) To: Andrew Morton, david, lorenzo.stoakes, willy, linux-mm Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain, baolin.wang, npache, Liam.Howlett, ryan.roberts, vbabka, lance.yang, linux-kernel, kernel-team, Usama Arif This is an RFC patch to allocate the PTE page table at split time only and not do pre-deposit for THPs as suggested by David [1]. The core patch is the first one. The second one is not needed and its just vmstat counters I used to show that split doesn't fail. Its going to be 0 all the time and won't include it in future revisions. It would have been ideal if all pre-deposit code was removed but its not possible due to PowerPC. The rationale and further details are covered in the commit message of the first patch, including why the patch is safe. [1] https://lore.kernel.org/all/ee5bd77f-87ad-4640-a974-304b488e4c64@kernel.org/ Usama Arif (2): mm: thp: allocate PTE page tables lazily at split time mm: thp: add THP_SPLIT_PMD_PTE_ALLOC_FAILED counter include/linux/huge_mm.h | 4 +- include/linux/vm_event_item.h | 1 + mm/huge_memory.c | 145 ++++++++++++++++++++++++---------- mm/khugepaged.c | 7 +- mm/migrate_device.c | 15 ++-- mm/rmap.c | 42 +++++++++- mm/vmstat.c | 1 + 7 files changed, 162 insertions(+), 53 deletions(-) -- 2.47.3 ^ permalink raw reply [flat|nested] 22+ messages in thread
* [RFC 1/2] mm: thp: allocate PTE page tables lazily at split time 2026-02-11 12:49 [RFC 0/2] mm: thp: split time allocation of page table for THPs Usama Arif @ 2026-02-11 12:49 ` Usama Arif 2026-02-11 13:25 ` David Hildenbrand (Arm) ` (2 more replies) 2026-02-11 12:49 ` [RFC 2/2] mm: thp: add THP_SPLIT_PMD_PTE_ALLOC_FAILED counter Usama Arif 1 sibling, 3 replies; 22+ messages in thread From: Usama Arif @ 2026-02-11 12:49 UTC (permalink / raw) To: Andrew Morton, david, lorenzo.stoakes, willy, linux-mm Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain, baolin.wang, npache, Liam.Howlett, ryan.roberts, vbabka, lance.yang, linux-kernel, kernel-team, Usama Arif When the kernel creates a PMD-level THP mapping for anonymous pages, it pre-allocates a PTE page table and deposits it via pgtable_trans_huge_deposit(). This deposited table is withdrawn during PMD split or zap. The rationale was that split must not fail—if the kernel decides to split a THP, it needs a PTE table to populate. However, every anon THP wastes 4KB (one page table page) that sits unused in the deposit list for the lifetime of the mapping. On systems with many THPs, this adds up to significant memory waste. The original rationale is also not an issue. It is ok for split to fail, and if the kernel can't find an order 0 allocation for split, there are much bigger problems. On large servers where you can easily have 100s of GBs of THPs, the memory usage for these tables is 200M per 100G. This memory could be used for any other usecase, which include allocating the pagetables required during split. This patch removes the pre-deposit for anonymous pages on architectures where arch_needs_pgtable_deposit() returns false (every arch apart from powerpc, and only when radix hash tables are not enabled) and allocates the PTE table lazily—only when a split actually occurs. The split path is modified to accept a caller-provided page table. PowerPC exception: It would have been great if we can completely remove the pagetable deposit code and this commit would mostly have been a code cleanup patch, unfortunately PowerPC has hash MMU, it stores hash slot information in the deposited page table and pre-deposit is necessary. All deposit/ withdraw paths are guarded by arch_needs_pgtable_deposit(), so PowerPC behavior is unchanged with this patch. On a better note, arch_needs_pgtable_deposit will always evaluate to false at compile time on non PowerPC architectures and the pre-deposit code will not be compiled in. Why Split Failures Are Safe: If a system is under severe memory pressure that even a 4K allocation fails for a PTE table, there are far greater problems than a THP split being delayed. The OOM killer will likely intervene before this becomes an issue. When pte_alloc_one() fails due to not being able to allocate a 4K page, the PMD split is aborted and the THP remains intact. I could not get split to fail, as its very difficult to make order-0 allocation to fail. Code analysis of what would happen if it does: - mprotect(): If split fails in change_pmd_range, it will fallback to change_pte_range, which will return an error which will cause the whole function to be retried again. - munmap() (partial THP range): zap_pte_range() returns early when pte_offset_map_lock() fails, causing zap_pmd_range() to retry via pmd--. For full THP range, zap_huge_pmd() unmaps the entire PMD without split. - Memory reclaim (try_to_unmap()): Returns false, folio rotated back LRU, retried in next reclaim cycle. - Migration / compaction (try_to_migrate()): Returns -EAGAIN, migration skips this folio, retried later. - CoW fault (wp_huge_pmd()): Returns VM_FAULT_FALLBACK, fault retried. - madvise (MADV_COLD/PAGEOUT): split_folio() internally calls try_to_migrate() with TTU_SPLIT_HUGE_PMD. If PMD split fails, try_to_migrate() returns false, split_folio() returns -EAGAIN, and madvise returns 0 (success) silently skipping the region. This should be fine. madvise is just an advice and can fail for other reasons as well. Suggested-by: David Hildenbrand <david@kernel.org> Signed-off-by: Usama Arif <usama.arif@linux.dev> --- include/linux/huge_mm.h | 4 +- mm/huge_memory.c | 144 ++++++++++++++++++++++++++++------------ mm/khugepaged.c | 7 +- mm/migrate_device.c | 15 +++-- mm/rmap.c | 39 ++++++++++- 5 files changed, 156 insertions(+), 53 deletions(-) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index a4d9f964dfdea..b21bb72a298c9 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -562,7 +562,7 @@ static inline bool thp_migration_supported(void) } void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address, - pmd_t *pmd, bool freeze); + pmd_t *pmd, bool freeze, pgtable_t pgtable); bool unmap_huge_pmd_locked(struct vm_area_struct *vma, unsigned long addr, pmd_t *pmdp, struct folio *folio); void map_anon_folio_pmd_nopf(struct folio *folio, pmd_t *pmd, @@ -660,7 +660,7 @@ static inline void split_huge_pmd_address(struct vm_area_struct *vma, unsigned long address, bool freeze) {} static inline void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address, pmd_t *pmd, - bool freeze) {} + bool freeze, pgtable_t pgtable) {} static inline bool unmap_huge_pmd_locked(struct vm_area_struct *vma, unsigned long addr, pmd_t *pmdp, diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 44ff8a648afd5..4c9a8d89fc8aa 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1322,17 +1322,19 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf) unsigned long haddr = vmf->address & HPAGE_PMD_MASK; struct vm_area_struct *vma = vmf->vma; struct folio *folio; - pgtable_t pgtable; + pgtable_t pgtable = NULL; vm_fault_t ret = 0; folio = vma_alloc_anon_folio_pmd(vma, vmf->address); if (unlikely(!folio)) return VM_FAULT_FALLBACK; - pgtable = pte_alloc_one(vma->vm_mm); - if (unlikely(!pgtable)) { - ret = VM_FAULT_OOM; - goto release; + if (arch_needs_pgtable_deposit()) { + pgtable = pte_alloc_one(vma->vm_mm); + if (unlikely(!pgtable)) { + ret = VM_FAULT_OOM; + goto release; + } } vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd); @@ -1347,14 +1349,18 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf) if (userfaultfd_missing(vma)) { spin_unlock(vmf->ptl); folio_put(folio); - pte_free(vma->vm_mm, pgtable); + if (pgtable) + pte_free(vma->vm_mm, pgtable); ret = handle_userfault(vmf, VM_UFFD_MISSING); VM_BUG_ON(ret & VM_FAULT_FALLBACK); return ret; } - pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd, pgtable); + if (pgtable) { + pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd, + pgtable); + mm_inc_nr_ptes(vma->vm_mm); + } map_anon_folio_pmd_pf(folio, vmf->pmd, vma, haddr); - mm_inc_nr_ptes(vma->vm_mm); spin_unlock(vmf->ptl); } @@ -1450,9 +1456,11 @@ static void set_huge_zero_folio(pgtable_t pgtable, struct mm_struct *mm, pmd_t entry; entry = folio_mk_pmd(zero_folio, vma->vm_page_prot); entry = pmd_mkspecial(entry); - pgtable_trans_huge_deposit(mm, pmd, pgtable); + if (pgtable) { + pgtable_trans_huge_deposit(mm, pmd, pgtable); + mm_inc_nr_ptes(mm); + } set_pmd_at(mm, haddr, pmd, entry); - mm_inc_nr_ptes(mm); } vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf) @@ -1471,16 +1479,19 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf) if (!(vmf->flags & FAULT_FLAG_WRITE) && !mm_forbids_zeropage(vma->vm_mm) && transparent_hugepage_use_zero_page()) { - pgtable_t pgtable; + pgtable_t pgtable = NULL; struct folio *zero_folio; vm_fault_t ret; - pgtable = pte_alloc_one(vma->vm_mm); - if (unlikely(!pgtable)) - return VM_FAULT_OOM; + if (arch_needs_pgtable_deposit()) { + pgtable = pte_alloc_one(vma->vm_mm); + if (unlikely(!pgtable)) + return VM_FAULT_OOM; + } zero_folio = mm_get_huge_zero_folio(vma->vm_mm); if (unlikely(!zero_folio)) { - pte_free(vma->vm_mm, pgtable); + if (pgtable) + pte_free(vma->vm_mm, pgtable); count_vm_event(THP_FAULT_FALLBACK); return VM_FAULT_FALLBACK; } @@ -1490,10 +1501,12 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf) ret = check_stable_address_space(vma->vm_mm); if (ret) { spin_unlock(vmf->ptl); - pte_free(vma->vm_mm, pgtable); + if (pgtable) + pte_free(vma->vm_mm, pgtable); } else if (userfaultfd_missing(vma)) { spin_unlock(vmf->ptl); - pte_free(vma->vm_mm, pgtable); + if (pgtable) + pte_free(vma->vm_mm, pgtable); ret = handle_userfault(vmf, VM_UFFD_MISSING); VM_BUG_ON(ret & VM_FAULT_FALLBACK); } else { @@ -1504,7 +1517,8 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf) } } else { spin_unlock(vmf->ptl); - pte_free(vma->vm_mm, pgtable); + if (pgtable) + pte_free(vma->vm_mm, pgtable); } return ret; } @@ -1836,8 +1850,10 @@ static void copy_huge_non_present_pmd( } add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR); - mm_inc_nr_ptes(dst_mm); - pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable); + if (pgtable) { + mm_inc_nr_ptes(dst_mm); + pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable); + } if (!userfaultfd_wp(dst_vma)) pmd = pmd_swp_clear_uffd_wp(pmd); set_pmd_at(dst_mm, addr, dst_pmd, pmd); @@ -1877,9 +1893,11 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, if (!vma_is_anonymous(dst_vma)) return 0; - pgtable = pte_alloc_one(dst_mm); - if (unlikely(!pgtable)) - goto out; + if (arch_needs_pgtable_deposit()) { + pgtable = pte_alloc_one(dst_mm); + if (unlikely(!pgtable)) + goto out; + } dst_ptl = pmd_lock(dst_mm, dst_pmd); src_ptl = pmd_lockptr(src_mm, src_pmd); @@ -1897,7 +1915,8 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, } if (unlikely(!pmd_trans_huge(pmd))) { - pte_free(dst_mm, pgtable); + if (pgtable) + pte_free(dst_mm, pgtable); goto out_unlock; } /* @@ -1923,7 +1942,8 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, if (unlikely(folio_try_dup_anon_rmap_pmd(src_folio, src_page, dst_vma, src_vma))) { /* Page maybe pinned: split and retry the fault on PTEs. */ folio_put(src_folio); - pte_free(dst_mm, pgtable); + if (pgtable) + pte_free(dst_mm, pgtable); spin_unlock(src_ptl); spin_unlock(dst_ptl); __split_huge_pmd(src_vma, src_pmd, addr, false); @@ -1931,8 +1951,10 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, } add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR); out_zero_page: - mm_inc_nr_ptes(dst_mm); - pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable); + if (pgtable) { + mm_inc_nr_ptes(dst_mm); + pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable); + } pmdp_set_wrprotect(src_mm, addr, src_pmd); if (!userfaultfd_wp(dst_vma)) pmd = pmd_clear_uffd_wp(pmd); @@ -2364,7 +2386,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, zap_deposited_table(tlb->mm, pmd); spin_unlock(ptl); } else if (is_huge_zero_pmd(orig_pmd)) { - if (!vma_is_dax(vma) || arch_needs_pgtable_deposit()) + if (arch_needs_pgtable_deposit()) zap_deposited_table(tlb->mm, pmd); spin_unlock(ptl); } else { @@ -2389,7 +2411,8 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, } if (folio_test_anon(folio)) { - zap_deposited_table(tlb->mm, pmd); + if (arch_needs_pgtable_deposit()) + zap_deposited_table(tlb->mm, pmd); add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR); } else { if (arch_needs_pgtable_deposit()) @@ -2490,7 +2513,8 @@ bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr, force_flush = true; VM_BUG_ON(!pmd_none(*new_pmd)); - if (pmd_move_must_withdraw(new_ptl, old_ptl, vma)) { + if (pmd_move_must_withdraw(new_ptl, old_ptl, vma) && + arch_needs_pgtable_deposit()) { pgtable_t pgtable; pgtable = pgtable_trans_huge_withdraw(mm, old_pmd); pgtable_trans_huge_deposit(mm, new_pmd, pgtable); @@ -2798,8 +2822,10 @@ int move_pages_huge_pmd(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd, pm } set_pmd_at(mm, dst_addr, dst_pmd, _dst_pmd); - src_pgtable = pgtable_trans_huge_withdraw(mm, src_pmd); - pgtable_trans_huge_deposit(mm, dst_pmd, src_pgtable); + if (arch_needs_pgtable_deposit()) { + src_pgtable = pgtable_trans_huge_withdraw(mm, src_pmd); + pgtable_trans_huge_deposit(mm, dst_pmd, src_pgtable); + } unlock_ptls: double_pt_unlock(src_ptl, dst_ptl); /* unblock rmap walks */ @@ -2941,10 +2967,9 @@ void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud, #endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma, - unsigned long haddr, pmd_t *pmd) + unsigned long haddr, pmd_t *pmd, pgtable_t pgtable) { struct mm_struct *mm = vma->vm_mm; - pgtable_t pgtable; pmd_t _pmd, old_pmd; unsigned long addr; pte_t *pte; @@ -2960,7 +2985,16 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma, */ old_pmd = pmdp_huge_clear_flush(vma, haddr, pmd); - pgtable = pgtable_trans_huge_withdraw(mm, pmd); + if (arch_needs_pgtable_deposit()) { + pgtable = pgtable_trans_huge_withdraw(mm, pmd); + } else { + VM_BUG_ON(!pgtable); + /* + * Account for the freshly allocated (in __split_huge_pmd) pgtable + * being used in mm. + */ + mm_inc_nr_ptes(mm); + } pmd_populate(mm, &_pmd, pgtable); pte = pte_offset_map(&_pmd, haddr); @@ -2982,12 +3016,11 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma, } static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, - unsigned long haddr, bool freeze) + unsigned long haddr, bool freeze, pgtable_t pgtable) { struct mm_struct *mm = vma->vm_mm; struct folio *folio; struct page *page; - pgtable_t pgtable; pmd_t old_pmd, _pmd; bool soft_dirty, uffd_wp = false, young = false, write = false; bool anon_exclusive = false, dirty = false; @@ -3011,6 +3044,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, */ if (arch_needs_pgtable_deposit()) zap_deposited_table(mm, pmd); + if (pgtable) + pte_free(mm, pgtable); if (!vma_is_dax(vma) && vma_is_special_huge(vma)) return; if (unlikely(pmd_is_migration_entry(old_pmd))) { @@ -3043,7 +3078,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, * small page also write protected so it does not seems useful * to invalidate secondary mmu at this time. */ - return __split_huge_zero_page_pmd(vma, haddr, pmd); + return __split_huge_zero_page_pmd(vma, haddr, pmd, pgtable); } if (pmd_is_migration_entry(*pmd)) { @@ -3167,7 +3202,16 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, * Withdraw the table only after we mark the pmd entry invalid. * This's critical for some architectures (Power). */ - pgtable = pgtable_trans_huge_withdraw(mm, pmd); + if (arch_needs_pgtable_deposit()) { + pgtable = pgtable_trans_huge_withdraw(mm, pmd); + } else { + VM_BUG_ON(!pgtable); + /* + * Account for the freshly allocated (in __split_huge_pmd) pgtable + * being used in mm. + */ + mm_inc_nr_ptes(mm); + } pmd_populate(mm, &_pmd, pgtable); pte = pte_offset_map(&_pmd, haddr); @@ -3263,11 +3307,13 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, } void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address, - pmd_t *pmd, bool freeze) + pmd_t *pmd, bool freeze, pgtable_t pgtable) { VM_WARN_ON_ONCE(!IS_ALIGNED(address, HPAGE_PMD_SIZE)); if (pmd_trans_huge(*pmd) || pmd_is_valid_softleaf(*pmd)) - __split_huge_pmd_locked(vma, pmd, address, freeze); + __split_huge_pmd_locked(vma, pmd, address, freeze, pgtable); + else if (pgtable) + pte_free(vma->vm_mm, pgtable); } void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, @@ -3275,13 +3321,24 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, { spinlock_t *ptl; struct mmu_notifier_range range; + pgtable_t pgtable = NULL; mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma->vm_mm, address & HPAGE_PMD_MASK, (address & HPAGE_PMD_MASK) + HPAGE_PMD_SIZE); mmu_notifier_invalidate_range_start(&range); + + /* allocate pagetable before acquiring pmd lock */ + if (vma_is_anonymous(vma) && !arch_needs_pgtable_deposit()) { + pgtable = pte_alloc_one(vma->vm_mm); + if (!pgtable) { + mmu_notifier_invalidate_range_end(&range); + return; + } + } + ptl = pmd_lock(vma->vm_mm, pmd); - split_huge_pmd_locked(vma, range.start, pmd, freeze); + split_huge_pmd_locked(vma, range.start, pmd, freeze, pgtable); spin_unlock(ptl); mmu_notifier_invalidate_range_end(&range); } @@ -3402,7 +3459,8 @@ static bool __discard_anon_folio_pmd_locked(struct vm_area_struct *vma, } folio_remove_rmap_pmd(folio, pmd_page(orig_pmd), vma); - zap_deposited_table(mm, pmdp); + if (arch_needs_pgtable_deposit()) + zap_deposited_table(mm, pmdp); add_mm_counter(mm, MM_ANONPAGES, -HPAGE_PMD_NR); if (vma->vm_flags & VM_LOCKED) mlock_drain_local(); diff --git a/mm/khugepaged.c b/mm/khugepaged.c index fa1e57fd2c469..0e976e4c975ef 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -1223,7 +1223,12 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a spin_lock(pmd_ptl); BUG_ON(!pmd_none(*pmd)); - pgtable_trans_huge_deposit(mm, pmd, pgtable); + if (arch_needs_pgtable_deposit()) { + pgtable_trans_huge_deposit(mm, pmd, pgtable); + } else { + mm_dec_nr_ptes(mm); + pte_free(mm, pgtable); + } map_anon_folio_pmd_nopf(folio, pmd, vma, address); spin_unlock(pmd_ptl); diff --git a/mm/migrate_device.c b/mm/migrate_device.c index 0a8b31939640f..053db74303e36 100644 --- a/mm/migrate_device.c +++ b/mm/migrate_device.c @@ -829,9 +829,13 @@ static int migrate_vma_insert_huge_pmd_page(struct migrate_vma *migrate, __folio_mark_uptodate(folio); - pgtable = pte_alloc_one(vma->vm_mm); - if (unlikely(!pgtable)) - goto abort; + if (arch_needs_pgtable_deposit()) { + pgtable = pte_alloc_one(vma->vm_mm); + if (unlikely(!pgtable)) + goto abort; + } else { + pgtable = NULL; + } if (folio_is_device_private(folio)) { swp_entry_t swp_entry; @@ -879,10 +883,11 @@ static int migrate_vma_insert_huge_pmd_page(struct migrate_vma *migrate, folio_get(folio); if (flush) { - pte_free(vma->vm_mm, pgtable); + if (pgtable) + pte_free(vma->vm_mm, pgtable); flush_cache_page(vma, addr, addr + HPAGE_PMD_SIZE); pmdp_invalidate(vma, addr, pmdp); - } else { + } else if (pgtable) { pgtable_trans_huge_deposit(vma->vm_mm, pmdp, pgtable); mm_inc_nr_ptes(vma->vm_mm); } diff --git a/mm/rmap.c b/mm/rmap.c index edf5d32f46042..c6ff23fc12944 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -76,6 +76,7 @@ #include <linux/mm_inline.h> #include <linux/oom.h> +#include <asm/pgalloc.h> #include <asm/tlb.h> #define CREATE_TRACE_POINTS @@ -1978,6 +1979,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma, unsigned long pfn; unsigned long hsz = 0; int ptes = 0; + pgtable_t prealloc_pte = NULL; /* * When racing against e.g. zap_pte_range() on another cpu, @@ -2012,6 +2014,10 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma, } mmu_notifier_invalidate_range_start(&range); + if ((flags & TTU_SPLIT_HUGE_PMD) && vma_is_anonymous(vma) && + !arch_needs_pgtable_deposit()) + prealloc_pte = pte_alloc_one(mm); + while (page_vma_mapped_walk(&pvmw)) { /* * If the folio is in an mlock()d vma, we must not swap it out. @@ -2061,12 +2067,21 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma, } if (flags & TTU_SPLIT_HUGE_PMD) { + pgtable_t pgtable = prealloc_pte; + + prealloc_pte = NULL; + if (!arch_needs_pgtable_deposit() && !pgtable && + vma_is_anonymous(vma)) { + page_vma_mapped_walk_done(&pvmw); + ret = false; + break; + } /* * We temporarily have to drop the PTL and * restart so we can process the PTE-mapped THP. */ split_huge_pmd_locked(vma, pvmw.address, - pvmw.pmd, false); + pvmw.pmd, false, pgtable); flags &= ~TTU_SPLIT_HUGE_PMD; page_vma_mapped_walk_restart(&pvmw); continue; @@ -2346,6 +2361,9 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma, break; } + if (prealloc_pte) + pte_free(mm, prealloc_pte); + mmu_notifier_invalidate_range_end(&range); return ret; @@ -2405,6 +2423,7 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma, enum ttu_flags flags = (enum ttu_flags)(long)arg; unsigned long pfn; unsigned long hsz = 0; + pgtable_t prealloc_pte = NULL; /* * When racing against e.g. zap_pte_range() on another cpu, @@ -2439,6 +2458,10 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma, } mmu_notifier_invalidate_range_start(&range); + if ((flags & TTU_SPLIT_HUGE_PMD) && vma_is_anonymous(vma) && + !arch_needs_pgtable_deposit()) + prealloc_pte = pte_alloc_one(mm); + while (page_vma_mapped_walk(&pvmw)) { /* PMD-mapped THP migration entry */ if (!pvmw.pte) { @@ -2446,8 +2469,17 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma, __maybe_unused pmd_t pmdval; if (flags & TTU_SPLIT_HUGE_PMD) { + pgtable_t pgtable = prealloc_pte; + + prealloc_pte = NULL; + if (!arch_needs_pgtable_deposit() && !pgtable && + vma_is_anonymous(vma)) { + page_vma_mapped_walk_done(&pvmw); + ret = false; + break; + } split_huge_pmd_locked(vma, pvmw.address, - pvmw.pmd, true); + pvmw.pmd, true, pgtable); ret = false; page_vma_mapped_walk_done(&pvmw); break; @@ -2698,6 +2730,9 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma, folio_put(folio); } + if (prealloc_pte) + pte_free(mm, prealloc_pte); + mmu_notifier_invalidate_range_end(&range); return ret; -- 2.47.3 ^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [RFC 1/2] mm: thp: allocate PTE page tables lazily at split time 2026-02-11 12:49 ` [RFC 1/2] mm: thp: allocate PTE page tables lazily at split time Usama Arif @ 2026-02-11 13:25 ` David Hildenbrand (Arm) 2026-02-11 13:38 ` Usama Arif 2026-02-12 12:13 ` Ritesh Harjani 2026-02-11 13:35 ` David Hildenbrand (Arm) 2026-02-11 19:28 ` Matthew Wilcox 2 siblings, 2 replies; 22+ messages in thread From: David Hildenbrand (Arm) @ 2026-02-11 13:25 UTC (permalink / raw) To: Usama Arif, Andrew Morton, lorenzo.stoakes, willy, linux-mm Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain, baolin.wang, npache, Liam.Howlett, ryan.roberts, vbabka, lance.yang, linux-kernel, kernel-team, Madhavan Srinivasan, Michael Ellerman, linuxppc-dev CCing ppc folks On 2/11/26 13:49, Usama Arif wrote: > When the kernel creates a PMD-level THP mapping for anonymous pages, > it pre-allocates a PTE page table and deposits it via > pgtable_trans_huge_deposit(). This deposited table is withdrawn during > PMD split or zap. The rationale was that split must not fail—if the > kernel decides to split a THP, it needs a PTE table to populate. > > However, every anon THP wastes 4KB (one page table page) that sits > unused in the deposit list for the lifetime of the mapping. On systems > with many THPs, this adds up to significant memory waste. The original > rationale is also not an issue. It is ok for split to fail, and if the > kernel can't find an order 0 allocation for split, there are much bigger > problems. On large servers where you can easily have 100s of GBs of THPs, > the memory usage for these tables is 200M per 100G. This memory could be > used for any other usecase, which include allocating the pagetables > required during split. > > This patch removes the pre-deposit for anonymous pages on architectures > where arch_needs_pgtable_deposit() returns false (every arch apart from > powerpc, and only when radix hash tables are not enabled) and allocates > the PTE table lazily—only when a split actually occurs. The split path > is modified to accept a caller-provided page table. > > PowerPC exception: > > It would have been great if we can completely remove the pagetable > deposit code and this commit would mostly have been a code cleanup patch, > unfortunately PowerPC has hash MMU, it stores hash slot information in > the deposited page table and pre-deposit is necessary. All deposit/ > withdraw paths are guarded by arch_needs_pgtable_deposit(), so PowerPC > behavior is unchanged with this patch. On a better note, > arch_needs_pgtable_deposit will always evaluate to false at compile time > on non PowerPC architectures and the pre-deposit code will not be > compiled in. Is there a way to remove this? It's always been a confusing hack, now it's unpleasant to have around :) In particular, seeing that radix__pgtable_trans_huge_deposit() just 1:1 copied generic pgtable_trans_huge_deposit() hurts my belly. IIUC, hash is mostly used on legacy power systems, radix on newer ones. So one obvious solution: remove PMD THP support for hash MMUs along with all this hacky deposit code. the "vma_is_anonymous(vma) && !arch_needs_pgtable_deposit()" and similar checks need to be wrapped in a reasonable helper and likely this all needs to get cleaned up further. The implementation if the generic pgtable_trans_huge_deposit and the radix handlers etc must be removed. If any code would trigger them it would be a bug. If we have to keep this around, pgtable_trans_huge_deposit() should likely get renamed to arch_pgtable_trans_huge_deposit() etc, as there will not be generic support for it. -- Cheers, David ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC 1/2] mm: thp: allocate PTE page tables lazily at split time 2026-02-11 13:25 ` David Hildenbrand (Arm) @ 2026-02-11 13:38 ` Usama Arif 2026-02-12 12:13 ` Ritesh Harjani 1 sibling, 0 replies; 22+ messages in thread From: Usama Arif @ 2026-02-11 13:38 UTC (permalink / raw) To: David Hildenbrand (Arm), Andrew Morton, lorenzo.stoakes, willy, linux-mm Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain, baolin.wang, npache, Liam.Howlett, ryan.roberts, vbabka, lance.yang, linux-kernel, kernel-team, Madhavan Srinivasan, Michael Ellerman, linuxppc-dev On 11/02/2026 13:25, David Hildenbrand (Arm) wrote: > CCing ppc folks > > On 2/11/26 13:49, Usama Arif wrote: >> When the kernel creates a PMD-level THP mapping for anonymous pages, >> it pre-allocates a PTE page table and deposits it via >> pgtable_trans_huge_deposit(). This deposited table is withdrawn during >> PMD split or zap. The rationale was that split must not fail—if the >> kernel decides to split a THP, it needs a PTE table to populate. >> >> However, every anon THP wastes 4KB (one page table page) that sits >> unused in the deposit list for the lifetime of the mapping. On systems >> with many THPs, this adds up to significant memory waste. The original >> rationale is also not an issue. It is ok for split to fail, and if the >> kernel can't find an order 0 allocation for split, there are much bigger >> problems. On large servers where you can easily have 100s of GBs of THPs, >> the memory usage for these tables is 200M per 100G. This memory could be >> used for any other usecase, which include allocating the pagetables >> required during split. >> >> This patch removes the pre-deposit for anonymous pages on architectures >> where arch_needs_pgtable_deposit() returns false (every arch apart from >> powerpc, and only when radix hash tables are not enabled) and allocates >> the PTE table lazily—only when a split actually occurs. The split path >> is modified to accept a caller-provided page table. >> >> PowerPC exception: >> >> It would have been great if we can completely remove the pagetable >> deposit code and this commit would mostly have been a code cleanup patch, >> unfortunately PowerPC has hash MMU, it stores hash slot information in >> the deposited page table and pre-deposit is necessary. All deposit/ >> withdraw paths are guarded by arch_needs_pgtable_deposit(), so PowerPC >> behavior is unchanged with this patch. On a better note, >> arch_needs_pgtable_deposit will always evaluate to false at compile time >> on non PowerPC architectures and the pre-deposit code will not be >> compiled in. > > Is there a way to remove this? It's always been a confusing hack, now it's unpleasant to have around :) I spent some time researching this (I havent worked with PowerPC before) as I really wanted to get rid of all the pre-deposit code. I cant really see a way without removing PMD THP support. I was going to CC the PowerPC maintainers but I see that you already did! > > In particular, seeing that radix__pgtable_trans_huge_deposit() just 1:1 copied generic pgtable_trans_huge_deposit() hurts my belly. > > > IIUC, hash is mostly used on legacy power systems, radix on newer ones. > Yes that is what I found as well. > So one obvious solution: remove PMD THP support for hash MMUs along with all this hacky deposit code. > I would be happy with that! > > the "vma_is_anonymous(vma) && !arch_needs_pgtable_deposit()" and similar checks need to be wrapped in a reasonable helper and likely this all needs to get cleaned up further. Ack. The code will definitely look a lot lot cleaner and wont have much of this if we decide to remove PMD THP support for hash MMU. > > The implementation if the generic pgtable_trans_huge_deposit and the radix handlers etc must be removed. If any code would trigger them it would be a bug. > > If we have to keep this around, pgtable_trans_huge_deposit() should likely get renamed to arch_pgtable_trans_huge_deposit() etc, as there will not be generic support for it. > Ack. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC 1/2] mm: thp: allocate PTE page tables lazily at split time 2026-02-11 13:25 ` David Hildenbrand (Arm) 2026-02-11 13:38 ` Usama Arif @ 2026-02-12 12:13 ` Ritesh Harjani 2026-02-12 15:25 ` Usama Arif 2026-02-12 15:39 ` David Hildenbrand (Arm) 1 sibling, 2 replies; 22+ messages in thread From: Ritesh Harjani @ 2026-02-12 12:13 UTC (permalink / raw) To: David Hildenbrand (Arm), Usama Arif, Andrew Morton, lorenzo.stoakes, willy, linux-mm Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain, baolin.wang, npache, Liam.Howlett, ryan.roberts, vbabka, lance.yang, linux-kernel, kernel-team, Madhavan Srinivasan, Michael Ellerman, linuxppc-dev "David Hildenbrand (Arm)" <david@kernel.org> writes: > CCing ppc folks > Thanks David! > On 2/11/26 13:49, Usama Arif wrote: >> When the kernel creates a PMD-level THP mapping for anonymous pages, >> it pre-allocates a PTE page table and deposits it via >> pgtable_trans_huge_deposit(). This deposited table is withdrawn during >> PMD split or zap. The rationale was that split must not fail—if the >> kernel decides to split a THP, it needs a PTE table to populate. >> >> However, every anon THP wastes 4KB (one page table page) that sits >> unused in the deposit list for the lifetime of the mapping. On systems >> with many THPs, this adds up to significant memory waste. The original >> rationale is also not an issue. It is ok for split to fail, and if the >> kernel can't find an order 0 allocation for split, there are much bigger >> problems. On large servers where you can easily have 100s of GBs of THPs, >> the memory usage for these tables is 200M per 100G. This memory could be >> used for any other usecase, which include allocating the pagetables >> required during split. >> >> This patch removes the pre-deposit for anonymous pages on architectures >> where arch_needs_pgtable_deposit() returns false (every arch apart from >> powerpc, and only when radix hash tables are not enabled) and allocates >> the PTE table lazily—only when a split actually occurs. The split path >> is modified to accept a caller-provided page table. >> >> PowerPC exception: >> >> It would have been great if we can completely remove the pagetable >> deposit code and this commit would mostly have been a code cleanup patch, >> unfortunately PowerPC has hash MMU, it stores hash slot information in >> the deposited page table and pre-deposit is necessary. All deposit/ >> withdraw paths are guarded by arch_needs_pgtable_deposit(), so PowerPC >> behavior is unchanged with this patch. On a better note, >> arch_needs_pgtable_deposit will always evaluate to false at compile time >> on non PowerPC architectures and the pre-deposit code will not be >> compiled in. > > Is there a way to remove this? It's always been a confusing hack, now > it's unpleasant to have around :) > Hash MMU on PowerPC works fundamentally different than other MMUs (unlike Radix MMU on PowerPC). So yes, it requires few tricks to fit into the Linux's multi-level SW page table model. ;) > In particular, seeing that radix__pgtable_trans_huge_deposit() just 1:1 > copied generic pgtable_trans_huge_deposit() hurts my belly. > On PowerPC, pgtable_t can be a pte fragment. typedef pte_t *pgtable_t; That means a single page can be shared among other PTE page tables. So, we cannot use page->lru which the generic implementation uses. I guess due to this, there is a slight change in implementation of radix__pgtable_trans_huge_deposit(). Doing a grep search, I think that's the same for sparc and s390 as well. > > IIUC, hash is mostly used on legacy power systems, radix on newer ones. > > So one obvious solution: remove PMD THP support for hash MMUs along with > all this hacky deposit code. > Unfortunately, please no. There are real customers using Hash MMU on Power9 and even on older generations and this would mean breaking Hash PMD THP support for them. > > the "vma_is_anonymous(vma) && !arch_needs_pgtable_deposit()" and similar > checks need to be wrapped in a reasonable helper and likely this all > needs to get cleaned up further. > > The implementation if the generic pgtable_trans_huge_deposit and the > radix handlers etc must be removed. If any code would trigger them it > would be a bug. > Sure, I think after this patch series, the radix__pgtable_trans_huge_deposit() will mostly be a dead code anyways. I will spend some time going through this series and will also give it a test on powerpc HW (with both Hash and Radix MMU). I guess, we should also look at removing pgtable_trans_huge_deposit() and pgtable_trans_huge_withdraw() implementations from s390 and sparc, since those too will be dead code after this. > If we have to keep this around, pgtable_trans_huge_deposit() should > likely get renamed to arch_pgtable_trans_huge_deposit() etc, as there > will not be generic support for it. > Sure. That make sense since PowerPC Hash MMU will still need this. -ritesh ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC 1/2] mm: thp: allocate PTE page tables lazily at split time 2026-02-12 12:13 ` Ritesh Harjani @ 2026-02-12 15:25 ` Usama Arif 2026-02-12 15:39 ` David Hildenbrand (Arm) 1 sibling, 0 replies; 22+ messages in thread From: Usama Arif @ 2026-02-12 15:25 UTC (permalink / raw) To: Ritesh Harjani (IBM), David Hildenbrand (Arm), Andrew Morton, lorenzo.stoakes, willy, linux-mm Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain, baolin.wang, npache, Liam.Howlett, ryan.roberts, vbabka, lance.yang, linux-kernel, kernel-team, Madhavan Srinivasan, Michael Ellerman, linuxppc-dev On 12/02/2026 12:13, Ritesh Harjani (IBM) wrote: > "David Hildenbrand (Arm)" <david@kernel.org> writes: > >> CCing ppc folks >> > > Thanks David! > >> On 2/11/26 13:49, Usama Arif wrote: >>> When the kernel creates a PMD-level THP mapping for anonymous pages, >>> it pre-allocates a PTE page table and deposits it via >>> pgtable_trans_huge_deposit(). This deposited table is withdrawn during >>> PMD split or zap. The rationale was that split must not fail—if the >>> kernel decides to split a THP, it needs a PTE table to populate. >>> >>> However, every anon THP wastes 4KB (one page table page) that sits >>> unused in the deposit list for the lifetime of the mapping. On systems >>> with many THPs, this adds up to significant memory waste. The original >>> rationale is also not an issue. It is ok for split to fail, and if the >>> kernel can't find an order 0 allocation for split, there are much bigger >>> problems. On large servers where you can easily have 100s of GBs of THPs, >>> the memory usage for these tables is 200M per 100G. This memory could be >>> used for any other usecase, which include allocating the pagetables >>> required during split. >>> >>> This patch removes the pre-deposit for anonymous pages on architectures >>> where arch_needs_pgtable_deposit() returns false (every arch apart from >>> powerpc, and only when radix hash tables are not enabled) and allocates >>> the PTE table lazily—only when a split actually occurs. The split path >>> is modified to accept a caller-provided page table. >>> >>> PowerPC exception: >>> >>> It would have been great if we can completely remove the pagetable >>> deposit code and this commit would mostly have been a code cleanup patch, >>> unfortunately PowerPC has hash MMU, it stores hash slot information in >>> the deposited page table and pre-deposit is necessary. All deposit/ >>> withdraw paths are guarded by arch_needs_pgtable_deposit(), so PowerPC >>> behavior is unchanged with this patch. On a better note, >>> arch_needs_pgtable_deposit will always evaluate to false at compile time >>> on non PowerPC architectures and the pre-deposit code will not be >>> compiled in. >> >> Is there a way to remove this? It's always been a confusing hack, now >> it's unpleasant to have around :) >> > > Hash MMU on PowerPC works fundamentally different than other MMUs > (unlike Radix MMU on PowerPC). So yes, it requires few tricks to fit > into the Linux's multi-level SW page table model. ;) > > >> In particular, seeing that radix__pgtable_trans_huge_deposit() just 1:1 >> copied generic pgtable_trans_huge_deposit() hurts my belly. >> > > On PowerPC, pgtable_t can be a pte fragment. > > typedef pte_t *pgtable_t; > > That means a single page can be shared among other PTE page tables. So, we > cannot use page->lru which the generic implementation uses. I guess due > to this, there is a slight change in implementation of > radix__pgtable_trans_huge_deposit(). > > Doing a grep search, I think that's the same for sparc and s390 as well. > >> >> IIUC, hash is mostly used on legacy power systems, radix on newer ones. >> >> So one obvious solution: remove PMD THP support for hash MMUs along with >> all this hacky deposit code. >> > > Unfortunately, please no. There are real customers using Hash MMU on > Power9 and even on older generations and this would mean breaking Hash > PMD THP support for them. > > Thanks for confirming! I will keep the pagetable deposit for powerpc in the next revision. I will rename pgtable_trans_huge_deposit to arch_pgtable_trans_huge_deposit and move it to arch/powerpc. It will an empty function for the rest of the architectures. >> >> the "vma_is_anonymous(vma) && !arch_needs_pgtable_deposit()" and similar >> checks need to be wrapped in a reasonable helper and likely this all >> needs to get cleaned up further. >> >> The implementation if the generic pgtable_trans_huge_deposit and the >> radix handlers etc must be removed. If any code would trigger them it >> would be a bug. >> > > Sure, I think after this patch series, the radix__pgtable_trans_huge_deposit() > will mostly be a dead code anyways. I will spend some time going > through this series and will also give it a test on powerpc HW (with > both Hash and Radix MMU). > > I guess, we should also look at removing pgtable_trans_huge_deposit() and > pgtable_trans_huge_withdraw() implementations from s390 and sparc, since > those too will be dead code after this. > > >> If we have to keep this around, pgtable_trans_huge_deposit() should >> likely get renamed to arch_pgtable_trans_huge_deposit() etc, as there >> will not be generic support for it. >> > > Sure. That make sense since PowerPC Hash MMU will still need this. > > -ritesh ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC 1/2] mm: thp: allocate PTE page tables lazily at split time 2026-02-12 12:13 ` Ritesh Harjani 2026-02-12 15:25 ` Usama Arif @ 2026-02-12 15:39 ` David Hildenbrand (Arm) 2026-02-12 16:46 ` Ritesh Harjani 1 sibling, 1 reply; 22+ messages in thread From: David Hildenbrand (Arm) @ 2026-02-12 15:39 UTC (permalink / raw) To: Ritesh Harjani (IBM), Usama Arif, Andrew Morton, lorenzo.stoakes, willy, linux-mm Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain, baolin.wang, npache, Liam.Howlett, ryan.roberts, vbabka, lance.yang, linux-kernel, kernel-team, Madhavan Srinivasan, Michael Ellerman, linuxppc-dev >> >> Is there a way to remove this? It's always been a confusing hack, now >> it's unpleasant to have around :) >> > > Hash MMU on PowerPC works fundamentally different than other MMUs > (unlike Radix MMU on PowerPC). So yes, it requires few tricks to fit > into the Linux's multi-level SW page table model. ;) :) > >> In particular, seeing that radix__pgtable_trans_huge_deposit() just 1:1 >> copied generic pgtable_trans_huge_deposit() hurts my belly. >> > > On PowerPC, pgtable_t can be a pte fragment. > > typedef pte_t *pgtable_t; > > That means a single page can be shared among other PTE page tables. So, we > cannot use page->lru which the generic implementation uses. I guess due > to this, there is a slight change in implementation of > radix__pgtable_trans_huge_deposit(). Ah, did not spot this difference, but makes sense. Still ugly, but make sense. Fortunately it would go away with this RFC. > > Doing a grep search, I think that's the same for sparc and s390 as well. ... and I also did not realize that s390x+sparc have separate implementations we can now get rid of as well. > >> >> IIUC, hash is mostly used on legacy power systems, radix on newer ones. >> >> So one obvious solution: remove PMD THP support for hash MMUs along with >> all this hacky deposit code. >> > > Unfortunately, please no. There are real customers using Hash MMU on > Power9 and even on older generations and this would mean breaking Hash > PMD THP support for them. > I was expecting this answer :) > >> >> the "vma_is_anonymous(vma) && !arch_needs_pgtable_deposit()" and similar >> checks need to be wrapped in a reasonable helper and likely this all >> needs to get cleaned up further. >> >> The implementation if the generic pgtable_trans_huge_deposit and the >> radix handlers etc must be removed. If any code would trigger them it >> would be a bug. >> > > Sure, I think after this patch series, the radix__pgtable_trans_huge_deposit() > will mostly be a dead code anyways. I will spend some time going > through this series and will also give it a test on powerpc HW (with > both Hash and Radix MMU). Thanks! The series will grow quite a bit I think, so retesting new revisions will be very appreciated! > > I guess, we should also look at removing pgtable_trans_huge_deposit() and > pgtable_trans_huge_withdraw() implementations from s390 and sparc, since > those too will be dead code after this. Exactly. -- Cheers, David ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC 1/2] mm: thp: allocate PTE page tables lazily at split time 2026-02-12 15:39 ` David Hildenbrand (Arm) @ 2026-02-12 16:46 ` Ritesh Harjani 0 siblings, 0 replies; 22+ messages in thread From: Ritesh Harjani @ 2026-02-12 16:46 UTC (permalink / raw) To: David Hildenbrand (Arm), Usama Arif, Andrew Morton, lorenzo.stoakes, willy, linux-mm Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain, baolin.wang, npache, Liam.Howlett, ryan.roberts, vbabka, lance.yang, linux-kernel, kernel-team, Madhavan Srinivasan, Michael Ellerman, linuxppc-dev "David Hildenbrand (Arm)" <david@kernel.org> writes: > > Thanks! The series will grow quite a bit I think, so retesting new > revisions will be very appreciated! > Definitely. Thanks! -ritesh ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC 1/2] mm: thp: allocate PTE page tables lazily at split time 2026-02-11 12:49 ` [RFC 1/2] mm: thp: allocate PTE page tables lazily at split time Usama Arif 2026-02-11 13:25 ` David Hildenbrand (Arm) @ 2026-02-11 13:35 ` David Hildenbrand (Arm) 2026-02-11 13:46 ` Kiryl Shutsemau 2026-02-11 13:47 ` Usama Arif 2026-02-11 19:28 ` Matthew Wilcox 2 siblings, 2 replies; 22+ messages in thread From: David Hildenbrand (Arm) @ 2026-02-11 13:35 UTC (permalink / raw) To: Usama Arif, Andrew Morton, lorenzo.stoakes, willy, linux-mm Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain, baolin.wang, npache, Liam.Howlett, ryan.roberts, vbabka, lance.yang, linux-kernel, kernel-team On 2/11/26 13:49, Usama Arif wrote: > When the kernel creates a PMD-level THP mapping for anonymous pages, > it pre-allocates a PTE page table and deposits it via > pgtable_trans_huge_deposit(). This deposited table is withdrawn during > PMD split or zap. The rationale was that split must not fail—if the > kernel decides to split a THP, it needs a PTE table to populate. > > However, every anon THP wastes 4KB (one page table page) that sits > unused in the deposit list for the lifetime of the mapping. On systems > with many THPs, this adds up to significant memory waste. The original > rationale is also not an issue. It is ok for split to fail, and if the > kernel can't find an order 0 allocation for split, there are much bigger > problems. On large servers where you can easily have 100s of GBs of THPs, > the memory usage for these tables is 200M per 100G. This memory could be > used for any other usecase, which include allocating the pagetables > required during split. > > This patch removes the pre-deposit for anonymous pages on architectures > where arch_needs_pgtable_deposit() returns false (every arch apart from > powerpc, and only when radix hash tables are not enabled) and allocates > the PTE table lazily—only when a split actually occurs. The split path > is modified to accept a caller-provided page table. > > PowerPC exception: > > It would have been great if we can completely remove the pagetable > deposit code and this commit would mostly have been a code cleanup patch, > unfortunately PowerPC has hash MMU, it stores hash slot information in > the deposited page table and pre-deposit is necessary. All deposit/ > withdraw paths are guarded by arch_needs_pgtable_deposit(), so PowerPC > behavior is unchanged with this patch. On a better note, > arch_needs_pgtable_deposit will always evaluate to false at compile time > on non PowerPC architectures and the pre-deposit code will not be > compiled in. > > Why Split Failures Are Safe: > > If a system is under severe memory pressure that even a 4K allocation > fails for a PTE table, there are far greater problems than a THP split > being delayed. The OOM killer will likely intervene before this becomes an > issue. > When pte_alloc_one() fails due to not being able to allocate a 4K page, > the PMD split is aborted and the THP remains intact. I could not get split > to fail, as its very difficult to make order-0 allocation to fail. > Code analysis of what would happen if it does: > > - mprotect(): If split fails in change_pmd_range, it will fallback > to change_pte_range, which will return an error which will cause the > whole function to be retried again. > > - munmap() (partial THP range): zap_pte_range() returns early when > pte_offset_map_lock() fails, causing zap_pmd_range() to retry via pmd--. > For full THP range, zap_huge_pmd() unmaps the entire PMD without > split. > > - Memory reclaim (try_to_unmap()): Returns false, folio rotated back > LRU, retried in next reclaim cycle. > > - Migration / compaction (try_to_migrate()): Returns -EAGAIN, migration > skips this folio, retried later. > > - CoW fault (wp_huge_pmd()): Returns VM_FAULT_FALLBACK, fault retried. > > - madvise (MADV_COLD/PAGEOUT): split_folio() internally calls > try_to_migrate() with TTU_SPLIT_HUGE_PMD. If PMD split fails, > try_to_migrate() returns false, split_folio() returns -EAGAIN, > and madvise returns 0 (success) silently skipping the region. This > should be fine. madvise is just an advice and can fail for other > reasons as well. > > Suggested-by: David Hildenbrand <david@kernel.org> > Signed-off-by: Usama Arif <usama.arif@linux.dev> > --- > include/linux/huge_mm.h | 4 +- > mm/huge_memory.c | 144 ++++++++++++++++++++++++++++------------ > mm/khugepaged.c | 7 +- > mm/migrate_device.c | 15 +++-- > mm/rmap.c | 39 ++++++++++- > 5 files changed, 156 insertions(+), 53 deletions(-) > > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h > index a4d9f964dfdea..b21bb72a298c9 100644 > --- a/include/linux/huge_mm.h > +++ b/include/linux/huge_mm.h > @@ -562,7 +562,7 @@ static inline bool thp_migration_supported(void) > } > > void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address, > - pmd_t *pmd, bool freeze); > + pmd_t *pmd, bool freeze, pgtable_t pgtable); > bool unmap_huge_pmd_locked(struct vm_area_struct *vma, unsigned long addr, > pmd_t *pmdp, struct folio *folio); > void map_anon_folio_pmd_nopf(struct folio *folio, pmd_t *pmd, > @@ -660,7 +660,7 @@ static inline void split_huge_pmd_address(struct vm_area_struct *vma, > unsigned long address, bool freeze) {} > static inline void split_huge_pmd_locked(struct vm_area_struct *vma, > unsigned long address, pmd_t *pmd, > - bool freeze) {} > + bool freeze, pgtable_t pgtable) {} > > static inline bool unmap_huge_pmd_locked(struct vm_area_struct *vma, > unsigned long addr, pmd_t *pmdp, > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index 44ff8a648afd5..4c9a8d89fc8aa 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -1322,17 +1322,19 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf) > unsigned long haddr = vmf->address & HPAGE_PMD_MASK; > struct vm_area_struct *vma = vmf->vma; > struct folio *folio; > - pgtable_t pgtable; > + pgtable_t pgtable = NULL; > vm_fault_t ret = 0; > > folio = vma_alloc_anon_folio_pmd(vma, vmf->address); > if (unlikely(!folio)) > return VM_FAULT_FALLBACK; > > - pgtable = pte_alloc_one(vma->vm_mm); > - if (unlikely(!pgtable)) { > - ret = VM_FAULT_OOM; > - goto release; > + if (arch_needs_pgtable_deposit()) { > + pgtable = pte_alloc_one(vma->vm_mm); > + if (unlikely(!pgtable)) { > + ret = VM_FAULT_OOM; > + goto release; > + } > } > > vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd); > @@ -1347,14 +1349,18 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf) > if (userfaultfd_missing(vma)) { > spin_unlock(vmf->ptl); > folio_put(folio); > - pte_free(vma->vm_mm, pgtable); > + if (pgtable) > + pte_free(vma->vm_mm, pgtable); > ret = handle_userfault(vmf, VM_UFFD_MISSING); > VM_BUG_ON(ret & VM_FAULT_FALLBACK); > return ret; > } > - pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd, pgtable); > + if (pgtable) { > + pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd, > + pgtable); > + mm_inc_nr_ptes(vma->vm_mm); > + } > map_anon_folio_pmd_pf(folio, vmf->pmd, vma, haddr); > - mm_inc_nr_ptes(vma->vm_mm); > spin_unlock(vmf->ptl); > } > > @@ -1450,9 +1456,11 @@ static void set_huge_zero_folio(pgtable_t pgtable, struct mm_struct *mm, > pmd_t entry; > entry = folio_mk_pmd(zero_folio, vma->vm_page_prot); > entry = pmd_mkspecial(entry); > - pgtable_trans_huge_deposit(mm, pmd, pgtable); > + if (pgtable) { > + pgtable_trans_huge_deposit(mm, pmd, pgtable); > + mm_inc_nr_ptes(mm); > + } > set_pmd_at(mm, haddr, pmd, entry); > - mm_inc_nr_ptes(mm); > } > > vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf) > @@ -1471,16 +1479,19 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf) > if (!(vmf->flags & FAULT_FLAG_WRITE) && > !mm_forbids_zeropage(vma->vm_mm) && > transparent_hugepage_use_zero_page()) { > - pgtable_t pgtable; > + pgtable_t pgtable = NULL; > struct folio *zero_folio; > vm_fault_t ret; > > - pgtable = pte_alloc_one(vma->vm_mm); > - if (unlikely(!pgtable)) > - return VM_FAULT_OOM; > + if (arch_needs_pgtable_deposit()) { > + pgtable = pte_alloc_one(vma->vm_mm); > + if (unlikely(!pgtable)) > + return VM_FAULT_OOM; > + } > zero_folio = mm_get_huge_zero_folio(vma->vm_mm); > if (unlikely(!zero_folio)) { > - pte_free(vma->vm_mm, pgtable); > + if (pgtable) > + pte_free(vma->vm_mm, pgtable); > count_vm_event(THP_FAULT_FALLBACK); > return VM_FAULT_FALLBACK; > } > @@ -1490,10 +1501,12 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf) > ret = check_stable_address_space(vma->vm_mm); > if (ret) { > spin_unlock(vmf->ptl); > - pte_free(vma->vm_mm, pgtable); > + if (pgtable) > + pte_free(vma->vm_mm, pgtable); > } else if (userfaultfd_missing(vma)) { > spin_unlock(vmf->ptl); > - pte_free(vma->vm_mm, pgtable); > + if (pgtable) > + pte_free(vma->vm_mm, pgtable); > ret = handle_userfault(vmf, VM_UFFD_MISSING); > VM_BUG_ON(ret & VM_FAULT_FALLBACK); > } else { > @@ -1504,7 +1517,8 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf) > } > } else { > spin_unlock(vmf->ptl); > - pte_free(vma->vm_mm, pgtable); > + if (pgtable) > + pte_free(vma->vm_mm, pgtable); > } > return ret; > } > @@ -1836,8 +1850,10 @@ static void copy_huge_non_present_pmd( > } > > add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR); > - mm_inc_nr_ptes(dst_mm); > - pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable); > + if (pgtable) { > + mm_inc_nr_ptes(dst_mm); > + pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable); > + } > if (!userfaultfd_wp(dst_vma)) > pmd = pmd_swp_clear_uffd_wp(pmd); > set_pmd_at(dst_mm, addr, dst_pmd, pmd); > @@ -1877,9 +1893,11 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, > if (!vma_is_anonymous(dst_vma)) > return 0; > > - pgtable = pte_alloc_one(dst_mm); > - if (unlikely(!pgtable)) > - goto out; > + if (arch_needs_pgtable_deposit()) { > + pgtable = pte_alloc_one(dst_mm); > + if (unlikely(!pgtable)) > + goto out; > + } > > dst_ptl = pmd_lock(dst_mm, dst_pmd); > src_ptl = pmd_lockptr(src_mm, src_pmd); > @@ -1897,7 +1915,8 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, > } > > if (unlikely(!pmd_trans_huge(pmd))) { > - pte_free(dst_mm, pgtable); > + if (pgtable) > + pte_free(dst_mm, pgtable); > goto out_unlock; > } > /* > @@ -1923,7 +1942,8 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, > if (unlikely(folio_try_dup_anon_rmap_pmd(src_folio, src_page, dst_vma, src_vma))) { > /* Page maybe pinned: split and retry the fault on PTEs. */ > folio_put(src_folio); > - pte_free(dst_mm, pgtable); > + if (pgtable) > + pte_free(dst_mm, pgtable); > spin_unlock(src_ptl); > spin_unlock(dst_ptl); > __split_huge_pmd(src_vma, src_pmd, addr, false); > @@ -1931,8 +1951,10 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, > } > add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR); > out_zero_page: > - mm_inc_nr_ptes(dst_mm); > - pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable); > + if (pgtable) { > + mm_inc_nr_ptes(dst_mm); > + pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable); > + } > pmdp_set_wrprotect(src_mm, addr, src_pmd); > if (!userfaultfd_wp(dst_vma)) > pmd = pmd_clear_uffd_wp(pmd); > @@ -2364,7 +2386,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, > zap_deposited_table(tlb->mm, pmd); > spin_unlock(ptl); > } else if (is_huge_zero_pmd(orig_pmd)) { > - if (!vma_is_dax(vma) || arch_needs_pgtable_deposit()) > + if (arch_needs_pgtable_deposit()) > zap_deposited_table(tlb->mm, pmd); > spin_unlock(ptl); > } else { > @@ -2389,7 +2411,8 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, > } > > if (folio_test_anon(folio)) { > - zap_deposited_table(tlb->mm, pmd); > + if (arch_needs_pgtable_deposit()) > + zap_deposited_table(tlb->mm, pmd); > add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR); > } else { > if (arch_needs_pgtable_deposit()) > @@ -2490,7 +2513,8 @@ bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr, > force_flush = true; > VM_BUG_ON(!pmd_none(*new_pmd)); > > - if (pmd_move_must_withdraw(new_ptl, old_ptl, vma)) { > + if (pmd_move_must_withdraw(new_ptl, old_ptl, vma) && > + arch_needs_pgtable_deposit()) { > pgtable_t pgtable; > pgtable = pgtable_trans_huge_withdraw(mm, old_pmd); > pgtable_trans_huge_deposit(mm, new_pmd, pgtable); > @@ -2798,8 +2822,10 @@ int move_pages_huge_pmd(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd, pm > } > set_pmd_at(mm, dst_addr, dst_pmd, _dst_pmd); > > - src_pgtable = pgtable_trans_huge_withdraw(mm, src_pmd); > - pgtable_trans_huge_deposit(mm, dst_pmd, src_pgtable); > + if (arch_needs_pgtable_deposit()) { > + src_pgtable = pgtable_trans_huge_withdraw(mm, src_pmd); > + pgtable_trans_huge_deposit(mm, dst_pmd, src_pgtable); > + } > unlock_ptls: > double_pt_unlock(src_ptl, dst_ptl); > /* unblock rmap walks */ > @@ -2941,10 +2967,9 @@ void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud, > #endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */ > > static void __split_huge_zero_page_pmd(struct vm_area_struct *vma, > - unsigned long haddr, pmd_t *pmd) > + unsigned long haddr, pmd_t *pmd, pgtable_t pgtable) > { > struct mm_struct *mm = vma->vm_mm; > - pgtable_t pgtable; > pmd_t _pmd, old_pmd; > unsigned long addr; > pte_t *pte; > @@ -2960,7 +2985,16 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma, > */ > old_pmd = pmdp_huge_clear_flush(vma, haddr, pmd); > > - pgtable = pgtable_trans_huge_withdraw(mm, pmd); > + if (arch_needs_pgtable_deposit()) { > + pgtable = pgtable_trans_huge_withdraw(mm, pmd); > + } else { > + VM_BUG_ON(!pgtable); > + /* > + * Account for the freshly allocated (in __split_huge_pmd) pgtable > + * being used in mm. > + */ > + mm_inc_nr_ptes(mm); > + } > pmd_populate(mm, &_pmd, pgtable); > > pte = pte_offset_map(&_pmd, haddr); > @@ -2982,12 +3016,11 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma, > } > > static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, > - unsigned long haddr, bool freeze) > + unsigned long haddr, bool freeze, pgtable_t pgtable) > { > struct mm_struct *mm = vma->vm_mm; > struct folio *folio; > struct page *page; > - pgtable_t pgtable; > pmd_t old_pmd, _pmd; > bool soft_dirty, uffd_wp = false, young = false, write = false; > bool anon_exclusive = false, dirty = false; > @@ -3011,6 +3044,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, > */ > if (arch_needs_pgtable_deposit()) > zap_deposited_table(mm, pmd); > + if (pgtable) > + pte_free(mm, pgtable); > if (!vma_is_dax(vma) && vma_is_special_huge(vma)) > return; > if (unlikely(pmd_is_migration_entry(old_pmd))) { > @@ -3043,7 +3078,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, > * small page also write protected so it does not seems useful > * to invalidate secondary mmu at this time. > */ > - return __split_huge_zero_page_pmd(vma, haddr, pmd); > + return __split_huge_zero_page_pmd(vma, haddr, pmd, pgtable); > } > > if (pmd_is_migration_entry(*pmd)) { > @@ -3167,7 +3202,16 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, > * Withdraw the table only after we mark the pmd entry invalid. > * This's critical for some architectures (Power). > */ > - pgtable = pgtable_trans_huge_withdraw(mm, pmd); > + if (arch_needs_pgtable_deposit()) { > + pgtable = pgtable_trans_huge_withdraw(mm, pmd); > + } else { > + VM_BUG_ON(!pgtable); > + /* > + * Account for the freshly allocated (in __split_huge_pmd) pgtable > + * being used in mm. > + */ > + mm_inc_nr_ptes(mm); > + } > pmd_populate(mm, &_pmd, pgtable); > > pte = pte_offset_map(&_pmd, haddr); > @@ -3263,11 +3307,13 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, > } > > void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address, > - pmd_t *pmd, bool freeze) > + pmd_t *pmd, bool freeze, pgtable_t pgtable) > { > VM_WARN_ON_ONCE(!IS_ALIGNED(address, HPAGE_PMD_SIZE)); > if (pmd_trans_huge(*pmd) || pmd_is_valid_softleaf(*pmd)) > - __split_huge_pmd_locked(vma, pmd, address, freeze); > + __split_huge_pmd_locked(vma, pmd, address, freeze, pgtable); > + else if (pgtable) > + pte_free(vma->vm_mm, pgtable); > } > > void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, > @@ -3275,13 +3321,24 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, > { > spinlock_t *ptl; > struct mmu_notifier_range range; > + pgtable_t pgtable = NULL; > > mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma->vm_mm, > address & HPAGE_PMD_MASK, > (address & HPAGE_PMD_MASK) + HPAGE_PMD_SIZE); > mmu_notifier_invalidate_range_start(&range); > + > + /* allocate pagetable before acquiring pmd lock */ > + if (vma_is_anonymous(vma) && !arch_needs_pgtable_deposit()) { > + pgtable = pte_alloc_one(vma->vm_mm); > + if (!pgtable) { > + mmu_notifier_invalidate_range_end(&range); What I last looked at this, I thought the clean thing to do is to let __split_huge_pmd() and friends return an error. Let's take a look at walk_pmd_range() as one example: if (walk->vma) split_huge_pmd(walk->vma, pmd, addr); else if (pmd_leaf(*pmd) || !pmd_present(*pmd)) continue; err = walk_pte_range(pmd, addr, next, walk); Where walk_pte_range() just does a pte_offset_map_lock. pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl); But if that fails (as the remapping failed), we will silently skip this range. I don't think silently skipping is the right thing to do. So I would think that all splitting functions have to be taught to return an error and handle it accordingly. Then we can actually start returning errors. -- Cheers, David ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC 1/2] mm: thp: allocate PTE page tables lazily at split time 2026-02-11 13:35 ` David Hildenbrand (Arm) @ 2026-02-11 13:46 ` Kiryl Shutsemau 2026-02-11 13:47 ` Usama Arif 1 sibling, 0 replies; 22+ messages in thread From: Kiryl Shutsemau @ 2026-02-11 13:46 UTC (permalink / raw) To: David Hildenbrand (Arm) Cc: Usama Arif, Andrew Morton, lorenzo.stoakes, willy, linux-mm, fvdl, hannes, riel, shakeel.butt, baohua, dev.jain, baolin.wang, npache, Liam.Howlett, ryan.roberts, vbabka, lance.yang, linux-kernel, kernel-team On Wed, Feb 11, 2026 at 02:35:07PM +0100, David Hildenbrand (Arm) wrote: > On 2/11/26 13:49, Usama Arif wrote: > > When the kernel creates a PMD-level THP mapping for anonymous pages, > > it pre-allocates a PTE page table and deposits it via > > pgtable_trans_huge_deposit(). This deposited table is withdrawn during > > PMD split or zap. The rationale was that split must not fail—if the > > kernel decides to split a THP, it needs a PTE table to populate. > > > > However, every anon THP wastes 4KB (one page table page) that sits > > unused in the deposit list for the lifetime of the mapping. On systems > > with many THPs, this adds up to significant memory waste. The original > > rationale is also not an issue. It is ok for split to fail, and if the > > kernel can't find an order 0 allocation for split, there are much bigger > > problems. On large servers where you can easily have 100s of GBs of THPs, > > the memory usage for these tables is 200M per 100G. This memory could be > > used for any other usecase, which include allocating the pagetables > > required during split. > > > > This patch removes the pre-deposit for anonymous pages on architectures > > where arch_needs_pgtable_deposit() returns false (every arch apart from > > powerpc, and only when radix hash tables are not enabled) and allocates > > the PTE table lazily—only when a split actually occurs. The split path > > is modified to accept a caller-provided page table. > > > > PowerPC exception: > > > > It would have been great if we can completely remove the pagetable > > deposit code and this commit would mostly have been a code cleanup patch, > > unfortunately PowerPC has hash MMU, it stores hash slot information in > > the deposited page table and pre-deposit is necessary. All deposit/ > > withdraw paths are guarded by arch_needs_pgtable_deposit(), so PowerPC > > behavior is unchanged with this patch. On a better note, > > arch_needs_pgtable_deposit will always evaluate to false at compile time > > on non PowerPC architectures and the pre-deposit code will not be > > compiled in. > > > > Why Split Failures Are Safe: > > > > If a system is under severe memory pressure that even a 4K allocation > > fails for a PTE table, there are far greater problems than a THP split > > being delayed. The OOM killer will likely intervene before this becomes an > > issue. > > When pte_alloc_one() fails due to not being able to allocate a 4K page, > > the PMD split is aborted and the THP remains intact. I could not get split > > to fail, as its very difficult to make order-0 allocation to fail. > > Code analysis of what would happen if it does: > > > > - mprotect(): If split fails in change_pmd_range, it will fallback > > to change_pte_range, which will return an error which will cause the > > whole function to be retried again. > > > > - munmap() (partial THP range): zap_pte_range() returns early when > > pte_offset_map_lock() fails, causing zap_pmd_range() to retry via pmd--. > > For full THP range, zap_huge_pmd() unmaps the entire PMD without > > split. > > > > - Memory reclaim (try_to_unmap()): Returns false, folio rotated back > > LRU, retried in next reclaim cycle. > > > > - Migration / compaction (try_to_migrate()): Returns -EAGAIN, migration > > skips this folio, retried later. > > > > - CoW fault (wp_huge_pmd()): Returns VM_FAULT_FALLBACK, fault retried. > > > > - madvise (MADV_COLD/PAGEOUT): split_folio() internally calls > > try_to_migrate() with TTU_SPLIT_HUGE_PMD. If PMD split fails, > > try_to_migrate() returns false, split_folio() returns -EAGAIN, > > and madvise returns 0 (success) silently skipping the region. This > > should be fine. madvise is just an advice and can fail for other > > reasons as well. > > > > Suggested-by: David Hildenbrand <david@kernel.org> > > Signed-off-by: Usama Arif <usama.arif@linux.dev> > > --- > > include/linux/huge_mm.h | 4 +- > > mm/huge_memory.c | 144 ++++++++++++++++++++++++++++------------ > > mm/khugepaged.c | 7 +- > > mm/migrate_device.c | 15 +++-- > > mm/rmap.c | 39 ++++++++++- > > 5 files changed, 156 insertions(+), 53 deletions(-) > > > > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h > > index a4d9f964dfdea..b21bb72a298c9 100644 > > --- a/include/linux/huge_mm.h > > +++ b/include/linux/huge_mm.h > > @@ -562,7 +562,7 @@ static inline bool thp_migration_supported(void) > > } > > void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address, > > - pmd_t *pmd, bool freeze); > > + pmd_t *pmd, bool freeze, pgtable_t pgtable); > > bool unmap_huge_pmd_locked(struct vm_area_struct *vma, unsigned long addr, > > pmd_t *pmdp, struct folio *folio); > > void map_anon_folio_pmd_nopf(struct folio *folio, pmd_t *pmd, > > @@ -660,7 +660,7 @@ static inline void split_huge_pmd_address(struct vm_area_struct *vma, > > unsigned long address, bool freeze) {} > > static inline void split_huge_pmd_locked(struct vm_area_struct *vma, > > unsigned long address, pmd_t *pmd, > > - bool freeze) {} > > + bool freeze, pgtable_t pgtable) {} > > static inline bool unmap_huge_pmd_locked(struct vm_area_struct *vma, > > unsigned long addr, pmd_t *pmdp, > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > > index 44ff8a648afd5..4c9a8d89fc8aa 100644 > > --- a/mm/huge_memory.c > > +++ b/mm/huge_memory.c > > @@ -1322,17 +1322,19 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf) > > unsigned long haddr = vmf->address & HPAGE_PMD_MASK; > > struct vm_area_struct *vma = vmf->vma; > > struct folio *folio; > > - pgtable_t pgtable; > > + pgtable_t pgtable = NULL; > > vm_fault_t ret = 0; > > folio = vma_alloc_anon_folio_pmd(vma, vmf->address); > > if (unlikely(!folio)) > > return VM_FAULT_FALLBACK; > > - pgtable = pte_alloc_one(vma->vm_mm); > > - if (unlikely(!pgtable)) { > > - ret = VM_FAULT_OOM; > > - goto release; > > + if (arch_needs_pgtable_deposit()) { > > + pgtable = pte_alloc_one(vma->vm_mm); > > + if (unlikely(!pgtable)) { > > + ret = VM_FAULT_OOM; > > + goto release; > > + } > > } > > vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd); > > @@ -1347,14 +1349,18 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf) > > if (userfaultfd_missing(vma)) { > > spin_unlock(vmf->ptl); > > folio_put(folio); > > - pte_free(vma->vm_mm, pgtable); > > + if (pgtable) > > + pte_free(vma->vm_mm, pgtable); > > ret = handle_userfault(vmf, VM_UFFD_MISSING); > > VM_BUG_ON(ret & VM_FAULT_FALLBACK); > > return ret; > > } > > - pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd, pgtable); > > + if (pgtable) { > > + pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd, > > + pgtable); > > + mm_inc_nr_ptes(vma->vm_mm); > > + } > > map_anon_folio_pmd_pf(folio, vmf->pmd, vma, haddr); > > - mm_inc_nr_ptes(vma->vm_mm); > > spin_unlock(vmf->ptl); > > } > > @@ -1450,9 +1456,11 @@ static void set_huge_zero_folio(pgtable_t pgtable, struct mm_struct *mm, > > pmd_t entry; > > entry = folio_mk_pmd(zero_folio, vma->vm_page_prot); > > entry = pmd_mkspecial(entry); > > - pgtable_trans_huge_deposit(mm, pmd, pgtable); > > + if (pgtable) { > > + pgtable_trans_huge_deposit(mm, pmd, pgtable); > > + mm_inc_nr_ptes(mm); > > + } > > set_pmd_at(mm, haddr, pmd, entry); > > - mm_inc_nr_ptes(mm); > > } > > vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf) > > @@ -1471,16 +1479,19 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf) > > if (!(vmf->flags & FAULT_FLAG_WRITE) && > > !mm_forbids_zeropage(vma->vm_mm) && > > transparent_hugepage_use_zero_page()) { > > - pgtable_t pgtable; > > + pgtable_t pgtable = NULL; > > struct folio *zero_folio; > > vm_fault_t ret; > > - pgtable = pte_alloc_one(vma->vm_mm); > > - if (unlikely(!pgtable)) > > - return VM_FAULT_OOM; > > + if (arch_needs_pgtable_deposit()) { > > + pgtable = pte_alloc_one(vma->vm_mm); > > + if (unlikely(!pgtable)) > > + return VM_FAULT_OOM; > > + } > > zero_folio = mm_get_huge_zero_folio(vma->vm_mm); > > if (unlikely(!zero_folio)) { > > - pte_free(vma->vm_mm, pgtable); > > + if (pgtable) > > + pte_free(vma->vm_mm, pgtable); > > count_vm_event(THP_FAULT_FALLBACK); > > return VM_FAULT_FALLBACK; > > } > > @@ -1490,10 +1501,12 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf) > > ret = check_stable_address_space(vma->vm_mm); > > if (ret) { > > spin_unlock(vmf->ptl); > > - pte_free(vma->vm_mm, pgtable); > > + if (pgtable) > > + pte_free(vma->vm_mm, pgtable); > > } else if (userfaultfd_missing(vma)) { > > spin_unlock(vmf->ptl); > > - pte_free(vma->vm_mm, pgtable); > > + if (pgtable) > > + pte_free(vma->vm_mm, pgtable); > > ret = handle_userfault(vmf, VM_UFFD_MISSING); > > VM_BUG_ON(ret & VM_FAULT_FALLBACK); > > } else { > > @@ -1504,7 +1517,8 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf) > > } > > } else { > > spin_unlock(vmf->ptl); > > - pte_free(vma->vm_mm, pgtable); > > + if (pgtable) > > + pte_free(vma->vm_mm, pgtable); > > } > > return ret; > > } > > @@ -1836,8 +1850,10 @@ static void copy_huge_non_present_pmd( > > } > > add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR); > > - mm_inc_nr_ptes(dst_mm); > > - pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable); > > + if (pgtable) { > > + mm_inc_nr_ptes(dst_mm); > > + pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable); > > + } > > if (!userfaultfd_wp(dst_vma)) > > pmd = pmd_swp_clear_uffd_wp(pmd); > > set_pmd_at(dst_mm, addr, dst_pmd, pmd); > > @@ -1877,9 +1893,11 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, > > if (!vma_is_anonymous(dst_vma)) > > return 0; > > - pgtable = pte_alloc_one(dst_mm); > > - if (unlikely(!pgtable)) > > - goto out; > > + if (arch_needs_pgtable_deposit()) { > > + pgtable = pte_alloc_one(dst_mm); > > + if (unlikely(!pgtable)) > > + goto out; > > + } > > dst_ptl = pmd_lock(dst_mm, dst_pmd); > > src_ptl = pmd_lockptr(src_mm, src_pmd); > > @@ -1897,7 +1915,8 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, > > } > > if (unlikely(!pmd_trans_huge(pmd))) { > > - pte_free(dst_mm, pgtable); > > + if (pgtable) > > + pte_free(dst_mm, pgtable); > > goto out_unlock; > > } > > /* > > @@ -1923,7 +1942,8 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, > > if (unlikely(folio_try_dup_anon_rmap_pmd(src_folio, src_page, dst_vma, src_vma))) { > > /* Page maybe pinned: split and retry the fault on PTEs. */ > > folio_put(src_folio); > > - pte_free(dst_mm, pgtable); > > + if (pgtable) > > + pte_free(dst_mm, pgtable); > > spin_unlock(src_ptl); > > spin_unlock(dst_ptl); > > __split_huge_pmd(src_vma, src_pmd, addr, false); > > @@ -1931,8 +1951,10 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, > > } > > add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR); > > out_zero_page: > > - mm_inc_nr_ptes(dst_mm); > > - pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable); > > + if (pgtable) { > > + mm_inc_nr_ptes(dst_mm); > > + pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable); > > + } > > pmdp_set_wrprotect(src_mm, addr, src_pmd); > > if (!userfaultfd_wp(dst_vma)) > > pmd = pmd_clear_uffd_wp(pmd); > > @@ -2364,7 +2386,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, > > zap_deposited_table(tlb->mm, pmd); > > spin_unlock(ptl); > > } else if (is_huge_zero_pmd(orig_pmd)) { > > - if (!vma_is_dax(vma) || arch_needs_pgtable_deposit()) > > + if (arch_needs_pgtable_deposit()) > > zap_deposited_table(tlb->mm, pmd); > > spin_unlock(ptl); > > } else { > > @@ -2389,7 +2411,8 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, > > } > > if (folio_test_anon(folio)) { > > - zap_deposited_table(tlb->mm, pmd); > > + if (arch_needs_pgtable_deposit()) > > + zap_deposited_table(tlb->mm, pmd); > > add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR); > > } else { > > if (arch_needs_pgtable_deposit()) > > @@ -2490,7 +2513,8 @@ bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr, > > force_flush = true; > > VM_BUG_ON(!pmd_none(*new_pmd)); > > - if (pmd_move_must_withdraw(new_ptl, old_ptl, vma)) { > > + if (pmd_move_must_withdraw(new_ptl, old_ptl, vma) && > > + arch_needs_pgtable_deposit()) { > > pgtable_t pgtable; > > pgtable = pgtable_trans_huge_withdraw(mm, old_pmd); > > pgtable_trans_huge_deposit(mm, new_pmd, pgtable); > > @@ -2798,8 +2822,10 @@ int move_pages_huge_pmd(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd, pm > > } > > set_pmd_at(mm, dst_addr, dst_pmd, _dst_pmd); > > - src_pgtable = pgtable_trans_huge_withdraw(mm, src_pmd); > > - pgtable_trans_huge_deposit(mm, dst_pmd, src_pgtable); > > + if (arch_needs_pgtable_deposit()) { > > + src_pgtable = pgtable_trans_huge_withdraw(mm, src_pmd); > > + pgtable_trans_huge_deposit(mm, dst_pmd, src_pgtable); > > + } > > unlock_ptls: > > double_pt_unlock(src_ptl, dst_ptl); > > /* unblock rmap walks */ > > @@ -2941,10 +2967,9 @@ void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud, > > #endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */ > > static void __split_huge_zero_page_pmd(struct vm_area_struct *vma, > > - unsigned long haddr, pmd_t *pmd) > > + unsigned long haddr, pmd_t *pmd, pgtable_t pgtable) > > { > > struct mm_struct *mm = vma->vm_mm; > > - pgtable_t pgtable; > > pmd_t _pmd, old_pmd; > > unsigned long addr; > > pte_t *pte; > > @@ -2960,7 +2985,16 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma, > > */ > > old_pmd = pmdp_huge_clear_flush(vma, haddr, pmd); > > - pgtable = pgtable_trans_huge_withdraw(mm, pmd); > > + if (arch_needs_pgtable_deposit()) { > > + pgtable = pgtable_trans_huge_withdraw(mm, pmd); > > + } else { > > + VM_BUG_ON(!pgtable); > > + /* > > + * Account for the freshly allocated (in __split_huge_pmd) pgtable > > + * being used in mm. > > + */ > > + mm_inc_nr_ptes(mm); > > + } > > pmd_populate(mm, &_pmd, pgtable); > > pte = pte_offset_map(&_pmd, haddr); > > @@ -2982,12 +3016,11 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma, > > } > > static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, > > - unsigned long haddr, bool freeze) > > + unsigned long haddr, bool freeze, pgtable_t pgtable) > > { > > struct mm_struct *mm = vma->vm_mm; > > struct folio *folio; > > struct page *page; > > - pgtable_t pgtable; > > pmd_t old_pmd, _pmd; > > bool soft_dirty, uffd_wp = false, young = false, write = false; > > bool anon_exclusive = false, dirty = false; > > @@ -3011,6 +3044,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, > > */ > > if (arch_needs_pgtable_deposit()) > > zap_deposited_table(mm, pmd); > > + if (pgtable) > > + pte_free(mm, pgtable); > > if (!vma_is_dax(vma) && vma_is_special_huge(vma)) > > return; > > if (unlikely(pmd_is_migration_entry(old_pmd))) { > > @@ -3043,7 +3078,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, > > * small page also write protected so it does not seems useful > > * to invalidate secondary mmu at this time. > > */ > > - return __split_huge_zero_page_pmd(vma, haddr, pmd); > > + return __split_huge_zero_page_pmd(vma, haddr, pmd, pgtable); > > } > > if (pmd_is_migration_entry(*pmd)) { > > @@ -3167,7 +3202,16 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, > > * Withdraw the table only after we mark the pmd entry invalid. > > * This's critical for some architectures (Power). > > */ > > - pgtable = pgtable_trans_huge_withdraw(mm, pmd); > > + if (arch_needs_pgtable_deposit()) { > > + pgtable = pgtable_trans_huge_withdraw(mm, pmd); > > + } else { > > + VM_BUG_ON(!pgtable); > > + /* > > + * Account for the freshly allocated (in __split_huge_pmd) pgtable > > + * being used in mm. > > + */ > > + mm_inc_nr_ptes(mm); > > + } > > pmd_populate(mm, &_pmd, pgtable); > > pte = pte_offset_map(&_pmd, haddr); > > @@ -3263,11 +3307,13 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, > > } > > void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address, > > - pmd_t *pmd, bool freeze) > > + pmd_t *pmd, bool freeze, pgtable_t pgtable) > > { > > VM_WARN_ON_ONCE(!IS_ALIGNED(address, HPAGE_PMD_SIZE)); > > if (pmd_trans_huge(*pmd) || pmd_is_valid_softleaf(*pmd)) > > - __split_huge_pmd_locked(vma, pmd, address, freeze); > > + __split_huge_pmd_locked(vma, pmd, address, freeze, pgtable); > > + else if (pgtable) > > + pte_free(vma->vm_mm, pgtable); > > } > > void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, > > @@ -3275,13 +3321,24 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, > > { > > spinlock_t *ptl; > > struct mmu_notifier_range range; > > + pgtable_t pgtable = NULL; > > mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma->vm_mm, > > address & HPAGE_PMD_MASK, > > (address & HPAGE_PMD_MASK) + HPAGE_PMD_SIZE); > > mmu_notifier_invalidate_range_start(&range); > > + > > + /* allocate pagetable before acquiring pmd lock */ > > + if (vma_is_anonymous(vma) && !arch_needs_pgtable_deposit()) { > > + pgtable = pte_alloc_one(vma->vm_mm); > > + if (!pgtable) { > > + mmu_notifier_invalidate_range_end(&range); > > What I last looked at this, I thought the clean thing to do is to let > __split_huge_pmd() and friends return an error. > > Let's take a look at walk_pmd_range() as one example: > > if (walk->vma) > split_huge_pmd(walk->vma, pmd, addr); > else if (pmd_leaf(*pmd) || !pmd_present(*pmd)) > continue; > > err = walk_pte_range(pmd, addr, next, walk); > > Where walk_pte_range() just does a pte_offset_map_lock. > > pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl); > > But if that fails (as the remapping failed), we will silently skip this > range. > > I don't think silently skipping is the right thing to do. > > So I would think that all splitting functions have to be taught to return an > error and handle it accordingly. Then we can actually start returning > errors. Yeah, I am also confused by silent split PMD failure. It has to be communicated to the caller cleanly. It is also an opportunity to audit all callers and check if they can deal with the failure. -- Kiryl Shutsemau / Kirill A. Shutemov ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC 1/2] mm: thp: allocate PTE page tables lazily at split time 2026-02-11 13:35 ` David Hildenbrand (Arm) 2026-02-11 13:46 ` Kiryl Shutsemau @ 2026-02-11 13:47 ` Usama Arif 1 sibling, 0 replies; 22+ messages in thread From: Usama Arif @ 2026-02-11 13:47 UTC (permalink / raw) To: David Hildenbrand (Arm), Andrew Morton, lorenzo.stoakes, willy, linux-mm Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain, baolin.wang, npache, Liam.Howlett, ryan.roberts, vbabka, lance.yang, linux-kernel, kernel-team On 11/02/2026 13:35, David Hildenbrand (Arm) wrote: > On 2/11/26 13:49, Usama Arif wrote: >> When the kernel creates a PMD-level THP mapping for anonymous pages, >> it pre-allocates a PTE page table and deposits it via >> pgtable_trans_huge_deposit(). This deposited table is withdrawn during >> PMD split or zap. The rationale was that split must not fail—if the >> kernel decides to split a THP, it needs a PTE table to populate. >> >> However, every anon THP wastes 4KB (one page table page) that sits >> unused in the deposit list for the lifetime of the mapping. On systems >> with many THPs, this adds up to significant memory waste. The original >> rationale is also not an issue. It is ok for split to fail, and if the >> kernel can't find an order 0 allocation for split, there are much bigger >> problems. On large servers where you can easily have 100s of GBs of THPs, >> the memory usage for these tables is 200M per 100G. This memory could be >> used for any other usecase, which include allocating the pagetables >> required during split. >> >> This patch removes the pre-deposit for anonymous pages on architectures >> where arch_needs_pgtable_deposit() returns false (every arch apart from >> powerpc, and only when radix hash tables are not enabled) and allocates >> the PTE table lazily—only when a split actually occurs. The split path >> is modified to accept a caller-provided page table. >> >> PowerPC exception: >> >> It would have been great if we can completely remove the pagetable >> deposit code and this commit would mostly have been a code cleanup patch, >> unfortunately PowerPC has hash MMU, it stores hash slot information in >> the deposited page table and pre-deposit is necessary. All deposit/ >> withdraw paths are guarded by arch_needs_pgtable_deposit(), so PowerPC >> behavior is unchanged with this patch. On a better note, >> arch_needs_pgtable_deposit will always evaluate to false at compile time >> on non PowerPC architectures and the pre-deposit code will not be >> compiled in. >> >> Why Split Failures Are Safe: >> >> If a system is under severe memory pressure that even a 4K allocation >> fails for a PTE table, there are far greater problems than a THP split >> being delayed. The OOM killer will likely intervene before this becomes an >> issue. >> When pte_alloc_one() fails due to not being able to allocate a 4K page, >> the PMD split is aborted and the THP remains intact. I could not get split >> to fail, as its very difficult to make order-0 allocation to fail. >> Code analysis of what would happen if it does: >> >> - mprotect(): If split fails in change_pmd_range, it will fallback >> to change_pte_range, which will return an error which will cause the >> whole function to be retried again. >> >> - munmap() (partial THP range): zap_pte_range() returns early when >> pte_offset_map_lock() fails, causing zap_pmd_range() to retry via pmd--. >> For full THP range, zap_huge_pmd() unmaps the entire PMD without >> split. >> >> - Memory reclaim (try_to_unmap()): Returns false, folio rotated back >> LRU, retried in next reclaim cycle. >> >> - Migration / compaction (try_to_migrate()): Returns -EAGAIN, migration >> skips this folio, retried later. >> >> - CoW fault (wp_huge_pmd()): Returns VM_FAULT_FALLBACK, fault retried. >> >> - madvise (MADV_COLD/PAGEOUT): split_folio() internally calls >> try_to_migrate() with TTU_SPLIT_HUGE_PMD. If PMD split fails, >> try_to_migrate() returns false, split_folio() returns -EAGAIN, >> and madvise returns 0 (success) silently skipping the region. This >> should be fine. madvise is just an advice and can fail for other >> reasons as well. >> >> Suggested-by: David Hildenbrand <david@kernel.org> >> Signed-off-by: Usama Arif <usama.arif@linux.dev> >> --- >> include/linux/huge_mm.h | 4 +- >> mm/huge_memory.c | 144 ++++++++++++++++++++++++++++------------ >> mm/khugepaged.c | 7 +- >> mm/migrate_device.c | 15 +++-- >> mm/rmap.c | 39 ++++++++++- >> 5 files changed, 156 insertions(+), 53 deletions(-) >> >> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h >> index a4d9f964dfdea..b21bb72a298c9 100644 >> --- a/include/linux/huge_mm.h >> +++ b/include/linux/huge_mm.h >> @@ -562,7 +562,7 @@ static inline bool thp_migration_supported(void) >> } >> void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address, >> - pmd_t *pmd, bool freeze); >> + pmd_t *pmd, bool freeze, pgtable_t pgtable); >> bool unmap_huge_pmd_locked(struct vm_area_struct *vma, unsigned long addr, >> pmd_t *pmdp, struct folio *folio); >> void map_anon_folio_pmd_nopf(struct folio *folio, pmd_t *pmd, >> @@ -660,7 +660,7 @@ static inline void split_huge_pmd_address(struct vm_area_struct *vma, >> unsigned long address, bool freeze) {} >> static inline void split_huge_pmd_locked(struct vm_area_struct *vma, >> unsigned long address, pmd_t *pmd, >> - bool freeze) {} >> + bool freeze, pgtable_t pgtable) {} >> static inline bool unmap_huge_pmd_locked(struct vm_area_struct *vma, >> unsigned long addr, pmd_t *pmdp, >> diff --git a/mm/huge_memory.c b/mm/huge_memory.c >> index 44ff8a648afd5..4c9a8d89fc8aa 100644 >> --- a/mm/huge_memory.c >> +++ b/mm/huge_memory.c >> @@ -1322,17 +1322,19 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf) >> unsigned long haddr = vmf->address & HPAGE_PMD_MASK; >> struct vm_area_struct *vma = vmf->vma; >> struct folio *folio; >> - pgtable_t pgtable; >> + pgtable_t pgtable = NULL; >> vm_fault_t ret = 0; >> folio = vma_alloc_anon_folio_pmd(vma, vmf->address); >> if (unlikely(!folio)) >> return VM_FAULT_FALLBACK; >> - pgtable = pte_alloc_one(vma->vm_mm); >> - if (unlikely(!pgtable)) { >> - ret = VM_FAULT_OOM; >> - goto release; >> + if (arch_needs_pgtable_deposit()) { >> + pgtable = pte_alloc_one(vma->vm_mm); >> + if (unlikely(!pgtable)) { >> + ret = VM_FAULT_OOM; >> + goto release; >> + } >> } >> vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd); >> @@ -1347,14 +1349,18 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf) >> if (userfaultfd_missing(vma)) { >> spin_unlock(vmf->ptl); >> folio_put(folio); >> - pte_free(vma->vm_mm, pgtable); >> + if (pgtable) >> + pte_free(vma->vm_mm, pgtable); >> ret = handle_userfault(vmf, VM_UFFD_MISSING); >> VM_BUG_ON(ret & VM_FAULT_FALLBACK); >> return ret; >> } >> - pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd, pgtable); >> + if (pgtable) { >> + pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd, >> + pgtable); >> + mm_inc_nr_ptes(vma->vm_mm); >> + } >> map_anon_folio_pmd_pf(folio, vmf->pmd, vma, haddr); >> - mm_inc_nr_ptes(vma->vm_mm); >> spin_unlock(vmf->ptl); >> } >> @@ -1450,9 +1456,11 @@ static void set_huge_zero_folio(pgtable_t pgtable, struct mm_struct *mm, >> pmd_t entry; >> entry = folio_mk_pmd(zero_folio, vma->vm_page_prot); >> entry = pmd_mkspecial(entry); >> - pgtable_trans_huge_deposit(mm, pmd, pgtable); >> + if (pgtable) { >> + pgtable_trans_huge_deposit(mm, pmd, pgtable); >> + mm_inc_nr_ptes(mm); >> + } >> set_pmd_at(mm, haddr, pmd, entry); >> - mm_inc_nr_ptes(mm); >> } >> vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf) >> @@ -1471,16 +1479,19 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf) >> if (!(vmf->flags & FAULT_FLAG_WRITE) && >> !mm_forbids_zeropage(vma->vm_mm) && >> transparent_hugepage_use_zero_page()) { >> - pgtable_t pgtable; >> + pgtable_t pgtable = NULL; >> struct folio *zero_folio; >> vm_fault_t ret; >> - pgtable = pte_alloc_one(vma->vm_mm); >> - if (unlikely(!pgtable)) >> - return VM_FAULT_OOM; >> + if (arch_needs_pgtable_deposit()) { >> + pgtable = pte_alloc_one(vma->vm_mm); >> + if (unlikely(!pgtable)) >> + return VM_FAULT_OOM; >> + } >> zero_folio = mm_get_huge_zero_folio(vma->vm_mm); >> if (unlikely(!zero_folio)) { >> - pte_free(vma->vm_mm, pgtable); >> + if (pgtable) >> + pte_free(vma->vm_mm, pgtable); >> count_vm_event(THP_FAULT_FALLBACK); >> return VM_FAULT_FALLBACK; >> } >> @@ -1490,10 +1501,12 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf) >> ret = check_stable_address_space(vma->vm_mm); >> if (ret) { >> spin_unlock(vmf->ptl); >> - pte_free(vma->vm_mm, pgtable); >> + if (pgtable) >> + pte_free(vma->vm_mm, pgtable); >> } else if (userfaultfd_missing(vma)) { >> spin_unlock(vmf->ptl); >> - pte_free(vma->vm_mm, pgtable); >> + if (pgtable) >> + pte_free(vma->vm_mm, pgtable); >> ret = handle_userfault(vmf, VM_UFFD_MISSING); >> VM_BUG_ON(ret & VM_FAULT_FALLBACK); >> } else { >> @@ -1504,7 +1517,8 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf) >> } >> } else { >> spin_unlock(vmf->ptl); >> - pte_free(vma->vm_mm, pgtable); >> + if (pgtable) >> + pte_free(vma->vm_mm, pgtable); >> } >> return ret; >> } >> @@ -1836,8 +1850,10 @@ static void copy_huge_non_present_pmd( >> } >> add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR); >> - mm_inc_nr_ptes(dst_mm); >> - pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable); >> + if (pgtable) { >> + mm_inc_nr_ptes(dst_mm); >> + pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable); >> + } >> if (!userfaultfd_wp(dst_vma)) >> pmd = pmd_swp_clear_uffd_wp(pmd); >> set_pmd_at(dst_mm, addr, dst_pmd, pmd); >> @@ -1877,9 +1893,11 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, >> if (!vma_is_anonymous(dst_vma)) >> return 0; >> - pgtable = pte_alloc_one(dst_mm); >> - if (unlikely(!pgtable)) >> - goto out; >> + if (arch_needs_pgtable_deposit()) { >> + pgtable = pte_alloc_one(dst_mm); >> + if (unlikely(!pgtable)) >> + goto out; >> + } >> dst_ptl = pmd_lock(dst_mm, dst_pmd); >> src_ptl = pmd_lockptr(src_mm, src_pmd); >> @@ -1897,7 +1915,8 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, >> } >> if (unlikely(!pmd_trans_huge(pmd))) { >> - pte_free(dst_mm, pgtable); >> + if (pgtable) >> + pte_free(dst_mm, pgtable); >> goto out_unlock; >> } >> /* >> @@ -1923,7 +1942,8 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, >> if (unlikely(folio_try_dup_anon_rmap_pmd(src_folio, src_page, dst_vma, src_vma))) { >> /* Page maybe pinned: split and retry the fault on PTEs. */ >> folio_put(src_folio); >> - pte_free(dst_mm, pgtable); >> + if (pgtable) >> + pte_free(dst_mm, pgtable); >> spin_unlock(src_ptl); >> spin_unlock(dst_ptl); >> __split_huge_pmd(src_vma, src_pmd, addr, false); >> @@ -1931,8 +1951,10 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, >> } >> add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR); >> out_zero_page: >> - mm_inc_nr_ptes(dst_mm); >> - pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable); >> + if (pgtable) { >> + mm_inc_nr_ptes(dst_mm); >> + pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable); >> + } >> pmdp_set_wrprotect(src_mm, addr, src_pmd); >> if (!userfaultfd_wp(dst_vma)) >> pmd = pmd_clear_uffd_wp(pmd); >> @@ -2364,7 +2386,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, >> zap_deposited_table(tlb->mm, pmd); >> spin_unlock(ptl); >> } else if (is_huge_zero_pmd(orig_pmd)) { >> - if (!vma_is_dax(vma) || arch_needs_pgtable_deposit()) >> + if (arch_needs_pgtable_deposit()) >> zap_deposited_table(tlb->mm, pmd); >> spin_unlock(ptl); >> } else { >> @@ -2389,7 +2411,8 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, >> } >> if (folio_test_anon(folio)) { >> - zap_deposited_table(tlb->mm, pmd); >> + if (arch_needs_pgtable_deposit()) >> + zap_deposited_table(tlb->mm, pmd); >> add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR); >> } else { >> if (arch_needs_pgtable_deposit()) >> @@ -2490,7 +2513,8 @@ bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr, >> force_flush = true; >> VM_BUG_ON(!pmd_none(*new_pmd)); >> - if (pmd_move_must_withdraw(new_ptl, old_ptl, vma)) { >> + if (pmd_move_must_withdraw(new_ptl, old_ptl, vma) && >> + arch_needs_pgtable_deposit()) { >> pgtable_t pgtable; >> pgtable = pgtable_trans_huge_withdraw(mm, old_pmd); >> pgtable_trans_huge_deposit(mm, new_pmd, pgtable); >> @@ -2798,8 +2822,10 @@ int move_pages_huge_pmd(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd, pm >> } >> set_pmd_at(mm, dst_addr, dst_pmd, _dst_pmd); >> - src_pgtable = pgtable_trans_huge_withdraw(mm, src_pmd); >> - pgtable_trans_huge_deposit(mm, dst_pmd, src_pgtable); >> + if (arch_needs_pgtable_deposit()) { >> + src_pgtable = pgtable_trans_huge_withdraw(mm, src_pmd); >> + pgtable_trans_huge_deposit(mm, dst_pmd, src_pgtable); >> + } >> unlock_ptls: >> double_pt_unlock(src_ptl, dst_ptl); >> /* unblock rmap walks */ >> @@ -2941,10 +2967,9 @@ void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud, >> #endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */ >> static void __split_huge_zero_page_pmd(struct vm_area_struct *vma, >> - unsigned long haddr, pmd_t *pmd) >> + unsigned long haddr, pmd_t *pmd, pgtable_t pgtable) >> { >> struct mm_struct *mm = vma->vm_mm; >> - pgtable_t pgtable; >> pmd_t _pmd, old_pmd; >> unsigned long addr; >> pte_t *pte; >> @@ -2960,7 +2985,16 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma, >> */ >> old_pmd = pmdp_huge_clear_flush(vma, haddr, pmd); >> - pgtable = pgtable_trans_huge_withdraw(mm, pmd); >> + if (arch_needs_pgtable_deposit()) { >> + pgtable = pgtable_trans_huge_withdraw(mm, pmd); >> + } else { >> + VM_BUG_ON(!pgtable); >> + /* >> + * Account for the freshly allocated (in __split_huge_pmd) pgtable >> + * being used in mm. >> + */ >> + mm_inc_nr_ptes(mm); >> + } >> pmd_populate(mm, &_pmd, pgtable); >> pte = pte_offset_map(&_pmd, haddr); >> @@ -2982,12 +3016,11 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma, >> } >> static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, >> - unsigned long haddr, bool freeze) >> + unsigned long haddr, bool freeze, pgtable_t pgtable) >> { >> struct mm_struct *mm = vma->vm_mm; >> struct folio *folio; >> struct page *page; >> - pgtable_t pgtable; >> pmd_t old_pmd, _pmd; >> bool soft_dirty, uffd_wp = false, young = false, write = false; >> bool anon_exclusive = false, dirty = false; >> @@ -3011,6 +3044,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, >> */ >> if (arch_needs_pgtable_deposit()) >> zap_deposited_table(mm, pmd); >> + if (pgtable) >> + pte_free(mm, pgtable); >> if (!vma_is_dax(vma) && vma_is_special_huge(vma)) >> return; >> if (unlikely(pmd_is_migration_entry(old_pmd))) { >> @@ -3043,7 +3078,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, >> * small page also write protected so it does not seems useful >> * to invalidate secondary mmu at this time. >> */ >> - return __split_huge_zero_page_pmd(vma, haddr, pmd); >> + return __split_huge_zero_page_pmd(vma, haddr, pmd, pgtable); >> } >> if (pmd_is_migration_entry(*pmd)) { >> @@ -3167,7 +3202,16 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, >> * Withdraw the table only after we mark the pmd entry invalid. >> * This's critical for some architectures (Power). >> */ >> - pgtable = pgtable_trans_huge_withdraw(mm, pmd); >> + if (arch_needs_pgtable_deposit()) { >> + pgtable = pgtable_trans_huge_withdraw(mm, pmd); >> + } else { >> + VM_BUG_ON(!pgtable); >> + /* >> + * Account for the freshly allocated (in __split_huge_pmd) pgtable >> + * being used in mm. >> + */ >> + mm_inc_nr_ptes(mm); >> + } >> pmd_populate(mm, &_pmd, pgtable); >> pte = pte_offset_map(&_pmd, haddr); >> @@ -3263,11 +3307,13 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, >> } >> void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address, >> - pmd_t *pmd, bool freeze) >> + pmd_t *pmd, bool freeze, pgtable_t pgtable) >> { >> VM_WARN_ON_ONCE(!IS_ALIGNED(address, HPAGE_PMD_SIZE)); >> if (pmd_trans_huge(*pmd) || pmd_is_valid_softleaf(*pmd)) >> - __split_huge_pmd_locked(vma, pmd, address, freeze); >> + __split_huge_pmd_locked(vma, pmd, address, freeze, pgtable); >> + else if (pgtable) >> + pte_free(vma->vm_mm, pgtable); >> } >> void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, >> @@ -3275,13 +3321,24 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, >> { >> spinlock_t *ptl; >> struct mmu_notifier_range range; >> + pgtable_t pgtable = NULL; >> mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma->vm_mm, >> address & HPAGE_PMD_MASK, >> (address & HPAGE_PMD_MASK) + HPAGE_PMD_SIZE); >> mmu_notifier_invalidate_range_start(&range); >> + >> + /* allocate pagetable before acquiring pmd lock */ >> + if (vma_is_anonymous(vma) && !arch_needs_pgtable_deposit()) { >> + pgtable = pte_alloc_one(vma->vm_mm); >> + if (!pgtable) { >> + mmu_notifier_invalidate_range_end(&range); > > What I last looked at this, I thought the clean thing to do is to let __split_huge_pmd() and friends return an error. > > Let's take a look at walk_pmd_range() as one example: > > if (walk->vma) > split_huge_pmd(walk->vma, pmd, addr); > else if (pmd_leaf(*pmd) || !pmd_present(*pmd)) > continue; > > err = walk_pte_range(pmd, addr, next, walk); > > Where walk_pte_range() just does a pte_offset_map_lock. > > pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl); > > But if that fails (as the remapping failed), we will silently skip this range. > > I don't think silently skipping is the right thing to do. > > So I would think that all splitting functions have to be taught to return an error and handle it accordingly. Then we can actually start returning errors. > Ack. This was one of the cases where we would try again if needed. I did manual code analysis which I included at the end of the commit message but agreed, its best to return an error and handle accordingly. I will look into doing this for the next revision. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC 1/2] mm: thp: allocate PTE page tables lazily at split time 2026-02-11 12:49 ` [RFC 1/2] mm: thp: allocate PTE page tables lazily at split time Usama Arif 2026-02-11 13:25 ` David Hildenbrand (Arm) 2026-02-11 13:35 ` David Hildenbrand (Arm) @ 2026-02-11 19:28 ` Matthew Wilcox 2026-02-11 19:55 ` David Hildenbrand (Arm) 2 siblings, 1 reply; 22+ messages in thread From: Matthew Wilcox @ 2026-02-11 19:28 UTC (permalink / raw) To: Usama Arif Cc: Andrew Morton, david, lorenzo.stoakes, linux-mm, fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain, baolin.wang, npache, Liam.Howlett, ryan.roberts, vbabka, lance.yang, linux-kernel, kernel-team On Wed, Feb 11, 2026 at 04:49:45AM -0800, Usama Arif wrote: > - Memory reclaim (try_to_unmap()): Returns false, folio rotated back > LRU, retried in next reclaim cycle. I was advised to ask my stupid question ... Why do we still try to split the PMD in reclaim? I understand we're about to swap the folio out and we'll need to put a swap entry in the page table so we can find it again. But can't we now store swap entries at the PMD level, or are we still forced to store 512 entries at the PTE level? ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC 1/2] mm: thp: allocate PTE page tables lazily at split time 2026-02-11 19:28 ` Matthew Wilcox @ 2026-02-11 19:55 ` David Hildenbrand (Arm) 0 siblings, 0 replies; 22+ messages in thread From: David Hildenbrand (Arm) @ 2026-02-11 19:55 UTC (permalink / raw) To: Matthew Wilcox, Usama Arif Cc: Andrew Morton, lorenzo.stoakes, linux-mm, fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain, baolin.wang, npache, Liam.Howlett, ryan.roberts, vbabka, lance.yang, linux-kernel, kernel-team On 2/11/26 20:28, Matthew Wilcox wrote: > On Wed, Feb 11, 2026 at 04:49:45AM -0800, Usama Arif wrote: >> - Memory reclaim (try_to_unmap()): Returns false, folio rotated back >> LRU, retried in next reclaim cycle. > > I was advised to ask my stupid question ... > > Why do we still try to split the PMD in reclaim? I understand we're > about to swap the folio out and we'll need to put a swap entry in the page > table so we can find it again. But can't we now store swap entries at the > PMD level, or are we still forced to store 512 entries at the PTE level? Yes. We don't support PMD swap entries yet. I don't know all historical details. I suspect there are some rough edges around swapin (assume we cannot swapin a 2M THP), and maybe it was just easier to not deal with splitting of PMD swap entries (which we would similarly have to support). For sure an interesting project to look into. -- Cheers, David ^ permalink raw reply [flat|nested] 22+ messages in thread
* [RFC 2/2] mm: thp: add THP_SPLIT_PMD_PTE_ALLOC_FAILED counter 2026-02-11 12:49 [RFC 0/2] mm: thp: split time allocation of page table for THPs Usama Arif 2026-02-11 12:49 ` [RFC 1/2] mm: thp: allocate PTE page tables lazily at split time Usama Arif @ 2026-02-11 12:49 ` Usama Arif 2026-02-11 13:27 ` David Hildenbrand (Arm) ` (2 more replies) 1 sibling, 3 replies; 22+ messages in thread From: Usama Arif @ 2026-02-11 12:49 UTC (permalink / raw) To: Andrew Morton, david, lorenzo.stoakes, willy, linux-mm Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain, baolin.wang, npache, Liam.Howlett, ryan.roberts, vbabka, lance.yang, linux-kernel, kernel-team, Usama Arif Add a vmstat counter to track PTE allocation failures during PMD split. This enables monitoring of split failures due to memory pressure after the lazy PTE page table allocation change. The counter is incremented in three places: - __split_huge_pmd(): Main entry point for splitting a PMD - try_to_unmap_one(): When reclaim needs to split a PMD-mapped THP - try_to_migrate_one(): When migration needs to split a PMD-mapped THP Visible via /proc/vmstat as thp_split_pmd_pte_alloc_failed. Signed-off-by: Usama Arif <usama.arif@linux.dev> --- include/linux/vm_event_item.h | 1 + mm/huge_memory.c | 1 + mm/rmap.c | 3 +++ mm/vmstat.c | 1 + 4 files changed, 6 insertions(+) diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h index 22a139f82d75f..827c9a8c251de 100644 --- a/include/linux/vm_event_item.h +++ b/include/linux/vm_event_item.h @@ -111,6 +111,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, THP_DEFERRED_SPLIT_PAGE, THP_UNDERUSED_SPLIT_PAGE, THP_SPLIT_PMD, + THP_SPLIT_PMD_PTE_ALLOC_FAILED, THP_SCAN_EXCEED_NONE_PTE, THP_SCAN_EXCEED_SWAP_PTE, THP_SCAN_EXCEED_SHARED_PTE, diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 4c9a8d89fc8aa..8d7c9f67f8a1d 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -3332,6 +3332,7 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, if (vma_is_anonymous(vma) && !arch_needs_pgtable_deposit()) { pgtable = pte_alloc_one(vma->vm_mm); if (!pgtable) { + count_vm_event(THP_SPLIT_PMD_PTE_ALLOC_FAILED); mmu_notifier_invalidate_range_end(&range); return; } diff --git a/mm/rmap.c b/mm/rmap.c index c6ff23fc12944..5c4afedb29d5a 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -2070,8 +2070,10 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma, pgtable_t pgtable = prealloc_pte; prealloc_pte = NULL; + if (!arch_needs_pgtable_deposit() && !pgtable && vma_is_anonymous(vma)) { + count_vm_event(THP_SPLIT_PMD_PTE_ALLOC_FAILED); page_vma_mapped_walk_done(&pvmw); ret = false; break; @@ -2474,6 +2476,7 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma, prealloc_pte = NULL; if (!arch_needs_pgtable_deposit() && !pgtable && vma_is_anonymous(vma)) { + count_vm_event(THP_SPLIT_PMD_PTE_ALLOC_FAILED); page_vma_mapped_walk_done(&pvmw); ret = false; break; diff --git a/mm/vmstat.c b/mm/vmstat.c index 99270713e0c13..473edfa624a41 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1408,6 +1408,7 @@ const char * const vmstat_text[] = { [I(THP_DEFERRED_SPLIT_PAGE)] = "thp_deferred_split_page", [I(THP_UNDERUSED_SPLIT_PAGE)] = "thp_underused_split_page", [I(THP_SPLIT_PMD)] = "thp_split_pmd", + [I(THP_SPLIT_PMD_PTE_ALLOC_FAILED)] = "thp_split_pmd_pte_alloc_failed", [I(THP_SCAN_EXCEED_NONE_PTE)] = "thp_scan_exceed_none_pte", [I(THP_SCAN_EXCEED_SWAP_PTE)] = "thp_scan_exceed_swap_pte", [I(THP_SCAN_EXCEED_SHARED_PTE)] = "thp_scan_exceed_share_pte", -- 2.47.3 ^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [RFC 2/2] mm: thp: add THP_SPLIT_PMD_PTE_ALLOC_FAILED counter 2026-02-11 12:49 ` [RFC 2/2] mm: thp: add THP_SPLIT_PMD_PTE_ALLOC_FAILED counter Usama Arif @ 2026-02-11 13:27 ` David Hildenbrand (Arm) 2026-02-11 13:31 ` Usama Arif 2026-02-12 21:40 ` kernel test robot 2026-02-12 21:40 ` kernel test robot 2 siblings, 1 reply; 22+ messages in thread From: David Hildenbrand (Arm) @ 2026-02-11 13:27 UTC (permalink / raw) To: Usama Arif, Andrew Morton, lorenzo.stoakes, willy, linux-mm Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain, baolin.wang, npache, Liam.Howlett, ryan.roberts, vbabka, lance.yang, linux-kernel, kernel-team On 2/11/26 13:49, Usama Arif wrote: > Add a vmstat counter to track PTE allocation failures during PMD split. > This enables monitoring of split failures due to memory pressure after > the lazy PTE page table allocation change. > > The counter is incremented in three places: > - __split_huge_pmd(): Main entry point for splitting a PMD > - try_to_unmap_one(): When reclaim needs to split a PMD-mapped THP > - try_to_migrate_one(): When migration needs to split a PMD-mapped THP > > Visible via /proc/vmstat as thp_split_pmd_pte_alloc_failed. > > Signed-off-by: Usama Arif <usama.arif@linux.dev> > --- > include/linux/vm_event_item.h | 1 + > mm/huge_memory.c | 1 + > mm/rmap.c | 3 +++ > mm/vmstat.c | 1 + > 4 files changed, 6 insertions(+) > > diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h > index 22a139f82d75f..827c9a8c251de 100644 > --- a/include/linux/vm_event_item.h > +++ b/include/linux/vm_event_item.h > @@ -111,6 +111,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, > THP_DEFERRED_SPLIT_PAGE, > THP_UNDERUSED_SPLIT_PAGE, > THP_SPLIT_PMD, > + THP_SPLIT_PMD_PTE_ALLOC_FAILED, Probably sufficient to call this THP_SPLIT_PMD_FAILED and count any (future) failures (if any) as well. It's a shame that we called a remapping a "split" and keep causing confusion. -- Cheers, David ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC 2/2] mm: thp: add THP_SPLIT_PMD_PTE_ALLOC_FAILED counter 2026-02-11 13:27 ` David Hildenbrand (Arm) @ 2026-02-11 13:31 ` Usama Arif 2026-02-11 13:36 ` David Hildenbrand (Arm) 2026-02-11 13:38 ` David Hildenbrand (Arm) 0 siblings, 2 replies; 22+ messages in thread From: Usama Arif @ 2026-02-11 13:31 UTC (permalink / raw) To: David Hildenbrand (Arm), Andrew Morton, lorenzo.stoakes, willy, linux-mm Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain, baolin.wang, npache, Liam.Howlett, ryan.roberts, vbabka, lance.yang, linux-kernel, kernel-team On 11/02/2026 13:27, David Hildenbrand (Arm) wrote: > On 2/11/26 13:49, Usama Arif wrote: >> Add a vmstat counter to track PTE allocation failures during PMD split. >> This enables monitoring of split failures due to memory pressure after >> the lazy PTE page table allocation change. >> >> The counter is incremented in three places: >> - __split_huge_pmd(): Main entry point for splitting a PMD >> - try_to_unmap_one(): When reclaim needs to split a PMD-mapped THP >> - try_to_migrate_one(): When migration needs to split a PMD-mapped THP >> >> Visible via /proc/vmstat as thp_split_pmd_pte_alloc_failed. >> >> Signed-off-by: Usama Arif <usama.arif@linux.dev> >> --- >> include/linux/vm_event_item.h | 1 + >> mm/huge_memory.c | 1 + >> mm/rmap.c | 3 +++ >> mm/vmstat.c | 1 + >> 4 files changed, 6 insertions(+) >> >> diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h >> index 22a139f82d75f..827c9a8c251de 100644 >> --- a/include/linux/vm_event_item.h >> +++ b/include/linux/vm_event_item.h >> @@ -111,6 +111,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, >> THP_DEFERRED_SPLIT_PAGE, >> THP_UNDERUSED_SPLIT_PAGE, >> THP_SPLIT_PMD, >> + THP_SPLIT_PMD_PTE_ALLOC_FAILED, > > Probably sufficient to call this THP_SPLIT_PMD_FAILED and count any (future) failures (if any) as well. > Makes sense. This was just a patch I was using for testing and I wanted to share. It was always 0 as I couldnt get split to fail :) But I can rename it as THP_SPLIT_PMD_FAILED as suggested and we can use for future split failures (hopefully none). > It's a shame that we called a remapping a "split" and keep causing confusion. > ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC 2/2] mm: thp: add THP_SPLIT_PMD_PTE_ALLOC_FAILED counter 2026-02-11 13:31 ` Usama Arif @ 2026-02-11 13:36 ` David Hildenbrand (Arm) 2026-02-11 13:42 ` Usama Arif 2026-02-11 13:38 ` David Hildenbrand (Arm) 1 sibling, 1 reply; 22+ messages in thread From: David Hildenbrand (Arm) @ 2026-02-11 13:36 UTC (permalink / raw) To: Usama Arif, Andrew Morton, lorenzo.stoakes, willy, linux-mm Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain, baolin.wang, npache, Liam.Howlett, ryan.roberts, vbabka, lance.yang, linux-kernel, kernel-team On 2/11/26 14:31, Usama Arif wrote: > > > On 11/02/2026 13:27, David Hildenbrand (Arm) wrote: >> On 2/11/26 13:49, Usama Arif wrote: >>> Add a vmstat counter to track PTE allocation failures during PMD split. >>> This enables monitoring of split failures due to memory pressure after >>> the lazy PTE page table allocation change. >>> >>> The counter is incremented in three places: >>> - __split_huge_pmd(): Main entry point for splitting a PMD >>> - try_to_unmap_one(): When reclaim needs to split a PMD-mapped THP >>> - try_to_migrate_one(): When migration needs to split a PMD-mapped THP >>> >>> Visible via /proc/vmstat as thp_split_pmd_pte_alloc_failed. >>> >>> Signed-off-by: Usama Arif <usama.arif@linux.dev> >>> --- >>> include/linux/vm_event_item.h | 1 + >>> mm/huge_memory.c | 1 + >>> mm/rmap.c | 3 +++ >>> mm/vmstat.c | 1 + >>> 4 files changed, 6 insertions(+) >>> >>> diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h >>> index 22a139f82d75f..827c9a8c251de 100644 >>> --- a/include/linux/vm_event_item.h >>> +++ b/include/linux/vm_event_item.h >>> @@ -111,6 +111,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, >>> THP_DEFERRED_SPLIT_PAGE, >>> THP_UNDERUSED_SPLIT_PAGE, >>> THP_SPLIT_PMD, >>> + THP_SPLIT_PMD_PTE_ALLOC_FAILED, >> >> Probably sufficient to call this THP_SPLIT_PMD_FAILED and count any (future) failures (if any) as well. >> > > Makes sense. This was just a patch I was using for testing and I wanted to share. > It was always 0 as I couldnt get split to fail :) But I can rename it as THP_SPLIT_PMD_FAILED > as suggested and we can use for future split failures (hopefully none). I guess it might be reasonable to have because I am sure it will fail at some point and maybe provoke weird issues we didn't think of. In that case, having an indication that splitting failed at some point might be reasonable. -- Cheers, David ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC 2/2] mm: thp: add THP_SPLIT_PMD_PTE_ALLOC_FAILED counter 2026-02-11 13:36 ` David Hildenbrand (Arm) @ 2026-02-11 13:42 ` Usama Arif 0 siblings, 0 replies; 22+ messages in thread From: Usama Arif @ 2026-02-11 13:42 UTC (permalink / raw) To: David Hildenbrand (Arm), Andrew Morton, lorenzo.stoakes, willy, linux-mm Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain, baolin.wang, npache, Liam.Howlett, ryan.roberts, vbabka, lance.yang, linux-kernel, kernel-team On 11/02/2026 13:36, David Hildenbrand (Arm) wrote: > On 2/11/26 14:31, Usama Arif wrote: >> >> >> On 11/02/2026 13:27, David Hildenbrand (Arm) wrote: >>> On 2/11/26 13:49, Usama Arif wrote: >>>> Add a vmstat counter to track PTE allocation failures during PMD split. >>>> This enables monitoring of split failures due to memory pressure after >>>> the lazy PTE page table allocation change. >>>> >>>> The counter is incremented in three places: >>>> - __split_huge_pmd(): Main entry point for splitting a PMD >>>> - try_to_unmap_one(): When reclaim needs to split a PMD-mapped THP >>>> - try_to_migrate_one(): When migration needs to split a PMD-mapped THP >>>> >>>> Visible via /proc/vmstat as thp_split_pmd_pte_alloc_failed. >>>> >>>> Signed-off-by: Usama Arif <usama.arif@linux.dev> >>>> --- >>>> include/linux/vm_event_item.h | 1 + >>>> mm/huge_memory.c | 1 + >>>> mm/rmap.c | 3 +++ >>>> mm/vmstat.c | 1 + >>>> 4 files changed, 6 insertions(+) >>>> >>>> diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h >>>> index 22a139f82d75f..827c9a8c251de 100644 >>>> --- a/include/linux/vm_event_item.h >>>> +++ b/include/linux/vm_event_item.h >>>> @@ -111,6 +111,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, >>>> THP_DEFERRED_SPLIT_PAGE, >>>> THP_UNDERUSED_SPLIT_PAGE, >>>> THP_SPLIT_PMD, >>>> + THP_SPLIT_PMD_PTE_ALLOC_FAILED, >>> >>> Probably sufficient to call this THP_SPLIT_PMD_FAILED and count any (future) failures (if any) as well. >>> >> >> Makes sense. This was just a patch I was using for testing and I wanted to share. >> It was always 0 as I couldnt get split to fail :) But I can rename it as THP_SPLIT_PMD_FAILED >> as suggested and we can use for future split failures (hopefully none). > > I guess it might be reasonable to have because I am sure it will fail at some point and maybe provoke weird issues we didn't think of. In that case, having an indication that splitting failed at some point might be reasonable. > ack ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC 2/2] mm: thp: add THP_SPLIT_PMD_PTE_ALLOC_FAILED counter 2026-02-11 13:31 ` Usama Arif 2026-02-11 13:36 ` David Hildenbrand (Arm) @ 2026-02-11 13:38 ` David Hildenbrand (Arm) 2026-02-11 13:43 ` Usama Arif 1 sibling, 1 reply; 22+ messages in thread From: David Hildenbrand (Arm) @ 2026-02-11 13:38 UTC (permalink / raw) To: Usama Arif, Andrew Morton, lorenzo.stoakes, willy, linux-mm Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain, baolin.wang, npache, Liam.Howlett, ryan.roberts, vbabka, lance.yang, linux-kernel, kernel-team On 2/11/26 14:31, Usama Arif wrote: > > > On 11/02/2026 13:27, David Hildenbrand (Arm) wrote: >> On 2/11/26 13:49, Usama Arif wrote: >>> Add a vmstat counter to track PTE allocation failures during PMD split. >>> This enables monitoring of split failures due to memory pressure after >>> the lazy PTE page table allocation change. >>> >>> The counter is incremented in three places: >>> - __split_huge_pmd(): Main entry point for splitting a PMD >>> - try_to_unmap_one(): When reclaim needs to split a PMD-mapped THP >>> - try_to_migrate_one(): When migration needs to split a PMD-mapped THP >>> >>> Visible via /proc/vmstat as thp_split_pmd_pte_alloc_failed. >>> >>> Signed-off-by: Usama Arif <usama.arif@linux.dev> >>> --- >>> include/linux/vm_event_item.h | 1 + >>> mm/huge_memory.c | 1 + >>> mm/rmap.c | 3 +++ >>> mm/vmstat.c | 1 + >>> 4 files changed, 6 insertions(+) >>> >>> diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h >>> index 22a139f82d75f..827c9a8c251de 100644 >>> --- a/include/linux/vm_event_item.h >>> +++ b/include/linux/vm_event_item.h >>> @@ -111,6 +111,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, >>> THP_DEFERRED_SPLIT_PAGE, >>> THP_UNDERUSED_SPLIT_PAGE, >>> THP_SPLIT_PMD, >>> + THP_SPLIT_PMD_PTE_ALLOC_FAILED, >> >> Probably sufficient to call this THP_SPLIT_PMD_FAILED and count any (future) failures (if any) as well. >> > > Makes sense. This was just a patch I was using for testing and I wanted to share. > It was always 0 as I couldnt get split to fail :) But I can rename it as THP_SPLIT_PMD_FAILED > as suggested and we can use for future split failures (hopefully none). Btw, you can use the allocation fault injection framework to find weird issues, if you haven't heard of that yet. -- Cheers, David ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC 2/2] mm: thp: add THP_SPLIT_PMD_PTE_ALLOC_FAILED counter 2026-02-11 13:38 ` David Hildenbrand (Arm) @ 2026-02-11 13:43 ` Usama Arif 0 siblings, 0 replies; 22+ messages in thread From: Usama Arif @ 2026-02-11 13:43 UTC (permalink / raw) To: David Hildenbrand (Arm), Andrew Morton, lorenzo.stoakes, willy, linux-mm Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain, baolin.wang, npache, Liam.Howlett, ryan.roberts, vbabka, lance.yang, linux-kernel, kernel-team On 11/02/2026 13:38, David Hildenbrand (Arm) wrote: > On 2/11/26 14:31, Usama Arif wrote: >> >> >> On 11/02/2026 13:27, David Hildenbrand (Arm) wrote: >>> On 2/11/26 13:49, Usama Arif wrote: >>>> Add a vmstat counter to track PTE allocation failures during PMD split. >>>> This enables monitoring of split failures due to memory pressure after >>>> the lazy PTE page table allocation change. >>>> >>>> The counter is incremented in three places: >>>> - __split_huge_pmd(): Main entry point for splitting a PMD >>>> - try_to_unmap_one(): When reclaim needs to split a PMD-mapped THP >>>> - try_to_migrate_one(): When migration needs to split a PMD-mapped THP >>>> >>>> Visible via /proc/vmstat as thp_split_pmd_pte_alloc_failed. >>>> >>>> Signed-off-by: Usama Arif <usama.arif@linux.dev> >>>> --- >>>> include/linux/vm_event_item.h | 1 + >>>> mm/huge_memory.c | 1 + >>>> mm/rmap.c | 3 +++ >>>> mm/vmstat.c | 1 + >>>> 4 files changed, 6 insertions(+) >>>> >>>> diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h >>>> index 22a139f82d75f..827c9a8c251de 100644 >>>> --- a/include/linux/vm_event_item.h >>>> +++ b/include/linux/vm_event_item.h >>>> @@ -111,6 +111,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, >>>> THP_DEFERRED_SPLIT_PAGE, >>>> THP_UNDERUSED_SPLIT_PAGE, >>>> THP_SPLIT_PMD, >>>> + THP_SPLIT_PMD_PTE_ALLOC_FAILED, >>> >>> Probably sufficient to call this THP_SPLIT_PMD_FAILED and count any (future) failures (if any) as well. >>> >> >> Makes sense. This was just a patch I was using for testing and I wanted to share. >> It was always 0 as I couldnt get split to fail :) But I can rename it as THP_SPLIT_PMD_FAILED >> as suggested and we can use for future split failures (hopefully none). > > Btw, you can use the allocation fault injection framework to find weird issues, if you haven't heard of that yet. > This looks very interesting, Thanks! Let me have a look. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC 2/2] mm: thp: add THP_SPLIT_PMD_PTE_ALLOC_FAILED counter 2026-02-11 12:49 ` [RFC 2/2] mm: thp: add THP_SPLIT_PMD_PTE_ALLOC_FAILED counter Usama Arif 2026-02-11 13:27 ` David Hildenbrand (Arm) @ 2026-02-12 21:40 ` kernel test robot 2026-02-12 21:40 ` kernel test robot 2 siblings, 0 replies; 22+ messages in thread From: kernel test robot @ 2026-02-12 21:40 UTC (permalink / raw) To: Usama Arif; +Cc: llvm, oe-kbuild-all Hi Usama, [This is a private test report for your RFC patch.] kernel test robot noticed the following build errors: [auto build test ERROR on linus/master] [also build test ERROR on v6.19 next-20260212] [cannot apply to akpm-mm/mm-everything] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch#_base_tree_information] url: https://github.com/intel-lab-lkp/linux/commits/Usama-Arif/mm-thp-allocate-PTE-page-tables-lazily-at-split-time/20260211-205726 base: linus/master patch link: https://lore.kernel.org/r/20260211125507.4175026-3-usama.arif%40linux.dev patch subject: [RFC 2/2] mm: thp: add THP_SPLIT_PMD_PTE_ALLOC_FAILED counter config: x86_64-allnoconfig (https://download.01.org/0day-ci/archive/20260213/202602130506.7Tm8CJkW-lkp@intel.com/config) compiler: clang version 20.1.8 (https://github.com/llvm/llvm-project 87f0227cb60147a26a1eeb4fb06e3b505e9c7261) reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260213/202602130506.7Tm8CJkW-lkp@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <lkp@intel.com> | Closes: https://lore.kernel.org/oe-kbuild-all/202602130506.7Tm8CJkW-lkp@intel.com/ All errors (new ones prefixed by >>): >> mm/rmap.c:1961:21: error: use of undeclared identifier 'THP_SPLIT_PMD_PTE_ALLOC_FAILED' 1961 | count_vm_event(THP_SPLIT_PMD_PTE_ALLOC_FAILED); | ^ mm/rmap.c:2364:21: error: use of undeclared identifier 'THP_SPLIT_PMD_PTE_ALLOC_FAILED' 2364 | count_vm_event(THP_SPLIT_PMD_PTE_ALLOC_FAILED); | ^ 2 errors generated. vim +/THP_SPLIT_PMD_PTE_ALLOC_FAILED +1961 mm/rmap.c 1849 1850 /* 1851 * @arg: enum ttu_flags will be passed to this argument 1852 */ 1853 static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma, 1854 unsigned long address, void *arg) 1855 { 1856 struct mm_struct *mm = vma->vm_mm; 1857 DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0); 1858 bool anon_exclusive, ret = true; 1859 pte_t pteval; 1860 struct page *subpage; 1861 struct mmu_notifier_range range; 1862 enum ttu_flags flags = (enum ttu_flags)(long)arg; 1863 unsigned long nr_pages = 1, end_addr; 1864 unsigned long pfn; 1865 unsigned long hsz = 0; 1866 int ptes = 0; 1867 pgtable_t prealloc_pte = NULL; 1868 1869 /* 1870 * When racing against e.g. zap_pte_range() on another cpu, 1871 * in between its ptep_get_and_clear_full() and folio_remove_rmap_*(), 1872 * try_to_unmap() may return before page_mapped() has become false, 1873 * if page table locking is skipped: use TTU_SYNC to wait for that. 1874 */ 1875 if (flags & TTU_SYNC) 1876 pvmw.flags = PVMW_SYNC; 1877 1878 /* 1879 * For THP, we have to assume the worse case ie pmd for invalidation. 1880 * For hugetlb, it could be much worse if we need to do pud 1881 * invalidation in the case of pmd sharing. 1882 * 1883 * Note that the folio can not be freed in this function as call of 1884 * try_to_unmap() must hold a reference on the folio. 1885 */ 1886 range.end = vma_address_end(&pvmw); 1887 mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma->vm_mm, 1888 address, range.end); 1889 if (folio_test_hugetlb(folio)) { 1890 /* 1891 * If sharing is possible, start and end will be adjusted 1892 * accordingly. 1893 */ 1894 adjust_range_if_pmd_sharing_possible(vma, &range.start, 1895 &range.end); 1896 1897 /* We need the huge page size for set_huge_pte_at() */ 1898 hsz = huge_page_size(hstate_vma(vma)); 1899 } 1900 mmu_notifier_invalidate_range_start(&range); 1901 1902 if ((flags & TTU_SPLIT_HUGE_PMD) && vma_is_anonymous(vma) && 1903 !arch_needs_pgtable_deposit()) 1904 prealloc_pte = pte_alloc_one(mm); 1905 1906 while (page_vma_mapped_walk(&pvmw)) { 1907 /* 1908 * If the folio is in an mlock()d vma, we must not swap it out. 1909 */ 1910 if (!(flags & TTU_IGNORE_MLOCK) && 1911 (vma->vm_flags & VM_LOCKED)) { 1912 ptes++; 1913 1914 /* 1915 * Set 'ret' to indicate the page cannot be unmapped. 1916 * 1917 * Do not jump to walk_abort immediately as additional 1918 * iteration might be required to detect fully mapped 1919 * folio an mlock it. 1920 */ 1921 ret = false; 1922 1923 /* Only mlock fully mapped pages */ 1924 if (pvmw.pte && ptes != pvmw.nr_pages) 1925 continue; 1926 1927 /* 1928 * All PTEs must be protected by page table lock in 1929 * order to mlock the page. 1930 * 1931 * If page table boundary has been cross, current ptl 1932 * only protect part of ptes. 1933 */ 1934 if (pvmw.flags & PVMW_PGTABLE_CROSSED) 1935 goto walk_done; 1936 1937 /* Restore the mlock which got missed */ 1938 mlock_vma_folio(folio, vma); 1939 goto walk_done; 1940 } 1941 1942 if (!pvmw.pte) { 1943 if (folio_test_anon(folio) && !folio_test_swapbacked(folio)) { 1944 if (unmap_huge_pmd_locked(vma, pvmw.address, pvmw.pmd, folio)) 1945 goto walk_done; 1946 /* 1947 * unmap_huge_pmd_locked has either already marked 1948 * the folio as swap-backed or decided to retain it 1949 * due to GUP or speculative references. 1950 */ 1951 goto walk_abort; 1952 } 1953 1954 if (flags & TTU_SPLIT_HUGE_PMD) { 1955 pgtable_t pgtable = prealloc_pte; 1956 1957 prealloc_pte = NULL; 1958 1959 if (!arch_needs_pgtable_deposit() && !pgtable && 1960 vma_is_anonymous(vma)) { > 1961 count_vm_event(THP_SPLIT_PMD_PTE_ALLOC_FAILED); 1962 page_vma_mapped_walk_done(&pvmw); 1963 ret = false; 1964 break; 1965 } 1966 /* 1967 * We temporarily have to drop the PTL and 1968 * restart so we can process the PTE-mapped THP. 1969 */ 1970 split_huge_pmd_locked(vma, pvmw.address, 1971 pvmw.pmd, false, pgtable); 1972 flags &= ~TTU_SPLIT_HUGE_PMD; 1973 page_vma_mapped_walk_restart(&pvmw); 1974 continue; 1975 } 1976 } 1977 1978 /* Unexpected PMD-mapped THP? */ 1979 VM_BUG_ON_FOLIO(!pvmw.pte, folio); 1980 1981 /* 1982 * Handle PFN swap PTEs, such as device-exclusive ones, that 1983 * actually map pages. 1984 */ 1985 pteval = ptep_get(pvmw.pte); 1986 if (likely(pte_present(pteval))) { 1987 pfn = pte_pfn(pteval); 1988 } else { 1989 const softleaf_t entry = softleaf_from_pte(pteval); 1990 1991 pfn = softleaf_to_pfn(entry); 1992 VM_WARN_ON_FOLIO(folio_test_hugetlb(folio), folio); 1993 } 1994 1995 subpage = folio_page(folio, pfn - folio_pfn(folio)); 1996 address = pvmw.address; 1997 anon_exclusive = folio_test_anon(folio) && 1998 PageAnonExclusive(subpage); 1999 2000 if (folio_test_hugetlb(folio)) { 2001 bool anon = folio_test_anon(folio); 2002 2003 /* 2004 * The try_to_unmap() is only passed a hugetlb page 2005 * in the case where the hugetlb page is poisoned. 2006 */ 2007 VM_BUG_ON_PAGE(!PageHWPoison(subpage), subpage); 2008 /* 2009 * huge_pmd_unshare may unmap an entire PMD page. 2010 * There is no way of knowing exactly which PMDs may 2011 * be cached for this mm, so we must flush them all. 2012 * start/end were already adjusted above to cover this 2013 * range. 2014 */ 2015 flush_cache_range(vma, range.start, range.end); 2016 2017 /* 2018 * To call huge_pmd_unshare, i_mmap_rwsem must be 2019 * held in write mode. Caller needs to explicitly 2020 * do this outside rmap routines. 2021 * 2022 * We also must hold hugetlb vma_lock in write mode. 2023 * Lock order dictates acquiring vma_lock BEFORE 2024 * i_mmap_rwsem. We can only try lock here and fail 2025 * if unsuccessful. 2026 */ 2027 if (!anon) { 2028 struct mmu_gather tlb; 2029 2030 VM_BUG_ON(!(flags & TTU_RMAP_LOCKED)); 2031 if (!hugetlb_vma_trylock_write(vma)) 2032 goto walk_abort; 2033 2034 tlb_gather_mmu_vma(&tlb, vma); 2035 if (huge_pmd_unshare(&tlb, vma, address, pvmw.pte)) { 2036 hugetlb_vma_unlock_write(vma); 2037 huge_pmd_unshare_flush(&tlb, vma); 2038 tlb_finish_mmu(&tlb); 2039 /* 2040 * The PMD table was unmapped, 2041 * consequently unmapping the folio. 2042 */ 2043 goto walk_done; 2044 } 2045 hugetlb_vma_unlock_write(vma); 2046 tlb_finish_mmu(&tlb); 2047 } 2048 pteval = huge_ptep_clear_flush(vma, address, pvmw.pte); 2049 if (pte_dirty(pteval)) 2050 folio_mark_dirty(folio); 2051 } else if (likely(pte_present(pteval))) { 2052 nr_pages = folio_unmap_pte_batch(folio, &pvmw, flags, pteval); 2053 end_addr = address + nr_pages * PAGE_SIZE; 2054 flush_cache_range(vma, address, end_addr); 2055 2056 /* Nuke the page table entry. */ 2057 pteval = get_and_clear_ptes(mm, address, pvmw.pte, nr_pages); 2058 /* 2059 * We clear the PTE but do not flush so potentially 2060 * a remote CPU could still be writing to the folio. 2061 * If the entry was previously clean then the 2062 * architecture must guarantee that a clear->dirty 2063 * transition on a cached TLB entry is written through 2064 * and traps if the PTE is unmapped. 2065 */ 2066 if (should_defer_flush(mm, flags)) 2067 set_tlb_ubc_flush_pending(mm, pteval, address, end_addr); 2068 else 2069 flush_tlb_range(vma, address, end_addr); 2070 if (pte_dirty(pteval)) 2071 folio_mark_dirty(folio); 2072 } else { 2073 pte_clear(mm, address, pvmw.pte); 2074 } 2075 2076 /* 2077 * Now the pte is cleared. If this pte was uffd-wp armed, 2078 * we may want to replace a none pte with a marker pte if 2079 * it's file-backed, so we don't lose the tracking info. 2080 */ 2081 pte_install_uffd_wp_if_needed(vma, address, pvmw.pte, pteval); 2082 2083 /* Update high watermark before we lower rss */ 2084 update_hiwater_rss(mm); 2085 2086 if (PageHWPoison(subpage) && (flags & TTU_HWPOISON)) { 2087 pteval = swp_entry_to_pte(make_hwpoison_entry(subpage)); 2088 if (folio_test_hugetlb(folio)) { 2089 hugetlb_count_sub(folio_nr_pages(folio), mm); 2090 set_huge_pte_at(mm, address, pvmw.pte, pteval, 2091 hsz); 2092 } else { 2093 dec_mm_counter(mm, mm_counter(folio)); 2094 set_pte_at(mm, address, pvmw.pte, pteval); 2095 } 2096 } else if (likely(pte_present(pteval)) && pte_unused(pteval) && 2097 !userfaultfd_armed(vma)) { 2098 /* 2099 * The guest indicated that the page content is of no 2100 * interest anymore. Simply discard the pte, vmscan 2101 * will take care of the rest. 2102 * A future reference will then fault in a new zero 2103 * page. When userfaultfd is active, we must not drop 2104 * this page though, as its main user (postcopy 2105 * migration) will not expect userfaults on already 2106 * copied pages. 2107 */ 2108 dec_mm_counter(mm, mm_counter(folio)); 2109 } else if (folio_test_anon(folio)) { 2110 swp_entry_t entry = page_swap_entry(subpage); 2111 pte_t swp_pte; 2112 /* 2113 * Store the swap location in the pte. 2114 * See handle_pte_fault() ... 2115 */ 2116 if (unlikely(folio_test_swapbacked(folio) != 2117 folio_test_swapcache(folio))) { 2118 WARN_ON_ONCE(1); 2119 goto walk_abort; 2120 } 2121 2122 /* MADV_FREE page check */ 2123 if (!folio_test_swapbacked(folio)) { 2124 int ref_count, map_count; 2125 2126 /* 2127 * Synchronize with gup_pte_range(): 2128 * - clear PTE; barrier; read refcount 2129 * - inc refcount; barrier; read PTE 2130 */ 2131 smp_mb(); 2132 2133 ref_count = folio_ref_count(folio); 2134 map_count = folio_mapcount(folio); 2135 2136 /* 2137 * Order reads for page refcount and dirty flag 2138 * (see comments in __remove_mapping()). 2139 */ 2140 smp_rmb(); 2141 2142 if (folio_test_dirty(folio) && !(vma->vm_flags & VM_DROPPABLE)) { 2143 /* 2144 * redirtied either using the page table or a previously 2145 * obtained GUP reference. 2146 */ 2147 set_ptes(mm, address, pvmw.pte, pteval, nr_pages); 2148 folio_set_swapbacked(folio); 2149 goto walk_abort; 2150 } else if (ref_count != 1 + map_count) { 2151 /* 2152 * Additional reference. Could be a GUP reference or any 2153 * speculative reference. GUP users must mark the folio 2154 * dirty if there was a modification. This folio cannot be 2155 * reclaimed right now either way, so act just like nothing 2156 * happened. 2157 * We'll come back here later and detect if the folio was 2158 * dirtied when the additional reference is gone. 2159 */ 2160 set_ptes(mm, address, pvmw.pte, pteval, nr_pages); 2161 goto walk_abort; 2162 } 2163 add_mm_counter(mm, MM_ANONPAGES, -nr_pages); 2164 goto discard; 2165 } 2166 2167 if (swap_duplicate(entry) < 0) { 2168 set_pte_at(mm, address, pvmw.pte, pteval); 2169 goto walk_abort; 2170 } 2171 2172 /* 2173 * arch_unmap_one() is expected to be a NOP on 2174 * architectures where we could have PFN swap PTEs, 2175 * so we'll not check/care. 2176 */ 2177 if (arch_unmap_one(mm, vma, address, pteval) < 0) { 2178 swap_free(entry); 2179 set_pte_at(mm, address, pvmw.pte, pteval); 2180 goto walk_abort; 2181 } 2182 2183 /* See folio_try_share_anon_rmap(): clear PTE first. */ 2184 if (anon_exclusive && 2185 folio_try_share_anon_rmap_pte(folio, subpage)) { 2186 swap_free(entry); 2187 set_pte_at(mm, address, pvmw.pte, pteval); 2188 goto walk_abort; 2189 } 2190 if (list_empty(&mm->mmlist)) { 2191 spin_lock(&mmlist_lock); 2192 if (list_empty(&mm->mmlist)) 2193 list_add(&mm->mmlist, &init_mm.mmlist); 2194 spin_unlock(&mmlist_lock); 2195 } 2196 dec_mm_counter(mm, MM_ANONPAGES); 2197 inc_mm_counter(mm, MM_SWAPENTS); 2198 swp_pte = swp_entry_to_pte(entry); 2199 if (anon_exclusive) 2200 swp_pte = pte_swp_mkexclusive(swp_pte); 2201 if (likely(pte_present(pteval))) { 2202 if (pte_soft_dirty(pteval)) 2203 swp_pte = pte_swp_mksoft_dirty(swp_pte); 2204 if (pte_uffd_wp(pteval)) 2205 swp_pte = pte_swp_mkuffd_wp(swp_pte); 2206 } else { 2207 if (pte_swp_soft_dirty(pteval)) 2208 swp_pte = pte_swp_mksoft_dirty(swp_pte); 2209 if (pte_swp_uffd_wp(pteval)) 2210 swp_pte = pte_swp_mkuffd_wp(swp_pte); 2211 } 2212 set_pte_at(mm, address, pvmw.pte, swp_pte); 2213 } else { 2214 /* 2215 * This is a locked file-backed folio, 2216 * so it cannot be removed from the page 2217 * cache and replaced by a new folio before 2218 * mmu_notifier_invalidate_range_end, so no 2219 * concurrent thread might update its page table 2220 * to point at a new folio while a device is 2221 * still using this folio. 2222 * 2223 * See Documentation/mm/mmu_notifier.rst 2224 */ 2225 dec_mm_counter(mm, mm_counter_file(folio)); 2226 } 2227 discard: 2228 if (unlikely(folio_test_hugetlb(folio))) { 2229 hugetlb_remove_rmap(folio); 2230 } else { 2231 folio_remove_rmap_ptes(folio, subpage, nr_pages, vma); 2232 } 2233 if (vma->vm_flags & VM_LOCKED) 2234 mlock_drain_local(); 2235 folio_put_refs(folio, nr_pages); 2236 2237 /* 2238 * If we are sure that we batched the entire folio and cleared 2239 * all PTEs, we can just optimize and stop right here. 2240 */ 2241 if (nr_pages == folio_nr_pages(folio)) 2242 goto walk_done; 2243 continue; 2244 walk_abort: 2245 ret = false; 2246 walk_done: 2247 page_vma_mapped_walk_done(&pvmw); 2248 break; 2249 } 2250 2251 if (prealloc_pte) 2252 pte_free(mm, prealloc_pte); 2253 2254 mmu_notifier_invalidate_range_end(&range); 2255 2256 return ret; 2257 } 2258 -- 0-DAY CI Kernel Test Service https://github.com/intel/lkp-tests/wiki ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC 2/2] mm: thp: add THP_SPLIT_PMD_PTE_ALLOC_FAILED counter 2026-02-11 12:49 ` [RFC 2/2] mm: thp: add THP_SPLIT_PMD_PTE_ALLOC_FAILED counter Usama Arif 2026-02-11 13:27 ` David Hildenbrand (Arm) 2026-02-12 21:40 ` kernel test robot @ 2026-02-12 21:40 ` kernel test robot 2 siblings, 0 replies; 22+ messages in thread From: kernel test robot @ 2026-02-12 21:40 UTC (permalink / raw) To: Usama Arif; +Cc: oe-kbuild-all Hi Usama, [This is a private test report for your RFC patch.] kernel test robot noticed the following build errors: [auto build test ERROR on linus/master] [also build test ERROR on v6.19 next-20260212] [cannot apply to akpm-mm/mm-everything] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch#_base_tree_information] url: https://github.com/intel-lab-lkp/linux/commits/Usama-Arif/mm-thp-allocate-PTE-page-tables-lazily-at-split-time/20260211-205726 base: linus/master patch link: https://lore.kernel.org/r/20260211125507.4175026-3-usama.arif%40linux.dev patch subject: [RFC 2/2] mm: thp: add THP_SPLIT_PMD_PTE_ALLOC_FAILED counter config: nios2-allnoconfig (https://download.01.org/0day-ci/archive/20260213/202602130520.mofTmHuk-lkp@intel.com/config) compiler: nios2-linux-gcc (GCC) 11.5.0 reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260213/202602130520.mofTmHuk-lkp@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <lkp@intel.com> | Closes: https://lore.kernel.org/oe-kbuild-all/202602130520.mofTmHuk-lkp@intel.com/ All errors (new ones prefixed by >>): mm/rmap.c: In function 'try_to_unmap_one': >> mm/rmap.c:1961:56: error: 'THP_SPLIT_PMD_PTE_ALLOC_FAILED' undeclared (first use in this function) 1961 | count_vm_event(THP_SPLIT_PMD_PTE_ALLOC_FAILED); | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ mm/rmap.c:1961:56: note: each undeclared identifier is reported only once for each function it appears in mm/rmap.c: In function 'try_to_migrate_one': mm/rmap.c:2364:56: error: 'THP_SPLIT_PMD_PTE_ALLOC_FAILED' undeclared (first use in this function) 2364 | count_vm_event(THP_SPLIT_PMD_PTE_ALLOC_FAILED); | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ vim +/THP_SPLIT_PMD_PTE_ALLOC_FAILED +1961 mm/rmap.c 1849 1850 /* 1851 * @arg: enum ttu_flags will be passed to this argument 1852 */ 1853 static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma, 1854 unsigned long address, void *arg) 1855 { 1856 struct mm_struct *mm = vma->vm_mm; 1857 DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0); 1858 bool anon_exclusive, ret = true; 1859 pte_t pteval; 1860 struct page *subpage; 1861 struct mmu_notifier_range range; 1862 enum ttu_flags flags = (enum ttu_flags)(long)arg; 1863 unsigned long nr_pages = 1, end_addr; 1864 unsigned long pfn; 1865 unsigned long hsz = 0; 1866 int ptes = 0; 1867 pgtable_t prealloc_pte = NULL; 1868 1869 /* 1870 * When racing against e.g. zap_pte_range() on another cpu, 1871 * in between its ptep_get_and_clear_full() and folio_remove_rmap_*(), 1872 * try_to_unmap() may return before page_mapped() has become false, 1873 * if page table locking is skipped: use TTU_SYNC to wait for that. 1874 */ 1875 if (flags & TTU_SYNC) 1876 pvmw.flags = PVMW_SYNC; 1877 1878 /* 1879 * For THP, we have to assume the worse case ie pmd for invalidation. 1880 * For hugetlb, it could be much worse if we need to do pud 1881 * invalidation in the case of pmd sharing. 1882 * 1883 * Note that the folio can not be freed in this function as call of 1884 * try_to_unmap() must hold a reference on the folio. 1885 */ 1886 range.end = vma_address_end(&pvmw); 1887 mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma->vm_mm, 1888 address, range.end); 1889 if (folio_test_hugetlb(folio)) { 1890 /* 1891 * If sharing is possible, start and end will be adjusted 1892 * accordingly. 1893 */ 1894 adjust_range_if_pmd_sharing_possible(vma, &range.start, 1895 &range.end); 1896 1897 /* We need the huge page size for set_huge_pte_at() */ 1898 hsz = huge_page_size(hstate_vma(vma)); 1899 } 1900 mmu_notifier_invalidate_range_start(&range); 1901 1902 if ((flags & TTU_SPLIT_HUGE_PMD) && vma_is_anonymous(vma) && 1903 !arch_needs_pgtable_deposit()) 1904 prealloc_pte = pte_alloc_one(mm); 1905 1906 while (page_vma_mapped_walk(&pvmw)) { 1907 /* 1908 * If the folio is in an mlock()d vma, we must not swap it out. 1909 */ 1910 if (!(flags & TTU_IGNORE_MLOCK) && 1911 (vma->vm_flags & VM_LOCKED)) { 1912 ptes++; 1913 1914 /* 1915 * Set 'ret' to indicate the page cannot be unmapped. 1916 * 1917 * Do not jump to walk_abort immediately as additional 1918 * iteration might be required to detect fully mapped 1919 * folio an mlock it. 1920 */ 1921 ret = false; 1922 1923 /* Only mlock fully mapped pages */ 1924 if (pvmw.pte && ptes != pvmw.nr_pages) 1925 continue; 1926 1927 /* 1928 * All PTEs must be protected by page table lock in 1929 * order to mlock the page. 1930 * 1931 * If page table boundary has been cross, current ptl 1932 * only protect part of ptes. 1933 */ 1934 if (pvmw.flags & PVMW_PGTABLE_CROSSED) 1935 goto walk_done; 1936 1937 /* Restore the mlock which got missed */ 1938 mlock_vma_folio(folio, vma); 1939 goto walk_done; 1940 } 1941 1942 if (!pvmw.pte) { 1943 if (folio_test_anon(folio) && !folio_test_swapbacked(folio)) { 1944 if (unmap_huge_pmd_locked(vma, pvmw.address, pvmw.pmd, folio)) 1945 goto walk_done; 1946 /* 1947 * unmap_huge_pmd_locked has either already marked 1948 * the folio as swap-backed or decided to retain it 1949 * due to GUP or speculative references. 1950 */ 1951 goto walk_abort; 1952 } 1953 1954 if (flags & TTU_SPLIT_HUGE_PMD) { 1955 pgtable_t pgtable = prealloc_pte; 1956 1957 prealloc_pte = NULL; 1958 1959 if (!arch_needs_pgtable_deposit() && !pgtable && 1960 vma_is_anonymous(vma)) { > 1961 count_vm_event(THP_SPLIT_PMD_PTE_ALLOC_FAILED); 1962 page_vma_mapped_walk_done(&pvmw); 1963 ret = false; 1964 break; 1965 } 1966 /* 1967 * We temporarily have to drop the PTL and 1968 * restart so we can process the PTE-mapped THP. 1969 */ 1970 split_huge_pmd_locked(vma, pvmw.address, 1971 pvmw.pmd, false, pgtable); 1972 flags &= ~TTU_SPLIT_HUGE_PMD; 1973 page_vma_mapped_walk_restart(&pvmw); 1974 continue; 1975 } 1976 } 1977 1978 /* Unexpected PMD-mapped THP? */ 1979 VM_BUG_ON_FOLIO(!pvmw.pte, folio); 1980 1981 /* 1982 * Handle PFN swap PTEs, such as device-exclusive ones, that 1983 * actually map pages. 1984 */ 1985 pteval = ptep_get(pvmw.pte); 1986 if (likely(pte_present(pteval))) { 1987 pfn = pte_pfn(pteval); 1988 } else { 1989 const softleaf_t entry = softleaf_from_pte(pteval); 1990 1991 pfn = softleaf_to_pfn(entry); 1992 VM_WARN_ON_FOLIO(folio_test_hugetlb(folio), folio); 1993 } 1994 1995 subpage = folio_page(folio, pfn - folio_pfn(folio)); 1996 address = pvmw.address; 1997 anon_exclusive = folio_test_anon(folio) && 1998 PageAnonExclusive(subpage); 1999 2000 if (folio_test_hugetlb(folio)) { 2001 bool anon = folio_test_anon(folio); 2002 2003 /* 2004 * The try_to_unmap() is only passed a hugetlb page 2005 * in the case where the hugetlb page is poisoned. 2006 */ 2007 VM_BUG_ON_PAGE(!PageHWPoison(subpage), subpage); 2008 /* 2009 * huge_pmd_unshare may unmap an entire PMD page. 2010 * There is no way of knowing exactly which PMDs may 2011 * be cached for this mm, so we must flush them all. 2012 * start/end were already adjusted above to cover this 2013 * range. 2014 */ 2015 flush_cache_range(vma, range.start, range.end); 2016 2017 /* 2018 * To call huge_pmd_unshare, i_mmap_rwsem must be 2019 * held in write mode. Caller needs to explicitly 2020 * do this outside rmap routines. 2021 * 2022 * We also must hold hugetlb vma_lock in write mode. 2023 * Lock order dictates acquiring vma_lock BEFORE 2024 * i_mmap_rwsem. We can only try lock here and fail 2025 * if unsuccessful. 2026 */ 2027 if (!anon) { 2028 struct mmu_gather tlb; 2029 2030 VM_BUG_ON(!(flags & TTU_RMAP_LOCKED)); 2031 if (!hugetlb_vma_trylock_write(vma)) 2032 goto walk_abort; 2033 2034 tlb_gather_mmu_vma(&tlb, vma); 2035 if (huge_pmd_unshare(&tlb, vma, address, pvmw.pte)) { 2036 hugetlb_vma_unlock_write(vma); 2037 huge_pmd_unshare_flush(&tlb, vma); 2038 tlb_finish_mmu(&tlb); 2039 /* 2040 * The PMD table was unmapped, 2041 * consequently unmapping the folio. 2042 */ 2043 goto walk_done; 2044 } 2045 hugetlb_vma_unlock_write(vma); 2046 tlb_finish_mmu(&tlb); 2047 } 2048 pteval = huge_ptep_clear_flush(vma, address, pvmw.pte); 2049 if (pte_dirty(pteval)) 2050 folio_mark_dirty(folio); 2051 } else if (likely(pte_present(pteval))) { 2052 nr_pages = folio_unmap_pte_batch(folio, &pvmw, flags, pteval); 2053 end_addr = address + nr_pages * PAGE_SIZE; 2054 flush_cache_range(vma, address, end_addr); 2055 2056 /* Nuke the page table entry. */ 2057 pteval = get_and_clear_ptes(mm, address, pvmw.pte, nr_pages); 2058 /* 2059 * We clear the PTE but do not flush so potentially 2060 * a remote CPU could still be writing to the folio. 2061 * If the entry was previously clean then the 2062 * architecture must guarantee that a clear->dirty 2063 * transition on a cached TLB entry is written through 2064 * and traps if the PTE is unmapped. 2065 */ 2066 if (should_defer_flush(mm, flags)) 2067 set_tlb_ubc_flush_pending(mm, pteval, address, end_addr); 2068 else 2069 flush_tlb_range(vma, address, end_addr); 2070 if (pte_dirty(pteval)) 2071 folio_mark_dirty(folio); 2072 } else { 2073 pte_clear(mm, address, pvmw.pte); 2074 } 2075 2076 /* 2077 * Now the pte is cleared. If this pte was uffd-wp armed, 2078 * we may want to replace a none pte with a marker pte if 2079 * it's file-backed, so we don't lose the tracking info. 2080 */ 2081 pte_install_uffd_wp_if_needed(vma, address, pvmw.pte, pteval); 2082 2083 /* Update high watermark before we lower rss */ 2084 update_hiwater_rss(mm); 2085 2086 if (PageHWPoison(subpage) && (flags & TTU_HWPOISON)) { 2087 pteval = swp_entry_to_pte(make_hwpoison_entry(subpage)); 2088 if (folio_test_hugetlb(folio)) { 2089 hugetlb_count_sub(folio_nr_pages(folio), mm); 2090 set_huge_pte_at(mm, address, pvmw.pte, pteval, 2091 hsz); 2092 } else { 2093 dec_mm_counter(mm, mm_counter(folio)); 2094 set_pte_at(mm, address, pvmw.pte, pteval); 2095 } 2096 } else if (likely(pte_present(pteval)) && pte_unused(pteval) && 2097 !userfaultfd_armed(vma)) { 2098 /* 2099 * The guest indicated that the page content is of no 2100 * interest anymore. Simply discard the pte, vmscan 2101 * will take care of the rest. 2102 * A future reference will then fault in a new zero 2103 * page. When userfaultfd is active, we must not drop 2104 * this page though, as its main user (postcopy 2105 * migration) will not expect userfaults on already 2106 * copied pages. 2107 */ 2108 dec_mm_counter(mm, mm_counter(folio)); 2109 } else if (folio_test_anon(folio)) { 2110 swp_entry_t entry = page_swap_entry(subpage); 2111 pte_t swp_pte; 2112 /* 2113 * Store the swap location in the pte. 2114 * See handle_pte_fault() ... 2115 */ 2116 if (unlikely(folio_test_swapbacked(folio) != 2117 folio_test_swapcache(folio))) { 2118 WARN_ON_ONCE(1); 2119 goto walk_abort; 2120 } 2121 2122 /* MADV_FREE page check */ 2123 if (!folio_test_swapbacked(folio)) { 2124 int ref_count, map_count; 2125 2126 /* 2127 * Synchronize with gup_pte_range(): 2128 * - clear PTE; barrier; read refcount 2129 * - inc refcount; barrier; read PTE 2130 */ 2131 smp_mb(); 2132 2133 ref_count = folio_ref_count(folio); 2134 map_count = folio_mapcount(folio); 2135 2136 /* 2137 * Order reads for page refcount and dirty flag 2138 * (see comments in __remove_mapping()). 2139 */ 2140 smp_rmb(); 2141 2142 if (folio_test_dirty(folio) && !(vma->vm_flags & VM_DROPPABLE)) { 2143 /* 2144 * redirtied either using the page table or a previously 2145 * obtained GUP reference. 2146 */ 2147 set_ptes(mm, address, pvmw.pte, pteval, nr_pages); 2148 folio_set_swapbacked(folio); 2149 goto walk_abort; 2150 } else if (ref_count != 1 + map_count) { 2151 /* 2152 * Additional reference. Could be a GUP reference or any 2153 * speculative reference. GUP users must mark the folio 2154 * dirty if there was a modification. This folio cannot be 2155 * reclaimed right now either way, so act just like nothing 2156 * happened. 2157 * We'll come back here later and detect if the folio was 2158 * dirtied when the additional reference is gone. 2159 */ 2160 set_ptes(mm, address, pvmw.pte, pteval, nr_pages); 2161 goto walk_abort; 2162 } 2163 add_mm_counter(mm, MM_ANONPAGES, -nr_pages); 2164 goto discard; 2165 } 2166 2167 if (swap_duplicate(entry) < 0) { 2168 set_pte_at(mm, address, pvmw.pte, pteval); 2169 goto walk_abort; 2170 } 2171 2172 /* 2173 * arch_unmap_one() is expected to be a NOP on 2174 * architectures where we could have PFN swap PTEs, 2175 * so we'll not check/care. 2176 */ 2177 if (arch_unmap_one(mm, vma, address, pteval) < 0) { 2178 swap_free(entry); 2179 set_pte_at(mm, address, pvmw.pte, pteval); 2180 goto walk_abort; 2181 } 2182 2183 /* See folio_try_share_anon_rmap(): clear PTE first. */ 2184 if (anon_exclusive && 2185 folio_try_share_anon_rmap_pte(folio, subpage)) { 2186 swap_free(entry); 2187 set_pte_at(mm, address, pvmw.pte, pteval); 2188 goto walk_abort; 2189 } 2190 if (list_empty(&mm->mmlist)) { 2191 spin_lock(&mmlist_lock); 2192 if (list_empty(&mm->mmlist)) 2193 list_add(&mm->mmlist, &init_mm.mmlist); 2194 spin_unlock(&mmlist_lock); 2195 } 2196 dec_mm_counter(mm, MM_ANONPAGES); 2197 inc_mm_counter(mm, MM_SWAPENTS); 2198 swp_pte = swp_entry_to_pte(entry); 2199 if (anon_exclusive) 2200 swp_pte = pte_swp_mkexclusive(swp_pte); 2201 if (likely(pte_present(pteval))) { 2202 if (pte_soft_dirty(pteval)) 2203 swp_pte = pte_swp_mksoft_dirty(swp_pte); 2204 if (pte_uffd_wp(pteval)) 2205 swp_pte = pte_swp_mkuffd_wp(swp_pte); 2206 } else { 2207 if (pte_swp_soft_dirty(pteval)) 2208 swp_pte = pte_swp_mksoft_dirty(swp_pte); 2209 if (pte_swp_uffd_wp(pteval)) 2210 swp_pte = pte_swp_mkuffd_wp(swp_pte); 2211 } 2212 set_pte_at(mm, address, pvmw.pte, swp_pte); 2213 } else { 2214 /* 2215 * This is a locked file-backed folio, 2216 * so it cannot be removed from the page 2217 * cache and replaced by a new folio before 2218 * mmu_notifier_invalidate_range_end, so no 2219 * concurrent thread might update its page table 2220 * to point at a new folio while a device is 2221 * still using this folio. 2222 * 2223 * See Documentation/mm/mmu_notifier.rst 2224 */ 2225 dec_mm_counter(mm, mm_counter_file(folio)); 2226 } 2227 discard: 2228 if (unlikely(folio_test_hugetlb(folio))) { 2229 hugetlb_remove_rmap(folio); 2230 } else { 2231 folio_remove_rmap_ptes(folio, subpage, nr_pages, vma); 2232 } 2233 if (vma->vm_flags & VM_LOCKED) 2234 mlock_drain_local(); 2235 folio_put_refs(folio, nr_pages); 2236 2237 /* 2238 * If we are sure that we batched the entire folio and cleared 2239 * all PTEs, we can just optimize and stop right here. 2240 */ 2241 if (nr_pages == folio_nr_pages(folio)) 2242 goto walk_done; 2243 continue; 2244 walk_abort: 2245 ret = false; 2246 walk_done: 2247 page_vma_mapped_walk_done(&pvmw); 2248 break; 2249 } 2250 2251 if (prealloc_pte) 2252 pte_free(mm, prealloc_pte); 2253 2254 mmu_notifier_invalidate_range_end(&range); 2255 2256 return ret; 2257 } 2258 -- 0-DAY CI Kernel Test Service https://github.com/intel/lkp-tests/wiki ^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2026-02-12 21:40 UTC | newest] Thread overview: 22+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-02-11 12:49 [RFC 0/2] mm: thp: split time allocation of page table for THPs Usama Arif 2026-02-11 12:49 ` [RFC 1/2] mm: thp: allocate PTE page tables lazily at split time Usama Arif 2026-02-11 13:25 ` David Hildenbrand (Arm) 2026-02-11 13:38 ` Usama Arif 2026-02-12 12:13 ` Ritesh Harjani 2026-02-12 15:25 ` Usama Arif 2026-02-12 15:39 ` David Hildenbrand (Arm) 2026-02-12 16:46 ` Ritesh Harjani 2026-02-11 13:35 ` David Hildenbrand (Arm) 2026-02-11 13:46 ` Kiryl Shutsemau 2026-02-11 13:47 ` Usama Arif 2026-02-11 19:28 ` Matthew Wilcox 2026-02-11 19:55 ` David Hildenbrand (Arm) 2026-02-11 12:49 ` [RFC 2/2] mm: thp: add THP_SPLIT_PMD_PTE_ALLOC_FAILED counter Usama Arif 2026-02-11 13:27 ` David Hildenbrand (Arm) 2026-02-11 13:31 ` Usama Arif 2026-02-11 13:36 ` David Hildenbrand (Arm) 2026-02-11 13:42 ` Usama Arif 2026-02-11 13:38 ` David Hildenbrand (Arm) 2026-02-11 13:43 ` Usama Arif 2026-02-12 21:40 ` kernel test robot 2026-02-12 21:40 ` kernel test robot
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.