[RFC 00/12] mm: PUD (1GB) THP implementation

All of lore.kernel.org
 help / color / mirror / Atom feed

* [RFC 00/12] mm: PUD (1GB) THP implementation
@ 2026-02-02  0:50 Usama Arif
  2026-02-02  0:50 ` [RFC 01/12] mm: add PUD THP ptdesc and rmap support Usama Arif
                   ` (15 more replies)
  0 siblings, 16 replies; 52+ messages in thread
From: Usama Arif @ 2026-02-02  0:50 UTC (permalink / raw)
  To: ziy, Andrew Morton, David Hildenbrand, lorenzo.stoakes, linux-mm
  Cc: hannes, riel, shakeel.butt, kas, baohua, dev.jain, baolin.wang,
	npache, Liam.Howlett, ryan.roberts, vbabka, lance.yang,
	linux-kernel, kernel-team, Usama Arif

This is an RFC series to implement 1GB PUD-level THPs, allowing
applications to benefit from reduced TLB pressure without requiring
hugetlbfs. The patches are based on top of
f9b74c13b773b7c7e4920d7bc214ea3d5f37b422 from mm-stable (6.19-rc6).

Motivation: Why 1GB THP over hugetlbfs?
=======================================

While hugetlbfs provides 1GB huge pages today, it has significant limitations
that make it unsuitable for many workloads:

1. Static Reservation: hugetlbfs requires pre-allocating huge pages at boot
   or runtime, taking memory away. This requires capacity planning,
   administrative overhead, and makes workload orchastration much much more
   complex, especially colocating with workloads that don't use hugetlbfs.

4. No Fallback: If a 1GB huge page cannot be allocated, hugetlbfs fails
   rather than falling back to smaller pages. This makes it fragile under
   memory pressure.

4. No Splitting: hugetlbfs pages cannot be split when only partial access
   is needed, leading to memory waste and preventing partial reclaim.

5. Memory Accounting: hugetlbfs memory is accounted separately and cannot
   be easily shared with regular memory pools.

PUD THP solves these limitations by integrating 1GB pages into the existing
THP infrastructure.

Performance Results
===================

Benchmark results of these patches on Intel Xeon Platinum 8321HC:

Test: True Random Memory Access [1] test of 4GB memory region with pointer
chasing workload (4M random pointer dereferences through memory):

| Metric            | PUD THP (1GB) | PMD THP (2MB) | Change       |
|-------------------|---------------|---------------|--------------|
| Memory access     | 88 ms         | 134 ms        | 34% faster   |
| Page fault time   | 898 ms        | 331 ms        | 2.7x slower  |

Page faulting 1G pages is 2.7x slower (Allocating 1G pages is hard :)).
For long-running workloads this will be a one-off cost, and the 34%
improvement in access latency provides significant benefit.

ARM with 64K PAGE_SZIE supports 512M PMD THPs. In meta, we have a CPU
bound workload running on a large number of ARM servers (256G). I enabled
the 512M THP settings to always for a 100 servers in production (didn't
really have high expectations :)). The average memory used for the workload
increased from 217G to 233G. The amount of memory backed by 512M pages was
68G! The dTLB misses went down by 26% and the PID multiplier increased input
by 5.9% (This is a very significant improvment in workload performance).
A significant number of these THPs were faulted in at application start when
were present across different VMAs. Ofcourse getting these 512M pages is
easier on ARM due to bigger PAGE_SIZE and pageblock order.

I am hoping that these patches for 1G THP can be used to provide similar
benefits for x86. I expect workloads to fault them in at start time when there
is plenty of free memory available.

Previous attempt by Zi Yan
==========================

Zi Yan attempted 1G THPs [2] in kernel version 5.11. There have been
significant changes in kernel since then, including folio conversion, mTHP
framework, ptdesc, rmap changes, etc. I found it easier to use the current PMD
code as reference for making 1G PUD THP work. I am hoping Zi can provide
guidance on these patches!

Major Design Decisions
======================

1. No shared 1G zero page: The memory cost would be quite significant!

2. Page Table Pre-deposit Strategy
   PMD THP deposits a single PTE page table. PUD THP deposits 512 PTE
   page tables (one for each potential PMD entry after split).
   We allocate a PMD page table and use its pmd_huge_pte list to store
   the deposited PTE tables. This ensures split operations don't fail due
   to page table allocation failures (at the cost of 2M per PUD THP)

3. Split to Base Pages
   When a PUD THP must be split (COW, partial unmap, mprotect), we split
   directly to base pages (262,144 PTEs). The ideal thing would be to split
   to 2M pages and then to 4K pages if needed. However, this would require
   significant rmap and mapcount tracking changes.

4. COW and fork handling via split
   Copy-on-write and fork for PUD THP triggers a split to base pages, then
   uses existing PTE-level COW infrastructure. Getting another 1G region is
   hard and could fail. If only a 4K is written, copying 1G is a waste.
   Probably this should only be done on CoW and not fork?

5. Migration via split
   Split PUD to PTEs and migrate individual pages. It is going to be difficult
   to find a 1G continguous memory to migrate to. Maybe its better to not
   allow migration of PUDs at all? I am more tempted to not allow migration,
   but have kept splitting in this RFC.

Reviewers guide
===============

Most of the code is written by adapting from PMD code. For e.g. the PUD page
fault path is very similar to PMD. The difference is no shared zero page and
the page table deposit strategy. I think the easiest way to review this series
is to compare with PMD code.

Test results
============

  1..7
  # Starting 7 tests from 1 test cases.
  #  RUN           pud_thp.basic_allocation ...
  # pud_thp_test.c:169:basic_allocation:PUD THP allocated (anon_fault_alloc: 0 -> 1)
  #            OK  pud_thp.basic_allocation
  ok 1 pud_thp.basic_allocation
  #  RUN           pud_thp.read_write_access ...
  #            OK  pud_thp.read_write_access
  ok 2 pud_thp.read_write_access
  #  RUN           pud_thp.fork_cow ...
  # pud_thp_test.c:236:fork_cow:Fork COW completed (thp_split_pud: 0 -> 1)
  #            OK  pud_thp.fork_cow
  ok 3 pud_thp.fork_cow
  #  RUN           pud_thp.partial_munmap ...
  # pud_thp_test.c:267:partial_munmap:Partial munmap completed (thp_split_pud: 1 -> 2)
  #            OK  pud_thp.partial_munmap
  ok 4 pud_thp.partial_munmap
  #  RUN           pud_thp.mprotect_split ...
  # pud_thp_test.c:293:mprotect_split:mprotect split completed (thp_split_pud: 2 -> 3)
  #            OK  pud_thp.mprotect_split
  ok 5 pud_thp.mprotect_split
  #  RUN           pud_thp.reclaim_pageout ...
  # pud_thp_test.c:322:reclaim_pageout:Reclaim completed (thp_split_pud: 3 -> 4)
  #            OK  pud_thp.reclaim_pageout
  ok 6 pud_thp.reclaim_pageout
  #  RUN           pud_thp.migration_mbind ...
  # pud_thp_test.c:356:migration_mbind:Migration completed (thp_split_pud: 4 -> 5)
  #            OK  pud_thp.migration_mbind
  ok 7 pud_thp.migration_mbind
  # PASSED: 7 / 7 tests passed.
  # Totals: pass:7 fail:0 xfail:0 xpass:0 skip:0 error:0

[1] https://gist.github.com/uarif1/bf279b2a01a536cda945ff9f40196a26
[2] https://lore.kernel.org/linux-mm/20210224223536.803765-1-zi.yan@sent.com/

Signed-off-by: Usama Arif <usamaarif642@gmail.com>

Usama Arif (12):
  mm: add PUD THP ptdesc and rmap support
  mm/thp: add mTHP stats infrastructure for PUD THP
  mm: thp: add PUD THP allocation and fault handling
  mm: thp: implement PUD THP split to PTE level
  mm: thp: add reclaim and migration support for PUD THP
  selftests/mm: add PUD THP basic allocation test
  selftests/mm: add PUD THP read/write access test
  selftests/mm: add PUD THP fork COW test
  selftests/mm: add PUD THP partial munmap test
  selftests/mm: add PUD THP mprotect split test
  selftests/mm: add PUD THP reclaim test
  selftests/mm: add PUD THP migration test

 include/linux/huge_mm.h                   |  60 ++-
 include/linux/mm.h                        |  19 +
 include/linux/mm_types.h                  |   5 +-
 include/linux/pgtable.h                   |   8 +
 include/linux/rmap.h                      |   7 +-
 mm/huge_memory.c                          | 535 +++++++++++++++++++++-
 mm/internal.h                             |   3 +
 mm/memory.c                               |   8 +-
 mm/migrate.c                              |  17 +
 mm/page_vma_mapped.c                      |  35 ++
 mm/pgtable-generic.c                      |  83 ++++
 mm/rmap.c                                 |  96 +++-
 mm/vmscan.c                               |   2 +
 tools/testing/selftests/mm/Makefile       |   1 +
 tools/testing/selftests/mm/pud_thp_test.c | 360 +++++++++++++++
 15 files changed, 1197 insertions(+), 42 deletions(-)
 create mode 100644 tools/testing/selftests/mm/pud_thp_test.c

-- 
2.47.3

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [RFC 01/12] mm: add PUD THP ptdesc and rmap support
  2026-02-02  0:50 [RFC 00/12] mm: PUD (1GB) THP implementation Usama Arif
@ 2026-02-02  0:50 ` Usama Arif
  2026-02-02  3:10   ` kernel test robot
                     ` (2 more replies)
  2026-02-02  0:50 ` [RFC 02/12] mm/thp: add mTHP stats infrastructure for PUD THP Usama Arif
                   ` (14 subsequent siblings)
  15 siblings, 3 replies; 52+ messages in thread
From: Usama Arif @ 2026-02-02  0:50 UTC (permalink / raw)
  To: ziy, Andrew Morton, David Hildenbrand, lorenzo.stoakes, linux-mm
  Cc: hannes, riel, shakeel.butt, kas, baohua, dev.jain, baolin.wang,
	npache, Liam.Howlett, ryan.roberts, vbabka, lance.yang,
	linux-kernel, kernel-team, Usama Arif

For page table management, PUD THPs need to pre-deposit page tables
that will be used when the huge page is later split. When a PUD THP
is allocated, we cannot know in advance when or why it might need to
be split (COW, partial unmap, reclaim), but we need page tables ready
for that eventuality. Similar to how PMD THPs deposit a single PTE
table, PUD THPs deposit a PMD table which itself contains deposited
PTE tables - a two-level deposit. This commit adds the deposit/withdraw
infrastructure and a new pud_huge_pmd field in ptdesc to store the
deposited PMD.

The deposited PMD tables are stored as a singly-linked stack using only
page->lru.next as the link pointer. A doubly-linked list using the
standard list_head mechanism would cause memory corruption: list_del()
poisons both lru.next (offset 8) and lru.prev (offset 16), but lru.prev
overlaps with ptdesc->pmd_huge_pte at offset 16. Since deposited PMD
tables have their own deposited PTE tables stored in pmd_huge_pte,
poisoning lru.prev would corrupt the PTE table list and cause crashes
when withdrawing PTE tables during split. PMD THPs don't have this
problem because their deposited PTE tables don't have sub-deposits.
Using only lru.next avoids the overlap entirely.

For reverse mapping, PUD THPs need the same rmap support that PMD THPs
have. The page_vma_mapped_walk() function is extended to recognize and
handle PUD-mapped folios during rmap traversal. A new TTU_SPLIT_HUGE_PUD
flag tells the unmap path to split PUD THPs before proceeding, since
there is no PUD-level migration entry format - the split converts the
single PUD mapping into individual PTE mappings that can be migrated
or swapped normally.

Signed-off-by: Usama Arif <usamaarif642@gmail.com>
---
 include/linux/huge_mm.h  |  5 +++
 include/linux/mm.h       | 19 ++++++++
 include/linux/mm_types.h |  5 ++-
 include/linux/pgtable.h  |  8 ++++
 include/linux/rmap.h     |  7 ++-
 mm/huge_memory.c         |  8 ++++
 mm/internal.h            |  3 ++
 mm/page_vma_mapped.c     | 35 +++++++++++++++
 mm/pgtable-generic.c     | 83 ++++++++++++++++++++++++++++++++++
 mm/rmap.c                | 96 +++++++++++++++++++++++++++++++++++++---
 10 files changed, 260 insertions(+), 9 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index a4d9f964dfdea..e672e45bb9cc7 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -463,10 +463,15 @@ void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
 		unsigned long address);
 
 #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
+void split_huge_pud_locked(struct vm_area_struct *vma, pud_t *pud,
+			   unsigned long address);
 int change_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		    pud_t *pudp, unsigned long addr, pgprot_t newprot,
 		    unsigned long cp_flags);
 #else
+static inline void
+split_huge_pud_locked(struct vm_area_struct *vma, pud_t *pud,
+		      unsigned long address) {}
 static inline int
 change_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		pud_t *pudp, unsigned long addr, pgprot_t newprot,
diff --git a/include/linux/mm.h b/include/linux/mm.h
index ab2e7e30aef96..a15e18df0f771 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3455,6 +3455,22 @@ static inline bool pagetable_pmd_ctor(struct mm_struct *mm,
  * considered ready to switch to split PUD locks yet; there may be places
  * which need to be converted from page_table_lock.
  */
+#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
+static inline struct page *pud_pgtable_page(pud_t *pud)
+{
+	unsigned long mask = ~(PTRS_PER_PUD * sizeof(pud_t) - 1);
+
+	return virt_to_page((void *)((unsigned long)pud & mask));
+}
+
+static inline struct ptdesc *pud_ptdesc(pud_t *pud)
+{
+	return page_ptdesc(pud_pgtable_page(pud));
+}
+
+#define pud_huge_pmd(pud) (pud_ptdesc(pud)->pud_huge_pmd)
+#endif
+
 static inline spinlock_t *pud_lockptr(struct mm_struct *mm, pud_t *pud)
 {
 	return &mm->page_table_lock;
@@ -3471,6 +3487,9 @@ static inline spinlock_t *pud_lock(struct mm_struct *mm, pud_t *pud)
 static inline void pagetable_pud_ctor(struct ptdesc *ptdesc)
 {
 	__pagetable_ctor(ptdesc);
+#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
+	ptdesc->pud_huge_pmd = NULL;
+#endif
 }
 
 static inline void pagetable_p4d_ctor(struct ptdesc *ptdesc)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 78950eb8926dc..26a38490ae2e1 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -577,7 +577,10 @@ struct ptdesc {
 		struct list_head pt_list;
 		struct {
 			unsigned long _pt_pad_1;
-			pgtable_t pmd_huge_pte;
+			union {
+				pgtable_t pmd_huge_pte;  /* For PMD tables: deposited PTE */
+				pgtable_t pud_huge_pmd;  /* For PUD tables: deposited PMD list */
+			};
 		};
 	};
 	unsigned long __page_mapping;
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 2f0dd3a4ace1a..3ce733c1d71a2 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1168,6 +1168,14 @@ extern pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp);
 #define arch_needs_pgtable_deposit() (false)
 #endif
 
+#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
+extern void pgtable_trans_huge_pud_deposit(struct mm_struct *mm, pud_t *pudp,
+					   pmd_t *pmd_table);
+extern pmd_t *pgtable_trans_huge_pud_withdraw(struct mm_struct *mm, pud_t *pudp);
+extern void pud_deposit_pte(pmd_t *pmd_table, pgtable_t pgtable);
+extern pgtable_t pud_withdraw_pte(pmd_t *pmd_table);
+#endif
+
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 /*
  * This is an implementation of pmdp_establish() that is only suitable for an
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index daa92a58585d9..08cd0a0eb8763 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -101,6 +101,7 @@ enum ttu_flags {
 					 * do a final flush if necessary */
 	TTU_RMAP_LOCKED		= 0x80,	/* do not grab rmap lock:
 					 * caller holds it */
+	TTU_SPLIT_HUGE_PUD	= 0x100, /* split huge PUD if any */
 };
 
 #ifdef CONFIG_MMU
@@ -473,6 +474,8 @@ void folio_add_anon_rmap_ptes(struct folio *, struct page *, int nr_pages,
 	folio_add_anon_rmap_ptes(folio, page, 1, vma, address, flags)
 void folio_add_anon_rmap_pmd(struct folio *, struct page *,
 		struct vm_area_struct *, unsigned long address, rmap_t flags);
+void folio_add_anon_rmap_pud(struct folio *, struct page *,
+		struct vm_area_struct *, unsigned long address, rmap_t flags);
 void folio_add_new_anon_rmap(struct folio *, struct vm_area_struct *,
 		unsigned long address, rmap_t flags);
 void folio_add_file_rmap_ptes(struct folio *, struct page *, int nr_pages,
@@ -933,6 +936,7 @@ struct page_vma_mapped_walk {
 	pgoff_t pgoff;
 	struct vm_area_struct *vma;
 	unsigned long address;
+	pud_t *pud;
 	pmd_t *pmd;
 	pte_t *pte;
 	spinlock_t *ptl;
@@ -970,7 +974,7 @@ static inline void page_vma_mapped_walk_done(struct page_vma_mapped_walk *pvmw)
 static inline void
 page_vma_mapped_walk_restart(struct page_vma_mapped_walk *pvmw)
 {
-	WARN_ON_ONCE(!pvmw->pmd && !pvmw->pte);
+	WARN_ON_ONCE(!pvmw->pud && !pvmw->pmd && !pvmw->pte);
 
 	if (likely(pvmw->ptl))
 		spin_unlock(pvmw->ptl);
@@ -978,6 +982,7 @@ page_vma_mapped_walk_restart(struct page_vma_mapped_walk *pvmw)
 		WARN_ON_ONCE(1);
 
 	pvmw->ptl = NULL;
+	pvmw->pud = NULL;
 	pvmw->pmd = NULL;
 	pvmw->pte = NULL;
 }
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 40cf59301c21a..3128b3beedb0a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2933,6 +2933,14 @@ void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
 	spin_unlock(ptl);
 	mmu_notifier_invalidate_range_end(&range);
 }
+
+void split_huge_pud_locked(struct vm_area_struct *vma, pud_t *pud,
+			   unsigned long address)
+{
+	VM_WARN_ON_ONCE(!IS_ALIGNED(address, HPAGE_PUD_SIZE));
+	if (pud_trans_huge(*pud))
+		__split_huge_pud_locked(vma, pud, address);
+}
 #else
 void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
 		unsigned long address)
diff --git a/mm/internal.h b/mm/internal.h
index 9ee336aa03656..21d5c00f638dc 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -545,6 +545,9 @@ int user_proactive_reclaim(char *buf,
  * in mm/rmap.c:
  */
 pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address);
+#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
+pud_t *mm_find_pud(struct mm_struct *mm, unsigned long address);
+#endif
 
 /*
  * in mm/page_alloc.c
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index b38a1d00c971b..d31eafba38041 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -146,6 +146,18 @@ static bool check_pmd(unsigned long pfn, struct page_vma_mapped_walk *pvmw)
 	return true;
 }
 
+#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
+/* Returns true if the two ranges overlap.  Careful to not overflow. */
+static bool check_pud(unsigned long pfn, struct page_vma_mapped_walk *pvmw)
+{
+	if ((pfn + HPAGE_PUD_NR - 1) < pvmw->pfn)
+		return false;
+	if (pfn > pvmw->pfn + pvmw->nr_pages - 1)
+		return false;
+	return true;
+}
+#endif
+
 static void step_forward(struct page_vma_mapped_walk *pvmw, unsigned long size)
 {
 	pvmw->address = (pvmw->address + size) & ~(size - 1);
@@ -188,6 +200,10 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
 	pud_t *pud;
 	pmd_t pmde;
 
+	/* The only possible pud mapping has been handled on last iteration */
+	if (pvmw->pud && !pvmw->pmd)
+		return not_found(pvmw);
+
 	/* The only possible pmd mapping has been handled on last iteration */
 	if (pvmw->pmd && !pvmw->pte)
 		return not_found(pvmw);
@@ -234,6 +250,25 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
 			continue;
 		}
 
+#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
+		/* Check for PUD-mapped THP */
+		if (pud_trans_huge(*pud)) {
+			pvmw->pud = pud;
+			pvmw->ptl = pud_lock(mm, pud);
+			if (likely(pud_trans_huge(*pud))) {
+				if (pvmw->flags & PVMW_MIGRATION)
+					return not_found(pvmw);
+				if (!check_pud(pud_pfn(*pud), pvmw))
+					return not_found(pvmw);
+				return true;
+			}
+			/* PUD was split under us, retry at PMD level */
+			spin_unlock(pvmw->ptl);
+			pvmw->ptl = NULL;
+			pvmw->pud = NULL;
+		}
+#endif
+
 		pvmw->pmd = pmd_offset(pud, pvmw->address);
 		/*
 		 * Make sure the pmd value isn't cached in a register by the
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index d3aec7a9926ad..2047558ddcd79 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -195,6 +195,89 @@ pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp)
 }
 #endif
 
+#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
+/*
+ * Deposit page tables for PUD THP.
+ * Called with PUD lock held. Stores PMD tables in a singly-linked stack
+ * via pud_huge_pmd, using only pmd_page->lru.next as the link pointer.
+ *
+ * IMPORTANT: We use only lru.next (offset 8) for linking, NOT the full
+ * list_head. This is because lru.prev (offset 16) overlaps with
+ * ptdesc->pmd_huge_pte, which stores the PMD table's deposited PTE tables.
+ * Using list_del() would corrupt pmd_huge_pte with LIST_POISON2.
+ *
+ * PTE tables should be deposited into the PMD using pud_deposit_pte().
+ */
+void pgtable_trans_huge_pud_deposit(struct mm_struct *mm, pud_t *pudp,
+				    pmd_t *pmd_table)
+{
+	pgtable_t pmd_page = virt_to_page(pmd_table);
+
+	assert_spin_locked(pud_lockptr(mm, pudp));
+
+	/* Push onto stack using only lru.next as the link */
+	pmd_page->lru.next = (struct list_head *)pud_huge_pmd(pudp);
+	pud_huge_pmd(pudp) = pmd_page;
+}
+
+/*
+ * Withdraw the deposited PMD table for PUD THP split or zap.
+ * Called with PUD lock held.
+ * Returns NULL if no more PMD tables are deposited.
+ */
+pmd_t *pgtable_trans_huge_pud_withdraw(struct mm_struct *mm, pud_t *pudp)
+{
+	pgtable_t pmd_page;
+
+	assert_spin_locked(pud_lockptr(mm, pudp));
+
+	pmd_page = pud_huge_pmd(pudp);
+	if (!pmd_page)
+		return NULL;
+
+	/* Pop from stack - lru.next points to next PMD page (or NULL) */
+	pud_huge_pmd(pudp) = (pgtable_t)pmd_page->lru.next;
+
+	return page_address(pmd_page);
+}
+
+/*
+ * Deposit a PTE table into a standalone PMD table (not yet in page table hierarchy).
+ * Used for PUD THP pre-deposit. The PMD table's pmd_huge_pte stores a linked list.
+ * No lock assertion since the PMD isn't visible yet.
+ */
+void pud_deposit_pte(pmd_t *pmd_table, pgtable_t pgtable)
+{
+	struct ptdesc *ptdesc = virt_to_ptdesc(pmd_table);
+
+	/* FIFO - add to front of list */
+	if (!ptdesc->pmd_huge_pte)
+		INIT_LIST_HEAD(&pgtable->lru);
+	else
+		list_add(&pgtable->lru, &ptdesc->pmd_huge_pte->lru);
+	ptdesc->pmd_huge_pte = pgtable;
+}
+
+/*
+ * Withdraw a PTE table from a standalone PMD table.
+ * Returns NULL if no more PTE tables are deposited.
+ */
+pgtable_t pud_withdraw_pte(pmd_t *pmd_table)
+{
+	struct ptdesc *ptdesc = virt_to_ptdesc(pmd_table);
+	pgtable_t pgtable;
+
+	pgtable = ptdesc->pmd_huge_pte;
+	if (!pgtable)
+		return NULL;
+	ptdesc->pmd_huge_pte = list_first_entry_or_null(&pgtable->lru,
+							struct page, lru);
+	if (ptdesc->pmd_huge_pte)
+		list_del(&pgtable->lru);
+	return pgtable;
+}
+#endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
+
 #ifndef __HAVE_ARCH_PMDP_INVALIDATE
 pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
 		     pmd_t *pmdp)
diff --git a/mm/rmap.c b/mm/rmap.c
index 7b9879ef442d9..69acabd763da4 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -811,6 +811,32 @@ pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address)
 	return pmd;
 }
 
+#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
+/*
+ * Returns the actual pud_t* where we expect 'address' to be mapped from, or
+ * NULL if it doesn't exist.  No guarantees / checks on what the pud_t*
+ * represents.
+ */
+pud_t *mm_find_pud(struct mm_struct *mm, unsigned long address)
+{
+	pgd_t *pgd;
+	p4d_t *p4d;
+	pud_t *pud = NULL;
+
+	pgd = pgd_offset(mm, address);
+	if (!pgd_present(*pgd))
+		goto out;
+
+	p4d = p4d_offset(pgd, address);
+	if (!p4d_present(*p4d))
+		goto out;
+
+	pud = pud_offset(p4d, address);
+out:
+	return pud;
+}
+#endif
+
 struct folio_referenced_arg {
 	int mapcount;
 	int referenced;
@@ -1415,11 +1441,7 @@ static __always_inline void __folio_add_anon_rmap(struct folio *folio,
 			SetPageAnonExclusive(page);
 			break;
 		case PGTABLE_LEVEL_PUD:
-			/*
-			 * Keep the compiler happy, we don't support anonymous
-			 * PUD mappings.
-			 */
-			WARN_ON_ONCE(1);
+			SetPageAnonExclusive(page);
 			break;
 		default:
 			BUILD_BUG();
@@ -1503,6 +1525,31 @@ void folio_add_anon_rmap_pmd(struct folio *folio, struct page *page,
 #endif
 }
 
+/**
+ * folio_add_anon_rmap_pud - add a PUD mapping to a page range of an anon folio
+ * @folio:	The folio to add the mapping to
+ * @page:	The first page to add
+ * @vma:	The vm area in which the mapping is added
+ * @address:	The user virtual address of the first page to map
+ * @flags:	The rmap flags
+ *
+ * The page range of folio is defined by [first_page, first_page + HPAGE_PUD_NR)
+ *
+ * The caller needs to hold the page table lock, and the page must be locked in
+ * the anon_vma case: to serialize mapping,index checking after setting.
+ */
+void folio_add_anon_rmap_pud(struct folio *folio, struct page *page,
+		struct vm_area_struct *vma, unsigned long address, rmap_t flags)
+{
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && \
+	defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD)
+	__folio_add_anon_rmap(folio, page, HPAGE_PUD_NR, vma, address, flags,
+			      PGTABLE_LEVEL_PUD);
+#else
+	WARN_ON_ONCE(true);
+#endif
+}
+
 /**
  * folio_add_new_anon_rmap - Add mapping to a new anonymous folio.
  * @folio:	The folio to add the mapping to.
@@ -1934,6 +1981,20 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 		}
 
 		if (!pvmw.pte) {
+			/*
+			 * Check for PUD-mapped THP first.
+			 * If we have a PUD mapping and TTU_SPLIT_HUGE_PUD is set,
+			 * split the PUD to PMD level and restart the walk.
+			 */
+			if (pvmw.pud && pud_trans_huge(*pvmw.pud)) {
+				if (flags & TTU_SPLIT_HUGE_PUD) {
+					split_huge_pud_locked(vma, pvmw.pud, pvmw.address);
+					flags &= ~TTU_SPLIT_HUGE_PUD;
+					page_vma_mapped_walk_restart(&pvmw);
+					continue;
+				}
+			}
+
 			if (folio_test_anon(folio) && !folio_test_swapbacked(folio)) {
 				if (unmap_huge_pmd_locked(vma, pvmw.address, pvmw.pmd, folio))
 					goto walk_done;
@@ -2325,6 +2386,27 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 	mmu_notifier_invalidate_range_start(&range);
 
 	while (page_vma_mapped_walk(&pvmw)) {
+		/* Handle PUD-mapped THP first */
+		if (!pvmw.pte && !pvmw.pmd) {
+#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
+			/*
+			 * PUD-mapped THP: skip migration to preserve the huge
+			 * page. Splitting would defeat the purpose of PUD THPs.
+			 * Return false to indicate migration failure, which
+			 * will cause alloc_contig_range() to try a different
+			 * memory region.
+			 */
+			if (pvmw.pud && pud_trans_huge(*pvmw.pud)) {
+				page_vma_mapped_walk_done(&pvmw);
+				ret = false;
+				break;
+			}
+#endif
+			/* Unexpected state: !pte && !pmd but not a PUD THP */
+			page_vma_mapped_walk_done(&pvmw);
+			break;
+		}
+
 		/* PMD-mapped THP migration entry */
 		if (!pvmw.pte) {
 			__maybe_unused unsigned long pfn;
@@ -2607,10 +2689,10 @@ void try_to_migrate(struct folio *folio, enum ttu_flags flags)
 
 	/*
 	 * Migration always ignores mlock and only supports TTU_RMAP_LOCKED and
-	 * TTU_SPLIT_HUGE_PMD, TTU_SYNC, and TTU_BATCH_FLUSH flags.
+	 * TTU_SPLIT_HUGE_PMD, TTU_SPLIT_HUGE_PUD, TTU_SYNC, and TTU_BATCH_FLUSH flags.
 	 */
 	if (WARN_ON_ONCE(flags & ~(TTU_RMAP_LOCKED | TTU_SPLIT_HUGE_PMD |
-					TTU_SYNC | TTU_BATCH_FLUSH)))
+					TTU_SPLIT_HUGE_PUD | TTU_SYNC | TTU_BATCH_FLUSH)))
 		return;
 
 	if (folio_is_zone_device(folio) &&
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [RFC 02/12] mm/thp: add mTHP stats infrastructure for PUD THP
  2026-02-02  0:50 [RFC 00/12] mm: PUD (1GB) THP implementation Usama Arif
  2026-02-02  0:50 ` [RFC 01/12] mm: add PUD THP ptdesc and rmap support Usama Arif
@ 2026-02-02  0:50 ` Usama Arif
  2026-02-02 11:56   ` Lorenzo Stoakes
  2026-02-02  0:50 ` [RFC 03/12] mm: thp: add PUD THP allocation and fault handling Usama Arif
                   ` (13 subsequent siblings)
  15 siblings, 1 reply; 52+ messages in thread
From: Usama Arif @ 2026-02-02  0:50 UTC (permalink / raw)
  To: ziy, Andrew Morton, David Hildenbrand, lorenzo.stoakes, linux-mm
  Cc: hannes, riel, shakeel.butt, kas, baohua, dev.jain, baolin.wang,
	npache, Liam.Howlett, ryan.roberts, vbabka, lance.yang,
	linux-kernel, kernel-team, Usama Arif

Extend the mTHP (multi-size THP) statistics infrastructure to support
PUD-sized transparent huge pages.

The mTHP framework tracks statistics for each supported THP size through
per-order counters exposed via sysfs. To add PUD THP support, PUD_ORDER
must be included in the set of tracked orders.

With this change, PUD THP events (allocations, faults, splits, swaps)
are tracked and exposed through the existing sysfs interface at
/sys/kernel/mm/transparent_hugepage/hugepages-1048576kB/stats/. This
provides visibility into PUD THP behavior for debugging and performance
analysis.

Signed-off-by: Usama Arif <usamaarif642@gmail.com>
---
 include/linux/huge_mm.h | 42 +++++++++++++++++++++++++++++++++++++----
 mm/huge_memory.c        |  3 ++-
 2 files changed, 40 insertions(+), 5 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index e672e45bb9cc7..5509ba8555b6e 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -76,7 +76,13 @@ extern struct kobj_attribute thpsize_shmem_enabled_attr;
  * and including PMD_ORDER, except order-0 (which is not "huge") and order-1
  * (which is a limitation of the THP implementation).
  */
-#define THP_ORDERS_ALL_ANON	((BIT(PMD_ORDER + 1) - 1) & ~(BIT(0) | BIT(1)))
+#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
+#define THP_ORDERS_ALL_ANON_PUD		BIT(PUD_ORDER)
+#else
+#define THP_ORDERS_ALL_ANON_PUD		0
+#endif
+#define THP_ORDERS_ALL_ANON	(((BIT(PMD_ORDER + 1) - 1) & ~(BIT(0) | BIT(1))) | \
+				 THP_ORDERS_ALL_ANON_PUD)
 
 /*
  * Mask of all large folio orders supported for file THP. Folios in a DAX
@@ -146,18 +152,46 @@ enum mthp_stat_item {
 };
 
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && defined(CONFIG_SYSFS)
+
+#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
+#define MTHP_STAT_COUNT		(PMD_ORDER + 2)
+#define MTHP_STAT_PUD_INDEX	(PMD_ORDER + 1)  /* PUD uses last index */
+#else
+#define MTHP_STAT_COUNT		(PMD_ORDER + 1)
+#endif
+
 struct mthp_stat {
-	unsigned long stats[ilog2(MAX_PTRS_PER_PTE) + 1][__MTHP_STAT_COUNT];
+	unsigned long stats[MTHP_STAT_COUNT][__MTHP_STAT_COUNT];
 };
 
 DECLARE_PER_CPU(struct mthp_stat, mthp_stats);
 
+static inline int mthp_stat_order_to_index(int order)
+{
+#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
+	if (order == PUD_ORDER)
+		return MTHP_STAT_PUD_INDEX;
+#endif
+	return order;
+}
+
 static inline void mod_mthp_stat(int order, enum mthp_stat_item item, int delta)
 {
-	if (order <= 0 || order > PMD_ORDER)
+	int index;
+
+	if (order <= 0)
+		return;
+
+#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
+	if (order != PUD_ORDER && order > PMD_ORDER)
 		return;
+#else
+	if (order > PMD_ORDER)
+		return;
+#endif
 
-	this_cpu_add(mthp_stats.stats[order][item], delta);
+	index = mthp_stat_order_to_index(order);
+	this_cpu_add(mthp_stats.stats[index][item], delta);
 }
 
 static inline void count_mthp_stat(int order, enum mthp_stat_item item)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 3128b3beedb0a..d033624d7e1f2 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -598,11 +598,12 @@ static unsigned long sum_mthp_stat(int order, enum mthp_stat_item item)
 {
 	unsigned long sum = 0;
 	int cpu;
+	int index = mthp_stat_order_to_index(order);
 
 	for_each_possible_cpu(cpu) {
 		struct mthp_stat *this = &per_cpu(mthp_stats, cpu);
 
-		sum += this->stats[order][item];
+		sum += this->stats[index][item];
 	}
 
 	return sum;
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [RFC 03/12] mm: thp: add PUD THP allocation and fault handling
  2026-02-02  0:50 [RFC 00/12] mm: PUD (1GB) THP implementation Usama Arif
  2026-02-02  0:50 ` [RFC 01/12] mm: add PUD THP ptdesc and rmap support Usama Arif
  2026-02-02  0:50 ` [RFC 02/12] mm/thp: add mTHP stats infrastructure for PUD THP Usama Arif
@ 2026-02-02  0:50 ` Usama Arif
  2026-02-02  0:50 ` [RFC 04/12] mm: thp: implement PUD THP split to PTE level Usama Arif
                   ` (12 subsequent siblings)
  15 siblings, 0 replies; 52+ messages in thread
From: Usama Arif @ 2026-02-02  0:50 UTC (permalink / raw)
  To: ziy, Andrew Morton, David Hildenbrand, lorenzo.stoakes, linux-mm
  Cc: hannes, riel, shakeel.butt, kas, baohua, dev.jain, baolin.wang,
	npache, Liam.Howlett, ryan.roberts, vbabka, lance.yang,
	linux-kernel, kernel-team, Usama Arif

Add the page fault handling path for anonymous PUD THPs, following the
same design as the existing PMD THP fault handlers.

When a process accesses memory in an anonymous VMA that is PUD-aligned
and large enough, the fault handler checks if PUD THP is enabled and
attempts to allocate a 1GB folio. The allocation uses folio_alloc_gigantic.
If allocation succeeds, the folio is mapped at the faulting PUD entry.

Before installing the PUD mapping, page tables are pre-deposited for
future use. A PUD THP will eventually need to be split - whether due
to copy-on-write after fork, partial munmap, mprotect on a subregion,
or memory reclaim. At split time, we need 512 PTE tables (one for each
PMD entry) plus the PMD table itself. Allocating 513 page tables during
split could fail, leaving the system unable to proceed. By depositing
them at fault time when memory pressure is typically lower, we guarantee
the split will always succeed.

The write-protect fault handler triggers when a process tries to write
to a PUD THP that is mapped read-only (typically after fork). Rather
than implementing PUD-level COW which would require copying 1GB of data,
the handler splits the PUD to PTE level and retries the fault. The
retry then handles COW at PTE level, copying only the single 4KB page
being written.

Signed-off-by: Usama Arif <usamaarif642@gmail.com>
---
 include/linux/huge_mm.h |   2 +
 mm/huge_memory.c        | 260 ++++++++++++++++++++++++++++++++++++++--
 mm/memory.c             |   8 +-
 3 files changed, 258 insertions(+), 12 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 5509ba8555b6e..a292035c0270f 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -8,6 +8,7 @@
 #include <linux/kobject.h>
 
 vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf);
+vm_fault_t do_huge_pud_anonymous_page(struct vm_fault *vmf);
 int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		  pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
 		  struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma);
@@ -25,6 +26,7 @@ static inline void huge_pud_set_accessed(struct vm_fault *vmf, pud_t orig_pud)
 #endif
 
 vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf);
+vm_fault_t do_huge_pud_wp_page(struct vm_fault *vmf);
 bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 			   pmd_t *pmd, unsigned long addr, unsigned long next);
 int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, pmd_t *pmd,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d033624d7e1f2..7613caf1e7c30 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1294,6 +1294,70 @@ static struct folio *vma_alloc_anon_folio_pmd(struct vm_area_struct *vma,
 	return folio;
 }
 
+#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
+static struct folio *vma_alloc_anon_folio_pud(struct vm_area_struct *vma,
+		unsigned long addr)
+{
+	gfp_t gfp = vma_thp_gfp_mask(vma);
+	const int order = HPAGE_PUD_ORDER;
+	struct folio *folio = NULL;
+	/*
+	 * Contiguous allocation via alloc_contig_range() migrates existing
+	 * pages out of the target range. __GFP_NOMEMALLOC would allow using
+	 * memory reserves for migration destination pages, but THP is an
+	 * optional performance optimization and should not deplete reserves
+	 * that may be needed for critical allocations. Remove it.
+	 * alloc_contig_range_noprof (__alloc_contig_verify_gfp_mask) will
+	 * cause this to fail without it.
+	 */
+	gfp_t contig_gfp = gfp & ~__GFP_NOMEMALLOC;
+
+	folio = folio_alloc_gigantic(order, contig_gfp, numa_node_id(), NULL);
+
+	if (unlikely(!folio)) {
+		count_vm_event(THP_FAULT_FALLBACK);
+		count_mthp_stat(order, MTHP_STAT_ANON_FAULT_FALLBACK);
+		return NULL;
+	}
+
+	VM_BUG_ON_FOLIO(!folio_test_large(folio), folio);
+	if (mem_cgroup_charge(folio, vma->vm_mm, gfp)) {
+		folio_put(folio);
+		count_vm_event(THP_FAULT_FALLBACK);
+		count_vm_event(THP_FAULT_FALLBACK_CHARGE);
+		count_mthp_stat(order, MTHP_STAT_ANON_FAULT_FALLBACK);
+		count_mthp_stat(order, MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE);
+		return NULL;
+	}
+	folio_throttle_swaprate(folio, gfp);
+
+	/*
+	 * When a folio is not zeroed during allocation (__GFP_ZERO not used)
+	 * or user folios require special handling, folio_zero_user() is used to
+	 * make sure that the page corresponding to the faulting address will be
+	 * hot in the cache after zeroing.
+	 */
+	if (user_alloc_needs_zeroing())
+		folio_zero_user(folio, addr);
+	/*
+	 * The memory barrier inside __folio_mark_uptodate makes sure that
+	 * folio_zero_user writes become visible before the set_pud_at()
+	 * write.
+	 */
+	__folio_mark_uptodate(folio);
+
+	/*
+	 * Set the large_rmappable flag so that the folio can be properly
+	 * removed from the deferred_split list when freed.
+	 * folio_alloc_gigantic() doesn't set this flag (unlike __folio_alloc),
+	 * so we must set it explicitly.
+	 */
+	folio_set_large_rmappable(folio);
+
+	return folio;
+}
+#endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
+
 void map_anon_folio_pmd_nopf(struct folio *folio, pmd_t *pmd,
 		struct vm_area_struct *vma, unsigned long haddr)
 {
@@ -1318,6 +1382,40 @@ static void map_anon_folio_pmd_pf(struct folio *folio, pmd_t *pmd,
 	count_memcg_event_mm(vma->vm_mm, THP_FAULT_ALLOC);
 }
 
+#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
+static pud_t maybe_pud_mkwrite(pud_t pud, struct vm_area_struct *vma)
+{
+	if (likely(vma->vm_flags & VM_WRITE))
+		pud = pud_mkwrite(pud);
+	return pud;
+}
+
+static void map_anon_folio_pud_nopf(struct folio *folio, pud_t *pud,
+		struct vm_area_struct *vma, unsigned long haddr)
+{
+	pud_t entry;
+
+	entry = folio_mk_pud(folio, vma->vm_page_prot);
+	entry = maybe_pud_mkwrite(pud_mkdirty(entry), vma);
+	folio_add_new_anon_rmap(folio, vma, haddr, RMAP_EXCLUSIVE);
+	folio_add_lru_vma(folio, vma);
+	set_pud_at(vma->vm_mm, haddr, pud, entry);
+	update_mmu_cache_pud(vma, haddr, pud);
+	deferred_split_folio(folio, false);
+}
+
+
+static void map_anon_folio_pud_pf(struct folio *folio, pud_t *pud,
+		struct vm_area_struct *vma, unsigned long haddr)
+{
+	map_anon_folio_pud_nopf(folio, pud, vma, haddr);
+	add_mm_counter(vma->vm_mm, MM_ANONPAGES, HPAGE_PUD_NR);
+	count_vm_event(THP_FAULT_ALLOC);
+	count_mthp_stat(HPAGE_PUD_ORDER, MTHP_STAT_ANON_FAULT_ALLOC);
+	count_memcg_event_mm(vma->vm_mm, THP_FAULT_ALLOC);
+}
+#endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
+
 static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf)
 {
 	unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
@@ -1513,6 +1611,161 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
 	return __do_huge_pmd_anonymous_page(vmf);
 }
 
+#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
+/* Number of PTE tables needed for PUD THP split: 512 */
+#define NR_PTE_TABLES_FOR_PUD (HPAGE_PUD_NR / HPAGE_PMD_NR)
+
+/*
+ * Allocate page tables for PUD THP pre-deposit.
+ */
+static bool alloc_pud_predeposit_ptables(struct mm_struct *mm,
+					 unsigned long haddr,
+					 pmd_t **pmd_table_out,
+					 int *nr_pte_deposited)
+{
+	pmd_t *pmd_table;
+	pgtable_t pte_table;
+	struct ptdesc *pmd_ptdesc;
+	int i;
+
+	*pmd_table_out = NULL;
+	*nr_pte_deposited = 0;
+
+	pmd_table = pmd_alloc_one(mm, haddr);
+	if (!pmd_table)
+		return false;
+
+	/* Initialize the pmd_huge_pte field for PTE table storage */
+	pmd_ptdesc = virt_to_ptdesc(pmd_table);
+	pmd_ptdesc->pmd_huge_pte = NULL;
+
+	/* Allocate and deposit 512 PTE tables into the PMD table */
+	for (i = 0; i < NR_PTE_TABLES_FOR_PUD; i++) {
+		pte_table = pte_alloc_one(mm);
+		if (!pte_table)
+			goto fail;
+		pud_deposit_pte(pmd_table, pte_table);
+		(*nr_pte_deposited)++;
+	}
+
+	*pmd_table_out = pmd_table;
+	return true;
+
+fail:
+	/* Free any PTE tables we deposited */
+	while ((pte_table = pud_withdraw_pte(pmd_table)) != NULL)
+		pte_free(mm, pte_table);
+	pmd_free(mm, pmd_table);
+	return false;
+}
+
+/*
+ * Free pre-allocated page tables if the PUD THP fault fails.
+ */
+static void free_pud_predeposit_ptables(struct mm_struct *mm,
+					pmd_t *pmd_table)
+{
+	pgtable_t pte_table;
+
+	if (!pmd_table)
+		return;
+
+	while ((pte_table = pud_withdraw_pte(pmd_table)) != NULL)
+		pte_free(mm, pte_table);
+	pmd_free(mm, pmd_table);
+}
+
+vm_fault_t do_huge_pud_anonymous_page(struct vm_fault *vmf)
+{
+	struct vm_area_struct *vma = vmf->vma;
+	unsigned long haddr = vmf->address & HPAGE_PUD_MASK;
+	struct folio *folio;
+	pmd_t *pmd_table = NULL;
+	int nr_pte_deposited = 0;
+	vm_fault_t ret = 0;
+	int i;
+
+	/* Check VMA bounds and alignment */
+	if (!thp_vma_suitable_order(vma, haddr, PUD_ORDER))
+		return VM_FAULT_FALLBACK;
+
+	ret = vmf_anon_prepare(vmf);
+	if (ret)
+		return ret;
+
+	folio = vma_alloc_anon_folio_pud(vma, vmf->address);
+	if (unlikely(!folio))
+		return VM_FAULT_FALLBACK;
+
+	/*
+	 * Pre-allocate page tables for future PUD split.
+	 * We need 1 PMD table and 512 PTE tables.
+	 */
+	if (!alloc_pud_predeposit_ptables(vma->vm_mm, haddr,
+					  &pmd_table, &nr_pte_deposited)) {
+		folio_put(folio);
+		return VM_FAULT_FALLBACK;
+	}
+
+	vmf->ptl = pud_lock(vma->vm_mm, vmf->pud);
+	if (unlikely(!pud_none(*vmf->pud)))
+		goto release;
+
+	ret = check_stable_address_space(vma->vm_mm);
+	if (ret)
+		goto release;
+
+	/* Deliver the page fault to userland */
+	if (userfaultfd_missing(vma)) {
+		spin_unlock(vmf->ptl);
+		folio_put(folio);
+		free_pud_predeposit_ptables(vma->vm_mm, pmd_table);
+		ret = handle_userfault(vmf, VM_UFFD_MISSING);
+		VM_BUG_ON(ret & VM_FAULT_FALLBACK);
+		return ret;
+	}
+
+	/* Deposit page tables for future PUD split */
+	pgtable_trans_huge_pud_deposit(vma->vm_mm, vmf->pud, pmd_table);
+	map_anon_folio_pud_pf(folio, vmf->pud, vma, haddr);
+	mm_inc_nr_pmds(vma->vm_mm);
+	for (i = 0; i < nr_pte_deposited; i++)
+		mm_inc_nr_ptes(vma->vm_mm);
+	spin_unlock(vmf->ptl);
+
+	return 0;
+release:
+	spin_unlock(vmf->ptl);
+	folio_put(folio);
+	free_pud_predeposit_ptables(vma->vm_mm, pmd_table);
+	return ret;
+}
+#else
+vm_fault_t do_huge_pud_anonymous_page(struct vm_fault *vmf)
+{
+	return VM_FAULT_FALLBACK;
+}
+#endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
+
+#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
+vm_fault_t do_huge_pud_wp_page(struct vm_fault *vmf)
+{
+	struct vm_area_struct *vma = vmf->vma;
+
+	/*
+	 * For now, split PUD to PTE level on write fault.
+	 * This is the simplest approach for COW handling.
+	 */
+	__split_huge_pud(vma, vmf->pud, vmf->address);
+	return VM_FAULT_FALLBACK;
+}
+#else
+vm_fault_t do_huge_pud_wp_page(struct vm_fault *vmf)
+{
+	return VM_FAULT_FALLBACK;
+}
+#endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
+
 struct folio_or_pfn {
 	union {
 		struct folio *folio;
@@ -1646,13 +1899,6 @@ vm_fault_t vmf_insert_folio_pmd(struct vm_fault *vmf, struct folio *folio,
 EXPORT_SYMBOL_GPL(vmf_insert_folio_pmd);
 
 #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
-static pud_t maybe_pud_mkwrite(pud_t pud, struct vm_area_struct *vma)
-{
-	if (likely(vma->vm_flags & VM_WRITE))
-		pud = pud_mkwrite(pud);
-	return pud;
-}
-
 static vm_fault_t insert_pud(struct vm_area_struct *vma, unsigned long addr,
 		pud_t *pud, struct folio_or_pfn fop, pgprot_t prot, bool write)
 {
diff --git a/mm/memory.c b/mm/memory.c
index 87cf4e1a6f866..e5f86c1d2aded 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -6142,9 +6142,9 @@ static vm_fault_t create_huge_pud(struct vm_fault *vmf)
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) &&			\
 	defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD)
 	struct vm_area_struct *vma = vmf->vma;
-	/* No support for anonymous transparent PUD pages yet */
+
 	if (vma_is_anonymous(vma))
-		return VM_FAULT_FALLBACK;
+		return do_huge_pud_anonymous_page(vmf);
 	if (vma->vm_ops->huge_fault)
 		return vma->vm_ops->huge_fault(vmf, PUD_ORDER);
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
@@ -6158,9 +6158,8 @@ static vm_fault_t wp_huge_pud(struct vm_fault *vmf, pud_t orig_pud)
 	struct vm_area_struct *vma = vmf->vma;
 	vm_fault_t ret;
 
-	/* No support for anonymous transparent PUD pages yet */
 	if (vma_is_anonymous(vma))
-		goto split;
+		return do_huge_pud_wp_page(vmf);
 	if (vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) {
 		if (vma->vm_ops->huge_fault) {
 			ret = vma->vm_ops->huge_fault(vmf, PUD_ORDER);
@@ -6168,7 +6167,6 @@ static vm_fault_t wp_huge_pud(struct vm_fault *vmf, pud_t orig_pud)
 				return ret;
 		}
 	}
-split:
 	/* COW or write-notify not handled on PUD level: split pud.*/
 	__split_huge_pud(vma, vmf->pud, vmf->address);
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE && CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [RFC 04/12] mm: thp: implement PUD THP split to PTE level
  2026-02-02  0:50 [RFC 00/12] mm: PUD (1GB) THP implementation Usama Arif
                   ` (2 preceding siblings ...)
  2026-02-02  0:50 ` [RFC 03/12] mm: thp: add PUD THP allocation and fault handling Usama Arif
@ 2026-02-02  0:50 ` Usama Arif
  2026-02-02  0:50 ` [RFC 05/12] mm: thp: add reclaim and migration support for PUD THP Usama Arif
                   ` (11 subsequent siblings)
  15 siblings, 0 replies; 52+ messages in thread
From: Usama Arif @ 2026-02-02  0:50 UTC (permalink / raw)
  To: ziy, Andrew Morton, David Hildenbrand, lorenzo.stoakes, linux-mm
  Cc: hannes, riel, shakeel.butt, kas, baohua, dev.jain, baolin.wang,
	npache, Liam.Howlett, ryan.roberts, vbabka, lance.yang,
	linux-kernel, kernel-team, Usama Arif

Implement the split operation that converts a PUD THP mapping into
individual PTE mappings.

A PUD THP maps 1GB of memory with a single page table entry. When the
mapping needs to be broken - for COW, partial unmap, permission changes,
or reclaim - it must be split into smaller mappings. Unlike PMD THPs
which split into 512 PTEs in a single level, PUD THPs require a two-level
split: the single PUD entry becomes 512 PMD entries, each pointing to a
PTE table containing 512 PTEs, for a total of 262144 page table entries.

The split uses page tables that were pre-deposited when the PUD THP was
first allocated. This guarantees the split cannot fail due to memory
allocation failure, which is critical since splits often happen under
memory pressure during reclaim. The deposited PMD table is installed in
the PUD entry, and each PMD slot receives one of the 512 deposited PTE
tables.

Each PTE is populated to map one 4KB page of the original 1GB folio.
Page flags from the original PUD entry (dirty, accessed, writable,
soft-dirty) are propagated to each PTE so that no information is lost.
The rmap is updated to remove the single PUD-level mapping entry and
add 262144 PTE-level mapping entries.

The split goes directly to PTE level rather than stopping at PMD level.
This is because the kernel's rmap infrastructure assumes that PMD-level
mappings are for PMD-sized folios. If we mapped a PUD-sized folio at
PMD level (512 PMD entries for one folio), the rmap accounting would
break - it would see 512 "large" mappings for a folio that should have
far more. Going to PTE level avoids this problem entirely.

Signed-off-by: Usama Arif <usamaarif642@gmail.com>
---
 mm/huge_memory.c | 181 ++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 173 insertions(+), 8 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 7613caf1e7c30..39b8212b5abd4 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3129,12 +3129,82 @@ int zap_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	return 1;
 }
 
+/*
+ * Structure to hold page tables for PUD split.
+ * Tables are withdrawn from the pre-deposit made at fault time.
+ */
+struct pud_split_ptables {
+	pmd_t *pmd_table;
+	pgtable_t *pte_tables;  /* Array of 512 PTE tables */
+	int nr_pte_tables;      /* Number of PTE tables in array */
+};
+
+/*
+ * Withdraw pre-deposited page tables from PUD THP.
+ * Tables are always deposited at fault time in do_huge_pud_anonymous_page().
+ * Returns true if successful, false if no tables deposited.
+ */
+static bool withdraw_pud_split_ptables(struct mm_struct *mm, pud_t *pud,
+				       struct pud_split_ptables *tables)
+{
+	pmd_t *pmd_table;
+	pgtable_t pte_table;
+	int i;
+
+	tables->pmd_table = NULL;
+	tables->pte_tables = NULL;
+	tables->nr_pte_tables = 0;
+
+	/* Try to withdraw the deposited PMD table */
+	pmd_table = pgtable_trans_huge_pud_withdraw(mm, pud);
+	if (!pmd_table)
+		return false;
+
+	tables->pmd_table = pmd_table;
+
+	/* Allocate array to hold PTE table pointers */
+	tables->pte_tables = kmalloc_array(NR_PTE_TABLES_FOR_PUD,
+					   sizeof(pgtable_t), GFP_ATOMIC);
+	if (!tables->pte_tables)
+		goto fail;
+
+	/* Withdraw PTE tables from the PMD table */
+	for (i = 0; i < NR_PTE_TABLES_FOR_PUD; i++) {
+		pte_table = pud_withdraw_pte(pmd_table);
+		if (!pte_table)
+			goto fail;
+		tables->pte_tables[i] = pte_table;
+		tables->nr_pte_tables++;
+	}
+
+	return true;
+
+fail:
+	/* Put back any tables we withdrew */
+	for (i = 0; i < tables->nr_pte_tables; i++)
+		pud_deposit_pte(pmd_table, tables->pte_tables[i]);
+	kfree(tables->pte_tables);
+	pgtable_trans_huge_pud_deposit(mm, pud, pmd_table);
+	tables->pmd_table = NULL;
+	tables->pte_tables = NULL;
+	tables->nr_pte_tables = 0;
+	return false;
+}
+
 static void __split_huge_pud_locked(struct vm_area_struct *vma, pud_t *pud,
 		unsigned long haddr)
 {
+	bool dirty = false, young = false, write = false;
+	struct pud_split_ptables tables = { 0 };
+	struct mm_struct *mm = vma->vm_mm;
+	rmap_t rmap_flags = RMAP_NONE;
+	bool anon_exclusive = false;
+	bool soft_dirty = false;
 	struct folio *folio;
+	unsigned long addr;
 	struct page *page;
 	pud_t old_pud;
+	int i, j;
 
 	VM_BUG_ON(haddr & ~HPAGE_PUD_MASK);
 	VM_BUG_ON_VMA(vma->vm_start > haddr, vma);
@@ -3145,20 +3215,115 @@ static void __split_huge_pud_locked(struct vm_area_struct *vma, pud_t *pud,
 
 	old_pud = pudp_huge_clear_flush(vma, haddr, pud);
 
-	if (!vma_is_dax(vma))
+	if (!vma_is_anonymous(vma)) {
+		if (!vma_is_dax(vma))
+			return;
+
+		page = pud_page(old_pud);
+		folio = page_folio(page);
+
+		if (!folio_test_dirty(folio) && pud_dirty(old_pud))
+			folio_mark_dirty(folio);
+		if (!folio_test_referenced(folio) && pud_young(old_pud))
+			folio_set_referenced(folio);
+		folio_remove_rmap_pud(folio, page, vma);
+		folio_put(folio);
+		add_mm_counter(mm, mm_counter_file(folio), -HPAGE_PUD_NR);
 		return;
+	}
+
+	/*
+	 * Anonymous PUD split: split directly to PTE level.
+	 *
+	 * We cannot create PMD huge entries pointing to portions of a larger
+	 * folio because the kernel's rmap infrastructure assumes PMD mappings
+	 * are for PMD-sized folios only (see __folio_rmap_sanity_checks).
+	 * Instead, we create a PMD table with 512 entries, each pointing to
+	 * a PTE table with 512 PTEs.
+	 *
+	 * Tables are always deposited at fault time in do_huge_pud_anonymous_page().
+	 */
+	if (!withdraw_pud_split_ptables(mm, pud, &tables)) {
+		WARN_ON_ONCE(1);
+		return;
+	}
 
 	page = pud_page(old_pud);
 	folio = page_folio(page);
 
-	if (!folio_test_dirty(folio) && pud_dirty(old_pud))
-		folio_mark_dirty(folio);
-	if (!folio_test_referenced(folio) && pud_young(old_pud))
-		folio_set_referenced(folio);
+	dirty = pud_dirty(old_pud);
+	write = pud_write(old_pud);
+	young = pud_young(old_pud);
+	soft_dirty = pud_soft_dirty(old_pud);
+	anon_exclusive = PageAnonExclusive(page);
+
+	if (dirty)
+		folio_set_dirty(folio);
+
+	/*
+	 * Add references for each page that will have its own PTE.
+	 * Original folio has 1 reference. After split, each of 262144 PTEs
+	 * will eventually be unmapped, each calling folio_put().
+	 */
+	folio_ref_add(folio, HPAGE_PUD_NR - 1);
+
+	/*
+	 * Add PTE-level rmap for all pages at once.
+	 */
+	if (anon_exclusive)
+		rmap_flags |= RMAP_EXCLUSIVE;
+	folio_add_anon_rmap_ptes(folio, page, HPAGE_PUD_NR,
+				 vma, haddr, rmap_flags);
+
+	/* Remove PUD-level rmap */
 	folio_remove_rmap_pud(folio, page, vma);
-	folio_put(folio);
-	add_mm_counter(vma->vm_mm, mm_counter_file(folio),
-		-HPAGE_PUD_NR);
+
+	/*
+	 * Create 512 PMD entries, each pointing to a PTE table.
+	 * Each PTE table has 512 PTEs pointing to individual pages.
+	 */
+	addr = haddr;
+	for (i = 0; i < (HPAGE_PUD_NR / HPAGE_PMD_NR); i++) {
+		pmd_t *pmd_entry = tables.pmd_table + i;
+		pgtable_t pte_table = tables.pte_tables[i];
+		pte_t *pte;
+		struct page *subpage_base = page + i * HPAGE_PMD_NR;
+
+		/* Populate the PTE table */
+		pte = page_address(pte_table);
+		for (j = 0; j < HPAGE_PMD_NR; j++) {
+			struct page *subpage = subpage_base + j;
+			pte_t entry;
+
+			entry = mk_pte(subpage, vma->vm_page_prot);
+			if (write)
+				entry = pte_mkwrite(entry, vma);
+			if (dirty)
+				entry = pte_mkdirty(entry);
+			if (young)
+				entry = pte_mkyoung(entry);
+			if (soft_dirty)
+				entry = pte_mksoft_dirty(entry);
+
+			set_pte_at(mm, addr + j * PAGE_SIZE, pte + j, entry);
+		}
+
+		/* Set PMD to point to PTE table */
+		pmd_populate(mm, pmd_entry, pte_table);
+		addr += HPAGE_PMD_SIZE;
+	}
+
+	/*
+	 * Memory barrier ensures all PMD entries are visible before
+	 * installing the PMD table in the PUD.
+	 */
+	smp_wmb();
+
+	/* Install the PMD table in the PUD */
+	pud_populate(mm, pud, tables.pmd_table);
+
+	/* Free the temporary array holding PTE table pointers */
+	kfree(tables.pte_tables);
 }
 
 void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [RFC 05/12] mm: thp: add reclaim and migration support for PUD THP
  2026-02-02  0:50 [RFC 00/12] mm: PUD (1GB) THP implementation Usama Arif
                   ` (3 preceding siblings ...)
  2026-02-02  0:50 ` [RFC 04/12] mm: thp: implement PUD THP split to PTE level Usama Arif
@ 2026-02-02  0:50 ` Usama Arif
  2026-02-02  4:44   ` kernel test robot
  2026-02-02  9:12   ` kernel test robot
  2026-02-02  0:50 ` [RFC 06/12] selftests/mm: add PUD THP basic allocation test Usama Arif
                   ` (10 subsequent siblings)
  15 siblings, 2 replies; 52+ messages in thread
From: Usama Arif @ 2026-02-02  0:50 UTC (permalink / raw)
  To: ziy, Andrew Morton, David Hildenbrand, lorenzo.stoakes, linux-mm
  Cc: hannes, riel, shakeel.butt, kas, baohua, dev.jain, baolin.wang,
	npache, Liam.Howlett, ryan.roberts, vbabka, lance.yang,
	linux-kernel, kernel-team, Usama Arif

Enable the memory reclaim and migration paths to handle PUD THPs
correctly by splitting them before proceeding.

Memory reclaim needs to unmap pages before they can be reclaimed. For
PUD THPs, the unmap path now passes TTU_SPLIT_HUGE_PUD when unmapping
PUD-sized folios. This triggers the PUD split during the unmap phase,
converting the single PUD mapping into 262144 PTE mappings. Reclaim
then proceeds normally with the individual pages. This follows the same
pattern used for PMD THPs with TTU_SPLIT_HUGE_PMD.

When migration encounters a PUD-sized folio, it now splits the folio
first using the standard folio split mechanism. The resulting smaller
folios (or individual pages) can then be migrated normally. This matches
how PMD THPs are handled when PMD migration is not supported on a given
architecture.

The split-before-migrate approach means PUD THPs will be broken up
during NUMA balancing or memory compaction. While this loses the TLB
benefit of the large mapping, it allows these memory management
operations to proceed. Future work could add PUD-level migration
entries to preserve the mapping through migration.

Signed-off-by: Usama Arif <usamaarif642@gmail.com>
---
 include/linux/huge_mm.h | 11 ++++++
 mm/huge_memory.c        | 83 +++++++++++++++++++++++++++++++++++++----
 mm/migrate.c            | 17 +++++++++
 mm/vmscan.c             |  2 +
 4 files changed, 105 insertions(+), 8 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index a292035c0270f..8b2bffda4b4f3 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -559,6 +559,17 @@ static inline bool folio_test_pmd_mappable(struct folio *folio)
 	return folio_order(folio) >= HPAGE_PMD_ORDER;
 }
 
+/**
+ * folio_test_pud_mappable - Can we map this folio with a PUD?
+ * @folio: The folio to test
+ *
+ * Return: true - @folio can be PUD-mapped, false - @folio cannot be PUD-mapped.
+ */
+static inline bool folio_test_pud_mappable(struct folio *folio)
+{
+	return folio_order(folio) >= HPAGE_PUD_ORDER;
+}
+
 vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf);
 
 vm_fault_t do_huge_pmd_device_private(struct vm_fault *vmf);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 39b8212b5abd4..87b2c21df4a49 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2228,9 +2228,17 @@ int copy_huge_pud(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		goto out_unlock;
 
 	/*
-	 * TODO: once we support anonymous pages, use
-	 * folio_try_dup_anon_rmap_*() and split if duplicating fails.
+	 * For anonymous pages, split to PTE level.
+	 * This simplifies fork handling - we don't need to duplicate
+	 * the complex anon rmap at PUD level.
 	 */
+	if (vma_is_anonymous(vma)) {
+		spin_unlock(src_ptl);
+		spin_unlock(dst_ptl);
+		__split_huge_pud(vma, src_pud, addr);
+		return -EAGAIN;
+	}
+
 	if (is_cow_mapping(vma->vm_flags) && pud_write(pud)) {
 		pudp_set_wrprotect(src_mm, addr, src_pud);
 		pud = pud_wrprotect(pud);
@@ -3099,11 +3107,29 @@ int zap_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
 {
 	spinlock_t *ptl;
 	pud_t orig_pud;
+	pmd_t *pmd_table;
+	pgtable_t pte_table;
+	int nr_pte_tables = 0;
 
 	ptl = __pud_trans_huge_lock(pud, vma);
 	if (!ptl)
 		return 0;
 
+	/*
+	 * Withdraw any deposited page tables before clearing the PUD.
+	 * These need to be freed and their counters decremented.
+	 */
+	pmd_table = pgtable_trans_huge_pud_withdraw(tlb->mm, pud);
+	if (pmd_table) {
+		while ((pte_table = pud_withdraw_pte(pmd_table)) != NULL) {
+			pte_free(tlb->mm, pte_table);
+			mm_dec_nr_ptes(tlb->mm);
+			nr_pte_tables++;
+		}
+		pmd_free(tlb->mm, pmd_table);
+		mm_dec_nr_pmds(tlb->mm);
+	}
+
 	orig_pud = pudp_huge_get_and_clear_full(vma, addr, pud, tlb->fullmm);
 	arch_check_zapped_pud(vma, orig_pud);
 	tlb_remove_pud_tlb_entry(tlb, pud, addr);
@@ -3114,14 +3140,15 @@ int zap_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		struct page *page = NULL;
 		struct folio *folio;
 
-		/* No support for anonymous PUD pages or migration yet */
-		VM_WARN_ON_ONCE(vma_is_anonymous(vma) ||
-				!pud_present(orig_pud));
+		VM_WARN_ON_ONCE(!pud_present(orig_pud));
 
 		page = pud_page(orig_pud);
 		folio = page_folio(page);
 		folio_remove_rmap_pud(folio, page, vma);
-		add_mm_counter(tlb->mm, mm_counter_file(folio), -HPAGE_PUD_NR);
+		if (vma_is_anonymous(vma))
+			add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PUD_NR);
+		else
+			add_mm_counter(tlb->mm, mm_counter_file(folio), -HPAGE_PUD_NR);
 
 		spin_unlock(ptl);
 		tlb_remove_page_size(tlb, page, HPAGE_PUD_SIZE);
@@ -3729,15 +3756,53 @@ static inline void split_huge_pmd_if_needed(struct vm_area_struct *vma, unsigned
 		split_huge_pmd_address(vma, address, false);
 }
 
+#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
+static void split_huge_pud_address(struct vm_area_struct *vma, unsigned long address)
+{
+	pud_t *pud = mm_find_pud(vma->vm_mm, address);
+
+	if (!pud)
+		return;
+
+	__split_huge_pud(vma, pud, address);
+}
+
+static inline void split_huge_pud_if_needed(struct vm_area_struct *vma, unsigned long address)
+{
+	/*
+	 * If the new address isn't PUD-aligned and it could previously
+	 * contain a PUD huge page: check if we need to split it.
+	 */
+	if (!IS_ALIGNED(address, HPAGE_PUD_SIZE) &&
+	    range_in_vma(vma, ALIGN_DOWN(address, HPAGE_PUD_SIZE),
+			 ALIGN(address, HPAGE_PUD_SIZE)))
+		split_huge_pud_address(vma, address);
+}
+#else
+static inline void split_huge_pud_if_needed(struct vm_area_struct *vma, unsigned long address)
+{
+}
+#endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
+
 void vma_adjust_trans_huge(struct vm_area_struct *vma,
 			   unsigned long start,
 			   unsigned long end,
 			   struct vm_area_struct *next)
 {
-	/* Check if we need to split start first. */
+	/* Check if we need to split PUD THP at start first. */
+	split_huge_pud_if_needed(vma, start);
+
+	/* Check if we need to split PUD THP at end. */
+	split_huge_pud_if_needed(vma, end);
+
+	/* If we're incrementing next->vm_start, we might need to split it. */
+	if (next)
+		split_huge_pud_if_needed(next, end);
+
+	/* Check if we need to split PMD THP at start. */
 	split_huge_pmd_if_needed(vma, start);
 
-	/* Check if we need to split end next. */
+	/* Check if we need to split PMD THP at end. */
 	split_huge_pmd_if_needed(vma, end);
 
 	/* If we're incrementing next->vm_start, we might need to split it. */
@@ -3752,6 +3817,8 @@ static void unmap_folio(struct folio *folio)
 
 	VM_BUG_ON_FOLIO(!folio_test_large(folio), folio);
 
+	if (folio_test_pud_mappable(folio))
+		ttu_flags |= TTU_SPLIT_HUGE_PUD;
 	if (folio_test_pmd_mappable(folio))
 		ttu_flags |= TTU_SPLIT_HUGE_PMD;
 
diff --git a/mm/migrate.c b/mm/migrate.c
index 4688b9e38cd2f..2d3d2f5585d14 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1859,6 +1859,23 @@ static int migrate_pages_batch(struct list_head *from,
 			 * we will migrate them after the rest of the
 			 * list is processed.
 			 */
+			/*
+			 * PUD-sized folios cannot be migrated directly,
+			 * but can be split. Try to split them first and
+			 * migrate the resulting smaller folios.
+			 */
+			if (folio_test_pud_mappable(folio)) {
+				nr_failed++;
+				stats->nr_thp_failed++;
+				if (!try_split_folio(folio, split_folios, mode)) {
+					stats->nr_thp_split++;
+					stats->nr_split++;
+					continue;
+				}
+				stats->nr_failed_pages += nr_pages;
+				list_move_tail(&folio->lru, ret_folios);
+				continue;
+			}
 			if (!thp_migration_supported() && is_thp) {
 				nr_failed++;
 				stats->nr_thp_failed++;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 619691aa43938..868514a770bf2 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1348,6 +1348,8 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
 			enum ttu_flags flags = TTU_BATCH_FLUSH;
 			bool was_swapbacked = folio_test_swapbacked(folio);
 
+			if (folio_test_pud_mappable(folio))
+				flags |= TTU_SPLIT_HUGE_PUD;
 			if (folio_test_pmd_mappable(folio))
 				flags |= TTU_SPLIT_HUGE_PMD;
 			/*
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [RFC 06/12] selftests/mm: add PUD THP basic allocation test
  2026-02-02  0:50 [RFC 00/12] mm: PUD (1GB) THP implementation Usama Arif
                   ` (4 preceding siblings ...)
  2026-02-02  0:50 ` [RFC 05/12] mm: thp: add reclaim and migration support for PUD THP Usama Arif
@ 2026-02-02  0:50 ` Usama Arif
  2026-02-02  0:50 ` [RFC 07/12] selftests/mm: add PUD THP read/write access test Usama Arif
                   ` (9 subsequent siblings)
  15 siblings, 0 replies; 52+ messages in thread
From: Usama Arif @ 2026-02-02  0:50 UTC (permalink / raw)
  To: ziy, Andrew Morton, David Hildenbrand, lorenzo.stoakes, linux-mm
  Cc: hannes, riel, shakeel.butt, kas, baohua, dev.jain, baolin.wang,
	npache, Liam.Howlett, ryan.roberts, vbabka, lance.yang,
	linux-kernel, kernel-team, Usama Arif

Add a selftest for PUD-level THPs (1GB THPs) with test infrastructure
and a basic allocation test.

The test uses the kselftest harness FIXTURE/TEST_F framework. A shared
fixture allocates a 2GB anonymous mapping and computes a PUD-aligned
address within it. Helper functions read THP counters from /proc/vmstat
and mTHP statistics from sysfs.

The basic allocation test verifies the fundamental PUD THP allocation
path by touching a PUD-aligned region and checking that the mTHP
anon_fault_alloc counter increments, confirming a 1GB folio was
allocated.

Signed-off-by: Usama Arif <usamaarif642@gmail.com>
---
 tools/testing/selftests/mm/Makefile       |   1 +
 tools/testing/selftests/mm/pud_thp_test.c | 161 ++++++++++++++++++++++
 2 files changed, 162 insertions(+)
 create mode 100644 tools/testing/selftests/mm/pud_thp_test.c

diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile
index eaf9312097f7b..ab79f1693941a 100644
--- a/tools/testing/selftests/mm/Makefile
+++ b/tools/testing/selftests/mm/Makefile
@@ -88,6 +88,7 @@ TEST_GEN_FILES += pagemap_ioctl
 TEST_GEN_FILES += pfnmap
 TEST_GEN_FILES += process_madv
 TEST_GEN_FILES += prctl_thp_disable
+TEST_GEN_FILES += pud_thp_test
 TEST_GEN_FILES += thuge-gen
 TEST_GEN_FILES += transhuge-stress
 TEST_GEN_FILES += uffd-stress
diff --git a/tools/testing/selftests/mm/pud_thp_test.c b/tools/testing/selftests/mm/pud_thp_test.c
new file mode 100644
index 0000000000000..6f0c02c6afd3a
--- /dev/null
+++ b/tools/testing/selftests/mm/pud_thp_test.c
@@ -0,0 +1,161 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Test program for PUD-level Transparent Huge Pages (1GB anonymous THP)
+ *
+ * Prerequisites:
+ * - Kernel with PUD THP support (CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD)
+ * - THP enabled: echo always > /sys/kernel/mm/transparent_hugepage/enabled
+ * - PUD THP enabled: echo always > /sys/kernel/mm/transparent_hugepage/hugepages-1048576kB/enabled
+ */
+
+#define _GNU_SOURCE
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <sys/mman.h>
+#include <sys/wait.h>
+#include <fcntl.h>
+#include <errno.h>
+#include <stdint.h>
+#include <sys/syscall.h>
+
+#include "kselftest_harness.h"
+
+#define PUD_SIZE	(1UL << 30)	/* 1GB */
+#define PMD_SIZE	(1UL << 21)	/* 2MB */
+#define PAGE_SIZE	(1UL << 12)	/* 4KB */
+
+#define TEST_REGION_SIZE	(2 * PUD_SIZE)	/* 2GB to ensure PUD alignment */
+
+/* Get PUD-aligned address within a region */
+static inline void *pud_align(void *addr)
+{
+	return (void *)(((unsigned long)addr + PUD_SIZE - 1) & ~(PUD_SIZE - 1));
+}
+
+/* Read vmstat counter */
+static unsigned long read_vmstat(const char *name)
+{
+	FILE *fp;
+	char line[256];
+	unsigned long value = 0;
+
+	fp = fopen("/proc/vmstat", "r");
+	if (!fp)
+		return 0;
+
+	while (fgets(line, sizeof(line), fp)) {
+		if (strncmp(line, name, strlen(name)) == 0 &&
+		    line[strlen(name)] == ' ') {
+			sscanf(line + strlen(name), " %lu", &value);
+			break;
+		}
+	}
+	fclose(fp);
+	return value;
+}
+
+/* Read mTHP stats for PUD order (1GB = 1048576kB) */
+static unsigned long read_mthp_stat(const char *stat_name)
+{
+	char path[256];
+	char buf[64];
+	int fd;
+	ssize_t ret;
+	unsigned long value = 0;
+
+	snprintf(path, sizeof(path),
+		 "/sys/kernel/mm/transparent_hugepage/hugepages-1048576kB/stats/%s",
+		 stat_name);
+	fd = open(path, O_RDONLY);
+	if (fd < 0)
+		return 0;
+	ret = read(fd, buf, sizeof(buf) - 1);
+	close(fd);
+	if (ret <= 0)
+		return 0;
+	buf[ret] = '\0';
+	sscanf(buf, "%lu", &value);
+	return value;
+}
+
+/* Check if PUD THP is enabled */
+static int pud_thp_enabled(void)
+{
+	char buf[64];
+	int fd;
+	ssize_t ret;
+
+	fd = open("/sys/kernel/mm/transparent_hugepage/hugepages-1048576kB/enabled", O_RDONLY);
+	if (fd < 0)
+		return 0;
+	ret = read(fd, buf, sizeof(buf) - 1);
+	close(fd);
+	if (ret <= 0)
+		return 0;
+	buf[ret] = '\0';
+
+	/* Check if [always] or [madvise] is set */
+	if (strstr(buf, "[always]") || strstr(buf, "[madvise]"))
+		return 1;
+	return 0;
+}
+
+/*
+ * Main fixture for PUD THP tests
+ * Allocates a 2GB region and provides a PUD-aligned pointer within it
+ */
+FIXTURE(pud_thp)
+{
+	void *mem;		/* Base mmap allocation */
+	void *aligned;		/* PUD-aligned pointer within mem */
+	unsigned long mthp_alloc_before;
+	unsigned long split_before;
+};
+
+FIXTURE_SETUP(pud_thp)
+{
+	if (!pud_thp_enabled())
+		SKIP(return, "PUD THP not enabled in sysfs");
+
+	self->mthp_alloc_before = read_mthp_stat("anon_fault_alloc");
+	self->split_before = read_vmstat("thp_split_pud");
+
+	self->mem = mmap(NULL, TEST_REGION_SIZE, PROT_READ | PROT_WRITE,
+			 MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+	ASSERT_NE(self->mem, MAP_FAILED);
+
+	self->aligned = pud_align(self->mem);
+}
+
+FIXTURE_TEARDOWN(pud_thp)
+{
+	if (self->mem && self->mem != MAP_FAILED)
+		munmap(self->mem, TEST_REGION_SIZE);
+}
+
+/*
+ * Test: Basic PUD THP allocation
+ * Verifies that touching a PUD-aligned region allocates a PUD THP
+ */
+TEST_F(pud_thp, basic_allocation)
+{
+	unsigned long mthp_alloc_after;
+
+	/* Touch memory to trigger page fault and PUD THP allocation */
+	memset(self->aligned, 0xAB, PUD_SIZE);
+
+	mthp_alloc_after = read_mthp_stat("anon_fault_alloc");
+
+	/*
+	 * If mTHP allocation counter increased, a PUD THP was allocated.
+	 */
+	if (mthp_alloc_after <= self->mthp_alloc_before)
+		SKIP(return, "PUD THP not allocated");
+
+	TH_LOG("PUD THP allocated (anon_fault_alloc: %lu -> %lu)",
+	       self->mthp_alloc_before, mthp_alloc_after);
+}
+
+TEST_HARNESS_MAIN
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [RFC 07/12] selftests/mm: add PUD THP read/write access test
  2026-02-02  0:50 [RFC 00/12] mm: PUD (1GB) THP implementation Usama Arif
                   ` (5 preceding siblings ...)
  2026-02-02  0:50 ` [RFC 06/12] selftests/mm: add PUD THP basic allocation test Usama Arif
@ 2026-02-02  0:50 ` Usama Arif
  2026-02-02  0:50 ` [RFC 08/12] selftests/mm: add PUD THP fork COW test Usama Arif
                   ` (8 subsequent siblings)
  15 siblings, 0 replies; 52+ messages in thread
From: Usama Arif @ 2026-02-02  0:50 UTC (permalink / raw)
  To: ziy, Andrew Morton, David Hildenbrand, lorenzo.stoakes, linux-mm
  Cc: hannes, riel, shakeel.butt, kas, baohua, dev.jain, baolin.wang,
	npache, Liam.Howlett, ryan.roberts, vbabka, lance.yang,
	linux-kernel, kernel-team, Usama Arif

Add a test that verifies data integrity across a 1GB PUD THP region
by writing patterns at page boundaries and reading them back.

Signed-off-by: Usama Arif <usamaarif642@gmail.com>
---
 tools/testing/selftests/mm/pud_thp_test.c | 23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

diff --git a/tools/testing/selftests/mm/pud_thp_test.c b/tools/testing/selftests/mm/pud_thp_test.c
index 6f0c02c6afd3a..7a1f0b0f81468 100644
--- a/tools/testing/selftests/mm/pud_thp_test.c
+++ b/tools/testing/selftests/mm/pud_thp_test.c
@@ -158,4 +158,27 @@ TEST_F(pud_thp, basic_allocation)
 	       self->mthp_alloc_before, mthp_alloc_after);
 }
 
+/*
+ * Test: Read/write access patterns
+ * Verifies data integrity across the entire 1GB region
+ */
+TEST_F(pud_thp, read_write_access)
+{
+	unsigned long *ptr = (unsigned long *)self->aligned;
+	size_t i;
+	int errors = 0;
+
+	/* Write pattern - sample every page to reduce test time */
+	for (i = 0; i < PUD_SIZE / sizeof(unsigned long); i += PAGE_SIZE / sizeof(unsigned long))
+		ptr[i] = i ^ 0xDEADBEEFUL;
+
+	/* Verify pattern */
+	for (i = 0; i < PUD_SIZE / sizeof(unsigned long); i += PAGE_SIZE / sizeof(unsigned long)) {
+		if (ptr[i] != (i ^ 0xDEADBEEFUL))
+			errors++;
+	}
+
+	ASSERT_EQ(errors, 0);
+}
+
 TEST_HARNESS_MAIN
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [RFC 08/12] selftests/mm: add PUD THP fork COW test
  2026-02-02  0:50 [RFC 00/12] mm: PUD (1GB) THP implementation Usama Arif
                   ` (6 preceding siblings ...)
  2026-02-02  0:50 ` [RFC 07/12] selftests/mm: add PUD THP read/write access test Usama Arif
@ 2026-02-02  0:50 ` Usama Arif
  2026-02-02  0:50 ` [RFC 09/12] selftests/mm: add PUD THP partial munmap test Usama Arif
                   ` (7 subsequent siblings)
  15 siblings, 0 replies; 52+ messages in thread
From: Usama Arif @ 2026-02-02  0:50 UTC (permalink / raw)
  To: ziy, Andrew Morton, David Hildenbrand, lorenzo.stoakes, linux-mm
  Cc: hannes, riel, shakeel.butt, kas, baohua, dev.jain, baolin.wang,
	npache, Liam.Howlett, ryan.roberts, vbabka, lance.yang,
	linux-kernel, kernel-team, Usama Arif

Add a test that allocates a PUD THP, forks a child process, and has
the child write to the shared memory. This triggers the copy-on-write
path which must split the PUD THP. The test verifies that both parent
and child see correct data after the split.

Signed-off-by: Usama Arif <usamaarif642@gmail.com>
---
 tools/testing/selftests/mm/pud_thp_test.c | 44 +++++++++++++++++++++++
 1 file changed, 44 insertions(+)

diff --git a/tools/testing/selftests/mm/pud_thp_test.c b/tools/testing/selftests/mm/pud_thp_test.c
index 7a1f0b0f81468..27a509cd477d5 100644
--- a/tools/testing/selftests/mm/pud_thp_test.c
+++ b/tools/testing/selftests/mm/pud_thp_test.c
@@ -181,4 +181,48 @@ TEST_F(pud_thp, read_write_access)
 	ASSERT_EQ(errors, 0);
 }
 
+/*
+ * Test: Fork and copy-on-write
+ * Verifies that COW correctly splits the PUD THP and isolates parent/child
+ */
+TEST_F(pud_thp, fork_cow)
+{
+	unsigned long *ptr = (unsigned long *)self->aligned;
+	unsigned char *bytes = (unsigned char *)self->aligned;
+	pid_t pid;
+	int status;
+	unsigned long split_after;
+
+	/* Initialize memory with known pattern */
+	memset(self->aligned, 0xCC, PUD_SIZE);
+
+	pid = fork();
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0) {
+		/* Child: write to trigger COW */
+		ptr[0] = 0x12345678UL;
+
+		/* Verify write succeeded and rest of memory unchanged */
+		if (ptr[0] != 0x12345678UL)
+			_exit(1);
+		if (bytes[PAGE_SIZE] != 0xCC)
+			_exit(2);
+
+		_exit(0);
+	}
+
+	/* Parent: wait for child */
+	waitpid(pid, &status, 0);
+	ASSERT_TRUE(WIFEXITED(status));
+	ASSERT_EQ(WEXITSTATUS(status), 0);
+
+	/* Verify parent memory unchanged (COW should have given child a copy) */
+	ASSERT_EQ(bytes[0], 0xCC);
+
+	split_after = read_vmstat("thp_split_pud");
+	TH_LOG("Fork COW completed (thp_split_pud: %lu -> %lu)",
+	       self->split_before, split_after);
+}
+
 TEST_HARNESS_MAIN
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [RFC 09/12] selftests/mm: add PUD THP partial munmap test
  2026-02-02  0:50 [RFC 00/12] mm: PUD (1GB) THP implementation Usama Arif
                   ` (7 preceding siblings ...)
  2026-02-02  0:50 ` [RFC 08/12] selftests/mm: add PUD THP fork COW test Usama Arif
@ 2026-02-02  0:50 ` Usama Arif
  2026-02-02  0:50 ` [RFC 10/12] selftests/mm: add PUD THP mprotect split test Usama Arif
                   ` (6 subsequent siblings)
  15 siblings, 0 replies; 52+ messages in thread
From: Usama Arif @ 2026-02-02  0:50 UTC (permalink / raw)
  To: ziy, Andrew Morton, David Hildenbrand, lorenzo.stoakes, linux-mm
  Cc: hannes, riel, shakeel.butt, kas, baohua, dev.jain, baolin.wang,
	npache, Liam.Howlett, ryan.roberts, vbabka, lance.yang,
	linux-kernel, kernel-team, Usama Arif

Add a test that allocates a PUD THP and unmaps a 2MB region from the
middle. Since the PUD can no longer cover the entire region, it must
be split. The test verifies that memory before and after the hole
remains accessible with correct data.

Signed-off-by: Usama Arif <usamaarif642@gmail.com>
---
 tools/testing/selftests/mm/pud_thp_test.c | 31 +++++++++++++++++++++++
 1 file changed, 31 insertions(+)

diff --git a/tools/testing/selftests/mm/pud_thp_test.c b/tools/testing/selftests/mm/pud_thp_test.c
index 27a509cd477d5..8d4cb0e60f7f7 100644
--- a/tools/testing/selftests/mm/pud_thp_test.c
+++ b/tools/testing/selftests/mm/pud_thp_test.c
@@ -225,4 +225,35 @@ TEST_F(pud_thp, fork_cow)
 	       self->split_before, split_after);
 }
 
+/*
+ * Test: Partial munmap triggers split
+ * Verifies that unmapping part of a PUD THP splits it correctly
+ */
+TEST_F(pud_thp, partial_munmap)
+{
+	unsigned long *ptr = (unsigned long *)self->aligned;
+	unsigned long *after_hole;
+	unsigned long split_after;
+	int ret;
+
+	/* Touch memory to allocate PUD THP */
+	memset(self->aligned, 0xDD, PUD_SIZE);
+
+	/* Unmap a 2MB region in the middle - should trigger PUD split */
+	ret = munmap((char *)self->aligned + PUD_SIZE / 2, PMD_SIZE);
+	ASSERT_EQ(ret, 0);
+
+	split_after = read_vmstat("thp_split_pud");
+
+	/* Verify memory before the hole is still accessible and correct */
+	ASSERT_EQ(ptr[0], 0xDDDDDDDDDDDDDDDDUL);
+
+	/* Verify memory after the hole is still accessible and correct */
+	after_hole = (unsigned long *)((char *)self->aligned + PUD_SIZE / 2 + PMD_SIZE);
+	ASSERT_EQ(*after_hole, 0xDDDDDDDDDDDDDDDDUL);
+
+	TH_LOG("Partial munmap completed (thp_split_pud: %lu -> %lu)",
+	       self->split_before, split_after);
+}
+
 TEST_HARNESS_MAIN
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [RFC 10/12] selftests/mm: add PUD THP mprotect split test
  2026-02-02  0:50 [RFC 00/12] mm: PUD (1GB) THP implementation Usama Arif
                   ` (8 preceding siblings ...)
  2026-02-02  0:50 ` [RFC 09/12] selftests/mm: add PUD THP partial munmap test Usama Arif
@ 2026-02-02  0:50 ` Usama Arif
  2026-02-02  0:50 ` [RFC 11/12] selftests/mm: add PUD THP reclaim test Usama Arif
                   ` (5 subsequent siblings)
  15 siblings, 0 replies; 52+ messages in thread
From: Usama Arif @ 2026-02-02  0:50 UTC (permalink / raw)
  To: ziy, Andrew Morton, David Hildenbrand, lorenzo.stoakes, linux-mm
  Cc: hannes, riel, shakeel.butt, kas, baohua, dev.jain, baolin.wang,
	npache, Liam.Howlett, ryan.roberts, vbabka, lance.yang,
	linux-kernel, kernel-team, Usama Arif

Add a test that changes permissions on a portion of a PUD THP using
mprotect. Since different parts now have different permissions, the
PUD must be split. The test verifies correct behavior after the
permission change.

Signed-off-by: Usama Arif <usamaarif642@gmail.com>
---
 tools/testing/selftests/mm/pud_thp_test.c | 26 +++++++++++++++++++++++
 1 file changed, 26 insertions(+)

diff --git a/tools/testing/selftests/mm/pud_thp_test.c b/tools/testing/selftests/mm/pud_thp_test.c
index 8d4cb0e60f7f7..b59eb470adbba 100644
--- a/tools/testing/selftests/mm/pud_thp_test.c
+++ b/tools/testing/selftests/mm/pud_thp_test.c
@@ -256,4 +256,30 @@ TEST_F(pud_thp, partial_munmap)
 	       self->split_before, split_after);
 }
 
+/*
+ * Test: mprotect triggers split
+ * Verifies that changing protection on part of a PUD THP splits it
+ */
+TEST_F(pud_thp, mprotect_split)
+{
+	volatile unsigned char *p = (unsigned char *)self->aligned;
+	unsigned long split_after;
+	int ret;
+
+	/* Touch memory to allocate PUD THP */
+	memset(self->aligned, 0xEE, PUD_SIZE);
+
+	/* Change protection on a 2MB region - should trigger PUD split */
+	ret = mprotect((char *)self->aligned + PMD_SIZE, PMD_SIZE, PROT_READ);
+	ASSERT_EQ(ret, 0);
+
+	split_after = read_vmstat("thp_split_pud");
+
+	/* Verify memory still readable */
+	ASSERT_EQ(*p, 0xEE);
+
+	TH_LOG("mprotect split completed (thp_split_pud: %lu -> %lu)",
+	       self->split_before, split_after);
+}
+
 TEST_HARNESS_MAIN
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [RFC 11/12] selftests/mm: add PUD THP reclaim test
  2026-02-02  0:50 [RFC 00/12] mm: PUD (1GB) THP implementation Usama Arif
                   ` (9 preceding siblings ...)
  2026-02-02  0:50 ` [RFC 10/12] selftests/mm: add PUD THP mprotect split test Usama Arif
@ 2026-02-02  0:50 ` Usama Arif
  2026-02-02  0:50 ` [RFC 12/12] selftests/mm: add PUD THP migration test Usama Arif
                   ` (4 subsequent siblings)
  15 siblings, 0 replies; 52+ messages in thread
From: Usama Arif @ 2026-02-02  0:50 UTC (permalink / raw)
  To: ziy, Andrew Morton, David Hildenbrand, lorenzo.stoakes, linux-mm
  Cc: hannes, riel, shakeel.butt, kas, baohua, dev.jain, baolin.wang,
	npache, Liam.Howlett, ryan.roberts, vbabka, lance.yang,
	linux-kernel, kernel-team, Usama Arif

Add a test that uses MADV_PAGEOUT to advise the kernel to page out
the PUD THP memory. This exercises the reclaim path which must split
the PUD THP before reclaiming the individual pages.

Signed-off-by: Usama Arif <usamaarif642@gmail.com>
---
 tools/testing/selftests/mm/pud_thp_test.c | 33 +++++++++++++++++++++++
 1 file changed, 33 insertions(+)

diff --git a/tools/testing/selftests/mm/pud_thp_test.c b/tools/testing/selftests/mm/pud_thp_test.c
index b59eb470adbba..961fdc489d8a2 100644
--- a/tools/testing/selftests/mm/pud_thp_test.c
+++ b/tools/testing/selftests/mm/pud_thp_test.c
@@ -28,6 +28,10 @@
 
 #define TEST_REGION_SIZE	(2 * PUD_SIZE)	/* 2GB to ensure PUD alignment */
 
+#ifndef MADV_PAGEOUT
+#define MADV_PAGEOUT	21
+#endif
+
 /* Get PUD-aligned address within a region */
 static inline void *pud_align(void *addr)
 {
@@ -282,4 +286,33 @@ TEST_F(pud_thp, mprotect_split)
 	       self->split_before, split_after);
 }
 
+/*
+ * Test: Reclaim via MADV_PAGEOUT
+ * Verifies that reclaim path correctly handles PUD THPs
+ */
+TEST_F(pud_thp, reclaim_pageout)
+{
+	volatile unsigned char *p;
+	unsigned long split_after;
+	int ret;
+
+	/* Touch memory to allocate PUD THP */
+	memset(self->aligned, 0xAA, PUD_SIZE);
+
+	/* Try to reclaim the pages */
+	ret = madvise(self->aligned, PUD_SIZE, MADV_PAGEOUT);
+	if (ret < 0 && errno == EINVAL)
+		SKIP(return, "MADV_PAGEOUT not supported");
+	ASSERT_EQ(ret, 0);
+
+	split_after = read_vmstat("thp_split_pud");
+
+	/* Touch memory again to verify it's still accessible */
+	p = (unsigned char *)self->aligned;
+	(void)*p;  /* Read to bring pages back if swapped */
+
+	TH_LOG("Reclaim completed (thp_split_pud: %lu -> %lu)",
+	       self->split_before, split_after);
+}
+
 TEST_HARNESS_MAIN
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [RFC 12/12] selftests/mm: add PUD THP migration test
  2026-02-02  0:50 [RFC 00/12] mm: PUD (1GB) THP implementation Usama Arif
                   ` (10 preceding siblings ...)
  2026-02-02  0:50 ` [RFC 11/12] selftests/mm: add PUD THP reclaim test Usama Arif
@ 2026-02-02  0:50 ` Usama Arif
  2026-02-02  2:44 ` [RFC 00/12] mm: PUD (1GB) THP implementation Rik van Riel
                   ` (3 subsequent siblings)
  15 siblings, 0 replies; 52+ messages in thread
From: Usama Arif @ 2026-02-02  0:50 UTC (permalink / raw)
  To: ziy, Andrew Morton, David Hildenbrand, lorenzo.stoakes, linux-mm
  Cc: hannes, riel, shakeel.butt, kas, baohua, dev.jain, baolin.wang,
	npache, Liam.Howlett, ryan.roberts, vbabka, lance.yang,
	linux-kernel, kernel-team, Usama Arif

Add a test that uses mbind() to change the NUMA memory policy, which
triggers migration. The kernel must split PUD THPs before migration
since there is no PUD-level migration entry support. The test verifies
data integrity after the migration attempt.

Signed-off-by: Usama Arif <usamaarif642@gmail.com>
---
 tools/testing/selftests/mm/pud_thp_test.c | 42 +++++++++++++++++++++++
 1 file changed, 42 insertions(+)

diff --git a/tools/testing/selftests/mm/pud_thp_test.c b/tools/testing/selftests/mm/pud_thp_test.c
index 961fdc489d8a2..7e227f29e69fb 100644
--- a/tools/testing/selftests/mm/pud_thp_test.c
+++ b/tools/testing/selftests/mm/pud_thp_test.c
@@ -32,6 +32,14 @@
 #define MADV_PAGEOUT	21
 #endif
 
+#ifndef MPOL_BIND
+#define MPOL_BIND	2
+#endif
+
+#ifndef MPOL_MF_MOVE
+#define MPOL_MF_MOVE	(1 << 1)
+#endif
+
 /* Get PUD-aligned address within a region */
 static inline void *pud_align(void *addr)
 {
@@ -315,4 +323,38 @@ TEST_F(pud_thp, reclaim_pageout)
 	       self->split_before, split_after);
 }
 
+/*
+ * Test: Migration via mbind
+ * Verifies that migration path correctly handles PUD THPs by splitting
+ */
+TEST_F(pud_thp, migration_mbind)
+{
+	unsigned char *bytes = (unsigned char *)self->aligned;
+	unsigned long nodemask = 1UL;  /* Node 0 */
+	unsigned long split_after;
+	int ret;
+
+	/* Touch memory to allocate PUD THP */
+	memset(self->aligned, 0xBB, PUD_SIZE);
+
+	/* Try to migrate by changing NUMA policy */
+	ret = syscall(__NR_mbind, self->aligned, PUD_SIZE, MPOL_BIND, &nodemask,
+		      sizeof(nodemask) * 8, MPOL_MF_MOVE);
+	/*
+	 * mbind may fail with EINVAL (single node) or EIO (migration failed),
+	 * which is acceptable - we just want to exercise the migration path.
+	 */
+	if (ret < 0 && errno != EINVAL && errno != EIO)
+		TH_LOG("mbind returned unexpected error: %s", strerror(errno));
+
+	split_after = read_vmstat("thp_split_pud");
+
+	/* Verify data integrity */
+	ASSERT_EQ(bytes[0], 0xBB);
+	ASSERT_EQ(bytes[PUD_SIZE - 1], 0xBB);
+
+	TH_LOG("Migration completed (thp_split_pud: %lu -> %lu)",
+	       self->split_before, split_after);
+}
+
 TEST_HARNESS_MAIN
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [RFC 00/12] mm: PUD (1GB) THP implementation
  2026-02-02  0:50 [RFC 00/12] mm: PUD (1GB) THP implementation Usama Arif
                   ` (11 preceding siblings ...)
  2026-02-02  0:50 ` [RFC 12/12] selftests/mm: add PUD THP migration test Usama Arif
@ 2026-02-02  2:44 ` Rik van Riel
  2026-02-02 11:30   ` Lorenzo Stoakes
  2026-02-02  4:00 ` Matthew Wilcox
                   ` (2 subsequent siblings)
  15 siblings, 1 reply; 52+ messages in thread
From: Rik van Riel @ 2026-02-02  2:44 UTC (permalink / raw)
  To: Usama Arif, ziy, Andrew Morton, David Hildenbrand,
	lorenzo.stoakes, linux-mm
  Cc: hannes, shakeel.butt, kas, baohua, dev.jain, baolin.wang, npache,
	Liam.Howlett, ryan.roberts, vbabka, lance.yang, linux-kernel,
	kernel-team, Frank van der Linden

On Sun, 2026-02-01 at 16:50 -0800, Usama Arif wrote:
> 
> 1. Static Reservation: hugetlbfs requires pre-allocating huge pages
> at boot
>    or runtime, taking memory away. This requires capacity planning,
>    administrative overhead, and makes workload orchastration much
> much more
>    complex, especially colocating with workloads that don't use
> hugetlbfs.
> 
To address the obvious objection "but how could we
possibly allocate 1GB huge pages while the workload
is running?", I am planning to pick up the CMA balancing 
patch series (thank you, Frank) and get that in an 
upstream ready shape soon.

https://lkml.org/2025/9/15/1735

That patch set looks like another case where no
amount of internal testing will find every single
corner case, and we'll probably just want to
merge it upstream, deploy it experimentally, and
aggressively deal with anything that might pop up.

With CMA balancing, it would be possibly to just
have half (or even more) of system memory for
movable allocations only, which would make it possible
to allocate 1GB huge pages dynamically.

-- 
All Rights Reversed.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC 01/12] mm: add PUD THP ptdesc and rmap support
  2026-02-02  0:50 ` [RFC 01/12] mm: add PUD THP ptdesc and rmap support Usama Arif
@ 2026-02-02  3:10   ` kernel test robot
  2026-02-02 10:44   ` Kiryl Shutsemau
  2026-02-02 12:15   ` Lorenzo Stoakes
  2 siblings, 0 replies; 52+ messages in thread
From: kernel test robot @ 2026-02-02  3:10 UTC (permalink / raw)
  To: Usama Arif; +Cc: oe-kbuild-all

Hi Usama,

[This is a private test report for your RFC patch.]
kernel test robot noticed the following build errors:

[auto build test ERROR on akpm-mm/mm-everything]

url:    https://github.com/intel-lab-lkp/linux/commits/Usama-Arif/mm-add-PUD-THP-ptdesc-and-rmap-support/20260202-085725
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20260202005451.774496-2-usamaarif642%40gmail.com
patch subject: [RFC 01/12] mm: add PUD THP ptdesc and rmap support
config: nios2-allnoconfig (https://download.01.org/0day-ci/archive/20260202/202602021158.RvuFv8Nm-lkp@intel.com/config)
compiler: nios2-linux-gcc (GCC) 11.5.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260202/202602021158.RvuFv8Nm-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202602021158.RvuFv8Nm-lkp@intel.com/

All errors (new ones prefixed by >>):

   mm/rmap.c: In function 'try_to_unmap_one':
>> mm/rmap.c:2106:41: error: implicit declaration of function 'split_huge_pud_locked'; did you mean 'split_huge_pmd_locked'? [-Werror=implicit-function-declaration]
    2106 |                                         split_huge_pud_locked(vma, pvmw.pud, pvmw.address);
         |                                         ^~~~~~~~~~~~~~~~~~~~~
         |                                         split_huge_pmd_locked
   cc1: some warnings being treated as errors


vim +2106 mm/rmap.c

  2010	
  2011	/*
  2012	 * @arg: enum ttu_flags will be passed to this argument
  2013	 */
  2014	static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
  2015			     unsigned long address, void *arg)
  2016	{
  2017		struct mm_struct *mm = vma->vm_mm;
  2018		DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0);
  2019		bool anon_exclusive, ret = true;
  2020		pte_t pteval;
  2021		struct page *subpage;
  2022		struct mmu_notifier_range range;
  2023		enum ttu_flags flags = (enum ttu_flags)(long)arg;
  2024		unsigned long nr_pages = 1, end_addr;
  2025		unsigned long pfn;
  2026		unsigned long hsz = 0;
  2027		int ptes = 0;
  2028	
  2029		/*
  2030		 * When racing against e.g. zap_pte_range() on another cpu,
  2031		 * in between its ptep_get_and_clear_full() and folio_remove_rmap_*(),
  2032		 * try_to_unmap() may return before page_mapped() has become false,
  2033		 * if page table locking is skipped: use TTU_SYNC to wait for that.
  2034		 */
  2035		if (flags & TTU_SYNC)
  2036			pvmw.flags = PVMW_SYNC;
  2037	
  2038		/*
  2039		 * For THP, we have to assume the worse case ie pmd for invalidation.
  2040		 * For hugetlb, it could be much worse if we need to do pud
  2041		 * invalidation in the case of pmd sharing.
  2042		 *
  2043		 * Note that the folio can not be freed in this function as call of
  2044		 * try_to_unmap() must hold a reference on the folio.
  2045		 */
  2046		range.end = vma_address_end(&pvmw);
  2047		mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma->vm_mm,
  2048					address, range.end);
  2049		if (folio_test_hugetlb(folio)) {
  2050			/*
  2051			 * If sharing is possible, start and end will be adjusted
  2052			 * accordingly.
  2053			 */
  2054			adjust_range_if_pmd_sharing_possible(vma, &range.start,
  2055							     &range.end);
  2056	
  2057			/* We need the huge page size for set_huge_pte_at() */
  2058			hsz = huge_page_size(hstate_vma(vma));
  2059		}
  2060		mmu_notifier_invalidate_range_start(&range);
  2061	
  2062		while (page_vma_mapped_walk(&pvmw)) {
  2063			/*
  2064			 * If the folio is in an mlock()d vma, we must not swap it out.
  2065			 */
  2066			if (!(flags & TTU_IGNORE_MLOCK) &&
  2067			    (vma->vm_flags & VM_LOCKED)) {
  2068				ptes++;
  2069	
  2070				/*
  2071				 * Set 'ret' to indicate the page cannot be unmapped.
  2072				 *
  2073				 * Do not jump to walk_abort immediately as additional
  2074				 * iteration might be required to detect fully mapped
  2075				 * folio an mlock it.
  2076				 */
  2077				ret = false;
  2078	
  2079				/* Only mlock fully mapped pages */
  2080				if (pvmw.pte && ptes != pvmw.nr_pages)
  2081					continue;
  2082	
  2083				/*
  2084				 * All PTEs must be protected by page table lock in
  2085				 * order to mlock the page.
  2086				 *
  2087				 * If page table boundary has been cross, current ptl
  2088				 * only protect part of ptes.
  2089				 */
  2090				if (pvmw.flags & PVMW_PGTABLE_CROSSED)
  2091					goto walk_done;
  2092	
  2093				/* Restore the mlock which got missed */
  2094				mlock_vma_folio(folio, vma);
  2095				goto walk_done;
  2096			}
  2097	
  2098			if (!pvmw.pte) {
  2099				/*
  2100				 * Check for PUD-mapped THP first.
  2101				 * If we have a PUD mapping and TTU_SPLIT_HUGE_PUD is set,
  2102				 * split the PUD to PMD level and restart the walk.
  2103				 */
  2104				if (pvmw.pud && pud_trans_huge(*pvmw.pud)) {
  2105					if (flags & TTU_SPLIT_HUGE_PUD) {
> 2106						split_huge_pud_locked(vma, pvmw.pud, pvmw.address);
  2107						flags &= ~TTU_SPLIT_HUGE_PUD;
  2108						page_vma_mapped_walk_restart(&pvmw);
  2109						continue;
  2110					}
  2111				}
  2112	
  2113				if (folio_test_anon(folio) && !folio_test_swapbacked(folio)) {
  2114					if (unmap_huge_pmd_locked(vma, pvmw.address, pvmw.pmd, folio))
  2115						goto walk_done;
  2116					/*
  2117					 * unmap_huge_pmd_locked has either already marked
  2118					 * the folio as swap-backed or decided to retain it
  2119					 * due to GUP or speculative references.
  2120					 */
  2121					goto walk_abort;
  2122				}
  2123	
  2124				if (flags & TTU_SPLIT_HUGE_PMD) {
  2125					/*
  2126					 * We temporarily have to drop the PTL and
  2127					 * restart so we can process the PTE-mapped THP.
  2128					 */
  2129					split_huge_pmd_locked(vma, pvmw.address,
  2130							      pvmw.pmd, false);
  2131					flags &= ~TTU_SPLIT_HUGE_PMD;
  2132					page_vma_mapped_walk_restart(&pvmw);
  2133					continue;
  2134				}
  2135			}
  2136	
  2137			/* Unexpected PMD-mapped THP? */
  2138			VM_BUG_ON_FOLIO(!pvmw.pte, folio);
  2139	
  2140			/*
  2141			 * Handle PFN swap PTEs, such as device-exclusive ones, that
  2142			 * actually map pages.
  2143			 */
  2144			pteval = ptep_get(pvmw.pte);
  2145			if (likely(pte_present(pteval))) {
  2146				pfn = pte_pfn(pteval);
  2147			} else {
  2148				const softleaf_t entry = softleaf_from_pte(pteval);
  2149	
  2150				pfn = softleaf_to_pfn(entry);
  2151				VM_WARN_ON_FOLIO(folio_test_hugetlb(folio), folio);
  2152			}
  2153	
  2154			subpage = folio_page(folio, pfn - folio_pfn(folio));
  2155			address = pvmw.address;
  2156			anon_exclusive = folio_test_anon(folio) &&
  2157					 PageAnonExclusive(subpage);
  2158	
  2159			if (folio_test_hugetlb(folio)) {
  2160				bool anon = folio_test_anon(folio);
  2161	
  2162				/*
  2163				 * The try_to_unmap() is only passed a hugetlb page
  2164				 * in the case where the hugetlb page is poisoned.
  2165				 */
  2166				VM_BUG_ON_PAGE(!PageHWPoison(subpage), subpage);
  2167				/*
  2168				 * huge_pmd_unshare may unmap an entire PMD page.
  2169				 * There is no way of knowing exactly which PMDs may
  2170				 * be cached for this mm, so we must flush them all.
  2171				 * start/end were already adjusted above to cover this
  2172				 * range.
  2173				 */
  2174				flush_cache_range(vma, range.start, range.end);
  2175	
  2176				/*
  2177				 * To call huge_pmd_unshare, i_mmap_rwsem must be
  2178				 * held in write mode.  Caller needs to explicitly
  2179				 * do this outside rmap routines.
  2180				 *
  2181				 * We also must hold hugetlb vma_lock in write mode.
  2182				 * Lock order dictates acquiring vma_lock BEFORE
  2183				 * i_mmap_rwsem.  We can only try lock here and fail
  2184				 * if unsuccessful.
  2185				 */
  2186				if (!anon) {
  2187					struct mmu_gather tlb;
  2188	
  2189					VM_BUG_ON(!(flags & TTU_RMAP_LOCKED));
  2190					if (!hugetlb_vma_trylock_write(vma))
  2191						goto walk_abort;
  2192	
  2193					tlb_gather_mmu_vma(&tlb, vma);
  2194					if (huge_pmd_unshare(&tlb, vma, address, pvmw.pte)) {
  2195						hugetlb_vma_unlock_write(vma);
  2196						huge_pmd_unshare_flush(&tlb, vma);
  2197						tlb_finish_mmu(&tlb);
  2198						/*
  2199						 * The PMD table was unmapped,
  2200						 * consequently unmapping the folio.
  2201						 */
  2202						goto walk_done;
  2203					}
  2204					hugetlb_vma_unlock_write(vma);
  2205					tlb_finish_mmu(&tlb);
  2206				}
  2207				pteval = huge_ptep_clear_flush(vma, address, pvmw.pte);
  2208				if (pte_dirty(pteval))
  2209					folio_mark_dirty(folio);
  2210			} else if (likely(pte_present(pteval))) {
  2211				nr_pages = folio_unmap_pte_batch(folio, &pvmw, flags, pteval);
  2212				end_addr = address + nr_pages * PAGE_SIZE;
  2213				flush_cache_range(vma, address, end_addr);
  2214	
  2215				/* Nuke the page table entry. */
  2216				pteval = get_and_clear_ptes(mm, address, pvmw.pte, nr_pages);
  2217				/*
  2218				 * We clear the PTE but do not flush so potentially
  2219				 * a remote CPU could still be writing to the folio.
  2220				 * If the entry was previously clean then the
  2221				 * architecture must guarantee that a clear->dirty
  2222				 * transition on a cached TLB entry is written through
  2223				 * and traps if the PTE is unmapped.
  2224				 */
  2225				if (should_defer_flush(mm, flags))
  2226					set_tlb_ubc_flush_pending(mm, pteval, address, end_addr);
  2227				else
  2228					flush_tlb_range(vma, address, end_addr);
  2229				if (pte_dirty(pteval))
  2230					folio_mark_dirty(folio);
  2231			} else {
  2232				pte_clear(mm, address, pvmw.pte);
  2233			}
  2234	
  2235			/*
  2236			 * Now the pte is cleared. If this pte was uffd-wp armed,
  2237			 * we may want to replace a none pte with a marker pte if
  2238			 * it's file-backed, so we don't lose the tracking info.
  2239			 */
  2240			pte_install_uffd_wp_if_needed(vma, address, pvmw.pte, pteval);
  2241	
  2242			/* Update high watermark before we lower rss */
  2243			update_hiwater_rss(mm);
  2244	
  2245			if (PageHWPoison(subpage) && (flags & TTU_HWPOISON)) {
  2246				pteval = swp_entry_to_pte(make_hwpoison_entry(subpage));
  2247				if (folio_test_hugetlb(folio)) {
  2248					hugetlb_count_sub(folio_nr_pages(folio), mm);
  2249					set_huge_pte_at(mm, address, pvmw.pte, pteval,
  2250							hsz);
  2251				} else {
  2252					dec_mm_counter(mm, mm_counter(folio));
  2253					set_pte_at(mm, address, pvmw.pte, pteval);
  2254				}
  2255			} else if (likely(pte_present(pteval)) && pte_unused(pteval) &&
  2256				   !userfaultfd_armed(vma)) {
  2257				/*
  2258				 * The guest indicated that the page content is of no
  2259				 * interest anymore. Simply discard the pte, vmscan
  2260				 * will take care of the rest.
  2261				 * A future reference will then fault in a new zero
  2262				 * page. When userfaultfd is active, we must not drop
  2263				 * this page though, as its main user (postcopy
  2264				 * migration) will not expect userfaults on already
  2265				 * copied pages.
  2266				 */
  2267				dec_mm_counter(mm, mm_counter(folio));
  2268			} else if (folio_test_anon(folio)) {
  2269				swp_entry_t entry = page_swap_entry(subpage);
  2270				pte_t swp_pte;
  2271				/*
  2272				 * Store the swap location in the pte.
  2273				 * See handle_pte_fault() ...
  2274				 */
  2275				if (unlikely(folio_test_swapbacked(folio) !=
  2276						folio_test_swapcache(folio))) {
  2277					WARN_ON_ONCE(1);
  2278					goto walk_abort;
  2279				}
  2280	
  2281				/* MADV_FREE page check */
  2282				if (!folio_test_swapbacked(folio)) {
  2283					int ref_count, map_count;
  2284	
  2285					/*
  2286					 * Synchronize with gup_pte_range():
  2287					 * - clear PTE; barrier; read refcount
  2288					 * - inc refcount; barrier; read PTE
  2289					 */
  2290					smp_mb();
  2291	
  2292					ref_count = folio_ref_count(folio);
  2293					map_count = folio_mapcount(folio);
  2294	
  2295					/*
  2296					 * Order reads for page refcount and dirty flag
  2297					 * (see comments in __remove_mapping()).
  2298					 */
  2299					smp_rmb();
  2300	
  2301					if (folio_test_dirty(folio) && !(vma->vm_flags & VM_DROPPABLE)) {
  2302						/*
  2303						 * redirtied either using the page table or a previously
  2304						 * obtained GUP reference.
  2305						 */
  2306						set_ptes(mm, address, pvmw.pte, pteval, nr_pages);
  2307						folio_set_swapbacked(folio);
  2308						goto walk_abort;
  2309					} else if (ref_count != 1 + map_count) {
  2310						/*
  2311						 * Additional reference. Could be a GUP reference or any
  2312						 * speculative reference. GUP users must mark the folio
  2313						 * dirty if there was a modification. This folio cannot be
  2314						 * reclaimed right now either way, so act just like nothing
  2315						 * happened.
  2316						 * We'll come back here later and detect if the folio was
  2317						 * dirtied when the additional reference is gone.
  2318						 */
  2319						set_ptes(mm, address, pvmw.pte, pteval, nr_pages);
  2320						goto walk_abort;
  2321					}
  2322					add_mm_counter(mm, MM_ANONPAGES, -nr_pages);
  2323					goto discard;
  2324				}
  2325	
  2326				if (folio_dup_swap(folio, subpage) < 0) {
  2327					set_pte_at(mm, address, pvmw.pte, pteval);
  2328					goto walk_abort;
  2329				}
  2330	
  2331				/*
  2332				 * arch_unmap_one() is expected to be a NOP on
  2333				 * architectures where we could have PFN swap PTEs,
  2334				 * so we'll not check/care.
  2335				 */
  2336				if (arch_unmap_one(mm, vma, address, pteval) < 0) {
  2337					folio_put_swap(folio, subpage);
  2338					set_pte_at(mm, address, pvmw.pte, pteval);
  2339					goto walk_abort;
  2340				}
  2341	
  2342				/* See folio_try_share_anon_rmap(): clear PTE first. */
  2343				if (anon_exclusive &&
  2344				    folio_try_share_anon_rmap_pte(folio, subpage)) {
  2345					folio_put_swap(folio, subpage);
  2346					set_pte_at(mm, address, pvmw.pte, pteval);
  2347					goto walk_abort;
  2348				}
  2349				if (list_empty(&mm->mmlist)) {
  2350					spin_lock(&mmlist_lock);
  2351					if (list_empty(&mm->mmlist))
  2352						list_add(&mm->mmlist, &init_mm.mmlist);
  2353					spin_unlock(&mmlist_lock);
  2354				}
  2355				dec_mm_counter(mm, MM_ANONPAGES);
  2356				inc_mm_counter(mm, MM_SWAPENTS);
  2357				swp_pte = swp_entry_to_pte(entry);
  2358				if (anon_exclusive)
  2359					swp_pte = pte_swp_mkexclusive(swp_pte);
  2360				if (likely(pte_present(pteval))) {
  2361					if (pte_soft_dirty(pteval))
  2362						swp_pte = pte_swp_mksoft_dirty(swp_pte);
  2363					if (pte_uffd_wp(pteval))
  2364						swp_pte = pte_swp_mkuffd_wp(swp_pte);
  2365				} else {
  2366					if (pte_swp_soft_dirty(pteval))
  2367						swp_pte = pte_swp_mksoft_dirty(swp_pte);
  2368					if (pte_swp_uffd_wp(pteval))
  2369						swp_pte = pte_swp_mkuffd_wp(swp_pte);
  2370				}
  2371				set_pte_at(mm, address, pvmw.pte, swp_pte);
  2372			} else {
  2373				/*
  2374				 * This is a locked file-backed folio,
  2375				 * so it cannot be removed from the page
  2376				 * cache and replaced by a new folio before
  2377				 * mmu_notifier_invalidate_range_end, so no
  2378				 * concurrent thread might update its page table
  2379				 * to point at a new folio while a device is
  2380				 * still using this folio.
  2381				 *
  2382				 * See Documentation/mm/mmu_notifier.rst
  2383				 */
  2384				add_mm_counter(mm, mm_counter_file(folio), -nr_pages);
  2385			}
  2386	discard:
  2387			if (unlikely(folio_test_hugetlb(folio))) {
  2388				hugetlb_remove_rmap(folio);
  2389			} else {
  2390				folio_remove_rmap_ptes(folio, subpage, nr_pages, vma);
  2391			}
  2392			if (vma->vm_flags & VM_LOCKED)
  2393				mlock_drain_local();
  2394			folio_put_refs(folio, nr_pages);
  2395	
  2396			/*
  2397			 * If we are sure that we batched the entire folio and cleared
  2398			 * all PTEs, we can just optimize and stop right here.
  2399			 */
  2400			if (nr_pages == folio_nr_pages(folio))
  2401				goto walk_done;
  2402			continue;
  2403	walk_abort:
  2404			ret = false;
  2405	walk_done:
  2406			page_vma_mapped_walk_done(&pvmw);
  2407			break;
  2408		}
  2409	
  2410		mmu_notifier_invalidate_range_end(&range);
  2411	
  2412		return ret;
  2413	}
  2414	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC 00/12] mm: PUD (1GB) THP implementation
  2026-02-02  0:50 [RFC 00/12] mm: PUD (1GB) THP implementation Usama Arif
                   ` (12 preceding siblings ...)
  2026-02-02  2:44 ` [RFC 00/12] mm: PUD (1GB) THP implementation Rik van Riel
@ 2026-02-02  4:00 ` Matthew Wilcox
  2026-02-02  9:06   ` David Hildenbrand (arm)
  2026-02-02 11:20 ` Lorenzo Stoakes
  2026-02-02 16:24 ` Zi Yan
  15 siblings, 1 reply; 52+ messages in thread
From: Matthew Wilcox @ 2026-02-02  4:00 UTC (permalink / raw)
  To: Usama Arif
  Cc: ziy, Andrew Morton, David Hildenbrand, lorenzo.stoakes, linux-mm,
	hannes, riel, shakeel.butt, kas, baohua, dev.jain, baolin.wang,
	npache, Liam.Howlett, ryan.roberts, vbabka, lance.yang,
	linux-kernel, kernel-team

On Sun, Feb 01, 2026 at 04:50:17PM -0800, Usama Arif wrote:
> This is an RFC series to implement 1GB PUD-level THPs, allowing
> applications to benefit from reduced TLB pressure without requiring
> hugetlbfs. The patches are based on top of
> f9b74c13b773b7c7e4920d7bc214ea3d5f37b422 from mm-stable (6.19-rc6).

I suggest this has not had enough testing.  There are dozens of places
in the MM which assume that if a folio is at leaast PMD size then it is
exactly PMD size.  Everywhere that calls folio_test_pmd_mappable() needs
to be audited to make sure that it will work properly if the folio is
larger than PMD size.

zap_pmd_range() for example.  Or finish_fault():

                page = vmf->page;
(can be any page within the folio)
        folio = page_folio(page);
        if (pmd_none(*vmf->pmd)) {
                if (!needs_fallback && folio_test_pmd_mappable(folio)) {
                        ret = do_set_pmd(vmf, folio, page);

then do_set_pmd() does:

        if (folio_order(folio) != HPAGE_PMD_ORDER)
                return ret;
        page = &folio->page;

so that check needs to be changed, and then we need to select the
appropriate page within the folio rather than just the first page
of the folio.  And then after the call:

        entry = folio_mk_pmd(folio, vma->vm_page_prot);

we need to adjust entry to point to the appropriate PMD-sized range
within the folio.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC 05/12] mm: thp: add reclaim and migration support for PUD THP
  2026-02-02  0:50 ` [RFC 05/12] mm: thp: add reclaim and migration support for PUD THP Usama Arif
@ 2026-02-02  4:44   ` kernel test robot
  2026-02-02  9:12   ` kernel test robot
  1 sibling, 0 replies; 52+ messages in thread
From: kernel test robot @ 2026-02-02  4:44 UTC (permalink / raw)
  To: Usama Arif; +Cc: oe-kbuild-all

Hi Usama,

[This is a private test report for your RFC patch.]
kernel test robot noticed the following build errors:

[auto build test ERROR on akpm-mm/mm-everything]

url:    https://github.com/intel-lab-lkp/linux/commits/Usama-Arif/mm-add-PUD-THP-ptdesc-and-rmap-support/20260202-085725
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20260202005451.774496-6-usamaarif642%40gmail.com
patch subject: [RFC 05/12] mm: thp: add reclaim and migration support for PUD THP
config: nios2-allnoconfig (https://download.01.org/0day-ci/archive/20260202/202602021257.pz4igUG7-lkp@intel.com/config)
compiler: nios2-linux-gcc (GCC) 11.5.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260202/202602021257.pz4igUG7-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202602021257.pz4igUG7-lkp@intel.com/

All errors (new ones prefixed by >>):

   mm/vmscan.c: In function 'shrink_folio_list':
>> mm/vmscan.c:1358:29: error: implicit declaration of function 'folio_test_pud_mappable'; did you mean 'folio_test_pmd_mappable'? [-Werror=implicit-function-declaration]
    1358 |                         if (folio_test_pud_mappable(folio))
         |                             ^~~~~~~~~~~~~~~~~~~~~~~
         |                             folio_test_pmd_mappable
   cc1: some warnings being treated as errors


vim +1358 mm/vmscan.c

  1079	
  1080	/*
  1081	 * shrink_folio_list() returns the number of reclaimed pages
  1082	 */
  1083	static unsigned int shrink_folio_list(struct list_head *folio_list,
  1084			struct pglist_data *pgdat, struct scan_control *sc,
  1085			struct reclaim_stat *stat, bool ignore_references,
  1086			struct mem_cgroup *memcg)
  1087	{
  1088		struct folio_batch free_folios;
  1089		LIST_HEAD(ret_folios);
  1090		LIST_HEAD(demote_folios);
  1091		unsigned int nr_reclaimed = 0, nr_demoted = 0;
  1092		unsigned int pgactivate = 0;
  1093		bool do_demote_pass;
  1094		struct swap_iocb *plug = NULL;
  1095	
  1096		folio_batch_init(&free_folios);
  1097		memset(stat, 0, sizeof(*stat));
  1098		cond_resched();
  1099		do_demote_pass = can_demote(pgdat->node_id, sc, memcg);
  1100	
  1101	retry:
  1102		while (!list_empty(folio_list)) {
  1103			struct address_space *mapping;
  1104			struct folio *folio;
  1105			enum folio_references references = FOLIOREF_RECLAIM;
  1106			bool dirty, writeback;
  1107			unsigned int nr_pages;
  1108	
  1109			cond_resched();
  1110	
  1111			folio = lru_to_folio(folio_list);
  1112			list_del(&folio->lru);
  1113	
  1114			if (!folio_trylock(folio))
  1115				goto keep;
  1116	
  1117			if (folio_contain_hwpoisoned_page(folio)) {
  1118				/*
  1119				 * unmap_poisoned_folio() can't handle large
  1120				 * folio, just skip it. memory_failure() will
  1121				 * handle it if the UCE is triggered again.
  1122				 */
  1123				if (folio_test_large(folio))
  1124					goto keep_locked;
  1125	
  1126				unmap_poisoned_folio(folio, folio_pfn(folio), false);
  1127				folio_unlock(folio);
  1128				folio_put(folio);
  1129				continue;
  1130			}
  1131	
  1132			VM_BUG_ON_FOLIO(folio_test_active(folio), folio);
  1133	
  1134			nr_pages = folio_nr_pages(folio);
  1135	
  1136			/* Account the number of base pages */
  1137			sc->nr_scanned += nr_pages;
  1138	
  1139			if (unlikely(!folio_evictable(folio)))
  1140				goto activate_locked;
  1141	
  1142			if (!sc->may_unmap && folio_mapped(folio))
  1143				goto keep_locked;
  1144	
  1145			/*
  1146			 * The number of dirty pages determines if a node is marked
  1147			 * reclaim_congested. kswapd will stall and start writing
  1148			 * folios if the tail of the LRU is all dirty unqueued folios.
  1149			 */
  1150			folio_check_dirty_writeback(folio, &dirty, &writeback);
  1151			if (dirty || writeback)
  1152				stat->nr_dirty += nr_pages;
  1153	
  1154			if (dirty && !writeback)
  1155				stat->nr_unqueued_dirty += nr_pages;
  1156	
  1157			/*
  1158			 * Treat this folio as congested if folios are cycling
  1159			 * through the LRU so quickly that the folios marked
  1160			 * for immediate reclaim are making it to the end of
  1161			 * the LRU a second time.
  1162			 */
  1163			if (writeback && folio_test_reclaim(folio))
  1164				stat->nr_congested += nr_pages;
  1165	
  1166			/*
  1167			 * If a folio at the tail of the LRU is under writeback, there
  1168			 * are three cases to consider.
  1169			 *
  1170			 * 1) If reclaim is encountering an excessive number
  1171			 *    of folios under writeback and this folio has both
  1172			 *    the writeback and reclaim flags set, then it
  1173			 *    indicates that folios are being queued for I/O but
  1174			 *    are being recycled through the LRU before the I/O
  1175			 *    can complete. Waiting on the folio itself risks an
  1176			 *    indefinite stall if it is impossible to writeback
  1177			 *    the folio due to I/O error or disconnected storage
  1178			 *    so instead note that the LRU is being scanned too
  1179			 *    quickly and the caller can stall after the folio
  1180			 *    list has been processed.
  1181			 *
  1182			 * 2) Global or new memcg reclaim encounters a folio that is
  1183			 *    not marked for immediate reclaim, or the caller does not
  1184			 *    have __GFP_FS (or __GFP_IO if it's simply going to swap,
  1185			 *    not to fs), or the folio belongs to a mapping where
  1186			 *    waiting on writeback during reclaim may lead to a deadlock.
  1187			 *    In this case mark the folio for immediate reclaim and
  1188			 *    continue scanning.
  1189			 *
  1190			 *    Require may_enter_fs() because we would wait on fs, which
  1191			 *    may not have submitted I/O yet. And the loop driver might
  1192			 *    enter reclaim, and deadlock if it waits on a folio for
  1193			 *    which it is needed to do the write (loop masks off
  1194			 *    __GFP_IO|__GFP_FS for this reason); but more thought
  1195			 *    would probably show more reasons.
  1196			 *
  1197			 * 3) Legacy memcg encounters a folio that already has the
  1198			 *    reclaim flag set. memcg does not have any dirty folio
  1199			 *    throttling so we could easily OOM just because too many
  1200			 *    folios are in writeback and there is nothing else to
  1201			 *    reclaim. Wait for the writeback to complete.
  1202			 *
  1203			 * In cases 1) and 2) we activate the folios to get them out of
  1204			 * the way while we continue scanning for clean folios on the
  1205			 * inactive list and refilling from the active list. The
  1206			 * observation here is that waiting for disk writes is more
  1207			 * expensive than potentially causing reloads down the line.
  1208			 * Since they're marked for immediate reclaim, they won't put
  1209			 * memory pressure on the cache working set any longer than it
  1210			 * takes to write them to disk.
  1211			 */
  1212			if (folio_test_writeback(folio)) {
  1213				mapping = folio_mapping(folio);
  1214	
  1215				/* Case 1 above */
  1216				if (current_is_kswapd() &&
  1217				    folio_test_reclaim(folio) &&
  1218				    test_bit(PGDAT_WRITEBACK, &pgdat->flags)) {
  1219					stat->nr_immediate += nr_pages;
  1220					goto activate_locked;
  1221	
  1222				/* Case 2 above */
  1223				} else if (writeback_throttling_sane(sc) ||
  1224				    !folio_test_reclaim(folio) ||
  1225				    !may_enter_fs(folio, sc->gfp_mask) ||
  1226				    (mapping &&
  1227				     mapping_writeback_may_deadlock_on_reclaim(mapping))) {
  1228					/*
  1229					 * This is slightly racy -
  1230					 * folio_end_writeback() might have
  1231					 * just cleared the reclaim flag, then
  1232					 * setting the reclaim flag here ends up
  1233					 * interpreted as the readahead flag - but
  1234					 * that does not matter enough to care.
  1235					 * What we do want is for this folio to
  1236					 * have the reclaim flag set next time
  1237					 * memcg reclaim reaches the tests above,
  1238					 * so it will then wait for writeback to
  1239					 * avoid OOM; and it's also appropriate
  1240					 * in global reclaim.
  1241					 */
  1242					folio_set_reclaim(folio);
  1243					stat->nr_writeback += nr_pages;
  1244					goto activate_locked;
  1245	
  1246				/* Case 3 above */
  1247				} else {
  1248					folio_unlock(folio);
  1249					folio_wait_writeback(folio);
  1250					/* then go back and try same folio again */
  1251					list_add_tail(&folio->lru, folio_list);
  1252					continue;
  1253				}
  1254			}
  1255	
  1256			if (!ignore_references)
  1257				references = folio_check_references(folio, sc);
  1258	
  1259			switch (references) {
  1260			case FOLIOREF_ACTIVATE:
  1261				goto activate_locked;
  1262			case FOLIOREF_KEEP:
  1263				stat->nr_ref_keep += nr_pages;
  1264				goto keep_locked;
  1265			case FOLIOREF_RECLAIM:
  1266			case FOLIOREF_RECLAIM_CLEAN:
  1267				; /* try to reclaim the folio below */
  1268			}
  1269	
  1270			/*
  1271			 * Before reclaiming the folio, try to relocate
  1272			 * its contents to another node.
  1273			 */
  1274			if (do_demote_pass &&
  1275			    (thp_migration_supported() || !folio_test_large(folio))) {
  1276				list_add(&folio->lru, &demote_folios);
  1277				folio_unlock(folio);
  1278				continue;
  1279			}
  1280	
  1281			/*
  1282			 * Anonymous process memory has backing store?
  1283			 * Try to allocate it some swap space here.
  1284			 * Lazyfree folio could be freed directly
  1285			 */
  1286			if (folio_test_anon(folio) && folio_test_swapbacked(folio) &&
  1287					!folio_test_swapcache(folio)) {
  1288				if (!(sc->gfp_mask & __GFP_IO))
  1289					goto keep_locked;
  1290				if (folio_maybe_dma_pinned(folio))
  1291					goto keep_locked;
  1292				if (folio_test_large(folio)) {
  1293					/* cannot split folio, skip it */
  1294					if (folio_expected_ref_count(folio) !=
  1295					    folio_ref_count(folio) - 1)
  1296						goto activate_locked;
  1297					/*
  1298					 * Split partially mapped folios right away.
  1299					 * We can free the unmapped pages without IO.
  1300					 */
  1301					if (data_race(!list_empty(&folio->_deferred_list) &&
  1302					    folio_test_partially_mapped(folio)) &&
  1303					    split_folio_to_list(folio, folio_list))
  1304						goto activate_locked;
  1305				}
  1306				if (folio_alloc_swap(folio)) {
  1307					int __maybe_unused order = folio_order(folio);
  1308	
  1309					if (!folio_test_large(folio))
  1310						goto activate_locked_split;
  1311					/* Fallback to swap normal pages */
  1312					if (split_folio_to_list(folio, folio_list))
  1313						goto activate_locked;
  1314	#ifdef CONFIG_TRANSPARENT_HUGEPAGE
  1315					if (nr_pages >= HPAGE_PMD_NR) {
  1316						count_memcg_folio_events(folio,
  1317							THP_SWPOUT_FALLBACK, 1);
  1318						count_vm_event(THP_SWPOUT_FALLBACK);
  1319					}
  1320	#endif
  1321					count_mthp_stat(order, MTHP_STAT_SWPOUT_FALLBACK);
  1322					if (folio_alloc_swap(folio))
  1323						goto activate_locked_split;
  1324				}
  1325				/*
  1326				 * Normally the folio will be dirtied in unmap because
  1327				 * its pte should be dirty. A special case is MADV_FREE
  1328				 * page. The page's pte could have dirty bit cleared but
  1329				 * the folio's SwapBacked flag is still set because
  1330				 * clearing the dirty bit and SwapBacked flag has no
  1331				 * lock protected. For such folio, unmap will not set
  1332				 * dirty bit for it, so folio reclaim will not write the
  1333				 * folio out. This can cause data corruption when the
  1334				 * folio is swapped in later. Always setting the dirty
  1335				 * flag for the folio solves the problem.
  1336				 */
  1337				folio_mark_dirty(folio);
  1338			}
  1339	
  1340			/*
  1341			 * If the folio was split above, the tail pages will make
  1342			 * their own pass through this function and be accounted
  1343			 * then.
  1344			 */
  1345			if ((nr_pages > 1) && !folio_test_large(folio)) {
  1346				sc->nr_scanned -= (nr_pages - 1);
  1347				nr_pages = 1;
  1348			}
  1349	
  1350			/*
  1351			 * The folio is mapped into the page tables of one or more
  1352			 * processes. Try to unmap it here.
  1353			 */
  1354			if (folio_mapped(folio)) {
  1355				enum ttu_flags flags = TTU_BATCH_FLUSH;
  1356				bool was_swapbacked = folio_test_swapbacked(folio);
  1357	
> 1358				if (folio_test_pud_mappable(folio))
  1359					flags |= TTU_SPLIT_HUGE_PUD;
  1360				if (folio_test_pmd_mappable(folio))
  1361					flags |= TTU_SPLIT_HUGE_PMD;
  1362				/*
  1363				 * Without TTU_SYNC, try_to_unmap will only begin to
  1364				 * hold PTL from the first present PTE within a large
  1365				 * folio. Some initial PTEs might be skipped due to
  1366				 * races with parallel PTE writes in which PTEs can be
  1367				 * cleared temporarily before being written new present
  1368				 * values. This will lead to a large folio is still
  1369				 * mapped while some subpages have been partially
  1370				 * unmapped after try_to_unmap; TTU_SYNC helps
  1371				 * try_to_unmap acquire PTL from the first PTE,
  1372				 * eliminating the influence of temporary PTE values.
  1373				 */
  1374				if (folio_test_large(folio))
  1375					flags |= TTU_SYNC;
  1376	
  1377				try_to_unmap(folio, flags);
  1378				if (folio_mapped(folio)) {
  1379					stat->nr_unmap_fail += nr_pages;
  1380					if (!was_swapbacked &&
  1381					    folio_test_swapbacked(folio))
  1382						stat->nr_lazyfree_fail += nr_pages;
  1383					goto activate_locked;
  1384				}
  1385			}
  1386	
  1387			/*
  1388			 * Folio is unmapped now so it cannot be newly pinned anymore.
  1389			 * No point in trying to reclaim folio if it is pinned.
  1390			 * Furthermore we don't want to reclaim underlying fs metadata
  1391			 * if the folio is pinned and thus potentially modified by the
  1392			 * pinning process as that may upset the filesystem.
  1393			 */
  1394			if (folio_maybe_dma_pinned(folio))
  1395				goto activate_locked;
  1396	
  1397			mapping = folio_mapping(folio);
  1398			if (folio_test_dirty(folio)) {
  1399				if (folio_is_file_lru(folio)) {
  1400					/*
  1401					 * Immediately reclaim when written back.
  1402					 * Similar in principle to folio_deactivate()
  1403					 * except we already have the folio isolated
  1404					 * and know it's dirty
  1405					 */
  1406					node_stat_mod_folio(folio, NR_VMSCAN_IMMEDIATE,
  1407							nr_pages);
  1408					if (!folio_test_reclaim(folio))
  1409						folio_set_reclaim(folio);
  1410	
  1411					goto activate_locked;
  1412				}
  1413	
  1414				if (references == FOLIOREF_RECLAIM_CLEAN)
  1415					goto keep_locked;
  1416				if (!may_enter_fs(folio, sc->gfp_mask))
  1417					goto keep_locked;
  1418				if (!sc->may_writepage)
  1419					goto keep_locked;
  1420	
  1421				/*
  1422				 * Folio is dirty. Flush the TLB if a writable entry
  1423				 * potentially exists to avoid CPU writes after I/O
  1424				 * starts and then write it out here.
  1425				 */
  1426				try_to_unmap_flush_dirty();
  1427				switch (pageout(folio, mapping, &plug, folio_list)) {
  1428				case PAGE_KEEP:
  1429					goto keep_locked;
  1430				case PAGE_ACTIVATE:
  1431					/*
  1432					 * If shmem folio is split when writeback to swap,
  1433					 * the tail pages will make their own pass through
  1434					 * this function and be accounted then.
  1435					 */
  1436					if (nr_pages > 1 && !folio_test_large(folio)) {
  1437						sc->nr_scanned -= (nr_pages - 1);
  1438						nr_pages = 1;
  1439					}
  1440					goto activate_locked;
  1441				case PAGE_SUCCESS:
  1442					if (nr_pages > 1 && !folio_test_large(folio)) {
  1443						sc->nr_scanned -= (nr_pages - 1);
  1444						nr_pages = 1;
  1445					}
  1446					stat->nr_pageout += nr_pages;
  1447	
  1448					if (folio_test_writeback(folio))
  1449						goto keep;
  1450					if (folio_test_dirty(folio))
  1451						goto keep;
  1452	
  1453					/*
  1454					 * A synchronous write - probably a ramdisk.  Go
  1455					 * ahead and try to reclaim the folio.
  1456					 */
  1457					if (!folio_trylock(folio))
  1458						goto keep;
  1459					if (folio_test_dirty(folio) ||
  1460					    folio_test_writeback(folio))
  1461						goto keep_locked;
  1462					mapping = folio_mapping(folio);
  1463					fallthrough;
  1464				case PAGE_CLEAN:
  1465					; /* try to free the folio below */
  1466				}
  1467			}
  1468	
  1469			/*
  1470			 * If the folio has buffers, try to free the buffer
  1471			 * mappings associated with this folio. If we succeed
  1472			 * we try to free the folio as well.
  1473			 *
  1474			 * We do this even if the folio is dirty.
  1475			 * filemap_release_folio() does not perform I/O, but it
  1476			 * is possible for a folio to have the dirty flag set,
  1477			 * but it is actually clean (all its buffers are clean).
  1478			 * This happens if the buffers were written out directly,
  1479			 * with submit_bh(). ext3 will do this, as well as
  1480			 * the blockdev mapping.  filemap_release_folio() will
  1481			 * discover that cleanness and will drop the buffers
  1482			 * and mark the folio clean - it can be freed.
  1483			 *
  1484			 * Rarely, folios can have buffers and no ->mapping.
  1485			 * These are the folios which were not successfully
  1486			 * invalidated in truncate_cleanup_folio().  We try to
  1487			 * drop those buffers here and if that worked, and the
  1488			 * folio is no longer mapped into process address space
  1489			 * (refcount == 1) it can be freed.  Otherwise, leave
  1490			 * the folio on the LRU so it is swappable.
  1491			 */
  1492			if (folio_needs_release(folio)) {
  1493				if (!filemap_release_folio(folio, sc->gfp_mask))
  1494					goto activate_locked;
  1495				if (!mapping && folio_ref_count(folio) == 1) {
  1496					folio_unlock(folio);
  1497					if (folio_put_testzero(folio))
  1498						goto free_it;
  1499					else {
  1500						/*
  1501						 * rare race with speculative reference.
  1502						 * the speculative reference will free
  1503						 * this folio shortly, so we may
  1504						 * increment nr_reclaimed here (and
  1505						 * leave it off the LRU).
  1506						 */
  1507						nr_reclaimed += nr_pages;
  1508						continue;
  1509					}
  1510				}
  1511			}
  1512	
  1513			if (folio_test_anon(folio) && !folio_test_swapbacked(folio)) {
  1514				/* follow __remove_mapping for reference */
  1515				if (!folio_ref_freeze(folio, 1))
  1516					goto keep_locked;
  1517				/*
  1518				 * The folio has only one reference left, which is
  1519				 * from the isolation. After the caller puts the
  1520				 * folio back on the lru and drops the reference, the
  1521				 * folio will be freed anyway. It doesn't matter
  1522				 * which lru it goes on. So we don't bother checking
  1523				 * the dirty flag here.
  1524				 */
  1525				count_vm_events(PGLAZYFREED, nr_pages);
  1526				count_memcg_folio_events(folio, PGLAZYFREED, nr_pages);
  1527			} else if (!mapping || !__remove_mapping(mapping, folio, true,
  1528								 sc->target_mem_cgroup))
  1529				goto keep_locked;
  1530	
  1531			folio_unlock(folio);
  1532	free_it:
  1533			/*
  1534			 * Folio may get swapped out as a whole, need to account
  1535			 * all pages in it.
  1536			 */
  1537			nr_reclaimed += nr_pages;
  1538	
  1539			folio_unqueue_deferred_split(folio);
  1540			if (folio_batch_add(&free_folios, folio) == 0) {
  1541				mem_cgroup_uncharge_folios(&free_folios);
  1542				try_to_unmap_flush();
  1543				free_unref_folios(&free_folios);
  1544			}
  1545			continue;
  1546	
  1547	activate_locked_split:
  1548			/*
  1549			 * The tail pages that are failed to add into swap cache
  1550			 * reach here.  Fixup nr_scanned and nr_pages.
  1551			 */
  1552			if (nr_pages > 1) {
  1553				sc->nr_scanned -= (nr_pages - 1);
  1554				nr_pages = 1;
  1555			}
  1556	activate_locked:
  1557			/* Not a candidate for swapping, so reclaim swap space. */
  1558			if (folio_test_swapcache(folio) &&
  1559			    (mem_cgroup_swap_full(folio) || folio_test_mlocked(folio)))
  1560				folio_free_swap(folio);
  1561			VM_BUG_ON_FOLIO(folio_test_active(folio), folio);
  1562			if (!folio_test_mlocked(folio)) {
  1563				int type = folio_is_file_lru(folio);
  1564				folio_set_active(folio);
  1565				stat->nr_activate[type] += nr_pages;
  1566				count_memcg_folio_events(folio, PGACTIVATE, nr_pages);
  1567			}
  1568	keep_locked:
  1569			folio_unlock(folio);
  1570	keep:
  1571			list_add(&folio->lru, &ret_folios);
  1572			VM_BUG_ON_FOLIO(folio_test_lru(folio) ||
  1573					folio_test_unevictable(folio), folio);
  1574		}
  1575		/* 'folio_list' is always empty here */
  1576	
  1577		/* Migrate folios selected for demotion */
  1578		nr_demoted = demote_folio_list(&demote_folios, pgdat, memcg);
  1579		nr_reclaimed += nr_demoted;
  1580		stat->nr_demoted += nr_demoted;
  1581		/* Folios that could not be demoted are still in @demote_folios */
  1582		if (!list_empty(&demote_folios)) {
  1583			/* Folios which weren't demoted go back on @folio_list */
  1584			list_splice_init(&demote_folios, folio_list);
  1585	
  1586			/*
  1587			 * goto retry to reclaim the undemoted folios in folio_list if
  1588			 * desired.
  1589			 *
  1590			 * Reclaiming directly from top tier nodes is not often desired
  1591			 * due to it breaking the LRU ordering: in general memory
  1592			 * should be reclaimed from lower tier nodes and demoted from
  1593			 * top tier nodes.
  1594			 *
  1595			 * However, disabling reclaim from top tier nodes entirely
  1596			 * would cause ooms in edge scenarios where lower tier memory
  1597			 * is unreclaimable for whatever reason, eg memory being
  1598			 * mlocked or too hot to reclaim. We can disable reclaim
  1599			 * from top tier nodes in proactive reclaim though as that is
  1600			 * not real memory pressure.
  1601			 */
  1602			if (!sc->proactive) {
  1603				do_demote_pass = false;
  1604				goto retry;
  1605			}
  1606		}
  1607	
  1608		pgactivate = stat->nr_activate[0] + stat->nr_activate[1];
  1609	
  1610		mem_cgroup_uncharge_folios(&free_folios);
  1611		try_to_unmap_flush();
  1612		free_unref_folios(&free_folios);
  1613	
  1614		list_splice(&ret_folios, folio_list);
  1615		count_vm_events(PGACTIVATE, pgactivate);
  1616	
  1617		if (plug)
  1618			swap_write_unplug(plug);
  1619		return nr_reclaimed;
  1620	}
  1621	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC 00/12] mm: PUD (1GB) THP implementation
  2026-02-02  4:00 ` Matthew Wilcox
@ 2026-02-02  9:06   ` David Hildenbrand (arm)
  2026-02-03 21:11     ` Usama Arif
  0 siblings, 1 reply; 52+ messages in thread
From: David Hildenbrand (arm) @ 2026-02-02  9:06 UTC (permalink / raw)
  To: Matthew Wilcox, Usama Arif
  Cc: ziy, Andrew Morton, lorenzo.stoakes, linux-mm, hannes, riel,
	shakeel.butt, kas, baohua, dev.jain, baolin.wang, npache,
	Liam.Howlett, ryan.roberts, vbabka, lance.yang, linux-kernel,
	kernel-team

On 2/2/26 05:00, Matthew Wilcox wrote:
> On Sun, Feb 01, 2026 at 04:50:17PM -0800, Usama Arif wrote:
>> This is an RFC series to implement 1GB PUD-level THPs, allowing
>> applications to benefit from reduced TLB pressure without requiring
>> hugetlbfs. The patches are based on top of
>> f9b74c13b773b7c7e4920d7bc214ea3d5f37b422 from mm-stable (6.19-rc6).
> 
> I suggest this has not had enough testing.  There are dozens of places
> in the MM which assume that if a folio is at leaast PMD size then it is
> exactly PMD size.  Everywhere that calls folio_test_pmd_mappable() needs
> to be audited to make sure that it will work properly if the folio is
> larger than PMD size.

I think the hack (ehm trick) in this patch set is to do it just like dax 
PUDs: only map through a PUD or through PTEs, not through PMDs.

That also avoids dealing with mapcounts until I sorted that out.

-- 
Cheers

David


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC 05/12] mm: thp: add reclaim and migration support for PUD THP
  2026-02-02  0:50 ` [RFC 05/12] mm: thp: add reclaim and migration support for PUD THP Usama Arif
  2026-02-02  4:44   ` kernel test robot
@ 2026-02-02  9:12   ` kernel test robot
  1 sibling, 0 replies; 52+ messages in thread
From: kernel test robot @ 2026-02-02  9:12 UTC (permalink / raw)
  To: Usama Arif; +Cc: llvm, oe-kbuild-all

Hi Usama,

[This is a private test report for your RFC patch.]
kernel test robot noticed the following build errors:

[auto build test ERROR on akpm-mm/mm-everything]

url:    https://github.com/intel-lab-lkp/linux/commits/Usama-Arif/mm-add-PUD-THP-ptdesc-and-rmap-support/20260202-085725
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20260202005451.774496-6-usamaarif642%40gmail.com
patch subject: [RFC 05/12] mm: thp: add reclaim and migration support for PUD THP
config: powerpc-randconfig-002-20260202 (https://download.01.org/0day-ci/archive/20260202/202602021716.8bd0cxap-lkp@intel.com/config)
compiler: clang version 22.0.0git (https://github.com/llvm/llvm-project 9b8addffa70cee5b2acc5454712d9cf78ce45710)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260202/202602021716.8bd0cxap-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202602021716.8bd0cxap-lkp@intel.com/

All errors (new ones prefixed by >>):

>> mm/migrate.c:1867:8: error: call to undeclared function 'folio_test_pud_mappable'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
    1867 |                         if (folio_test_pud_mappable(folio)) {
         |                             ^
   mm/migrate.c:1867:8: note: did you mean 'folio_test_pmd_mappable'?
   include/linux/huge_mm.h:625:20: note: 'folio_test_pmd_mappable' declared here
     625 | static inline bool folio_test_pmd_mappable(struct folio *folio)
         |                    ^
   1 error generated.


vim +/folio_test_pud_mappable +1867 mm/migrate.c

  1773	
  1774	/*
  1775	 * migrate_pages_batch() first unmaps folios in the from list as many as
  1776	 * possible, then move the unmapped folios.
  1777	 *
  1778	 * We only batch migration if mode == MIGRATE_ASYNC to avoid to wait a
  1779	 * lock or bit when we have locked more than one folio.  Which may cause
  1780	 * deadlock (e.g., for loop device).  So, if mode != MIGRATE_ASYNC, the
  1781	 * length of the from list must be <= 1.
  1782	 */
  1783	static int migrate_pages_batch(struct list_head *from,
  1784			new_folio_t get_new_folio, free_folio_t put_new_folio,
  1785			unsigned long private, enum migrate_mode mode, int reason,
  1786			struct list_head *ret_folios, struct list_head *split_folios,
  1787			struct migrate_pages_stats *stats, int nr_pass)
  1788	{
  1789		int retry = 1;
  1790		int thp_retry = 1;
  1791		int nr_failed = 0;
  1792		int nr_retry_pages = 0;
  1793		int pass = 0;
  1794		bool is_thp = false;
  1795		bool is_large = false;
  1796		struct folio *folio, *folio2, *dst = NULL;
  1797		int rc, rc_saved = 0, nr_pages;
  1798		LIST_HEAD(unmap_folios);
  1799		LIST_HEAD(dst_folios);
  1800		bool nosplit = (reason == MR_NUMA_MISPLACED);
  1801	
  1802		VM_WARN_ON_ONCE(mode != MIGRATE_ASYNC &&
  1803				!list_empty(from) && !list_is_singular(from));
  1804	
  1805		for (pass = 0; pass < nr_pass && retry; pass++) {
  1806			retry = 0;
  1807			thp_retry = 0;
  1808			nr_retry_pages = 0;
  1809	
  1810			list_for_each_entry_safe(folio, folio2, from, lru) {
  1811				is_large = folio_test_large(folio);
  1812				is_thp = folio_test_pmd_mappable(folio);
  1813				nr_pages = folio_nr_pages(folio);
  1814	
  1815				cond_resched();
  1816	
  1817				/*
  1818				 * The rare folio on the deferred split list should
  1819				 * be split now. It should not count as a failure:
  1820				 * but increment nr_failed because, without doing so,
  1821				 * migrate_pages() may report success with (split but
  1822				 * unmigrated) pages still on its fromlist; whereas it
  1823				 * always reports success when its fromlist is empty.
  1824				 * stats->nr_thp_failed should be increased too,
  1825				 * otherwise stats inconsistency will happen when
  1826				 * migrate_pages_batch is called via migrate_pages()
  1827				 * with MIGRATE_SYNC and MIGRATE_ASYNC.
  1828				 *
  1829				 * Only check it without removing it from the list.
  1830				 * Since the folio can be on deferred_split_scan()
  1831				 * local list and removing it can cause the local list
  1832				 * corruption. Folio split process below can handle it
  1833				 * with the help of folio_ref_freeze().
  1834				 *
  1835				 * nr_pages > 2 is needed to avoid checking order-1
  1836				 * page cache folios. They exist, in contrast to
  1837				 * non-existent order-1 anonymous folios, and do not
  1838				 * use _deferred_list.
  1839				 */
  1840				if (nr_pages > 2 &&
  1841				   !list_empty(&folio->_deferred_list) &&
  1842				   folio_test_partially_mapped(folio)) {
  1843					if (!try_split_folio(folio, split_folios, mode)) {
  1844						nr_failed++;
  1845						stats->nr_thp_failed += is_thp;
  1846						stats->nr_thp_split += is_thp;
  1847						stats->nr_split++;
  1848						continue;
  1849					}
  1850				}
  1851	
  1852				/*
  1853				 * Large folio migration might be unsupported or
  1854				 * the allocation might be failed so we should retry
  1855				 * on the same folio with the large folio split
  1856				 * to normal folios.
  1857				 *
  1858				 * Split folios are put in split_folios, and
  1859				 * we will migrate them after the rest of the
  1860				 * list is processed.
  1861				 */
  1862				/*
  1863				 * PUD-sized folios cannot be migrated directly,
  1864				 * but can be split. Try to split them first and
  1865				 * migrate the resulting smaller folios.
  1866				 */
> 1867				if (folio_test_pud_mappable(folio)) {
  1868					nr_failed++;
  1869					stats->nr_thp_failed++;
  1870					if (!try_split_folio(folio, split_folios, mode)) {
  1871						stats->nr_thp_split++;
  1872						stats->nr_split++;
  1873						continue;
  1874					}
  1875					stats->nr_failed_pages += nr_pages;
  1876					list_move_tail(&folio->lru, ret_folios);
  1877					continue;
  1878				}
  1879				if (!thp_migration_supported() && is_thp) {
  1880					nr_failed++;
  1881					stats->nr_thp_failed++;
  1882					if (!try_split_folio(folio, split_folios, mode)) {
  1883						stats->nr_thp_split++;
  1884						stats->nr_split++;
  1885						continue;
  1886					}
  1887					stats->nr_failed_pages += nr_pages;
  1888					list_move_tail(&folio->lru, ret_folios);
  1889					continue;
  1890				}
  1891	
  1892				/*
  1893				 * If we are holding the last folio reference, the folio
  1894				 * was freed from under us, so just drop our reference.
  1895				 */
  1896				if (likely(!page_has_movable_ops(&folio->page)) &&
  1897				    folio_ref_count(folio) == 1) {
  1898					folio_clear_active(folio);
  1899					folio_clear_unevictable(folio);
  1900					list_del(&folio->lru);
  1901					migrate_folio_done(folio, reason);
  1902					stats->nr_succeeded += nr_pages;
  1903					stats->nr_thp_succeeded += is_thp;
  1904					continue;
  1905				}
  1906	
  1907				rc = migrate_folio_unmap(get_new_folio, put_new_folio,
  1908						private, folio, &dst, mode, ret_folios);
  1909				/*
  1910				 * The rules are:
  1911				 *	0: folio will be put on unmap_folios list,
  1912				 *	   dst folio put on dst_folios list
  1913				 *	-EAGAIN: stay on the from list
  1914				 *	-ENOMEM: stay on the from list
  1915				 *	Other errno: put on ret_folios list
  1916				 */
  1917				switch(rc) {
  1918				case -ENOMEM:
  1919					/*
  1920					 * When memory is low, don't bother to try to migrate
  1921					 * other folios, move unmapped folios, then exit.
  1922					 */
  1923					nr_failed++;
  1924					stats->nr_thp_failed += is_thp;
  1925					/* Large folio NUMA faulting doesn't split to retry. */
  1926					if (is_large && !nosplit) {
  1927						int ret = try_split_folio(folio, split_folios, mode);
  1928	
  1929						if (!ret) {
  1930							stats->nr_thp_split += is_thp;
  1931							stats->nr_split++;
  1932							break;
  1933						} else if (reason == MR_LONGTERM_PIN &&
  1934							   ret == -EAGAIN) {
  1935							/*
  1936							 * Try again to split large folio to
  1937							 * mitigate the failure of longterm pinning.
  1938							 */
  1939							retry++;
  1940							thp_retry += is_thp;
  1941							nr_retry_pages += nr_pages;
  1942							/* Undo duplicated failure counting. */
  1943							nr_failed--;
  1944							stats->nr_thp_failed -= is_thp;
  1945							break;
  1946						}
  1947					}
  1948	
  1949					stats->nr_failed_pages += nr_pages + nr_retry_pages;
  1950					/* nr_failed isn't updated for not used */
  1951					stats->nr_thp_failed += thp_retry;
  1952					rc_saved = rc;
  1953					if (list_empty(&unmap_folios))
  1954						goto out;
  1955					else
  1956						goto move;
  1957				case -EAGAIN:
  1958					retry++;
  1959					thp_retry += is_thp;
  1960					nr_retry_pages += nr_pages;
  1961					break;
  1962				case 0:
  1963					list_move_tail(&folio->lru, &unmap_folios);
  1964					list_add_tail(&dst->lru, &dst_folios);
  1965					break;
  1966				default:
  1967					/*
  1968					 * Permanent failure (-EBUSY, etc.):
  1969					 * unlike -EAGAIN case, the failed folio is
  1970					 * removed from migration folio list and not
  1971					 * retried in the next outer loop.
  1972					 */
  1973					nr_failed++;
  1974					stats->nr_thp_failed += is_thp;
  1975					stats->nr_failed_pages += nr_pages;
  1976					break;
  1977				}
  1978			}
  1979		}
  1980		nr_failed += retry;
  1981		stats->nr_thp_failed += thp_retry;
  1982		stats->nr_failed_pages += nr_retry_pages;
  1983	move:
  1984		/* Flush TLBs for all unmapped folios */
  1985		try_to_unmap_flush();
  1986	
  1987		retry = 1;
  1988		for (pass = 0; pass < nr_pass && retry; pass++) {
  1989			retry = 0;
  1990			thp_retry = 0;
  1991			nr_retry_pages = 0;
  1992	
  1993			/* Move the unmapped folios */
  1994			migrate_folios_move(&unmap_folios, &dst_folios,
  1995					put_new_folio, private, mode, reason,
  1996					ret_folios, stats, &retry, &thp_retry,
  1997					&nr_failed, &nr_retry_pages);
  1998		}
  1999		nr_failed += retry;
  2000		stats->nr_thp_failed += thp_retry;
  2001		stats->nr_failed_pages += nr_retry_pages;
  2002	
  2003		rc = rc_saved ? : nr_failed;
  2004	out:
  2005		/* Cleanup remaining folios */
  2006		migrate_folios_undo(&unmap_folios, &dst_folios,
  2007				put_new_folio, private, ret_folios);
  2008	
  2009		return rc;
  2010	}
  2011	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC 01/12] mm: add PUD THP ptdesc and rmap support
  2026-02-02  0:50 ` [RFC 01/12] mm: add PUD THP ptdesc and rmap support Usama Arif
  2026-02-02  3:10   ` kernel test robot
@ 2026-02-02 10:44   ` Kiryl Shutsemau
  2026-02-02 16:01     ` Zi Yan
  2026-02-02 12:15   ` Lorenzo Stoakes
  2 siblings, 1 reply; 52+ messages in thread
From: Kiryl Shutsemau @ 2026-02-02 10:44 UTC (permalink / raw)
  To: Usama Arif
  Cc: ziy, Andrew Morton, David Hildenbrand, lorenzo.stoakes, linux-mm,
	hannes, riel, shakeel.butt, baohua, dev.jain, baolin.wang, npache,
	Liam.Howlett, ryan.roberts, vbabka, lance.yang, linux-kernel,
	kernel-team

On Sun, Feb 01, 2026 at 04:50:18PM -0800, Usama Arif wrote:
> For page table management, PUD THPs need to pre-deposit page tables
> that will be used when the huge page is later split. When a PUD THP
> is allocated, we cannot know in advance when or why it might need to
> be split (COW, partial unmap, reclaim), but we need page tables ready
> for that eventuality. Similar to how PMD THPs deposit a single PTE
> table, PUD THPs deposit a PMD table which itself contains deposited
> PTE tables - a two-level deposit. This commit adds the deposit/withdraw
> infrastructure and a new pud_huge_pmd field in ptdesc to store the
> deposited PMD.
> 
> The deposited PMD tables are stored as a singly-linked stack using only
> page->lru.next as the link pointer. A doubly-linked list using the
> standard list_head mechanism would cause memory corruption: list_del()
> poisons both lru.next (offset 8) and lru.prev (offset 16), but lru.prev
> overlaps with ptdesc->pmd_huge_pte at offset 16. Since deposited PMD
> tables have their own deposited PTE tables stored in pmd_huge_pte,
> poisoning lru.prev would corrupt the PTE table list and cause crashes
> when withdrawing PTE tables during split. PMD THPs don't have this
> problem because their deposited PTE tables don't have sub-deposits.
> Using only lru.next avoids the overlap entirely.
> 
> For reverse mapping, PUD THPs need the same rmap support that PMD THPs
> have. The page_vma_mapped_walk() function is extended to recognize and
> handle PUD-mapped folios during rmap traversal. A new TTU_SPLIT_HUGE_PUD
> flag tells the unmap path to split PUD THPs before proceeding, since
> there is no PUD-level migration entry format - the split converts the
> single PUD mapping into individual PTE mappings that can be migrated
> or swapped normally.
> 
> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
> ---
>  include/linux/huge_mm.h  |  5 +++
>  include/linux/mm.h       | 19 ++++++++
>  include/linux/mm_types.h |  5 ++-
>  include/linux/pgtable.h  |  8 ++++
>  include/linux/rmap.h     |  7 ++-
>  mm/huge_memory.c         |  8 ++++
>  mm/internal.h            |  3 ++
>  mm/page_vma_mapped.c     | 35 +++++++++++++++
>  mm/pgtable-generic.c     | 83 ++++++++++++++++++++++++++++++++++
>  mm/rmap.c                | 96 +++++++++++++++++++++++++++++++++++++---
>  10 files changed, 260 insertions(+), 9 deletions(-)
> 
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index a4d9f964dfdea..e672e45bb9cc7 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -463,10 +463,15 @@ void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
>  		unsigned long address);
>  
>  #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
> +void split_huge_pud_locked(struct vm_area_struct *vma, pud_t *pud,
> +			   unsigned long address);
>  int change_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  		    pud_t *pudp, unsigned long addr, pgprot_t newprot,
>  		    unsigned long cp_flags);
>  #else
> +static inline void
> +split_huge_pud_locked(struct vm_area_struct *vma, pud_t *pud,
> +		      unsigned long address) {}
>  static inline int
>  change_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  		pud_t *pudp, unsigned long addr, pgprot_t newprot,
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index ab2e7e30aef96..a15e18df0f771 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3455,6 +3455,22 @@ static inline bool pagetable_pmd_ctor(struct mm_struct *mm,
>   * considered ready to switch to split PUD locks yet; there may be places
>   * which need to be converted from page_table_lock.
>   */
> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
> +static inline struct page *pud_pgtable_page(pud_t *pud)
> +{
> +	unsigned long mask = ~(PTRS_PER_PUD * sizeof(pud_t) - 1);
> +
> +	return virt_to_page((void *)((unsigned long)pud & mask));
> +}
> +
> +static inline struct ptdesc *pud_ptdesc(pud_t *pud)
> +{
> +	return page_ptdesc(pud_pgtable_page(pud));
> +}
> +
> +#define pud_huge_pmd(pud) (pud_ptdesc(pud)->pud_huge_pmd)
> +#endif
> +
>  static inline spinlock_t *pud_lockptr(struct mm_struct *mm, pud_t *pud)
>  {
>  	return &mm->page_table_lock;
> @@ -3471,6 +3487,9 @@ static inline spinlock_t *pud_lock(struct mm_struct *mm, pud_t *pud)
>  static inline void pagetable_pud_ctor(struct ptdesc *ptdesc)
>  {
>  	__pagetable_ctor(ptdesc);
> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
> +	ptdesc->pud_huge_pmd = NULL;
> +#endif
>  }
>  
>  static inline void pagetable_p4d_ctor(struct ptdesc *ptdesc)
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 78950eb8926dc..26a38490ae2e1 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -577,7 +577,10 @@ struct ptdesc {
>  		struct list_head pt_list;
>  		struct {
>  			unsigned long _pt_pad_1;
> -			pgtable_t pmd_huge_pte;
> +			union {
> +				pgtable_t pmd_huge_pte;  /* For PMD tables: deposited PTE */
> +				pgtable_t pud_huge_pmd;  /* For PUD tables: deposited PMD list */
> +			};
>  		};
>  	};
>  	unsigned long __page_mapping;
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index 2f0dd3a4ace1a..3ce733c1d71a2 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -1168,6 +1168,14 @@ extern pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp);
>  #define arch_needs_pgtable_deposit() (false)
>  #endif
>  
> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
> +extern void pgtable_trans_huge_pud_deposit(struct mm_struct *mm, pud_t *pudp,
> +					   pmd_t *pmd_table);
> +extern pmd_t *pgtable_trans_huge_pud_withdraw(struct mm_struct *mm, pud_t *pudp);
> +extern void pud_deposit_pte(pmd_t *pmd_table, pgtable_t pgtable);
> +extern pgtable_t pud_withdraw_pte(pmd_t *pmd_table);
> +#endif
> +
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>  /*
>   * This is an implementation of pmdp_establish() that is only suitable for an
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index daa92a58585d9..08cd0a0eb8763 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -101,6 +101,7 @@ enum ttu_flags {
>  					 * do a final flush if necessary */
>  	TTU_RMAP_LOCKED		= 0x80,	/* do not grab rmap lock:
>  					 * caller holds it */
> +	TTU_SPLIT_HUGE_PUD	= 0x100, /* split huge PUD if any */
>  };
>  
>  #ifdef CONFIG_MMU
> @@ -473,6 +474,8 @@ void folio_add_anon_rmap_ptes(struct folio *, struct page *, int nr_pages,
>  	folio_add_anon_rmap_ptes(folio, page, 1, vma, address, flags)
>  void folio_add_anon_rmap_pmd(struct folio *, struct page *,
>  		struct vm_area_struct *, unsigned long address, rmap_t flags);
> +void folio_add_anon_rmap_pud(struct folio *, struct page *,
> +		struct vm_area_struct *, unsigned long address, rmap_t flags);
>  void folio_add_new_anon_rmap(struct folio *, struct vm_area_struct *,
>  		unsigned long address, rmap_t flags);
>  void folio_add_file_rmap_ptes(struct folio *, struct page *, int nr_pages,
> @@ -933,6 +936,7 @@ struct page_vma_mapped_walk {
>  	pgoff_t pgoff;
>  	struct vm_area_struct *vma;
>  	unsigned long address;
> +	pud_t *pud;
>  	pmd_t *pmd;
>  	pte_t *pte;
>  	spinlock_t *ptl;
> @@ -970,7 +974,7 @@ static inline void page_vma_mapped_walk_done(struct page_vma_mapped_walk *pvmw)
>  static inline void
>  page_vma_mapped_walk_restart(struct page_vma_mapped_walk *pvmw)
>  {
> -	WARN_ON_ONCE(!pvmw->pmd && !pvmw->pte);
> +	WARN_ON_ONCE(!pvmw->pud && !pvmw->pmd && !pvmw->pte);
>  
>  	if (likely(pvmw->ptl))
>  		spin_unlock(pvmw->ptl);
> @@ -978,6 +982,7 @@ page_vma_mapped_walk_restart(struct page_vma_mapped_walk *pvmw)
>  		WARN_ON_ONCE(1);
>  
>  	pvmw->ptl = NULL;
> +	pvmw->pud = NULL;
>  	pvmw->pmd = NULL;
>  	pvmw->pte = NULL;
>  }
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 40cf59301c21a..3128b3beedb0a 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2933,6 +2933,14 @@ void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
>  	spin_unlock(ptl);
>  	mmu_notifier_invalidate_range_end(&range);
>  }
> +
> +void split_huge_pud_locked(struct vm_area_struct *vma, pud_t *pud,
> +			   unsigned long address)
> +{
> +	VM_WARN_ON_ONCE(!IS_ALIGNED(address, HPAGE_PUD_SIZE));
> +	if (pud_trans_huge(*pud))
> +		__split_huge_pud_locked(vma, pud, address);
> +}
>  #else
>  void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
>  		unsigned long address)
> diff --git a/mm/internal.h b/mm/internal.h
> index 9ee336aa03656..21d5c00f638dc 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -545,6 +545,9 @@ int user_proactive_reclaim(char *buf,
>   * in mm/rmap.c:
>   */
>  pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address);
> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
> +pud_t *mm_find_pud(struct mm_struct *mm, unsigned long address);
> +#endif
>  
>  /*
>   * in mm/page_alloc.c
> diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
> index b38a1d00c971b..d31eafba38041 100644
> --- a/mm/page_vma_mapped.c
> +++ b/mm/page_vma_mapped.c
> @@ -146,6 +146,18 @@ static bool check_pmd(unsigned long pfn, struct page_vma_mapped_walk *pvmw)
>  	return true;
>  }
>  
> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
> +/* Returns true if the two ranges overlap.  Careful to not overflow. */
> +static bool check_pud(unsigned long pfn, struct page_vma_mapped_walk *pvmw)
> +{
> +	if ((pfn + HPAGE_PUD_NR - 1) < pvmw->pfn)
> +		return false;
> +	if (pfn > pvmw->pfn + pvmw->nr_pages - 1)
> +		return false;
> +	return true;
> +}
> +#endif
> +
>  static void step_forward(struct page_vma_mapped_walk *pvmw, unsigned long size)
>  {
>  	pvmw->address = (pvmw->address + size) & ~(size - 1);
> @@ -188,6 +200,10 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
>  	pud_t *pud;
>  	pmd_t pmde;
>  
> +	/* The only possible pud mapping has been handled on last iteration */
> +	if (pvmw->pud && !pvmw->pmd)
> +		return not_found(pvmw);
> +
>  	/* The only possible pmd mapping has been handled on last iteration */
>  	if (pvmw->pmd && !pvmw->pte)
>  		return not_found(pvmw);
> @@ -234,6 +250,25 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
>  			continue;
>  		}
>  
> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
> +		/* Check for PUD-mapped THP */
> +		if (pud_trans_huge(*pud)) {
> +			pvmw->pud = pud;
> +			pvmw->ptl = pud_lock(mm, pud);
> +			if (likely(pud_trans_huge(*pud))) {
> +				if (pvmw->flags & PVMW_MIGRATION)
> +					return not_found(pvmw);
> +				if (!check_pud(pud_pfn(*pud), pvmw))
> +					return not_found(pvmw);
> +				return true;
> +			}
> +			/* PUD was split under us, retry at PMD level */
> +			spin_unlock(pvmw->ptl);
> +			pvmw->ptl = NULL;
> +			pvmw->pud = NULL;
> +		}
> +#endif
> +
>  		pvmw->pmd = pmd_offset(pud, pvmw->address);
>  		/*
>  		 * Make sure the pmd value isn't cached in a register by the
> diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
> index d3aec7a9926ad..2047558ddcd79 100644
> --- a/mm/pgtable-generic.c
> +++ b/mm/pgtable-generic.c
> @@ -195,6 +195,89 @@ pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp)
>  }
>  #endif
>  
> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
> +/*
> + * Deposit page tables for PUD THP.
> + * Called with PUD lock held. Stores PMD tables in a singly-linked stack
> + * via pud_huge_pmd, using only pmd_page->lru.next as the link pointer.
> + *
> + * IMPORTANT: We use only lru.next (offset 8) for linking, NOT the full
> + * list_head. This is because lru.prev (offset 16) overlaps with
> + * ptdesc->pmd_huge_pte, which stores the PMD table's deposited PTE tables.
> + * Using list_del() would corrupt pmd_huge_pte with LIST_POISON2.

This is ugly.

Sounds like you want to use llist_node/head instead of list_head for this.

You might able to avoid taking the lock in some cases. Note that
pud_lockptr() is mm->page_table_lock as of now.

> + *
> + * PTE tables should be deposited into the PMD using pud_deposit_pte().
> + */
> +void pgtable_trans_huge_pud_deposit(struct mm_struct *mm, pud_t *pudp,
> +				    pmd_t *pmd_table)
> +{
> +	pgtable_t pmd_page = virt_to_page(pmd_table);
> +
> +	assert_spin_locked(pud_lockptr(mm, pudp));
> +
> +	/* Push onto stack using only lru.next as the link */
> +	pmd_page->lru.next = (struct list_head *)pud_huge_pmd(pudp);
> +	pud_huge_pmd(pudp) = pmd_page;
> +}
> +
> +/*
> + * Withdraw the deposited PMD table for PUD THP split or zap.
> + * Called with PUD lock held.
> + * Returns NULL if no more PMD tables are deposited.
> + */
> +pmd_t *pgtable_trans_huge_pud_withdraw(struct mm_struct *mm, pud_t *pudp)
> +{
> +	pgtable_t pmd_page;
> +
> +	assert_spin_locked(pud_lockptr(mm, pudp));
> +
> +	pmd_page = pud_huge_pmd(pudp);
> +	if (!pmd_page)
> +		return NULL;
> +
> +	/* Pop from stack - lru.next points to next PMD page (or NULL) */
> +	pud_huge_pmd(pudp) = (pgtable_t)pmd_page->lru.next;
> +
> +	return page_address(pmd_page);
> +}
> +
> +/*
> + * Deposit a PTE table into a standalone PMD table (not yet in page table hierarchy).
> + * Used for PUD THP pre-deposit. The PMD table's pmd_huge_pte stores a linked list.
> + * No lock assertion since the PMD isn't visible yet.
> + */
> +void pud_deposit_pte(pmd_t *pmd_table, pgtable_t pgtable)
> +{
> +	struct ptdesc *ptdesc = virt_to_ptdesc(pmd_table);
> +
> +	/* FIFO - add to front of list */
> +	if (!ptdesc->pmd_huge_pte)
> +		INIT_LIST_HEAD(&pgtable->lru);
> +	else
> +		list_add(&pgtable->lru, &ptdesc->pmd_huge_pte->lru);
> +	ptdesc->pmd_huge_pte = pgtable;
> +}
> +
> +/*
> + * Withdraw a PTE table from a standalone PMD table.
> + * Returns NULL if no more PTE tables are deposited.
> + */
> +pgtable_t pud_withdraw_pte(pmd_t *pmd_table)
> +{
> +	struct ptdesc *ptdesc = virt_to_ptdesc(pmd_table);
> +	pgtable_t pgtable;
> +
> +	pgtable = ptdesc->pmd_huge_pte;
> +	if (!pgtable)
> +		return NULL;
> +	ptdesc->pmd_huge_pte = list_first_entry_or_null(&pgtable->lru,
> +							struct page, lru);
> +	if (ptdesc->pmd_huge_pte)
> +		list_del(&pgtable->lru);
> +	return pgtable;
> +}
> +#endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
> +
>  #ifndef __HAVE_ARCH_PMDP_INVALIDATE
>  pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
>  		     pmd_t *pmdp)
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 7b9879ef442d9..69acabd763da4 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -811,6 +811,32 @@ pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address)
>  	return pmd;
>  }
>  
> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
> +/*
> + * Returns the actual pud_t* where we expect 'address' to be mapped from, or
> + * NULL if it doesn't exist.  No guarantees / checks on what the pud_t*
> + * represents.
> + */
> +pud_t *mm_find_pud(struct mm_struct *mm, unsigned long address)

Remove the ifdef and make mm_find_pmd() call it.

And in general, try to avoid ifdeffery where possible.

> +{
> +	pgd_t *pgd;
> +	p4d_t *p4d;
> +	pud_t *pud = NULL;
> +
> +	pgd = pgd_offset(mm, address);
> +	if (!pgd_present(*pgd))
> +		goto out;
> +
> +	p4d = p4d_offset(pgd, address);
> +	if (!p4d_present(*p4d))
> +		goto out;
> +
> +	pud = pud_offset(p4d, address);
> +out:
> +	return pud;
> +}
> +#endif
> +
>  struct folio_referenced_arg {
>  	int mapcount;
>  	int referenced;
> @@ -1415,11 +1441,7 @@ static __always_inline void __folio_add_anon_rmap(struct folio *folio,
>  			SetPageAnonExclusive(page);
>  			break;
>  		case PGTABLE_LEVEL_PUD:
> -			/*
> -			 * Keep the compiler happy, we don't support anonymous
> -			 * PUD mappings.
> -			 */
> -			WARN_ON_ONCE(1);
> +			SetPageAnonExclusive(page);
>  			break;
>  		default:
>  			BUILD_BUG();
> @@ -1503,6 +1525,31 @@ void folio_add_anon_rmap_pmd(struct folio *folio, struct page *page,
>  #endif
>  }
>  
> +/**
> + * folio_add_anon_rmap_pud - add a PUD mapping to a page range of an anon folio
> + * @folio:	The folio to add the mapping to
> + * @page:	The first page to add
> + * @vma:	The vm area in which the mapping is added
> + * @address:	The user virtual address of the first page to map
> + * @flags:	The rmap flags
> + *
> + * The page range of folio is defined by [first_page, first_page + HPAGE_PUD_NR)
> + *
> + * The caller needs to hold the page table lock, and the page must be locked in
> + * the anon_vma case: to serialize mapping,index checking after setting.
> + */
> +void folio_add_anon_rmap_pud(struct folio *folio, struct page *page,
> +		struct vm_area_struct *vma, unsigned long address, rmap_t flags)
> +{
> +#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && \
> +	defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD)
> +	__folio_add_anon_rmap(folio, page, HPAGE_PUD_NR, vma, address, flags,
> +			      PGTABLE_LEVEL_PUD);
> +#else
> +	WARN_ON_ONCE(true);
> +#endif
> +}
> +
>  /**
>   * folio_add_new_anon_rmap - Add mapping to a new anonymous folio.
>   * @folio:	The folio to add the mapping to.
> @@ -1934,6 +1981,20 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>  		}
>  
>  		if (!pvmw.pte) {
> +			/*
> +			 * Check for PUD-mapped THP first.
> +			 * If we have a PUD mapping and TTU_SPLIT_HUGE_PUD is set,
> +			 * split the PUD to PMD level and restart the walk.
> +			 */
> +			if (pvmw.pud && pud_trans_huge(*pvmw.pud)) {
> +				if (flags & TTU_SPLIT_HUGE_PUD) {
> +					split_huge_pud_locked(vma, pvmw.pud, pvmw.address);
> +					flags &= ~TTU_SPLIT_HUGE_PUD;
> +					page_vma_mapped_walk_restart(&pvmw);
> +					continue;
> +				}
> +			}
> +
>  			if (folio_test_anon(folio) && !folio_test_swapbacked(folio)) {
>  				if (unmap_huge_pmd_locked(vma, pvmw.address, pvmw.pmd, folio))
>  					goto walk_done;
> @@ -2325,6 +2386,27 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
>  	mmu_notifier_invalidate_range_start(&range);
>  
>  	while (page_vma_mapped_walk(&pvmw)) {
> +		/* Handle PUD-mapped THP first */
> +		if (!pvmw.pte && !pvmw.pmd) {
> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
> +			/*
> +			 * PUD-mapped THP: skip migration to preserve the huge
> +			 * page. Splitting would defeat the purpose of PUD THPs.
> +			 * Return false to indicate migration failure, which
> +			 * will cause alloc_contig_range() to try a different
> +			 * memory region.
> +			 */
> +			if (pvmw.pud && pud_trans_huge(*pvmw.pud)) {
> +				page_vma_mapped_walk_done(&pvmw);
> +				ret = false;
> +				break;
> +			}
> +#endif
> +			/* Unexpected state: !pte && !pmd but not a PUD THP */
> +			page_vma_mapped_walk_done(&pvmw);
> +			break;
> +		}
> +
>  		/* PMD-mapped THP migration entry */
>  		if (!pvmw.pte) {
>  			__maybe_unused unsigned long pfn;
> @@ -2607,10 +2689,10 @@ void try_to_migrate(struct folio *folio, enum ttu_flags flags)
>  
>  	/*
>  	 * Migration always ignores mlock and only supports TTU_RMAP_LOCKED and
> -	 * TTU_SPLIT_HUGE_PMD, TTU_SYNC, and TTU_BATCH_FLUSH flags.
> +	 * TTU_SPLIT_HUGE_PMD, TTU_SPLIT_HUGE_PUD, TTU_SYNC, and TTU_BATCH_FLUSH flags.
>  	 */
>  	if (WARN_ON_ONCE(flags & ~(TTU_RMAP_LOCKED | TTU_SPLIT_HUGE_PMD |
> -					TTU_SYNC | TTU_BATCH_FLUSH)))
> +					TTU_SPLIT_HUGE_PUD | TTU_SYNC | TTU_BATCH_FLUSH)))
>  		return;
>  
>  	if (folio_is_zone_device(folio) &&
> -- 
> 2.47.3
> 

-- 
  Kiryl Shutsemau / Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC 00/12] mm: PUD (1GB) THP implementation
  2026-02-02  0:50 [RFC 00/12] mm: PUD (1GB) THP implementation Usama Arif
                   ` (13 preceding siblings ...)
  2026-02-02  4:00 ` Matthew Wilcox
@ 2026-02-02 11:20 ` Lorenzo Stoakes
  2026-02-04  1:00   ` Usama Arif
  2026-02-02 16:24 ` Zi Yan
  15 siblings, 1 reply; 52+ messages in thread
From: Lorenzo Stoakes @ 2026-02-02 11:20 UTC (permalink / raw)
  To: Usama Arif
  Cc: ziy, Andrew Morton, David Hildenbrand, linux-mm, hannes, riel,
	shakeel.butt, kas, baohua, dev.jain, baolin.wang, npache,
	Liam.Howlett, ryan.roberts, vbabka, lance.yang, linux-kernel,
	kernel-team

OK so this is somewhat unexpected :)

It would have been nice to discuss it in the THP cabal or at a conference
etc. so we could discuss approaches ahead of time. Communication is important,
especially with major changes like this.

And PUD THP is especially problematic in that it requires pages that the page
allocator can't give us, presumably you're doing something with CMA and... it's
a whole kettle of fish.

It's also complicated by the fact we _already_ support it in the DAX, VFIO cases
but it's kinda a weird sorta special case that we need to keep supporting.

There's questions about how this will interact with khugepaged, MADV_COLLAPSE,
mTHP (and really I want to see Nico's series land before we really consider
this).

So overall, I want to be very cautious and SLOW here. So let's please not drop
the RFC tag until David and I are ok with that?

Also the THP code base is in _dire_ need of rework, and I don't really want to
add major new features without us paying down some technical debt, to be honest.

So let's proceed with caution, and treat this as a very early bit of
experimental code.

Thanks, Lorenzo

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC 00/12] mm: PUD (1GB) THP implementation
  2026-02-02  2:44 ` [RFC 00/12] mm: PUD (1GB) THP implementation Rik van Riel
@ 2026-02-02 11:30   ` Lorenzo Stoakes
  2026-02-02 15:50     ` Zi Yan
  0 siblings, 1 reply; 52+ messages in thread
From: Lorenzo Stoakes @ 2026-02-02 11:30 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Usama Arif, ziy, Andrew Morton, David Hildenbrand, linux-mm,
	hannes, shakeel.butt, kas, baohua, dev.jain, baolin.wang, npache,
	Liam.Howlett, ryan.roberts, vbabka, lance.yang, linux-kernel,
	kernel-team, Frank van der Linden

On Sun, Feb 01, 2026 at 09:44:12PM -0500, Rik van Riel wrote:
> On Sun, 2026-02-01 at 16:50 -0800, Usama Arif wrote:
> >
> > 1. Static Reservation: hugetlbfs requires pre-allocating huge pages
> > at boot
> >    or runtime, taking memory away. This requires capacity planning,
> >    administrative overhead, and makes workload orchastration much
> > much more
> >    complex, especially colocating with workloads that don't use
> > hugetlbfs.
> >
> To address the obvious objection "but how could we
> possibly allocate 1GB huge pages while the workload
> is running?", I am planning to pick up the CMA balancing
> patch series (thank you, Frank) and get that in an
> upstream ready shape soon.
>
> https://lkml.org/2025/9/15/1735

That link doesn't work?

Did a quick search for CMA balancing on lore, couldn't find anything, could you
provide a lore link?

>
> That patch set looks like another case where no
> amount of internal testing will find every single
> corner case, and we'll probably just want to
> merge it upstream, deploy it experimentally, and
> aggressively deal with anything that might pop up.

I'm not really in favour of this kind of approach. There's plenty of things that
were considered 'temporary' upstream that became rather permanent :)

Maybe we can't cover all corner-cases, but we need to make sure whatever we do
send upstream is maintainable, conceptually sensible and doesn't paint us into
any corners, etc.

>
> With CMA balancing, it would be possibly to just
> have half (or even more) of system memory for
> movable allocations only, which would make it possible
> to allocate 1GB huge pages dynamically.

Could you expand on that?

>
> --
> All Rights Reversed.

Thanks, Lorenzo


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC 02/12] mm/thp: add mTHP stats infrastructure for PUD THP
  2026-02-02  0:50 ` [RFC 02/12] mm/thp: add mTHP stats infrastructure for PUD THP Usama Arif
@ 2026-02-02 11:56   ` Lorenzo Stoakes
  2026-02-05  5:53     ` Usama Arif
  0 siblings, 1 reply; 52+ messages in thread
From: Lorenzo Stoakes @ 2026-02-02 11:56 UTC (permalink / raw)
  To: Usama Arif
  Cc: ziy, Andrew Morton, David Hildenbrand, linux-mm, hannes, riel,
	shakeel.butt, kas, baohua, dev.jain, baolin.wang, npache,
	Liam.Howlett, ryan.roberts, vbabka, lance.yang, linux-kernel,
	kernel-team

On Sun, Feb 01, 2026 at 04:50:19PM -0800, Usama Arif wrote:
> Extend the mTHP (multi-size THP) statistics infrastructure to support
> PUD-sized transparent huge pages.
>
> The mTHP framework tracks statistics for each supported THP size through
> per-order counters exposed via sysfs. To add PUD THP support, PUD_ORDER
> must be included in the set of tracked orders.
>
> With this change, PUD THP events (allocations, faults, splits, swaps)
> are tracked and exposed through the existing sysfs interface at
> /sys/kernel/mm/transparent_hugepage/hugepages-1048576kB/stats/. This
> provides visibility into PUD THP behavior for debugging and performance
> analysis.
>
> Signed-off-by: Usama Arif <usamaarif642@gmail.com>

Yeah we really need to be basing this on mm-unstable once Nico's series is
landed.

I think it's quite important as well for you to check that khugepaged mTHP works
with all of this.

> ---
>  include/linux/huge_mm.h | 42 +++++++++++++++++++++++++++++++++++++----
>  mm/huge_memory.c        |  3 ++-
>  2 files changed, 40 insertions(+), 5 deletions(-)
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index e672e45bb9cc7..5509ba8555b6e 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -76,7 +76,13 @@ extern struct kobj_attribute thpsize_shmem_enabled_attr;
>   * and including PMD_ORDER, except order-0 (which is not "huge") and order-1
>   * (which is a limitation of the THP implementation).
>   */
> -#define THP_ORDERS_ALL_ANON	((BIT(PMD_ORDER + 1) - 1) & ~(BIT(0) | BIT(1)))
> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
> +#define THP_ORDERS_ALL_ANON_PUD		BIT(PUD_ORDER)
> +#else
> +#define THP_ORDERS_ALL_ANON_PUD		0
> +#endif
> +#define THP_ORDERS_ALL_ANON	(((BIT(PMD_ORDER + 1) - 1) & ~(BIT(0) | BIT(1))) | \
> +				 THP_ORDERS_ALL_ANON_PUD)

Err what is this change doing in a 'stats' change? This quietly updates
__thp_vma_allowable_orders() to also support PUD order for anon memory... Can we
put this in the right place?

>
>  /*
>   * Mask of all large folio orders supported for file THP. Folios in a DAX
> @@ -146,18 +152,46 @@ enum mthp_stat_item {
>  };
>
>  #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && defined(CONFIG_SYSFS)
> +
> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD

By the way I'm not a fan of us treating an 'arch has' as a 'will use'.

> +#define MTHP_STAT_COUNT		(PMD_ORDER + 2)

Yeah I hate this. This is just 'one more thing to remember'.

> +#define MTHP_STAT_PUD_INDEX	(PMD_ORDER + 1)  /* PUD uses last index */
> +#else
> +#define MTHP_STAT_COUNT		(PMD_ORDER + 1)
> +#endif
> +
>  struct mthp_stat {
> -	unsigned long stats[ilog2(MAX_PTRS_PER_PTE) + 1][__MTHP_STAT_COUNT];
> +	unsigned long stats[MTHP_STAT_COUNT][__MTHP_STAT_COUNT];
>  };
>
>  DECLARE_PER_CPU(struct mthp_stat, mthp_stats);
>
> +static inline int mthp_stat_order_to_index(int order)
> +{
> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
> +	if (order == PUD_ORDER)
> +		return MTHP_STAT_PUD_INDEX;

This seems like a hack again.

> +#endif
> +	return order;
> +}
> +
>  static inline void mod_mthp_stat(int order, enum mthp_stat_item item, int delta)
>  {
> -	if (order <= 0 || order > PMD_ORDER)
> +	int index;
> +
> +	if (order <= 0)
> +		return;
> +
> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
> +	if (order != PUD_ORDER && order > PMD_ORDER)
>  		return;
> +#else
> +	if (order > PMD_ORDER)
> +		return;
> +#endif

Or we could actually define a max order... except now the hack contorts this
code.

Is it really that bad to just take up memory for the order between PMD_ORDER and
PUD_ORDER? ~72 bytes * cores and we avoid having to do this silly dance.

>
> -	this_cpu_add(mthp_stats.stats[order][item], delta);
> +	index = mthp_stat_order_to_index(order);
> +	this_cpu_add(mthp_stats.stats[index][item], delta);
>  }
>
>  static inline void count_mthp_stat(int order, enum mthp_stat_item item)
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 3128b3beedb0a..d033624d7e1f2 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -598,11 +598,12 @@ static unsigned long sum_mthp_stat(int order, enum mthp_stat_item item)
>  {
>  	unsigned long sum = 0;
>  	int cpu;
> +	int index = mthp_stat_order_to_index(order);
>
>  	for_each_possible_cpu(cpu) {
>  		struct mthp_stat *this = &per_cpu(mthp_stats, cpu);
>
> -		sum += this->stats[order][item];
> +		sum += this->stats[index][item];
>  	}
>
>  	return sum;
> --
> 2.47.3
>


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC 01/12] mm: add PUD THP ptdesc and rmap support
  2026-02-02  0:50 ` [RFC 01/12] mm: add PUD THP ptdesc and rmap support Usama Arif
  2026-02-02  3:10   ` kernel test robot
  2026-02-02 10:44   ` Kiryl Shutsemau
@ 2026-02-02 12:15   ` Lorenzo Stoakes
  2026-02-04  7:38     ` Usama Arif
  2 siblings, 1 reply; 52+ messages in thread
From: Lorenzo Stoakes @ 2026-02-02 12:15 UTC (permalink / raw)
  To: Usama Arif
  Cc: ziy, Andrew Morton, David Hildenbrand, linux-mm, hannes, riel,
	shakeel.butt, kas, baohua, dev.jain, baolin.wang, npache,
	Liam.Howlett, ryan.roberts, vbabka, lance.yang, linux-kernel,
	kernel-team

I think I'm going to have to do several passes on this, so this is just a
first one :)

On Sun, Feb 01, 2026 at 04:50:18PM -0800, Usama Arif wrote:
> For page table management, PUD THPs need to pre-deposit page tables
> that will be used when the huge page is later split. When a PUD THP
> is allocated, we cannot know in advance when or why it might need to
> be split (COW, partial unmap, reclaim), but we need page tables ready
> for that eventuality. Similar to how PMD THPs deposit a single PTE
> table, PUD THPs deposit a PMD table which itself contains deposited
> PTE tables - a two-level deposit. This commit adds the deposit/withdraw
> infrastructure and a new pud_huge_pmd field in ptdesc to store the
> deposited PMD.

This feels like you're hacking this support in, honestly. The list_head
abuse only adds to that feeling.

And are we now not required to store rather a lot of memory to keep all of
this coherent?

>
> The deposited PMD tables are stored as a singly-linked stack using only
> page->lru.next as the link pointer. A doubly-linked list using the
> standard list_head mechanism would cause memory corruption: list_del()
> poisons both lru.next (offset 8) and lru.prev (offset 16), but lru.prev
> overlaps with ptdesc->pmd_huge_pte at offset 16. Since deposited PMD
> tables have their own deposited PTE tables stored in pmd_huge_pte,
> poisoning lru.prev would corrupt the PTE table list and cause crashes
> when withdrawing PTE tables during split. PMD THPs don't have this
> problem because their deposited PTE tables don't have sub-deposits.
> Using only lru.next avoids the overlap entirely.

Yeah this is horrendous and a hack, I don't consider this at all
upstreamable.

You need to completely rework this.

>
> For reverse mapping, PUD THPs need the same rmap support that PMD THPs
> have. The page_vma_mapped_walk() function is extended to recognize and
> handle PUD-mapped folios during rmap traversal. A new TTU_SPLIT_HUGE_PUD
> flag tells the unmap path to split PUD THPs before proceeding, since
> there is no PUD-level migration entry format - the split converts the
> single PUD mapping into individual PTE mappings that can be migrated
> or swapped normally.

Individual PTE... mappings? You need to be a lot clearer here, page tables
are naturally confusing with entries vs. tables.

Let's be VERY specific here. Do you mean you have 1 PMD table and 512 PTE
tables reserved, spanning 1 PUD entry and 262,144 PTE entries?

>
> Signed-off-by: Usama Arif <usamaarif642@gmail.com>

How does this change interact with existing DAX/VFIO code, which now it
seems will be subject to the mechanisms you introduce here?

Right now DAX/VFIO is only obtainable via a specially THP-aligned
get_unmapped_area() + then can only be obtained at fault time.

Is that the intent here also?

What is your intent - that khugepaged do this, or on alloc? How does it
interact with MADV_COLLAPSE?

I noted on the 2nd patch, but you're changing THP_ORDERS_ALL_ANON which
alters __thp_vma_allowable_orders() behaviour, that change belongs here...


> ---
>  include/linux/huge_mm.h  |  5 +++
>  include/linux/mm.h       | 19 ++++++++
>  include/linux/mm_types.h |  5 ++-
>  include/linux/pgtable.h  |  8 ++++
>  include/linux/rmap.h     |  7 ++-
>  mm/huge_memory.c         |  8 ++++
>  mm/internal.h            |  3 ++
>  mm/page_vma_mapped.c     | 35 +++++++++++++++
>  mm/pgtable-generic.c     | 83 ++++++++++++++++++++++++++++++++++
>  mm/rmap.c                | 96 +++++++++++++++++++++++++++++++++++++---
>  10 files changed, 260 insertions(+), 9 deletions(-)
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index a4d9f964dfdea..e672e45bb9cc7 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -463,10 +463,15 @@ void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
>  		unsigned long address);
>
>  #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
> +void split_huge_pud_locked(struct vm_area_struct *vma, pud_t *pud,
> +			   unsigned long address);
>  int change_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  		    pud_t *pudp, unsigned long addr, pgprot_t newprot,
>  		    unsigned long cp_flags);
>  #else
> +static inline void
> +split_huge_pud_locked(struct vm_area_struct *vma, pud_t *pud,
> +		      unsigned long address) {}
>  static inline int
>  change_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  		pud_t *pudp, unsigned long addr, pgprot_t newprot,
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index ab2e7e30aef96..a15e18df0f771 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3455,6 +3455,22 @@ static inline bool pagetable_pmd_ctor(struct mm_struct *mm,
>   * considered ready to switch to split PUD locks yet; there may be places
>   * which need to be converted from page_table_lock.
>   */
> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
> +static inline struct page *pud_pgtable_page(pud_t *pud)
> +{
> +	unsigned long mask = ~(PTRS_PER_PUD * sizeof(pud_t) - 1);
> +
> +	return virt_to_page((void *)((unsigned long)pud & mask));
> +}
> +
> +static inline struct ptdesc *pud_ptdesc(pud_t *pud)
> +{
> +	return page_ptdesc(pud_pgtable_page(pud));
> +}
> +
> +#define pud_huge_pmd(pud) (pud_ptdesc(pud)->pud_huge_pmd)
> +#endif
> +
>  static inline spinlock_t *pud_lockptr(struct mm_struct *mm, pud_t *pud)
>  {
>  	return &mm->page_table_lock;
> @@ -3471,6 +3487,9 @@ static inline spinlock_t *pud_lock(struct mm_struct *mm, pud_t *pud)
>  static inline void pagetable_pud_ctor(struct ptdesc *ptdesc)
>  {
>  	__pagetable_ctor(ptdesc);
> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
> +	ptdesc->pud_huge_pmd = NULL;
> +#endif
>  }
>
>  static inline void pagetable_p4d_ctor(struct ptdesc *ptdesc)
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 78950eb8926dc..26a38490ae2e1 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -577,7 +577,10 @@ struct ptdesc {
>  		struct list_head pt_list;
>  		struct {
>  			unsigned long _pt_pad_1;
> -			pgtable_t pmd_huge_pte;
> +			union {
> +				pgtable_t pmd_huge_pte;  /* For PMD tables: deposited PTE */
> +				pgtable_t pud_huge_pmd;  /* For PUD tables: deposited PMD list */
> +			};
>  		};
>  	};
>  	unsigned long __page_mapping;
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index 2f0dd3a4ace1a..3ce733c1d71a2 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -1168,6 +1168,14 @@ extern pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp);
>  #define arch_needs_pgtable_deposit() (false)
>  #endif
>
> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
> +extern void pgtable_trans_huge_pud_deposit(struct mm_struct *mm, pud_t *pudp,
> +					   pmd_t *pmd_table);
> +extern pmd_t *pgtable_trans_huge_pud_withdraw(struct mm_struct *mm, pud_t *pudp);
> +extern void pud_deposit_pte(pmd_t *pmd_table, pgtable_t pgtable);
> +extern pgtable_t pud_withdraw_pte(pmd_t *pmd_table);

These are useless extern's.

> +#endif
> +
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>  /*
>   * This is an implementation of pmdp_establish() that is only suitable for an
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index daa92a58585d9..08cd0a0eb8763 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -101,6 +101,7 @@ enum ttu_flags {
>  					 * do a final flush if necessary */
>  	TTU_RMAP_LOCKED		= 0x80,	/* do not grab rmap lock:
>  					 * caller holds it */
> +	TTU_SPLIT_HUGE_PUD	= 0x100, /* split huge PUD if any */
>  };
>
>  #ifdef CONFIG_MMU
> @@ -473,6 +474,8 @@ void folio_add_anon_rmap_ptes(struct folio *, struct page *, int nr_pages,
>  	folio_add_anon_rmap_ptes(folio, page, 1, vma, address, flags)
>  void folio_add_anon_rmap_pmd(struct folio *, struct page *,
>  		struct vm_area_struct *, unsigned long address, rmap_t flags);
> +void folio_add_anon_rmap_pud(struct folio *, struct page *,
> +		struct vm_area_struct *, unsigned long address, rmap_t flags);
>  void folio_add_new_anon_rmap(struct folio *, struct vm_area_struct *,
>  		unsigned long address, rmap_t flags);
>  void folio_add_file_rmap_ptes(struct folio *, struct page *, int nr_pages,
> @@ -933,6 +936,7 @@ struct page_vma_mapped_walk {
>  	pgoff_t pgoff;
>  	struct vm_area_struct *vma;
>  	unsigned long address;
> +	pud_t *pud;
>  	pmd_t *pmd;
>  	pte_t *pte;
>  	spinlock_t *ptl;
> @@ -970,7 +974,7 @@ static inline void page_vma_mapped_walk_done(struct page_vma_mapped_walk *pvmw)
>  static inline void
>  page_vma_mapped_walk_restart(struct page_vma_mapped_walk *pvmw)
>  {
> -	WARN_ON_ONCE(!pvmw->pmd && !pvmw->pte);
> +	WARN_ON_ONCE(!pvmw->pud && !pvmw->pmd && !pvmw->pte);
>
>  	if (likely(pvmw->ptl))
>  		spin_unlock(pvmw->ptl);
> @@ -978,6 +982,7 @@ page_vma_mapped_walk_restart(struct page_vma_mapped_walk *pvmw)
>  		WARN_ON_ONCE(1);
>
>  	pvmw->ptl = NULL;
> +	pvmw->pud = NULL;
>  	pvmw->pmd = NULL;
>  	pvmw->pte = NULL;
>  }
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 40cf59301c21a..3128b3beedb0a 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2933,6 +2933,14 @@ void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
>  	spin_unlock(ptl);
>  	mmu_notifier_invalidate_range_end(&range);
>  }
> +
> +void split_huge_pud_locked(struct vm_area_struct *vma, pud_t *pud,
> +			   unsigned long address)
> +{
> +	VM_WARN_ON_ONCE(!IS_ALIGNED(address, HPAGE_PUD_SIZE));
> +	if (pud_trans_huge(*pud))
> +		__split_huge_pud_locked(vma, pud, address);
> +}
>  #else
>  void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
>  		unsigned long address)
> diff --git a/mm/internal.h b/mm/internal.h
> index 9ee336aa03656..21d5c00f638dc 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -545,6 +545,9 @@ int user_proactive_reclaim(char *buf,
>   * in mm/rmap.c:
>   */
>  pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address);
> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
> +pud_t *mm_find_pud(struct mm_struct *mm, unsigned long address);
> +#endif
>
>  /*
>   * in mm/page_alloc.c
> diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
> index b38a1d00c971b..d31eafba38041 100644
> --- a/mm/page_vma_mapped.c
> +++ b/mm/page_vma_mapped.c
> @@ -146,6 +146,18 @@ static bool check_pmd(unsigned long pfn, struct page_vma_mapped_walk *pvmw)
>  	return true;
>  }
>
> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
> +/* Returns true if the two ranges overlap.  Careful to not overflow. */
> +static bool check_pud(unsigned long pfn, struct page_vma_mapped_walk *pvmw)
> +{
> +	if ((pfn + HPAGE_PUD_NR - 1) < pvmw->pfn)
> +		return false;
> +	if (pfn > pvmw->pfn + pvmw->nr_pages - 1)
> +		return false;
> +	return true;
> +}
> +#endif
> +
>  static void step_forward(struct page_vma_mapped_walk *pvmw, unsigned long size)
>  {
>  	pvmw->address = (pvmw->address + size) & ~(size - 1);
> @@ -188,6 +200,10 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
>  	pud_t *pud;
>  	pmd_t pmde;
>
> +	/* The only possible pud mapping has been handled on last iteration */
> +	if (pvmw->pud && !pvmw->pmd)
> +		return not_found(pvmw);
> +
>  	/* The only possible pmd mapping has been handled on last iteration */
>  	if (pvmw->pmd && !pvmw->pte)
>  		return not_found(pvmw);
> @@ -234,6 +250,25 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
>  			continue;
>  		}
>
> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD

Said it elsewhere, but it's really weird to treat an arch having the
ability to do something as a go ahead for doing it.

> +		/* Check for PUD-mapped THP */
> +		if (pud_trans_huge(*pud)) {
> +			pvmw->pud = pud;
> +			pvmw->ptl = pud_lock(mm, pud);
> +			if (likely(pud_trans_huge(*pud))) {
> +				if (pvmw->flags & PVMW_MIGRATION)
> +					return not_found(pvmw);
> +				if (!check_pud(pud_pfn(*pud), pvmw))
> +					return not_found(pvmw);
> +				return true;
> +			}
> +			/* PUD was split under us, retry at PMD level */
> +			spin_unlock(pvmw->ptl);
> +			pvmw->ptl = NULL;
> +			pvmw->pud = NULL;
> +		}
> +#endif
> +

Yeah, as I said elsewhere, we got to be refactoring not copy/pasting with
modifications :)


>  		pvmw->pmd = pmd_offset(pud, pvmw->address);
>  		/*
>  		 * Make sure the pmd value isn't cached in a register by the
> diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
> index d3aec7a9926ad..2047558ddcd79 100644
> --- a/mm/pgtable-generic.c
> +++ b/mm/pgtable-generic.c
> @@ -195,6 +195,89 @@ pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp)
>  }
>  #endif
>
> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
> +/*
> + * Deposit page tables for PUD THP.
> + * Called with PUD lock held. Stores PMD tables in a singly-linked stack
> + * via pud_huge_pmd, using only pmd_page->lru.next as the link pointer.
> + *
> + * IMPORTANT: We use only lru.next (offset 8) for linking, NOT the full
> + * list_head. This is because lru.prev (offset 16) overlaps with
> + * ptdesc->pmd_huge_pte, which stores the PMD table's deposited PTE tables.
> + * Using list_del() would corrupt pmd_huge_pte with LIST_POISON2.

This is horrible and feels like a hack? Treating a doubly-linked list as a
singly-linked one like this is not upstreamable.

> + *
> + * PTE tables should be deposited into the PMD using pud_deposit_pte().
> + */
> +void pgtable_trans_huge_pud_deposit(struct mm_struct *mm, pud_t *pudp,
> +				    pmd_t *pmd_table)

This is a horrid, you're depositing the PMD using the... questionable
list_head abuse, but then also have pud_deposit_pte()... But here we're
depositing a PMD shouldn't the name reflect that?

> +{
> +	pgtable_t pmd_page = virt_to_page(pmd_table);
> +
> +	assert_spin_locked(pud_lockptr(mm, pudp));
> +
> +	/* Push onto stack using only lru.next as the link */
> +	pmd_page->lru.next = (struct list_head *)pud_huge_pmd(pudp);

Yikes...

> +	pud_huge_pmd(pudp) = pmd_page;
> +}
> +
> +/*
> + * Withdraw the deposited PMD table for PUD THP split or zap.
> + * Called with PUD lock held.
> + * Returns NULL if no more PMD tables are deposited.
> + */
> +pmd_t *pgtable_trans_huge_pud_withdraw(struct mm_struct *mm, pud_t *pudp)
> +{
> +	pgtable_t pmd_page;
> +
> +	assert_spin_locked(pud_lockptr(mm, pudp));
> +
> +	pmd_page = pud_huge_pmd(pudp);
> +	if (!pmd_page)
> +		return NULL;
> +
> +	/* Pop from stack - lru.next points to next PMD page (or NULL) */
> +	pud_huge_pmd(pudp) = (pgtable_t)pmd_page->lru.next;

Where's the popping? You're just assigning here.

> +
> +	return page_address(pmd_page);
> +}
> +
> +/*
> + * Deposit a PTE table into a standalone PMD table (not yet in page table hierarchy).
> + * Used for PUD THP pre-deposit. The PMD table's pmd_huge_pte stores a linked list.
> + * No lock assertion since the PMD isn't visible yet.
> + */
> +void pud_deposit_pte(pmd_t *pmd_table, pgtable_t pgtable)
> +{
> +	struct ptdesc *ptdesc = virt_to_ptdesc(pmd_table);
> +
> +	/* FIFO - add to front of list */
> +	if (!ptdesc->pmd_huge_pte)
> +		INIT_LIST_HEAD(&pgtable->lru);
> +	else
> +		list_add(&pgtable->lru, &ptdesc->pmd_huge_pte->lru);
> +	ptdesc->pmd_huge_pte = pgtable;
> +}
> +
> +/*
> + * Withdraw a PTE table from a standalone PMD table.
> + * Returns NULL if no more PTE tables are deposited.
> + */
> +pgtable_t pud_withdraw_pte(pmd_t *pmd_table)
> +{
> +	struct ptdesc *ptdesc = virt_to_ptdesc(pmd_table);
> +	pgtable_t pgtable;
> +
> +	pgtable = ptdesc->pmd_huge_pte;
> +	if (!pgtable)
> +		return NULL;
> +	ptdesc->pmd_huge_pte = list_first_entry_or_null(&pgtable->lru,
> +							struct page, lru);
> +	if (ptdesc->pmd_huge_pte)
> +		list_del(&pgtable->lru);
> +	return pgtable;
> +}
> +#endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
> +
>  #ifndef __HAVE_ARCH_PMDP_INVALIDATE
>  pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
>  		     pmd_t *pmdp)
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 7b9879ef442d9..69acabd763da4 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -811,6 +811,32 @@ pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address)
>  	return pmd;
>  }
>
> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
> +/*
> + * Returns the actual pud_t* where we expect 'address' to be mapped from, or
> + * NULL if it doesn't exist.  No guarantees / checks on what the pud_t*
> + * represents.
> + */
> +pud_t *mm_find_pud(struct mm_struct *mm, unsigned long address)

This series seems to be full of copy/paste.

It's just not acceptable given the state of THP code as I said in reply to
the cover letter - you need to _refactor_ the code.

The code is bug-prone and difficult to maintain as-is, your series has to
improve the technical debt, not add to it.

> +{
> +	pgd_t *pgd;
> +	p4d_t *p4d;
> +	pud_t *pud = NULL;
> +
> +	pgd = pgd_offset(mm, address);
> +	if (!pgd_present(*pgd))
> +		goto out;
> +
> +	p4d = p4d_offset(pgd, address);
> +	if (!p4d_present(*p4d))
> +		goto out;
> +
> +	pud = pud_offset(p4d, address);
> +out:
> +	return pud;
> +}
> +#endif
> +
>  struct folio_referenced_arg {
>  	int mapcount;
>  	int referenced;
> @@ -1415,11 +1441,7 @@ static __always_inline void __folio_add_anon_rmap(struct folio *folio,
>  			SetPageAnonExclusive(page);
>  			break;
>  		case PGTABLE_LEVEL_PUD:
> -			/*
> -			 * Keep the compiler happy, we don't support anonymous
> -			 * PUD mappings.
> -			 */
> -			WARN_ON_ONCE(1);
> +			SetPageAnonExclusive(page);
>  			break;
>  		default:
>  			BUILD_BUG();
> @@ -1503,6 +1525,31 @@ void folio_add_anon_rmap_pmd(struct folio *folio, struct page *page,
>  #endif
>  }
>
> +/**
> + * folio_add_anon_rmap_pud - add a PUD mapping to a page range of an anon folio
> + * @folio:	The folio to add the mapping to
> + * @page:	The first page to add
> + * @vma:	The vm area in which the mapping is added
> + * @address:	The user virtual address of the first page to map
> + * @flags:	The rmap flags
> + *
> + * The page range of folio is defined by [first_page, first_page + HPAGE_PUD_NR)
> + *
> + * The caller needs to hold the page table lock, and the page must be locked in
> + * the anon_vma case: to serialize mapping,index checking after setting.
> + */
> +void folio_add_anon_rmap_pud(struct folio *folio, struct page *page,
> +		struct vm_area_struct *vma, unsigned long address, rmap_t flags)
> +{
> +#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && \
> +	defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD)
> +	__folio_add_anon_rmap(folio, page, HPAGE_PUD_NR, vma, address, flags,
> +			      PGTABLE_LEVEL_PUD);
> +#else
> +	WARN_ON_ONCE(true);
> +#endif
> +}

More copy/paste... Maybe unavoidable in this case, but be good to try.

> +
>  /**
>   * folio_add_new_anon_rmap - Add mapping to a new anonymous folio.
>   * @folio:	The folio to add the mapping to.
> @@ -1934,6 +1981,20 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>  		}
>
>  		if (!pvmw.pte) {
> +			/*
> +			 * Check for PUD-mapped THP first.
> +			 * If we have a PUD mapping and TTU_SPLIT_HUGE_PUD is set,
> +			 * split the PUD to PMD level and restart the walk.
> +			 */

This is literally describing the code below, it's not useful.

> +			if (pvmw.pud && pud_trans_huge(*pvmw.pud)) {
> +				if (flags & TTU_SPLIT_HUGE_PUD) {
> +					split_huge_pud_locked(vma, pvmw.pud, pvmw.address);
> +					flags &= ~TTU_SPLIT_HUGE_PUD;
> +					page_vma_mapped_walk_restart(&pvmw);
> +					continue;
> +				}
> +			}
> +
>  			if (folio_test_anon(folio) && !folio_test_swapbacked(folio)) {
>  				if (unmap_huge_pmd_locked(vma, pvmw.address, pvmw.pmd, folio))
>  					goto walk_done;
> @@ -2325,6 +2386,27 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
>  	mmu_notifier_invalidate_range_start(&range);
>
>  	while (page_vma_mapped_walk(&pvmw)) {
> +		/* Handle PUD-mapped THP first */

How did/will this interact with DAX, VFIO PUD THP?

> +		if (!pvmw.pte && !pvmw.pmd) {
> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD

Won't pud_trans_huge() imply this...

> +			/*
> +			 * PUD-mapped THP: skip migration to preserve the huge
> +			 * page. Splitting would defeat the purpose of PUD THPs.
> +			 * Return false to indicate migration failure, which
> +			 * will cause alloc_contig_range() to try a different
> +			 * memory region.
> +			 */
> +			if (pvmw.pud && pud_trans_huge(*pvmw.pud)) {
> +				page_vma_mapped_walk_done(&pvmw);
> +				ret = false;
> +				break;
> +			}
> +#endif
> +			/* Unexpected state: !pte && !pmd but not a PUD THP */
> +			page_vma_mapped_walk_done(&pvmw);
> +			break;
> +		}
> +
>  		/* PMD-mapped THP migration entry */
>  		if (!pvmw.pte) {
>  			__maybe_unused unsigned long pfn;
> @@ -2607,10 +2689,10 @@ void try_to_migrate(struct folio *folio, enum ttu_flags flags)
>
>  	/*
>  	 * Migration always ignores mlock and only supports TTU_RMAP_LOCKED and
> -	 * TTU_SPLIT_HUGE_PMD, TTU_SYNC, and TTU_BATCH_FLUSH flags.
> +	 * TTU_SPLIT_HUGE_PMD, TTU_SPLIT_HUGE_PUD, TTU_SYNC, and TTU_BATCH_FLUSH flags.
>  	 */
>  	if (WARN_ON_ONCE(flags & ~(TTU_RMAP_LOCKED | TTU_SPLIT_HUGE_PMD |
> -					TTU_SYNC | TTU_BATCH_FLUSH)))
> +					TTU_SPLIT_HUGE_PUD | TTU_SYNC | TTU_BATCH_FLUSH)))
>  		return;
>
>  	if (folio_is_zone_device(folio) &&
> --
> 2.47.3
>

This isn't a final review, I'll have to look more thoroughly through here
over time and you're going to have to be patient in general :)

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC 00/12] mm: PUD (1GB) THP implementation
  2026-02-02 11:30   ` Lorenzo Stoakes
@ 2026-02-02 15:50     ` Zi Yan
  2026-02-04 10:56       ` Lorenzo Stoakes
  2026-02-05 11:22       ` David Hildenbrand (arm)
  0 siblings, 2 replies; 52+ messages in thread
From: Zi Yan @ 2026-02-02 15:50 UTC (permalink / raw)
  To: Lorenzo Stoakes, David Hildenbrand
  Cc: Rik van Riel, Usama Arif, Andrew Morton, linux-mm, hannes,
	shakeel.butt, kas, baohua, dev.jain, baolin.wang, npache,
	Liam.Howlett, ryan.roberts, vbabka, lance.yang, linux-kernel,
	kernel-team, Frank van der Linden

On 2 Feb 2026, at 6:30, Lorenzo Stoakes wrote:

> On Sun, Feb 01, 2026 at 09:44:12PM -0500, Rik van Riel wrote:
>> On Sun, 2026-02-01 at 16:50 -0800, Usama Arif wrote:
>>>
>>> 1. Static Reservation: hugetlbfs requires pre-allocating huge pages
>>> at boot
>>>    or runtime, taking memory away. This requires capacity planning,
>>>    administrative overhead, and makes workload orchastration much
>>> much more
>>>    complex, especially colocating with workloads that don't use
>>> hugetlbfs.
>>>
>> To address the obvious objection "but how could we
>> possibly allocate 1GB huge pages while the workload
>> is running?", I am planning to pick up the CMA balancing
>> patch series (thank you, Frank) and get that in an
>> upstream ready shape soon.
>>
>> https://lkml.org/2025/9/15/1735
>
> That link doesn't work?
>
> Did a quick search for CMA balancing on lore, couldn't find anything, could you
> provide a lore link?

https://lwn.net/Articles/1038263/

>
>>
>> That patch set looks like another case where no
>> amount of internal testing will find every single
>> corner case, and we'll probably just want to
>> merge it upstream, deploy it experimentally, and
>> aggressively deal with anything that might pop up.
>
> I'm not really in favour of this kind of approach. There's plenty of things that
> were considered 'temporary' upstream that became rather permanent :)
>
> Maybe we can't cover all corner-cases, but we need to make sure whatever we do
> send upstream is maintainable, conceptually sensible and doesn't paint us into
> any corners, etc.
>
>>
>> With CMA balancing, it would be possibly to just
>> have half (or even more) of system memory for
>> movable allocations only, which would make it possible
>> to allocate 1GB huge pages dynamically.
>
> Could you expand on that?

I also would like to hear David’s opinion on using CMA for 1GB THP.
He did not like it[1] when I posted my patch back in 2020, but it has
been more than 5 years. :)

The other direction I explored is to get 1GB THP from buddy allocator.
That means we need to:
1. bump MAX_PAGE_ORDER to 18 or make it a runtime variable so that only 1GB
   THP users need to bump it,
2. handle cross memory section PFN merge in buddy allocator,
3. improve anti-fragmentation mechanism for 1GB range compaction.

1 is easier-ish[2]. I have not looked into 2 and 3 much yet.

[1] https://lore.kernel.org/all/52bc2d5d-eb8a-83de-1c93-abd329132d58@redhat.com/
[2] https://lore.kernel.org/all/20210805190253.2795604-1-zi.yan@sent.com/


Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC 01/12] mm: add PUD THP ptdesc and rmap support
  2026-02-02 10:44   ` Kiryl Shutsemau
@ 2026-02-02 16:01     ` Zi Yan
  2026-02-03 22:07       ` Usama Arif
  0 siblings, 1 reply; 52+ messages in thread
From: Zi Yan @ 2026-02-02 16:01 UTC (permalink / raw)
  To: Usama Arif, Kiryl Shutsemau
  Cc: Andrew Morton, David Hildenbrand, lorenzo.stoakes, linux-mm,
	hannes, riel, shakeel.butt, baohua, dev.jain, baolin.wang, npache,
	Liam.Howlett, ryan.roberts, vbabka, lance.yang, linux-kernel,
	kernel-team

On 2 Feb 2026, at 5:44, Kiryl Shutsemau wrote:

> On Sun, Feb 01, 2026 at 04:50:18PM -0800, Usama Arif wrote:
>> For page table management, PUD THPs need to pre-deposit page tables
>> that will be used when the huge page is later split. When a PUD THP
>> is allocated, we cannot know in advance when or why it might need to
>> be split (COW, partial unmap, reclaim), but we need page tables ready
>> for that eventuality. Similar to how PMD THPs deposit a single PTE
>> table, PUD THPs deposit a PMD table which itself contains deposited
>> PTE tables - a two-level deposit. This commit adds the deposit/withdraw
>> infrastructure and a new pud_huge_pmd field in ptdesc to store the
>> deposited PMD.
>>
>> The deposited PMD tables are stored as a singly-linked stack using only
>> page->lru.next as the link pointer. A doubly-linked list using the
>> standard list_head mechanism would cause memory corruption: list_del()
>> poisons both lru.next (offset 8) and lru.prev (offset 16), but lru.prev
>> overlaps with ptdesc->pmd_huge_pte at offset 16. Since deposited PMD
>> tables have their own deposited PTE tables stored in pmd_huge_pte,
>> poisoning lru.prev would corrupt the PTE table list and cause crashes
>> when withdrawing PTE tables during split. PMD THPs don't have this
>> problem because their deposited PTE tables don't have sub-deposits.
>> Using only lru.next avoids the overlap entirely.
>>
>> For reverse mapping, PUD THPs need the same rmap support that PMD THPs
>> have. The page_vma_mapped_walk() function is extended to recognize and
>> handle PUD-mapped folios during rmap traversal. A new TTU_SPLIT_HUGE_PUD
>> flag tells the unmap path to split PUD THPs before proceeding, since
>> there is no PUD-level migration entry format - the split converts the
>> single PUD mapping into individual PTE mappings that can be migrated
>> or swapped normally.
>>
>> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
>> ---
>>  include/linux/huge_mm.h  |  5 +++
>>  include/linux/mm.h       | 19 ++++++++
>>  include/linux/mm_types.h |  5 ++-
>>  include/linux/pgtable.h  |  8 ++++
>>  include/linux/rmap.h     |  7 ++-
>>  mm/huge_memory.c         |  8 ++++
>>  mm/internal.h            |  3 ++
>>  mm/page_vma_mapped.c     | 35 +++++++++++++++
>>  mm/pgtable-generic.c     | 83 ++++++++++++++++++++++++++++++++++
>>  mm/rmap.c                | 96 +++++++++++++++++++++++++++++++++++++---
>>  10 files changed, 260 insertions(+), 9 deletions(-)
>>

<snip>

>> diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
>> index d3aec7a9926ad..2047558ddcd79 100644
>> --- a/mm/pgtable-generic.c
>> +++ b/mm/pgtable-generic.c
>> @@ -195,6 +195,89 @@ pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp)
>>  }
>>  #endif
>>
>> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
>> +/*
>> + * Deposit page tables for PUD THP.
>> + * Called with PUD lock held. Stores PMD tables in a singly-linked stack
>> + * via pud_huge_pmd, using only pmd_page->lru.next as the link pointer.
>> + *
>> + * IMPORTANT: We use only lru.next (offset 8) for linking, NOT the full
>> + * list_head. This is because lru.prev (offset 16) overlaps with
>> + * ptdesc->pmd_huge_pte, which stores the PMD table's deposited PTE tables.
>> + * Using list_del() would corrupt pmd_huge_pte with LIST_POISON2.
>
> This is ugly.
>
> Sounds like you want to use llist_node/head instead of list_head for this.
>
> You might able to avoid taking the lock in some cases. Note that
> pud_lockptr() is mm->page_table_lock as of now.

I agree. I used llist_node/head in my implementation[1] and it works.
I have an illustration at[2] to show the concept. Feel free to reuse the code.


[1] https://lore.kernel.org/all/20200928193428.GB30994@casper.infradead.org/
[2] https://normal.zone/blog/2021-01-04-linux-1gb-thp-2/#new-mechanism

Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC 00/12] mm: PUD (1GB) THP implementation
  2026-02-02  0:50 [RFC 00/12] mm: PUD (1GB) THP implementation Usama Arif
                   ` (14 preceding siblings ...)
  2026-02-02 11:20 ` Lorenzo Stoakes
@ 2026-02-02 16:24 ` Zi Yan
  2026-02-03 23:29   ` Usama Arif
  15 siblings, 1 reply; 52+ messages in thread
From: Zi Yan @ 2026-02-02 16:24 UTC (permalink / raw)
  To: Usama Arif
  Cc: Andrew Morton, David Hildenbrand, lorenzo.stoakes, linux-mm,
	hannes, riel, shakeel.butt, kas, baohua, dev.jain, baolin.wang,
	npache, Liam.Howlett, ryan.roberts, vbabka, lance.yang,
	linux-kernel, kernel-team

On 1 Feb 2026, at 19:50, Usama Arif wrote:

> This is an RFC series to implement 1GB PUD-level THPs, allowing
> applications to benefit from reduced TLB pressure without requiring
> hugetlbfs. The patches are based on top of
> f9b74c13b773b7c7e4920d7bc214ea3d5f37b422 from mm-stable (6.19-rc6).

It is nice to see you are working on 1GB THP.

>
> Motivation: Why 1GB THP over hugetlbfs?
> =======================================
>
> While hugetlbfs provides 1GB huge pages today, it has significant limitations
> that make it unsuitable for many workloads:
>
> 1. Static Reservation: hugetlbfs requires pre-allocating huge pages at boot
>    or runtime, taking memory away. This requires capacity planning,
>    administrative overhead, and makes workload orchastration much much more
>    complex, especially colocating with workloads that don't use hugetlbfs.

But you are using CMA, the same allocation mechanism as hugetlb_cma. What
is the difference?

>
> 4. No Fallback: If a 1GB huge page cannot be allocated, hugetlbfs fails
>    rather than falling back to smaller pages. This makes it fragile under
>    memory pressure.

True.

>
> 4. No Splitting: hugetlbfs pages cannot be split when only partial access
>    is needed, leading to memory waste and preventing partial reclaim.

Since you have PUD THP implementation, have you run any workload on it?
How often you see a PUD THP split?

Oh, you actually ran 512MB THP on ARM64 (I saw it below), do you have
any split stats to show the necessity of THP split?

>
> 5. Memory Accounting: hugetlbfs memory is accounted separately and cannot
>    be easily shared with regular memory pools.

True.

>
> PUD THP solves these limitations by integrating 1GB pages into the existing
> THP infrastructure.

The main advantage of PUD THP over hugetlb is that it can be split and mapped
at sub-folio level. Do you have any data to support the necessity of them?
I wonder if it would be easier to just support 1GB folio in core-mm first
and we can add 1GB THP split and sub-folio mapping later. With that, we
can move hugetlb users to 1GB folio.

BTW, without split support, you can apply HVO to 1GB folio to save memory.
That is a disadvantage of PUD THP. Have you taken that into consideration?
Basically, switching from hugetlb to PUD THP, you will lose memory due
to vmemmap usage.

>
> Performance Results
> ===================
>
> Benchmark results of these patches on Intel Xeon Platinum 8321HC:
>
> Test: True Random Memory Access [1] test of 4GB memory region with pointer
> chasing workload (4M random pointer dereferences through memory):
>
> | Metric            | PUD THP (1GB) | PMD THP (2MB) | Change       |
> |-------------------|---------------|---------------|--------------|
> | Memory access     | 88 ms         | 134 ms        | 34% faster   |
> | Page fault time   | 898 ms        | 331 ms        | 2.7x slower  |
>
> Page faulting 1G pages is 2.7x slower (Allocating 1G pages is hard :)).
> For long-running workloads this will be a one-off cost, and the 34%
> improvement in access latency provides significant benefit.
>
> ARM with 64K PAGE_SZIE supports 512M PMD THPs. In meta, we have a CPU
> bound workload running on a large number of ARM servers (256G). I enabled
> the 512M THP settings to always for a 100 servers in production (didn't
> really have high expectations :)). The average memory used for the workload
> increased from 217G to 233G. The amount of memory backed by 512M pages was
> 68G! The dTLB misses went down by 26% and the PID multiplier increased input
> by 5.9% (This is a very significant improvment in workload performance).
> A significant number of these THPs were faulted in at application start when
> were present across different VMAs. Ofcourse getting these 512M pages is
> easier on ARM due to bigger PAGE_SIZE and pageblock order.
>
> I am hoping that these patches for 1G THP can be used to provide similar
> benefits for x86. I expect workloads to fault them in at start time when there
> is plenty of free memory available.
>
>
> Previous attempt by Zi Yan
> ==========================
>
> Zi Yan attempted 1G THPs [2] in kernel version 5.11. There have been
> significant changes in kernel since then, including folio conversion, mTHP
> framework, ptdesc, rmap changes, etc. I found it easier to use the current PMD
> code as reference for making 1G PUD THP work. I am hoping Zi can provide
> guidance on these patches!

I am more than happy to help you. :)

>
> Major Design Decisions
> ======================
>
> 1. No shared 1G zero page: The memory cost would be quite significant!
>
> 2. Page Table Pre-deposit Strategy
>    PMD THP deposits a single PTE page table. PUD THP deposits 512 PTE
>    page tables (one for each potential PMD entry after split).
>    We allocate a PMD page table and use its pmd_huge_pte list to store
>    the deposited PTE tables. This ensures split operations don't fail due
>    to page table allocation failures (at the cost of 2M per PUD THP)
>
> 3. Split to Base Pages
>    When a PUD THP must be split (COW, partial unmap, mprotect), we split
>    directly to base pages (262,144 PTEs). The ideal thing would be to split
>    to 2M pages and then to 4K pages if needed. However, this would require
>    significant rmap and mapcount tracking changes.
>
> 4. COW and fork handling via split
>    Copy-on-write and fork for PUD THP triggers a split to base pages, then
>    uses existing PTE-level COW infrastructure. Getting another 1G region is
>    hard and could fail. If only a 4K is written, copying 1G is a waste.
>    Probably this should only be done on CoW and not fork?
>
> 5. Migration via split
>    Split PUD to PTEs and migrate individual pages. It is going to be difficult
>    to find a 1G continguous memory to migrate to. Maybe its better to not
>    allow migration of PUDs at all? I am more tempted to not allow migration,
>    but have kept splitting in this RFC.

Without migration, PUD THP loses its flexibility and transparency. But with
its 1GB size, I also wonder what the purpose of PUD THP migration can be.
It does not create memory fragmentation, since it is the largest folio size
we have and contiguous. NUMA balancing 1GB THP seems too much work.

BTW, I posted many questions, but that does not mean I object the patchset.
I just want to understand your use case better, reduce unnecessary
code changes, and hopefully get it upstreamed this time. :)

Thank you for the work.

>
>
> Reviewers guide
> ===============
>
> Most of the code is written by adapting from PMD code. For e.g. the PUD page
> fault path is very similar to PMD. The difference is no shared zero page and
> the page table deposit strategy. I think the easiest way to review this series
> is to compare with PMD code.
>
> Test results
> ============
>
>   1..7
>   # Starting 7 tests from 1 test cases.
>   #  RUN           pud_thp.basic_allocation ...
>   # pud_thp_test.c:169:basic_allocation:PUD THP allocated (anon_fault_alloc: 0 -> 1)
>   #            OK  pud_thp.basic_allocation
>   ok 1 pud_thp.basic_allocation
>   #  RUN           pud_thp.read_write_access ...
>   #            OK  pud_thp.read_write_access
>   ok 2 pud_thp.read_write_access
>   #  RUN           pud_thp.fork_cow ...
>   # pud_thp_test.c:236:fork_cow:Fork COW completed (thp_split_pud: 0 -> 1)
>   #            OK  pud_thp.fork_cow
>   ok 3 pud_thp.fork_cow
>   #  RUN           pud_thp.partial_munmap ...
>   # pud_thp_test.c:267:partial_munmap:Partial munmap completed (thp_split_pud: 1 -> 2)
>   #            OK  pud_thp.partial_munmap
>   ok 4 pud_thp.partial_munmap
>   #  RUN           pud_thp.mprotect_split ...
>   # pud_thp_test.c:293:mprotect_split:mprotect split completed (thp_split_pud: 2 -> 3)
>   #            OK  pud_thp.mprotect_split
>   ok 5 pud_thp.mprotect_split
>   #  RUN           pud_thp.reclaim_pageout ...
>   # pud_thp_test.c:322:reclaim_pageout:Reclaim completed (thp_split_pud: 3 -> 4)
>   #            OK  pud_thp.reclaim_pageout
>   ok 6 pud_thp.reclaim_pageout
>   #  RUN           pud_thp.migration_mbind ...
>   # pud_thp_test.c:356:migration_mbind:Migration completed (thp_split_pud: 4 -> 5)
>   #            OK  pud_thp.migration_mbind
>   ok 7 pud_thp.migration_mbind
>   # PASSED: 7 / 7 tests passed.
>   # Totals: pass:7 fail:0 xfail:0 xpass:0 skip:0 error:0
>
> [1] https://gist.github.com/uarif1/bf279b2a01a536cda945ff9f40196a26
> [2] https://lore.kernel.org/linux-mm/20210224223536.803765-1-zi.yan@sent.com/
>
> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
>
> Usama Arif (12):
>   mm: add PUD THP ptdesc and rmap support
>   mm/thp: add mTHP stats infrastructure for PUD THP
>   mm: thp: add PUD THP allocation and fault handling
>   mm: thp: implement PUD THP split to PTE level
>   mm: thp: add reclaim and migration support for PUD THP
>   selftests/mm: add PUD THP basic allocation test
>   selftests/mm: add PUD THP read/write access test
>   selftests/mm: add PUD THP fork COW test
>   selftests/mm: add PUD THP partial munmap test
>   selftests/mm: add PUD THP mprotect split test
>   selftests/mm: add PUD THP reclaim test
>   selftests/mm: add PUD THP migration test
>
>  include/linux/huge_mm.h                   |  60 ++-
>  include/linux/mm.h                        |  19 +
>  include/linux/mm_types.h                  |   5 +-
>  include/linux/pgtable.h                   |   8 +
>  include/linux/rmap.h                      |   7 +-
>  mm/huge_memory.c                          | 535 +++++++++++++++++++++-
>  mm/internal.h                             |   3 +
>  mm/memory.c                               |   8 +-
>  mm/migrate.c                              |  17 +
>  mm/page_vma_mapped.c                      |  35 ++
>  mm/pgtable-generic.c                      |  83 ++++
>  mm/rmap.c                                 |  96 +++-
>  mm/vmscan.c                               |   2 +
>  tools/testing/selftests/mm/Makefile       |   1 +
>  tools/testing/selftests/mm/pud_thp_test.c | 360 +++++++++++++++
>  15 files changed, 1197 insertions(+), 42 deletions(-)
>  create mode 100644 tools/testing/selftests/mm/pud_thp_test.c
>
> -- 
> 2.47.3


Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC 00/12] mm: PUD (1GB) THP implementation
  2026-02-02  9:06   ` David Hildenbrand (arm)
@ 2026-02-03 21:11     ` Usama Arif
  0 siblings, 0 replies; 52+ messages in thread
From: Usama Arif @ 2026-02-03 21:11 UTC (permalink / raw)
  To: David Hildenbrand (arm), Matthew Wilcox
  Cc: ziy, Andrew Morton, lorenzo.stoakes, linux-mm, hannes, riel,
	shakeel.butt, kas, baohua, dev.jain, baolin.wang, npache,
	Liam.Howlett, ryan.roberts, vbabka, lance.yang, linux-kernel,
	kernel-team

On 02/02/2026 01:06, David Hildenbrand (arm) wrote:
> On 2/2/26 05:00, Matthew Wilcox wrote:
>> On Sun, Feb 01, 2026 at 04:50:17PM -0800, Usama Arif wrote:
>>> This is an RFC series to implement 1GB PUD-level THPs, allowing
>>> applications to benefit from reduced TLB pressure without requiring
>>> hugetlbfs. The patches are based on top of
>>> f9b74c13b773b7c7e4920d7bc214ea3d5f37b422 from mm-stable (6.19-rc6).
>>
>> I suggest this has not had enough testing.  There are dozens of places
>> in the MM which assume that if a folio is at leaast PMD size then it is
>> exactly PMD size.  Everywhere that calls folio_test_pmd_mappable() needs
>> to be audited to make sure that it will work properly if the folio is
>> larger than PMD size.
> 
> I think the hack (ehm trick) in this patch set is to do it just like dax PUDs: only map through a PUD or through PTEs, not through PMDs.
> 
> That also avoids dealing with mapcounts until I sorted that out.
> 

Hello!

Thanks for the review! So its as David said, currently for PUD THP case, we
won't run into those paths.
PUD is split via TTU_SPLIT_HUGE_PUD which calls __split_huge_pud_locked().
This splits PUD to PTE directly (not PMD), so we never have a PUD folio
going through do_set_pmd(). The anonymous fault path uses
do_huge_pud_anonymous_page() so we won't go to finish_fault()

When I started working on this, I was really hoping that we could split PUDs to PMDs,
but very quickly realised thats a separate and much more complicated mapcount problem
(which is probably why David is dealing with it as he mentioned in the reply :P)
and should not be dealt with in this series.

In terms of more testing, I would definitely like to add more.
I have added selftests for allocation, memory integrity, fork, partial munmap, mprotect,
reclaim and migration, and am running them with DEBUG_VM to make sure we dont get the VM
bugs/warnings, but I am sure I am missing paths. I will try to think of more
but please let me know if there are more cases we can come up with.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC 01/12] mm: add PUD THP ptdesc and rmap support
  2026-02-02 16:01     ` Zi Yan
@ 2026-02-03 22:07       ` Usama Arif
  2026-02-05  4:17         ` Matthew Wilcox
  0 siblings, 1 reply; 52+ messages in thread
From: Usama Arif @ 2026-02-03 22:07 UTC (permalink / raw)
  To: Zi Yan, Kiryl Shutsemau, lorenzo.stoakes
  Cc: Andrew Morton, David Hildenbrand, linux-mm, hannes, riel,
	shakeel.butt, baohua, dev.jain, baolin.wang, npache, Liam.Howlett,
	ryan.roberts, vbabka, lance.yang, linux-kernel, kernel-team



On 02/02/2026 08:01, Zi Yan wrote:
> On 2 Feb 2026, at 5:44, Kiryl Shutsemau wrote:
> 
>> On Sun, Feb 01, 2026 at 04:50:18PM -0800, Usama Arif wrote:
>>> For page table management, PUD THPs need to pre-deposit page tables
>>> that will be used when the huge page is later split. When a PUD THP
>>> is allocated, we cannot know in advance when or why it might need to
>>> be split (COW, partial unmap, reclaim), but we need page tables ready
>>> for that eventuality. Similar to how PMD THPs deposit a single PTE
>>> table, PUD THPs deposit a PMD table which itself contains deposited
>>> PTE tables - a two-level deposit. This commit adds the deposit/withdraw
>>> infrastructure and a new pud_huge_pmd field in ptdesc to store the
>>> deposited PMD.
>>>
>>> The deposited PMD tables are stored as a singly-linked stack using only
>>> page->lru.next as the link pointer. A doubly-linked list using the
>>> standard list_head mechanism would cause memory corruption: list_del()
>>> poisons both lru.next (offset 8) and lru.prev (offset 16), but lru.prev
>>> overlaps with ptdesc->pmd_huge_pte at offset 16. Since deposited PMD
>>> tables have their own deposited PTE tables stored in pmd_huge_pte,
>>> poisoning lru.prev would corrupt the PTE table list and cause crashes
>>> when withdrawing PTE tables during split. PMD THPs don't have this
>>> problem because their deposited PTE tables don't have sub-deposits.
>>> Using only lru.next avoids the overlap entirely.
>>>
>>> For reverse mapping, PUD THPs need the same rmap support that PMD THPs
>>> have. The page_vma_mapped_walk() function is extended to recognize and
>>> handle PUD-mapped folios during rmap traversal. A new TTU_SPLIT_HUGE_PUD
>>> flag tells the unmap path to split PUD THPs before proceeding, since
>>> there is no PUD-level migration entry format - the split converts the
>>> single PUD mapping into individual PTE mappings that can be migrated
>>> or swapped normally.
>>>
>>> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
>>> ---
>>>  include/linux/huge_mm.h  |  5 +++
>>>  include/linux/mm.h       | 19 ++++++++
>>>  include/linux/mm_types.h |  5 ++-
>>>  include/linux/pgtable.h  |  8 ++++
>>>  include/linux/rmap.h     |  7 ++-
>>>  mm/huge_memory.c         |  8 ++++
>>>  mm/internal.h            |  3 ++
>>>  mm/page_vma_mapped.c     | 35 +++++++++++++++
>>>  mm/pgtable-generic.c     | 83 ++++++++++++++++++++++++++++++++++
>>>  mm/rmap.c                | 96 +++++++++++++++++++++++++++++++++++++---
>>>  10 files changed, 260 insertions(+), 9 deletions(-)
>>>
> 
> <snip>
> 
>>> diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
>>> index d3aec7a9926ad..2047558ddcd79 100644
>>> --- a/mm/pgtable-generic.c
>>> +++ b/mm/pgtable-generic.c
>>> @@ -195,6 +195,89 @@ pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp)
>>>  }
>>>  #endif
>>>
>>> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
>>> +/*
>>> + * Deposit page tables for PUD THP.
>>> + * Called with PUD lock held. Stores PMD tables in a singly-linked stack
>>> + * via pud_huge_pmd, using only pmd_page->lru.next as the link pointer.
>>> + *
>>> + * IMPORTANT: We use only lru.next (offset 8) for linking, NOT the full
>>> + * list_head. This is because lru.prev (offset 16) overlaps with
>>> + * ptdesc->pmd_huge_pte, which stores the PMD table's deposited PTE tables.
>>> + * Using list_del() would corrupt pmd_huge_pte with LIST_POISON2.
>>
>> This is ugly.
>>
>> Sounds like you want to use llist_node/head instead of list_head for this.
>>
>> You might able to avoid taking the lock in some cases. Note that
>> pud_lockptr() is mm->page_table_lock as of now.
> 
> I agree. I used llist_node/head in my implementation[1] and it works.
> I have an illustration at[2] to show the concept. Feel free to reuse the code.
> 
> 
> [1] https://lore.kernel.org/all/20200928193428.GB30994@casper.infradead.org/
> [2] https://normal.zone/blog/2021-01-04-linux-1gb-thp-2/#new-mechanism
> 
> Best Regards,
> Yan, Zi



Ah I should have looked at your patches more! I started working by just using lru
and was using list_add/list_del which was ofcourse corrupting the list and took me
way more time than I would like to admit to debug what was going on! The diagrams
in your 2nd link are really useful. I ended up drawing by hand those to debug
the corruption issue. I will point to that link in the next series :) 

How about something like the below diff over this patch? (Not included the comment
changes that I will make everywhere)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 26a38490ae2e1..3653e24ce97d7 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -99,6 +99,9 @@ struct page {
                                struct list_head buddy_list;
                                struct list_head pcp_list;
                                struct llist_node pcp_llist;
+
+                               /* PMD pagetable deposit head */
+                               struct llist_node pgtable_deposit_head;
                        };
                        struct address_space *mapping;
                        union {
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index 2047558ddcd79..764f14d0afcbb 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -215,9 +215,7 @@ void pgtable_trans_huge_pud_deposit(struct mm_struct *mm, pud_t *pudp,
 
        assert_spin_locked(pud_lockptr(mm, pudp));
 
-       /* Push onto stack using only lru.next as the link */
-       pmd_page->lru.next = (struct list_head *)pud_huge_pmd(pudp);
-       pud_huge_pmd(pudp) = pmd_page;
+       llist_add(&pmd_page->pgtable_deposit_head, (struct llist_head *)&pud_huge_pmd(pudp));
 }
 
 /*
@@ -227,16 +225,16 @@ void pgtable_trans_huge_pud_deposit(struct mm_struct *mm, pud_t *pudp,
  */
 pmd_t *pgtable_trans_huge_pud_withdraw(struct mm_struct *mm, pud_t *pudp)
 {
+       struct llist_node *node;
        pgtable_t pmd_page;
 
        assert_spin_locked(pud_lockptr(mm, pudp));
 
-       pmd_page = pud_huge_pmd(pudp);
-       if (!pmd_page)
+       node = llist_del_first((struct llist_head *)&pud_huge_pmd(pudp));
+       if (!node)
                return NULL;
 
-       /* Pop from stack - lru.next points to next PMD page (or NULL) */
-       pud_huge_pmd(pudp) = (pgtable_t)pmd_page->lru.next;
+       pmd_page = llist_entry(node, struct page, pgtable_deposit_head);
 
        return page_address(pmd_page);
 }

 Also, Zi is it ok if I add your Co-developed by on this patch in future revisions?
 I didn't want to do that without your explicit approval.



^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [RFC 00/12] mm: PUD (1GB) THP implementation
  2026-02-02 16:24 ` Zi Yan
@ 2026-02-03 23:29   ` Usama Arif
  2026-02-04  0:08     ` Frank van der Linden
  2026-02-05 18:07     ` Zi Yan
  0 siblings, 2 replies; 52+ messages in thread
From: Usama Arif @ 2026-02-03 23:29 UTC (permalink / raw)
  To: Zi Yan
  Cc: Andrew Morton, David Hildenbrand, lorenzo.stoakes, linux-mm,
	hannes, riel, shakeel.butt, kas, baohua, dev.jain, baolin.wang,
	npache, Liam.Howlett, ryan.roberts, vbabka, lance.yang,
	linux-kernel, kernel-team



On 02/02/2026 08:24, Zi Yan wrote:
> On 1 Feb 2026, at 19:50, Usama Arif wrote:
> 
>> This is an RFC series to implement 1GB PUD-level THPs, allowing
>> applications to benefit from reduced TLB pressure without requiring
>> hugetlbfs. The patches are based on top of
>> f9b74c13b773b7c7e4920d7bc214ea3d5f37b422 from mm-stable (6.19-rc6).
> 
> It is nice to see you are working on 1GB THP.
> 
>>
>> Motivation: Why 1GB THP over hugetlbfs?
>> =======================================
>>
>> While hugetlbfs provides 1GB huge pages today, it has significant limitations
>> that make it unsuitable for many workloads:
>>
>> 1. Static Reservation: hugetlbfs requires pre-allocating huge pages at boot
>>    or runtime, taking memory away. This requires capacity planning,
>>    administrative overhead, and makes workload orchastration much much more
>>    complex, especially colocating with workloads that don't use hugetlbfs.
> 
> But you are using CMA, the same allocation mechanism as hugetlb_cma. What
> is the difference?
> 

So we dont really need to use CMA. CMA can help a lot ofcourse, but we dont *need* it.
For e.g. I can run the very simple case [1] of trying to get 1G pages in the upstream
kernel without CMA on my server and it works. The server has been up for more than a week
(so pretty fragmented), is running a bunch of stuff in the background, uses 0 CMA memory,
and I tried to get 20x1G pages on it and it worked.
It uses folio_alloc_gigantic, which is exactly what this series uses:

$ uptime -p
up 1 week, 3 days, 5 hours, 7 minutes
$ cat /proc/meminfo | grep -i cma                                                                                                                                                                                                   
CmaTotal:              0 kB                                                                                                                                                                                                                                           
CmaFree:               0 kB        
$ echo 20 | sudo tee /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages                                                                                                                                                      
20                                                                                                                                                                                                                                                                    
$ cat /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages                                                                                                                                                                     
20
$ free -h
               total        used        free      shared  buff/cache   available
Mem:           1.0Ti       142Gi       292Gi       143Mi       583Gi       868Gi
Swap:          129Gi       3.5Gi       126Gi
$ ./map_1g_hugepages 
Mapping 20 x 1GB huge pages (20 GB total)
Mapped at 0x7f43c0000000
Touched page 0 at 0x7f43c0000000
Touched page 1 at 0x7f4400000000
Touched page 2 at 0x7f4440000000
Touched page 3 at 0x7f4480000000
Touched page 4 at 0x7f44c0000000
Touched page 5 at 0x7f4500000000
Touched page 6 at 0x7f4540000000
Touched page 7 at 0x7f4580000000
Touched page 8 at 0x7f45c0000000
Touched page 9 at 0x7f4600000000
Touched page 10 at 0x7f4640000000
Touched page 11 at 0x7f4680000000
Touched page 12 at 0x7f46c0000000
Touched page 13 at 0x7f4700000000
Touched page 14 at 0x7f4740000000
Touched page 15 at 0x7f4780000000
Touched page 16 at 0x7f47c0000000
Touched page 17 at 0x7f4800000000
Touched page 18 at 0x7f4840000000
Touched page 19 at 0x7f4880000000
Unmapped successfully
                                  



>>
>> 4. No Fallback: If a 1GB huge page cannot be allocated, hugetlbfs fails
>>    rather than falling back to smaller pages. This makes it fragile under
>>    memory pressure.
> 
> True.
> 
>>
>> 4. No Splitting: hugetlbfs pages cannot be split when only partial access
>>    is needed, leading to memory waste and preventing partial reclaim.
> 
> Since you have PUD THP implementation, have you run any workload on it?
> How often you see a PUD THP split?
> 

Ah so running non upstream kernels in production is a bit more difficult
(and also risky). I was trying to use the 512M experiment on arm as a comparison,
although I know its not the same thing with PAGE_SIZE and pageblock order.

I can try some other upstream benchmarks if it helps? Although will need to find
ones that create VMA > 1G.

> Oh, you actually ran 512MB THP on ARM64 (I saw it below), do you have
> any split stats to show the necessity of THP split?
> 
>>
>> 5. Memory Accounting: hugetlbfs memory is accounted separately and cannot
>>    be easily shared with regular memory pools.
> 
> True.
> 
>>
>> PUD THP solves these limitations by integrating 1GB pages into the existing
>> THP infrastructure.
> 
> The main advantage of PUD THP over hugetlb is that it can be split and mapped
> at sub-folio level. Do you have any data to support the necessity of them?
> I wonder if it would be easier to just support 1GB folio in core-mm first
> and we can add 1GB THP split and sub-folio mapping later. With that, we
> can move hugetlb users to 1GB folio.
> 

I would say its not the main advantage? But its definitely one of them.
The 2 main areas where split would be helpful is munmap partial
range and reclaim (MADV_PAGEOUT). For e.g. jemalloc/tcmalloc can now start
taking advantge of 1G pages. My knowledge is not that great when it comes
to memory allocators, but I believe they track for how long certain areas
have been cold and can trigger reclaim as an example. Then split will be useful.
Having memory allocators use hugetlb is probably going to be a no?


> BTW, without split support, you can apply HVO to 1GB folio to save memory.
> That is a disadvantage of PUD THP. Have you taken that into consideration?
> Basically, switching from hugetlb to PUD THP, you will lose memory due
> to vmemmap usage.
> 

Yeah so HVO saves 16M per 1G, and the page depost mechanism adds ~2M as per 1G.
We have HVO enabled in the meta fleet. I think we should not only think of PUD THP
as a replacement for hugetlb, but to also enable further usescases where hugetlb
would not be feasible.

Ater the basic infrastructure for 1G is there, we can work on optimizing, I think
there would be a a lot of interesting work we can do. HVO for 1G THP would be one
of them? 

>>
>> Performance Results
>> ===================
>>
>> Benchmark results of these patches on Intel Xeon Platinum 8321HC:
>>
>> Test: True Random Memory Access [1] test of 4GB memory region with pointer
>> chasing workload (4M random pointer dereferences through memory):
>>
>> | Metric            | PUD THP (1GB) | PMD THP (2MB) | Change       |
>> |-------------------|---------------|---------------|--------------|
>> | Memory access     | 88 ms         | 134 ms        | 34% faster   |
>> | Page fault time   | 898 ms        | 331 ms        | 2.7x slower  |
>>
>> Page faulting 1G pages is 2.7x slower (Allocating 1G pages is hard :)).
>> For long-running workloads this will be a one-off cost, and the 34%
>> improvement in access latency provides significant benefit.
>>
>> ARM with 64K PAGE_SZIE supports 512M PMD THPs. In meta, we have a CPU
>> bound workload running on a large number of ARM servers (256G). I enabled
>> the 512M THP settings to always for a 100 servers in production (didn't
>> really have high expectations :)). The average memory used for the workload
>> increased from 217G to 233G. The amount of memory backed by 512M pages was
>> 68G! The dTLB misses went down by 26% and the PID multiplier increased input
>> by 5.9% (This is a very significant improvment in workload performance).
>> A significant number of these THPs were faulted in at application start when
>> were present across different VMAs. Ofcourse getting these 512M pages is
>> easier on ARM due to bigger PAGE_SIZE and pageblock order.
>>
>> I am hoping that these patches for 1G THP can be used to provide similar
>> benefits for x86. I expect workloads to fault them in at start time when there
>> is plenty of free memory available.
>>
>>
>> Previous attempt by Zi Yan
>> ==========================
>>
>> Zi Yan attempted 1G THPs [2] in kernel version 5.11. There have been
>> significant changes in kernel since then, including folio conversion, mTHP
>> framework, ptdesc, rmap changes, etc. I found it easier to use the current PMD
>> code as reference for making 1G PUD THP work. I am hoping Zi can provide
>> guidance on these patches!
> 
> I am more than happy to help you. :)
> 

Thanks!!!

>>
>> Major Design Decisions
>> ======================
>>
>> 1. No shared 1G zero page: The memory cost would be quite significant!
>>
>> 2. Page Table Pre-deposit Strategy
>>    PMD THP deposits a single PTE page table. PUD THP deposits 512 PTE
>>    page tables (one for each potential PMD entry after split).
>>    We allocate a PMD page table and use its pmd_huge_pte list to store
>>    the deposited PTE tables. This ensures split operations don't fail due
>>    to page table allocation failures (at the cost of 2M per PUD THP)
>>
>> 3. Split to Base Pages
>>    When a PUD THP must be split (COW, partial unmap, mprotect), we split
>>    directly to base pages (262,144 PTEs). The ideal thing would be to split
>>    to 2M pages and then to 4K pages if needed. However, this would require
>>    significant rmap and mapcount tracking changes.
>>
>> 4. COW and fork handling via split
>>    Copy-on-write and fork for PUD THP triggers a split to base pages, then
>>    uses existing PTE-level COW infrastructure. Getting another 1G region is
>>    hard and could fail. If only a 4K is written, copying 1G is a waste.
>>    Probably this should only be done on CoW and not fork?
>>
>> 5. Migration via split
>>    Split PUD to PTEs and migrate individual pages. It is going to be difficult
>>    to find a 1G continguous memory to migrate to. Maybe its better to not
>>    allow migration of PUDs at all? I am more tempted to not allow migration,
>>    but have kept splitting in this RFC.
> 
> Without migration, PUD THP loses its flexibility and transparency. But with
> its 1GB size, I also wonder what the purpose of PUD THP migration can be.
> It does not create memory fragmentation, since it is the largest folio size
> we have and contiguous. NUMA balancing 1GB THP seems too much work.

Yeah this is exactly what I was thinking as well. It is going to be expensive
and difficult to migrate 1G pages, and I am not sure if what we get out of it
is worth it? I kept the splitting code in this RFC as I wanted to show that
its possible to split and migrate and the rejecting migration code is a lot easier.

> 
> BTW, I posted many questions, but that does not mean I object the patchset.
> I just want to understand your use case better, reduce unnecessary
> code changes, and hopefully get it upstreamed this time. :)
> 
> Thank you for the work.
> 

Ah no this is awesome! Thanks for the questions! Its basically the discussion I
wanted to start with the RFC.


[1] https://gist.github.com/uarif1/35dcd63f9d76048b07eb5c16ace85991



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC 00/12] mm: PUD (1GB) THP implementation
  2026-02-03 23:29   ` Usama Arif
@ 2026-02-04  0:08     ` Frank van der Linden
  2026-02-05  5:46       ` Usama Arif
  2026-02-05 18:07     ` Zi Yan
  1 sibling, 1 reply; 52+ messages in thread
From: Frank van der Linden @ 2026-02-04  0:08 UTC (permalink / raw)
  To: Usama Arif
  Cc: Zi Yan, Andrew Morton, David Hildenbrand, lorenzo.stoakes,
	linux-mm, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
	baolin.wang, npache, Liam.Howlett, ryan.roberts, vbabka,
	lance.yang, linux-kernel, kernel-team

On Tue, Feb 3, 2026 at 3:29 PM Usama Arif <usamaarif642@gmail.com> wrote:
>
>
>
> On 02/02/2026 08:24, Zi Yan wrote:
> > On 1 Feb 2026, at 19:50, Usama Arif wrote:
> >
> >> This is an RFC series to implement 1GB PUD-level THPs, allowing
> >> applications to benefit from reduced TLB pressure without requiring
> >> hugetlbfs. The patches are based on top of
> >> f9b74c13b773b7c7e4920d7bc214ea3d5f37b422 from mm-stable (6.19-rc6).
> >
> > It is nice to see you are working on 1GB THP.
> >
> >>
> >> Motivation: Why 1GB THP over hugetlbfs?
> >> =======================================
> >>
> >> While hugetlbfs provides 1GB huge pages today, it has significant limitations
> >> that make it unsuitable for many workloads:
> >>
> >> 1. Static Reservation: hugetlbfs requires pre-allocating huge pages at boot
> >>    or runtime, taking memory away. This requires capacity planning,
> >>    administrative overhead, and makes workload orchastration much much more
> >>    complex, especially colocating with workloads that don't use hugetlbfs.
> >
> > But you are using CMA, the same allocation mechanism as hugetlb_cma. What
> > is the difference?
> >
>
> So we dont really need to use CMA. CMA can help a lot ofcourse, but we dont *need* it.
> For e.g. I can run the very simple case [1] of trying to get 1G pages in the upstream
> kernel without CMA on my server and it works. The server has been up for more than a week
> (so pretty fragmented), is running a bunch of stuff in the background, uses 0 CMA memory,
> and I tried to get 20x1G pages on it and it worked.
> It uses folio_alloc_gigantic, which is exactly what this series uses:
>
> $ uptime -p
> up 1 week, 3 days, 5 hours, 7 minutes
> $ cat /proc/meminfo | grep -i cma
> CmaTotal:              0 kB
> CmaFree:               0 kB
> $ echo 20 | sudo tee /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
> 20
> $ cat /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
> 20
> $ free -h
>                total        used        free      shared  buff/cache   available
> Mem:           1.0Ti       142Gi       292Gi       143Mi       583Gi       868Gi
> Swap:          129Gi       3.5Gi       126Gi
> $ ./map_1g_hugepages
> Mapping 20 x 1GB huge pages (20 GB total)
> Mapped at 0x7f43c0000000
> Touched page 0 at 0x7f43c0000000
> Touched page 1 at 0x7f4400000000
> Touched page 2 at 0x7f4440000000
> Touched page 3 at 0x7f4480000000
> Touched page 4 at 0x7f44c0000000
> Touched page 5 at 0x7f4500000000
> Touched page 6 at 0x7f4540000000
> Touched page 7 at 0x7f4580000000
> Touched page 8 at 0x7f45c0000000
> Touched page 9 at 0x7f4600000000
> Touched page 10 at 0x7f4640000000
> Touched page 11 at 0x7f4680000000
> Touched page 12 at 0x7f46c0000000
> Touched page 13 at 0x7f4700000000
> Touched page 14 at 0x7f4740000000
> Touched page 15 at 0x7f4780000000
> Touched page 16 at 0x7f47c0000000
> Touched page 17 at 0x7f4800000000
> Touched page 18 at 0x7f4840000000
> Touched page 19 at 0x7f4880000000
> Unmapped successfully
>
>
>
>
> >>
> >> 4. No Fallback: If a 1GB huge page cannot be allocated, hugetlbfs fails
> >>    rather than falling back to smaller pages. This makes it fragile under
> >>    memory pressure.
> >
> > True.
> >
> >>
> >> 4. No Splitting: hugetlbfs pages cannot be split when only partial access
> >>    is needed, leading to memory waste and preventing partial reclaim.
> >
> > Since you have PUD THP implementation, have you run any workload on it?
> > How often you see a PUD THP split?
> >
>
> Ah so running non upstream kernels in production is a bit more difficult
> (and also risky). I was trying to use the 512M experiment on arm as a comparison,
> although I know its not the same thing with PAGE_SIZE and pageblock order.
>
> I can try some other upstream benchmarks if it helps? Although will need to find
> ones that create VMA > 1G.
>
> > Oh, you actually ran 512MB THP on ARM64 (I saw it below), do you have
> > any split stats to show the necessity of THP split?
> >
> >>
> >> 5. Memory Accounting: hugetlbfs memory is accounted separately and cannot
> >>    be easily shared with regular memory pools.
> >
> > True.
> >
> >>
> >> PUD THP solves these limitations by integrating 1GB pages into the existing
> >> THP infrastructure.
> >
> > The main advantage of PUD THP over hugetlb is that it can be split and mapped
> > at sub-folio level. Do you have any data to support the necessity of them?
> > I wonder if it would be easier to just support 1GB folio in core-mm first
> > and we can add 1GB THP split and sub-folio mapping later. With that, we
> > can move hugetlb users to 1GB folio.
> >
>
> I would say its not the main advantage? But its definitely one of them.
> The 2 main areas where split would be helpful is munmap partial
> range and reclaim (MADV_PAGEOUT). For e.g. jemalloc/tcmalloc can now start
> taking advantge of 1G pages. My knowledge is not that great when it comes
> to memory allocators, but I believe they track for how long certain areas
> have been cold and can trigger reclaim as an example. Then split will be useful.
> Having memory allocators use hugetlb is probably going to be a no?
>
>
> > BTW, without split support, you can apply HVO to 1GB folio to save memory.
> > That is a disadvantage of PUD THP. Have you taken that into consideration?
> > Basically, switching from hugetlb to PUD THP, you will lose memory due
> > to vmemmap usage.
> >
>
> Yeah so HVO saves 16M per 1G, and the page depost mechanism adds ~2M as per 1G.
> We have HVO enabled in the meta fleet. I think we should not only think of PUD THP
> as a replacement for hugetlb, but to also enable further usescases where hugetlb
> would not be feasible.
>
> Ater the basic infrastructure for 1G is there, we can work on optimizing, I think
> there would be a a lot of interesting work we can do. HVO for 1G THP would be one
> of them?
>
> >>
> >> Performance Results
> >> ===================
> >>
> >> Benchmark results of these patches on Intel Xeon Platinum 8321HC:
> >>
> >> Test: True Random Memory Access [1] test of 4GB memory region with pointer
> >> chasing workload (4M random pointer dereferences through memory):
> >>
> >> | Metric            | PUD THP (1GB) | PMD THP (2MB) | Change       |
> >> |-------------------|---------------|---------------|--------------|
> >> | Memory access     | 88 ms         | 134 ms        | 34% faster   |
> >> | Page fault time   | 898 ms        | 331 ms        | 2.7x slower  |
> >>
> >> Page faulting 1G pages is 2.7x slower (Allocating 1G pages is hard :)).
> >> For long-running workloads this will be a one-off cost, and the 34%
> >> improvement in access latency provides significant benefit.
> >>
> >> ARM with 64K PAGE_SZIE supports 512M PMD THPs. In meta, we have a CPU
> >> bound workload running on a large number of ARM servers (256G). I enabled
> >> the 512M THP settings to always for a 100 servers in production (didn't
> >> really have high expectations :)). The average memory used for the workload
> >> increased from 217G to 233G. The amount of memory backed by 512M pages was
> >> 68G! The dTLB misses went down by 26% and the PID multiplier increased input
> >> by 5.9% (This is a very significant improvment in workload performance).
> >> A significant number of these THPs were faulted in at application start when
> >> were present across different VMAs. Ofcourse getting these 512M pages is
> >> easier on ARM due to bigger PAGE_SIZE and pageblock order.
> >>
> >> I am hoping that these patches for 1G THP can be used to provide similar
> >> benefits for x86. I expect workloads to fault them in at start time when there
> >> is plenty of free memory available.
> >>
> >>
> >> Previous attempt by Zi Yan
> >> ==========================
> >>
> >> Zi Yan attempted 1G THPs [2] in kernel version 5.11. There have been
> >> significant changes in kernel since then, including folio conversion, mTHP
> >> framework, ptdesc, rmap changes, etc. I found it easier to use the current PMD
> >> code as reference for making 1G PUD THP work. I am hoping Zi can provide
> >> guidance on these patches!
> >
> > I am more than happy to help you. :)
> >
>
> Thanks!!!
>
> >>
> >> Major Design Decisions
> >> ======================
> >>
> >> 1. No shared 1G zero page: The memory cost would be quite significant!
> >>
> >> 2. Page Table Pre-deposit Strategy
> >>    PMD THP deposits a single PTE page table. PUD THP deposits 512 PTE
> >>    page tables (one for each potential PMD entry after split).
> >>    We allocate a PMD page table and use its pmd_huge_pte list to store
> >>    the deposited PTE tables. This ensures split operations don't fail due
> >>    to page table allocation failures (at the cost of 2M per PUD THP)
> >>
> >> 3. Split to Base Pages
> >>    When a PUD THP must be split (COW, partial unmap, mprotect), we split
> >>    directly to base pages (262,144 PTEs). The ideal thing would be to split
> >>    to 2M pages and then to 4K pages if needed. However, this would require
> >>    significant rmap and mapcount tracking changes.
> >>
> >> 4. COW and fork handling via split
> >>    Copy-on-write and fork for PUD THP triggers a split to base pages, then
> >>    uses existing PTE-level COW infrastructure. Getting another 1G region is
> >>    hard and could fail. If only a 4K is written, copying 1G is a waste.
> >>    Probably this should only be done on CoW and not fork?
> >>
> >> 5. Migration via split
> >>    Split PUD to PTEs and migrate individual pages. It is going to be difficult
> >>    to find a 1G continguous memory to migrate to. Maybe its better to not
> >>    allow migration of PUDs at all? I am more tempted to not allow migration,
> >>    but have kept splitting in this RFC.
> >
> > Without migration, PUD THP loses its flexibility and transparency. But with
> > its 1GB size, I also wonder what the purpose of PUD THP migration can be.
> > It does not create memory fragmentation, since it is the largest folio size
> > we have and contiguous. NUMA balancing 1GB THP seems too much work.
>
> Yeah this is exactly what I was thinking as well. It is going to be expensive
> and difficult to migrate 1G pages, and I am not sure if what we get out of it
> is worth it? I kept the splitting code in this RFC as I wanted to show that
> its possible to split and migrate and the rejecting migration code is a lot easier.
>
> >
> > BTW, I posted many questions, but that does not mean I object the patchset.
> > I just want to understand your use case better, reduce unnecessary
> > code changes, and hopefully get it upstreamed this time. :)
> >
> > Thank you for the work.
> >
>
> Ah no this is awesome! Thanks for the questions! Its basically the discussion I
> wanted to start with the RFC.
>
>
> [1] https://gist.github.com/uarif1/35dcd63f9d76048b07eb5c16ace85991
>
>

It looks like the scenario you're going for is an application that
allocates a sizeable chunk of memory upfront, and would like it to be
1G pages as much as possible, right?

You can do that with 1G THPs, the advantage being that any failures to
get 1G pages are not explicit, so you're not left with having to grow
the number of hugetlb pages yourself, and see how many you can use.

1G THPs seem useful for that. I don't recall all of the discussion
here, but I assume that hooking 1G THP support in to khugepaged is
quite something else - the potential churn to get an 1G page could
well cause more system interference than you'd like.

The CMA scenario Rik was talking about is similar: you set
hugetlb_cma=NG, and then, when you need 1G pages, you grow the hugetlb
pool and use them. Disadvantage: you have to do it explicitly.

However, hugetlb_cma does give you a much larger chance of getting
those 1G pages. The example you give, 20 1G pages on a 1T system where
there is 292G free, isn't much of a problem in my experience. You
should have no problem getting that amount of 1G pages. Things get
more difficult when most of your memory is taken - hugetlb_cma really
helps there. E.g. we have systems that have 90% hugetlb_cma, and there
is a pretty good success rate converting back and forth between
hugetlb and normal page allocator pages with hugetlb_cma, while
operating close to that 90% hugetlb coverage. Without CMA, the success
rate drops quite a bit at that level.

CMA balancing is a related issue, for hugetlb. It fixes a problem that
has been known for years: the more memory you set aside for movable
only allocations (e.g. hugetlb_cma), the less breathing room you have
for unmovable allocations. So you risk the 'false OOM' scenario, where
the kernel can't make an unmovable allocation, even though there is
enough memory available, even outside of CMA. It's just that those
MOVABLE pageblocks were used for movable allocations. So ideally, you
would migrate those movable allocations to CMA under those
circumstances. Which is what CMA balancing does. It's worked out very
well for us in the scenario I list above (most memory being
hugetlb_cma).

Anyway, I'm rambling on a bit. Let's see if I got this right:

1G THP
  - advantages: transparent interface
  - disadvantage: no HVO, lower success rate under higher memory
pressure than hugetlb_cma

hugetlb_cma
   - disadvantage: explicit interface, for higher values needs 'false
OOM' avoidance
   - advange: better success rate under pressure.

I think 1G THPs are a good solution for "nice to have" scenarios, but
there will still be use cases where a higher success rate is preferred
and HugeTLB is preferred.

Lastly, there's also the ZONE_MOVABLE story. I think 1G THPs and
ZONE_MOVABLE could work well together, improving the success rate. But
then the issue of pinning raise its head again, and whether that
should be allowed or configurable per zone..

- Frank


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC 00/12] mm: PUD (1GB) THP implementation
  2026-02-02 11:20 ` Lorenzo Stoakes
@ 2026-02-04  1:00   ` Usama Arif
  2026-02-04 11:08     ` Lorenzo Stoakes
  0 siblings, 1 reply; 52+ messages in thread
From: Usama Arif @ 2026-02-04  1:00 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: ziy, Andrew Morton, David Hildenbrand, linux-mm, hannes, riel,
	shakeel.butt, kas, baohua, dev.jain, baolin.wang, npache,
	Liam.Howlett, ryan.roberts, vbabka, lance.yang, linux-kernel,
	kernel-team

On 02/02/2026 03:20, Lorenzo Stoakes wrote:
> OK so this is somewhat unexpected :)
> 
> It would have been nice to discuss it in the THP cabal or at a conference
> etc. so we could discuss approaches ahead of time. Communication is important,
> especially with major changes like this.

Makes sense!

> 
> And PUD THP is especially problematic in that it requires pages that the page
> allocator can't give us, presumably you're doing something with CMA and... it's
> a whole kettle of fish.

So we dont need CMA. It helps ofcourse, but we don't *need* it.
Its summarized in the first reply I gave to Zi in [1]:

> 
> It's also complicated by the fact we _already_ support it in the DAX, VFIO cases
> but it's kinda a weird sorta special case that we need to keep supporting.
> 
> There's questions about how this will interact with khugepaged, MADV_COLLAPSE,
> mTHP (and really I want to see Nico's series land before we really consider
> this).

So I have numbers and experiments for page faults which are in the cover letter,
but not for khugepaged. I would be very surprised (although pleasently :)) if
khugepaged by some magic finds 262144 pages that meets all the khugepaged requirements
to collapse the page. In the basic infrastructure support which this series is adding,
I want to keep khugepaged collapse disabled for 1G pages. This is also the initial
approach that was taken in other mTHP sizes. We should go slow with 1G THPs.

> 
> So overall, I want to be very cautious and SLOW here. So let's please not drop
> the RFC tag until David and I are ok with that?
> 
> Also the THP code base is in _dire_ need of rework, and I don't really want to
> add major new features without us paying down some technical debt, to be honest.
> 
> So let's proceed with caution, and treat this as a very early bit of
> experimental code.
> 
> Thanks, Lorenzo

Ack, yeah so this is mainly an RFC to discuss what the major design choices will be.
I got a kernel with selftests for allocation, memory integrity, fork, partial munmap,
mprotect, reclaim and migration passing and am running them with DEBUG_VM to make sure
we dont get the VM bugs/warnings and the numbers are good, so just wanted to share it
upstream and get your opinions! Basically try and trigger a discussion similar to what
Zi asked in [2]! And also if someone could point out if there is something fundamental
we are missing in this series.

Thanks for the reviews! Really do apprecaite it!

[1] https://lore.kernel.org/all/20f92576-e932-435f-bb7b-de49eb84b012@gmail.com/#t
[2] https://lore.kernel.org/all/3561FD10-664D-42AA-8351-DE7D8D49D42E@nvidia.com/

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC 01/12] mm: add PUD THP ptdesc and rmap support
  2026-02-02 12:15   ` Lorenzo Stoakes
@ 2026-02-04  7:38     ` Usama Arif
  2026-02-04 12:55       ` Lorenzo Stoakes
  0 siblings, 1 reply; 52+ messages in thread
From: Usama Arif @ 2026-02-04  7:38 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: ziy, Andrew Morton, David Hildenbrand, linux-mm, hannes, riel,
	shakeel.butt, kas, baohua, dev.jain, baolin.wang, npache,
	Liam.Howlett, ryan.roberts, vbabka, lance.yang, linux-kernel,
	kernel-team



On 02/02/2026 04:15, Lorenzo Stoakes wrote:
> I think I'm going to have to do several passes on this, so this is just a
> first one :)
> 

Thanks! Really appreciate the reviews!

One thing over here is the higher level design decision when it comes to migration
of 1G pages. As Zi said in [1]:
"I also wonder what the purpose of PUD THP migration can be.
It does not create memory fragmentation, since it is the largest folio size
we have and contiguous. NUMA balancing 1GB THP seems too much work."

> On Sun, Feb 01, 2026 at 04:50:18PM -0800, Usama Arif wrote:
>> For page table management, PUD THPs need to pre-deposit page tables
>> that will be used when the huge page is later split. When a PUD THP
>> is allocated, we cannot know in advance when or why it might need to
>> be split (COW, partial unmap, reclaim), but we need page tables ready
>> for that eventuality. Similar to how PMD THPs deposit a single PTE
>> table, PUD THPs deposit a PMD table which itself contains deposited
>> PTE tables - a two-level deposit. This commit adds the deposit/withdraw
>> infrastructure and a new pud_huge_pmd field in ptdesc to store the
>> deposited PMD.
> 
> This feels like you're hacking this support in, honestly. The list_head
> abuse only adds to that feeling.
> 

Yeah so I hope turning it to something like [2] is the way forward.

> And are we now not required to store rather a lot of memory to keep all of
> this coherent?

PMD THP allocates 1 4K page (pte_alloc_one) at fault time so that split
doesnt fail.

For PUD we allocate 2M worth of PTE page tables and 1 4K PMD table at fault
time so that split doesnt fail due to there not being enough memory.
Its not great, but its not bad as well.
The alternative is to allocate this at split time and so we are not
pre-reserving them. Now there is a chance that allocation and therefore split
fails, so the tradeoff is some memory vs reliability. This patch favours
reliability.

Lets say a user gets 100x1G THPs. They would end up using ~200M for it.
I think that is okish. If the user has 100G, 200M might not be an issue
for them :)

> 
>>
>> The deposited PMD tables are stored as a singly-linked stack using only
>> page->lru.next as the link pointer. A doubly-linked list using the
>> standard list_head mechanism would cause memory corruption: list_del()
>> poisons both lru.next (offset 8) and lru.prev (offset 16), but lru.prev
>> overlaps with ptdesc->pmd_huge_pte at offset 16. Since deposited PMD
>> tables have their own deposited PTE tables stored in pmd_huge_pte,
>> poisoning lru.prev would corrupt the PTE table list and cause crashes
>> when withdrawing PTE tables during split. PMD THPs don't have this
>> problem because their deposited PTE tables don't have sub-deposits.
>> Using only lru.next avoids the overlap entirely.
> 
> Yeah this is horrendous and a hack, I don't consider this at all
> upstreamable.
> 
> You need to completely rework this.

Hopefully [2] is the path forward!
> 
>>
>> For reverse mapping, PUD THPs need the same rmap support that PMD THPs
>> have. The page_vma_mapped_walk() function is extended to recognize and
>> handle PUD-mapped folios during rmap traversal. A new TTU_SPLIT_HUGE_PUD
>> flag tells the unmap path to split PUD THPs before proceeding, since
>> there is no PUD-level migration entry format - the split converts the
>> single PUD mapping into individual PTE mappings that can be migrated
>> or swapped normally.
> 
> Individual PTE... mappings? You need to be a lot clearer here, page tables
> are naturally confusing with entries vs. tables.
> 
> Let's be VERY specific here. Do you mean you have 1 PMD table and 512 PTE
> tables reserved, spanning 1 PUD entry and 262,144 PTE entries?
> 

Yes that is correct, Thanks! I will change the commit message in the next revision
to what you have written: 1 PMD table and 512 PTE tables reserved, spanning
1 PUD entry and 262,144 PTE entries.

>>
>> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
> 
> How does this change interact with existing DAX/VFIO code, which now it
> seems will be subject to the mechanisms you introduce here?

I think what you mean here is the change in try_to_migrate_one?


So one 

> 
> Right now DAX/VFIO is only obtainable via a specially THP-aligned
> get_unmapped_area() + then can only be obtained at fault time.
> > Is that the intent here also?
> 

Ah thanks for pointing this out. This is something the series is missing.

What I did in the selftest and benchmark was fault on an address that was already aligned.
i.e. basically call the below function before faulting in.

static inline void *pud_align(void *addr)
{
	return (void *)(((unsigned long)addr + PUD_SIZE - 1) & ~(PUD_SIZE - 1));
}


What I think you are suggesting this series is missing is the below diff? (its untested).

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 87b2c21df4a49..461158a0840db 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1236,6 +1236,12 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
        unsigned long ret;
        loff_t off = (loff_t)pgoff << PAGE_SHIFT;
 
+       if (IS_ENABLED(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD) && len >= PUD_SIZE) {
+               ret = __thp_get_unmapped_area(filp, addr, len, off, flags, PUD_SIZE, vm_flags);
+               if (ret)
+                       return ret;
+       }
+


> What is your intent - that khugepaged do this, or on alloc? How does it
> interact with MADV_COLLAPSE?
> 

Ah basically what I mentioned in [3], we want to go slow. Only enable PUD THPs
page faults at the start. If there is data supporting that khugepaged will work
than we do it, but we keep it disabled.

> I noted on the 2nd patch, but you're changing THP_ORDERS_ALL_ANON which
> alters __thp_vma_allowable_orders() behaviour, that change belongs here...
> 
> 

Thanks for this! I only tried to split this code into logical commits
after the whole thing was working. Some things are tightly coupled
and I would need to move them to the right commit.

>> ---
>>  include/linux/huge_mm.h  |  5 +++
>>  include/linux/mm.h       | 19 ++++++++
>>  include/linux/mm_types.h |  5 ++-
>>  include/linux/pgtable.h  |  8 ++++
>>  include/linux/rmap.h     |  7 ++-
>>  mm/huge_memory.c         |  8 ++++
>>  mm/internal.h            |  3 ++
>>  mm/page_vma_mapped.c     | 35 +++++++++++++++
>>  mm/pgtable-generic.c     | 83 ++++++++++++++++++++++++++++++++++
>>  mm/rmap.c                | 96 +++++++++++++++++++++++++++++++++++++---
>>  10 files changed, 260 insertions(+), 9 deletions(-)
>>
>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>> index a4d9f964dfdea..e672e45bb9cc7 100644
>> --- a/include/linux/huge_mm.h
>> +++ b/include/linux/huge_mm.h
>> @@ -463,10 +463,15 @@ void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
>>  		unsigned long address);
>>
>>  #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
>> +void split_huge_pud_locked(struct vm_area_struct *vma, pud_t *pud,
>> +			   unsigned long address);
>>  int change_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
>>  		    pud_t *pudp, unsigned long addr, pgprot_t newprot,
>>  		    unsigned long cp_flags);
>>  #else
>> +static inline void
>> +split_huge_pud_locked(struct vm_area_struct *vma, pud_t *pud,
>> +		      unsigned long address) {}
>>  static inline int
>>  change_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
>>  		pud_t *pudp, unsigned long addr, pgprot_t newprot,
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index ab2e7e30aef96..a15e18df0f771 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -3455,6 +3455,22 @@ static inline bool pagetable_pmd_ctor(struct mm_struct *mm,
>>   * considered ready to switch to split PUD locks yet; there may be places
>>   * which need to be converted from page_table_lock.
>>   */
>> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
>> +static inline struct page *pud_pgtable_page(pud_t *pud)
>> +{
>> +	unsigned long mask = ~(PTRS_PER_PUD * sizeof(pud_t) - 1);
>> +
>> +	return virt_to_page((void *)((unsigned long)pud & mask));
>> +}
>> +
>> +static inline struct ptdesc *pud_ptdesc(pud_t *pud)
>> +{
>> +	return page_ptdesc(pud_pgtable_page(pud));
>> +}
>> +
>> +#define pud_huge_pmd(pud) (pud_ptdesc(pud)->pud_huge_pmd)
>> +#endif
>> +
>>  static inline spinlock_t *pud_lockptr(struct mm_struct *mm, pud_t *pud)
>>  {
>>  	return &mm->page_table_lock;
>> @@ -3471,6 +3487,9 @@ static inline spinlock_t *pud_lock(struct mm_struct *mm, pud_t *pud)
>>  static inline void pagetable_pud_ctor(struct ptdesc *ptdesc)
>>  {
>>  	__pagetable_ctor(ptdesc);
>> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
>> +	ptdesc->pud_huge_pmd = NULL;
>> +#endif
>>  }
>>
>>  static inline void pagetable_p4d_ctor(struct ptdesc *ptdesc)
>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>> index 78950eb8926dc..26a38490ae2e1 100644
>> --- a/include/linux/mm_types.h
>> +++ b/include/linux/mm_types.h
>> @@ -577,7 +577,10 @@ struct ptdesc {
>>  		struct list_head pt_list;
>>  		struct {
>>  			unsigned long _pt_pad_1;
>> -			pgtable_t pmd_huge_pte;
>> +			union {
>> +				pgtable_t pmd_huge_pte;  /* For PMD tables: deposited PTE */
>> +				pgtable_t pud_huge_pmd;  /* For PUD tables: deposited PMD list */
>> +			};
>>  		};
>>  	};
>>  	unsigned long __page_mapping;
>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>> index 2f0dd3a4ace1a..3ce733c1d71a2 100644
>> --- a/include/linux/pgtable.h
>> +++ b/include/linux/pgtable.h
>> @@ -1168,6 +1168,14 @@ extern pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp);
>>  #define arch_needs_pgtable_deposit() (false)
>>  #endif
>>
>> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
>> +extern void pgtable_trans_huge_pud_deposit(struct mm_struct *mm, pud_t *pudp,
>> +					   pmd_t *pmd_table);
>> +extern pmd_t *pgtable_trans_huge_pud_withdraw(struct mm_struct *mm, pud_t *pudp);
>> +extern void pud_deposit_pte(pmd_t *pmd_table, pgtable_t pgtable);
>> +extern pgtable_t pud_withdraw_pte(pmd_t *pmd_table);
> 
> These are useless extern's.
> 


ack

These are coming from the existing functions from the file:
extern void pgtable_trans_huge_deposit
extern pgtable_t pgtable_trans_huge_withdraw

I think the externs can be removed from these as well? We can
fix those in a separate patch.


>> +#endif
>> +
>>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>  /*
>>   * This is an implementation of pmdp_establish() that is only suitable for an
>> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
>> index daa92a58585d9..08cd0a0eb8763 100644
>> --- a/include/linux/rmap.h
>> +++ b/include/linux/rmap.h
>> @@ -101,6 +101,7 @@ enum ttu_flags {
>>  					 * do a final flush if necessary */
>>  	TTU_RMAP_LOCKED		= 0x80,	/* do not grab rmap lock:
>>  					 * caller holds it */
>> +	TTU_SPLIT_HUGE_PUD	= 0x100, /* split huge PUD if any */
>>  };
>>
>>  #ifdef CONFIG_MMU
>> @@ -473,6 +474,8 @@ void folio_add_anon_rmap_ptes(struct folio *, struct page *, int nr_pages,
>>  	folio_add_anon_rmap_ptes(folio, page, 1, vma, address, flags)
>>  void folio_add_anon_rmap_pmd(struct folio *, struct page *,
>>  		struct vm_area_struct *, unsigned long address, rmap_t flags);
>> +void folio_add_anon_rmap_pud(struct folio *, struct page *,
>> +		struct vm_area_struct *, unsigned long address, rmap_t flags);
>>  void folio_add_new_anon_rmap(struct folio *, struct vm_area_struct *,
>>  		unsigned long address, rmap_t flags);
>>  void folio_add_file_rmap_ptes(struct folio *, struct page *, int nr_pages,
>> @@ -933,6 +936,7 @@ struct page_vma_mapped_walk {
>>  	pgoff_t pgoff;
>>  	struct vm_area_struct *vma;
>>  	unsigned long address;
>> +	pud_t *pud;
>>  	pmd_t *pmd;
>>  	pte_t *pte;
>>  	spinlock_t *ptl;
>> @@ -970,7 +974,7 @@ static inline void page_vma_mapped_walk_done(struct page_vma_mapped_walk *pvmw)
>>  static inline void
>>  page_vma_mapped_walk_restart(struct page_vma_mapped_walk *pvmw)
>>  {
>> -	WARN_ON_ONCE(!pvmw->pmd && !pvmw->pte);
>> +	WARN_ON_ONCE(!pvmw->pud && !pvmw->pmd && !pvmw->pte);
>>
>>  	if (likely(pvmw->ptl))
>>  		spin_unlock(pvmw->ptl);
>> @@ -978,6 +982,7 @@ page_vma_mapped_walk_restart(struct page_vma_mapped_walk *pvmw)
>>  		WARN_ON_ONCE(1);
>>
>>  	pvmw->ptl = NULL;
>> +	pvmw->pud = NULL;
>>  	pvmw->pmd = NULL;
>>  	pvmw->pte = NULL;
>>  }
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 40cf59301c21a..3128b3beedb0a 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -2933,6 +2933,14 @@ void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
>>  	spin_unlock(ptl);
>>  	mmu_notifier_invalidate_range_end(&range);
>>  }
>> +
>> +void split_huge_pud_locked(struct vm_area_struct *vma, pud_t *pud,
>> +			   unsigned long address)
>> +{
>> +	VM_WARN_ON_ONCE(!IS_ALIGNED(address, HPAGE_PUD_SIZE));
>> +	if (pud_trans_huge(*pud))
>> +		__split_huge_pud_locked(vma, pud, address);
>> +}
>>  #else
>>  void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
>>  		unsigned long address)
>> diff --git a/mm/internal.h b/mm/internal.h
>> index 9ee336aa03656..21d5c00f638dc 100644
>> --- a/mm/internal.h
>> +++ b/mm/internal.h
>> @@ -545,6 +545,9 @@ int user_proactive_reclaim(char *buf,
>>   * in mm/rmap.c:
>>   */
>>  pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address);
>> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
>> +pud_t *mm_find_pud(struct mm_struct *mm, unsigned long address);
>> +#endif
>>
>>  /*
>>   * in mm/page_alloc.c
>> diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
>> index b38a1d00c971b..d31eafba38041 100644
>> --- a/mm/page_vma_mapped.c
>> +++ b/mm/page_vma_mapped.c
>> @@ -146,6 +146,18 @@ static bool check_pmd(unsigned long pfn, struct page_vma_mapped_walk *pvmw)
>>  	return true;
>>  }
>>
>> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
>> +/* Returns true if the two ranges overlap.  Careful to not overflow. */
>> +static bool check_pud(unsigned long pfn, struct page_vma_mapped_walk *pvmw)
>> +{
>> +	if ((pfn + HPAGE_PUD_NR - 1) < pvmw->pfn)
>> +		return false;
>> +	if (pfn > pvmw->pfn + pvmw->nr_pages - 1)
>> +		return false;
>> +	return true;
>> +}
>> +#endif
>> +
>>  static void step_forward(struct page_vma_mapped_walk *pvmw, unsigned long size)
>>  {
>>  	pvmw->address = (pvmw->address + size) & ~(size - 1);
>> @@ -188,6 +200,10 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
>>  	pud_t *pud;
>>  	pmd_t pmde;
>>
>> +	/* The only possible pud mapping has been handled on last iteration */
>> +	if (pvmw->pud && !pvmw->pmd)
>> +		return not_found(pvmw);
>> +
>>  	/* The only possible pmd mapping has been handled on last iteration */
>>  	if (pvmw->pmd && !pvmw->pte)
>>  		return not_found(pvmw);
>> @@ -234,6 +250,25 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
>>  			continue;
>>  		}
>>
>> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
> 
> Said it elsewhere, but it's really weird to treat an arch having the
> ability to do something as a go ahead for doing it.
> 
>> +		/* Check for PUD-mapped THP */
>> +		if (pud_trans_huge(*pud)) {
>> +			pvmw->pud = pud;
>> +			pvmw->ptl = pud_lock(mm, pud);
>> +			if (likely(pud_trans_huge(*pud))) {
>> +				if (pvmw->flags & PVMW_MIGRATION)
>> +					return not_found(pvmw);
>> +				if (!check_pud(pud_pfn(*pud), pvmw))
>> +					return not_found(pvmw);
>> +				return true;
>> +			}
>> +			/* PUD was split under us, retry at PMD level */
>> +			spin_unlock(pvmw->ptl);
>> +			pvmw->ptl = NULL;
>> +			pvmw->pud = NULL;
>> +		}
>> +#endif
>> +
> 
> Yeah, as I said elsewhere, we got to be refactoring not copy/pasting with
> modifications :)
> 

Yeah there is repeated code in multiple places, where all I did was replace
what was done from PMD into PUD. In a lot of places, its actually difficult
to not repeat the code (unless we want function macros, which is much worse
IMO).
 
> 
>>  		pvmw->pmd = pmd_offset(pud, pvmw->address);
>>  		/*
>>  		 * Make sure the pmd value isn't cached in a register by the
>> diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
>> index d3aec7a9926ad..2047558ddcd79 100644
>> --- a/mm/pgtable-generic.c
>> +++ b/mm/pgtable-generic.c
>> @@ -195,6 +195,89 @@ pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp)
>>  }
>>  #endif
>>
>> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
>> +/*
>> + * Deposit page tables for PUD THP.
>> + * Called with PUD lock held. Stores PMD tables in a singly-linked stack
>> + * via pud_huge_pmd, using only pmd_page->lru.next as the link pointer.
>> + *
>> + * IMPORTANT: We use only lru.next (offset 8) for linking, NOT the full
>> + * list_head. This is because lru.prev (offset 16) overlaps with
>> + * ptdesc->pmd_huge_pte, which stores the PMD table's deposited PTE tables.
>> + * Using list_del() would corrupt pmd_huge_pte with LIST_POISON2.
> 
> This is horrible and feels like a hack? Treating a doubly-linked list as a
> singly-linked one like this is not upstreamable.
> 
>> + *
>> + * PTE tables should be deposited into the PMD using pud_deposit_pte().
>> + */
>> +void pgtable_trans_huge_pud_deposit(struct mm_struct *mm, pud_t *pudp,
>> +				    pmd_t *pmd_table)
> 
> This is a horrid, you're depositing the PMD using the... questionable
> list_head abuse, but then also have pud_deposit_pte()... But here we're
> depositing a PMD shouldn't the name reflect that?
> 
>> +{
>> +	pgtable_t pmd_page = virt_to_page(pmd_table);
>> +
>> +	assert_spin_locked(pud_lockptr(mm, pudp));
>> +
>> +	/* Push onto stack using only lru.next as the link */
>> +	pmd_page->lru.next = (struct list_head *)pud_huge_pmd(pudp);
> 
> Yikes...
> 
>> +	pud_huge_pmd(pudp) = pmd_page;
>> +}
>> +
>> +/*
>> + * Withdraw the deposited PMD table for PUD THP split or zap.
>> + * Called with PUD lock held.
>> + * Returns NULL if no more PMD tables are deposited.
>> + */
>> +pmd_t *pgtable_trans_huge_pud_withdraw(struct mm_struct *mm, pud_t *pudp)
>> +{
>> +	pgtable_t pmd_page;
>> +
>> +	assert_spin_locked(pud_lockptr(mm, pudp));
>> +
>> +	pmd_page = pud_huge_pmd(pudp);
>> +	if (!pmd_page)
>> +		return NULL;
>> +
>> +	/* Pop from stack - lru.next points to next PMD page (or NULL) */
>> +	pud_huge_pmd(pudp) = (pgtable_t)pmd_page->lru.next;
> 
> Where's the popping? You're just assigning here.


Ack on all of the above. Hopefully [1] is better.
> 
>> +
>> +	return page_address(pmd_page);
>> +}
>> +
>> +/*
>> + * Deposit a PTE table into a standalone PMD table (not yet in page table hierarchy).
>> + * Used for PUD THP pre-deposit. The PMD table's pmd_huge_pte stores a linked list.
>> + * No lock assertion since the PMD isn't visible yet.
>> + */
>> +void pud_deposit_pte(pmd_t *pmd_table, pgtable_t pgtable)
>> +{
>> +	struct ptdesc *ptdesc = virt_to_ptdesc(pmd_table);
>> +
>> +	/* FIFO - add to front of list */
>> +	if (!ptdesc->pmd_huge_pte)
>> +		INIT_LIST_HEAD(&pgtable->lru);
>> +	else
>> +		list_add(&pgtable->lru, &ptdesc->pmd_huge_pte->lru);
>> +	ptdesc->pmd_huge_pte = pgtable;
>> +}
>> +
>> +/*
>> + * Withdraw a PTE table from a standalone PMD table.
>> + * Returns NULL if no more PTE tables are deposited.
>> + */
>> +pgtable_t pud_withdraw_pte(pmd_t *pmd_table)
>> +{
>> +	struct ptdesc *ptdesc = virt_to_ptdesc(pmd_table);
>> +	pgtable_t pgtable;
>> +
>> +	pgtable = ptdesc->pmd_huge_pte;
>> +	if (!pgtable)
>> +		return NULL;
>> +	ptdesc->pmd_huge_pte = list_first_entry_or_null(&pgtable->lru,
>> +							struct page, lru);
>> +	if (ptdesc->pmd_huge_pte)
>> +		list_del(&pgtable->lru);
>> +	return pgtable;
>> +}
>> +#endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
>> +
>>  #ifndef __HAVE_ARCH_PMDP_INVALIDATE
>>  pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
>>  		     pmd_t *pmdp)
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index 7b9879ef442d9..69acabd763da4 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -811,6 +811,32 @@ pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address)
>>  	return pmd;
>>  }
>>
>> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
>> +/*
>> + * Returns the actual pud_t* where we expect 'address' to be mapped from, or
>> + * NULL if it doesn't exist.  No guarantees / checks on what the pud_t*
>> + * represents.
>> + */
>> +pud_t *mm_find_pud(struct mm_struct *mm, unsigned long address)
> 
> This series seems to be full of copy/paste.
> 
> It's just not acceptable given the state of THP code as I said in reply to
> the cover letter - you need to _refactor_ the code.
> 
> The code is bug-prone and difficult to maintain as-is, your series has to
> improve the technical debt, not add to it.
> 

In some cases we might not be able to avoid the copy, but this is definitely
a place where we dont need to. I will change here. Thanks!

>> +{
>> +	pgd_t *pgd;
>> +	p4d_t *p4d;
>> +	pud_t *pud = NULL;
>> +
>> +	pgd = pgd_offset(mm, address);
>> +	if (!pgd_present(*pgd))
>> +		goto out;
>> +
>> +	p4d = p4d_offset(pgd, address);
>> +	if (!p4d_present(*p4d))
>> +		goto out;
>> +
>> +	pud = pud_offset(p4d, address);
>> +out:
>> +	return pud;
>> +}
>> +#endif
>> +
>>  struct folio_referenced_arg {
>>  	int mapcount;
>>  	int referenced;
>> @@ -1415,11 +1441,7 @@ static __always_inline void __folio_add_anon_rmap(struct folio *folio,
>>  			SetPageAnonExclusive(page);
>>  			break;
>>  		case PGTABLE_LEVEL_PUD:
>> -			/*
>> -			 * Keep the compiler happy, we don't support anonymous
>> -			 * PUD mappings.
>> -			 */
>> -			WARN_ON_ONCE(1);
>> +			SetPageAnonExclusive(page);
>>  			break;
>>  		default:
>>  			BUILD_BUG();
>> @@ -1503,6 +1525,31 @@ void folio_add_anon_rmap_pmd(struct folio *folio, struct page *page,
>>  #endif
>>  }
>>
>> +/**
>> + * folio_add_anon_rmap_pud - add a PUD mapping to a page range of an anon folio
>> + * @folio:	The folio to add the mapping to
>> + * @page:	The first page to add
>> + * @vma:	The vm area in which the mapping is added
>> + * @address:	The user virtual address of the first page to map
>> + * @flags:	The rmap flags
>> + *
>> + * The page range of folio is defined by [first_page, first_page + HPAGE_PUD_NR)
>> + *
>> + * The caller needs to hold the page table lock, and the page must be locked in
>> + * the anon_vma case: to serialize mapping,index checking after setting.
>> + */
>> +void folio_add_anon_rmap_pud(struct folio *folio, struct page *page,
>> +		struct vm_area_struct *vma, unsigned long address, rmap_t flags)
>> +{
>> +#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && \
>> +	defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD)
>> +	__folio_add_anon_rmap(folio, page, HPAGE_PUD_NR, vma, address, flags,
>> +			      PGTABLE_LEVEL_PUD);
>> +#else
>> +	WARN_ON_ONCE(true);
>> +#endif
>> +}
> 
> More copy/paste... Maybe unavoidable in this case, but be good to try.
> 
>> +
>>  /**
>>   * folio_add_new_anon_rmap - Add mapping to a new anonymous folio.
>>   * @folio:	The folio to add the mapping to.
>> @@ -1934,6 +1981,20 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>>  		}
>>
>>  		if (!pvmw.pte) {
>> +			/*
>> +			 * Check for PUD-mapped THP first.
>> +			 * If we have a PUD mapping and TTU_SPLIT_HUGE_PUD is set,
>> +			 * split the PUD to PMD level and restart the walk.
>> +			 */
> 
> This is literally describing the code below, it's not useful.

Ack, Will remove this comment, Thanks!
> 
>> +			if (pvmw.pud && pud_trans_huge(*pvmw.pud)) {
>> +				if (flags & TTU_SPLIT_HUGE_PUD) {
>> +					split_huge_pud_locked(vma, pvmw.pud, pvmw.address);
>> +					flags &= ~TTU_SPLIT_HUGE_PUD;
>> +					page_vma_mapped_walk_restart(&pvmw);
>> +					continue;
>> +				}
>> +			}
>> +
>>  			if (folio_test_anon(folio) && !folio_test_swapbacked(folio)) {
>>  				if (unmap_huge_pmd_locked(vma, pvmw.address, pvmw.pmd, folio))
>>  					goto walk_done;
>> @@ -2325,6 +2386,27 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
>>  	mmu_notifier_invalidate_range_start(&range);
>>
>>  	while (page_vma_mapped_walk(&pvmw)) {
>> +		/* Handle PUD-mapped THP first */
> 
> How did/will this interact with DAX, VFIO PUD THP?

It wont interact with DAX. try_to_migrate does the below and just returns:

	if (folio_is_zone_device(folio) &&
	    (!folio_is_device_private(folio) && !folio_is_device_coherent(folio)))
		return;

so DAX would never reach here.

I think vfio pages are pinned and therefore cant be migrated? (I have
not looked at vfio code, I will try to get a better understanding tomorrow,
but please let me know if that sounds wrong.)


> 
>> +		if (!pvmw.pte && !pvmw.pmd) {
>> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
> 
> Won't pud_trans_huge() imply this...
> 

Agreed, I think it should cover it.

>> +			/*
>> +			 * PUD-mapped THP: skip migration to preserve the huge
>> +			 * page. Splitting would defeat the purpose of PUD THPs.
>> +			 * Return false to indicate migration failure, which
>> +			 * will cause alloc_contig_range() to try a different
>> +			 * memory region.
>> +			 */
>> +			if (pvmw.pud && pud_trans_huge(*pvmw.pud)) {
>> +				page_vma_mapped_walk_done(&pvmw);
>> +				ret = false;
>> +				break;
>> +			}
>> +#endif
>> +			/* Unexpected state: !pte && !pmd but not a PUD THP */
>> +			page_vma_mapped_walk_done(&pvmw);
>> +			break;
>> +		}
>> +
>>  		/* PMD-mapped THP migration entry */
>>  		if (!pvmw.pte) {
>>  			__maybe_unused unsigned long pfn;
>> @@ -2607,10 +2689,10 @@ void try_to_migrate(struct folio *folio, enum ttu_flags flags)
>>
>>  	/*
>>  	 * Migration always ignores mlock and only supports TTU_RMAP_LOCKED and
>> -	 * TTU_SPLIT_HUGE_PMD, TTU_SYNC, and TTU_BATCH_FLUSH flags.
>> +	 * TTU_SPLIT_HUGE_PMD, TTU_SPLIT_HUGE_PUD, TTU_SYNC, and TTU_BATCH_FLUSH flags.
>>  	 */
>>  	if (WARN_ON_ONCE(flags & ~(TTU_RMAP_LOCKED | TTU_SPLIT_HUGE_PMD |
>> -					TTU_SYNC | TTU_BATCH_FLUSH)))
>> +					TTU_SPLIT_HUGE_PUD | TTU_SYNC | TTU_BATCH_FLUSH)))
>>  		return;
>>
>>  	if (folio_is_zone_device(folio) &&
>> --
>> 2.47.3
>>
> 
> This isn't a final review, I'll have to look more thoroughly through here
> over time and you're going to have to be patient in general :)
> 
> Cheers, Lorenzo


Thanks for the review, this is awesome!


[1] https://lore.kernel.org/all/20f92576-e932-435f-bb7b-de49eb84b012@gmail.com/
[2] https://lore.kernel.org/all/05d5918f-b61b-4091-b8c6-20eebfffc3c4@gmail.com/
[3] https://lore.kernel.org/all/2efaa5ed-bd09-41f0-9c07-5cd6cccc4595@gmail.com/





^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [RFC 00/12] mm: PUD (1GB) THP implementation
  2026-02-02 15:50     ` Zi Yan
@ 2026-02-04 10:56       ` Lorenzo Stoakes
  2026-02-05 11:29         ` David Hildenbrand (arm)
  2026-02-05 11:22       ` David Hildenbrand (arm)
  1 sibling, 1 reply; 52+ messages in thread
From: Lorenzo Stoakes @ 2026-02-04 10:56 UTC (permalink / raw)
  To: Zi Yan
  Cc: David Hildenbrand, Rik van Riel, Usama Arif, Andrew Morton,
	linux-mm, hannes, shakeel.butt, kas, baohua, dev.jain,
	baolin.wang, npache, Liam.Howlett, ryan.roberts, vbabka,
	lance.yang, linux-kernel, kernel-team, Frank van der Linden

On Mon, Feb 02, 2026 at 10:50:35AM -0500, Zi Yan wrote:
> On 2 Feb 2026, at 6:30, Lorenzo Stoakes wrote:
>
> > On Sun, Feb 01, 2026 at 09:44:12PM -0500, Rik van Riel wrote:
> >> On Sun, 2026-02-01 at 16:50 -0800, Usama Arif wrote:
> >>>
> >>> 1. Static Reservation: hugetlbfs requires pre-allocating huge pages
> >>> at boot
> >>>    or runtime, taking memory away. This requires capacity planning,
> >>>    administrative overhead, and makes workload orchastration much
> >>> much more
> >>>    complex, especially colocating with workloads that don't use
> >>> hugetlbfs.
> >>>
> >> To address the obvious objection "but how could we
> >> possibly allocate 1GB huge pages while the workload
> >> is running?", I am planning to pick up the CMA balancing
> >> patch series (thank you, Frank) and get that in an
> >> upstream ready shape soon.
> >>
> >> https://lkml.org/2025/9/15/1735
> >
> > That link doesn't work?
> >
> > Did a quick search for CMA balancing on lore, couldn't find anything, could you
> > provide a lore link?
>
> https://lwn.net/Articles/1038263/
>
> >
> >>
> >> That patch set looks like another case where no
> >> amount of internal testing will find every single
> >> corner case, and we'll probably just want to
> >> merge it upstream, deploy it experimentally, and
> >> aggressively deal with anything that might pop up.
> >
> > I'm not really in favour of this kind of approach. There's plenty of things that
> > were considered 'temporary' upstream that became rather permanent :)
> >
> > Maybe we can't cover all corner-cases, but we need to make sure whatever we do
> > send upstream is maintainable, conceptually sensible and doesn't paint us into
> > any corners, etc.
> >
> >>
> >> With CMA balancing, it would be possibly to just
> >> have half (or even more) of system memory for
> >> movable allocations only, which would make it possible
> >> to allocate 1GB huge pages dynamically.
> >
> > Could you expand on that?
>
> I also would like to hear David’s opinion on using CMA for 1GB THP.
> He did not like it[1] when I posted my patch back in 2020, but it has
> been more than 5 years. :)

Yes please David :)

I find the idea of using the CMA for this a bit gross. And I fear we're
essentially expanding the hacks for DAX to everyone.

Again I really feel that we should be tackling technical debt here, rather
than adding features on shaky foundations and just making things worse.

We are inundated with series-after-series for THP trying to add features
but really not very many that are tackling this debt, and I think it's time
to get firmer about that.

>
> The other direction I explored is to get 1GB THP from buddy allocator.
> That means we need to:
> 1. bump MAX_PAGE_ORDER to 18 or make it a runtime variable so that only 1GB
>    THP users need to bump it,

Would we need to bump the page block size too to stand more of a chance of
avoiding fragmentation?

Doing that though would result in reserves being way higher and thus more
memory used and we'd be in the territory of the unresolved issues with 64
KB page size kernels :)

> 2. handle cross memory section PFN merge in buddy allocator,

Ugh god...

> 3. improve anti-fragmentation mechanism for 1GB range compaction.

I think we'd really need something like this. Obviously there's the series
Rik refers to.

I mean CMA itself feels like a hack, though efforts are being made to at
least make it more robust (series mentioned, also the guaranteed CMA stuff
from Suren).

>
> 1 is easier-ish[2]. I have not looked into 2 and 3 much yet.
>
> [1] https://lore.kernel.org/all/52bc2d5d-eb8a-83de-1c93-abd329132d58@redhat.com/
> [2] https://lore.kernel.org/all/20210805190253.2795604-1-zi.yan@sent.com/
>
>
> Best Regards,
> Yan, Zi

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC 00/12] mm: PUD (1GB) THP implementation
  2026-02-04  1:00   ` Usama Arif
@ 2026-02-04 11:08     ` Lorenzo Stoakes
  2026-02-04 11:50       ` Dev Jain
  2026-02-05  6:08       ` Usama Arif
  0 siblings, 2 replies; 52+ messages in thread
From: Lorenzo Stoakes @ 2026-02-04 11:08 UTC (permalink / raw)
  To: Usama Arif
  Cc: ziy, Andrew Morton, David Hildenbrand, linux-mm, hannes, riel,
	shakeel.butt, kas, baohua, dev.jain, baolin.wang, npache,
	Liam.Howlett, ryan.roberts, vbabka, lance.yang, linux-kernel,
	kernel-team

On Tue, Feb 03, 2026 at 05:00:10PM -0800, Usama Arif wrote:
>
>
> On 02/02/2026 03:20, Lorenzo Stoakes wrote:
> > OK so this is somewhat unexpected :)
> >
> > It would have been nice to discuss it in the THP cabal or at a conference
> > etc. so we could discuss approaches ahead of time. Communication is important,
> > especially with major changes like this.
>
> Makes sense!
>
> >
> > And PUD THP is especially problematic in that it requires pages that the page
> > allocator can't give us, presumably you're doing something with CMA and... it's
> > a whole kettle of fish.
>
> So we dont need CMA. It helps ofcourse, but we don't *need* it.
> Its summarized in the first reply I gave to Zi in [1]:
>
> >
> > It's also complicated by the fact we _already_ support it in the DAX, VFIO cases
> > but it's kinda a weird sorta special case that we need to keep supporting.
> >
> > There's questions about how this will interact with khugepaged, MADV_COLLAPSE,
> > mTHP (and really I want to see Nico's series land before we really consider
> > this).
>
>
> So I have numbers and experiments for page faults which are in the cover letter,
> but not for khugepaged. I would be very surprised (although pleasently :)) if
> khugepaged by some magic finds 262144 pages that meets all the khugepaged requirements
> to collapse the page. In the basic infrastructure support which this series is adding,
> I want to keep khugepaged collapse disabled for 1G pages. This is also the initial
> approach that was taken in other mTHP sizes. We should go slow with 1G THPs.

Yes we definitely want to limit to page faults for now.

But keep in mind for that to be viable you'd surely need to update who gets
appropriate alignment in __get_unmapped_area()... not read through series far
enough to see so not sure if you update that though!

I guess that'd be the sanest place to start, if an allocation _size_ is aligned
1 GB, then align the unmapped area _address_ to 1 GB for maximum chance of 1 GB
fault-in.

Oh by the way I made some rough THP notes at
https://publish.obsidian.md/mm/Transparent+Huge+Pages+(THP) which are helpful
for reminding me about what does what where, useful for a top-down view of how
things are now.

>
> >
> > So overall, I want to be very cautious and SLOW here. So let's please not drop
> > the RFC tag until David and I are ok with that?
> >
> > Also the THP code base is in _dire_ need of rework, and I don't really want to
> > add major new features without us paying down some technical debt, to be honest.
> >
> > So let's proceed with caution, and treat this as a very early bit of
> > experimental code.
> >
> > Thanks, Lorenzo
>
> Ack, yeah so this is mainly an RFC to discuss what the major design choices will be.
> I got a kernel with selftests for allocation, memory integrity, fork, partial munmap,
> mprotect, reclaim and migration passing and am running them with DEBUG_VM to make sure
> we dont get the VM bugs/warnings and the numbers are good, so just wanted to share it
> upstream and get your opinions! Basically try and trigger a discussion similar to what
> Zi asked in [2]! And also if someone could point out if there is something fundamental
> we are missing in this series.

Well that's fair enough :)

But do come to a THP cabal so we can chat, face-to-face (ok, digital face to
digital face ;). It's usually a force-multiplier I find, esp. if multiple people
have input which I think is the case here. We're friendly :)

In any case, conversations are already kicking off so that's definitely positive!

I think we will definitely get there with this at _some point_ but I would urge
patience and also I really want to underline my desire for us in THP to start
paying down some of this technical debt.

I know people are already making efforts (Vernon, Luiz), and sorry that I've not
been great at review recently (should be gradually increasing over time), but I
feel that for large features to be added like this now we really do require some
refactoring work before we take it.

We definitely need to rebase this once Nico's series lands (should do next
cycle) and think about how it plays with this, I'm not sure if arm64 supports
mTHP between PMD and PUD size (Dev? Do you know?) so maybe that one is moot, but
in general want to make sure it plays nice.

>
> Thanks for the reviews! Really do apprecaite it!

No worries! :)

>
> [1] https://lore.kernel.org/all/20f92576-e932-435f-bb7b-de49eb84b012@gmail.com/#t
> [2] https://lore.kernel.org/all/3561FD10-664D-42AA-8351-DE7D8D49D42E@nvidia.com/

Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC 00/12] mm: PUD (1GB) THP implementation
  2026-02-04 11:08     ` Lorenzo Stoakes
@ 2026-02-04 11:50       ` Dev Jain
  2026-02-04 12:01         ` Dev Jain
  2026-02-05  6:08       ` Usama Arif
  1 sibling, 1 reply; 52+ messages in thread
From: Dev Jain @ 2026-02-04 11:50 UTC (permalink / raw)
  To: Lorenzo Stoakes, Usama Arif
  Cc: ziy, Andrew Morton, David Hildenbrand, linux-mm, hannes, riel,
	shakeel.butt, kas, baohua, baolin.wang, npache, Liam.Howlett,
	ryan.roberts, vbabka, lance.yang, linux-kernel, kernel-team


On 04/02/26 4:38 pm, Lorenzo Stoakes wrote:
> On Tue, Feb 03, 2026 at 05:00:10PM -0800, Usama Arif wrote:
>>
>> On 02/02/2026 03:20, Lorenzo Stoakes wrote:
>>> OK so this is somewhat unexpected :)
>>>
>>> It would have been nice to discuss it in the THP cabal or at a conference
>>> etc. so we could discuss approaches ahead of time. Communication is important,
>>> especially with major changes like this.
>> Makes sense!
>>
>>> And PUD THP is especially problematic in that it requires pages that the page
>>> allocator can't give us, presumably you're doing something with CMA and... it's
>>> a whole kettle of fish.
>> So we dont need CMA. It helps ofcourse, but we don't *need* it.
>> Its summarized in the first reply I gave to Zi in [1]:
>>
>>> It's also complicated by the fact we _already_ support it in the DAX, VFIO cases
>>> but it's kinda a weird sorta special case that we need to keep supporting.
>>>
>>> There's questions about how this will interact with khugepaged, MADV_COLLAPSE,
>>> mTHP (and really I want to see Nico's series land before we really consider
>>> this).
>>
>> So I have numbers and experiments for page faults which are in the cover letter,
>> but not for khugepaged. I would be very surprised (although pleasently :)) if
>> khugepaged by some magic finds 262144 pages that meets all the khugepaged requirements
>> to collapse the page. In the basic infrastructure support which this series is adding,
>> I want to keep khugepaged collapse disabled for 1G pages. This is also the initial
>> approach that was taken in other mTHP sizes. We should go slow with 1G THPs.
> Yes we definitely want to limit to page faults for now.
>
> But keep in mind for that to be viable you'd surely need to update who gets
> appropriate alignment in __get_unmapped_area()... not read through series far
> enough to see so not sure if you update that though!
>
> I guess that'd be the sanest place to start, if an allocation _size_ is aligned
> 1 GB, then align the unmapped area _address_ to 1 GB for maximum chance of 1 GB
> fault-in.
>
> Oh by the way I made some rough THP notes at
> https://publish.obsidian.md/mm/Transparent+Huge+Pages+(THP) which are helpful
> for reminding me about what does what where, useful for a top-down view of how
> things are now.
>
>>> So overall, I want to be very cautious and SLOW here. So let's please not drop
>>> the RFC tag until David and I are ok with that?
>>>
>>> Also the THP code base is in _dire_ need of rework, and I don't really want to
>>> add major new features without us paying down some technical debt, to be honest.
>>>
>>> So let's proceed with caution, and treat this as a very early bit of
>>> experimental code.
>>>
>>> Thanks, Lorenzo
>> Ack, yeah so this is mainly an RFC to discuss what the major design choices will be.
>> I got a kernel with selftests for allocation, memory integrity, fork, partial munmap,
>> mprotect, reclaim and migration passing and am running them with DEBUG_VM to make sure
>> we dont get the VM bugs/warnings and the numbers are good, so just wanted to share it
>> upstream and get your opinions! Basically try and trigger a discussion similar to what
>> Zi asked in [2]! And also if someone could point out if there is something fundamental
>> we are missing in this series.
> Well that's fair enough :)
>
> But do come to a THP cabal so we can chat, face-to-face (ok, digital face to
> digital face ;). It's usually a force-multiplier I find, esp. if multiple people
> have input which I think is the case here. We're friendly :)
>
> In any case, conversations are already kicking off so that's definitely positive!
>
> I think we will definitely get there with this at _some point_ but I would urge
> patience and also I really want to underline my desire for us in THP to start
> paying down some of this technical debt.
>
> I know people are already making efforts (Vernon, Luiz), and sorry that I've not
> been great at review recently (should be gradually increasing over time), but I
> feel that for large features to be added like this now we really do require some
> refactoring work before we take it.
>
> We definitely need to rebase this once Nico's series lands (should do next
> cycle) and think about how it plays with this, I'm not sure if arm64 supports
> mTHP between PMD and PUD size (Dev? Do you know?) so maybe that one is moot, but

arm64 does support cont mappings at the PMD level. Currently, they are supported
for kernel pagetables, and hugetlbpages. You may search around for "CONT_PMD" in
the codebase. Hence it only supports cont PMD in the "static" case, there is
no dynamic folding/unfolding of the cont bit at the PMD level, which mTHP requires.

I see that this patchset splits PUD all the way down to PTEs. If we were to split
it down to PMD, and add arm64 support for dynamic cont mappings at the PMD level,
it will be nicer. But I guess there is some mapcount/rmap stuff involved
here stopping us from doing that :(

> in general want to make sure it plays nice.
>
>> Thanks for the reviews! Really do apprecaite it!
> No worries! :)
>
>> [1] https://lore.kernel.org/all/20f92576-e932-435f-bb7b-de49eb84b012@gmail.com/#t
>> [2] https://lore.kernel.org/all/3561FD10-664D-42AA-8351-DE7D8D49D42E@nvidia.com/
> Cheers, Lorenzo
>


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC 00/12] mm: PUD (1GB) THP implementation
  2026-02-04 11:50       ` Dev Jain
@ 2026-02-04 12:01         ` Dev Jain
  0 siblings, 0 replies; 52+ messages in thread
From: Dev Jain @ 2026-02-04 12:01 UTC (permalink / raw)
  To: Lorenzo Stoakes, Usama Arif
  Cc: ziy, Andrew Morton, David Hildenbrand, linux-mm, hannes, riel,
	shakeel.butt, kas, baohua, baolin.wang, npache, Liam.Howlett,
	ryan.roberts, vbabka, lance.yang, linux-kernel, kernel-team


On 04/02/26 5:20 pm, Dev Jain wrote:
> On 04/02/26 4:38 pm, Lorenzo Stoakes wrote:
>> On Tue, Feb 03, 2026 at 05:00:10PM -0800, Usama Arif wrote:
>>> On 02/02/2026 03:20, Lorenzo Stoakes wrote:
>>>> OK so this is somewhat unexpected :)
>>>>
>>>> It would have been nice to discuss it in the THP cabal or at a conference
>>>> etc. so we could discuss approaches ahead of time. Communication is important,
>>>> especially with major changes like this.
>>> Makes sense!
>>>
>>>> And PUD THP is especially problematic in that it requires pages that the page
>>>> allocator can't give us, presumably you're doing something with CMA and... it's
>>>> a whole kettle of fish.
>>> So we dont need CMA. It helps ofcourse, but we don't *need* it.
>>> Its summarized in the first reply I gave to Zi in [1]:
>>>
>>>> It's also complicated by the fact we _already_ support it in the DAX, VFIO cases
>>>> but it's kinda a weird sorta special case that we need to keep supporting.
>>>>
>>>> There's questions about how this will interact with khugepaged, MADV_COLLAPSE,
>>>> mTHP (and really I want to see Nico's series land before we really consider
>>>> this).
>>> So I have numbers and experiments for page faults which are in the cover letter,
>>> but not for khugepaged. I would be very surprised (although pleasently :)) if
>>> khugepaged by some magic finds 262144 pages that meets all the khugepaged requirements
>>> to collapse the page. In the basic infrastructure support which this series is adding,
>>> I want to keep khugepaged collapse disabled for 1G pages. This is also the initial
>>> approach that was taken in other mTHP sizes. We should go slow with 1G THPs.
>> Yes we definitely want to limit to page faults for now.
>>
>> But keep in mind for that to be viable you'd surely need to update who gets
>> appropriate alignment in __get_unmapped_area()... not read through series far
>> enough to see so not sure if you update that though!
>>
>> I guess that'd be the sanest place to start, if an allocation _size_ is aligned
>> 1 GB, then align the unmapped area _address_ to 1 GB for maximum chance of 1 GB
>> fault-in.
>>
>> Oh by the way I made some rough THP notes at
>> https://publish.obsidian.md/mm/Transparent+Huge+Pages+(THP) which are helpful
>> for reminding me about what does what where, useful for a top-down view of how
>> things are now.
>>
>>>> So overall, I want to be very cautious and SLOW here. So let's please not drop
>>>> the RFC tag until David and I are ok with that?
>>>>
>>>> Also the THP code base is in _dire_ need of rework, and I don't really want to
>>>> add major new features without us paying down some technical debt, to be honest.
>>>>
>>>> So let's proceed with caution, and treat this as a very early bit of
>>>> experimental code.
>>>>
>>>> Thanks, Lorenzo
>>> Ack, yeah so this is mainly an RFC to discuss what the major design choices will be.
>>> I got a kernel with selftests for allocation, memory integrity, fork, partial munmap,
>>> mprotect, reclaim and migration passing and am running them with DEBUG_VM to make sure
>>> we dont get the VM bugs/warnings and the numbers are good, so just wanted to share it
>>> upstream and get your opinions! Basically try and trigger a discussion similar to what
>>> Zi asked in [2]! And also if someone could point out if there is something fundamental
>>> we are missing in this series.
>> Well that's fair enough :)
>>
>> But do come to a THP cabal so we can chat, face-to-face (ok, digital face to
>> digital face ;). It's usually a force-multiplier I find, esp. if multiple people
>> have input which I think is the case here. We're friendly :)
>>
>> In any case, conversations are already kicking off so that's definitely positive!
>>
>> I think we will definitely get there with this at _some point_ but I would urge
>> patience and also I really want to underline my desire for us in THP to start
>> paying down some of this technical debt.
>>
>> I know people are already making efforts (Vernon, Luiz), and sorry that I've not
>> been great at review recently (should be gradually increasing over time), but I
>> feel that for large features to be added like this now we really do require some
>> refactoring work before we take it.
>>
>> We definitely need to rebase this once Nico's series lands (should do next
>> cycle) and think about how it plays with this, I'm not sure if arm64 supports
>> mTHP between PMD and PUD size (Dev? Do you know?) so maybe that one is moot, but
> arm64 does support cont mappings at the PMD level. Currently, they are supported
> for kernel pagetables, and hugetlbpages. You may search around for "CONT_PMD" in
> the codebase. Hence it only supports cont PMD in the "static" case, there is
> no dynamic folding/unfolding of the cont bit at the PMD level, which mTHP requires.
>
> I see that this patchset splits PUD all the way down to PTEs. If we were to split
> it down to PMD, and add arm64 support for dynamic cont mappings at the PMD level,
> it will be nicer. But I guess there is some mapcount/rmap stuff involved
> here stopping us from doing that :(

Hmm, this won't make a difference w.r.t cont PMD. If we were to split PUD folio
down to PMD folios, we won't get cont PMD. But yes, in general PMD mappings
are nicer.

>
>> in general want to make sure it plays nice.
>>
>>> Thanks for the reviews! Really do apprecaite it!
>> No worries! :)
>>
>>> [1] https://lore.kernel.org/all/20f92576-e932-435f-bb7b-de49eb84b012@gmail.com/#t
>>> [2] https://lore.kernel.org/all/3561FD10-664D-42AA-8351-DE7D8D49D42E@nvidia.com/
>> Cheers, Lorenzo
>>


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC 01/12] mm: add PUD THP ptdesc and rmap support
  2026-02-04  7:38     ` Usama Arif
@ 2026-02-04 12:55       ` Lorenzo Stoakes
  2026-02-05  6:40         ` Usama Arif
  0 siblings, 1 reply; 52+ messages in thread
From: Lorenzo Stoakes @ 2026-02-04 12:55 UTC (permalink / raw)
  To: Usama Arif
  Cc: ziy, Andrew Morton, David Hildenbrand, linux-mm, hannes, riel,
	shakeel.butt, kas, baohua, dev.jain, baolin.wang, npache,
	Liam.Howlett, ryan.roberts, vbabka, lance.yang, linux-kernel,
	kernel-team

On Tue, Feb 03, 2026 at 11:38:02PM -0800, Usama Arif wrote:
>
>
> On 02/02/2026 04:15, Lorenzo Stoakes wrote:
> > I think I'm going to have to do several passes on this, so this is just a
> > first one :)
> >
>
> Thanks! Really appreciate the reviews!

No worries!

>
> One thing over here is the higher level design decision when it comes to migration
> of 1G pages. As Zi said in [1]:
> "I also wonder what the purpose of PUD THP migration can be.
> It does not create memory fragmentation, since it is the largest folio size
> we have and contiguous. NUMA balancing 1GB THP seems too much work."
>
> > On Sun, Feb 01, 2026 at 04:50:18PM -0800, Usama Arif wrote:
> >> For page table management, PUD THPs need to pre-deposit page tables
> >> that will be used when the huge page is later split. When a PUD THP
> >> is allocated, we cannot know in advance when or why it might need to
> >> be split (COW, partial unmap, reclaim), but we need page tables ready
> >> for that eventuality. Similar to how PMD THPs deposit a single PTE
> >> table, PUD THPs deposit a PMD table which itself contains deposited
> >> PTE tables - a two-level deposit. This commit adds the deposit/withdraw
> >> infrastructure and a new pud_huge_pmd field in ptdesc to store the
> >> deposited PMD.
> >
> > This feels like you're hacking this support in, honestly. The list_head
> > abuse only adds to that feeling.
> >
>
> Yeah so I hope turning it to something like [2] is the way forward.

Right, that's one option, though David suggested avoiding this altogether by
only pre-allocating PTEs?

>
> > And are we now not required to store rather a lot of memory to keep all of
> > this coherent?
>
> PMD THP allocates 1 4K page (pte_alloc_one) at fault time so that split
> doesnt fail.
>
> For PUD we allocate 2M worth of PTE page tables and 1 4K PMD table at fault
> time so that split doesnt fail due to there not being enough memory.
> Its not great, but its not bad as well.
> The alternative is to allocate this at split time and so we are not
> pre-reserving them. Now there is a chance that allocation and therefore split
> fails, so the tradeoff is some memory vs reliability. This patch favours
> reliability.

That's a significant amount of unmovable, unreclaimable memory though. Going
from 4K to 2M is a pretty huge uptick.

>
> Lets say a user gets 100x1G THPs. They would end up using ~200M for it.
> I think that is okish. If the user has 100G, 200M might not be an issue
> for them :)

But there's more than one user on boxes big enough for this, so this makes me
think we want this to be somehow opt-in right?

And that means we're incurring an unmovable memory penalty, the kind which we're
trying to avoid in general elsewhere in the kernel.

>
> >
> >>
> >> The deposited PMD tables are stored as a singly-linked stack using only
> >> page->lru.next as the link pointer. A doubly-linked list using the
> >> standard list_head mechanism would cause memory corruption: list_del()
> >> poisons both lru.next (offset 8) and lru.prev (offset 16), but lru.prev
> >> overlaps with ptdesc->pmd_huge_pte at offset 16. Since deposited PMD
> >> tables have their own deposited PTE tables stored in pmd_huge_pte,
> >> poisoning lru.prev would corrupt the PTE table list and cause crashes
> >> when withdrawing PTE tables during split. PMD THPs don't have this
> >> problem because their deposited PTE tables don't have sub-deposits.
> >> Using only lru.next avoids the overlap entirely.
> >
> > Yeah this is horrendous and a hack, I don't consider this at all
> > upstreamable.
> >
> > You need to completely rework this.
>
> Hopefully [2] is the path forward!

Ack

> >
> >>
> >> For reverse mapping, PUD THPs need the same rmap support that PMD THPs
> >> have. The page_vma_mapped_walk() function is extended to recognize and
> >> handle PUD-mapped folios during rmap traversal. A new TTU_SPLIT_HUGE_PUD
> >> flag tells the unmap path to split PUD THPs before proceeding, since
> >> there is no PUD-level migration entry format - the split converts the
> >> single PUD mapping into individual PTE mappings that can be migrated
> >> or swapped normally.
> >
> > Individual PTE... mappings? You need to be a lot clearer here, page tables
> > are naturally confusing with entries vs. tables.
> >
> > Let's be VERY specific here. Do you mean you have 1 PMD table and 512 PTE
> > tables reserved, spanning 1 PUD entry and 262,144 PTE entries?
> >
>
> Yes that is correct, Thanks! I will change the commit message in the next revision
> to what you have written: 1 PMD table and 512 PTE tables reserved, spanning
> 1 PUD entry and 262,144 PTE entries.

Yeah :) my concerns remain :)

>
> >>
> >> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
> >
> > How does this change interact with existing DAX/VFIO code, which now it
> > seems will be subject to the mechanisms you introduce here?
>
> I think what you mean here is the change in try_to_migrate_one?
>
>
> So one

Unfinished sentence? :P

No I mean currently we support 1G THP for DAX/VFIO right? So how does this
interplay with how that currently works? Does that change how DAX/VFIO works?
Will that impact existing users?

Or are we extending the existing mechanism?

>
> >
> > Right now DAX/VFIO is only obtainable via a specially THP-aligned
> > get_unmapped_area() + then can only be obtained at fault time.
> > > Is that the intent here also?
> >
>
> Ah thanks for pointing this out. This is something the series is missing.
>
> What I did in the selftest and benchmark was fault on an address that was already aligned.
> i.e. basically call the below function before faulting in.
>
> static inline void *pud_align(void *addr)
> {
> 	return (void *)(((unsigned long)addr + PUD_SIZE - 1) & ~(PUD_SIZE - 1));
> }

Right yeah :)

>
>
> What I think you are suggesting this series is missing is the below diff? (its untested).
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 87b2c21df4a49..461158a0840db 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1236,6 +1236,12 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
>         unsigned long ret;
>         loff_t off = (loff_t)pgoff << PAGE_SHIFT;
>
> +       if (IS_ENABLED(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD) && len >= PUD_SIZE) {
> +               ret = __thp_get_unmapped_area(filp, addr, len, off, flags, PUD_SIZE, vm_flags);
> +               if (ret)
> +                       return ret;
> +       }

No not that, that's going to cause issues, see commit d4148aeab4 for details as
to why this can go wrong.

In __get_unmapped_area() where the current 'if PMD size aligned then align area'
logic, like that.

> +
>
>
> > What is your intent - that khugepaged do this, or on alloc? How does it
> > interact with MADV_COLLAPSE?
> >
>
> Ah basically what I mentioned in [3], we want to go slow. Only enable PUD THPs
> page faults at the start. If there is data supporting that khugepaged will work
> than we do it, but we keep it disabled.

Yes I think khugepaged is probably never going to be all that a good idea with
this.

>
> > I noted on the 2nd patch, but you're changing THP_ORDERS_ALL_ANON which
> > alters __thp_vma_allowable_orders() behaviour, that change belongs here...
> >
> >
>
> Thanks for this! I only tried to split this code into logical commits
> after the whole thing was working. Some things are tightly coupled
> and I would need to move them to the right commit.

Yes there's a bunch of things that need tweaking here, to reiterate let's try to
pay down technical debt here and avoid copy/pasting :>)

>
> >> ---
> >>  include/linux/huge_mm.h  |  5 +++
> >>  include/linux/mm.h       | 19 ++++++++
> >>  include/linux/mm_types.h |  5 ++-
> >>  include/linux/pgtable.h  |  8 ++++
> >>  include/linux/rmap.h     |  7 ++-
> >>  mm/huge_memory.c         |  8 ++++
> >>  mm/internal.h            |  3 ++
> >>  mm/page_vma_mapped.c     | 35 +++++++++++++++
> >>  mm/pgtable-generic.c     | 83 ++++++++++++++++++++++++++++++++++
> >>  mm/rmap.c                | 96 +++++++++++++++++++++++++++++++++++++---
> >>  10 files changed, 260 insertions(+), 9 deletions(-)
> >>
> >> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> >> index a4d9f964dfdea..e672e45bb9cc7 100644
> >> --- a/include/linux/huge_mm.h
> >> +++ b/include/linux/huge_mm.h
> >> @@ -463,10 +463,15 @@ void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
> >>  		unsigned long address);
> >>
> >>  #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
> >> +void split_huge_pud_locked(struct vm_area_struct *vma, pud_t *pud,
> >> +			   unsigned long address);
> >>  int change_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
> >>  		    pud_t *pudp, unsigned long addr, pgprot_t newprot,
> >>  		    unsigned long cp_flags);
> >>  #else
> >> +static inline void
> >> +split_huge_pud_locked(struct vm_area_struct *vma, pud_t *pud,
> >> +		      unsigned long address) {}
> >>  static inline int
> >>  change_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
> >>  		pud_t *pudp, unsigned long addr, pgprot_t newprot,
> >> diff --git a/include/linux/mm.h b/include/linux/mm.h
> >> index ab2e7e30aef96..a15e18df0f771 100644
> >> --- a/include/linux/mm.h
> >> +++ b/include/linux/mm.h
> >> @@ -3455,6 +3455,22 @@ static inline bool pagetable_pmd_ctor(struct mm_struct *mm,
> >>   * considered ready to switch to split PUD locks yet; there may be places
> >>   * which need to be converted from page_table_lock.
> >>   */
> >> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
> >> +static inline struct page *pud_pgtable_page(pud_t *pud)
> >> +{
> >> +	unsigned long mask = ~(PTRS_PER_PUD * sizeof(pud_t) - 1);
> >> +
> >> +	return virt_to_page((void *)((unsigned long)pud & mask));
> >> +}
> >> +
> >> +static inline struct ptdesc *pud_ptdesc(pud_t *pud)
> >> +{
> >> +	return page_ptdesc(pud_pgtable_page(pud));
> >> +}
> >> +
> >> +#define pud_huge_pmd(pud) (pud_ptdesc(pud)->pud_huge_pmd)
> >> +#endif
> >> +
> >>  static inline spinlock_t *pud_lockptr(struct mm_struct *mm, pud_t *pud)
> >>  {
> >>  	return &mm->page_table_lock;
> >> @@ -3471,6 +3487,9 @@ static inline spinlock_t *pud_lock(struct mm_struct *mm, pud_t *pud)
> >>  static inline void pagetable_pud_ctor(struct ptdesc *ptdesc)
> >>  {
> >>  	__pagetable_ctor(ptdesc);
> >> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
> >> +	ptdesc->pud_huge_pmd = NULL;
> >> +#endif
> >>  }
> >>
> >>  static inline void pagetable_p4d_ctor(struct ptdesc *ptdesc)
> >> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> >> index 78950eb8926dc..26a38490ae2e1 100644
> >> --- a/include/linux/mm_types.h
> >> +++ b/include/linux/mm_types.h
> >> @@ -577,7 +577,10 @@ struct ptdesc {
> >>  		struct list_head pt_list;
> >>  		struct {
> >>  			unsigned long _pt_pad_1;
> >> -			pgtable_t pmd_huge_pte;
> >> +			union {
> >> +				pgtable_t pmd_huge_pte;  /* For PMD tables: deposited PTE */
> >> +				pgtable_t pud_huge_pmd;  /* For PUD tables: deposited PMD list */
> >> +			};
> >>  		};
> >>  	};
> >>  	unsigned long __page_mapping;
> >> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> >> index 2f0dd3a4ace1a..3ce733c1d71a2 100644
> >> --- a/include/linux/pgtable.h
> >> +++ b/include/linux/pgtable.h
> >> @@ -1168,6 +1168,14 @@ extern pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp);
> >>  #define arch_needs_pgtable_deposit() (false)
> >>  #endif
> >>
> >> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
> >> +extern void pgtable_trans_huge_pud_deposit(struct mm_struct *mm, pud_t *pudp,
> >> +					   pmd_t *pmd_table);
> >> +extern pmd_t *pgtable_trans_huge_pud_withdraw(struct mm_struct *mm, pud_t *pudp);
> >> +extern void pud_deposit_pte(pmd_t *pmd_table, pgtable_t pgtable);
> >> +extern pgtable_t pud_withdraw_pte(pmd_t *pmd_table);
> >
> > These are useless extern's.
> >
>
>
> ack
>
> These are coming from the existing functions from the file:
> extern void pgtable_trans_huge_deposit
> extern pgtable_t pgtable_trans_huge_withdraw
>
> I think the externs can be removed from these as well? We can
> fix those in a separate patch.

Generally the approach is to remove externs when adding/changing new stuff as
otherwise we get completely useless churn on that and annoying git history
changes.

>
>
> >> +#endif
> >> +
> >>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >>  /*
> >>   * This is an implementation of pmdp_establish() that is only suitable for an
> >> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> >> index daa92a58585d9..08cd0a0eb8763 100644
> >> --- a/include/linux/rmap.h
> >> +++ b/include/linux/rmap.h
> >> @@ -101,6 +101,7 @@ enum ttu_flags {
> >>  					 * do a final flush if necessary */
> >>  	TTU_RMAP_LOCKED		= 0x80,	/* do not grab rmap lock:
> >>  					 * caller holds it */
> >> +	TTU_SPLIT_HUGE_PUD	= 0x100, /* split huge PUD if any */
> >>  };
> >>
> >>  #ifdef CONFIG_MMU
> >> @@ -473,6 +474,8 @@ void folio_add_anon_rmap_ptes(struct folio *, struct page *, int nr_pages,
> >>  	folio_add_anon_rmap_ptes(folio, page, 1, vma, address, flags)
> >>  void folio_add_anon_rmap_pmd(struct folio *, struct page *,
> >>  		struct vm_area_struct *, unsigned long address, rmap_t flags);
> >> +void folio_add_anon_rmap_pud(struct folio *, struct page *,
> >> +		struct vm_area_struct *, unsigned long address, rmap_t flags);
> >>  void folio_add_new_anon_rmap(struct folio *, struct vm_area_struct *,
> >>  		unsigned long address, rmap_t flags);
> >>  void folio_add_file_rmap_ptes(struct folio *, struct page *, int nr_pages,
> >> @@ -933,6 +936,7 @@ struct page_vma_mapped_walk {
> >>  	pgoff_t pgoff;
> >>  	struct vm_area_struct *vma;
> >>  	unsigned long address;
> >> +	pud_t *pud;
> >>  	pmd_t *pmd;
> >>  	pte_t *pte;
> >>  	spinlock_t *ptl;
> >> @@ -970,7 +974,7 @@ static inline void page_vma_mapped_walk_done(struct page_vma_mapped_walk *pvmw)
> >>  static inline void
> >>  page_vma_mapped_walk_restart(struct page_vma_mapped_walk *pvmw)
> >>  {
> >> -	WARN_ON_ONCE(!pvmw->pmd && !pvmw->pte);
> >> +	WARN_ON_ONCE(!pvmw->pud && !pvmw->pmd && !pvmw->pte);
> >>
> >>  	if (likely(pvmw->ptl))
> >>  		spin_unlock(pvmw->ptl);
> >> @@ -978,6 +982,7 @@ page_vma_mapped_walk_restart(struct page_vma_mapped_walk *pvmw)
> >>  		WARN_ON_ONCE(1);
> >>
> >>  	pvmw->ptl = NULL;
> >> +	pvmw->pud = NULL;
> >>  	pvmw->pmd = NULL;
> >>  	pvmw->pte = NULL;
> >>  }
> >> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> >> index 40cf59301c21a..3128b3beedb0a 100644
> >> --- a/mm/huge_memory.c
> >> +++ b/mm/huge_memory.c
> >> @@ -2933,6 +2933,14 @@ void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
> >>  	spin_unlock(ptl);
> >>  	mmu_notifier_invalidate_range_end(&range);
> >>  }
> >> +
> >> +void split_huge_pud_locked(struct vm_area_struct *vma, pud_t *pud,
> >> +			   unsigned long address)
> >> +{
> >> +	VM_WARN_ON_ONCE(!IS_ALIGNED(address, HPAGE_PUD_SIZE));
> >> +	if (pud_trans_huge(*pud))
> >> +		__split_huge_pud_locked(vma, pud, address);
> >> +}
> >>  #else
> >>  void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
> >>  		unsigned long address)
> >> diff --git a/mm/internal.h b/mm/internal.h
> >> index 9ee336aa03656..21d5c00f638dc 100644
> >> --- a/mm/internal.h
> >> +++ b/mm/internal.h
> >> @@ -545,6 +545,9 @@ int user_proactive_reclaim(char *buf,
> >>   * in mm/rmap.c:
> >>   */
> >>  pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address);
> >> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
> >> +pud_t *mm_find_pud(struct mm_struct *mm, unsigned long address);
> >> +#endif
> >>
> >>  /*
> >>   * in mm/page_alloc.c
> >> diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
> >> index b38a1d00c971b..d31eafba38041 100644
> >> --- a/mm/page_vma_mapped.c
> >> +++ b/mm/page_vma_mapped.c
> >> @@ -146,6 +146,18 @@ static bool check_pmd(unsigned long pfn, struct page_vma_mapped_walk *pvmw)
> >>  	return true;
> >>  }
> >>
> >> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
> >> +/* Returns true if the two ranges overlap.  Careful to not overflow. */
> >> +static bool check_pud(unsigned long pfn, struct page_vma_mapped_walk *pvmw)
> >> +{
> >> +	if ((pfn + HPAGE_PUD_NR - 1) < pvmw->pfn)
> >> +		return false;
> >> +	if (pfn > pvmw->pfn + pvmw->nr_pages - 1)
> >> +		return false;
> >> +	return true;
> >> +}
> >> +#endif
> >> +
> >>  static void step_forward(struct page_vma_mapped_walk *pvmw, unsigned long size)
> >>  {
> >>  	pvmw->address = (pvmw->address + size) & ~(size - 1);
> >> @@ -188,6 +200,10 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
> >>  	pud_t *pud;
> >>  	pmd_t pmde;
> >>
> >> +	/* The only possible pud mapping has been handled on last iteration */
> >> +	if (pvmw->pud && !pvmw->pmd)
> >> +		return not_found(pvmw);
> >> +
> >>  	/* The only possible pmd mapping has been handled on last iteration */
> >>  	if (pvmw->pmd && !pvmw->pte)
> >>  		return not_found(pvmw);
> >> @@ -234,6 +250,25 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
> >>  			continue;
> >>  		}
> >>
> >> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
> >
> > Said it elsewhere, but it's really weird to treat an arch having the
> > ability to do something as a go ahead for doing it.
> >
> >> +		/* Check for PUD-mapped THP */
> >> +		if (pud_trans_huge(*pud)) {
> >> +			pvmw->pud = pud;
> >> +			pvmw->ptl = pud_lock(mm, pud);
> >> +			if (likely(pud_trans_huge(*pud))) {
> >> +				if (pvmw->flags & PVMW_MIGRATION)
> >> +					return not_found(pvmw);
> >> +				if (!check_pud(pud_pfn(*pud), pvmw))
> >> +					return not_found(pvmw);
> >> +				return true;
> >> +			}
> >> +			/* PUD was split under us, retry at PMD level */
> >> +			spin_unlock(pvmw->ptl);
> >> +			pvmw->ptl = NULL;
> >> +			pvmw->pud = NULL;
> >> +		}
> >> +#endif
> >> +
> >
> > Yeah, as I said elsewhere, we got to be refactoring not copy/pasting with
> > modifications :)
> >
>
> Yeah there is repeated code in multiple places, where all I did was replace
> what was done from PMD into PUD. In a lot of places, its actually difficult
> to not repeat the code (unless we want function macros, which is much worse
> IMO).

Not if we actually refactor the existing code :)

When I wanted to make functional changes to mremap I took a lot of time to
refactor the code into something sane before even starting that.

Because I _could_ have added the features there as-is, but it would have been
hellish to do so as-is and added more confusion etc.

So yeah, I think a similar mentality has to be had with this change.

>
> >
> >>  		pvmw->pmd = pmd_offset(pud, pvmw->address);
> >>  		/*
> >>  		 * Make sure the pmd value isn't cached in a register by the
> >> diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
> >> index d3aec7a9926ad..2047558ddcd79 100644
> >> --- a/mm/pgtable-generic.c
> >> +++ b/mm/pgtable-generic.c
> >> @@ -195,6 +195,89 @@ pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp)
> >>  }
> >>  #endif
> >>
> >> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
> >> +/*
> >> + * Deposit page tables for PUD THP.
> >> + * Called with PUD lock held. Stores PMD tables in a singly-linked stack
> >> + * via pud_huge_pmd, using only pmd_page->lru.next as the link pointer.
> >> + *
> >> + * IMPORTANT: We use only lru.next (offset 8) for linking, NOT the full
> >> + * list_head. This is because lru.prev (offset 16) overlaps with
> >> + * ptdesc->pmd_huge_pte, which stores the PMD table's deposited PTE tables.
> >> + * Using list_del() would corrupt pmd_huge_pte with LIST_POISON2.
> >
> > This is horrible and feels like a hack? Treating a doubly-linked list as a
> > singly-linked one like this is not upstreamable.
> >
> >> + *
> >> + * PTE tables should be deposited into the PMD using pud_deposit_pte().
> >> + */
> >> +void pgtable_trans_huge_pud_deposit(struct mm_struct *mm, pud_t *pudp,
> >> +				    pmd_t *pmd_table)
> >
> > This is a horrid, you're depositing the PMD using the... questionable
> > list_head abuse, but then also have pud_deposit_pte()... But here we're
> > depositing a PMD shouldn't the name reflect that?
> >
> >> +{
> >> +	pgtable_t pmd_page = virt_to_page(pmd_table);
> >> +
> >> +	assert_spin_locked(pud_lockptr(mm, pudp));
> >> +
> >> +	/* Push onto stack using only lru.next as the link */
> >> +	pmd_page->lru.next = (struct list_head *)pud_huge_pmd(pudp);
> >
> > Yikes...
> >
> >> +	pud_huge_pmd(pudp) = pmd_page;
> >> +}
> >> +
> >> +/*
> >> + * Withdraw the deposited PMD table for PUD THP split or zap.
> >> + * Called with PUD lock held.
> >> + * Returns NULL if no more PMD tables are deposited.
> >> + */
> >> +pmd_t *pgtable_trans_huge_pud_withdraw(struct mm_struct *mm, pud_t *pudp)
> >> +{
> >> +	pgtable_t pmd_page;
> >> +
> >> +	assert_spin_locked(pud_lockptr(mm, pudp));
> >> +
> >> +	pmd_page = pud_huge_pmd(pudp);
> >> +	if (!pmd_page)
> >> +		return NULL;
> >> +
> >> +	/* Pop from stack - lru.next points to next PMD page (or NULL) */
> >> +	pud_huge_pmd(pudp) = (pgtable_t)pmd_page->lru.next;
> >
> > Where's the popping? You're just assigning here.
>
>
> Ack on all of the above. Hopefully [1] is better.

Thanks!

> >
> >> +
> >> +	return page_address(pmd_page);
> >> +}
> >> +
> >> +/*
> >> + * Deposit a PTE table into a standalone PMD table (not yet in page table hierarchy).
> >> + * Used for PUD THP pre-deposit. The PMD table's pmd_huge_pte stores a linked list.
> >> + * No lock assertion since the PMD isn't visible yet.
> >> + */
> >> +void pud_deposit_pte(pmd_t *pmd_table, pgtable_t pgtable)
> >> +{
> >> +	struct ptdesc *ptdesc = virt_to_ptdesc(pmd_table);
> >> +
> >> +	/* FIFO - add to front of list */
> >> +	if (!ptdesc->pmd_huge_pte)
> >> +		INIT_LIST_HEAD(&pgtable->lru);
> >> +	else
> >> +		list_add(&pgtable->lru, &ptdesc->pmd_huge_pte->lru);
> >> +	ptdesc->pmd_huge_pte = pgtable;
> >> +}
> >> +
> >> +/*
> >> + * Withdraw a PTE table from a standalone PMD table.
> >> + * Returns NULL if no more PTE tables are deposited.
> >> + */
> >> +pgtable_t pud_withdraw_pte(pmd_t *pmd_table)
> >> +{
> >> +	struct ptdesc *ptdesc = virt_to_ptdesc(pmd_table);
> >> +	pgtable_t pgtable;
> >> +
> >> +	pgtable = ptdesc->pmd_huge_pte;
> >> +	if (!pgtable)
> >> +		return NULL;
> >> +	ptdesc->pmd_huge_pte = list_first_entry_or_null(&pgtable->lru,
> >> +							struct page, lru);
> >> +	if (ptdesc->pmd_huge_pte)
> >> +		list_del(&pgtable->lru);
> >> +	return pgtable;
> >> +}
> >> +#endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
> >> +
> >>  #ifndef __HAVE_ARCH_PMDP_INVALIDATE
> >>  pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
> >>  		     pmd_t *pmdp)
> >> diff --git a/mm/rmap.c b/mm/rmap.c
> >> index 7b9879ef442d9..69acabd763da4 100644
> >> --- a/mm/rmap.c
> >> +++ b/mm/rmap.c
> >> @@ -811,6 +811,32 @@ pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address)
> >>  	return pmd;
> >>  }
> >>
> >> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
> >> +/*
> >> + * Returns the actual pud_t* where we expect 'address' to be mapped from, or
> >> + * NULL if it doesn't exist.  No guarantees / checks on what the pud_t*
> >> + * represents.
> >> + */
> >> +pud_t *mm_find_pud(struct mm_struct *mm, unsigned long address)
> >
> > This series seems to be full of copy/paste.
> >
> > It's just not acceptable given the state of THP code as I said in reply to
> > the cover letter - you need to _refactor_ the code.
> >
> > The code is bug-prone and difficult to maintain as-is, your series has to
> > improve the technical debt, not add to it.
> >
>
> In some cases we might not be able to avoid the copy, but this is definitely
> a place where we dont need to. I will change here. Thanks!

I disagree, see above :) But thanks on this one

>
> >> +{
> >> +	pgd_t *pgd;
> >> +	p4d_t *p4d;
> >> +	pud_t *pud = NULL;
> >> +
> >> +	pgd = pgd_offset(mm, address);
> >> +	if (!pgd_present(*pgd))
> >> +		goto out;
> >> +
> >> +	p4d = p4d_offset(pgd, address);
> >> +	if (!p4d_present(*p4d))
> >> +		goto out;
> >> +
> >> +	pud = pud_offset(p4d, address);
> >> +out:
> >> +	return pud;
> >> +}
> >> +#endif
> >> +
> >>  struct folio_referenced_arg {
> >>  	int mapcount;
> >>  	int referenced;
> >> @@ -1415,11 +1441,7 @@ static __always_inline void __folio_add_anon_rmap(struct folio *folio,
> >>  			SetPageAnonExclusive(page);
> >>  			break;
> >>  		case PGTABLE_LEVEL_PUD:
> >> -			/*
> >> -			 * Keep the compiler happy, we don't support anonymous
> >> -			 * PUD mappings.
> >> -			 */
> >> -			WARN_ON_ONCE(1);
> >> +			SetPageAnonExclusive(page);
> >>  			break;
> >>  		default:
> >>  			BUILD_BUG();
> >> @@ -1503,6 +1525,31 @@ void folio_add_anon_rmap_pmd(struct folio *folio, struct page *page,
> >>  #endif
> >>  }
> >>
> >> +/**
> >> + * folio_add_anon_rmap_pud - add a PUD mapping to a page range of an anon folio
> >> + * @folio:	The folio to add the mapping to
> >> + * @page:	The first page to add
> >> + * @vma:	The vm area in which the mapping is added
> >> + * @address:	The user virtual address of the first page to map
> >> + * @flags:	The rmap flags
> >> + *
> >> + * The page range of folio is defined by [first_page, first_page + HPAGE_PUD_NR)
> >> + *
> >> + * The caller needs to hold the page table lock, and the page must be locked in
> >> + * the anon_vma case: to serialize mapping,index checking after setting.
> >> + */
> >> +void folio_add_anon_rmap_pud(struct folio *folio, struct page *page,
> >> +		struct vm_area_struct *vma, unsigned long address, rmap_t flags)
> >> +{
> >> +#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && \
> >> +	defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD)
> >> +	__folio_add_anon_rmap(folio, page, HPAGE_PUD_NR, vma, address, flags,
> >> +			      PGTABLE_LEVEL_PUD);
> >> +#else
> >> +	WARN_ON_ONCE(true);
> >> +#endif
> >> +}
> >
> > More copy/paste... Maybe unavoidable in this case, but be good to try.
> >
> >> +
> >>  /**
> >>   * folio_add_new_anon_rmap - Add mapping to a new anonymous folio.
> >>   * @folio:	The folio to add the mapping to.
> >> @@ -1934,6 +1981,20 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> >>  		}
> >>
> >>  		if (!pvmw.pte) {
> >> +			/*
> >> +			 * Check for PUD-mapped THP first.
> >> +			 * If we have a PUD mapping and TTU_SPLIT_HUGE_PUD is set,
> >> +			 * split the PUD to PMD level and restart the walk.
> >> +			 */
> >
> > This is literally describing the code below, it's not useful.
>
> Ack, Will remove this comment, Thanks!

Thanks

> >
> >> +			if (pvmw.pud && pud_trans_huge(*pvmw.pud)) {
> >> +				if (flags & TTU_SPLIT_HUGE_PUD) {
> >> +					split_huge_pud_locked(vma, pvmw.pud, pvmw.address);
> >> +					flags &= ~TTU_SPLIT_HUGE_PUD;
> >> +					page_vma_mapped_walk_restart(&pvmw);
> >> +					continue;
> >> +				}
> >> +			}
> >> +
> >>  			if (folio_test_anon(folio) && !folio_test_swapbacked(folio)) {
> >>  				if (unmap_huge_pmd_locked(vma, pvmw.address, pvmw.pmd, folio))
> >>  					goto walk_done;
> >> @@ -2325,6 +2386,27 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
> >>  	mmu_notifier_invalidate_range_start(&range);
> >>
> >>  	while (page_vma_mapped_walk(&pvmw)) {
> >> +		/* Handle PUD-mapped THP first */
> >
> > How did/will this interact with DAX, VFIO PUD THP?
>
> It wont interact with DAX. try_to_migrate does the below and just returns:
>
> 	if (folio_is_zone_device(folio) &&
> 	    (!folio_is_device_private(folio) && !folio_is_device_coherent(folio)))
> 		return;
>
> so DAX would never reach here.

Hmm folio_is_zone_device() always returns true for DAX?

Also that's just one rmap call right?

>
> I think vfio pages are pinned and therefore cant be migrated? (I have
> not looked at vfio code, I will try to get a better understanding tomorrow,
> but please let me know if that sounds wrong.)

OK I've not dug into this either please do check, and be good really to test
this code vs. actual DAX/VFIO scenarios if you can find a way to test that, thanks!

>
>
> >
> >> +		if (!pvmw.pte && !pvmw.pmd) {
> >> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
> >
> > Won't pud_trans_huge() imply this...
> >
>
> Agreed, I think it should cover it.
Thanks!

>
> >> +			/*
> >> +			 * PUD-mapped THP: skip migration to preserve the huge
> >> +			 * page. Splitting would defeat the purpose of PUD THPs.
> >> +			 * Return false to indicate migration failure, which
> >> +			 * will cause alloc_contig_range() to try a different
> >> +			 * memory region.
> >> +			 */
> >> +			if (pvmw.pud && pud_trans_huge(*pvmw.pud)) {
> >> +				page_vma_mapped_walk_done(&pvmw);
> >> +				ret = false;
> >> +				break;
> >> +			}
> >> +#endif
> >> +			/* Unexpected state: !pte && !pmd but not a PUD THP */
> >> +			page_vma_mapped_walk_done(&pvmw);
> >> +			break;
> >> +		}
> >> +
> >>  		/* PMD-mapped THP migration entry */
> >>  		if (!pvmw.pte) {
> >>  			__maybe_unused unsigned long pfn;
> >> @@ -2607,10 +2689,10 @@ void try_to_migrate(struct folio *folio, enum ttu_flags flags)
> >>
> >>  	/*
> >>  	 * Migration always ignores mlock and only supports TTU_RMAP_LOCKED and
> >> -	 * TTU_SPLIT_HUGE_PMD, TTU_SYNC, and TTU_BATCH_FLUSH flags.
> >> +	 * TTU_SPLIT_HUGE_PMD, TTU_SPLIT_HUGE_PUD, TTU_SYNC, and TTU_BATCH_FLUSH flags.
> >>  	 */
> >>  	if (WARN_ON_ONCE(flags & ~(TTU_RMAP_LOCKED | TTU_SPLIT_HUGE_PMD |
> >> -					TTU_SYNC | TTU_BATCH_FLUSH)))
> >> +					TTU_SPLIT_HUGE_PUD | TTU_SYNC | TTU_BATCH_FLUSH)))
> >>  		return;
> >>
> >>  	if (folio_is_zone_device(folio) &&
> >> --
> >> 2.47.3
> >>
> >
> > This isn't a final review, I'll have to look more thoroughly through here
> > over time and you're going to have to be patient in general :)
> >
> > Cheers, Lorenzo
>
>
> Thanks for the review, this is awesome!

Ack, will do more when I have time, and obviously you're getting a lot of input
from others too.

Be good to get a summary at next THP cabal ;)

>
>
> [1] https://lore.kernel.org/all/20f92576-e932-435f-bb7b-de49eb84b012@gmail.com/
> [2] https://lore.kernel.org/all/05d5918f-b61b-4091-b8c6-20eebfffc3c4@gmail.com/
> [3] https://lore.kernel.org/all/2efaa5ed-bd09-41f0-9c07-5cd6cccc4595@gmail.com/
>
>
>

cheers, Lorenzo


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC 01/12] mm: add PUD THP ptdesc and rmap support
  2026-02-03 22:07       ` Usama Arif
@ 2026-02-05  4:17         ` Matthew Wilcox
  2026-02-05  4:21           ` Matthew Wilcox
  0 siblings, 1 reply; 52+ messages in thread
From: Matthew Wilcox @ 2026-02-05  4:17 UTC (permalink / raw)
  To: Usama Arif
  Cc: Zi Yan, Kiryl Shutsemau, lorenzo.stoakes, Andrew Morton,
	David Hildenbrand, linux-mm, hannes, riel, shakeel.butt, baohua,
	dev.jain, baolin.wang, npache, Liam.Howlett, ryan.roberts, vbabka,
	lance.yang, linux-kernel, kernel-team

On Tue, Feb 03, 2026 at 02:07:25PM -0800, Usama Arif wrote:
> Ah I should have looked at your patches more! I started working by just using lru
> and was using list_add/list_del which was ofcourse corrupting the list and took me
> way more time than I would like to admit to debug what was going on! The diagrams
> in your 2nd link are really useful. I ended up drawing by hand those to debug
> the corruption issue. I will point to that link in the next series :) 
> 
> How about something like the below diff over this patch? (Not included the comment
> changes that I will make everywhere)

Why are you even talking about "the next series"?  The approach is
wrong.  You need to put this POC aside and solve the problems that
you've bypassed to create this POC.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC 01/12] mm: add PUD THP ptdesc and rmap support
  2026-02-05  4:17         ` Matthew Wilcox
@ 2026-02-05  4:21           ` Matthew Wilcox
  2026-02-05  5:13             ` Usama Arif
  0 siblings, 1 reply; 52+ messages in thread
From: Matthew Wilcox @ 2026-02-05  4:21 UTC (permalink / raw)
  To: Usama Arif
  Cc: Zi Yan, Kiryl Shutsemau, lorenzo.stoakes, Andrew Morton,
	David Hildenbrand, linux-mm, hannes, riel, shakeel.butt, baohua,
	dev.jain, baolin.wang, npache, Liam.Howlett, ryan.roberts, vbabka,
	lance.yang, linux-kernel, kernel-team

On Thu, Feb 05, 2026 at 04:17:19AM +0000, Matthew Wilcox wrote:
> Why are you even talking about "the next series"?  The approach is
> wrong.  You need to put this POC aside and solve the problems that
> you've bypassed to create this POC.

... and gmail is rejecting this email as being spam.  You need to stop
using gmail for kernel deveopment work.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC 01/12] mm: add PUD THP ptdesc and rmap support
  2026-02-05  4:21           ` Matthew Wilcox
@ 2026-02-05  5:13             ` Usama Arif
  2026-02-05 17:40               ` David Hildenbrand (Arm)
  0 siblings, 1 reply; 52+ messages in thread
From: Usama Arif @ 2026-02-05  5:13 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Zi Yan, Kiryl Shutsemau, lorenzo.stoakes, Andrew Morton,
	David Hildenbrand, linux-mm, hannes, riel, shakeel.butt, baohua,
	dev.jain, baolin.wang, npache, Liam.Howlett, ryan.roberts, vbabka,
	lance.yang, linux-kernel, kernel-team

On 04/02/2026 20:21, Matthew Wilcox wrote:
> On Thu, Feb 05, 2026 at 04:17:19AM +0000, Matthew Wilcox wrote:
>> Why are you even talking about "the next series"?  The approach is
>> wrong.  You need to put this POC aside and solve the problems that
>> you've bypassed to create this POC.

Ah is the issue the code duplication that Lorenzo has raised (ofcourse
completely agree that there is quite a bit), the lru.next patch I did
which hopefully [1] makes better, or investigating if it might be
interferring with DAX/VFIO that Lorenzo pointed out (will ofcourse
investigate before sending the next revision)? The mapcount work
(I think David is working on this?) that is needed to allow splitting
PUDs to PMD is completely a separate issue and can be tackled in parallel
to this.

> 
> ... and gmail is rejecting this email as being spam.  You need to stop
> using gmail for kernel deveopment work.

I asked a couple of folks now and it seems they got it without any issue.
I have used it for a long time. I will try and see if something has changed.

[1] https://lore.kernel.org/all/05d5918f-b61b-4091-b8c6-20eebfffc3c4@gmail.com/

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC 00/12] mm: PUD (1GB) THP implementation
  2026-02-04  0:08     ` Frank van der Linden
@ 2026-02-05  5:46       ` Usama Arif
  0 siblings, 0 replies; 52+ messages in thread
From: Usama Arif @ 2026-02-05  5:46 UTC (permalink / raw)
  To: Frank van der Linden
  Cc: Zi Yan, Andrew Morton, David Hildenbrand, lorenzo.stoakes,
	linux-mm, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
	baolin.wang, npache, Liam.Howlett, ryan.roberts, vbabka,
	lance.yang, linux-kernel, kernel-team



On 03/02/2026 16:08, Frank van der Linden wrote:
> On Tue, Feb 3, 2026 at 3:29 PM Usama Arif <usamaarif642@gmail.com> wrote:
>>
>>
>>
>> On 02/02/2026 08:24, Zi Yan wrote:
>>> On 1 Feb 2026, at 19:50, Usama Arif wrote:
>>>
>>>> This is an RFC series to implement 1GB PUD-level THPs, allowing
>>>> applications to benefit from reduced TLB pressure without requiring
>>>> hugetlbfs. The patches are based on top of
>>>> f9b74c13b773b7c7e4920d7bc214ea3d5f37b422 from mm-stable (6.19-rc6).
>>>
>>> It is nice to see you are working on 1GB THP.
>>>
>>>>
>>>> Motivation: Why 1GB THP over hugetlbfs?
>>>> =======================================
>>>>
>>>> While hugetlbfs provides 1GB huge pages today, it has significant limitations
>>>> that make it unsuitable for many workloads:
>>>>
>>>> 1. Static Reservation: hugetlbfs requires pre-allocating huge pages at boot
>>>>    or runtime, taking memory away. This requires capacity planning,
>>>>    administrative overhead, and makes workload orchastration much much more
>>>>    complex, especially colocating with workloads that don't use hugetlbfs.
>>>
>>> But you are using CMA, the same allocation mechanism as hugetlb_cma. What
>>> is the difference?
>>>
>>
>> So we dont really need to use CMA. CMA can help a lot ofcourse, but we dont *need* it.
>> For e.g. I can run the very simple case [1] of trying to get 1G pages in the upstream
>> kernel without CMA on my server and it works. The server has been up for more than a week
>> (so pretty fragmented), is running a bunch of stuff in the background, uses 0 CMA memory,
>> and I tried to get 20x1G pages on it and it worked.
>> It uses folio_alloc_gigantic, which is exactly what this series uses:
>>
>> $ uptime -p
>> up 1 week, 3 days, 5 hours, 7 minutes
>> $ cat /proc/meminfo | grep -i cma
>> CmaTotal:              0 kB
>> CmaFree:               0 kB
>> $ echo 20 | sudo tee /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
>> 20
>> $ cat /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
>> 20
>> $ free -h
>>                total        used        free      shared  buff/cache   available
>> Mem:           1.0Ti       142Gi       292Gi       143Mi       583Gi       868Gi
>> Swap:          129Gi       3.5Gi       126Gi
>> $ ./map_1g_hugepages
>> Mapping 20 x 1GB huge pages (20 GB total)
>> Mapped at 0x7f43c0000000
>> Touched page 0 at 0x7f43c0000000
>> Touched page 1 at 0x7f4400000000
>> Touched page 2 at 0x7f4440000000
>> Touched page 3 at 0x7f4480000000
>> Touched page 4 at 0x7f44c0000000
>> Touched page 5 at 0x7f4500000000
>> Touched page 6 at 0x7f4540000000
>> Touched page 7 at 0x7f4580000000
>> Touched page 8 at 0x7f45c0000000
>> Touched page 9 at 0x7f4600000000
>> Touched page 10 at 0x7f4640000000
>> Touched page 11 at 0x7f4680000000
>> Touched page 12 at 0x7f46c0000000
>> Touched page 13 at 0x7f4700000000
>> Touched page 14 at 0x7f4740000000
>> Touched page 15 at 0x7f4780000000
>> Touched page 16 at 0x7f47c0000000
>> Touched page 17 at 0x7f4800000000
>> Touched page 18 at 0x7f4840000000
>> Touched page 19 at 0x7f4880000000
>> Unmapped successfully
>>
>>
>>
>>
>>>>
>>>> 4. No Fallback: If a 1GB huge page cannot be allocated, hugetlbfs fails
>>>>    rather than falling back to smaller pages. This makes it fragile under
>>>>    memory pressure.
>>>
>>> True.
>>>
>>>>
>>>> 4. No Splitting: hugetlbfs pages cannot be split when only partial access
>>>>    is needed, leading to memory waste and preventing partial reclaim.
>>>
>>> Since you have PUD THP implementation, have you run any workload on it?
>>> How often you see a PUD THP split?
>>>
>>
>> Ah so running non upstream kernels in production is a bit more difficult
>> (and also risky). I was trying to use the 512M experiment on arm as a comparison,
>> although I know its not the same thing with PAGE_SIZE and pageblock order.
>>
>> I can try some other upstream benchmarks if it helps? Although will need to find
>> ones that create VMA > 1G.
>>
>>> Oh, you actually ran 512MB THP on ARM64 (I saw it below), do you have
>>> any split stats to show the necessity of THP split?
>>>
>>>>
>>>> 5. Memory Accounting: hugetlbfs memory is accounted separately and cannot
>>>>    be easily shared with regular memory pools.
>>>
>>> True.
>>>
>>>>
>>>> PUD THP solves these limitations by integrating 1GB pages into the existing
>>>> THP infrastructure.
>>>
>>> The main advantage of PUD THP over hugetlb is that it can be split and mapped
>>> at sub-folio level. Do you have any data to support the necessity of them?
>>> I wonder if it would be easier to just support 1GB folio in core-mm first
>>> and we can add 1GB THP split and sub-folio mapping later. With that, we
>>> can move hugetlb users to 1GB folio.
>>>
>>
>> I would say its not the main advantage? But its definitely one of them.
>> The 2 main areas where split would be helpful is munmap partial
>> range and reclaim (MADV_PAGEOUT). For e.g. jemalloc/tcmalloc can now start
>> taking advantge of 1G pages. My knowledge is not that great when it comes
>> to memory allocators, but I believe they track for how long certain areas
>> have been cold and can trigger reclaim as an example. Then split will be useful.
>> Having memory allocators use hugetlb is probably going to be a no?
>>
>>
>>> BTW, without split support, you can apply HVO to 1GB folio to save memory.
>>> That is a disadvantage of PUD THP. Have you taken that into consideration?
>>> Basically, switching from hugetlb to PUD THP, you will lose memory due
>>> to vmemmap usage.
>>>
>>
>> Yeah so HVO saves 16M per 1G, and the page depost mechanism adds ~2M as per 1G.
>> We have HVO enabled in the meta fleet. I think we should not only think of PUD THP
>> as a replacement for hugetlb, but to also enable further usescases where hugetlb
>> would not be feasible.
>>
>> Ater the basic infrastructure for 1G is there, we can work on optimizing, I think
>> there would be a a lot of interesting work we can do. HVO for 1G THP would be one
>> of them?
>>
>>>>
>>>> Performance Results
>>>> ===================
>>>>
>>>> Benchmark results of these patches on Intel Xeon Platinum 8321HC:
>>>>
>>>> Test: True Random Memory Access [1] test of 4GB memory region with pointer
>>>> chasing workload (4M random pointer dereferences through memory):
>>>>
>>>> | Metric            | PUD THP (1GB) | PMD THP (2MB) | Change       |
>>>> |-------------------|---------------|---------------|--------------|
>>>> | Memory access     | 88 ms         | 134 ms        | 34% faster   |
>>>> | Page fault time   | 898 ms        | 331 ms        | 2.7x slower  |
>>>>
>>>> Page faulting 1G pages is 2.7x slower (Allocating 1G pages is hard :)).
>>>> For long-running workloads this will be a one-off cost, and the 34%
>>>> improvement in access latency provides significant benefit.
>>>>
>>>> ARM with 64K PAGE_SZIE supports 512M PMD THPs. In meta, we have a CPU
>>>> bound workload running on a large number of ARM servers (256G). I enabled
>>>> the 512M THP settings to always for a 100 servers in production (didn't
>>>> really have high expectations :)). The average memory used for the workload
>>>> increased from 217G to 233G. The amount of memory backed by 512M pages was
>>>> 68G! The dTLB misses went down by 26% and the PID multiplier increased input
>>>> by 5.9% (This is a very significant improvment in workload performance).
>>>> A significant number of these THPs were faulted in at application start when
>>>> were present across different VMAs. Ofcourse getting these 512M pages is
>>>> easier on ARM due to bigger PAGE_SIZE and pageblock order.
>>>>
>>>> I am hoping that these patches for 1G THP can be used to provide similar
>>>> benefits for x86. I expect workloads to fault them in at start time when there
>>>> is plenty of free memory available.
>>>>
>>>>
>>>> Previous attempt by Zi Yan
>>>> ==========================
>>>>
>>>> Zi Yan attempted 1G THPs [2] in kernel version 5.11. There have been
>>>> significant changes in kernel since then, including folio conversion, mTHP
>>>> framework, ptdesc, rmap changes, etc. I found it easier to use the current PMD
>>>> code as reference for making 1G PUD THP work. I am hoping Zi can provide
>>>> guidance on these patches!
>>>
>>> I am more than happy to help you. :)
>>>
>>
>> Thanks!!!
>>
>>>>
>>>> Major Design Decisions
>>>> ======================
>>>>
>>>> 1. No shared 1G zero page: The memory cost would be quite significant!
>>>>
>>>> 2. Page Table Pre-deposit Strategy
>>>>    PMD THP deposits a single PTE page table. PUD THP deposits 512 PTE
>>>>    page tables (one for each potential PMD entry after split).
>>>>    We allocate a PMD page table and use its pmd_huge_pte list to store
>>>>    the deposited PTE tables. This ensures split operations don't fail due
>>>>    to page table allocation failures (at the cost of 2M per PUD THP)
>>>>
>>>> 3. Split to Base Pages
>>>>    When a PUD THP must be split (COW, partial unmap, mprotect), we split
>>>>    directly to base pages (262,144 PTEs). The ideal thing would be to split
>>>>    to 2M pages and then to 4K pages if needed. However, this would require
>>>>    significant rmap and mapcount tracking changes.
>>>>
>>>> 4. COW and fork handling via split
>>>>    Copy-on-write and fork for PUD THP triggers a split to base pages, then
>>>>    uses existing PTE-level COW infrastructure. Getting another 1G region is
>>>>    hard and could fail. If only a 4K is written, copying 1G is a waste.
>>>>    Probably this should only be done on CoW and not fork?
>>>>
>>>> 5. Migration via split
>>>>    Split PUD to PTEs and migrate individual pages. It is going to be difficult
>>>>    to find a 1G continguous memory to migrate to. Maybe its better to not
>>>>    allow migration of PUDs at all? I am more tempted to not allow migration,
>>>>    but have kept splitting in this RFC.
>>>
>>> Without migration, PUD THP loses its flexibility and transparency. But with
>>> its 1GB size, I also wonder what the purpose of PUD THP migration can be.
>>> It does not create memory fragmentation, since it is the largest folio size
>>> we have and contiguous. NUMA balancing 1GB THP seems too much work.
>>
>> Yeah this is exactly what I was thinking as well. It is going to be expensive
>> and difficult to migrate 1G pages, and I am not sure if what we get out of it
>> is worth it? I kept the splitting code in this RFC as I wanted to show that
>> its possible to split and migrate and the rejecting migration code is a lot easier.
>>
>>>
>>> BTW, I posted many questions, but that does not mean I object the patchset.
>>> I just want to understand your use case better, reduce unnecessary
>>> code changes, and hopefully get it upstreamed this time. :)
>>>
>>> Thank you for the work.
>>>
>>
>> Ah no this is awesome! Thanks for the questions! Its basically the discussion I
>> wanted to start with the RFC.
>>
>>
>> [1] https://gist.github.com/uarif1/35dcd63f9d76048b07eb5c16ace85991
>>
>>
> 
> It looks like the scenario you're going for is an application that
> allocates a sizeable chunk of memory upfront, and would like it to be
> 1G pages as much as possible, right?
> 

Hello!

Yes. But also it doesnt need to be a single chunk (VMA).

> You can do that with 1G THPs, the advantage being that any failures to
> get 1G pages are not explicit, so you're not left with having to grow
> the number of hugetlb pages yourself, and see how many you can use.
> 
> 1G THPs seem useful for that. I don't recall all of the discussion
> here, but I assume that hooking 1G THP support in to khugepaged is
> quite something else - the potential churn to get an 1G page could
> well cause more system interference than you'd like.
> 

Yes completely agree.

> The CMA scenario Rik was talking about is similar: you set
> hugetlb_cma=NG, and then, when you need 1G pages, you grow the hugetlb
> pool and use them. Disadvantage: you have to do it explicitly.
> 
> However, hugetlb_cma does give you a much larger chance of getting
> those 1G pages. The example you give, 20 1G pages on a 1T system where
> there is 292G free, isn't much of a problem in my experience. You
> should have no problem getting that amount of 1G pages. Things get
> more difficult when most of your memory is taken - hugetlb_cma really
> helps there. E.g. we have systems that have 90% hugetlb_cma, and there
> is a pretty good success rate converting back and forth between
> hugetlb and normal page allocator pages with hugetlb_cma, while
> operating close to that 90% hugetlb coverage. Without CMA, the success
> rate drops quite a bit at that level.

Yes agreed.
> 
> CMA balancing is a related issue, for hugetlb. It fixes a problem that
> has been known for years: the more memory you set aside for movable
> only allocations (e.g. hugetlb_cma), the less breathing room you have
> for unmovable allocations. So you risk the 'false OOM' scenario, where
> the kernel can't make an unmovable allocation, even though there is
> enough memory available, even outside of CMA. It's just that those
> MOVABLE pageblocks were used for movable allocations. So ideally, you
> would migrate those movable allocations to CMA under those
> circumstances. Which is what CMA balancing does. It's worked out very
> well for us in the scenario I list above (most memory being
> hugetlb_cma).
> 
> Anyway, I'm rambling on a bit. Let's see if I got this right:
> 
> 1G THP
>   - advantages: transparent interface
>   - disadvantage: no HVO, lower success rate under higher memory
> pressure than hugetlb_cma
> 

Yes! But also, the problem of having no HVO for THPs I think can be worked
on once the support for it is there. The lower success rate is a much more
difficult problem to solve.

> hugetlb_cma
>    - disadvantage: explicit interface, for higher values needs 'false
> OOM' avoidance
>    - advange: better success rate under pressure.
> 
> I think 1G THPs are a good solution for "nice to have" scenarios, but
> there will still be use cases where a higher success rate is preferred
> and HugeTLB is preferred.
> 

Agreed. I dont think 1G THPs can completely replace hugetlb. Maybe after
getting after several years of work to optimize it, there might be a path to it
but not at the very start.


> Lastly, there's also the ZONE_MOVABLE story. I think 1G THPs and
> ZONE_MOVABLE could work well together, improving the success rate. But
> then the issue of pinning raise its head again, and whether that
> should be allowed or configurable per zone..
> 

Ack

> - Frank



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC 02/12] mm/thp: add mTHP stats infrastructure for PUD THP
  2026-02-02 11:56   ` Lorenzo Stoakes
@ 2026-02-05  5:53     ` Usama Arif
  0 siblings, 0 replies; 52+ messages in thread
From: Usama Arif @ 2026-02-05  5:53 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: ziy, Andrew Morton, David Hildenbrand, linux-mm, hannes, riel,
	shakeel.butt, kas, baohua, dev.jain, baolin.wang, npache,
	Liam.Howlett, ryan.roberts, vbabka, lance.yang, linux-kernel,
	kernel-team



On 02/02/2026 03:56, Lorenzo Stoakes wrote:
> On Sun, Feb 01, 2026 at 04:50:19PM -0800, Usama Arif wrote:
>> Extend the mTHP (multi-size THP) statistics infrastructure to support
>> PUD-sized transparent huge pages.
>>
>> The mTHP framework tracks statistics for each supported THP size through
>> per-order counters exposed via sysfs. To add PUD THP support, PUD_ORDER
>> must be included in the set of tracked orders.
>>
>> With this change, PUD THP events (allocations, faults, splits, swaps)
>> are tracked and exposed through the existing sysfs interface at
>> /sys/kernel/mm/transparent_hugepage/hugepages-1048576kB/stats/. This
>> provides visibility into PUD THP behavior for debugging and performance
>> analysis.
>>
>> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
> 
> Yeah we really need to be basing this on mm-unstable once Nico's series is
> landed.
> 
> I think it's quite important as well for you to check that khugepaged mTHP works
> with all of this.
> 
>> ---
>>  include/linux/huge_mm.h | 42 +++++++++++++++++++++++++++++++++++++----
>>  mm/huge_memory.c        |  3 ++-
>>  2 files changed, 40 insertions(+), 5 deletions(-)
>>
>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>> index e672e45bb9cc7..5509ba8555b6e 100644
>> --- a/include/linux/huge_mm.h
>> +++ b/include/linux/huge_mm.h
>> @@ -76,7 +76,13 @@ extern struct kobj_attribute thpsize_shmem_enabled_attr;
>>   * and including PMD_ORDER, except order-0 (which is not "huge") and order-1
>>   * (which is a limitation of the THP implementation).
>>   */
>> -#define THP_ORDERS_ALL_ANON	((BIT(PMD_ORDER + 1) - 1) & ~(BIT(0) | BIT(1)))
>> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
>> +#define THP_ORDERS_ALL_ANON_PUD		BIT(PUD_ORDER)
>> +#else
>> +#define THP_ORDERS_ALL_ANON_PUD		0
>> +#endif
>> +#define THP_ORDERS_ALL_ANON	(((BIT(PMD_ORDER + 1) - 1) & ~(BIT(0) | BIT(1))) | \
>> +				 THP_ORDERS_ALL_ANON_PUD)
> 
> Err what is this change doing in a 'stats' change? This quietly updates
> __thp_vma_allowable_orders() to also support PUD order for anon memory... Can we
> put this in the right place?
> 

Yeah I didnt place it in the right place. Thanks!

>>
>>  /*
>>   * Mask of all large folio orders supported for file THP. Folios in a DAX
>> @@ -146,18 +152,46 @@ enum mthp_stat_item {
>>  };
>>
>>  #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && defined(CONFIG_SYSFS)
>> +
>> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
> 
> By the way I'm not a fan of us treating an 'arch has' as a 'will use'.
> 
>> +#define MTHP_STAT_COUNT		(PMD_ORDER + 2)
> 
> Yeah I hate this. This is just 'one more thing to remember'.
> 
>> +#define MTHP_STAT_PUD_INDEX	(PMD_ORDER + 1)  /* PUD uses last index */
>> +#else
>> +#define MTHP_STAT_COUNT		(PMD_ORDER + 1)
>> +#endif
>> +
>>  struct mthp_stat {
>> -	unsigned long stats[ilog2(MAX_PTRS_PER_PTE) + 1][__MTHP_STAT_COUNT];
>> +	unsigned long stats[MTHP_STAT_COUNT][__MTHP_STAT_COUNT];
>>  };
>>
>>  DECLARE_PER_CPU(struct mthp_stat, mthp_stats);
>>
>> +static inline int mthp_stat_order_to_index(int order)
>> +{
>> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
>> +	if (order == PUD_ORDER)
>> +		return MTHP_STAT_PUD_INDEX;
> 
> This seems like a hack again.
> 
>> +#endif
>> +	return order;
>> +}
>> +
>>  static inline void mod_mthp_stat(int order, enum mthp_stat_item item, int delta)
>>  {
>> -	if (order <= 0 || order > PMD_ORDER)
>> +	int index;
>> +
>> +	if (order <= 0)
>> +		return;
>> +
>> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
>> +	if (order != PUD_ORDER && order > PMD_ORDER)
>>  		return;
>> +#else
>> +	if (order > PMD_ORDER)
>> +		return;
>> +#endif
> 
> Or we could actually define a max order... except now the hack contorts this
> code.
> 
> Is it really that bad to just take up memory for the order between PMD_ORDER and
> PUD_ORDER? ~72 bytes * cores and we avoid having to do this silly dance.


So up until a few hours before I sent the series. What you are saying is exactly what
I was doing, i.e. allocating up until PUD order. Its not a lot of memory wastage,
but it is there, and I saw this patch as an easy solution to it. For a server
with 512 cores, this is 36KB. Its not a lot because a server with 512 cores will
also have several 100GBs or TBs of memory.

I know its not elegant, but I do like the approach in this patch more. If there is a
very strong preference to switch to having all order to PUD as it would the make the
code more elegant, than I can switch to it.

> 
>>
>> -	this_cpu_add(mthp_stats.stats[order][item], delta);
>> +	index = mthp_stat_order_to_index(order);
>> +	this_cpu_add(mthp_stats.stats[index][item], delta);
>>  }
>>
>>  static inline void count_mthp_stat(int order, enum mthp_stat_item item)
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 3128b3beedb0a..d033624d7e1f2 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -598,11 +598,12 @@ static unsigned long sum_mthp_stat(int order, enum mthp_stat_item item)
>>  {
>>  	unsigned long sum = 0;
>>  	int cpu;
>> +	int index = mthp_stat_order_to_index(order);
>>
>>  	for_each_possible_cpu(cpu) {
>>  		struct mthp_stat *this = &per_cpu(mthp_stats, cpu);
>>
>> -		sum += this->stats[order][item];
>> +		sum += this->stats[index][item];
>>  	}
>>
>>  	return sum;
>> --
>> 2.47.3
>>



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC 00/12] mm: PUD (1GB) THP implementation
  2026-02-04 11:08     ` Lorenzo Stoakes
  2026-02-04 11:50       ` Dev Jain
@ 2026-02-05  6:08       ` Usama Arif
  1 sibling, 0 replies; 52+ messages in thread
From: Usama Arif @ 2026-02-05  6:08 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: ziy, Andrew Morton, David Hildenbrand, linux-mm, hannes, riel,
	shakeel.butt, kas, baohua, dev.jain, baolin.wang, npache,
	Liam.Howlett, ryan.roberts, vbabka, lance.yang, linux-kernel,
	kernel-team



On 04/02/2026 03:08, Lorenzo Stoakes wrote:
> On Tue, Feb 03, 2026 at 05:00:10PM -0800, Usama Arif wrote:
>>
>>
>> On 02/02/2026 03:20, Lorenzo Stoakes wrote:
>>> OK so this is somewhat unexpected :)
>>>
>>> It would have been nice to discuss it in the THP cabal or at a conference
>>> etc. so we could discuss approaches ahead of time. Communication is important,
>>> especially with major changes like this.
>>
>> Makes sense!
>>
>>>
>>> And PUD THP is especially problematic in that it requires pages that the page
>>> allocator can't give us, presumably you're doing something with CMA and... it's
>>> a whole kettle of fish.
>>
>> So we dont need CMA. It helps ofcourse, but we don't *need* it.
>> Its summarized in the first reply I gave to Zi in [1]:
>>
>>>
>>> It's also complicated by the fact we _already_ support it in the DAX, VFIO cases
>>> but it's kinda a weird sorta special case that we need to keep supporting.
>>>
>>> There's questions about how this will interact with khugepaged, MADV_COLLAPSE,
>>> mTHP (and really I want to see Nico's series land before we really consider
>>> this).
>>
>>
>> So I have numbers and experiments for page faults which are in the cover letter,
>> but not for khugepaged. I would be very surprised (although pleasently :)) if
>> khugepaged by some magic finds 262144 pages that meets all the khugepaged requirements
>> to collapse the page. In the basic infrastructure support which this series is adding,
>> I want to keep khugepaged collapse disabled for 1G pages. This is also the initial
>> approach that was taken in other mTHP sizes. We should go slow with 1G THPs.
> 
> Yes we definitely want to limit to page faults for now.
> 
> But keep in mind for that to be viable you'd surely need to update who gets
> appropriate alignment in __get_unmapped_area()... not read through series far
> enough to see so not sure if you update that though!
> 
> I guess that'd be the sanest place to start, if an allocation _size_ is aligned
> 1 GB, then align the unmapped area _address_ to 1 GB for maximum chance of 1 GB
> fault-in.


Yeah this was definitely missing. I was manually aligning the fault address in selftest
and benchmarks with the trick used in other selftests
(((unsigned long)addr + PUD_SIZE - 1) & ~(PUD_SIZE - 1))

Thanks for pointing this out! This is basically what I wanted with the RFC, to find out
what I am missing and not testing. Will look into VFIO and DAX as you mentioned as well.

> 
> Oh by the way I made some rough THP notes at
> https://publish.obsidian.md/mm/Transparent+Huge+Pages+(THP) which are helpful
> for reminding me about what does what where, useful for a top-down view of how
> things are now.
> 

Thanks!

>>
>>>
>>> So overall, I want to be very cautious and SLOW here. So let's please not drop
>>> the RFC tag until David and I are ok with that?
>>>
>>> Also the THP code base is in _dire_ need of rework, and I don't really want to
>>> add major new features without us paying down some technical debt, to be honest.
>>>
>>> So let's proceed with caution, and treat this as a very early bit of
>>> experimental code.
>>>
>>> Thanks, Lorenzo
>>
>> Ack, yeah so this is mainly an RFC to discuss what the major design choices will be.
>> I got a kernel with selftests for allocation, memory integrity, fork, partial munmap,
>> mprotect, reclaim and migration passing and am running them with DEBUG_VM to make sure
>> we dont get the VM bugs/warnings and the numbers are good, so just wanted to share it
>> upstream and get your opinions! Basically try and trigger a discussion similar to what
>> Zi asked in [2]! And also if someone could point out if there is something fundamental
>> we are missing in this series.
> 
> Well that's fair enough :)
> 
> But do come to a THP cabal so we can chat, face-to-face (ok, digital face to
> digital face ;). It's usually a force-multiplier I find, esp. if multiple people
> have input which I think is the case here. We're friendly :)


Yes, Thanks for this! It would be really helpful to discuss in a call. I didn't
know there was a meeting but have requested details (date/time) in another thread.

> 
> In any case, conversations are already kicking off so that's definitely positive!
> 
> I think we will definitely get there with this at _some point_ but I would urge
> patience and also I really want to underline my desire for us in THP to start
> paying down some of this technical debt.
> 
> I know people are already making efforts (Vernon, Luiz), and sorry that I've not
> been great at review recently (should be gradually increasing over time), but I
> feel that for large features to be added like this now we really do require some
> refactoring work before we take it.
> 

Yes agreed! I will definitely need your and others guidance on what needs to be
properly refractored so that this fits well with the current code.

> We definitely need to rebase this once Nico's series lands (should do next
> cycle) and think about how it plays with this, I'm not sure if arm64 supports
> mTHP between PMD and PUD size (Dev? Do you know?) so maybe that one is moot, but
> in general want to make sure it plays nice.
> 

Will do!


>>
>> Thanks for the reviews! Really do apprecaite it!
> 
> No worries! :)
> 
>>
>> [1] https://lore.kernel.org/all/20f92576-e932-435f-bb7b-de49eb84b012@gmail.com/#t
>> [2] https://lore.kernel.org/all/3561FD10-664D-42AA-8351-DE7D8D49D42E@nvidia.com/
> 
> Cheers, Lorenzo



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC 01/12] mm: add PUD THP ptdesc and rmap support
  2026-02-04 12:55       ` Lorenzo Stoakes
@ 2026-02-05  6:40         ` Usama Arif
  0 siblings, 0 replies; 52+ messages in thread
From: Usama Arif @ 2026-02-05  6:40 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: ziy, Andrew Morton, David Hildenbrand, linux-mm, hannes, riel,
	shakeel.butt, kas, baohua, dev.jain, baolin.wang, npache,
	Liam.Howlett, ryan.roberts, vbabka, lance.yang, linux-kernel,
	kernel-team



On 04/02/2026 04:55, Lorenzo Stoakes wrote:
> On Tue, Feb 03, 2026 at 11:38:02PM -0800, Usama Arif wrote:
>>
>>
>> On 02/02/2026 04:15, Lorenzo Stoakes wrote:
>>> I think I'm going to have to do several passes on this, so this is just a
>>> first one :)
>>>
>>
>> Thanks! Really appreciate the reviews!
> 
> No worries!
> 
>>
>> One thing over here is the higher level design decision when it comes to migration
>> of 1G pages. As Zi said in [1]:
>> "I also wonder what the purpose of PUD THP migration can be.
>> It does not create memory fragmentation, since it is the largest folio size
>> we have and contiguous. NUMA balancing 1GB THP seems too much work."
>>
>>> On Sun, Feb 01, 2026 at 04:50:18PM -0800, Usama Arif wrote:
>>>> For page table management, PUD THPs need to pre-deposit page tables
>>>> that will be used when the huge page is later split. When a PUD THP
>>>> is allocated, we cannot know in advance when or why it might need to
>>>> be split (COW, partial unmap, reclaim), but we need page tables ready
>>>> for that eventuality. Similar to how PMD THPs deposit a single PTE
>>>> table, PUD THPs deposit a PMD table which itself contains deposited
>>>> PTE tables - a two-level deposit. This commit adds the deposit/withdraw
>>>> infrastructure and a new pud_huge_pmd field in ptdesc to store the
>>>> deposited PMD.
>>>
>>> This feels like you're hacking this support in, honestly. The list_head
>>> abuse only adds to that feeling.
>>>
>>
>> Yeah so I hope turning it to something like [2] is the way forward.
> 
> Right, that's one option, though David suggested avoiding this altogether by
> only pre-allocating PTEs?

Maybe I dont understand it properly, but that wont work, right?

You need 1 PMD table and 512 PTE tables for a PMD. Cant just have PTE tables, right?


> 
>>
>>> And are we now not required to store rather a lot of memory to keep all of
>>> this coherent?
>>
>> PMD THP allocates 1 4K page (pte_alloc_one) at fault time so that split
>> doesnt fail.
>>
>> For PUD we allocate 2M worth of PTE page tables and 1 4K PMD table at fault
>> time so that split doesnt fail due to there not being enough memory.
>> Its not great, but its not bad as well.
>> The alternative is to allocate this at split time and so we are not
>> pre-reserving them. Now there is a chance that allocation and therefore split
>> fails, so the tradeoff is some memory vs reliability. This patch favours
>> reliability.
> 
> That's a significant amount of unmovable, unreclaimable memory though. Going
> from 4K to 2M is a pretty huge uptick.
> 

Yeah I dont like it either, but its the only way to make sure split doesnt fail.
I think there will always be a tradeoff between reliability and memory.

>>
>> Lets say a user gets 100x1G THPs. They would end up using ~200M for it.
>> I think that is okish. If the user has 100G, 200M might not be an issue
>> for them :)
> 
> But there's more than one user on boxes big enough for this, so this makes me
> think we want this to be somehow opt-in right?
>

Do you mean madvise?

Also an idea (although probably a very bad one :)) is to have MADV_HUGEPAGE_1G.

> And that means we're incurring an unmovable memory penalty, the kind which we're
> trying to avoid in general elsewhere in the kernel.
> 

ack.

>>
>>>
>>>>
>>>> The deposited PMD tables are stored as a singly-linked stack using only
>>>> page->lru.next as the link pointer. A doubly-linked list using the
>>>> standard list_head mechanism would cause memory corruption: list_del()
>>>> poisons both lru.next (offset 8) and lru.prev (offset 16), but lru.prev
>>>> overlaps with ptdesc->pmd_huge_pte at offset 16. Since deposited PMD
>>>> tables have their own deposited PTE tables stored in pmd_huge_pte,
>>>> poisoning lru.prev would corrupt the PTE table list and cause crashes
>>>> when withdrawing PTE tables during split. PMD THPs don't have this
>>>> problem because their deposited PTE tables don't have sub-deposits.
>>>> Using only lru.next avoids the overlap entirely.
>>>
>>> Yeah this is horrendous and a hack, I don't consider this at all
>>> upstreamable.
>>>
>>> You need to completely rework this.
>>
>> Hopefully [2] is the path forward!
> 
> Ack
> 
>>>
>>>>
>>>> For reverse mapping, PUD THPs need the same rmap support that PMD THPs
>>>> have. The page_vma_mapped_walk() function is extended to recognize and
>>>> handle PUD-mapped folios during rmap traversal. A new TTU_SPLIT_HUGE_PUD
>>>> flag tells the unmap path to split PUD THPs before proceeding, since
>>>> there is no PUD-level migration entry format - the split converts the
>>>> single PUD mapping into individual PTE mappings that can be migrated
>>>> or swapped normally.
>>>
>>> Individual PTE... mappings? You need to be a lot clearer here, page tables
>>> are naturally confusing with entries vs. tables.
>>>
>>> Let's be VERY specific here. Do you mean you have 1 PMD table and 512 PTE
>>> tables reserved, spanning 1 PUD entry and 262,144 PTE entries?
>>>
>>
>> Yes that is correct, Thanks! I will change the commit message in the next revision
>> to what you have written: 1 PMD table and 512 PTE tables reserved, spanning
>> 1 PUD entry and 262,144 PTE entries.
> 
> Yeah :) my concerns remain :)
> 
>>
>>>>
>>>> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
>>>
>>> How does this change interact with existing DAX/VFIO code, which now it
>>> seems will be subject to the mechanisms you introduce here?
>>
>> I think what you mean here is the change in try_to_migrate_one?
>>
>>
>> So one
> 
> Unfinished sentence? :P
> 
> No I mean currently we support 1G THP for DAX/VFIO right? So how does this
> interplay with how that currently works? Does that change how DAX/VFIO works?
> Will that impact existing users?
> 
> Or are we extending the existing mechanism?
> 

A lot of the stuff like copy_huge_pud, zap_huge_pud, __split_huge_pud_locked,
create_huge_pud, wp_huge_pud... is protected by vma_is_anonymous check. I will
try and do a better audit of DAX and VFIO.

>>
>>>
>>> Right now DAX/VFIO is only obtainable via a specially THP-aligned
>>> get_unmapped_area() + then can only be obtained at fault time.
>>>> Is that the intent here also?
>>>
>>
>> Ah thanks for pointing this out. This is something the series is missing.
>>
>> What I did in the selftest and benchmark was fault on an address that was already aligned.
>> i.e. basically call the below function before faulting in.
>>
>> static inline void *pud_align(void *addr)
>> {
>> 	return (void *)(((unsigned long)addr + PUD_SIZE - 1) & ~(PUD_SIZE - 1));
>> }
> 
> Right yeah :)
> 
>>
>>
>> What I think you are suggesting this series is missing is the below diff? (its untested).
>>
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 87b2c21df4a49..461158a0840db 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -1236,6 +1236,12 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
>>         unsigned long ret;
>>         loff_t off = (loff_t)pgoff << PAGE_SHIFT;
>>
>> +       if (IS_ENABLED(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD) && len >= PUD_SIZE) {
>> +               ret = __thp_get_unmapped_area(filp, addr, len, off, flags, PUD_SIZE, vm_flags);
>> +               if (ret)
>> +                       return ret;
>> +       }
> 
> No not that, that's going to cause issues, see commit d4148aeab4 for details as
> to why this can go wrong.
> 
> In __get_unmapped_area() where the current 'if PMD size aligned then align area'
> logic, like that.


Ack, Thanks for pointing to this. I will also add another selftest to see that we actually
get this from _get_unmapped_area when we dont do the pud_align trick I currently have in
the selftests.

> 
>> +
>>
>>
>>> What is your intent - that khugepaged do this, or on alloc? How does it
>>> interact with MADV_COLLAPSE?
>>>
>>
>> Ah basically what I mentioned in [3], we want to go slow. Only enable PUD THPs
>> page faults at the start. If there is data supporting that khugepaged will work
>> than we do it, but we keep it disabled.
> 
> Yes I think khugepaged is probably never going to be all that a good idea with
> this.
> 
>>
>>> I noted on the 2nd patch, but you're changing THP_ORDERS_ALL_ANON which
>>> alters __thp_vma_allowable_orders() behaviour, that change belongs here...
>>>
>>>
>>
>> Thanks for this! I only tried to split this code into logical commits
>> after the whole thing was working. Some things are tightly coupled
>> and I would need to move them to the right commit.
> 
> Yes there's a bunch of things that need tweaking here, to reiterate let's try to
> pay down technical debt here and avoid copy/pasting :>)

Yes, will spend a lot more time thinking about how to avoid copy/paste.


> 
>>
>>>> ---
>>>>  include/linux/huge_mm.h  |  5 +++
>>>>  include/linux/mm.h       | 19 ++++++++
>>>>  include/linux/mm_types.h |  5 ++-
>>>>  include/linux/pgtable.h  |  8 ++++
>>>>  include/linux/rmap.h     |  7 ++-
>>>>  mm/huge_memory.c         |  8 ++++
>>>>  mm/internal.h            |  3 ++
>>>>  mm/page_vma_mapped.c     | 35 +++++++++++++++
>>>>  mm/pgtable-generic.c     | 83 ++++++++++++++++++++++++++++++++++
>>>>  mm/rmap.c                | 96 +++++++++++++++++++++++++++++++++++++---
>>>>  10 files changed, 260 insertions(+), 9 deletions(-)
>>>>
>>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>>>> index a4d9f964dfdea..e672e45bb9cc7 100644
>>>> --- a/include/linux/huge_mm.h
>>>> +++ b/include/linux/huge_mm.h
>>>> @@ -463,10 +463,15 @@ void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
>>>>  		unsigned long address);
>>>>
>>>>  #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
>>>> +void split_huge_pud_locked(struct vm_area_struct *vma, pud_t *pud,
>>>> +			   unsigned long address);
>>>>  int change_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
>>>>  		    pud_t *pudp, unsigned long addr, pgprot_t newprot,
>>>>  		    unsigned long cp_flags);
>>>>  #else
>>>> +static inline void
>>>> +split_huge_pud_locked(struct vm_area_struct *vma, pud_t *pud,
>>>> +		      unsigned long address) {}
>>>>  static inline int
>>>>  change_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
>>>>  		pud_t *pudp, unsigned long addr, pgprot_t newprot,
>>>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>>>> index ab2e7e30aef96..a15e18df0f771 100644
>>>> --- a/include/linux/mm.h
>>>> +++ b/include/linux/mm.h
>>>> @@ -3455,6 +3455,22 @@ static inline bool pagetable_pmd_ctor(struct mm_struct *mm,
>>>>   * considered ready to switch to split PUD locks yet; there may be places
>>>>   * which need to be converted from page_table_lock.
>>>>   */
>>>> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
>>>> +static inline struct page *pud_pgtable_page(pud_t *pud)
>>>> +{
>>>> +	unsigned long mask = ~(PTRS_PER_PUD * sizeof(pud_t) - 1);
>>>> +
>>>> +	return virt_to_page((void *)((unsigned long)pud & mask));
>>>> +}
>>>> +
>>>> +static inline struct ptdesc *pud_ptdesc(pud_t *pud)
>>>> +{
>>>> +	return page_ptdesc(pud_pgtable_page(pud));
>>>> +}
>>>> +
>>>> +#define pud_huge_pmd(pud) (pud_ptdesc(pud)->pud_huge_pmd)
>>>> +#endif
>>>> +
>>>>  static inline spinlock_t *pud_lockptr(struct mm_struct *mm, pud_t *pud)
>>>>  {
>>>>  	return &mm->page_table_lock;
>>>> @@ -3471,6 +3487,9 @@ static inline spinlock_t *pud_lock(struct mm_struct *mm, pud_t *pud)
>>>>  static inline void pagetable_pud_ctor(struct ptdesc *ptdesc)
>>>>  {
>>>>  	__pagetable_ctor(ptdesc);
>>>> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
>>>> +	ptdesc->pud_huge_pmd = NULL;
>>>> +#endif
>>>>  }
>>>>
>>>>  static inline void pagetable_p4d_ctor(struct ptdesc *ptdesc)
>>>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>>>> index 78950eb8926dc..26a38490ae2e1 100644
>>>> --- a/include/linux/mm_types.h
>>>> +++ b/include/linux/mm_types.h
>>>> @@ -577,7 +577,10 @@ struct ptdesc {
>>>>  		struct list_head pt_list;
>>>>  		struct {
>>>>  			unsigned long _pt_pad_1;
>>>> -			pgtable_t pmd_huge_pte;
>>>> +			union {
>>>> +				pgtable_t pmd_huge_pte;  /* For PMD tables: deposited PTE */
>>>> +				pgtable_t pud_huge_pmd;  /* For PUD tables: deposited PMD list */
>>>> +			};
>>>>  		};
>>>>  	};
>>>>  	unsigned long __page_mapping;
>>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>>> index 2f0dd3a4ace1a..3ce733c1d71a2 100644
>>>> --- a/include/linux/pgtable.h
>>>> +++ b/include/linux/pgtable.h
>>>> @@ -1168,6 +1168,14 @@ extern pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp);
>>>>  #define arch_needs_pgtable_deposit() (false)
>>>>  #endif
>>>>
>>>> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
>>>> +extern void pgtable_trans_huge_pud_deposit(struct mm_struct *mm, pud_t *pudp,
>>>> +					   pmd_t *pmd_table);
>>>> +extern pmd_t *pgtable_trans_huge_pud_withdraw(struct mm_struct *mm, pud_t *pudp);
>>>> +extern void pud_deposit_pte(pmd_t *pmd_table, pgtable_t pgtable);
>>>> +extern pgtable_t pud_withdraw_pte(pmd_t *pmd_table);
>>>
>>> These are useless extern's.
>>>
>>
>>
>> ack
>>
>> These are coming from the existing functions from the file:
>> extern void pgtable_trans_huge_deposit
>> extern pgtable_t pgtable_trans_huge_withdraw
>>
>> I think the externs can be removed from these as well? We can
>> fix those in a separate patch.
> 
> Generally the approach is to remove externs when adding/changing new stuff as
> otherwise we get completely useless churn on that and annoying git history
> changes.

Ack

>>
>>
>>>> +#endif
>>>> +
>>>>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>>>  /*
>>>>   * This is an implementation of pmdp_establish() that is only suitable for an
>>>> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
>>>> index daa92a58585d9..08cd0a0eb8763 100644
>>>> --- a/include/linux/rmap.h
>>>> +++ b/include/linux/rmap.h
>>>> @@ -101,6 +101,7 @@ enum ttu_flags {
>>>>  					 * do a final flush if necessary */
>>>>  	TTU_RMAP_LOCKED		= 0x80,	/* do not grab rmap lock:
>>>>  					 * caller holds it */
>>>> +	TTU_SPLIT_HUGE_PUD	= 0x100, /* split huge PUD if any */
>>>>  };
>>>>
>>>>  #ifdef CONFIG_MMU
>>>> @@ -473,6 +474,8 @@ void folio_add_anon_rmap_ptes(struct folio *, struct page *, int nr_pages,
>>>>  	folio_add_anon_rmap_ptes(folio, page, 1, vma, address, flags)
>>>>  void folio_add_anon_rmap_pmd(struct folio *, struct page *,
>>>>  		struct vm_area_struct *, unsigned long address, rmap_t flags);
>>>> +void folio_add_anon_rmap_pud(struct folio *, struct page *,
>>>> +		struct vm_area_struct *, unsigned long address, rmap_t flags);
>>>>  void folio_add_new_anon_rmap(struct folio *, struct vm_area_struct *,
>>>>  		unsigned long address, rmap_t flags);
>>>>  void folio_add_file_rmap_ptes(struct folio *, struct page *, int nr_pages,
>>>> @@ -933,6 +936,7 @@ struct page_vma_mapped_walk {
>>>>  	pgoff_t pgoff;
>>>>  	struct vm_area_struct *vma;
>>>>  	unsigned long address;
>>>> +	pud_t *pud;
>>>>  	pmd_t *pmd;
>>>>  	pte_t *pte;
>>>>  	spinlock_t *ptl;
>>>> @@ -970,7 +974,7 @@ static inline void page_vma_mapped_walk_done(struct page_vma_mapped_walk *pvmw)
>>>>  static inline void
>>>>  page_vma_mapped_walk_restart(struct page_vma_mapped_walk *pvmw)
>>>>  {
>>>> -	WARN_ON_ONCE(!pvmw->pmd && !pvmw->pte);
>>>> +	WARN_ON_ONCE(!pvmw->pud && !pvmw->pmd && !pvmw->pte);
>>>>
>>>>  	if (likely(pvmw->ptl))
>>>>  		spin_unlock(pvmw->ptl);
>>>> @@ -978,6 +982,7 @@ page_vma_mapped_walk_restart(struct page_vma_mapped_walk *pvmw)
>>>>  		WARN_ON_ONCE(1);
>>>>
>>>>  	pvmw->ptl = NULL;
>>>> +	pvmw->pud = NULL;
>>>>  	pvmw->pmd = NULL;
>>>>  	pvmw->pte = NULL;
>>>>  }
>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>> index 40cf59301c21a..3128b3beedb0a 100644
>>>> --- a/mm/huge_memory.c
>>>> +++ b/mm/huge_memory.c
>>>> @@ -2933,6 +2933,14 @@ void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
>>>>  	spin_unlock(ptl);
>>>>  	mmu_notifier_invalidate_range_end(&range);
>>>>  }
>>>> +
>>>> +void split_huge_pud_locked(struct vm_area_struct *vma, pud_t *pud,
>>>> +			   unsigned long address)
>>>> +{
>>>> +	VM_WARN_ON_ONCE(!IS_ALIGNED(address, HPAGE_PUD_SIZE));
>>>> +	if (pud_trans_huge(*pud))
>>>> +		__split_huge_pud_locked(vma, pud, address);
>>>> +}
>>>>  #else
>>>>  void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
>>>>  		unsigned long address)
>>>> diff --git a/mm/internal.h b/mm/internal.h
>>>> index 9ee336aa03656..21d5c00f638dc 100644
>>>> --- a/mm/internal.h
>>>> +++ b/mm/internal.h
>>>> @@ -545,6 +545,9 @@ int user_proactive_reclaim(char *buf,
>>>>   * in mm/rmap.c:
>>>>   */
>>>>  pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address);
>>>> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
>>>> +pud_t *mm_find_pud(struct mm_struct *mm, unsigned long address);
>>>> +#endif
>>>>
>>>>  /*
>>>>   * in mm/page_alloc.c
>>>> diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
>>>> index b38a1d00c971b..d31eafba38041 100644
>>>> --- a/mm/page_vma_mapped.c
>>>> +++ b/mm/page_vma_mapped.c
>>>> @@ -146,6 +146,18 @@ static bool check_pmd(unsigned long pfn, struct page_vma_mapped_walk *pvmw)
>>>>  	return true;
>>>>  }
>>>>
>>>> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
>>>> +/* Returns true if the two ranges overlap.  Careful to not overflow. */
>>>> +static bool check_pud(unsigned long pfn, struct page_vma_mapped_walk *pvmw)
>>>> +{
>>>> +	if ((pfn + HPAGE_PUD_NR - 1) < pvmw->pfn)
>>>> +		return false;
>>>> +	if (pfn > pvmw->pfn + pvmw->nr_pages - 1)
>>>> +		return false;
>>>> +	return true;
>>>> +}
>>>> +#endif
>>>> +
>>>>  static void step_forward(struct page_vma_mapped_walk *pvmw, unsigned long size)
>>>>  {
>>>>  	pvmw->address = (pvmw->address + size) & ~(size - 1);
>>>> @@ -188,6 +200,10 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
>>>>  	pud_t *pud;
>>>>  	pmd_t pmde;
>>>>
>>>> +	/* The only possible pud mapping has been handled on last iteration */
>>>> +	if (pvmw->pud && !pvmw->pmd)
>>>> +		return not_found(pvmw);
>>>> +
>>>>  	/* The only possible pmd mapping has been handled on last iteration */
>>>>  	if (pvmw->pmd && !pvmw->pte)
>>>>  		return not_found(pvmw);
>>>> @@ -234,6 +250,25 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
>>>>  			continue;
>>>>  		}
>>>>
>>>> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
>>>
>>> Said it elsewhere, but it's really weird to treat an arch having the
>>> ability to do something as a go ahead for doing it.
>>>
>>>> +		/* Check for PUD-mapped THP */
>>>> +		if (pud_trans_huge(*pud)) {
>>>> +			pvmw->pud = pud;
>>>> +			pvmw->ptl = pud_lock(mm, pud);
>>>> +			if (likely(pud_trans_huge(*pud))) {
>>>> +				if (pvmw->flags & PVMW_MIGRATION)
>>>> +					return not_found(pvmw);
>>>> +				if (!check_pud(pud_pfn(*pud), pvmw))
>>>> +					return not_found(pvmw);
>>>> +				return true;
>>>> +			}
>>>> +			/* PUD was split under us, retry at PMD level */
>>>> +			spin_unlock(pvmw->ptl);
>>>> +			pvmw->ptl = NULL;
>>>> +			pvmw->pud = NULL;
>>>> +		}
>>>> +#endif
>>>> +
>>>
>>> Yeah, as I said elsewhere, we got to be refactoring not copy/pasting with
>>> modifications :)
>>>
>>
>> Yeah there is repeated code in multiple places, where all I did was replace
>> what was done from PMD into PUD. In a lot of places, its actually difficult
>> to not repeat the code (unless we want function macros, which is much worse
>> IMO).
> 
> Not if we actually refactor the existing code :)
> 
> When I wanted to make functional changes to mremap I took a lot of time to
> refactor the code into something sane before even starting that.
> 
> Because I _could_ have added the features there as-is, but it would have been
> hellish to do so as-is and added more confusion etc.
> 
> So yeah, I think a similar mentality has to be had with this change.

Ack, I will spend a lot more time thinking about the refractoring.
> 
>>
>>>
>>>>  		pvmw->pmd = pmd_offset(pud, pvmw->address);
>>>>  		/*
>>>>  		 * Make sure the pmd value isn't cached in a register by the
>>>> diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
>>>> index d3aec7a9926ad..2047558ddcd79 100644
>>>> --- a/mm/pgtable-generic.c
>>>> +++ b/mm/pgtable-generic.c
>>>> @@ -195,6 +195,89 @@ pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp)
>>>>  }
>>>>  #endif
>>>>
>>>> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
>>>> +/*
>>>> + * Deposit page tables for PUD THP.
>>>> + * Called with PUD lock held. Stores PMD tables in a singly-linked stack
>>>> + * via pud_huge_pmd, using only pmd_page->lru.next as the link pointer.
>>>> + *
>>>> + * IMPORTANT: We use only lru.next (offset 8) for linking, NOT the full
>>>> + * list_head. This is because lru.prev (offset 16) overlaps with
>>>> + * ptdesc->pmd_huge_pte, which stores the PMD table's deposited PTE tables.
>>>> + * Using list_del() would corrupt pmd_huge_pte with LIST_POISON2.
>>>
>>> This is horrible and feels like a hack? Treating a doubly-linked list as a
>>> singly-linked one like this is not upstreamable.
>>>
>>>> + *
>>>> + * PTE tables should be deposited into the PMD using pud_deposit_pte().
>>>> + */
>>>> +void pgtable_trans_huge_pud_deposit(struct mm_struct *mm, pud_t *pudp,
>>>> +				    pmd_t *pmd_table)
>>>
>>> This is a horrid, you're depositing the PMD using the... questionable
>>> list_head abuse, but then also have pud_deposit_pte()... But here we're
>>> depositing a PMD shouldn't the name reflect that?
>>>
>>>> +{
>>>> +	pgtable_t pmd_page = virt_to_page(pmd_table);
>>>> +
>>>> +	assert_spin_locked(pud_lockptr(mm, pudp));
>>>> +
>>>> +	/* Push onto stack using only lru.next as the link */
>>>> +	pmd_page->lru.next = (struct list_head *)pud_huge_pmd(pudp);
>>>
>>> Yikes...
>>>
>>>> +	pud_huge_pmd(pudp) = pmd_page;
>>>> +}
>>>> +
>>>> +/*
>>>> + * Withdraw the deposited PMD table for PUD THP split or zap.
>>>> + * Called with PUD lock held.
>>>> + * Returns NULL if no more PMD tables are deposited.
>>>> + */
>>>> +pmd_t *pgtable_trans_huge_pud_withdraw(struct mm_struct *mm, pud_t *pudp)
>>>> +{
>>>> +	pgtable_t pmd_page;
>>>> +
>>>> +	assert_spin_locked(pud_lockptr(mm, pudp));
>>>> +
>>>> +	pmd_page = pud_huge_pmd(pudp);
>>>> +	if (!pmd_page)
>>>> +		return NULL;
>>>> +
>>>> +	/* Pop from stack - lru.next points to next PMD page (or NULL) */
>>>> +	pud_huge_pmd(pudp) = (pgtable_t)pmd_page->lru.next;
>>>
>>> Where's the popping? You're just assigning here.
>>
>>
>> Ack on all of the above. Hopefully [1] is better.
> 
> Thanks!
> 
>>>
>>>> +
>>>> +	return page_address(pmd_page);
>>>> +}
>>>> +
>>>> +/*
>>>> + * Deposit a PTE table into a standalone PMD table (not yet in page table hierarchy).
>>>> + * Used for PUD THP pre-deposit. The PMD table's pmd_huge_pte stores a linked list.
>>>> + * No lock assertion since the PMD isn't visible yet.
>>>> + */
>>>> +void pud_deposit_pte(pmd_t *pmd_table, pgtable_t pgtable)
>>>> +{
>>>> +	struct ptdesc *ptdesc = virt_to_ptdesc(pmd_table);
>>>> +
>>>> +	/* FIFO - add to front of list */
>>>> +	if (!ptdesc->pmd_huge_pte)
>>>> +		INIT_LIST_HEAD(&pgtable->lru);
>>>> +	else
>>>> +		list_add(&pgtable->lru, &ptdesc->pmd_huge_pte->lru);
>>>> +	ptdesc->pmd_huge_pte = pgtable;
>>>> +}
>>>> +
>>>> +/*
>>>> + * Withdraw a PTE table from a standalone PMD table.
>>>> + * Returns NULL if no more PTE tables are deposited.
>>>> + */
>>>> +pgtable_t pud_withdraw_pte(pmd_t *pmd_table)
>>>> +{
>>>> +	struct ptdesc *ptdesc = virt_to_ptdesc(pmd_table);
>>>> +	pgtable_t pgtable;
>>>> +
>>>> +	pgtable = ptdesc->pmd_huge_pte;
>>>> +	if (!pgtable)
>>>> +		return NULL;
>>>> +	ptdesc->pmd_huge_pte = list_first_entry_or_null(&pgtable->lru,
>>>> +							struct page, lru);
>>>> +	if (ptdesc->pmd_huge_pte)
>>>> +		list_del(&pgtable->lru);
>>>> +	return pgtable;
>>>> +}
>>>> +#endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
>>>> +
>>>>  #ifndef __HAVE_ARCH_PMDP_INVALIDATE
>>>>  pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
>>>>  		     pmd_t *pmdp)
>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>> index 7b9879ef442d9..69acabd763da4 100644
>>>> --- a/mm/rmap.c
>>>> +++ b/mm/rmap.c
>>>> @@ -811,6 +811,32 @@ pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address)
>>>>  	return pmd;
>>>>  }
>>>>
>>>> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
>>>> +/*
>>>> + * Returns the actual pud_t* where we expect 'address' to be mapped from, or
>>>> + * NULL if it doesn't exist.  No guarantees / checks on what the pud_t*
>>>> + * represents.
>>>> + */
>>>> +pud_t *mm_find_pud(struct mm_struct *mm, unsigned long address)
>>>
>>> This series seems to be full of copy/paste.
>>>
>>> It's just not acceptable given the state of THP code as I said in reply to
>>> the cover letter - you need to _refactor_ the code.
>>>
>>> The code is bug-prone and difficult to maintain as-is, your series has to
>>> improve the technical debt, not add to it.
>>>
>>
>> In some cases we might not be able to avoid the copy, but this is definitely
>> a place where we dont need to. I will change here. Thanks!
> 
> I disagree, see above :) But thanks on this one
> 
>>
>>>> +{
>>>> +	pgd_t *pgd;
>>>> +	p4d_t *p4d;
>>>> +	pud_t *pud = NULL;
>>>> +
>>>> +	pgd = pgd_offset(mm, address);
>>>> +	if (!pgd_present(*pgd))
>>>> +		goto out;
>>>> +
>>>> +	p4d = p4d_offset(pgd, address);
>>>> +	if (!p4d_present(*p4d))
>>>> +		goto out;
>>>> +
>>>> +	pud = pud_offset(p4d, address);
>>>> +out:
>>>> +	return pud;
>>>> +}
>>>> +#endif
>>>> +
>>>>  struct folio_referenced_arg {
>>>>  	int mapcount;
>>>>  	int referenced;
>>>> @@ -1415,11 +1441,7 @@ static __always_inline void __folio_add_anon_rmap(struct folio *folio,
>>>>  			SetPageAnonExclusive(page);
>>>>  			break;
>>>>  		case PGTABLE_LEVEL_PUD:
>>>> -			/*
>>>> -			 * Keep the compiler happy, we don't support anonymous
>>>> -			 * PUD mappings.
>>>> -			 */
>>>> -			WARN_ON_ONCE(1);
>>>> +			SetPageAnonExclusive(page);
>>>>  			break;
>>>>  		default:
>>>>  			BUILD_BUG();
>>>> @@ -1503,6 +1525,31 @@ void folio_add_anon_rmap_pmd(struct folio *folio, struct page *page,
>>>>  #endif
>>>>  }
>>>>
>>>> +/**
>>>> + * folio_add_anon_rmap_pud - add a PUD mapping to a page range of an anon folio
>>>> + * @folio:	The folio to add the mapping to
>>>> + * @page:	The first page to add
>>>> + * @vma:	The vm area in which the mapping is added
>>>> + * @address:	The user virtual address of the first page to map
>>>> + * @flags:	The rmap flags
>>>> + *
>>>> + * The page range of folio is defined by [first_page, first_page + HPAGE_PUD_NR)
>>>> + *
>>>> + * The caller needs to hold the page table lock, and the page must be locked in
>>>> + * the anon_vma case: to serialize mapping,index checking after setting.
>>>> + */
>>>> +void folio_add_anon_rmap_pud(struct folio *folio, struct page *page,
>>>> +		struct vm_area_struct *vma, unsigned long address, rmap_t flags)
>>>> +{
>>>> +#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && \
>>>> +	defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD)
>>>> +	__folio_add_anon_rmap(folio, page, HPAGE_PUD_NR, vma, address, flags,
>>>> +			      PGTABLE_LEVEL_PUD);
>>>> +#else
>>>> +	WARN_ON_ONCE(true);
>>>> +#endif
>>>> +}
>>>
>>> More copy/paste... Maybe unavoidable in this case, but be good to try.
>>>
>>>> +
>>>>  /**
>>>>   * folio_add_new_anon_rmap - Add mapping to a new anonymous folio.
>>>>   * @folio:	The folio to add the mapping to.
>>>> @@ -1934,6 +1981,20 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>>>>  		}
>>>>
>>>>  		if (!pvmw.pte) {
>>>> +			/*
>>>> +			 * Check for PUD-mapped THP first.
>>>> +			 * If we have a PUD mapping and TTU_SPLIT_HUGE_PUD is set,
>>>> +			 * split the PUD to PMD level and restart the walk.
>>>> +			 */
>>>
>>> This is literally describing the code below, it's not useful.
>>
>> Ack, Will remove this comment, Thanks!
> 
> Thanks
> 
>>>
>>>> +			if (pvmw.pud && pud_trans_huge(*pvmw.pud)) {
>>>> +				if (flags & TTU_SPLIT_HUGE_PUD) {
>>>> +					split_huge_pud_locked(vma, pvmw.pud, pvmw.address);
>>>> +					flags &= ~TTU_SPLIT_HUGE_PUD;
>>>> +					page_vma_mapped_walk_restart(&pvmw);
>>>> +					continue;
>>>> +				}
>>>> +			}
>>>> +
>>>>  			if (folio_test_anon(folio) && !folio_test_swapbacked(folio)) {
>>>>  				if (unmap_huge_pmd_locked(vma, pvmw.address, pvmw.pmd, folio))
>>>>  					goto walk_done;
>>>> @@ -2325,6 +2386,27 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
>>>>  	mmu_notifier_invalidate_range_start(&range);
>>>>
>>>>  	while (page_vma_mapped_walk(&pvmw)) {
>>>> +		/* Handle PUD-mapped THP first */
>>>
>>> How did/will this interact with DAX, VFIO PUD THP?
>>
>> It wont interact with DAX. try_to_migrate does the below and just returns:
>>
>> 	if (folio_is_zone_device(folio) &&
>> 	    (!folio_is_device_private(folio) && !folio_is_device_coherent(folio)))
>> 		return;
>>
>> so DAX would never reach here.
> 
> Hmm folio_is_zone_device() always returns true for DAX?
> 

Yes, that is my understanding. Both fsdax and devdax call into
devm_memremap_pages() -> memremap_pages() in mm/memremap.c, which
unconditionally places all pages in ZONE_DEVICE.

> Also that's just one rmap call right?
> 
Yes,

>>
>> I think vfio pages are pinned and therefore cant be migrated? (I have
>> not looked at vfio code, I will try to get a better understanding tomorrow,
>> but please let me know if that sounds wrong.)
> 
> OK I've not dug into this either please do check, and be good really to test
> this code vs. actual DAX/VFIO scenarios if you can find a way to test that, thanks!

I think DAX is ok, will check more into VFIO. I will also CC the people who added
DAX and VFIO PUD support in the next RFC.

> 
>>
>>
>>>
>>>> +		if (!pvmw.pte && !pvmw.pmd) {
>>>> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
>>>
>>> Won't pud_trans_huge() imply this...
>>>
>>
>> Agreed, I think it should cover it.
> Thanks!
> 
>>
>>>> +			/*
>>>> +			 * PUD-mapped THP: skip migration to preserve the huge
>>>> +			 * page. Splitting would defeat the purpose of PUD THPs.
>>>> +			 * Return false to indicate migration failure, which
>>>> +			 * will cause alloc_contig_range() to try a different
>>>> +			 * memory region.
>>>> +			 */
>>>> +			if (pvmw.pud && pud_trans_huge(*pvmw.pud)) {
>>>> +				page_vma_mapped_walk_done(&pvmw);
>>>> +				ret = false;
>>>> +				break;
>>>> +			}
>>>> +#endif
>>>> +			/* Unexpected state: !pte && !pmd but not a PUD THP */
>>>> +			page_vma_mapped_walk_done(&pvmw);
>>>> +			break;
>>>> +		}
>>>> +
>>>>  		/* PMD-mapped THP migration entry */
>>>>  		if (!pvmw.pte) {
>>>>  			__maybe_unused unsigned long pfn;
>>>> @@ -2607,10 +2689,10 @@ void try_to_migrate(struct folio *folio, enum ttu_flags flags)
>>>>
>>>>  	/*
>>>>  	 * Migration always ignores mlock and only supports TTU_RMAP_LOCKED and
>>>> -	 * TTU_SPLIT_HUGE_PMD, TTU_SYNC, and TTU_BATCH_FLUSH flags.
>>>> +	 * TTU_SPLIT_HUGE_PMD, TTU_SPLIT_HUGE_PUD, TTU_SYNC, and TTU_BATCH_FLUSH flags.
>>>>  	 */
>>>>  	if (WARN_ON_ONCE(flags & ~(TTU_RMAP_LOCKED | TTU_SPLIT_HUGE_PMD |
>>>> -					TTU_SYNC | TTU_BATCH_FLUSH)))
>>>> +					TTU_SPLIT_HUGE_PUD | TTU_SYNC | TTU_BATCH_FLUSH)))
>>>>  		return;
>>>>
>>>>  	if (folio_is_zone_device(folio) &&
>>>> --
>>>> 2.47.3
>>>>
>>>
>>> This isn't a final review, I'll have to look more thoroughly through here
>>> over time and you're going to have to be patient in general :)
>>>
>>> Cheers, Lorenzo
>>
>>
>> Thanks for the review, this is awesome!
> 
> Ack, will do more when I have time, and obviously you're getting a lot of input
> from others too.
> 
> Be good to get a summary at next THP cabal ;)
> 
>>
>>
>> [1] https://lore.kernel.org/all/20f92576-e932-435f-bb7b-de49eb84b012@gmail.com/
>> [2] https://lore.kernel.org/all/05d5918f-b61b-4091-b8c6-20eebfffc3c4@gmail.com/
>> [3] https://lore.kernel.org/all/2efaa5ed-bd09-41f0-9c07-5cd6cccc4595@gmail.com/
>>
>>
>>
> 
> cheers, Lorenzo



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC 00/12] mm: PUD (1GB) THP implementation
  2026-02-02 15:50     ` Zi Yan
  2026-02-04 10:56       ` Lorenzo Stoakes
@ 2026-02-05 11:22       ` David Hildenbrand (arm)
  1 sibling, 0 replies; 52+ messages in thread
From: David Hildenbrand (arm) @ 2026-02-05 11:22 UTC (permalink / raw)
  To: Zi Yan, Lorenzo Stoakes
  Cc: Rik van Riel, Usama Arif, Andrew Morton, linux-mm, hannes,
	shakeel.butt, kas, baohua, dev.jain, baolin.wang, npache,
	Liam.Howlett, ryan.roberts, vbabka, lance.yang, linux-kernel,
	kernel-team, Frank van der Linden

On 2/2/26 16:50, Zi Yan wrote:
> On 2 Feb 2026, at 6:30, Lorenzo Stoakes wrote:
> 
>> On Sun, Feb 01, 2026 at 09:44:12PM -0500, Rik van Riel wrote:
>>> To address the obvious objection "but how could we
>>> possibly allocate 1GB huge pages while the workload
>>> is running?", I am planning to pick up the CMA balancing
>>> patch series (thank you, Frank) and get that in an
>>> upstream ready shape soon.
>>>
>>> https://lkml.org/2025/9/15/1735
>>
>> That link doesn't work?
>>
>> Did a quick search for CMA balancing on lore, couldn't find anything, could you
>> provide a lore link?
> 
> https://lwn.net/Articles/1038263/
> 
>>
>>>
>>> That patch set looks like another case where no
>>> amount of internal testing will find every single
>>> corner case, and we'll probably just want to
>>> merge it upstream, deploy it experimentally, and
>>> aggressively deal with anything that might pop up.
>>
>> I'm not really in favour of this kind of approach. There's plenty of things that
>> were considered 'temporary' upstream that became rather permanent :)
>>
>> Maybe we can't cover all corner-cases, but we need to make sure whatever we do
>> send upstream is maintainable, conceptually sensible and doesn't paint us into
>> any corners, etc.
>>
>>>
>>> With CMA balancing, it would be possibly to just
>>> have half (or even more) of system memory for
>>> movable allocations only, which would make it possible
>>> to allocate 1GB huge pages dynamically.
>>
>> Could you expand on that?
> 
> I also would like to hear David’s opinion on using CMA for 1GB THP.
> He did not like it[1] when I posted my patch back in 2020, but it has
> been more than 5 years. :)

Hehe, not particularly excited about that.

We really have to avoid short-term hacks by any means. We have enough of 
that in THP land already.

We talked about challenges in the past like:
* Controlling who gets to allocate them.
* Having a reasonable swap/migration mechanism
* Reliably allocating them without hacks, while being future-proof
* Long-term pinning them when they are actually on ZONE_MOVABLE or CMA
   (the latter could be made working but requires thought)

I agree with Lorenzo that this RFC is a bit surprising, because I assume 
none of the real challenges were tackled.

Having that said, it will take me some time to come back to this RFC 
here, other stuff that piled up is more urgent and more important.

But I'll note that we really have to cleanup the THP mess before we add 
more stuff on it.

For example, I still wonder whether we can just stop pre-allocating page 
tables for THPs and instead let code fail+retry in case we cannot remap 
the page. I wanted to look into the details a long time ago but never 
got to it.

Avoiding that would make the remapping much easier; and we should then 
remap from PUD->PMD->PTEs.

Implementing 1 GiB support for shmem might be a reasonable first step, 
before we start digging into the anonymous memory land with all these 
nasty things involved.

-- 
Cheers,

David

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC 00/12] mm: PUD (1GB) THP implementation
  2026-02-04 10:56       ` Lorenzo Stoakes
@ 2026-02-05 11:29         ` David Hildenbrand (arm)
  0 siblings, 0 replies; 52+ messages in thread
From: David Hildenbrand (arm) @ 2026-02-05 11:29 UTC (permalink / raw)
  To: Lorenzo Stoakes, Zi Yan
  Cc: Rik van Riel, Usama Arif, Andrew Morton, linux-mm, hannes,
	shakeel.butt, kas, baohua, dev.jain, baolin.wang, npache,
	Liam.Howlett, ryan.roberts, vbabka, lance.yang, linux-kernel,
	kernel-team, Frank van der Linden

On 2/4/26 11:56, Lorenzo Stoakes wrote:
> On Mon, Feb 02, 2026 at 10:50:35AM -0500, Zi Yan wrote:
>> On 2 Feb 2026, at 6:30, Lorenzo Stoakes wrote:
>>
>>>
>>> That link doesn't work?
>>>
>>> Did a quick search for CMA balancing on lore, couldn't find anything, could you
>>> provide a lore link?
>>
>> https://lwn.net/Articles/1038263/
>>
>>>
>>>
>>> I'm not really in favour of this kind of approach. There's plenty of things that
>>> were considered 'temporary' upstream that became rather permanent :)
>>>
>>> Maybe we can't cover all corner-cases, but we need to make sure whatever we do
>>> send upstream is maintainable, conceptually sensible and doesn't paint us into
>>> any corners, etc.
>>>
>>>
>>> Could you expand on that?
>>
>> I also would like to hear David’s opinion on using CMA for 1GB THP.
>> He did not like it[1] when I posted my patch back in 2020, but it has
>> been more than 5 years. :)
> 
> Yes please David :)

Heh, read Zi's mail first :)

> 
> I find the idea of using the CMA for this a bit gross. And I fear we're
> essentially expanding the hacks for DAX to everyone.

Jup.

> 
> Again I really feel that we should be tackling technical debt here, rather
> than adding features on shaky foundations and just making things worse.
> 

Jup.

> We are inundated with series-after-series for THP trying to add features
> but really not very many that are tackling this debt, and I think it's time
> to get firmer about that.

Almost nobody wants do cleanups because there is the believe that only 
features are important; and some companies seem to value features more 
than cleanups when it comes to promotions etc.

And cleanups in that area are hard, because you'll very likely just 
break stuff because it's all so weirdly interconnected.

See max_ptes_none discussion ...

> 
>>
>> The other direction I explored is to get 1GB THP from buddy allocator.
>> That means we need to:
>> 1. bump MAX_PAGE_ORDER to 18 or make it a runtime variable so that only 1GB
>>     THP users need to bump it,
> 
> Would we need to bump the page block size too to stand more of a chance of
> avoiding fragmentation?

We discussed one idea of another level of anti-fragmentation on top (I 
forgot how we called it, essentially bigger blocks that group pages in 
the buddy). But implementing that is non trivial.

But long-term we really need something better than pageblocks and using 
hacky CMA reservations for anything larger.

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC 01/12] mm: add PUD THP ptdesc and rmap support
  2026-02-05  5:13             ` Usama Arif
@ 2026-02-05 17:40               ` David Hildenbrand (Arm)
  2026-02-05 18:05                 ` Usama Arif
  0 siblings, 1 reply; 52+ messages in thread
From: David Hildenbrand (Arm) @ 2026-02-05 17:40 UTC (permalink / raw)
  To: Usama Arif, Matthew Wilcox
  Cc: Zi Yan, Kiryl Shutsemau, lorenzo.stoakes, Andrew Morton, linux-mm,
	hannes, riel, shakeel.butt, baohua, dev.jain, baolin.wang, npache,
	Liam.Howlett, ryan.roberts, vbabka, lance.yang, linux-kernel,
	kernel-team

On 2/5/26 06:13, Usama Arif wrote:
> 
> 
> On 04/02/2026 20:21, Matthew Wilcox wrote:
>> On Thu, Feb 05, 2026 at 04:17:19AM +0000, Matthew Wilcox wrote:
>>> Why are you even talking about "the next series"?  The approach is
>>> wrong.  You need to put this POC aside and solve the problems that
>>> you've bypassed to create this POC.
> 
> 
> Ah is the issue the code duplication that Lorenzo has raised (ofcourse
> completely agree that there is quite a bit), the lru.next patch I did
> which hopefully [1] makes better, or investigating if it might be
> interferring with DAX/VFIO that Lorenzo pointed out (will ofcourse
> investigate before sending the next revision)? The mapcount work
> (I think David is working on this?) that is needed to allow splitting
> PUDs to PMD is completely a separate issue and can be tackled in parallel
> to this.

I would enjoy seeing an investigation where we see what might have to be 
done to avoid preallocating page tables for anonymous memory THPs, and 
instead, try allocating them on demand when remapping. If allocation 
fails, it's just another -ENOMEM or -EAGAIN.

That would not only reduce the page table overhead when using THPs, it 
would also avoid the preallocation of two levels like you need here.

Maybe it's doable, maybe not.

Last time I looked into it I was like "there must be a better way to 
achieve that" :)

Spinlocks might require preallocating etc.

(as raised elsewhere, staring with shmem support avoid the page table 
problem)

> 
>>
>> ... and gmail is rejecting this email as being spam.  You need to stop
>> using gmail for kernel deveopment work.
> 
> I asked a couple of folks now and it seems they got it without any issue.
> I have used it for a long time. I will try and see if something has changed.

Gmail is absolutely horrible for upstream development. For example, 
linux-mm recently un-subscribed all gmail addresses.

When I moved to my kernel.org address I thought using gmail as a backend 
would be a great choice. I was wrong and after getting daily bounce 
notifications from MLs (even though my spamfilter rules essentially 
allowed everything). So I moved to something else (I now pay 3Euro a 
month, omg! :) ).

-- 
Cheers,

David

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC 01/12] mm: add PUD THP ptdesc and rmap support
  2026-02-05 17:40               ` David Hildenbrand (Arm)
@ 2026-02-05 18:05                 ` Usama Arif
  2026-02-05 18:11                   ` Usama Arif
  0 siblings, 1 reply; 52+ messages in thread
From: Usama Arif @ 2026-02-05 18:05 UTC (permalink / raw)
  To: David Hildenbrand (Arm), Matthew Wilcox
  Cc: Zi Yan, Kiryl Shutsemau, lorenzo.stoakes, Andrew Morton, linux-mm,
	hannes, riel, shakeel.butt, baohua, dev.jain, baolin.wang, npache,
	Liam.Howlett, ryan.roberts, vbabka, lance.yang, linux-kernel,
	kernel-team



On 05/02/2026 09:40, David Hildenbrand (Arm) wrote:
> On 2/5/26 06:13, Usama Arif wrote:
>>
>>
>> On 04/02/2026 20:21, Matthew Wilcox wrote:
>>> On Thu, Feb 05, 2026 at 04:17:19AM +0000, Matthew Wilcox wrote:
>>>> Why are you even talking about "the next series"?  The approach is
>>>> wrong.  You need to put this POC aside and solve the problems that
>>>> you've bypassed to create this POC.
>>
>>
>> Ah is the issue the code duplication that Lorenzo has raised (ofcourse
>> completely agree that there is quite a bit), the lru.next patch I did
>> which hopefully [1] makes better, or investigating if it might be
>> interferring with DAX/VFIO that Lorenzo pointed out (will ofcourse
>> investigate before sending the next revision)? The mapcount work
>> (I think David is working on this?) that is needed to allow splitting
>> PUDs to PMD is completely a separate issue and can be tackled in parallel
>> to this.
> 
> I would enjoy seeing an investigation where we see what might have to be done to avoid preallocating page tables for anonymous memory THPs, and instead, try allocating them on demand when remapping. If allocation fails, it's just another -ENOMEM or -EAGAIN.
> 
> That would not only reduce the page table overhead when using THPs, it would also avoid the preallocation of two levels like you need here.
> 
> Maybe it's doable, maybe not.
> 
> Last time I looked into it I was like "there must be a better way to achieve that" :)
> 
> Spinlocks might require preallocating etc.


Thanks for this! I am going to try and implement this now and stress test this as well for 2M THPs.
I have access to some production workloads that use a lot of THPs as well and I can put
counters to see how often this even happens in prod workloads. i.e. how often page table
allocation even fails in 2M THPs if its done on demand instead of preallocating this.

> 
> (as raised elsewhere, staring with shmem support avoid the page table problem)
> 







^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC 00/12] mm: PUD (1GB) THP implementation
  2026-02-03 23:29   ` Usama Arif
  2026-02-04  0:08     ` Frank van der Linden
@ 2026-02-05 18:07     ` Zi Yan
  2026-02-07 23:22       ` Usama Arif
  1 sibling, 1 reply; 52+ messages in thread
From: Zi Yan @ 2026-02-05 18:07 UTC (permalink / raw)
  To: Usama Arif
  Cc: Andrew Morton, David Hildenbrand, lorenzo.stoakes, linux-mm,
	hannes, riel, shakeel.butt, kas, baohua, dev.jain, baolin.wang,
	npache, Liam.Howlett, ryan.roberts, vbabka, lance.yang,
	linux-kernel, kernel-team

On 3 Feb 2026, at 18:29, Usama Arif wrote:

> On 02/02/2026 08:24, Zi Yan wrote:
>> On 1 Feb 2026, at 19:50, Usama Arif wrote:
>>
>>> This is an RFC series to implement 1GB PUD-level THPs, allowing
>>> applications to benefit from reduced TLB pressure without requiring
>>> hugetlbfs. The patches are based on top of
>>> f9b74c13b773b7c7e4920d7bc214ea3d5f37b422 from mm-stable (6.19-rc6).
>>
>> It is nice to see you are working on 1GB THP.
>>
>>>
>>> Motivation: Why 1GB THP over hugetlbfs?
>>> =======================================
>>>
>>> While hugetlbfs provides 1GB huge pages today, it has significant limitations
>>> that make it unsuitable for many workloads:
>>>
>>> 1. Static Reservation: hugetlbfs requires pre-allocating huge pages at boot
>>>    or runtime, taking memory away. This requires capacity planning,
>>>    administrative overhead, and makes workload orchastration much much more
>>>    complex, especially colocating with workloads that don't use hugetlbfs.
>>
>> But you are using CMA, the same allocation mechanism as hugetlb_cma. What
>> is the difference?
>>
>
> So we dont really need to use CMA. CMA can help a lot ofcourse, but we dont *need* it.
> For e.g. I can run the very simple case [1] of trying to get 1G pages in the upstream
> kernel without CMA on my server and it works. The server has been up for more than a week
> (so pretty fragmented), is running a bunch of stuff in the background, uses 0 CMA memory,
> and I tried to get 20x1G pages on it and it worked.
> It uses folio_alloc_gigantic, which is exactly what this series uses:
>
> $ uptime -p
> up 1 week, 3 days, 5 hours, 7 minutes
> $ cat /proc/meminfo | grep -i cma
> CmaTotal:              0 kB
> CmaFree:               0 kB
> $ echo 20 | sudo tee /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
> 20
> $ cat /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
> 20
> $ free -h
>                total        used        free      shared  buff/cache   available
> Mem:           1.0Ti       142Gi       292Gi       143Mi       583Gi       868Gi
> Swap:          129Gi       3.5Gi       126Gi
> $ ./map_1g_hugepages
> Mapping 20 x 1GB huge pages (20 GB total)
> Mapped at 0x7f43c0000000
> Touched page 0 at 0x7f43c0000000
> Touched page 1 at 0x7f4400000000
> Touched page 2 at 0x7f4440000000
> Touched page 3 at 0x7f4480000000
> Touched page 4 at 0x7f44c0000000
> Touched page 5 at 0x7f4500000000
> Touched page 6 at 0x7f4540000000
> Touched page 7 at 0x7f4580000000
> Touched page 8 at 0x7f45c0000000
> Touched page 9 at 0x7f4600000000
> Touched page 10 at 0x7f4640000000
> Touched page 11 at 0x7f4680000000
> Touched page 12 at 0x7f46c0000000
> Touched page 13 at 0x7f4700000000
> Touched page 14 at 0x7f4740000000
> Touched page 15 at 0x7f4780000000
> Touched page 16 at 0x7f47c0000000
> Touched page 17 at 0x7f4800000000
> Touched page 18 at 0x7f4840000000
> Touched page 19 at 0x7f4880000000
> Unmapped successfully
>

OK, I see the subtle difference among CMA, hugetlb_cma, alloc_contig_pages(),
although CMA and hugetlb_cma use alloc_contig_pages() behind the scenes:

1. CMA and hugetlb_cma reserves some amount of memory at boot as MIGRATE_CMA
and only CMA allocations are allowed. It is a carveout.

2. alloc_contig_pages() without CMA needs to look for a contiguous physical
range without any unmovable page or pinned movable pages, so that the allocation
can succeeds.

Your example is quite optimistic, since the free memory is much bigger than
the requested 1GB pages, 292GB vs 20GB. Unless the worst scenario, where
each 1GB of the free memory has 1 unmovable pages, happens, alloc_contig_pages()
will succeed. But does it represent the product environment, where free memory
is scarce? And in that case, how long does alloc_contig_pages() take to get
1GB memory? Is that delay tolerable?

This discussion all comes back to
“should we have a dedicated source for 1GB folio?” Yu Zhao’s TAO[1] was
interesting, since it has a dedicated zone for large folios and split is
replaced by migrating after-split folios to a different zone. But how to
adjust that dedicated zone size is still not determined. Lots of ideas,
but no conclusion yet.

[1] https://lwn.net/Articles/964097/

>
>
>
>>>
>>> 4. No Fallback: If a 1GB huge page cannot be allocated, hugetlbfs fails
>>>    rather than falling back to smaller pages. This makes it fragile under
>>>    memory pressure.
>>
>> True.
>>
>>>
>>> 4. No Splitting: hugetlbfs pages cannot be split when only partial access
>>>    is needed, leading to memory waste and preventing partial reclaim.
>>
>> Since you have PUD THP implementation, have you run any workload on it?
>> How often you see a PUD THP split?
>>
>
> Ah so running non upstream kernels in production is a bit more difficult
> (and also risky). I was trying to use the 512M experiment on arm as a comparison,
> although I know its not the same thing with PAGE_SIZE and pageblock order.
>
> I can try some other upstream benchmarks if it helps? Although will need to find
> ones that create VMA > 1G.

I think getting split stats from ARM 512MB PMD THP can give some clue about
1GB THP, since the THP sizes are similar (yeah, base page to THP size ratios
are 32x different but the gap between base page size and THP size is still
much bigger than 4KB vs 2MB).

>
>> Oh, you actually ran 512MB THP on ARM64 (I saw it below), do you have
>> any split stats to show the necessity of THP split?
>>
>>>
>>> 5. Memory Accounting: hugetlbfs memory is accounted separately and cannot
>>>    be easily shared with regular memory pools.
>>
>> True.
>>
>>>
>>> PUD THP solves these limitations by integrating 1GB pages into the existing
>>> THP infrastructure.
>>
>> The main advantage of PUD THP over hugetlb is that it can be split and mapped
>> at sub-folio level. Do you have any data to support the necessity of them?
>> I wonder if it would be easier to just support 1GB folio in core-mm first
>> and we can add 1GB THP split and sub-folio mapping later. With that, we
>> can move hugetlb users to 1GB folio.
>>
>
> I would say its not the main advantage? But its definitely one of them.
> The 2 main areas where split would be helpful is munmap partial
> range and reclaim (MADV_PAGEOUT). For e.g. jemalloc/tcmalloc can now start
> taking advantge of 1G pages. My knowledge is not that great when it comes
> to memory allocators, but I believe they track for how long certain areas
> have been cold and can trigger reclaim as an example. Then split will be useful.
> Having memory allocators use hugetlb is probably going to be a no?

To take advantage of 1GB pages, memory allocators would want to keep that
whole GB mapped by PUD, otherwise TLB wise there is no difference from
using 2MB pages, right? I guess memory allocators would want to promote
a set of stable memory objects to 1GB and demote them from 1GB if any
is gone (promote by migrating them into a 1GB folio, demote by migrating
them out of a 1GB folio) and this can avoid split.

>
>
>> BTW, without split support, you can apply HVO to 1GB folio to save memory.
>> That is a disadvantage of PUD THP. Have you taken that into consideration?
>> Basically, switching from hugetlb to PUD THP, you will lose memory due
>> to vmemmap usage.
>>
>
> Yeah so HVO saves 16M per 1G, and the page depost mechanism adds ~2M as per 1G.
> We have HVO enabled in the meta fleet. I think we should not only think of PUD THP
> as a replacement for hugetlb, but to also enable further usescases where hugetlb
> would not be feasible.
>
> Ater the basic infrastructure for 1G is there, we can work on optimizing, I think
> there would be a a lot of interesting work we can do. HVO for 1G THP would be one
> of them?

HVO would prevent folio split, right? Since most of struct pages are mapped
to the same memory area. You will need to allocate more memory, 16MB, to split
1GB. That further decreases the motivation of splitting 1GB.

>
>>>
>>> Performance Results
>>> ===================
>>>
>>> Benchmark results of these patches on Intel Xeon Platinum 8321HC:
>>>
>>> Test: True Random Memory Access [1] test of 4GB memory region with pointer
>>> chasing workload (4M random pointer dereferences through memory):
>>>
>>> | Metric            | PUD THP (1GB) | PMD THP (2MB) | Change       |
>>> |-------------------|---------------|---------------|--------------|
>>> | Memory access     | 88 ms         | 134 ms        | 34% faster   |
>>> | Page fault time   | 898 ms        | 331 ms        | 2.7x slower  |
>>>
>>> Page faulting 1G pages is 2.7x slower (Allocating 1G pages is hard :)).
>>> For long-running workloads this will be a one-off cost, and the 34%
>>> improvement in access latency provides significant benefit.
>>>
>>> ARM with 64K PAGE_SZIE supports 512M PMD THPs. In meta, we have a CPU
>>> bound workload running on a large number of ARM servers (256G). I enabled
>>> the 512M THP settings to always for a 100 servers in production (didn't
>>> really have high expectations :)). The average memory used for the workload
>>> increased from 217G to 233G. The amount of memory backed by 512M pages was
>>> 68G! The dTLB misses went down by 26% and the PID multiplier increased input
>>> by 5.9% (This is a very significant improvment in workload performance).
>>> A significant number of these THPs were faulted in at application start when
>>> were present across different VMAs. Ofcourse getting these 512M pages is
>>> easier on ARM due to bigger PAGE_SIZE and pageblock order.
>>>
>>> I am hoping that these patches for 1G THP can be used to provide similar
>>> benefits for x86. I expect workloads to fault them in at start time when there
>>> is plenty of free memory available.
>>>
>>>
>>> Previous attempt by Zi Yan
>>> ==========================
>>>
>>> Zi Yan attempted 1G THPs [2] in kernel version 5.11. There have been
>>> significant changes in kernel since then, including folio conversion, mTHP
>>> framework, ptdesc, rmap changes, etc. I found it easier to use the current PMD
>>> code as reference for making 1G PUD THP work. I am hoping Zi can provide
>>> guidance on these patches!
>>
>> I am more than happy to help you. :)
>>
>
> Thanks!!!
>
>>>
>>> Major Design Decisions
>>> ======================
>>>
>>> 1. No shared 1G zero page: The memory cost would be quite significant!
>>>
>>> 2. Page Table Pre-deposit Strategy
>>>    PMD THP deposits a single PTE page table. PUD THP deposits 512 PTE
>>>    page tables (one for each potential PMD entry after split).
>>>    We allocate a PMD page table and use its pmd_huge_pte list to store
>>>    the deposited PTE tables. This ensures split operations don't fail due
>>>    to page table allocation failures (at the cost of 2M per PUD THP)
>>>
>>> 3. Split to Base Pages
>>>    When a PUD THP must be split (COW, partial unmap, mprotect), we split
>>>    directly to base pages (262,144 PTEs). The ideal thing would be to split
>>>    to 2M pages and then to 4K pages if needed. However, this would require
>>>    significant rmap and mapcount tracking changes.
>>>
>>> 4. COW and fork handling via split
>>>    Copy-on-write and fork for PUD THP triggers a split to base pages, then
>>>    uses existing PTE-level COW infrastructure. Getting another 1G region is
>>>    hard and could fail. If only a 4K is written, copying 1G is a waste.
>>>    Probably this should only be done on CoW and not fork?
>>>
>>> 5. Migration via split
>>>    Split PUD to PTEs and migrate individual pages. It is going to be difficult
>>>    to find a 1G continguous memory to migrate to. Maybe its better to not
>>>    allow migration of PUDs at all? I am more tempted to not allow migration,
>>>    but have kept splitting in this RFC.
>>
>> Without migration, PUD THP loses its flexibility and transparency. But with
>> its 1GB size, I also wonder what the purpose of PUD THP migration can be.
>> It does not create memory fragmentation, since it is the largest folio size
>> we have and contiguous. NUMA balancing 1GB THP seems too much work.
>
> Yeah this is exactly what I was thinking as well. It is going to be expensive
> and difficult to migrate 1G pages, and I am not sure if what we get out of it
> is worth it? I kept the splitting code in this RFC as I wanted to show that
> its possible to split and migrate and the rejecting migration code is a lot easier.

Got it. Maybe reframing this patchset as 1GB folio support without split or
migration is better?

>
>>
>> BTW, I posted many questions, but that does not mean I object the patchset.
>> I just want to understand your use case better, reduce unnecessary
>> code changes, and hopefully get it upstreamed this time. :)
>>
>> Thank you for the work.
>>
>
> Ah no this is awesome! Thanks for the questions! Its basically the discussion I
> wanted to start with the RFC.
>
>
> [1] https://gist.github.com/uarif1/35dcd63f9d76048b07eb5c16ace85991


Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC 01/12] mm: add PUD THP ptdesc and rmap support
  2026-02-05 18:05                 ` Usama Arif
@ 2026-02-05 18:11                   ` Usama Arif
  0 siblings, 0 replies; 52+ messages in thread
From: Usama Arif @ 2026-02-05 18:11 UTC (permalink / raw)
  To: David Hildenbrand (Arm), Matthew Wilcox
  Cc: Zi Yan, Kiryl Shutsemau, lorenzo.stoakes, Andrew Morton, linux-mm,
	hannes, riel, shakeel.butt, baohua, dev.jain, baolin.wang, npache,
	Liam.Howlett, ryan.roberts, vbabka, lance.yang, linux-kernel,
	kernel-team


>>
>> (as raised elsewhere, staring with shmem support avoid the page table problem)
>>
> 

Also forgot to add here, I will look into this before PUD anon THPs first.




^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC 00/12] mm: PUD (1GB) THP implementation
  2026-02-05 18:07     ` Zi Yan
@ 2026-02-07 23:22       ` Usama Arif
  0 siblings, 0 replies; 52+ messages in thread
From: Usama Arif @ 2026-02-07 23:22 UTC (permalink / raw)
  To: Zi Yan
  Cc: Andrew Morton, David Hildenbrand, lorenzo.stoakes, linux-mm,
	hannes, riel, shakeel.butt, kas, baohua, dev.jain, baolin.wang,
	npache, Liam.Howlett, ryan.roberts, vbabka, lance.yang,
	linux-kernel, kernel-team



On 05/02/2026 18:07, Zi Yan wrote:
> On 3 Feb 2026, at 18:29, Usama Arif wrote:
> 
>> On 02/02/2026 08:24, Zi Yan wrote:
>>> On 1 Feb 2026, at 19:50, Usama Arif wrote:
>>>
>>>> This is an RFC series to implement 1GB PUD-level THPs, allowing
>>>> applications to benefit from reduced TLB pressure without requiring
>>>> hugetlbfs. The patches are based on top of
>>>> f9b74c13b773b7c7e4920d7bc214ea3d5f37b422 from mm-stable (6.19-rc6).
>>>
>>> It is nice to see you are working on 1GB THP.
>>>
>>>>
>>>> Motivation: Why 1GB THP over hugetlbfs?
>>>> =======================================
>>>>
>>>> While hugetlbfs provides 1GB huge pages today, it has significant limitations
>>>> that make it unsuitable for many workloads:
>>>>
>>>> 1. Static Reservation: hugetlbfs requires pre-allocating huge pages at boot
>>>>    or runtime, taking memory away. This requires capacity planning,
>>>>    administrative overhead, and makes workload orchastration much much more
>>>>    complex, especially colocating with workloads that don't use hugetlbfs.
>>>
>>> But you are using CMA, the same allocation mechanism as hugetlb_cma. What
>>> is the difference?
>>>
>>
>> So we dont really need to use CMA. CMA can help a lot ofcourse, but we dont *need* it.
>> For e.g. I can run the very simple case [1] of trying to get 1G pages in the upstream
>> kernel without CMA on my server and it works. The server has been up for more than a week
>> (so pretty fragmented), is running a bunch of stuff in the background, uses 0 CMA memory,
>> and I tried to get 20x1G pages on it and it worked.
>> It uses folio_alloc_gigantic, which is exactly what this series uses:
>>
>> $ uptime -p
>> up 1 week, 3 days, 5 hours, 7 minutes
>> $ cat /proc/meminfo | grep -i cma
>> CmaTotal:              0 kB
>> CmaFree:               0 kB
>> $ echo 20 | sudo tee /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
>> 20
>> $ cat /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
>> 20
>> $ free -h
>>                total        used        free      shared  buff/cache   available
>> Mem:           1.0Ti       142Gi       292Gi       143Mi       583Gi       868Gi
>> Swap:          129Gi       3.5Gi       126Gi
>> $ ./map_1g_hugepages
>> Mapping 20 x 1GB huge pages (20 GB total)
>> Mapped at 0x7f43c0000000
>> Touched page 0 at 0x7f43c0000000
>> Touched page 1 at 0x7f4400000000
>> Touched page 2 at 0x7f4440000000
>> Touched page 3 at 0x7f4480000000
>> Touched page 4 at 0x7f44c0000000
>> Touched page 5 at 0x7f4500000000
>> Touched page 6 at 0x7f4540000000
>> Touched page 7 at 0x7f4580000000
>> Touched page 8 at 0x7f45c0000000
>> Touched page 9 at 0x7f4600000000
>> Touched page 10 at 0x7f4640000000
>> Touched page 11 at 0x7f4680000000
>> Touched page 12 at 0x7f46c0000000
>> Touched page 13 at 0x7f4700000000
>> Touched page 14 at 0x7f4740000000
>> Touched page 15 at 0x7f4780000000
>> Touched page 16 at 0x7f47c0000000
>> Touched page 17 at 0x7f4800000000
>> Touched page 18 at 0x7f4840000000
>> Touched page 19 at 0x7f4880000000
>> Unmapped successfully
>>
> 
> OK, I see the subtle difference among CMA, hugetlb_cma, alloc_contig_pages(),
> although CMA and hugetlb_cma use alloc_contig_pages() behind the scenes:
> 
> 1. CMA and hugetlb_cma reserves some amount of memory at boot as MIGRATE_CMA
> and only CMA allocations are allowed. It is a carveout.

Yes, also there is always going to be some amount of movable non-pinned memory in
the system. So it is ok with having a certain percentage of memory dedicated to
CMA even if we never make 1G allocations, as we aren't really taking it away from
the system. When its needed for 1G allocations, the memory will just be migrated out.

> 
> 2. alloc_contig_pages() without CMA needs to look for a contiguous physical
> range without any unmovable page or pinned movable pages, so that the allocation
> can succeeds.
> 
> Your example is quite optimistic, since the free memory is much bigger than
> the requested 1GB pages, 292GB vs 20GB. Unless the worst scenario, where
> each 1GB of the free memory has 1 unmovable pages, happens, alloc_contig_pages()
> will succeed. But does it represent the product environment, where free memory
> is scarce? And in that case, how long does alloc_contig_pages() take to get
> 1GB memory? Is that delay tolerable?

So this was my personal server, which had been up for more than a week. I was
expecting the worst case as you described, but it seems that doesnt really happen.
I will also try requested a larger amount of 1G pages.

The majority of usecases for this would be applications getting the 1G pages
when they are started (when there is plenty of free memory), and holding them
for a long time. The delay is large (as I showed in the numbers below), but if
the application gets the 1G page at the start and keeps it for long time, its
a one-off cost.

> 
> This discussion all comes back to
> “should we have a dedicated source for 1GB folio?” Yu Zhao’s TAO[1] was
> interesting, since it has a dedicated zone for large folios and split is
> replaced by migrating after-split folios to a different zone. But how to
> adjust that dedicated zone size is still not determined. Lots of ideas,
> but no conclusion yet.
> 
> [1] https://lwn.net/Articles/964097/
> 

Actually I wasn't a big fan of TAO. I would rather have CMA than TAO, as atleast
you wouldn't make the memory unusable if there are no 1G allocations. But as can
be seen, neither is actually needed.

>>
>>
>>
>>>>
>>>> 4. No Fallback: If a 1GB huge page cannot be allocated, hugetlbfs fails
>>>>    rather than falling back to smaller pages. This makes it fragile under
>>>>    memory pressure.
>>>
>>> True.
>>>
>>>>
>>>> 4. No Splitting: hugetlbfs pages cannot be split when only partial access
>>>>    is needed, leading to memory waste and preventing partial reclaim.
>>>
>>> Since you have PUD THP implementation, have you run any workload on it?
>>> How often you see a PUD THP split?
>>>
>>
>> Ah so running non upstream kernels in production is a bit more difficult
>> (and also risky). I was trying to use the 512M experiment on arm as a comparison,
>> although I know its not the same thing with PAGE_SIZE and pageblock order.
>>
>> I can try some other upstream benchmarks if it helps? Although will need to find
>> ones that create VMA > 1G.
> 
> I think getting split stats from ARM 512MB PMD THP can give some clue about
> 1GB THP, since the THP sizes are similar (yeah, base page to THP size ratios
> are 32x different but the gap between base page size and THP size is still
> much bigger than 4KB vs 2MB).
> 

There were splits, I was running with max_ptes_none = 0, as I didn't want jobs
to OOM, and the THP shrinker was kicking in. I dont have the numbers on hand, but
I cant try and run the job again next week (It takes some time and effort to
set things up).

>>
>>> Oh, you actually ran 512MB THP on ARM64 (I saw it below), do you have
>>> any split stats to show the necessity of THP split?
>>>
>>>>
>>>> 5. Memory Accounting: hugetlbfs memory is accounted separately and cannot
>>>>    be easily shared with regular memory pools.
>>>
>>> True.
>>>
>>>>
>>>> PUD THP solves these limitations by integrating 1GB pages into the existing
>>>> THP infrastructure.
>>>
>>> The main advantage of PUD THP over hugetlb is that it can be split and mapped
>>> at sub-folio level. Do you have any data to support the necessity of them?
>>> I wonder if it would be easier to just support 1GB folio in core-mm first
>>> and we can add 1GB THP split and sub-folio mapping later. With that, we
>>> can move hugetlb users to 1GB folio.
>>>
>>
>> I would say its not the main advantage? But its definitely one of them.
>> The 2 main areas where split would be helpful is munmap partial
>> range and reclaim (MADV_PAGEOUT). For e.g. jemalloc/tcmalloc can now start
>> taking advantge of 1G pages. My knowledge is not that great when it comes
>> to memory allocators, but I believe they track for how long certain areas
>> have been cold and can trigger reclaim as an example. Then split will be useful.
>> Having memory allocators use hugetlb is probably going to be a no?
> 
> To take advantage of 1GB pages, memory allocators would want to keep that
> whole GB mapped by PUD, otherwise TLB wise there is no difference from
> using 2MB pages, right?

Yes

> I guess memory allocators would want to promote
> a set of stable memory objects to 1GB and demote them from 1GB if any
> is gone (promote by migrating them into a 1GB folio, demote by migrating
> them out of a 1GB folio) and this can avoid split.
> 
>>
>>
>>> BTW, without split support, you can apply HVO to 1GB folio to save memory.
>>> That is a disadvantage of PUD THP. Have you taken that into consideration?
>>> Basically, switching from hugetlb to PUD THP, you will lose memory due
>>> to vmemmap usage.
>>>
>>
>> Yeah so HVO saves 16M per 1G, and the page depost mechanism adds ~2M as per 1G.
>> We have HVO enabled in the meta fleet. I think we should not only think of PUD THP
>> as a replacement for hugetlb, but to also enable further usescases where hugetlb
>> would not be feasible.
>>
>> Ater the basic infrastructure for 1G is there, we can work on optimizing, I think
>> there would be a a lot of interesting work we can do. HVO for 1G THP would be one
>> of them?
> 
> HVO would prevent folio split, right? Since most of struct pages are mapped
> to the same memory area. You will need to allocate more memory, 16MB, to split
> 1GB. That further decreases the motivation of splitting 1GB.

Yes, thats right.

>>
>>>>
>>>> Performance Results
>>>> ===================
>>>>
>>>> Benchmark results of these patches on Intel Xeon Platinum 8321HC:
>>>>
>>>> Test: True Random Memory Access [1] test of 4GB memory region with pointer
>>>> chasing workload (4M random pointer dereferences through memory):
>>>>
>>>> | Metric            | PUD THP (1GB) | PMD THP (2MB) | Change       |
>>>> |-------------------|---------------|---------------|--------------|
>>>> | Memory access     | 88 ms         | 134 ms        | 34% faster   |
>>>> | Page fault time   | 898 ms        | 331 ms        | 2.7x slower  |
>>>>
>>>> Page faulting 1G pages is 2.7x slower (Allocating 1G pages is hard :)).
>>>> For long-running workloads this will be a one-off cost, and the 34%
>>>> improvement in access latency provides significant benefit.
>>>>
>>>> ARM with 64K PAGE_SZIE supports 512M PMD THPs. In meta, we have a CPU
>>>> bound workload running on a large number of ARM servers (256G). I enabled
>>>> the 512M THP settings to always for a 100 servers in production (didn't
>>>> really have high expectations :)). The average memory used for the workload
>>>> increased from 217G to 233G. The amount of memory backed by 512M pages was
>>>> 68G! The dTLB misses went down by 26% and the PID multiplier increased input
>>>> by 5.9% (This is a very significant improvment in workload performance).
>>>> A significant number of these THPs were faulted in at application start when
>>>> were present across different VMAs. Ofcourse getting these 512M pages is
>>>> easier on ARM due to bigger PAGE_SIZE and pageblock order.
>>>>
>>>> I am hoping that these patches for 1G THP can be used to provide similar
>>>> benefits for x86. I expect workloads to fault them in at start time when there
>>>> is plenty of free memory available.
>>>>
>>>>
>>>> Previous attempt by Zi Yan
>>>> ==========================
>>>>
>>>> Zi Yan attempted 1G THPs [2] in kernel version 5.11. There have been
>>>> significant changes in kernel since then, including folio conversion, mTHP
>>>> framework, ptdesc, rmap changes, etc. I found it easier to use the current PMD
>>>> code as reference for making 1G PUD THP work. I am hoping Zi can provide
>>>> guidance on these patches!
>>>
>>> I am more than happy to help you. :)
>>>
>>
>> Thanks!!!
>>
>>>>
>>>> Major Design Decisions
>>>> ======================
>>>>
>>>> 1. No shared 1G zero page: The memory cost would be quite significant!
>>>>
>>>> 2. Page Table Pre-deposit Strategy
>>>>    PMD THP deposits a single PTE page table. PUD THP deposits 512 PTE
>>>>    page tables (one for each potential PMD entry after split).
>>>>    We allocate a PMD page table and use its pmd_huge_pte list to store
>>>>    the deposited PTE tables. This ensures split operations don't fail due
>>>>    to page table allocation failures (at the cost of 2M per PUD THP)
>>>>
>>>> 3. Split to Base Pages
>>>>    When a PUD THP must be split (COW, partial unmap, mprotect), we split
>>>>    directly to base pages (262,144 PTEs). The ideal thing would be to split
>>>>    to 2M pages and then to 4K pages if needed. However, this would require
>>>>    significant rmap and mapcount tracking changes.
>>>>
>>>> 4. COW and fork handling via split
>>>>    Copy-on-write and fork for PUD THP triggers a split to base pages, then
>>>>    uses existing PTE-level COW infrastructure. Getting another 1G region is
>>>>    hard and could fail. If only a 4K is written, copying 1G is a waste.
>>>>    Probably this should only be done on CoW and not fork?
>>>>
>>>> 5. Migration via split
>>>>    Split PUD to PTEs and migrate individual pages. It is going to be difficult
>>>>    to find a 1G continguous memory to migrate to. Maybe its better to not
>>>>    allow migration of PUDs at all? I am more tempted to not allow migration,
>>>>    but have kept splitting in this RFC.
>>>
>>> Without migration, PUD THP loses its flexibility and transparency. But with
>>> its 1GB size, I also wonder what the purpose of PUD THP migration can be.
>>> It does not create memory fragmentation, since it is the largest folio size
>>> we have and contiguous. NUMA balancing 1GB THP seems too much work.
>>
>> Yeah this is exactly what I was thinking as well. It is going to be expensive
>> and difficult to migrate 1G pages, and I am not sure if what we get out of it
>> is worth it? I kept the splitting code in this RFC as I wanted to show that
>> its possible to split and migrate and the rejecting migration code is a lot easier.
> 
> Got it. Maybe reframing this patchset as 1GB folio support without split or
> migration is better?

I think split support is good to have. For e.g. on CoW, partial unmap and mprotect.
I do agree that migration support seems to have little benefit at a high cost, so
simplest to not have it.

> 
>>
>>>
>>> BTW, I posted many questions, but that does not mean I object the patchset.
>>> I just want to understand your use case better, reduce unnecessary
>>> code changes, and hopefully get it upstreamed this time. :)
>>>
>>> Thank you for the work.
>>>
>>
>> Ah no this is awesome! Thanks for the questions! Its basically the discussion I
>> wanted to start with the RFC.
>>
>>
>> [1] https://gist.github.com/uarif1/35dcd63f9d76048b07eb5c16ace85991
> 
> 
> Best Regards,
> Yan, Zi



^ permalink raw reply	[flat|nested] 52+ messages in thread

end of thread, other threads:[~2026-02-07 23:22 UTC | newest]

Thread overview: 52+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-02  0:50 [RFC 00/12] mm: PUD (1GB) THP implementation Usama Arif
2026-02-02  0:50 ` [RFC 01/12] mm: add PUD THP ptdesc and rmap support Usama Arif
2026-02-02  3:10   ` kernel test robot
2026-02-02 10:44   ` Kiryl Shutsemau
2026-02-02 16:01     ` Zi Yan
2026-02-03 22:07       ` Usama Arif
2026-02-05  4:17         ` Matthew Wilcox
2026-02-05  4:21           ` Matthew Wilcox
2026-02-05  5:13             ` Usama Arif
2026-02-05 17:40               ` David Hildenbrand (Arm)
2026-02-05 18:05                 ` Usama Arif
2026-02-05 18:11                   ` Usama Arif
2026-02-02 12:15   ` Lorenzo Stoakes
2026-02-04  7:38     ` Usama Arif
2026-02-04 12:55       ` Lorenzo Stoakes
2026-02-05  6:40         ` Usama Arif
2026-02-02  0:50 ` [RFC 02/12] mm/thp: add mTHP stats infrastructure for PUD THP Usama Arif
2026-02-02 11:56   ` Lorenzo Stoakes
2026-02-05  5:53     ` Usama Arif
2026-02-02  0:50 ` [RFC 03/12] mm: thp: add PUD THP allocation and fault handling Usama Arif
2026-02-02  0:50 ` [RFC 04/12] mm: thp: implement PUD THP split to PTE level Usama Arif
2026-02-02  0:50 ` [RFC 05/12] mm: thp: add reclaim and migration support for PUD THP Usama Arif
2026-02-02  4:44   ` kernel test robot
2026-02-02  9:12   ` kernel test robot
2026-02-02  0:50 ` [RFC 06/12] selftests/mm: add PUD THP basic allocation test Usama Arif
2026-02-02  0:50 ` [RFC 07/12] selftests/mm: add PUD THP read/write access test Usama Arif
2026-02-02  0:50 ` [RFC 08/12] selftests/mm: add PUD THP fork COW test Usama Arif
2026-02-02  0:50 ` [RFC 09/12] selftests/mm: add PUD THP partial munmap test Usama Arif
2026-02-02  0:50 ` [RFC 10/12] selftests/mm: add PUD THP mprotect split test Usama Arif
2026-02-02  0:50 ` [RFC 11/12] selftests/mm: add PUD THP reclaim test Usama Arif
2026-02-02  0:50 ` [RFC 12/12] selftests/mm: add PUD THP migration test Usama Arif
2026-02-02  2:44 ` [RFC 00/12] mm: PUD (1GB) THP implementation Rik van Riel
2026-02-02 11:30   ` Lorenzo Stoakes
2026-02-02 15:50     ` Zi Yan
2026-02-04 10:56       ` Lorenzo Stoakes
2026-02-05 11:29         ` David Hildenbrand (arm)
2026-02-05 11:22       ` David Hildenbrand (arm)
2026-02-02  4:00 ` Matthew Wilcox
2026-02-02  9:06   ` David Hildenbrand (arm)
2026-02-03 21:11     ` Usama Arif
2026-02-02 11:20 ` Lorenzo Stoakes
2026-02-04  1:00   ` Usama Arif
2026-02-04 11:08     ` Lorenzo Stoakes
2026-02-04 11:50       ` Dev Jain
2026-02-04 12:01         ` Dev Jain
2026-02-05  6:08       ` Usama Arif
2026-02-02 16:24 ` Zi Yan
2026-02-03 23:29   ` Usama Arif
2026-02-04  0:08     ` Frank van der Linden
2026-02-05  5:46       ` Usama Arif
2026-02-05 18:07     ` Zi Yan
2026-02-07 23:22       ` Usama Arif

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.