Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 00/19] mm: Refactor bootmem gigantic hugepage allocation
@ 2026-06-02 10:10 Muchun Song
  2026-06-02 10:10 ` [PATCH v3 01/19] mm/hugetlb: Fix boot panic with CONFIG_DEBUG_VM and HVO bootmem pages Muchun Song
                   ` (19 more replies)
  0 siblings, 20 replies; 37+ messages in thread
From: Muchun Song @ 2026-06-02 10:10 UTC (permalink / raw)
  To: Oscar Salvador, David Hildenbrand, Andrew Morton,
	Madhavan Srinivasan, Michael Ellerman
  Cc: Muchun Song, Mike Rapoport, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, linux-mm, linux-kernel, Nicholas Piggin,
	Christophe Leroy (CS GROUP), Ritesh Harjani (IBM),
	Aneesh Kumar K.V, linuxppc-dev, Mike Kravetz, Muchun Song

This series is split out from the earlier larger series "mm: Generalize
HVO for HugeTLB and device DAX" [1]. It collects the first 19 patches of
that series as a standalone set of fixes and preparatory cleanups around
bootmem HugeTLB handling, sparse initialization ordering, and related
vmemmap setup.

The first patches fix a few bugs found while reviewing the existing
code, including incorrect bootmem HVO handling, wrong vmemmap
registration arguments, a powerpc compound-vmemmap tracking bug, and
too-late initialization of gigantic bootmem HugeTLB struct pages.

The rest of the series reorders early memory initialization so the
relevant zone state is available before sparse and HugeTLB boot-time
setup runs, then simplifies the remaining bootmem gigantic hugepage
allocation path and removes code made obsolete by that rework.

At a high level:
  - patches [1-4] fix boot-time and arch-specific bugs
  - patches [5-12] reorder and simplify sparse/mm/hugetlb early init
  - patches [13-19] refactor bootmem gigantic hugepage allocation and
    remove obsolete helpers and state

Changes since v2:
  - patch 1: add a comment explaining why shared tail pages must be
    initialized from gather_bootmem_prealloc() before
    hugetlb_vmemmap_init() runs
  - patch 1: update the stale sparse-vmemmap comment to point to
    gather_bootmem_prealloc() as the shared-tail initialization site
  - patch 2: collect Acked-by from Oscar Salvador
  - patch 19: fold __init_page_from_nid() into __init_deferred_page()
    instead of only making it static

[1] https://lore.kernel.org/linux-mm/20260513130542.35604-1-songmuchun@bytedance.com/

Muchun Song (19):
  mm/hugetlb: Fix boot panic with CONFIG_DEBUG_VM and HVO bootmem pages
  mm/hugetlb_vmemmap: Fix __hugetlb_vmemmap_optimize_folios()
  powerpc/mm: Fix wrong addr_pfn tracking in compound vmemmap population
  mm/hugetlb: Initialize gigantic bootmem hugepage struct pages earlier
  mm/mm_init: Simplify deferred_free_pages() migratetype init
  mm/sparse: Panic on memmap and usemap allocation failure
  mm/sparse: Move subsection_map_init() into sparse_init()
  mm/mm_init: Defer sparse_init() until after zone initialization
  mm/mm_init: Defer hugetlb reservation until after zone initialization
  mm/mm_init: Remove set_pageblock_order() call from sparse_init()
  mm/sparse: Move sparse_vmemmap_init_nid_late() into sparse_init_nid()
  mm/hugetlb_cma: Validate hugetlb CMA range by zone at reserve time
  mm/hugetlb: Refactor early boot gigantic hugepage allocation
  mm/hugetlb: Free cross-zone bootmem gigantic pages after allocation
  mm/hugetlb_vmemmap: Move bootmem HVO setup to early init
  mm/hugetlb: Remove obsolete bootmem cross-zone checks
  mm/sparse-vmemmap: Remove sparse_vmemmap_init_nid_late()
  mm/hugetlb: Remove unused bootmem cma field
  mm/mm_init: Fold __init_page_from_nid() into __init_deferred_page()

 arch/powerpc/mm/book3s64/radix_pgtable.c |   7 +-
 arch/powerpc/mm/hugetlbpage.c            |  13 +-
 include/linux/hugetlb.h                  |  24 +--
 include/linux/mmzone.h                   |   7 -
 mm/cma.c                                 |   3 +-
 mm/hugetlb.c                             | 259 +++++++++++------------
 mm/hugetlb_cma.c                         |  44 ++--
 mm/hugetlb_cma.h                         |   8 +-
 mm/hugetlb_vmemmap.c                     |  94 ++------
 mm/hugetlb_vmemmap.h                     |   5 -
 mm/internal.h                            |  14 +-
 mm/mm_init.c                             |  88 +++-----
 mm/sparse-vmemmap.c                      |  26 ++-
 mm/sparse.c                              |  48 +----
 14 files changed, 241 insertions(+), 399 deletions(-)


base-commit: 08484c504b55a98bd100527fbe10a3caf55ff3ff
-- 
2.54.0



^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v3 01/19] mm/hugetlb: Fix boot panic with CONFIG_DEBUG_VM and HVO bootmem pages
  2026-06-02 10:10 [PATCH v3 00/19] mm: Refactor bootmem gigantic hugepage allocation Muchun Song
@ 2026-06-02 10:10 ` Muchun Song
  2026-06-02 10:10 ` [PATCH v3 02/19] mm/hugetlb_vmemmap: Fix __hugetlb_vmemmap_optimize_folios() Muchun Song
                   ` (18 subsequent siblings)
  19 siblings, 0 replies; 37+ messages in thread
From: Muchun Song @ 2026-06-02 10:10 UTC (permalink / raw)
  To: Oscar Salvador, David Hildenbrand, Andrew Morton,
	Madhavan Srinivasan, Michael Ellerman
  Cc: Muchun Song, Mike Rapoport, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, linux-mm, linux-kernel, Nicholas Piggin,
	Christophe Leroy (CS GROUP), Ritesh Harjani (IBM),
	Aneesh Kumar K.V, linuxppc-dev, Mike Kravetz, Muchun Song

Commit 622026e87c40 ("mm/hugetlb: remove fake head pages") switched
HVO to reuse per-zone shared tail pages from zone->vmemmap_tails[].

Those shared tail pages were initialized in hugetlb_vmemmap_init(), but
bootmem HugeTLB folios are prepared earlier from gather_bootmem_prealloc().
With hugetlb_free_vmemmap=on, prep_and_add_bootmem_folios() can access
pageblock flags on bootmem HugeTLB pages whose mirrored tail struct pages
already point to the shared tail page. On CONFIG_DEBUG_VM kernels,
get_pfnblock_bitmap_bitidx() then dereferences the still-uninitialized
shared tail page and can panic during boot.

Initialize zone->vmemmap_tails[] from gather_bootmem_prealloc(), before
bootmem HugeTLB folios are processed, and drop the later initialization
from hugetlb_vmemmap_init().

This bug only affects CONFIG_DEBUG_VM kernels, where the relevant
assertion is evaluated.

Fixes: 622026e87c40 ("mm/hugetlb: remove fake head pages")
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Oscar Salvador <osalvador@suse.de>
---
v2->v3:
- add a comment explaining why shared tail pages must be initialized from
  gather_bootmem_prealloc() before hugetlb_vmemmap_init() runs (per Oscar
  Salvador)
- update the stale sparse-vmemmap comment to point to gather_bootmem_prealloc()
  as the bootmem HugeTLB shared-tail initialization site (reported by Oscar
  Salvador)
---
 mm/hugetlb.c         | 25 +++++++++++++++++++++++++
 mm/hugetlb_vmemmap.c | 17 -----------------
 mm/sparse-vmemmap.c  |  2 +-
 3 files changed, 26 insertions(+), 18 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 571212b80835..cd55524c7e30 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3365,6 +3365,31 @@ static void __init gather_bootmem_prealloc(void)
 		.max_threads	= num_node_state(N_MEMORY),
 		.numa_aware	= true,
 	};
+#ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
+	struct zone *zone;
+
+	for_each_zone(zone) {
+		for (int i = 0; i < NR_VMEMMAP_TAILS; i++) {
+			struct page *tail, *p;
+			unsigned int order;
+
+			tail = zone->vmemmap_tails[i];
+			if (!tail)
+				continue;
+
+			order = i + VMEMMAP_TAIL_MIN_ORDER;
+			p = page_to_virt(tail);
+			/*
+			 * prep_and_add_bootmem_folios() can access pageblock
+			 * flags on bootmem HugeTLB pages, so initialize the
+			 * shared tail struct pages here before bootmem folios
+			 * start using them.
+			 */
+			for (int j = 0; j < PAGE_SIZE / sizeof(struct page); j++)
+				init_compound_tail(p + j, NULL, order, zone);
+		}
+	}
+#endif
 
 	padata_do_multithreaded(&job);
 }
diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index 133b46dfb09f..c713c0d2593a 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -870,27 +870,10 @@ static const struct ctl_table hugetlb_vmemmap_sysctls[] = {
 static int __init hugetlb_vmemmap_init(void)
 {
 	const struct hstate *h;
-	struct zone *zone;
 
 	/* HUGETLB_VMEMMAP_RESERVE_SIZE should cover all used struct pages */
 	BUILD_BUG_ON(__NR_USED_SUBPAGE > HUGETLB_VMEMMAP_RESERVE_PAGES);
 
-	for_each_zone(zone) {
-		for (int i = 0; i < NR_VMEMMAP_TAILS; i++) {
-			struct page *tail, *p;
-			unsigned int order;
-
-			tail = zone->vmemmap_tails[i];
-			if (!tail)
-				continue;
-
-			order = i + VMEMMAP_TAIL_MIN_ORDER;
-			p = page_to_virt(tail);
-			for (int j = 0; j < PAGE_SIZE / sizeof(struct page); j++)
-				init_compound_tail(p + j, NULL, order, zone);
-		}
-	}
-
 	for_each_hstate(h) {
 		if (hugetlb_vmemmap_optimizable(h)) {
 			register_sysctl_init("vm", hugetlb_vmemmap_sysctls);
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index 112ccf9c71ca..8f41b73fb674 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -342,7 +342,7 @@ static __meminit struct page *vmemmap_get_tail(unsigned int order, struct zone *
 	 *
 	 * Any initialization done here will be overwritten by memmap_init().
 	 *
-	 * hugetlb_vmemmap_init() will take care of initialization after
+	 * gather_bootmem_prealloc() will take care of initialization after
 	 * memmap_init().
 	 */
 
-- 
2.54.0



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v3 02/19] mm/hugetlb_vmemmap: Fix __hugetlb_vmemmap_optimize_folios()
  2026-06-02 10:10 [PATCH v3 00/19] mm: Refactor bootmem gigantic hugepage allocation Muchun Song
  2026-06-02 10:10 ` [PATCH v3 01/19] mm/hugetlb: Fix boot panic with CONFIG_DEBUG_VM and HVO bootmem pages Muchun Song
@ 2026-06-02 10:10 ` Muchun Song
  2026-06-02 10:10 ` [PATCH v3 03/19] powerpc/mm: Fix wrong addr_pfn tracking in compound vmemmap population Muchun Song
                   ` (17 subsequent siblings)
  19 siblings, 0 replies; 37+ messages in thread
From: Muchun Song @ 2026-06-02 10:10 UTC (permalink / raw)
  To: Oscar Salvador, David Hildenbrand, Andrew Morton,
	Madhavan Srinivasan, Michael Ellerman
  Cc: Muchun Song, Mike Rapoport, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, linux-mm, linux-kernel, Nicholas Piggin,
	Christophe Leroy (CS GROUP), Ritesh Harjani (IBM),
	Aneesh Kumar K.V, linuxppc-dev, Mike Kravetz, Muchun Song

__hugetlb_vmemmap_optimize_folios() uses incorrect arguments when handling
bootmem HugeTLB folios.

The section number passed to register_page_bootmem_memmap() is derived from
the vmemmap virtual address of folio->page instead of the folio PFN, so the
bootmem memmap metadata can be registered against the wrong section. The
helper is also given HUGETLB_VMEMMAP_RESERVE_SIZE even though it expects a
page count, not a size in bytes. In addition, the write-protect range is
based on pages_per_huge_page(h), which does not cover the full HugeTLB
vmemmap area and can leave part of the shared tail vmemmap mapping writable.

Fix the section lookup to use folio_pfn(folio), use
HUGETLB_VMEMMAP_RESERVE_PAGES when registering the reserved memmap pages, and
use hugetlb_vmemmap_size(h) for the write-protect range.

Fixes: 752fe17af693 ("mm/hugetlb: add pre-HVO framework")
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Oscar Salvador <osalvador@suse.de>
---
v2->v3:
- collect Acked-by from Oscar Salvador
---
 mm/hugetlb_vmemmap.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index c713c0d2593a..ea6af85bfec1 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -635,12 +635,12 @@ static void __hugetlb_vmemmap_optimize_folios(struct hstate *h,
 			 * mirrored tail page structs RO.
 			 */
 			spfn = (unsigned long)&folio->page;
-			epfn = spfn + pages_per_huge_page(h);
+			epfn = spfn + hugetlb_vmemmap_size(h);
 			vmemmap_wrprotect_hvo(spfn, epfn, folio_nid(folio),
 					HUGETLB_VMEMMAP_RESERVE_SIZE);
-			register_page_bootmem_memmap(pfn_to_section_nr(spfn),
+			register_page_bootmem_memmap(pfn_to_section_nr(folio_pfn(folio)),
 					&folio->page,
-					HUGETLB_VMEMMAP_RESERVE_SIZE);
+					HUGETLB_VMEMMAP_RESERVE_PAGES);
 			continue;
 		}
 
-- 
2.54.0



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v3 03/19] powerpc/mm: Fix wrong addr_pfn tracking in compound vmemmap population
  2026-06-02 10:10 [PATCH v3 00/19] mm: Refactor bootmem gigantic hugepage allocation Muchun Song
  2026-06-02 10:10 ` [PATCH v3 01/19] mm/hugetlb: Fix boot panic with CONFIG_DEBUG_VM and HVO bootmem pages Muchun Song
  2026-06-02 10:10 ` [PATCH v3 02/19] mm/hugetlb_vmemmap: Fix __hugetlb_vmemmap_optimize_folios() Muchun Song
@ 2026-06-02 10:10 ` Muchun Song
  2026-06-03 14:36   ` Ritesh Harjani
  2026-06-02 10:10 ` [PATCH v3 04/19] mm/hugetlb: Initialize gigantic bootmem hugepage struct pages earlier Muchun Song
                   ` (16 subsequent siblings)
  19 siblings, 1 reply; 37+ messages in thread
From: Muchun Song @ 2026-06-02 10:10 UTC (permalink / raw)
  To: Oscar Salvador, David Hildenbrand, Andrew Morton,
	Madhavan Srinivasan, Michael Ellerman
  Cc: Muchun Song, Mike Rapoport, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, linux-mm, linux-kernel, Nicholas Piggin,
	Christophe Leroy (CS GROUP), Ritesh Harjani (IBM),
	Aneesh Kumar K.V, linuxppc-dev, Mike Kravetz, Muchun Song

vmemmap_populate_compound_pages() uses addr_pfn to determine the PFN
offset within a compound page and to decide whether the current
vmemmap slot should be populated as a head page mapping or should reuse
a tail page mapping.

However, addr_pfn is advanced manually in parallel with addr.  The loop
itself progresses in vmemmap address space, so each PAGE_SIZE step in
addr covers PAGE_SIZE / sizeof(struct page) struct page slots.  Since
addr_pfn is compared against nr_pages in data-PFN units, it should
advance by the same number of PFNs.  The existing manual increments do
not match that and therefore do not reliably track the PFN
corresponding to the current addr.

As a result, pfn_offset can be computed from the wrong PFN and the code
can make the head/tail decision for the wrong compound-page position.

Fix this by deriving addr_pfn directly from the current vmemmap address
instead of carrying it as loop state.

Fixes: f2b79c0d7968 ("powerpc/book3s64/radix: add support for vmemmap optimization for radix")
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Oscar Salvador <osalvador@suse.de>
---
v2->v3:
- collect Acked-by from Oscar Salvador
---
 arch/powerpc/mm/book3s64/radix_pgtable.c | 7 +------
 1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c b/arch/powerpc/mm/book3s64/radix_pgtable.c
index 10aced261cff..cf692b2b5f7b 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -1314,7 +1314,6 @@ int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
 	 * covering out both edges.
 	 */
 	unsigned long addr;
-	unsigned long addr_pfn = start_pfn;
 	unsigned long next;
 	pgd_t *pgd;
 	p4d_t *p4d;
@@ -1335,7 +1334,6 @@ int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
 
 		if (pmd_leaf(READ_ONCE(*pmd))) {
 			/* existing huge mapping. Skip the range */
-			addr_pfn += (PMD_SIZE >> PAGE_SHIFT);
 			next = pmd_addr_end(addr, end);
 			continue;
 		}
@@ -1348,11 +1346,11 @@ int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
 			 * page whose VMEMMAP_RESERVE_NR pages were mapped and
 			 * this request fall in those pages.
 			 */
-			addr_pfn += 1;
 			next = addr + PAGE_SIZE;
 			continue;
 		} else {
 			unsigned long nr_pages = pgmap_vmemmap_nr(pgmap);
+			unsigned long addr_pfn = page_to_pfn((struct page *)addr);
 			unsigned long pfn_offset = addr_pfn - ALIGN_DOWN(addr_pfn, nr_pages);
 			pte_t *tail_page_pte;
 
@@ -1376,7 +1374,6 @@ int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
 				if (!pte)
 					return -ENOMEM;
 
-				addr_pfn += 2;
 				next = addr + 2 * PAGE_SIZE;
 				continue;
 			}
@@ -1392,7 +1389,6 @@ int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
 					return -ENOMEM;
 				vmemmap_verify(pte, node, addr, addr + PAGE_SIZE);
 
-				addr_pfn += 1;
 				next = addr + PAGE_SIZE;
 				continue;
 			}
@@ -1402,7 +1398,6 @@ int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
 				return -ENOMEM;
 			vmemmap_verify(pte, node, addr, addr + PAGE_SIZE);
 
-			addr_pfn += 1;
 			next = addr + PAGE_SIZE;
 			continue;
 		}
-- 
2.54.0



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v3 04/19] mm/hugetlb: Initialize gigantic bootmem hugepage struct pages earlier
  2026-06-02 10:10 [PATCH v3 00/19] mm: Refactor bootmem gigantic hugepage allocation Muchun Song
                   ` (2 preceding siblings ...)
  2026-06-02 10:10 ` [PATCH v3 03/19] powerpc/mm: Fix wrong addr_pfn tracking in compound vmemmap population Muchun Song
@ 2026-06-02 10:10 ` Muchun Song
  2026-06-02 10:10 ` [PATCH v3 05/19] mm/mm_init: Simplify deferred_free_pages() migratetype init Muchun Song
                   ` (15 subsequent siblings)
  19 siblings, 0 replies; 37+ messages in thread
From: Muchun Song @ 2026-06-02 10:10 UTC (permalink / raw)
  To: Oscar Salvador, David Hildenbrand, Andrew Morton,
	Madhavan Srinivasan, Michael Ellerman
  Cc: Muchun Song, Mike Rapoport, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, linux-mm, linux-kernel, Nicholas Piggin,
	Christophe Leroy (CS GROUP), Ritesh Harjani (IBM),
	Aneesh Kumar K.V, linuxppc-dev, Mike Kravetz, Muchun Song

Gigantic bootmem HugeTLB pages are currently initialized from hugetlb_init(),
but page_alloc_init_late() runs earlier and walks pageblocks to determine
zone contiguity.

If a bootmem HugeTLB region is marked noinit, set_zone_contiguous() can
observe still-uninitialized struct pages through __pageblock_pfn_to_page().
This may not trigger an immediate failure, but it can make
set_zone_contiguous() compute the wrong zone contiguity state. If extra
poisoned-page checks are added in this path, such as PF_POISONED_CHECK()
in page_zone_id(), it can also trigger an early boot panic.

Initialize gigantic bootmem HugeTLB struct pages from page_alloc_init_late(),
before zone contiguity is evaluated, so later page allocator setup only
sees valid struct page state. This also makes the initialization order
more natural, as struct pages should be initialized before later code
inspects them.

Fixes: fde1c4ecf916 ("mm: hugetlb: skip initialization of gigantic tail struct pages if freed by HVO")
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Acked-by: Oscar Salvador <osalvador@suse.de>
---
v2->v3:
- rename the helper to hugetlb_bootmem_struct_page_init() to make the
  bootmem-only scope explicit (per Oscar Salvador)
---
 include/linux/hugetlb.h | 5 +++++
 mm/hugetlb.c            | 5 ++---
 mm/mm_init.c            | 1 +
 mm/sparse-vmemmap.c     | 4 ++--
 4 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 2abaf99321e9..3700c0a1f6ff 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -171,6 +171,7 @@ extern int movable_gigantic_pages __read_mostly;
 extern int sysctl_hugetlb_shm_group __read_mostly;
 extern struct list_head huge_boot_pages[MAX_NUMNODES];
 
+void hugetlb_bootmem_struct_page_init(void);
 void hugetlb_bootmem_alloc(void);
 extern nodemask_t hugetlb_bootmem_nodes;
 void hugetlb_bootmem_set_nodes(void);
@@ -1293,6 +1294,10 @@ static inline bool hugetlbfs_pagecache_present(
 static inline void hugetlb_bootmem_alloc(void)
 {
 }
+
+static inline void hugetlb_bootmem_struct_page_init(void)
+{
+}
 #endif	/* CONFIG_HUGETLB_PAGE */
 
 static inline spinlock_t *huge_pte_lock(struct hstate *h,
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index cd55524c7e30..2bf9fe16abb9 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3353,7 +3353,7 @@ static void __init gather_bootmem_prealloc_parallel(unsigned long start,
 		gather_bootmem_prealloc_node(nid);
 }
 
-static void __init gather_bootmem_prealloc(void)
+void __init hugetlb_bootmem_struct_page_init(void)
 {
 	struct padata_mt_job job = {
 		.thread_fn	= gather_bootmem_prealloc_parallel,
@@ -3582,7 +3582,7 @@ static unsigned long __init hugetlb_pages_alloc_boot(struct hstate *h)
  * - For gigantic pages, this is called early in the boot process and
  *   pages are allocated from memblock allocated or something similar.
  *   Gigantic pages are actually added to pools later with the routine
- *   gather_bootmem_prealloc.
+ *   hugetlb_bootmem_struct_page_init.
  * - For non-gigantic pages, this is called later in the boot process after
  *   all of mm is up and functional.  Pages are allocated from buddy and
  *   then added to hugetlb pools.
@@ -4152,7 +4152,6 @@ static int __init hugetlb_init(void)
 	}
 
 	hugetlb_init_hstates();
-	gather_bootmem_prealloc();
 	report_hugepages();
 
 	hugetlb_sysfs_init();
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 6de3a77eb9ae..1890bda948b8 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -2338,6 +2338,7 @@ void __init page_alloc_init_late(void)
 	/* Reinit limits that are based on free pages after the kernel is up */
 	files_maxfiles_init();
 #endif
+	hugetlb_bootmem_struct_page_init();
 
 	/* Accounting of total+free memory is stable at this point. */
 	mem_init_print_info();
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index 8f41b73fb674..db9cfe57e827 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -342,8 +342,8 @@ static __meminit struct page *vmemmap_get_tail(unsigned int order, struct zone *
 	 *
 	 * Any initialization done here will be overwritten by memmap_init().
 	 *
-	 * gather_bootmem_prealloc() will take care of initialization after
-	 * memmap_init().
+	 * hugetlb_bootmem_struct_page_init() will take care of initialization
+	 * after memmap_init().
 	 */
 
 	p = vmemmap_alloc_block_zero(PAGE_SIZE, node);
-- 
2.54.0



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v3 05/19] mm/mm_init: Simplify deferred_free_pages() migratetype init
  2026-06-02 10:10 [PATCH v3 00/19] mm: Refactor bootmem gigantic hugepage allocation Muchun Song
                   ` (3 preceding siblings ...)
  2026-06-02 10:10 ` [PATCH v3 04/19] mm/hugetlb: Initialize gigantic bootmem hugepage struct pages earlier Muchun Song
@ 2026-06-02 10:10 ` Muchun Song
  2026-06-02 10:10 ` [PATCH v3 06/19] mm/sparse: Panic on memmap and usemap allocation failure Muchun Song
                   ` (14 subsequent siblings)
  19 siblings, 0 replies; 37+ messages in thread
From: Muchun Song @ 2026-06-02 10:10 UTC (permalink / raw)
  To: Oscar Salvador, David Hildenbrand, Andrew Morton,
	Madhavan Srinivasan, Michael Ellerman
  Cc: Muchun Song, Mike Rapoport, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, linux-mm, linux-kernel, Nicholas Piggin,
	Christophe Leroy (CS GROUP), Ritesh Harjani (IBM),
	Aneesh Kumar K.V, linuxppc-dev, Mike Kravetz, Muchun Song

deferred_free_pages() open-codes two loops to initialize the pageblock
migratetype for a range of pages.

Replace them with pageblock_migratetype_init_range() to remove the
duplication and make the code clearer (Note that deferred_free_pages()
may be called from atomic context).

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Acked-by: Oscar Salvador <osalvador@suse.de>
---
v2->v3:
- collect Acked-by from Oscar Salvador
---
 mm/mm_init.c | 19 ++++++++-----------
 1 file changed, 8 insertions(+), 11 deletions(-)

diff --git a/mm/mm_init.c b/mm/mm_init.c
index 1890bda948b8..be652b6990a2 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -674,15 +674,15 @@ static inline void fixup_hashdist(void)
 static inline void fixup_hashdist(void) {}
 #endif /* CONFIG_NUMA */
 
-#ifdef CONFIG_ZONE_DEVICE
+#if defined(CONFIG_ZONE_DEVICE) || defined(CONFIG_DEFERRED_STRUCT_PAGE_INIT)
 static __meminit void pageblock_migratetype_init_range(unsigned long pfn,
-		unsigned long nr_pages, int migratetype)
+		unsigned long nr_pages, int migratetype, bool atomic)
 {
 	const unsigned long end = pfn + nr_pages;
 
 	for (pfn = pageblock_align(pfn); pfn < end; pfn += pageblock_nr_pages) {
 		init_pageblock_migratetype(pfn_to_page(pfn), migratetype, false);
-		if (IS_ALIGNED(pfn, PAGES_PER_SECTION))
+		if (!atomic && IS_ALIGNED(pfn, PAGES_PER_SECTION))
 			cond_resched();
 	}
 }
@@ -1142,7 +1142,7 @@ void __ref memmap_init_zone_device(struct zone *zone,
 				     compound_nr_pages(pfn, altmap, pgmap));
 	}
 
-	pageblock_migratetype_init_range(start_pfn, nr_pages, MIGRATE_MOVABLE);
+	pageblock_migratetype_init_range(start_pfn, nr_pages, MIGRATE_MOVABLE, false);
 
 	pr_debug("%s initialised %lu pages in %ums\n", __func__,
 		nr_pages, jiffies_to_msecs(jiffies - start));
@@ -1996,12 +1996,12 @@ static void __init deferred_free_pages(unsigned long pfn,
 	if (!nr_pages)
 		return;
 
+	pageblock_migratetype_init_range(pfn, nr_pages, mt, true);
+
 	page = pfn_to_page(pfn);
 
 	/* Free a large naturally-aligned chunk if possible */
 	if (nr_pages == MAX_ORDER_NR_PAGES && IS_MAX_ORDER_ALIGNED(pfn)) {
-		for (i = 0; i < nr_pages; i += pageblock_nr_pages)
-			init_pageblock_migratetype(page + i, mt, false);
 		__free_pages_core(page, MAX_PAGE_ORDER, MEMINIT_EARLY);
 		return;
 	}
@@ -2009,11 +2009,8 @@ static void __init deferred_free_pages(unsigned long pfn,
 	/* Accept chunks smaller than MAX_PAGE_ORDER upfront */
 	accept_memory(PFN_PHYS(pfn), nr_pages * PAGE_SIZE);
 
-	for (i = 0; i < nr_pages; i++, page++, pfn++) {
-		if (pageblock_aligned(pfn))
-			init_pageblock_migratetype(page, mt, false);
-		__free_pages_core(page, 0, MEMINIT_EARLY);
-	}
+	for (i = 0; i < nr_pages; i++)
+		__free_pages_core(page + i, 0, MEMINIT_EARLY);
 }
 
 /* Completion tracking for deferred_init_memmap() threads */
-- 
2.54.0



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v3 06/19] mm/sparse: Panic on memmap and usemap allocation failure
  2026-06-02 10:10 [PATCH v3 00/19] mm: Refactor bootmem gigantic hugepage allocation Muchun Song
                   ` (4 preceding siblings ...)
  2026-06-02 10:10 ` [PATCH v3 05/19] mm/mm_init: Simplify deferred_free_pages() migratetype init Muchun Song
@ 2026-06-02 10:10 ` Muchun Song
  2026-06-02 10:10 ` [PATCH v3 07/19] mm/sparse: Move subsection_map_init() into sparse_init() Muchun Song
                   ` (13 subsequent siblings)
  19 siblings, 0 replies; 37+ messages in thread
From: Muchun Song @ 2026-06-02 10:10 UTC (permalink / raw)
  To: Oscar Salvador, David Hildenbrand, Andrew Morton,
	Madhavan Srinivasan, Michael Ellerman
  Cc: Muchun Song, Mike Rapoport, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, linux-mm, linux-kernel, Nicholas Piggin,
	Christophe Leroy (CS GROUP), Ritesh Harjani (IBM),
	Aneesh Kumar K.V, linuxppc-dev, Mike Kravetz, Muchun Song

When vmemmap or usemap allocation fails, sparse_init_nid() currently
marks the section non-present and continues. Later boot-time code can
still walk PFNs in that section without checking for this partial setup,
which leads to invalid accesses. subsection_map_init() can also touch an
unallocated usemap.

Auditing and fixing all early PFN walkers for this case is not worth the
complexity. These allocation failures are expected to be fatal anyway,
and other memory models already treat them that way.

Make memmap and usemap allocation failures panic immediately instead of
trying to recover and crashing later in less obvious ways. This is also
consistent with how other memory model configurations handle memmap
allocation failures.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Acked-by: Oscar Salvador <osalvador@suse.de>
---
v2->v3:
- collect Acked-by from Oscar Salvador
---
 mm/sparse.c | 44 +++++++++-----------------------------------
 1 file changed, 9 insertions(+), 35 deletions(-)

diff --git a/mm/sparse.c b/mm/sparse.c
index 16ac6df3c89f..c92bbc3f3aa3 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -239,15 +239,8 @@ struct page __init *__populate_section_memmap(unsigned long pfn,
 		struct dev_pagemap *pgmap)
 {
 	unsigned long size = section_map_size();
-	struct page *map;
-	phys_addr_t addr = __pa(MAX_DMA_ADDRESS);
 
-	map = memmap_alloc(size, size, addr, nid, false);
-	if (!map)
-		panic("%s: Failed to allocate %lu bytes align=0x%lx nid=%d from=%pa\n",
-		      __func__, size, PAGE_SIZE, nid, &addr);
-
-	return map;
+	return memmap_alloc(size, size, __pa(MAX_DMA_ADDRESS), nid, false);
 }
 #endif /* !CONFIG_SPARSEMEM_VMEMMAP */
 
@@ -300,17 +293,14 @@ static void __init sparse_init_nid(int nid, unsigned long pnum_begin,
 				   unsigned long map_count)
 {
 	unsigned long pnum;
-	struct page *map;
-	struct mem_section *ms;
 
-	if (sparse_usage_init(nid, map_count)) {
-		pr_err("%s: node[%d] usemap allocation failed", __func__, nid);
-		goto failed;
-	}
+	if (sparse_usage_init(nid, map_count))
+		panic("Failed to allocate usemap for node %d\n", nid);
 
 	sparse_vmemmap_init_nid_early(nid);
 
 	for_each_present_section_nr(pnum_begin, pnum) {
+		struct mem_section *ms;
 		unsigned long pfn = section_nr_to_pfn(pnum);
 
 		if (pnum >= pnum_end)
@@ -318,34 +308,18 @@ static void __init sparse_init_nid(int nid, unsigned long pnum_begin,
 
 		ms = __nr_to_section(pnum);
 		if (!preinited_vmemmap_section(ms)) {
+			struct page *map;
+
 			map = __populate_section_memmap(pfn, PAGES_PER_SECTION,
-					nid, NULL, NULL);
-			if (!map) {
-				pr_err("%s: node[%d] memory map backing failed. Some memory will not be available.",
-				       __func__, nid);
-				pnum_begin = pnum;
-				sparse_usage_fini();
-				goto failed;
-			}
+							nid, NULL, NULL);
+			if (!map)
+				panic("Failed to allocate memmap for section %lu\n", pnum);
 			memmap_boot_pages_add(DIV_ROUND_UP(PAGES_PER_SECTION * sizeof(struct page),
 							   PAGE_SIZE));
 			sparse_init_early_section(nid, map, pnum, 0);
 		}
 	}
 	sparse_usage_fini();
-	return;
-failed:
-	/*
-	 * We failed to allocate, mark all the following pnums as not present,
-	 * except the ones already initialized earlier.
-	 */
-	for_each_present_section_nr(pnum_begin, pnum) {
-		if (pnum >= pnum_end)
-			break;
-		ms = __nr_to_section(pnum);
-		if (!preinited_vmemmap_section(ms))
-			ms->section_mem_map = 0;
-	}
 }
 
 /*
-- 
2.54.0



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v3 07/19] mm/sparse: Move subsection_map_init() into sparse_init()
  2026-06-02 10:10 [PATCH v3 00/19] mm: Refactor bootmem gigantic hugepage allocation Muchun Song
                   ` (5 preceding siblings ...)
  2026-06-02 10:10 ` [PATCH v3 06/19] mm/sparse: Panic on memmap and usemap allocation failure Muchun Song
@ 2026-06-02 10:10 ` Muchun Song
  2026-06-02 10:10 ` [PATCH v3 08/19] mm/mm_init: Defer sparse_init() until after zone initialization Muchun Song
                   ` (12 subsequent siblings)
  19 siblings, 0 replies; 37+ messages in thread
From: Muchun Song @ 2026-06-02 10:10 UTC (permalink / raw)
  To: Oscar Salvador, David Hildenbrand, Andrew Morton,
	Madhavan Srinivasan, Michael Ellerman
  Cc: Muchun Song, Mike Rapoport, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, linux-mm, linux-kernel, Nicholas Piggin,
	Christophe Leroy (CS GROUP), Ritesh Harjani (IBM),
	Aneesh Kumar K.V, linuxppc-dev, Mike Kravetz, Muchun Song

subsection_map_init() is part of sparse memory initialization, but it is
currently called from free_area_init().

Move it into sparse_init() so the sparse-specific setup stays together
instead of being split across the generic free_area_init() path.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Acked-by: Oscar Salvador <osalvador@suse.de>
---
v2->v3:
- collect Acked-by from Oscar Salvador
---
 mm/internal.h       |  5 ++---
 mm/mm_init.c        | 10 ++--------
 mm/sparse-vmemmap.c | 11 ++++++++++-
 mm/sparse.c         |  1 +
 4 files changed, 15 insertions(+), 12 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 5602393054f3..e71ba519f7f2 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -994,10 +994,9 @@ static inline void sparse_init(void) {}
  * mm/sparse-vmemmap.c
  */
 #ifdef CONFIG_SPARSEMEM_VMEMMAP
-void sparse_init_subsection_map(unsigned long pfn, unsigned long nr_pages);
+void sparse_init_subsection_map(void);
 #else
-static inline void sparse_init_subsection_map(unsigned long pfn,
-		unsigned long nr_pages)
+static inline void sparse_init_subsection_map(void)
 {
 }
 #endif /* CONFIG_SPARSEMEM_VMEMMAP */
diff --git a/mm/mm_init.c b/mm/mm_init.c
index be652b6990a2..3a57bf5a9b46 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -1879,18 +1879,12 @@ static void __init free_area_init(void)
 			       (u64)zone_movable_pfn[i] << PAGE_SHIFT);
 	}
 
-	/*
-	 * Print out the early node map, and initialize the
-	 * subsection-map relative to active online memory ranges to
-	 * enable future "sub-section" extensions of the memory map.
-	 */
+	/* Print out the early node map. */
 	pr_info("Early memory node ranges\n");
-	for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, &nid) {
+	for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, &nid)
 		pr_info("  node %3d: [mem %#018Lx-%#018Lx]\n", nid,
 			(u64)start_pfn << PAGE_SHIFT,
 			((u64)end_pfn << PAGE_SHIFT) - 1);
-		sparse_init_subsection_map(start_pfn, end_pfn - start_pfn);
-	}
 
 	/* Initialise every node */
 	mminit_verify_pageflags_layout();
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index db9cfe57e827..3b036251a2f4 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -596,7 +596,7 @@ static void subsection_mask_set(unsigned long *map, unsigned long pfn,
 	bitmap_set(map, idx, end - idx + 1);
 }
 
-void __init sparse_init_subsection_map(unsigned long pfn, unsigned long nr_pages)
+static void __init sparse_init_subsection_map_range(unsigned long pfn, unsigned long nr_pages)
 {
 	int end_sec_nr = pfn_to_section_nr(pfn + nr_pages - 1);
 	unsigned long nr, start_sec_nr = pfn_to_section_nr(pfn);
@@ -619,6 +619,15 @@ void __init sparse_init_subsection_map(unsigned long pfn, unsigned long nr_pages
 	}
 }
 
+void __init sparse_init_subsection_map(void)
+{
+	int i, nid;
+	unsigned long start, end;
+
+	for_each_mem_pfn_range(i, MAX_NUMNODES, &start, &end, &nid)
+		sparse_init_subsection_map_range(start, end - start);
+}
+
 #ifdef CONFIG_MEMORY_HOTPLUG
 
 /* Mark all memory sections within the pfn range as online */
diff --git a/mm/sparse.c b/mm/sparse.c
index c92bbc3f3aa3..85557ef387c7 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -361,5 +361,6 @@ void __init sparse_init(void)
 	}
 	/* cover the last node */
 	sparse_init_nid(nid_begin, pnum_begin, pnum_end, map_count);
+	sparse_init_subsection_map();
 	vmemmap_populate_print_last();
 }
-- 
2.54.0



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v3 08/19] mm/mm_init: Defer sparse_init() until after zone initialization
  2026-06-02 10:10 [PATCH v3 00/19] mm: Refactor bootmem gigantic hugepage allocation Muchun Song
                   ` (6 preceding siblings ...)
  2026-06-02 10:10 ` [PATCH v3 07/19] mm/sparse: Move subsection_map_init() into sparse_init() Muchun Song
@ 2026-06-02 10:10 ` Muchun Song
  2026-06-02 10:10 ` [PATCH v3 09/19] mm/mm_init: Defer hugetlb reservation " Muchun Song
                   ` (11 subsequent siblings)
  19 siblings, 0 replies; 37+ messages in thread
From: Muchun Song @ 2026-06-02 10:10 UTC (permalink / raw)
  To: Oscar Salvador, David Hildenbrand, Andrew Morton,
	Madhavan Srinivasan, Michael Ellerman
  Cc: Muchun Song, Mike Rapoport, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, linux-mm, linux-kernel, Nicholas Piggin,
	Christophe Leroy (CS GROUP), Ritesh Harjani (IBM),
	Aneesh Kumar K.V, linuxppc-dev, Mike Kravetz, Muchun Song,
	Oscar Salvador (SUSE)

free_area_init() is responsible for initializing pgdat and zone state.
Calling sparse_init() from there mixes in later vmemmap and struct page
setup, which makes the initialization flow less clear.

Defer sparse_init(), sparse_vmemmap_init_nid_late(), and memmap_init()
until after free_area_init() completes, when zone initialization is fully
done. This keeps free_area_init() focused on zone setup and ensures that
sparse_init() runs with the relevant zone state already available.

This is also a prerequisite for later hugetlb vmemmap changes that need
zone information during early sparse vmemmap setup.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Oscar Salvador (SUSE) <osalvador@kernel.org>
---
v2->v3:
- collect Reviewed-by from Oscar Salvador
---
 mm/mm_init.c | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/mm/mm_init.c b/mm/mm_init.c
index 3a57bf5a9b46..f349a6f34139 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -1829,7 +1829,6 @@ static void __init free_area_init(void)
 	bool descending;
 
 	arch_zone_limits_init(max_zone_pfn);
-	sparse_init();
 
 	start_pfn = PHYS_PFN(memblock_start_of_DRAM());
 	descending = arch_has_descending_max_zone_pfns();
@@ -1918,11 +1917,7 @@ static void __init free_area_init(void)
 		}
 	}
 
-	for_each_node_state(nid, N_MEMORY)
-		sparse_vmemmap_init_nid_late(nid);
-
 	calc_nr_kernel_pages();
-	memmap_init();
 
 	/* disable hash distribution for systems with a single node */
 	fixup_hashdist();
@@ -2694,10 +2689,17 @@ void __init __weak mem_init(void)
 
 void __init mm_core_init_early(void)
 {
+	int nid;
+
 	hugetlb_cma_reserve();
 	hugetlb_bootmem_alloc();
 
 	free_area_init();
+
+	sparse_init();
+	for_each_node_state(nid, N_MEMORY)
+		sparse_vmemmap_init_nid_late(nid);
+	memmap_init();
 }
 
 /*
-- 
2.54.0



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v3 09/19] mm/mm_init: Defer hugetlb reservation until after zone initialization
  2026-06-02 10:10 [PATCH v3 00/19] mm: Refactor bootmem gigantic hugepage allocation Muchun Song
                   ` (7 preceding siblings ...)
  2026-06-02 10:10 ` [PATCH v3 08/19] mm/mm_init: Defer sparse_init() until after zone initialization Muchun Song
@ 2026-06-02 10:10 ` Muchun Song
  2026-06-02 10:10 ` [PATCH v3 10/19] mm/mm_init: Remove set_pageblock_order() call from sparse_init() Muchun Song
                   ` (10 subsequent siblings)
  19 siblings, 0 replies; 37+ messages in thread
From: Muchun Song @ 2026-06-02 10:10 UTC (permalink / raw)
  To: Oscar Salvador, David Hildenbrand, Andrew Morton,
	Madhavan Srinivasan, Michael Ellerman
  Cc: Muchun Song, Mike Rapoport, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, linux-mm, linux-kernel, Nicholas Piggin,
	Christophe Leroy (CS GROUP), Ritesh Harjani (IBM),
	Aneesh Kumar K.V, linuxppc-dev, Mike Kravetz, Muchun Song,
	Oscar Salvador (SUSE)

hugetlb_cma_reserve() and hugetlb_bootmem_alloc() currently run before
free_area_init(), so HugeTLB reservation happens before zone state is
initialized.

Move the reservation step after free_area_init() so the relevant zone
information is available before HugeTLB reserves memory. This is needed
for later hugetlb changes that validate boot-time HugeTLB reservations
against zone boundaries.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Oscar Salvador (SUSE) <osalvador@kernel.org>
---
v2->v3:
- collect Reviewed-by from Mike Rapoport
- collect Reviewed-by from Oscar Salvador
---
 mm/mm_init.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/mm_init.c b/mm/mm_init.c
index f349a6f34139..4601e5d659eb 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -2691,11 +2691,11 @@ void __init mm_core_init_early(void)
 {
 	int nid;
 
+	free_area_init();
+
 	hugetlb_cma_reserve();
 	hugetlb_bootmem_alloc();
 
-	free_area_init();
-
 	sparse_init();
 	for_each_node_state(nid, N_MEMORY)
 		sparse_vmemmap_init_nid_late(nid);
-- 
2.54.0



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v3 10/19] mm/mm_init: Remove set_pageblock_order() call from sparse_init()
  2026-06-02 10:10 [PATCH v3 00/19] mm: Refactor bootmem gigantic hugepage allocation Muchun Song
                   ` (8 preceding siblings ...)
  2026-06-02 10:10 ` [PATCH v3 09/19] mm/mm_init: Defer hugetlb reservation " Muchun Song
@ 2026-06-02 10:10 ` Muchun Song
  2026-06-02 10:10 ` [PATCH v3 11/19] mm/sparse: Move sparse_vmemmap_init_nid_late() into sparse_init_nid() Muchun Song
                   ` (9 subsequent siblings)
  19 siblings, 0 replies; 37+ messages in thread
From: Muchun Song @ 2026-06-02 10:10 UTC (permalink / raw)
  To: Oscar Salvador, David Hildenbrand, Andrew Morton,
	Madhavan Srinivasan, Michael Ellerman
  Cc: Muchun Song, Mike Rapoport, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, linux-mm, linux-kernel, Nicholas Piggin,
	Christophe Leroy (CS GROUP), Ritesh Harjani (IBM),
	Aneesh Kumar K.V, linuxppc-dev, Mike Kravetz, Muchun Song

free_area_init() already sets pageblock_order before sparse_init() runs
for CONFIG_HUGETLB_PAGE_SIZE_VARIABLE, so sparse_init() does not need to
call set_pageblock_order() again.

With that call removed, set_pageblock_order() is only used in mm/mm_init.c.
Make it static.

Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Oscar Salvador (SUSE) <osalvador@suse.de>
---
v2->v3:
- collect Reviewed-by from Oscar Salvador
---
 mm/internal.h | 1 -
 mm/mm_init.c  | 4 ++--
 mm/sparse.c   | 3 ---
 3 files changed, 2 insertions(+), 6 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index e71ba519f7f2..004a3f1d5006 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1435,7 +1435,6 @@ extern unsigned long  __must_check vm_mmap_pgoff(struct file *, unsigned long,
         unsigned long, unsigned long,
         unsigned long, unsigned long);
 
-extern void set_pageblock_order(void);
 unsigned long reclaim_pages(struct list_head *folio_list);
 unsigned int reclaim_clean_pages_from_list(struct zone *zone,
 					    struct list_head *folio_list);
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 4601e5d659eb..44512f3b3544 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -1511,7 +1511,7 @@ static inline void setup_usemap(struct zone *zone) {}
 #ifdef CONFIG_HUGETLB_PAGE_SIZE_VARIABLE
 
 /* Initialise the number of pages represented by NR_PAGEBLOCK_BITS */
-void __init set_pageblock_order(void)
+static void __init set_pageblock_order(void)
 {
 	unsigned int order = PAGE_BLOCK_MAX_ORDER;
 
@@ -1537,7 +1537,7 @@ void __init set_pageblock_order(void)
  * include/linux/pageblock-flags.h for the values of pageblock_order based on
  * the kernel config
  */
-void __init set_pageblock_order(void)
+static inline void __init set_pageblock_order(void)
 {
 }
 
diff --git a/mm/sparse.c b/mm/sparse.c
index 85557ef387c7..324213d8bdcb 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -343,9 +343,6 @@ void __init sparse_init(void)
 	pnum_begin = first_present_section_nr();
 	nid_begin = sparse_early_nid(__nr_to_section(pnum_begin));
 
-	/* Setup pageblock_order for HUGETLB_PAGE_SIZE_VARIABLE */
-	set_pageblock_order();
-
 	for_each_present_section_nr(pnum_begin + 1, pnum_end) {
 		int nid = sparse_early_nid(__nr_to_section(pnum_end));
 
-- 
2.54.0



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v3 11/19] mm/sparse: Move sparse_vmemmap_init_nid_late() into sparse_init_nid()
  2026-06-02 10:10 [PATCH v3 00/19] mm: Refactor bootmem gigantic hugepage allocation Muchun Song
                   ` (9 preceding siblings ...)
  2026-06-02 10:10 ` [PATCH v3 10/19] mm/mm_init: Remove set_pageblock_order() call from sparse_init() Muchun Song
@ 2026-06-02 10:10 ` Muchun Song
  2026-06-02 10:10 ` [PATCH v3 12/19] mm/hugetlb_cma: Validate hugetlb CMA range by zone at reserve time Muchun Song
                   ` (8 subsequent siblings)
  19 siblings, 0 replies; 37+ messages in thread
From: Muchun Song @ 2026-06-02 10:10 UTC (permalink / raw)
  To: Oscar Salvador, David Hildenbrand, Andrew Morton,
	Madhavan Srinivasan, Michael Ellerman
  Cc: Muchun Song, Mike Rapoport, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, linux-mm, linux-kernel, Nicholas Piggin,
	Christophe Leroy (CS GROUP), Ritesh Harjani (IBM),
	Aneesh Kumar K.V, linuxppc-dev, Mike Kravetz, Muchun Song,
	Oscar Salvador (SUSE)

sparse_vmemmap_init_nid_late() is still called separately from
mm_core_init_early(), away from the rest of the sparse initialization
path.

Now that sparse_init() runs after zone initialization, call
sparse_vmemmap_init_nid_late() from sparse_init_nid() instead. This
keeps both sparse_vmemmap_init_nid_early() and
sparse_vmemmap_init_nid_late() in the sparse setup path.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Oscar Salvador (SUSE) <osalvador@kernel.org>
---
v2->v3:
- collect Reviewed-by from Oscar Salvador
---
 mm/mm_init.c | 4 ----
 mm/sparse.c  | 1 +
 2 files changed, 1 insertion(+), 4 deletions(-)

diff --git a/mm/mm_init.c b/mm/mm_init.c
index 44512f3b3544..41b83dd18c01 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -2689,16 +2689,12 @@ void __init __weak mem_init(void)
 
 void __init mm_core_init_early(void)
 {
-	int nid;
-
 	free_area_init();
 
 	hugetlb_cma_reserve();
 	hugetlb_bootmem_alloc();
 
 	sparse_init();
-	for_each_node_state(nid, N_MEMORY)
-		sparse_vmemmap_init_nid_late(nid);
 	memmap_init();
 }
 
diff --git a/mm/sparse.c b/mm/sparse.c
index 324213d8bdcb..3917a47153d8 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -320,6 +320,7 @@ static void __init sparse_init_nid(int nid, unsigned long pnum_begin,
 		}
 	}
 	sparse_usage_fini();
+	sparse_vmemmap_init_nid_late(nid);
 }
 
 /*
-- 
2.54.0



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v3 12/19] mm/hugetlb_cma: Validate hugetlb CMA range by zone at reserve time
  2026-06-02 10:10 [PATCH v3 00/19] mm: Refactor bootmem gigantic hugepage allocation Muchun Song
                   ` (10 preceding siblings ...)
  2026-06-02 10:10 ` [PATCH v3 11/19] mm/sparse: Move sparse_vmemmap_init_nid_late() into sparse_init_nid() Muchun Song
@ 2026-06-02 10:10 ` Muchun Song
  2026-06-02 10:10 ` [PATCH v3 13/19] mm/hugetlb: Refactor early boot gigantic hugepage allocation Muchun Song
                   ` (7 subsequent siblings)
  19 siblings, 0 replies; 37+ messages in thread
From: Muchun Song @ 2026-06-02 10:10 UTC (permalink / raw)
  To: Oscar Salvador, David Hildenbrand, Andrew Morton,
	Madhavan Srinivasan, Michael Ellerman
  Cc: Muchun Song, Mike Rapoport, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, linux-mm, linux-kernel, Nicholas Piggin,
	Christophe Leroy (CS GROUP), Ritesh Harjani (IBM),
	Aneesh Kumar K.V, linuxppc-dev, Mike Kravetz, Muchun Song

Hugetlb CMA allocation currently has to cope with CMA areas that span
multiple zones.

Validate the reserved CMA range up front in hugetlb_cma_reserve() so
later hugetlb CMA allocations can assume a zone-consistent area.

Also drop the pfn_valid() check from cma_validate_zones(). mem_section
is not fully initialized at this point, so the check can trigger false
warnings. Keep the sanity check in cma_activate_area() instead.

Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Oscar Salvador (SUSE) <osalvador@suse.de>
---
v2->v3:
- collect Reviewed-by from Oscar Salvador
---
 mm/cma.c         | 3 ++-
 mm/hugetlb_cma.c | 6 ++++--
 2 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/mm/cma.c b/mm/cma.c
index a13ce4999b39..31073738f2ac 100644
--- a/mm/cma.c
+++ b/mm/cma.c
@@ -126,7 +126,6 @@ bool cma_validate_zones(struct cma *cma)
 		 * to be in the same zone. Simplify by forcing the entire
 		 * CMA resv range to be in the same zone.
 		 */
-		WARN_ON_ONCE(!pfn_valid(base_pfn));
 		if (pfn_range_intersects_zones(cma->nid, base_pfn, cmr->count)) {
 			set_bit(CMA_ZONES_INVALID, &cma->flags);
 			return false;
@@ -165,6 +164,8 @@ static void __init cma_activate_area(struct cma *cma)
 			bitmap_set(cmr->bitmap, 0, bitmap_count);
 		}
 
+		WARN_ON_ONCE(!pfn_valid(cmr->base_pfn));
+
 		for (pfn = early_pfn[r]; pfn < cmr->base_pfn + cmr->count;
 		     pfn += pageblock_nr_pages)
 			init_cma_reserved_pageblock(pfn_to_page(pfn));
diff --git a/mm/hugetlb_cma.c b/mm/hugetlb_cma.c
index 39344d6c78d8..ce999391cc14 100644
--- a/mm/hugetlb_cma.c
+++ b/mm/hugetlb_cma.c
@@ -231,9 +231,11 @@ void __init hugetlb_cma_reserve(void)
 		res = cma_declare_contiguous_multi(size, gigantic_page_size,
 					HUGETLB_PAGE_ORDER, name,
 					&hugetlb_cma[nid], nid);
-		if (res) {
-			pr_warn("hugetlb_cma: reservation failed: err %d, node %d",
+		if (res || !cma_validate_zones(hugetlb_cma[nid])) {
+			pr_warn("hugetlb_cma: %s: err %d, node %d\n",
+				res ? "reservation failed" : "reserved area spans zones",
 				res, nid);
+			hugetlb_cma[nid] = NULL;
 			continue;
 		}
 
-- 
2.54.0



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v3 13/19] mm/hugetlb: Refactor early boot gigantic hugepage allocation
  2026-06-02 10:10 [PATCH v3 00/19] mm: Refactor bootmem gigantic hugepage allocation Muchun Song
                   ` (11 preceding siblings ...)
  2026-06-02 10:10 ` [PATCH v3 12/19] mm/hugetlb_cma: Validate hugetlb CMA range by zone at reserve time Muchun Song
@ 2026-06-02 10:10 ` Muchun Song
  2026-06-02 10:10 ` [PATCH v3 14/19] mm/hugetlb: Free cross-zone bootmem gigantic pages after allocation Muchun Song
                   ` (6 subsequent siblings)
  19 siblings, 0 replies; 37+ messages in thread
From: Muchun Song @ 2026-06-02 10:10 UTC (permalink / raw)
  To: Oscar Salvador, David Hildenbrand, Andrew Morton,
	Madhavan Srinivasan, Michael Ellerman
  Cc: Muchun Song, Mike Rapoport, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, linux-mm, linux-kernel, Nicholas Piggin,
	Christophe Leroy (CS GROUP), Ritesh Harjani (IBM),
	Aneesh Kumar K.V, linuxppc-dev, Mike Kravetz, Muchun Song

The early boot gigantic hugepage allocation helpers currently mix
allocation with huge_bootmem_page setup, and leave part of the
initialization flow in architecture code.

Refactor the interface to return the allocated huge page pointer and
move the huge_bootmem_page setup into the generic hugetlb code. This
makes the architecture-specific paths focus only on finding memory,
while the common code handles node placement and early page metadata
setup in one place.

This also lets powerpc benefit from memblock_reserved_mark_noinit(),
which it did not enable before.

In addition, upcoming cross-zone validation for boot-time gigantic
hugetlb reservation is common logic. With this refactoring, that logic
can stay in the generic code instead of being duplicated in
architecture-specific paths.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Oscar Salvador (SUSE) <osalvador@suse.de>
---
v2->v3:
- keep powerpc code independent of struct huge_bootmem_page by switching
  it to void * (per Mike Rapoport)
- move huge_bootmem_page internals out of include/linux/hugetlb.h and keep
  them in mm-private scope so the arch code does not need to see the type
  (per Mike Rapoport, echoed by Oscar Salvador)
---
 arch/powerpc/mm/hugetlbpage.c | 13 ++---
 include/linux/hugetlb.h       | 18 ++-----
 mm/hugetlb.c                  | 95 ++++++++++++++---------------------
 mm/hugetlb_cma.c              | 13 ++---
 mm/hugetlb_cma.h              |  8 ++-
 mm/internal.h                 |  9 ++++
 6 files changed, 64 insertions(+), 92 deletions(-)

diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index 558fafb82b8a..a298746dc143 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -104,17 +104,14 @@ void __init pseries_add_gpage(u64 addr, u64 page_size, unsigned long number_of_p
 	}
 }
 
-static int __init pseries_alloc_bootmem_huge_page(struct hstate *hstate)
+static __init void *pseries_alloc_bootmem_huge_page(struct hstate *hstate)
 {
-	struct huge_bootmem_page *m;
+	void *m;
 	if (nr_gpages == 0)
-		return 0;
+		return NULL;
 	m = phys_to_virt(gpage_freearray[--nr_gpages]);
 	gpage_freearray[nr_gpages] = 0;
-	list_add(&m->list, &huge_boot_pages[0]);
-	m->hstate = hstate;
-	m->flags = 0;
-	return 1;
+	return m;
 }
 
 bool __init hugetlb_node_alloc_supported(void)
@@ -124,7 +121,7 @@ bool __init hugetlb_node_alloc_supported(void)
 #endif
 
 
-int __init alloc_bootmem_huge_page(struct hstate *h, int nid)
+void *__init arch_alloc_bootmem_huge_page(struct hstate *h, int nid)
 {
 
 #ifdef CONFIG_PPC_BOOK3S_64
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 3700c0a1f6ff..09f28dd773b7 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -674,19 +674,11 @@ struct hstate {
 	char name[HSTATE_NAME_LEN];
 };
 
-struct cma;
-
-struct huge_bootmem_page {
-	struct list_head list;
-	struct hstate *hstate;
-	unsigned long flags;
-	struct cma *cma;
-};
-
 #define HUGE_BOOTMEM_HVO		0x0001
 #define HUGE_BOOTMEM_ZONES_VALID	0x0002
 #define HUGE_BOOTMEM_CMA		0x0004
 
+struct huge_bootmem_page;
 bool hugetlb_bootmem_page_zones_valid(int nid, struct huge_bootmem_page *m);
 
 int isolate_or_dissolve_huge_folio(struct folio *folio, struct list_head *list);
@@ -706,8 +698,8 @@ void restore_reserve_on_error(struct hstate *h, struct vm_area_struct *vma,
 				unsigned long address, struct folio *folio);
 
 /* arch callback */
-int __init __alloc_bootmem_huge_page(struct hstate *h, int nid);
-int __init alloc_bootmem_huge_page(struct hstate *h, int nid);
+void *__init __alloc_bootmem_huge_page(struct hstate *h, int nid);
+void *__init arch_alloc_bootmem_huge_page(struct hstate *h, int nid);
 bool __init hugetlb_node_alloc_supported(void);
 
 void __init hugetlb_add_hstate(unsigned order);
@@ -1138,9 +1130,9 @@ alloc_hugetlb_folio_nodemask(struct hstate *h, int preferred_nid,
 	return NULL;
 }
 
-static inline int __alloc_bootmem_huge_page(struct hstate *h)
+static inline void *__alloc_bootmem_huge_page(struct hstate *h, int nid)
 {
-	return 0;
+	return NULL;
 }
 
 static inline struct hstate *hstate_file(struct file *f)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 2bf9fe16abb9..5e557c05d80a 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3027,79 +3027,58 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 
 static __init void *alloc_bootmem(struct hstate *h, int nid, bool node_exact)
 {
-	struct huge_bootmem_page *m;
-	int listnode = nid;
-
 	if (hugetlb_early_cma(h))
-		m = hugetlb_cma_alloc_bootmem(h, &listnode, node_exact);
-	else {
-		if (node_exact)
-			m = memblock_alloc_exact_nid_raw(huge_page_size(h),
+		return hugetlb_cma_alloc_bootmem(h, nid, node_exact);
+
+	if (node_exact)
+		return memblock_alloc_exact_nid_raw(huge_page_size(h),
 				huge_page_size(h), 0,
 				MEMBLOCK_ALLOC_ACCESSIBLE, nid);
-		else {
-			m = memblock_alloc_try_nid_raw(huge_page_size(h),
+
+	return memblock_alloc_try_nid_raw(huge_page_size(h),
 				huge_page_size(h), 0,
 				MEMBLOCK_ALLOC_ACCESSIBLE, nid);
-			/*
-			 * For pre-HVO to work correctly, pages need to be on
-			 * the list for the node they were actually allocated
-			 * from. That node may be different in the case of
-			 * fallback by memblock_alloc_try_nid_raw. So,
-			 * extract the actual node first.
-			 */
-			if (m)
-				listnode = early_pfn_to_nid(PHYS_PFN(__pa(m)));
-		}
-
-		if (m) {
-			m->flags = 0;
-			m->cma = NULL;
-		}
-	}
-
-	if (m) {
-		/*
-		 * Use the beginning of the huge page to store the
-		 * huge_bootmem_page struct (until gather_bootmem
-		 * puts them into the mem_map).
-		 *
-		 * Put them into a private list first because mem_map
-		 * is not up yet.
-		 */
-		INIT_LIST_HEAD(&m->list);
-		list_add(&m->list, &huge_boot_pages[listnode]);
-		m->hstate = h;
-	}
-
-	return m;
 }
 
-int alloc_bootmem_huge_page(struct hstate *h, int nid)
+void *__init arch_alloc_bootmem_huge_page(struct hstate *h, int nid)
 	__attribute__ ((weak, alias("__alloc_bootmem_huge_page")));
-int __alloc_bootmem_huge_page(struct hstate *h, int nid)
+void *__init __alloc_bootmem_huge_page(struct hstate *h, int nid)
 {
-	struct huge_bootmem_page *m = NULL; /* initialize for clang */
 	int nr_nodes, node = nid;
 
 	/* do node specific alloc */
-	if (nid != NUMA_NO_NODE) {
-		m = alloc_bootmem(h, node, true);
-		if (!m)
-			return 0;
-		goto found;
-	}
+	if (nid != NUMA_NO_NODE)
+		return alloc_bootmem(h, node, true);
 
 	/* allocate from next node when distributing huge pages */
 	for_each_node_mask_to_alloc(&h->next_nid_to_alloc, nr_nodes, node,
-				    &hugetlb_bootmem_nodes) {
-		m = alloc_bootmem(h, node, false);
-		if (!m)
-			return 0;
-		goto found;
-	}
+				    &hugetlb_bootmem_nodes)
+		return alloc_bootmem(h, node, false);
 
-found:
+	return NULL;
+}
+
+static bool __init alloc_bootmem_huge_page(struct hstate *h, int nid)
+{
+	struct huge_bootmem_page *m = arch_alloc_bootmem_huge_page(h, nid);
+
+	if (!m)
+		return false;
+
+	nid = early_pfn_to_nid(PHYS_PFN(__pa(m)));
+	/*
+	 * Use the beginning of the huge page to store the huge_bootmem_page
+	 * struct (until gather_bootmem puts them into the mem_map).
+	 *
+	 * Put them into a private list first because mem_map is not up yet.
+	 */
+	INIT_LIST_HEAD(&m->list);
+	list_add(&m->list, &huge_boot_pages[nid]);
+	m->hstate = h;
+	if (!hugetlb_early_cma(h)) {
+		m->cma = NULL;
+		m->flags = 0;
+	}
 
 	/*
 	 * Only initialize the head struct page in memmap_init_reserved_pages,
@@ -3111,7 +3090,7 @@ int __alloc_bootmem_huge_page(struct hstate *h, int nid)
 	memblock_reserved_mark_noinit(__pa((void *)m + PAGE_SIZE),
 		huge_page_size(h) - PAGE_SIZE);
 
-	return 1;
+	return true;
 }
 
 /* Initialize [start_page:end_page_number] tail struct pages of a hugepage */
diff --git a/mm/hugetlb_cma.c b/mm/hugetlb_cma.c
index ce999391cc14..e487d0ffffc0 100644
--- a/mm/hugetlb_cma.c
+++ b/mm/hugetlb_cma.c
@@ -56,14 +56,13 @@ struct folio *hugetlb_cma_alloc_frozen_folio(int order, gfp_t gfp_mask,
 	return folio;
 }
 
-struct huge_bootmem_page * __init
-hugetlb_cma_alloc_bootmem(struct hstate *h, int *nid, bool node_exact)
+void * __init hugetlb_cma_alloc_bootmem(struct hstate *h, int nid, bool node_exact)
 {
 	struct cma *cma;
 	struct huge_bootmem_page *m;
-	int node = *nid;
+	int node;
 
-	cma = hugetlb_cma[*nid];
+	cma = hugetlb_cma[nid];
 	m = cma_reserve_early(cma, huge_page_size(h));
 	if (!m) {
 		if (node_exact)
@@ -71,13 +70,11 @@ hugetlb_cma_alloc_bootmem(struct hstate *h, int *nid, bool node_exact)
 
 		for_each_node_mask(node, hugetlb_bootmem_nodes) {
 			cma = hugetlb_cma[node];
-			if (!cma || node == *nid)
+			if (!cma || node == nid)
 				continue;
 			m = cma_reserve_early(cma, huge_page_size(h));
-			if (m) {
-				*nid = node;
+			if (m)
 				break;
-			}
 		}
 	}
 
diff --git a/mm/hugetlb_cma.h b/mm/hugetlb_cma.h
index c619c394b1ae..3aa483573d17 100644
--- a/mm/hugetlb_cma.h
+++ b/mm/hugetlb_cma.h
@@ -6,8 +6,7 @@
 void hugetlb_cma_free_frozen_folio(struct folio *folio);
 struct folio *hugetlb_cma_alloc_frozen_folio(int order, gfp_t gfp_mask,
 				      int nid, nodemask_t *nodemask);
-struct huge_bootmem_page *hugetlb_cma_alloc_bootmem(struct hstate *h, int *nid,
-						    bool node_exact);
+void *hugetlb_cma_alloc_bootmem(struct hstate *h, int nid, bool node_exact);
 bool hugetlb_cma_exclusive_alloc(void);
 unsigned long hugetlb_cma_total_size(void);
 void hugetlb_cma_validate_params(void);
@@ -23,9 +22,8 @@ static inline struct folio *hugetlb_cma_alloc_frozen_folio(int order,
 	return NULL;
 }
 
-static inline
-struct huge_bootmem_page *hugetlb_cma_alloc_bootmem(struct hstate *h, int *nid,
-						    bool node_exact)
+static inline void *hugetlb_cma_alloc_bootmem(struct hstate *h, int nid,
+					      bool node_exact)
 {
 	return NULL;
 }
diff --git a/mm/internal.h b/mm/internal.h
index 004a3f1d5006..6b9802460a7c 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -23,6 +23,15 @@
 #include "vma.h"
 
 struct folio_batch;
+struct hstate;
+struct cma;
+
+struct huge_bootmem_page {
+	struct list_head list;
+	struct hstate *hstate;
+	unsigned long flags;
+	struct cma *cma;
+};
 
 /*
  * Maintains state across a page table move. The operation assumes both source
-- 
2.54.0



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v3 14/19] mm/hugetlb: Free cross-zone bootmem gigantic pages after allocation
  2026-06-02 10:10 [PATCH v3 00/19] mm: Refactor bootmem gigantic hugepage allocation Muchun Song
                   ` (12 preceding siblings ...)
  2026-06-02 10:10 ` [PATCH v3 13/19] mm/hugetlb: Refactor early boot gigantic hugepage allocation Muchun Song
@ 2026-06-02 10:10 ` Muchun Song
  2026-06-02 15:41   ` Mike Rapoport
  2026-06-02 10:10 ` [PATCH v3 15/19] mm/hugetlb_vmemmap: Move bootmem HVO setup to early init Muchun Song
                   ` (5 subsequent siblings)
  19 siblings, 1 reply; 37+ messages in thread
From: Muchun Song @ 2026-06-02 10:10 UTC (permalink / raw)
  To: Oscar Salvador, David Hildenbrand, Andrew Morton,
	Madhavan Srinivasan, Michael Ellerman
  Cc: Muchun Song, Mike Rapoport, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, linux-mm, linux-kernel, Nicholas Piggin,
	Christophe Leroy (CS GROUP), Ritesh Harjani (IBM),
	Aneesh Kumar K.V, linuxppc-dev, Mike Kravetz, Muchun Song

Now that hugetlb reservation runs after zone initialization, bootmem
gigantic page allocation can detect pages that span multiple zones.

Keep those cross-zone pages separate during allocation and free them
after allocation completes, so later hugetlb initialization only sees
zone-valid gigantic pages.

This chooses to free cross-zone gigantic pages directly instead of
retrying allocation. In practice, such cross-zone cases are expected to
be very rare, so adding retry logic does not seem justified at this
point. Keeping the handling simple also preserves the previous behavior.
If similar real-world reports show up later, retry support can be
reconsidered then.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/hugetlb.c | 75 ++++++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 64 insertions(+), 11 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 5e557c05d80a..218fb1ca45f4 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3060,12 +3060,15 @@ void *__init __alloc_bootmem_huge_page(struct hstate *h, int nid)
 
 static bool __init alloc_bootmem_huge_page(struct hstate *h, int nid)
 {
+	unsigned long pfn;
+	unsigned int nid_request = nid;
 	struct huge_bootmem_page *m = arch_alloc_bootmem_huge_page(h, nid);
 
 	if (!m)
 		return false;
 
-	nid = early_pfn_to_nid(PHYS_PFN(__pa(m)));
+	pfn = PHYS_PFN(__pa(m));
+	nid = early_pfn_to_nid(pfn);
 	/*
 	 * Use the beginning of the huge page to store the huge_bootmem_page
 	 * struct (until gather_bootmem puts them into the mem_map).
@@ -3073,22 +3076,38 @@ static bool __init alloc_bootmem_huge_page(struct hstate *h, int nid)
 	 * Put them into a private list first because mem_map is not up yet.
 	 */
 	INIT_LIST_HEAD(&m->list);
-	list_add(&m->list, &huge_boot_pages[nid]);
 	m->hstate = h;
 	if (!hugetlb_early_cma(h)) {
 		m->cma = NULL;
 		m->flags = 0;
 	}
 
-	/*
-	 * Only initialize the head struct page in memmap_init_reserved_pages,
-	 * rest of the struct pages will be initialized by the HugeTLB
-	 * subsystem itself.
-	 * The head struct page is used to get folio information by the HugeTLB
-	 * subsystem like zone id and node id.
-	 */
-	memblock_reserved_mark_noinit(__pa((void *)m + PAGE_SIZE),
-		huge_page_size(h) - PAGE_SIZE);
+	/* CMA pages: zone-crossing is validated in hugetlb_cma_reserve(). */
+	if (!hugetlb_early_cma(h) &&
+	    pfn_range_intersects_zones(nid, pfn, pages_per_huge_page(h))) {
+		/*
+		 * If the allocated page is on a different node than requested
+		 * (e.g., on PowerPC LPARs), put it on the requested node's list,
+		 * because hugetlb_free_cross_zone_pages() only frees cross-zone
+		 * pages belonging to the requested node.
+		 */
+		if (WARN_ON_ONCE(nid_request != NUMA_NO_NODE && nid != nid_request))
+			list_add(&m->list, &huge_boot_pages[nid_request]);
+		else
+			list_add(&m->list, &huge_boot_pages[nid]);
+	} else {
+		list_add_tail(&m->list, &huge_boot_pages[nid]);
+		m->flags |= HUGE_BOOTMEM_ZONES_VALID;
+		/*
+		 * Only initialize the head struct page in memmap_init_reserved_pages,
+		 * rest of the struct pages will be initialized by the HugeTLB
+		 * subsystem itself.
+		 * The head struct page is used to get folio information by the HugeTLB
+		 * subsystem like zone id and node id.
+		 */
+		memblock_reserved_mark_noinit(__pa((void *)m + PAGE_SIZE),
+				huge_page_size(h) - PAGE_SIZE);
+	}
 
 	return true;
 }
@@ -3373,6 +3392,34 @@ void __init hugetlb_bootmem_struct_page_init(void)
 	padata_do_multithreaded(&job);
 }
 
+static unsigned long __init hugetlb_free_cross_zone_pages(struct hstate *h, int nid)
+{
+	unsigned long freed = 0;
+	struct huge_bootmem_page *m, *tmp;
+
+	if (!hstate_is_gigantic(h))
+		return freed;
+
+	list_for_each_entry_safe(m, tmp, &huge_boot_pages[nid], list) {
+		if (m->flags & HUGE_BOOTMEM_ZONES_VALID)
+			break;
+
+		list_del(&m->list);
+		memblock_free(m, huge_page_size(h));
+		freed++;
+	}
+
+	if (freed) {
+		char buf[32];
+
+		string_get_size(huge_page_size(h), 1, STRING_UNITS_2, buf, sizeof(buf));
+		pr_warn("HugeTLB: freed %lu cross-zone hugepages of size %s on node %d.\n",
+			freed, buf, nid);
+	}
+
+	return freed;
+}
+
 static void __init hugetlb_hstate_alloc_pages_onenode(struct hstate *h, int nid)
 {
 	unsigned long i;
@@ -3403,6 +3450,8 @@ static void __init hugetlb_hstate_alloc_pages_onenode(struct hstate *h, int nid)
 		cond_resched();
 	}
 
+	i -= hugetlb_free_cross_zone_pages(h, nid);
+
 	if (!list_empty(&folio_list))
 		prep_and_add_allocated_folios(h, &folio_list);
 
@@ -3476,6 +3525,7 @@ static void __init hugetlb_pages_alloc_boot_node(unsigned long start, unsigned l
 
 static unsigned long __init hugetlb_gigantic_pages_alloc_boot(struct hstate *h)
 {
+	int nid;
 	unsigned long i;
 
 	for (i = 0; i < h->max_huge_pages; ++i) {
@@ -3484,6 +3534,9 @@ static unsigned long __init hugetlb_gigantic_pages_alloc_boot(struct hstate *h)
 		cond_resched();
 	}
 
+	for_each_node(nid)
+		i -= hugetlb_free_cross_zone_pages(h, nid);
+
 	return i;
 }
 
-- 
2.54.0



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v3 15/19] mm/hugetlb_vmemmap: Move bootmem HVO setup to early init
  2026-06-02 10:10 [PATCH v3 00/19] mm: Refactor bootmem gigantic hugepage allocation Muchun Song
                   ` (13 preceding siblings ...)
  2026-06-02 10:10 ` [PATCH v3 14/19] mm/hugetlb: Free cross-zone bootmem gigantic pages after allocation Muchun Song
@ 2026-06-02 10:10 ` Muchun Song
  2026-06-02 15:41   ` Mike Rapoport
  2026-06-03 12:02   ` Usama Arif
  2026-06-02 10:10 ` [PATCH v3 16/19] mm/hugetlb: Remove obsolete bootmem cross-zone checks Muchun Song
                   ` (4 subsequent siblings)
  19 siblings, 2 replies; 37+ messages in thread
From: Muchun Song @ 2026-06-02 10:10 UTC (permalink / raw)
  To: Oscar Salvador, David Hildenbrand, Andrew Morton,
	Madhavan Srinivasan, Michael Ellerman
  Cc: Muchun Song, Mike Rapoport, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, linux-mm, linux-kernel, Nicholas Piggin,
	Christophe Leroy (CS GROUP), Ritesh Harjani (IBM),
	Aneesh Kumar K.V, linuxppc-dev, Mike Kravetz, Muchun Song

Bootmem HugeTLB pages currently defer HVO setup to
hugetlb_vmemmap_init_late(), because the optimization needs zone
information.

Now that zone initialization is available earlier, the bootmem HVO setup
can be done directly from hugetlb_vmemmap_init_early(). This lets
gigantic HugeTLB pages apply HVO as soon as they are allocated.

Bootmem gigantic pages that span multiple zones are now filtered out
when they are allocated, so the remaining bootmem gigantic pages seen by
later hugetlb initialization are already zone-valid. As a result,
hugetlb_vmemmap_init_late() no longer needs to handle bootmem HVO setup.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/hugetlb_vmemmap.c | 67 +++++++++-----------------------------------
 1 file changed, 13 insertions(+), 54 deletions(-)

diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index ea6af85bfec1..464578ee246e 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -745,6 +745,8 @@ static bool vmemmap_should_optimize_bootmem_page(struct huge_bootmem_page *m)
 	return true;
 }
 
+static struct zone *pfn_to_zone(unsigned nid, unsigned long pfn);
+
 /*
  * Initialize memmap section for a gigantic page, HVO-style.
  */
@@ -752,6 +754,7 @@ void __init hugetlb_vmemmap_init_early(int nid)
 {
 	unsigned long psize, paddr, section_size;
 	unsigned long ns, i, pnum, pfn, nr_pages;
+	unsigned long start, end;
 	struct huge_bootmem_page *m = NULL;
 	void *map;
 
@@ -761,6 +764,8 @@ void __init hugetlb_vmemmap_init_early(int nid)
 	section_size = (1UL << PA_SECTION_SHIFT);
 
 	list_for_each_entry(m, &huge_boot_pages[nid], list) {
+		struct zone *zone;
+
 		if (!vmemmap_should_optimize_bootmem_page(m))
 			continue;
 
@@ -769,6 +774,14 @@ void __init hugetlb_vmemmap_init_early(int nid)
 		paddr = virt_to_phys(m);
 		pfn = PHYS_PFN(paddr);
 		map = pfn_to_page(pfn);
+		start = (unsigned long)map;
+		end = start + hugetlb_vmemmap_size(m->hstate);
+		zone = pfn_to_zone(nid, pfn);
+
+		if (vmemmap_populate_hvo(start, end, huge_page_order(m->hstate),
+					 zone, HUGETLB_VMEMMAP_RESERVE_SIZE))
+			panic("Failed to allocate memmap for HugeTLB page\n");
+		memmap_boot_pages_add(DIV_ROUND_UP(HUGETLB_VMEMMAP_RESERVE_SIZE, PAGE_SIZE));
 
 		pnum = pfn_to_section_nr(pfn);
 		ns = psize / section_size;
@@ -800,60 +813,6 @@ static struct zone *pfn_to_zone(unsigned nid, unsigned long pfn)
 
 void __init hugetlb_vmemmap_init_late(int nid)
 {
-	struct huge_bootmem_page *m, *tm;
-	unsigned long phys, nr_pages, start, end;
-	unsigned long pfn, nr_mmap;
-	struct zone *zone = NULL;
-	struct hstate *h;
-	void *map;
-
-	if (!READ_ONCE(vmemmap_optimize_enabled))
-		return;
-
-	list_for_each_entry_safe(m, tm, &huge_boot_pages[nid], list) {
-		if (!(m->flags & HUGE_BOOTMEM_HVO))
-			continue;
-
-		phys = virt_to_phys(m);
-		h = m->hstate;
-		pfn = PHYS_PFN(phys);
-		nr_pages = pages_per_huge_page(h);
-		map = pfn_to_page(pfn);
-		start = (unsigned long)map;
-		end = start + nr_pages * sizeof(struct page);
-
-		if (!hugetlb_bootmem_page_zones_valid(nid, m)) {
-			/*
-			 * Oops, the hugetlb page spans multiple zones.
-			 * Remove it from the list, and populate it normally.
-			 */
-			list_del(&m->list);
-
-			vmemmap_populate(start, end, nid, NULL);
-			nr_mmap = end - start;
-			memmap_boot_pages_add(DIV_ROUND_UP(nr_mmap, PAGE_SIZE));
-
-			memblock_phys_free(phys, huge_page_size(h));
-			continue;
-		}
-
-		if (!zone || !zone_spans_pfn(zone, pfn))
-			zone = pfn_to_zone(nid, pfn);
-		if (WARN_ON_ONCE(!zone))
-			continue;
-
-		if (vmemmap_populate_hvo(start, end, huge_page_order(h), zone,
-					 HUGETLB_VMEMMAP_RESERVE_SIZE) < 0) {
-			/* Fallback if HVO population fails */
-			vmemmap_populate(start, end, nid, NULL);
-			nr_mmap = end - start;
-		} else {
-			m->flags |= HUGE_BOOTMEM_ZONES_VALID;
-			nr_mmap = HUGETLB_VMEMMAP_RESERVE_SIZE;
-		}
-
-		memmap_boot_pages_add(DIV_ROUND_UP(nr_mmap, PAGE_SIZE));
-	}
 }
 #endif
 
-- 
2.54.0



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v3 16/19] mm/hugetlb: Remove obsolete bootmem cross-zone checks
  2026-06-02 10:10 [PATCH v3 00/19] mm: Refactor bootmem gigantic hugepage allocation Muchun Song
                   ` (14 preceding siblings ...)
  2026-06-02 10:10 ` [PATCH v3 15/19] mm/hugetlb_vmemmap: Move bootmem HVO setup to early init Muchun Song
@ 2026-06-02 10:10 ` Muchun Song
  2026-06-02 15:41   ` Mike Rapoport
  2026-06-02 10:10 ` [PATCH v3 17/19] mm/sparse-vmemmap: Remove sparse_vmemmap_init_nid_late() Muchun Song
                   ` (3 subsequent siblings)
  19 siblings, 1 reply; 37+ messages in thread
From: Muchun Song @ 2026-06-02 10:10 UTC (permalink / raw)
  To: Oscar Salvador, David Hildenbrand, Andrew Morton,
	Madhavan Srinivasan, Michael Ellerman
  Cc: Muchun Song, Mike Rapoport, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, linux-mm, linux-kernel, Nicholas Piggin,
	Christophe Leroy (CS GROUP), Ritesh Harjani (IBM),
	Aneesh Kumar K.V, linuxppc-dev, Mike Kravetz, Muchun Song

Bootmem gigantic HugeTLB pages used to be validated again during
gather_bootmem_prealloc_node() and any cross-zone pages were discarded
there.

That validation is no longer needed. Cross-zone bootmem gigantic pages
are now detected during allocation and freed before they reach the later
bootmem gathering path, so the remaining pages are already zone-valid.

Remove the obsolete cross-zone validation, invalid-page freeing, and the
associated discarded-page accounting.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 include/linux/hugetlb.h |  3 --
 mm/hugetlb.c            | 70 -----------------------------------------
 2 files changed, 73 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 09f28dd773b7..f68a390d43bd 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -678,9 +678,6 @@ struct hstate {
 #define HUGE_BOOTMEM_ZONES_VALID	0x0002
 #define HUGE_BOOTMEM_CMA		0x0004
 
-struct huge_bootmem_page;
-bool hugetlb_bootmem_page_zones_valid(int nid, struct huge_bootmem_page *m);
-
 int isolate_or_dissolve_huge_folio(struct folio *folio, struct list_head *list);
 int replace_free_hugepage_folios(unsigned long start_pfn, unsigned long end_pfn);
 void wait_for_freed_hugetlb_folios(void);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 218fb1ca45f4..47c3d6d11c58 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -58,7 +58,6 @@ struct hstate hstates[HUGE_MAX_HSTATE];
 
 __initdata nodemask_t hugetlb_bootmem_nodes;
 __initdata struct list_head huge_boot_pages[MAX_NUMNODES];
-static unsigned long hstate_boot_nrinvalid[HUGE_MAX_HSTATE] __initdata;
 
 /*
  * Due to ordering constraints across the init code for various
@@ -3221,57 +3220,6 @@ static void __init prep_and_add_bootmem_folios(struct hstate *h,
 	}
 }
 
-bool __init hugetlb_bootmem_page_zones_valid(int nid,
-					     struct huge_bootmem_page *m)
-{
-	unsigned long start_pfn;
-	bool valid;
-
-	if (m->flags & HUGE_BOOTMEM_ZONES_VALID) {
-		/*
-		 * Already validated, skip check.
-		 */
-		return true;
-	}
-
-	if (hugetlb_bootmem_page_earlycma(m)) {
-		valid = cma_validate_zones(m->cma);
-		goto out;
-	}
-
-	start_pfn = virt_to_phys(m) >> PAGE_SHIFT;
-
-	valid = !pfn_range_intersects_zones(nid, start_pfn,
-			pages_per_huge_page(m->hstate));
-out:
-	if (!valid)
-		hstate_boot_nrinvalid[hstate_index(m->hstate)]++;
-
-	return valid;
-}
-
-/*
- * Free a bootmem page that was found to be invalid (intersecting with
- * multiple zones).
- *
- * Since it intersects with multiple zones, we can't just do a free
- * operation on all pages at once, but instead have to walk all
- * pages, freeing them one by one.
- */
-static void __init hugetlb_bootmem_free_invalid_page(int nid, struct page *page,
-					     struct hstate *h)
-{
-	unsigned long npages = pages_per_huge_page(h);
-	unsigned long pfn;
-
-	while (npages--) {
-		pfn = page_to_pfn(page);
-		__init_page_from_nid(pfn, nid);
-		free_reserved_page(page);
-		page++;
-	}
-}
-
 /*
  * Put bootmem huge pages into the standard lists after mem_map is up.
  * Note: This only applies to gigantic (order > MAX_PAGE_ORDER) pages.
@@ -3287,17 +3235,6 @@ static void __init gather_bootmem_prealloc_node(unsigned long nid)
 		struct folio *folio = (void *)page;
 
 		h = m->hstate;
-		if (!hugetlb_bootmem_page_zones_valid(nid, m)) {
-			/*
-			 * Can't use this page. Initialize the
-			 * page structures if that hasn't already
-			 * been done, and give them to the page
-			 * allocator.
-			 */
-			hugetlb_bootmem_free_invalid_page(nid, page, h);
-			continue;
-		}
-
 		/*
 		 * It is possible to have multiple huge page sizes (hstates)
 		 * in this list.  If so, process each size separately.
@@ -3692,20 +3629,13 @@ static void __init hugetlb_init_hstates(void)
 static void __init report_hugepages(void)
 {
 	struct hstate *h;
-	unsigned long nrinvalid;
 
 	for_each_hstate(h) {
 		char buf[32];
 
-		nrinvalid = hstate_boot_nrinvalid[hstate_index(h)];
-		h->max_huge_pages -= nrinvalid;
-
 		string_get_size(huge_page_size(h), 1, STRING_UNITS_2, buf, 32);
 		pr_info("HugeTLB: registered %s page size, pre-allocated %ld pages\n",
 			buf, h->nr_huge_pages);
-		if (nrinvalid)
-			pr_info("HugeTLB: %s page size: %lu invalid page%s discarded\n",
-					buf, nrinvalid, str_plural(nrinvalid));
 		pr_info("HugeTLB: %d KiB vmemmap can be freed for a %s page\n",
 			hugetlb_vmemmap_optimizable_size(h) / SZ_1K, buf);
 	}
-- 
2.54.0



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v3 17/19] mm/sparse-vmemmap: Remove sparse_vmemmap_init_nid_late()
  2026-06-02 10:10 [PATCH v3 00/19] mm: Refactor bootmem gigantic hugepage allocation Muchun Song
                   ` (15 preceding siblings ...)
  2026-06-02 10:10 ` [PATCH v3 16/19] mm/hugetlb: Remove obsolete bootmem cross-zone checks Muchun Song
@ 2026-06-02 10:10 ` Muchun Song
  2026-06-02 15:41   ` Mike Rapoport
  2026-06-02 10:10 ` [PATCH v3 18/19] mm/hugetlb: Remove unused bootmem cma field Muchun Song
                   ` (2 subsequent siblings)
  19 siblings, 1 reply; 37+ messages in thread
From: Muchun Song @ 2026-06-02 10:10 UTC (permalink / raw)
  To: Oscar Salvador, David Hildenbrand, Andrew Morton,
	Madhavan Srinivasan, Michael Ellerman
  Cc: Muchun Song, Mike Rapoport, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, linux-mm, linux-kernel, Nicholas Piggin,
	Christophe Leroy (CS GROUP), Ritesh Harjani (IBM),
	Aneesh Kumar K.V, linuxppc-dev, Mike Kravetz, Muchun Song

hugetlb_vmemmap_init_late() no longer has any users, so the remaining
late-init path in sparse_vmemmap_init_nid_late() is dead code.

Remove sparse_vmemmap_init_nid_late() and its declarations.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 include/linux/mmzone.h |  7 -------
 mm/hugetlb_vmemmap.c   |  4 ----
 mm/hugetlb_vmemmap.h   |  5 -----
 mm/sparse-vmemmap.c    | 11 -----------
 mm/sparse.c            |  1 -
 5 files changed, 28 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 1331a7b93f33..72883df17c72 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -2170,8 +2170,6 @@ static inline int preinited_vmemmap_section(const struct mem_section *section)
 }
 
 void sparse_vmemmap_init_nid_early(int nid);
-void sparse_vmemmap_init_nid_late(int nid);
-
 #else
 static inline int preinited_vmemmap_section(const struct mem_section *section)
 {
@@ -2180,10 +2178,6 @@ static inline int preinited_vmemmap_section(const struct mem_section *section)
 static inline void sparse_vmemmap_init_nid_early(int nid)
 {
 }
-
-static inline void sparse_vmemmap_init_nid_late(int nid)
-{
-}
 #endif
 
 static inline int online_section_nr(unsigned long nr)
@@ -2388,7 +2382,6 @@ static inline unsigned long next_present_section_nr(unsigned long section_nr)
 
 #else
 #define sparse_vmemmap_init_nid_early(_nid) do {} while (0)
-#define sparse_vmemmap_init_nid_late(_nid) do {} while (0)
 #define pfn_in_present_section pfn_valid
 #endif /* CONFIG_SPARSEMEM */
 
diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index 464578ee246e..cde6f3aba87b 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -810,10 +810,6 @@ static struct zone *pfn_to_zone(unsigned nid, unsigned long pfn)
 
 	return NULL;
 }
-
-void __init hugetlb_vmemmap_init_late(int nid)
-{
-}
 #endif
 
 static const struct ctl_table hugetlb_vmemmap_sysctls[] = {
diff --git a/mm/hugetlb_vmemmap.h b/mm/hugetlb_vmemmap.h
index 18b490825215..7ac49c52457d 100644
--- a/mm/hugetlb_vmemmap.h
+++ b/mm/hugetlb_vmemmap.h
@@ -29,7 +29,6 @@ void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_l
 void hugetlb_vmemmap_optimize_bootmem_folios(struct hstate *h, struct list_head *folio_list);
 #ifdef CONFIG_SPARSEMEM_VMEMMAP_PREINIT
 void hugetlb_vmemmap_init_early(int nid);
-void hugetlb_vmemmap_init_late(int nid);
 #endif
 
 
@@ -81,10 +80,6 @@ static inline void hugetlb_vmemmap_init_early(int nid)
 {
 }
 
-static inline void hugetlb_vmemmap_init_late(int nid)
-{
-}
-
 static inline unsigned int hugetlb_vmemmap_optimizable_size(const struct hstate *h)
 {
 	return 0;
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index 3b036251a2f4..077686af394b 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -574,17 +574,6 @@ void __init sparse_vmemmap_init_nid_early(int nid)
 {
 	hugetlb_vmemmap_init_early(nid);
 }
-
-/*
- * This is called just before the initialization of page structures
- * through memmap_init. Zones are now initialized, so any work that
- * needs to be done that needs zone information can be done from
- * here.
- */
-void __init sparse_vmemmap_init_nid_late(int nid)
-{
-	hugetlb_vmemmap_init_late(nid);
-}
 #endif
 
 static void subsection_mask_set(unsigned long *map, unsigned long pfn,
diff --git a/mm/sparse.c b/mm/sparse.c
index 3917a47153d8..324213d8bdcb 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -320,7 +320,6 @@ static void __init sparse_init_nid(int nid, unsigned long pnum_begin,
 		}
 	}
 	sparse_usage_fini();
-	sparse_vmemmap_init_nid_late(nid);
 }
 
 /*
-- 
2.54.0



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v3 18/19] mm/hugetlb: Remove unused bootmem cma field
  2026-06-02 10:10 [PATCH v3 00/19] mm: Refactor bootmem gigantic hugepage allocation Muchun Song
                   ` (16 preceding siblings ...)
  2026-06-02 10:10 ` [PATCH v3 17/19] mm/sparse-vmemmap: Remove sparse_vmemmap_init_nid_late() Muchun Song
@ 2026-06-02 10:10 ` Muchun Song
  2026-06-02 15:41   ` Mike Rapoport
  2026-06-02 10:10 ` [PATCH v3 19/19] mm/mm_init: Fold __init_page_from_nid() into __init_deferred_page() Muchun Song
  2026-06-02 10:34 ` [PATCH v3 00/19] mm: Refactor bootmem gigantic hugepage allocation Oscar Salvador (SUSE)
  19 siblings, 1 reply; 37+ messages in thread
From: Muchun Song @ 2026-06-02 10:10 UTC (permalink / raw)
  To: Oscar Salvador, David Hildenbrand, Andrew Morton,
	Madhavan Srinivasan, Michael Ellerman
  Cc: Muchun Song, Mike Rapoport, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, linux-mm, linux-kernel, Nicholas Piggin,
	Christophe Leroy (CS GROUP), Ritesh Harjani (IBM),
	Aneesh Kumar K.V, linuxppc-dev, Mike Kravetz, Muchun Song

struct huge_bootmem_page no longer needs to keep the CMA pointer. The
bootmem path only needs to remember whether a huge page came from CMA,
which is already encoded in the flags field.

Set HUGE_BOOTMEM_CMA when the page is allocated and drop the unused cma
field together with the redundant assignments.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/hugetlb.c     |  5 +----
 mm/hugetlb_cma.c | 29 +++++++++++------------------
 mm/internal.h    |  2 --
 3 files changed, 12 insertions(+), 24 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 47c3d6d11c58..fb7ad2a4a26b 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3076,10 +3076,7 @@ static bool __init alloc_bootmem_huge_page(struct hstate *h, int nid)
 	 */
 	INIT_LIST_HEAD(&m->list);
 	m->hstate = h;
-	if (!hugetlb_early_cma(h)) {
-		m->cma = NULL;
-		m->flags = 0;
-	}
+	m->flags = hugetlb_early_cma(h) ? HUGE_BOOTMEM_CMA : 0;
 
 	/* CMA pages: zone-crossing is validated in hugetlb_cma_reserve(). */
 	if (!hugetlb_early_cma(h) &&
diff --git a/mm/hugetlb_cma.c b/mm/hugetlb_cma.c
index e487d0ffffc0..4dfce68b354a 100644
--- a/mm/hugetlb_cma.c
+++ b/mm/hugetlb_cma.c
@@ -59,31 +59,24 @@ struct folio *hugetlb_cma_alloc_frozen_folio(int order, gfp_t gfp_mask,
 void * __init hugetlb_cma_alloc_bootmem(struct hstate *h, int nid, bool node_exact)
 {
 	struct cma *cma;
-	struct huge_bootmem_page *m;
+	void *m;
 	int node;
 
 	cma = hugetlb_cma[nid];
 	m = cma_reserve_early(cma, huge_page_size(h));
-	if (!m) {
-		if (node_exact)
-			return NULL;
+	if (m || node_exact)
+		return m;
 
-		for_each_node_mask(node, hugetlb_bootmem_nodes) {
-			cma = hugetlb_cma[node];
-			if (!cma || node == nid)
-				continue;
-			m = cma_reserve_early(cma, huge_page_size(h));
-			if (m)
-				break;
-		}
-	}
-
-	if (m) {
-		m->flags = HUGE_BOOTMEM_CMA;
-		m->cma = cma;
+	for_each_node_mask(node, hugetlb_bootmem_nodes) {
+		cma = hugetlb_cma[node];
+		if (!cma || node == nid)
+			continue;
+		m = cma_reserve_early(cma, huge_page_size(h));
+		if (m)
+			return m;
 	}
 
-	return m;
+	return NULL;
 }
 
 static int __init cmdline_parse_hugetlb_cma(char *p)
diff --git a/mm/internal.h b/mm/internal.h
index 6b9802460a7c..8497673d0ac3 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -24,13 +24,11 @@
 
 struct folio_batch;
 struct hstate;
-struct cma;
 
 struct huge_bootmem_page {
 	struct list_head list;
 	struct hstate *hstate;
 	unsigned long flags;
-	struct cma *cma;
 };
 
 /*
-- 
2.54.0



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v3 19/19] mm/mm_init: Fold __init_page_from_nid() into __init_deferred_page()
  2026-06-02 10:10 [PATCH v3 00/19] mm: Refactor bootmem gigantic hugepage allocation Muchun Song
                   ` (17 preceding siblings ...)
  2026-06-02 10:10 ` [PATCH v3 18/19] mm/hugetlb: Remove unused bootmem cma field Muchun Song
@ 2026-06-02 10:10 ` Muchun Song
  2026-06-02 14:46   ` Mike Rapoport
  2026-06-02 15:41   ` Mike Rapoport
  2026-06-02 10:34 ` [PATCH v3 00/19] mm: Refactor bootmem gigantic hugepage allocation Oscar Salvador (SUSE)
  19 siblings, 2 replies; 37+ messages in thread
From: Muchun Song @ 2026-06-02 10:10 UTC (permalink / raw)
  To: Oscar Salvador, David Hildenbrand, Andrew Morton,
	Madhavan Srinivasan, Michael Ellerman
  Cc: Muchun Song, Mike Rapoport, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, linux-mm, linux-kernel, Nicholas Piggin,
	Christophe Leroy (CS GROUP), Ritesh Harjani (IBM),
	Aneesh Kumar K.V, linuxppc-dev, Mike Kravetz, Muchun Song

__init_page_from_nid() no longer has external users and is only used
locally in mm/mm_init.c under CONFIG_DEFERRED_STRUCT_PAGE_INIT.

Fold it into its sole caller __init_deferred_page() and remove the
separate helper declaration.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
v2->v3:
- fold __init_page_from_nid() into __init_deferred_page() since it
  only has a single caller (suggested by Mike Rapoport)
---
 mm/internal.h |  1 -
 mm/mm_init.c  | 44 ++++++++++++++++++--------------------------
 2 files changed, 18 insertions(+), 27 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 8497673d0ac3..b33fc87e4555 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1760,7 +1760,6 @@ static inline bool pte_needs_soft_dirty_wp(struct vm_area_struct *vma, pte_t pte
 
 void __meminit __init_single_page(struct page *page, unsigned long pfn,
 				unsigned long zone, int nid);
-void __meminit __init_page_from_nid(unsigned long pfn, int nid);
 
 /* shrinker related functions */
 unsigned long shrink_slab(gfp_t gfp_mask, int nid, struct mem_cgroup *memcg,
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 41b83dd18c01..f1bbf3b9a321 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -688,31 +688,6 @@ static __meminit void pageblock_migratetype_init_range(unsigned long pfn,
 }
 #endif
 
-/*
- * Initialize a reserved page unconditionally, finding its zone first.
- */
-void __meminit __init_page_from_nid(unsigned long pfn, int nid)
-{
-	pg_data_t *pgdat;
-	int zid;
-
-	pgdat = NODE_DATA(nid);
-
-	for (zid = 0; zid < MAX_NR_ZONES; zid++) {
-		struct zone *zone = &pgdat->node_zones[zid];
-
-		if (zone_spans_pfn(zone, pfn))
-			break;
-	}
-	__init_single_page(pfn_to_page(pfn), pfn, zid, nid);
-
-	if (pageblock_aligned(pfn)) {
-		enum migratetype mt =
-			kho_scratch_migratetype(pfn, MIGRATE_MOVABLE);
-		init_pageblock_migratetype(pfn_to_page(pfn), mt, false);
-	}
-}
-
 #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
 static inline void pgdat_set_deferred_range(pg_data_t *pgdat)
 {
@@ -771,10 +746,27 @@ defer_init(int nid, unsigned long pfn, unsigned long end_pfn)
 
 static void __meminit __init_deferred_page(unsigned long pfn, int nid)
 {
+	pg_data_t *pgdat;
+	int zid;
+
 	if (early_page_initialised(pfn, nid))
 		return;
 
-	__init_page_from_nid(pfn, nid);
+	pgdat = NODE_DATA(nid);
+
+	for (zid = 0; zid < MAX_NR_ZONES; zid++) {
+		struct zone *zone = &pgdat->node_zones[zid];
+
+		if (zone_spans_pfn(zone, pfn))
+			break;
+	}
+	__init_single_page(pfn_to_page(pfn), pfn, zid, nid);
+
+	if (pageblock_aligned(pfn)) {
+		enum migratetype mt =
+			kho_scratch_migratetype(pfn, MIGRATE_MOVABLE);
+		init_pageblock_migratetype(pfn_to_page(pfn), mt, false);
+	}
 }
 #else
 static inline void pgdat_set_deferred_range(pg_data_t *pgdat) {}
-- 
2.54.0



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 00/19] mm: Refactor bootmem gigantic hugepage allocation
  2026-06-02 10:10 [PATCH v3 00/19] mm: Refactor bootmem gigantic hugepage allocation Muchun Song
                   ` (18 preceding siblings ...)
  2026-06-02 10:10 ` [PATCH v3 19/19] mm/mm_init: Fold __init_page_from_nid() into __init_deferred_page() Muchun Song
@ 2026-06-02 10:34 ` Oscar Salvador (SUSE)
  2026-06-02 12:01   ` Muchun Song
  19 siblings, 1 reply; 37+ messages in thread
From: Oscar Salvador (SUSE) @ 2026-06-02 10:34 UTC (permalink / raw)
  To: Muchun Song
  Cc: Oscar Salvador, David Hildenbrand, Andrew Morton,
	Madhavan Srinivasan, Michael Ellerman, Muchun Song, Mike Rapoport,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, linux-mm,
	linux-kernel, Nicholas Piggin, Christophe Leroy (CS GROUP),
	Ritesh Harjani (IBM), Aneesh Kumar K.V, linuxppc-dev,
	Mike Kravetz

On Tue, Jun 02, 2026 at 06:10:20PM +0800, Muchun Song wrote:
> This series is split out from the earlier larger series "mm: Generalize
> HVO for HugeTLB and device DAX" [1]. It collects the first 19 patches of
> that series as a standalone set of fixes and preparatory cleanups around
> bootmem HugeTLB handling, sparse initialization ordering, and related
> vmemmap setup.

Thanks Munchun, this split out really helps easing the review.
I think not so many patches from this series scaped review, but I shall
get back ot it later this week.

 

-- 
Oscar Salvador
SUSE Labs


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 00/19] mm: Refactor bootmem gigantic hugepage allocation
  2026-06-02 10:34 ` [PATCH v3 00/19] mm: Refactor bootmem gigantic hugepage allocation Oscar Salvador (SUSE)
@ 2026-06-02 12:01   ` Muchun Song
  0 siblings, 0 replies; 37+ messages in thread
From: Muchun Song @ 2026-06-02 12:01 UTC (permalink / raw)
  To: Oscar Salvador (SUSE)
  Cc: Muchun Song, Oscar Salvador, David Hildenbrand, Andrew Morton,
	Madhavan Srinivasan, Michael Ellerman, Mike Rapoport,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, linux-mm,
	linux-kernel, Nicholas Piggin, Christophe Leroy (CS GROUP),
	Ritesh Harjani (IBM), Aneesh Kumar K.V, linuxppc-dev,
	Mike Kravetz



> On Jun 2, 2026, at 18:34, Oscar Salvador (SUSE) <osalvador@kernel.org> wrote:
> 
> On Tue, Jun 02, 2026 at 06:10:20PM +0800, Muchun Song wrote:
>> This series is split out from the earlier larger series "mm: Generalize
>> HVO for HugeTLB and device DAX" [1]. It collects the first 19 patches of
>> that series as a standalone set of fixes and preparatory cleanups around
>> bootmem HugeTLB handling, sparse initialization ordering, and related
>> vmemmap setup.

Hi Oscar,

> 
> Thanks Munchun, this split out really helps easing the review.
> I think not so many patches from this series scaped review, but I shall
> get back ot it later this week.

Sounds good! Thanks for taking the time to review. Looking forward to your
feedback later this week.

Best,
Muchun

> 
> 
> 
> -- 
> Oscar Salvador
> SUSE Labs



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 19/19] mm/mm_init: Fold __init_page_from_nid() into __init_deferred_page()
  2026-06-02 10:10 ` [PATCH v3 19/19] mm/mm_init: Fold __init_page_from_nid() into __init_deferred_page() Muchun Song
@ 2026-06-02 14:46   ` Mike Rapoport
  2026-06-02 15:41   ` Mike Rapoport
  1 sibling, 0 replies; 37+ messages in thread
From: Mike Rapoport @ 2026-06-02 14:46 UTC (permalink / raw)
  To: Muchun Song
  Cc: Oscar Salvador, David Hildenbrand, Andrew Morton,
	Madhavan Srinivasan, Michael Ellerman, Muchun Song,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, linux-mm,
	linux-kernel, Nicholas Piggin, Christophe Leroy (CS GROUP),
	Ritesh Harjani (IBM), Aneesh Kumar K.V, linuxppc-dev,
	Mike Kravetz

On Tue, Jun 02, 2026 at 06:10:39PM +0800, Muchun Song wrote:
> __init_page_from_nid() no longer has external users and is only used
> locally in mm/mm_init.c under CONFIG_DEFERRED_STRUCT_PAGE_INIT.
> 
> Fold it into its sole caller __init_deferred_page() and remove the
> separate helper declaration.
> 
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> ---
> v2->v3:
> - fold __init_page_from_nid() into __init_deferred_page() since it
>   only has a single caller (suggested by Mike Rapoport)
> ---
>  mm/internal.h |  1 -
>  mm/mm_init.c  | 44 ++++++++++++++++++--------------------------
>  2 files changed, 18 insertions(+), 27 deletions(-)
> 
> diff --git a/mm/internal.h b/mm/internal.h
> index 8497673d0ac3..b33fc87e4555 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -1760,7 +1760,6 @@ static inline bool pte_needs_soft_dirty_wp(struct vm_area_struct *vma, pte_t pte
>  
>  void __meminit __init_single_page(struct page *page, unsigned long pfn,
>  				unsigned long zone, int nid);
> -void __meminit __init_page_from_nid(unsigned long pfn, int nid);
>  
>  /* shrinker related functions */
>  unsigned long shrink_slab(gfp_t gfp_mask, int nid, struct mem_cgroup *memcg,
> diff --git a/mm/mm_init.c b/mm/mm_init.c
> index 41b83dd18c01..f1bbf3b9a321 100644
> --- a/mm/mm_init.c
> +++ b/mm/mm_init.c
> @@ -688,31 +688,6 @@ static __meminit void pageblock_migratetype_init_range(unsigned long pfn,
>  }
>  #endif
>  
> -/*
> - * Initialize a reserved page unconditionally, finding its zone first.
> - */
> -void __meminit __init_page_from_nid(unsigned long pfn, int nid)
> -{
> -	pg_data_t *pgdat;
> -	int zid;
> -
> -	pgdat = NODE_DATA(nid);
> -
> -	for (zid = 0; zid < MAX_NR_ZONES; zid++) {
> -		struct zone *zone = &pgdat->node_zones[zid];
> -
> -		if (zone_spans_pfn(zone, pfn))
> -			break;
> -	}
> -	__init_single_page(pfn_to_page(pfn), pfn, zid, nid);
> -
> -	if (pageblock_aligned(pfn)) {
> -		enum migratetype mt =
> -			kho_scratch_migratetype(pfn, MIGRATE_MOVABLE);
> -		init_pageblock_migratetype(pfn_to_page(pfn), mt, false);
> -	}
> -}
> -
>  #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
>  static inline void pgdat_set_deferred_range(pg_data_t *pgdat)
>  {
> @@ -771,10 +746,27 @@ defer_init(int nid, unsigned long pfn, unsigned long end_pfn)
>  
>  static void __meminit __init_deferred_page(unsigned long pfn, int nid)
>  {
> +	pg_data_t *pgdat;
> +	int zid;
> +
>  	if (early_page_initialised(pfn, nid))
>  		return;
>  
> -	__init_page_from_nid(pfn, nid);
> +	pgdat = NODE_DATA(nid);

Nit: we can initialize pgdat at declaration line, other than that

Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>


> +
> +	for (zid = 0; zid < MAX_NR_ZONES; zid++) {
> +		struct zone *zone = &pgdat->node_zones[zid];
> +
> +		if (zone_spans_pfn(zone, pfn))
> +			break;
> +	}
> +	__init_single_page(pfn_to_page(pfn), pfn, zid, nid);
> +
> +	if (pageblock_aligned(pfn)) {
> +		enum migratetype mt =
> +			kho_scratch_migratetype(pfn, MIGRATE_MOVABLE);
> +		init_pageblock_migratetype(pfn_to_page(pfn), mt, false);
> +	}
>  }
>  #else
>  static inline void pgdat_set_deferred_range(pg_data_t *pgdat) {}
> -- 
> 2.54.0
> 
> 

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 14/19] mm/hugetlb: Free cross-zone bootmem gigantic pages after allocation
  2026-06-02 10:10 ` [PATCH v3 14/19] mm/hugetlb: Free cross-zone bootmem gigantic pages after allocation Muchun Song
@ 2026-06-02 15:41   ` Mike Rapoport
  2026-06-03  2:53     ` Muchun Song
  0 siblings, 1 reply; 37+ messages in thread
From: Mike Rapoport @ 2026-06-02 15:41 UTC (permalink / raw)
  To: Muchun Song
  Cc: Oscar Salvador, David Hildenbrand, Andrew Morton,
	Madhavan Srinivasan, Michael Ellerman, Muchun Song, Mike Rapoport,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, linux-mm,
	linux-kernel, Nicholas Piggin, Christophe Leroy (CS GROUP),
	Ritesh Harjani (IBM), Aneesh Kumar K.V, linuxppc-dev,
	Mike Kravetz

On Tue, 02 Jun 2026 18:10:34 +0800, Muchun Song <songmuchun@bytedance.com> wrote:

Hi Muchun,

>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 5e557c05d80a..218fb1ca45f4 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -3073,22 +3076,38 @@ static bool __init alloc_bootmem_huge_page(struct hstate *h, int nid)
> [ ... skip 26 lines ... ]
> +		 * pages belonging to the requested node.
> +		 */
> +		if (WARN_ON_ONCE(nid_request != NUMA_NO_NODE && nid != nid_request))
> +			list_add(&m->list, &huge_boot_pages[nid_request]);
> +		else
> +			list_add(&m->list, &huge_boot_pages[nid]);

Can we just memblock_free() the page that intersects zones here?

Rather than making alloc_bootmem_huge_page() bool (sorry my bad :)) we
can make it return -ENOMEM when memblock_alloc() fails, 0 if the page is
not usable and 1 (i.e. number of allocated gigantic pages) if everything
is fine.

The callers would need a bit of massage, but it still seems simpler to
me than adding them to the list and then walking that list.

-- 
Sincerely yours,
Mike.



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 15/19] mm/hugetlb_vmemmap: Move bootmem HVO setup to early init
  2026-06-02 10:10 ` [PATCH v3 15/19] mm/hugetlb_vmemmap: Move bootmem HVO setup to early init Muchun Song
@ 2026-06-02 15:41   ` Mike Rapoport
  2026-06-03  2:42     ` Muchun Song
  2026-06-03 12:02   ` Usama Arif
  1 sibling, 1 reply; 37+ messages in thread
From: Mike Rapoport @ 2026-06-02 15:41 UTC (permalink / raw)
  To: Muchun Song
  Cc: Oscar Salvador, David Hildenbrand, Andrew Morton,
	Madhavan Srinivasan, Michael Ellerman, Muchun Song, Mike Rapoport,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, linux-mm,
	linux-kernel, Nicholas Piggin, Christophe Leroy (CS GROUP),
	Ritesh Harjani (IBM), Aneesh Kumar K.V, linuxppc-dev,
	Mike Kravetz

On Tue, 02 Jun 2026 18:10:35 +0800, Muchun Song <songmuchun@bytedance.com> wrote:
> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
> index ea6af85bfec1..464578ee246e 100644
> --- a/mm/hugetlb_vmemmap.c
> +++ b/mm/hugetlb_vmemmap.c
> @@ -745,6 +745,8 @@ static bool vmemmap_should_optimize_bootmem_page(struct huge_bootmem_page *m)
>  	return true;
>  }
>  
> +static struct zone *pfn_to_zone(unsigned nid, unsigned long pfn);
> +

Can we please move the entire function rather than add a forward
declaration?

Other than that

Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>

-- 
Sincerely yours,
Mike.



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 16/19] mm/hugetlb: Remove obsolete bootmem cross-zone checks
  2026-06-02 10:10 ` [PATCH v3 16/19] mm/hugetlb: Remove obsolete bootmem cross-zone checks Muchun Song
@ 2026-06-02 15:41   ` Mike Rapoport
  0 siblings, 0 replies; 37+ messages in thread
From: Mike Rapoport @ 2026-06-02 15:41 UTC (permalink / raw)
  To: Muchun Song
  Cc: Oscar Salvador, David Hildenbrand, Andrew Morton,
	Madhavan Srinivasan, Michael Ellerman, Muchun Song, Mike Rapoport,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, linux-mm,
	linux-kernel, Nicholas Piggin, Christophe Leroy (CS GROUP),
	Ritesh Harjani (IBM), Aneesh Kumar K.V, linuxppc-dev,
	Mike Kravetz

On Tue, 02 Jun 2026 18:10:36 +0800, Muchun Song <songmuchun@bytedance.com> wrote:
> Bootmem gigantic HugeTLB pages used to be validated again during
> gather_bootmem_prealloc_node() and any cross-zone pages were discarded
> there.
> 
> That validation is no longer needed. Cross-zone bootmem gigantic pages
> are now detected during allocation and freed before they reach the later
> bootmem gathering path, so the remaining pages are already zone-valid.
> 
> [...]

Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>

-- 
Sincerely yours,
Mike.



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 17/19] mm/sparse-vmemmap: Remove sparse_vmemmap_init_nid_late()
  2026-06-02 10:10 ` [PATCH v3 17/19] mm/sparse-vmemmap: Remove sparse_vmemmap_init_nid_late() Muchun Song
@ 2026-06-02 15:41   ` Mike Rapoport
  0 siblings, 0 replies; 37+ messages in thread
From: Mike Rapoport @ 2026-06-02 15:41 UTC (permalink / raw)
  To: Muchun Song
  Cc: Oscar Salvador, David Hildenbrand, Andrew Morton,
	Madhavan Srinivasan, Michael Ellerman, Muchun Song, Mike Rapoport,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, linux-mm,
	linux-kernel, Nicholas Piggin, Christophe Leroy (CS GROUP),
	Ritesh Harjani (IBM), Aneesh Kumar K.V, linuxppc-dev,
	Mike Kravetz

On Tue, 02 Jun 2026 18:10:37 +0800, Muchun Song <songmuchun@bytedance.com> wrote:
> hugetlb_vmemmap_init_late() no longer has any users, so the remaining
> late-init path in sparse_vmemmap_init_nid_late() is dead code.
> 
> Remove sparse_vmemmap_init_nid_late() and its declarations.

Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>

-- 
Sincerely yours,
Mike.



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 18/19] mm/hugetlb: Remove unused bootmem cma field
  2026-06-02 10:10 ` [PATCH v3 18/19] mm/hugetlb: Remove unused bootmem cma field Muchun Song
@ 2026-06-02 15:41   ` Mike Rapoport
  2026-06-03  2:41     ` Muchun Song
  0 siblings, 1 reply; 37+ messages in thread
From: Mike Rapoport @ 2026-06-02 15:41 UTC (permalink / raw)
  To: Muchun Song
  Cc: Oscar Salvador, David Hildenbrand, Andrew Morton,
	Madhavan Srinivasan, Michael Ellerman, Muchun Song, Mike Rapoport,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, linux-mm,
	linux-kernel, Nicholas Piggin, Christophe Leroy (CS GROUP),
	Ritesh Harjani (IBM), Aneesh Kumar K.V, linuxppc-dev,
	Mike Kravetz

On Tue, 02 Jun 2026 18:10:38 +0800, Muchun Song <songmuchun@bytedance.com> wrote:
> struct huge_bootmem_page no longer needs to keep the CMA pointer. The
> bootmem path only needs to remember whether a huge page came from CMA,
> which is already encoded in the flags field.
> 
> Set HUGE_BOOTMEM_CMA when the page is allocated and drop the unused cma
> field together with the redundant assignments.

It looks like the commit does more refactoring, please mention it in the
changelog.

-- 
Sincerely yours,
Mike.



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 19/19] mm/mm_init: Fold __init_page_from_nid() into __init_deferred_page()
  2026-06-02 10:10 ` [PATCH v3 19/19] mm/mm_init: Fold __init_page_from_nid() into __init_deferred_page() Muchun Song
  2026-06-02 14:46   ` Mike Rapoport
@ 2026-06-02 15:41   ` Mike Rapoport
  2026-06-03  2:39     ` Muchun Song
  1 sibling, 1 reply; 37+ messages in thread
From: Mike Rapoport @ 2026-06-02 15:41 UTC (permalink / raw)
  To: Muchun Song
  Cc: Oscar Salvador, David Hildenbrand, Andrew Morton,
	Madhavan Srinivasan, Michael Ellerman, Muchun Song, Mike Rapoport,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, linux-mm,
	linux-kernel, Nicholas Piggin, Christophe Leroy (CS GROUP),
	Ritesh Harjani (IBM), Aneesh Kumar K.V, linuxppc-dev,
	Mike Kravetz

On Tue, 02 Jun 2026 18:10:39 +0800, Muchun Song <songmuchun@bytedance.com> wrote:
> diff --git a/mm/mm_init.c b/mm/mm_init.c
> index 41b83dd18c01..f1bbf3b9a321 100644
> --- a/mm/mm_init.c
> +++ b/mm/mm_init.c
> @@ -771,10 +746,27 @@ defer_init(int nid, unsigned long pfn, unsigned long end_pfn)
>  
>  static void __meminit __init_deferred_page(unsigned long pfn, int nid)
>  {
> +	pg_data_t *pgdat;
> +	int zid;
> +
>  	if (early_page_initialised(pfn, nid))
>  		return;
>  
> -	__init_page_from_nid(pfn, nid);
> +	pgdat = NODE_DATA(nid);

Nit: we can initialize pgdat at declaration line, other than that

Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>

-- 
Sincerely yours,
Mike.



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 19/19] mm/mm_init: Fold __init_page_from_nid() into __init_deferred_page()
  2026-06-02 15:41   ` Mike Rapoport
@ 2026-06-03  2:39     ` Muchun Song
  0 siblings, 0 replies; 37+ messages in thread
From: Muchun Song @ 2026-06-03  2:39 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Muchun Song, Oscar Salvador, David Hildenbrand, Andrew Morton,
	Madhavan Srinivasan, Michael Ellerman, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, linux-mm, linux-kernel,
	Nicholas Piggin, Christophe Leroy (CS GROUP),
	Ritesh Harjani (IBM), Aneesh Kumar K.V, linuxppc-dev



> On Jun 2, 2026, at 23:41, Mike Rapoport <rppt@kernel.org> wrote:
> 
> On Tue, 02 Jun 2026 18:10:39 +0800, Muchun Song <songmuchun@bytedance.com> wrote:
>> diff --git a/mm/mm_init.c b/mm/mm_init.c
>> index 41b83dd18c01..f1bbf3b9a321 100644
>> --- a/mm/mm_init.c
>> +++ b/mm/mm_init.c
>> @@ -771,10 +746,27 @@ defer_init(int nid, unsigned long pfn, unsigned long end_pfn)
>> 
>> static void __meminit __init_deferred_page(unsigned long pfn, int nid)
>> {
>> + 	pg_data_t *pgdat;
>> + 	int zid;
>> +
>> 	if (early_page_initialised(pfn, nid))
>> 		return;
>> 
>> - 	__init_page_from_nid(pfn, nid);
>> + 	pgdat = NODE_DATA(nid);
> 
> Nit: we can initialize pgdat at declaration line, other than that

Yes, will do next version.

> 
> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>

Thanks.

> 
> -- 
> Sincerely yours,
> Mike.
> 



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 18/19] mm/hugetlb: Remove unused bootmem cma field
  2026-06-02 15:41   ` Mike Rapoport
@ 2026-06-03  2:41     ` Muchun Song
  0 siblings, 0 replies; 37+ messages in thread
From: Muchun Song @ 2026-06-03  2:41 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Muchun Song, Oscar Salvador, David Hildenbrand, Andrew Morton,
	Madhavan Srinivasan, Michael Ellerman, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, linux-mm, linux-kernel,
	Nicholas Piggin, Christophe Leroy (CS GROUP),
	Ritesh Harjani (IBM), Aneesh Kumar K.V, linuxppc-dev



> On Jun 2, 2026, at 23:41, Mike Rapoport <rppt@kernel.org> wrote:
> 
> On Tue, 02 Jun 2026 18:10:38 +0800, Muchun Song <songmuchun@bytedance.com> wrote:
>> struct huge_bootmem_page no longer needs to keep the CMA pointer. The
>> bootmem path only needs to remember whether a huge page came from CMA,
>> which is already encoded in the flags field.
>> 
>> Set HUGE_BOOTMEM_CMA when the page is allocated and drop the unused cma
>> field together with the redundant assignments.
> 
> It looks like the commit does more refactoring, please mention it in the
> changelog.

Will do.

Thanks.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 15/19] mm/hugetlb_vmemmap: Move bootmem HVO setup to early init
  2026-06-02 15:41   ` Mike Rapoport
@ 2026-06-03  2:42     ` Muchun Song
  0 siblings, 0 replies; 37+ messages in thread
From: Muchun Song @ 2026-06-03  2:42 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Muchun Song, Oscar Salvador, David Hildenbrand, Andrew Morton,
	Madhavan Srinivasan, Michael Ellerman, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, linux-mm, linux-kernel,
	Nicholas Piggin, Christophe Leroy (CS GROUP),
	Ritesh Harjani (IBM), Aneesh Kumar K.V, linuxppc-dev,
	Mike Kravetz



> On Jun 2, 2026, at 23:41, Mike Rapoport <rppt@kernel.org> wrote:
> 
> On Tue, 02 Jun 2026 18:10:35 +0800, Muchun Song <songmuchun@bytedance.com> wrote:
>> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
>> index ea6af85bfec1..464578ee246e 100644
>> --- a/mm/hugetlb_vmemmap.c
>> +++ b/mm/hugetlb_vmemmap.c
>> @@ -745,6 +745,8 @@ static bool vmemmap_should_optimize_bootmem_page(struct huge_bootmem_page *m)
>> 	return true;
>> }
>> 
>> +static struct zone *pfn_to_zone(unsigned nid, unsigned long pfn);
>> +
> 
> Can we please move the entire function rather than add a forward
> declaration?

Yes. Will update next version.

> 
> Other than that
> 
> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>

Thanks.

> 
> -- 
> Sincerely yours,
> Mike.
> 



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 14/19] mm/hugetlb: Free cross-zone bootmem gigantic pages after allocation
  2026-06-02 15:41   ` Mike Rapoport
@ 2026-06-03  2:53     ` Muchun Song
  0 siblings, 0 replies; 37+ messages in thread
From: Muchun Song @ 2026-06-03  2:53 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Muchun Song, Oscar Salvador, David Hildenbrand, Andrew Morton,
	Madhavan Srinivasan, Michael Ellerman, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, linux-mm, linux-kernel,
	Nicholas Piggin, Christophe Leroy (CS GROUP),
	Ritesh Harjani (IBM), Aneesh Kumar K.V, linuxppc-dev



> On Jun 2, 2026, at 23:41, Mike Rapoport <rppt@kernel.org> wrote:
> 
> On Tue, 02 Jun 2026 18:10:34 +0800, Muchun Song <songmuchun@bytedance.com> wrote:
> 
> Hi Muchun,

Hi Mike,

> 
>> 
>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> index 5e557c05d80a..218fb1ca45f4 100644
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -3073,22 +3076,38 @@ static bool __init alloc_bootmem_huge_page(struct hstate *h, int nid)
>> [ ... skip 26 lines ... ]
>> + 	* pages belonging to the requested node.
>> + 	*/
>> + 	if (WARN_ON_ONCE(nid_request != NUMA_NO_NODE && nid != nid_request))
>> + 		list_add(&m->list, &huge_boot_pages[nid_request]);
>> + 	else
>> + 		list_add(&m->list, &huge_boot_pages[nid]);
> 
> Can we just memblock_free() the page that intersects zones here?

I had previously considered doing this, but then I realized that if we free the
allocated cross-zone memory here, memblock is very likely to select the exact
same block for the next allocation. This means we'd just end up with this
cross-zone memory again, degrading allocation efficiency. Unless there is a way
to mark the block so memblock avoids reallocating it, I ultimately chose to
defer the release to prevent this issue from happening.

Thanks.

> 
> Rather than making alloc_bootmem_huge_page() bool (sorry my bad :)) we
> can make it return -ENOMEM when memblock_alloc() fails, 0 if the page is
> not usable and 1 (i.e. number of allocated gigantic pages) if everything
> is fine.
> 
> The callers would need a bit of massage, but it still seems simpler to
> me than adding them to the list and then walking that list.
> 
> -- 
> Sincerely yours,
> Mike.
> 



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 15/19] mm/hugetlb_vmemmap: Move bootmem HVO setup to early init
  2026-06-02 10:10 ` [PATCH v3 15/19] mm/hugetlb_vmemmap: Move bootmem HVO setup to early init Muchun Song
  2026-06-02 15:41   ` Mike Rapoport
@ 2026-06-03 12:02   ` Usama Arif
  2026-06-03 12:24     ` Muchun Song
  1 sibling, 1 reply; 37+ messages in thread
From: Usama Arif @ 2026-06-03 12:02 UTC (permalink / raw)
  To: Muchun Song
  Cc: Usama Arif, Oscar Salvador, David Hildenbrand, Andrew Morton,
	Madhavan Srinivasan, Michael Ellerman, Muchun Song, Mike Rapoport,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, linux-mm,
	linux-kernel, Nicholas Piggin, Christophe Leroy (CS GROUP),
	Ritesh Harjani (IBM), Aneesh Kumar K.V, linuxppc-dev,
	Mike Kravetz

On Tue,  2 Jun 2026 18:10:35 +0800 Muchun Song <songmuchun@bytedance.com> wrote:

> Bootmem HugeTLB pages currently defer HVO setup to
> hugetlb_vmemmap_init_late(), because the optimization needs zone
> information.
> 
> Now that zone initialization is available earlier, the bootmem HVO setup
> can be done directly from hugetlb_vmemmap_init_early(). This lets
> gigantic HugeTLB pages apply HVO as soon as they are allocated.
> 
> Bootmem gigantic pages that span multiple zones are now filtered out
> when they are allocated, so the remaining bootmem gigantic pages seen by
> later hugetlb initialization are already zone-valid. As a result,
> hugetlb_vmemmap_init_late() no longer needs to handle bootmem HVO setup.
> 
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> ---
>  mm/hugetlb_vmemmap.c | 67 +++++++++-----------------------------------
>  1 file changed, 13 insertions(+), 54 deletions(-)
> 
> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
> index ea6af85bfec1..464578ee246e 100644
> --- a/mm/hugetlb_vmemmap.c
> +++ b/mm/hugetlb_vmemmap.c
> @@ -745,6 +745,8 @@ static bool vmemmap_should_optimize_bootmem_page(struct huge_bootmem_page *m)
>  	return true;
>  }
>  
> +static struct zone *pfn_to_zone(unsigned nid, unsigned long pfn);
> +
>  /*
>   * Initialize memmap section for a gigantic page, HVO-style.
>   */
> @@ -752,6 +754,7 @@ void __init hugetlb_vmemmap_init_early(int nid)
>  {
>  	unsigned long psize, paddr, section_size;
>  	unsigned long ns, i, pnum, pfn, nr_pages;
> +	unsigned long start, end;
>  	struct huge_bootmem_page *m = NULL;
>  	void *map;
>  
> @@ -761,6 +764,8 @@ void __init hugetlb_vmemmap_init_early(int nid)
>  	section_size = (1UL << PA_SECTION_SHIFT);
>  
>  	list_for_each_entry(m, &huge_boot_pages[nid], list) {
> +		struct zone *zone;
> +
>  		if (!vmemmap_should_optimize_bootmem_page(m))
>  			continue;
>  
> @@ -769,6 +774,14 @@ void __init hugetlb_vmemmap_init_early(int nid)
>  		paddr = virt_to_phys(m);
>  		pfn = PHYS_PFN(paddr);
>  		map = pfn_to_page(pfn);
> +		start = (unsigned long)map;
> +		end = start + hugetlb_vmemmap_size(m->hstate);
> +		zone = pfn_to_zone(nid, pfn);
> +
> +		if (vmemmap_populate_hvo(start, end, huge_page_order(m->hstate),
> +					 zone, HUGETLB_VMEMMAP_RESERVE_SIZE))
> +			panic("Failed to allocate memmap for HugeTLB page\n");

The replaced hugetlb_vmemmap_init_late() path used to fall back to
vmemmap_populate() if vmemmap_populate_hvo() returned an error and
just lost the HVO optimization for that page.

The new path panics on any non-zero return.  Is the panic intended,
given that vmemmap_populate_hvo() returns -ENOMEM on allocation
failure and HVO is normally treated as an optimization rather than a
hard requirement?

> +		memmap_boot_pages_add(DIV_ROUND_UP(HUGETLB_VMEMMAP_RESERVE_SIZE, PAGE_SIZE));
>  
>  		pnum = pfn_to_section_nr(pfn);
>  		ns = psize / section_size;
> @@ -800,60 +813,6 @@ static struct zone *pfn_to_zone(unsigned nid, unsigned long pfn)
>  
>  void __init hugetlb_vmemmap_init_late(int nid)
>  {
> -	struct huge_bootmem_page *m, *tm;
> -	unsigned long phys, nr_pages, start, end;
> -	unsigned long pfn, nr_mmap;
> -	struct zone *zone = NULL;
> -	struct hstate *h;
> -	void *map;
> -
> -	if (!READ_ONCE(vmemmap_optimize_enabled))
> -		return;
> -
> -	list_for_each_entry_safe(m, tm, &huge_boot_pages[nid], list) {
> -		if (!(m->flags & HUGE_BOOTMEM_HVO))
> -			continue;
> -
> -		phys = virt_to_phys(m);
> -		h = m->hstate;
> -		pfn = PHYS_PFN(phys);
> -		nr_pages = pages_per_huge_page(h);
> -		map = pfn_to_page(pfn);
> -		start = (unsigned long)map;
> -		end = start + nr_pages * sizeof(struct page);
> -
> -		if (!hugetlb_bootmem_page_zones_valid(nid, m)) {
> -			/*
> -			 * Oops, the hugetlb page spans multiple zones.
> -			 * Remove it from the list, and populate it normally.
> -			 */
> -			list_del(&m->list);
> -
> -			vmemmap_populate(start, end, nid, NULL);
> -			nr_mmap = end - start;
> -			memmap_boot_pages_add(DIV_ROUND_UP(nr_mmap, PAGE_SIZE));
> -
> -			memblock_phys_free(phys, huge_page_size(h));
> -			continue;
> -		}
> -
> -		if (!zone || !zone_spans_pfn(zone, pfn))
> -			zone = pfn_to_zone(nid, pfn);
> -		if (WARN_ON_ONCE(!zone))
> -			continue;
> -
> -		if (vmemmap_populate_hvo(start, end, huge_page_order(h), zone,
> -					 HUGETLB_VMEMMAP_RESERVE_SIZE) < 0) {
> -			/* Fallback if HVO population fails */
> -			vmemmap_populate(start, end, nid, NULL);
> -			nr_mmap = end - start;
> -		} else {
> -			m->flags |= HUGE_BOOTMEM_ZONES_VALID;
> -			nr_mmap = HUGETLB_VMEMMAP_RESERVE_SIZE;
> -		}
> -
> -		memmap_boot_pages_add(DIV_ROUND_UP(nr_mmap, PAGE_SIZE));
> -	}
>  }
>  #endif
>  
> -- 
> 2.54.0
> 
> 


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 15/19] mm/hugetlb_vmemmap: Move bootmem HVO setup to early init
  2026-06-03 12:02   ` Usama Arif
@ 2026-06-03 12:24     ` Muchun Song
  2026-06-03 12:35       ` Usama Arif
  0 siblings, 1 reply; 37+ messages in thread
From: Muchun Song @ 2026-06-03 12:24 UTC (permalink / raw)
  To: Usama Arif
  Cc: Oscar Salvador, David Hildenbrand, Andrew Morton,
	Madhavan Srinivasan, Michael Ellerman, Mike Rapoport,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, linux-mm,
	linux-kernel, Nicholas Piggin, Christophe Leroy (CS GROUP),
	Ritesh Harjani (IBM), Aneesh Kumar K.V, linuxppc-dev, Muchun Song



On 2026/6/3 20:02, Usama Arif wrote:
> On Tue,  2 Jun 2026 18:10:35 +0800 Muchun Song <songmuchun@bytedance.com> wrote:
>
>> Bootmem HugeTLB pages currently defer HVO setup to
>> hugetlb_vmemmap_init_late(), because the optimization needs zone
>> information.
>>
>> Now that zone initialization is available earlier, the bootmem HVO setup
>> can be done directly from hugetlb_vmemmap_init_early(). This lets
>> gigantic HugeTLB pages apply HVO as soon as they are allocated.
>>
>> Bootmem gigantic pages that span multiple zones are now filtered out
>> when they are allocated, so the remaining bootmem gigantic pages seen by
>> later hugetlb initialization are already zone-valid. As a result,
>> hugetlb_vmemmap_init_late() no longer needs to handle bootmem HVO setup.
>>
>> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
>> ---
>>   mm/hugetlb_vmemmap.c | 67 +++++++++-----------------------------------
>>   1 file changed, 13 insertions(+), 54 deletions(-)
>>
>> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
>> index ea6af85bfec1..464578ee246e 100644
>> --- a/mm/hugetlb_vmemmap.c
>> +++ b/mm/hugetlb_vmemmap.c
>> @@ -745,6 +745,8 @@ static bool vmemmap_should_optimize_bootmem_page(struct huge_bootmem_page *m)
>>   	return true;
>>   }
>>   
>> +static struct zone *pfn_to_zone(unsigned nid, unsigned long pfn);
>> +
>>   /*
>>    * Initialize memmap section for a gigantic page, HVO-style.
>>    */
>> @@ -752,6 +754,7 @@ void __init hugetlb_vmemmap_init_early(int nid)
>>   {
>>   	unsigned long psize, paddr, section_size;
>>   	unsigned long ns, i, pnum, pfn, nr_pages;
>> +	unsigned long start, end;
>>   	struct huge_bootmem_page *m = NULL;
>>   	void *map;
>>   
>> @@ -761,6 +764,8 @@ void __init hugetlb_vmemmap_init_early(int nid)
>>   	section_size = (1UL << PA_SECTION_SHIFT);
>>   
>>   	list_for_each_entry(m, &huge_boot_pages[nid], list) {
>> +		struct zone *zone;
>> +
>>   		if (!vmemmap_should_optimize_bootmem_page(m))
>>   			continue;
>>   
>> @@ -769,6 +774,14 @@ void __init hugetlb_vmemmap_init_early(int nid)
>>   		paddr = virt_to_phys(m);
>>   		pfn = PHYS_PFN(paddr);
>>   		map = pfn_to_page(pfn);
>> +		start = (unsigned long)map;
>> +		end = start + hugetlb_vmemmap_size(m->hstate);
>> +		zone = pfn_to_zone(nid, pfn);
>> +
>> +		if (vmemmap_populate_hvo(start, end, huge_page_order(m->hstate),
>> +					 zone, HUGETLB_VMEMMAP_RESERVE_SIZE))
>> +			panic("Failed to allocate memmap for HugeTLB page\n");
> The replaced hugetlb_vmemmap_init_late() path used to fall back to
> vmemmap_populate() if vmemmap_populate_hvo() returned an error and
> just lost the HVO optimization for that page.
>
> The new path panics on any non-zero return.  Is the panic intended,
> given that vmemmap_populate_hvo() returns -ENOMEM on allocation
> failure and HVO is normally treated as an optimization rather than a
> hard requirement?

This is intentional; see patch 6:

     mm/sparse: Panic on memmap and usemap allocation failure

We already panic on OOM anyway.

Muchun,
Thanks.

>
>> +		memmap_boot_pages_add(DIV_ROUND_UP(HUGETLB_VMEMMAP_RESERVE_SIZE, PAGE_SIZE));
>>   
>>   		pnum = pfn_to_section_nr(pfn);
>>   		ns = psize / section_size;
>> @@ -800,60 +813,6 @@ static struct zone *pfn_to_zone(unsigned nid, unsigned long pfn)
>>   
>>   void __init hugetlb_vmemmap_init_late(int nid)
>>   {
>> -	struct huge_bootmem_page *m, *tm;
>> -	unsigned long phys, nr_pages, start, end;
>> -	unsigned long pfn, nr_mmap;
>> -	struct zone *zone = NULL;
>> -	struct hstate *h;
>> -	void *map;
>> -
>> -	if (!READ_ONCE(vmemmap_optimize_enabled))
>> -		return;
>> -
>> -	list_for_each_entry_safe(m, tm, &huge_boot_pages[nid], list) {
>> -		if (!(m->flags & HUGE_BOOTMEM_HVO))
>> -			continue;
>> -
>> -		phys = virt_to_phys(m);
>> -		h = m->hstate;
>> -		pfn = PHYS_PFN(phys);
>> -		nr_pages = pages_per_huge_page(h);
>> -		map = pfn_to_page(pfn);
>> -		start = (unsigned long)map;
>> -		end = start + nr_pages * sizeof(struct page);
>> -
>> -		if (!hugetlb_bootmem_page_zones_valid(nid, m)) {
>> -			/*
>> -			 * Oops, the hugetlb page spans multiple zones.
>> -			 * Remove it from the list, and populate it normally.
>> -			 */
>> -			list_del(&m->list);
>> -
>> -			vmemmap_populate(start, end, nid, NULL);
>> -			nr_mmap = end - start;
>> -			memmap_boot_pages_add(DIV_ROUND_UP(nr_mmap, PAGE_SIZE));
>> -
>> -			memblock_phys_free(phys, huge_page_size(h));
>> -			continue;
>> -		}
>> -
>> -		if (!zone || !zone_spans_pfn(zone, pfn))
>> -			zone = pfn_to_zone(nid, pfn);
>> -		if (WARN_ON_ONCE(!zone))
>> -			continue;
>> -
>> -		if (vmemmap_populate_hvo(start, end, huge_page_order(h), zone,
>> -					 HUGETLB_VMEMMAP_RESERVE_SIZE) < 0) {
>> -			/* Fallback if HVO population fails */
>> -			vmemmap_populate(start, end, nid, NULL);
>> -			nr_mmap = end - start;
>> -		} else {
>> -			m->flags |= HUGE_BOOTMEM_ZONES_VALID;
>> -			nr_mmap = HUGETLB_VMEMMAP_RESERVE_SIZE;
>> -		}
>> -
>> -		memmap_boot_pages_add(DIV_ROUND_UP(nr_mmap, PAGE_SIZE));
>> -	}
>>   }
>>   #endif
>>   
>> -- 
>> 2.54.0
>>
>>



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 15/19] mm/hugetlb_vmemmap: Move bootmem HVO setup to early init
  2026-06-03 12:24     ` Muchun Song
@ 2026-06-03 12:35       ` Usama Arif
  0 siblings, 0 replies; 37+ messages in thread
From: Usama Arif @ 2026-06-03 12:35 UTC (permalink / raw)
  To: Muchun Song
  Cc: Oscar Salvador, David Hildenbrand, Andrew Morton,
	Madhavan Srinivasan, Michael Ellerman, Mike Rapoport,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, linux-mm,
	linux-kernel, Nicholas Piggin, Christophe Leroy (CS GROUP),
	Ritesh Harjani (IBM), Aneesh Kumar K.V, linuxppc-dev, Muchun Song



On 03/06/2026 13:24, Muchun Song wrote:
> 
> 
> On 2026/6/3 20:02, Usama Arif wrote:
>> On Tue,  2 Jun 2026 18:10:35 +0800 Muchun Song <songmuchun@bytedance.com> wrote:
>>
>>> Bootmem HugeTLB pages currently defer HVO setup to
>>> hugetlb_vmemmap_init_late(), because the optimization needs zone
>>> information.
>>>
>>> Now that zone initialization is available earlier, the bootmem HVO setup
>>> can be done directly from hugetlb_vmemmap_init_early(). This lets
>>> gigantic HugeTLB pages apply HVO as soon as they are allocated.
>>>
>>> Bootmem gigantic pages that span multiple zones are now filtered out
>>> when they are allocated, so the remaining bootmem gigantic pages seen by
>>> later hugetlb initialization are already zone-valid. As a result,
>>> hugetlb_vmemmap_init_late() no longer needs to handle bootmem HVO setup.
>>>
>>> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
>>> ---

Acked-by: Usama Arif <usama.arif@linux.dev>



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 03/19] powerpc/mm: Fix wrong addr_pfn tracking in compound vmemmap population
  2026-06-02 10:10 ` [PATCH v3 03/19] powerpc/mm: Fix wrong addr_pfn tracking in compound vmemmap population Muchun Song
@ 2026-06-03 14:36   ` Ritesh Harjani
  0 siblings, 0 replies; 37+ messages in thread
From: Ritesh Harjani @ 2026-06-03 14:36 UTC (permalink / raw)
  To: Muchun Song, Oscar Salvador, David Hildenbrand, Andrew Morton,
	Madhavan Srinivasan, Michael Ellerman
  Cc: Muchun Song, Mike Rapoport, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, linux-mm, linux-kernel, Nicholas Piggin,
	Christophe Leroy (CS GROUP), Aneesh Kumar K.V, linuxppc-dev,
	Mike Kravetz, Muchun Song

Muchun Song <songmuchun@bytedance.com> writes:

> vmemmap_populate_compound_pages() uses addr_pfn to determine the PFN
> offset within a compound page and to decide whether the current
> vmemmap slot should be populated as a head page mapping or should reuse
> a tail page mapping.
>
> However, addr_pfn is advanced manually in parallel with addr.  The loop
> itself progresses in vmemmap address space, so each PAGE_SIZE step in
> addr covers PAGE_SIZE / sizeof(struct page) struct page slots.  Since
> addr_pfn is compared against nr_pages in data-PFN units, it should
> advance by the same number of PFNs.  The existing manual increments do
> not match that and therefore do not reliably track the PFN
> corresponding to the current addr.
>
> As a result, pfn_offset can be computed from the wrong PFN and the code
> can make the head/tail decision for the wrong compound-page position.
>
> Fix this by deriving addr_pfn directly from the current vmemmap address
> instead of carrying it as loop state.
>
> Fixes: f2b79c0d7968 ("powerpc/book3s64/radix: add support for vmemmap optimization for radix")
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> Acked-by: Oscar Salvador <osalvador@suse.de>

Thanks for fixing it. I guess this was not caught because section size
on powerpc is 16MB and with 64K pagesize we have 256 pfns to map. The
vmemmap size required for this is 256*sizeof(struct page) = 16KB which
is < 64K (pagesize). So basically we never loop in
vmemmap_populate_compound_page(), because
next = addr+PAGE_SIZE will be > end after the 1st iteration itself.

But I agree this is a bug which needs fixing and it can be easily caught
with 4K pagesize, where we have 4096 pfns to map within a 16MB section.


The change looks good to me. Can we please add stable tag too?
Cc: stable@kernel.org

Also, feel free to add:
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>


^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2026-06-03 15:09 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-02 10:10 [PATCH v3 00/19] mm: Refactor bootmem gigantic hugepage allocation Muchun Song
2026-06-02 10:10 ` [PATCH v3 01/19] mm/hugetlb: Fix boot panic with CONFIG_DEBUG_VM and HVO bootmem pages Muchun Song
2026-06-02 10:10 ` [PATCH v3 02/19] mm/hugetlb_vmemmap: Fix __hugetlb_vmemmap_optimize_folios() Muchun Song
2026-06-02 10:10 ` [PATCH v3 03/19] powerpc/mm: Fix wrong addr_pfn tracking in compound vmemmap population Muchun Song
2026-06-03 14:36   ` Ritesh Harjani
2026-06-02 10:10 ` [PATCH v3 04/19] mm/hugetlb: Initialize gigantic bootmem hugepage struct pages earlier Muchun Song
2026-06-02 10:10 ` [PATCH v3 05/19] mm/mm_init: Simplify deferred_free_pages() migratetype init Muchun Song
2026-06-02 10:10 ` [PATCH v3 06/19] mm/sparse: Panic on memmap and usemap allocation failure Muchun Song
2026-06-02 10:10 ` [PATCH v3 07/19] mm/sparse: Move subsection_map_init() into sparse_init() Muchun Song
2026-06-02 10:10 ` [PATCH v3 08/19] mm/mm_init: Defer sparse_init() until after zone initialization Muchun Song
2026-06-02 10:10 ` [PATCH v3 09/19] mm/mm_init: Defer hugetlb reservation " Muchun Song
2026-06-02 10:10 ` [PATCH v3 10/19] mm/mm_init: Remove set_pageblock_order() call from sparse_init() Muchun Song
2026-06-02 10:10 ` [PATCH v3 11/19] mm/sparse: Move sparse_vmemmap_init_nid_late() into sparse_init_nid() Muchun Song
2026-06-02 10:10 ` [PATCH v3 12/19] mm/hugetlb_cma: Validate hugetlb CMA range by zone at reserve time Muchun Song
2026-06-02 10:10 ` [PATCH v3 13/19] mm/hugetlb: Refactor early boot gigantic hugepage allocation Muchun Song
2026-06-02 10:10 ` [PATCH v3 14/19] mm/hugetlb: Free cross-zone bootmem gigantic pages after allocation Muchun Song
2026-06-02 15:41   ` Mike Rapoport
2026-06-03  2:53     ` Muchun Song
2026-06-02 10:10 ` [PATCH v3 15/19] mm/hugetlb_vmemmap: Move bootmem HVO setup to early init Muchun Song
2026-06-02 15:41   ` Mike Rapoport
2026-06-03  2:42     ` Muchun Song
2026-06-03 12:02   ` Usama Arif
2026-06-03 12:24     ` Muchun Song
2026-06-03 12:35       ` Usama Arif
2026-06-02 10:10 ` [PATCH v3 16/19] mm/hugetlb: Remove obsolete bootmem cross-zone checks Muchun Song
2026-06-02 15:41   ` Mike Rapoport
2026-06-02 10:10 ` [PATCH v3 17/19] mm/sparse-vmemmap: Remove sparse_vmemmap_init_nid_late() Muchun Song
2026-06-02 15:41   ` Mike Rapoport
2026-06-02 10:10 ` [PATCH v3 18/19] mm/hugetlb: Remove unused bootmem cma field Muchun Song
2026-06-02 15:41   ` Mike Rapoport
2026-06-03  2:41     ` Muchun Song
2026-06-02 10:10 ` [PATCH v3 19/19] mm/mm_init: Fold __init_page_from_nid() into __init_deferred_page() Muchun Song
2026-06-02 14:46   ` Mike Rapoport
2026-06-02 15:41   ` Mike Rapoport
2026-06-03  2:39     ` Muchun Song
2026-06-02 10:34 ` [PATCH v3 00/19] mm: Refactor bootmem gigantic hugepage allocation Oscar Salvador (SUSE)
2026-06-02 12:01   ` Muchun Song

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox