* [PATCH v2 00/69] mm: Generalize HVO for HugeTLB and device DAX
@ 2026-05-13 13:04 Muchun Song
2026-05-13 13:04 ` [PATCH v2 01/69] mm/hugetlb: Fix boot panic with CONFIG_DEBUG_VM and HVO bootmem pages Muchun Song
` (47 more replies)
0 siblings, 48 replies; 72+ messages in thread
From: Muchun Song @ 2026-05-13 13:04 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
Michael Ellerman, Madhavan Srinivasan
Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
Christophe Leroy, Ackerley Tng, Frank van der Linden,
aneesh.kumar, joao.m.martins, linux-mm, linuxppc-dev,
linux-kernel, Muchun Song
In this series, HVO is redefined as Hugepage Vmemmap Optimization: a
general vmemmap optimization model for large hugepage-backed mappings,
rather than a HugeTLB-only implementation detail.
The existing code grew around the original HugeTLB-specific HVO path,
while device DAX developed similar but separate vmemmap optimization
handling. As a result, the current implementation carries duplicated
logic, boot-time special cases, and subsystem-specific interfaces around
what is fundamentally the same sparse-vmemmap optimization.
This series generalizes that optimization into a common framework used
by both HugeTLB and device DAX.
The first few patches include some minor bug fixes found during AI-aided
review of the current code. These fixes are not the main goal of the
series, but the later refactoring and unification work depends on them,
so they are included here as preparatory changes.
The series then reworks the relevant early boot and sparse
initialization paths, introduces a generic section-based sparse-vmemmap
optimization infrastructure, switches HugeTLB and device DAX over to the
shared implementation, and removes the old special-case code.
At a high level, the series does the following:
- apply a small set of preparatory bug fixes
- reorder early boot and sparse initialization so optimized vmemmap
setup has the required zone and pageblock state
- introduce generic section-based vmemmap optimization infrastructure
- switch HugeTLB and device DAX to the shared implementation
- consolidate HVO enablement and naming
- remove obsolete HugeTLB-specific boot-time and architecture-specific
optimization code
- rewrite the documentation around the unified design
This brings a few concrete benefits:
- HugeTLB and device DAX share one vmemmap optimization framework,
reducing duplicated logic and long-term maintenance overhead
- when CONFIG_DEFERRED_STRUCT_PAGE_INIT is disabled, optimized struct
pages can skip the usual memmap_init() initialization work, which
helps reduce boot-time overhead
- all architectures that support HVO benefit from the generic
sparse-vmemmap optimization path without extra architecture-specific
preinit handling
- device DAX improves its struct page savings further by dropping the
extra reserved tail page
- shared vmemmap tail pages are mapped read-only, improving robustness
I have only built and tested this series on x86. I do not currently have
a powerpc test environment, so any testing or feedback on powerpc would
be much appreciated.
Changes since v1:
- rebased onto current next tree
- added the preparatory minor bug fixes found during AI-aided review
- added further refactoring on top of the new infrastructure
Muchun Song (69):
mm/hugetlb: Fix boot panic with CONFIG_DEBUG_VM and HVO bootmem pages
mm/hugetlb_vmemmap: Fix __hugetlb_vmemmap_optimize_folios()
powerpc/mm: Fix wrong addr_pfn tracking in compound vmemmap population
mm/hugetlb: Initialize gigantic bootmem hugepage struct pages earlier
mm/mm_init: Simplify deferred_free_pages() migratetype init
mm/sparse: Panic on memmap and usemap allocation failure
mm/sparse: Move subsection_map_init() into sparse_init()
mm/mm_init: Defer sparse_init() until after zone initialization
mm/mm_init: Defer hugetlb reservation until after zone initialization
mm/mm_init: Remove set_pageblock_order() call from sparse_init()
mm/sparse: Move sparse_vmemmap_init_nid_late() into sparse_init_nid()
mm/hugetlb_cma: Validate hugetlb CMA range by zone at reserve time
mm/hugetlb: Refactor early boot gigantic hugepage allocation
mm/hugetlb: Free cross-zone bootmem gigantic pages after allocation
mm/hugetlb_vmemmap: Move bootmem HVO setup to early init
mm/hugetlb: Remove obsolete bootmem cross-zone checks
mm/sparse-vmemmap: Remove sparse_vmemmap_init_nid_late()
mm/hugetlb: Remove unused bootmem cma field
mm/mm_init: Make __init_page_from_nid() static
mm/sparse-vmemmap: Drop VMEMMAP_POPULATE_PAGEREF
mm: Rename vmemmap optimization macros around folio semantics
mm/sparse: Drop power-of-2 size requirement for struct mem_section
mm/sparse-vmemmap: track compound page order in struct mem_section
mm/mm_init: Skip initializing shared vmemmap tail pages
mm/sparse-vmemmap: Initialize shared tail vmemmap pages on allocation
mm/sparse-vmemmap: Support section-based vmemmap accounting
mm/sparse-vmemmap: Support section-based vmemmap optimization
mm/hugetlb: Use generic vmemmap optimization macros
mm/sparse: Mark memblocks present earlier
mm/hugetlb: Switch HugeTLB to section-based vmemmap optimization
mm/sparse: Remove section_map_size()
mm/mm_init: Factor out pfn_to_zone() as a shared helper
mm/sparse: Remove SPARSEMEM_VMEMMAP_PREINIT
mm/sparse: Inline usemap allocation into sparse_init_nid()
mm/hugetlb: Remove HUGE_BOOTMEM_HVO
mm/hugetlb: Remove HUGE_BOOTMEM_CMA
mm/sparse-vmemmap: Factor out shared vmemmap page allocation
mm/sparse-vmemmap: Introduce CONFIG_SPARSEMEM_VMEMMAP_OPTIMIZATION
mm/sparse-vmemmap: Switch DAX to vmemmap_shared_tail_page()
powerpc/mm: Switch DAX to vmemmap_shared_tail_page()
mm/sparse-vmemmap: Drop the extra tail page from DAX reservation
mm/sparse-vmemmap: Switch DAX to section-based vmemmap optimization
mm/sparse-vmemmap: Unify DAX and HugeTLB population paths
mm/sparse-vmemmap: Remove the unused ptpfn argument
powerpc/mm: Make vmemmap_populate_compound_pages() static
mm/sparse-vmemmap: Map shared vmemmap tail pages read-only
powerpc/mm: Map shared vmemmap tail pages read-only
mm/sparse-vmemmap: Inline vmemmap_populate_address() into its caller
mm/hugetlb_vmemmap: Remove vmemmap_wrprotect_hvo()
mm/sparse: Simplify section_nr_vmemmap_pages()
mm/sparse-vmemmap: Introduce vmemmap_nr_struct_pages()
powerpc/mm: Drop powerpc vmemmap_can_optimize()
mm/sparse-vmemmap: Drop vmemmap_can_optimize()
mm/sparse-vmemmap: Drop @pgmap from vmemmap population APIs
mm/sparse: Decouple section activation from ZONE_DEVICE
mm: Redefine HVO as Hugepage Vmemmap Optimization
mm/sparse-vmemmap: Consolidate HVO enable checks
mm/hugetlb: Make HVO optimizable checks depend on generic logic
mm/sparse-vmemmap: Localize init_compound_tail()
mm/mm_init: Check zone consistency on optimized vmemmap sections
mm/hugetlb: Drop boot-time HVO handling for gigantic folios
mm/hugetlb: Simplify hugetlb_folio_init_vmemmap()
mm/hugetlb: Initialize the full bootmem hugepage in hugetlb code
mm/mm_init: Factor out compound page initialization
mm/mm_init: Make __init_single_page() static
mm/cma: Move CMA pageblock initialization into cma_activate_area()
mm/cma: Move init_cma_pageblock() into cma.c
mm/mm_init: Initialize pageblock migratetype in memmap init helpers
Documentation/mm: Rewrite vmemmap_dedup.rst for unified HVO
.../admin-guide/kernel-parameters.txt | 2 +-
Documentation/admin-guide/mm/hugetlbpage.rst | 4 +-
.../admin-guide/mm/memory-hotplug.rst | 2 +-
Documentation/admin-guide/sysctl/vm.rst | 3 +-
Documentation/arch/powerpc/index.rst | 1 -
Documentation/arch/powerpc/vmemmap_dedup.rst | 101 ----
Documentation/mm/vmemmap_dedup.rst | 217 ++------
arch/arm64/mm/mmu.c | 5 +-
arch/loongarch/mm/init.c | 5 +-
arch/powerpc/include/asm/book3s/64/radix.h | 12 -
arch/powerpc/mm/book3s64/radix_pgtable.c | 154 +-----
arch/powerpc/mm/hugetlbpage.c | 11 +-
arch/powerpc/mm/init_64.c | 1 +
arch/powerpc/mm/mem.c | 5 +-
arch/riscv/mm/init.c | 5 +-
arch/s390/mm/init.c | 5 +-
arch/x86/Kconfig | 1 -
arch/x86/entry/vdso/vdso32/fake_32bit_build.h | 1 -
arch/x86/mm/init_64.c | 5 +-
drivers/dax/Kconfig | 1 +
fs/Kconfig | 6 +-
include/linux/hugetlb.h | 23 +-
include/linux/memory_hotplug.h | 12 +-
include/linux/mm.h | 44 +-
include/linux/mm_types.h | 3 +-
include/linux/mmzone.h | 151 ++++--
include/linux/page-flags-layout.h | 2 +
include/linux/page-flags.h | 31 +-
kernel/bounds.c | 5 +
mm/Kconfig | 9 +-
mm/bootmem_info.c | 5 +-
mm/cma.c | 18 +-
mm/hugetlb.c | 337 ++++--------
mm/hugetlb_cma.c | 41 +-
mm/hugetlb_cma.h | 4 +-
mm/hugetlb_vmemmap.c | 266 +--------
mm/hugetlb_vmemmap.h | 64 +--
mm/internal.h | 72 ++-
mm/memory-failure.c | 6 +-
mm/memory_hotplug.c | 22 +-
mm/memremap.c | 4 +-
mm/mm_init.c | 241 ++++-----
mm/sparse-vmemmap.c | 511 ++++++------------
mm/sparse.c | 129 +----
mm/util.c | 2 +-
scripts/gdb/linux/mm.py | 6 +-
46 files changed, 743 insertions(+), 1812 deletions(-)
delete mode 100644 Documentation/arch/powerpc/vmemmap_dedup.rst
base-commit: e98d21c170b01ddef366f023bbfcf6b31509fa83
--
2.54.0
^ permalink raw reply [flat|nested] 72+ messages in thread
* [PATCH v2 01/69] mm/hugetlb: Fix boot panic with CONFIG_DEBUG_VM and HVO bootmem pages
2026-05-13 13:04 [PATCH v2 00/69] mm: Generalize HVO for HugeTLB and device DAX Muchun Song
@ 2026-05-13 13:04 ` Muchun Song
2026-05-13 13:04 ` [PATCH v2 02/69] mm/hugetlb_vmemmap: Fix __hugetlb_vmemmap_optimize_folios() Muchun Song
` (46 subsequent siblings)
47 siblings, 0 replies; 72+ messages in thread
From: Muchun Song @ 2026-05-13 13:04 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
Michael Ellerman, Madhavan Srinivasan
Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
Christophe Leroy, Ackerley Tng, Frank van der Linden,
aneesh.kumar, joao.m.martins, linux-mm, linuxppc-dev,
linux-kernel, Muchun Song
Commit 622026e87c40 ("mm/hugetlb: remove fake head pages") switched
HVO to reuse per-zone shared tail pages from zone->vmemmap_tails[].
Those shared tail pages were initialized in hugetlb_vmemmap_init(), but
bootmem HugeTLB folios are prepared earlier from gather_bootmem_prealloc().
With hugetlb_free_vmemmap=on, prep_and_add_bootmem_folios() can access
pageblock flags on bootmem HugeTLB pages whose mirrored tail struct pages
already point to the shared tail page. On CONFIG_DEBUG_VM kernels,
get_pfnblock_bitmap_bitidx() then dereferences the still-uninitialized
shared tail page and can panic during boot.
Initialize zone->vmemmap_tails[] from gather_bootmem_prealloc(), before
bootmem HugeTLB folios are processed, and drop the later initialization
from hugetlb_vmemmap_init().
This bug only affects CONFIG_DEBUG_VM kernels, where the relevant
assertion is evaluated.
Fixes: 622026e87c40 ("mm/hugetlb: remove fake head pages")
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
mm/hugetlb.c | 19 +++++++++++++++++++
mm/hugetlb_vmemmap.c | 17 -----------------
2 files changed, 19 insertions(+), 17 deletions(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 31b34ca0f402..d22683ab30a1 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3382,6 +3382,25 @@ static void __init gather_bootmem_prealloc(void)
.max_threads = num_node_state(N_MEMORY),
.numa_aware = true,
};
+#ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
+ struct zone *zone;
+
+ for_each_zone(zone) {
+ for (int i = 0; i < NR_VMEMMAP_TAILS; i++) {
+ struct page *tail, *p;
+ unsigned int order;
+
+ tail = zone->vmemmap_tails[i];
+ if (!tail)
+ continue;
+
+ order = i + VMEMMAP_TAIL_MIN_ORDER;
+ p = page_to_virt(tail);
+ for (int j = 0; j < PAGE_SIZE / sizeof(struct page); j++)
+ init_compound_tail(p + j, NULL, order, zone);
+ }
+ }
+#endif
padata_do_multithreaded(&job);
}
diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index 4a077d231d3a..62e61af18c9a 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -870,27 +870,10 @@ static const struct ctl_table hugetlb_vmemmap_sysctls[] = {
static int __init hugetlb_vmemmap_init(void)
{
const struct hstate *h;
- struct zone *zone;
/* HUGETLB_VMEMMAP_RESERVE_SIZE should cover all used struct pages */
BUILD_BUG_ON(__NR_USED_SUBPAGE > HUGETLB_VMEMMAP_RESERVE_PAGES);
- for_each_zone(zone) {
- for (int i = 0; i < NR_VMEMMAP_TAILS; i++) {
- struct page *tail, *p;
- unsigned int order;
-
- tail = zone->vmemmap_tails[i];
- if (!tail)
- continue;
-
- order = i + VMEMMAP_TAIL_MIN_ORDER;
- p = page_to_virt(tail);
- for (int j = 0; j < PAGE_SIZE / sizeof(struct page); j++)
- init_compound_tail(p + j, NULL, order, zone);
- }
- }
-
for_each_hstate(h) {
if (hugetlb_vmemmap_optimizable(h)) {
register_sysctl_init("vm", hugetlb_vmemmap_sysctls);
--
2.54.0
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v2 02/69] mm/hugetlb_vmemmap: Fix __hugetlb_vmemmap_optimize_folios()
2026-05-13 13:04 [PATCH v2 00/69] mm: Generalize HVO for HugeTLB and device DAX Muchun Song
2026-05-13 13:04 ` [PATCH v2 01/69] mm/hugetlb: Fix boot panic with CONFIG_DEBUG_VM and HVO bootmem pages Muchun Song
@ 2026-05-13 13:04 ` Muchun Song
2026-05-13 13:04 ` [PATCH v2 03/69] powerpc/mm: Fix wrong addr_pfn tracking in compound vmemmap population Muchun Song
` (45 subsequent siblings)
47 siblings, 0 replies; 72+ messages in thread
From: Muchun Song @ 2026-05-13 13:04 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
Michael Ellerman, Madhavan Srinivasan
Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
Christophe Leroy, Ackerley Tng, Frank van der Linden,
aneesh.kumar, joao.m.martins, linux-mm, linuxppc-dev,
linux-kernel, Muchun Song
__hugetlb_vmemmap_optimize_folios() uses incorrect arguments when handling
bootmem HugeTLB folios.
The section number passed to register_page_bootmem_memmap() is derived from
the vmemmap virtual address of folio->page instead of the folio PFN, so the
bootmem memmap metadata can be registered against the wrong section. The
helper is also given HUGETLB_VMEMMAP_RESERVE_SIZE even though it expects a
page count, not a size in bytes. In addition, the write-protect range is
based on pages_per_huge_page(h), which does not cover the full HugeTLB
vmemmap area and can leave part of the shared tail vmemmap mapping writable.
Fix the section lookup to use folio_pfn(folio), use
HUGETLB_VMEMMAP_RESERVE_PAGES when registering the reserved memmap pages, and
use hugetlb_vmemmap_size(h) for the write-protect range.
Fixes: 752fe17af693 ("mm/hugetlb: add pre-HVO framework")
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
mm/hugetlb_vmemmap.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index 62e61af18c9a..4f58cd940f61 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -635,12 +635,12 @@ static void __hugetlb_vmemmap_optimize_folios(struct hstate *h,
* mirrored tail page structs RO.
*/
spfn = (unsigned long)&folio->page;
- epfn = spfn + pages_per_huge_page(h);
+ epfn = spfn + hugetlb_vmemmap_size(h);
vmemmap_wrprotect_hvo(spfn, epfn, folio_nid(folio),
HUGETLB_VMEMMAP_RESERVE_SIZE);
- register_page_bootmem_memmap(pfn_to_section_nr(spfn),
+ register_page_bootmem_memmap(pfn_to_section_nr(folio_pfn(folio)),
&folio->page,
- HUGETLB_VMEMMAP_RESERVE_SIZE);
+ HUGETLB_VMEMMAP_RESERVE_PAGES);
continue;
}
--
2.54.0
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v2 03/69] powerpc/mm: Fix wrong addr_pfn tracking in compound vmemmap population
2026-05-13 13:04 [PATCH v2 00/69] mm: Generalize HVO for HugeTLB and device DAX Muchun Song
2026-05-13 13:04 ` [PATCH v2 01/69] mm/hugetlb: Fix boot panic with CONFIG_DEBUG_VM and HVO bootmem pages Muchun Song
2026-05-13 13:04 ` [PATCH v2 02/69] mm/hugetlb_vmemmap: Fix __hugetlb_vmemmap_optimize_folios() Muchun Song
@ 2026-05-13 13:04 ` Muchun Song
2026-05-13 13:04 ` [PATCH v2 04/69] mm/hugetlb: Initialize gigantic bootmem hugepage struct pages earlier Muchun Song
` (44 subsequent siblings)
47 siblings, 0 replies; 72+ messages in thread
From: Muchun Song @ 2026-05-13 13:04 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
Michael Ellerman, Madhavan Srinivasan
Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
Christophe Leroy, Ackerley Tng, Frank van der Linden,
aneesh.kumar, joao.m.martins, linux-mm, linuxppc-dev,
linux-kernel, Muchun Song
vmemmap_populate_compound_pages() uses addr_pfn to determine the PFN
offset within a compound page and to decide whether the current
vmemmap slot should be populated as a head page mapping or should reuse
a tail page mapping.
However, addr_pfn is advanced manually in parallel with addr. The loop
itself progresses in vmemmap address space, so each PAGE_SIZE step in
addr covers PAGE_SIZE / sizeof(struct page) struct page slots. Since
addr_pfn is compared against nr_pages in data-PFN units, it should
advance by the same number of PFNs. The existing manual increments do
not match that and therefore do not reliably track the PFN
corresponding to the current addr.
As a result, pfn_offset can be computed from the wrong PFN and the code
can make the head/tail decision for the wrong compound-page position.
Fix this by deriving addr_pfn directly from the current vmemmap address
instead of carrying it as loop state.
Fixes: f2b79c0d7968 ("powerpc/book3s64/radix: add support for vmemmap optimization for radix")
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
arch/powerpc/mm/book3s64/radix_pgtable.c | 7 +------
1 file changed, 1 insertion(+), 6 deletions(-)
diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c b/arch/powerpc/mm/book3s64/radix_pgtable.c
index 10aced261cff..cf692b2b5f7b 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -1314,7 +1314,6 @@ int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
* covering out both edges.
*/
unsigned long addr;
- unsigned long addr_pfn = start_pfn;
unsigned long next;
pgd_t *pgd;
p4d_t *p4d;
@@ -1335,7 +1334,6 @@ int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
if (pmd_leaf(READ_ONCE(*pmd))) {
/* existing huge mapping. Skip the range */
- addr_pfn += (PMD_SIZE >> PAGE_SHIFT);
next = pmd_addr_end(addr, end);
continue;
}
@@ -1348,11 +1346,11 @@ int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
* page whose VMEMMAP_RESERVE_NR pages were mapped and
* this request fall in those pages.
*/
- addr_pfn += 1;
next = addr + PAGE_SIZE;
continue;
} else {
unsigned long nr_pages = pgmap_vmemmap_nr(pgmap);
+ unsigned long addr_pfn = page_to_pfn((struct page *)addr);
unsigned long pfn_offset = addr_pfn - ALIGN_DOWN(addr_pfn, nr_pages);
pte_t *tail_page_pte;
@@ -1376,7 +1374,6 @@ int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
if (!pte)
return -ENOMEM;
- addr_pfn += 2;
next = addr + 2 * PAGE_SIZE;
continue;
}
@@ -1392,7 +1389,6 @@ int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
return -ENOMEM;
vmemmap_verify(pte, node, addr, addr + PAGE_SIZE);
- addr_pfn += 1;
next = addr + PAGE_SIZE;
continue;
}
@@ -1402,7 +1398,6 @@ int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
return -ENOMEM;
vmemmap_verify(pte, node, addr, addr + PAGE_SIZE);
- addr_pfn += 1;
next = addr + PAGE_SIZE;
continue;
}
--
2.54.0
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v2 04/69] mm/hugetlb: Initialize gigantic bootmem hugepage struct pages earlier
2026-05-13 13:04 [PATCH v2 00/69] mm: Generalize HVO for HugeTLB and device DAX Muchun Song
` (2 preceding siblings ...)
2026-05-13 13:04 ` [PATCH v2 03/69] powerpc/mm: Fix wrong addr_pfn tracking in compound vmemmap population Muchun Song
@ 2026-05-13 13:04 ` Muchun Song
2026-05-13 13:04 ` [PATCH v2 05/69] mm/mm_init: Simplify deferred_free_pages() migratetype init Muchun Song
` (43 subsequent siblings)
47 siblings, 0 replies; 72+ messages in thread
From: Muchun Song @ 2026-05-13 13:04 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
Michael Ellerman, Madhavan Srinivasan
Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
Christophe Leroy, Ackerley Tng, Frank van der Linden,
aneesh.kumar, joao.m.martins, linux-mm, linuxppc-dev,
linux-kernel, Muchun Song
Gigantic bootmem HugeTLB pages are currently initialized from hugetlb_init(),
but page_alloc_init_late() runs earlier and walks pageblocks to determine
zone contiguity.
If a bootmem HugeTLB region is marked noinit, set_zone_contiguous() can
observe still-uninitialized struct pages through __pageblock_pfn_to_page().
This may not trigger an immediate failure, but it can make
set_zone_contiguous() compute the wrong zone contiguity state. If extra
poisoned-page checks are added in this path, such as PF_POISONED_CHECK()
in page_zone_id(), it can also trigger an early boot panic.
Initialize gigantic bootmem HugeTLB struct pages from page_alloc_init_late(),
before zone contiguity is evaluated, so later page allocator setup only
sees valid struct page state. This also makes the initialization order
more natural, as struct pages should be initialized before later code
inspects them.
Fixes: fde1c4ecf916 ("mm: hugetlb: skip initialization of gigantic tail struct pages if freed by HVO")
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
include/linux/hugetlb.h | 5 +++++
mm/hugetlb.c | 3 +--
mm/mm_init.c | 1 +
3 files changed, 7 insertions(+), 2 deletions(-)
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 93418625d3c5..52a2c30f866c 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -173,6 +173,7 @@ extern int movable_gigantic_pages __read_mostly;
extern int sysctl_hugetlb_shm_group __read_mostly;
extern struct list_head huge_boot_pages[MAX_NUMNODES];
+void hugetlb_struct_page_init(void);
void hugetlb_bootmem_alloc(void);
extern nodemask_t hugetlb_bootmem_nodes;
void hugetlb_bootmem_set_nodes(void);
@@ -1307,6 +1308,10 @@ static inline bool hugetlbfs_pagecache_present(
static inline void hugetlb_bootmem_alloc(void)
{
}
+
+static inline void hugetlb_struct_page_init(void)
+{
+}
#endif /* CONFIG_HUGETLB_PAGE */
static inline spinlock_t *huge_pte_lock(struct hstate *h,
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index d22683ab30a1..b4999653a156 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3370,7 +3370,7 @@ static void __init gather_bootmem_prealloc_parallel(unsigned long start,
gather_bootmem_prealloc_node(nid);
}
-static void __init gather_bootmem_prealloc(void)
+void __init hugetlb_struct_page_init(void)
{
struct padata_mt_job job = {
.thread_fn = gather_bootmem_prealloc_parallel,
@@ -4163,7 +4163,6 @@ static int __init hugetlb_init(void)
}
hugetlb_init_hstates();
- gather_bootmem_prealloc();
report_hugepages();
hugetlb_sysfs_init();
diff --git a/mm/mm_init.c b/mm/mm_init.c
index fde49f7bba6c..5a910cc5534c 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -2335,6 +2335,7 @@ void __init page_alloc_init_late(void)
/* Reinit limits that are based on free pages after the kernel is up */
files_maxfiles_init();
#endif
+ hugetlb_struct_page_init();
/* Accounting of total+free memory is stable at this point. */
mem_init_print_info();
--
2.54.0
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v2 05/69] mm/mm_init: Simplify deferred_free_pages() migratetype init
2026-05-13 13:04 [PATCH v2 00/69] mm: Generalize HVO for HugeTLB and device DAX Muchun Song
` (3 preceding siblings ...)
2026-05-13 13:04 ` [PATCH v2 04/69] mm/hugetlb: Initialize gigantic bootmem hugepage struct pages earlier Muchun Song
@ 2026-05-13 13:04 ` Muchun Song
2026-05-13 13:04 ` [PATCH v2 06/69] mm/sparse: Panic on memmap and usemap allocation failure Muchun Song
` (42 subsequent siblings)
47 siblings, 0 replies; 72+ messages in thread
From: Muchun Song @ 2026-05-13 13:04 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
Michael Ellerman, Madhavan Srinivasan
Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
Christophe Leroy, Ackerley Tng, Frank van der Linden,
aneesh.kumar, joao.m.martins, linux-mm, linuxppc-dev,
linux-kernel, Muchun Song
deferred_free_pages() open-codes two loops to initialize the pageblock
migratetype for a range of pages.
Replace them with pageblock_migratetype_init_range() to remove the
duplication and make the code clearer (Note that deferred_free_pages()
may be called from atomic context).
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
v1->v2:
- Add Acked-by from Mike Rapoport
---
mm/mm_init.c | 19 ++++++++-----------
1 file changed, 8 insertions(+), 11 deletions(-)
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 5a910cc5534c..96e0f2d8c3ea 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -674,15 +674,15 @@ static inline void fixup_hashdist(void)
static inline void fixup_hashdist(void) {}
#endif /* CONFIG_NUMA */
-#ifdef CONFIG_ZONE_DEVICE
+#if defined(CONFIG_ZONE_DEVICE) || defined(CONFIG_DEFERRED_STRUCT_PAGE_INIT)
static __meminit void pageblock_migratetype_init_range(unsigned long pfn,
- unsigned long nr_pages, int migratetype)
+ unsigned long nr_pages, int migratetype, bool atomic)
{
const unsigned long end = pfn + nr_pages;
for (pfn = pageblock_align(pfn); pfn < end; pfn += pageblock_nr_pages) {
init_pageblock_migratetype(pfn_to_page(pfn), migratetype, false);
- if (IS_ALIGNED(pfn, PAGES_PER_SECTION))
+ if (!atomic && IS_ALIGNED(pfn, PAGES_PER_SECTION))
cond_resched();
}
}
@@ -1142,7 +1142,7 @@ void __ref memmap_init_zone_device(struct zone *zone,
compound_nr_pages(pfn, altmap, pgmap));
}
- pageblock_migratetype_init_range(start_pfn, nr_pages, MIGRATE_MOVABLE);
+ pageblock_migratetype_init_range(start_pfn, nr_pages, MIGRATE_MOVABLE, false);
pr_debug("%s initialised %lu pages in %ums\n", __func__,
nr_pages, jiffies_to_msecs(jiffies - start));
@@ -1993,12 +1993,12 @@ static void __init deferred_free_pages(unsigned long pfn,
if (!nr_pages)
return;
+ pageblock_migratetype_init_range(pfn, nr_pages, mt, true);
+
page = pfn_to_page(pfn);
/* Free a large naturally-aligned chunk if possible */
if (nr_pages == MAX_ORDER_NR_PAGES && IS_MAX_ORDER_ALIGNED(pfn)) {
- for (i = 0; i < nr_pages; i += pageblock_nr_pages)
- init_pageblock_migratetype(page + i, mt, false);
__free_pages_core(page, MAX_PAGE_ORDER, MEMINIT_EARLY);
return;
}
@@ -2006,11 +2006,8 @@ static void __init deferred_free_pages(unsigned long pfn,
/* Accept chunks smaller than MAX_PAGE_ORDER upfront */
accept_memory(PFN_PHYS(pfn), nr_pages * PAGE_SIZE);
- for (i = 0; i < nr_pages; i++, page++, pfn++) {
- if (pageblock_aligned(pfn))
- init_pageblock_migratetype(page, mt, false);
- __free_pages_core(page, 0, MEMINIT_EARLY);
- }
+ for (i = 0; i < nr_pages; i++)
+ __free_pages_core(page + i, 0, MEMINIT_EARLY);
}
/* Completion tracking for deferred_init_memmap() threads */
--
2.54.0
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v2 06/69] mm/sparse: Panic on memmap and usemap allocation failure
2026-05-13 13:04 [PATCH v2 00/69] mm: Generalize HVO for HugeTLB and device DAX Muchun Song
` (4 preceding siblings ...)
2026-05-13 13:04 ` [PATCH v2 05/69] mm/mm_init: Simplify deferred_free_pages() migratetype init Muchun Song
@ 2026-05-13 13:04 ` Muchun Song
2026-05-13 13:04 ` [PATCH v2 07/69] mm/sparse: Move subsection_map_init() into sparse_init() Muchun Song
` (41 subsequent siblings)
47 siblings, 0 replies; 72+ messages in thread
From: Muchun Song @ 2026-05-13 13:04 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
Michael Ellerman, Madhavan Srinivasan
Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
Christophe Leroy, Ackerley Tng, Frank van der Linden,
aneesh.kumar, joao.m.martins, linux-mm, linuxppc-dev,
linux-kernel, Muchun Song
When vmemmap or usemap allocation fails, sparse_init_nid() currently
marks the section non-present and continues. Later boot-time code can
still walk PFNs in that section without checking for this partial setup,
which leads to invalid accesses. subsection_map_init() can also touch an
unallocated usemap.
Auditing and fixing all early PFN walkers for this case is not worth the
complexity. These allocation failures are expected to be fatal anyway,
and other memory models already treat them that way.
Make memmap and usemap allocation failures panic immediately instead of
trying to recover and crashing later in less obvious ways. This is also
consistent with how other memory model configurations handle memmap
allocation failures.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
v1->v2:
- Add Acked-by from Mike Rapoport
- I refrained from adding panic() to memmap_alloc() as it wouldn't simplify
the code. However, panic() is still required in sparse_init_nid() because
the architecture-specific vmemmap_populate() bypasses memmap_alloc().
---
mm/sparse.c | 44 +++++++++-----------------------------------
1 file changed, 9 insertions(+), 35 deletions(-)
diff --git a/mm/sparse.c b/mm/sparse.c
index 16ac6df3c89f..c92bbc3f3aa3 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -239,15 +239,8 @@ struct page __init *__populate_section_memmap(unsigned long pfn,
struct dev_pagemap *pgmap)
{
unsigned long size = section_map_size();
- struct page *map;
- phys_addr_t addr = __pa(MAX_DMA_ADDRESS);
- map = memmap_alloc(size, size, addr, nid, false);
- if (!map)
- panic("%s: Failed to allocate %lu bytes align=0x%lx nid=%d from=%pa\n",
- __func__, size, PAGE_SIZE, nid, &addr);
-
- return map;
+ return memmap_alloc(size, size, __pa(MAX_DMA_ADDRESS), nid, false);
}
#endif /* !CONFIG_SPARSEMEM_VMEMMAP */
@@ -300,17 +293,14 @@ static void __init sparse_init_nid(int nid, unsigned long pnum_begin,
unsigned long map_count)
{
unsigned long pnum;
- struct page *map;
- struct mem_section *ms;
- if (sparse_usage_init(nid, map_count)) {
- pr_err("%s: node[%d] usemap allocation failed", __func__, nid);
- goto failed;
- }
+ if (sparse_usage_init(nid, map_count))
+ panic("Failed to allocate usemap for node %d\n", nid);
sparse_vmemmap_init_nid_early(nid);
for_each_present_section_nr(pnum_begin, pnum) {
+ struct mem_section *ms;
unsigned long pfn = section_nr_to_pfn(pnum);
if (pnum >= pnum_end)
@@ -318,34 +308,18 @@ static void __init sparse_init_nid(int nid, unsigned long pnum_begin,
ms = __nr_to_section(pnum);
if (!preinited_vmemmap_section(ms)) {
+ struct page *map;
+
map = __populate_section_memmap(pfn, PAGES_PER_SECTION,
- nid, NULL, NULL);
- if (!map) {
- pr_err("%s: node[%d] memory map backing failed. Some memory will not be available.",
- __func__, nid);
- pnum_begin = pnum;
- sparse_usage_fini();
- goto failed;
- }
+ nid, NULL, NULL);
+ if (!map)
+ panic("Failed to allocate memmap for section %lu\n", pnum);
memmap_boot_pages_add(DIV_ROUND_UP(PAGES_PER_SECTION * sizeof(struct page),
PAGE_SIZE));
sparse_init_early_section(nid, map, pnum, 0);
}
}
sparse_usage_fini();
- return;
-failed:
- /*
- * We failed to allocate, mark all the following pnums as not present,
- * except the ones already initialized earlier.
- */
- for_each_present_section_nr(pnum_begin, pnum) {
- if (pnum >= pnum_end)
- break;
- ms = __nr_to_section(pnum);
- if (!preinited_vmemmap_section(ms))
- ms->section_mem_map = 0;
- }
}
/*
--
2.54.0
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v2 07/69] mm/sparse: Move subsection_map_init() into sparse_init()
2026-05-13 13:04 [PATCH v2 00/69] mm: Generalize HVO for HugeTLB and device DAX Muchun Song
` (5 preceding siblings ...)
2026-05-13 13:04 ` [PATCH v2 06/69] mm/sparse: Panic on memmap and usemap allocation failure Muchun Song
@ 2026-05-13 13:04 ` Muchun Song
2026-05-13 13:04 ` [PATCH v2 08/69] mm/mm_init: Defer sparse_init() until after zone initialization Muchun Song
` (40 subsequent siblings)
47 siblings, 0 replies; 72+ messages in thread
From: Muchun Song @ 2026-05-13 13:04 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
Michael Ellerman, Madhavan Srinivasan
Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
Christophe Leroy, Ackerley Tng, Frank van der Linden,
aneesh.kumar, joao.m.martins, linux-mm, linuxppc-dev,
linux-kernel, Muchun Song
subsection_map_init() is part of sparse memory initialization, but it is
currently called from free_area_init().
Move it into sparse_init() so the sparse-specific setup stays together
instead of being split across the generic free_area_init() path.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
v1->v2:
- Add Acked-by from Mike Rapoport
---
mm/internal.h | 5 ++---
mm/mm_init.c | 10 ++--------
mm/sparse-vmemmap.c | 11 ++++++++++-
mm/sparse.c | 1 +
4 files changed, 15 insertions(+), 12 deletions(-)
diff --git a/mm/internal.h b/mm/internal.h
index 5a2ddcf68e0b..28d179cbc451 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1003,10 +1003,9 @@ static inline void sparse_init(void) {}
* mm/sparse-vmemmap.c
*/
#ifdef CONFIG_SPARSEMEM_VMEMMAP
-void sparse_init_subsection_map(unsigned long pfn, unsigned long nr_pages);
+void sparse_init_subsection_map(void);
#else
-static inline void sparse_init_subsection_map(unsigned long pfn,
- unsigned long nr_pages)
+static inline void sparse_init_subsection_map(void)
{
}
#endif /* CONFIG_SPARSEMEM_VMEMMAP */
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 96e0f2d8c3ea..12fe21c4e26c 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -1876,18 +1876,12 @@ static void __init free_area_init(void)
(u64)zone_movable_pfn[i] << PAGE_SHIFT);
}
- /*
- * Print out the early node map, and initialize the
- * subsection-map relative to active online memory ranges to
- * enable future "sub-section" extensions of the memory map.
- */
+ /* Print out the early node map. */
pr_info("Early memory node ranges\n");
- for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, &nid) {
+ for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, &nid)
pr_info(" node %3d: [mem %#018Lx-%#018Lx]\n", nid,
(u64)start_pfn << PAGE_SHIFT,
((u64)end_pfn << PAGE_SHIFT) - 1);
- sparse_init_subsection_map(start_pfn, end_pfn - start_pfn);
- }
/* Initialise every node */
mminit_verify_pageflags_layout();
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index 112ccf9c71ca..fcf0ce5212f1 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -596,7 +596,7 @@ static void subsection_mask_set(unsigned long *map, unsigned long pfn,
bitmap_set(map, idx, end - idx + 1);
}
-void __init sparse_init_subsection_map(unsigned long pfn, unsigned long nr_pages)
+static void __init sparse_init_subsection_map_range(unsigned long pfn, unsigned long nr_pages)
{
int end_sec_nr = pfn_to_section_nr(pfn + nr_pages - 1);
unsigned long nr, start_sec_nr = pfn_to_section_nr(pfn);
@@ -619,6 +619,15 @@ void __init sparse_init_subsection_map(unsigned long pfn, unsigned long nr_pages
}
}
+void __init sparse_init_subsection_map(void)
+{
+ int i, nid;
+ unsigned long start, end;
+
+ for_each_mem_pfn_range(i, MAX_NUMNODES, &start, &end, &nid)
+ sparse_init_subsection_map_range(start, end - start);
+}
+
#ifdef CONFIG_MEMORY_HOTPLUG
/* Mark all memory sections within the pfn range as online */
diff --git a/mm/sparse.c b/mm/sparse.c
index c92bbc3f3aa3..85557ef387c7 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -361,5 +361,6 @@ void __init sparse_init(void)
}
/* cover the last node */
sparse_init_nid(nid_begin, pnum_begin, pnum_end, map_count);
+ sparse_init_subsection_map();
vmemmap_populate_print_last();
}
--
2.54.0
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v2 08/69] mm/mm_init: Defer sparse_init() until after zone initialization
2026-05-13 13:04 [PATCH v2 00/69] mm: Generalize HVO for HugeTLB and device DAX Muchun Song
` (6 preceding siblings ...)
2026-05-13 13:04 ` [PATCH v2 07/69] mm/sparse: Move subsection_map_init() into sparse_init() Muchun Song
@ 2026-05-13 13:04 ` Muchun Song
2026-05-13 13:04 ` [PATCH v2 09/69] mm/mm_init: Defer hugetlb reservation " Muchun Song
` (39 subsequent siblings)
47 siblings, 0 replies; 72+ messages in thread
From: Muchun Song @ 2026-05-13 13:04 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
Michael Ellerman, Madhavan Srinivasan
Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
Christophe Leroy, Ackerley Tng, Frank van der Linden,
aneesh.kumar, joao.m.martins, linux-mm, linuxppc-dev,
linux-kernel, Muchun Song
free_area_init() is responsible for initializing pgdat and zone state.
Calling sparse_init() from there mixes in later vmemmap and struct page
setup, which makes the initialization flow less clear.
Defer sparse_init(), sparse_vmemmap_init_nid_late(), and memmap_init()
until after free_area_init() completes, when zone initialization is fully
done. This keeps free_area_init() focused on zone setup and ensures that
sparse_init() runs with the relevant zone state already available.
This is also a prerequisite for later hugetlb vmemmap changes that need
zone information during early sparse vmemmap setup.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
v1->v2:
- Restore the set_pageblock_order() change suggested by Mike Rapoport
- Add Mike Rapoport's Reviewed-by
---
mm/mm_init.c | 12 +++++++-----
1 file changed, 7 insertions(+), 5 deletions(-)
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 12fe21c4e26c..c14491c2dad3 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -1826,7 +1826,6 @@ static void __init free_area_init(void)
bool descending;
arch_zone_limits_init(max_zone_pfn);
- sparse_init();
start_pfn = PHYS_PFN(memblock_start_of_DRAM());
descending = arch_has_descending_max_zone_pfns();
@@ -1915,11 +1914,7 @@ static void __init free_area_init(void)
}
}
- for_each_node_state(nid, N_MEMORY)
- sparse_vmemmap_init_nid_late(nid);
-
calc_nr_kernel_pages();
- memmap_init();
/* disable hash distribution for systems with a single node */
fixup_hashdist();
@@ -2691,10 +2686,17 @@ void __init __weak mem_init(void)
void __init mm_core_init_early(void)
{
+ int nid;
+
hugetlb_cma_reserve();
hugetlb_bootmem_alloc();
free_area_init();
+
+ sparse_init();
+ for_each_node_state(nid, N_MEMORY)
+ sparse_vmemmap_init_nid_late(nid);
+ memmap_init();
}
/*
--
2.54.0
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v2 09/69] mm/mm_init: Defer hugetlb reservation until after zone initialization
2026-05-13 13:04 [PATCH v2 00/69] mm: Generalize HVO for HugeTLB and device DAX Muchun Song
` (7 preceding siblings ...)
2026-05-13 13:04 ` [PATCH v2 08/69] mm/mm_init: Defer sparse_init() until after zone initialization Muchun Song
@ 2026-05-13 13:04 ` Muchun Song
2026-05-13 13:04 ` [PATCH v2 10/69] mm/mm_init: Remove set_pageblock_order() call from sparse_init() Muchun Song
` (38 subsequent siblings)
47 siblings, 0 replies; 72+ messages in thread
From: Muchun Song @ 2026-05-13 13:04 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
Michael Ellerman, Madhavan Srinivasan
Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
Christophe Leroy, Ackerley Tng, Frank van der Linden,
aneesh.kumar, joao.m.martins, linux-mm, linuxppc-dev,
linux-kernel, Muchun Song
hugetlb_cma_reserve() and hugetlb_bootmem_alloc() currently run before
free_area_init(), so HugeTLB reservation happens before zone state is
initialized.
Move the reservation step after free_area_init() so the relevant zone
information is available before HugeTLB reserves memory. This is needed
for later hugetlb changes that validate boot-time HugeTLB reservations
against zone boundaries.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
mm/mm_init.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/mm/mm_init.c b/mm/mm_init.c
index c14491c2dad3..75f98abfed97 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -2688,11 +2688,11 @@ void __init mm_core_init_early(void)
{
int nid;
+ free_area_init();
+
hugetlb_cma_reserve();
hugetlb_bootmem_alloc();
- free_area_init();
-
sparse_init();
for_each_node_state(nid, N_MEMORY)
sparse_vmemmap_init_nid_late(nid);
--
2.54.0
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v2 10/69] mm/mm_init: Remove set_pageblock_order() call from sparse_init()
2026-05-13 13:04 [PATCH v2 00/69] mm: Generalize HVO for HugeTLB and device DAX Muchun Song
` (8 preceding siblings ...)
2026-05-13 13:04 ` [PATCH v2 09/69] mm/mm_init: Defer hugetlb reservation " Muchun Song
@ 2026-05-13 13:04 ` Muchun Song
2026-05-13 13:04 ` [PATCH v2 11/69] mm/sparse: Move sparse_vmemmap_init_nid_late() into sparse_init_nid() Muchun Song
` (37 subsequent siblings)
47 siblings, 0 replies; 72+ messages in thread
From: Muchun Song @ 2026-05-13 13:04 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
Michael Ellerman, Madhavan Srinivasan
Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
Christophe Leroy, Ackerley Tng, Frank van der Linden,
aneesh.kumar, joao.m.martins, linux-mm, linuxppc-dev,
linux-kernel, Muchun Song
free_area_init() already sets pageblock_order before sparse_init() runs
for CONFIG_HUGETLB_PAGE_SIZE_VARIABLE, so sparse_init() does not need to
call set_pageblock_order() again.
With that call removed, set_pageblock_order() is only used in mm/mm_init.c.
Make it static.
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
v1->v2:
- Move the removal of set_pageblock_order() into this patch
- Update the commit message accordingly
- Add Reviewed-by from Mike Rapoport
---
mm/internal.h | 1 -
mm/mm_init.c | 4 ++--
mm/sparse.c | 3 ---
3 files changed, 2 insertions(+), 6 deletions(-)
diff --git a/mm/internal.h b/mm/internal.h
index 28d179cbc451..6bd9aa37b952 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1436,7 +1436,6 @@ extern unsigned long __must_check vm_mmap_pgoff(struct file *, unsigned long,
unsigned long, unsigned long,
unsigned long, unsigned long);
-extern void set_pageblock_order(void);
unsigned long reclaim_pages(struct list_head *folio_list);
unsigned int reclaim_clean_pages_from_list(struct zone *zone,
struct list_head *folio_list);
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 75f98abfed97..6646d4b47796 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -1508,7 +1508,7 @@ static inline void setup_usemap(struct zone *zone) {}
#ifdef CONFIG_HUGETLB_PAGE_SIZE_VARIABLE
/* Initialise the number of pages represented by NR_PAGEBLOCK_BITS */
-void __init set_pageblock_order(void)
+static void __init set_pageblock_order(void)
{
unsigned int order = PAGE_BLOCK_MAX_ORDER;
@@ -1534,7 +1534,7 @@ void __init set_pageblock_order(void)
* include/linux/pageblock-flags.h for the values of pageblock_order based on
* the kernel config
*/
-void __init set_pageblock_order(void)
+static inline void __init set_pageblock_order(void)
{
}
diff --git a/mm/sparse.c b/mm/sparse.c
index 85557ef387c7..324213d8bdcb 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -343,9 +343,6 @@ void __init sparse_init(void)
pnum_begin = first_present_section_nr();
nid_begin = sparse_early_nid(__nr_to_section(pnum_begin));
- /* Setup pageblock_order for HUGETLB_PAGE_SIZE_VARIABLE */
- set_pageblock_order();
-
for_each_present_section_nr(pnum_begin + 1, pnum_end) {
int nid = sparse_early_nid(__nr_to_section(pnum_end));
--
2.54.0
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v2 11/69] mm/sparse: Move sparse_vmemmap_init_nid_late() into sparse_init_nid()
2026-05-13 13:04 [PATCH v2 00/69] mm: Generalize HVO for HugeTLB and device DAX Muchun Song
` (9 preceding siblings ...)
2026-05-13 13:04 ` [PATCH v2 10/69] mm/mm_init: Remove set_pageblock_order() call from sparse_init() Muchun Song
@ 2026-05-13 13:04 ` Muchun Song
2026-05-13 13:04 ` [PATCH v2 12/69] mm/hugetlb_cma: Validate hugetlb CMA range by zone at reserve time Muchun Song
` (36 subsequent siblings)
47 siblings, 0 replies; 72+ messages in thread
From: Muchun Song @ 2026-05-13 13:04 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
Michael Ellerman, Madhavan Srinivasan
Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
Christophe Leroy, Ackerley Tng, Frank van der Linden,
aneesh.kumar, joao.m.martins, linux-mm, linuxppc-dev,
linux-kernel, Muchun Song
sparse_vmemmap_init_nid_late() is still called separately from
mm_core_init_early(), away from the rest of the sparse initialization
path.
Now that sparse_init() runs after zone initialization, call
sparse_vmemmap_init_nid_late() from sparse_init_nid() instead. This
keeps both sparse_vmemmap_init_nid_early() and
sparse_vmemmap_init_nid_late() in the sparse setup path.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
v1->v2:
- Add Reviewed-by from Mike Rapoport
---
mm/mm_init.c | 4 ----
mm/sparse.c | 1 +
2 files changed, 1 insertion(+), 4 deletions(-)
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 6646d4b47796..165b83c9a9c3 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -2686,16 +2686,12 @@ void __init __weak mem_init(void)
void __init mm_core_init_early(void)
{
- int nid;
-
free_area_init();
hugetlb_cma_reserve();
hugetlb_bootmem_alloc();
sparse_init();
- for_each_node_state(nid, N_MEMORY)
- sparse_vmemmap_init_nid_late(nid);
memmap_init();
}
diff --git a/mm/sparse.c b/mm/sparse.c
index 324213d8bdcb..3917a47153d8 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -320,6 +320,7 @@ static void __init sparse_init_nid(int nid, unsigned long pnum_begin,
}
}
sparse_usage_fini();
+ sparse_vmemmap_init_nid_late(nid);
}
/*
--
2.54.0
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v2 12/69] mm/hugetlb_cma: Validate hugetlb CMA range by zone at reserve time
2026-05-13 13:04 [PATCH v2 00/69] mm: Generalize HVO for HugeTLB and device DAX Muchun Song
` (10 preceding siblings ...)
2026-05-13 13:04 ` [PATCH v2 11/69] mm/sparse: Move sparse_vmemmap_init_nid_late() into sparse_init_nid() Muchun Song
@ 2026-05-13 13:04 ` Muchun Song
2026-05-13 13:04 ` [PATCH v2 13/69] mm/hugetlb: Refactor early boot gigantic hugepage allocation Muchun Song
` (35 subsequent siblings)
47 siblings, 0 replies; 72+ messages in thread
From: Muchun Song @ 2026-05-13 13:04 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
Michael Ellerman, Madhavan Srinivasan
Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
Christophe Leroy, Ackerley Tng, Frank van der Linden,
aneesh.kumar, joao.m.martins, linux-mm, linuxppc-dev,
linux-kernel, Muchun Song
Hugetlb CMA allocation currently has to cope with CMA areas that span
multiple zones.
Validate the reserved CMA range up front in hugetlb_cma_reserve() so
later hugetlb CMA allocations can assume a zone-consistent area.
Also drop the pfn_valid() check from cma_validate_zones(). mem_section
is not fully initialized at this point, so the check can trigger false
warnings. Keep the sanity check in cma_activate_area() instead.
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
v1->v2:
- Update the warning message for zone validation failures
- Add Acked-by from Mike Rapoport
---
mm/cma.c | 3 ++-
mm/hugetlb_cma.c | 6 ++++--
2 files changed, 6 insertions(+), 3 deletions(-)
diff --git a/mm/cma.c b/mm/cma.c
index c7ca567f4c5c..0369f04c7ba5 100644
--- a/mm/cma.c
+++ b/mm/cma.c
@@ -126,7 +126,6 @@ bool cma_validate_zones(struct cma *cma)
* to be in the same zone. Simplify by forcing the entire
* CMA resv range to be in the same zone.
*/
- WARN_ON_ONCE(!pfn_valid(base_pfn));
if (pfn_range_intersects_zones(cma->nid, base_pfn, cmr->count)) {
set_bit(CMA_ZONES_INVALID, &cma->flags);
return false;
@@ -165,6 +164,8 @@ static void __init cma_activate_area(struct cma *cma)
bitmap_set(cmr->bitmap, 0, bitmap_count);
}
+ WARN_ON_ONCE(!pfn_valid(cmr->base_pfn));
+
for (pfn = early_pfn[r]; pfn < cmr->base_pfn + cmr->count;
pfn += pageblock_nr_pages)
init_cma_reserved_pageblock(pfn_to_page(pfn));
diff --git a/mm/hugetlb_cma.c b/mm/hugetlb_cma.c
index 7693ccefd0c6..57a7b3acc758 100644
--- a/mm/hugetlb_cma.c
+++ b/mm/hugetlb_cma.c
@@ -234,9 +234,11 @@ void __init hugetlb_cma_reserve(void)
res = cma_declare_contiguous_multi(size, PAGE_SIZE << order,
HUGETLB_PAGE_ORDER, name,
&hugetlb_cma[nid], nid);
- if (res) {
- pr_warn("hugetlb_cma: reservation failed: err %d, node %d",
+ if (res || !cma_validate_zones(hugetlb_cma[nid])) {
+ pr_warn("hugetlb_cma: %s: err %d, node %d\n",
+ res ? "reservation failed" : "reserved area spans zones",
res, nid);
+ hugetlb_cma[nid] = NULL;
continue;
}
--
2.54.0
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v2 13/69] mm/hugetlb: Refactor early boot gigantic hugepage allocation
2026-05-13 13:04 [PATCH v2 00/69] mm: Generalize HVO for HugeTLB and device DAX Muchun Song
` (11 preceding siblings ...)
2026-05-13 13:04 ` [PATCH v2 12/69] mm/hugetlb_cma: Validate hugetlb CMA range by zone at reserve time Muchun Song
@ 2026-05-13 13:04 ` Muchun Song
2026-05-13 13:04 ` [PATCH v2 14/69] mm/hugetlb: Free cross-zone bootmem gigantic pages after allocation Muchun Song
` (34 subsequent siblings)
47 siblings, 0 replies; 72+ messages in thread
From: Muchun Song @ 2026-05-13 13:04 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
Michael Ellerman, Madhavan Srinivasan
Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
Christophe Leroy, Ackerley Tng, Frank van der Linden,
aneesh.kumar, joao.m.martins, linux-mm, linuxppc-dev,
linux-kernel, Muchun Song
The early boot gigantic hugepage allocation helpers currently mix
allocation with huge_bootmem_page setup, and leave part of the
initialization flow in architecture code.
Refactor the interface to return the allocated huge page pointer and
move the huge_bootmem_page setup into the generic hugetlb code. This
makes the architecture-specific paths focus only on finding memory,
while the common code handles node placement and early page metadata
setup in one place.
This also lets powerpc benefit from memblock_reserved_mark_noinit(),
which it did not enable before.
In addition, upcoming cross-zone validation for boot-time gigantic
hugetlb reservation is common logic. With this refactoring, that logic
can stay in the generic code instead of being duplicated in
architecture-specific paths.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
arch/powerpc/mm/hugetlbpage.c | 11 ++--
include/linux/hugetlb.h | 8 +--
mm/hugetlb.c | 95 ++++++++++++++---------------------
mm/hugetlb_cma.c | 12 ++---
mm/hugetlb_cma.h | 4 +-
5 files changed, 52 insertions(+), 78 deletions(-)
diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index 558fafb82b8a..ff8c5ec831bb 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -104,17 +104,14 @@ void __init pseries_add_gpage(u64 addr, u64 page_size, unsigned long number_of_p
}
}
-static int __init pseries_alloc_bootmem_huge_page(struct hstate *hstate)
+static __init void *pseries_alloc_bootmem_huge_page(struct hstate *hstate)
{
struct huge_bootmem_page *m;
if (nr_gpages == 0)
- return 0;
+ return NULL;
m = phys_to_virt(gpage_freearray[--nr_gpages]);
gpage_freearray[nr_gpages] = 0;
- list_add(&m->list, &huge_boot_pages[0]);
- m->hstate = hstate;
- m->flags = 0;
- return 1;
+ return m;
}
bool __init hugetlb_node_alloc_supported(void)
@@ -124,7 +121,7 @@ bool __init hugetlb_node_alloc_supported(void)
#endif
-int __init alloc_bootmem_huge_page(struct hstate *h, int nid)
+void *__init arch_alloc_bootmem_huge_page(struct hstate *h, int nid)
{
#ifdef CONFIG_PPC_BOOK3S_64
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 52a2c30f866c..9a65271d167c 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -720,8 +720,8 @@ void restore_reserve_on_error(struct hstate *h, struct vm_area_struct *vma,
unsigned long address, struct folio *folio);
/* arch callback */
-int __init __alloc_bootmem_huge_page(struct hstate *h, int nid);
-int __init alloc_bootmem_huge_page(struct hstate *h, int nid);
+void *__init __alloc_bootmem_huge_page(struct hstate *h, int nid);
+void *__init arch_alloc_bootmem_huge_page(struct hstate *h, int nid);
bool __init hugetlb_node_alloc_supported(void);
void __init hugetlb_add_hstate(unsigned order);
@@ -1152,9 +1152,9 @@ alloc_hugetlb_folio_nodemask(struct hstate *h, int preferred_nid,
return NULL;
}
-static inline int __alloc_bootmem_huge_page(struct hstate *h)
+static inline void *__alloc_bootmem_huge_page(struct hstate *h, int nid)
{
- return 0;
+ return NULL;
}
static inline struct hstate *hstate_file(struct file *f)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index b4999653a156..e9ba0be2eb17 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3044,79 +3044,58 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
static __init void *alloc_bootmem(struct hstate *h, int nid, bool node_exact)
{
- struct huge_bootmem_page *m;
- int listnode = nid;
-
if (hugetlb_early_cma(h))
- m = hugetlb_cma_alloc_bootmem(h, &listnode, node_exact);
- else {
- if (node_exact)
- m = memblock_alloc_exact_nid_raw(huge_page_size(h),
+ return hugetlb_cma_alloc_bootmem(h, nid, node_exact);
+
+ if (node_exact)
+ return memblock_alloc_exact_nid_raw(huge_page_size(h),
huge_page_size(h), 0,
MEMBLOCK_ALLOC_ACCESSIBLE, nid);
- else {
- m = memblock_alloc_try_nid_raw(huge_page_size(h),
+
+ return memblock_alloc_try_nid_raw(huge_page_size(h),
huge_page_size(h), 0,
MEMBLOCK_ALLOC_ACCESSIBLE, nid);
- /*
- * For pre-HVO to work correctly, pages need to be on
- * the list for the node they were actually allocated
- * from. That node may be different in the case of
- * fallback by memblock_alloc_try_nid_raw. So,
- * extract the actual node first.
- */
- if (m)
- listnode = early_pfn_to_nid(PHYS_PFN(__pa(m)));
- }
-
- if (m) {
- m->flags = 0;
- m->cma = NULL;
- }
- }
-
- if (m) {
- /*
- * Use the beginning of the huge page to store the
- * huge_bootmem_page struct (until gather_bootmem
- * puts them into the mem_map).
- *
- * Put them into a private list first because mem_map
- * is not up yet.
- */
- INIT_LIST_HEAD(&m->list);
- list_add(&m->list, &huge_boot_pages[listnode]);
- m->hstate = h;
- }
-
- return m;
}
-int alloc_bootmem_huge_page(struct hstate *h, int nid)
+void *__init arch_alloc_bootmem_huge_page(struct hstate *h, int nid)
__attribute__ ((weak, alias("__alloc_bootmem_huge_page")));
-int __alloc_bootmem_huge_page(struct hstate *h, int nid)
+void *__init __alloc_bootmem_huge_page(struct hstate *h, int nid)
{
- struct huge_bootmem_page *m = NULL; /* initialize for clang */
int nr_nodes, node = nid;
/* do node specific alloc */
- if (nid != NUMA_NO_NODE) {
- m = alloc_bootmem(h, node, true);
- if (!m)
- return 0;
- goto found;
- }
+ if (nid != NUMA_NO_NODE)
+ return alloc_bootmem(h, node, true);
/* allocate from next node when distributing huge pages */
for_each_node_mask_to_alloc(&h->next_nid_to_alloc, nr_nodes, node,
- &hugetlb_bootmem_nodes) {
- m = alloc_bootmem(h, node, false);
- if (!m)
- return 0;
- goto found;
- }
+ &hugetlb_bootmem_nodes)
+ return alloc_bootmem(h, node, false);
-found:
+ return NULL;
+}
+
+static bool __init alloc_bootmem_huge_page(struct hstate *h, int nid)
+{
+ struct huge_bootmem_page *m = arch_alloc_bootmem_huge_page(h, nid);
+
+ if (!m)
+ return false;
+
+ nid = early_pfn_to_nid(PHYS_PFN(__pa(m)));
+ /*
+ * Use the beginning of the huge page to store the huge_bootmem_page
+ * struct (until gather_bootmem puts them into the mem_map).
+ *
+ * Put them into a private list first because mem_map is not up yet.
+ */
+ INIT_LIST_HEAD(&m->list);
+ list_add(&m->list, &huge_boot_pages[nid]);
+ m->hstate = h;
+ if (!hugetlb_early_cma(h)) {
+ m->cma = NULL;
+ m->flags = 0;
+ }
/*
* Only initialize the head struct page in memmap_init_reserved_pages,
@@ -3128,7 +3107,7 @@ int __alloc_bootmem_huge_page(struct hstate *h, int nid)
memblock_reserved_mark_noinit(__pa((void *)m + PAGE_SIZE),
huge_page_size(h) - PAGE_SIZE);
- return 1;
+ return true;
}
/* Initialize [start_page:end_page_number] tail struct pages of a hugepage */
diff --git a/mm/hugetlb_cma.c b/mm/hugetlb_cma.c
index 57a7b3acc758..6b5c2aec4449 100644
--- a/mm/hugetlb_cma.c
+++ b/mm/hugetlb_cma.c
@@ -57,13 +57,13 @@ struct folio *hugetlb_cma_alloc_frozen_folio(int order, gfp_t gfp_mask,
}
struct huge_bootmem_page * __init
-hugetlb_cma_alloc_bootmem(struct hstate *h, int *nid, bool node_exact)
+hugetlb_cma_alloc_bootmem(struct hstate *h, int nid, bool node_exact)
{
struct cma *cma;
struct huge_bootmem_page *m;
- int node = *nid;
+ int node;
- cma = hugetlb_cma[*nid];
+ cma = hugetlb_cma[nid];
m = cma_reserve_early(cma, huge_page_size(h));
if (!m) {
if (node_exact)
@@ -71,13 +71,11 @@ hugetlb_cma_alloc_bootmem(struct hstate *h, int *nid, bool node_exact)
for_each_node_mask(node, hugetlb_bootmem_nodes) {
cma = hugetlb_cma[node];
- if (!cma || node == *nid)
+ if (!cma || node == nid)
continue;
m = cma_reserve_early(cma, huge_page_size(h));
- if (m) {
- *nid = node;
+ if (m)
break;
- }
}
}
diff --git a/mm/hugetlb_cma.h b/mm/hugetlb_cma.h
index c619c394b1ae..057852c792bd 100644
--- a/mm/hugetlb_cma.h
+++ b/mm/hugetlb_cma.h
@@ -6,7 +6,7 @@
void hugetlb_cma_free_frozen_folio(struct folio *folio);
struct folio *hugetlb_cma_alloc_frozen_folio(int order, gfp_t gfp_mask,
int nid, nodemask_t *nodemask);
-struct huge_bootmem_page *hugetlb_cma_alloc_bootmem(struct hstate *h, int *nid,
+struct huge_bootmem_page *hugetlb_cma_alloc_bootmem(struct hstate *h, int nid,
bool node_exact);
bool hugetlb_cma_exclusive_alloc(void);
unsigned long hugetlb_cma_total_size(void);
@@ -24,7 +24,7 @@ static inline struct folio *hugetlb_cma_alloc_frozen_folio(int order,
}
static inline
-struct huge_bootmem_page *hugetlb_cma_alloc_bootmem(struct hstate *h, int *nid,
+struct huge_bootmem_page *hugetlb_cma_alloc_bootmem(struct hstate *h, int nid,
bool node_exact)
{
return NULL;
--
2.54.0
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v2 14/69] mm/hugetlb: Free cross-zone bootmem gigantic pages after allocation
2026-05-13 13:04 [PATCH v2 00/69] mm: Generalize HVO for HugeTLB and device DAX Muchun Song
` (12 preceding siblings ...)
2026-05-13 13:04 ` [PATCH v2 13/69] mm/hugetlb: Refactor early boot gigantic hugepage allocation Muchun Song
@ 2026-05-13 13:04 ` Muchun Song
2026-05-13 13:04 ` [PATCH v2 15/69] mm/hugetlb_vmemmap: Move bootmem HVO setup to early init Muchun Song
` (33 subsequent siblings)
47 siblings, 0 replies; 72+ messages in thread
From: Muchun Song @ 2026-05-13 13:04 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
Michael Ellerman, Madhavan Srinivasan
Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
Christophe Leroy, Ackerley Tng, Frank van der Linden,
aneesh.kumar, joao.m.martins, linux-mm, linuxppc-dev,
linux-kernel, Muchun Song
Now that hugetlb reservation runs after zone initialization, bootmem
gigantic page allocation can detect pages that span multiple zones.
Keep those cross-zone pages separate during allocation and free them
after allocation completes, so later hugetlb initialization only sees
zone-valid gigantic pages.
This chooses to free cross-zone gigantic pages directly instead of
retrying allocation. In practice, such cross-zone cases are expected to
be very rare, so adding retry logic does not seem justified at this
point. Keeping the handling simple also preserves the previous behavior.
If similar real-world reports show up later, retry support can be
reconsidered then.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
mm/hugetlb.c | 75 ++++++++++++++++++++++++++++++++++++++++++++--------
1 file changed, 64 insertions(+), 11 deletions(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index e9ba0be2eb17..d5d324f69d7a 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3077,12 +3077,15 @@ void *__init __alloc_bootmem_huge_page(struct hstate *h, int nid)
static bool __init alloc_bootmem_huge_page(struct hstate *h, int nid)
{
+ unsigned long pfn;
+ unsigned int nid_request = nid;
struct huge_bootmem_page *m = arch_alloc_bootmem_huge_page(h, nid);
if (!m)
return false;
- nid = early_pfn_to_nid(PHYS_PFN(__pa(m)));
+ pfn = PHYS_PFN(__pa(m));
+ nid = early_pfn_to_nid(pfn);
/*
* Use the beginning of the huge page to store the huge_bootmem_page
* struct (until gather_bootmem puts them into the mem_map).
@@ -3090,22 +3093,38 @@ static bool __init alloc_bootmem_huge_page(struct hstate *h, int nid)
* Put them into a private list first because mem_map is not up yet.
*/
INIT_LIST_HEAD(&m->list);
- list_add(&m->list, &huge_boot_pages[nid]);
m->hstate = h;
if (!hugetlb_early_cma(h)) {
m->cma = NULL;
m->flags = 0;
}
- /*
- * Only initialize the head struct page in memmap_init_reserved_pages,
- * rest of the struct pages will be initialized by the HugeTLB
- * subsystem itself.
- * The head struct page is used to get folio information by the HugeTLB
- * subsystem like zone id and node id.
- */
- memblock_reserved_mark_noinit(__pa((void *)m + PAGE_SIZE),
- huge_page_size(h) - PAGE_SIZE);
+ /* CMA pages: zone-crossing is validated in hugetlb_cma_reserve(). */
+ if (!hugetlb_early_cma(h) &&
+ pfn_range_intersects_zones(nid, pfn, pages_per_huge_page(h))) {
+ /*
+ * If the allocated page is on a different node than requested
+ * (e.g. on PowerPC LPARs), put it on the requested node's list.
+ * Otherwise, the cross-zone page will be stranded and never
+ * freed, as the cleanup code only operates on the requested node.
+ */
+ if (WARN_ON_ONCE(nid_request != NUMA_NO_NODE && nid != nid_request))
+ list_add(&m->list, &huge_boot_pages[nid_request]);
+ else
+ list_add(&m->list, &huge_boot_pages[nid]);
+ } else {
+ list_add_tail(&m->list, &huge_boot_pages[nid]);
+ m->flags |= HUGE_BOOTMEM_ZONES_VALID;
+ /*
+ * Only initialize the head struct page in memmap_init_reserved_pages,
+ * rest of the struct pages will be initialized by the HugeTLB
+ * subsystem itself.
+ * The head struct page is used to get folio information by the HugeTLB
+ * subsystem like zone id and node id.
+ */
+ memblock_reserved_mark_noinit(__pa((void *)m + PAGE_SIZE),
+ huge_page_size(h) - PAGE_SIZE);
+ }
return true;
}
@@ -3384,6 +3403,34 @@ void __init hugetlb_struct_page_init(void)
padata_do_multithreaded(&job);
}
+static unsigned long __init hugetlb_free_cross_zone_pages(struct hstate *h, int nid)
+{
+ unsigned long freed = 0;
+ struct huge_bootmem_page *m, *tmp;
+
+ if (!hstate_is_gigantic(h))
+ return freed;
+
+ list_for_each_entry_safe(m, tmp, &huge_boot_pages[nid], list) {
+ if (m->flags & HUGE_BOOTMEM_ZONES_VALID)
+ break;
+
+ list_del(&m->list);
+ memblock_free(m, huge_page_size(h));
+ freed++;
+ }
+
+ if (freed) {
+ char buf[32];
+
+ string_get_size(huge_page_size(h), 1, STRING_UNITS_2, buf, sizeof(buf));
+ pr_warn("HugeTLB: freed %lu cross-zone hugepages of size %s on node %d.\n",
+ freed, buf, nid);
+ }
+
+ return freed;
+}
+
static void __init hugetlb_hstate_alloc_pages_onenode(struct hstate *h, int nid)
{
unsigned long i;
@@ -3414,6 +3461,8 @@ static void __init hugetlb_hstate_alloc_pages_onenode(struct hstate *h, int nid)
cond_resched();
}
+ i -= hugetlb_free_cross_zone_pages(h, nid);
+
if (!list_empty(&folio_list))
prep_and_add_allocated_folios(h, &folio_list);
@@ -3487,6 +3536,7 @@ static void __init hugetlb_pages_alloc_boot_node(unsigned long start, unsigned l
static unsigned long __init hugetlb_gigantic_pages_alloc_boot(struct hstate *h)
{
+ int nid;
unsigned long i;
for (i = 0; i < h->max_huge_pages; ++i) {
@@ -3495,6 +3545,9 @@ static unsigned long __init hugetlb_gigantic_pages_alloc_boot(struct hstate *h)
cond_resched();
}
+ for_each_node(nid)
+ i -= hugetlb_free_cross_zone_pages(h, nid);
+
return i;
}
--
2.54.0
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v2 15/69] mm/hugetlb_vmemmap: Move bootmem HVO setup to early init
2026-05-13 13:04 [PATCH v2 00/69] mm: Generalize HVO for HugeTLB and device DAX Muchun Song
` (13 preceding siblings ...)
2026-05-13 13:04 ` [PATCH v2 14/69] mm/hugetlb: Free cross-zone bootmem gigantic pages after allocation Muchun Song
@ 2026-05-13 13:04 ` Muchun Song
2026-05-13 13:04 ` [PATCH v2 16/69] mm/hugetlb: Remove obsolete bootmem cross-zone checks Muchun Song
` (32 subsequent siblings)
47 siblings, 0 replies; 72+ messages in thread
From: Muchun Song @ 2026-05-13 13:04 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
Michael Ellerman, Madhavan Srinivasan
Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
Christophe Leroy, Ackerley Tng, Frank van der Linden,
aneesh.kumar, joao.m.martins, linux-mm, linuxppc-dev,
linux-kernel, Muchun Song
Bootmem HugeTLB pages currently defer HVO setup to
hugetlb_vmemmap_init_late(), because the optimization needs zone
information.
Now that zone initialization is available earlier, the bootmem HVO setup
can be done directly from hugetlb_vmemmap_init_early(). This lets
gigantic HugeTLB pages apply HVO as soon as they are allocated.
Bootmem gigantic pages that span multiple zones are now filtered out
when they are allocated, so the remaining bootmem gigantic pages seen by
later hugetlb initialization are already zone-valid. As a result,
hugetlb_vmemmap_init_late() no longer needs to handle bootmem HVO setup.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
mm/hugetlb_vmemmap.c | 67 +++++++++-----------------------------------
1 file changed, 13 insertions(+), 54 deletions(-)
diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index 4f58cd940f61..e2251bc47444 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -745,6 +745,8 @@ static bool vmemmap_should_optimize_bootmem_page(struct huge_bootmem_page *m)
return true;
}
+static struct zone *pfn_to_zone(unsigned nid, unsigned long pfn);
+
/*
* Initialize memmap section for a gigantic page, HVO-style.
*/
@@ -752,6 +754,7 @@ void __init hugetlb_vmemmap_init_early(int nid)
{
unsigned long psize, paddr, section_size;
unsigned long ns, i, pnum, pfn, nr_pages;
+ unsigned long start, end;
struct huge_bootmem_page *m = NULL;
void *map;
@@ -761,6 +764,8 @@ void __init hugetlb_vmemmap_init_early(int nid)
section_size = (1UL << PA_SECTION_SHIFT);
list_for_each_entry(m, &huge_boot_pages[nid], list) {
+ struct zone *zone;
+
if (!vmemmap_should_optimize_bootmem_page(m))
continue;
@@ -769,6 +774,14 @@ void __init hugetlb_vmemmap_init_early(int nid)
paddr = virt_to_phys(m);
pfn = PHYS_PFN(paddr);
map = pfn_to_page(pfn);
+ start = (unsigned long)map;
+ end = start + hugetlb_vmemmap_size(m->hstate);
+ zone = pfn_to_zone(nid, pfn);
+
+ if (vmemmap_populate_hvo(start, end, huge_page_order(m->hstate),
+ zone, HUGETLB_VMEMMAP_RESERVE_SIZE))
+ panic("Failed to allocate memmap for HugeTLB page\n");
+ memmap_boot_pages_add(DIV_ROUND_UP(HUGETLB_VMEMMAP_RESERVE_SIZE, PAGE_SIZE));
pnum = pfn_to_section_nr(pfn);
ns = psize / section_size;
@@ -800,60 +813,6 @@ static struct zone *pfn_to_zone(unsigned nid, unsigned long pfn)
void __init hugetlb_vmemmap_init_late(int nid)
{
- struct huge_bootmem_page *m, *tm;
- unsigned long phys, nr_pages, start, end;
- unsigned long pfn, nr_mmap;
- struct zone *zone = NULL;
- struct hstate *h;
- void *map;
-
- if (!READ_ONCE(vmemmap_optimize_enabled))
- return;
-
- list_for_each_entry_safe(m, tm, &huge_boot_pages[nid], list) {
- if (!(m->flags & HUGE_BOOTMEM_HVO))
- continue;
-
- phys = virt_to_phys(m);
- h = m->hstate;
- pfn = PHYS_PFN(phys);
- nr_pages = pages_per_huge_page(h);
- map = pfn_to_page(pfn);
- start = (unsigned long)map;
- end = start + nr_pages * sizeof(struct page);
-
- if (!hugetlb_bootmem_page_zones_valid(nid, m)) {
- /*
- * Oops, the hugetlb page spans multiple zones.
- * Remove it from the list, and populate it normally.
- */
- list_del(&m->list);
-
- vmemmap_populate(start, end, nid, NULL);
- nr_mmap = end - start;
- memmap_boot_pages_add(DIV_ROUND_UP(nr_mmap, PAGE_SIZE));
-
- memblock_phys_free(phys, huge_page_size(h));
- continue;
- }
-
- if (!zone || !zone_spans_pfn(zone, pfn))
- zone = pfn_to_zone(nid, pfn);
- if (WARN_ON_ONCE(!zone))
- continue;
-
- if (vmemmap_populate_hvo(start, end, huge_page_order(h), zone,
- HUGETLB_VMEMMAP_RESERVE_SIZE) < 0) {
- /* Fallback if HVO population fails */
- vmemmap_populate(start, end, nid, NULL);
- nr_mmap = end - start;
- } else {
- m->flags |= HUGE_BOOTMEM_ZONES_VALID;
- nr_mmap = HUGETLB_VMEMMAP_RESERVE_SIZE;
- }
-
- memmap_boot_pages_add(DIV_ROUND_UP(nr_mmap, PAGE_SIZE));
- }
}
#endif
--
2.54.0
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v2 16/69] mm/hugetlb: Remove obsolete bootmem cross-zone checks
2026-05-13 13:04 [PATCH v2 00/69] mm: Generalize HVO for HugeTLB and device DAX Muchun Song
` (14 preceding siblings ...)
2026-05-13 13:04 ` [PATCH v2 15/69] mm/hugetlb_vmemmap: Move bootmem HVO setup to early init Muchun Song
@ 2026-05-13 13:04 ` Muchun Song
2026-05-13 13:04 ` [PATCH v2 17/69] mm/sparse-vmemmap: Remove sparse_vmemmap_init_nid_late() Muchun Song
` (31 subsequent siblings)
47 siblings, 0 replies; 72+ messages in thread
From: Muchun Song @ 2026-05-13 13:04 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
Michael Ellerman, Madhavan Srinivasan
Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
Christophe Leroy, Ackerley Tng, Frank van der Linden,
aneesh.kumar, joao.m.martins, linux-mm, linuxppc-dev,
linux-kernel, Muchun Song
Bootmem gigantic HugeTLB pages used to be validated again during
gather_bootmem_prealloc_node() and any cross-zone pages were discarded
there.
That validation is no longer needed. Cross-zone bootmem gigantic pages
are now detected during allocation and freed before they reach the later
bootmem gathering path, so the remaining pages are already zone-valid.
Remove the obsolete cross-zone validation, invalid-page freeing, and the
associated discarded-page accounting.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
include/linux/hugetlb.h | 2 --
mm/hugetlb.c | 70 -----------------------------------------
2 files changed, 72 deletions(-)
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 9a65271d167c..ece4e6a4a4c6 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -701,8 +701,6 @@ struct huge_bootmem_page {
#define HUGE_BOOTMEM_ZONES_VALID 0x0002
#define HUGE_BOOTMEM_CMA 0x0004
-bool hugetlb_bootmem_page_zones_valid(int nid, struct huge_bootmem_page *m);
-
int isolate_or_dissolve_huge_folio(struct folio *folio, struct list_head *list);
int replace_free_hugepage_folios(unsigned long start_pfn, unsigned long end_pfn);
void wait_for_freed_hugetlb_folios(void);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index d5d324f69d7a..dcf8e09ec6be 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -58,7 +58,6 @@ struct hstate hstates[HUGE_MAX_HSTATE];
__initdata nodemask_t hugetlb_bootmem_nodes;
__initdata struct list_head huge_boot_pages[MAX_NUMNODES];
-static unsigned long hstate_boot_nrinvalid[HUGE_MAX_HSTATE] __initdata;
/*
* Due to ordering constraints across the init code for various
@@ -3238,57 +3237,6 @@ static void __init prep_and_add_bootmem_folios(struct hstate *h,
}
}
-bool __init hugetlb_bootmem_page_zones_valid(int nid,
- struct huge_bootmem_page *m)
-{
- unsigned long start_pfn;
- bool valid;
-
- if (m->flags & HUGE_BOOTMEM_ZONES_VALID) {
- /*
- * Already validated, skip check.
- */
- return true;
- }
-
- if (hugetlb_bootmem_page_earlycma(m)) {
- valid = cma_validate_zones(m->cma);
- goto out;
- }
-
- start_pfn = virt_to_phys(m) >> PAGE_SHIFT;
-
- valid = !pfn_range_intersects_zones(nid, start_pfn,
- pages_per_huge_page(m->hstate));
-out:
- if (!valid)
- hstate_boot_nrinvalid[hstate_index(m->hstate)]++;
-
- return valid;
-}
-
-/*
- * Free a bootmem page that was found to be invalid (intersecting with
- * multiple zones).
- *
- * Since it intersects with multiple zones, we can't just do a free
- * operation on all pages at once, but instead have to walk all
- * pages, freeing them one by one.
- */
-static void __init hugetlb_bootmem_free_invalid_page(int nid, struct page *page,
- struct hstate *h)
-{
- unsigned long npages = pages_per_huge_page(h);
- unsigned long pfn;
-
- while (npages--) {
- pfn = page_to_pfn(page);
- __init_page_from_nid(pfn, nid);
- free_reserved_page(page);
- page++;
- }
-}
-
/*
* Put bootmem huge pages into the standard lists after mem_map is up.
* Note: This only applies to gigantic (order > MAX_PAGE_ORDER) pages.
@@ -3304,17 +3252,6 @@ static void __init gather_bootmem_prealloc_node(unsigned long nid)
struct folio *folio = (void *)page;
h = m->hstate;
- if (!hugetlb_bootmem_page_zones_valid(nid, m)) {
- /*
- * Can't use this page. Initialize the
- * page structures if that hasn't already
- * been done, and give them to the page
- * allocator.
- */
- hugetlb_bootmem_free_invalid_page(nid, page, h);
- continue;
- }
-
/*
* It is possible to have multiple huge page sizes (hstates)
* in this list. If so, process each size separately.
@@ -3703,20 +3640,13 @@ static void __init hugetlb_init_hstates(void)
static void __init report_hugepages(void)
{
struct hstate *h;
- unsigned long nrinvalid;
for_each_hstate(h) {
char buf[32];
- nrinvalid = hstate_boot_nrinvalid[hstate_index(h)];
- h->max_huge_pages -= nrinvalid;
-
string_get_size(huge_page_size(h), 1, STRING_UNITS_2, buf, 32);
pr_info("HugeTLB: registered %s page size, pre-allocated %ld pages\n",
buf, h->nr_huge_pages);
- if (nrinvalid)
- pr_info("HugeTLB: %s page size: %lu invalid page%s discarded\n",
- buf, nrinvalid, str_plural(nrinvalid));
pr_info("HugeTLB: %d KiB vmemmap can be freed for a %s page\n",
hugetlb_vmemmap_optimizable_size(h) / SZ_1K, buf);
}
--
2.54.0
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v2 17/69] mm/sparse-vmemmap: Remove sparse_vmemmap_init_nid_late()
2026-05-13 13:04 [PATCH v2 00/69] mm: Generalize HVO for HugeTLB and device DAX Muchun Song
` (15 preceding siblings ...)
2026-05-13 13:04 ` [PATCH v2 16/69] mm/hugetlb: Remove obsolete bootmem cross-zone checks Muchun Song
@ 2026-05-13 13:04 ` Muchun Song
2026-05-13 13:04 ` [PATCH v2 18/69] mm/hugetlb: Remove unused bootmem cma field Muchun Song
` (30 subsequent siblings)
47 siblings, 0 replies; 72+ messages in thread
From: Muchun Song @ 2026-05-13 13:04 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
Michael Ellerman, Madhavan Srinivasan
Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
Christophe Leroy, Ackerley Tng, Frank van der Linden,
aneesh.kumar, joao.m.martins, linux-mm, linuxppc-dev,
linux-kernel, Muchun Song
hugetlb_vmemmap_init_late() no longer has any users, so the remaining
late-init path in sparse_vmemmap_init_nid_late() is dead code.
Remove sparse_vmemmap_init_nid_late() and its declarations.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
include/linux/mmzone.h | 7 -------
mm/hugetlb_vmemmap.c | 4 ----
mm/hugetlb_vmemmap.h | 5 -----
mm/sparse-vmemmap.c | 11 -----------
mm/sparse.c | 1 -
5 files changed, 28 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 9adb2ad21da5..362e16497533 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -2167,8 +2167,6 @@ static inline int preinited_vmemmap_section(const struct mem_section *section)
}
void sparse_vmemmap_init_nid_early(int nid);
-void sparse_vmemmap_init_nid_late(int nid);
-
#else
static inline int preinited_vmemmap_section(const struct mem_section *section)
{
@@ -2177,10 +2175,6 @@ static inline int preinited_vmemmap_section(const struct mem_section *section)
static inline void sparse_vmemmap_init_nid_early(int nid)
{
}
-
-static inline void sparse_vmemmap_init_nid_late(int nid)
-{
-}
#endif
static inline int online_section_nr(unsigned long nr)
@@ -2385,7 +2379,6 @@ static inline unsigned long next_present_section_nr(unsigned long section_nr)
#else
#define sparse_vmemmap_init_nid_early(_nid) do {} while (0)
-#define sparse_vmemmap_init_nid_late(_nid) do {} while (0)
#define pfn_in_present_section pfn_valid
#endif /* CONFIG_SPARSEMEM */
diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index e2251bc47444..952216a49bcb 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -810,10 +810,6 @@ static struct zone *pfn_to_zone(unsigned nid, unsigned long pfn)
return NULL;
}
-
-void __init hugetlb_vmemmap_init_late(int nid)
-{
-}
#endif
static const struct ctl_table hugetlb_vmemmap_sysctls[] = {
diff --git a/mm/hugetlb_vmemmap.h b/mm/hugetlb_vmemmap.h
index 18b490825215..7ac49c52457d 100644
--- a/mm/hugetlb_vmemmap.h
+++ b/mm/hugetlb_vmemmap.h
@@ -29,7 +29,6 @@ void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_l
void hugetlb_vmemmap_optimize_bootmem_folios(struct hstate *h, struct list_head *folio_list);
#ifdef CONFIG_SPARSEMEM_VMEMMAP_PREINIT
void hugetlb_vmemmap_init_early(int nid);
-void hugetlb_vmemmap_init_late(int nid);
#endif
@@ -81,10 +80,6 @@ static inline void hugetlb_vmemmap_init_early(int nid)
{
}
-static inline void hugetlb_vmemmap_init_late(int nid)
-{
-}
-
static inline unsigned int hugetlb_vmemmap_optimizable_size(const struct hstate *h)
{
return 0;
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index fcf0ce5212f1..17d45dac4324 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -574,17 +574,6 @@ void __init sparse_vmemmap_init_nid_early(int nid)
{
hugetlb_vmemmap_init_early(nid);
}
-
-/*
- * This is called just before the initialization of page structures
- * through memmap_init. Zones are now initialized, so any work that
- * needs to be done that needs zone information can be done from
- * here.
- */
-void __init sparse_vmemmap_init_nid_late(int nid)
-{
- hugetlb_vmemmap_init_late(nid);
-}
#endif
static void subsection_mask_set(unsigned long *map, unsigned long pfn,
diff --git a/mm/sparse.c b/mm/sparse.c
index 3917a47153d8..324213d8bdcb 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -320,7 +320,6 @@ static void __init sparse_init_nid(int nid, unsigned long pnum_begin,
}
}
sparse_usage_fini();
- sparse_vmemmap_init_nid_late(nid);
}
/*
--
2.54.0
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v2 18/69] mm/hugetlb: Remove unused bootmem cma field
2026-05-13 13:04 [PATCH v2 00/69] mm: Generalize HVO for HugeTLB and device DAX Muchun Song
` (16 preceding siblings ...)
2026-05-13 13:04 ` [PATCH v2 17/69] mm/sparse-vmemmap: Remove sparse_vmemmap_init_nid_late() Muchun Song
@ 2026-05-13 13:04 ` Muchun Song
2026-05-13 13:04 ` [PATCH v2 19/69] mm/mm_init: Make __init_page_from_nid() static Muchun Song
` (29 subsequent siblings)
47 siblings, 0 replies; 72+ messages in thread
From: Muchun Song @ 2026-05-13 13:04 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
Michael Ellerman, Madhavan Srinivasan
Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
Christophe Leroy, Ackerley Tng, Frank van der Linden,
aneesh.kumar, joao.m.martins, linux-mm, linuxppc-dev,
linux-kernel, Muchun Song
struct huge_bootmem_page no longer needs to keep the CMA pointer. The
bootmem path only needs to remember whether a huge page came from CMA,
which is already encoded in the flags field.
Set HUGE_BOOTMEM_CMA when the page is allocated and drop the unused cma
field together with the redundant assignments.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
include/linux/hugetlb.h | 1 -
mm/hugetlb.c | 5 +----
mm/hugetlb_cma.c | 27 ++++++++++-----------------
3 files changed, 11 insertions(+), 22 deletions(-)
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index ece4e6a4a4c6..fd901bb3630c 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -694,7 +694,6 @@ struct huge_bootmem_page {
struct list_head list;
struct hstate *hstate;
unsigned long flags;
- struct cma *cma;
};
#define HUGE_BOOTMEM_HVO 0x0001
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index dcf8e09ec6be..1f0a0e31d624 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3093,10 +3093,7 @@ static bool __init alloc_bootmem_huge_page(struct hstate *h, int nid)
*/
INIT_LIST_HEAD(&m->list);
m->hstate = h;
- if (!hugetlb_early_cma(h)) {
- m->cma = NULL;
- m->flags = 0;
- }
+ m->flags = hugetlb_early_cma(h) ? HUGE_BOOTMEM_CMA : 0;
/* CMA pages: zone-crossing is validated in hugetlb_cma_reserve(). */
if (!hugetlb_early_cma(h) &&
diff --git a/mm/hugetlb_cma.c b/mm/hugetlb_cma.c
index 6b5c2aec4449..fbe5ed7ffaa7 100644
--- a/mm/hugetlb_cma.c
+++ b/mm/hugetlb_cma.c
@@ -65,26 +65,19 @@ hugetlb_cma_alloc_bootmem(struct hstate *h, int nid, bool node_exact)
cma = hugetlb_cma[nid];
m = cma_reserve_early(cma, huge_page_size(h));
- if (!m) {
- if (node_exact)
- return NULL;
+ if (m || node_exact)
+ return m;
- for_each_node_mask(node, hugetlb_bootmem_nodes) {
- cma = hugetlb_cma[node];
- if (!cma || node == nid)
- continue;
- m = cma_reserve_early(cma, huge_page_size(h));
- if (m)
- break;
- }
- }
-
- if (m) {
- m->flags = HUGE_BOOTMEM_CMA;
- m->cma = cma;
+ for_each_node_mask(node, hugetlb_bootmem_nodes) {
+ cma = hugetlb_cma[node];
+ if (!cma || node == nid)
+ continue;
+ m = cma_reserve_early(cma, huge_page_size(h));
+ if (m)
+ return m;
}
- return m;
+ return NULL;
}
static int __init cmdline_parse_hugetlb_cma(char *p)
--
2.54.0
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v2 19/69] mm/mm_init: Make __init_page_from_nid() static
2026-05-13 13:04 [PATCH v2 00/69] mm: Generalize HVO for HugeTLB and device DAX Muchun Song
` (17 preceding siblings ...)
2026-05-13 13:04 ` [PATCH v2 18/69] mm/hugetlb: Remove unused bootmem cma field Muchun Song
@ 2026-05-13 13:04 ` Muchun Song
2026-05-13 13:04 ` [PATCH v2 20/69] mm/sparse-vmemmap: Drop VMEMMAP_POPULATE_PAGEREF Muchun Song
` (28 subsequent siblings)
47 siblings, 0 replies; 72+ messages in thread
From: Muchun Song @ 2026-05-13 13:04 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
Michael Ellerman, Madhavan Srinivasan
Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
Christophe Leroy, Ackerley Tng, Frank van der Linden,
aneesh.kumar, joao.m.martins, linux-mm, linuxppc-dev,
linux-kernel, Muchun Song
__init_page_from_nid() no longer has external users and is only used
locally in mm/mm_init.c under CONFIG_DEFERRED_STRUCT_PAGE_INIT.
Make it static and keep it inside that block.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
mm/internal.h | 1 -
mm/mm_init.c | 4 ++--
2 files changed, 2 insertions(+), 3 deletions(-)
diff --git a/mm/internal.h b/mm/internal.h
index 6bd9aa37b952..4a5053368078 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1754,7 +1754,6 @@ static inline bool pte_needs_soft_dirty_wp(struct vm_area_struct *vma, pte_t pte
void __meminit __init_single_page(struct page *page, unsigned long pfn,
unsigned long zone, int nid);
-void __meminit __init_page_from_nid(unsigned long pfn, int nid);
/* shrinker related functions */
unsigned long shrink_slab(gfp_t gfp_mask, int nid, struct mem_cgroup *memcg,
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 165b83c9a9c3..c64e5d63c4ae 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -688,10 +688,11 @@ static __meminit void pageblock_migratetype_init_range(unsigned long pfn,
}
#endif
+#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
/*
* Initialize a reserved page unconditionally, finding its zone first.
*/
-void __meminit __init_page_from_nid(unsigned long pfn, int nid)
+static void __meminit __init_page_from_nid(unsigned long pfn, int nid)
{
pg_data_t *pgdat;
int zid;
@@ -713,7 +714,6 @@ void __meminit __init_page_from_nid(unsigned long pfn, int nid)
}
}
-#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
static inline void pgdat_set_deferred_range(pg_data_t *pgdat)
{
pgdat->first_deferred_pfn = ULONG_MAX;
--
2.54.0
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v2 20/69] mm/sparse-vmemmap: Drop VMEMMAP_POPULATE_PAGEREF
2026-05-13 13:04 [PATCH v2 00/69] mm: Generalize HVO for HugeTLB and device DAX Muchun Song
` (18 preceding siblings ...)
2026-05-13 13:04 ` [PATCH v2 19/69] mm/mm_init: Make __init_page_from_nid() static Muchun Song
@ 2026-05-13 13:04 ` Muchun Song
2026-05-13 13:04 ` [PATCH v2 21/69] mm: Rename vmemmap optimization macros around folio semantics Muchun Song
` (27 subsequent siblings)
47 siblings, 0 replies; 72+ messages in thread
From: Muchun Song @ 2026-05-13 13:04 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
Michael Ellerman, Madhavan Srinivasan
Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
Christophe Leroy, Ackerley Tng, Frank van der Linden,
aneesh.kumar, joao.m.martins, linux-mm, linuxppc-dev,
linux-kernel, Muchun Song
VMEMMAP_POPULATE_PAGEREF is only needed once the slab allocator is
available, so it does not need to be passed through the vmemmap
population call chain.
Drop the flag and test slab_is_available() directly in
vmemmap_pte_populate() instead.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
mm/sparse-vmemmap.c | 40 ++++++++++++++--------------------------
1 file changed, 14 insertions(+), 26 deletions(-)
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index 17d45dac4324..d7e9fb47f7ee 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -33,13 +33,6 @@
#include <asm/tlbflush.h>
#include "hugetlb_vmemmap.h"
-
-/*
- * Flags for vmemmap_populate_range and friends.
- */
-/* Get a ref on the head page struct page, for ZONE_DEVICE compound pages */
-#define VMEMMAP_POPULATE_PAGEREF 0x0001
-
#include "internal.h"
/*
@@ -147,8 +140,8 @@ void __meminit vmemmap_verify(pte_t *pte, int node,
}
static pte_t * __meminit vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node,
- struct vmem_altmap *altmap,
- unsigned long ptpfn, unsigned long flags)
+ struct vmem_altmap *altmap,
+ unsigned long ptpfn)
{
pte_t *pte = pte_offset_kernel(pmd, addr);
if (pte_none(ptep_get(pte))) {
@@ -170,7 +163,7 @@ static pte_t * __meminit vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, in
* and through vmemmap_populate_compound_pages() when
* slab is available.
*/
- if (flags & VMEMMAP_POPULATE_PAGEREF)
+ if (slab_is_available())
get_page(pfn_to_page(ptpfn));
}
entry = pfn_pte(ptpfn, PAGE_KERNEL);
@@ -243,8 +236,7 @@ static pgd_t * __meminit vmemmap_pgd_populate(unsigned long addr, int node)
static pte_t * __meminit vmemmap_populate_address(unsigned long addr, int node,
struct vmem_altmap *altmap,
- unsigned long ptpfn,
- unsigned long flags)
+ unsigned long ptpfn)
{
pgd_t *pgd;
p4d_t *p4d;
@@ -264,7 +256,7 @@ static pte_t * __meminit vmemmap_populate_address(unsigned long addr, int node,
pmd = vmemmap_pmd_populate(pud, addr, node);
if (!pmd)
return NULL;
- pte = vmemmap_pte_populate(pmd, addr, node, altmap, ptpfn, flags);
+ pte = vmemmap_pte_populate(pmd, addr, node, altmap, ptpfn);
if (!pte)
return NULL;
vmemmap_verify(pte, node, addr, addr + PAGE_SIZE);
@@ -275,15 +267,14 @@ static pte_t * __meminit vmemmap_populate_address(unsigned long addr, int node,
static int __meminit vmemmap_populate_range(unsigned long start,
unsigned long end, int node,
struct vmem_altmap *altmap,
- unsigned long ptpfn,
- unsigned long flags)
+ unsigned long ptpfn)
{
unsigned long addr = start;
pte_t *pte;
for (; addr < end; addr += PAGE_SIZE) {
pte = vmemmap_populate_address(addr, node, altmap,
- ptpfn, flags);
+ ptpfn);
if (!pte)
return -ENOMEM;
}
@@ -294,7 +285,7 @@ static int __meminit vmemmap_populate_range(unsigned long start,
int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end,
int node, struct vmem_altmap *altmap)
{
- return vmemmap_populate_range(start, end, node, altmap, -1, 0);
+ return vmemmap_populate_range(start, end, node, altmap, -1);
}
/*
@@ -370,7 +361,7 @@ int __meminit vmemmap_populate_hvo(unsigned long addr, unsigned long end,
return -ENOMEM;
for (maddr = addr; maddr < addr + headsize; maddr += PAGE_SIZE) {
- pte = vmemmap_populate_address(maddr, node, NULL, -1, 0);
+ pte = vmemmap_populate_address(maddr, node, NULL, -1);
if (!pte)
return -ENOMEM;
}
@@ -378,8 +369,7 @@ int __meminit vmemmap_populate_hvo(unsigned long addr, unsigned long end,
/*
* Reuse the last page struct page mapped above for the rest.
*/
- return vmemmap_populate_range(maddr, end, node, NULL,
- page_to_pfn(tail), 0);
+ return vmemmap_populate_range(maddr, end, node, NULL, page_to_pfn(tail));
}
#endif
@@ -503,8 +493,7 @@ static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
* with just tail struct pages.
*/
return vmemmap_populate_range(start, end, node, NULL,
- pte_pfn(ptep_get(pte)),
- VMEMMAP_POPULATE_PAGEREF);
+ pte_pfn(ptep_get(pte)));
}
size = min(end - start, pgmap_vmemmap_nr(pgmap) * sizeof(struct page));
@@ -512,13 +501,13 @@ static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
unsigned long next, last = addr + size;
/* Populate the head page vmemmap page */
- pte = vmemmap_populate_address(addr, node, NULL, -1, 0);
+ pte = vmemmap_populate_address(addr, node, NULL, -1);
if (!pte)
return -ENOMEM;
/* Populate the tail pages vmemmap page */
next = addr + PAGE_SIZE;
- pte = vmemmap_populate_address(next, node, NULL, -1, 0);
+ pte = vmemmap_populate_address(next, node, NULL, -1);
if (!pte)
return -ENOMEM;
@@ -528,8 +517,7 @@ static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
*/
next += PAGE_SIZE;
rc = vmemmap_populate_range(next, last, node, NULL,
- pte_pfn(ptep_get(pte)),
- VMEMMAP_POPULATE_PAGEREF);
+ pte_pfn(ptep_get(pte)));
if (rc)
return -ENOMEM;
}
--
2.54.0
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v2 21/69] mm: Rename vmemmap optimization macros around folio semantics
2026-05-13 13:04 [PATCH v2 00/69] mm: Generalize HVO for HugeTLB and device DAX Muchun Song
` (19 preceding siblings ...)
2026-05-13 13:04 ` [PATCH v2 20/69] mm/sparse-vmemmap: Drop VMEMMAP_POPULATE_PAGEREF Muchun Song
@ 2026-05-13 13:04 ` Muchun Song
2026-05-13 13:04 ` [PATCH v2 22/69] mm/sparse: Drop power-of-2 size requirement for struct mem_section Muchun Song
` (26 subsequent siblings)
47 siblings, 0 replies; 72+ messages in thread
From: Muchun Song @ 2026-05-13 13:04 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
Michael Ellerman, Madhavan Srinivasan
Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
Christophe Leroy, Ackerley Tng, Frank van der Linden,
aneesh.kumar, joao.m.martins, linux-mm, linuxppc-dev,
linux-kernel, Muchun Song
The existing vmemmap optimization macros are named in terms of tail
pages, but they actually describe which folio sizes can use the
optimization and how much vmemmap backing an optimized folio keeps.
Rename them to reflect that meaning directly. This makes the names work
for both HugeTLB and other folio-based users such as DAX.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
include/linux/mmzone.h | 18 ++++++++++--------
mm/hugetlb.c | 4 ++--
mm/hugetlb_vmemmap.c | 2 +-
mm/sparse-vmemmap.c | 4 ++--
4 files changed, 15 insertions(+), 13 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 362e16497533..40b1cea98b82 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -107,13 +107,15 @@
is_power_of_2(sizeof(struct page)) ? \
MAX_FOLIO_NR_PAGES * sizeof(struct page) : 0)
-/*
- * vmemmap optimization (like HVO) is only possible for page orders that fill
- * two or more pages with struct pages.
- */
-#define VMEMMAP_TAIL_MIN_ORDER (ilog2(2 * PAGE_SIZE / sizeof(struct page)))
-#define __NR_VMEMMAP_TAILS (MAX_FOLIO_ORDER - VMEMMAP_TAIL_MIN_ORDER + 1)
-#define NR_VMEMMAP_TAILS (__NR_VMEMMAP_TAILS > 0 ? __NR_VMEMMAP_TAILS : 0)
+/* The number of vmemmap pages required by a vmemmap-optimized folio. */
+#define OPTIMIZED_FOLIO_VMEMMAP_PAGES 1
+#define OPTIMIZED_FOLIO_VMEMMAP_SIZE (OPTIMIZED_FOLIO_VMEMMAP_PAGES * PAGE_SIZE)
+#define OPTIMIZED_FOLIO_VMEMMAP_NR_STRUCT_PAGES (OPTIMIZED_FOLIO_VMEMMAP_SIZE / sizeof(struct page))
+#define OPTIMIZABLE_FOLIO_MIN_ORDER (ilog2(OPTIMIZED_FOLIO_VMEMMAP_NR_STRUCT_PAGES) + 1)
+
+#define __NR_OPTIMIZABLE_FOLIO_ORDERS (MAX_FOLIO_ORDER - OPTIMIZABLE_FOLIO_MIN_ORDER + 1)
+#define NR_OPTIMIZABLE_FOLIO_ORDERS \
+ (__NR_OPTIMIZABLE_FOLIO_ORDERS > 0 ? __NR_OPTIMIZABLE_FOLIO_ORDERS : 0)
enum migratetype {
MIGRATE_UNMOVABLE,
@@ -1146,7 +1148,7 @@ struct zone {
atomic_long_t vm_stat[NR_VM_ZONE_STAT_ITEMS];
atomic_long_t vm_numa_event[NR_VM_NUMA_EVENT_ITEMS];
#ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
- struct page *vmemmap_tails[NR_VMEMMAP_TAILS];
+ struct page *vmemmap_tails[NR_OPTIMIZABLE_FOLIO_ORDERS];
#endif
} ____cacheline_internodealigned_in_smp;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 1f0a0e31d624..53448b05ca11 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3318,7 +3318,7 @@ void __init hugetlb_struct_page_init(void)
struct zone *zone;
for_each_zone(zone) {
- for (int i = 0; i < NR_VMEMMAP_TAILS; i++) {
+ for (int i = 0; i < NR_OPTIMIZABLE_FOLIO_ORDERS; i++) {
struct page *tail, *p;
unsigned int order;
@@ -3326,7 +3326,7 @@ void __init hugetlb_struct_page_init(void)
if (!tail)
continue;
- order = i + VMEMMAP_TAIL_MIN_ORDER;
+ order = i + OPTIMIZABLE_FOLIO_MIN_ORDER;
p = page_to_virt(tail);
for (int j = 0; j < PAGE_SIZE / sizeof(struct page); j++)
init_compound_tail(p + j, NULL, order, zone);
diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index 952216a49bcb..e9906d32a64c 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -495,7 +495,7 @@ static bool vmemmap_should_optimize_folio(const struct hstate *h, struct folio *
static struct page *vmemmap_get_tail(unsigned int order, struct zone *zone)
{
- const unsigned int idx = order - VMEMMAP_TAIL_MIN_ORDER;
+ const unsigned int idx = order - OPTIMIZABLE_FOLIO_MIN_ORDER;
struct page *tail, *p;
int node = zone_to_nid(zone);
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index d7e9fb47f7ee..39529245d790 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -318,12 +318,12 @@ static __meminit struct page *vmemmap_get_tail(unsigned int order, struct zone *
unsigned int idx;
int node = zone_to_nid(zone);
- if (WARN_ON_ONCE(order < VMEMMAP_TAIL_MIN_ORDER))
+ if (WARN_ON_ONCE(order < OPTIMIZABLE_FOLIO_MIN_ORDER))
return NULL;
if (WARN_ON_ONCE(order > MAX_FOLIO_ORDER))
return NULL;
- idx = order - VMEMMAP_TAIL_MIN_ORDER;
+ idx = order - OPTIMIZABLE_FOLIO_MIN_ORDER;
tail = zone->vmemmap_tails[idx];
if (tail)
return tail;
--
2.54.0
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v2 22/69] mm/sparse: Drop power-of-2 size requirement for struct mem_section
2026-05-13 13:04 [PATCH v2 00/69] mm: Generalize HVO for HugeTLB and device DAX Muchun Song
` (20 preceding siblings ...)
2026-05-13 13:04 ` [PATCH v2 21/69] mm: Rename vmemmap optimization macros around folio semantics Muchun Song
@ 2026-05-13 13:04 ` Muchun Song
2026-05-13 13:04 ` [PATCH v2 23/69] mm/sparse-vmemmap: track compound page order in " Muchun Song
` (25 subsequent siblings)
47 siblings, 0 replies; 72+ messages in thread
From: Muchun Song @ 2026-05-13 13:04 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
Michael Ellerman, Madhavan Srinivasan
Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
Christophe Leroy, Ackerley Tng, Frank van der Linden,
aneesh.kumar, joao.m.martins, linux-mm, linuxppc-dev,
linux-kernel, Muchun Song
struct mem_section is currently forced to a power-of-2 size so the
section-to-root lookup can use a mask instead of a modulo.
That requirement adds configuration-dependent padding, especially with
CONFIG_PAGE_EXTENSION, just to preserve the lookup scheme.
Drop the constraint and use a plain modulo for the lookup instead. The
divisor is constant, so the generated code remains cheap while avoiding
the extra padding. It also removes an unnecessary layout constraint
from the type.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
include/linux/mmzone.h | 8 +-------
mm/sparse.c | 2 --
scripts/gdb/linux/mm.py | 6 ++----
3 files changed, 3 insertions(+), 13 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 40b1cea98b82..ae0271eaec05 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -2027,12 +2027,7 @@ struct mem_section {
* section. (see page_ext.h about this.)
*/
struct page_ext *page_ext;
- unsigned long pad;
#endif
- /*
- * WARNING: mem_section must be a power-of-2 in size for the
- * calculation and use of SECTION_ROOT_MASK to make sense.
- */
};
#ifdef CONFIG_SPARSEMEM_EXTREME
@@ -2043,7 +2038,6 @@ struct mem_section {
#define SECTION_NR_TO_ROOT(sec) ((sec) / SECTIONS_PER_ROOT)
#define NR_SECTION_ROOTS DIV_ROUND_UP(NR_MEM_SECTIONS, SECTIONS_PER_ROOT)
-#define SECTION_ROOT_MASK (SECTIONS_PER_ROOT - 1)
#ifdef CONFIG_SPARSEMEM_EXTREME
extern struct mem_section **mem_section;
@@ -2067,7 +2061,7 @@ static inline struct mem_section *__nr_to_section(unsigned long nr)
if (!mem_section || !mem_section[root])
return NULL;
#endif
- return &mem_section[root][nr & SECTION_ROOT_MASK];
+ return &mem_section[root][nr % SECTIONS_PER_ROOT];
}
extern size_t mem_section_usage_size(void);
diff --git a/mm/sparse.c b/mm/sparse.c
index 324213d8bdcb..9457a4d6a6fc 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -331,8 +331,6 @@ void __init sparse_init(void)
unsigned long pnum_end, pnum_begin, map_count = 1;
int nid_begin;
- /* see include/linux/mmzone.h 'struct mem_section' definition */
- BUILD_BUG_ON(!is_power_of_2(sizeof(struct mem_section)));
memblocks_present();
if (compound_info_has_mask()) {
diff --git a/scripts/gdb/linux/mm.py b/scripts/gdb/linux/mm.py
index dffadccbb01d..da4e8e9655a6 100644
--- a/scripts/gdb/linux/mm.py
+++ b/scripts/gdb/linux/mm.py
@@ -70,7 +70,6 @@ class x86_page_ops():
self.SECTIONS_PER_ROOT = 1
self.NR_SECTION_ROOTS = DIV_ROUND_UP(self.NR_MEM_SECTIONS, self.SECTIONS_PER_ROOT)
- self.SECTION_ROOT_MASK = self.SECTIONS_PER_ROOT - 1
try:
self.SECTION_HAS_MEM_MAP = 1 << int(gdb.parse_and_eval('SECTION_HAS_MEM_MAP_BIT'))
@@ -100,7 +99,7 @@ class x86_page_ops():
def __nr_to_section(self, nr):
root = self.SECTION_NR_TO_ROOT(nr)
mem_section = gdb.parse_and_eval("mem_section")
- return mem_section[root][nr & self.SECTION_ROOT_MASK]
+ return mem_section[root][nr % self.SECTIONS_PER_ROOT]
def pfn_to_section_nr(self, pfn):
return pfn >> self.PFN_SECTION_SHIFT
@@ -249,7 +248,6 @@ class aarch64_page_ops():
self.SECTIONS_PER_ROOT = 1
self.NR_SECTION_ROOTS = DIV_ROUND_UP(self.NR_MEM_SECTIONS, self.SECTIONS_PER_ROOT)
- self.SECTION_ROOT_MASK = self.SECTIONS_PER_ROOT - 1
self.SUBSECTION_SHIFT = 21
self.SEBSECTION_SIZE = 1 << self.SUBSECTION_SHIFT
self.PFN_SUBSECTION_SHIFT = self.SUBSECTION_SHIFT - self.PAGE_SHIFT
@@ -304,7 +302,7 @@ class aarch64_page_ops():
def __nr_to_section(self, nr):
root = self.SECTION_NR_TO_ROOT(nr)
mem_section = gdb.parse_and_eval("mem_section")
- return mem_section[root][nr & self.SECTION_ROOT_MASK]
+ return mem_section[root][nr % self.SECTIONS_PER_ROOT]
def pfn_to_section_nr(self, pfn):
return pfn >> self.PFN_SECTION_SHIFT
--
2.54.0
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v2 23/69] mm/sparse-vmemmap: track compound page order in struct mem_section
2026-05-13 13:04 [PATCH v2 00/69] mm: Generalize HVO for HugeTLB and device DAX Muchun Song
` (21 preceding siblings ...)
2026-05-13 13:04 ` [PATCH v2 22/69] mm/sparse: Drop power-of-2 size requirement for struct mem_section Muchun Song
@ 2026-05-13 13:04 ` Muchun Song
2026-05-13 13:04 ` [PATCH v2 24/69] mm/mm_init: Skip initializing shared vmemmap tail pages Muchun Song
` (24 subsequent siblings)
47 siblings, 0 replies; 72+ messages in thread
From: Muchun Song @ 2026-05-13 13:04 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
Michael Ellerman, Madhavan Srinivasan
Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
Christophe Leroy, Ackerley Tng, Frank van der Linden,
aneesh.kumar, joao.m.martins, linux-mm, linuxppc-dev,
linux-kernel, Muchun Song
HugeTLB and DAX both rely on vmemmap optimization, but sparsemem does
not record what compound page order a section is populated with.
As a result, code that needs this information has to open-code
separate handling across users of vmemmap optimization. It also
prevents other memory management code, such as struct page
initialization, from skipping initialization of shared vmemmap pages
when needed.
Track the compound page order in struct mem_section and provide small
helpers to access it. A compound page larger than a section naturally
carries the same order across all covered sections.
This is a preparatory change for consolidating vmemmap optimization
handling and for letting later code make initialization decisions
based on the section's compound page order.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
include/linux/mmzone.h | 28 ++++++++++++++++++++++++++++
1 file changed, 28 insertions(+)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index ae0271eaec05..6f112e6f42bb 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -2028,6 +2028,14 @@ struct mem_section {
*/
struct page_ext *page_ext;
#endif
+#ifdef CONFIG_SPARSEMEM_VMEMMAP
+ /*
+ * The order of compound pages in this section. Typically, the section
+ * holds compound pages of this order; a larger compound page will span
+ * multiple sections.
+ */
+ unsigned int order;
+#endif
};
#ifdef CONFIG_SPARSEMEM_EXTREME
@@ -2224,6 +2232,17 @@ static inline bool pfn_section_first_valid(struct mem_section *ms, unsigned long
*pfn = (*pfn & PAGE_SECTION_MASK) + (bit * PAGES_PER_SUBSECTION);
return true;
}
+
+static inline void section_set_order(struct mem_section *section, unsigned int order)
+{
+ VM_WARN_ON(section->order && order && section->order != order);
+ section->order = order;
+}
+
+static inline unsigned int section_order(const struct mem_section *section)
+{
+ return section->order;
+}
#else
static inline int pfn_section_valid(struct mem_section *ms, unsigned long pfn)
{
@@ -2234,6 +2253,15 @@ static inline bool pfn_section_first_valid(struct mem_section *ms, unsigned long
{
return true;
}
+
+static inline void section_set_order(struct mem_section *section, unsigned int order)
+{
+}
+
+static inline unsigned int section_order(const struct mem_section *section)
+{
+ return 0;
+}
#endif
void sparse_init_early_section(int nid, struct page *map, unsigned long pnum,
--
2.54.0
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v2 24/69] mm/mm_init: Skip initializing shared vmemmap tail pages
2026-05-13 13:04 [PATCH v2 00/69] mm: Generalize HVO for HugeTLB and device DAX Muchun Song
` (22 preceding siblings ...)
2026-05-13 13:04 ` [PATCH v2 23/69] mm/sparse-vmemmap: track compound page order in " Muchun Song
@ 2026-05-13 13:04 ` Muchun Song
2026-05-13 13:04 ` [PATCH v2 25/69] mm/sparse-vmemmap: Initialize shared tail vmemmap pages on allocation Muchun Song
` (23 subsequent siblings)
47 siblings, 0 replies; 72+ messages in thread
From: Muchun Song @ 2026-05-13 13:04 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
Michael Ellerman, Madhavan Srinivasan
Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
Christophe Leroy, Ackerley Tng, Frank van der Linden,
aneesh.kumar, joao.m.martins, linux-mm, linuxppc-dev,
linux-kernel, Muchun Song
memmap_init_range() initializes every struct page in the target range.
For compound pages with vmemmap optimization, the tail struct pages are
backed by a shared vmemmap page.
Initializing those tail struct pages would overwrite the shared
vmemmap page contents, so users such as HugeTLB have to open-code
follow-up handling to restore the metadata afterwards.
Use the section's compound page order to detect struct pages that fall
into the shared tail vmemmap range and skip their initialization in
memmap_init_range(). Still initialize the pageblock migratetypes for
the skipped range so the surrounding setup remains intact.
This is a preparatory change for consolidating handling across users of
vmemmap optimization, and it also avoids redundant initialization of
shared tail vmemmap pages during early boot.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
include/linux/mmzone.h | 9 +++++++++
mm/internal.h | 16 ++++++++++++++++
mm/mm_init.c | 19 +++++++++++++------
3 files changed, 38 insertions(+), 6 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 6f112e6f42bb..5fc968bac1f7 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -2264,6 +2264,11 @@ static inline unsigned int section_order(const struct mem_section *section)
}
#endif
+static inline unsigned int pfn_to_section_order(unsigned long pfn)
+{
+ return section_order(__pfn_to_section(pfn));
+}
+
void sparse_init_early_section(int nid, struct page *map, unsigned long pnum,
unsigned long flags);
@@ -2404,6 +2409,10 @@ static inline unsigned long next_present_section_nr(unsigned long section_nr)
#else
#define sparse_vmemmap_init_nid_early(_nid) do {} while (0)
#define pfn_in_present_section pfn_valid
+static inline unsigned int pfn_to_section_order(unsigned long pfn)
+{
+ return 0;
+}
#endif /* CONFIG_SPARSEMEM */
/*
diff --git a/mm/internal.h b/mm/internal.h
index 4a5053368078..1f1c07eb70e2 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1004,10 +1004,26 @@ static inline void sparse_init(void) {}
*/
#ifdef CONFIG_SPARSEMEM_VMEMMAP
void sparse_init_subsection_map(void);
+
+static inline bool vmemmap_page_optimizable(const struct page *page)
+{
+ unsigned long pfn = page_to_pfn(page);
+ unsigned long nr_pages = 1UL << pfn_to_section_order(pfn);
+
+ if (!is_power_of_2(sizeof(struct page)))
+ return false;
+
+ return (pfn & (nr_pages - 1)) >= OPTIMIZED_FOLIO_VMEMMAP_NR_STRUCT_PAGES;
+}
#else
static inline void sparse_init_subsection_map(void)
{
}
+
+static inline bool vmemmap_page_optimizable(const struct page *page)
+{
+ return false;
+}
#endif /* CONFIG_SPARSEMEM_VMEMMAP */
#if defined CONFIG_COMPACTION || defined CONFIG_CMA
diff --git a/mm/mm_init.c b/mm/mm_init.c
index c64e5d63c4ae..3aaee1cf7bf0 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -674,19 +674,17 @@ static inline void fixup_hashdist(void)
static inline void fixup_hashdist(void) {}
#endif /* CONFIG_NUMA */
-#if defined(CONFIG_ZONE_DEVICE) || defined(CONFIG_DEFERRED_STRUCT_PAGE_INIT)
static __meminit void pageblock_migratetype_init_range(unsigned long pfn,
- unsigned long nr_pages, int migratetype, bool atomic)
+ unsigned long nr_pages, int migratetype, bool isolate, bool atomic)
{
const unsigned long end = pfn + nr_pages;
for (pfn = pageblock_align(pfn); pfn < end; pfn += pageblock_nr_pages) {
- init_pageblock_migratetype(pfn_to_page(pfn), migratetype, false);
+ init_pageblock_migratetype(pfn_to_page(pfn), migratetype, isolate);
if (!atomic && IS_ALIGNED(pfn, PAGES_PER_SECTION))
cond_resched();
}
}
-#endif
#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
/*
@@ -916,6 +914,15 @@ void __meminit memmap_init_range(unsigned long size, int nid, unsigned long zone
}
page = pfn_to_page(pfn);
+ if (vmemmap_page_optimizable(page)) {
+ unsigned long start = pfn;
+
+ pfn = min(ALIGN(start, 1UL << pfn_to_section_order(pfn)), end_pfn);
+ pageblock_migratetype_init_range(start, pfn - start, migratetype,
+ isolate_pageblock, false);
+ continue;
+ }
+
__init_single_page(page, pfn, zone, nid);
if (context == MEMINIT_HOTPLUG) {
#ifdef CONFIG_ZONE_DEVICE
@@ -1142,7 +1149,7 @@ void __ref memmap_init_zone_device(struct zone *zone,
compound_nr_pages(pfn, altmap, pgmap));
}
- pageblock_migratetype_init_range(start_pfn, nr_pages, MIGRATE_MOVABLE, false);
+ pageblock_migratetype_init_range(start_pfn, nr_pages, MIGRATE_MOVABLE, false, false);
pr_debug("%s initialised %lu pages in %ums\n", __func__,
nr_pages, jiffies_to_msecs(jiffies - start));
@@ -1982,7 +1989,7 @@ static void __init deferred_free_pages(unsigned long pfn,
if (!nr_pages)
return;
- pageblock_migratetype_init_range(pfn, nr_pages, mt, true);
+ pageblock_migratetype_init_range(pfn, nr_pages, mt, false, true);
page = pfn_to_page(pfn);
--
2.54.0
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v2 25/69] mm/sparse-vmemmap: Initialize shared tail vmemmap pages on allocation
2026-05-13 13:04 [PATCH v2 00/69] mm: Generalize HVO for HugeTLB and device DAX Muchun Song
` (23 preceding siblings ...)
2026-05-13 13:04 ` [PATCH v2 24/69] mm/mm_init: Skip initializing shared vmemmap tail pages Muchun Song
@ 2026-05-13 13:04 ` Muchun Song
2026-05-13 13:04 ` [PATCH v2 26/69] mm/sparse-vmemmap: Support section-based vmemmap accounting Muchun Song
` (22 subsequent siblings)
47 siblings, 0 replies; 72+ messages in thread
From: Muchun Song @ 2026-05-13 13:04 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
Michael Ellerman, Madhavan Srinivasan
Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
Christophe Leroy, Ackerley Tng, Frank van der Linden,
aneesh.kumar, joao.m.martins, linux-mm, linuxppc-dev,
linux-kernel, Muchun Song
The shared tail vmemmap page allocated in vmemmap_get_tail() used to be
left uninitialized, because memmap_init_range() would later overwrite
it. That forced users such as HugeTLB to defer the initialization to
their own setup paths.
Now that memmap_init_range() skips shared tail vmemmap pages, initialize
them immediately in vmemmap_get_tail() with init_compound_tail()
instead.
This moves the initialization to the point where the shared tail page is
allocated and avoids relying on deferred handling in individual users.
The remaining deferred initialization in HugeTLB will be removed once it
switches to the section compound page order mechanism.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
mm/sparse-vmemmap.c | 11 ++---------
1 file changed, 2 insertions(+), 9 deletions(-)
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index 39529245d790..60d5330a8399 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -328,18 +328,11 @@ static __meminit struct page *vmemmap_get_tail(unsigned int order, struct zone *
if (tail)
return tail;
- /*
- * Only allocate the page, but do not initialize it.
- *
- * Any initialization done here will be overwritten by memmap_init().
- *
- * hugetlb_vmemmap_init() will take care of initialization after
- * memmap_init().
- */
-
p = vmemmap_alloc_block_zero(PAGE_SIZE, node);
if (!p)
return NULL;
+ for (int i = 0; i < PAGE_SIZE / sizeof(struct page); i++)
+ init_compound_tail(p + i, NULL, order, zone);
tail = virt_to_page(p);
zone->vmemmap_tails[idx] = tail;
--
2.54.0
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v2 26/69] mm/sparse-vmemmap: Support section-based vmemmap accounting
2026-05-13 13:04 [PATCH v2 00/69] mm: Generalize HVO for HugeTLB and device DAX Muchun Song
` (24 preceding siblings ...)
2026-05-13 13:04 ` [PATCH v2 25/69] mm/sparse-vmemmap: Initialize shared tail vmemmap pages on allocation Muchun Song
@ 2026-05-13 13:04 ` Muchun Song
2026-05-13 13:04 ` [PATCH v2 27/69] mm/sparse-vmemmap: Support section-based vmemmap optimization Muchun Song
` (21 subsequent siblings)
47 siblings, 0 replies; 72+ messages in thread
From: Muchun Song @ 2026-05-13 13:04 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
Michael Ellerman, Madhavan Srinivasan
Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
Christophe Leroy, Ackerley Tng, Frank van der Linden,
aneesh.kumar, joao.m.martins, linux-mm, linuxppc-dev,
linux-kernel, Muchun Song
Teach section_nr_vmemmap_pages() to account for section-based vmemmap
optimization, so the helper can report the vmemmap page usage for a
memory section with or without shared tail vmemmap pages.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
include/linux/mmzone.h | 8 ++++++++
mm/sparse-vmemmap.c | 13 +++++++++----
2 files changed, 17 insertions(+), 4 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 5fc968bac1f7..0974205abd3d 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -2269,6 +2269,14 @@ static inline unsigned int pfn_to_section_order(unsigned long pfn)
return section_order(__pfn_to_section(pfn));
}
+static inline bool section_vmemmap_optimizable(const struct mem_section *section)
+{
+ if (!is_power_of_2(sizeof(struct page)))
+ return false;
+
+ return section_order(section) >= OPTIMIZABLE_FOLIO_MIN_ORDER;
+}
+
void sparse_init_early_section(int nid, struct page *map, unsigned long pnum,
unsigned long flags);
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index 60d5330a8399..94964363d95c 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -629,24 +629,29 @@ void offline_mem_sections(unsigned long start_pfn, unsigned long end_pfn)
static int __meminit section_nr_vmemmap_pages(unsigned long pfn, unsigned long nr_pages,
struct vmem_altmap *altmap, struct dev_pagemap *pgmap)
{
- const unsigned int order = pgmap ? pgmap->vmemmap_shift : 0;
+ const struct mem_section *ms = __pfn_to_section(pfn);
+ const unsigned int order = pgmap ? pgmap->vmemmap_shift : section_order(ms);
const unsigned long pages_per_compound = 1UL << order;
+ unsigned int vmemmap_pages = OPTIMIZED_FOLIO_VMEMMAP_PAGES;
VM_WARN_ON_ONCE(!IS_ALIGNED(pfn | nr_pages, PAGES_PER_SUBSECTION));
VM_WARN_ON_ONCE(nr_pages > PAGES_PER_SECTION);
- if (!vmemmap_can_optimize(altmap, pgmap))
+ if (vmemmap_can_optimize(altmap, pgmap))
+ vmemmap_pages = VMEMMAP_RESERVE_NR;
+
+ if (!vmemmap_can_optimize(altmap, pgmap) && !section_vmemmap_optimizable(ms))
return DIV_ROUND_UP(nr_pages * sizeof(struct page), PAGE_SIZE);
if (order < PFN_SECTION_SHIFT) {
VM_WARN_ON_ONCE(!IS_ALIGNED(pfn | nr_pages, pages_per_compound));
- return VMEMMAP_RESERVE_NR * nr_pages / pages_per_compound;
+ return vmemmap_pages * nr_pages / pages_per_compound;
}
VM_WARN_ON_ONCE(!IS_ALIGNED(pfn | nr_pages, PAGES_PER_SECTION));
if (IS_ALIGNED(pfn, pages_per_compound))
- return VMEMMAP_RESERVE_NR;
+ return vmemmap_pages;
return 0;
}
--
2.54.0
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v2 27/69] mm/sparse-vmemmap: Support section-based vmemmap optimization
2026-05-13 13:04 [PATCH v2 00/69] mm: Generalize HVO for HugeTLB and device DAX Muchun Song
` (25 preceding siblings ...)
2026-05-13 13:04 ` [PATCH v2 26/69] mm/sparse-vmemmap: Support section-based vmemmap accounting Muchun Song
@ 2026-05-13 13:04 ` Muchun Song
2026-05-13 13:04 ` [PATCH v2 28/69] mm/hugetlb: Use generic vmemmap optimization macros Muchun Song
` (20 subsequent siblings)
47 siblings, 0 replies; 72+ messages in thread
From: Muchun Song @ 2026-05-13 13:04 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
Michael Ellerman, Madhavan Srinivasan
Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
Christophe Leroy, Ackerley Tng, Frank van der Linden,
aneesh.kumar, joao.m.martins, linux-mm, linuxppc-dev,
linux-kernel, Muchun Song
Teach sparse-vmemmap population code to use the compound page order
when deciding whether a vmemmap page can be optimized.
With this information, the common sparse-vmemmap population path can
allocate or reuse shared tail vmemmap pages directly instead of relying
on HugeTLB/DAX-specific handling.
This centralizes vmemmap optimization logic in the sparse-vmemmap code,
based on section metadata, and prepares for sharing the same mechanism
across different users of vmemmap optimization, including HugeTLB and
DAX.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
include/linux/mmzone.h | 2 +-
mm/internal.h | 3 ++
mm/sparse-vmemmap.c | 89 +++++++++++++++++++++++++-----------------
mm/sparse.c | 34 +++++++++++++++-
4 files changed, 89 insertions(+), 39 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 0974205abd3d..bf4c40818b63 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1147,7 +1147,7 @@ struct zone {
/* Zone statistics */
atomic_long_t vm_stat[NR_VM_ZONE_STAT_ITEMS];
atomic_long_t vm_numa_event[NR_VM_NUMA_EVENT_ITEMS];
-#ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
+#ifdef CONFIG_SPARSEMEM_VMEMMAP
struct page *vmemmap_tails[NR_OPTIMIZABLE_FOLIO_ORDERS];
#endif
} ____cacheline_internodealigned_in_smp;
diff --git a/mm/internal.h b/mm/internal.h
index 1f1c07eb70e2..2defdef1aedf 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -995,6 +995,9 @@ static inline void __section_mark_present(struct mem_section *ms,
ms->section_mem_map |= SECTION_MARKED_PRESENT;
}
+
+int section_nr_vmemmap_pages(unsigned long pfn, unsigned long nr_pages,
+ struct vmem_altmap *altmap, struct dev_pagemap *pgmap);
#else
static inline void sparse_init(void) {}
#endif /* CONFIG_SPARSEMEM */
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index 94964363d95c..69ae40692e41 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -139,17 +139,49 @@ void __meminit vmemmap_verify(pte_t *pte, int node,
start, end - 1);
}
+static struct zone __meminit *pfn_to_zone(unsigned long pfn, int nid)
+{
+ pg_data_t *pgdat = NODE_DATA(nid);
+
+ for (enum zone_type zone_type = 0; zone_type < MAX_NR_ZONES; zone_type++) {
+ struct zone *zone = &pgdat->node_zones[zone_type];
+
+ if (zone_spans_pfn(zone, pfn))
+ return zone;
+ }
+
+ return NULL;
+}
+
+static __meminit struct page *vmemmap_get_tail(unsigned int order, struct zone *zone);
+
static pte_t * __meminit vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node,
struct vmem_altmap *altmap,
unsigned long ptpfn)
{
pte_t *pte = pte_offset_kernel(pmd, addr);
+
if (pte_none(ptep_get(pte))) {
pte_t entry;
- void *p;
+
+ if (vmemmap_page_optimizable((struct page *)addr) &&
+ ptpfn == (unsigned long)-1) {
+ struct page *page;
+ unsigned long pfn = page_to_pfn((struct page *)addr);
+ const struct mem_section *ms = __pfn_to_section(pfn);
+ struct zone *zone = pfn_to_zone(pfn, node);
+
+ if (WARN_ON_ONCE(!zone))
+ return NULL;
+ page = vmemmap_get_tail(section_order(ms), zone);
+ if (!page)
+ return NULL;
+ ptpfn = page_to_pfn(page);
+ }
if (ptpfn == (unsigned long)-1) {
- p = vmemmap_alloc_block_buf(PAGE_SIZE, node, altmap);
+ void *p = vmemmap_alloc_block_buf(PAGE_SIZE, node, altmap);
+
if (!p)
return NULL;
ptpfn = PHYS_PFN(__pa(p));
@@ -168,7 +200,8 @@ static pte_t * __meminit vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, in
}
entry = pfn_pte(ptpfn, PAGE_KERNEL);
set_pte_at(&init_mm, addr, pte, entry);
- }
+ } else if (WARN_ON_ONCE(vmemmap_page_optimizable((struct page *)addr)))
+ return NULL;
return pte;
}
@@ -311,7 +344,6 @@ void vmemmap_wrprotect_hvo(unsigned long addr, unsigned long end,
}
}
-#ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
static __meminit struct page *vmemmap_get_tail(unsigned int order, struct zone *zone)
{
struct page *p, *tail;
@@ -340,6 +372,7 @@ static __meminit struct page *vmemmap_get_tail(unsigned int order, struct zone *
return tail;
}
+#ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
int __meminit vmemmap_populate_hvo(unsigned long addr, unsigned long end,
unsigned int order, struct zone *zone,
unsigned long headsize)
@@ -388,6 +421,9 @@ int __meminit vmemmap_populate_hugepages(unsigned long start, unsigned long end,
pmd_t *pmd;
for (addr = start; addr < end; addr = next) {
+ unsigned long pfn = page_to_pfn((struct page *)addr);
+ struct mem_section *ms = __pfn_to_section(pfn);
+
next = pmd_addr_end(addr, end);
pgd = vmemmap_pgd_populate(addr, node);
@@ -403,7 +439,7 @@ int __meminit vmemmap_populate_hugepages(unsigned long start, unsigned long end,
return -ENOMEM;
pmd = pmd_offset(pud, addr);
- if (pmd_none(pmdp_get(pmd))) {
+ if (pmd_none(pmdp_get(pmd)) && !section_vmemmap_optimizable(ms)) {
void *p;
p = vmemmap_alloc_block_buf(PMD_SIZE, node, altmap);
@@ -421,8 +457,19 @@ int __meminit vmemmap_populate_hugepages(unsigned long start, unsigned long end,
*/
return -ENOMEM;
}
- } else if (vmemmap_check_pmd(pmd, node, addr, next))
+ } else if (vmemmap_check_pmd(pmd, node, addr, next)) {
+ const struct mem_section *start_ms;
+ unsigned long align = max(1UL << section_order(ms), PAGES_PER_SECTION);
+
+ /* HVO-covered sections must not use PMD mappings. */
+ start_ms = __pfn_to_section(ALIGN_DOWN(pfn, align));
+ if (!IS_ALIGNED(pfn, align) && section_vmemmap_optimizable(start_ms))
+ return -ENOTSUPP;
+
+ /* PMD mappings end HVO coverage for this section. */
+ section_set_order(ms, 0);
continue;
+ }
if (vmemmap_populate_basepages(addr, next, node, altmap))
return -ENOMEM;
}
@@ -626,36 +673,6 @@ void offline_mem_sections(unsigned long start_pfn, unsigned long end_pfn)
}
}
-static int __meminit section_nr_vmemmap_pages(unsigned long pfn, unsigned long nr_pages,
- struct vmem_altmap *altmap, struct dev_pagemap *pgmap)
-{
- const struct mem_section *ms = __pfn_to_section(pfn);
- const unsigned int order = pgmap ? pgmap->vmemmap_shift : section_order(ms);
- const unsigned long pages_per_compound = 1UL << order;
- unsigned int vmemmap_pages = OPTIMIZED_FOLIO_VMEMMAP_PAGES;
-
- VM_WARN_ON_ONCE(!IS_ALIGNED(pfn | nr_pages, PAGES_PER_SUBSECTION));
- VM_WARN_ON_ONCE(nr_pages > PAGES_PER_SECTION);
-
- if (vmemmap_can_optimize(altmap, pgmap))
- vmemmap_pages = VMEMMAP_RESERVE_NR;
-
- if (!vmemmap_can_optimize(altmap, pgmap) && !section_vmemmap_optimizable(ms))
- return DIV_ROUND_UP(nr_pages * sizeof(struct page), PAGE_SIZE);
-
- if (order < PFN_SECTION_SHIFT) {
- VM_WARN_ON_ONCE(!IS_ALIGNED(pfn | nr_pages, pages_per_compound));
- return vmemmap_pages * nr_pages / pages_per_compound;
- }
-
- VM_WARN_ON_ONCE(!IS_ALIGNED(pfn | nr_pages, PAGES_PER_SECTION));
-
- if (IS_ALIGNED(pfn, pages_per_compound))
- return vmemmap_pages;
-
- return 0;
-}
-
static struct page * __meminit populate_section_memmap(unsigned long pfn,
unsigned long nr_pages, int nid, struct vmem_altmap *altmap,
struct dev_pagemap *pgmap)
diff --git a/mm/sparse.c b/mm/sparse.c
index 9457a4d6a6fc..3e96478a63e0 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -284,6 +284,36 @@ static void __init sparse_usage_fini(void)
sparse_usagebuf = sparse_usagebuf_end = NULL;
}
+int __meminit section_nr_vmemmap_pages(unsigned long pfn, unsigned long nr_pages,
+ struct vmem_altmap *altmap, struct dev_pagemap *pgmap)
+{
+ const struct mem_section *ms = __pfn_to_section(pfn);
+ const unsigned int order = pgmap ? pgmap->vmemmap_shift : section_order(ms);
+ const unsigned long pages_per_compound = 1UL << order;
+ unsigned int vmemmap_pages = OPTIMIZED_FOLIO_VMEMMAP_PAGES;
+
+ VM_WARN_ON_ONCE(!IS_ALIGNED(pfn | nr_pages, PAGES_PER_SUBSECTION));
+ VM_WARN_ON_ONCE(nr_pages > PAGES_PER_SECTION);
+
+ if (vmemmap_can_optimize(altmap, pgmap))
+ vmemmap_pages = VMEMMAP_RESERVE_NR;
+
+ if (!vmemmap_can_optimize(altmap, pgmap) && !section_vmemmap_optimizable(ms))
+ return DIV_ROUND_UP(nr_pages * sizeof(struct page), PAGE_SIZE);
+
+ if (order < PFN_SECTION_SHIFT) {
+ VM_WARN_ON_ONCE(!IS_ALIGNED(pfn | nr_pages, pages_per_compound));
+ return vmemmap_pages * nr_pages / pages_per_compound;
+ }
+
+ VM_WARN_ON_ONCE(!IS_ALIGNED(pfn | nr_pages, PAGES_PER_SECTION));
+
+ if (IS_ALIGNED(pfn, pages_per_compound))
+ return vmemmap_pages;
+
+ return 0;
+}
+
/*
* Initialize sparse on a specific node. The node spans [pnum_begin, pnum_end)
* And number of present sections in this node is map_count.
@@ -314,8 +344,8 @@ static void __init sparse_init_nid(int nid, unsigned long pnum_begin,
nid, NULL, NULL);
if (!map)
panic("Failed to allocate memmap for section %lu\n", pnum);
- memmap_boot_pages_add(DIV_ROUND_UP(PAGES_PER_SECTION * sizeof(struct page),
- PAGE_SIZE));
+ memmap_boot_pages_add(section_nr_vmemmap_pages(pfn, PAGES_PER_SECTION,
+ NULL, NULL));
sparse_init_early_section(nid, map, pnum, 0);
}
}
--
2.54.0
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v2 28/69] mm/hugetlb: Use generic vmemmap optimization macros
2026-05-13 13:04 [PATCH v2 00/69] mm: Generalize HVO for HugeTLB and device DAX Muchun Song
` (26 preceding siblings ...)
2026-05-13 13:04 ` [PATCH v2 27/69] mm/sparse-vmemmap: Support section-based vmemmap optimization Muchun Song
@ 2026-05-13 13:04 ` Muchun Song
2026-05-13 13:04 ` [PATCH v2 29/69] mm/sparse: Mark memblocks present earlier Muchun Song
` (19 subsequent siblings)
47 siblings, 0 replies; 72+ messages in thread
From: Muchun Song @ 2026-05-13 13:04 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
Michael Ellerman, Madhavan Srinivasan
Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
Christophe Leroy, Ackerley Tng, Frank van der Linden,
aneesh.kumar, joao.m.martins, linux-mm, linuxppc-dev,
linux-kernel, Muchun Song
Vmemmap optimization is no longer hugetlb-specific, so the remaining
hugetlb-local reserve macros are redundant.
Replace them with the generic definitions to remove duplication and keep
the hugetlb vmemmap code aligned with the common optimization macros.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
mm/hugetlb.c | 4 ++--
mm/hugetlb_vmemmap.c | 14 +++++++-------
mm/hugetlb_vmemmap.h | 9 +--------
3 files changed, 10 insertions(+), 17 deletions(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 53448b05ca11..8debe5c5abce 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3222,7 +3222,7 @@ static void __init prep_and_add_bootmem_folios(struct hstate *h,
* be no contention.
*/
hugetlb_folio_init_tail_vmemmap(folio, h,
- HUGETLB_VMEMMAP_RESERVE_PAGES,
+ OPTIMIZED_FOLIO_VMEMMAP_NR_STRUCT_PAGES,
pages_per_huge_page(h));
}
hugetlb_bootmem_init_migratetype(folio, h);
@@ -3261,7 +3261,7 @@ static void __init gather_bootmem_prealloc_node(unsigned long nid)
WARN_ON(folio_ref_count(folio) != 1);
hugetlb_folio_init_vmemmap(folio, h,
- HUGETLB_VMEMMAP_RESERVE_PAGES);
+ OPTIMIZED_FOLIO_VMEMMAP_NR_STRUCT_PAGES);
init_new_hugetlb_folio(folio);
if (hugetlb_bootmem_page_prehvo(m))
diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index e9906d32a64c..4367118f8f57 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -407,7 +407,7 @@ static int __hugetlb_vmemmap_restore_folio(const struct hstate *h,
vmemmap_start = (unsigned long)&folio->page;
vmemmap_end = vmemmap_start + hugetlb_vmemmap_size(h);
- vmemmap_start += HUGETLB_VMEMMAP_RESERVE_SIZE;
+ vmemmap_start += OPTIMIZED_FOLIO_VMEMMAP_SIZE;
/*
* The pages which the vmemmap virtual address range [@vmemmap_start,
@@ -637,10 +637,10 @@ static void __hugetlb_vmemmap_optimize_folios(struct hstate *h,
spfn = (unsigned long)&folio->page;
epfn = spfn + hugetlb_vmemmap_size(h);
vmemmap_wrprotect_hvo(spfn, epfn, folio_nid(folio),
- HUGETLB_VMEMMAP_RESERVE_SIZE);
+ OPTIMIZED_FOLIO_VMEMMAP_SIZE);
register_page_bootmem_memmap(pfn_to_section_nr(folio_pfn(folio)),
&folio->page,
- HUGETLB_VMEMMAP_RESERVE_PAGES);
+ OPTIMIZED_FOLIO_VMEMMAP_NR_STRUCT_PAGES);
continue;
}
@@ -779,9 +779,9 @@ void __init hugetlb_vmemmap_init_early(int nid)
zone = pfn_to_zone(nid, pfn);
if (vmemmap_populate_hvo(start, end, huge_page_order(m->hstate),
- zone, HUGETLB_VMEMMAP_RESERVE_SIZE))
+ zone, OPTIMIZED_FOLIO_VMEMMAP_SIZE))
panic("Failed to allocate memmap for HugeTLB page\n");
- memmap_boot_pages_add(DIV_ROUND_UP(HUGETLB_VMEMMAP_RESERVE_SIZE, PAGE_SIZE));
+ memmap_boot_pages_add(OPTIMIZED_FOLIO_VMEMMAP_PAGES);
pnum = pfn_to_section_nr(pfn);
ns = psize / section_size;
@@ -826,8 +826,8 @@ static int __init hugetlb_vmemmap_init(void)
{
const struct hstate *h;
- /* HUGETLB_VMEMMAP_RESERVE_SIZE should cover all used struct pages */
- BUILD_BUG_ON(__NR_USED_SUBPAGE > HUGETLB_VMEMMAP_RESERVE_PAGES);
+ /* OPTIMIZED_FOLIO_VMEMMAP_SIZE should cover all used struct pages */
+ BUILD_BUG_ON(__NR_USED_SUBPAGE > OPTIMIZED_FOLIO_VMEMMAP_NR_STRUCT_PAGES);
for_each_hstate(h) {
if (hugetlb_vmemmap_optimizable(h)) {
diff --git a/mm/hugetlb_vmemmap.h b/mm/hugetlb_vmemmap.h
index 7ac49c52457d..66e11893d076 100644
--- a/mm/hugetlb_vmemmap.h
+++ b/mm/hugetlb_vmemmap.h
@@ -12,13 +12,6 @@
#include <linux/io.h>
#include <linux/memblock.h>
-/*
- * Reserve one vmemmap page, all vmemmap addresses are mapped to it. See
- * Documentation/mm/vmemmap_dedup.rst.
- */
-#define HUGETLB_VMEMMAP_RESERVE_SIZE PAGE_SIZE
-#define HUGETLB_VMEMMAP_RESERVE_PAGES (HUGETLB_VMEMMAP_RESERVE_SIZE / sizeof(struct page))
-
#ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
int hugetlb_vmemmap_restore_folio(const struct hstate *h, struct folio *folio);
long hugetlb_vmemmap_restore_folios(const struct hstate *h,
@@ -43,7 +36,7 @@ static inline unsigned int hugetlb_vmemmap_size(const struct hstate *h)
*/
static inline unsigned int hugetlb_vmemmap_optimizable_size(const struct hstate *h)
{
- int size = hugetlb_vmemmap_size(h) - HUGETLB_VMEMMAP_RESERVE_SIZE;
+ int size = hugetlb_vmemmap_size(h) - OPTIMIZED_FOLIO_VMEMMAP_SIZE;
if (!is_power_of_2(sizeof(struct page)))
return 0;
--
2.54.0
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v2 29/69] mm/sparse: Mark memblocks present earlier
2026-05-13 13:04 [PATCH v2 00/69] mm: Generalize HVO for HugeTLB and device DAX Muchun Song
` (27 preceding siblings ...)
2026-05-13 13:04 ` [PATCH v2 28/69] mm/hugetlb: Use generic vmemmap optimization macros Muchun Song
@ 2026-05-13 13:04 ` Muchun Song
2026-05-13 13:04 ` [PATCH v2 30/69] mm/hugetlb: Switch HugeTLB to section-based vmemmap optimization Muchun Song
` (18 subsequent siblings)
47 siblings, 0 replies; 72+ messages in thread
From: Muchun Song @ 2026-05-13 13:04 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
Michael Ellerman, Madhavan Srinivasan
Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
Christophe Leroy, Ackerley Tng, Frank van der Linden,
aneesh.kumar, joao.m.martins, linux-mm, linuxppc-dev,
linux-kernel, Muchun Song
Later patches need struct mem_section entries to be available before
HugeTLB bootmem allocation starts, so the section metadata can be set up
at that stage.
Move the memblock-based section present marking out of sparse_init() and
call it earlier from mm_core_init_early(). Rename the helper to
sparse_memblocks_present() while doing so.
This prepares sparsemem section metadata before the early HugeTLB setup
path.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
mm/internal.h | 2 ++
mm/mm_init.c | 1 +
mm/sparse.c | 4 +---
3 files changed, 4 insertions(+), 3 deletions(-)
diff --git a/mm/internal.h b/mm/internal.h
index 2defdef1aedf..bf30617c78d8 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -962,6 +962,7 @@ void memmap_init_range(unsigned long, int, unsigned long, unsigned long,
* mm/sparse.c
*/
#ifdef CONFIG_SPARSEMEM
+void sparse_memblocks_present(void);
void sparse_init(void);
int sparse_index_init(unsigned long section_nr, int nid);
@@ -999,6 +1000,7 @@ static inline void __section_mark_present(struct mem_section *ms,
int section_nr_vmemmap_pages(unsigned long pfn, unsigned long nr_pages,
struct vmem_altmap *altmap, struct dev_pagemap *pgmap);
#else
+static inline void sparse_memblocks_present(void) {}
static inline void sparse_init(void) {}
#endif /* CONFIG_SPARSEMEM */
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 3aaee1cf7bf0..6723c604eefd 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -2693,6 +2693,7 @@ void __init __weak mem_init(void)
void __init mm_core_init_early(void)
{
+ sparse_memblocks_present();
free_area_init();
hugetlb_cma_reserve();
diff --git a/mm/sparse.c b/mm/sparse.c
index 3e96478a63e0..33e89bf1ec0c 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -195,7 +195,7 @@ static void __init memory_present(int nid, unsigned long start, unsigned long en
* This is a convenience function that is useful to mark all of the systems
* memory as present during initialization.
*/
-static void __init memblocks_present(void)
+void __init sparse_memblocks_present(void)
{
unsigned long start, end;
int i, nid;
@@ -361,8 +361,6 @@ void __init sparse_init(void)
unsigned long pnum_end, pnum_begin, map_count = 1;
int nid_begin;
- memblocks_present();
-
if (compound_info_has_mask()) {
VM_WARN_ON_ONCE(!IS_ALIGNED((unsigned long) pfn_to_page(0),
MAX_FOLIO_VMEMMAP_ALIGN));
--
2.54.0
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v2 30/69] mm/hugetlb: Switch HugeTLB to section-based vmemmap optimization
2026-05-13 13:04 [PATCH v2 00/69] mm: Generalize HVO for HugeTLB and device DAX Muchun Song
` (28 preceding siblings ...)
2026-05-13 13:04 ` [PATCH v2 29/69] mm/sparse: Mark memblocks present earlier Muchun Song
@ 2026-05-13 13:04 ` Muchun Song
2026-05-13 13:04 ` [PATCH v2 31/69] mm/sparse: Remove section_map_size() Muchun Song
` (17 subsequent siblings)
47 siblings, 0 replies; 72+ messages in thread
From: Muchun Song @ 2026-05-13 13:04 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
Michael Ellerman, Madhavan Srinivasan
Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
Christophe Leroy, Ackerley Tng, Frank van der Linden,
aneesh.kumar, joao.m.martins, linux-mm, linuxppc-dev,
linux-kernel, Muchun Song
HugeTLB bootmem vmemmap optimization still carries its own early setup
path, including pre-populating optimized mappings before the generic
sparse-vmemmap code runs.
Now that section metadata records the compound page order, HugeTLB only
needs to mark the bootmem huge page range with that order. The generic
sparse-vmemmap population path can then allocate and map the shared tail
vmemmap pages without any HugeTLB-specific early population code.
Do that by setting the section order when a bootmem huge page is
allocated and dropping the dedicated pre-HVO helpers and related
special-casing.
This removes duplicate early setup logic and switches HugeTLB to the
section-based vmemmap optimization path.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
include/linux/hugetlb.h | 1 -
include/linux/mm.h | 3 -
include/linux/mmzone.h | 17 ++++++
mm/bootmem_info.c | 5 +-
mm/hugetlb.c | 26 ++-------
mm/hugetlb_vmemmap.c | 124 ++++++----------------------------------
mm/hugetlb_vmemmap.h | 13 ++---
mm/sparse-vmemmap.c | 29 ----------
8 files changed, 45 insertions(+), 173 deletions(-)
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index fd901bb3630c..dce8969961ea 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -171,7 +171,6 @@ struct address_space *hugetlb_folio_mapping_lock_write(struct folio *folio);
extern int movable_gigantic_pages __read_mostly;
extern int sysctl_hugetlb_shm_group __read_mostly;
-extern struct list_head huge_boot_pages[MAX_NUMNODES];
void hugetlb_struct_page_init(void);
void hugetlb_bootmem_alloc(void);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 31e27ff6a35f..f39f6fca6551 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -4864,9 +4864,6 @@ int vmemmap_populate_hugepages(unsigned long start, unsigned long end,
int node, struct vmem_altmap *altmap);
int vmemmap_populate(unsigned long start, unsigned long end, int node,
struct vmem_altmap *altmap);
-int vmemmap_populate_hvo(unsigned long start, unsigned long end,
- unsigned int order, struct zone *zone,
- unsigned long headsize);
void vmemmap_wrprotect_hvo(unsigned long start, unsigned long end, int node,
unsigned long headsize);
void vmemmap_populate_print_last(void);
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index bf4c40818b63..d6a5dd042c25 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -2264,6 +2264,18 @@ static inline unsigned int section_order(const struct mem_section *section)
}
#endif
+static inline void section_set_order_range(unsigned long pfn, unsigned long nr_pages,
+ unsigned int order)
+{
+ unsigned long section_nr = pfn_to_section_nr(pfn);
+
+ if (!IS_ALIGNED(pfn | nr_pages, PAGES_PER_SECTION))
+ return;
+
+ for (unsigned long i = 0; i < nr_pages / PAGES_PER_SECTION; i++)
+ section_set_order(__nr_to_section(section_nr + i), order);
+}
+
static inline unsigned int pfn_to_section_order(unsigned long pfn)
{
return section_order(__pfn_to_section(pfn));
@@ -2417,6 +2429,11 @@ static inline unsigned long next_present_section_nr(unsigned long section_nr)
#else
#define sparse_vmemmap_init_nid_early(_nid) do {} while (0)
#define pfn_in_present_section pfn_valid
+static inline void section_set_order_range(unsigned long pfn, unsigned long nr_pages,
+ unsigned int order)
+{
+}
+
static inline unsigned int pfn_to_section_order(unsigned long pfn)
{
return 0;
diff --git a/mm/bootmem_info.c b/mm/bootmem_info.c
index 3d7675a3ae04..24f45d86ffb3 100644
--- a/mm/bootmem_info.c
+++ b/mm/bootmem_info.c
@@ -51,9 +51,8 @@ static void __init register_page_bootmem_info_section(unsigned long start_pfn)
section_nr = pfn_to_section_nr(start_pfn);
ms = __nr_to_section(section_nr);
- if (!preinited_vmemmap_section(ms))
- register_page_bootmem_memmap(section_nr, pfn_to_page(start_pfn),
- PAGES_PER_SECTION);
+ register_page_bootmem_memmap(section_nr, pfn_to_page(start_pfn),
+ PAGES_PER_SECTION);
usage = ms->usage;
page = virt_to_page(usage);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 8debe5c5abce..080f130017e3 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -57,7 +57,7 @@ unsigned int default_hstate_idx;
struct hstate hstates[HUGE_MAX_HSTATE];
__initdata nodemask_t hugetlb_bootmem_nodes;
-__initdata struct list_head huge_boot_pages[MAX_NUMNODES];
+static __initdata struct list_head huge_boot_pages[MAX_NUMNODES];
/*
* Due to ordering constraints across the init code for various
@@ -3111,6 +3111,7 @@ static bool __init alloc_bootmem_huge_page(struct hstate *h, int nid)
} else {
list_add_tail(&m->list, &huge_boot_pages[nid]);
m->flags |= HUGE_BOOTMEM_ZONES_VALID;
+ hugetlb_vmemmap_optimize_bootmem_page(m);
/*
* Only initialize the head struct page in memmap_init_reserved_pages,
* rest of the struct pages will be initialized by the HugeTLB
@@ -3264,13 +3265,15 @@ static void __init gather_bootmem_prealloc_node(unsigned long nid)
OPTIMIZED_FOLIO_VMEMMAP_NR_STRUCT_PAGES);
init_new_hugetlb_folio(folio);
- if (hugetlb_bootmem_page_prehvo(m))
+ if (hugetlb_bootmem_page_prehvo(m)) {
/*
* If pre-HVO was done, just set the
* flag, the HVO code will then skip
* this folio.
*/
folio_set_hugetlb_vmemmap_optimized(folio);
+ section_set_order_range(folio_pfn(folio), folio_nr_pages(folio), 0);
+ }
if (hugetlb_bootmem_page_earlycma(m))
folio_set_hugetlb_cma(folio);
@@ -3314,25 +3317,6 @@ void __init hugetlb_struct_page_init(void)
.max_threads = num_node_state(N_MEMORY),
.numa_aware = true,
};
-#ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
- struct zone *zone;
-
- for_each_zone(zone) {
- for (int i = 0; i < NR_OPTIMIZABLE_FOLIO_ORDERS; i++) {
- struct page *tail, *p;
- unsigned int order;
-
- tail = zone->vmemmap_tails[i];
- if (!tail)
- continue;
-
- order = i + OPTIMIZABLE_FOLIO_MIN_ORDER;
- p = page_to_virt(tail);
- for (int j = 0; j < PAGE_SIZE / sizeof(struct page); j++)
- init_compound_tail(p + j, NULL, order, zone);
- }
- }
-#endif
padata_do_multithreaded(&job);
}
diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index 4367118f8f57..730190390ba9 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -16,6 +16,7 @@
#include <linux/mmdebug.h>
#include <linux/pagewalk.h>
#include <linux/pgalloc.h>
+#include <linux/io.h>
#include <asm/tlbflush.h>
#include "hugetlb_vmemmap.h"
@@ -478,12 +479,8 @@ long hugetlb_vmemmap_restore_folios(const struct hstate *h,
return ret;
}
-/* Return true iff a HugeTLB whose vmemmap should and can be optimized. */
-static bool vmemmap_should_optimize_folio(const struct hstate *h, struct folio *folio)
+static inline bool vmemmap_should_optimize(const struct hstate *h)
{
- if (folio_test_hugetlb_vmemmap_optimized(folio))
- return false;
-
if (!READ_ONCE(vmemmap_optimize_enabled))
return false;
@@ -493,6 +490,15 @@ static bool vmemmap_should_optimize_folio(const struct hstate *h, struct folio *
return true;
}
+/* Return true iff a HugeTLB whose vmemmap should and can be optimized. */
+static bool vmemmap_should_optimize_folio(const struct hstate *h, struct folio *folio)
+{
+ if (folio_test_hugetlb_vmemmap_optimized(folio))
+ return false;
+
+ return vmemmap_should_optimize(h);
+}
+
static struct page *vmemmap_get_tail(unsigned int order, struct zone *zone)
{
const unsigned int idx = order - OPTIMIZABLE_FOLIO_MIN_ORDER;
@@ -638,9 +644,6 @@ static void __hugetlb_vmemmap_optimize_folios(struct hstate *h,
epfn = spfn + hugetlb_vmemmap_size(h);
vmemmap_wrprotect_hvo(spfn, epfn, folio_nid(folio),
OPTIMIZED_FOLIO_VMEMMAP_SIZE);
- register_page_bootmem_memmap(pfn_to_section_nr(folio_pfn(folio)),
- &folio->page,
- OPTIMIZED_FOLIO_VMEMMAP_NR_STRUCT_PAGES);
continue;
}
@@ -706,111 +709,18 @@ void hugetlb_vmemmap_optimize_bootmem_folios(struct hstate *h, struct list_head
__hugetlb_vmemmap_optimize_folios(h, folio_list, true);
}
-#ifdef CONFIG_SPARSEMEM_VMEMMAP_PREINIT
-
-/* Return true of a bootmem allocated HugeTLB page should be pre-HVO-ed */
-static bool vmemmap_should_optimize_bootmem_page(struct huge_bootmem_page *m)
-{
- unsigned long section_size, psize, pmd_vmemmap_size;
- phys_addr_t paddr;
-
- if (!READ_ONCE(vmemmap_optimize_enabled))
- return false;
-
- if (!hugetlb_vmemmap_optimizable(m->hstate))
- return false;
-
- psize = huge_page_size(m->hstate);
- paddr = virt_to_phys(m);
-
- /*
- * Pre-HVO only works if the bootmem huge page
- * is aligned to the section size.
- */
- section_size = (1UL << PA_SECTION_SHIFT);
- if (!IS_ALIGNED(paddr, section_size) ||
- !IS_ALIGNED(psize, section_size))
- return false;
-
- /*
- * The pre-HVO code does not deal with splitting PMDS,
- * so the bootmem page must be aligned to the number
- * of base pages that can be mapped with one vmemmap PMD.
- */
- pmd_vmemmap_size = (PMD_SIZE / (sizeof(struct page))) << PAGE_SHIFT;
- if (!IS_ALIGNED(paddr, pmd_vmemmap_size) ||
- !IS_ALIGNED(psize, pmd_vmemmap_size))
- return false;
-
- return true;
-}
-
-static struct zone *pfn_to_zone(unsigned nid, unsigned long pfn);
-
-/*
- * Initialize memmap section for a gigantic page, HVO-style.
- */
-void __init hugetlb_vmemmap_init_early(int nid)
+void __init hugetlb_vmemmap_optimize_bootmem_page(struct huge_bootmem_page *m)
{
- unsigned long psize, paddr, section_size;
- unsigned long ns, i, pnum, pfn, nr_pages;
- unsigned long start, end;
- struct huge_bootmem_page *m = NULL;
- void *map;
+ struct hstate *h = m->hstate;
+ unsigned long pfn = PHYS_PFN(__pa(m));
- if (!READ_ONCE(vmemmap_optimize_enabled))
+ if (!vmemmap_should_optimize(h))
return;
- section_size = (1UL << PA_SECTION_SHIFT);
-
- list_for_each_entry(m, &huge_boot_pages[nid], list) {
- struct zone *zone;
-
- if (!vmemmap_should_optimize_bootmem_page(m))
- continue;
-
- nr_pages = pages_per_huge_page(m->hstate);
- psize = nr_pages << PAGE_SHIFT;
- paddr = virt_to_phys(m);
- pfn = PHYS_PFN(paddr);
- map = pfn_to_page(pfn);
- start = (unsigned long)map;
- end = start + hugetlb_vmemmap_size(m->hstate);
- zone = pfn_to_zone(nid, pfn);
-
- if (vmemmap_populate_hvo(start, end, huge_page_order(m->hstate),
- zone, OPTIMIZED_FOLIO_VMEMMAP_SIZE))
- panic("Failed to allocate memmap for HugeTLB page\n");
- memmap_boot_pages_add(OPTIMIZED_FOLIO_VMEMMAP_PAGES);
-
- pnum = pfn_to_section_nr(pfn);
- ns = psize / section_size;
-
- for (i = 0; i < ns; i++) {
- sparse_init_early_section(nid, map, pnum,
- SECTION_IS_VMEMMAP_PREINIT);
- map += section_map_size();
- pnum++;
- }
-
+ section_set_order_range(pfn, pages_per_huge_page(h), huge_page_order(h));
+ if (section_vmemmap_optimizable(__pfn_to_section(pfn)))
m->flags |= HUGE_BOOTMEM_HVO;
- }
-}
-
-static struct zone *pfn_to_zone(unsigned nid, unsigned long pfn)
-{
- struct zone *zone;
- enum zone_type zone_type;
-
- for (zone_type = 0; zone_type < MAX_NR_ZONES; zone_type++) {
- zone = &NODE_DATA(nid)->node_zones[zone_type];
- if (zone_spans_pfn(zone, pfn))
- return zone;
- }
-
- return NULL;
}
-#endif
static const struct ctl_table hugetlb_vmemmap_sysctls[] = {
{
diff --git a/mm/hugetlb_vmemmap.h b/mm/hugetlb_vmemmap.h
index 66e11893d076..0d8c88997066 100644
--- a/mm/hugetlb_vmemmap.h
+++ b/mm/hugetlb_vmemmap.h
@@ -9,8 +9,6 @@
#ifndef _LINUX_HUGETLB_VMEMMAP_H
#define _LINUX_HUGETLB_VMEMMAP_H
#include <linux/hugetlb.h>
-#include <linux/io.h>
-#include <linux/memblock.h>
#ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
int hugetlb_vmemmap_restore_folio(const struct hstate *h, struct folio *folio);
@@ -20,10 +18,7 @@ long hugetlb_vmemmap_restore_folios(const struct hstate *h,
void hugetlb_vmemmap_optimize_folio(const struct hstate *h, struct folio *folio);
void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_list);
void hugetlb_vmemmap_optimize_bootmem_folios(struct hstate *h, struct list_head *folio_list);
-#ifdef CONFIG_SPARSEMEM_VMEMMAP_PREINIT
-void hugetlb_vmemmap_init_early(int nid);
-#endif
-
+void hugetlb_vmemmap_optimize_bootmem_page(struct huge_bootmem_page *m);
static inline unsigned int hugetlb_vmemmap_size(const struct hstate *h)
{
@@ -69,13 +64,13 @@ static inline void hugetlb_vmemmap_optimize_bootmem_folios(struct hstate *h,
{
}
-static inline void hugetlb_vmemmap_init_early(int nid)
+static inline unsigned int hugetlb_vmemmap_optimizable_size(const struct hstate *h)
{
+ return 0;
}
-static inline unsigned int hugetlb_vmemmap_optimizable_size(const struct hstate *h)
+static inline void hugetlb_vmemmap_optimize_bootmem_page(struct huge_bootmem_page *m)
{
- return 0;
}
#endif /* CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP */
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index 69ae40692e41..b86634903fc0 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -32,7 +32,6 @@
#include <asm/dma.h>
#include <asm/tlbflush.h>
-#include "hugetlb_vmemmap.h"
#include "internal.h"
/*
@@ -372,33 +371,6 @@ static __meminit struct page *vmemmap_get_tail(unsigned int order, struct zone *
return tail;
}
-#ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
-int __meminit vmemmap_populate_hvo(unsigned long addr, unsigned long end,
- unsigned int order, struct zone *zone,
- unsigned long headsize)
-{
- unsigned long maddr;
- struct page *tail;
- pte_t *pte;
- int node = zone_to_nid(zone);
-
- tail = vmemmap_get_tail(order, zone);
- if (!tail)
- return -ENOMEM;
-
- for (maddr = addr; maddr < addr + headsize; maddr += PAGE_SIZE) {
- pte = vmemmap_populate_address(maddr, node, NULL, -1);
- if (!pte)
- return -ENOMEM;
- }
-
- /*
- * Reuse the last page struct page mapped above for the rest.
- */
- return vmemmap_populate_range(maddr, end, node, NULL, page_to_pfn(tail));
-}
-#endif
-
void __weak __meminit vmemmap_set_pmd(pmd_t *pmd, void *p, int node,
unsigned long addr, unsigned long next)
{
@@ -600,7 +572,6 @@ struct page * __meminit __populate_section_memmap(unsigned long pfn,
*/
void __init sparse_vmemmap_init_nid_early(int nid)
{
- hugetlb_vmemmap_init_early(nid);
}
#endif
--
2.54.0
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v2 31/69] mm/sparse: Remove section_map_size()
2026-05-13 13:04 [PATCH v2 00/69] mm: Generalize HVO for HugeTLB and device DAX Muchun Song
` (29 preceding siblings ...)
2026-05-13 13:04 ` [PATCH v2 30/69] mm/hugetlb: Switch HugeTLB to section-based vmemmap optimization Muchun Song
@ 2026-05-13 13:04 ` Muchun Song
2026-05-13 13:05 ` [PATCH v2 32/69] mm/mm_init: Factor out pfn_to_zone() as a shared helper Muchun Song
` (16 subsequent siblings)
47 siblings, 0 replies; 72+ messages in thread
From: Muchun Song @ 2026-05-13 13:04 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
Michael Ellerman, Madhavan Srinivasan
Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
Christophe Leroy, Ackerley Tng, Frank van der Linden,
aneesh.kumar, joao.m.martins, linux-mm, linuxppc-dev,
linux-kernel, Muchun Song
section_map_size() no longer provides any shared logic.
After the sparse-vmemmap changes, its only remaining user is the
!CONFIG_SPARSEMEM_VMEMMAP path in __populate_section_memmap(), which can
compute the size inline with PAGE_ALIGN(sizeof(struct page) *
PAGES_PER_SECTION).
Remove section_map_size() and inline the remaining calculation.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
include/linux/mm.h | 1 -
mm/sparse.c | 15 ++-------------
2 files changed, 2 insertions(+), 14 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index f39f6fca6551..fef39be8acd2 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -4845,7 +4845,6 @@ static inline void print_vma_addr(char *prefix, unsigned long rip)
}
#endif
-unsigned long section_map_size(void);
struct page * __populate_section_memmap(unsigned long pfn,
unsigned long nr_pages, int nid, struct vmem_altmap *altmap,
struct dev_pagemap *pgmap);
diff --git a/mm/sparse.c b/mm/sparse.c
index 33e89bf1ec0c..47349f6f463f 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -222,23 +222,12 @@ size_t mem_section_usage_size(void)
return sizeof(struct mem_section_usage) + usemap_size();
}
-#ifdef CONFIG_SPARSEMEM_VMEMMAP
-unsigned long __init section_map_size(void)
-{
- return ALIGN(sizeof(struct page) * PAGES_PER_SECTION, PMD_SIZE);
-}
-
-#else
-unsigned long __init section_map_size(void)
-{
- return PAGE_ALIGN(sizeof(struct page) * PAGES_PER_SECTION);
-}
-
+#ifndef CONFIG_SPARSEMEM_VMEMMAP
struct page __init *__populate_section_memmap(unsigned long pfn,
unsigned long nr_pages, int nid, struct vmem_altmap *altmap,
struct dev_pagemap *pgmap)
{
- unsigned long size = section_map_size();
+ unsigned long size = PAGE_ALIGN(sizeof(struct page) * PAGES_PER_SECTION);
return memmap_alloc(size, size, __pa(MAX_DMA_ADDRESS), nid, false);
}
--
2.54.0
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v2 32/69] mm/mm_init: Factor out pfn_to_zone() as a shared helper
2026-05-13 13:04 [PATCH v2 00/69] mm: Generalize HVO for HugeTLB and device DAX Muchun Song
` (30 preceding siblings ...)
2026-05-13 13:04 ` [PATCH v2 31/69] mm/sparse: Remove section_map_size() Muchun Song
@ 2026-05-13 13:05 ` Muchun Song
2026-05-13 13:05 ` [PATCH v2 33/69] mm/sparse: Remove SPARSEMEM_VMEMMAP_PREINIT Muchun Song
` (15 subsequent siblings)
47 siblings, 0 replies; 72+ messages in thread
From: Muchun Song @ 2026-05-13 13:05 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
Michael Ellerman, Madhavan Srinivasan
Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
Christophe Leroy, Ackerley Tng, Frank van der Linden,
aneesh.kumar, joao.m.martins, linux-mm, linuxppc-dev,
linux-kernel, Muchun Song
pfn_to_zone() in sparse-vmemmap.c duplicates the zone lookup logic in
__init_page_from_nid().
Move it to mm_init.c, declare it in mm/internal.h, and reuse it from
__init_page_from_nid() instead of open-coding the zone walk there.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
mm/internal.h | 1 +
mm/mm_init.c | 28 ++++++++++++++++------------
mm/sparse-vmemmap.c | 14 --------------
3 files changed, 17 insertions(+), 26 deletions(-)
diff --git a/mm/internal.h b/mm/internal.h
index bf30617c78d8..18276cd15622 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1354,6 +1354,7 @@ static inline bool deferred_pages_enabled(void)
}
#endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */
+struct zone *pfn_to_zone(unsigned long pfn, int nid);
void init_deferred_page(unsigned long pfn, int nid);
enum mminit_level {
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 6723c604eefd..35c99e5c215c 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -686,25 +686,29 @@ static __meminit void pageblock_migratetype_init_range(unsigned long pfn,
}
}
+struct zone __meminit *pfn_to_zone(unsigned long pfn, int nid)
+{
+ pg_data_t *pgdat = NODE_DATA(nid);
+
+ for (enum zone_type zone_type = 0; zone_type < MAX_NR_ZONES; zone_type++) {
+ struct zone *zone = &pgdat->node_zones[zone_type];
+
+ if (zone_spans_pfn(zone, pfn))
+ return zone;
+ }
+
+ return NULL;
+}
+
#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
/*
* Initialize a reserved page unconditionally, finding its zone first.
*/
static void __meminit __init_page_from_nid(unsigned long pfn, int nid)
{
- pg_data_t *pgdat;
- int zid;
-
- pgdat = NODE_DATA(nid);
-
- for (zid = 0; zid < MAX_NR_ZONES; zid++) {
- struct zone *zone = &pgdat->node_zones[zid];
-
- if (zone_spans_pfn(zone, pfn))
- break;
- }
- __init_single_page(pfn_to_page(pfn), pfn, zid, nid);
+ struct zone *zone = pfn_to_zone(pfn, nid);
+ __init_single_page(pfn_to_page(pfn), pfn, zone_idx(zone), nid);
if (pageblock_aligned(pfn)) {
enum migratetype mt =
kho_scratch_migratetype(pfn, MIGRATE_MOVABLE);
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index b86634903fc0..f1c3b2d0f23c 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -138,20 +138,6 @@ void __meminit vmemmap_verify(pte_t *pte, int node,
start, end - 1);
}
-static struct zone __meminit *pfn_to_zone(unsigned long pfn, int nid)
-{
- pg_data_t *pgdat = NODE_DATA(nid);
-
- for (enum zone_type zone_type = 0; zone_type < MAX_NR_ZONES; zone_type++) {
- struct zone *zone = &pgdat->node_zones[zone_type];
-
- if (zone_spans_pfn(zone, pfn))
- return zone;
- }
-
- return NULL;
-}
-
static __meminit struct page *vmemmap_get_tail(unsigned int order, struct zone *zone);
static pte_t * __meminit vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node,
--
2.54.0
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v2 33/69] mm/sparse: Remove SPARSEMEM_VMEMMAP_PREINIT
2026-05-13 13:04 [PATCH v2 00/69] mm: Generalize HVO for HugeTLB and device DAX Muchun Song
` (31 preceding siblings ...)
2026-05-13 13:05 ` [PATCH v2 32/69] mm/mm_init: Factor out pfn_to_zone() as a shared helper Muchun Song
@ 2026-05-13 13:05 ` Muchun Song
2026-05-13 13:05 ` [PATCH v2 34/69] mm/sparse: Inline usemap allocation into sparse_init_nid() Muchun Song
` (14 subsequent siblings)
47 siblings, 0 replies; 72+ messages in thread
From: Muchun Song @ 2026-05-13 13:05 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
Michael Ellerman, Madhavan Srinivasan
Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
Christophe Leroy, Ackerley Tng, Frank van der Linden,
aneesh.kumar, joao.m.martins, linux-mm, linuxppc-dev,
linux-kernel, Muchun Song
SPARSEMEM_VMEMMAP_PREINIT was only there to support HugeTLB's early
vmemmap optimization setup.
Now that HugeTLB bootmem vmemmap optimization uses the common
section-based sparse-vmemmap path, sparsemem no longer needs a separate
pre-initialization mechanism.
Remove the Kconfig symbols, section flag, and empty sparse-vmemmap early
hook, and always initialize present sections through the normal sparse
setup path.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
arch/x86/Kconfig | 1 -
fs/Kconfig | 1 -
include/linux/mmzone.h | 25 -------------------------
mm/Kconfig | 5 -----
mm/sparse-vmemmap.c | 13 -------------
mm/sparse.c | 23 ++++++++---------------
6 files changed, 8 insertions(+), 60 deletions(-)
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index f24810015234..ed2aa0e4c472 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -148,7 +148,6 @@ config X86
select ARCH_WANT_LD_ORPHAN_WARN
select ARCH_WANT_OPTIMIZE_DAX_VMEMMAP if X86_64
select ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP if X86_64
- select ARCH_WANT_HUGETLB_VMEMMAP_PREINIT if X86_64
select ARCH_WANTS_THP_SWAP if X86_64
select ARCH_HAS_PARANOID_L1D_FLUSH
select ARCH_WANT_IRQS_OFF_ACTIVATE_MM
diff --git a/fs/Kconfig b/fs/Kconfig
index cf6ae64776e6..ccb9dd480523 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -278,7 +278,6 @@ config HUGETLB_PAGE_OPTIMIZE_VMEMMAP
def_bool HUGETLB_PAGE
depends on ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP
depends on SPARSEMEM_VMEMMAP
- select SPARSEMEM_VMEMMAP_PREINIT if ARCH_WANT_HUGETLB_VMEMMAP_PREINIT
config HUGETLB_PMD_PAGE_TABLE_SHARING
def_bool HUGETLB_PAGE
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index d6a5dd042c25..b9baef8cca91 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -2092,9 +2092,6 @@ enum {
SECTION_IS_EARLY_BIT,
#ifdef CONFIG_ZONE_DEVICE
SECTION_TAINT_ZONE_DEVICE_BIT,
-#endif
-#ifdef CONFIG_SPARSEMEM_VMEMMAP_PREINIT
- SECTION_IS_VMEMMAP_PREINIT_BIT,
#endif
SECTION_MAP_LAST_BIT,
};
@@ -2106,9 +2103,6 @@ enum {
#ifdef CONFIG_ZONE_DEVICE
#define SECTION_TAINT_ZONE_DEVICE BIT(SECTION_TAINT_ZONE_DEVICE_BIT)
#endif
-#ifdef CONFIG_SPARSEMEM_VMEMMAP_PREINIT
-#define SECTION_IS_VMEMMAP_PREINIT BIT(SECTION_IS_VMEMMAP_PREINIT_BIT)
-#endif
#define SECTION_MAP_MASK (~(BIT(SECTION_MAP_LAST_BIT) - 1))
#define SECTION_NID_SHIFT SECTION_MAP_LAST_BIT
@@ -2163,24 +2157,6 @@ static inline int online_device_section(const struct mem_section *section)
}
#endif
-#ifdef CONFIG_SPARSEMEM_VMEMMAP_PREINIT
-static inline int preinited_vmemmap_section(const struct mem_section *section)
-{
- return (section &&
- (section->section_mem_map & SECTION_IS_VMEMMAP_PREINIT));
-}
-
-void sparse_vmemmap_init_nid_early(int nid);
-#else
-static inline int preinited_vmemmap_section(const struct mem_section *section)
-{
- return 0;
-}
-static inline void sparse_vmemmap_init_nid_early(int nid)
-{
-}
-#endif
-
static inline int online_section_nr(unsigned long nr)
{
return online_section(__nr_to_section(nr));
@@ -2427,7 +2403,6 @@ static inline unsigned long next_present_section_nr(unsigned long section_nr)
#endif
#else
-#define sparse_vmemmap_init_nid_early(_nid) do {} while (0)
#define pfn_in_present_section pfn_valid
static inline void section_set_order_range(unsigned long pfn, unsigned long nr_pages,
unsigned int order)
diff --git a/mm/Kconfig b/mm/Kconfig
index bb0202cf8b15..c26d2d2050d5 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -410,8 +410,6 @@ config SPARSEMEM_VMEMMAP
pfn_to_page and page_to_pfn operations. This is the most
efficient option when sufficient kernel resources are available.
-config SPARSEMEM_VMEMMAP_PREINIT
- bool
#
# Select this config option from the architecture Kconfig, if it is preferred
# to enable the feature of HugeTLB/dev_dax vmemmap optimization.
@@ -422,9 +420,6 @@ config ARCH_WANT_OPTIMIZE_DAX_VMEMMAP
config ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP
bool
-config ARCH_WANT_HUGETLB_VMEMMAP_PREINIT
- bool
-
config HAVE_MEMBLOCK_PHYS_MAP
bool
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index f1c3b2d0f23c..dde4486195ad 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -548,19 +548,6 @@ struct page * __meminit __populate_section_memmap(unsigned long pfn,
return pfn_to_page(pfn);
}
-#ifdef CONFIG_SPARSEMEM_VMEMMAP_PREINIT
-/*
- * This is called just before initializing sections for a NUMA node.
- * Any special initialization that needs to be done before the
- * generic initialization can be done from here. Sections that
- * are initialized in hooks called from here will be skipped by
- * the generic initialization.
- */
-void __init sparse_vmemmap_init_nid_early(int nid)
-{
-}
-#endif
-
static void subsection_mask_set(unsigned long *map, unsigned long pfn,
unsigned long nr_pages)
{
diff --git a/mm/sparse.c b/mm/sparse.c
index 47349f6f463f..eab37504819d 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -316,27 +316,20 @@ static void __init sparse_init_nid(int nid, unsigned long pnum_begin,
if (sparse_usage_init(nid, map_count))
panic("Failed to allocate usemap for node %d\n", nid);
- sparse_vmemmap_init_nid_early(nid);
-
for_each_present_section_nr(pnum_begin, pnum) {
- struct mem_section *ms;
unsigned long pfn = section_nr_to_pfn(pnum);
+ struct page *map;
if (pnum >= pnum_end)
break;
- ms = __nr_to_section(pnum);
- if (!preinited_vmemmap_section(ms)) {
- struct page *map;
-
- map = __populate_section_memmap(pfn, PAGES_PER_SECTION,
- nid, NULL, NULL);
- if (!map)
- panic("Failed to allocate memmap for section %lu\n", pnum);
- memmap_boot_pages_add(section_nr_vmemmap_pages(pfn, PAGES_PER_SECTION,
- NULL, NULL));
- sparse_init_early_section(nid, map, pnum, 0);
- }
+ map = __populate_section_memmap(pfn, PAGES_PER_SECTION,
+ nid, NULL, NULL);
+ if (!map)
+ panic("Failed to allocate memmap for section %lu\n", pnum);
+ memmap_boot_pages_add(section_nr_vmemmap_pages(pfn, PAGES_PER_SECTION,
+ NULL, NULL));
+ sparse_init_early_section(nid, map, pnum, 0);
}
sparse_usage_fini();
}
--
2.54.0
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v2 34/69] mm/sparse: Inline usemap allocation into sparse_init_nid()
2026-05-13 13:04 [PATCH v2 00/69] mm: Generalize HVO for HugeTLB and device DAX Muchun Song
` (32 preceding siblings ...)
2026-05-13 13:05 ` [PATCH v2 33/69] mm/sparse: Remove SPARSEMEM_VMEMMAP_PREINIT Muchun Song
@ 2026-05-13 13:05 ` Muchun Song
2026-05-13 13:05 ` [PATCH v2 35/69] mm/hugetlb: Remove HUGE_BOOTMEM_HVO Muchun Song
` (13 subsequent siblings)
47 siblings, 0 replies; 72+ messages in thread
From: Muchun Song @ 2026-05-13 13:05 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
Michael Ellerman, Madhavan Srinivasan
Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
Christophe Leroy, Ackerley Tng, Frank van der Linden,
aneesh.kumar, joao.m.martins, linux-mm, linuxppc-dev,
linux-kernel, Muchun Song
After removing SPARSEMEM_VMEMMAP_PREINIT, sparse_init_nid() no longer
needs the transient sparse_usagebuf state and its helper wrappers.
Allocate the usemap buffer directly in sparse_init_nid(), pass it to
sparse_init_one_section(), and drop sparse_usage_init(),
sparse_usage_fini(), and sparse_init_early_section().
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
include/linux/mmzone.h | 3 ---
mm/sparse.c | 46 +++++++-----------------------------------
2 files changed, 7 insertions(+), 42 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index b9baef8cca91..a60fd5785fa5 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -2265,9 +2265,6 @@ static inline bool section_vmemmap_optimizable(const struct mem_section *section
return section_order(section) >= OPTIMIZABLE_FOLIO_MIN_ORDER;
}
-void sparse_init_early_section(int nid, struct page *map, unsigned long pnum,
- unsigned long flags);
-
#ifndef CONFIG_HAVE_ARCH_PFN_VALID
/**
* pfn_valid - check if there is a valid memory map entry for a PFN
diff --git a/mm/sparse.c b/mm/sparse.c
index eab37504819d..54c38ea08190 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -237,42 +237,6 @@ void __weak __meminit vmemmap_populate_print_last(void)
{
}
-static void *sparse_usagebuf __meminitdata;
-static void *sparse_usagebuf_end __meminitdata;
-
-/*
- * Helper function that is used for generic section initialization, and
- * can also be used by any hooks added above.
- */
-void __init sparse_init_early_section(int nid, struct page *map,
- unsigned long pnum, unsigned long flags)
-{
- BUG_ON(!sparse_usagebuf || sparse_usagebuf >= sparse_usagebuf_end);
- sparse_init_one_section(__nr_to_section(pnum), pnum, map,
- sparse_usagebuf, SECTION_IS_EARLY | flags);
- sparse_usagebuf = (void *)sparse_usagebuf + mem_section_usage_size();
-}
-
-static int __init sparse_usage_init(int nid, unsigned long map_count)
-{
- unsigned long size;
-
- size = mem_section_usage_size() * map_count;
- sparse_usagebuf = memblock_alloc_node(size, SMP_CACHE_BYTES, nid);
- if (!sparse_usagebuf) {
- sparse_usagebuf_end = NULL;
- return -ENOMEM;
- }
-
- sparse_usagebuf_end = sparse_usagebuf + size;
- return 0;
-}
-
-static void __init sparse_usage_fini(void)
-{
- sparse_usagebuf = sparse_usagebuf_end = NULL;
-}
-
int __meminit section_nr_vmemmap_pages(unsigned long pfn, unsigned long nr_pages,
struct vmem_altmap *altmap, struct dev_pagemap *pgmap)
{
@@ -312,8 +276,11 @@ static void __init sparse_init_nid(int nid, unsigned long pnum_begin,
unsigned long map_count)
{
unsigned long pnum;
+ struct mem_section_usage *usage;
- if (sparse_usage_init(nid, map_count))
+ usage = memblock_alloc_node(map_count * mem_section_usage_size(),
+ SMP_CACHE_BYTES, nid);
+ if (!usage)
panic("Failed to allocate usemap for node %d\n", nid);
for_each_present_section_nr(pnum_begin, pnum) {
@@ -329,9 +296,10 @@ static void __init sparse_init_nid(int nid, unsigned long pnum_begin,
panic("Failed to allocate memmap for section %lu\n", pnum);
memmap_boot_pages_add(section_nr_vmemmap_pages(pfn, PAGES_PER_SECTION,
NULL, NULL));
- sparse_init_early_section(nid, map, pnum, 0);
+ sparse_init_one_section(__nr_to_section(pnum), pnum, map, usage,
+ SECTION_IS_EARLY);
+ usage = (void *)usage + mem_section_usage_size();
}
- sparse_usage_fini();
}
/*
--
2.54.0
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v2 35/69] mm/hugetlb: Remove HUGE_BOOTMEM_HVO
2026-05-13 13:04 [PATCH v2 00/69] mm: Generalize HVO for HugeTLB and device DAX Muchun Song
` (33 preceding siblings ...)
2026-05-13 13:05 ` [PATCH v2 34/69] mm/sparse: Inline usemap allocation into sparse_init_nid() Muchun Song
@ 2026-05-13 13:05 ` Muchun Song
2026-05-13 13:05 ` [PATCH v2 36/69] mm/hugetlb: Remove HUGE_BOOTMEM_CMA Muchun Song
` (12 subsequent siblings)
47 siblings, 0 replies; 72+ messages in thread
From: Muchun Song @ 2026-05-13 13:05 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
Michael Ellerman, Madhavan Srinivasan
Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
Christophe Leroy, Ackerley Tng, Frank van der Linden,
aneesh.kumar, joao.m.martins, linux-mm, linuxppc-dev,
linux-kernel, Muchun Song
The HUGE_BOOTMEM_HVO flag tracked whether a bootmem huge page had
already gone through the old early vmemmap optimization path.
Now that HugeTLB uses section-based vmemmap optimization, that state is
already reflected in the section order.
Remove HUGE_BOOTMEM_HVO and its helper, and use the section state
directly when deciding whether to mark a folio as vmemmap-optimized.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
include/linux/hugetlb.h | 5 ++---
include/linux/mmzone.h | 7 ++++++-
mm/hugetlb.c | 12 +-----------
mm/hugetlb_vmemmap.c | 2 --
4 files changed, 9 insertions(+), 17 deletions(-)
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index dce8969961ea..18af8f304b95 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -695,9 +695,8 @@ struct huge_bootmem_page {
unsigned long flags;
};
-#define HUGE_BOOTMEM_HVO 0x0001
-#define HUGE_BOOTMEM_ZONES_VALID 0x0002
-#define HUGE_BOOTMEM_CMA 0x0004
+#define HUGE_BOOTMEM_ZONES_VALID BIT(0)
+#define HUGE_BOOTMEM_CMA BIT(1)
int isolate_or_dissolve_huge_folio(struct folio *folio, struct list_head *list);
int replace_free_hugepage_folios(unsigned long start_pfn, unsigned long end_pfn);
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index a60fd5785fa5..9b87d798a365 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -117,6 +117,11 @@
#define NR_OPTIMIZABLE_FOLIO_ORDERS \
(__NR_OPTIMIZABLE_FOLIO_ORDERS > 0 ? __NR_OPTIMIZABLE_FOLIO_ORDERS : 0)
+static inline bool order_vmemmap_optimizable(unsigned int order)
+{
+ return order >= OPTIMIZABLE_FOLIO_MIN_ORDER;
+}
+
enum migratetype {
MIGRATE_UNMOVABLE,
MIGRATE_MOVABLE,
@@ -2262,7 +2267,7 @@ static inline bool section_vmemmap_optimizable(const struct mem_section *section
if (!is_power_of_2(sizeof(struct page)))
return false;
- return section_order(section) >= OPTIMIZABLE_FOLIO_MIN_ORDER;
+ return order_vmemmap_optimizable(section_order(section));
}
#ifndef CONFIG_HAVE_ARCH_PFN_VALID
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 080f130017e3..abd79bb85b1c 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3169,11 +3169,6 @@ static void __init hugetlb_folio_init_vmemmap(struct folio *folio,
prep_compound_head(&folio->page, huge_page_order(h));
}
-static bool __init hugetlb_bootmem_page_prehvo(struct huge_bootmem_page *m)
-{
- return m->flags & HUGE_BOOTMEM_HVO;
-}
-
static bool __init hugetlb_bootmem_page_earlycma(struct huge_bootmem_page *m)
{
return m->flags & HUGE_BOOTMEM_CMA;
@@ -3265,12 +3260,7 @@ static void __init gather_bootmem_prealloc_node(unsigned long nid)
OPTIMIZED_FOLIO_VMEMMAP_NR_STRUCT_PAGES);
init_new_hugetlb_folio(folio);
- if (hugetlb_bootmem_page_prehvo(m)) {
- /*
- * If pre-HVO was done, just set the
- * flag, the HVO code will then skip
- * this folio.
- */
+ if (order_vmemmap_optimizable(pfn_to_section_order(folio_pfn(folio)))) {
folio_set_hugetlb_vmemmap_optimized(folio);
section_set_order_range(folio_pfn(folio), folio_nr_pages(folio), 0);
}
diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index 730190390ba9..66362e553870 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -718,8 +718,6 @@ void __init hugetlb_vmemmap_optimize_bootmem_page(struct huge_bootmem_page *m)
return;
section_set_order_range(pfn, pages_per_huge_page(h), huge_page_order(h));
- if (section_vmemmap_optimizable(__pfn_to_section(pfn)))
- m->flags |= HUGE_BOOTMEM_HVO;
}
static const struct ctl_table hugetlb_vmemmap_sysctls[] = {
--
2.54.0
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v2 36/69] mm/hugetlb: Remove HUGE_BOOTMEM_CMA
2026-05-13 13:04 [PATCH v2 00/69] mm: Generalize HVO for HugeTLB and device DAX Muchun Song
` (34 preceding siblings ...)
2026-05-13 13:05 ` [PATCH v2 35/69] mm/hugetlb: Remove HUGE_BOOTMEM_HVO Muchun Song
@ 2026-05-13 13:05 ` Muchun Song
2026-05-13 13:05 ` [PATCH v2 37/69] mm/sparse-vmemmap: Factor out shared vmemmap page allocation Muchun Song
` (11 subsequent siblings)
47 siblings, 0 replies; 72+ messages in thread
From: Muchun Song @ 2026-05-13 13:05 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
Michael Ellerman, Madhavan Srinivasan
Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
Christophe Leroy, Ackerley Tng, Frank van der Linden,
aneesh.kumar, joao.m.martins, linux-mm, linuxppc-dev,
linux-kernel, Muchun Song
Track early CMA hugetlb pages from the hstate instead of storing a
redundant bootmem flag. This removes the unused helper and keeps the
bootmem metadata minimal.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
include/linux/hugetlb.h | 1 -
mm/hugetlb.c | 9 ++-------
2 files changed, 2 insertions(+), 8 deletions(-)
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 18af8f304b95..82dbb9ebead8 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -696,7 +696,6 @@ struct huge_bootmem_page {
};
#define HUGE_BOOTMEM_ZONES_VALID BIT(0)
-#define HUGE_BOOTMEM_CMA BIT(1)
int isolate_or_dissolve_huge_folio(struct folio *folio, struct list_head *list);
int replace_free_hugepage_folios(unsigned long start_pfn, unsigned long end_pfn);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index abd79bb85b1c..74770c1648fc 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3093,7 +3093,7 @@ static bool __init alloc_bootmem_huge_page(struct hstate *h, int nid)
*/
INIT_LIST_HEAD(&m->list);
m->hstate = h;
- m->flags = hugetlb_early_cma(h) ? HUGE_BOOTMEM_CMA : 0;
+ m->flags = 0;
/* CMA pages: zone-crossing is validated in hugetlb_cma_reserve(). */
if (!hugetlb_early_cma(h) &&
@@ -3169,11 +3169,6 @@ static void __init hugetlb_folio_init_vmemmap(struct folio *folio,
prep_compound_head(&folio->page, huge_page_order(h));
}
-static bool __init hugetlb_bootmem_page_earlycma(struct huge_bootmem_page *m)
-{
- return m->flags & HUGE_BOOTMEM_CMA;
-}
-
/*
* memblock-allocated pageblocks might not have the migrate type set
* if marked with the 'noinit' flag. Set it to the default (MIGRATE_MOVABLE)
@@ -3265,7 +3260,7 @@ static void __init gather_bootmem_prealloc_node(unsigned long nid)
section_set_order_range(folio_pfn(folio), folio_nr_pages(folio), 0);
}
- if (hugetlb_bootmem_page_earlycma(m))
+ if (hugetlb_early_cma(h))
folio_set_hugetlb_cma(folio);
list_add(&folio->lru, &folio_list);
--
2.54.0
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v2 37/69] mm/sparse-vmemmap: Factor out shared vmemmap page allocation
2026-05-13 13:04 [PATCH v2 00/69] mm: Generalize HVO for HugeTLB and device DAX Muchun Song
` (35 preceding siblings ...)
2026-05-13 13:05 ` [PATCH v2 36/69] mm/hugetlb: Remove HUGE_BOOTMEM_CMA Muchun Song
@ 2026-05-13 13:05 ` Muchun Song
2026-05-13 13:05 ` [PATCH v2 38/69] mm/sparse-vmemmap: Introduce CONFIG_SPARSEMEM_VMEMMAP_OPTIMIZATION Muchun Song
` (10 subsequent siblings)
47 siblings, 0 replies; 72+ messages in thread
From: Muchun Song @ 2026-05-13 13:05 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
Michael Ellerman, Madhavan Srinivasan
Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
Christophe Leroy, Ackerley Tng, Frank van der Linden,
aneesh.kumar, joao.m.martins, linux-mm, linuxppc-dev,
linux-kernel, Muchun Song
HugeTLB and sparse-vmemmap each have their own helper to allocate the
shared tail page used by vmemmap optimization.
Factor that logic into a common vmemmap_shared_tail_page() helper in
sparse-vmemmap.c. It allocates the page through
vmemmap_alloc_block_zero(), initializes the tail struct pages, and uses
cmpxchg() to install the per-zone shared page.
This removes duplicate allocation logic while still handling both the
early boot and runtime paths through the same helper.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
include/linux/mm.h | 1 +
mm/hugetlb_vmemmap.c | 28 +-----------------
mm/sparse-vmemmap.c | 67 ++++++++++++++++++--------------------------
3 files changed, 29 insertions(+), 67 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index fef39be8acd2..5281f073230c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -4866,6 +4866,7 @@ int vmemmap_populate(unsigned long start, unsigned long end, int node,
void vmemmap_wrprotect_hvo(unsigned long start, unsigned long end, int node,
unsigned long headsize);
void vmemmap_populate_print_last(void);
+struct page *vmemmap_shared_tail_page(unsigned int order, struct zone *zone);
#ifdef CONFIG_MEMORY_HOTPLUG
void vmemmap_free(unsigned long start, unsigned long end,
struct vmem_altmap *altmap);
diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index 66362e553870..d24143dd6051 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -499,32 +499,6 @@ static bool vmemmap_should_optimize_folio(const struct hstate *h, struct folio *
return vmemmap_should_optimize(h);
}
-static struct page *vmemmap_get_tail(unsigned int order, struct zone *zone)
-{
- const unsigned int idx = order - OPTIMIZABLE_FOLIO_MIN_ORDER;
- struct page *tail, *p;
- int node = zone_to_nid(zone);
-
- tail = READ_ONCE(zone->vmemmap_tails[idx]);
- if (likely(tail))
- return tail;
-
- tail = alloc_pages_node(node, GFP_KERNEL | __GFP_ZERO, 0);
- if (!tail)
- return NULL;
-
- p = page_to_virt(tail);
- for (int i = 0; i < PAGE_SIZE / sizeof(struct page); i++)
- init_compound_tail(p + i, NULL, order, zone);
-
- if (cmpxchg(&zone->vmemmap_tails[idx], NULL, tail)) {
- __free_page(tail);
- tail = READ_ONCE(zone->vmemmap_tails[idx]);
- }
-
- return tail;
-}
-
static int __hugetlb_vmemmap_optimize_folio(const struct hstate *h,
struct folio *folio,
struct list_head *vmemmap_pages,
@@ -541,7 +515,7 @@ static int __hugetlb_vmemmap_optimize_folio(const struct hstate *h,
return ret;
nid = folio_nid(folio);
- vmemmap_tail = vmemmap_get_tail(h->order, folio_zone(folio));
+ vmemmap_tail = vmemmap_shared_tail_page(h->order, folio_zone(folio));
if (!vmemmap_tail)
return -ENOMEM;
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index dde4486195ad..53a341fcde74 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -34,27 +34,13 @@
#include "internal.h"
-/*
- * Allocate a block of memory to be used to back the virtual memory map
- * or to back the page tables that are used to create the mapping.
- * Uses the main allocators if they are available, else bootmem.
- */
-
-static void * __ref __earlyonly_bootmem_alloc(int node,
- unsigned long size,
- unsigned long align,
- unsigned long goal)
-{
- return memmap_alloc(size, align, goal, node, false);
-}
-
-void * __meminit vmemmap_alloc_block(unsigned long size, int node)
+void __ref *vmemmap_alloc_block(unsigned long size, int node)
{
/* If the main allocator is up use that, fallback to bootmem. */
if (slab_is_available()) {
gfp_t gfp_mask = GFP_KERNEL|__GFP_RETRY_MAYFAIL|__GFP_NOWARN;
int order = get_order(size);
- static bool warned __meminitdata;
+ static bool warned;
struct page *page;
page = alloc_pages_node(node, gfp_mask, order);
@@ -68,8 +54,7 @@ void * __meminit vmemmap_alloc_block(unsigned long size, int node)
}
return NULL;
} else
- return __earlyonly_bootmem_alloc(node, size, size,
- __pa(MAX_DMA_ADDRESS));
+ return memmap_alloc(size, size, __pa(MAX_DMA_ADDRESS), node, false);
}
static void * __meminit altmap_alloc_block_buf(unsigned long size,
@@ -138,8 +123,6 @@ void __meminit vmemmap_verify(pte_t *pte, int node,
start, end - 1);
}
-static __meminit struct page *vmemmap_get_tail(unsigned int order, struct zone *zone);
-
static pte_t * __meminit vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node,
struct vmem_altmap *altmap,
unsigned long ptpfn)
@@ -158,7 +141,7 @@ static pte_t * __meminit vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, in
if (WARN_ON_ONCE(!zone))
return NULL;
- page = vmemmap_get_tail(section_order(ms), zone);
+ page = vmemmap_shared_tail_page(section_order(ms), zone);
if (!page)
return NULL;
ptpfn = page_to_pfn(page);
@@ -190,7 +173,7 @@ static pte_t * __meminit vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, in
return pte;
}
-static void * __meminit vmemmap_alloc_block_zero(unsigned long size, int node)
+static void *vmemmap_alloc_block_zero(unsigned long size, int node)
{
void *p = vmemmap_alloc_block(size, node);
@@ -329,32 +312,36 @@ void vmemmap_wrprotect_hvo(unsigned long addr, unsigned long end,
}
}
-static __meminit struct page *vmemmap_get_tail(unsigned int order, struct zone *zone)
+struct page __ref *vmemmap_shared_tail_page(unsigned int order, struct zone *zone)
{
- struct page *p, *tail;
- unsigned int idx;
- int node = zone_to_nid(zone);
+ void *addr;
+ struct page *page;
+ const unsigned int idx = order - OPTIMIZABLE_FOLIO_MIN_ORDER;
- if (WARN_ON_ONCE(order < OPTIMIZABLE_FOLIO_MIN_ORDER))
- return NULL;
- if (WARN_ON_ONCE(order > MAX_FOLIO_ORDER))
+ if (WARN_ON_ONCE(idx >= ARRAY_SIZE(zone->vmemmap_tails)))
return NULL;
- idx = order - OPTIMIZABLE_FOLIO_MIN_ORDER;
- tail = zone->vmemmap_tails[idx];
- if (tail)
- return tail;
+ page = READ_ONCE(zone->vmemmap_tails[idx]);
+ if (likely(page))
+ return page;
- p = vmemmap_alloc_block_zero(PAGE_SIZE, node);
- if (!p)
+ addr = vmemmap_alloc_block_zero(PAGE_SIZE, zone_to_nid(zone));
+ if (!addr)
return NULL;
- for (int i = 0; i < PAGE_SIZE / sizeof(struct page); i++)
- init_compound_tail(p + i, NULL, order, zone);
- tail = virt_to_page(p);
- zone->vmemmap_tails[idx] = tail;
+ for (int i = 0; i < PAGE_SIZE / sizeof(struct page); i++)
+ init_compound_tail((struct page *)addr + i, NULL, order, zone);
+
+ page = virt_to_page(addr);
+ if (cmpxchg(&zone->vmemmap_tails[idx], NULL, page) != NULL) {
+ if (slab_is_available())
+ __free_page(page);
+ else
+ memblock_free(page_to_virt(page), PAGE_SIZE);
+ page = READ_ONCE(zone->vmemmap_tails[idx]);
+ }
- return tail;
+ return page;
}
void __weak __meminit vmemmap_set_pmd(pmd_t *pmd, void *p, int node,
--
2.54.0
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v2 38/69] mm/sparse-vmemmap: Introduce CONFIG_SPARSEMEM_VMEMMAP_OPTIMIZATION
2026-05-13 13:04 [PATCH v2 00/69] mm: Generalize HVO for HugeTLB and device DAX Muchun Song
` (36 preceding siblings ...)
2026-05-13 13:05 ` [PATCH v2 37/69] mm/sparse-vmemmap: Factor out shared vmemmap page allocation Muchun Song
@ 2026-05-13 13:05 ` Muchun Song
2026-05-13 13:05 ` [PATCH v2 39/69] mm/sparse-vmemmap: Switch DAX to vmemmap_shared_tail_page() Muchun Song
` (9 subsequent siblings)
47 siblings, 0 replies; 72+ messages in thread
From: Muchun Song @ 2026-05-13 13:05 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
Michael Ellerman, Madhavan Srinivasan
Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
Christophe Leroy, Ackerley Tng, Frank van der Linden,
aneesh.kumar, joao.m.martins, linux-mm, linuxppc-dev,
linux-kernel, Muchun Song
The generic sparse-vmemmap optimization code is still guarded by
CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP, even though it is no longer
HugeTLB-specific.
Introduce CONFIG_SPARSEMEM_VMEMMAP_OPTIMIZATION to represent the common
vmemmap optimization infrastructure. Have HugeTLB and DAX select it,
and use it to guard generic optimization code.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
arch/x86/entry/vdso/vdso32/fake_32bit_build.h | 2 +-
drivers/dax/Kconfig | 1 +
fs/Kconfig | 1 +
include/linux/mmzone.h | 33 ++++++++++---------
include/linux/page-flags.h | 5 +--
mm/Kconfig | 4 +++
6 files changed, 26 insertions(+), 20 deletions(-)
diff --git a/arch/x86/entry/vdso/vdso32/fake_32bit_build.h b/arch/x86/entry/vdso/vdso32/fake_32bit_build.h
index bc3e549795c3..5f8424eade2b 100644
--- a/arch/x86/entry/vdso/vdso32/fake_32bit_build.h
+++ b/arch/x86/entry/vdso/vdso32/fake_32bit_build.h
@@ -11,7 +11,7 @@
#undef CONFIG_PGTABLE_LEVELS
#undef CONFIG_ILLEGAL_POINTER_VALUE
#undef CONFIG_SPARSEMEM_VMEMMAP
-#undef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
+#undef CONFIG_SPARSEMEM_VMEMMAP_OPTIMIZATION
#undef CONFIG_NR_CPUS
#undef CONFIG_PARAVIRT_XXL
diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig
index 602f9a0839a9..60cb05dce53d 100644
--- a/drivers/dax/Kconfig
+++ b/drivers/dax/Kconfig
@@ -8,6 +8,7 @@ if DAX
config DEV_DAX
tristate "Device DAX: direct access mapping device"
depends on TRANSPARENT_HUGEPAGE
+ select SPARSEMEM_VMEMMAP_OPTIMIZATION if ARCH_WANT_OPTIMIZE_DAX_VMEMMAP && SPARSEMEM_VMEMMAP
help
Support raw access to differentiated (persistence, bandwidth,
latency...) memory via an mmap(2) capable character
diff --git a/fs/Kconfig b/fs/Kconfig
index ccb9dd480523..f6cee1bbb1fc 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -278,6 +278,7 @@ config HUGETLB_PAGE_OPTIMIZE_VMEMMAP
def_bool HUGETLB_PAGE
depends on ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP
depends on SPARSEMEM_VMEMMAP
+ select SPARSEMEM_VMEMMAP_OPTIMIZATION
config HUGETLB_PMD_PAGE_TABLE_SHARING
def_bool HUGETLB_PAGE
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 9b87d798a365..5285d53b0c53 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -102,9 +102,9 @@
*
* HVO which is only active if the size of struct page is a power of 2.
*/
-#define MAX_FOLIO_VMEMMAP_ALIGN \
- (IS_ENABLED(CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP) && \
- is_power_of_2(sizeof(struct page)) ? \
+#define MAX_FOLIO_VMEMMAP_ALIGN \
+ (IS_ENABLED(CONFIG_SPARSEMEM_VMEMMAP_OPTIMIZATION) && \
+ is_power_of_2(sizeof(struct page)) ? \
MAX_FOLIO_NR_PAGES * sizeof(struct page) : 0)
/* The number of vmemmap pages required by a vmemmap-optimized folio. */
@@ -115,7 +115,8 @@
#define __NR_OPTIMIZABLE_FOLIO_ORDERS (MAX_FOLIO_ORDER - OPTIMIZABLE_FOLIO_MIN_ORDER + 1)
#define NR_OPTIMIZABLE_FOLIO_ORDERS \
- (__NR_OPTIMIZABLE_FOLIO_ORDERS > 0 ? __NR_OPTIMIZABLE_FOLIO_ORDERS : 0)
+ ((__NR_OPTIMIZABLE_FOLIO_ORDERS > 0 && \
+ IS_ENABLED(CONFIG_SPARSEMEM_VMEMMAP_OPTIMIZATION)) ? __NR_OPTIMIZABLE_FOLIO_ORDERS : 0)
static inline bool order_vmemmap_optimizable(unsigned int order)
{
@@ -2033,7 +2034,7 @@ struct mem_section {
*/
struct page_ext *page_ext;
#endif
-#ifdef CONFIG_SPARSEMEM_VMEMMAP
+#ifdef CONFIG_SPARSEMEM_VMEMMAP_OPTIMIZATION
/*
* The order of compound pages in this section. Typically, the section
* holds compound pages of this order; a larger compound page will span
@@ -2213,7 +2214,19 @@ static inline bool pfn_section_first_valid(struct mem_section *ms, unsigned long
*pfn = (*pfn & PAGE_SECTION_MASK) + (bit * PAGES_PER_SUBSECTION);
return true;
}
+#else
+static inline int pfn_section_valid(struct mem_section *ms, unsigned long pfn)
+{
+ return 1;
+}
+
+static inline bool pfn_section_first_valid(struct mem_section *ms, unsigned long *pfn)
+{
+ return true;
+}
+#endif
+#ifdef CONFIG_SPARSEMEM_VMEMMAP_OPTIMIZATION
static inline void section_set_order(struct mem_section *section, unsigned int order)
{
VM_WARN_ON(section->order && order && section->order != order);
@@ -2225,16 +2238,6 @@ static inline unsigned int section_order(const struct mem_section *section)
return section->order;
}
#else
-static inline int pfn_section_valid(struct mem_section *ms, unsigned long pfn)
-{
- return 1;
-}
-
-static inline bool pfn_section_first_valid(struct mem_section *ms, unsigned long *pfn)
-{
- return true;
-}
-
static inline void section_set_order(struct mem_section *section, unsigned int order)
{
}
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 0e03d816e8b9..12665b34586c 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -208,14 +208,11 @@ enum pageflags {
static __always_inline bool compound_info_has_mask(void)
{
/*
- * Limit mask usage to HugeTLB vmemmap optimization (HVO) where it
- * makes a difference.
- *
* The approach with mask would work in the wider set of conditions,
* but it requires validating that struct pages are naturally aligned
* for all orders up to the MAX_FOLIO_ORDER, which can be tricky.
*/
- if (!IS_ENABLED(CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP))
+ if (!IS_ENABLED(CONFIG_SPARSEMEM_VMEMMAP_OPTIMIZATION))
return false;
return is_power_of_2(sizeof(struct page));
diff --git a/mm/Kconfig b/mm/Kconfig
index c26d2d2050d5..ddd10cb4d0a3 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -410,6 +410,10 @@ config SPARSEMEM_VMEMMAP
pfn_to_page and page_to_pfn operations. This is the most
efficient option when sufficient kernel resources are available.
+config SPARSEMEM_VMEMMAP_OPTIMIZATION
+ bool
+ depends on SPARSEMEM_VMEMMAP
+
#
# Select this config option from the architecture Kconfig, if it is preferred
# to enable the feature of HugeTLB/dev_dax vmemmap optimization.
--
2.54.0
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v2 39/69] mm/sparse-vmemmap: Switch DAX to vmemmap_shared_tail_page()
2026-05-13 13:04 [PATCH v2 00/69] mm: Generalize HVO for HugeTLB and device DAX Muchun Song
` (37 preceding siblings ...)
2026-05-13 13:05 ` [PATCH v2 38/69] mm/sparse-vmemmap: Introduce CONFIG_SPARSEMEM_VMEMMAP_OPTIMIZATION Muchun Song
@ 2026-05-13 13:05 ` Muchun Song
2026-05-13 13:05 ` [PATCH v2 40/69] powerpc/mm: " Muchun Song
` (8 subsequent siblings)
47 siblings, 0 replies; 72+ messages in thread
From: Muchun Song @ 2026-05-13 13:05 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
Michael Ellerman, Madhavan Srinivasan
Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
Christophe Leroy, Ackerley Tng, Frank van der Linden,
aneesh.kumar, joao.m.martins, linux-mm, linuxppc-dev,
linux-kernel, Muchun Song
DAX compound vmemmap population still has its own way to find a reusable
tail page by walking the previous section's PTEs.
Switch it to the common vmemmap_shared_tail_page() helper instead, so
DAX uses the same per-zone shared tail page as the other vmemmap
optimization users. This removes the PTE walk and lets both the section
reuse path and the populate path use the same shared page directly.
When the target zone is ZONE_DEVICE, mark the shared tail page entries
PG_reserved as well, so they match the initialization requirements for
device pages.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
include/linux/mmzone.h | 10 +++++++++
mm/memory_hotplug.c | 9 ++++++--
mm/sparse-vmemmap.c | 48 ++++++++++++++----------------------------
3 files changed, 33 insertions(+), 34 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 5285d53b0c53..7484e7be7b6d 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1693,11 +1693,21 @@ static inline bool zone_is_zone_device(const struct zone *zone)
{
return zone_idx(zone) == ZONE_DEVICE;
}
+
+static inline struct zone *device_zone(int nid)
+{
+ return &NODE_DATA(nid)->node_zones[ZONE_DEVICE];
+}
#else
static inline bool zone_is_zone_device(const struct zone *zone)
{
return false;
}
+
+static inline struct zone *device_zone(int nid)
+{
+ return NULL;
+}
#endif
/*
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 462d8dcd636d..9ff830703785 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -551,8 +551,13 @@ void remove_pfn_range_from_zone(struct zone *zone,
/* Select all remaining pages up to the next section boundary */
cur_nr_pages =
min(end_pfn - pfn, SECTION_ALIGN_UP(pfn + 1) - pfn);
- page_init_poison(pfn_to_page(pfn),
- sizeof(struct page) * cur_nr_pages);
+ /*
+ * This is a temporary workaround to prevent the shared vmemmap
+ * page from being overwritten; it will be removed later.
+ */
+ if (!zone_is_zone_device(zone))
+ page_init_poison(pfn_to_page(pfn),
+ sizeof(struct page) * cur_nr_pages);
}
/*
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index 53a341fcde74..0c0b54e94c07 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -329,8 +329,12 @@ struct page __ref *vmemmap_shared_tail_page(unsigned int order, struct zone *zon
if (!addr)
return NULL;
- for (int i = 0; i < PAGE_SIZE / sizeof(struct page); i++)
- init_compound_tail((struct page *)addr + i, NULL, order, zone);
+ for (int i = 0; i < PAGE_SIZE / sizeof(struct page); i++) {
+ page = (struct page *)addr + i;
+ if (zone_is_zone_device(zone))
+ __SetPageReserved(page);
+ init_compound_tail(page, NULL, order, zone);
+ }
page = virt_to_page(addr);
if (cmpxchg(&zone->vmemmap_tails[idx], NULL, page) != NULL) {
@@ -442,23 +446,6 @@ static bool __meminit reuse_compound_section(unsigned long start_pfn,
return !IS_ALIGNED(offset, nr_pages) && nr_pages > PAGES_PER_SUBSECTION;
}
-static pte_t * __meminit compound_section_tail_page(unsigned long addr)
-{
- pte_t *pte;
-
- addr -= PAGE_SIZE;
-
- /*
- * Assuming sections are populated sequentially, the previous section's
- * page data can be reused.
- */
- pte = pte_offset_kernel(pmd_off_k(addr), addr);
- if (!pte)
- return NULL;
-
- return pte;
-}
-
static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
unsigned long start,
unsigned long end, int node,
@@ -467,19 +454,15 @@ static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
unsigned long size, addr;
pte_t *pte;
int rc;
+ struct page *page;
- if (reuse_compound_section(start_pfn, pgmap)) {
- pte = compound_section_tail_page(start);
- if (!pte)
- return -ENOMEM;
+ page = vmemmap_shared_tail_page(pgmap->vmemmap_shift, device_zone(node));
+ if (!page)
+ return -ENOMEM;
- /*
- * Reuse the page that was populated in the prior iteration
- * with just tail struct pages.
- */
+ if (reuse_compound_section(start_pfn, pgmap))
return vmemmap_populate_range(start, end, node, NULL,
- pte_pfn(ptep_get(pte)));
- }
+ page_to_pfn(page));
size = min(end - start, pgmap_vmemmap_nr(pgmap) * sizeof(struct page));
for (addr = start; addr < end; addr += size) {
@@ -497,12 +480,12 @@ static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
return -ENOMEM;
/*
- * Reuse the previous page for the rest of tail pages
+ * Reuse the shared page for the rest of tail pages
* See layout diagram in Documentation/mm/vmemmap_dedup.rst
*/
next += PAGE_SIZE;
rc = vmemmap_populate_range(next, last, node, NULL,
- pte_pfn(ptep_get(pte)));
+ page_to_pfn(page));
if (rc)
return -ENOMEM;
}
@@ -828,7 +811,8 @@ int __meminit sparse_add_section(int nid, unsigned long start_pfn,
* Poison uninitialized struct pages in order to catch invalid flags
* combinations.
*/
- page_init_poison(memmap, sizeof(struct page) * nr_pages);
+ if (!vmemmap_can_optimize(altmap, pgmap))
+ page_init_poison(memmap, sizeof(struct page) * nr_pages);
ms = __nr_to_section(section_nr);
__section_mark_present(ms, section_nr);
--
2.54.0
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v2 40/69] powerpc/mm: Switch DAX to vmemmap_shared_tail_page()
2026-05-13 13:04 [PATCH v2 00/69] mm: Generalize HVO for HugeTLB and device DAX Muchun Song
` (38 preceding siblings ...)
2026-05-13 13:05 ` [PATCH v2 39/69] mm/sparse-vmemmap: Switch DAX to vmemmap_shared_tail_page() Muchun Song
@ 2026-05-13 13:05 ` Muchun Song
2026-05-13 13:05 ` [PATCH v2 41/69] mm/sparse-vmemmap: Drop the extra tail page from DAX reservation Muchun Song
` (7 subsequent siblings)
47 siblings, 0 replies; 72+ messages in thread
From: Muchun Song @ 2026-05-13 13:05 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
Michael Ellerman, Madhavan Srinivasan
Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
Christophe Leroy, Ackerley Tng, Frank van der Linden,
aneesh.kumar, joao.m.martins, linux-mm, linuxppc-dev,
linux-kernel, Muchun Song
powerpc compound vmemmap population still finds a reusable tail page by
walking the vmemmap page tables.
Switch it to the common vmemmap_shared_tail_page() helper instead, so it
can use the shared tail page directly without probing or populating
neighboring mappings.
This removes the powerpc-specific tail-page lookup and its fallback path
and aligns the radix vmemmap optimization path with the generic shared
tail-page scheme.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
arch/powerpc/mm/book3s64/radix_pgtable.c | 76 ++----------------------
1 file changed, 6 insertions(+), 70 deletions(-)
diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c b/arch/powerpc/mm/book3s64/radix_pgtable.c
index cf692b2b5f7b..95e65ac8cdea 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -1250,59 +1250,6 @@ static pte_t * __meminit radix__vmemmap_populate_address(unsigned long addr, int
return pte;
}
-static pte_t * __meminit vmemmap_compound_tail_page(unsigned long addr,
- unsigned long pfn_offset, int node)
-{
- pgd_t *pgd;
- p4d_t *p4d;
- pud_t *pud;
- pmd_t *pmd;
- pte_t *pte;
- unsigned long map_addr;
-
- /* the second vmemmap page which we use for duplication */
- map_addr = addr - pfn_offset * sizeof(struct page) + PAGE_SIZE;
- pgd = pgd_offset_k(map_addr);
- p4d = p4d_offset(pgd, map_addr);
- pud = vmemmap_pud_alloc(p4d, node, map_addr);
- if (!pud)
- return NULL;
- pmd = vmemmap_pmd_alloc(pud, node, map_addr);
- if (!pmd)
- return NULL;
- if (pmd_leaf(*pmd))
- /*
- * The second page is mapped as a hugepage due to a nearby request.
- * Force our mapping to page size without deduplication
- */
- return NULL;
- pte = vmemmap_pte_alloc(pmd, node, map_addr);
- if (!pte)
- return NULL;
- /*
- * Check if there exist a mapping to the left
- */
- if (pte_none(*pte)) {
- /*
- * Populate the head page vmemmap page.
- * It can fall in different pmd, hence
- * vmemmap_populate_address()
- */
- pte = radix__vmemmap_populate_address(map_addr - PAGE_SIZE, node, NULL, NULL);
- if (!pte)
- return NULL;
- /*
- * Populate the tail pages vmemmap page
- */
- pte = radix__vmemmap_pte_populate(pmd, map_addr, node, NULL, NULL);
- if (!pte)
- return NULL;
- vmemmap_verify(pte, node, map_addr, map_addr + PAGE_SIZE);
- return pte;
- }
- return pte;
-}
-
int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
unsigned long start,
unsigned long end, int node,
@@ -1320,6 +1267,11 @@ int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
pud_t *pud;
pmd_t *pmd;
pte_t *pte;
+ struct page *tail_page;
+
+ tail_page = vmemmap_shared_tail_page(pgmap->vmemmap_shift, device_zone(node));
+ if (!tail_page)
+ return -ENOMEM;
for (addr = start; addr < end; addr = next) {
@@ -1352,7 +1304,6 @@ int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
unsigned long nr_pages = pgmap_vmemmap_nr(pgmap);
unsigned long addr_pfn = page_to_pfn((struct page *)addr);
unsigned long pfn_offset = addr_pfn - ALIGN_DOWN(addr_pfn, nr_pages);
- pte_t *tail_page_pte;
/*
* if the address is aligned to huge page size it is the
@@ -1377,23 +1328,8 @@ int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
next = addr + 2 * PAGE_SIZE;
continue;
}
- /*
- * get the 2nd mapping details
- * Also create it if that doesn't exist
- */
- tail_page_pte = vmemmap_compound_tail_page(addr, pfn_offset, node);
- if (!tail_page_pte) {
-
- pte = radix__vmemmap_pte_populate(pmd, addr, node, NULL, NULL);
- if (!pte)
- return -ENOMEM;
- vmemmap_verify(pte, node, addr, addr + PAGE_SIZE);
-
- next = addr + PAGE_SIZE;
- continue;
- }
- pte = radix__vmemmap_pte_populate(pmd, addr, node, NULL, pte_page(*tail_page_pte));
+ pte = radix__vmemmap_pte_populate(pmd, addr, node, NULL, tail_page);
if (!pte)
return -ENOMEM;
vmemmap_verify(pte, node, addr, addr + PAGE_SIZE);
--
2.54.0
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v2 41/69] mm/sparse-vmemmap: Drop the extra tail page from DAX reservation
2026-05-13 13:04 [PATCH v2 00/69] mm: Generalize HVO for HugeTLB and device DAX Muchun Song
` (39 preceding siblings ...)
2026-05-13 13:05 ` [PATCH v2 40/69] powerpc/mm: " Muchun Song
@ 2026-05-13 13:05 ` Muchun Song
2026-05-13 13:05 ` [PATCH v2 42/69] mm/sparse-vmemmap: Switch DAX to section-based vmemmap optimization Muchun Song
` (6 subsequent siblings)
47 siblings, 0 replies; 72+ messages in thread
From: Muchun Song @ 2026-05-13 13:05 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
Michael Ellerman, Madhavan Srinivasan
Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
Christophe Leroy, Ackerley Tng, Frank van der Linden,
aneesh.kumar, joao.m.martins, linux-mm, linuxppc-dev,
linux-kernel, Muchun Song
DAX compound vmemmap population still reserves one extra tail vmemmap
page after the head page, and only maps the remaining tail pages
through the shared tail page.
Drop that extra reservation and let the shared tail page cover all tail
vmemmap pages after the head page, so DAX follows the same reservation
model as HugeTLB.
This reduces the reserved vmemmap pages for optimized DAX mappings to
OPTIMIZED_FOLIO_VMEMMAP_PAGES and removes the now-unneeded first-tail
population from the generic and powerpc paths.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
arch/powerpc/mm/book3s64/radix_pgtable.c | 44 +-----------------------
include/linux/mm.h | 2 +-
mm/sparse-vmemmap.c | 8 +----
3 files changed, 3 insertions(+), 51 deletions(-)
diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c b/arch/powerpc/mm/book3s64/radix_pgtable.c
index 95e65ac8cdea..fb8738016b30 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -1217,39 +1217,6 @@ int __meminit radix__vmemmap_populate(unsigned long start, unsigned long end, in
return 0;
}
-static pte_t * __meminit radix__vmemmap_populate_address(unsigned long addr, int node,
- struct vmem_altmap *altmap,
- struct page *reuse)
-{
- pgd_t *pgd;
- p4d_t *p4d;
- pud_t *pud;
- pmd_t *pmd;
- pte_t *pte;
-
- pgd = pgd_offset_k(addr);
- p4d = p4d_offset(pgd, addr);
- pud = vmemmap_pud_alloc(p4d, node, addr);
- if (!pud)
- return NULL;
- pmd = vmemmap_pmd_alloc(pud, node, addr);
- if (!pmd)
- return NULL;
- if (pmd_leaf(*pmd))
- /*
- * The second page is mapped as a hugepage due to a nearby request.
- * Force our mapping to page size without deduplication
- */
- return NULL;
- pte = vmemmap_pte_alloc(pmd, node, addr);
- if (!pte)
- return NULL;
- radix__vmemmap_pte_populate(pmd, addr, node, NULL, NULL);
- vmemmap_verify(pte, node, addr, addr + PAGE_SIZE);
-
- return pte;
-}
-
int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
unsigned long start,
unsigned long end, int node,
@@ -1316,16 +1283,7 @@ int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
return -ENOMEM;
vmemmap_verify(pte, node, addr, addr + PAGE_SIZE);
- /*
- * Populate the tail pages vmemmap page
- * It can fall in different pmd, hence
- * vmemmap_populate_address()
- */
- pte = radix__vmemmap_populate_address(addr + PAGE_SIZE, node, NULL, NULL);
- if (!pte)
- return -ENOMEM;
-
- next = addr + 2 * PAGE_SIZE;
+ next = addr + PAGE_SIZE;
continue;
}
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5281f073230c..86d7cecb834e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -4898,7 +4898,7 @@ static inline void vmem_altmap_free(struct vmem_altmap *altmap,
}
#endif
-#define VMEMMAP_RESERVE_NR 2
+#define VMEMMAP_RESERVE_NR OPTIMIZED_FOLIO_VMEMMAP_PAGES
#ifdef CONFIG_ARCH_WANT_OPTIMIZE_DAX_VMEMMAP
static inline bool __vmemmap_can_optimize(struct vmem_altmap *altmap,
struct dev_pagemap *pgmap)
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index 0c0b54e94c07..b5c109b8af6f 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -473,17 +473,11 @@ static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
if (!pte)
return -ENOMEM;
- /* Populate the tail pages vmemmap page */
- next = addr + PAGE_SIZE;
- pte = vmemmap_populate_address(next, node, NULL, -1);
- if (!pte)
- return -ENOMEM;
-
/*
* Reuse the shared page for the rest of tail pages
* See layout diagram in Documentation/mm/vmemmap_dedup.rst
*/
- next += PAGE_SIZE;
+ next = addr + PAGE_SIZE;
rc = vmemmap_populate_range(next, last, node, NULL,
page_to_pfn(page));
if (rc)
--
2.54.0
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v2 42/69] mm/sparse-vmemmap: Switch DAX to section-based vmemmap optimization
2026-05-13 13:04 [PATCH v2 00/69] mm: Generalize HVO for HugeTLB and device DAX Muchun Song
` (40 preceding siblings ...)
2026-05-13 13:05 ` [PATCH v2 41/69] mm/sparse-vmemmap: Drop the extra tail page from DAX reservation Muchun Song
@ 2026-05-13 13:05 ` Muchun Song
2026-05-13 13:05 ` [PATCH v2 43/69] mm/sparse-vmemmap: Unify DAX and HugeTLB population paths Muchun Song
` (5 subsequent siblings)
47 siblings, 0 replies; 72+ messages in thread
From: Muchun Song @ 2026-05-13 13:05 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
Michael Ellerman, Madhavan Srinivasan
Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
Christophe Leroy, Ackerley Tng, Frank van der Linden,
aneesh.kumar, joao.m.martins, linux-mm, linuxppc-dev,
linux-kernel, Muchun Song
DAX vmemmap optimization still uses pgmap-specific state to decide
whether a section should use the optimized layout.
Switch DAX to the compound page order recorded in struct mem_section, so
it follows the same section-based optimization state as the rest of
sparse-vmemmap.
This lets the DAX population, initialization, and teardown paths make
their optimization decisions from the section metadata instead of
carrying separate pgmap-specific state.
This makes DAX vmemmap optimization section-granular. Only
section-aligned ranges record a compound page order, so subsection
mappings remain unoptimized. The resulting loss of vmemmap savings
is negligible.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
arch/powerpc/mm/book3s64/radix_pgtable.c | 5 +++--
mm/memory_hotplug.c | 6 +-----
mm/mm_init.c | 13 ++++---------
mm/sparse-vmemmap.c | 24 ++++++++++++++++++------
mm/sparse.c | 2 +-
5 files changed, 27 insertions(+), 23 deletions(-)
diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c b/arch/powerpc/mm/book3s64/radix_pgtable.c
index fb8738016b30..f0043c57694e 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -1235,8 +1235,9 @@ int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
pmd_t *pmd;
pte_t *pte;
struct page *tail_page;
+ const struct mem_section *ms = __pfn_to_section(start_pfn);
- tail_page = vmemmap_shared_tail_page(pgmap->vmemmap_shift, device_zone(node));
+ tail_page = vmemmap_shared_tail_page(section_order(ms), device_zone(node));
if (!tail_page)
return -ENOMEM;
@@ -1268,7 +1269,7 @@ int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
next = addr + PAGE_SIZE;
continue;
} else {
- unsigned long nr_pages = pgmap_vmemmap_nr(pgmap);
+ unsigned long nr_pages = 1UL << section_order(ms);
unsigned long addr_pfn = page_to_pfn((struct page *)addr);
unsigned long pfn_offset = addr_pfn - ALIGN_DOWN(addr_pfn, nr_pages);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 9ff830703785..c9c69f827efa 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -551,11 +551,7 @@ void remove_pfn_range_from_zone(struct zone *zone,
/* Select all remaining pages up to the next section boundary */
cur_nr_pages =
min(end_pfn - pfn, SECTION_ALIGN_UP(pfn + 1) - pfn);
- /*
- * This is a temporary workaround to prevent the shared vmemmap
- * page from being overwritten; it will be removed later.
- */
- if (!zone_is_zone_device(zone))
+ if (!section_vmemmap_optimizable(__pfn_to_section(pfn)))
page_init_poison(pfn_to_page(pfn),
sizeof(struct page) * cur_nr_pages);
}
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 35c99e5c215c..2b94115e6dd5 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -1071,16 +1071,11 @@ static void __ref __init_zone_device_page(struct page *page, unsigned long pfn,
* of an altmap. See vmemmap_populate_compound_pages().
*/
static inline unsigned long compound_nr_pages(unsigned long pfn,
- struct vmem_altmap *altmap,
struct dev_pagemap *pgmap)
{
- /*
- * If DAX memory is hot-plugged into an unoccupied subsection
- * of an early section, the unoptimized boot memmap is reused.
- * See section_activate().
- */
- if (early_section(__pfn_to_section(pfn)) ||
- !vmemmap_can_optimize(altmap, pgmap))
+ const struct mem_section *ms = __pfn_to_section(pfn);
+
+ if (!section_vmemmap_optimizable(ms))
return pgmap_vmemmap_nr(pgmap);
return VMEMMAP_RESERVE_NR * (PAGE_SIZE / sizeof(struct page));
@@ -1150,7 +1145,7 @@ void __ref memmap_init_zone_device(struct zone *zone,
continue;
memmap_init_compound(page, pfn, zone_idx, nid, pgmap,
- compound_nr_pages(pfn, altmap, pgmap));
+ compound_nr_pages(pfn, pgmap));
}
pageblock_migratetype_init_range(start_pfn, nr_pages, MIGRATE_MOVABLE, false, false);
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index b5c109b8af6f..ad3e5b54abf7 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -455,8 +455,9 @@ static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
pte_t *pte;
int rc;
struct page *page;
+ const struct mem_section *ms = __pfn_to_section(start_pfn);
- page = vmemmap_shared_tail_page(pgmap->vmemmap_shift, device_zone(node));
+ page = vmemmap_shared_tail_page(section_order(ms), device_zone(node));
if (!page)
return -ENOMEM;
@@ -464,7 +465,7 @@ static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
return vmemmap_populate_range(start, end, node, NULL,
page_to_pfn(page));
- size = min(end - start, pgmap_vmemmap_nr(pgmap) * sizeof(struct page));
+ size = min(end - start, (1UL << section_order(ms)) * sizeof(struct page));
for (addr = start; addr < end; addr += size) {
unsigned long next, last = addr + size;
@@ -501,7 +502,9 @@ struct page * __meminit __populate_section_memmap(unsigned long pfn,
!IS_ALIGNED(nr_pages, PAGES_PER_SUBSECTION)))
return NULL;
- if (vmemmap_can_optimize(altmap, pgmap))
+ /* This may occur in sub-section scenarios. */
+ if (vmemmap_can_optimize(altmap, pgmap) &&
+ section_vmemmap_optimizable(__pfn_to_section(pfn)))
r = vmemmap_populate_compound_pages(pfn, start, end, nid, pgmap);
else
r = vmemmap_populate(start, end, nid, altmap);
@@ -718,8 +721,10 @@ static void section_deactivate(unsigned long pfn, unsigned long nr_pages,
else if (memmap)
free_map_bootmem(memmap);
- if (empty)
+ if (empty) {
ms->section_mem_map = (unsigned long)NULL;
+ section_set_order(ms, 0);
+ }
}
static struct page * __meminit section_activate(int nid, unsigned long pfn,
@@ -729,8 +734,14 @@ static struct page * __meminit section_activate(int nid, unsigned long pfn,
struct mem_section *ms = __pfn_to_section(pfn);
struct mem_section_usage *usage = NULL;
struct page *memmap;
+ unsigned int order;
int rc;
+ order = vmemmap_can_optimize(altmap, pgmap) ? pgmap->vmemmap_shift : 0;
+ /* All sub-sections within a section must share the same order. */
+ if (nr_pages < PAGES_PER_SECTION && section_order(ms) && section_order(ms) != order)
+ return ERR_PTR(-ENOTSUPP);
+
if (!ms->usage) {
usage = kzalloc(mem_section_usage_size(), GFP_KERNEL);
if (!usage)
@@ -756,6 +767,7 @@ static struct page * __meminit section_activate(int nid, unsigned long pfn,
if (nr_pages < PAGES_PER_SECTION && early_section(ms))
return pfn_to_page(pfn);
+ section_set_order_range(pfn, nr_pages, order);
memmap = populate_section_memmap(pfn, nr_pages, nid, altmap, pgmap);
if (!memmap) {
section_deactivate(pfn, nr_pages, altmap, pgmap);
@@ -801,14 +813,14 @@ int __meminit sparse_add_section(int nid, unsigned long start_pfn,
if (IS_ERR(memmap))
return PTR_ERR(memmap);
+ ms = __nr_to_section(section_nr);
/*
* Poison uninitialized struct pages in order to catch invalid flags
* combinations.
*/
- if (!vmemmap_can_optimize(altmap, pgmap))
+ if (!section_vmemmap_optimizable(ms))
page_init_poison(memmap, sizeof(struct page) * nr_pages);
- ms = __nr_to_section(section_nr);
__section_mark_present(ms, section_nr);
/* Align memmap to section boundary in the subsection case */
diff --git a/mm/sparse.c b/mm/sparse.c
index 54c38ea08190..6878f8941b4c 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -251,7 +251,7 @@ int __meminit section_nr_vmemmap_pages(unsigned long pfn, unsigned long nr_pages
if (vmemmap_can_optimize(altmap, pgmap))
vmemmap_pages = VMEMMAP_RESERVE_NR;
- if (!vmemmap_can_optimize(altmap, pgmap) && !section_vmemmap_optimizable(ms))
+ if (!section_vmemmap_optimizable(ms))
return DIV_ROUND_UP(nr_pages * sizeof(struct page), PAGE_SIZE);
if (order < PFN_SECTION_SHIFT) {
--
2.54.0
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v2 43/69] mm/sparse-vmemmap: Unify DAX and HugeTLB population paths
2026-05-13 13:04 [PATCH v2 00/69] mm: Generalize HVO for HugeTLB and device DAX Muchun Song
` (41 preceding siblings ...)
2026-05-13 13:05 ` [PATCH v2 42/69] mm/sparse-vmemmap: Switch DAX to section-based vmemmap optimization Muchun Song
@ 2026-05-13 13:05 ` Muchun Song
2026-05-13 13:05 ` [PATCH v2 44/69] mm/sparse-vmemmap: Remove the unused ptpfn argument Muchun Song
` (4 subsequent siblings)
47 siblings, 0 replies; 72+ messages in thread
From: Muchun Song @ 2026-05-13 13:05 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
Michael Ellerman, Madhavan Srinivasan
Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
Christophe Leroy, Ackerley Tng, Frank van der Linden,
aneesh.kumar, joao.m.martins, linux-mm, linuxppc-dev,
linux-kernel, Muchun Song
Now that DAX and HugeTLB use the same optimized vmemmap layout, they no
longer need separate population flows.
Move the shared-tail-page handling into vmemmap_pte_populate() so both
users can go through the normal basepage population path. This removes
the compound-page-specific population helper and leaves the optimized
mapping decisions in one place.
At runtime, the optimized users are limited to ZONE_DEVICE memory, so
use device_zone() for shared-tail-page allocation instead of relying on
pfn_to_zone() before zone spans are available.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
arch/powerpc/mm/book3s64/radix_pgtable.c | 3 +
mm/mm_init.c | 2 +-
mm/sparse-vmemmap.c | 183 ++++++-----------------
3 files changed, 50 insertions(+), 138 deletions(-)
diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c b/arch/powerpc/mm/book3s64/radix_pgtable.c
index f0043c57694e..c7f2327681cc 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -1121,7 +1121,10 @@ int __meminit radix__vmemmap_populate(unsigned long start, unsigned long end, in
pud_t *pud;
pmd_t *pmd;
pte_t *pte;
+ unsigned long pfn = page_to_pfn((struct page *)start);
+ if (section_vmemmap_optimizable(__pfn_to_section(pfn)))
+ return vmemmap_populate_compound_pages(pfn, start, end, node, NULL);
/*
* If altmap is present, Make sure we align the start vmemmap addr
* to PAGE_SIZE so that we calculate the correct start_pfn in
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 2b94115e6dd5..9ff118e35641 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -1068,7 +1068,7 @@ static void __ref __init_zone_device_page(struct page *page, unsigned long pfn,
* initialize is a lot smaller that the total amount of struct pages being
* mapped. This is a paired / mild layering violation with explicit knowledge
* of how the sparse_vmemmap internals handle compound pages in the lack
- * of an altmap. See vmemmap_populate_compound_pages().
+ * of an altmap.
*/
static inline unsigned long compound_nr_pages(unsigned long pfn,
struct dev_pagemap *pgmap)
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index ad3e5b54abf7..4833a2295abb 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -127,49 +127,48 @@ static pte_t * __meminit vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, in
struct vmem_altmap *altmap,
unsigned long ptpfn)
{
- pte_t *pte = pte_offset_kernel(pmd, addr);
-
- if (pte_none(ptep_get(pte))) {
- pte_t entry;
-
- if (vmemmap_page_optimizable((struct page *)addr) &&
- ptpfn == (unsigned long)-1) {
- struct page *page;
- unsigned long pfn = page_to_pfn((struct page *)addr);
- const struct mem_section *ms = __pfn_to_section(pfn);
- struct zone *zone = pfn_to_zone(pfn, node);
-
- if (WARN_ON_ONCE(!zone))
- return NULL;
- page = vmemmap_shared_tail_page(section_order(ms), zone);
- if (!page)
- return NULL;
- ptpfn = page_to_pfn(page);
- }
+ pte_t entry, *pte = pte_offset_kernel(pmd, addr);
+ struct page *page = (struct page *)addr;
+
+ if (!pte_none(ptep_get(pte)))
+ return WARN_ON_ONCE(vmemmap_page_optimizable(page)) ? NULL : pte;
+
+ /* See layout diagram in Documentation/mm/vmemmap_dedup.rst. */
+ if (vmemmap_page_optimizable(page)) {
+ struct zone *zone;
+ unsigned long pfn = page_to_pfn(page);
+
+ /*
+ * At runtime (slab available), only ZONE_DEVICE pages (DAX)
+ * trigger vmemmap optimization, so device_zone() suffices.
+ * Note: pfn_to_zone() cannot be used at runtime because the
+ * zone span is not set up now.
+ */
+ zone = slab_is_available() ? device_zone(node) : pfn_to_zone(pfn, node);
+ if (WARN_ON_ONCE(!zone))
+ return NULL;
+ page = vmemmap_shared_tail_page(pfn_to_section_order(pfn), zone);
+ if (!page)
+ return NULL;
+
+ /*
+ * When a PTE entry is freed, a free_pages() call occurs. This
+ * get_page() pairs with put_page_testzero() on the freeing
+ * path. This can only occur when slab is available.
+ */
+ if (slab_is_available())
+ get_page(page);
+ ptpfn = page_to_pfn(page);
+ } else {
+ void *vaddr = vmemmap_alloc_block_buf(PAGE_SIZE, node, altmap);
+
+ if (!vaddr)
+ return NULL;
+ ptpfn = PHYS_PFN(__pa(vaddr));
+ }
+ entry = pfn_pte(ptpfn, PAGE_KERNEL);
+ set_pte_at(&init_mm, addr, pte, entry);
- if (ptpfn == (unsigned long)-1) {
- void *p = vmemmap_alloc_block_buf(PAGE_SIZE, node, altmap);
-
- if (!p)
- return NULL;
- ptpfn = PHYS_PFN(__pa(p));
- } else {
- /*
- * When a PTE/PMD entry is freed from the init_mm
- * there's a free_pages() call to this page allocated
- * above. Thus this get_page() is paired with the
- * put_page_testzero() on the freeing path.
- * This can only called by certain ZONE_DEVICE path,
- * and through vmemmap_populate_compound_pages() when
- * slab is available.
- */
- if (slab_is_available())
- get_page(pfn_to_page(ptpfn));
- }
- entry = pfn_pte(ptpfn, PAGE_KERNEL);
- set_pte_at(&init_mm, addr, pte, entry);
- } else if (WARN_ON_ONCE(vmemmap_page_optimizable((struct page *)addr)))
- return NULL;
return pte;
}
@@ -265,30 +264,16 @@ static pte_t * __meminit vmemmap_populate_address(unsigned long addr, int node,
return pte;
}
-static int __meminit vmemmap_populate_range(unsigned long start,
- unsigned long end, int node,
- struct vmem_altmap *altmap,
- unsigned long ptpfn)
+int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end,
+ int node, struct vmem_altmap *altmap)
{
- unsigned long addr = start;
- pte_t *pte;
-
- for (; addr < end; addr += PAGE_SIZE) {
- pte = vmemmap_populate_address(addr, node, altmap,
- ptpfn);
- if (!pte)
+ for (; start < end; start += PAGE_SIZE)
+ if (!vmemmap_populate_address(start, node, altmap, -1))
return -ENOMEM;
- }
return 0;
}
-int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end,
- int node, struct vmem_altmap *altmap)
-{
- return vmemmap_populate_range(start, end, node, altmap, -1);
-}
-
/*
* Write protect the mirrored tail page structs for HVO. This will be
* called from the hugetlb code when gathering and initializing the
@@ -425,94 +410,18 @@ int __meminit vmemmap_populate_hugepages(unsigned long start, unsigned long end,
return 0;
}
-#ifndef vmemmap_populate_compound_pages
-/*
- * For compound pages bigger than section size (e.g. x86 1G compound
- * pages with 2M subsection size) fill the rest of sections as tail
- * pages.
- *
- * Note that memremap_pages() resets @nr_range value and will increment
- * it after each range successful onlining. Thus the value or @nr_range
- * at section memmap populate corresponds to the in-progress range
- * being onlined here.
- */
-static bool __meminit reuse_compound_section(unsigned long start_pfn,
- struct dev_pagemap *pgmap)
-{
- unsigned long nr_pages = pgmap_vmemmap_nr(pgmap);
- unsigned long offset = start_pfn -
- PHYS_PFN(pgmap->ranges[pgmap->nr_range].start);
-
- return !IS_ALIGNED(offset, nr_pages) && nr_pages > PAGES_PER_SUBSECTION;
-}
-
-static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
- unsigned long start,
- unsigned long end, int node,
- struct dev_pagemap *pgmap)
-{
- unsigned long size, addr;
- pte_t *pte;
- int rc;
- struct page *page;
- const struct mem_section *ms = __pfn_to_section(start_pfn);
-
- page = vmemmap_shared_tail_page(section_order(ms), device_zone(node));
- if (!page)
- return -ENOMEM;
-
- if (reuse_compound_section(start_pfn, pgmap))
- return vmemmap_populate_range(start, end, node, NULL,
- page_to_pfn(page));
-
- size = min(end - start, (1UL << section_order(ms)) * sizeof(struct page));
- for (addr = start; addr < end; addr += size) {
- unsigned long next, last = addr + size;
-
- /* Populate the head page vmemmap page */
- pte = vmemmap_populate_address(addr, node, NULL, -1);
- if (!pte)
- return -ENOMEM;
-
- /*
- * Reuse the shared page for the rest of tail pages
- * See layout diagram in Documentation/mm/vmemmap_dedup.rst
- */
- next = addr + PAGE_SIZE;
- rc = vmemmap_populate_range(next, last, node, NULL,
- page_to_pfn(page));
- if (rc)
- return -ENOMEM;
- }
-
- return 0;
-}
-
-#endif
-
struct page * __meminit __populate_section_memmap(unsigned long pfn,
unsigned long nr_pages, int nid, struct vmem_altmap *altmap,
struct dev_pagemap *pgmap)
{
unsigned long start = (unsigned long) pfn_to_page(pfn);
unsigned long end = start + nr_pages * sizeof(struct page);
- int r;
if (WARN_ON_ONCE(!IS_ALIGNED(pfn, PAGES_PER_SUBSECTION) ||
!IS_ALIGNED(nr_pages, PAGES_PER_SUBSECTION)))
return NULL;
- /* This may occur in sub-section scenarios. */
- if (vmemmap_can_optimize(altmap, pgmap) &&
- section_vmemmap_optimizable(__pfn_to_section(pfn)))
- r = vmemmap_populate_compound_pages(pfn, start, end, nid, pgmap);
- else
- r = vmemmap_populate(start, end, nid, altmap);
-
- if (r < 0)
- return NULL;
-
- return pfn_to_page(pfn);
+ return vmemmap_populate(start, end, nid, altmap) ? NULL : (void *)start;
}
static void subsection_mask_set(unsigned long *map, unsigned long pfn,
--
2.54.0
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v2 44/69] mm/sparse-vmemmap: Remove the unused ptpfn argument
2026-05-13 13:04 [PATCH v2 00/69] mm: Generalize HVO for HugeTLB and device DAX Muchun Song
` (42 preceding siblings ...)
2026-05-13 13:05 ` [PATCH v2 43/69] mm/sparse-vmemmap: Unify DAX and HugeTLB population paths Muchun Song
@ 2026-05-13 13:05 ` Muchun Song
2026-05-13 13:05 ` [PATCH v2 45/69] powerpc/mm: Make vmemmap_populate_compound_pages() static Muchun Song
` (3 subsequent siblings)
47 siblings, 0 replies; 72+ messages in thread
From: Muchun Song @ 2026-05-13 13:05 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
Michael Ellerman, Madhavan Srinivasan
Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
Christophe Leroy, Ackerley Tng, Frank van der Linden,
aneesh.kumar, joao.m.martins, linux-mm, linuxppc-dev,
linux-kernel, Muchun Song
vmemmap_pte_populate() no longer uses ptpfn as an input. It computes
the PFN locally in both cases before building the PTE.
Drop the argument and inline the PFN computation at the PTE creation
sites.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
mm/sparse-vmemmap.c | 16 +++++++---------
1 file changed, 7 insertions(+), 9 deletions(-)
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index 4833a2295abb..182d0c7dd1e7 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -124,8 +124,7 @@ void __meminit vmemmap_verify(pte_t *pte, int node,
}
static pte_t * __meminit vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node,
- struct vmem_altmap *altmap,
- unsigned long ptpfn)
+ struct vmem_altmap *altmap)
{
pte_t entry, *pte = pte_offset_kernel(pmd, addr);
struct page *page = (struct page *)addr;
@@ -158,15 +157,15 @@ static pte_t * __meminit vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, in
*/
if (slab_is_available())
get_page(page);
- ptpfn = page_to_pfn(page);
+
+ entry = pfn_pte(page_to_pfn(page), PAGE_KERNEL);
} else {
void *vaddr = vmemmap_alloc_block_buf(PAGE_SIZE, node, altmap);
if (!vaddr)
return NULL;
- ptpfn = PHYS_PFN(__pa(vaddr));
+ entry = pfn_pte(PHYS_PFN(__pa(vaddr)), PAGE_KERNEL);
}
- entry = pfn_pte(ptpfn, PAGE_KERNEL);
set_pte_at(&init_mm, addr, pte, entry);
return pte;
@@ -235,8 +234,7 @@ static pgd_t * __meminit vmemmap_pgd_populate(unsigned long addr, int node)
}
static pte_t * __meminit vmemmap_populate_address(unsigned long addr, int node,
- struct vmem_altmap *altmap,
- unsigned long ptpfn)
+ struct vmem_altmap *altmap)
{
pgd_t *pgd;
p4d_t *p4d;
@@ -256,7 +254,7 @@ static pte_t * __meminit vmemmap_populate_address(unsigned long addr, int node,
pmd = vmemmap_pmd_populate(pud, addr, node);
if (!pmd)
return NULL;
- pte = vmemmap_pte_populate(pmd, addr, node, altmap, ptpfn);
+ pte = vmemmap_pte_populate(pmd, addr, node, altmap);
if (!pte)
return NULL;
vmemmap_verify(pte, node, addr, addr + PAGE_SIZE);
@@ -268,7 +266,7 @@ int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end,
int node, struct vmem_altmap *altmap)
{
for (; start < end; start += PAGE_SIZE)
- if (!vmemmap_populate_address(start, node, altmap, -1))
+ if (!vmemmap_populate_address(start, node, altmap))
return -ENOMEM;
return 0;
--
2.54.0
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v2 45/69] powerpc/mm: Make vmemmap_populate_compound_pages() static
2026-05-13 13:04 [PATCH v2 00/69] mm: Generalize HVO for HugeTLB and device DAX Muchun Song
` (43 preceding siblings ...)
2026-05-13 13:05 ` [PATCH v2 44/69] mm/sparse-vmemmap: Remove the unused ptpfn argument Muchun Song
@ 2026-05-13 13:05 ` Muchun Song
2026-05-13 13:05 ` [PATCH v2 46/69] mm/sparse-vmemmap: Map shared vmemmap tail pages read-only Muchun Song
` (2 subsequent siblings)
47 siblings, 0 replies; 72+ messages in thread
From: Muchun Song @ 2026-05-13 13:05 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
Michael Ellerman, Madhavan Srinivasan
Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
Christophe Leroy, Ackerley Tng, Frank van der Linden,
aneesh.kumar, joao.m.martins, linux-mm, linuxppc-dev,
linux-kernel, Muchun Song
vmemmap_populate_compound_pages() is no longer used outside
radix_pgtable.c.
Make it static and drop the unused dev_pagemap and start_pfn
argument from its only remaining caller.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
arch/powerpc/include/asm/book3s/64/radix.h | 6 ------
arch/powerpc/mm/book3s64/radix_pgtable.c | 14 +++++++-------
2 files changed, 7 insertions(+), 13 deletions(-)
diff --git a/arch/powerpc/include/asm/book3s/64/radix.h b/arch/powerpc/include/asm/book3s/64/radix.h
index da954e779744..8452a2714cb1 100644
--- a/arch/powerpc/include/asm/book3s/64/radix.h
+++ b/arch/powerpc/include/asm/book3s/64/radix.h
@@ -356,11 +356,5 @@ int radix__remove_section_mapping(unsigned long start, unsigned long end);
#define vmemmap_can_optimize vmemmap_can_optimize
bool vmemmap_can_optimize(struct vmem_altmap *altmap, struct dev_pagemap *pgmap);
#endif
-
-#define vmemmap_populate_compound_pages vmemmap_populate_compound_pages
-int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
- unsigned long start,
- unsigned long end, int node,
- struct dev_pagemap *pgmap);
#endif /* __ASSEMBLER__ */
#endif
diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c b/arch/powerpc/mm/book3s64/radix_pgtable.c
index c7f2327681cc..18b24bb891b7 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -1109,7 +1109,8 @@ static inline pte_t *vmemmap_pte_alloc(pmd_t *pmdp, int node,
return pte_offset_kernel(pmdp, address);
}
-
+static int __meminit vmemmap_populate_compound_pages(unsigned long start,
+ unsigned long end, int node);
int __meminit radix__vmemmap_populate(unsigned long start, unsigned long end, int node,
struct vmem_altmap *altmap)
@@ -1124,7 +1125,7 @@ int __meminit radix__vmemmap_populate(unsigned long start, unsigned long end, in
unsigned long pfn = page_to_pfn((struct page *)start);
if (section_vmemmap_optimizable(__pfn_to_section(pfn)))
- return vmemmap_populate_compound_pages(pfn, start, end, node, NULL);
+ return vmemmap_populate_compound_pages(start, end, node);
/*
* If altmap is present, Make sure we align the start vmemmap addr
* to PAGE_SIZE so that we calculate the correct start_pfn in
@@ -1220,10 +1221,8 @@ int __meminit radix__vmemmap_populate(unsigned long start, unsigned long end, in
return 0;
}
-int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
- unsigned long start,
- unsigned long end, int node,
- struct dev_pagemap *pgmap)
+static int __meminit vmemmap_populate_compound_pages(unsigned long start,
+ unsigned long end, int node)
{
/*
* we want to map things as base page size mapping so that
@@ -1238,8 +1237,9 @@ int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
pmd_t *pmd;
pte_t *pte;
struct page *tail_page;
- const struct mem_section *ms = __pfn_to_section(start_pfn);
+ const struct mem_section *ms;
+ ms = __pfn_to_section(page_to_pfn((struct page *)start));
tail_page = vmemmap_shared_tail_page(section_order(ms), device_zone(node));
if (!tail_page)
return -ENOMEM;
--
2.54.0
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v2 46/69] mm/sparse-vmemmap: Map shared vmemmap tail pages read-only
2026-05-13 13:04 [PATCH v2 00/69] mm: Generalize HVO for HugeTLB and device DAX Muchun Song
` (44 preceding siblings ...)
2026-05-13 13:05 ` [PATCH v2 45/69] powerpc/mm: Make vmemmap_populate_compound_pages() static Muchun Song
@ 2026-05-13 13:05 ` Muchun Song
2026-05-13 13:20 ` [PATCH v2 47/69] powerpc/mm: " Muchun Song
2026-05-13 17:46 ` [PATCH v2 00/69] mm: Generalize HVO for HugeTLB and device DAX Andrew Morton
47 siblings, 0 replies; 72+ messages in thread
From: Muchun Song @ 2026-05-13 13:05 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
Michael Ellerman, Madhavan Srinivasan
Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
Christophe Leroy, Ackerley Tng, Frank van der Linden,
aneesh.kumar, joao.m.martins, linux-mm, linuxppc-dev,
linux-kernel, Muchun Song
Shared vmemmap tail pages are now installed through
vmemmap_pte_populate().
Map those shared pages with PAGE_KERNEL_RO so writes to shared tail
vmemmap entries fault immediately instead of silently corrupting shared
metadata.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
mm/sparse-vmemmap.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index 182d0c7dd1e7..9811c92ad258 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -158,7 +158,8 @@ static pte_t * __meminit vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, in
if (slab_is_available())
get_page(page);
- entry = pfn_pte(page_to_pfn(page), PAGE_KERNEL);
+ /* Map shared tail page read-only to catch illegal writes. */
+ entry = pfn_pte(page_to_pfn(page), PAGE_KERNEL_RO);
} else {
void *vaddr = vmemmap_alloc_block_buf(PAGE_SIZE, node, altmap);
--
2.54.0
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v2 47/69] powerpc/mm: Map shared vmemmap tail pages read-only
2026-05-13 13:04 [PATCH v2 00/69] mm: Generalize HVO for HugeTLB and device DAX Muchun Song
` (45 preceding siblings ...)
2026-05-13 13:05 ` [PATCH v2 46/69] mm/sparse-vmemmap: Map shared vmemmap tail pages read-only Muchun Song
@ 2026-05-13 13:20 ` Muchun Song
2026-05-13 13:20 ` [PATCH v2 48/69] mm/sparse-vmemmap: Inline vmemmap_populate_address() into its caller Muchun Song
` (21 more replies)
2026-05-13 17:46 ` [PATCH v2 00/69] mm: Generalize HVO for HugeTLB and device DAX Andrew Morton
47 siblings, 22 replies; 72+ messages in thread
From: Muchun Song @ 2026-05-13 13:20 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
Michael Ellerman, Madhavan Srinivasan
Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
Christophe Leroy, Ackerley Tng, Frank van der Linden,
aneesh.kumar, joao.m.martins, linux-mm, linuxppc-dev,
linux-kernel, Muchun Song
Shared vmemmap tail pages can also be installed through the powerpc
radix vmemmap populate path.
Map reused tail pages with PAGE_KERNEL_RO so writes to shared tail
vmemmap entries fault immediately instead of silently corrupting shared
metadata.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
arch/powerpc/mm/book3s64/radix_pgtable.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c b/arch/powerpc/mm/book3s64/radix_pgtable.c
index 18b24bb891b7..4c3d027c823c 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -1053,7 +1053,8 @@ static pte_t * __meminit radix__vmemmap_pte_populate(pmd_t *pmdp, unsigned long
}
VM_BUG_ON(!PAGE_ALIGNED(addr));
- entry = pfn_pte(__pa(p) >> PAGE_SHIFT, PAGE_KERNEL);
+ entry = pfn_pte(__pa(p) >> PAGE_SHIFT,
+ reuse ? PAGE_KERNEL_RO : PAGE_KERNEL);
set_pte_at(&init_mm, addr, pte, entry);
asm volatile("ptesync": : :"memory");
}
--
2.54.0
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v2 48/69] mm/sparse-vmemmap: Inline vmemmap_populate_address() into its caller
2026-05-13 13:20 ` [PATCH v2 47/69] powerpc/mm: " Muchun Song
@ 2026-05-13 13:20 ` Muchun Song
2026-05-13 13:20 ` [PATCH v2 49/69] mm/hugetlb_vmemmap: Remove vmemmap_wrprotect_hvo() Muchun Song
` (20 subsequent siblings)
21 siblings, 0 replies; 72+ messages in thread
From: Muchun Song @ 2026-05-13 13:20 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
Michael Ellerman, Madhavan Srinivasan
Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
Christophe Leroy, Ackerley Tng, Frank van der Linden,
aneesh.kumar, joao.m.martins, linux-mm, linuxppc-dev,
linux-kernel, Muchun Song
vmemmap_populate_address() no longer has any callers that need the
returned PTE. Its only remaining user just checks whether the call
succeeded.
Inline it back into vmemmap_populate_basepages() and return -ENOMEM
directly on failure.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
mm/sparse-vmemmap.c | 46 +++++++++++++++++++--------------------------
1 file changed, 19 insertions(+), 27 deletions(-)
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index 9811c92ad258..5d5cd5f73365 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -234,8 +234,8 @@ static pgd_t * __meminit vmemmap_pgd_populate(unsigned long addr, int node)
return pgd;
}
-static pte_t * __meminit vmemmap_populate_address(unsigned long addr, int node,
- struct vmem_altmap *altmap)
+int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end,
+ int node, struct vmem_altmap *altmap)
{
pgd_t *pgd;
p4d_t *p4d;
@@ -243,32 +243,24 @@ static pte_t * __meminit vmemmap_populate_address(unsigned long addr, int node,
pmd_t *pmd;
pte_t *pte;
- pgd = vmemmap_pgd_populate(addr, node);
- if (!pgd)
- return NULL;
- p4d = vmemmap_p4d_populate(pgd, addr, node);
- if (!p4d)
- return NULL;
- pud = vmemmap_pud_populate(p4d, addr, node);
- if (!pud)
- return NULL;
- pmd = vmemmap_pmd_populate(pud, addr, node);
- if (!pmd)
- return NULL;
- pte = vmemmap_pte_populate(pmd, addr, node, altmap);
- if (!pte)
- return NULL;
- vmemmap_verify(pte, node, addr, addr + PAGE_SIZE);
-
- return pte;
-}
-
-int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end,
- int node, struct vmem_altmap *altmap)
-{
- for (; start < end; start += PAGE_SIZE)
- if (!vmemmap_populate_address(start, node, altmap))
+ for (; start < end; start += PAGE_SIZE) {
+ pgd = vmemmap_pgd_populate(start, node);
+ if (!pgd)
+ return -ENOMEM;
+ p4d = vmemmap_p4d_populate(pgd, start, node);
+ if (!p4d)
return -ENOMEM;
+ pud = vmemmap_pud_populate(p4d, start, node);
+ if (!pud)
+ return -ENOMEM;
+ pmd = vmemmap_pmd_populate(pud, start, node);
+ if (!pmd)
+ return -ENOMEM;
+ pte = vmemmap_pte_populate(pmd, start, node, altmap);
+ if (!pte)
+ return -ENOMEM;
+ vmemmap_verify(pte, node, start, start + PAGE_SIZE);
+ }
return 0;
}
--
2.54.0
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v2 49/69] mm/hugetlb_vmemmap: Remove vmemmap_wrprotect_hvo()
2026-05-13 13:20 ` [PATCH v2 47/69] powerpc/mm: " Muchun Song
2026-05-13 13:20 ` [PATCH v2 48/69] mm/sparse-vmemmap: Inline vmemmap_populate_address() into its caller Muchun Song
@ 2026-05-13 13:20 ` Muchun Song
2026-05-13 13:20 ` [PATCH v2 50/69] mm/sparse: Simplify section_nr_vmemmap_pages() Muchun Song
` (19 subsequent siblings)
21 siblings, 0 replies; 72+ messages in thread
From: Muchun Song @ 2026-05-13 13:20 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
Michael Ellerman, Madhavan Srinivasan
Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
Christophe Leroy, Ackerley Tng, Frank van der Linden,
aneesh.kumar, joao.m.martins, linux-mm, linuxppc-dev,
linux-kernel, Muchun Song
Shared vmemmap tail pages are now mapped read-only when their PTEs are
installed, so HugeTLB bootmem optimization no longer needs a separate
write-protect pass afterwards.
Remove vmemmap_wrprotect_hvo() and the bootmem-specific HugeTLB wrapper,
and let bootmem folios use the normal hugetlb_vmemmap_optimize_folios()
path.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
include/linux/mm.h | 2 --
mm/hugetlb.c | 2 +-
mm/hugetlb_vmemmap.c | 45 +++++++++-----------------------------------
mm/hugetlb_vmemmap.h | 6 ------
mm/sparse-vmemmap.c | 23 ----------------------
5 files changed, 10 insertions(+), 68 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 86d7cecb834e..5e38c9a16a0a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -4863,8 +4863,6 @@ int vmemmap_populate_hugepages(unsigned long start, unsigned long end,
int node, struct vmem_altmap *altmap);
int vmemmap_populate(unsigned long start, unsigned long end, int node,
struct vmem_altmap *altmap);
-void vmemmap_wrprotect_hvo(unsigned long start, unsigned long end, int node,
- unsigned long headsize);
void vmemmap_populate_print_last(void);
struct page *vmemmap_shared_tail_page(unsigned int order, struct zone *zone);
#ifdef CONFIG_MEMORY_HOTPLUG
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 74770c1648fc..54ef7d12c585 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3202,7 +3202,7 @@ static void __init prep_and_add_bootmem_folios(struct hstate *h,
struct folio *folio, *tmp_f;
/* Send list for bulk vmemmap optimization processing */
- hugetlb_vmemmap_optimize_bootmem_folios(h, folio_list);
+ hugetlb_vmemmap_optimize_folios(h, folio_list);
list_for_each_entry_safe(folio, tmp_f, folio_list, lru) {
if (!folio_test_hugetlb_vmemmap_optimized(folio)) {
diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index d24143dd6051..fce772e95adc 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -595,31 +595,22 @@ static int hugetlb_vmemmap_split_folio(const struct hstate *h, struct folio *fol
return vmemmap_remap_split(vmemmap_start, vmemmap_end);
}
-static void __hugetlb_vmemmap_optimize_folios(struct hstate *h,
- struct list_head *folio_list,
- bool boot)
+void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_list)
{
struct folio *folio;
- int nr_to_optimize;
+ unsigned long nr_to_optimize = 0;
LIST_HEAD(vmemmap_pages);
unsigned long flags = VMEMMAP_REMAP_NO_TLB_FLUSH;
- nr_to_optimize = 0;
list_for_each_entry(folio, folio_list, lru) {
int ret;
- unsigned long spfn, epfn;
-
- if (boot && folio_test_hugetlb_vmemmap_optimized(folio)) {
- /*
- * Already optimized by pre-HVO, just map the
- * mirrored tail page structs RO.
- */
- spfn = (unsigned long)&folio->page;
- epfn = spfn + hugetlb_vmemmap_size(h);
- vmemmap_wrprotect_hvo(spfn, epfn, folio_nid(folio),
- OPTIMIZED_FOLIO_VMEMMAP_SIZE);
+
+ /*
+ * Bootmem gigantic folios may already be marked optimized when
+ * their vmemmap layout was prepared earlier, so skip them here.
+ */
+ if (folio_test_hugetlb_vmemmap_optimized(folio))
continue;
- }
nr_to_optimize++;
@@ -636,14 +627,7 @@ static void __hugetlb_vmemmap_optimize_folios(struct hstate *h,
}
if (!nr_to_optimize)
- /*
- * All pre-HVO folios, nothing left to do. It's ok if
- * there is a mix of pre-HVO and not yet HVO-ed folios
- * here, as __hugetlb_vmemmap_optimize_folio() will
- * skip any folios that already have the optimized flag
- * set, see vmemmap_should_optimize_folio().
- */
- goto out;
+ return;
flush_tlb_all();
@@ -668,21 +652,10 @@ static void __hugetlb_vmemmap_optimize_folios(struct hstate *h,
}
}
-out:
flush_tlb_all();
free_vmemmap_page_list(&vmemmap_pages);
}
-void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_list)
-{
- __hugetlb_vmemmap_optimize_folios(h, folio_list, false);
-}
-
-void hugetlb_vmemmap_optimize_bootmem_folios(struct hstate *h, struct list_head *folio_list)
-{
- __hugetlb_vmemmap_optimize_folios(h, folio_list, true);
-}
-
void __init hugetlb_vmemmap_optimize_bootmem_page(struct huge_bootmem_page *m)
{
struct hstate *h = m->hstate;
diff --git a/mm/hugetlb_vmemmap.h b/mm/hugetlb_vmemmap.h
index 0d8c88997066..2b0a85e09602 100644
--- a/mm/hugetlb_vmemmap.h
+++ b/mm/hugetlb_vmemmap.h
@@ -17,7 +17,6 @@ long hugetlb_vmemmap_restore_folios(const struct hstate *h,
struct list_head *non_hvo_folios);
void hugetlb_vmemmap_optimize_folio(const struct hstate *h, struct folio *folio);
void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_list);
-void hugetlb_vmemmap_optimize_bootmem_folios(struct hstate *h, struct list_head *folio_list);
void hugetlb_vmemmap_optimize_bootmem_page(struct huge_bootmem_page *m);
static inline unsigned int hugetlb_vmemmap_size(const struct hstate *h)
@@ -59,11 +58,6 @@ static inline void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list
{
}
-static inline void hugetlb_vmemmap_optimize_bootmem_folios(struct hstate *h,
- struct list_head *folio_list)
-{
-}
-
static inline unsigned int hugetlb_vmemmap_optimizable_size(const struct hstate *h)
{
return 0;
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index 5d5cd5f73365..ce1cf5cdf613 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -265,29 +265,6 @@ int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end,
return 0;
}
-/*
- * Write protect the mirrored tail page structs for HVO. This will be
- * called from the hugetlb code when gathering and initializing the
- * memblock allocated gigantic pages. The write protect can't be
- * done earlier, since it can't be guaranteed that the reserved
- * page structures will not be written to during initialization,
- * even if CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled.
- *
- * The PTEs are known to exist, and nothing else should be touching
- * these pages. The caller is responsible for any TLB flushing.
- */
-void vmemmap_wrprotect_hvo(unsigned long addr, unsigned long end,
- int node, unsigned long headsize)
-{
- unsigned long maddr;
- pte_t *pte;
-
- for (maddr = addr + headsize; maddr < end; maddr += PAGE_SIZE) {
- pte = virt_to_kpte(maddr);
- ptep_set_wrprotect(&init_mm, maddr, pte);
- }
-}
-
struct page __ref *vmemmap_shared_tail_page(unsigned int order, struct zone *zone)
{
void *addr;
--
2.54.0
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v2 50/69] mm/sparse: Simplify section_nr_vmemmap_pages()
2026-05-13 13:20 ` [PATCH v2 47/69] powerpc/mm: " Muchun Song
2026-05-13 13:20 ` [PATCH v2 48/69] mm/sparse-vmemmap: Inline vmemmap_populate_address() into its caller Muchun Song
2026-05-13 13:20 ` [PATCH v2 49/69] mm/hugetlb_vmemmap: Remove vmemmap_wrprotect_hvo() Muchun Song
@ 2026-05-13 13:20 ` Muchun Song
2026-05-13 13:20 ` [PATCH v2 51/69] mm/sparse-vmemmap: Introduce vmemmap_nr_struct_pages() Muchun Song
` (18 subsequent siblings)
21 siblings, 0 replies; 72+ messages in thread
From: Muchun Song @ 2026-05-13 13:20 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
Michael Ellerman, Madhavan Srinivasan
Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
Christophe Leroy, Ackerley Tng, Frank van der Linden,
aneesh.kumar, joao.m.martins, linux-mm, linuxppc-dev,
linux-kernel, Muchun Song
section_nr_vmemmap_pages() no longer needs altmap- or pgmap-specific
state to decide whether a section uses the optimized vmemmap layout.
Now that the optimization state is recorded in struct mem_section, use
section_vmemmap_optimizable() and section_order() directly and drop the
redundant arguments from the helper and its callers.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
mm/internal.h | 3 +--
mm/sparse-vmemmap.c | 7 +++----
mm/sparse.c | 19 ++++++-------------
3 files changed, 10 insertions(+), 19 deletions(-)
diff --git a/mm/internal.h b/mm/internal.h
index 18276cd15622..06022074ebcb 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -997,8 +997,7 @@ static inline void __section_mark_present(struct mem_section *ms,
ms->section_mem_map |= SECTION_MARKED_PRESENT;
}
-int section_nr_vmemmap_pages(unsigned long pfn, unsigned long nr_pages,
- struct vmem_altmap *altmap, struct dev_pagemap *pgmap);
+int section_nr_vmemmap_pages(unsigned long pfn, unsigned long nr_pages);
#else
static inline void sparse_memblocks_present(void) {}
static inline void sparse_init(void) {}
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index ce1cf5cdf613..793fd4ce1393 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -468,7 +468,7 @@ static struct page * __meminit populate_section_memmap(unsigned long pfn,
struct page *page = __populate_section_memmap(pfn, nr_pages, nid, altmap,
pgmap);
- memmap_pages_add(section_nr_vmemmap_pages(pfn, nr_pages, altmap, pgmap));
+ memmap_pages_add(section_nr_vmemmap_pages(pfn, nr_pages));
return page;
}
@@ -479,7 +479,7 @@ static void depopulate_section_memmap(unsigned long pfn, unsigned long nr_pages,
unsigned long start = (unsigned long) pfn_to_page(pfn);
unsigned long end = start + nr_pages * sizeof(struct page);
- memmap_pages_add(-section_nr_vmemmap_pages(pfn, nr_pages, altmap, pgmap));
+ memmap_pages_add(-section_nr_vmemmap_pages(pfn, nr_pages));
vmemmap_free(start, end, altmap);
}
@@ -489,8 +489,7 @@ static void free_map_bootmem(struct page *memmap)
unsigned long end = (unsigned long)(memmap + PAGES_PER_SECTION);
unsigned long pfn = page_to_pfn(memmap);
- memmap_boot_pages_add(-section_nr_vmemmap_pages(pfn, PAGES_PER_SECTION,
- NULL, NULL));
+ memmap_boot_pages_add(-section_nr_vmemmap_pages(pfn, PAGES_PER_SECTION));
vmemmap_free(start, end, NULL);
}
diff --git a/mm/sparse.c b/mm/sparse.c
index 6878f8941b4c..3390cb82f114 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -237,32 +237,26 @@ void __weak __meminit vmemmap_populate_print_last(void)
{
}
-int __meminit section_nr_vmemmap_pages(unsigned long pfn, unsigned long nr_pages,
- struct vmem_altmap *altmap, struct dev_pagemap *pgmap)
+int __meminit section_nr_vmemmap_pages(unsigned long pfn, unsigned long nr_pages)
{
- const struct mem_section *ms = __pfn_to_section(pfn);
- const unsigned int order = pgmap ? pgmap->vmemmap_shift : section_order(ms);
+ const unsigned int order = pfn_to_section_order(pfn);
const unsigned long pages_per_compound = 1UL << order;
- unsigned int vmemmap_pages = OPTIMIZED_FOLIO_VMEMMAP_PAGES;
VM_WARN_ON_ONCE(!IS_ALIGNED(pfn | nr_pages, PAGES_PER_SUBSECTION));
VM_WARN_ON_ONCE(nr_pages > PAGES_PER_SECTION);
- if (vmemmap_can_optimize(altmap, pgmap))
- vmemmap_pages = VMEMMAP_RESERVE_NR;
-
- if (!section_vmemmap_optimizable(ms))
+ if (!order_vmemmap_optimizable(order))
return DIV_ROUND_UP(nr_pages * sizeof(struct page), PAGE_SIZE);
if (order < PFN_SECTION_SHIFT) {
VM_WARN_ON_ONCE(!IS_ALIGNED(pfn | nr_pages, pages_per_compound));
- return vmemmap_pages * nr_pages / pages_per_compound;
+ return OPTIMIZED_FOLIO_VMEMMAP_PAGES * nr_pages / pages_per_compound;
}
VM_WARN_ON_ONCE(!IS_ALIGNED(pfn | nr_pages, PAGES_PER_SECTION));
if (IS_ALIGNED(pfn, pages_per_compound))
- return vmemmap_pages;
+ return OPTIMIZED_FOLIO_VMEMMAP_PAGES;
return 0;
}
@@ -294,8 +288,7 @@ static void __init sparse_init_nid(int nid, unsigned long pnum_begin,
nid, NULL, NULL);
if (!map)
panic("Failed to allocate memmap for section %lu\n", pnum);
- memmap_boot_pages_add(section_nr_vmemmap_pages(pfn, PAGES_PER_SECTION,
- NULL, NULL));
+ memmap_boot_pages_add(section_nr_vmemmap_pages(pfn, PAGES_PER_SECTION));
sparse_init_one_section(__nr_to_section(pnum), pnum, map, usage,
SECTION_IS_EARLY);
usage = (void *)usage + mem_section_usage_size();
--
2.54.0
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v2 51/69] mm/sparse-vmemmap: Introduce vmemmap_nr_struct_pages()
2026-05-13 13:20 ` [PATCH v2 47/69] powerpc/mm: " Muchun Song
` (2 preceding siblings ...)
2026-05-13 13:20 ` [PATCH v2 50/69] mm/sparse: Simplify section_nr_vmemmap_pages() Muchun Song
@ 2026-05-13 13:20 ` Muchun Song
2026-05-13 13:20 ` [PATCH v2 52/69] powerpc/mm: Drop powerpc vmemmap_can_optimize() Muchun Song
` (17 subsequent siblings)
21 siblings, 0 replies; 72+ messages in thread
From: Muchun Song @ 2026-05-13 13:20 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
Michael Ellerman, Madhavan Srinivasan
Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
Christophe Leroy, Ackerley Tng, Frank van der Linden,
aneesh.kumar, joao.m.martins, linux-mm, linuxppc-dev,
linux-kernel, Muchun Song
compound_nr_pages() exposes sparse vmemmap optimization details to the
core memory initialization code.
Introduce vmemmap_nr_struct_pages() to report how many struct pages are
actually allocated and need initialization for an optimized vmemmap
mapping. This gives memmap_init_zone_device() the information it needs
without depending on sparse-vmemmap internals.
With this helper in place, drop compound_nr_pages() and keep the
vmemmap-specific logic inside sparse-vmemmap code.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
mm/internal.h | 11 ++++++++++-
mm/mm_init.c | 21 +--------------------
mm/sparse.c | 13 ++++++-------
3 files changed, 17 insertions(+), 28 deletions(-)
diff --git a/mm/internal.h b/mm/internal.h
index 06022074ebcb..9597a703bc73 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -997,7 +997,16 @@ static inline void __section_mark_present(struct mem_section *ms,
ms->section_mem_map |= SECTION_MARKED_PRESENT;
}
-int section_nr_vmemmap_pages(unsigned long pfn, unsigned long nr_pages);
+int vmemmap_nr_struct_pages(unsigned long pfn, unsigned long nr_pages);
+
+static inline int section_nr_vmemmap_pages(unsigned long pfn, unsigned long nr_pages)
+{
+ VM_WARN_ON_ONCE(!IS_ALIGNED(pfn | nr_pages, PAGES_PER_SUBSECTION));
+ VM_WARN_ON_ONCE(nr_pages > PAGES_PER_SECTION);
+
+ return DIV_ROUND_UP(vmemmap_nr_struct_pages(pfn, nr_pages) * sizeof(struct page),
+ PAGE_SIZE);
+}
#else
static inline void sparse_memblocks_present(void) {}
static inline void sparse_init(void) {}
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 9ff118e35641..4ea39392993b 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -1062,25 +1062,6 @@ static void __ref __init_zone_device_page(struct page *page, unsigned long pfn,
}
}
-/*
- * With compound page geometry and when struct pages are stored in ram most
- * tail pages are reused. Consequently, the amount of unique struct pages to
- * initialize is a lot smaller that the total amount of struct pages being
- * mapped. This is a paired / mild layering violation with explicit knowledge
- * of how the sparse_vmemmap internals handle compound pages in the lack
- * of an altmap.
- */
-static inline unsigned long compound_nr_pages(unsigned long pfn,
- struct dev_pagemap *pgmap)
-{
- const struct mem_section *ms = __pfn_to_section(pfn);
-
- if (!section_vmemmap_optimizable(ms))
- return pgmap_vmemmap_nr(pgmap);
-
- return VMEMMAP_RESERVE_NR * (PAGE_SIZE / sizeof(struct page));
-}
-
static void __ref memmap_init_compound(struct page *head,
unsigned long head_pfn,
unsigned long zone_idx, int nid,
@@ -1145,7 +1126,7 @@ void __ref memmap_init_zone_device(struct zone *zone,
continue;
memmap_init_compound(page, pfn, zone_idx, nid, pgmap,
- compound_nr_pages(pfn, pgmap));
+ vmemmap_nr_struct_pages(pfn, pfns_per_compound));
}
pageblock_migratetype_init_range(start_pfn, nr_pages, MIGRATE_MOVABLE, false, false);
diff --git a/mm/sparse.c b/mm/sparse.c
index 3390cb82f114..f314b9babc4a 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -237,26 +237,25 @@ void __weak __meminit vmemmap_populate_print_last(void)
{
}
-int __meminit section_nr_vmemmap_pages(unsigned long pfn, unsigned long nr_pages)
+int __meminit vmemmap_nr_struct_pages(unsigned long pfn, unsigned long nr_pages)
{
const unsigned int order = pfn_to_section_order(pfn);
const unsigned long pages_per_compound = 1UL << order;
- VM_WARN_ON_ONCE(!IS_ALIGNED(pfn | nr_pages, PAGES_PER_SUBSECTION));
- VM_WARN_ON_ONCE(nr_pages > PAGES_PER_SECTION);
-
if (!order_vmemmap_optimizable(order))
- return DIV_ROUND_UP(nr_pages * sizeof(struct page), PAGE_SIZE);
+ return nr_pages;
if (order < PFN_SECTION_SHIFT) {
VM_WARN_ON_ONCE(!IS_ALIGNED(pfn | nr_pages, pages_per_compound));
- return OPTIMIZED_FOLIO_VMEMMAP_PAGES * nr_pages / pages_per_compound;
+ return OPTIMIZED_FOLIO_VMEMMAP_NR_STRUCT_PAGES * nr_pages / pages_per_compound;
}
VM_WARN_ON_ONCE(!IS_ALIGNED(pfn | nr_pages, PAGES_PER_SECTION));
+ /* Ensure the requested range does not cross a compound page boundary. */
+ VM_WARN_ON_ONCE((pfn % pages_per_compound) + nr_pages > pages_per_compound);
if (IS_ALIGNED(pfn, pages_per_compound))
- return OPTIMIZED_FOLIO_VMEMMAP_PAGES;
+ return OPTIMIZED_FOLIO_VMEMMAP_NR_STRUCT_PAGES;
return 0;
}
--
2.54.0
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v2 52/69] powerpc/mm: Drop powerpc vmemmap_can_optimize()
2026-05-13 13:20 ` [PATCH v2 47/69] powerpc/mm: " Muchun Song
` (3 preceding siblings ...)
2026-05-13 13:20 ` [PATCH v2 51/69] mm/sparse-vmemmap: Introduce vmemmap_nr_struct_pages() Muchun Song
@ 2026-05-13 13:20 ` Muchun Song
2026-05-13 13:20 ` [PATCH v2 53/69] mm/sparse-vmemmap: Drop vmemmap_can_optimize() Muchun Song
` (16 subsequent siblings)
21 siblings, 0 replies; 72+ messages in thread
From: Muchun Song @ 2026-05-13 13:20 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
Michael Ellerman, Madhavan Srinivasan
Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
Christophe Leroy, Ackerley Tng, Frank van der Linden,
aneesh.kumar, joao.m.martins, linux-mm, linuxppc-dev,
linux-kernel, Muchun Song
PowerPC no longer needs an architecture-specific vmemmap_can_optimize()
override for DAX vmemmap optimization.
Whether the optimized mapping can be used is now decided in the
architecture-specific vmemmap_populate() path. When PowerPC has to fall
back, such as on Hash MMU, it can simply clear the section order there
and disable the optimization for that section.
Drop the radix-specific vmemmap_can_optimize() override and rely on the
generic checks instead.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
arch/powerpc/include/asm/book3s/64/radix.h | 5 -----
arch/powerpc/mm/book3s64/radix_pgtable.c | 10 ----------
arch/powerpc/mm/init_64.c | 1 +
3 files changed, 1 insertion(+), 15 deletions(-)
diff --git a/arch/powerpc/include/asm/book3s/64/radix.h b/arch/powerpc/include/asm/book3s/64/radix.h
index 8452a2714cb1..df67209b0c5b 100644
--- a/arch/powerpc/include/asm/book3s/64/radix.h
+++ b/arch/powerpc/include/asm/book3s/64/radix.h
@@ -351,10 +351,5 @@ int radix__create_section_mapping(unsigned long start, unsigned long end,
int nid, pgprot_t prot);
int radix__remove_section_mapping(unsigned long start, unsigned long end);
#endif /* CONFIG_MEMORY_HOTPLUG */
-
-#ifdef CONFIG_ARCH_WANT_OPTIMIZE_DAX_VMEMMAP
-#define vmemmap_can_optimize vmemmap_can_optimize
-bool vmemmap_can_optimize(struct vmem_altmap *altmap, struct dev_pagemap *pgmap);
-#endif
#endif /* __ASSEMBLER__ */
#endif
diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c b/arch/powerpc/mm/book3s64/radix_pgtable.c
index 4c3d027c823c..2f8783b3f678 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -977,16 +977,6 @@ int __meminit radix__vmemmap_create_mapping(unsigned long start,
return 0;
}
-#ifdef CONFIG_ARCH_WANT_OPTIMIZE_DAX_VMEMMAP
-bool vmemmap_can_optimize(struct vmem_altmap *altmap, struct dev_pagemap *pgmap)
-{
- if (radix_enabled())
- return __vmemmap_can_optimize(altmap, pgmap);
-
- return false;
-}
-#endif
-
int __meminit vmemmap_check_pmd(pmd_t *pmdp, int node,
unsigned long addr, unsigned long next)
{
diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
index b6f3ae03ca9e..8e18ed427fdd 100644
--- a/arch/powerpc/mm/init_64.c
+++ b/arch/powerpc/mm/init_64.c
@@ -283,6 +283,7 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
return radix__vmemmap_populate(start, end, node, altmap);
#endif
+ section_set_order(__pfn_to_section(page_to_pfn((struct page *)start)), 0);
return __vmemmap_populate(start, end, node, altmap);
}
--
2.54.0
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v2 53/69] mm/sparse-vmemmap: Drop vmemmap_can_optimize()
2026-05-13 13:20 ` [PATCH v2 47/69] powerpc/mm: " Muchun Song
` (4 preceding siblings ...)
2026-05-13 13:20 ` [PATCH v2 52/69] powerpc/mm: Drop powerpc vmemmap_can_optimize() Muchun Song
@ 2026-05-13 13:20 ` Muchun Song
2026-05-13 13:20 ` [PATCH v2 54/69] mm/sparse-vmemmap: Drop @pgmap from vmemmap population APIs Muchun Song
` (15 subsequent siblings)
21 siblings, 0 replies; 72+ messages in thread
From: Muchun Song @ 2026-05-13 13:20 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
Michael Ellerman, Madhavan Srinivasan
Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
Christophe Leroy, Ackerley Tng, Frank van der Linden,
aneesh.kumar, joao.m.martins, linux-mm, linuxppc-dev,
linux-kernel, Muchun Song
vmemmap_can_optimize() no longer needs to gate section activation.
section_activate() can use pgmap->vmemmap_shift directly to record the
requested section order and leave support checks to the vmemmap
population path. That keeps the policy local to the code that actually
instantiates the mapping, instead of requiring callers to pre-filter
unsupported cases.
In particular, altmap-backed memmap allocation cannot support HVO, so
__populate_section_memmap() clears any inherited optimized section
order for full-section adds and rejects subsection re-adds. Unsupported
optimized mappings are therefore rejected where the vmemmap backing is
set up, and callers no longer have to care about that restriction.
With that handling in place, vmemmap_can_optimize() becomes redundant
and can be removed.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
include/linux/mm.h | 34 ----------------------------------
mm/sparse-vmemmap.c | 14 +++++++++++++-
2 files changed, 13 insertions(+), 35 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5e38c9a16a0a..5f45de90972d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -4896,40 +4896,6 @@ static inline void vmem_altmap_free(struct vmem_altmap *altmap,
}
#endif
-#define VMEMMAP_RESERVE_NR OPTIMIZED_FOLIO_VMEMMAP_PAGES
-#ifdef CONFIG_ARCH_WANT_OPTIMIZE_DAX_VMEMMAP
-static inline bool __vmemmap_can_optimize(struct vmem_altmap *altmap,
- struct dev_pagemap *pgmap)
-{
- unsigned long nr_pages;
- unsigned long nr_vmemmap_pages;
-
- if (!pgmap || !is_power_of_2(sizeof(struct page)))
- return false;
-
- nr_pages = pgmap_vmemmap_nr(pgmap);
- nr_vmemmap_pages = ((nr_pages * sizeof(struct page)) >> PAGE_SHIFT);
- /*
- * For vmemmap optimization with DAX we need minimum 2 vmemmap
- * pages. See layout diagram in Documentation/mm/vmemmap_dedup.rst
- */
- return !altmap && (nr_vmemmap_pages > VMEMMAP_RESERVE_NR);
-}
-/*
- * If we don't have an architecture override, use the generic rule
- */
-#ifndef vmemmap_can_optimize
-#define vmemmap_can_optimize __vmemmap_can_optimize
-#endif
-
-#else
-static inline bool vmemmap_can_optimize(struct vmem_altmap *altmap,
- struct dev_pagemap *pgmap)
-{
- return false;
-}
-#endif
-
enum mf_flags {
MF_COUNT_INCREASED = 1 << 0,
MF_ACTION_REQUIRED = 1 << 1,
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index 793fd4ce1393..549be01d90f8 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -384,11 +384,23 @@ struct page * __meminit __populate_section_memmap(unsigned long pfn,
{
unsigned long start = (unsigned long) pfn_to_page(pfn);
unsigned long end = start + nr_pages * sizeof(struct page);
+ struct mem_section *ms = __pfn_to_section(pfn);
if (WARN_ON_ONCE(!IS_ALIGNED(pfn, PAGES_PER_SUBSECTION) ||
!IS_ALIGNED(nr_pages, PAGES_PER_SUBSECTION)))
return NULL;
+ /* HVO is not supported now when memmap pages are backed by an altmap. */
+ if (altmap && section_vmemmap_optimizable(ms)) {
+ /*
+ * A subsection re-add can inherit order left by a partial
+ * remove after full add.
+ */
+ if (nr_pages < PAGES_PER_SECTION)
+ return NULL;
+ section_set_order(ms, 0);
+ }
+
return vmemmap_populate(start, end, nid, altmap) ? NULL : (void *)start;
}
@@ -613,7 +625,7 @@ static struct page * __meminit section_activate(int nid, unsigned long pfn,
unsigned int order;
int rc;
- order = vmemmap_can_optimize(altmap, pgmap) ? pgmap->vmemmap_shift : 0;
+ order = pgmap ? pgmap->vmemmap_shift : 0;
/* All sub-sections within a section must share the same order. */
if (nr_pages < PAGES_PER_SECTION && section_order(ms) && section_order(ms) != order)
return ERR_PTR(-ENOTSUPP);
--
2.54.0
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v2 54/69] mm/sparse-vmemmap: Drop @pgmap from vmemmap population APIs
2026-05-13 13:20 ` [PATCH v2 47/69] powerpc/mm: " Muchun Song
` (5 preceding siblings ...)
2026-05-13 13:20 ` [PATCH v2 53/69] mm/sparse-vmemmap: Drop vmemmap_can_optimize() Muchun Song
@ 2026-05-13 13:20 ` Muchun Song
2026-05-13 13:20 ` [PATCH v2 55/69] mm/sparse: Decouple section activation from ZONE_DEVICE Muchun Song
` (14 subsequent siblings)
21 siblings, 0 replies; 72+ messages in thread
From: Muchun Song @ 2026-05-13 13:20 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
Michael Ellerman, Madhavan Srinivasan
Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
Christophe Leroy, Ackerley Tng, Frank van der Linden,
aneesh.kumar, joao.m.martins, linux-mm, linuxppc-dev,
linux-kernel, Muchun Song
The vmemmap population and memory hotplug paths no longer need @pgmap
to decide whether a mapping can be optimized. That state is now carried
in mem_section, and the architecture-specific population code can make
the remaining decisions internally.
Drop the @pgmap parameter from the vmemmap population helpers and the
related memory hotplug interfaces, and remove the remaining
dev_pagemap-specific coupling from those call chains.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
arch/arm64/mm/mmu.c | 5 ++---
arch/loongarch/mm/init.c | 5 ++---
arch/powerpc/include/asm/book3s/64/radix.h | 1 -
arch/powerpc/mm/mem.c | 5 ++---
arch/riscv/mm/init.c | 5 ++---
arch/s390/mm/init.c | 5 ++---
arch/x86/mm/init_64.c | 5 ++---
include/linux/memory_hotplug.h | 8 +++-----
include/linux/mm.h | 3 +--
mm/memory_hotplug.c | 13 ++++++------
mm/memremap.c | 4 ++--
mm/sparse-vmemmap.c | 23 ++++++++++------------
mm/sparse.c | 6 ++----
13 files changed, 36 insertions(+), 52 deletions(-)
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index e5a42b7a0160..dd85e093ffdb 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -2024,13 +2024,12 @@ int arch_add_memory(int nid, u64 start, u64 size,
return ret;
}
-void arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap,
- struct dev_pagemap *pgmap)
+void arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap)
{
unsigned long start_pfn = start >> PAGE_SHIFT;
unsigned long nr_pages = size >> PAGE_SHIFT;
- __remove_pages(start_pfn, nr_pages, altmap, pgmap);
+ __remove_pages(start_pfn, nr_pages, altmap);
__remove_pgd_mapping(swapper_pg_dir, __phys_to_virt(start), size);
}
diff --git a/arch/loongarch/mm/init.c b/arch/loongarch/mm/init.c
index 055ecd2c8fd9..3f9ab54114c5 100644
--- a/arch/loongarch/mm/init.c
+++ b/arch/loongarch/mm/init.c
@@ -119,8 +119,7 @@ int arch_add_memory(int nid, u64 start, u64 size, struct mhp_params *params)
return ret;
}
-void arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap,
- struct dev_pagemap *pgmap)
+void arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap)
{
unsigned long start_pfn = start >> PAGE_SHIFT;
unsigned long nr_pages = size >> PAGE_SHIFT;
@@ -129,7 +128,7 @@ void arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap,
/* With altmap the first mapped page is offset from @start */
if (altmap)
page += vmem_altmap_offset(altmap);
- __remove_pages(start_pfn, nr_pages, altmap, pgmap);
+ __remove_pages(start_pfn, nr_pages, altmap);
}
#endif
diff --git a/arch/powerpc/include/asm/book3s/64/radix.h b/arch/powerpc/include/asm/book3s/64/radix.h
index df67209b0c5b..0c9195dd50c9 100644
--- a/arch/powerpc/include/asm/book3s/64/radix.h
+++ b/arch/powerpc/include/asm/book3s/64/radix.h
@@ -316,7 +316,6 @@ static inline int radix__has_transparent_pud_hugepage(void)
#endif
struct vmem_altmap;
-struct dev_pagemap;
extern int __meminit radix__vmemmap_create_mapping(unsigned long start,
unsigned long page_size,
unsigned long phys);
diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
index 4c1afab91996..648d0c5602ec 100644
--- a/arch/powerpc/mm/mem.c
+++ b/arch/powerpc/mm/mem.c
@@ -158,13 +158,12 @@ int __ref arch_add_memory(int nid, u64 start, u64 size,
return rc;
}
-void __ref arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap,
- struct dev_pagemap *pgmap)
+void __ref arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap)
{
unsigned long start_pfn = start >> PAGE_SHIFT;
unsigned long nr_pages = size >> PAGE_SHIFT;
- __remove_pages(start_pfn, nr_pages, altmap, pgmap);
+ __remove_pages(start_pfn, nr_pages, altmap);
arch_remove_linear_mapping(start, size);
}
#endif
diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
index 885f1db4e9bf..fa8d2f6f554b 100644
--- a/arch/riscv/mm/init.c
+++ b/arch/riscv/mm/init.c
@@ -1742,10 +1742,9 @@ int __ref arch_add_memory(int nid, u64 start, u64 size, struct mhp_params *param
return ret;
}
-void __ref arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap,
- struct dev_pagemap *pgmap)
+void __ref arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap)
{
- __remove_pages(start >> PAGE_SHIFT, size >> PAGE_SHIFT, altmap, pgmap);
+ __remove_pages(start >> PAGE_SHIFT, size >> PAGE_SHIFT, altmap);
remove_linear_mapping(start, size);
flush_tlb_all();
}
diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c
index 11a689423440..1f72efc2a579 100644
--- a/arch/s390/mm/init.c
+++ b/arch/s390/mm/init.c
@@ -276,13 +276,12 @@ int arch_add_memory(int nid, u64 start, u64 size,
return rc;
}
-void arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap,
- struct dev_pagemap *pgmap)
+void arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap)
{
unsigned long start_pfn = start >> PAGE_SHIFT;
unsigned long nr_pages = size >> PAGE_SHIFT;
- __remove_pages(start_pfn, nr_pages, altmap, pgmap);
+ __remove_pages(start_pfn, nr_pages, altmap);
vmem_remove_mapping(start, size);
}
#endif /* CONFIG_MEMORY_HOTPLUG */
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 77b889b71cf3..df2261fa4f98 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1288,13 +1288,12 @@ kernel_physical_mapping_remove(unsigned long start, unsigned long end)
remove_pagetable(start, end, true, NULL);
}
-void __ref arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap,
- struct dev_pagemap *pgmap)
+void __ref arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap)
{
unsigned long start_pfn = start >> PAGE_SHIFT;
unsigned long nr_pages = size >> PAGE_SHIFT;
- __remove_pages(start_pfn, nr_pages, altmap, pgmap);
+ __remove_pages(start_pfn, nr_pages, altmap);
kernel_physical_mapping_remove(start, start + size);
}
#endif /* CONFIG_MEMORY_HOTPLUG */
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 7c9d66729c60..815e908c4135 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -135,10 +135,9 @@ static inline bool movable_node_is_enabled(void)
return movable_node_enabled;
}
-extern void arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap,
- struct dev_pagemap *pgmap);
+extern void arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap);
extern void __remove_pages(unsigned long start_pfn, unsigned long nr_pages,
- struct vmem_altmap *altmap, struct dev_pagemap *pgmap);
+ struct vmem_altmap *altmap);
/* reasonably generic interface to expand the physical pages */
extern int __add_pages(int nid, unsigned long start_pfn, unsigned long nr_pages,
@@ -308,8 +307,7 @@ extern int sparse_add_section(int nid, unsigned long pfn,
unsigned long nr_pages, struct vmem_altmap *altmap,
struct dev_pagemap *pgmap);
extern void sparse_remove_section(unsigned long pfn, unsigned long nr_pages,
- struct vmem_altmap *altmap,
- struct dev_pagemap *pgmap);
+ struct vmem_altmap *altmap);
extern struct zone *zone_for_pfn_range(enum mmop online_type,
int nid, struct memory_group *group, unsigned long start_pfn,
unsigned long nr_pages);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5f45de90972d..87e98bdb0417 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -4846,8 +4846,7 @@ static inline void print_vma_addr(char *prefix, unsigned long rip)
#endif
struct page * __populate_section_memmap(unsigned long pfn,
- unsigned long nr_pages, int nid, struct vmem_altmap *altmap,
- struct dev_pagemap *pgmap);
+ unsigned long nr_pages, int nid, struct vmem_altmap *altmap);
void *vmemmap_alloc_block(unsigned long size, int node);
struct vmem_altmap;
void *vmemmap_alloc_block_buf(unsigned long size, int node,
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index c9c69f827efa..5c60533677a1 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -577,7 +577,6 @@ void remove_pfn_range_from_zone(struct zone *zone,
* @pfn: starting pageframe (must be aligned to start of a section)
* @nr_pages: number of pages to remove (must be multiple of section size)
* @altmap: alternative device page map or %NULL if default memmap is used
- * @pgmap: device page map or %NULL if not ZONE_DEVICE
*
* Generic helper function to remove section mappings and sysfs entries
* for the section of the memory we are removing. Caller needs to make
@@ -585,7 +584,7 @@ void remove_pfn_range_from_zone(struct zone *zone,
* calling offline_pages().
*/
void __remove_pages(unsigned long pfn, unsigned long nr_pages,
- struct vmem_altmap *altmap, struct dev_pagemap *pgmap)
+ struct vmem_altmap *altmap)
{
const unsigned long end_pfn = pfn + nr_pages;
unsigned long cur_nr_pages;
@@ -600,7 +599,7 @@ void __remove_pages(unsigned long pfn, unsigned long nr_pages,
/* Select all remaining pages up to the next section boundary */
cur_nr_pages = min(end_pfn - pfn,
SECTION_ALIGN_UP(pfn + 1) - pfn);
- sparse_remove_section(pfn, cur_nr_pages, altmap, pgmap);
+ sparse_remove_section(pfn, cur_nr_pages, altmap);
}
}
@@ -1429,7 +1428,7 @@ static void remove_memory_blocks_and_altmaps(u64 start, u64 size)
remove_memory_block_devices(cur_start, memblock_size);
- arch_remove_memory(cur_start, memblock_size, altmap, NULL);
+ arch_remove_memory(cur_start, memblock_size, altmap);
/* Verify that all vmemmap pages have actually been freed. */
WARN(altmap->alloc, "Altmap not fully unmapped");
@@ -1472,7 +1471,7 @@ static int create_altmaps_and_memory_blocks(int nid, struct memory_group *group,
ret = create_memory_block_devices(cur_start, memblock_size, nid,
params.altmap, group);
if (ret) {
- arch_remove_memory(cur_start, memblock_size, params.altmap, NULL);
+ arch_remove_memory(cur_start, memblock_size, params.altmap);
kfree(params.altmap);
goto out;
}
@@ -1558,7 +1557,7 @@ int add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
/* create memory block devices after memory was added */
ret = create_memory_block_devices(start, size, nid, NULL, group);
if (ret) {
- arch_remove_memory(start, size, params.altmap, NULL);
+ arch_remove_memory(start, size, params.altmap);
goto error;
}
}
@@ -2270,7 +2269,7 @@ static int try_remove_memory(u64 start, u64 size)
* No altmaps present, do the removal directly
*/
remove_memory_block_devices(start, size);
- arch_remove_memory(start, size, NULL, NULL);
+ arch_remove_memory(start, size, NULL);
} else {
/* all memblocks in the range have altmaps */
remove_memory_blocks_and_altmaps(start, size);
diff --git a/mm/memremap.c b/mm/memremap.c
index 81766d822400..053842d45cb1 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -97,10 +97,10 @@ static void pageunmap_range(struct dev_pagemap *pgmap, int range_id)
PHYS_PFN(range_len(range)));
if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
__remove_pages(PHYS_PFN(range->start),
- PHYS_PFN(range_len(range)), NULL, pgmap);
+ PHYS_PFN(range_len(range)), NULL);
} else {
arch_remove_memory(range->start, range_len(range),
- pgmap_altmap(pgmap), pgmap);
+ pgmap_altmap(pgmap));
kasan_remove_zero_shadow(__va(range->start), range_len(range));
}
mem_hotplug_done();
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index 549be01d90f8..a807210fe9e1 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -379,8 +379,7 @@ int __meminit vmemmap_populate_hugepages(unsigned long start, unsigned long end,
}
struct page * __meminit __populate_section_memmap(unsigned long pfn,
- unsigned long nr_pages, int nid, struct vmem_altmap *altmap,
- struct dev_pagemap *pgmap)
+ unsigned long nr_pages, int nid, struct vmem_altmap *altmap)
{
unsigned long start = (unsigned long) pfn_to_page(pfn);
unsigned long end = start + nr_pages * sizeof(struct page);
@@ -474,11 +473,9 @@ void offline_mem_sections(unsigned long start_pfn, unsigned long end_pfn)
}
static struct page * __meminit populate_section_memmap(unsigned long pfn,
- unsigned long nr_pages, int nid, struct vmem_altmap *altmap,
- struct dev_pagemap *pgmap)
+ unsigned long nr_pages, int nid, struct vmem_altmap *altmap)
{
- struct page *page = __populate_section_memmap(pfn, nr_pages, nid, altmap,
- pgmap);
+ struct page *page = __populate_section_memmap(pfn, nr_pages, nid, altmap);
memmap_pages_add(section_nr_vmemmap_pages(pfn, nr_pages));
@@ -486,7 +483,7 @@ static struct page * __meminit populate_section_memmap(unsigned long pfn,
}
static void depopulate_section_memmap(unsigned long pfn, unsigned long nr_pages,
- struct vmem_altmap *altmap, struct dev_pagemap *pgmap)
+ struct vmem_altmap *altmap)
{
unsigned long start = (unsigned long) pfn_to_page(pfn);
unsigned long end = start + nr_pages * sizeof(struct page);
@@ -567,7 +564,7 @@ static int fill_subsection_map(unsigned long pfn, unsigned long nr_pages)
* usage map, but still need to free the vmemmap range.
*/
static void section_deactivate(unsigned long pfn, unsigned long nr_pages,
- struct vmem_altmap *altmap, struct dev_pagemap *pgmap)
+ struct vmem_altmap *altmap)
{
struct mem_section *ms = __pfn_to_section(pfn);
bool section_is_early = early_section(ms);
@@ -605,7 +602,7 @@ static void section_deactivate(unsigned long pfn, unsigned long nr_pages,
* section_activate() and pfn_valid() .
*/
if (!section_is_early)
- depopulate_section_memmap(pfn, nr_pages, altmap, pgmap);
+ depopulate_section_memmap(pfn, nr_pages, altmap);
else if (memmap)
free_map_bootmem(memmap);
@@ -656,9 +653,9 @@ static struct page * __meminit section_activate(int nid, unsigned long pfn,
return pfn_to_page(pfn);
section_set_order_range(pfn, nr_pages, order);
- memmap = populate_section_memmap(pfn, nr_pages, nid, altmap, pgmap);
+ memmap = populate_section_memmap(pfn, nr_pages, nid, altmap);
if (!memmap) {
- section_deactivate(pfn, nr_pages, altmap, pgmap);
+ section_deactivate(pfn, nr_pages, altmap);
return ERR_PTR(-ENOMEM);
}
@@ -720,13 +717,13 @@ int __meminit sparse_add_section(int nid, unsigned long start_pfn,
}
void sparse_remove_section(unsigned long pfn, unsigned long nr_pages,
- struct vmem_altmap *altmap, struct dev_pagemap *pgmap)
+ struct vmem_altmap *altmap)
{
struct mem_section *ms = __pfn_to_section(pfn);
if (WARN_ON_ONCE(!valid_section(ms)))
return;
- section_deactivate(pfn, nr_pages, altmap, pgmap);
+ section_deactivate(pfn, nr_pages, altmap);
}
#endif /* CONFIG_MEMORY_HOTPLUG */
diff --git a/mm/sparse.c b/mm/sparse.c
index f314b9babc4a..bdf23709a1c7 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -224,8 +224,7 @@ size_t mem_section_usage_size(void)
#ifndef CONFIG_SPARSEMEM_VMEMMAP
struct page __init *__populate_section_memmap(unsigned long pfn,
- unsigned long nr_pages, int nid, struct vmem_altmap *altmap,
- struct dev_pagemap *pgmap)
+ unsigned long nr_pages, int nid, struct vmem_altmap *altmap)
{
unsigned long size = PAGE_ALIGN(sizeof(struct page) * PAGES_PER_SECTION);
@@ -283,8 +282,7 @@ static void __init sparse_init_nid(int nid, unsigned long pnum_begin,
if (pnum >= pnum_end)
break;
- map = __populate_section_memmap(pfn, PAGES_PER_SECTION,
- nid, NULL, NULL);
+ map = __populate_section_memmap(pfn, PAGES_PER_SECTION, nid, NULL);
if (!map)
panic("Failed to allocate memmap for section %lu\n", pnum);
memmap_boot_pages_add(section_nr_vmemmap_pages(pfn, PAGES_PER_SECTION));
--
2.54.0
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v2 55/69] mm/sparse: Decouple section activation from ZONE_DEVICE
2026-05-13 13:20 ` [PATCH v2 47/69] powerpc/mm: " Muchun Song
` (6 preceding siblings ...)
2026-05-13 13:20 ` [PATCH v2 54/69] mm/sparse-vmemmap: Drop @pgmap from vmemmap population APIs Muchun Song
@ 2026-05-13 13:20 ` Muchun Song
2026-05-13 13:20 ` [PATCH v2 56/69] mm: Redefine HVO as Hugepage Vmemmap Optimization Muchun Song
` (13 subsequent siblings)
21 siblings, 0 replies; 72+ messages in thread
From: Muchun Song @ 2026-05-13 13:20 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
Michael Ellerman, Madhavan Srinivasan
Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
Christophe Leroy, Ackerley Tng, Frank van der Linden,
aneesh.kumar, joao.m.martins, linux-mm, linuxppc-dev,
linux-kernel, Muchun Song
sparse_add_section()/section_activate() currently takes struct
dev_pagemap only to obtain the compound page order.
Pass the order explicitly instead of routing it through a ZONE_DEVICE
specific structure. This removes the dev_pagemap dependency from the
generic sparse memory population path and keeps the interface usable for
other callers (if possible).
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
include/linux/memory_hotplug.h | 4 ++--
mm/memory_hotplug.c | 4 ++--
mm/sparse-vmemmap.c | 14 ++++++--------
3 files changed, 10 insertions(+), 12 deletions(-)
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 815e908c4135..083f0abea62d 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -304,8 +304,8 @@ extern void remove_pfn_range_from_zone(struct zone *zone,
unsigned long start_pfn,
unsigned long nr_pages);
extern int sparse_add_section(int nid, unsigned long pfn,
- unsigned long nr_pages, struct vmem_altmap *altmap,
- struct dev_pagemap *pgmap);
+ unsigned long nr_pages, unsigned int order,
+ struct vmem_altmap *altmap);
extern void sparse_remove_section(unsigned long pfn, unsigned long nr_pages,
struct vmem_altmap *altmap);
extern struct zone *zone_for_pfn_range(enum mmop online_type,
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 5c60533677a1..ef1595bdfd3a 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -385,6 +385,7 @@ int __add_pages(int nid, unsigned long pfn, unsigned long nr_pages,
unsigned long cur_nr_pages;
int err;
struct vmem_altmap *altmap = params->altmap;
+ unsigned int order = params->pgmap ? params->pgmap->vmemmap_shift : 0;
if (WARN_ON_ONCE(!pgprot_val(params->pgprot)))
return -EINVAL;
@@ -412,8 +413,7 @@ int __add_pages(int nid, unsigned long pfn, unsigned long nr_pages,
/* Select all remaining pages up to the next section boundary */
cur_nr_pages = min(end_pfn - pfn,
SECTION_ALIGN_UP(pfn + 1) - pfn);
- err = sparse_add_section(nid, pfn, cur_nr_pages, altmap,
- params->pgmap);
+ err = sparse_add_section(nid, pfn, cur_nr_pages, order, altmap);
if (err)
break;
cond_resched();
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index a807210fe9e1..667424aadd6b 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -613,16 +613,14 @@ static void section_deactivate(unsigned long pfn, unsigned long nr_pages,
}
static struct page * __meminit section_activate(int nid, unsigned long pfn,
- unsigned long nr_pages, struct vmem_altmap *altmap,
- struct dev_pagemap *pgmap)
+ unsigned long nr_pages, unsigned int order,
+ struct vmem_altmap *altmap)
{
struct mem_section *ms = __pfn_to_section(pfn);
struct mem_section_usage *usage = NULL;
struct page *memmap;
- unsigned int order;
int rc;
- order = pgmap ? pgmap->vmemmap_shift : 0;
/* All sub-sections within a section must share the same order. */
if (nr_pages < PAGES_PER_SECTION && section_order(ms) && section_order(ms) != order)
return ERR_PTR(-ENOTSUPP);
@@ -667,8 +665,8 @@ static struct page * __meminit section_activate(int nid, unsigned long pfn,
* @nid: The node to add section on
* @start_pfn: start pfn of the memory range
* @nr_pages: number of pfns to add in the section
+ * @order: section order
* @altmap: alternate pfns to allocate the memmap backing store
- * @pgmap: alternate compound page geometry for devmap mappings
*
* This is only intended for hotplug.
*
@@ -682,8 +680,8 @@ static struct page * __meminit section_activate(int nid, unsigned long pfn,
* * -ENOMEM - Out of memory.
*/
int __meminit sparse_add_section(int nid, unsigned long start_pfn,
- unsigned long nr_pages, struct vmem_altmap *altmap,
- struct dev_pagemap *pgmap)
+ unsigned long nr_pages, unsigned int order,
+ struct vmem_altmap *altmap)
{
unsigned long section_nr = pfn_to_section_nr(start_pfn);
struct mem_section *ms;
@@ -694,7 +692,7 @@ int __meminit sparse_add_section(int nid, unsigned long start_pfn,
if (ret < 0)
return ret;
- memmap = section_activate(nid, start_pfn, nr_pages, altmap, pgmap);
+ memmap = section_activate(nid, start_pfn, nr_pages, order, altmap);
if (IS_ERR(memmap))
return PTR_ERR(memmap);
--
2.54.0
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v2 56/69] mm: Redefine HVO as Hugepage Vmemmap Optimization
2026-05-13 13:20 ` [PATCH v2 47/69] powerpc/mm: " Muchun Song
` (7 preceding siblings ...)
2026-05-13 13:20 ` [PATCH v2 55/69] mm/sparse: Decouple section activation from ZONE_DEVICE Muchun Song
@ 2026-05-13 13:20 ` Muchun Song
2026-05-13 13:20 ` [PATCH v2 57/69] mm/sparse-vmemmap: Consolidate HVO enable checks Muchun Song
` (12 subsequent siblings)
21 siblings, 0 replies; 72+ messages in thread
From: Muchun Song @ 2026-05-13 13:20 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
Michael Ellerman, Madhavan Srinivasan
Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
Christophe Leroy, Ackerley Tng, Frank van der Linden,
aneesh.kumar, joao.m.martins, linux-mm, linuxppc-dev,
linux-kernel, Muchun Song
HVO no longer refers only to HugeTLB vmemmap optimization. The same
optimization is now used more broadly for large compound-page mappings,
so the old expansion is too narrow.
Redefine HVO as Hugepage Vmemmap Optimization and update the generic
documentation, Kconfig text, and comments accordingly.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
Documentation/admin-guide/kernel-parameters.txt | 2 +-
Documentation/admin-guide/mm/hugetlbpage.rst | 4 ++--
Documentation/admin-guide/mm/memory-hotplug.rst | 2 +-
Documentation/admin-guide/sysctl/vm.rst | 3 ++-
Documentation/mm/vmemmap_dedup.rst | 2 +-
fs/Kconfig | 4 ++--
include/linux/mmzone.h | 2 +-
mm/Kconfig | 2 +-
mm/hugetlb_vmemmap.c | 2 +-
mm/hugetlb_vmemmap.h | 2 +-
mm/memory-failure.c | 6 +++---
11 files changed, 16 insertions(+), 15 deletions(-)
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 0eb64aab3685..2d4cfdcb7535 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2114,7 +2114,7 @@ Kernel parameters
hugetlb_free_vmemmap=
[KNL] Requires CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
enabled.
- Control if HugeTLB Vmemmap Optimization (HVO) is enabled.
+ Control if Hugepage Vmemmap Optimization (HVO) for HugeTLB is enabled.
Allows heavy hugetlb users to free up some more
memory (7 * PAGE_SIZE for each 2MB hugetlb page).
Format: { on | off (default) }
diff --git a/Documentation/admin-guide/mm/hugetlbpage.rst b/Documentation/admin-guide/mm/hugetlbpage.rst
index 67a941903fd2..3f98ca1d7ce1 100644
--- a/Documentation/admin-guide/mm/hugetlbpage.rst
+++ b/Documentation/admin-guide/mm/hugetlbpage.rst
@@ -172,8 +172,8 @@ default_hugepagesz
will all result in 256 2M huge pages being allocated. Valid default
huge page size is architecture dependent.
hugetlb_free_vmemmap
- When CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP is set, this enables HugeTLB
- Vmemmap Optimization (HVO).
+ When CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP is set, this enables Hugepage
+ Vmemmap Optimization (HVO) for HugeTLB.
When multiple huge page sizes are supported, ``/proc/sys/vm/nr_hugepages``
indicates the current number of pre-allocated huge pages of the default size.
diff --git a/Documentation/admin-guide/mm/memory-hotplug.rst b/Documentation/admin-guide/mm/memory-hotplug.rst
index 0207f8725142..d5e350607baa 100644
--- a/Documentation/admin-guide/mm/memory-hotplug.rst
+++ b/Documentation/admin-guide/mm/memory-hotplug.rst
@@ -682,7 +682,7 @@ block might fail:
ZONE_MOVABLE for increasing the reliability of gigantic page allocation
against the potential loss of hot-unplug reliability.
-- Out of memory when dissolving huge pages, especially when HugeTLB Vmemmap
+- Out of memory when dissolving huge pages, especially when Hugepage Vmemmap
Optimization (HVO) is enabled.
Offlining code may be able to migrate huge page contents, but may not be able
diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
index 97e12359775c..9f333970fdb2 100644
--- a/Documentation/admin-guide/sysctl/vm.rst
+++ b/Documentation/admin-guide/sysctl/vm.rst
@@ -665,7 +665,8 @@ This knob is not available when the size of 'struct page' (a structure defined
in include/linux/mm_types.h) is not power of two (an unusual system config could
result in this).
-Enable (set to 1) or disable (set to 0) HugeTLB Vmemmap Optimization (HVO).
+Enable (set to 1) or disable (set to 0) Hugepage Vmemmap Optimization (HVO) for
+HugeTLB.
Once enabled, the vmemmap pages of subsequent allocation of HugeTLB pages from
buddy allocator will be optimized (7 pages per 2MB HugeTLB page and 4095 pages
diff --git a/Documentation/mm/vmemmap_dedup.rst b/Documentation/mm/vmemmap_dedup.rst
index 9fa8642ded48..44e80bd2e398 100644
--- a/Documentation/mm/vmemmap_dedup.rst
+++ b/Documentation/mm/vmemmap_dedup.rst
@@ -8,7 +8,7 @@ A vmemmap diet for HugeTLB and Device DAX
HugeTLB
=======
-This section is to explain how HugeTLB Vmemmap Optimization (HVO) works.
+This section is to explain how Hugepage Vmemmap Optimization (HVO) for HugeTLB works.
The ``struct page`` structures are used to describe a physical page frame. By
default, there is a one-to-one mapping from a page frame to its corresponding
diff --git a/fs/Kconfig b/fs/Kconfig
index f6cee1bbb1fc..496cfa2379e5 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -261,11 +261,11 @@ menuconfig HUGETLBFS
if HUGETLBFS
config HUGETLB_PAGE_OPTIMIZE_VMEMMAP_DEFAULT_ON
- bool "HugeTLB Vmemmap Optimization (HVO) defaults to on"
+ bool "Hugepage Vmemmap Optimization (HVO) for HugeTLB defaults to on"
default n
depends on HUGETLB_PAGE_OPTIMIZE_VMEMMAP
help
- The HugeTLB Vmemmap Optimization (HVO) defaults to off. Say Y here to
+ The Hugepage Vmemmap Optimization (HVO) for HugeTLB defaults to off. Say Y here to
enable HVO by default. It can be disabled via hugetlb_free_vmemmap=off
(boot command line) or hugetlb_optimize_vmemmap (sysctl).
endif # HUGETLBFS
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 7484e7be7b6d..efb37f2ffec4 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -97,7 +97,7 @@
#define MAX_FOLIO_NR_PAGES (1UL << MAX_FOLIO_ORDER)
/*
- * HugeTLB Vmemmap Optimization (HVO) requires struct pages of the head page to
+ * Hugepage Vmemmap Optimization (HVO) requires struct pages of the head page to
* be naturally aligned with regard to the folio size.
*
* HVO which is only active if the size of struct page is a power of 2.
diff --git a/mm/Kconfig b/mm/Kconfig
index ddd10cb4d0a3..c85ed7d7f37d 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -416,7 +416,7 @@ config SPARSEMEM_VMEMMAP_OPTIMIZATION
#
# Select this config option from the architecture Kconfig, if it is preferred
-# to enable the feature of HugeTLB/dev_dax vmemmap optimization.
+# to enable the feature of HVO.
#
config ARCH_WANT_OPTIMIZE_DAX_VMEMMAP
bool
diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index fce772e95adc..6f6f1740f540 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -1,6 +1,6 @@
// SPDX-License-Identifier: GPL-2.0
/*
- * HugeTLB Vmemmap Optimization (HVO)
+ * Hugepage Vmemmap Optimization (HVO) for HugeTLB
*
* Copyright (c) 2020, ByteDance. All rights reserved.
*
diff --git a/mm/hugetlb_vmemmap.h b/mm/hugetlb_vmemmap.h
index 2b0a85e09602..b4d0ba27b42c 100644
--- a/mm/hugetlb_vmemmap.h
+++ b/mm/hugetlb_vmemmap.h
@@ -1,6 +1,6 @@
// SPDX-License-Identifier: GPL-2.0
/*
- * HugeTLB Vmemmap Optimization (HVO)
+ * Hugepage Vmemmap Optimization (HVO) for HugeTLB
*
* Copyright (c) 2020, ByteDance. All rights reserved.
*
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 866c4428ac7e..ad6416145667 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -175,9 +175,9 @@ static int __page_handle_poison(struct page *page)
/*
* zone_pcp_disable() can't be used here. It will
* hold pcp_batch_high_lock and dissolve_free_hugetlb_folio() might hold
- * cpu_hotplug_lock via static_key_slow_dec() when hugetlb vmemmap
- * optimization is enabled. This will break current lock dependency
- * chain and leads to deadlock.
+ * cpu_hotplug_lock via static_key_slow_dec() when HVO for HugeTLB
+ * is enabled. This will break current lock dependency chain and leads
+ * to deadlock.
* Disabling pcp before dissolving the page was a deterministic
* approach because we made sure that those pages cannot end up in any
* PCP list. Draining PCP lists expels those pages to the buddy system,
--
2.54.0
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v2 57/69] mm/sparse-vmemmap: Consolidate HVO enable checks
2026-05-13 13:20 ` [PATCH v2 47/69] powerpc/mm: " Muchun Song
` (8 preceding siblings ...)
2026-05-13 13:20 ` [PATCH v2 56/69] mm: Redefine HVO as Hugepage Vmemmap Optimization Muchun Song
@ 2026-05-13 13:20 ` Muchun Song
2026-05-13 13:20 ` [PATCH v2 58/69] mm/hugetlb: Make HVO optimizable checks depend on generic logic Muchun Song
` (11 subsequent siblings)
21 siblings, 0 replies; 72+ messages in thread
From: Muchun Song @ 2026-05-13 13:20 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
Michael Ellerman, Madhavan Srinivasan
Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
Christophe Leroy, Ackerley Tng, Frank van der Linden,
aneesh.kumar, joao.m.martins, linux-mm, linuxppc-dev,
linux-kernel, Muchun Song
HVO depends on build-time conditions that are not fully expressible in
Kconfig, including whether sizeof(struct page) is a power of two and
whether the supported folio order range can use the optimized layout.
Those checks are currently duplicated in several places. Define
SPARSEMEM_VMEMMAP_OPTIMIZATION in bounds.c when the build-time
requirements are met, and use that generated constant to guard the
generic HVO code.
This centralizes the build-time checks instead of repeating them
throughout the HVO paths.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
arch/x86/entry/vdso/vdso32/fake_32bit_build.h | 1 -
drivers/dax/Kconfig | 2 +-
fs/Kconfig | 2 +-
include/linux/mm_types.h | 3 +-
include/linux/mmzone.h | 38 ++++++++-----------
include/linux/page-flags-layout.h | 2 +
include/linux/page-flags.h | 28 ++------------
kernel/bounds.c | 5 +++
mm/Kconfig | 2 +-
mm/hugetlb_vmemmap.c | 2 +
mm/hugetlb_vmemmap.h | 4 +-
mm/internal.h | 3 --
mm/sparse.c | 6 +--
mm/util.c | 2 +-
14 files changed, 38 insertions(+), 62 deletions(-)
diff --git a/arch/x86/entry/vdso/vdso32/fake_32bit_build.h b/arch/x86/entry/vdso/vdso32/fake_32bit_build.h
index 5f8424eade2b..db1b15f686e3 100644
--- a/arch/x86/entry/vdso/vdso32/fake_32bit_build.h
+++ b/arch/x86/entry/vdso/vdso32/fake_32bit_build.h
@@ -11,7 +11,6 @@
#undef CONFIG_PGTABLE_LEVELS
#undef CONFIG_ILLEGAL_POINTER_VALUE
#undef CONFIG_SPARSEMEM_VMEMMAP
-#undef CONFIG_SPARSEMEM_VMEMMAP_OPTIMIZATION
#undef CONFIG_NR_CPUS
#undef CONFIG_PARAVIRT_XXL
diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig
index 60cb05dce53d..cb7710c29885 100644
--- a/drivers/dax/Kconfig
+++ b/drivers/dax/Kconfig
@@ -8,7 +8,7 @@ if DAX
config DEV_DAX
tristate "Device DAX: direct access mapping device"
depends on TRANSPARENT_HUGEPAGE
- select SPARSEMEM_VMEMMAP_OPTIMIZATION if ARCH_WANT_OPTIMIZE_DAX_VMEMMAP && SPARSEMEM_VMEMMAP
+ select SPARSEMEM_VMEMMAP_OPTIMIZATION_ENABLE if ARCH_WANT_OPTIMIZE_DAX_VMEMMAP && SPARSEMEM_VMEMMAP
help
Support raw access to differentiated (persistence, bandwidth,
latency...) memory via an mmap(2) capable character
diff --git a/fs/Kconfig b/fs/Kconfig
index 496cfa2379e5..ab3937abe07f 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -278,7 +278,7 @@ config HUGETLB_PAGE_OPTIMIZE_VMEMMAP
def_bool HUGETLB_PAGE
depends on ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP
depends on SPARSEMEM_VMEMMAP
- select SPARSEMEM_VMEMMAP_OPTIMIZATION
+ select SPARSEMEM_VMEMMAP_OPTIMIZATION_ENABLE
config HUGETLB_PMD_PAGE_TABLE_SHARING
def_bool HUGETLB_PAGE
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index a308e2c23b82..9a7cd7575f3a 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -546,6 +546,7 @@ FOLIO_MATCH(flags, _flags_3);
FOLIO_MATCH(compound_info, _head_3);
#undef FOLIO_MATCH
+#ifndef __GENERATING_BOUNDS_H
/**
* struct ptdesc - Memory descriptor for page tables.
* @pt_flags: enum pt_flags plus zone/node/section.
@@ -1990,5 +1991,5 @@ static inline unsigned long mmf_init_legacy_flags(unsigned long flags)
(1UL << MMF_HAS_MDWE_NO_INHERIT));
return flags & MMF_INIT_LEGACY_MASK;
}
-
+#endif /* __GENERATING_BOUNDS_H */
#endif /* _LINUX_MM_TYPES_H */
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index efb37f2ffec4..0d49d6e163ff 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -3,8 +3,6 @@
#define _LINUX_MMZONE_H
#ifndef __ASSEMBLY__
-#ifndef __GENERATING_BOUNDS_H
-
#include <linux/spinlock.h>
#include <linux/list.h>
#include <linux/list_nulls.h>
@@ -96,33 +94,32 @@
#define MAX_FOLIO_NR_PAGES (1UL << MAX_FOLIO_ORDER)
-/*
- * Hugepage Vmemmap Optimization (HVO) requires struct pages of the head page to
- * be naturally aligned with regard to the folio size.
- *
- * HVO which is only active if the size of struct page is a power of 2.
- */
-#define MAX_FOLIO_VMEMMAP_ALIGN \
- (IS_ENABLED(CONFIG_SPARSEMEM_VMEMMAP_OPTIMIZATION) && \
- is_power_of_2(sizeof(struct page)) ? \
- MAX_FOLIO_NR_PAGES * sizeof(struct page) : 0)
-
/* The number of vmemmap pages required by a vmemmap-optimized folio. */
#define OPTIMIZED_FOLIO_VMEMMAP_PAGES 1
#define OPTIMIZED_FOLIO_VMEMMAP_SIZE (OPTIMIZED_FOLIO_VMEMMAP_PAGES * PAGE_SIZE)
#define OPTIMIZED_FOLIO_VMEMMAP_NR_STRUCT_PAGES (OPTIMIZED_FOLIO_VMEMMAP_SIZE / sizeof(struct page))
#define OPTIMIZABLE_FOLIO_MIN_ORDER (ilog2(OPTIMIZED_FOLIO_VMEMMAP_NR_STRUCT_PAGES) + 1)
-#define __NR_OPTIMIZABLE_FOLIO_ORDERS (MAX_FOLIO_ORDER - OPTIMIZABLE_FOLIO_MIN_ORDER + 1)
-#define NR_OPTIMIZABLE_FOLIO_ORDERS \
- ((__NR_OPTIMIZABLE_FOLIO_ORDERS > 0 && \
- IS_ENABLED(CONFIG_SPARSEMEM_VMEMMAP_OPTIMIZATION)) ? __NR_OPTIMIZABLE_FOLIO_ORDERS : 0)
+#ifdef SPARSEMEM_VMEMMAP_OPTIMIZATION
+/*
+ * Hugepage Vmemmap Optimization (HVO) requires the struct page of the head page
+ * to be naturally aligned with regard to the vmemmap size of the maximal folio.
+ */
+#define MAX_FOLIO_VMEMMAP_ALIGN (MAX_FOLIO_NR_PAGES * sizeof(struct page))
+#define NR_OPTIMIZABLE_FOLIO_ORDERS (MAX_FOLIO_ORDER - OPTIMIZABLE_FOLIO_MIN_ORDER + 1)
+#else
+#define MAX_FOLIO_VMEMMAP_ALIGN 0
+#define NR_OPTIMIZABLE_FOLIO_ORDERS 0
+#endif
static inline bool order_vmemmap_optimizable(unsigned int order)
{
+ if (!IS_ENABLED(SPARSEMEM_VMEMMAP_OPTIMIZATION))
+ return false;
return order >= OPTIMIZABLE_FOLIO_MIN_ORDER;
}
+#ifndef __GENERATING_BOUNDS_H
enum migratetype {
MIGRATE_UNMOVABLE,
MIGRATE_MOVABLE,
@@ -2044,7 +2041,7 @@ struct mem_section {
*/
struct page_ext *page_ext;
#endif
-#ifdef CONFIG_SPARSEMEM_VMEMMAP_OPTIMIZATION
+#ifdef SPARSEMEM_VMEMMAP_OPTIMIZATION
/*
* The order of compound pages in this section. Typically, the section
* holds compound pages of this order; a larger compound page will span
@@ -2236,7 +2233,7 @@ static inline bool pfn_section_first_valid(struct mem_section *ms, unsigned long
}
#endif
-#ifdef CONFIG_SPARSEMEM_VMEMMAP_OPTIMIZATION
+#ifdef SPARSEMEM_VMEMMAP_OPTIMIZATION
static inline void section_set_order(struct mem_section *section, unsigned int order)
{
VM_WARN_ON(section->order && order && section->order != order);
@@ -2277,9 +2274,6 @@ static inline unsigned int pfn_to_section_order(unsigned long pfn)
static inline bool section_vmemmap_optimizable(const struct mem_section *section)
{
- if (!is_power_of_2(sizeof(struct page)))
- return false;
-
return order_vmemmap_optimizable(section_order(section));
}
diff --git a/include/linux/page-flags-layout.h b/include/linux/page-flags-layout.h
index 760006b1c480..6a7e7f3dbb93 100644
--- a/include/linux/page-flags-layout.h
+++ b/include/linux/page-flags-layout.h
@@ -2,6 +2,7 @@
#ifndef PAGE_FLAGS_LAYOUT_H
#define PAGE_FLAGS_LAYOUT_H
+#ifndef __GENERATING_BOUNDS_H
#include <linux/numa.h>
#include <generated/bounds.h>
@@ -121,4 +122,5 @@
(NR_NON_PAGEFLAG_BITS + NR_PAGEFLAGS))
#endif
+#endif /* __GENERATING_BOUNDS_H */
#endif /* _LINUX_PAGE_FLAGS_LAYOUT */
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 12665b34586c..df7f6dea2e5b 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -198,32 +198,12 @@ enum pageflags {
#ifndef __GENERATING_BOUNDS_H
-/*
- * For tail pages, if the size of struct page is power-of-2 ->compound_info
- * encodes the mask that converts the address of the tail page address to
- * the head page address.
- *
- * Otherwise, ->compound_info has direct pointer to head pages.
- */
-static __always_inline bool compound_info_has_mask(void)
-{
- /*
- * The approach with mask would work in the wider set of conditions,
- * but it requires validating that struct pages are naturally aligned
- * for all orders up to the MAX_FOLIO_ORDER, which can be tricky.
- */
- if (!IS_ENABLED(CONFIG_SPARSEMEM_VMEMMAP_OPTIMIZATION))
- return false;
-
- return is_power_of_2(sizeof(struct page));
-}
-
static __always_inline unsigned long _compound_head(const struct page *page)
{
unsigned long info = READ_ONCE(page->compound_info);
unsigned long mask;
- if (!compound_info_has_mask()) {
+ if (!IS_ENABLED(SPARSEMEM_VMEMMAP_OPTIMIZATION)) {
/* Bit 0 encodes PageTail() */
if (info & 1)
return info - 1;
@@ -232,8 +212,8 @@ static __always_inline unsigned long _compound_head(const struct page *page)
}
/*
- * If compound_info_has_mask() is true the rest of the info encodes
- * the mask that converts the address of the tail page to the head page.
+ * If HVO is enabled the rest of the info encodes the mask that converts
+ * the address of the tail page to the head page.
*
* No need to clear bit 0 in the mask as 'page' always has it clear.
*
@@ -257,7 +237,7 @@ static __always_inline void set_compound_head(struct page *tail,
unsigned int shift;
unsigned long mask;
- if (!compound_info_has_mask()) {
+ if (!IS_ENABLED(SPARSEMEM_VMEMMAP_OPTIMIZATION)) {
WRITE_ONCE(tail->compound_info, (unsigned long)head | 1);
return;
}
diff --git a/kernel/bounds.c b/kernel/bounds.c
index 02b619eb6106..9638260d67f8 100644
--- a/kernel/bounds.c
+++ b/kernel/bounds.c
@@ -8,6 +8,7 @@
#define __GENERATING_BOUNDS_H
#define COMPILE_OFFSETS
/* Include headers that define the enum constants of interest */
+#include <linux/mm_types.h>
#include <linux/page-flags.h>
#include <linux/mmzone.h>
#include <linux/kbuild.h>
@@ -30,6 +31,10 @@ int main(void)
DEFINE(LRU_GEN_WIDTH, 0);
DEFINE(__LRU_REFS_WIDTH, 0);
#endif
+ if (IS_ENABLED(CONFIG_SPARSEMEM_VMEMMAP_OPTIMIZATION_ENABLE) &&
+ is_power_of_2(sizeof(struct page)) &&
+ MAX_FOLIO_ORDER >= OPTIMIZABLE_FOLIO_MIN_ORDER)
+ DEFINE(SPARSEMEM_VMEMMAP_OPTIMIZATION, 1);
/* End of constants */
return 0;
diff --git a/mm/Kconfig b/mm/Kconfig
index c85ed7d7f37d..52d9d69a95ff 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -410,7 +410,7 @@ config SPARSEMEM_VMEMMAP
pfn_to_page and page_to_pfn operations. This is the most
efficient option when sufficient kernel resources are available.
-config SPARSEMEM_VMEMMAP_OPTIMIZATION
+config SPARSEMEM_VMEMMAP_OPTIMIZATION_ENABLE
bool
depends on SPARSEMEM_VMEMMAP
diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index 6f6f1740f540..1305bee1195a 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -22,6 +22,7 @@
#include "hugetlb_vmemmap.h"
#include "internal.h"
+#ifdef SPARSEMEM_VMEMMAP_OPTIMIZATION
/**
* struct vmemmap_remap_walk - walk vmemmap page table
*
@@ -693,3 +694,4 @@ static int __init hugetlb_vmemmap_init(void)
return 0;
}
late_initcall(hugetlb_vmemmap_init);
+#endif
diff --git a/mm/hugetlb_vmemmap.h b/mm/hugetlb_vmemmap.h
index b4d0ba27b42c..dfd48be6b231 100644
--- a/mm/hugetlb_vmemmap.h
+++ b/mm/hugetlb_vmemmap.h
@@ -10,7 +10,7 @@
#define _LINUX_HUGETLB_VMEMMAP_H
#include <linux/hugetlb.h>
-#ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
+#if defined(CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP) && defined(SPARSEMEM_VMEMMAP_OPTIMIZATION)
int hugetlb_vmemmap_restore_folio(const struct hstate *h, struct folio *folio);
long hugetlb_vmemmap_restore_folios(const struct hstate *h,
struct list_head *folio_list,
@@ -32,8 +32,6 @@ static inline unsigned int hugetlb_vmemmap_optimizable_size(const struct hstate
{
int size = hugetlb_vmemmap_size(h) - OPTIMIZED_FOLIO_VMEMMAP_SIZE;
- if (!is_power_of_2(sizeof(struct page)))
- return 0;
return size > 0 ? size : 0;
}
#else
diff --git a/mm/internal.h b/mm/internal.h
index 9597a703bc73..afdae79640b5 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1023,9 +1023,6 @@ static inline bool vmemmap_page_optimizable(const struct page *page)
unsigned long pfn = page_to_pfn(page);
unsigned long nr_pages = 1UL << pfn_to_section_order(pfn);
- if (!is_power_of_2(sizeof(struct page)))
- return false;
-
return (pfn & (nr_pages - 1)) >= OPTIMIZED_FOLIO_VMEMMAP_NR_STRUCT_PAGES;
}
#else
diff --git a/mm/sparse.c b/mm/sparse.c
index bdf23709a1c7..598da1651e49 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -301,10 +301,8 @@ void __init sparse_init(void)
unsigned long pnum_end, pnum_begin, map_count = 1;
int nid_begin;
- if (compound_info_has_mask()) {
- VM_WARN_ON_ONCE(!IS_ALIGNED((unsigned long) pfn_to_page(0),
- MAX_FOLIO_VMEMMAP_ALIGN));
- }
+ VM_WARN_ON(IS_ENABLED(SPARSEMEM_VMEMMAP_OPTIMIZATION) &&
+ !IS_ALIGNED((unsigned long)pfn_to_page(0), MAX_FOLIO_VMEMMAP_ALIGN));
pnum_begin = first_present_section_nr();
nid_begin = sparse_early_nid(__nr_to_section(pnum_begin));
diff --git a/mm/util.c b/mm/util.c
index 3cc949a0b7ed..4543f2b6ffa1 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -1338,7 +1338,7 @@ void snapshot_page(struct page_snapshot *ps, const struct page *page)
foliop = (struct folio *)page;
} else {
/* See compound_head() */
- if (compound_info_has_mask()) {
+ if (IS_ENABLED(SPARSEMEM_VMEMMAP_OPTIMIZATION)) {
unsigned long p = (unsigned long)page;
foliop = (struct folio *)(p & info);
--
2.54.0
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v2 58/69] mm/hugetlb: Make HVO optimizable checks depend on generic logic
2026-05-13 13:20 ` [PATCH v2 47/69] powerpc/mm: " Muchun Song
` (9 preceding siblings ...)
2026-05-13 13:20 ` [PATCH v2 57/69] mm/sparse-vmemmap: Consolidate HVO enable checks Muchun Song
@ 2026-05-13 13:20 ` Muchun Song
2026-05-13 13:20 ` [PATCH v2 59/69] mm/sparse-vmemmap: Localize init_compound_tail() Muchun Song
` (10 subsequent siblings)
21 siblings, 0 replies; 72+ messages in thread
From: Muchun Song @ 2026-05-13 13:20 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
Michael Ellerman, Madhavan Srinivasan
Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
Christophe Leroy, Ackerley Tng, Frank van der Linden,
aneesh.kumar, joao.m.martins, linux-mm, linuxppc-dev,
linux-kernel, Muchun Song
Make hugetlb_vmemmap_optimizable() reuse the generic
order_vmemmap_optimizable() logic, and switch hugetlb boolean call sites
to use the dedicated helper directly.
This keeps HugeTLB-specific optimizable checks aligned with the generic
vmemmap optimization rules and avoids open-coding the size-based test.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
include/linux/hugetlb.h | 2 +-
mm/hugetlb.c | 4 ++--
mm/hugetlb_vmemmap.h | 43 ++++++++++++++++++++---------------------
3 files changed, 24 insertions(+), 25 deletions(-)
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 82dbb9ebead8..2383adb22ce1 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -778,7 +778,7 @@ static inline unsigned long huge_page_mask(struct hstate *h)
return h->mask;
}
-static inline unsigned int huge_page_order(struct hstate *h)
+static inline unsigned int huge_page_order(const struct hstate *h)
{
return h->order;
}
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 54ef7d12c585..bd136fc6aec0 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3351,7 +3351,7 @@ static void __init hugetlb_hstate_alloc_pages_onenode(struct hstate *h, int nid)
folio = only_alloc_fresh_hugetlb_folio(h, gfp_mask, nid,
&node_states[N_MEMORY], NULL);
if (!folio && !list_empty(&folio_list) &&
- hugetlb_vmemmap_optimizable_size(h)) {
+ hugetlb_vmemmap_optimizable(h)) {
prep_and_add_allocated_folios(h, &folio_list);
INIT_LIST_HEAD(&folio_list);
folio = only_alloc_fresh_hugetlb_folio(h, gfp_mask, nid,
@@ -3420,7 +3420,7 @@ static void __init hugetlb_pages_alloc_boot_node(unsigned long start, unsigned l
for (i = 0; i < num; ++i) {
struct folio *folio;
- if (hugetlb_vmemmap_optimizable_size(h) &&
+ if (hugetlb_vmemmap_optimizable(h) &&
(si_mem_available() == 0) && !list_empty(&folio_list)) {
prep_and_add_allocated_folios(h, &folio_list);
INIT_LIST_HEAD(&folio_list);
diff --git a/mm/hugetlb_vmemmap.h b/mm/hugetlb_vmemmap.h
index dfd48be6b231..1765f8274220 100644
--- a/mm/hugetlb_vmemmap.h
+++ b/mm/hugetlb_vmemmap.h
@@ -18,22 +18,6 @@ long hugetlb_vmemmap_restore_folios(const struct hstate *h,
void hugetlb_vmemmap_optimize_folio(const struct hstate *h, struct folio *folio);
void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_list);
void hugetlb_vmemmap_optimize_bootmem_page(struct huge_bootmem_page *m);
-
-static inline unsigned int hugetlb_vmemmap_size(const struct hstate *h)
-{
- return pages_per_huge_page(h) * sizeof(struct page);
-}
-
-/*
- * Return how many vmemmap size associated with a HugeTLB page that can be
- * optimized and can be freed to the buddy allocator.
- */
-static inline unsigned int hugetlb_vmemmap_optimizable_size(const struct hstate *h)
-{
- int size = hugetlb_vmemmap_size(h) - OPTIMIZED_FOLIO_VMEMMAP_SIZE;
-
- return size > 0 ? size : 0;
-}
#else
static inline int hugetlb_vmemmap_restore_folio(const struct hstate *h, struct folio *folio)
{
@@ -56,11 +40,6 @@ static inline void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list
{
}
-static inline unsigned int hugetlb_vmemmap_optimizable_size(const struct hstate *h)
-{
- return 0;
-}
-
static inline void hugetlb_vmemmap_optimize_bootmem_page(struct huge_bootmem_page *m)
{
}
@@ -68,6 +47,26 @@ static inline void hugetlb_vmemmap_optimize_bootmem_page(struct huge_bootmem_pag
static inline bool hugetlb_vmemmap_optimizable(const struct hstate *h)
{
- return hugetlb_vmemmap_optimizable_size(h) != 0;
+ if (!IS_ENABLED(CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP))
+ return false;
+
+ return order_vmemmap_optimizable(huge_page_order(h));
+}
+
+static inline unsigned int hugetlb_vmemmap_size(const struct hstate *h)
+{
+ return pages_per_huge_page(h) * sizeof(struct page);
+}
+
+/*
+ * Return the size of the vmemmap area associated with a HugeTLB page
+ * that can be optimized.
+ */
+static inline unsigned int hugetlb_vmemmap_optimizable_size(const struct hstate *h)
+{
+ if (!hugetlb_vmemmap_optimizable(h))
+ return 0;
+
+ return hugetlb_vmemmap_size(h) - OPTIMIZED_FOLIO_VMEMMAP_SIZE;
}
#endif /* _LINUX_HUGETLB_VMEMMAP_H */
--
2.54.0
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v2 59/69] mm/sparse-vmemmap: Localize init_compound_tail()
2026-05-13 13:20 ` [PATCH v2 47/69] powerpc/mm: " Muchun Song
` (10 preceding siblings ...)
2026-05-13 13:20 ` [PATCH v2 58/69] mm/hugetlb: Make HVO optimizable checks depend on generic logic Muchun Song
@ 2026-05-13 13:20 ` Muchun Song
2026-05-13 13:20 ` [PATCH v2 60/69] mm/mm_init: Check zone consistency on optimized vmemmap sections Muchun Song
` (9 subsequent siblings)
21 siblings, 0 replies; 72+ messages in thread
From: Muchun Song @ 2026-05-13 13:20 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
Michael Ellerman, Madhavan Srinivasan
Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
Christophe Leroy, Ackerley Tng, Frank van der Linden,
aneesh.kumar, joao.m.martins, linux-mm, linuxppc-dev,
linux-kernel, Muchun Song
init_compound_tail() is only used in mm/sparse-vmemmap.c, so there is no
need to keep it in mm/internal.h.
The helper is only used for SPARSEMEM_VMEMMAP_OPTIMIZATION, where passing
NULL as the compound head is valid. Keeping it visible outside makes that
usage look more generally applicable than it really is, which increases
the chance of misuse.
Move it into mm/sparse-vmemmap.c so the helper stays tied to the only
context where its NULL head argument is valid.
No functional change intended.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
mm/internal.h | 9 ---------
mm/sparse-vmemmap.c | 12 +++++++++++-
2 files changed, 11 insertions(+), 10 deletions(-)
diff --git a/mm/internal.h b/mm/internal.h
index afdae79640b5..aff7cebb1da4 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -907,15 +907,6 @@ static inline void prep_compound_tail(struct page *tail,
set_page_private(tail, 0);
}
-static inline void init_compound_tail(struct page *tail,
- const struct page *head, unsigned int order, struct zone *zone)
-{
- atomic_set(&tail->_mapcount, -1);
- set_page_node(tail, zone_to_nid(zone));
- set_page_zone(tail, zone_idx(zone));
- prep_compound_tail(tail, head, order);
-}
-
void post_alloc_hook(struct page *page, unsigned int order, gfp_t gfp_flags);
extern bool free_pages_prepare(struct page *page, unsigned int order);
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index 667424aadd6b..38777e4952e1 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -265,6 +265,16 @@ int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end,
return 0;
}
+static void init_compound_tail(struct page *page, unsigned int order, struct zone *zone)
+{
+ BUILD_BUG_ON(!IS_ENABLED(SPARSEMEM_VMEMMAP_OPTIMIZATION));
+
+ atomic_set(&page->_mapcount, -1);
+ set_page_node(page, zone_to_nid(zone));
+ set_page_zone(page, zone_idx(zone));
+ prep_compound_tail(page, NULL, order);
+}
+
struct page __ref *vmemmap_shared_tail_page(unsigned int order, struct zone *zone)
{
void *addr;
@@ -286,7 +296,7 @@ struct page __ref *vmemmap_shared_tail_page(unsigned int order, struct zone *zon
page = (struct page *)addr + i;
if (zone_is_zone_device(zone))
__SetPageReserved(page);
- init_compound_tail(page, NULL, order, zone);
+ init_compound_tail(page, order, zone);
}
page = virt_to_page(addr);
--
2.54.0
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v2 60/69] mm/mm_init: Check zone consistency on optimized vmemmap sections
2026-05-13 13:20 ` [PATCH v2 47/69] powerpc/mm: " Muchun Song
` (11 preceding siblings ...)
2026-05-13 13:20 ` [PATCH v2 59/69] mm/sparse-vmemmap: Localize init_compound_tail() Muchun Song
@ 2026-05-13 13:20 ` Muchun Song
2026-05-13 13:20 ` [PATCH v2 61/69] mm/hugetlb: Drop boot-time HVO handling for gigantic folios Muchun Song
` (8 subsequent siblings)
21 siblings, 0 replies; 72+ messages in thread
From: Muchun Song @ 2026-05-13 13:20 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
Michael Ellerman, Madhavan Srinivasan
Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
Christophe Leroy, Ackerley Tng, Frank van der Linden,
aneesh.kumar, joao.m.martins, linux-mm, linuxppc-dev,
linux-kernel, Muchun Song
For vmemmap-optimized sections, the shared tail struct pages are reused
across compound pages and should already carry the expected zone and
node.
Warn in __init_single_page() if such a shared tail page is seen with a
different zone or node, which would indicate inconsistent initialization.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
mm/mm_init.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 4ea39392993b..95422e92ede8 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -605,6 +605,9 @@ void __meminit __init_single_page(struct page *page, unsigned long pfn,
if (!is_highmem_idx(zone))
set_page_address(page, __va(pfn << PAGE_SHIFT));
#endif
+ VM_WARN_ON_ONCE(order_vmemmap_optimizable(pfn_to_section_order(pfn)) &&
+ page_zone_id(page + OPTIMIZED_FOLIO_VMEMMAP_NR_STRUCT_PAGES) !=
+ page_zone_id(page));
}
#ifdef CONFIG_NUMA
--
2.54.0
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v2 61/69] mm/hugetlb: Drop boot-time HVO handling for gigantic folios
2026-05-13 13:20 ` [PATCH v2 47/69] powerpc/mm: " Muchun Song
` (12 preceding siblings ...)
2026-05-13 13:20 ` [PATCH v2 60/69] mm/mm_init: Check zone consistency on optimized vmemmap sections Muchun Song
@ 2026-05-13 13:20 ` Muchun Song
2026-05-13 13:20 ` [PATCH v2 62/69] mm/hugetlb: Simplify hugetlb_folio_init_vmemmap() Muchun Song
` (7 subsequent siblings)
21 siblings, 0 replies; 72+ messages in thread
From: Muchun Song @ 2026-05-13 13:20 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
Michael Ellerman, Madhavan Srinivasan
Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
Christophe Leroy, Ackerley Tng, Frank van der Linden,
aneesh.kumar, joao.m.martins, linux-mm, linuxppc-dev,
linux-kernel, Muchun Song
HugeTLB HVO is currently supported on x86-64, riscv64, and LoongArch.
On x86-64 and riscv64, gigantic HugeTLB pages are larger than the
section size, so the existing section-based vmemmap optimization
infrastructure is already sufficient to cover the whole folio. On
LoongArch, HugeTLB HVO is supported without gigantic HugeTLB pages.
Therefore, boot-time HugeTLB HVO folios can rely on the section-based
vmemmap optimization infrastructure directly, without the extra bulk
optimization and fallback handling.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
mm/hugetlb.c | 25 ++++++-------------------
mm/hugetlb_vmemmap.c | 21 ++++++---------------
mm/internal.h | 25 +++++++++++++++++++++++--
mm/sparse.c | 23 -----------------------
4 files changed, 35 insertions(+), 59 deletions(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index bd136fc6aec0..3cb8fffb9e3e 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3201,21 +3201,7 @@ static void __init prep_and_add_bootmem_folios(struct hstate *h,
unsigned long flags;
struct folio *folio, *tmp_f;
- /* Send list for bulk vmemmap optimization processing */
- hugetlb_vmemmap_optimize_folios(h, folio_list);
-
list_for_each_entry_safe(folio, tmp_f, folio_list, lru) {
- if (!folio_test_hugetlb_vmemmap_optimized(folio)) {
- /*
- * If HVO fails, initialize all tail struct pages
- * We do not worry about potential long lock hold
- * time as this is early in boot and there should
- * be no contention.
- */
- hugetlb_folio_init_tail_vmemmap(folio, h,
- OPTIMIZED_FOLIO_VMEMMAP_NR_STRUCT_PAGES,
- pages_per_huge_page(h));
- }
hugetlb_bootmem_init_migratetype(folio, h);
/* Subdivide locks to achieve better parallel performance */
spin_lock_irqsave(&hugetlb_lock, flags);
@@ -3238,6 +3224,8 @@ static void __init gather_bootmem_prealloc_node(unsigned long nid)
list_for_each_entry_safe(m, tm, &huge_boot_pages[nid], list) {
struct page *page = virt_to_page(m);
struct folio *folio = (void *)page;
+ unsigned long pfn = PHYS_PFN(__pa(m));
+ unsigned long nr_pages = pages_per_huge_page(m->hstate);
h = m->hstate;
/*
@@ -3251,13 +3239,12 @@ static void __init gather_bootmem_prealloc_node(unsigned long nid)
VM_BUG_ON(!hstate_is_gigantic(h));
WARN_ON(folio_ref_count(folio) != 1);
- hugetlb_folio_init_vmemmap(folio, h,
- OPTIMIZED_FOLIO_VMEMMAP_NR_STRUCT_PAGES);
+ hugetlb_folio_init_vmemmap(folio, h, vmemmap_nr_struct_pages(pfn, nr_pages));
init_new_hugetlb_folio(folio);
- if (order_vmemmap_optimizable(pfn_to_section_order(folio_pfn(folio)))) {
+ if (order_vmemmap_optimizable(pfn_to_section_order(pfn))) {
folio_set_hugetlb_vmemmap_optimized(folio);
- section_set_order_range(folio_pfn(folio), folio_nr_pages(folio), 0);
+ section_set_order_range(pfn, nr_pages, 0);
}
if (hugetlb_early_cma(h))
@@ -3274,7 +3261,7 @@ static void __init gather_bootmem_prealloc_node(unsigned long nid)
* (via hugetlb_bootmem_init_migratetype), so skip it here.
*/
if (!folio_test_hugetlb_cma(folio))
- adjust_managed_page_count(page, pages_per_huge_page(h));
+ adjust_managed_page_count(page, nr_pages);
cond_resched();
}
diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index 1305bee1195a..d20d2ce13906 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -599,23 +599,17 @@ static int hugetlb_vmemmap_split_folio(const struct hstate *h, struct folio *fol
void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_list)
{
struct folio *folio;
- unsigned long nr_to_optimize = 0;
LIST_HEAD(vmemmap_pages);
unsigned long flags = VMEMMAP_REMAP_NO_TLB_FLUSH;
- list_for_each_entry(folio, folio_list, lru) {
- int ret;
-
- /*
- * Bootmem gigantic folios may already be marked optimized when
- * their vmemmap layout was prepared earlier, so skip them here.
- */
- if (folio_test_hugetlb_vmemmap_optimized(folio))
- continue;
+ if (!vmemmap_should_optimize(h))
+ return;
- nr_to_optimize++;
+ if (list_empty(folio_list))
+ return;
- ret = hugetlb_vmemmap_split_folio(h, folio);
+ list_for_each_entry(folio, folio_list, lru) {
+ int ret = hugetlb_vmemmap_split_folio(h, folio);
/*
* Splitting the PMD requires allocating a page, thus let's fail
@@ -627,9 +621,6 @@ void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_l
break;
}
- if (!nr_to_optimize)
- return;
-
flush_tlb_all();
list_for_each_entry(folio, folio_list, lru) {
diff --git a/mm/internal.h b/mm/internal.h
index aff7cebb1da4..416afdf7b2ec 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -949,6 +949,29 @@ void memmap_init_range(unsigned long, int, unsigned long, unsigned long,
unsigned long, enum meminit_context, struct vmem_altmap *, int,
bool);
+static inline int vmemmap_nr_struct_pages(unsigned long pfn, unsigned long nr_pages)
+{
+ const unsigned int order = pfn_to_section_order(pfn);
+ const unsigned long pages_per_compound = 1UL << order;
+
+ if (!order_vmemmap_optimizable(order))
+ return nr_pages;
+
+ if (order < PFN_SECTION_SHIFT) {
+ VM_WARN_ON_ONCE(!IS_ALIGNED(pfn | nr_pages, pages_per_compound));
+ return OPTIMIZED_FOLIO_VMEMMAP_NR_STRUCT_PAGES * nr_pages / pages_per_compound;
+ }
+
+ VM_WARN_ON_ONCE(!IS_ALIGNED(pfn | nr_pages, PAGES_PER_SECTION));
+ /* Ensure the requested range does not cross a compound page boundary. */
+ VM_WARN_ON_ONCE((pfn % pages_per_compound) + nr_pages > pages_per_compound);
+
+ if (IS_ALIGNED(pfn, pages_per_compound))
+ return OPTIMIZED_FOLIO_VMEMMAP_NR_STRUCT_PAGES;
+
+ return 0;
+}
+
/*
* mm/sparse.c
*/
@@ -988,8 +1011,6 @@ static inline void __section_mark_present(struct mem_section *ms,
ms->section_mem_map |= SECTION_MARKED_PRESENT;
}
-int vmemmap_nr_struct_pages(unsigned long pfn, unsigned long nr_pages);
-
static inline int section_nr_vmemmap_pages(unsigned long pfn, unsigned long nr_pages)
{
VM_WARN_ON_ONCE(!IS_ALIGNED(pfn | nr_pages, PAGES_PER_SUBSECTION));
diff --git a/mm/sparse.c b/mm/sparse.c
index 598da1651e49..21a0eb636fea 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -236,29 +236,6 @@ void __weak __meminit vmemmap_populate_print_last(void)
{
}
-int __meminit vmemmap_nr_struct_pages(unsigned long pfn, unsigned long nr_pages)
-{
- const unsigned int order = pfn_to_section_order(pfn);
- const unsigned long pages_per_compound = 1UL << order;
-
- if (!order_vmemmap_optimizable(order))
- return nr_pages;
-
- if (order < PFN_SECTION_SHIFT) {
- VM_WARN_ON_ONCE(!IS_ALIGNED(pfn | nr_pages, pages_per_compound));
- return OPTIMIZED_FOLIO_VMEMMAP_NR_STRUCT_PAGES * nr_pages / pages_per_compound;
- }
-
- VM_WARN_ON_ONCE(!IS_ALIGNED(pfn | nr_pages, PAGES_PER_SECTION));
- /* Ensure the requested range does not cross a compound page boundary. */
- VM_WARN_ON_ONCE((pfn % pages_per_compound) + nr_pages > pages_per_compound);
-
- if (IS_ALIGNED(pfn, pages_per_compound))
- return OPTIMIZED_FOLIO_VMEMMAP_NR_STRUCT_PAGES;
-
- return 0;
-}
-
/*
* Initialize sparse on a specific node. The node spans [pnum_begin, pnum_end)
* And number of present sections in this node is map_count.
--
2.54.0
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v2 62/69] mm/hugetlb: Simplify hugetlb_folio_init_vmemmap()
2026-05-13 13:20 ` [PATCH v2 47/69] powerpc/mm: " Muchun Song
` (13 preceding siblings ...)
2026-05-13 13:20 ` [PATCH v2 61/69] mm/hugetlb: Drop boot-time HVO handling for gigantic folios Muchun Song
@ 2026-05-13 13:20 ` Muchun Song
2026-05-13 13:20 ` [PATCH v2 63/69] mm/hugetlb: Initialize the full bootmem hugepage in hugetlb code Muchun Song
` (6 subsequent siblings)
21 siblings, 0 replies; 72+ messages in thread
From: Muchun Song @ 2026-05-13 13:20 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
Michael Ellerman, Madhavan Srinivasan
Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
Christophe Leroy, Ackerley Tng, Frank van der Linden,
aneesh.kumar, joao.m.martins, linux-mm, linuxppc-dev,
linux-kernel, Muchun Song
hugetlb_folio_init_vmemmap() currently splits the open-coded
compound-page setup across two helpers even though the tail-page
initialization is only used here.
Fold the tail-page initialization into the main helper and pass
the precomputed page metadata in from the caller. This makes
the initialization flow easier to follow.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
mm/hugetlb.c | 50 +++++++++++++++++---------------------------------
1 file changed, 17 insertions(+), 33 deletions(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 3cb8fffb9e3e..950b0fa3bc27 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3126,33 +3126,8 @@ static bool __init alloc_bootmem_huge_page(struct hstate *h, int nid)
return true;
}
-/* Initialize [start_page:end_page_number] tail struct pages of a hugepage */
-static void __init hugetlb_folio_init_tail_vmemmap(struct folio *folio,
- struct hstate *h,
- unsigned long start_page_number,
- unsigned long end_page_number)
-{
- enum zone_type zone = folio_zonenum(folio);
- int nid = folio_nid(folio);
- struct page *page = folio_page(folio, start_page_number);
- unsigned long head_pfn = folio_pfn(folio);
- unsigned long pfn, end_pfn = head_pfn + end_page_number;
- unsigned int order = huge_page_order(h);
-
- /*
- * As we marked all tail pages with memblock_reserved_mark_noinit(),
- * we must initialize them ourselves here.
- */
- for (pfn = head_pfn + start_page_number; pfn < end_pfn; page++, pfn++) {
- __init_single_page(page, pfn, zone, nid);
- prep_compound_tail(page, &folio->page, order);
- set_page_count(page, 0);
- }
-}
-
-static void __init hugetlb_folio_init_vmemmap(struct folio *folio,
- struct hstate *h,
- unsigned long nr_pages)
+static void __init hugetlb_folio_init_vmemmap(struct page *head, unsigned long pfn,
+ enum zone_type zone, int nid, unsigned int order, unsigned int nr_pages)
{
int ret;
@@ -3161,12 +3136,19 @@ static void __init hugetlb_folio_init_vmemmap(struct folio *folio,
* walking pages twice by initializing/preparing+freezing them in the
* same go.
*/
- __folio_clear_reserved(folio);
- __folio_set_head(folio);
- ret = folio_ref_freeze(folio, 1);
+ __ClearPageReserved(head);
+ ret = page_ref_freeze(head, 1);
VM_BUG_ON(!ret);
- hugetlb_folio_init_tail_vmemmap(folio, h, 1, nr_pages);
- prep_compound_head(&folio->page, huge_page_order(h));
+
+ __SetPageHead(head);
+ for (int i = 1; i < nr_pages; i++) {
+ struct page *page = head + i;
+
+ __init_single_page(page, pfn + i, zone, nid);
+ prep_compound_tail(page, head, order);
+ set_page_count(page, 0);
+ }
+ prep_compound_head(head, order);
}
/*
@@ -3226,6 +3208,7 @@ static void __init gather_bootmem_prealloc_node(unsigned long nid)
struct folio *folio = (void *)page;
unsigned long pfn = PHYS_PFN(__pa(m));
unsigned long nr_pages = pages_per_huge_page(m->hstate);
+ enum zone_type zone = folio_zonenum(folio);
h = m->hstate;
/*
@@ -3239,7 +3222,8 @@ static void __init gather_bootmem_prealloc_node(unsigned long nid)
VM_BUG_ON(!hstate_is_gigantic(h));
WARN_ON(folio_ref_count(folio) != 1);
- hugetlb_folio_init_vmemmap(folio, h, vmemmap_nr_struct_pages(pfn, nr_pages));
+ hugetlb_folio_init_vmemmap(page, pfn, zone, nid, huge_page_order(h),
+ vmemmap_nr_struct_pages(pfn, nr_pages));
init_new_hugetlb_folio(folio);
if (order_vmemmap_optimizable(pfn_to_section_order(pfn))) {
--
2.54.0
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v2 63/69] mm/hugetlb: Initialize the full bootmem hugepage in hugetlb code
2026-05-13 13:20 ` [PATCH v2 47/69] powerpc/mm: " Muchun Song
` (14 preceding siblings ...)
2026-05-13 13:20 ` [PATCH v2 62/69] mm/hugetlb: Simplify hugetlb_folio_init_vmemmap() Muchun Song
@ 2026-05-13 13:20 ` Muchun Song
2026-05-13 13:20 ` [PATCH v2 64/69] mm/mm_init: Factor out compound page initialization Muchun Song
` (5 subsequent siblings)
21 siblings, 0 replies; 72+ messages in thread
From: Muchun Song @ 2026-05-13 13:20 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
Michael Ellerman, Madhavan Srinivasan
Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
Christophe Leroy, Ackerley Tng, Frank van der Linden,
aneesh.kumar, joao.m.martins, linux-mm, linuxppc-dev,
linux-kernel, Muchun Song
Boot-time gigantic hugepages currently leave the head struct page to
the generic memmap initialization path while the HugeTLB code
initializes the remaining struct pages itself.
Mark the full hugepage noinit and initialize the head struct page in
hugetlb_folio_init_vmemmap() as well, so the whole compound-page setup
is handled in one place.
This can also reduce memblock metadata overhead when many boot-time
HugeTLB pages are reserved, because physically contiguous hugepages
can be covered by fewer noinit regions.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
mm/hugetlb.c | 20 ++++----------------
1 file changed, 4 insertions(+), 16 deletions(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 950b0fa3bc27..10f04fa95d43 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3112,15 +3112,7 @@ static bool __init alloc_bootmem_huge_page(struct hstate *h, int nid)
list_add_tail(&m->list, &huge_boot_pages[nid]);
m->flags |= HUGE_BOOTMEM_ZONES_VALID;
hugetlb_vmemmap_optimize_bootmem_page(m);
- /*
- * Only initialize the head struct page in memmap_init_reserved_pages,
- * rest of the struct pages will be initialized by the HugeTLB
- * subsystem itself.
- * The head struct page is used to get folio information by the HugeTLB
- * subsystem like zone id and node id.
- */
- memblock_reserved_mark_noinit(__pa((void *)m + PAGE_SIZE),
- huge_page_size(h) - PAGE_SIZE);
+ memblock_reserved_mark_noinit(__pa(m), huge_page_size(h));
}
return true;
@@ -3129,16 +3121,13 @@ static bool __init alloc_bootmem_huge_page(struct hstate *h, int nid)
static void __init hugetlb_folio_init_vmemmap(struct page *head, unsigned long pfn,
enum zone_type zone, int nid, unsigned int order, unsigned int nr_pages)
{
- int ret;
-
/*
* This is an open-coded prep_compound_page() whereby we avoid
* walking pages twice by initializing/preparing+freezing them in the
* same go.
*/
- __ClearPageReserved(head);
- ret = page_ref_freeze(head, 1);
- VM_BUG_ON(!ret);
+ __init_single_page(head, pfn, zone, nid);
+ set_page_count(head, 0);
__SetPageHead(head);
for (int i = 1; i < nr_pages; i++) {
@@ -3208,7 +3197,7 @@ static void __init gather_bootmem_prealloc_node(unsigned long nid)
struct folio *folio = (void *)page;
unsigned long pfn = PHYS_PFN(__pa(m));
unsigned long nr_pages = pages_per_huge_page(m->hstate);
- enum zone_type zone = folio_zonenum(folio);
+ enum zone_type zone = zone_idx(pfn_to_zone(pfn, nid));
h = m->hstate;
/*
@@ -3220,7 +3209,6 @@ static void __init gather_bootmem_prealloc_node(unsigned long nid)
prev_h = h;
VM_BUG_ON(!hstate_is_gigantic(h));
- WARN_ON(folio_ref_count(folio) != 1);
hugetlb_folio_init_vmemmap(page, pfn, zone, nid, huge_page_order(h),
vmemmap_nr_struct_pages(pfn, nr_pages));
--
2.54.0
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v2 64/69] mm/mm_init: Factor out compound page initialization
2026-05-13 13:20 ` [PATCH v2 47/69] powerpc/mm: " Muchun Song
` (15 preceding siblings ...)
2026-05-13 13:20 ` [PATCH v2 63/69] mm/hugetlb: Initialize the full bootmem hugepage in hugetlb code Muchun Song
@ 2026-05-13 13:20 ` Muchun Song
2026-05-13 13:20 ` [PATCH v2 65/69] mm/mm_init: Make __init_single_page() static Muchun Song
` (4 subsequent siblings)
21 siblings, 0 replies; 72+ messages in thread
From: Muchun Song @ 2026-05-13 13:20 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
Michael Ellerman, Madhavan Srinivasan
Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
Christophe Leroy, Ackerley Tng, Frank van der Linden,
aneesh.kumar, joao.m.martins, linux-mm, linuxppc-dev,
linux-kernel, Muchun Song
The compound struct page initialization needed by boot-time gigantic hugetlb
folios is currently open-coded in hugetlb code, while ZONE_DEVICE has its own
separate initialization path in mm_init.c.
Factor the common compound memmap setup into memmap_init_compound_page_frozen()
so both paths can share the same frozen page initialization logic. This removes
duplicated open-coded compound page setup and keeps the initialization rules
in one place.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
mm/hugetlb.c | 25 +-----------
mm/internal.h | 2 +
mm/mm_init.c | 111 +++++++++++++++++++-------------------------------
3 files changed, 45 insertions(+), 93 deletions(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 10f04fa95d43..7e9f49882395 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3118,28 +3118,6 @@ static bool __init alloc_bootmem_huge_page(struct hstate *h, int nid)
return true;
}
-static void __init hugetlb_folio_init_vmemmap(struct page *head, unsigned long pfn,
- enum zone_type zone, int nid, unsigned int order, unsigned int nr_pages)
-{
- /*
- * This is an open-coded prep_compound_page() whereby we avoid
- * walking pages twice by initializing/preparing+freezing them in the
- * same go.
- */
- __init_single_page(head, pfn, zone, nid);
- set_page_count(head, 0);
-
- __SetPageHead(head);
- for (int i = 1; i < nr_pages; i++) {
- struct page *page = head + i;
-
- __init_single_page(page, pfn + i, zone, nid);
- prep_compound_tail(page, head, order);
- set_page_count(page, 0);
- }
- prep_compound_head(head, order);
-}
-
/*
* memblock-allocated pageblocks might not have the migrate type set
* if marked with the 'noinit' flag. Set it to the default (MIGRATE_MOVABLE)
@@ -3210,8 +3188,7 @@ static void __init gather_bootmem_prealloc_node(unsigned long nid)
VM_BUG_ON(!hstate_is_gigantic(h));
- hugetlb_folio_init_vmemmap(page, pfn, zone, nid, huge_page_order(h),
- vmemmap_nr_struct_pages(pfn, nr_pages));
+ memmap_init_compound_page_frozen(page, pfn, zone, nid, huge_page_order(h));
init_new_hugetlb_folio(folio);
if (order_vmemmap_optimizable(pfn_to_section_order(pfn))) {
diff --git a/mm/internal.h b/mm/internal.h
index 416afdf7b2ec..2c67ae25124b 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1793,6 +1793,8 @@ static inline bool pte_needs_soft_dirty_wp(struct vm_area_struct *vma, pte_t pte
void __meminit __init_single_page(struct page *page, unsigned long pfn,
unsigned long zone, int nid);
+void __meminit memmap_init_compound_page_frozen(struct page *head, unsigned long pfn,
+ enum zone_type zone, int nid, unsigned int order);
/* shrinker related functions */
unsigned long shrink_slab(gfp_t gfp_mask, int nid, struct mem_cgroup *memcg,
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 95422e92ede8..9b23c31db8c6 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -1018,79 +1018,46 @@ static void __init memmap_init(void)
init_unavailable_range(hole_pfn, end_pfn, zone_id, nid);
}
-#ifdef CONFIG_ZONE_DEVICE
-static void __ref __init_zone_device_page(struct page *page, unsigned long pfn,
- unsigned long zone_idx, int nid,
- struct dev_pagemap *pgmap)
+static void __meminit init_single_page_frozen(struct page *page, unsigned long pfn,
+ enum zone_type zone, int nid)
{
+ __init_single_page(page, pfn, zone, nid);
+ if (zone_is_zone_device(&NODE_DATA(nid)->node_zones[zone])) {
+ /*
+ * ZONE_DEVICE pages are not managed by the page allocator, mark
+ * them reserved to prevent them from being touched elsewhere.
+ *
+ * We can use the non-atomic __set_bit operation for setting
+ * the flag as we are still initializing the pages.
+ */
+ __SetPageReserved(page);
- __init_single_page(page, pfn, zone_idx, nid);
-
- /*
- * Mark page reserved as it will need to wait for onlining
- * phase for it to be fully associated with a zone.
- *
- * We can use the non-atomic __set_bit operation for setting
- * the flag as we are still initializing the pages.
- */
- __SetPageReserved(page);
-
- /*
- * ZONE_DEVICE pages union ->lru with a ->pgmap back pointer
- * and zone_device_data. It is a bug if a ZONE_DEVICE page is
- * ever freed or placed on a driver-private list.
- */
- page_folio(page)->pgmap = pgmap;
- page->zone_device_data = NULL;
-
- /*
- * ZONE_DEVICE pages other than MEMORY_TYPE_GENERIC are released
- * directly to the driver page allocator which will set the page count
- * to 1 when allocating the page.
- *
- * MEMORY_TYPE_GENERIC and MEMORY_TYPE_FS_DAX pages automatically have
- * their refcount reset to one whenever they are freed (ie. after
- * their refcount drops to 0).
- */
- switch (pgmap->type) {
- case MEMORY_DEVICE_FS_DAX:
- case MEMORY_DEVICE_PRIVATE:
- case MEMORY_DEVICE_COHERENT:
- case MEMORY_DEVICE_PCI_P2PDMA:
- set_page_count(page, 0);
- break;
-
- case MEMORY_DEVICE_GENERIC:
- break;
+ /*
+ * ZONE_DEVICE pages union ->lru with a ->pgmap back pointer
+ * and zone_device_data. It is a bug if a ZONE_DEVICE page is
+ * ever freed or placed on a driver-private list.
+ */
+ page->zone_device_data = NULL;
}
+ set_page_count(page, 0);
}
-static void __ref memmap_init_compound(struct page *head,
- unsigned long head_pfn,
- unsigned long zone_idx, int nid,
- struct dev_pagemap *pgmap,
- unsigned long nr_pages)
+void __meminit memmap_init_compound_page_frozen(struct page *head, unsigned long pfn,
+ enum zone_type zone, int nid, unsigned int order)
{
- unsigned long pfn, end_pfn = head_pfn + nr_pages;
- unsigned int order = pgmap->vmemmap_shift;
+ int nr_pages = vmemmap_nr_struct_pages(pfn, 1UL << order);
- /*
- * We have to initialize the pages, including setting up page links.
- * prep_compound_page() does not take care of that, so instead we
- * open-code prep_compound_page() so we can take care of initializing
- * the pages in the same go.
- */
- __SetPageHead(head);
- for (pfn = head_pfn + 1; pfn < end_pfn; pfn++) {
- struct page *page = pfn_to_page(pfn);
+ init_single_page_frozen(head, pfn, zone, nid);
- __init_zone_device_page(page, pfn, zone_idx, nid, pgmap);
- prep_compound_tail(page, head, order);
- set_page_count(page, 0);
+ __SetPageHead(head);
+ for (int i = 1; i < nr_pages; i++) {
+ init_single_page_frozen(head + i, pfn + i, zone, nid);
+ prep_compound_tail(head + i, head, order);
}
prep_compound_head(head, order);
}
+#ifdef CONFIG_ZONE_DEVICE
void __ref memmap_init_zone_device(struct zone *zone,
unsigned long start_pfn,
unsigned long nr_pages,
@@ -1118,18 +1085,24 @@ void __ref memmap_init_zone_device(struct zone *zone,
}
for (pfn = start_pfn; pfn < end_pfn; pfn += pfns_per_compound) {
- struct page *page = pfn_to_page(pfn);
-
- __init_zone_device_page(page, pfn, zone_idx, nid, pgmap);
+ struct page *head = pfn_to_page(pfn);
if (IS_ALIGNED(pfn, PAGES_PER_SECTION))
cond_resched();
- if (pfns_per_compound == 1)
- continue;
-
- memmap_init_compound(page, pfn, zone_idx, nid, pgmap,
- vmemmap_nr_struct_pages(pfn, pfns_per_compound));
+ if (pgmap->vmemmap_shift)
+ memmap_init_compound_page_frozen(head, pfn, zone_idx, nid,
+ pgmap->vmemmap_shift);
+ else
+ init_single_page_frozen(head, pfn, zone_idx, nid);
+ /*
+ * ZONE_DEVICE pages other than MEMORY_TYPE_GENERIC are released
+ * directly to the driver page allocator which will set the page
+ * count to 1 when allocating the page.
+ */
+ if (pgmap->type == MEMORY_DEVICE_GENERIC)
+ init_page_count(head);
+ ((struct folio *)head)->pgmap = pgmap;
}
pageblock_migratetype_init_range(start_pfn, nr_pages, MIGRATE_MOVABLE, false, false);
--
2.54.0
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v2 65/69] mm/mm_init: Make __init_single_page() static
2026-05-13 13:20 ` [PATCH v2 47/69] powerpc/mm: " Muchun Song
` (16 preceding siblings ...)
2026-05-13 13:20 ` [PATCH v2 64/69] mm/mm_init: Factor out compound page initialization Muchun Song
@ 2026-05-13 13:20 ` Muchun Song
2026-05-13 13:20 ` [PATCH v2 66/69] mm/cma: Move CMA pageblock initialization into cma_activate_area() Muchun Song
` (3 subsequent siblings)
21 siblings, 0 replies; 72+ messages in thread
From: Muchun Song @ 2026-05-13 13:20 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
Michael Ellerman, Madhavan Srinivasan
Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
Christophe Leroy, Ackerley Tng, Frank van der Linden,
aneesh.kumar, joao.m.martins, linux-mm, linuxppc-dev,
linux-kernel, Muchun Song
__init_single_page() is only used within mm/mm_init.c, so make it
static.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
mm/internal.h | 2 --
mm/mm_init.c | 2 +-
2 files changed, 1 insertion(+), 3 deletions(-)
diff --git a/mm/internal.h b/mm/internal.h
index 2c67ae25124b..80b9ab594dc5 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1791,8 +1791,6 @@ static inline bool pte_needs_soft_dirty_wp(struct vm_area_struct *vma, pte_t pte
return vma_soft_dirty_enabled(vma) && !pte_soft_dirty(pte);
}
-void __meminit __init_single_page(struct page *page, unsigned long pfn,
- unsigned long zone, int nid);
void __meminit memmap_init_compound_page_frozen(struct page *head, unsigned long pfn,
enum zone_type zone, int nid, unsigned int order);
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 9b23c31db8c6..1e11fd683292 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -589,7 +589,7 @@ static void __init find_zone_movable_pfns_for_nodes(void)
node_states[N_MEMORY] = saved_node_state;
}
-void __meminit __init_single_page(struct page *page, unsigned long pfn,
+static void __meminit __init_single_page(struct page *page, unsigned long pfn,
unsigned long zone, int nid)
{
mm_zero_struct_page(page);
--
2.54.0
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v2 66/69] mm/cma: Move CMA pageblock initialization into cma_activate_area()
2026-05-13 13:20 ` [PATCH v2 47/69] powerpc/mm: " Muchun Song
` (17 preceding siblings ...)
2026-05-13 13:20 ` [PATCH v2 65/69] mm/mm_init: Make __init_single_page() static Muchun Song
@ 2026-05-13 13:20 ` Muchun Song
2026-05-13 13:20 ` [PATCH v2 67/69] mm/cma: Move init_cma_pageblock() into cma.c Muchun Song
` (2 subsequent siblings)
21 siblings, 0 replies; 72+ messages in thread
From: Muchun Song @ 2026-05-13 13:20 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
Michael Ellerman, Madhavan Srinivasan
Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
Christophe Leroy, Ackerley Tng, Frank van der Linden,
aneesh.kumar, joao.m.martins, linux-mm, linuxppc-dev,
linux-kernel, Muchun Song
Move CMA pageblock initialization for early-reserved pages into
cma_activate_area() so CMA pageblock setup is handled in one place.
This keeps init_cma_pageblock() in the CMA core instead of pushing
special handling for early CMA allocations into its callers.
As a side effect, this also fixes the zone->cma_pages accounting race for
early-reserved HugeTLB CMA pages. The accounting is no longer updated from
parallel hugetlb_struct_page_init() workers and is instead performed
serially from cma_activate_area().
Fixes: d2d786714080 ("mm/hugetlb: enable bootmem allocation from CMA areas")
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
mm/cma.c | 7 +++++--
mm/hugetlb.c | 8 +++-----
2 files changed, 8 insertions(+), 7 deletions(-)
diff --git a/mm/cma.c b/mm/cma.c
index 0369f04c7ba5..c1896c0db63d 100644
--- a/mm/cma.c
+++ b/mm/cma.c
@@ -162,6 +162,10 @@ static void __init cma_activate_area(struct cma *cma)
count = early_pfn[r] - cmr->base_pfn;
bitmap_count = cma_bitmap_pages_to_bits(cma, count);
bitmap_set(cmr->bitmap, 0, bitmap_count);
+
+ for (pfn = cmr->base_pfn; pfn < early_pfn[r];
+ pfn += pageblock_nr_pages)
+ init_cma_pageblock(pfn_to_page(pfn));
}
WARN_ON_ONCE(!pfn_valid(cmr->base_pfn));
@@ -1098,8 +1102,7 @@ bool cma_intersects(struct cma *cma, unsigned long start, unsigned long end)
*
* The caller is responsible for initializing the page structures
* in the area properly, since this just points to memblock-allocated
- * memory. The caller should subsequently use init_cma_pageblock to
- * set the migrate type and CMA stats the pageblocks that were reserved.
+ * memory.
*
* If the CMA area fails to activate later, memory obtained through
* this interface is not handed to the page allocator, this is
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 7e9f49882395..df798f9386d6 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3136,9 +3136,7 @@ static void __init hugetlb_bootmem_init_migratetype(struct folio *folio,
WARN_ON_ONCE(!pageblock_aligned(folio_pfn(folio)));
for (i = 0; i < nr_pages; i += pageblock_nr_pages) {
- if (folio_test_hugetlb_cma(folio))
- init_cma_pageblock(folio_page(folio, i));
- else
+ if (!folio_test_hugetlb_cma(folio))
init_pageblock_migratetype(folio_page(folio, i),
MIGRATE_MOVABLE, false);
}
@@ -3206,8 +3204,8 @@ static void __init gather_bootmem_prealloc_node(unsigned long nid)
* in order to fix confusing memory reports from free(1) and
* other side-effects, like CommitLimit going negative.
*
- * For CMA pages, this is done in init_cma_pageblock
- * (via hugetlb_bootmem_init_migratetype), so skip it here.
+ * For CMA pages, this is done in cma_activate_area(), so skip
+ * it here.
*/
if (!folio_test_hugetlb_cma(folio))
adjust_managed_page_count(page, nr_pages);
--
2.54.0
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v2 67/69] mm/cma: Move init_cma_pageblock() into cma.c
2026-05-13 13:20 ` [PATCH v2 47/69] powerpc/mm: " Muchun Song
` (18 preceding siblings ...)
2026-05-13 13:20 ` [PATCH v2 66/69] mm/cma: Move CMA pageblock initialization into cma_activate_area() Muchun Song
@ 2026-05-13 13:20 ` Muchun Song
2026-05-13 13:20 ` [PATCH v2 68/69] mm/mm_init: Initialize pageblock migratetype in memmap init helpers Muchun Song
2026-05-13 13:20 ` [PATCH v2 69/69] Documentation/mm: Rewrite vmemmap_dedup.rst for unified HVO Muchun Song
21 siblings, 0 replies; 72+ messages in thread
From: Muchun Song @ 2026-05-13 13:20 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
Michael Ellerman, Madhavan Srinivasan
Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
Christophe Leroy, Ackerley Tng, Frank van der Linden,
aneesh.kumar, joao.m.martins, linux-mm, linuxppc-dev,
linux-kernel, Muchun Song
Move init_cma_pageblock() from mm_init.c into cma.c so it lives
alongside the CMA code that uses it.
No functional change intended.
---
mm/cma.c | 8 ++++++++
mm/internal.h | 4 ----
mm/mm_init.c | 9 ---------
3 files changed, 8 insertions(+), 13 deletions(-)
diff --git a/mm/cma.c b/mm/cma.c
index c1896c0db63d..2843c4f59c4e 100644
--- a/mm/cma.c
+++ b/mm/cma.c
@@ -30,6 +30,7 @@
#include <linux/io.h>
#include <linux/kmemleak.h>
#include <trace/events/cma.h>
+#include <linux/page-isolation.h>
#include "internal.h"
#include "cma.h"
@@ -137,6 +138,13 @@ bool cma_validate_zones(struct cma *cma)
return true;
}
+static void __init init_cma_pageblock(struct page *page)
+{
+ init_pageblock_migratetype(page, MIGRATE_CMA, false);
+ adjust_managed_page_count(page, pageblock_nr_pages);
+ page_zone(page)->cma_pages += pageblock_nr_pages;
+}
+
static void __init cma_activate_area(struct cma *cma)
{
unsigned long pfn, end_pfn, early_pfn[CMA_MAX_RANGES];
diff --git a/mm/internal.h b/mm/internal.h
index 80b9ab594dc5..25b6e767cea0 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1126,7 +1126,6 @@ struct cma;
#ifdef CONFIG_CMA
bool cma_validate_zones(struct cma *cma);
void *cma_reserve_early(struct cma *cma, unsigned long size);
-void init_cma_pageblock(struct page *page);
#else
static inline bool cma_validate_zones(struct cma *cma)
{
@@ -1136,9 +1135,6 @@ static inline void *cma_reserve_early(struct cma *cma, unsigned long size)
{
return NULL;
}
-static inline void init_cma_pageblock(struct page *page)
-{
-}
#endif
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 1e11fd683292..ff6e9fb468bd 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -2200,15 +2200,6 @@ void __init init_cma_reserved_pageblock(struct page *page)
adjust_managed_page_count(page, pageblock_nr_pages);
page_zone(page)->cma_pages += pageblock_nr_pages;
}
-/*
- * Similar to above, but only set the migrate type and stats.
- */
-void __init init_cma_pageblock(struct page *page)
-{
- init_pageblock_migratetype(page, MIGRATE_CMA, false);
- adjust_managed_page_count(page, pageblock_nr_pages);
- page_zone(page)->cma_pages += pageblock_nr_pages;
-}
#endif
void set_zone_contiguous(struct zone *zone)
--
2.54.0
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v2 68/69] mm/mm_init: Initialize pageblock migratetype in memmap init helpers
2026-05-13 13:20 ` [PATCH v2 47/69] powerpc/mm: " Muchun Song
` (19 preceding siblings ...)
2026-05-13 13:20 ` [PATCH v2 67/69] mm/cma: Move init_cma_pageblock() into cma.c Muchun Song
@ 2026-05-13 13:20 ` Muchun Song
2026-05-13 13:20 ` [PATCH v2 69/69] Documentation/mm: Rewrite vmemmap_dedup.rst for unified HVO Muchun Song
21 siblings, 0 replies; 72+ messages in thread
From: Muchun Song @ 2026-05-13 13:20 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
Michael Ellerman, Madhavan Srinivasan
Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
Christophe Leroy, Ackerley Tng, Frank van der Linden,
aneesh.kumar, joao.m.martins, linux-mm, linuxppc-dev,
linux-kernel, Muchun Song
Move MIGRATE_MOVABLE pageblock initialization into the memmap init
helpers in mm/mm_init.c.
Let init_single_page_frozen() initialize the pageblock migratetype for
single-page folios, and let memmap_init_compound_page_frozen() handle
the whole range for compound pages. With pageblock initialization
centralized there, drop the duplicate hugetlb bootmem-specific
initialization in mm/hugetlb.c.
The old hugetlb_bootmem_init_migratetype() skipped CMA folios (via
folio_test_hugetlb_cma()), but the new helpers always set
MIGRATE_MOVABLE. This is safe because cma_activate_area() will later
override the migratetype for CMA pageblocks, so the initial
MIGRATE_MOVABLE setting does not matter for CMA pages.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
mm/hugetlb.c | 25 -------------------------
mm/mm_init.c | 20 ++++++++++++--------
2 files changed, 12 insertions(+), 33 deletions(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index df798f9386d6..fa269560f657 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3118,30 +3118,6 @@ static bool __init alloc_bootmem_huge_page(struct hstate *h, int nid)
return true;
}
-/*
- * memblock-allocated pageblocks might not have the migrate type set
- * if marked with the 'noinit' flag. Set it to the default (MIGRATE_MOVABLE)
- * here, or MIGRATE_CMA if this was a page allocated through an early CMA
- * reservation.
- *
- * In case of vmemmap optimized folios, the tail vmemmap pages are mapped
- * read-only, but that's ok - for sparse vmemmap this does not write to
- * the page structure.
- */
-static void __init hugetlb_bootmem_init_migratetype(struct folio *folio,
- struct hstate *h)
-{
- unsigned long nr_pages = pages_per_huge_page(h), i;
-
- WARN_ON_ONCE(!pageblock_aligned(folio_pfn(folio)));
-
- for (i = 0; i < nr_pages; i += pageblock_nr_pages) {
- if (!folio_test_hugetlb_cma(folio))
- init_pageblock_migratetype(folio_page(folio, i),
- MIGRATE_MOVABLE, false);
- }
-}
-
static void __init prep_and_add_bootmem_folios(struct hstate *h,
struct list_head *folio_list)
{
@@ -3149,7 +3125,6 @@ static void __init prep_and_add_bootmem_folios(struct hstate *h,
struct folio *folio, *tmp_f;
list_for_each_entry_safe(folio, tmp_f, folio_list, lru) {
- hugetlb_bootmem_init_migratetype(folio, h);
/* Subdivide locks to achieve better parallel performance */
spin_lock_irqsave(&hugetlb_lock, flags);
account_new_hugetlb_folio(h, folio);
diff --git a/mm/mm_init.c b/mm/mm_init.c
index ff6e9fb468bd..17ae0eb1ccfb 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -1018,7 +1018,7 @@ static void __init memmap_init(void)
init_unavailable_range(hole_pfn, end_pfn, zone_id, nid);
}
-static void __meminit init_single_page_frozen(struct page *page, unsigned long pfn,
+static void __meminit __init_single_page_frozen(struct page *page, unsigned long pfn,
enum zone_type zone, int nid)
{
__init_single_page(page, pfn, zone, nid);
@@ -1042,19 +1042,28 @@ static void __meminit init_single_page_frozen(struct page *page, unsigned long p
set_page_count(page, 0);
}
+static void __meminit init_single_page_frozen(struct page *page, unsigned long pfn,
+ enum zone_type zone, int nid)
+{
+ __init_single_page_frozen(page, pfn, zone, nid);
+ pageblock_migratetype_init_range(pfn, 1, MIGRATE_MOVABLE, false, false);
+}
+
void __meminit memmap_init_compound_page_frozen(struct page *head, unsigned long pfn,
enum zone_type zone, int nid, unsigned int order)
{
int nr_pages = vmemmap_nr_struct_pages(pfn, 1UL << order);
- init_single_page_frozen(head, pfn, zone, nid);
+ __init_single_page_frozen(head, pfn, zone, nid);
__SetPageHead(head);
for (int i = 1; i < nr_pages; i++) {
- init_single_page_frozen(head + i, pfn + i, zone, nid);
+ __init_single_page_frozen(head + i, pfn + i, zone, nid);
prep_compound_tail(head + i, head, order);
}
prep_compound_head(head, order);
+
+ pageblock_migratetype_init_range(pfn, 1UL << order, MIGRATE_MOVABLE, false, false);
}
#ifdef CONFIG_ZONE_DEVICE
@@ -1087,9 +1096,6 @@ void __ref memmap_init_zone_device(struct zone *zone,
for (pfn = start_pfn; pfn < end_pfn; pfn += pfns_per_compound) {
struct page *head = pfn_to_page(pfn);
- if (IS_ALIGNED(pfn, PAGES_PER_SECTION))
- cond_resched();
-
if (pgmap->vmemmap_shift)
memmap_init_compound_page_frozen(head, pfn, zone_idx, nid,
pgmap->vmemmap_shift);
@@ -1105,8 +1111,6 @@ void __ref memmap_init_zone_device(struct zone *zone,
((struct folio *)head)->pgmap = pgmap;
}
- pageblock_migratetype_init_range(start_pfn, nr_pages, MIGRATE_MOVABLE, false, false);
-
pr_debug("%s initialised %lu pages in %ums\n", __func__,
nr_pages, jiffies_to_msecs(jiffies - start));
}
--
2.54.0
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v2 69/69] Documentation/mm: Rewrite vmemmap_dedup.rst for unified HVO
2026-05-13 13:20 ` [PATCH v2 47/69] powerpc/mm: " Muchun Song
` (20 preceding siblings ...)
2026-05-13 13:20 ` [PATCH v2 68/69] mm/mm_init: Initialize pageblock migratetype in memmap init helpers Muchun Song
@ 2026-05-13 13:20 ` Muchun Song
21 siblings, 0 replies; 72+ messages in thread
From: Muchun Song @ 2026-05-13 13:20 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
Michael Ellerman, Madhavan Srinivasan
Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
Christophe Leroy, Ackerley Tng, Frank van der Linden,
aneesh.kumar, joao.m.martins, linux-mm, linuxppc-dev,
linux-kernel, Muchun Song
HVO is no longer specific to HugeTLB. The optimization has been
generalized for other large compound-page users, including device DAX,
but vmemmap_dedup.rst still describes the old split model.
Rewrite the document around the shared HVO design and behavior, and
drop the obsolete powerpc-specific document that only covered the old
device DAX path.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
Documentation/arch/powerpc/index.rst | 1 -
Documentation/arch/powerpc/vmemmap_dedup.rst | 101 ---------
Documentation/mm/vmemmap_dedup.rst | 217 ++++---------------
3 files changed, 42 insertions(+), 277 deletions(-)
delete mode 100644 Documentation/arch/powerpc/vmemmap_dedup.rst
diff --git a/Documentation/arch/powerpc/index.rst b/Documentation/arch/powerpc/index.rst
index 40419bea8e10..4dcf6b0f218c 100644
--- a/Documentation/arch/powerpc/index.rst
+++ b/Documentation/arch/powerpc/index.rst
@@ -36,7 +36,6 @@ powerpc
ultravisor
vas-api
vcpudispatch_stats
- vmemmap_dedup
vpa-dtl
features
diff --git a/Documentation/arch/powerpc/vmemmap_dedup.rst b/Documentation/arch/powerpc/vmemmap_dedup.rst
deleted file mode 100644
index dc4db59fdf87..000000000000
--- a/Documentation/arch/powerpc/vmemmap_dedup.rst
+++ /dev/null
@@ -1,101 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-==========
-Device DAX
-==========
-
-The device-dax interface uses the tail deduplication technique explained in
-Documentation/mm/vmemmap_dedup.rst
-
-On powerpc, vmemmap deduplication is only used with radix MMU translation. Also
-with a 64K page size, only the devdax namespace with 1G alignment uses vmemmap
-deduplication.
-
-With 2M PMD level mapping, we require 32 struct pages and a single 64K vmemmap
-page can contain 1024 struct pages (64K/sizeof(struct page)). Hence there is no
-vmemmap deduplication possible.
-
-With 1G PUD level mapping, we require 16384 struct pages and a single 64K
-vmemmap page can contain 1024 struct pages (64K/sizeof(struct page)). Hence we
-require 16 64K pages in vmemmap to map the struct page for 1G PUD level mapping.
-
-Here's how things look like on device-dax after the sections are populated::
- +-----------+ ---virt_to_page---> +-----------+ mapping to +-----------+
- | | | 0 | -------------> | 0 |
- | | +-----------+ +-----------+
- | | | 1 | -------------> | 1 |
- | | +-----------+ +-----------+
- | | | 2 | ----------------^ ^ ^ ^ ^ ^
- | | +-----------+ | | | | |
- | | | 3 | ------------------+ | | | |
- | | +-----------+ | | | |
- | | | 4 | --------------------+ | | |
- | PUD | +-----------+ | | |
- | level | | . | ----------------------+ | |
- | mapping | +-----------+ | |
- | | | . | ------------------------+ |
- | | +-----------+ |
- | | | 15 | --------------------------+
- | | +-----------+
- | |
- | |
- | |
- +-----------+
-
-
-With 4K page size, 2M PMD level mapping requires 512 struct pages and a single
-4K vmemmap page contains 64 struct pages(4K/sizeof(struct page)). Hence we
-require 8 4K pages in vmemmap to map the struct page for 2M pmd level mapping.
-
-Here's how things look like on device-dax after the sections are populated::
-
- +-----------+ ---virt_to_page---> +-----------+ mapping to +-----------+
- | | | 0 | -------------> | 0 |
- | | +-----------+ +-----------+
- | | | 1 | -------------> | 1 |
- | | +-----------+ +-----------+
- | | | 2 | ----------------^ ^ ^ ^ ^ ^
- | | +-----------+ | | | | |
- | | | 3 | ------------------+ | | | |
- | | +-----------+ | | | |
- | | | 4 | --------------------+ | | |
- | PMD | +-----------+ | | |
- | level | | 5 | ----------------------+ | |
- | mapping | +-----------+ | |
- | | | 6 | ------------------------+ |
- | | +-----------+ |
- | | | 7 | --------------------------+
- | | +-----------+
- | |
- | |
- | |
- +-----------+
-
-With 1G PUD level mapping, we require 262144 struct pages and a single 4K
-vmemmap page can contain 64 struct pages (4K/sizeof(struct page)). Hence we
-require 4096 4K pages in vmemmap to map the struct pages for 1G PUD level
-mapping.
-
-Here's how things look like on device-dax after the sections are populated::
-
- +-----------+ ---virt_to_page---> +-----------+ mapping to +-----------+
- | | | 0 | -------------> | 0 |
- | | +-----------+ +-----------+
- | | | 1 | -------------> | 1 |
- | | +-----------+ +-----------+
- | | | 2 | ----------------^ ^ ^ ^ ^ ^
- | | +-----------+ | | | | |
- | | | 3 | ------------------+ | | | |
- | | +-----------+ | | | |
- | | | 4 | --------------------+ | | |
- | PUD | +-----------+ | | |
- | level | | . | ----------------------+ | |
- | mapping | +-----------+ | |
- | | | . | ------------------------+ |
- | | +-----------+ |
- | | | 4095 | --------------------------+
- | | +-----------+
- | |
- | |
- | |
- +-----------+
diff --git a/Documentation/mm/vmemmap_dedup.rst b/Documentation/mm/vmemmap_dedup.rst
index 44e80bd2e398..c3a68a923b0d 100644
--- a/Documentation/mm/vmemmap_dedup.rst
+++ b/Documentation/mm/vmemmap_dedup.rst
@@ -1,107 +1,34 @@
.. SPDX-License-Identifier: GPL-2.0
-=========================================
-A vmemmap diet for HugeTLB and Device DAX
-=========================================
+===================================================
+Fundamentals of Hugepage Vmemmap Optimization (HVO)
+===================================================
-HugeTLB
-=======
-
-This section is to explain how Hugepage Vmemmap Optimization (HVO) for HugeTLB works.
-
-The ``struct page`` structures are used to describe a physical page frame. By
-default, there is a one-to-one mapping from a page frame to its corresponding
+The ``struct page`` structures are used to describe a physical base page frame.
+By default, there is a one-to-one mapping from a page frame to its corresponding
``struct page``.
-HugeTLB pages consist of multiple base page size pages and is supported by many
-architectures. See Documentation/admin-guide/mm/hugetlbpage.rst for more
-details. On the x86-64 architecture, HugeTLB pages of size 2MB and 1GB are
-currently supported. Since the base page size on x86 is 4KB, a 2MB HugeTLB page
-consists of 512 base pages and a 1GB HugeTLB page consists of 262144 base pages.
-For each base page, there is a corresponding ``struct page``.
-
-Within the HugeTLB subsystem, only the first 4 ``struct page`` are used to
-contain unique information about a HugeTLB page. ``__NR_USED_SUBPAGE`` provides
-this upper limit. The only 'useful' information in the remaining ``struct page``
-is the compound_info field, and this field is the same for all tail pages.
-
-By removing redundant ``struct page`` for HugeTLB pages, memory can be returned
-to the buddy allocator for other uses.
-
-Different architectures support different HugeTLB pages. For example, the
-following table is the HugeTLB page size supported by x86 and arm64
-architectures. Because arm64 supports 4k, 16k, and 64k base pages and
-supports contiguous entries, so it supports many kinds of sizes of HugeTLB
-page.
-
-+--------------+-----------+-----------------------------------------------+
-| Architecture | Page Size | HugeTLB Page Size |
-+--------------+-----------+-----------+-----------+-----------+-----------+
-| x86-64 | 4KB | 2MB | 1GB | | |
-+--------------+-----------+-----------+-----------+-----------+-----------+
-| | 4KB | 64KB | 2MB | 32MB | 1GB |
-| +-----------+-----------+-----------+-----------+-----------+
-| arm64 | 16KB | 2MB | 32MB | 1GB | |
-| +-----------+-----------+-----------+-----------+-----------+
-| | 64KB | 2MB | 512MB | 16GB | |
-+--------------+-----------+-----------+-----------+-----------+-----------+
-
-When the system boot up, every HugeTLB page has more than one ``struct page``
-structs which size is (unit: pages)::
-
- struct_size = HugeTLB_Size / PAGE_SIZE * sizeof(struct page) / PAGE_SIZE
-
-Where HugeTLB_Size is the size of the HugeTLB page. We know that the size
-of the HugeTLB page is always n times PAGE_SIZE. So we can get the following
-relationship::
-
- HugeTLB_Size = n * PAGE_SIZE
-
-Then::
-
- struct_size = n * PAGE_SIZE / PAGE_SIZE * sizeof(struct page) / PAGE_SIZE
- = n * sizeof(struct page) / PAGE_SIZE
+When huge pages (large compound page) are used, they consist of multiple base
+page size pages. For each base page, there is a corresponding ``struct page``.
+However, only a few ``struct page``
+structures are actually used to contain unique information about the huge page.
+The only 'useful' information in the remaining tail ``struct page`` structures
+is the ``->compound_info`` field to get the head page structure, and this field
+is the same for all tail pages.
-We can use huge mapping at the pud/pmd level for the HugeTLB page.
+We can remove redundant ``struct page`` structures for huge pages to save memory.
+This optimization is referred to as Hugepage Vmemmap Optimization (HVO).
-For the HugeTLB page of the pmd level mapping, then::
+The optimization is only applied when the size of the ``struct page`` is a
+power-of-2. In this case, all tail pages of the same order are identical. See
+``compound_head()``. This allows us to remap the tail pages of the vmemmap to a
+shared page.
- struct_size = n * sizeof(struct page) / PAGE_SIZE
- = PAGE_SIZE / sizeof(pte_t) * sizeof(struct page) / PAGE_SIZE
- = sizeof(struct page) / sizeof(pte_t)
- = 64 / 8
- = 8 (pages)
+Let’s take a system with a 2 MB huge page and a base page size of 4 KB as an
+example for illustration. Here is how things look before optimization::
-Where n is how many pte entries which one page can contains. So the value of
-n is (PAGE_SIZE / sizeof(pte_t)).
-
-This optimization only supports 64-bit system, so the value of sizeof(pte_t)
-is 8. And this optimization also applicable only when the size of ``struct page``
-is a power of two. In most cases, the size of ``struct page`` is 64 bytes (e.g.
-x86-64 and arm64). So if we use pmd level mapping for a HugeTLB page, the
-size of ``struct page`` structs of it is 8 page frames which size depends on the
-size of the base page.
-
-For the HugeTLB page of the pud level mapping, then::
-
- struct_size = PAGE_SIZE / sizeof(pmd_t) * struct_size(pmd)
- = PAGE_SIZE / 8 * 8 (pages)
- = PAGE_SIZE (pages)
-
-Where the struct_size(pmd) is the size of the ``struct page`` structs of a
-HugeTLB page of the pmd level mapping.
-
-E.g.: A 2MB HugeTLB page on x86_64 consists in 8 page frames while 1GB
-HugeTLB page consists in 4096.
-
-Next, we take the pmd level mapping of the HugeTLB page as an example to
-show the internal implementation of this optimization. There are 8 pages
-``struct page`` structs associated with a HugeTLB page which is pmd mapped.
-
-Here is how things look before optimization::
-
- HugeTLB struct pages(8 pages) page frame(8 pages)
+ 2MB Hugepage struct pages (8 pages) page frame (8 pages)
+-----------+ ---virt_to_page---> +-----------+ mapping to +-----------+
| | | 0 | -------------> | 0 |
| | +-----------+ +-----------+
@@ -112,9 +39,9 @@ Here is how things look before optimization::
| | | 3 | -------------> | 3 |
| | +-----------+ +-----------+
| | | 4 | -------------> | 4 |
- | PMD | +-----------+ +-----------+
- | level | | 5 | -------------> | 5 |
- | mapping | +-----------+ +-----------+
+ | | +-----------+ +-----------+
+ | | | 5 | -------------> | 5 |
+ | | +-----------+ +-----------+
| | | 6 | -------------> | 6 |
| | +-----------+ +-----------+
| | | 7 | -------------> | 7 |
@@ -124,34 +51,27 @@ Here is how things look before optimization::
| |
+-----------+
-The first page of ``struct page`` (page 0) associated with the HugeTLB page
-contains the 4 ``struct page`` necessary to describe the HugeTLB. The remaining
-pages of ``struct page`` (page 1 to page 7) are tail pages.
-
-The optimization is only applied when the size of the struct page is a power
-of 2. In this case, all tail pages of the same order are identical. See
-compound_head(). This allows us to remap the tail pages of the vmemmap to a
-shared, read-only page. The head page is also remapped to a new page. This
-allows the original vmemmap pages to be freed.
+We remap the tail pages (page 1 to page 7) of the vmemmap to a shared, read-only
+page (per-zone).
Here is how things look after remapping::
- HugeTLB struct pages(8 pages) page frame (new)
+ 2MB Hugepage struct pages(8 pages) page frame (1 page)
+-----------+ ---virt_to_page---> +-----------+ mapping to +----------------+
| | | 0 | -------------> | 0 |
| | +-----------+ +----------------+
| | | 1 | ------┐
| | +-----------+ |
- | | | 2 | ------┼ +----------------------------+
+ | | | 2 | ------┼
+ | | +-----------+ |
+ | | | 3 | ------┼ +----------------------------+
| | +-----------+ | | A single, per-zone page |
- | | | 3 | ------┼------> | frame shared among all |
+ | | | 4 | ------┼------> | frame shared among all |
| | +-----------+ | | hugepages of the same size |
- | | | 4 | ------┼ +----------------------------+
+ | | | 5 | ------┼ +----------------------------+
+ | | +-----------+ |
+ | | | 6 | ------┼
| | +-----------+ |
- | | | 5 | ------┼
- | PMD | +-----------+ |
- | level | | 6 | ------┼
- | mapping | +-----------+ |
| | | 7 | ------┘
| | +-----------+
| |
@@ -159,65 +79,12 @@ Here is how things look after remapping::
| |
+-----------+
-When a HugeTLB is freed to the buddy system, we should allocate 7 pages for
-vmemmap pages and restore the previous mapping relationship.
-
-For the HugeTLB page of the pud level mapping. It is similar to the former.
-We also can use this approach to free (PAGE_SIZE - 1) vmemmap pages.
-
-Apart from the HugeTLB page of the pmd/pud level mapping, some architectures
-(e.g. aarch64) provides a contiguous bit in the translation table entries
-that hints to the MMU to indicate that it is one of a contiguous set of
-entries that can be cached in a single TLB entry.
-
-The contiguous bit is used to increase the mapping size at the pmd and pte
-(last) level. So this type of HugeTLB page can be optimized only when its
-size of the ``struct page`` structs is greater than **1** page.
-
-Device DAX
-==========
-
-The device-dax interface uses the same tail deduplication technique explained
-in the previous chapter, except when used with the vmemmap in
-the device (altmap).
-
-The following page sizes are supported in DAX: PAGE_SIZE (4K on x86_64),
-PMD_SIZE (2M on x86_64) and PUD_SIZE (1G on x86_64).
-For powerpc equivalent details see Documentation/arch/powerpc/vmemmap_dedup.rst
-
-The differences with HugeTLB are relatively minor.
-
-It only use 3 ``struct page`` for storing all information as opposed
-to 4 on HugeTLB pages.
-
-There's no remapping of vmemmap given that device-dax memory is not part of
-System RAM ranges initialized at boot. Thus the tail page deduplication
-happens at a later stage when we populate the sections. HugeTLB reuses the
-the head vmemmap page representing, whereas device-dax reuses the tail
-vmemmap page. This results in only half of the savings compared to HugeTLB.
-
-Deduplicated tail pages are not mapped read-only.
+Therefore, for any hugepage, if the total size of its corresponding ``struct pages``
+is greater than or equal to the size of two base pages, then HVO technology can
+be applied to this hugepage to save memory. For example, in this case, the
+smallest hugepage that can apply HVO is 512 KB (its order corresponds to
+``OPTIMIZABLE_FOLIO_MIN_ORDER``). Therefore, any hugepage with an order greater
+than or equal to ``OPTIMIZABLE_FOLIO_MIN_ORDER`` can apply HVO technology.
-Here's how things look like on device-dax after the sections are populated::
-
- +-----------+ ---virt_to_page---> +-----------+ mapping to +-----------+
- | | | 0 | -------------> | 0 |
- | | +-----------+ +-----------+
- | | | 1 | -------------> | 1 |
- | | +-----------+ +-----------+
- | | | 2 | ----------------^ ^ ^ ^ ^ ^
- | | +-----------+ | | | | |
- | | | 3 | ------------------+ | | | |
- | | +-----------+ | | | |
- | | | 4 | --------------------+ | | |
- | PMD | +-----------+ | | |
- | level | | 5 | ----------------------+ | |
- | mapping | +-----------+ | |
- | | | 6 | ------------------------+ |
- | | +-----------+ |
- | | | 7 | --------------------------+
- | | +-----------+
- | |
- | |
- | |
- +-----------+
+Meanwhile, each HVOed hugepage still has ``OPTIMIZED_FOLIO_VMEMMAP_NR_STRUCT_PAGES``
+available ``struct page`` structures.
--
2.54.0
^ permalink raw reply related [flat|nested] 72+ messages in thread
* Re: [PATCH v2 00/69] mm: Generalize HVO for HugeTLB and device DAX
2026-05-13 13:04 [PATCH v2 00/69] mm: Generalize HVO for HugeTLB and device DAX Muchun Song
` (46 preceding siblings ...)
2026-05-13 13:20 ` [PATCH v2 47/69] powerpc/mm: " Muchun Song
@ 2026-05-13 17:46 ` Andrew Morton
2026-05-13 18:26 ` Oscar Salvador
47 siblings, 1 reply; 72+ messages in thread
From: Andrew Morton @ 2026-05-13 17:46 UTC (permalink / raw)
To: Muchun Song
Cc: David Hildenbrand, Muchun Song, Oscar Salvador, Michael Ellerman,
Madhavan Srinivasan, Lorenzo Stoakes, Liam R . Howlett,
Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
Nicholas Piggin, Christophe Leroy, Ackerley Tng,
Frank van der Linden, aneesh.kumar, joao.m.martins, linux-mm,
linuxppc-dev, linux-kernel
On Wed, 13 May 2026 21:04:28 +0800 Muchun Song <songmuchun@bytedance.com> wrote:
> In this series, HVO is redefined as Hugepage Vmemmap Optimization: a
> general vmemmap optimization model for large hugepage-backed mappings,
> rather than a HugeTLB-only implementation detail.
>
> The existing code grew around the original HugeTLB-specific HVO path,
> while device DAX developed similar but separate vmemmap optimization
> handling. As a result, the current implementation carries duplicated
> logic, boot-time special cases, and subsystem-specific interfaces around
> what is fundamentally the same sparse-vmemmap optimization.
>
> This series generalizes that optimization into a common framework used
> by both HugeTLB and device DAX.
>
> The first few patches include some minor bug fixes found during AI-aided
> review of the current code. These fixes are not the main goal of the
> series, but the later refactoring and unification work depends on them,
> so they are included here as preparatory changes.
>
> The series then reworks the relevant early boot and sparse
> initialization paths, introduces a generic section-based sparse-vmemmap
> optimization infrastructure, switches HugeTLB and device DAX over to the
> shared implementation, and removes the old special-case code.
>
> ...
>
> 46 files changed, 743 insertions(+), 1812 deletions(-)
Gulp.
I think the first 15ish patches (little fixes and cleanups and
refactorings) are ready to go in immediately?
Perhaps you could prepare such things as a separate series. Or tell me
which ones are suitable and I'll fudge up a [0/N]?
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH v2 00/69] mm: Generalize HVO for HugeTLB and device DAX
2026-05-13 17:46 ` [PATCH v2 00/69] mm: Generalize HVO for HugeTLB and device DAX Andrew Morton
@ 2026-05-13 18:26 ` Oscar Salvador
0 siblings, 0 replies; 72+ messages in thread
From: Oscar Salvador @ 2026-05-13 18:26 UTC (permalink / raw)
To: Andrew Morton
Cc: Muchun Song, David Hildenbrand, Muchun Song, Michael Ellerman,
Madhavan Srinivasan, Lorenzo Stoakes, Liam R . Howlett,
Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
Nicholas Piggin, Christophe Leroy, Ackerley Tng,
Frank van der Linden, aneesh.kumar, joao.m.martins, linux-mm,
linuxppc-dev, linux-kernel
On Wed, May 13, 2026 at 10:46:40AM -0700, Andrew Morton wrote:
> On Wed, 13 May 2026 21:04:28 +0800 Muchun Song <songmuchun@bytedance.com> wrote:
>
> > In this series, HVO is redefined as Hugepage Vmemmap Optimization: a
> > general vmemmap optimization model for large hugepage-backed mappings,
> > rather than a HugeTLB-only implementation detail.
> >
> > The existing code grew around the original HugeTLB-specific HVO path,
> > while device DAX developed similar but separate vmemmap optimization
> > handling. As a result, the current implementation carries duplicated
> > logic, boot-time special cases, and subsystem-specific interfaces around
> > what is fundamentally the same sparse-vmemmap optimization.
> >
> > This series generalizes that optimization into a common framework used
> > by both HugeTLB and device DAX.
> >
> > The first few patches include some minor bug fixes found during AI-aided
> > review of the current code. These fixes are not the main goal of the
> > series, but the later refactoring and unification work depends on them,
> > so they are included here as preparatory changes.
> >
> > The series then reworks the relevant early boot and sparse
> > initialization paths, introduces a generic section-based sparse-vmemmap
> > optimization infrastructure, switches HugeTLB and device DAX over to the
> > shared implementation, and removes the old special-case code.
> >
> > ...
> >
> > 46 files changed, 743 insertions(+), 1812 deletions(-)
>
> Gulp.
>
> I think the first 15ish patches (little fixes and cleanups and
> refactorings) are ready to go in immediately?
I plan to have a (partial ) look at this tomorrow/Friday, but splitting
this series in fixes-that-can-go-straight-away and the feature itself would make more
sense and help ease the review.
Head tends to spin a bit when the patchset grows beyond certain number of patches :-D.
Would that be possible Munchun?
--
Oscar Salvador
SUSE Labs
^ permalink raw reply [flat|nested] 72+ messages in thread
end of thread, other threads:[~2026-05-13 18:27 UTC | newest]
Thread overview: 72+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-13 13:04 [PATCH v2 00/69] mm: Generalize HVO for HugeTLB and device DAX Muchun Song
2026-05-13 13:04 ` [PATCH v2 01/69] mm/hugetlb: Fix boot panic with CONFIG_DEBUG_VM and HVO bootmem pages Muchun Song
2026-05-13 13:04 ` [PATCH v2 02/69] mm/hugetlb_vmemmap: Fix __hugetlb_vmemmap_optimize_folios() Muchun Song
2026-05-13 13:04 ` [PATCH v2 03/69] powerpc/mm: Fix wrong addr_pfn tracking in compound vmemmap population Muchun Song
2026-05-13 13:04 ` [PATCH v2 04/69] mm/hugetlb: Initialize gigantic bootmem hugepage struct pages earlier Muchun Song
2026-05-13 13:04 ` [PATCH v2 05/69] mm/mm_init: Simplify deferred_free_pages() migratetype init Muchun Song
2026-05-13 13:04 ` [PATCH v2 06/69] mm/sparse: Panic on memmap and usemap allocation failure Muchun Song
2026-05-13 13:04 ` [PATCH v2 07/69] mm/sparse: Move subsection_map_init() into sparse_init() Muchun Song
2026-05-13 13:04 ` [PATCH v2 08/69] mm/mm_init: Defer sparse_init() until after zone initialization Muchun Song
2026-05-13 13:04 ` [PATCH v2 09/69] mm/mm_init: Defer hugetlb reservation " Muchun Song
2026-05-13 13:04 ` [PATCH v2 10/69] mm/mm_init: Remove set_pageblock_order() call from sparse_init() Muchun Song
2026-05-13 13:04 ` [PATCH v2 11/69] mm/sparse: Move sparse_vmemmap_init_nid_late() into sparse_init_nid() Muchun Song
2026-05-13 13:04 ` [PATCH v2 12/69] mm/hugetlb_cma: Validate hugetlb CMA range by zone at reserve time Muchun Song
2026-05-13 13:04 ` [PATCH v2 13/69] mm/hugetlb: Refactor early boot gigantic hugepage allocation Muchun Song
2026-05-13 13:04 ` [PATCH v2 14/69] mm/hugetlb: Free cross-zone bootmem gigantic pages after allocation Muchun Song
2026-05-13 13:04 ` [PATCH v2 15/69] mm/hugetlb_vmemmap: Move bootmem HVO setup to early init Muchun Song
2026-05-13 13:04 ` [PATCH v2 16/69] mm/hugetlb: Remove obsolete bootmem cross-zone checks Muchun Song
2026-05-13 13:04 ` [PATCH v2 17/69] mm/sparse-vmemmap: Remove sparse_vmemmap_init_nid_late() Muchun Song
2026-05-13 13:04 ` [PATCH v2 18/69] mm/hugetlb: Remove unused bootmem cma field Muchun Song
2026-05-13 13:04 ` [PATCH v2 19/69] mm/mm_init: Make __init_page_from_nid() static Muchun Song
2026-05-13 13:04 ` [PATCH v2 20/69] mm/sparse-vmemmap: Drop VMEMMAP_POPULATE_PAGEREF Muchun Song
2026-05-13 13:04 ` [PATCH v2 21/69] mm: Rename vmemmap optimization macros around folio semantics Muchun Song
2026-05-13 13:04 ` [PATCH v2 22/69] mm/sparse: Drop power-of-2 size requirement for struct mem_section Muchun Song
2026-05-13 13:04 ` [PATCH v2 23/69] mm/sparse-vmemmap: track compound page order in " Muchun Song
2026-05-13 13:04 ` [PATCH v2 24/69] mm/mm_init: Skip initializing shared vmemmap tail pages Muchun Song
2026-05-13 13:04 ` [PATCH v2 25/69] mm/sparse-vmemmap: Initialize shared tail vmemmap pages on allocation Muchun Song
2026-05-13 13:04 ` [PATCH v2 26/69] mm/sparse-vmemmap: Support section-based vmemmap accounting Muchun Song
2026-05-13 13:04 ` [PATCH v2 27/69] mm/sparse-vmemmap: Support section-based vmemmap optimization Muchun Song
2026-05-13 13:04 ` [PATCH v2 28/69] mm/hugetlb: Use generic vmemmap optimization macros Muchun Song
2026-05-13 13:04 ` [PATCH v2 29/69] mm/sparse: Mark memblocks present earlier Muchun Song
2026-05-13 13:04 ` [PATCH v2 30/69] mm/hugetlb: Switch HugeTLB to section-based vmemmap optimization Muchun Song
2026-05-13 13:04 ` [PATCH v2 31/69] mm/sparse: Remove section_map_size() Muchun Song
2026-05-13 13:05 ` [PATCH v2 32/69] mm/mm_init: Factor out pfn_to_zone() as a shared helper Muchun Song
2026-05-13 13:05 ` [PATCH v2 33/69] mm/sparse: Remove SPARSEMEM_VMEMMAP_PREINIT Muchun Song
2026-05-13 13:05 ` [PATCH v2 34/69] mm/sparse: Inline usemap allocation into sparse_init_nid() Muchun Song
2026-05-13 13:05 ` [PATCH v2 35/69] mm/hugetlb: Remove HUGE_BOOTMEM_HVO Muchun Song
2026-05-13 13:05 ` [PATCH v2 36/69] mm/hugetlb: Remove HUGE_BOOTMEM_CMA Muchun Song
2026-05-13 13:05 ` [PATCH v2 37/69] mm/sparse-vmemmap: Factor out shared vmemmap page allocation Muchun Song
2026-05-13 13:05 ` [PATCH v2 38/69] mm/sparse-vmemmap: Introduce CONFIG_SPARSEMEM_VMEMMAP_OPTIMIZATION Muchun Song
2026-05-13 13:05 ` [PATCH v2 39/69] mm/sparse-vmemmap: Switch DAX to vmemmap_shared_tail_page() Muchun Song
2026-05-13 13:05 ` [PATCH v2 40/69] powerpc/mm: " Muchun Song
2026-05-13 13:05 ` [PATCH v2 41/69] mm/sparse-vmemmap: Drop the extra tail page from DAX reservation Muchun Song
2026-05-13 13:05 ` [PATCH v2 42/69] mm/sparse-vmemmap: Switch DAX to section-based vmemmap optimization Muchun Song
2026-05-13 13:05 ` [PATCH v2 43/69] mm/sparse-vmemmap: Unify DAX and HugeTLB population paths Muchun Song
2026-05-13 13:05 ` [PATCH v2 44/69] mm/sparse-vmemmap: Remove the unused ptpfn argument Muchun Song
2026-05-13 13:05 ` [PATCH v2 45/69] powerpc/mm: Make vmemmap_populate_compound_pages() static Muchun Song
2026-05-13 13:05 ` [PATCH v2 46/69] mm/sparse-vmemmap: Map shared vmemmap tail pages read-only Muchun Song
2026-05-13 13:20 ` [PATCH v2 47/69] powerpc/mm: " Muchun Song
2026-05-13 13:20 ` [PATCH v2 48/69] mm/sparse-vmemmap: Inline vmemmap_populate_address() into its caller Muchun Song
2026-05-13 13:20 ` [PATCH v2 49/69] mm/hugetlb_vmemmap: Remove vmemmap_wrprotect_hvo() Muchun Song
2026-05-13 13:20 ` [PATCH v2 50/69] mm/sparse: Simplify section_nr_vmemmap_pages() Muchun Song
2026-05-13 13:20 ` [PATCH v2 51/69] mm/sparse-vmemmap: Introduce vmemmap_nr_struct_pages() Muchun Song
2026-05-13 13:20 ` [PATCH v2 52/69] powerpc/mm: Drop powerpc vmemmap_can_optimize() Muchun Song
2026-05-13 13:20 ` [PATCH v2 53/69] mm/sparse-vmemmap: Drop vmemmap_can_optimize() Muchun Song
2026-05-13 13:20 ` [PATCH v2 54/69] mm/sparse-vmemmap: Drop @pgmap from vmemmap population APIs Muchun Song
2026-05-13 13:20 ` [PATCH v2 55/69] mm/sparse: Decouple section activation from ZONE_DEVICE Muchun Song
2026-05-13 13:20 ` [PATCH v2 56/69] mm: Redefine HVO as Hugepage Vmemmap Optimization Muchun Song
2026-05-13 13:20 ` [PATCH v2 57/69] mm/sparse-vmemmap: Consolidate HVO enable checks Muchun Song
2026-05-13 13:20 ` [PATCH v2 58/69] mm/hugetlb: Make HVO optimizable checks depend on generic logic Muchun Song
2026-05-13 13:20 ` [PATCH v2 59/69] mm/sparse-vmemmap: Localize init_compound_tail() Muchun Song
2026-05-13 13:20 ` [PATCH v2 60/69] mm/mm_init: Check zone consistency on optimized vmemmap sections Muchun Song
2026-05-13 13:20 ` [PATCH v2 61/69] mm/hugetlb: Drop boot-time HVO handling for gigantic folios Muchun Song
2026-05-13 13:20 ` [PATCH v2 62/69] mm/hugetlb: Simplify hugetlb_folio_init_vmemmap() Muchun Song
2026-05-13 13:20 ` [PATCH v2 63/69] mm/hugetlb: Initialize the full bootmem hugepage in hugetlb code Muchun Song
2026-05-13 13:20 ` [PATCH v2 64/69] mm/mm_init: Factor out compound page initialization Muchun Song
2026-05-13 13:20 ` [PATCH v2 65/69] mm/mm_init: Make __init_single_page() static Muchun Song
2026-05-13 13:20 ` [PATCH v2 66/69] mm/cma: Move CMA pageblock initialization into cma_activate_area() Muchun Song
2026-05-13 13:20 ` [PATCH v2 67/69] mm/cma: Move init_cma_pageblock() into cma.c Muchun Song
2026-05-13 13:20 ` [PATCH v2 68/69] mm/mm_init: Initialize pageblock migratetype in memmap init helpers Muchun Song
2026-05-13 13:20 ` [PATCH v2 69/69] Documentation/mm: Rewrite vmemmap_dedup.rst for unified HVO Muchun Song
2026-05-13 17:46 ` [PATCH v2 00/69] mm: Generalize HVO for HugeTLB and device DAX Andrew Morton
2026-05-13 18:26 ` Oscar Salvador
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox