[PATCH 00/49] mm: Generalize vmemmap optimization for DAX and HugeTLB

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 00/49] mm: Generalize vmemmap optimization for DAX and HugeTLB
@ 2026-04-05 12:51 Muchun Song
  2026-04-05 12:51 ` [PATCH 01/49] mm/sparse: fix vmemmap accounting imbalance on memory hotplug error Muchun Song
                   ` (49 more replies)
  0 siblings, 50 replies; 51+ messages in thread
From: Muchun Song @ 2026-04-05 12:51 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
	Michael Ellerman, Madhavan Srinivasan
  Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
	Christophe Leroy, aneesh.kumar, joao.m.martins, linux-mm,
	linuxppc-dev, linux-kernel, Muchun Song

Overview:
This patch series generalizes the HugeTLB Vmemmap Optimization (HVO)
into a generic vmemmap optimization framework that can be used by both
HugeTLB and DAX.

Background:
Currently, the vmemmap optimization feature is highly coupled with
HugeTLB. However, DAX also has similar requirements for optimizing vmemmap
pages to save memory. The current implementation has separate vmemmap
optimization paths for HugeTLB and DAX, leading to duplicated logic,
complex initialization sequences, and architecture-specific flags.

Implementation:
This series breaks down the optimization into a generic framework:
- Patch 1-6: Fix bugs related to sparse vmemmap initialization and DAX.
- Patch 7-13: Refactor the existing sparse vmemmap initialization.
- Patch 14-26: Decouple the vmemmap optimization from HugeTLB and
  introduce generic optimization macros and functions.
- Patch 27-39: Switch HugeTLB and DAX to use the generic framework.
- Patch 40-49: Clean up the old HVO-specific code and simplify it.

Benifit:
- When CONFIG_DEFERRED_STRUCT_PAGE_INIT is disabled, all struct pages
  utilizing HVO (HugeTLB Vmemmap Optimization) skip initialization in
  memmap_init, significantly accelerating boot times.
- All architectures supporting HVO benefit from the optimizations 
  provided by SPARSEMEM_VMEMMAP_PREINIT without requiring 
  architecture-specific adaptations.
- Device DAX struct page savings are further improved, saving an 
  additional 4KB of struct page memory for every 2MB huge page.
- Vmemmap tail pages used for Device DAX shared mappings are changed 
  from read-write to read-only, enhancing system security.
- HugeTLB and Device DAX now share a unified vmemmap optimization 
  framework, reducing long-term maintenance overhead.

Testing:
- Verification: Compiled and tested on x86 architecture.


Chengkaitao (1):
  mm: Convert vmemmap_p?d_populate() to static functions

Muchun Song (48):
  mm/sparse: fix vmemmap accounting imbalance on memory hotplug error
  mm/sparse: add a @pgmap argument to memory deactivation paths
  mm/sparse: fix vmemmap page accounting for HVOed DAX
  mm/sparse: add a @pgmap parameter to arch vmemmap_populate()
  mm/sparse: fix missing architecture-specific page table sync for HVO
    DAX
  mm/mm_init: fix uninitialized pageblock migratetype for ZONE_DEVICE
    compound pages
  mm/mm_init: use pageblock_migratetype_init_range() in
    deferred_free_pages()
  mm: panic on memory allocation failure in sparse_init_nid()
  mm: move subsection_map_init() into sparse_init()
  mm: defer sparse_init() until after zone initialization
  mm: make set_pageblock_order() static
  mm: integrate sparse_vmemmap_init_nid_late() into sparse_init_nid()
  mm/cma: validate hugetlb CMA range by zone at reserve time
  mm/hugetlb: free cross-zone bootmem gigantic pages after allocation
  mm/hugetlb: initialize vmemmap optimization in early stage
  mm: remove sparse_vmemmap_init_nid_late()
  mm/mm_init: make __init_page_from_nid() static
  mm/sparse-vmemmap: remove the VMEMMAP_POPULATE_PAGEREF flag
  mm: rename vmemmap optimization macros to generic names
  mm/sparse: drop power-of-2 size requirement for struct mem_section
  mm/sparse: introduce compound page order to mem_section
  mm/mm_init: skip initializing shared tail pages for compound pages
  mm/sparse-vmemmap: initialize shared tail vmemmap page upon allocation
  mm/sparse-vmemmap: support vmemmap-optimizable compound page
    population
  mm/hugetlb: use generic vmemmap optimization macros
  mm: call memblocks_present() before HugeTLB initialization
  mm/hugetlb: switch HugeTLB to use generic vmemmap optimization
  mm: extract pfn_to_zone() helper
  mm/sparse-vmemmap: remove unused SPARSEMEM_VMEMMAP_PREINIT feature
  mm/hugetlb: remove HUGE_BOOTMEM_HVO flag and simplify pre-HVO logic
  mm/sparse-vmemmap: consolidate shared tail page allocation
  mm: introduce CONFIG_SPARSEMEM_VMEMMAP_OPTIMIZATION
  mm/sparse-vmemmap: switch DAX to use generic vmemmap optimization
  mm/sparse-vmemmap: introduce section zone to struct mem_section
  powerpc/mm: use generic vmemmap_shared_tail_page() in compound vmemmap
  mm/sparse-vmemmap: unify DAX and HugeTLB vmemmap optimization
  mm/sparse-vmemmap: remap the shared tail pages as read-only
  mm/sparse-vmemmap: remove unused ptpfn argument
  mm/hugetlb_vmemmap: remove vmemmap_wrprotect_hvo() and related code
  mm/sparse: simplify section_vmemmap_pages()
  mm/sparse-vmemmap: introduce section_vmemmap_page_structs()
  powerpc/mm: rely on generic vmemmap_can_optimize() to simplify code
  mm/sparse-vmemmap: drop ARCH_WANT_OPTIMIZE_DAX_VMEMMAP and simplify
    checks
  mm/sparse-vmemmap: drop @pgmap parameter from vmemmap populate APIs
  mm/sparse: replace pgmap with order and zone in sparse_add_section()
  mm: redefine HVO as Hugepage Vmemmap Optimization
  Documentation/mm: restructure vmemmap_dedup.rst to reflect generalized
    HVO
  mm: consolidate struct page power-of-2 size checks for HVO

 .../admin-guide/kernel-parameters.txt         |   2 +-
 Documentation/admin-guide/sysctl/vm.rst       |   2 +-
 Documentation/mm/vmemmap_dedup.rst            | 218 ++--------
 arch/powerpc/Kconfig                          |   1 -
 arch/powerpc/include/asm/book3s/64/radix.h    |  12 -
 arch/powerpc/mm/book3s64/radix_pgtable.c      | 114 +----
 arch/powerpc/mm/init_64.c                     |   1 +
 arch/riscv/Kconfig                            |   1 -
 arch/x86/Kconfig                              |   2 -
 fs/Kconfig                                    |   6 +-
 include/linux/hugetlb.h                       |   7 +-
 include/linux/memory_hotplug.h                |   2 +-
 include/linux/mm.h                            |  50 +--
 include/linux/mm_types.h                      |   2 +
 include/linux/mm_types_task.h                 |   4 +
 include/linux/mmzone.h                        | 143 ++++---
 include/linux/page-flags.h                    |  31 +-
 kernel/bounds.c                               |   2 +
 mm/Kconfig                                    |  20 +-
 mm/bootmem_info.c                             |   5 +-
 mm/cma.c                                      |   3 +-
 mm/hugetlb.c                                  | 143 +++----
 mm/hugetlb_cma.c                              |   3 +-
 mm/hugetlb_vmemmap.c                          | 237 +----------
 mm/hugetlb_vmemmap.h                          |  35 +-
 mm/internal.h                                 |  26 +-
 mm/memory_hotplug.c                           |  15 +-
 mm/mm_init.c                                  | 138 +++---
 mm/sparse-vmemmap.c                           | 392 +++++-------------
 mm/sparse.c                                   |  85 ++--
 mm/util.c                                     |   2 +-
 scripts/gdb/linux/mm.py                       |   6 +-
 32 files changed, 513 insertions(+), 1197 deletions(-)

-- 
2.20.1


^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 01/49] mm/sparse: fix vmemmap accounting imbalance on memory hotplug error
  2026-04-05 12:51 [PATCH 00/49] mm: Generalize vmemmap optimization for DAX and HugeTLB Muchun Song
@ 2026-04-05 12:51 ` Muchun Song
  2026-04-05 12:51 ` [PATCH 02/49] mm/sparse: add a @pgmap argument to memory deactivation paths Muchun Song
                   ` (48 subsequent siblings)
  49 siblings, 0 replies; 51+ messages in thread
From: Muchun Song @ 2026-04-05 12:51 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
	Michael Ellerman, Madhavan Srinivasan
  Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
	Christophe Leroy, aneesh.kumar, joao.m.martins, linux-mm,
	linuxppc-dev, linux-kernel, Muchun Song

In section_activate(), if populate_section_memmap() fails, the error
handling path calls section_deactivate() to roll back the state. This
approach introduces an accounting imbalance.

Since the commit c3576889d87b ("mm: fix accounting of memmap pages"),
memmap pages are accounted for only after populate_section_memmap()
succeeds. However, section_deactivate() unconditionally decrements the
vmemmap account. Consequently, a failure in populate_section_memmap()
leads to a negative offset (underflow) in the system's vmemmap tracking.

We can fix this by ensuring that the vmemmap accounting is incremented
immediately before checking for the success of populate_section_memmap().
If populate_section_memmap() fails, the subsequent call to
section_deactivate() will decrement the accounting, perfectly offsetting
the increment and maintaining balance.

Fixes: c3576889d87b ("mm: fix accounting of memmap pages")
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/sparse-vmemmap.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index 6eadb9d116e4..ee27d0c0efe2 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -822,11 +822,11 @@ static struct page * __meminit section_activate(int nid, unsigned long pfn,
 		return pfn_to_page(pfn);

 	memmap = populate_section_memmap(pfn, nr_pages, nid, altmap, pgmap);
+	memmap_pages_add(DIV_ROUND_UP(nr_pages * sizeof(struct page), PAGE_SIZE));
 	if (!memmap) {
 		section_deactivate(pfn, nr_pages, altmap);
 		return ERR_PTR(-ENOMEM);
 	}
-	memmap_pages_add(DIV_ROUND_UP(nr_pages * sizeof(struct page), PAGE_SIZE));

 	return memmap;
 }
-- 
2.20.1

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 02/49] mm/sparse: add a @pgmap argument to memory deactivation paths
  2026-04-05 12:51 [PATCH 00/49] mm: Generalize vmemmap optimization for DAX and HugeTLB Muchun Song
  2026-04-05 12:51 ` [PATCH 01/49] mm/sparse: fix vmemmap accounting imbalance on memory hotplug error Muchun Song
@ 2026-04-05 12:51 ` Muchun Song
  2026-04-05 12:51 ` [PATCH 03/49] mm/sparse: fix vmemmap page accounting for HVOed DAX Muchun Song
                   ` (47 subsequent siblings)
  49 siblings, 0 replies; 51+ messages in thread
From: Muchun Song @ 2026-04-05 12:51 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
	Michael Ellerman, Madhavan Srinivasan
  Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
	Christophe Leroy, aneesh.kumar, joao.m.martins, linux-mm,
	linuxppc-dev, linux-kernel, Muchun Song

Currently, memory hot-remove paths do not pass the struct dev_pagemap
pointer down to section_deactivate(). This prevents the lower levels
from knowing whether the section was originally populated with vmemmap
optimizations (e.g., DAX with HVO enabled).

Without this information, we cannot call vmemmap_can_optimize() to
determine if the vmemmap pages were optimized. As a result, the vmemmap
page accounting during teardown will mistakenly assume a non-optimized
allocation, leading to incorrect page statistics.

To lay the groundwork for fixing the vmemmap page accounting, we need
to pass the @pgmap pointer down to the deactivation location. Plumb the
@pgmap argument through the APIs of arch_remove_memory(), __remove_pages()
and sparse_remove_section(), mirroring the corresponding *_activate()
paths.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 arch/arm64/mm/mmu.c            |  5 +++--
 arch/loongarch/mm/init.c       |  5 +++--
 arch/powerpc/mm/mem.c          |  5 +++--
 arch/riscv/mm/init.c           |  5 +++--
 arch/s390/mm/init.c            |  5 +++--
 arch/x86/mm/init_64.c          |  5 +++--
 include/linux/memory_hotplug.h |  8 +++++---
 mm/memory_hotplug.c            | 12 ++++++------
 mm/memremap.c                  |  4 ++--
 mm/sparse-vmemmap.c            |  8 ++++----
 10 files changed, 35 insertions(+), 27 deletions(-)

diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index ec1c6971a561..dc8a8281888c 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -1994,12 +1994,13 @@ int arch_add_memory(int nid, u64 start, u64 size,
 	return ret;
 }
 
-void arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap)
+void arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap,
+			struct dev_pagemap *pgmap)
 {
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
 
-	__remove_pages(start_pfn, nr_pages, altmap);
+	__remove_pages(start_pfn, nr_pages, altmap, pgmap);
 	__remove_pgd_mapping(swapper_pg_dir, __phys_to_virt(start), size);
 }
 
diff --git a/arch/loongarch/mm/init.c b/arch/loongarch/mm/init.c
index 00f3822b6e47..c9c57f08fa2c 100644
--- a/arch/loongarch/mm/init.c
+++ b/arch/loongarch/mm/init.c
@@ -86,7 +86,8 @@ int arch_add_memory(int nid, u64 start, u64 size, struct mhp_params *params)
 	return ret;
 }
 
-void arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap)
+void arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap,
+			struct dev_pagemap *pgmap)
 {
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
@@ -95,7 +96,7 @@ void arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap)
 	/* With altmap the first mapped page is offset from @start */
 	if (altmap)
 		page += vmem_altmap_offset(altmap);
-	__remove_pages(start_pfn, nr_pages, altmap);
+	__remove_pages(start_pfn, nr_pages, altmap, pgmap);
 }
 #endif
 
diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
index 648d0c5602ec..4c1afab91996 100644
--- a/arch/powerpc/mm/mem.c
+++ b/arch/powerpc/mm/mem.c
@@ -158,12 +158,13 @@ int __ref arch_add_memory(int nid, u64 start, u64 size,
 	return rc;
 }
 
-void __ref arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap)
+void __ref arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap,
+			      struct dev_pagemap *pgmap)
 {
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
 
-	__remove_pages(start_pfn, nr_pages, altmap);
+	__remove_pages(start_pfn, nr_pages, altmap, pgmap);
 	arch_remove_linear_mapping(start, size);
 }
 #endif
diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
index 5142ca80be6f..980f693e6b19 100644
--- a/arch/riscv/mm/init.c
+++ b/arch/riscv/mm/init.c
@@ -1810,9 +1810,10 @@ int __ref arch_add_memory(int nid, u64 start, u64 size, struct mhp_params *param
 	return ret;
 }
 
-void __ref arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap)
+void __ref arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap,
+			      struct dev_pagemap *pgmap)
 {
-	__remove_pages(start >> PAGE_SHIFT, size >> PAGE_SHIFT, altmap);
+	__remove_pages(start >> PAGE_SHIFT, size >> PAGE_SHIFT, altmap, pgmap);
 	remove_linear_mapping(start, size);
 	flush_tlb_all();
 }
diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c
index 1f72efc2a579..11a689423440 100644
--- a/arch/s390/mm/init.c
+++ b/arch/s390/mm/init.c
@@ -276,12 +276,13 @@ int arch_add_memory(int nid, u64 start, u64 size,
 	return rc;
 }
 
-void arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap)
+void arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap,
+			struct dev_pagemap *pgmap)
 {
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
 
-	__remove_pages(start_pfn, nr_pages, altmap);
+	__remove_pages(start_pfn, nr_pages, altmap, pgmap);
 	vmem_remove_mapping(start, size);
 }
 #endif /* CONFIG_MEMORY_HOTPLUG */
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index df2261fa4f98..77b889b71cf3 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1288,12 +1288,13 @@ kernel_physical_mapping_remove(unsigned long start, unsigned long end)
 	remove_pagetable(start, end, true, NULL);
 }
 
-void __ref arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap)
+void __ref arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap,
+			      struct dev_pagemap *pgmap)
 {
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
 
-	__remove_pages(start_pfn, nr_pages, altmap);
+	__remove_pages(start_pfn, nr_pages, altmap, pgmap);
 	kernel_physical_mapping_remove(start, start + size);
 }
 #endif /* CONFIG_MEMORY_HOTPLUG */
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 815e908c4135..7c9d66729c60 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -135,9 +135,10 @@ static inline bool movable_node_is_enabled(void)
 	return movable_node_enabled;
 }
 
-extern void arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap);
+extern void arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap,
+			       struct dev_pagemap *pgmap);
 extern void __remove_pages(unsigned long start_pfn, unsigned long nr_pages,
-			   struct vmem_altmap *altmap);
+			   struct vmem_altmap *altmap, struct dev_pagemap *pgmap);
 
 /* reasonably generic interface to expand the physical pages */
 extern int __add_pages(int nid, unsigned long start_pfn, unsigned long nr_pages,
@@ -307,7 +308,8 @@ extern int sparse_add_section(int nid, unsigned long pfn,
 		unsigned long nr_pages, struct vmem_altmap *altmap,
 		struct dev_pagemap *pgmap);
 extern void sparse_remove_section(unsigned long pfn, unsigned long nr_pages,
-				  struct vmem_altmap *altmap);
+				  struct vmem_altmap *altmap,
+				  struct dev_pagemap *pgmap);
 extern struct zone *zone_for_pfn_range(enum mmop online_type,
 		int nid, struct memory_group *group, unsigned long start_pfn,
 		unsigned long nr_pages);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 8b18ddd1e7d5..05f5df12d843 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -583,7 +583,7 @@ void remove_pfn_range_from_zone(struct zone *zone,
  * calling offline_pages().
  */
 void __remove_pages(unsigned long pfn, unsigned long nr_pages,
-		    struct vmem_altmap *altmap)
+		    struct vmem_altmap *altmap, struct dev_pagemap *pgmap)
 {
 	const unsigned long end_pfn = pfn + nr_pages;
 	unsigned long cur_nr_pages;
@@ -598,7 +598,7 @@ void __remove_pages(unsigned long pfn, unsigned long nr_pages,
 		/* Select all remaining pages up to the next section boundary */
 		cur_nr_pages = min(end_pfn - pfn,
 				   SECTION_ALIGN_UP(pfn + 1) - pfn);
-		sparse_remove_section(pfn, cur_nr_pages, altmap);
+		sparse_remove_section(pfn, cur_nr_pages, altmap, pgmap);
 	}
 }
 
@@ -1418,7 +1418,7 @@ static void remove_memory_blocks_and_altmaps(u64 start, u64 size)
 
 		remove_memory_block_devices(cur_start, memblock_size);
 
-		arch_remove_memory(cur_start, memblock_size, altmap);
+		arch_remove_memory(cur_start, memblock_size, altmap, NULL);
 
 		/* Verify that all vmemmap pages have actually been freed. */
 		WARN(altmap->alloc, "Altmap not fully unmapped");
@@ -1461,7 +1461,7 @@ static int create_altmaps_and_memory_blocks(int nid, struct memory_group *group,
 		ret = create_memory_block_devices(cur_start, memblock_size, nid,
 						  params.altmap, group);
 		if (ret) {
-			arch_remove_memory(cur_start, memblock_size, NULL);
+			arch_remove_memory(cur_start, memblock_size, NULL, NULL);
 			kfree(params.altmap);
 			goto out;
 		}
@@ -1547,7 +1547,7 @@ int add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
 		/* create memory block devices after memory was added */
 		ret = create_memory_block_devices(start, size, nid, NULL, group);
 		if (ret) {
-			arch_remove_memory(start, size, params.altmap);
+			arch_remove_memory(start, size, params.altmap, NULL);
 			goto error;
 		}
 	}
@@ -2246,7 +2246,7 @@ static int try_remove_memory(u64 start, u64 size)
 		 * No altmaps present, do the removal directly
 		 */
 		remove_memory_block_devices(start, size);
-		arch_remove_memory(start, size, NULL);
+		arch_remove_memory(start, size, NULL, NULL);
 	} else {
 		/* all memblocks in the range have altmaps */
 		remove_memory_blocks_and_altmaps(start, size);
diff --git a/mm/memremap.c b/mm/memremap.c
index ac7be07e3361..c45b90f334ea 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -97,10 +97,10 @@ static void pageunmap_range(struct dev_pagemap *pgmap, int range_id)
 				   PHYS_PFN(range_len(range)));
 	if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
 		__remove_pages(PHYS_PFN(range->start),
-			       PHYS_PFN(range_len(range)), NULL);
+			       PHYS_PFN(range_len(range)), NULL, pgmap);
 	} else {
 		arch_remove_memory(range->start, range_len(range),
-				pgmap_altmap(pgmap));
+				pgmap_altmap(pgmap), pgmap);
 		kasan_remove_zero_shadow(__va(range->start), range_len(range));
 	}
 	mem_hotplug_done();
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index ee27d0c0efe2..7aa9a97498eb 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -737,7 +737,7 @@ static int fill_subsection_map(unsigned long pfn, unsigned long nr_pages)
  * usage map, but still need to free the vmemmap range.
  */
 static void section_deactivate(unsigned long pfn, unsigned long nr_pages,
-		struct vmem_altmap *altmap)
+		struct vmem_altmap *altmap, struct dev_pagemap *pgmap)
 {
 	struct mem_section *ms = __pfn_to_section(pfn);
 	bool section_is_early = early_section(ms);
@@ -824,7 +824,7 @@ static struct page * __meminit section_activate(int nid, unsigned long pfn,
 	memmap = populate_section_memmap(pfn, nr_pages, nid, altmap, pgmap);
 	memmap_pages_add(DIV_ROUND_UP(nr_pages * sizeof(struct page), PAGE_SIZE));
 	if (!memmap) {
-		section_deactivate(pfn, nr_pages, altmap);
+		section_deactivate(pfn, nr_pages, altmap, pgmap);
 		return ERR_PTR(-ENOMEM);
 	}
 
@@ -885,13 +885,13 @@ int __meminit sparse_add_section(int nid, unsigned long start_pfn,
 }
 
 void sparse_remove_section(unsigned long pfn, unsigned long nr_pages,
-			   struct vmem_altmap *altmap)
+			   struct vmem_altmap *altmap, struct dev_pagemap *pgmap)
 {
 	struct mem_section *ms = __pfn_to_section(pfn);
 
 	if (WARN_ON_ONCE(!valid_section(ms)))
 		return;
 
-	section_deactivate(pfn, nr_pages, altmap);
+	section_deactivate(pfn, nr_pages, altmap, pgmap);
 }
 #endif /* CONFIG_MEMORY_HOTPLUG */
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 03/49] mm/sparse: fix vmemmap page accounting for HVOed DAX
  2026-04-05 12:51 [PATCH 00/49] mm: Generalize vmemmap optimization for DAX and HugeTLB Muchun Song
  2026-04-05 12:51 ` [PATCH 01/49] mm/sparse: fix vmemmap accounting imbalance on memory hotplug error Muchun Song
  2026-04-05 12:51 ` [PATCH 02/49] mm/sparse: add a @pgmap argument to memory deactivation paths Muchun Song
@ 2026-04-05 12:51 ` Muchun Song
  2026-04-05 12:51 ` [PATCH 04/49] mm/sparse: add a @pgmap parameter to arch vmemmap_populate() Muchun Song
                   ` (46 subsequent siblings)
  49 siblings, 0 replies; 51+ messages in thread
From: Muchun Song @ 2026-04-05 12:51 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
	Michael Ellerman, Madhavan Srinivasan
  Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
	Christophe Leroy, aneesh.kumar, joao.m.martins, linux-mm,
	linuxppc-dev, linux-kernel, Muchun Song

When HVO is enabled for DAX, the vmemmap page accounting is wrong since
it only accounts for non-HVO case.

Fix the accounting by introducing section_vmemmap_pages() that returns
the exact number of vmemmap pages needed for the given pfn range.

Fixes: 15995a352474 ("mm: report per-page metadata information")
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/sparse-vmemmap.c | 30 ++++++++++++++++++++++++++----
 1 file changed, 26 insertions(+), 4 deletions(-)

diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index 7aa9a97498eb..0ef96b1afbcc 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -724,6 +724,27 @@ static int fill_subsection_map(unsigned long pfn, unsigned long nr_pages)
 	return rc;
 }
 
+static int __meminit section_vmemmap_pages(unsigned long pfn, unsigned long nr_pages,
+					   struct vmem_altmap *altmap, struct dev_pagemap *pgmap)
+{
+	unsigned int order = pgmap ? pgmap->vmemmap_shift : 0;
+	unsigned long pages_per_compound = 1L << order;
+
+	VM_BUG_ON(!IS_ALIGNED(pfn | nr_pages, min(pages_per_compound, PAGES_PER_SECTION)));
+	VM_BUG_ON(pfn_to_section_nr(pfn) != pfn_to_section_nr(pfn + nr_pages - 1));
+
+	if (!vmemmap_can_optimize(altmap, pgmap))
+		return DIV_ROUND_UP(nr_pages * sizeof(struct page), PAGE_SIZE);
+
+	if (order < PFN_SECTION_SHIFT)
+		return VMEMMAP_RESERVE_NR * nr_pages / pages_per_compound;
+
+	if (IS_ALIGNED(pfn, pages_per_compound))
+		return VMEMMAP_RESERVE_NR;
+
+	return 0;
+}
+
 /*
  * To deactivate a memory region, there are 3 cases to handle:
  *
@@ -775,11 +796,12 @@ static void section_deactivate(unsigned long pfn, unsigned long nr_pages,
 	 * section_activate() and pfn_valid() .
 	 */
 	if (!section_is_early) {
-		memmap_pages_add(-1L * (DIV_ROUND_UP(nr_pages * sizeof(struct page), PAGE_SIZE)));
+		memmap_pages_add(-section_vmemmap_pages(pfn, nr_pages, altmap,
+							pgmap));
 		depopulate_section_memmap(pfn, nr_pages, altmap);
 	} else if (memmap) {
-		memmap_boot_pages_add(-1L * (DIV_ROUND_UP(nr_pages * sizeof(struct page),
-							  PAGE_SIZE)));
+		memmap_pages_add(-section_vmemmap_pages(pfn, nr_pages, altmap,
+							pgmap));
 		free_map_bootmem(memmap);
 	}
 
@@ -822,7 +844,7 @@ static struct page * __meminit section_activate(int nid, unsigned long pfn,
 		return pfn_to_page(pfn);
 
 	memmap = populate_section_memmap(pfn, nr_pages, nid, altmap, pgmap);
-	memmap_pages_add(DIV_ROUND_UP(nr_pages * sizeof(struct page), PAGE_SIZE));
+	memmap_pages_add(section_vmemmap_pages(pfn, nr_pages, altmap, pgmap));
 	if (!memmap) {
 		section_deactivate(pfn, nr_pages, altmap, pgmap);
 		return ERR_PTR(-ENOMEM);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 04/49] mm/sparse: add a @pgmap parameter to arch vmemmap_populate()
  2026-04-05 12:51 [PATCH 00/49] mm: Generalize vmemmap optimization for DAX and HugeTLB Muchun Song
                   ` (2 preceding siblings ...)
  2026-04-05 12:51 ` [PATCH 03/49] mm/sparse: fix vmemmap page accounting for HVOed DAX Muchun Song
@ 2026-04-05 12:51 ` Muchun Song
  2026-04-05 12:51 ` [PATCH 05/49] mm/sparse: fix missing architecture-specific page table sync for HVO DAX Muchun Song
                   ` (45 subsequent siblings)
  49 siblings, 0 replies; 51+ messages in thread
From: Muchun Song @ 2026-04-05 12:51 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
	Michael Ellerman, Madhavan Srinivasan
  Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
	Christophe Leroy, aneesh.kumar, joao.m.martins, linux-mm,
	linuxppc-dev, linux-kernel, Muchun Song

Add the struct dev_pagemap pointer as a parameter to the architecture
specific vmemmap_populate(), vmemmap_populate_hugepages() and
vmemmap_populate_basepages() functions.

Currently, the vmemmap optimization for DAX is handled mostly in an
architecture-agnostic way via vmemmap_populate_compound_pages().
However, this approach skips crucial architecture-specific initialization
steps. For example, the x86 path must call sync_global_pgds() after
populating the vmemmap, which is currently being bypassed.

To fix this, we need to push the awareness of device memory optimization
(via the pgmap) down into the architecture-specific vmemmap_populate()
paths. This will allow each architecture to handle the optimization while
ensuring their specific initialization routines (like page directory
synchronization) are correctly invoked.

This is a preparatory patch only; it changes no behavior. The actual
architecture-specific implementations and fixes will follow.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 arch/arm64/mm/mmu.c                        |  6 +++---
 arch/loongarch/mm/init.c                   |  7 ++++---
 arch/powerpc/include/asm/book3s/64/radix.h |  3 ++-
 arch/powerpc/mm/book3s64/radix_pgtable.c   |  2 +-
 arch/powerpc/mm/init_64.c                  |  4 ++--
 arch/riscv/mm/init.c                       |  4 ++--
 arch/s390/mm/vmem.c                        |  2 +-
 arch/sparc/mm/init_64.c                    |  5 +++--
 arch/x86/mm/init_64.c                      |  8 ++++----
 include/linux/mm.h                         |  8 +++++---
 mm/hugetlb_vmemmap.c                       |  4 ++--
 mm/sparse-vmemmap.c                        | 10 ++++++----
 12 files changed, 35 insertions(+), 28 deletions(-)

diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index dc8a8281888c..86162aab5185 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -1760,7 +1760,7 @@ int __meminit vmemmap_check_pmd(pmd_t *pmdp, int node,
 }
 
 int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
-		struct vmem_altmap *altmap)
+		struct vmem_altmap *altmap, struct dev_pagemap *pgmap)
 {
 	WARN_ON((start < VMEMMAP_START) || (end > VMEMMAP_END));
 	/* [start, end] should be within one section */
@@ -1768,9 +1768,9 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
 
 	if (!IS_ENABLED(CONFIG_ARM64_4K_PAGES) ||
 	    (end - start < PAGES_PER_SECTION * sizeof(struct page)))
-		return vmemmap_populate_basepages(start, end, node, altmap);
+		return vmemmap_populate_basepages(start, end, node, altmap, pgmap);
 	else
-		return vmemmap_populate_hugepages(start, end, node, altmap);
+		return vmemmap_populate_hugepages(start, end, node, altmap, pgmap);
 }
 
 #ifdef CONFIG_MEMORY_HOTPLUG
diff --git a/arch/loongarch/mm/init.c b/arch/loongarch/mm/init.c
index c9c57f08fa2c..d61c2e09caae 100644
--- a/arch/loongarch/mm/init.c
+++ b/arch/loongarch/mm/init.c
@@ -123,12 +123,13 @@ int __meminit vmemmap_check_pmd(pmd_t *pmd, int node,
 }
 
 int __meminit vmemmap_populate(unsigned long start, unsigned long end,
-			       int node, struct vmem_altmap *altmap)
+			       int node, struct vmem_altmap *altmap,
+			       struct dev_pagemap *pgmap)
 {
 #if CONFIG_PGTABLE_LEVELS == 2
-	return vmemmap_populate_basepages(start, end, node, NULL);
+	return vmemmap_populate_basepages(start, end, node, NULL, pgmap);
 #else
-	return vmemmap_populate_hugepages(start, end, node, NULL);
+	return vmemmap_populate_hugepages(start, end, node, NULL, pgmap);
 #endif
 }
 
diff --git a/arch/powerpc/include/asm/book3s/64/radix.h b/arch/powerpc/include/asm/book3s/64/radix.h
index da954e779744..bde07c6f900f 100644
--- a/arch/powerpc/include/asm/book3s/64/radix.h
+++ b/arch/powerpc/include/asm/book3s/64/radix.h
@@ -321,7 +321,8 @@ extern int __meminit radix__vmemmap_create_mapping(unsigned long start,
 					     unsigned long page_size,
 					     unsigned long phys);
 int __meminit radix__vmemmap_populate(unsigned long start, unsigned long end,
-				      int node, struct vmem_altmap *altmap);
+				      int node, struct vmem_altmap *altmap,
+				      struct dev_pagemap *pgmap);
 void __ref radix__vmemmap_free(unsigned long start, unsigned long end,
 			       struct vmem_altmap *altmap);
 extern void radix__vmemmap_remove_mapping(unsigned long start,
diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c b/arch/powerpc/mm/book3s64/radix_pgtable.c
index 10aced261cff..568500343e5f 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -1112,7 +1112,7 @@ static inline pte_t *vmemmap_pte_alloc(pmd_t *pmdp, int node,
 
 
 int __meminit radix__vmemmap_populate(unsigned long start, unsigned long end, int node,
-				      struct vmem_altmap *altmap)
+				      struct vmem_altmap *altmap, struct dev_pagemap *pgmap)
 {
 	unsigned long addr;
 	unsigned long next;
diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
index b6f3ae03ca9e..8f4aa5b32186 100644
--- a/arch/powerpc/mm/init_64.c
+++ b/arch/powerpc/mm/init_64.c
@@ -275,12 +275,12 @@ static int __meminit __vmemmap_populate(unsigned long start, unsigned long end,
 }
 
 int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
-			       struct vmem_altmap *altmap)
+			       struct vmem_altmap *altmap, struct dev_pagemap *pgmap)
 {
 
 #ifdef CONFIG_PPC_BOOK3S_64
 	if (radix_enabled())
-		return radix__vmemmap_populate(start, end, node, altmap);
+		return radix__vmemmap_populate(start, end, node, altmap, pgmap);
 #endif
 
 	return __vmemmap_populate(start, end, node, altmap);
diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
index 980f693e6b19..277c89661dff 100644
--- a/arch/riscv/mm/init.c
+++ b/arch/riscv/mm/init.c
@@ -1443,7 +1443,7 @@ int __meminit vmemmap_check_pmd(pmd_t *pmdp, int node,
 }
 
 int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
-			       struct vmem_altmap *altmap)
+			       struct vmem_altmap *altmap, struct dev_pagemap *pgmap)
 {
 	/*
 	 * Note that SPARSEMEM_VMEMMAP is only selected for rv64 and that we
@@ -1451,7 +1451,7 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
 	 * memory hotplug, we are not able to update all the page tables with
 	 * the new PMDs.
 	 */
-	return vmemmap_populate_hugepages(start, end, node, altmap);
+	return vmemmap_populate_hugepages(start, end, node, altmap, pgmap);
 }
 #endif
 
diff --git a/arch/s390/mm/vmem.c b/arch/s390/mm/vmem.c
index eeadff45e0e1..a7bf8d3d5601 100644
--- a/arch/s390/mm/vmem.c
+++ b/arch/s390/mm/vmem.c
@@ -506,7 +506,7 @@ static void vmem_remove_range(unsigned long start, unsigned long size)
  * Add a backed mem_map array to the virtual mem_map array.
  */
 int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
-			       struct vmem_altmap *altmap)
+			       struct vmem_altmap *altmap, struct dev_pagemap *pgmap)
 {
 	int ret;
 
diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
index 367c269305e5..f870ca330f9e 100644
--- a/arch/sparc/mm/init_64.c
+++ b/arch/sparc/mm/init_64.c
@@ -2591,9 +2591,10 @@ int __meminit vmemmap_check_pmd(pmd_t *pmdp, int node,
 }
 
 int __meminit vmemmap_populate(unsigned long vstart, unsigned long vend,
-			       int node, struct vmem_altmap *altmap)
+			       int node, struct vmem_altmap *altmap,
+			       struct dev_pagemap *pgmap)
 {
-	return vmemmap_populate_hugepages(vstart, vend, node, NULL);
+	return vmemmap_populate_hugepages(vstart, vend, node, NULL, pgmap);
 }
 #endif /* CONFIG_SPARSEMEM_VMEMMAP */
 
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 77b889b71cf3..e18cc81a30b4 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1557,7 +1557,7 @@ int __meminit vmemmap_check_pmd(pmd_t *pmd, int node,
 }
 
 int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
-		struct vmem_altmap *altmap)
+		struct vmem_altmap *altmap, struct dev_pagemap *pgmap)
 {
 	int err;
 
@@ -1565,15 +1565,15 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
 	VM_BUG_ON(!PAGE_ALIGNED(end));
 
 	if (end - start < PAGES_PER_SECTION * sizeof(struct page))
-		err = vmemmap_populate_basepages(start, end, node, NULL);
+		err = vmemmap_populate_basepages(start, end, node, NULL, pgmap);
 	else if (boot_cpu_has(X86_FEATURE_PSE))
-		err = vmemmap_populate_hugepages(start, end, node, altmap);
+		err = vmemmap_populate_hugepages(start, end, node, altmap, pgmap);
 	else if (altmap) {
 		pr_err_once("%s: no cpu support for altmap allocations\n",
 				__func__);
 		err = -ENOMEM;
 	} else
-		err = vmemmap_populate_basepages(start, end, node, NULL);
+		err = vmemmap_populate_basepages(start, end, node, NULL, pgmap);
 	if (!err)
 		sync_global_pgds(start, end - 1);
 	return err;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0b776907152e..bebc5f892f81 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -4877,11 +4877,13 @@ void vmemmap_set_pmd(pmd_t *pmd, void *p, int node,
 int vmemmap_check_pmd(pmd_t *pmd, int node,
 		      unsigned long addr, unsigned long next);
 int vmemmap_populate_basepages(unsigned long start, unsigned long end,
-			       int node, struct vmem_altmap *altmap);
+			       int node, struct vmem_altmap *altmap,
+			       struct dev_pagemap *pgmap);
 int vmemmap_populate_hugepages(unsigned long start, unsigned long end,
-			       int node, struct vmem_altmap *altmap);
+			       int node, struct vmem_altmap *altmap,
+			       struct dev_pagemap *pgmap);
 int vmemmap_populate(unsigned long start, unsigned long end, int node,
-		struct vmem_altmap *altmap);
+		struct vmem_altmap *altmap, struct dev_pagemap *pgmap);
 int vmemmap_populate_hvo(unsigned long start, unsigned long end,
 			 unsigned int order, struct zone *zone,
 			 unsigned long headsize);
diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index 4a077d231d3a..50b7123f3bdd 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -829,7 +829,7 @@ void __init hugetlb_vmemmap_init_late(int nid)
 			 */
 			list_del(&m->list);
 
-			vmemmap_populate(start, end, nid, NULL);
+			vmemmap_populate(start, end, nid, NULL, NULL);
 			nr_mmap = end - start;
 			memmap_boot_pages_add(DIV_ROUND_UP(nr_mmap, PAGE_SIZE));
 
@@ -845,7 +845,7 @@ void __init hugetlb_vmemmap_init_late(int nid)
 		if (vmemmap_populate_hvo(start, end, huge_page_order(h), zone,
 					 HUGETLB_VMEMMAP_RESERVE_SIZE) < 0) {
 			/* Fallback if HVO population fails */
-			vmemmap_populate(start, end, nid, NULL);
+			vmemmap_populate(start, end, nid, NULL, NULL);
 			nr_mmap = end - start;
 		} else {
 			m->flags |= HUGE_BOOTMEM_ZONES_VALID;
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index 0ef96b1afbcc..387337bba05e 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -297,7 +297,8 @@ static int __meminit vmemmap_populate_range(unsigned long start,
 }
 
 int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end,
-					 int node, struct vmem_altmap *altmap)
+					 int node, struct vmem_altmap *altmap,
+					 struct dev_pagemap *pgmap)
 {
 	return vmemmap_populate_range(start, end, node, altmap, -1, 0);
 }
@@ -400,7 +401,8 @@ int __weak __meminit vmemmap_check_pmd(pmd_t *pmd, int node,
 }
 
 int __meminit vmemmap_populate_hugepages(unsigned long start, unsigned long end,
-					 int node, struct vmem_altmap *altmap)
+					 int node, struct vmem_altmap *altmap,
+					 struct dev_pagemap *pgmap)
 {
 	unsigned long addr;
 	unsigned long next;
@@ -445,7 +447,7 @@ int __meminit vmemmap_populate_hugepages(unsigned long start, unsigned long end,
 			}
 		} else if (vmemmap_check_pmd(pmd, node, addr, next))
 			continue;
-		if (vmemmap_populate_basepages(addr, next, node, altmap))
+		if (vmemmap_populate_basepages(addr, next, node, altmap, pgmap))
 			return -ENOMEM;
 	}
 	return 0;
@@ -559,7 +561,7 @@ struct page * __meminit __populate_section_memmap(unsigned long pfn,
 	if (vmemmap_can_optimize(altmap, pgmap))
 		r = vmemmap_populate_compound_pages(pfn, start, end, nid, pgmap);
 	else
-		r = vmemmap_populate(start, end, nid, altmap);
+		r = vmemmap_populate(start, end, nid, altmap, pgmap);
 
 	if (r < 0)
 		return NULL;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 05/49] mm/sparse: fix missing architecture-specific page table sync for HVO DAX
  2026-04-05 12:51 [PATCH 00/49] mm: Generalize vmemmap optimization for DAX and HugeTLB Muchun Song
                   ` (3 preceding siblings ...)
  2026-04-05 12:51 ` [PATCH 04/49] mm/sparse: add a @pgmap parameter to arch vmemmap_populate() Muchun Song
@ 2026-04-05 12:51 ` Muchun Song
  2026-04-05 12:51 ` [PATCH 06/49] mm/mm_init: fix uninitialized pageblock migratetype for ZONE_DEVICE compound pages Muchun Song
                   ` (44 subsequent siblings)
  49 siblings, 0 replies; 51+ messages in thread
From: Muchun Song @ 2026-04-05 12:51 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
	Michael Ellerman, Madhavan Srinivasan
  Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
	Christophe Leroy, aneesh.kumar, joao.m.martins, linux-mm,
	linuxppc-dev, linux-kernel, Muchun Song

On x86-64, vmemmap_populate() normally calls sync_global_pgds() to
keep the page tables in sync; however, when DAX HVO is enabled,
vmemmap_populate_compound_pages() skips this architecture-specific
step, so omitting the sync on x86-64 can later trigger vmemmap-access
faults.

Fix this by delegating the HVO DAX decision to the architecture:

- Architectures that do not use the generic vmemmap_populate_basepages()
  or vmemmap_populate_hugepages() paths (e.g. powerpc) can implement
  HVO DAX directly in their own vmemmap_populate().

- Architectures that rely on the generic helpers implicitly inherit
  the correct operation logic and therefore enable HVO DAX safely
  without extra work in generic vmemmap_populate_basepages() or
  vmemmap_populate_hugepages().

This prevents the x86-64 sync issue.

Fixes: 4917f55b4ef9 ("mm/sparse-vmemmap: improve memory savings for compound devmaps")
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 arch/powerpc/include/asm/book3s/64/radix.h |  6 ------
 arch/powerpc/mm/book3s64/radix_pgtable.c   | 15 +++++++++-----
 mm/sparse-vmemmap.c                        | 24 +++++++++++-----------
 3 files changed, 22 insertions(+), 23 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/radix.h b/arch/powerpc/include/asm/book3s/64/radix.h
index bde07c6f900f..2600defa2dc2 100644
--- a/arch/powerpc/include/asm/book3s/64/radix.h
+++ b/arch/powerpc/include/asm/book3s/64/radix.h
@@ -357,11 +357,5 @@ int radix__remove_section_mapping(unsigned long start, unsigned long end);
 #define vmemmap_can_optimize vmemmap_can_optimize
 bool vmemmap_can_optimize(struct vmem_altmap *altmap, struct dev_pagemap *pgmap);
 #endif
-
-#define vmemmap_populate_compound_pages vmemmap_populate_compound_pages
-int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
-					      unsigned long start,
-					      unsigned long end, int node,
-					      struct dev_pagemap *pgmap);
 #endif /* __ASSEMBLER__ */
 #endif
diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c b/arch/powerpc/mm/book3s64/radix_pgtable.c
index 568500343e5f..dfa2f7dc7e15 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -1109,7 +1109,10 @@ static inline pte_t *vmemmap_pte_alloc(pmd_t *pmdp, int node,
 	return pte_offset_kernel(pmdp, address);
 }
 
-
+static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
+						    unsigned long start,
+						    unsigned long end, int node,
+						    struct dev_pagemap *pgmap);
 
 int __meminit radix__vmemmap_populate(unsigned long start, unsigned long end, int node,
 				      struct vmem_altmap *altmap, struct dev_pagemap *pgmap)
@@ -1122,6 +1125,8 @@ int __meminit radix__vmemmap_populate(unsigned long start, unsigned long end, in
 	pmd_t *pmd;
 	pte_t *pte;
 
+	if (vmemmap_can_optimize(altmap, pgmap))
+		return vmemmap_populate_compound_pages(page_to_pfn((struct page *)start), start, end, node, pgmap);
 	/*
 	 * If altmap is present, Make sure we align the start vmemmap addr
 	 * to PAGE_SIZE so that we calculate the correct start_pfn in
@@ -1303,10 +1308,10 @@ static pte_t * __meminit vmemmap_compound_tail_page(unsigned long addr,
 	return pte;
 }
 
-int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
-					      unsigned long start,
-					      unsigned long end, int node,
-					      struct dev_pagemap *pgmap)
+static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
+						     unsigned long start,
+						     unsigned long end, int node,
+						     struct dev_pagemap *pgmap)
 {
 	/*
 	 * we want to map things as base page size mapping so that
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index 387337bba05e..d3096de04cc6 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -296,10 +296,16 @@ static int __meminit vmemmap_populate_range(unsigned long start,
 	return 0;
 }
 
+static int __meminit vmemmap_populate_compound_pages(unsigned long start,
+						     unsigned long end, int node,
+						     struct dev_pagemap *pgmap);
+
 int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end,
 					 int node, struct vmem_altmap *altmap,
 					 struct dev_pagemap *pgmap)
 {
+	if (vmemmap_can_optimize(altmap, pgmap))
+		return vmemmap_populate_compound_pages(start, end, node, pgmap);
 	return vmemmap_populate_range(start, end, node, altmap, -1, 0);
 }
 
@@ -411,6 +417,9 @@ int __meminit vmemmap_populate_hugepages(unsigned long start, unsigned long end,
 	pud_t *pud;
 	pmd_t *pmd;
 
+	if (vmemmap_can_optimize(altmap, pgmap))
+		return vmemmap_populate_compound_pages(start, end, node, pgmap);
+
 	for (addr = start; addr < end; addr = next) {
 		next = pmd_addr_end(addr, end);
 
@@ -453,7 +462,6 @@ int __meminit vmemmap_populate_hugepages(unsigned long start, unsigned long end,
 	return 0;
 }
 
-#ifndef vmemmap_populate_compound_pages
 /*
  * For compound pages bigger than section size (e.g. x86 1G compound
  * pages with 2M subsection size) fill the rest of sections as tail
@@ -491,14 +499,14 @@ static pte_t * __meminit compound_section_tail_page(unsigned long addr)
 	return pte;
 }
 
-static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
-						     unsigned long start,
+static int __meminit vmemmap_populate_compound_pages(unsigned long start,
 						     unsigned long end, int node,
 						     struct dev_pagemap *pgmap)
 {
 	unsigned long size, addr;
 	pte_t *pte;
 	int rc;
+	unsigned long start_pfn = page_to_pfn((struct page *)start);
 
 	if (reuse_compound_section(start_pfn, pgmap)) {
 		pte = compound_section_tail_page(start);
@@ -544,26 +552,18 @@ static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
 	return 0;
 }
 
-#endif
-
 struct page * __meminit __populate_section_memmap(unsigned long pfn,
 		unsigned long nr_pages, int nid, struct vmem_altmap *altmap,
 		struct dev_pagemap *pgmap)
 {
 	unsigned long start = (unsigned long) pfn_to_page(pfn);
 	unsigned long end = start + nr_pages * sizeof(struct page);
-	int r;
 
 	if (WARN_ON_ONCE(!IS_ALIGNED(pfn, PAGES_PER_SUBSECTION) ||
 		!IS_ALIGNED(nr_pages, PAGES_PER_SUBSECTION)))
 		return NULL;
 
-	if (vmemmap_can_optimize(altmap, pgmap))
-		r = vmemmap_populate_compound_pages(pfn, start, end, nid, pgmap);
-	else
-		r = vmemmap_populate(start, end, nid, altmap, pgmap);
-
-	if (r < 0)
+	if (vmemmap_populate(start, end, nid, altmap, pgmap))
 		return NULL;
 
 	return pfn_to_page(pfn);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 06/49] mm/mm_init: fix uninitialized pageblock migratetype for ZONE_DEVICE compound pages
  2026-04-05 12:51 [PATCH 00/49] mm: Generalize vmemmap optimization for DAX and HugeTLB Muchun Song
                   ` (4 preceding siblings ...)
  2026-04-05 12:51 ` [PATCH 05/49] mm/sparse: fix missing architecture-specific page table sync for HVO DAX Muchun Song
@ 2026-04-05 12:51 ` Muchun Song
  2026-04-05 12:51 ` [PATCH 07/49] mm/mm_init: use pageblock_migratetype_init_range() in deferred_free_pages() Muchun Song
                   ` (43 subsequent siblings)
  49 siblings, 0 replies; 51+ messages in thread
From: Muchun Song @ 2026-04-05 12:51 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
	Michael Ellerman, Madhavan Srinivasan
  Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
	Christophe Leroy, aneesh.kumar, joao.m.martins, linux-mm,
	linuxppc-dev, linux-kernel, Muchun Song

Previously, memmap_init_zone_device() only initialized the migratetype
of the first pageblock of a compound page. If the compound page size
exceeds pageblock_nr_pages (e.g., 1GB hugepages with 2MB pageblocks),
subsequent pageblocks in the compound page would remain uninitialized.

This patch moves the migratetype initialization out of
__init_zone_device_page() and into a separate function
pageblock_migratetype_init_range(). This function iterates over the
entire PFN range of the memory, ensuring that all pageblocks are correctly
initialized.

Fixes: c4386bd8ee3a ("mm/memremap: add ZONE_DEVICE support for compound pages")
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/mm_init.c | 41 ++++++++++++++++++++++++++---------------
 1 file changed, 26 insertions(+), 15 deletions(-)

diff --git a/mm/mm_init.c b/mm/mm_init.c
index 9a44e8458fed..4936ca78966c 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -674,6 +674,18 @@ static inline void fixup_hashdist(void)
 static inline void fixup_hashdist(void) {}
 #endif /* CONFIG_NUMA */
 
+static __meminit void pageblock_migratetype_init_range(unsigned long pfn,
+						       unsigned long nr_pages,
+						       int migratetype)
+{
+	unsigned long end = pfn + nr_pages;
+
+	for (pfn = pageblock_align(pfn); pfn < end; pfn += pageblock_nr_pages) {
+		init_pageblock_migratetype(pfn_to_page(pfn), migratetype, false);
+		cond_resched();
+	}
+}
+
 /*
  * Initialize a reserved page unconditionally, finding its zone first.
  */
@@ -1011,21 +1023,6 @@ static void __ref __init_zone_device_page(struct page *page, unsigned long pfn,
 	page_folio(page)->pgmap = pgmap;
 	page->zone_device_data = NULL;
 
-	/*
-	 * Mark the block movable so that blocks are reserved for
-	 * movable at startup. This will force kernel allocations
-	 * to reserve their blocks rather than leaking throughout
-	 * the address space during boot when many long-lived
-	 * kernel allocations are made.
-	 *
-	 * Please note that MEMINIT_HOTPLUG path doesn't clear memmap
-	 * because this is done early in section_activate()
-	 */
-	if (pageblock_aligned(pfn)) {
-		init_pageblock_migratetype(page, MIGRATE_MOVABLE, false);
-		cond_resched();
-	}
-
 	/*
 	 * ZONE_DEVICE pages other than MEMORY_TYPE_GENERIC are released
 	 * directly to the driver page allocator which will set the page count
@@ -1122,6 +1119,8 @@ void __ref memmap_init_zone_device(struct zone *zone,
 
 		__init_zone_device_page(page, pfn, zone_idx, nid, pgmap);
 
+		cond_resched();
+
 		if (pfns_per_compound == 1)
 			continue;
 
@@ -1129,6 +1128,18 @@ void __ref memmap_init_zone_device(struct zone *zone,
 				     compound_nr_pages(altmap, pgmap));
 	}
 
+	/*
+	 * Mark the block movable so that blocks are reserved for
+	 * movable at startup. This will force kernel allocations
+	 * to reserve their blocks rather than leaking throughout
+	 * the address space during boot when many long-lived
+	 * kernel allocations are made.
+	 *
+	 * Please note that MEMINIT_HOTPLUG path doesn't clear memmap
+	 * because this is done early in section_activate()
+	 */
+	pageblock_migratetype_init_range(start_pfn, nr_pages, MIGRATE_MOVABLE);
+
 	pr_debug("%s initialised %lu pages in %ums\n", __func__,
 		nr_pages, jiffies_to_msecs(jiffies - start));
 }
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 07/49] mm/mm_init: use pageblock_migratetype_init_range() in deferred_free_pages()
  2026-04-05 12:51 [PATCH 00/49] mm: Generalize vmemmap optimization for DAX and HugeTLB Muchun Song
                   ` (5 preceding siblings ...)
  2026-04-05 12:51 ` [PATCH 06/49] mm/mm_init: fix uninitialized pageblock migratetype for ZONE_DEVICE compound pages Muchun Song
@ 2026-04-05 12:51 ` Muchun Song
  2026-04-05 12:51 ` [PATCH 08/49] mm: Convert vmemmap_p?d_populate() to static functions Muchun Song
                   ` (42 subsequent siblings)
  49 siblings, 0 replies; 51+ messages in thread
From: Muchun Song @ 2026-04-05 12:51 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
	Michael Ellerman, Madhavan Srinivasan
  Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
	Christophe Leroy, aneesh.kumar, joao.m.martins, linux-mm,
	linuxppc-dev, linux-kernel, Muchun Song

Simplify deferred_free_pages by replacing the duplicate loops for
initializing pageblock migratetype with a call to
pageblock_migratetype_init_range to simplify the code.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/mm_init.c | 13 ++++---------
 1 file changed, 4 insertions(+), 9 deletions(-)

diff --git a/mm/mm_init.c b/mm/mm_init.c
index 4936ca78966c..a92c5053f63d 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -1974,13 +1974,12 @@ static void __init deferred_free_pages(unsigned long pfn,
 	if (!nr_pages)
 		return;
 
+	pageblock_migratetype_init_range(pfn, nr_pages, MIGRATE_MOVABLE);
+
 	page = pfn_to_page(pfn);
 
 	/* Free a large naturally-aligned chunk if possible */
 	if (nr_pages == MAX_ORDER_NR_PAGES && IS_MAX_ORDER_ALIGNED(pfn)) {
-		for (i = 0; i < nr_pages; i += pageblock_nr_pages)
-			init_pageblock_migratetype(page + i, MIGRATE_MOVABLE,
-					false);
 		__free_pages_core(page, MAX_PAGE_ORDER, MEMINIT_EARLY);
 		return;
 	}
@@ -1988,12 +1987,8 @@ static void __init deferred_free_pages(unsigned long pfn,
 	/* Accept chunks smaller than MAX_PAGE_ORDER upfront */
 	accept_memory(PFN_PHYS(pfn), nr_pages * PAGE_SIZE);
 
-	for (i = 0; i < nr_pages; i++, page++, pfn++) {
-		if (pageblock_aligned(pfn))
-			init_pageblock_migratetype(page, MIGRATE_MOVABLE,
-					false);
-		__free_pages_core(page, 0, MEMINIT_EARLY);
-	}
+	for (i = 0; i < nr_pages; i++)
+		__free_pages_core(page + i, 0, MEMINIT_EARLY);
 }
 
 /* Completion tracking for deferred_init_memmap() threads */
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 08/49] mm: Convert vmemmap_p?d_populate() to static functions
  2026-04-05 12:51 [PATCH 00/49] mm: Generalize vmemmap optimization for DAX and HugeTLB Muchun Song
                   ` (6 preceding siblings ...)
  2026-04-05 12:51 ` [PATCH 07/49] mm/mm_init: use pageblock_migratetype_init_range() in deferred_free_pages() Muchun Song
@ 2026-04-05 12:51 ` Muchun Song
  2026-04-05 12:52 ` [PATCH 09/49] mm: panic on memory allocation failure in sparse_init_nid() Muchun Song
                   ` (41 subsequent siblings)
  49 siblings, 0 replies; 51+ messages in thread
From: Muchun Song @ 2026-04-05 12:51 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
	Michael Ellerman, Madhavan Srinivasan
  Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
	Christophe Leroy, aneesh.kumar, joao.m.martins, linux-mm,
	linuxppc-dev, linux-kernel, Chengkaitao

From: Chengkaitao <chengkaitao@kylinos.cn>

Since the vmemmap_p?d_populate functions are unused outside the mm
subsystem, we can remove their external declarations and convert
them to static functions.

Signed-off-by: Chengkaitao <chengkaitao@kylinos.cn>
---
 include/linux/mm.h  |  7 -------
 mm/sparse-vmemmap.c | 10 +++++-----
 2 files changed, 5 insertions(+), 12 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index bebc5f892f81..aa8c05de7585 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -4860,13 +4860,6 @@ unsigned long section_map_size(void);
 struct page * __populate_section_memmap(unsigned long pfn,
 		unsigned long nr_pages, int nid, struct vmem_altmap *altmap,
 		struct dev_pagemap *pgmap);
-pgd_t *vmemmap_pgd_populate(unsigned long addr, int node);
-p4d_t *vmemmap_p4d_populate(pgd_t *pgd, unsigned long addr, int node);
-pud_t *vmemmap_pud_populate(p4d_t *p4d, unsigned long addr, int node);
-pmd_t *vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node);
-pte_t *vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node,
-			    struct vmem_altmap *altmap, unsigned long ptpfn,
-			    unsigned long flags);
 void *vmemmap_alloc_block(unsigned long size, int node);
 struct vmem_altmap;
 void *vmemmap_alloc_block_buf(unsigned long size, int node,
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index d3096de04cc6..0ee03db0b22f 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -151,7 +151,7 @@ void __meminit vmemmap_verify(pte_t *pte, int node,
 			start, end - 1);
 }
 
-pte_t * __meminit vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node,
+static pte_t * __meminit vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node,
 				       struct vmem_altmap *altmap,
 				       unsigned long ptpfn, unsigned long flags)
 {
@@ -195,7 +195,7 @@ static void * __meminit vmemmap_alloc_block_zero(unsigned long size, int node)
 	return p;
 }
 
-pmd_t * __meminit vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node)
+static pmd_t * __meminit vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node)
 {
 	pmd_t *pmd = pmd_offset(pud, addr);
 	if (pmd_none(*pmd)) {
@@ -208,7 +208,7 @@ pmd_t * __meminit vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node)
 	return pmd;
 }
 
-pud_t * __meminit vmemmap_pud_populate(p4d_t *p4d, unsigned long addr, int node)
+static pud_t * __meminit vmemmap_pud_populate(p4d_t *p4d, unsigned long addr, int node)
 {
 	pud_t *pud = pud_offset(p4d, addr);
 	if (pud_none(*pud)) {
@@ -221,7 +221,7 @@ pud_t * __meminit vmemmap_pud_populate(p4d_t *p4d, unsigned long addr, int node)
 	return pud;
 }
 
-p4d_t * __meminit vmemmap_p4d_populate(pgd_t *pgd, unsigned long addr, int node)
+static p4d_t * __meminit vmemmap_p4d_populate(pgd_t *pgd, unsigned long addr, int node)
 {
 	p4d_t *p4d = p4d_offset(pgd, addr);
 	if (p4d_none(*p4d)) {
@@ -234,7 +234,7 @@ p4d_t * __meminit vmemmap_p4d_populate(pgd_t *pgd, unsigned long addr, int node)
 	return p4d;
 }
 
-pgd_t * __meminit vmemmap_pgd_populate(unsigned long addr, int node)
+static pgd_t * __meminit vmemmap_pgd_populate(unsigned long addr, int node)
 {
 	pgd_t *pgd = pgd_offset_k(addr);
 	if (pgd_none(*pgd)) {
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 09/49] mm: panic on memory allocation failure in sparse_init_nid()
  2026-04-05 12:51 [PATCH 00/49] mm: Generalize vmemmap optimization for DAX and HugeTLB Muchun Song
                   ` (7 preceding siblings ...)
  2026-04-05 12:51 ` [PATCH 08/49] mm: Convert vmemmap_p?d_populate() to static functions Muchun Song
@ 2026-04-05 12:52 ` Muchun Song
  2026-04-05 12:52 ` [PATCH 10/49] mm: move subsection_map_init() into sparse_init() Muchun Song
                   ` (40 subsequent siblings)
  49 siblings, 0 replies; 51+ messages in thread
From: Muchun Song @ 2026-04-05 12:52 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
	Michael Ellerman, Madhavan Srinivasan
  Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
	Christophe Leroy, aneesh.kumar, joao.m.martins, linux-mm,
	linuxppc-dev, linux-kernel, Muchun Song

When vmemmap pages allocation or usemap allocation fails, sparse_init_nid()
currently only marks the corresponding section as non-present. However,
subsequent code like memmap_init() iterating over PFNs does not check for
non-present sections, leading to invalid memory access (additional,
subsection_map_init() accessing the unallocated usemap as well).

It is complex to audit and fix all boot-time PFN iterators to handle these
partially initialized sections correctly. Since vmemmap and usemap allocation
failures are extremely rare during early boot, the more appropriate approach
is to expose the problem as early as possible.

Therefore, use BUG_ON() to panic immediately if allocation fails, instead of
attempting a partial recovery that leads to obscure crashes later.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/sparse.c | 37 ++++++++-----------------------------
 1 file changed, 8 insertions(+), 29 deletions(-)

diff --git a/mm/sparse.c b/mm/sparse.c
index effdac6b0ab1..5c12b979a618 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -354,19 +354,15 @@ static void __init sparse_init_nid(int nid, unsigned long pnum_begin,
 				   unsigned long map_count)
 {
 	unsigned long pnum;
-	struct page *map;
-	struct mem_section *ms;
-
-	if (sparse_usage_init(nid, map_count)) {
-		pr_err("%s: node[%d] usemap allocation failed", __func__, nid);
-		goto failed;
-	}
 
+	if (sparse_usage_init(nid, map_count))
+		panic("The node[%d] usemap allocation failed\n", nid);
 	sparse_buffer_init(map_count * section_map_size(), nid);
 
 	sparse_vmemmap_init_nid_early(nid);
 
 	for_each_present_section_nr(pnum_begin, pnum) {
+		struct mem_section *ms;
 		unsigned long pfn = section_nr_to_pfn(pnum);
 
 		if (pnum >= pnum_end)
@@ -374,16 +370,12 @@ static void __init sparse_init_nid(int nid, unsigned long pnum_begin,
 
 		ms = __nr_to_section(pnum);
 		if (!preinited_vmemmap_section(ms)) {
+			struct page *map;
+
 			map = __populate_section_memmap(pfn, PAGES_PER_SECTION,
-					nid, NULL, NULL);
-			if (!map) {
-				pr_err("%s: node[%d] memory map backing failed. Some memory will not be available.",
-				       __func__, nid);
-				pnum_begin = pnum;
-				sparse_usage_fini();
-				sparse_buffer_fini();
-				goto failed;
-			}
+							nid, NULL, NULL);
+			if (!map)
+				panic("Populate section (%ld) on node[%d] failed\n", pnum, nid);
 			memmap_boot_pages_add(DIV_ROUND_UP(PAGES_PER_SECTION * sizeof(struct page),
 							   PAGE_SIZE));
 			sparse_init_early_section(nid, map, pnum, 0);
@@ -391,19 +383,6 @@ static void __init sparse_init_nid(int nid, unsigned long pnum_begin,
 	}
 	sparse_usage_fini();
 	sparse_buffer_fini();
-	return;
-failed:
-	/*
-	 * We failed to allocate, mark all the following pnums as not present,
-	 * except the ones already initialized earlier.
-	 */
-	for_each_present_section_nr(pnum_begin, pnum) {
-		if (pnum >= pnum_end)
-			break;
-		ms = __nr_to_section(pnum);
-		if (!preinited_vmemmap_section(ms))
-			ms->section_mem_map = 0;
-	}
 }
 
 /*
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 10/49] mm: move subsection_map_init() into sparse_init()
  2026-04-05 12:51 [PATCH 00/49] mm: Generalize vmemmap optimization for DAX and HugeTLB Muchun Song
                   ` (8 preceding siblings ...)
  2026-04-05 12:52 ` [PATCH 09/49] mm: panic on memory allocation failure in sparse_init_nid() Muchun Song
@ 2026-04-05 12:52 ` Muchun Song
  2026-04-05 12:52 ` [PATCH 11/49] mm: defer sparse_init() until after zone initialization Muchun Song
                   ` (39 subsequent siblings)
  49 siblings, 0 replies; 51+ messages in thread
From: Muchun Song @ 2026-04-05 12:52 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
	Michael Ellerman, Madhavan Srinivasan
  Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
	Christophe Leroy, aneesh.kumar, joao.m.martins, linux-mm,
	linuxppc-dev, linux-kernel, Muchun Song

Move the initialization of the subsection map from free_area_init()
into sparse_init(). This encapsulates the logic within the sparse
memory initialization code.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/internal.h       |  5 ++---
 mm/mm_init.c        | 10 ++--------
 mm/sparse-vmemmap.c | 11 ++++++++++-
 mm/sparse.c         |  1 +
 4 files changed, 15 insertions(+), 12 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index edb1c04d0617..d70075d0e788 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1004,10 +1004,9 @@ static inline void sparse_init(void) {}
  * mm/sparse-vmemmap.c
  */
 #ifdef CONFIG_SPARSEMEM_VMEMMAP
-void sparse_init_subsection_map(unsigned long pfn, unsigned long nr_pages);
+void sparse_init_subsection_map(void);
 #else
-static inline void sparse_init_subsection_map(unsigned long pfn,
-		unsigned long nr_pages)
+static inline void sparse_init_subsection_map(void)
 {
 }
 #endif /* CONFIG_SPARSEMEM_VMEMMAP */
diff --git a/mm/mm_init.c b/mm/mm_init.c
index a92c5053f63d..5ca4503e7622 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -1857,18 +1857,12 @@ static void __init free_area_init(void)
 			       (u64)zone_movable_pfn[i] << PAGE_SHIFT);
 	}
 
-	/*
-	 * Print out the early node map, and initialize the
-	 * subsection-map relative to active online memory ranges to
-	 * enable future "sub-section" extensions of the memory map.
-	 */
+	/* Print out the early node map. */
 	pr_info("Early memory node ranges\n");
-	for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, &nid) {
+	for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, &nid)
 		pr_info("  node %3d: [mem %#018Lx-%#018Lx]\n", nid,
 			(u64)start_pfn << PAGE_SHIFT,
 			((u64)end_pfn << PAGE_SHIFT) - 1);
-		sparse_init_subsection_map(start_pfn, end_pfn - start_pfn);
-	}
 
 	/* Initialise every node */
 	mminit_verify_pageflags_layout();
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index 0ee03db0b22f..b7201c235419 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -603,7 +603,7 @@ static void subsection_mask_set(unsigned long *map, unsigned long pfn,
 	bitmap_set(map, idx, end - idx + 1);
 }
 
-void __init sparse_init_subsection_map(unsigned long pfn, unsigned long nr_pages)
+static void __init sparse_init_subsection_map_range(unsigned long pfn, unsigned long nr_pages)
 {
 	int end_sec_nr = pfn_to_section_nr(pfn + nr_pages - 1);
 	unsigned long nr, start_sec_nr = pfn_to_section_nr(pfn);
@@ -626,6 +626,15 @@ void __init sparse_init_subsection_map(unsigned long pfn, unsigned long nr_pages
 	}
 }
 
+void __init sparse_init_subsection_map(void)
+{
+	int i, nid;
+	unsigned long start, end;
+
+	for_each_mem_pfn_range(i, MAX_NUMNODES, &start, &end, &nid)
+		sparse_init_subsection_map_range(start, end - start);
+}
+
 #ifdef CONFIG_MEMORY_HOTPLUG
 
 /* Mark all memory sections within the pfn range as online */
diff --git a/mm/sparse.c b/mm/sparse.c
index 5c12b979a618..c7f91dc2e5b5 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -424,5 +424,6 @@ void __init sparse_init(void)
 	}
 	/* cover the last node */
 	sparse_init_nid(nid_begin, pnum_begin, pnum_end, map_count);
+	sparse_init_subsection_map();
 	vmemmap_populate_print_last();
 }
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 11/49] mm: defer sparse_init() until after zone initialization
  2026-04-05 12:51 [PATCH 00/49] mm: Generalize vmemmap optimization for DAX and HugeTLB Muchun Song
                   ` (9 preceding siblings ...)
  2026-04-05 12:52 ` [PATCH 10/49] mm: move subsection_map_init() into sparse_init() Muchun Song
@ 2026-04-05 12:52 ` Muchun Song
  2026-04-05 12:52 ` [PATCH 12/49] mm: make set_pageblock_order() static Muchun Song
                   ` (38 subsequent siblings)
  49 siblings, 0 replies; 51+ messages in thread
From: Muchun Song @ 2026-04-05 12:52 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
	Michael Ellerman, Madhavan Srinivasan
  Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
	Christophe Leroy, aneesh.kumar, joao.m.martins, linux-mm,
	linuxppc-dev, linux-kernel, Muchun Song

According to the comment of free_area_init(), its main goal is to
initialise all pg_data_t and zone data. However, sparse_init() and
memmap_init() are aimed at allocating vmemmap pages and initializing
struct page respectively, which differs from the goal of free_area_init().
Therefore, it is reasonable to move them out of free_area_init().

Call sparse_init() after free_area_init() to guarantee that zone data
structures are available when sparse_init() executes. This change is a
prerequisite for integrating vmemmap initialization steps and allows
sparse_init() to safely access zone information if needed (e.g. HVO case).

Also, move hugetlb reservation functions (hugetlb_cma_reserve() and
hugetlb_bootmem_alloc()) to be after free_area_init(). This allows
hugetlb reservation to access zone information to ensure that contiguous
pages are not allocated across zone boundaries, which simplifies the
hugetlb code. So this is a preparation for subsequent changes.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/mm_init.c | 15 ++++++++-------
 mm/sparse.c  |  3 ---
 2 files changed, 8 insertions(+), 10 deletions(-)

diff --git a/mm/mm_init.c b/mm/mm_init.c
index 5ca4503e7622..72604d02a853 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -1807,7 +1807,6 @@ static void __init free_area_init(void)
 	bool descending;
 
 	arch_zone_limits_init(max_zone_pfn);
-	sparse_init();
 
 	start_pfn = PHYS_PFN(memblock_start_of_DRAM());
 	descending = arch_has_descending_max_zone_pfns();
@@ -1896,11 +1895,7 @@ static void __init free_area_init(void)
 		}
 	}
 
-	for_each_node_state(nid, N_MEMORY)
-		sparse_vmemmap_init_nid_late(nid);
-
 	calc_nr_kernel_pages();
-	memmap_init();
 
 	/* disable hash distribution for systems with a single node */
 	fixup_hashdist();
@@ -2669,10 +2664,16 @@ void __init __weak mem_init(void)
 
 void __init mm_core_init_early(void)
 {
-	hugetlb_cma_reserve();
-	hugetlb_bootmem_alloc();
+	int nid;
 
 	free_area_init();
+	/* Zone data structures are available from here. */
+	hugetlb_cma_reserve();
+	hugetlb_bootmem_alloc();
+	sparse_init();
+	for_each_node_state(nid, N_MEMORY)
+		sparse_vmemmap_init_nid_late(nid);
+	memmap_init();
 }
 
 /*
diff --git a/mm/sparse.c b/mm/sparse.c
index c7f91dc2e5b5..5fe0a7e66775 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -406,9 +406,6 @@ void __init sparse_init(void)
 	pnum_begin = first_present_section_nr();
 	nid_begin = sparse_early_nid(__nr_to_section(pnum_begin));
 
-	/* Setup pageblock_order for HUGETLB_PAGE_SIZE_VARIABLE */
-	set_pageblock_order();
-
 	for_each_present_section_nr(pnum_begin + 1, pnum_end) {
 		int nid = sparse_early_nid(__nr_to_section(pnum_end));
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 12/49] mm: make set_pageblock_order() static
  2026-04-05 12:51 [PATCH 00/49] mm: Generalize vmemmap optimization for DAX and HugeTLB Muchun Song
                   ` (10 preceding siblings ...)
  2026-04-05 12:52 ` [PATCH 11/49] mm: defer sparse_init() until after zone initialization Muchun Song
@ 2026-04-05 12:52 ` Muchun Song
  2026-04-05 12:52 ` [PATCH 13/49] mm: integrate sparse_vmemmap_init_nid_late() into sparse_init_nid() Muchun Song
                   ` (37 subsequent siblings)
  49 siblings, 0 replies; 51+ messages in thread
From: Muchun Song @ 2026-04-05 12:52 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
	Michael Ellerman, Madhavan Srinivasan
  Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
	Christophe Leroy, aneesh.kumar, joao.m.martins, linux-mm,
	linuxppc-dev, linux-kernel, Muchun Song

Since set_pageblock_order() is only used in mm/mm_init.c now, make it
static and remove its declaration from mm/internal.h.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/internal.h | 1 -
 mm/mm_init.c  | 4 ++--
 2 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index d70075d0e788..8232084f0c5e 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1437,7 +1437,6 @@ extern unsigned long  __must_check vm_mmap_pgoff(struct file *, unsigned long,
         unsigned long, unsigned long,
         unsigned long, unsigned long);
 
-extern void set_pageblock_order(void);
 unsigned long reclaim_pages(struct list_head *folio_list);
 unsigned int reclaim_clean_pages_from_list(struct zone *zone,
 					    struct list_head *folio_list);
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 72604d02a853..64363f35ad0d 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -1489,7 +1489,7 @@ static inline void setup_usemap(struct zone *zone) {}
 #ifdef CONFIG_HUGETLB_PAGE_SIZE_VARIABLE
 
 /* Initialise the number of pages represented by NR_PAGEBLOCK_BITS */
-void __init set_pageblock_order(void)
+static void __init set_pageblock_order(void)
 {
 	unsigned int order = PAGE_BLOCK_MAX_ORDER;
 
@@ -1515,7 +1515,7 @@ void __init set_pageblock_order(void)
  * include/linux/pageblock-flags.h for the values of pageblock_order based on
  * the kernel config
  */
-void __init set_pageblock_order(void)
+static inline void __init set_pageblock_order(void)
 {
 }
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 13/49] mm: integrate sparse_vmemmap_init_nid_late() into sparse_init_nid()
  2026-04-05 12:51 [PATCH 00/49] mm: Generalize vmemmap optimization for DAX and HugeTLB Muchun Song
                   ` (11 preceding siblings ...)
  2026-04-05 12:52 ` [PATCH 12/49] mm: make set_pageblock_order() static Muchun Song
@ 2026-04-05 12:52 ` Muchun Song
  2026-04-05 12:52 ` [PATCH 14/49] mm/cma: validate hugetlb CMA range by zone at reserve time Muchun Song
                   ` (36 subsequent siblings)
  49 siblings, 0 replies; 51+ messages in thread
From: Muchun Song @ 2026-04-05 12:52 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
	Michael Ellerman, Madhavan Srinivasan
  Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
	Christophe Leroy, aneesh.kumar, joao.m.martins, linux-mm,
	linuxppc-dev, linux-kernel, Muchun Song

Move the call to sparse_vmemmap_init_nid_late() from mm_core_init_early()
into sparse_init_nid().

Since sparse_init() has been deferred until after zone initialization,
the zone data structures are now available during sparse_init(). This
satisfies the requirements of sparse_vmemmap_init_nid_late(), allowing
it to be moved safely.

This change unifies the vmemmap initialization steps by placing both
sparse_vmemmap_init_nid_early() and sparse_vmemmap_init_nid_late()
within the sparse memory initialization logic, making the code structure
clearer.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/mm_init.c | 4 ----
 mm/sparse.c  | 2 ++
 2 files changed, 2 insertions(+), 4 deletions(-)

diff --git a/mm/mm_init.c b/mm/mm_init.c
index 64363f35ad0d..7a710fcbe3c8 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -2664,15 +2664,11 @@ void __init __weak mem_init(void)
 
 void __init mm_core_init_early(void)
 {
-	int nid;
-
 	free_area_init();
 	/* Zone data structures are available from here. */
 	hugetlb_cma_reserve();
 	hugetlb_bootmem_alloc();
 	sparse_init();
-	for_each_node_state(nid, N_MEMORY)
-		sparse_vmemmap_init_nid_late(nid);
 	memmap_init();
 }
 
diff --git a/mm/sparse.c b/mm/sparse.c
index 5fe0a7e66775..d940b973df66 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -383,6 +383,8 @@ static void __init sparse_init_nid(int nid, unsigned long pnum_begin,
 	}
 	sparse_usage_fini();
 	sparse_buffer_fini();
+
+	sparse_vmemmap_init_nid_late(nid);
 }
 
 /*
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 14/49] mm/cma: validate hugetlb CMA range by zone at reserve time
  2026-04-05 12:51 [PATCH 00/49] mm: Generalize vmemmap optimization for DAX and HugeTLB Muchun Song
                   ` (12 preceding siblings ...)
  2026-04-05 12:52 ` [PATCH 13/49] mm: integrate sparse_vmemmap_init_nid_late() into sparse_init_nid() Muchun Song
@ 2026-04-05 12:52 ` Muchun Song
  2026-04-05 12:52 ` [PATCH 15/49] mm/hugetlb: free cross-zone bootmem gigantic pages after allocation Muchun Song
                   ` (35 subsequent siblings)
  49 siblings, 0 replies; 51+ messages in thread
From: Muchun Song @ 2026-04-05 12:52 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
	Michael Ellerman, Madhavan Srinivasan
  Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
	Christophe Leroy, aneesh.kumar, joao.m.martins, linux-mm,
	linuxppc-dev, linux-kernel, Muchun Song

During hugetlb_cma_reserve() we already have access to zone information, so we
can validate that the reserved CMA range does not span multiple zones.

Doing this check up front allows future hugetlb allocations from CMA to assume
zone-valid CMA areas, avoiding additional validity checks and potential
fallback/rollback paths, greatly simplifying the code.

The pfn_valid() check is removed from cma_validate_zones() because mem_section is
not initialized at that stage and it can trigger false warnings; keep the
sanity check in cma_activate_area() instead. This is preparatory work for the
follow-up simplification.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/cma.c         | 3 ++-
 mm/hugetlb_cma.c | 3 ++-
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/mm/cma.c b/mm/cma.c
index 15cc0ae76c8e..dd046a23f467 100644
--- a/mm/cma.c
+++ b/mm/cma.c
@@ -125,7 +125,6 @@ bool cma_validate_zones(struct cma *cma)
 		 * to be in the same zone. Simplify by forcing the entire
 		 * CMA resv range to be in the same zone.
 		 */
-		WARN_ON_ONCE(!pfn_valid(base_pfn));
 		if (pfn_range_intersects_zones(cma->nid, base_pfn, cmr->count)) {
 			set_bit(CMA_ZONES_INVALID, &cma->flags);
 			return false;
@@ -164,6 +163,8 @@ static void __init cma_activate_area(struct cma *cma)
 			bitmap_set(cmr->bitmap, 0, bitmap_count);
 		}
 
+		WARN_ON_ONCE(!pfn_valid(cmr->base_pfn));
+
 		for (pfn = early_pfn[r]; pfn < cmr->base_pfn + cmr->count;
 		     pfn += pageblock_nr_pages)
 			init_cma_reserved_pageblock(pfn_to_page(pfn));
diff --git a/mm/hugetlb_cma.c b/mm/hugetlb_cma.c
index f83ae4998990..b068e9bf6537 100644
--- a/mm/hugetlb_cma.c
+++ b/mm/hugetlb_cma.c
@@ -233,9 +233,10 @@ void __init hugetlb_cma_reserve(void)
 		res = cma_declare_contiguous_multi(size, PAGE_SIZE << order,
 					HUGETLB_PAGE_ORDER, name,
 					&hugetlb_cma[nid], nid);
-		if (res) {
+		if (res || !cma_validate_zones(hugetlb_cma[nid])) {
 			pr_warn("hugetlb_cma: reservation failed: err %d, node %d",
 				res, nid);
+			hugetlb_cma[nid] = NULL;
 			continue;
 		}
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 15/49] mm/hugetlb: free cross-zone bootmem gigantic pages after allocation
  2026-04-05 12:51 [PATCH 00/49] mm: Generalize vmemmap optimization for DAX and HugeTLB Muchun Song
                   ` (13 preceding siblings ...)
  2026-04-05 12:52 ` [PATCH 14/49] mm/cma: validate hugetlb CMA range by zone at reserve time Muchun Song
@ 2026-04-05 12:52 ` Muchun Song
  2026-04-05 12:52 ` [PATCH 16/49] mm/hugetlb: initialize vmemmap optimization in early stage Muchun Song
                   ` (34 subsequent siblings)
  49 siblings, 0 replies; 51+ messages in thread
From: Muchun Song @ 2026-04-05 12:52 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
	Michael Ellerman, Madhavan Srinivasan
  Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
	Christophe Leroy, aneesh.kumar, joao.m.martins, linux-mm,
	linuxppc-dev, linux-kernel, Muchun Song

After moving hugetlb reservation after free_area_init(), zone information
becomes available during bootmem huge page allocation. This allows us to
identify and handle cross-zone gigantic pages more precisely.

During alloc_bootmem(), pages that intersect multiple zones are added to
the head of huge_boot_pages[nid] list (without ZONES_VALID flag), while
pages with valid zones are added to the tail (with ZONES_VALID flag).

After allocation completes, hugetlb_free_cross_zone_pages() iterates
through the list and frees those cross-zone pages (entries without
HUGE_BOOTMEM_ZONES_VALID flag). The count of freed pages is subtracted
from the allocated count to ensure the final number reflects only valid
huge pages.

This applies to both per-node allocation path and the global gigantic
allocation path, simplifying the code by avoiding cross-zone checks at
later stages.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/hugetlb.c | 53 ++++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 47 insertions(+), 6 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index d6ea11113f1d..238495fd04e4 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3049,6 +3049,11 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 	return ERR_PTR(-ENOSPC);
 }
 
+static bool __init hugetlb_bootmem_page_earlycma(struct huge_bootmem_page *m)
+{
+	return m->flags & HUGE_BOOTMEM_CMA;
+}
+
 static __init void *alloc_bootmem(struct hstate *h, int nid, bool node_exact)
 {
 	struct huge_bootmem_page *m;
@@ -3092,7 +3097,14 @@ static __init void *alloc_bootmem(struct hstate *h, int nid, bool node_exact)
 		 * is not up yet.
 		 */
 		INIT_LIST_HEAD(&m->list);
-		list_add(&m->list, &huge_boot_pages[listnode]);
+		if (pfn_range_intersects_zones(listnode, PHYS_PFN(virt_to_phys(m)),
+					       pages_per_huge_page(h))) {
+			VM_BUG_ON(hugetlb_bootmem_page_earlycma(m));
+			list_add(&m->list, &huge_boot_pages[listnode]);
+		} else {
+			list_add_tail(&m->list, &huge_boot_pages[listnode]);
+			m->flags |= HUGE_BOOTMEM_ZONES_VALID;
+		}
 		m->hstate = h;
 	}
 
@@ -3186,11 +3198,6 @@ static bool __init hugetlb_bootmem_page_prehvo(struct huge_bootmem_page *m)
 	return m->flags & HUGE_BOOTMEM_HVO;
 }
 
-static bool __init hugetlb_bootmem_page_earlycma(struct huge_bootmem_page *m)
-{
-	return m->flags & HUGE_BOOTMEM_CMA;
-}
-
 /*
  * memblock-allocated pageblocks might not have the migrate type set
  * if marked with the 'noinit' flag. Set it to the default (MIGRATE_MOVABLE)
@@ -3393,6 +3400,34 @@ static void __init gather_bootmem_prealloc(void)
 	padata_do_multithreaded(&job);
 }
 
+static unsigned long __init hugetlb_free_cross_zone_pages(struct hstate *h, int nid)
+{
+	unsigned long freed = 0;
+	struct huge_bootmem_page *m, *tmp;
+
+	if (!hstate_is_gigantic(h))
+		return freed;
+
+	list_for_each_entry_safe(m, tmp, &huge_boot_pages[nid], list) {
+		if (m->flags & HUGE_BOOTMEM_ZONES_VALID)
+			break;
+
+		list_del(&m->list);
+		memblock_free(m, huge_page_size(h));
+		freed++;
+	}
+
+	if (freed) {
+		char buf[32];
+
+		string_get_size(huge_page_size(h), 1, STRING_UNITS_2, buf, sizeof(buf));
+		pr_warn("HugeTLB: freeing %lu cross-zone hugepage of page size %s failed node%d.\n",
+			freed, buf, nid);
+	}
+
+	return freed;
+}
+
 static void __init hugetlb_hstate_alloc_pages_onenode(struct hstate *h, int nid)
 {
 	unsigned long i;
@@ -3423,6 +3458,8 @@ static void __init hugetlb_hstate_alloc_pages_onenode(struct hstate *h, int nid)
 		cond_resched();
 	}
 
+	i -= hugetlb_free_cross_zone_pages(h, nid);
+
 	if (!list_empty(&folio_list))
 		prep_and_add_allocated_folios(h, &folio_list);
 
@@ -3496,6 +3533,7 @@ static void __init hugetlb_pages_alloc_boot_node(unsigned long start, unsigned l
 
 static unsigned long __init hugetlb_gigantic_pages_alloc_boot(struct hstate *h)
 {
+	int nid;
 	unsigned long i;
 
 	for (i = 0; i < h->max_huge_pages; ++i) {
@@ -3504,6 +3542,9 @@ static unsigned long __init hugetlb_gigantic_pages_alloc_boot(struct hstate *h)
 		cond_resched();
 	}
 
+	for_each_node(nid)
+		i -= hugetlb_free_cross_zone_pages(h, nid);
+
 	return i;
 }
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 16/49] mm/hugetlb: initialize vmemmap optimization in early stage
  2026-04-05 12:51 [PATCH 00/49] mm: Generalize vmemmap optimization for DAX and HugeTLB Muchun Song
                   ` (14 preceding siblings ...)
  2026-04-05 12:52 ` [PATCH 15/49] mm/hugetlb: free cross-zone bootmem gigantic pages after allocation Muchun Song
@ 2026-04-05 12:52 ` Muchun Song
  2026-04-05 12:52 ` [PATCH 17/49] mm: remove sparse_vmemmap_init_nid_late() Muchun Song
                   ` (33 subsequent siblings)
  49 siblings, 0 replies; 51+ messages in thread
From: Muchun Song @ 2026-04-05 12:52 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
	Michael Ellerman, Madhavan Srinivasan
  Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
	Christophe Leroy, aneesh.kumar, joao.m.martins, linux-mm,
	linuxppc-dev, linux-kernel, Muchun Song

Move pfn_to_zone() to be available for hugetlb_vmemmap_init_early().
Populate vmemmap HVO in hugetlb_vmemmap_init_early() for bootmem allocated
huge pages.

The zone information is already available in hugetlb_vmemmap_init_early(),
so there is no need to wait for hugetlb_vmemmap_init_late() to access it.
This prepares for the removal of hugetlb_vmemmap_init_late().

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/hugetlb_vmemmap.c | 38 ++++++++++++++++++++++++--------------
 1 file changed, 24 insertions(+), 14 deletions(-)

diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index 50b7123f3bdd..e25c70453928 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -745,6 +745,20 @@ static bool vmemmap_should_optimize_bootmem_page(struct huge_bootmem_page *m)
 	return true;
 }
 
+static struct zone *pfn_to_zone(unsigned nid, unsigned long pfn)
+{
+	struct zone *zone;
+	enum zone_type zone_type;
+
+	for (zone_type = 0; zone_type < MAX_NR_ZONES; zone_type++) {
+		zone = &NODE_DATA(nid)->node_zones[zone_type];
+		if (zone_spans_pfn(zone, pfn))
+			return zone;
+	}
+
+	return NULL;
+}
+
 /*
  * Initialize memmap section for a gigantic page, HVO-style.
  */
@@ -752,6 +766,7 @@ void __init hugetlb_vmemmap_init_early(int nid)
 {
 	unsigned long psize, paddr, section_size;
 	unsigned long ns, i, pnum, pfn, nr_pages;
+	unsigned long start, end;
 	struct huge_bootmem_page *m = NULL;
 	void *map;
 
@@ -761,6 +776,8 @@ void __init hugetlb_vmemmap_init_early(int nid)
 	section_size = (1UL << PA_SECTION_SHIFT);
 
 	list_for_each_entry(m, &huge_boot_pages[nid], list) {
+		struct zone *zone;
+
 		if (!vmemmap_should_optimize_bootmem_page(m))
 			continue;
 
@@ -769,6 +786,13 @@ void __init hugetlb_vmemmap_init_early(int nid)
 		paddr = virt_to_phys(m);
 		pfn = PHYS_PFN(paddr);
 		map = pfn_to_page(pfn);
+		start = (unsigned long)map;
+		end = start + nr_pages * sizeof(struct page);
+		zone = pfn_to_zone(nid, pfn);
+
+		BUG_ON(vmemmap_populate_hvo(start, end, huge_page_order(m->hstate),
+					    zone, HUGETLB_VMEMMAP_RESERVE_SIZE));
+		memmap_boot_pages_add(HUGETLB_VMEMMAP_RESERVE_SIZE / PAGE_SIZE);
 
 		pnum = pfn_to_section_nr(pfn);
 		ns = psize / section_size;
@@ -784,20 +808,6 @@ void __init hugetlb_vmemmap_init_early(int nid)
 	}
 }
 
-static struct zone *pfn_to_zone(unsigned nid, unsigned long pfn)
-{
-	struct zone *zone;
-	enum zone_type zone_type;
-
-	for (zone_type = 0; zone_type < MAX_NR_ZONES; zone_type++) {
-		zone = &NODE_DATA(nid)->node_zones[zone_type];
-		if (zone_spans_pfn(zone, pfn))
-			return zone;
-	}
-
-	return NULL;
-}
-
 void __init hugetlb_vmemmap_init_late(int nid)
 {
 	struct huge_bootmem_page *m, *tm;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 17/49] mm: remove sparse_vmemmap_init_nid_late()
  2026-04-05 12:51 [PATCH 00/49] mm: Generalize vmemmap optimization for DAX and HugeTLB Muchun Song
                   ` (15 preceding siblings ...)
  2026-04-05 12:52 ` [PATCH 16/49] mm/hugetlb: initialize vmemmap optimization in early stage Muchun Song
@ 2026-04-05 12:52 ` Muchun Song
  2026-04-05 12:52 ` [PATCH 18/49] mm/mm_init: make __init_page_from_nid() static Muchun Song
                   ` (32 subsequent siblings)
  49 siblings, 0 replies; 51+ messages in thread
From: Muchun Song @ 2026-04-05 12:52 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
	Michael Ellerman, Madhavan Srinivasan
  Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
	Christophe Leroy, aneesh.kumar, joao.m.martins, linux-mm,
	linuxppc-dev, linux-kernel, Muchun Song

After deferring hugetlb bootmem allocation until after free_area_init()
and checking cross-zone pages during allocation, the hugetlb_vmemmap_init_late()
function is no longer needed:

1. hugetlb_bootmem_alloc() is now called after free_area_init(), so zone
   information is available during bootmem huge page allocation.
2. During alloc_bootmem(), cross-zone pages are identified and marked with
   HUGE_BOOTMEM_ZONES_VALID flag.
3. After allocation, hugetlb_free_cross_zone_pages() frees those pages that
   intersect multiple zones.

Since cross-zone pages are already handled in the allocation path, the late-stage
validation in hugetlb_vmemmap_init_late() is redundant and can be removed.

Also, the sparse_vmemmap_init_nid_late() function is now empty and unused.
Remove it to clean up the code.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 include/linux/hugetlb.h |  2 --
 include/linux/mmzone.h  |  7 -----
 mm/hugetlb.c            | 70 -----------------------------------------
 mm/hugetlb_vmemmap.c    | 58 ----------------------------------
 mm/hugetlb_vmemmap.h    |  5 ---
 mm/sparse-vmemmap.c     | 11 -------
 mm/sparse.c             |  2 --
 7 files changed, 155 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 9c098a02a09e..23d95ed6121f 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -699,8 +699,6 @@ struct huge_bootmem_page {
 #define HUGE_BOOTMEM_ZONES_VALID	0x0002
 #define HUGE_BOOTMEM_CMA		0x0004
 
-bool hugetlb_bootmem_page_zones_valid(int nid, struct huge_bootmem_page *m);
-
 int isolate_or_dissolve_huge_folio(struct folio *folio, struct list_head *list);
 int replace_free_hugepage_folios(unsigned long start_pfn, unsigned long end_pfn);
 void wait_for_freed_hugetlb_folios(void);
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index a071f1a0e242..8ee9dc60120a 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -2153,8 +2153,6 @@ static inline int preinited_vmemmap_section(const struct mem_section *section)
 }
 
 void sparse_vmemmap_init_nid_early(int nid);
-void sparse_vmemmap_init_nid_late(int nid);
-
 #else
 static inline int preinited_vmemmap_section(const struct mem_section *section)
 {
@@ -2163,10 +2161,6 @@ static inline int preinited_vmemmap_section(const struct mem_section *section)
 static inline void sparse_vmemmap_init_nid_early(int nid)
 {
 }
-
-static inline void sparse_vmemmap_init_nid_late(int nid)
-{
-}
 #endif
 
 static inline int online_section_nr(unsigned long nr)
@@ -2371,7 +2365,6 @@ static inline unsigned long next_present_section_nr(unsigned long section_nr)
 
 #else
 #define sparse_vmemmap_init_nid_early(_nid) do {} while (0)
-#define sparse_vmemmap_init_nid_late(_nid) do {} while (0)
 #define pfn_in_present_section pfn_valid
 #endif /* CONFIG_SPARSEMEM */
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 238495fd04e4..a00c9f3672b7 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -58,7 +58,6 @@ struct hstate hstates[HUGE_MAX_HSTATE];
 
 __initdata nodemask_t hugetlb_bootmem_nodes;
 __initdata struct list_head huge_boot_pages[MAX_NUMNODES];
-static unsigned long hstate_boot_nrinvalid[HUGE_MAX_HSTATE] __initdata;
 
 /*
  * Due to ordering constraints across the init code for various
@@ -3254,57 +3253,6 @@ static void __init prep_and_add_bootmem_folios(struct hstate *h,
 	}
 }
 
-bool __init hugetlb_bootmem_page_zones_valid(int nid,
-					     struct huge_bootmem_page *m)
-{
-	unsigned long start_pfn;
-	bool valid;
-
-	if (m->flags & HUGE_BOOTMEM_ZONES_VALID) {
-		/*
-		 * Already validated, skip check.
-		 */
-		return true;
-	}
-
-	if (hugetlb_bootmem_page_earlycma(m)) {
-		valid = cma_validate_zones(m->cma);
-		goto out;
-	}
-
-	start_pfn = virt_to_phys(m) >> PAGE_SHIFT;
-
-	valid = !pfn_range_intersects_zones(nid, start_pfn,
-			pages_per_huge_page(m->hstate));
-out:
-	if (!valid)
-		hstate_boot_nrinvalid[hstate_index(m->hstate)]++;
-
-	return valid;
-}
-
-/*
- * Free a bootmem page that was found to be invalid (intersecting with
- * multiple zones).
- *
- * Since it intersects with multiple zones, we can't just do a free
- * operation on all pages at once, but instead have to walk all
- * pages, freeing them one by one.
- */
-static void __init hugetlb_bootmem_free_invalid_page(int nid, struct page *page,
-					     struct hstate *h)
-{
-	unsigned long npages = pages_per_huge_page(h);
-	unsigned long pfn;
-
-	while (npages--) {
-		pfn = page_to_pfn(page);
-		__init_page_from_nid(pfn, nid);
-		free_reserved_page(page);
-		page++;
-	}
-}
-
 /*
  * Put bootmem huge pages into the standard lists after mem_map is up.
  * Note: This only applies to gigantic (order > MAX_PAGE_ORDER) pages.
@@ -3320,17 +3268,6 @@ static void __init gather_bootmem_prealloc_node(unsigned long nid)
 		struct folio *folio = (void *)page;
 
 		h = m->hstate;
-		if (!hugetlb_bootmem_page_zones_valid(nid, m)) {
-			/*
-			 * Can't use this page. Initialize the
-			 * page structures if that hasn't already
-			 * been done, and give them to the page
-			 * allocator.
-			 */
-			hugetlb_bootmem_free_invalid_page(nid, page, h);
-			continue;
-		}
-
 		/*
 		 * It is possible to have multiple huge page sizes (hstates)
 		 * in this list.  If so, process each size separately.
@@ -3700,20 +3637,13 @@ static void __init hugetlb_init_hstates(void)
 static void __init report_hugepages(void)
 {
 	struct hstate *h;
-	unsigned long nrinvalid;
 
 	for_each_hstate(h) {
 		char buf[32];
 
-		nrinvalid = hstate_boot_nrinvalid[hstate_index(h)];
-		h->max_huge_pages -= nrinvalid;
-
 		string_get_size(huge_page_size(h), 1, STRING_UNITS_2, buf, 32);
 		pr_info("HugeTLB: registered %s page size, pre-allocated %ld pages\n",
 			buf, h->nr_huge_pages);
-		if (nrinvalid)
-			pr_info("HugeTLB: %s page size: %lu invalid page%s discarded\n",
-					buf, nrinvalid, str_plural(nrinvalid));
 		pr_info("HugeTLB: %d KiB vmemmap can be freed for a %s page\n",
 			hugetlb_vmemmap_optimizable_size(h) / SZ_1K, buf);
 	}
diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index e25c70453928..535f0369a496 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -807,64 +807,6 @@ void __init hugetlb_vmemmap_init_early(int nid)
 		m->flags |= HUGE_BOOTMEM_HVO;
 	}
 }
-
-void __init hugetlb_vmemmap_init_late(int nid)
-{
-	struct huge_bootmem_page *m, *tm;
-	unsigned long phys, nr_pages, start, end;
-	unsigned long pfn, nr_mmap;
-	struct zone *zone = NULL;
-	struct hstate *h;
-	void *map;
-
-	if (!READ_ONCE(vmemmap_optimize_enabled))
-		return;
-
-	list_for_each_entry_safe(m, tm, &huge_boot_pages[nid], list) {
-		if (!(m->flags & HUGE_BOOTMEM_HVO))
-			continue;
-
-		phys = virt_to_phys(m);
-		h = m->hstate;
-		pfn = PHYS_PFN(phys);
-		nr_pages = pages_per_huge_page(h);
-		map = pfn_to_page(pfn);
-		start = (unsigned long)map;
-		end = start + nr_pages * sizeof(struct page);
-
-		if (!hugetlb_bootmem_page_zones_valid(nid, m)) {
-			/*
-			 * Oops, the hugetlb page spans multiple zones.
-			 * Remove it from the list, and populate it normally.
-			 */
-			list_del(&m->list);
-
-			vmemmap_populate(start, end, nid, NULL, NULL);
-			nr_mmap = end - start;
-			memmap_boot_pages_add(DIV_ROUND_UP(nr_mmap, PAGE_SIZE));
-
-			memblock_phys_free(phys, huge_page_size(h));
-			continue;
-		}
-
-		if (!zone || !zone_spans_pfn(zone, pfn))
-			zone = pfn_to_zone(nid, pfn);
-		if (WARN_ON_ONCE(!zone))
-			continue;
-
-		if (vmemmap_populate_hvo(start, end, huge_page_order(h), zone,
-					 HUGETLB_VMEMMAP_RESERVE_SIZE) < 0) {
-			/* Fallback if HVO population fails */
-			vmemmap_populate(start, end, nid, NULL, NULL);
-			nr_mmap = end - start;
-		} else {
-			m->flags |= HUGE_BOOTMEM_ZONES_VALID;
-			nr_mmap = HUGETLB_VMEMMAP_RESERVE_SIZE;
-		}
-
-		memmap_boot_pages_add(DIV_ROUND_UP(nr_mmap, PAGE_SIZE));
-	}
-}
 #endif
 
 static const struct ctl_table hugetlb_vmemmap_sysctls[] = {
diff --git a/mm/hugetlb_vmemmap.h b/mm/hugetlb_vmemmap.h
index 18b490825215..7ac49c52457d 100644
--- a/mm/hugetlb_vmemmap.h
+++ b/mm/hugetlb_vmemmap.h
@@ -29,7 +29,6 @@ void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_l
 void hugetlb_vmemmap_optimize_bootmem_folios(struct hstate *h, struct list_head *folio_list);
 #ifdef CONFIG_SPARSEMEM_VMEMMAP_PREINIT
 void hugetlb_vmemmap_init_early(int nid);
-void hugetlb_vmemmap_init_late(int nid);
 #endif
 
 
@@ -81,10 +80,6 @@ static inline void hugetlb_vmemmap_init_early(int nid)
 {
 }
 
-static inline void hugetlb_vmemmap_init_late(int nid)
-{
-}
-
 static inline unsigned int hugetlb_vmemmap_optimizable_size(const struct hstate *h)
 {
 	return 0;
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index b7201c235419..26cb55c12a83 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -581,17 +581,6 @@ void __init sparse_vmemmap_init_nid_early(int nid)
 {
 	hugetlb_vmemmap_init_early(nid);
 }
-
-/*
- * This is called just before the initialization of page structures
- * through memmap_init. Zones are now initialized, so any work that
- * needs to be done that needs zone information can be done from
- * here.
- */
-void __init sparse_vmemmap_init_nid_late(int nid)
-{
-	hugetlb_vmemmap_init_late(nid);
-}
 #endif
 
 static void subsection_mask_set(unsigned long *map, unsigned long pfn,
diff --git a/mm/sparse.c b/mm/sparse.c
index d940b973df66..5fe0a7e66775 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -383,8 +383,6 @@ static void __init sparse_init_nid(int nid, unsigned long pnum_begin,
 	}
 	sparse_usage_fini();
 	sparse_buffer_fini();
-
-	sparse_vmemmap_init_nid_late(nid);
 }
 
 /*
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 18/49] mm/mm_init: make __init_page_from_nid() static
  2026-04-05 12:51 [PATCH 00/49] mm: Generalize vmemmap optimization for DAX and HugeTLB Muchun Song
                   ` (16 preceding siblings ...)
  2026-04-05 12:52 ` [PATCH 17/49] mm: remove sparse_vmemmap_init_nid_late() Muchun Song
@ 2026-04-05 12:52 ` Muchun Song
  2026-04-05 12:52 ` [PATCH 19/49] mm/sparse-vmemmap: remove the VMEMMAP_POPULATE_PAGEREF flag Muchun Song
                   ` (31 subsequent siblings)
  49 siblings, 0 replies; 51+ messages in thread
From: Muchun Song @ 2026-04-05 12:52 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
	Michael Ellerman, Madhavan Srinivasan
  Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
	Christophe Leroy, aneesh.kumar, joao.m.martins, linux-mm,
	linuxppc-dev, linux-kernel, Muchun Song

Since last commit removed the only external user of __init_page_from_nid(),
this function is now only used locally within mm/mm_init.c under the
CONFIG_DEFERRED_STRUCT_PAGE_INIT block.

Make __init_page_from_nid() static, move it inside the
CONFIG_DEFERRED_STRUCT_PAGE_INIT block to clean up the code.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/internal.h | 1 -
 mm/mm_init.c  | 4 ++--
 2 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 8232084f0c5e..a8acabcd1d93 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1755,7 +1755,6 @@ static inline bool pte_needs_soft_dirty_wp(struct vm_area_struct *vma, pte_t pte
 
 void __meminit __init_single_page(struct page *page, unsigned long pfn,
 				unsigned long zone, int nid);
-void __meminit __init_page_from_nid(unsigned long pfn, int nid);
 
 /* shrinker related functions */
 unsigned long shrink_slab(gfp_t gfp_mask, int nid, struct mem_cgroup *memcg,
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 7a710fcbe3c8..977a837b7ef6 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -686,10 +686,11 @@ static __meminit void pageblock_migratetype_init_range(unsigned long pfn,
 	}
 }
 
+#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
 /*
  * Initialize a reserved page unconditionally, finding its zone first.
  */
-void __meminit __init_page_from_nid(unsigned long pfn, int nid)
+static void __meminit __init_page_from_nid(unsigned long pfn, int nid)
 {
 	pg_data_t *pgdat;
 	int zid;
@@ -709,7 +710,6 @@ void __meminit __init_page_from_nid(unsigned long pfn, int nid)
 				false);
 }
 
-#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
 static inline void pgdat_set_deferred_range(pg_data_t *pgdat)
 {
 	pgdat->first_deferred_pfn = ULONG_MAX;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 19/49] mm/sparse-vmemmap: remove the VMEMMAP_POPULATE_PAGEREF flag
  2026-04-05 12:51 [PATCH 00/49] mm: Generalize vmemmap optimization for DAX and HugeTLB Muchun Song
                   ` (17 preceding siblings ...)
  2026-04-05 12:52 ` [PATCH 18/49] mm/mm_init: make __init_page_from_nid() static Muchun Song
@ 2026-04-05 12:52 ` Muchun Song
  2026-04-05 12:52 ` [PATCH 20/49] mm: rename vmemmap optimization macros to generic names Muchun Song
                   ` (30 subsequent siblings)
  49 siblings, 0 replies; 51+ messages in thread
From: Muchun Song @ 2026-04-05 12:52 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
	Michael Ellerman, Madhavan Srinivasan
  Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
	Christophe Leroy, aneesh.kumar, joao.m.martins, linux-mm,
	linuxppc-dev, linux-kernel, Muchun Song

The VMEMMAP_POPULATE_PAGEREF flag is only used to ensure that we call
get_page() when slab is available, as mentioned in the comment:
"and through vmemmap_populate_compound_pages() when slab is available".

Since we can check slab_is_available() directly, the flag and the
associated argument passing can be removed to simplify the code.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/sparse-vmemmap.c | 40 ++++++++++++++--------------------------
 1 file changed, 14 insertions(+), 26 deletions(-)

diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index 26cb55c12a83..3fdb6808e8ab 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -33,13 +33,6 @@
 #include <asm/tlbflush.h>
 
 #include "hugetlb_vmemmap.h"
-
-/*
- * Flags for vmemmap_populate_range and friends.
- */
-/* Get a ref on the head page struct page, for ZONE_DEVICE compound pages */
-#define VMEMMAP_POPULATE_PAGEREF	0x0001
-
 #include "internal.h"
 
 /*
@@ -152,8 +145,8 @@ void __meminit vmemmap_verify(pte_t *pte, int node,
 }
 
 static pte_t * __meminit vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node,
-				       struct vmem_altmap *altmap,
-				       unsigned long ptpfn, unsigned long flags)
+					      struct vmem_altmap *altmap,
+					      unsigned long ptpfn)
 {
 	pte_t *pte = pte_offset_kernel(pmd, addr);
 	if (pte_none(ptep_get(pte))) {
@@ -175,7 +168,7 @@ static pte_t * __meminit vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, in
 			 * and through vmemmap_populate_compound_pages() when
 			 * slab is available.
 			 */
-			if (flags & VMEMMAP_POPULATE_PAGEREF)
+			if (slab_is_available())
 				get_page(pfn_to_page(ptpfn));
 		}
 		entry = pfn_pte(ptpfn, PAGE_KERNEL);
@@ -248,8 +241,7 @@ static pgd_t * __meminit vmemmap_pgd_populate(unsigned long addr, int node)
 
 static pte_t * __meminit vmemmap_populate_address(unsigned long addr, int node,
 					      struct vmem_altmap *altmap,
-					      unsigned long ptpfn,
-					      unsigned long flags)
+					      unsigned long ptpfn)
 {
 	pgd_t *pgd;
 	p4d_t *p4d;
@@ -269,7 +261,7 @@ static pte_t * __meminit vmemmap_populate_address(unsigned long addr, int node,
 	pmd = vmemmap_pmd_populate(pud, addr, node);
 	if (!pmd)
 		return NULL;
-	pte = vmemmap_pte_populate(pmd, addr, node, altmap, ptpfn, flags);
+	pte = vmemmap_pte_populate(pmd, addr, node, altmap, ptpfn);
 	if (!pte)
 		return NULL;
 	vmemmap_verify(pte, node, addr, addr + PAGE_SIZE);
@@ -280,15 +272,14 @@ static pte_t * __meminit vmemmap_populate_address(unsigned long addr, int node,
 static int __meminit vmemmap_populate_range(unsigned long start,
 					    unsigned long end, int node,
 					    struct vmem_altmap *altmap,
-					    unsigned long ptpfn,
-					    unsigned long flags)
+					    unsigned long ptpfn)
 {
 	unsigned long addr = start;
 	pte_t *pte;
 
 	for (; addr < end; addr += PAGE_SIZE) {
 		pte = vmemmap_populate_address(addr, node, altmap,
-					       ptpfn, flags);
+					       ptpfn);
 		if (!pte)
 			return -ENOMEM;
 	}
@@ -306,7 +297,7 @@ int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end,
 {
 	if (vmemmap_can_optimize(altmap, pgmap))
 		return vmemmap_populate_compound_pages(start, end, node, pgmap);
-	return vmemmap_populate_range(start, end, node, altmap, -1, 0);
+	return vmemmap_populate_range(start, end, node, altmap, -1);
 }
 
 /*
@@ -382,7 +373,7 @@ int __meminit vmemmap_populate_hvo(unsigned long addr, unsigned long end,
 		return -ENOMEM;
 
 	for (maddr = addr; maddr < addr + headsize; maddr += PAGE_SIZE) {
-		pte = vmemmap_populate_address(maddr, node, NULL, -1, 0);
+		pte = vmemmap_populate_address(maddr, node, NULL, -1);
 		if (!pte)
 			return -ENOMEM;
 	}
@@ -390,8 +381,7 @@ int __meminit vmemmap_populate_hvo(unsigned long addr, unsigned long end,
 	/*
 	 * Reuse the last page struct page mapped above for the rest.
 	 */
-	return vmemmap_populate_range(maddr, end, node, NULL,
-				      page_to_pfn(tail), 0);
+	return vmemmap_populate_range(maddr, end, node, NULL, page_to_pfn(tail));
 }
 #endif
 
@@ -518,8 +508,7 @@ static int __meminit vmemmap_populate_compound_pages(unsigned long start,
 		 * with just tail struct pages.
 		 */
 		return vmemmap_populate_range(start, end, node, NULL,
-					      pte_pfn(ptep_get(pte)),
-					      VMEMMAP_POPULATE_PAGEREF);
+					      pte_pfn(ptep_get(pte)));
 	}
 
 	size = min(end - start, pgmap_vmemmap_nr(pgmap) * sizeof(struct page));
@@ -527,13 +516,13 @@ static int __meminit vmemmap_populate_compound_pages(unsigned long start,
 		unsigned long next, last = addr + size;
 
 		/* Populate the head page vmemmap page */
-		pte = vmemmap_populate_address(addr, node, NULL, -1, 0);
+		pte = vmemmap_populate_address(addr, node, NULL, -1);
 		if (!pte)
 			return -ENOMEM;
 
 		/* Populate the tail pages vmemmap page */
 		next = addr + PAGE_SIZE;
-		pte = vmemmap_populate_address(next, node, NULL, -1, 0);
+		pte = vmemmap_populate_address(next, node, NULL, -1);
 		if (!pte)
 			return -ENOMEM;
 
@@ -543,8 +532,7 @@ static int __meminit vmemmap_populate_compound_pages(unsigned long start,
 		 */
 		next += PAGE_SIZE;
 		rc = vmemmap_populate_range(next, last, node, NULL,
-					    pte_pfn(ptep_get(pte)),
-					    VMEMMAP_POPULATE_PAGEREF);
+					    pte_pfn(ptep_get(pte)));
 		if (rc)
 			return -ENOMEM;
 	}
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 20/49] mm: rename vmemmap optimization macros to generic names
  2026-04-05 12:51 [PATCH 00/49] mm: Generalize vmemmap optimization for DAX and HugeTLB Muchun Song
                   ` (18 preceding siblings ...)
  2026-04-05 12:52 ` [PATCH 19/49] mm/sparse-vmemmap: remove the VMEMMAP_POPULATE_PAGEREF flag Muchun Song
@ 2026-04-05 12:52 ` Muchun Song
  2026-04-05 12:52 ` [PATCH 21/49] mm/sparse: drop power-of-2 size requirement for struct mem_section Muchun Song
                   ` (29 subsequent siblings)
  49 siblings, 0 replies; 51+ messages in thread
From: Muchun Song @ 2026-04-05 12:52 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
	Michael Ellerman, Madhavan Srinivasan
  Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
	Christophe Leroy, aneesh.kumar, joao.m.martins, linux-mm,
	linuxppc-dev, linux-kernel, Muchun Song

In preparation for unifying the vmemmap optimization paths for both
DAX and HugeTLB, rename the existing vmemmap tail page macros to
more generic, semantic-based names.

The original names (e.g., VMEMMAP_TAIL_MIN_ORDER) fail to clearly express
the actual requirement: it represents the minimum order of a folio that
can satisfy the vmemmap optimization. To provide a broader and clearer
abstraction for other users like DAX, replace them with newly introduced
macros like OPTIMIZABLE_FOLIO_MIN_ORDER and NR_OPTIMIZABLE_FOLIO_SIZES.

These new macros, along with OPTIMIZED_FOLIO_VMEMMAP_PAGES,
OPTIMIZED_FOLIO_VMEMMAP_SIZE, and OPTIMIZED_FOLIO_VMEMMAP_PAGE_STRUCTS,
are explicitly bound to the 'folio' concept. This systematic naming
makes it clearer to describe the properties of a vmemmap-optimized
folio rather than just raw pages.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 include/linux/mmzone.h | 18 ++++++++++--------
 mm/hugetlb_vmemmap.c   |  6 +++---
 mm/sparse-vmemmap.c    |  4 ++--
 3 files changed, 15 insertions(+), 13 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 8ee9dc60120a..378feaf4e4ed 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -107,13 +107,15 @@
 	 is_power_of_2(sizeof(struct page)) ? \
 	 MAX_FOLIO_NR_PAGES * sizeof(struct page) : 0)
 
-/*
- * vmemmap optimization (like HVO) is only possible for page orders that fill
- * two or more pages with struct pages.
- */
-#define VMEMMAP_TAIL_MIN_ORDER (ilog2(2 * PAGE_SIZE / sizeof(struct page)))
-#define __NR_VMEMMAP_TAILS (MAX_FOLIO_ORDER - VMEMMAP_TAIL_MIN_ORDER + 1)
-#define NR_VMEMMAP_TAILS (__NR_VMEMMAP_TAILS > 0 ? __NR_VMEMMAP_TAILS : 0)
+/* The number of vmemmap pages required by a vmemmap-optimized folio. */
+#define OPTIMIZED_FOLIO_VMEMMAP_PAGES		1
+#define OPTIMIZED_FOLIO_VMEMMAP_SIZE		(OPTIMIZED_FOLIO_VMEMMAP_PAGES * PAGE_SIZE)
+#define OPTIMIZED_FOLIO_VMEMMAP_PAGE_STRUCTS	(OPTIMIZED_FOLIO_VMEMMAP_SIZE / sizeof(struct page))
+#define OPTIMIZABLE_FOLIO_MIN_ORDER		(ilog2(OPTIMIZED_FOLIO_VMEMMAP_PAGE_STRUCTS) + 1)
+
+#define __NR_OPTIMIZABLE_FOLIO_SIZES		(MAX_FOLIO_ORDER - OPTIMIZABLE_FOLIO_MIN_ORDER + 1)
+#define NR_OPTIMIZABLE_FOLIO_SIZES		\
+	(__NR_OPTIMIZABLE_FOLIO_SIZES > 0 ? __NR_OPTIMIZABLE_FOLIO_SIZES : 0)
 
 enum migratetype {
 	MIGRATE_UNMOVABLE,
@@ -1144,7 +1146,7 @@ struct zone {
 	atomic_long_t		vm_stat[NR_VM_ZONE_STAT_ITEMS];
 	atomic_long_t		vm_numa_event[NR_VM_NUMA_EVENT_ITEMS];
 #ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
-	struct page *vmemmap_tails[NR_VMEMMAP_TAILS];
+	struct page *vmemmap_tails[NR_OPTIMIZABLE_FOLIO_SIZES];
 #endif
 } ____cacheline_internodealigned_in_smp;
 
diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index 535f0369a496..d6dd47c232e0 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -495,7 +495,7 @@ static bool vmemmap_should_optimize_folio(const struct hstate *h, struct folio *
 
 static struct page *vmemmap_get_tail(unsigned int order, struct zone *zone)
 {
-	const unsigned int idx = order - VMEMMAP_TAIL_MIN_ORDER;
+	const unsigned int idx = order - OPTIMIZABLE_FOLIO_MIN_ORDER;
 	struct page *tail, *p;
 	int node = zone_to_nid(zone);
 
@@ -828,7 +828,7 @@ static int __init hugetlb_vmemmap_init(void)
 	BUILD_BUG_ON(__NR_USED_SUBPAGE > HUGETLB_VMEMMAP_RESERVE_PAGES);
 
 	for_each_zone(zone) {
-		for (int i = 0; i < NR_VMEMMAP_TAILS; i++) {
+		for (int i = 0; i < NR_OPTIMIZABLE_FOLIO_SIZES; i++) {
 			struct page *tail, *p;
 			unsigned int order;
 
@@ -836,7 +836,7 @@ static int __init hugetlb_vmemmap_init(void)
 			if (!tail)
 				continue;
 
-			order = i + VMEMMAP_TAIL_MIN_ORDER;
+			order = i + OPTIMIZABLE_FOLIO_MIN_ORDER;
 			p = page_to_virt(tail);
 			for (int j = 0; j < PAGE_SIZE / sizeof(struct page); j++)
 				init_compound_tail(p + j, NULL, order, zone);
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index 3fdb6808e8ab..9f70559df4e8 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -330,12 +330,12 @@ static __meminit struct page *vmemmap_get_tail(unsigned int order, struct zone *
 	unsigned int idx;
 	int node = zone_to_nid(zone);
 
-	if (WARN_ON_ONCE(order < VMEMMAP_TAIL_MIN_ORDER))
+	if (WARN_ON_ONCE(order < OPTIMIZABLE_FOLIO_MIN_ORDER))
 		return NULL;
 	if (WARN_ON_ONCE(order > MAX_FOLIO_ORDER))
 		return NULL;
 
-	idx = order - VMEMMAP_TAIL_MIN_ORDER;
+	idx = order - OPTIMIZABLE_FOLIO_MIN_ORDER;
 	tail = zone->vmemmap_tails[idx];
 	if (tail)
 		return tail;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 21/49] mm/sparse: drop power-of-2 size requirement for struct mem_section
  2026-04-05 12:51 [PATCH 00/49] mm: Generalize vmemmap optimization for DAX and HugeTLB Muchun Song
                   ` (19 preceding siblings ...)
  2026-04-05 12:52 ` [PATCH 20/49] mm: rename vmemmap optimization macros to generic names Muchun Song
@ 2026-04-05 12:52 ` Muchun Song
  2026-04-05 12:52 ` [PATCH 22/49] mm/sparse: introduce compound page order to mem_section Muchun Song
                   ` (28 subsequent siblings)
  49 siblings, 0 replies; 51+ messages in thread
From: Muchun Song @ 2026-04-05 12:52 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
	Michael Ellerman, Madhavan Srinivasan
  Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
	Christophe Leroy, aneesh.kumar, joao.m.martins, linux-mm,
	linuxppc-dev, linux-kernel, Muchun Song

Since sparsemem-extreme was introduced, struct mem_section has been
forced to a power-of-2 size so that the section-to-root lookup could
use a cheap bit-mask instead of an expensive divide:

    section = &mem_section[root][nr & SECTION_ROOT_MASK];

This is enforced at compile time with

    BUILD_BUG_ON(!is_power_of_2(sizeof(struct mem_section)));

and forces us to add padding that grows and shrinks with every config
combination, wasting memory just to keep the structure aligned to the
next power of two.  With CONFIG_PAGE_EXTENSION enabled the padding
alone can reach 42 struct mem_section instances per section-root page.

Drop the requirement and switch to a plain modulo:

    section = &mem_section[root][nr % SECTIONS_PER_ROOT];

Modern compilers turn the divide into a multiply-by-reciprocal approach,
so the runtime impact is negligible.  In return we get:

1. Immediate memory savings when CONFIG_PAGE_EXTENSION is enabled.
2. Freedom to extend struct mem_section in the future without having
   to fiddle with artificial padding or the power-of-2 rule.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 include/linux/mmzone.h  | 8 +-------
 mm/sparse.c             | 2 --
 scripts/gdb/linux/mm.py | 6 ++----
 3 files changed, 3 insertions(+), 13 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 378feaf4e4ed..3e3755666846 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -2013,12 +2013,7 @@ struct mem_section {
 	 * section. (see page_ext.h about this.)
 	 */
 	struct page_ext *page_ext;
-	unsigned long pad;
 #endif
-	/*
-	 * WARNING: mem_section must be a power-of-2 in size for the
-	 * calculation and use of SECTION_ROOT_MASK to make sense.
-	 */
 };
 
 #ifdef CONFIG_SPARSEMEM_EXTREME
@@ -2029,7 +2024,6 @@ struct mem_section {
 
 #define SECTION_NR_TO_ROOT(sec)	((sec) / SECTIONS_PER_ROOT)
 #define NR_SECTION_ROOTS	DIV_ROUND_UP(NR_MEM_SECTIONS, SECTIONS_PER_ROOT)
-#define SECTION_ROOT_MASK	(SECTIONS_PER_ROOT - 1)
 
 #ifdef CONFIG_SPARSEMEM_EXTREME
 extern struct mem_section **mem_section;
@@ -2053,7 +2047,7 @@ static inline struct mem_section *__nr_to_section(unsigned long nr)
 	if (!mem_section || !mem_section[root])
 		return NULL;
 #endif
-	return &mem_section[root][nr & SECTION_ROOT_MASK];
+	return &mem_section[root][nr % SECTIONS_PER_ROOT];
 }
 extern size_t mem_section_usage_size(void);
 
diff --git a/mm/sparse.c b/mm/sparse.c
index 5fe0a7e66775..cfe4ffd89baf 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -394,8 +394,6 @@ void __init sparse_init(void)
 	unsigned long pnum_end, pnum_begin, map_count = 1;
 	int nid_begin;
 
-	/* see include/linux/mmzone.h 'struct mem_section' definition */
-	BUILD_BUG_ON(!is_power_of_2(sizeof(struct mem_section)));
 	memblocks_present();
 
 	if (compound_info_has_mask()) {
diff --git a/scripts/gdb/linux/mm.py b/scripts/gdb/linux/mm.py
index d78908f6664d..0c9eeed92064 100644
--- a/scripts/gdb/linux/mm.py
+++ b/scripts/gdb/linux/mm.py
@@ -70,7 +70,6 @@ class x86_page_ops():
             self.SECTIONS_PER_ROOT = 1
 
         self.NR_SECTION_ROOTS = DIV_ROUND_UP(self.NR_MEM_SECTIONS, self.SECTIONS_PER_ROOT)
-        self.SECTION_ROOT_MASK = self.SECTIONS_PER_ROOT - 1
 
         try:
             self.SECTION_HAS_MEM_MAP = 1 << int(gdb.parse_and_eval('SECTION_HAS_MEM_MAP_BIT'))
@@ -100,7 +99,7 @@ class x86_page_ops():
     def __nr_to_section(self, nr):
         root = self.SECTION_NR_TO_ROOT(nr)
         mem_section = gdb.parse_and_eval("mem_section")
-        return mem_section[root][nr & self.SECTION_ROOT_MASK]
+        return mem_section[root][nr % self.SECTIONS_PER_ROOT]
 
     def pfn_to_section_nr(self, pfn):
         return pfn >> self.PFN_SECTION_SHIFT
@@ -249,7 +248,6 @@ class aarch64_page_ops():
             self.SECTIONS_PER_ROOT = 1
 
         self.NR_SECTION_ROOTS = DIV_ROUND_UP(self.NR_MEM_SECTIONS, self.SECTIONS_PER_ROOT)
-        self.SECTION_ROOT_MASK = self.SECTIONS_PER_ROOT - 1
         self.SUBSECTION_SHIFT = 21
         self.SEBSECTION_SIZE = 1 << self.SUBSECTION_SHIFT
         self.PFN_SUBSECTION_SHIFT = self.SUBSECTION_SHIFT - self.PAGE_SHIFT
@@ -304,7 +302,7 @@ class aarch64_page_ops():
     def __nr_to_section(self, nr):
         root = self.SECTION_NR_TO_ROOT(nr)
         mem_section = gdb.parse_and_eval("mem_section")
-        return mem_section[root][nr & self.SECTION_ROOT_MASK]
+        return mem_section[root][nr % self.SECTIONS_PER_ROOT]
 
     def pfn_to_section_nr(self, pfn):
         return pfn >> self.PFN_SECTION_SHIFT
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 22/49] mm/sparse: introduce compound page order to mem_section
  2026-04-05 12:51 [PATCH 00/49] mm: Generalize vmemmap optimization for DAX and HugeTLB Muchun Song
                   ` (20 preceding siblings ...)
  2026-04-05 12:52 ` [PATCH 21/49] mm/sparse: drop power-of-2 size requirement for struct mem_section Muchun Song
@ 2026-04-05 12:52 ` Muchun Song
  2026-04-05 12:52 ` [PATCH 23/49] mm/mm_init: skip initializing shared tail pages for compound pages Muchun Song
                   ` (27 subsequent siblings)
  49 siblings, 0 replies; 51+ messages in thread
From: Muchun Song @ 2026-04-05 12:52 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
	Michael Ellerman, Madhavan Srinivasan
  Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
	Christophe Leroy, aneesh.kumar, joao.m.martins, linux-mm,
	linuxppc-dev, linux-kernel, Muchun Song

During early system boot and DAX device initialization, the establishment
of vmemmap page mappings via __populate_section_memmap() is done on a
per-section basis, followed by the initialization of the struct page area.

Currently, there are two scenarios utilizing HugeTLB Vmemmap Optimization
(HVO): HugeTLB and DAX.

For HugeTLB, the SPARSEMEM_VMEMMAP_PREINIT mechanism is used to apply HVO
(with Read-Write mappings), and later, vmemmap_wrprotect_hvo() is used to
enforce Read-Only mappings. HugeTLB has to manage its own related statistics
and metadata updates; the work done by hugetlb_vmemmap_init_early() is
somewhat similar to what sparse_init_nid() does. Furthermore, the shared
vmemmap tail pages allocated by vmemmap_get_tail() are left uninitialized
because they would be overwritten by the subsequent memmap_init(). We are
forced to compensate for this in hugetlb_vmemmap_init(). This limitation
also forces us to maintain two separate implementations of vmemmap_get_tail()
(one in hugetlb_vmemmap.c and another in sparse-vmemmap.c).

For DAX, HVO is already applied via __populate_section_memmap(), but it
does not employ Read-Only mappings, which introduces potential security
risks. Moreover, the fact that HugeTLB and DAX implement different logics
for what is essentially the same purpose increases code complexity and
maintenance burden.

The root cause of these issues is that a memory section is completely unaware
of the concept of compound pages. It cannot properly handle HVO or struct
page initialization for them.

To solve this, introduce the concept of compound page order to the memory
section (`struct mem_section`). Typically, a section holds compound pages
of a specific order, and a larger compound page will span multiple sections.
In the future, this order information can be utilized to unify and streamline
the aforementioned scenarios.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 include/linux/mmzone.h | 28 ++++++++++++++++++++++++++++
 1 file changed, 28 insertions(+)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 3e3755666846..620503aa29ba 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -2014,6 +2014,14 @@ struct mem_section {
 	 */
 	struct page_ext *page_ext;
 #endif
+#ifdef CONFIG_SPARSEMEM_VMEMMAP
+	/*
+	 * The order of compound pages in this section. Typically, the section
+	 * holds compound pages of this order; a larger compound page will span
+	 * multiple sections.
+	 */
+	unsigned int order;
+#endif
 };

 #ifdef CONFIG_SPARSEMEM_EXTREME
@@ -2210,6 +2218,17 @@ static inline bool pfn_section_first_valid(struct mem_section *ms, unsigned long
 	*pfn = (*pfn & PAGE_SECTION_MASK) + (bit * PAGES_PER_SUBSECTION);
 	return true;
 }
+
+static inline void section_set_order(struct mem_section *section, unsigned int order)
+{
+	VM_BUG_ON(section->order && order && section->order != order);
+	section->order = order;
+}
+
+static inline unsigned int section_order(const struct mem_section *section)
+{
+	return section->order;
+}
 #else
 static inline int pfn_section_valid(struct mem_section *ms, unsigned long pfn)
 {
@@ -2220,6 +2239,15 @@ static inline bool pfn_section_first_valid(struct mem_section *ms, unsigned long
 {
 	return true;
 }
+
+static inline void section_set_order(struct mem_section *section, unsigned int order)
+{
+}
+
+static inline unsigned int section_order(const struct mem_section *section)
+{
+	return 0;
+}
 #endif

 void sparse_init_early_section(int nid, struct page *map, unsigned long pnum,
-- 
2.20.1

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 23/49] mm/mm_init: skip initializing shared tail pages for compound pages
  2026-04-05 12:51 [PATCH 00/49] mm: Generalize vmemmap optimization for DAX and HugeTLB Muchun Song
                   ` (21 preceding siblings ...)
  2026-04-05 12:52 ` [PATCH 22/49] mm/sparse: introduce compound page order to mem_section Muchun Song
@ 2026-04-05 12:52 ` Muchun Song
  2026-04-05 12:52 ` [PATCH 24/49] mm/sparse-vmemmap: initialize shared tail vmemmap page upon allocation Muchun Song
                   ` (26 subsequent siblings)
  49 siblings, 0 replies; 51+ messages in thread
From: Muchun Song @ 2026-04-05 12:52 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
	Michael Ellerman, Madhavan Srinivasan
  Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
	Christophe Leroy, aneesh.kumar, joao.m.martins, linux-mm,
	linuxppc-dev, linux-kernel, Muchun Song

Currently, memmap_init_range() unconditionally initializes all struct pages
within a section. However, when HugeTLB Vmemmap Optimization (HVO) is enabled,
shared vmemmap tail pages are allocated during the vmemmap population phase
(e.g., via vmemmap_get_tail()). These shared tail pages are left intentionally
uninitialized at that time because the subsequent memmap_init() would simply
overwrite them.

If memmap_init_range() continues to initialize these shared tail pages, it
will overwrite the carefully constructed HVO mappings and metadata. This forces
subsystems like HugeTLB to implement workarounds (like re-initializing or
compensating for the overwritten data in their own init routines, as seen
in hugetlb_vmemmap_init()).

Therefore, the primary motivation of this patch is to prevent memmap_init_range()
from incorrectly overwriting the shared vmemmap tail pages. By detecting if a
page is an optimizable compound vmemmap page (using the newly introduced section
order), we can safely skip its redundant initialization.

As a significant side-effect, skipping the initialization of these shared tail
pages also saves substantial CPU cycles during the early boot stage.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/internal.h | 11 +++++++++++
 mm/mm_init.c  | 19 +++++++++++++++----
 2 files changed, 26 insertions(+), 4 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index a8acabcd1d93..1060d7c07f5b 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1011,6 +1011,17 @@ static inline void sparse_init_subsection_map(void)
 }
 #endif /* CONFIG_SPARSEMEM_VMEMMAP */
 
+static inline bool vmemmap_page_optimizable(const struct page *page)
+{
+	unsigned long pfn = page_to_pfn(page);
+	unsigned int order = section_order(__pfn_to_section(pfn));
+
+	if (!is_power_of_2(sizeof(struct page)))
+		return false;
+
+	return (pfn & ((1L << order) - 1)) >= OPTIMIZED_FOLIO_VMEMMAP_PAGE_STRUCTS;
+}
+
 #if defined CONFIG_COMPACTION || defined CONFIG_CMA
 
 /*
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 977a837b7ef6..7f5b326e9298 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -676,12 +676,13 @@ static inline void fixup_hashdist(void) {}
 
 static __meminit void pageblock_migratetype_init_range(unsigned long pfn,
 						       unsigned long nr_pages,
-						       int migratetype)
+						       int migratetype,
+						       bool isolate)
 {
 	unsigned long end = pfn + nr_pages;
 
 	for (pfn = pageblock_align(pfn); pfn < end; pfn += pageblock_nr_pages) {
-		init_pageblock_migratetype(pfn_to_page(pfn), migratetype, false);
+		init_pageblock_migratetype(pfn_to_page(pfn), migratetype, isolate);
 		cond_resched();
 	}
 }
@@ -912,6 +913,16 @@ void __meminit memmap_init_range(unsigned long size, int nid, unsigned long zone
 		}
 
 		page = pfn_to_page(pfn);
+		if (vmemmap_page_optimizable(page)) {
+			struct mem_section *ms = __pfn_to_section(pfn);
+			unsigned long start = pfn;
+
+			pfn = min(ALIGN(start, 1L << section_order(ms)), end_pfn);
+			pageblock_migratetype_init_range(start, pfn - start, migratetype,
+							 isolate_pageblock);
+			continue;
+		}
+
 		__init_single_page(page, pfn, zone, nid);
 		if (context == MEMINIT_HOTPLUG) {
 #ifdef CONFIG_ZONE_DEVICE
@@ -1138,7 +1149,7 @@ void __ref memmap_init_zone_device(struct zone *zone,
 	 * Please note that MEMINIT_HOTPLUG path doesn't clear memmap
 	 * because this is done early in section_activate()
 	 */
-	pageblock_migratetype_init_range(start_pfn, nr_pages, MIGRATE_MOVABLE);
+	pageblock_migratetype_init_range(start_pfn, nr_pages, MIGRATE_MOVABLE, false);
 
 	pr_debug("%s initialised %lu pages in %ums\n", __func__,
 		nr_pages, jiffies_to_msecs(jiffies - start));
@@ -1963,7 +1974,7 @@ static void __init deferred_free_pages(unsigned long pfn,
 	if (!nr_pages)
 		return;
 
-	pageblock_migratetype_init_range(pfn, nr_pages, MIGRATE_MOVABLE);
+	pageblock_migratetype_init_range(pfn, nr_pages, MIGRATE_MOVABLE, false);
 
 	page = pfn_to_page(pfn);
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 24/49] mm/sparse-vmemmap: initialize shared tail vmemmap page upon allocation
  2026-04-05 12:51 [PATCH 00/49] mm: Generalize vmemmap optimization for DAX and HugeTLB Muchun Song
                   ` (22 preceding siblings ...)
  2026-04-05 12:52 ` [PATCH 23/49] mm/mm_init: skip initializing shared tail pages for compound pages Muchun Song
@ 2026-04-05 12:52 ` Muchun Song
  2026-04-05 12:52 ` [PATCH 25/49] mm/sparse-vmemmap: support vmemmap-optimizable compound page population Muchun Song
                   ` (25 subsequent siblings)
  49 siblings, 0 replies; 51+ messages in thread
From: Muchun Song @ 2026-04-05 12:52 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
	Michael Ellerman, Madhavan Srinivasan
  Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
	Christophe Leroy, aneesh.kumar, joao.m.martins, linux-mm,
	linuxppc-dev, linux-kernel, Muchun Song

Previously, the shared vmemmap tail page allocated in vmemmap_get_tail()
was intentionally left uninitialized. This was because the subsequent
memmap_init_range() would unconditionally overwrite any initialization done
here, forcing subsystems (like HugeTLB) to compensate and perform the
initialization later in their own specific routines (e.g., hugetlb_vmemmap_init()).

Thanks to the previous patch, memmap_init_range() is now aware of the
section's compound page order and safely skips the redundant initialization
for these optimizable compound vmemmap pages.

Because the overwrite issue is resolved, we can now fully initialize the
shared tail pages (via init_compound_tail()) immediately upon allocation
in vmemmap_get_tail(). This simplifies the initialization flow and removes
the need to defer this work to specific subsystems.

Note that the initialization logic in hugetlb_vmemmap_init() is not removed
yet. It will be completely removed once HugeTLB switches to the new memory
section compound page order mechanism.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/sparse-vmemmap.c | 11 ++---------
 1 file changed, 2 insertions(+), 9 deletions(-)

diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index 9f70559df4e8..2a6c3c82f9f5 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -340,18 +340,11 @@ static __meminit struct page *vmemmap_get_tail(unsigned int order, struct zone *
 	if (tail)
 		return tail;

-	/*
-	 * Only allocate the page, but do not initialize it.
-	 *
-	 * Any initialization done here will be overwritten by memmap_init().
-	 *
-	 * hugetlb_vmemmap_init() will take care of initialization after
-	 * memmap_init().
-	 */
-
 	p = vmemmap_alloc_block_zero(PAGE_SIZE, node);
 	if (!p)
 		return NULL;
+	for (int i = 0; i < PAGE_SIZE / sizeof(struct page); i++)
+		init_compound_tail(p + i, NULL, order, zone);

 	tail = virt_to_page(p);
 	zone->vmemmap_tails[idx] = tail;
-- 
2.20.1

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 25/49] mm/sparse-vmemmap: support vmemmap-optimizable compound page population
  2026-04-05 12:51 [PATCH 00/49] mm: Generalize vmemmap optimization for DAX and HugeTLB Muchun Song
                   ` (23 preceding siblings ...)
  2026-04-05 12:52 ` [PATCH 24/49] mm/sparse-vmemmap: initialize shared tail vmemmap page upon allocation Muchun Song
@ 2026-04-05 12:52 ` Muchun Song
  2026-04-05 12:52 ` [PATCH 26/49] mm/hugetlb: use generic vmemmap optimization macros Muchun Song
                   ` (24 subsequent siblings)
  49 siblings, 0 replies; 51+ messages in thread
From: Muchun Song @ 2026-04-05 12:52 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
	Michael Ellerman, Madhavan Srinivasan
  Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
	Christophe Leroy, aneesh.kumar, joao.m.martins, linux-mm,
	linuxppc-dev, linux-kernel, Muchun Song

Previously, vmemmap optimization (HVO) was tightly coupled with HugeTLB
and relied on CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP. With the recent
introduction of compound page order to struct mem_section, we can now
generalize this optimization to be based on sections rather than being
HugeTLB-specific.

This patch refactors the vmemmap population logic to utilize the new
section-level order information by updating vmemmap_pte_populate() to
dynamically allocates or reuses the shared tail page if a section
contains optimizable compound pages.

These changes centralize the HVO logic within the core sparse-vmemmap
code, reducing code duplication and paving the way for unifying the vmemmap
optimization paths for both HugeTLB and DAX.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 include/linux/mmzone.h |  8 ++++-
 mm/internal.h          |  3 ++
 mm/sparse-vmemmap.c    | 66 +++++++++++++++++++++++++-----------------
 mm/sparse.c            | 30 +++++++++++++++++--
 4 files changed, 78 insertions(+), 29 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 620503aa29ba..e4d37492ca63 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1145,7 +1145,7 @@ struct zone {
 	/* Zone statistics */
 	atomic_long_t		vm_stat[NR_VM_ZONE_STAT_ITEMS];
 	atomic_long_t		vm_numa_event[NR_VM_NUMA_EVENT_ITEMS];
-#ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
+#ifdef CONFIG_SPARSEMEM_VMEMMAP
 	struct page *vmemmap_tails[NR_OPTIMIZABLE_FOLIO_SIZES];
 #endif
 } ____cacheline_internodealigned_in_smp;
@@ -2250,6 +2250,12 @@ static inline unsigned int section_order(const struct mem_section *section)
 }
 #endif
 
+static inline bool section_vmemmap_optimizable(const struct mem_section *section)
+{
+	return is_power_of_2(sizeof(struct page)) &&
+	       section_order(section) >= OPTIMIZABLE_FOLIO_MIN_ORDER;
+}
+
 void sparse_init_early_section(int nid, struct page *map, unsigned long pnum,
 			       unsigned long flags);
 
diff --git a/mm/internal.h b/mm/internal.h
index 1060d7c07f5b..c0d0f546864c 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -996,6 +996,9 @@ static inline void __section_mark_present(struct mem_section *ms,
 
 	ms->section_mem_map |= SECTION_MARKED_PRESENT;
 }
+
+int section_vmemmap_pages(unsigned long pfn, unsigned long nr_pages,
+			  struct vmem_altmap *altmap, struct dev_pagemap *pgmap);
 #else
 static inline void sparse_init(void) {}
 #endif /* CONFIG_SPARSEMEM */
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index 2a6c3c82f9f5..6522c36aac20 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -144,17 +144,47 @@ void __meminit vmemmap_verify(pte_t *pte, int node,
 			start, end - 1);
 }
 
+static struct zone __meminit *pfn_to_zone(unsigned long pfn, int nid)
+{
+	pg_data_t *pgdat = NODE_DATA(nid);
+
+	for (enum zone_type zone_type = 0; zone_type < MAX_NR_ZONES; zone_type++) {
+		struct zone *zone = &pgdat->node_zones[zone_type];
+
+		if (zone_spans_pfn(zone, pfn))
+			return zone;
+	}
+
+	return NULL;
+}
+
+static __meminit struct page *vmemmap_get_tail(unsigned int order, struct zone *zone);
+
 static pte_t * __meminit vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node,
 					      struct vmem_altmap *altmap,
 					      unsigned long ptpfn)
 {
 	pte_t *pte = pte_offset_kernel(pmd, addr);
+
 	if (pte_none(ptep_get(pte))) {
 		pte_t entry;
-		void *p;
+
+		if (vmemmap_page_optimizable((struct page *)addr) &&
+		    ptpfn == (unsigned long)-1) {
+			struct page *page;
+			unsigned long pfn = page_to_pfn((struct page *)addr);
+			const struct mem_section *ms = __pfn_to_section(pfn);
+
+			page = vmemmap_get_tail(section_order(ms),
+						pfn_to_zone(pfn, node));
+			if (!page)
+				return NULL;
+			ptpfn = page_to_pfn(page);
+		}
 
 		if (ptpfn == (unsigned long)-1) {
-			p = vmemmap_alloc_block_buf(PAGE_SIZE, node, altmap);
+			void *p = vmemmap_alloc_block_buf(PAGE_SIZE, node, altmap);
+
 			if (!p)
 				return NULL;
 			ptpfn = PHYS_PFN(__pa(p));
@@ -323,7 +353,6 @@ void vmemmap_wrprotect_hvo(unsigned long addr, unsigned long end,
 	}
 }
 
-#ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
 static __meminit struct page *vmemmap_get_tail(unsigned int order, struct zone *zone)
 {
 	struct page *p, *tail;
@@ -352,6 +381,7 @@ static __meminit struct page *vmemmap_get_tail(unsigned int order, struct zone *
 	return tail;
 }
 
+#ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
 int __meminit vmemmap_populate_hvo(unsigned long addr, unsigned long end,
 				       unsigned int order, struct zone *zone,
 				       unsigned long headsize)
@@ -404,6 +434,9 @@ int __meminit vmemmap_populate_hugepages(unsigned long start, unsigned long end,
 		return vmemmap_populate_compound_pages(start, end, node, pgmap);
 
 	for (addr = start; addr < end; addr = next) {
+		unsigned long pfn = page_to_pfn((struct page *)addr);
+		const struct mem_section *ms = __pfn_to_section(pfn);
+
 		next = pmd_addr_end(addr, end);
 
 		pgd = vmemmap_pgd_populate(addr, node);
@@ -419,7 +452,7 @@ int __meminit vmemmap_populate_hugepages(unsigned long start, unsigned long end,
 			return -ENOMEM;
 
 		pmd = pmd_offset(pud, addr);
-		if (pmd_none(pmdp_get(pmd))) {
+		if (pmd_none(pmdp_get(pmd)) && !section_vmemmap_optimizable(ms)) {
 			void *p;
 
 			p = vmemmap_alloc_block_buf(PMD_SIZE, node, altmap);
@@ -437,8 +470,10 @@ int __meminit vmemmap_populate_hugepages(unsigned long start, unsigned long end,
 				 */
 				return -ENOMEM;
 			}
-		} else if (vmemmap_check_pmd(pmd, node, addr, next))
+		} else if (vmemmap_check_pmd(pmd, node, addr, next)) {
+			VM_BUG_ON(section_vmemmap_optimizable(ms));
 			continue;
+		}
 		if (vmemmap_populate_basepages(addr, next, node, altmap, pgmap))
 			return -ENOMEM;
 	}
@@ -705,27 +740,6 @@ static int fill_subsection_map(unsigned long pfn, unsigned long nr_pages)
 	return rc;
 }
 
-static int __meminit section_vmemmap_pages(unsigned long pfn, unsigned long nr_pages,
-					   struct vmem_altmap *altmap, struct dev_pagemap *pgmap)
-{
-	unsigned int order = pgmap ? pgmap->vmemmap_shift : 0;
-	unsigned long pages_per_compound = 1L << order;
-
-	VM_BUG_ON(!IS_ALIGNED(pfn | nr_pages, min(pages_per_compound, PAGES_PER_SECTION)));
-	VM_BUG_ON(pfn_to_section_nr(pfn) != pfn_to_section_nr(pfn + nr_pages - 1));
-
-	if (!vmemmap_can_optimize(altmap, pgmap))
-		return DIV_ROUND_UP(nr_pages * sizeof(struct page), PAGE_SIZE);
-
-	if (order < PFN_SECTION_SHIFT)
-		return VMEMMAP_RESERVE_NR * nr_pages / pages_per_compound;
-
-	if (IS_ALIGNED(pfn, pages_per_compound))
-		return VMEMMAP_RESERVE_NR;
-
-	return 0;
-}
-
 /*
  * To deactivate a memory region, there are 3 cases to handle:
  *
diff --git a/mm/sparse.c b/mm/sparse.c
index cfe4ffd89baf..62659752980e 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -345,6 +345,32 @@ static void __init sparse_usage_fini(void)
 	sparse_usagebuf = sparse_usagebuf_end = NULL;
 }
 
+int __meminit section_vmemmap_pages(unsigned long pfn, unsigned long nr_pages,
+				    struct vmem_altmap *altmap, struct dev_pagemap *pgmap)
+{
+	const struct mem_section *ms = __pfn_to_section(pfn);
+	unsigned int order = pgmap ? pgmap->vmemmap_shift : section_order(ms);
+	unsigned long pages_per_compound = 1L << order;
+	unsigned int vmemmap_pages = OPTIMIZED_FOLIO_VMEMMAP_PAGES;
+
+	if (vmemmap_can_optimize(altmap, pgmap))
+		vmemmap_pages = VMEMMAP_RESERVE_NR;
+
+	VM_BUG_ON(!IS_ALIGNED(pfn | nr_pages, min(pages_per_compound, PAGES_PER_SECTION)));
+	VM_BUG_ON(pfn_to_section_nr(pfn) != pfn_to_section_nr(pfn + nr_pages - 1));
+
+	if (!vmemmap_can_optimize(altmap, pgmap) && !section_vmemmap_optimizable(ms))
+		return DIV_ROUND_UP(nr_pages * sizeof(struct page), PAGE_SIZE);
+
+	if (order < PFN_SECTION_SHIFT)
+		return vmemmap_pages * nr_pages / pages_per_compound;
+
+	if (IS_ALIGNED(pfn, pages_per_compound))
+		return vmemmap_pages;
+
+	return 0;
+}
+
 /*
  * Initialize sparse on a specific node. The node spans [pnum_begin, pnum_end)
  * And number of present sections in this node is map_count.
@@ -376,8 +402,8 @@ static void __init sparse_init_nid(int nid, unsigned long pnum_begin,
 							nid, NULL, NULL);
 			if (!map)
 				panic("Populate section (%ld) on node[%d] failed\n", pnum, nid);
-			memmap_boot_pages_add(DIV_ROUND_UP(PAGES_PER_SECTION * sizeof(struct page),
-							   PAGE_SIZE));
+			memmap_boot_pages_add(section_vmemmap_pages(pfn, PAGES_PER_SECTION,
+								    NULL, NULL));
 			sparse_init_early_section(nid, map, pnum, 0);
 		}
 	}
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 26/49] mm/hugetlb: use generic vmemmap optimization macros
  2026-04-05 12:51 [PATCH 00/49] mm: Generalize vmemmap optimization for DAX and HugeTLB Muchun Song
                   ` (24 preceding siblings ...)
  2026-04-05 12:52 ` [PATCH 25/49] mm/sparse-vmemmap: support vmemmap-optimizable compound page population Muchun Song
@ 2026-04-05 12:52 ` Muchun Song
  2026-04-05 12:52 ` [PATCH 27/49] mm: call memblocks_present() before HugeTLB initialization Muchun Song
                   ` (23 subsequent siblings)
  49 siblings, 0 replies; 51+ messages in thread
From: Muchun Song @ 2026-04-05 12:52 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
	Michael Ellerman, Madhavan Srinivasan
  Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
	Christophe Leroy, aneesh.kumar, joao.m.martins, linux-mm,
	linuxppc-dev, linux-kernel, Muchun Song

Use the generic macros OPTIMIZED_FOLIO_VMEMMAP_SIZE,
OPTIMIZED_FOLIO_VMEMMAP_PAGE_STRUCTS and OPTIMIZED_FOLIO_VMEMMAP_PAGES
instead of the hugetlb-specific HUGETLB_VMEMMAP_RESERVE_SIZE and
HUGETLB_VMEMMAP_RESERVE_PAGES to describe the vmemmap-optimized folio.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/hugetlb.c         |  4 ++--
 mm/hugetlb_vmemmap.c | 14 +++++++-------
 mm/hugetlb_vmemmap.h |  9 +--------
 3 files changed, 10 insertions(+), 17 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a00c9f3672b7..a7e0599802cb 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3241,7 +3241,7 @@ static void __init prep_and_add_bootmem_folios(struct hstate *h,
 			 * be no contention.
 			 */
 			hugetlb_folio_init_tail_vmemmap(folio, h,
-					HUGETLB_VMEMMAP_RESERVE_PAGES,
+					OPTIMIZED_FOLIO_VMEMMAP_PAGE_STRUCTS,
 					pages_per_huge_page(h));
 		}
 		hugetlb_bootmem_init_migratetype(folio, h);
@@ -3280,7 +3280,7 @@ static void __init gather_bootmem_prealloc_node(unsigned long nid)
 		WARN_ON(folio_ref_count(folio) != 1);
 
 		hugetlb_folio_init_vmemmap(folio, h,
-					   HUGETLB_VMEMMAP_RESERVE_PAGES);
+					   OPTIMIZED_FOLIO_VMEMMAP_PAGE_STRUCTS);
 		init_new_hugetlb_folio(folio);
 
 		if (hugetlb_bootmem_page_prehvo(m))
diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index d6dd47c232e0..0af528c0e229 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -407,7 +407,7 @@ static int __hugetlb_vmemmap_restore_folio(const struct hstate *h,
 	vmemmap_start	= (unsigned long)&folio->page;
 	vmemmap_end	= vmemmap_start + hugetlb_vmemmap_size(h);
 
-	vmemmap_start	+= HUGETLB_VMEMMAP_RESERVE_SIZE;
+	vmemmap_start	+= OPTIMIZED_FOLIO_VMEMMAP_SIZE;
 
 	/*
 	 * The pages which the vmemmap virtual address range [@vmemmap_start,
@@ -637,10 +637,10 @@ static void __hugetlb_vmemmap_optimize_folios(struct hstate *h,
 			spfn = (unsigned long)&folio->page;
 			epfn = spfn + pages_per_huge_page(h);
 			vmemmap_wrprotect_hvo(spfn, epfn, folio_nid(folio),
-					HUGETLB_VMEMMAP_RESERVE_SIZE);
+					OPTIMIZED_FOLIO_VMEMMAP_SIZE);
 			register_page_bootmem_memmap(pfn_to_section_nr(spfn),
 					&folio->page,
-					HUGETLB_VMEMMAP_RESERVE_SIZE);
+					OPTIMIZED_FOLIO_VMEMMAP_SIZE);
 			continue;
 		}
 
@@ -791,8 +791,8 @@ void __init hugetlb_vmemmap_init_early(int nid)
 		zone = pfn_to_zone(nid, pfn);
 
 		BUG_ON(vmemmap_populate_hvo(start, end, huge_page_order(m->hstate),
-					    zone, HUGETLB_VMEMMAP_RESERVE_SIZE));
-		memmap_boot_pages_add(HUGETLB_VMEMMAP_RESERVE_SIZE / PAGE_SIZE);
+					    zone, OPTIMIZED_FOLIO_VMEMMAP_SIZE));
+		memmap_boot_pages_add(OPTIMIZED_FOLIO_VMEMMAP_PAGES);
 
 		pnum = pfn_to_section_nr(pfn);
 		ns = psize / section_size;
@@ -824,8 +824,8 @@ static int __init hugetlb_vmemmap_init(void)
 	const struct hstate *h;
 	struct zone *zone;
 
-	/* HUGETLB_VMEMMAP_RESERVE_SIZE should cover all used struct pages */
-	BUILD_BUG_ON(__NR_USED_SUBPAGE > HUGETLB_VMEMMAP_RESERVE_PAGES);
+	/* OPTIMIZED_FOLIO_VMEMMAP_SIZE should cover all used struct pages */
+	BUILD_BUG_ON(__NR_USED_SUBPAGE > OPTIMIZED_FOLIO_VMEMMAP_PAGE_STRUCTS);
 
 	for_each_zone(zone) {
 		for (int i = 0; i < NR_OPTIMIZABLE_FOLIO_SIZES; i++) {
diff --git a/mm/hugetlb_vmemmap.h b/mm/hugetlb_vmemmap.h
index 7ac49c52457d..66e11893d076 100644
--- a/mm/hugetlb_vmemmap.h
+++ b/mm/hugetlb_vmemmap.h
@@ -12,13 +12,6 @@
 #include <linux/io.h>
 #include <linux/memblock.h>
 
-/*
- * Reserve one vmemmap page, all vmemmap addresses are mapped to it. See
- * Documentation/mm/vmemmap_dedup.rst.
- */
-#define HUGETLB_VMEMMAP_RESERVE_SIZE	PAGE_SIZE
-#define HUGETLB_VMEMMAP_RESERVE_PAGES	(HUGETLB_VMEMMAP_RESERVE_SIZE / sizeof(struct page))
-
 #ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
 int hugetlb_vmemmap_restore_folio(const struct hstate *h, struct folio *folio);
 long hugetlb_vmemmap_restore_folios(const struct hstate *h,
@@ -43,7 +36,7 @@ static inline unsigned int hugetlb_vmemmap_size(const struct hstate *h)
  */
 static inline unsigned int hugetlb_vmemmap_optimizable_size(const struct hstate *h)
 {
-	int size = hugetlb_vmemmap_size(h) - HUGETLB_VMEMMAP_RESERVE_SIZE;
+	int size = hugetlb_vmemmap_size(h) - OPTIMIZED_FOLIO_VMEMMAP_SIZE;
 
 	if (!is_power_of_2(sizeof(struct page)))
 		return 0;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 27/49] mm: call memblocks_present() before HugeTLB initialization
  2026-04-05 12:51 [PATCH 00/49] mm: Generalize vmemmap optimization for DAX and HugeTLB Muchun Song
                   ` (25 preceding siblings ...)
  2026-04-05 12:52 ` [PATCH 26/49] mm/hugetlb: use generic vmemmap optimization macros Muchun Song
@ 2026-04-05 12:52 ` Muchun Song
  2026-04-05 12:52 ` [PATCH 28/49] mm/hugetlb: switch HugeTLB to use generic vmemmap optimization Muchun Song
                   ` (22 subsequent siblings)
  49 siblings, 0 replies; 51+ messages in thread
From: Muchun Song @ 2026-04-05 12:52 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
	Michael Ellerman, Madhavan Srinivasan
  Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
	Christophe Leroy, aneesh.kumar, joao.m.martins, linux-mm,
	linuxppc-dev, linux-kernel, Muchun Song

Extract memblocks_present() from sparse_init() and call it earlier in
mm_core_init_early().

This ensures that the struct mem_section array is properly allocated
and marked as present before HugeTLB bootmem allocation. This is a
necessary preparation for the subsequent patches, which will need to
perform early setting of the section order for HugeTLB pages.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/internal.h | 2 ++
 mm/mm_init.c  | 1 +
 mm/sparse.c   | 4 +---
 3 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index c0d0f546864c..27c06250d6b8 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -963,6 +963,7 @@ void memmap_init_range(unsigned long, int, unsigned long, unsigned long,
  * mm/sparse.c
  */
 #ifdef CONFIG_SPARSEMEM
+void memblocks_present(void);
 void sparse_init(void);
 int sparse_index_init(unsigned long section_nr, int nid);
 
@@ -1000,6 +1001,7 @@ static inline void __section_mark_present(struct mem_section *ms,
 int section_vmemmap_pages(unsigned long pfn, unsigned long nr_pages,
 			  struct vmem_altmap *altmap, struct dev_pagemap *pgmap);
 #else
+static inline void memblocks_present(void) {}
 static inline void sparse_init(void) {}
 #endif /* CONFIG_SPARSEMEM */
 
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 7f5b326e9298..b47f65425bc1 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -2675,6 +2675,7 @@ void __init __weak mem_init(void)
 
 void __init mm_core_init_early(void)
 {
+	memblocks_present();
 	free_area_init();
 	/* Zone data structures are available from here. */
 	hugetlb_cma_reserve();
diff --git a/mm/sparse.c b/mm/sparse.c
index 62659752980e..7779554c5a0c 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -195,7 +195,7 @@ static void __init memory_present(int nid, unsigned long start, unsigned long en
  * This is a convenience function that is useful to mark all of the systems
  * memory as present during initialization.
  */
-static void __init memblocks_present(void)
+void __init memblocks_present(void)
 {
 	unsigned long start, end;
 	int i, nid;
@@ -420,8 +420,6 @@ void __init sparse_init(void)
 	unsigned long pnum_end, pnum_begin, map_count = 1;
 	int nid_begin;
 
-	memblocks_present();
-
 	if (compound_info_has_mask()) {
 		VM_WARN_ON_ONCE(!IS_ALIGNED((unsigned long) pfn_to_page(0),
 				    MAX_FOLIO_VMEMMAP_ALIGN));
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 28/49] mm/hugetlb: switch HugeTLB to use generic vmemmap optimization
  2026-04-05 12:51 [PATCH 00/49] mm: Generalize vmemmap optimization for DAX and HugeTLB Muchun Song
                   ` (26 preceding siblings ...)
  2026-04-05 12:52 ` [PATCH 27/49] mm: call memblocks_present() before HugeTLB initialization Muchun Song
@ 2026-04-05 12:52 ` Muchun Song
  2026-04-05 12:52 ` [PATCH 29/49] mm: extract pfn_to_zone() helper Muchun Song
                   ` (21 subsequent siblings)
  49 siblings, 0 replies; 51+ messages in thread
From: Muchun Song @ 2026-04-05 12:52 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
	Michael Ellerman, Madhavan Srinivasan
  Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
	Christophe Leroy, aneesh.kumar, joao.m.martins, linux-mm,
	linuxppc-dev, linux-kernel, Muchun Song

Switch the HugeTLB vmemmap optimization to use the new infrastructure
introduced in the previous patches (specifically, the compound page
support in sparse-vmemmap).

Previously, optimizing bootmem HugeTLB pages required dedicated and
complex pre-initialization logic, such as hugetlb_vmemmap_init_early()
and vmemmap_populate_hvo(). This approach manually handled page mapping
and initialization at a very early stage.

This patch removes all those special-cased functions and simply calls
hugetlb_vmemmap_optimize_bootmem_page() directly from alloc_bootmem().
By explicitly setting the compound page order in the mem_section
(via section_set_order), the generic sparse-vmemmap initialization code
will now automatically handle the shared tail page mapping for these
bootmem pages.

This significantly simplifies the code, eliminates duplicate logic, and
seamlessly integrates bootmem vmemmap optimization with the generic
vmemmap optimization flow.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 include/linux/mm.h     |   3 -
 include/linux/mmzone.h |  13 +++++
 mm/bootmem_info.c      |   5 +-
 mm/hugetlb.c           |   8 ++-
 mm/hugetlb_vmemmap.c   | 121 +++--------------------------------------
 mm/hugetlb_vmemmap.h   |  11 ++--
 mm/sparse-vmemmap.c    |  29 ----------
 7 files changed, 32 insertions(+), 158 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index aa8c05de7585..93e447468131 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -4877,9 +4877,6 @@ int vmemmap_populate_hugepages(unsigned long start, unsigned long end,
 			       struct dev_pagemap *pgmap);
 int vmemmap_populate(unsigned long start, unsigned long end, int node,
 		struct vmem_altmap *altmap, struct dev_pagemap *pgmap);
-int vmemmap_populate_hvo(unsigned long start, unsigned long end,
-			 unsigned int order, struct zone *zone,
-			 unsigned long headsize);
 void vmemmap_wrprotect_hvo(unsigned long start, unsigned long end, int node,
 			  unsigned long headsize);
 void vmemmap_populate_print_last(void);
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index e4d37492ca63..0bd20efac427 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -2250,6 +2250,19 @@ static inline unsigned int section_order(const struct mem_section *section)
 }
 #endif
 
+static inline void section_set_order_pfn_range(unsigned long pfn,
+					       unsigned long nr_pages,
+					       unsigned int order)
+{
+	unsigned long section_nr = pfn_to_section_nr(pfn);
+
+	if (!IS_ALIGNED(pfn | nr_pages, PAGES_PER_SECTION))
+		return;
+
+	for (int i = 0; i < nr_pages / PAGES_PER_SECTION; i++)
+		section_set_order(__nr_to_section(section_nr + i), order);
+}
+
 static inline bool section_vmemmap_optimizable(const struct mem_section *section)
 {
 	return is_power_of_2(sizeof(struct page)) &&
diff --git a/mm/bootmem_info.c b/mm/bootmem_info.c
index 3d7675a3ae04..24f45d86ffb3 100644
--- a/mm/bootmem_info.c
+++ b/mm/bootmem_info.c
@@ -51,9 +51,8 @@ static void __init register_page_bootmem_info_section(unsigned long start_pfn)
 	section_nr = pfn_to_section_nr(start_pfn);
 	ms = __nr_to_section(section_nr);
 
-	if (!preinited_vmemmap_section(ms))
-		register_page_bootmem_memmap(section_nr, pfn_to_page(start_pfn),
-					     PAGES_PER_SECTION);
+	register_page_bootmem_memmap(section_nr, pfn_to_page(start_pfn),
+				     PAGES_PER_SECTION);
 
 	usage = ms->usage;
 	page = virt_to_page(usage);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a7e0599802cb..dff94ab7040a 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3096,6 +3096,7 @@ static __init void *alloc_bootmem(struct hstate *h, int nid, bool node_exact)
 		 * is not up yet.
 		 */
 		INIT_LIST_HEAD(&m->list);
+		m->hstate = h;
 		if (pfn_range_intersects_zones(listnode, PHYS_PFN(virt_to_phys(m)),
 					       pages_per_huge_page(h))) {
 			VM_BUG_ON(hugetlb_bootmem_page_earlycma(m));
@@ -3103,8 +3104,8 @@ static __init void *alloc_bootmem(struct hstate *h, int nid, bool node_exact)
 		} else {
 			list_add_tail(&m->list, &huge_boot_pages[listnode]);
 			m->flags |= HUGE_BOOTMEM_ZONES_VALID;
+			hugetlb_vmemmap_optimize_bootmem_page(m);
 		}
-		m->hstate = h;
 	}
 
 	return m;
@@ -3283,13 +3284,16 @@ static void __init gather_bootmem_prealloc_node(unsigned long nid)
 					   OPTIMIZED_FOLIO_VMEMMAP_PAGE_STRUCTS);
 		init_new_hugetlb_folio(folio);
 
-		if (hugetlb_bootmem_page_prehvo(m))
+		if (hugetlb_bootmem_page_prehvo(m)) {
 			/*
 			 * If pre-HVO was done, just set the
 			 * flag, the HVO code will then skip
 			 * this folio.
 			 */
 			folio_set_hugetlb_vmemmap_optimized(folio);
+			section_set_order_pfn_range(folio_pfn(folio),
+						    pages_per_huge_page(h), 0);
+		}
 
 		if (hugetlb_bootmem_page_earlycma(m))
 			folio_set_hugetlb_cma(folio);
diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index 0af528c0e229..8c567b8c67cc 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -638,9 +638,6 @@ static void __hugetlb_vmemmap_optimize_folios(struct hstate *h,
 			epfn = spfn + pages_per_huge_page(h);
 			vmemmap_wrprotect_hvo(spfn, epfn, folio_nid(folio),
 					OPTIMIZED_FOLIO_VMEMMAP_SIZE);
-			register_page_bootmem_memmap(pfn_to_section_nr(spfn),
-					&folio->page,
-					OPTIMIZED_FOLIO_VMEMMAP_SIZE);
 			continue;
 		}
 
@@ -706,108 +703,21 @@ void hugetlb_vmemmap_optimize_bootmem_folios(struct hstate *h, struct list_head
 	__hugetlb_vmemmap_optimize_folios(h, folio_list, true);
 }
 
-#ifdef CONFIG_SPARSEMEM_VMEMMAP_PREINIT
-
-/* Return true of a bootmem allocated HugeTLB page should be pre-HVO-ed */
-static bool vmemmap_should_optimize_bootmem_page(struct huge_bootmem_page *m)
-{
-	unsigned long section_size, psize, pmd_vmemmap_size;
-	phys_addr_t paddr;
-
-	if (!READ_ONCE(vmemmap_optimize_enabled))
-		return false;
-
-	if (!hugetlb_vmemmap_optimizable(m->hstate))
-		return false;
-
-	psize = huge_page_size(m->hstate);
-	paddr = virt_to_phys(m);
-
-	/*
-	 * Pre-HVO only works if the bootmem huge page
-	 * is aligned to the section size.
-	 */
-	section_size = (1UL << PA_SECTION_SHIFT);
-	if (!IS_ALIGNED(paddr, section_size) ||
-	    !IS_ALIGNED(psize, section_size))
-		return false;
-
-	/*
-	 * The pre-HVO code does not deal with splitting PMDS,
-	 * so the bootmem page must be aligned to the number
-	 * of base pages that can be mapped with one vmemmap PMD.
-	 */
-	pmd_vmemmap_size = (PMD_SIZE / (sizeof(struct page))) << PAGE_SHIFT;
-	if (!IS_ALIGNED(paddr, pmd_vmemmap_size) ||
-	    !IS_ALIGNED(psize, pmd_vmemmap_size))
-		return false;
-
-	return true;
-}
-
-static struct zone *pfn_to_zone(unsigned nid, unsigned long pfn)
+void __init hugetlb_vmemmap_optimize_bootmem_page(struct huge_bootmem_page *m)
 {
-	struct zone *zone;
-	enum zone_type zone_type;
-
-	for (zone_type = 0; zone_type < MAX_NR_ZONES; zone_type++) {
-		zone = &NODE_DATA(nid)->node_zones[zone_type];
-		if (zone_spans_pfn(zone, pfn))
-			return zone;
-	}
-
-	return NULL;
-}
-
-/*
- * Initialize memmap section for a gigantic page, HVO-style.
- */
-void __init hugetlb_vmemmap_init_early(int nid)
-{
-	unsigned long psize, paddr, section_size;
-	unsigned long ns, i, pnum, pfn, nr_pages;
-	unsigned long start, end;
-	struct huge_bootmem_page *m = NULL;
-	void *map;
+	struct hstate *h = m->hstate;
+	unsigned long pfn = PHYS_PFN(virt_to_phys(m));
 
 	if (!READ_ONCE(vmemmap_optimize_enabled))
 		return;
 
-	section_size = (1UL << PA_SECTION_SHIFT);
-
-	list_for_each_entry(m, &huge_boot_pages[nid], list) {
-		struct zone *zone;
-
-		if (!vmemmap_should_optimize_bootmem_page(m))
-			continue;
-
-		nr_pages = pages_per_huge_page(m->hstate);
-		psize = nr_pages << PAGE_SHIFT;
-		paddr = virt_to_phys(m);
-		pfn = PHYS_PFN(paddr);
-		map = pfn_to_page(pfn);
-		start = (unsigned long)map;
-		end = start + nr_pages * sizeof(struct page);
-		zone = pfn_to_zone(nid, pfn);
-
-		BUG_ON(vmemmap_populate_hvo(start, end, huge_page_order(m->hstate),
-					    zone, OPTIMIZED_FOLIO_VMEMMAP_SIZE));
-		memmap_boot_pages_add(OPTIMIZED_FOLIO_VMEMMAP_PAGES);
-
-		pnum = pfn_to_section_nr(pfn);
-		ns = psize / section_size;
-
-		for (i = 0; i < ns; i++) {
-			sparse_init_early_section(nid, map, pnum,
-					SECTION_IS_VMEMMAP_PREINIT);
-			map += section_map_size();
-			pnum++;
-		}
+	if (!hugetlb_vmemmap_optimizable(h))
+		return;
 
+	section_set_order_pfn_range(pfn, pages_per_huge_page(h), huge_page_order(h));
+	if (section_vmemmap_optimizable(__pfn_to_section(pfn)))
 		m->flags |= HUGE_BOOTMEM_HVO;
-	}
 }
-#endif
 
 static const struct ctl_table hugetlb_vmemmap_sysctls[] = {
 	{
@@ -822,27 +732,10 @@ static const struct ctl_table hugetlb_vmemmap_sysctls[] = {
 static int __init hugetlb_vmemmap_init(void)
 {
 	const struct hstate *h;
-	struct zone *zone;
 
 	/* OPTIMIZED_FOLIO_VMEMMAP_SIZE should cover all used struct pages */
 	BUILD_BUG_ON(__NR_USED_SUBPAGE > OPTIMIZED_FOLIO_VMEMMAP_PAGE_STRUCTS);
 
-	for_each_zone(zone) {
-		for (int i = 0; i < NR_OPTIMIZABLE_FOLIO_SIZES; i++) {
-			struct page *tail, *p;
-			unsigned int order;
-
-			tail = zone->vmemmap_tails[i];
-			if (!tail)
-				continue;
-
-			order = i + OPTIMIZABLE_FOLIO_MIN_ORDER;
-			p = page_to_virt(tail);
-			for (int j = 0; j < PAGE_SIZE / sizeof(struct page); j++)
-				init_compound_tail(p + j, NULL, order, zone);
-		}
-	}
-
 	for_each_hstate(h) {
 		if (hugetlb_vmemmap_optimizable(h)) {
 			register_sysctl_init("vm", hugetlb_vmemmap_sysctls);
diff --git a/mm/hugetlb_vmemmap.h b/mm/hugetlb_vmemmap.h
index 66e11893d076..ff8e4c6e9833 100644
--- a/mm/hugetlb_vmemmap.h
+++ b/mm/hugetlb_vmemmap.h
@@ -20,10 +20,7 @@ long hugetlb_vmemmap_restore_folios(const struct hstate *h,
 void hugetlb_vmemmap_optimize_folio(const struct hstate *h, struct folio *folio);
 void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_list);
 void hugetlb_vmemmap_optimize_bootmem_folios(struct hstate *h, struct list_head *folio_list);
-#ifdef CONFIG_SPARSEMEM_VMEMMAP_PREINIT
-void hugetlb_vmemmap_init_early(int nid);
-#endif
-
+void hugetlb_vmemmap_optimize_bootmem_page(struct huge_bootmem_page *m);
 
 static inline unsigned int hugetlb_vmemmap_size(const struct hstate *h)
 {
@@ -69,13 +66,13 @@ static inline void hugetlb_vmemmap_optimize_bootmem_folios(struct hstate *h,
 {
 }
 
-static inline void hugetlb_vmemmap_init_early(int nid)
+static inline unsigned int hugetlb_vmemmap_optimizable_size(const struct hstate *h)
 {
+	return 0;
 }
 
-static inline unsigned int hugetlb_vmemmap_optimizable_size(const struct hstate *h)
+static inline void hugetlb_vmemmap_optimize_bootmem_page(struct huge_bootmem_page *m)
 {
-	return 0;
 }
 #endif /* CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP */
 
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index 6522c36aac20..d266bcf45b5c 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -32,7 +32,6 @@
 #include <asm/dma.h>
 #include <asm/tlbflush.h>
 
-#include "hugetlb_vmemmap.h"
 #include "internal.h"
 
 /*
@@ -381,33 +380,6 @@ static __meminit struct page *vmemmap_get_tail(unsigned int order, struct zone *
 	return tail;
 }
 
-#ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
-int __meminit vmemmap_populate_hvo(unsigned long addr, unsigned long end,
-				       unsigned int order, struct zone *zone,
-				       unsigned long headsize)
-{
-	unsigned long maddr;
-	struct page *tail;
-	pte_t *pte;
-	int node = zone_to_nid(zone);
-
-	tail = vmemmap_get_tail(order, zone);
-	if (!tail)
-		return -ENOMEM;
-
-	for (maddr = addr; maddr < addr + headsize; maddr += PAGE_SIZE) {
-		pte = vmemmap_populate_address(maddr, node, NULL, -1);
-		if (!pte)
-			return -ENOMEM;
-	}
-
-	/*
-	 * Reuse the last page struct page mapped above for the rest.
-	 */
-	return vmemmap_populate_range(maddr, end, node, NULL, page_to_pfn(tail));
-}
-#endif
-
 void __weak __meminit vmemmap_set_pmd(pmd_t *pmd, void *p, int node,
 				      unsigned long addr, unsigned long next)
 {
@@ -595,7 +567,6 @@ struct page * __meminit __populate_section_memmap(unsigned long pfn,
  */
 void __init sparse_vmemmap_init_nid_early(int nid)
 {
-	hugetlb_vmemmap_init_early(nid);
 }
 #endif
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 29/49] mm: extract pfn_to_zone() helper
  2026-04-05 12:51 [PATCH 00/49] mm: Generalize vmemmap optimization for DAX and HugeTLB Muchun Song
                   ` (27 preceding siblings ...)
  2026-04-05 12:52 ` [PATCH 28/49] mm/hugetlb: switch HugeTLB to use generic vmemmap optimization Muchun Song
@ 2026-04-05 12:52 ` Muchun Song
  2026-04-05 12:52 ` [PATCH 30/49] mm/sparse-vmemmap: remove unused SPARSEMEM_VMEMMAP_PREINIT feature Muchun Song
                   ` (20 subsequent siblings)
  49 siblings, 0 replies; 51+ messages in thread
From: Muchun Song @ 2026-04-05 12:52 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
	Michael Ellerman, Madhavan Srinivasan
  Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
	Christophe Leroy, aneesh.kumar, joao.m.martins, linux-mm,
	linuxppc-dev, linux-kernel, Muchun Song

Extract pfn_to_zone() from sparse-vmemmap.c to mm_init.c and use it
in __init_page_from_nid(). This removes duplicated code for finding
the zone for a given PFN.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/internal.h       |  1 +
 mm/mm_init.c        | 28 ++++++++++++++++------------
 mm/sparse-vmemmap.c | 14 --------------
 3 files changed, 17 insertions(+), 26 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 27c06250d6b8..b569d8309f4d 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1350,6 +1350,7 @@ static inline bool deferred_pages_enabled(void)
 }
 #endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */
 
+struct zone *pfn_to_zone(unsigned long pfn, int nid);
 void init_deferred_page(unsigned long pfn, int nid);
 
 enum mminit_level {
diff --git a/mm/mm_init.c b/mm/mm_init.c
index b47f65425bc1..e47d08b63154 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -687,24 +687,28 @@ static __meminit void pageblock_migratetype_init_range(unsigned long pfn,
 	}
 }
 
+struct zone __meminit *pfn_to_zone(unsigned long pfn, int nid)
+{
+	pg_data_t *pgdat = NODE_DATA(nid);
+
+	for (enum zone_type zone_type = 0; zone_type < MAX_NR_ZONES; zone_type++) {
+		struct zone *zone = &pgdat->node_zones[zone_type];
+
+		if (zone_spans_pfn(zone, pfn))
+			return zone;
+	}
+
+	return NULL;
+}
+
 #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
 /*
  * Initialize a reserved page unconditionally, finding its zone first.
  */
 static void __meminit __init_page_from_nid(unsigned long pfn, int nid)
 {
-	pg_data_t *pgdat;
-	int zid;
-
-	pgdat = NODE_DATA(nid);
-
-	for (zid = 0; zid < MAX_NR_ZONES; zid++) {
-		struct zone *zone = &pgdat->node_zones[zid];
-
-		if (zone_spans_pfn(zone, pfn))
-			break;
-	}
-	__init_single_page(pfn_to_page(pfn), pfn, zid, nid);
+	__init_single_page(pfn_to_page(pfn), pfn,
+			   zone_idx(pfn_to_zone(pfn, nid)), nid);
 
 	if (pageblock_aligned(pfn))
 		init_pageblock_migratetype(pfn_to_page(pfn), MIGRATE_MOVABLE,
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index d266bcf45b5c..9da49b0d03f0 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -143,20 +143,6 @@ void __meminit vmemmap_verify(pte_t *pte, int node,
 			start, end - 1);
 }
 
-static struct zone __meminit *pfn_to_zone(unsigned long pfn, int nid)
-{
-	pg_data_t *pgdat = NODE_DATA(nid);
-
-	for (enum zone_type zone_type = 0; zone_type < MAX_NR_ZONES; zone_type++) {
-		struct zone *zone = &pgdat->node_zones[zone_type];
-
-		if (zone_spans_pfn(zone, pfn))
-			return zone;
-	}
-
-	return NULL;
-}
-
 static __meminit struct page *vmemmap_get_tail(unsigned int order, struct zone *zone);
 
 static pte_t * __meminit vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node,
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 30/49] mm/sparse-vmemmap: remove unused SPARSEMEM_VMEMMAP_PREINIT feature
  2026-04-05 12:51 [PATCH 00/49] mm: Generalize vmemmap optimization for DAX and HugeTLB Muchun Song
                   ` (28 preceding siblings ...)
  2026-04-05 12:52 ` [PATCH 29/49] mm: extract pfn_to_zone() helper Muchun Song
@ 2026-04-05 12:52 ` Muchun Song
  2026-04-05 12:52 ` [PATCH 31/49] mm/hugetlb: remove HUGE_BOOTMEM_HVO flag and simplify pre-HVO logic Muchun Song
                   ` (19 subsequent siblings)
  49 siblings, 0 replies; 51+ messages in thread
From: Muchun Song @ 2026-04-05 12:52 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
	Michael Ellerman, Madhavan Srinivasan
  Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
	Christophe Leroy, aneesh.kumar, joao.m.martins, linux-mm,
	linuxppc-dev, linux-kernel, Muchun Song

Since the bootmem vmemmap optimization has been reimplemented to use
the new early compound vmemmap infrastructure, the old
SPARSEMEM_VMEMMAP_PREINIT feature and its related code (e.g.,
sparse_vmemmap_init_nid_early(), preinited_vmemmap_section()) are no
longer used.

Remove them to clean up the code.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 arch/x86/Kconfig       |  1 -
 fs/Kconfig             |  1 -
 include/linux/mmzone.h | 25 -------------------------
 mm/Kconfig             |  5 -----
 mm/sparse-vmemmap.c    | 13 -------------
 mm/sparse.c            | 23 ++++++++---------------
 6 files changed, 8 insertions(+), 60 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 99bb5217649a..f19625648f0f 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -148,7 +148,6 @@ config X86
 	select ARCH_WANT_LD_ORPHAN_WARN
 	select ARCH_WANT_OPTIMIZE_DAX_VMEMMAP	if X86_64
 	select ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP	if X86_64
-	select ARCH_WANT_HUGETLB_VMEMMAP_PREINIT if X86_64
 	select ARCH_WANTS_THP_SWAP		if X86_64
 	select ARCH_HAS_PARANOID_L1D_FLUSH
 	select ARCH_WANT_IRQS_OFF_ACTIVATE_MM
diff --git a/fs/Kconfig b/fs/Kconfig
index 43cb06de297f..e70aa5f0429a 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -278,7 +278,6 @@ config HUGETLB_PAGE_OPTIMIZE_VMEMMAP
 	def_bool HUGETLB_PAGE
 	depends on ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP
 	depends on SPARSEMEM_VMEMMAP
-	select SPARSEMEM_VMEMMAP_PREINIT if ARCH_WANT_HUGETLB_VMEMMAP_PREINIT
 
 config HUGETLB_PMD_PAGE_TABLE_SHARING
 	def_bool HUGETLB_PAGE
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 0bd20efac427..75425407e0c4 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -2078,9 +2078,6 @@ enum {
 	SECTION_IS_EARLY_BIT,
 #ifdef CONFIG_ZONE_DEVICE
 	SECTION_TAINT_ZONE_DEVICE_BIT,
-#endif
-#ifdef CONFIG_SPARSEMEM_VMEMMAP_PREINIT
-	SECTION_IS_VMEMMAP_PREINIT_BIT,
 #endif
 	SECTION_MAP_LAST_BIT,
 };
@@ -2092,9 +2089,6 @@ enum {
 #ifdef CONFIG_ZONE_DEVICE
 #define SECTION_TAINT_ZONE_DEVICE	BIT(SECTION_TAINT_ZONE_DEVICE_BIT)
 #endif
-#ifdef CONFIG_SPARSEMEM_VMEMMAP_PREINIT
-#define SECTION_IS_VMEMMAP_PREINIT	BIT(SECTION_IS_VMEMMAP_PREINIT_BIT)
-#endif
 #define SECTION_MAP_MASK		(~(BIT(SECTION_MAP_LAST_BIT) - 1))
 #define SECTION_NID_SHIFT		SECTION_MAP_LAST_BIT
 
@@ -2149,24 +2143,6 @@ static inline int online_device_section(const struct mem_section *section)
 }
 #endif
 
-#ifdef CONFIG_SPARSEMEM_VMEMMAP_PREINIT
-static inline int preinited_vmemmap_section(const struct mem_section *section)
-{
-	return (section &&
-		(section->section_mem_map & SECTION_IS_VMEMMAP_PREINIT));
-}
-
-void sparse_vmemmap_init_nid_early(int nid);
-#else
-static inline int preinited_vmemmap_section(const struct mem_section *section)
-{
-	return 0;
-}
-static inline void sparse_vmemmap_init_nid_early(int nid)
-{
-}
-#endif
-
 static inline int online_section_nr(unsigned long nr)
 {
 	return online_section(__nr_to_section(nr));
@@ -2407,7 +2383,6 @@ static inline unsigned long next_present_section_nr(unsigned long section_nr)
 #endif
 
 #else
-#define sparse_vmemmap_init_nid_early(_nid) do {} while (0)
 #define pfn_in_present_section pfn_valid
 #endif /* CONFIG_SPARSEMEM */
 
diff --git a/mm/Kconfig b/mm/Kconfig
index e8bf1e9e6ad9..3cce862088f1 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -410,8 +410,6 @@ config SPARSEMEM_VMEMMAP
 	  pfn_to_page and page_to_pfn operations.  This is the most
 	  efficient option when sufficient kernel resources are available.
 
-config SPARSEMEM_VMEMMAP_PREINIT
-	bool
 #
 # Select this config option from the architecture Kconfig, if it is preferred
 # to enable the feature of HugeTLB/dev_dax vmemmap optimization.
@@ -422,9 +420,6 @@ config ARCH_WANT_OPTIMIZE_DAX_VMEMMAP
 config ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP
 	bool
 
-config ARCH_WANT_HUGETLB_VMEMMAP_PREINIT
-	bool
-
 config HAVE_MEMBLOCK_PHYS_MAP
 	bool
 
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index 9da49b0d03f0..c35d912a1fef 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -543,19 +543,6 @@ struct page * __meminit __populate_section_memmap(unsigned long pfn,
 	return pfn_to_page(pfn);
 }
 
-#ifdef CONFIG_SPARSEMEM_VMEMMAP_PREINIT
-/*
- * This is called just before initializing sections for a NUMA node.
- * Any special initialization that needs to be done before the
- * generic initialization can be done from here. Sections that
- * are initialized in hooks called from here will be skipped by
- * the generic initialization.
- */
-void __init sparse_vmemmap_init_nid_early(int nid)
-{
-}
-#endif
-
 static void subsection_mask_set(unsigned long *map, unsigned long pfn,
 		unsigned long nr_pages)
 {
diff --git a/mm/sparse.c b/mm/sparse.c
index 7779554c5a0c..04c641b97325 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -385,27 +385,20 @@ static void __init sparse_init_nid(int nid, unsigned long pnum_begin,
 		panic("The node[%d] usemap allocation failed\n", nid);
 	sparse_buffer_init(map_count * section_map_size(), nid);
 
-	sparse_vmemmap_init_nid_early(nid);
-
 	for_each_present_section_nr(pnum_begin, pnum) {
-		struct mem_section *ms;
 		unsigned long pfn = section_nr_to_pfn(pnum);
+		struct page *map;
 
 		if (pnum >= pnum_end)
 			break;
 
-		ms = __nr_to_section(pnum);
-		if (!preinited_vmemmap_section(ms)) {
-			struct page *map;
-
-			map = __populate_section_memmap(pfn, PAGES_PER_SECTION,
-							nid, NULL, NULL);
-			if (!map)
-				panic("Populate section (%ld) on node[%d] failed\n", pnum, nid);
-			memmap_boot_pages_add(section_vmemmap_pages(pfn, PAGES_PER_SECTION,
-								    NULL, NULL));
-			sparse_init_early_section(nid, map, pnum, 0);
-		}
+		map = __populate_section_memmap(pfn, PAGES_PER_SECTION,
+						nid, NULL, NULL);
+		if (!map)
+			panic("Populate section (%ld) on node[%d] failed\n", pnum, nid);
+		memmap_boot_pages_add(section_vmemmap_pages(pfn, PAGES_PER_SECTION,
+							    NULL, NULL));
+		sparse_init_early_section(nid, map, pnum, 0);
 	}
 	sparse_usage_fini();
 	sparse_buffer_fini();
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 31/49] mm/hugetlb: remove HUGE_BOOTMEM_HVO flag and simplify pre-HVO logic
  2026-04-05 12:51 [PATCH 00/49] mm: Generalize vmemmap optimization for DAX and HugeTLB Muchun Song
                   ` (29 preceding siblings ...)
  2026-04-05 12:52 ` [PATCH 30/49] mm/sparse-vmemmap: remove unused SPARSEMEM_VMEMMAP_PREINIT feature Muchun Song
@ 2026-04-05 12:52 ` Muchun Song
  2026-04-05 12:52 ` [PATCH 32/49] mm/sparse-vmemmap: consolidate shared tail page allocation Muchun Song
                   ` (18 subsequent siblings)
  49 siblings, 0 replies; 51+ messages in thread
From: Muchun Song @ 2026-04-05 12:52 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
	Michael Ellerman, Madhavan Srinivasan
  Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
	Christophe Leroy, aneesh.kumar, joao.m.martins, linux-mm,
	linuxppc-dev, linux-kernel, Muchun Song

The pre-HVO feature is used to optimize the vmemmap pages of HugeTLB
bootmem pages. Previously, the HUGE_BOOTMEM_HVO flag was used to
indicate whether a bootmem page has been pre-optimized.

However, we can directly determine if a huge page is pre-optimized by
checking its section's optimization status using
section_vmemmap_optimizable(). The pre-initialization mechanism of
vmemmap has been completely removed in previous patches, making the
HUGE_BOOTMEM_HVO flag and its related checks redundant.

By directly using section_vmemmap_optimizable(), we can safely remove
the HUGE_BOOTMEM_HVO flag, clean up the associated state maintenance in
struct huge_bootmem_page, and simplify the bootmem page optimization
checks in the hugetlb initialization path.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 include/linux/hugetlb.h |  5 ++---
 mm/hugetlb.c            | 16 ++--------------
 mm/hugetlb_vmemmap.c    |  5 -----
 3 files changed, 4 insertions(+), 22 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 23d95ed6121f..6bedeaee9b79 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -695,9 +695,8 @@ struct huge_bootmem_page {
 	struct cma *cma;
 };
 
-#define HUGE_BOOTMEM_HVO		0x0001
-#define HUGE_BOOTMEM_ZONES_VALID	0x0002
-#define HUGE_BOOTMEM_CMA		0x0004
+#define HUGE_BOOTMEM_ZONES_VALID	BIT(0)
+#define HUGE_BOOTMEM_CMA		BIT(1)
 
 int isolate_or_dissolve_huge_folio(struct folio *folio, struct list_head *list);
 int replace_free_hugepage_folios(unsigned long start_pfn, unsigned long end_pfn);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index dff94ab7040a..59728e942384 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3193,11 +3193,6 @@ static void __init hugetlb_folio_init_vmemmap(struct folio *folio,
 	prep_compound_head(&folio->page, huge_page_order(h));
 }
 
-static bool __init hugetlb_bootmem_page_prehvo(struct huge_bootmem_page *m)
-{
-	return m->flags & HUGE_BOOTMEM_HVO;
-}
-
 /*
  * memblock-allocated pageblocks might not have the migrate type set
  * if marked with the 'noinit' flag. Set it to the default (MIGRATE_MOVABLE)
@@ -3284,16 +3279,9 @@ static void __init gather_bootmem_prealloc_node(unsigned long nid)
 					   OPTIMIZED_FOLIO_VMEMMAP_PAGE_STRUCTS);
 		init_new_hugetlb_folio(folio);
 
-		if (hugetlb_bootmem_page_prehvo(m)) {
-			/*
-			 * If pre-HVO was done, just set the
-			 * flag, the HVO code will then skip
-			 * this folio.
-			 */
+		if (section_vmemmap_optimizable(__pfn_to_section(folio_pfn(folio))))
 			folio_set_hugetlb_vmemmap_optimized(folio);
-			section_set_order_pfn_range(folio_pfn(folio),
-						    pages_per_huge_page(h), 0);
-		}
+		section_set_order_pfn_range(folio_pfn(folio), folio_nr_pages(folio), 0);
 
 		if (hugetlb_bootmem_page_earlycma(m))
 			folio_set_hugetlb_cma(folio);
diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index 8c567b8c67cc..a190b9b94346 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -711,12 +711,7 @@ void __init hugetlb_vmemmap_optimize_bootmem_page(struct huge_bootmem_page *m)
 	if (!READ_ONCE(vmemmap_optimize_enabled))
 		return;
 
-	if (!hugetlb_vmemmap_optimizable(h))
-		return;
-
 	section_set_order_pfn_range(pfn, pages_per_huge_page(h), huge_page_order(h));
-	if (section_vmemmap_optimizable(__pfn_to_section(pfn)))
-		m->flags |= HUGE_BOOTMEM_HVO;
 }
 
 static const struct ctl_table hugetlb_vmemmap_sysctls[] = {
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 32/49] mm/sparse-vmemmap: consolidate shared tail page allocation
  2026-04-05 12:51 [PATCH 00/49] mm: Generalize vmemmap optimization for DAX and HugeTLB Muchun Song
                   ` (30 preceding siblings ...)
  2026-04-05 12:52 ` [PATCH 31/49] mm/hugetlb: remove HUGE_BOOTMEM_HVO flag and simplify pre-HVO logic Muchun Song
@ 2026-04-05 12:52 ` Muchun Song
  2026-04-05 12:52 ` [PATCH 33/49] mm: introduce CONFIG_SPARSEMEM_VMEMMAP_OPTIMIZATION Muchun Song
                   ` (17 subsequent siblings)
  49 siblings, 0 replies; 51+ messages in thread
From: Muchun Song @ 2026-04-05 12:52 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
	Michael Ellerman, Madhavan Srinivasan
  Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
	Christophe Leroy, aneesh.kumar, joao.m.martins, linux-mm,
	linuxppc-dev, linux-kernel, Muchun Song

Currently, both HugeTLB and sparse-vmemmap have their own logic to get
or allocate the shared tail page for vmemmap optimization. The HugeTLB
version handles runtime concurrency using cmpxchg, while the
sparse-vmemmap version (used only at boot time) was simpler.

This patch unifies them into a single function in mm/sparse-vmemmap.c.

The new function of vmemmap_shared_tail_page() is introduced: it returns
the shared page frame used to map the tail vmemmap pages of a compound
page.

Furthermore, vmemmap_alloc_block_zero() is used as a safe allocation
method for both situations:

1. It calls alloc_pages_node() (via vmemmap_alloc_block()) when slab is
   available.

2. It falls back to bootmem allocation during early boot, making the function
   suitable for use in both early boot (sparse-vmemmap init) and runtime
   (HugeTLB HVO) contexts.

This reduces code duplication and ensures consistent behavior.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 include/linux/mm.h   |  1 +
 mm/hugetlb_vmemmap.c | 28 +---------------------------
 mm/sparse-vmemmap.c  | 42 +++++++++++++++++++++---------------------
 3 files changed, 23 insertions(+), 48 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 93e447468131..15841829b7eb 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -4880,6 +4880,7 @@ int vmemmap_populate(unsigned long start, unsigned long end, int node,
 void vmemmap_wrprotect_hvo(unsigned long start, unsigned long end, int node,
 			  unsigned long headsize);
 void vmemmap_populate_print_last(void);
+struct page *vmemmap_shared_tail_page(unsigned int order, struct zone *zone);
 #ifdef CONFIG_MEMORY_HOTPLUG
 void vmemmap_free(unsigned long start, unsigned long end,
 		struct vmem_altmap *altmap);
diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index a190b9b94346..a7ea98fcc18e 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -493,32 +493,6 @@ static bool vmemmap_should_optimize_folio(const struct hstate *h, struct folio *
 	return true;
 }
 
-static struct page *vmemmap_get_tail(unsigned int order, struct zone *zone)
-{
-	const unsigned int idx = order - OPTIMIZABLE_FOLIO_MIN_ORDER;
-	struct page *tail, *p;
-	int node = zone_to_nid(zone);
-
-	tail = READ_ONCE(zone->vmemmap_tails[idx]);
-	if (likely(tail))
-		return tail;
-
-	tail = alloc_pages_node(node, GFP_KERNEL | __GFP_ZERO, 0);
-	if (!tail)
-		return NULL;
-
-	p = page_to_virt(tail);
-	for (int i = 0; i < PAGE_SIZE / sizeof(struct page); i++)
-		init_compound_tail(p + i, NULL, order, zone);
-
-	if (cmpxchg(&zone->vmemmap_tails[idx], NULL, tail)) {
-		__free_page(tail);
-		tail = READ_ONCE(zone->vmemmap_tails[idx]);
-	}
-
-	return tail;
-}
-
 static int __hugetlb_vmemmap_optimize_folio(const struct hstate *h,
 					    struct folio *folio,
 					    struct list_head *vmemmap_pages,
@@ -535,7 +509,7 @@ static int __hugetlb_vmemmap_optimize_folio(const struct hstate *h,
 		return ret;
 
 	nid = folio_nid(folio);
-	vmemmap_tail = vmemmap_get_tail(h->order, folio_zone(folio));
+	vmemmap_tail = vmemmap_shared_tail_page(h->order, folio_zone(folio));
 	if (!vmemmap_tail)
 		return -ENOMEM;
 
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index c35d912a1fef..309d935fb05e 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -143,8 +143,6 @@ void __meminit vmemmap_verify(pte_t *pte, int node,
 			start, end - 1);
 }
 
-static __meminit struct page *vmemmap_get_tail(unsigned int order, struct zone *zone);
-
 static pte_t * __meminit vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node,
 					      struct vmem_altmap *altmap,
 					      unsigned long ptpfn)
@@ -160,8 +158,8 @@ static pte_t * __meminit vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, in
 			unsigned long pfn = page_to_pfn((struct page *)addr);
 			const struct mem_section *ms = __pfn_to_section(pfn);
 
-			page = vmemmap_get_tail(section_order(ms),
-						pfn_to_zone(pfn, node));
+			page = vmemmap_shared_tail_page(section_order(ms),
+							pfn_to_zone(pfn, node));
 			if (!page)
 				return NULL;
 			ptpfn = page_to_pfn(page);
@@ -338,32 +336,34 @@ void vmemmap_wrprotect_hvo(unsigned long addr, unsigned long end,
 	}
 }
 
-static __meminit struct page *vmemmap_get_tail(unsigned int order, struct zone *zone)
+struct page *vmemmap_shared_tail_page(unsigned int order, struct zone *zone)
 {
-	struct page *p, *tail;
-	unsigned int idx;
-	int node = zone_to_nid(zone);
+	void *addr;
+	struct page *page;
+	unsigned int idx = order - OPTIMIZABLE_FOLIO_MIN_ORDER;
 
-	if (WARN_ON_ONCE(order < OPTIMIZABLE_FOLIO_MIN_ORDER))
-		return NULL;
-	if (WARN_ON_ONCE(order > MAX_FOLIO_ORDER))
+	if (WARN_ON_ONCE(idx >= ARRAY_SIZE(zone->vmemmap_tails)))
 		return NULL;
 
-	idx = order - OPTIMIZABLE_FOLIO_MIN_ORDER;
-	tail = zone->vmemmap_tails[idx];
-	if (tail)
-		return tail;
+	page = READ_ONCE(zone->vmemmap_tails[idx]);
+	if (likely(page))
+		return page;
 
-	p = vmemmap_alloc_block_zero(PAGE_SIZE, node);
-	if (!p)
+	addr = vmemmap_alloc_block_zero(PAGE_SIZE, zone_to_nid(zone));
+	if (!addr)
 		return NULL;
+
 	for (int i = 0; i < PAGE_SIZE / sizeof(struct page); i++)
-		init_compound_tail(p + i, NULL, order, zone);
+		init_compound_tail((struct page *)addr + i, NULL, order, zone);
 
-	tail = virt_to_page(p);
-	zone->vmemmap_tails[idx] = tail;
+	page = virt_to_page(addr);
+	if (cmpxchg(&zone->vmemmap_tails[idx], NULL, page) != NULL) {
+		VM_BUG_ON(!slab_is_available());
+		__free_page(page);
+		page = READ_ONCE(zone->vmemmap_tails[idx]);
+	}
 
-	return tail;
+	return page;
 }
 
 void __weak __meminit vmemmap_set_pmd(pmd_t *pmd, void *p, int node,
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 33/49] mm: introduce CONFIG_SPARSEMEM_VMEMMAP_OPTIMIZATION
  2026-04-05 12:51 [PATCH 00/49] mm: Generalize vmemmap optimization for DAX and HugeTLB Muchun Song
                   ` (31 preceding siblings ...)
  2026-04-05 12:52 ` [PATCH 32/49] mm/sparse-vmemmap: consolidate shared tail page allocation Muchun Song
@ 2026-04-05 12:52 ` Muchun Song
  2026-04-05 12:52 ` [PATCH 34/49] mm/sparse-vmemmap: switch DAX to use generic vmemmap optimization Muchun Song
                   ` (16 subsequent siblings)
  49 siblings, 0 replies; 51+ messages in thread
From: Muchun Song @ 2026-04-05 12:52 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
	Michael Ellerman, Madhavan Srinivasan
  Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
	Christophe Leroy, aneesh.kumar, joao.m.martins, linux-mm,
	linuxppc-dev, linux-kernel, Muchun Song

Previously, the vmemmap optimization logic in mm/sparse-vmemmap.c was
closely tied to HugeTLB via CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP.
With recent refactoring (e.g., introducing compound page order to struct
mem_section), the core vmemmap optimization machinery has become more
generic and can be utilized by other subsystems like DAX.

To reflect this generalization and decouple the core optimization logic
from HugeTLB-specific configurations, this patch introduces a new common
Kconfig option: CONFIG_SPARSEMEM_VMEMMAP_OPTIMIZATION.

Both HugeTLB and DAX now select this generic option, ensuring that the
shared optimization infrastructure is enabled whenever either subsystem
requires it.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 fs/Kconfig                 |  1 +
 include/linux/mmzone.h     | 33 ++++++++++++++++++---------------
 include/linux/page-flags.h |  5 +----
 mm/Kconfig                 |  5 +++++
 4 files changed, 25 insertions(+), 19 deletions(-)

diff --git a/fs/Kconfig b/fs/Kconfig
index e70aa5f0429a..9b56a90e13db 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -278,6 +278,7 @@ config HUGETLB_PAGE_OPTIMIZE_VMEMMAP
 	def_bool HUGETLB_PAGE
 	depends on ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP
 	depends on SPARSEMEM_VMEMMAP
+	select SPARSEMEM_VMEMMAP_OPTIMIZATION
 
 config HUGETLB_PMD_PAGE_TABLE_SHARING
 	def_bool HUGETLB_PAGE
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 75425407e0c4..6edcb0cc46c4 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -102,9 +102,9 @@
  *
  * HVO which is only active if the size of struct page is a power of 2.
  */
-#define MAX_FOLIO_VMEMMAP_ALIGN \
-	(IS_ENABLED(CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP) && \
-	 is_power_of_2(sizeof(struct page)) ? \
+#define MAX_FOLIO_VMEMMAP_ALIGN					\
+	(IS_ENABLED(CONFIG_SPARSEMEM_VMEMMAP_OPTIMIZATION) &&	\
+	 is_power_of_2(sizeof(struct page)) ?			\
 	 MAX_FOLIO_NR_PAGES * sizeof(struct page) : 0)
 
 /* The number of vmemmap pages required by a vmemmap-optimized folio. */
@@ -115,7 +115,8 @@
 
 #define __NR_OPTIMIZABLE_FOLIO_SIZES		(MAX_FOLIO_ORDER - OPTIMIZABLE_FOLIO_MIN_ORDER + 1)
 #define NR_OPTIMIZABLE_FOLIO_SIZES		\
-	(__NR_OPTIMIZABLE_FOLIO_SIZES > 0 ? __NR_OPTIMIZABLE_FOLIO_SIZES : 0)
+	((__NR_OPTIMIZABLE_FOLIO_SIZES > 0 &&	\
+	  IS_ENABLED(CONFIG_SPARSEMEM_VMEMMAP_OPTIMIZATION)) ? __NR_OPTIMIZABLE_FOLIO_SIZES : 0)
 
 enum migratetype {
 	MIGRATE_UNMOVABLE,
@@ -2014,7 +2015,7 @@ struct mem_section {
 	 */
 	struct page_ext *page_ext;
 #endif
-#ifdef CONFIG_SPARSEMEM_VMEMMAP
+#ifdef CONFIG_SPARSEMEM_VMEMMAP_OPTIMIZATION
 	/*
 	 * The order of compound pages in this section. Typically, the section
 	 * holds compound pages of this order; a larger compound page will span
@@ -2194,7 +2195,19 @@ static inline bool pfn_section_first_valid(struct mem_section *ms, unsigned long
 	*pfn = (*pfn & PAGE_SECTION_MASK) + (bit * PAGES_PER_SUBSECTION);
 	return true;
 }
+#else
+static inline int pfn_section_valid(struct mem_section *ms, unsigned long pfn)
+{
+	return 1;
+}
+
+static inline bool pfn_section_first_valid(struct mem_section *ms, unsigned long *pfn)
+{
+	return true;
+}
+#endif
 
+#ifdef CONFIG_SPARSEMEM_VMEMMAP_OPTIMIZATION
 static inline void section_set_order(struct mem_section *section, unsigned int order)
 {
 	VM_BUG_ON(section->order && order && section->order != order);
@@ -2206,16 +2219,6 @@ static inline unsigned int section_order(const struct mem_section *section)
 	return section->order;
 }
 #else
-static inline int pfn_section_valid(struct mem_section *ms, unsigned long pfn)
-{
-	return 1;
-}
-
-static inline bool pfn_section_first_valid(struct mem_section *ms, unsigned long *pfn)
-{
-	return true;
-}
-
 static inline void section_set_order(struct mem_section *section, unsigned int order)
 {
 }
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 0e03d816e8b9..12665b34586c 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -208,14 +208,11 @@ enum pageflags {
 static __always_inline bool compound_info_has_mask(void)
 {
 	/*
-	 * Limit mask usage to HugeTLB vmemmap optimization (HVO) where it
-	 * makes a difference.
-	 *
 	 * The approach with mask would work in the wider set of conditions,
 	 * but it requires validating that struct pages are naturally aligned
 	 * for all orders up to the MAX_FOLIO_ORDER, which can be tricky.
 	 */
-	if (!IS_ENABLED(CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP))
+	if (!IS_ENABLED(CONFIG_SPARSEMEM_VMEMMAP_OPTIMIZATION))
 		return false;
 
 	return is_power_of_2(sizeof(struct page));
diff --git a/mm/Kconfig b/mm/Kconfig
index 3cce862088f1..e81aa77182b2 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -410,12 +410,17 @@ config SPARSEMEM_VMEMMAP
 	  pfn_to_page and page_to_pfn operations.  This is the most
 	  efficient option when sufficient kernel resources are available.
 
+config SPARSEMEM_VMEMMAP_OPTIMIZATION
+	bool
+	depends on SPARSEMEM_VMEMMAP
+
 #
 # Select this config option from the architecture Kconfig, if it is preferred
 # to enable the feature of HugeTLB/dev_dax vmemmap optimization.
 #
 config ARCH_WANT_OPTIMIZE_DAX_VMEMMAP
 	bool
+	select SPARSEMEM_VMEMMAP_OPTIMIZATION if SPARSEMEM_VMEMMAP
 
 config ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP
 	bool
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 34/49] mm/sparse-vmemmap: switch DAX to use generic vmemmap optimization
  2026-04-05 12:51 [PATCH 00/49] mm: Generalize vmemmap optimization for DAX and HugeTLB Muchun Song
                   ` (32 preceding siblings ...)
  2026-04-05 12:52 ` [PATCH 33/49] mm: introduce CONFIG_SPARSEMEM_VMEMMAP_OPTIMIZATION Muchun Song
@ 2026-04-05 12:52 ` Muchun Song
  2026-04-05 12:52 ` [PATCH 35/49] mm/sparse-vmemmap: introduce section zone to struct mem_section Muchun Song
                   ` (15 subsequent siblings)
  49 siblings, 0 replies; 51+ messages in thread
From: Muchun Song @ 2026-04-05 12:52 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
	Michael Ellerman, Madhavan Srinivasan
  Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
	Christophe Leroy, aneesh.kumar, joao.m.martins, linux-mm,
	linuxppc-dev, linux-kernel, Muchun Song

Recent refactoring introduced common vmemmap optimization logic via
CONFIG_SPARSEMEM_VMEMMAP_OPTIMIZATION. While HugeTLB already uses it,
DAX requires slightly different handling because it needs to preserve
2 vmemmap pages, instead of the 1 page HugeTLB preserves.

This patch updates DAX vmemmap optimization to manually allocate the
second vmemmap page, and integrates DAX memory setup to correctly set
the compound order and allocate/reuse the shared vmemmap tail page.

Note that manually allocating the vmemmap page is a temporary solution
and will be unified with the logic that HugeTLB relies on in the future.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 arch/powerpc/mm/book3s64/radix_pgtable.c |  5 +-
 mm/memory_hotplug.c                      |  5 +-
 mm/mm_init.c                             |  8 ++-
 mm/sparse-vmemmap.c                      | 82 ++++++++++++++----------
 4 files changed, 58 insertions(+), 42 deletions(-)

diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c b/arch/powerpc/mm/book3s64/radix_pgtable.c
index dfa2f7dc7e15..ad44883b1030 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -1124,9 +1124,10 @@ int __meminit radix__vmemmap_populate(unsigned long start, unsigned long end, in
 	pud_t *pud;
 	pmd_t *pmd;
 	pte_t *pte;
+	unsigned long pfn = page_to_pfn((struct page *)start);
 
-	if (vmemmap_can_optimize(altmap, pgmap))
-		return vmemmap_populate_compound_pages(page_to_pfn((struct page *)start), start, end, node, pgmap);
+	if (vmemmap_can_optimize(altmap, pgmap) && section_vmemmap_optimizable(__pfn_to_section(pfn)))
+		return vmemmap_populate_compound_pages(pfn, start, end, node, pgmap);
 	/*
 	 * If altmap is present, Make sure we align the start vmemmap addr
 	 * to PAGE_SIZE so that we calculate the correct start_pfn in
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 05f5df12d843..28306196c0fe 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -551,8 +551,9 @@ void remove_pfn_range_from_zone(struct zone *zone,
 		/* Select all remaining pages up to the next section boundary */
 		cur_nr_pages =
 			min(end_pfn - pfn, SECTION_ALIGN_UP(pfn + 1) - pfn);
-		page_init_poison(pfn_to_page(pfn),
-				 sizeof(struct page) * cur_nr_pages);
+		if (!section_vmemmap_optimizable(__pfn_to_section(pfn)))
+			page_init_poison(pfn_to_page(pfn),
+					 sizeof(struct page) * cur_nr_pages);
 	}
 
 	/*
diff --git a/mm/mm_init.c b/mm/mm_init.c
index e47d08b63154..636a0f9644f6 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -1069,9 +1069,10 @@ static void __ref __init_zone_device_page(struct page *page, unsigned long pfn,
  * of an altmap. See vmemmap_populate_compound_pages().
  */
 static inline unsigned long compound_nr_pages(struct vmem_altmap *altmap,
-					      struct dev_pagemap *pgmap)
+					      struct dev_pagemap *pgmap,
+					      const struct mem_section *ms)
 {
-	if (!vmemmap_can_optimize(altmap, pgmap))
+	if (!section_vmemmap_optimizable(ms))
 		return pgmap_vmemmap_nr(pgmap);
 
 	return VMEMMAP_RESERVE_NR * (PAGE_SIZE / sizeof(struct page));
@@ -1140,7 +1141,8 @@ void __ref memmap_init_zone_device(struct zone *zone,
 			continue;
 
 		memmap_init_compound(page, pfn, zone_idx, nid, pgmap,
-				     compound_nr_pages(altmap, pgmap));
+				     compound_nr_pages(altmap, pgmap,
+						       __pfn_to_section(pfn)));
 	}
 
 	/*
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index 309d935fb05e..6f959a999d5b 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -353,8 +353,12 @@ struct page *vmemmap_shared_tail_page(unsigned int order, struct zone *zone)
 	if (!addr)
 		return NULL;
 
-	for (int i = 0; i < PAGE_SIZE / sizeof(struct page); i++)
-		init_compound_tail((struct page *)addr + i, NULL, order, zone);
+	for (int i = 0; i < PAGE_SIZE / sizeof(struct page); i++) {
+		page = (struct page *)addr + i;
+		if (zone_is_zone_device(zone))
+			__SetPageReserved(page);
+		init_compound_tail(page, NULL, order, zone);
+	}
 
 	page = virt_to_page(addr);
 	if (cmpxchg(&zone->vmemmap_tails[idx], NULL, page) != NULL) {
@@ -458,23 +462,6 @@ static bool __meminit reuse_compound_section(unsigned long start_pfn,
 	return !IS_ALIGNED(offset, nr_pages) && nr_pages > PAGES_PER_SUBSECTION;
 }
 
-static pte_t * __meminit compound_section_tail_page(unsigned long addr)
-{
-	pte_t *pte;
-
-	addr -= PAGE_SIZE;
-
-	/*
-	 * Assuming sections are populated sequentially, the previous section's
-	 * page data can be reused.
-	 */
-	pte = pte_offset_kernel(pmd_off_k(addr), addr);
-	if (!pte)
-		return NULL;
-
-	return pte;
-}
-
 static int __meminit vmemmap_populate_compound_pages(unsigned long start,
 						     unsigned long end, int node,
 						     struct dev_pagemap *pgmap)
@@ -483,42 +470,62 @@ static int __meminit vmemmap_populate_compound_pages(unsigned long start,
 	pte_t *pte;
 	int rc;
 	unsigned long start_pfn = page_to_pfn((struct page *)start);
+	const struct mem_section *ms = __pfn_to_section(start_pfn);
+	struct page *tail = NULL;
 
-	if (reuse_compound_section(start_pfn, pgmap)) {
-		pte = compound_section_tail_page(start);
-		if (!pte)
-			return -ENOMEM;
+	/* This may occur in sub-section scenarios. */
+	if (!section_vmemmap_optimizable(ms))
+		return vmemmap_populate_range(start, end, node, NULL, -1);
 
-		/*
-		 * Reuse the page that was populated in the prior iteration
-		 * with just tail struct pages.
-		 */
+#ifdef CONFIG_ZONE_DEVICE
+	tail = vmemmap_shared_tail_page(section_order(ms),
+					&NODE_DATA(node)->node_zones[ZONE_DEVICE]);
+#endif
+	if (!tail)
+		return -ENOMEM;
+
+	if (reuse_compound_section(start_pfn, pgmap))
 		return vmemmap_populate_range(start, end, node, NULL,
-					      pte_pfn(ptep_get(pte)));
-	}
+					      page_to_pfn(tail));
 
 	size = min(end - start, pgmap_vmemmap_nr(pgmap) * sizeof(struct page));
 	for (addr = start; addr < end; addr += size) {
 		unsigned long next, last = addr + size;
+		void *p;
 
 		/* Populate the head page vmemmap page */
 		pte = vmemmap_populate_address(addr, node, NULL, -1);
 		if (!pte)
 			return -ENOMEM;
 
+		/*
+		 * Allocate manually since vmemmap_populate_address() will assume DAX
+		 * only needs 1 vmemmap page to be reserved, however DAX now needs 2
+		 * vmemmap pages. This is a temporary solution and will be unified
+		 * with HugeTLB in the future.
+		 */
+		p = vmemmap_alloc_block_buf(PAGE_SIZE, node, NULL);
+		if (!p)
+			return -ENOMEM;
+
 		/* Populate the tail pages vmemmap page */
 		next = addr + PAGE_SIZE;
-		pte = vmemmap_populate_address(next, node, NULL, -1);
+		pte = vmemmap_populate_address(next, node, NULL, PHYS_PFN(__pa(p)));
+		/*
+		 * get_page() is called above. Since we are not actually
+		 * reusing it, to avoid a memory leak, we call put_page() here.
+		 */
+		put_page(virt_to_page(p));
 		if (!pte)
 			return -ENOMEM;
 
 		/*
-		 * Reuse the previous page for the rest of tail pages
+		 * Reuse the shared vmemmap page for the rest of tail pages
 		 * See layout diagram in Documentation/mm/vmemmap_dedup.rst
 		 */
 		next += PAGE_SIZE;
 		rc = vmemmap_populate_range(next, last, node, NULL,
-					    pte_pfn(ptep_get(pte)));
+					    page_to_pfn(tail));
 		if (rc)
 			return -ENOMEM;
 	}
@@ -744,8 +751,10 @@ static void section_deactivate(unsigned long pfn, unsigned long nr_pages,
 		free_map_bootmem(memmap);
 	}
 
-	if (empty)
+	if (empty) {
 		ms->section_mem_map = (unsigned long)NULL;
+		section_set_order(ms, 0);
+	}
 }
 
 static struct page * __meminit section_activate(int nid, unsigned long pfn,
@@ -824,6 +833,9 @@ int __meminit sparse_add_section(int nid, unsigned long start_pfn,
 	if (ret < 0)
 		return ret;
 
+	ms = __nr_to_section(section_nr);
+	if (vmemmap_can_optimize(altmap, pgmap) && nr_pages == PAGES_PER_SECTION)
+		section_set_order(ms, pgmap->vmemmap_shift);
 	memmap = section_activate(nid, start_pfn, nr_pages, altmap, pgmap);
 	if (IS_ERR(memmap))
 		return PTR_ERR(memmap);
@@ -832,9 +844,9 @@ int __meminit sparse_add_section(int nid, unsigned long start_pfn,
 	 * Poison uninitialized struct pages in order to catch invalid flags
 	 * combinations.
 	 */
-	page_init_poison(memmap, sizeof(struct page) * nr_pages);
+	if (!section_vmemmap_optimizable(ms))
+		page_init_poison(memmap, sizeof(struct page) * nr_pages);
 
-	ms = __nr_to_section(section_nr);
 	__section_mark_present(ms, section_nr);
 
 	/* Align memmap to section boundary in the subsection case */
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 35/49] mm/sparse-vmemmap: introduce section zone to struct mem_section
  2026-04-05 12:51 [PATCH 00/49] mm: Generalize vmemmap optimization for DAX and HugeTLB Muchun Song
                   ` (33 preceding siblings ...)
  2026-04-05 12:52 ` [PATCH 34/49] mm/sparse-vmemmap: switch DAX to use generic vmemmap optimization Muchun Song
@ 2026-04-05 12:52 ` Muchun Song
  2026-04-05 12:52 ` [PATCH 36/49] powerpc/mm: use generic vmemmap_shared_tail_page() in compound vmemmap Muchun Song
                   ` (14 subsequent siblings)
  49 siblings, 0 replies; 51+ messages in thread
From: Muchun Song @ 2026-04-05 12:52 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
	Michael Ellerman, Madhavan Srinivasan
  Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
	Christophe Leroy, aneesh.kumar, joao.m.martins, linux-mm,
	linuxppc-dev, linux-kernel, Muchun Song

Currently, HugeTLB obtains zone information for vmemmap optimization
through early pfn_to_zone(). However, ZONE_DEVICE cannot utilize this
approach because its zone information is updated after vmemmap population.

To pave the way for unifying DAX and HugeTLB vmemmap optimization,
this patch introduces the 'zone' member to struct mem_section. This
allows both DAX and HugeTLB to reliably obtain zone information
directly from the memory section.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 include/linux/mmzone.h | 31 +++++++++++++++++++++++++++----
 mm/hugetlb.c           |  2 +-
 mm/hugetlb_vmemmap.c   |  4 +++-
 mm/sparse-vmemmap.c    | 19 +++++++++++++------
 4 files changed, 44 insertions(+), 12 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 6edcb0cc46c4..846a7ee1334f 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -2022,6 +2022,7 @@ struct mem_section {
 	 * multiple sections.
 	 */
 	unsigned int order;
+	enum zone_type zone;
 #endif
 };
 
@@ -2214,32 +2215,54 @@ static inline void section_set_order(struct mem_section *section, unsigned int o
 	section->order = order;
 }
 
+static inline void section_set_zone(struct mem_section *section, enum zone_type zone)
+{
+	section->zone = zone;
+}
+
 static inline unsigned int section_order(const struct mem_section *section)
 {
 	return section->order;
 }
+
+static inline enum zone_type section_zone(const struct mem_section *section)
+{
+	return section->zone;
+}
 #else
 static inline void section_set_order(struct mem_section *section, unsigned int order)
 {
 }
 
+static inline void section_set_zone(struct mem_section *section, enum zone_type zone)
+{
+}
+
 static inline unsigned int section_order(const struct mem_section *section)
 {
 	return 0;
 }
+
+static inline enum zone_type section_zone(const struct mem_section *section)
+{
+	return 0;
+}
 #endif
 
-static inline void section_set_order_pfn_range(unsigned long pfn,
-					       unsigned long nr_pages,
-					       unsigned int order)
+static inline void section_set_compound_range(unsigned long pfn,
+					      unsigned long nr_pages,
+					      unsigned int order,
+					      enum zone_type zone)
 {
 	unsigned long section_nr = pfn_to_section_nr(pfn);
 
 	if (!IS_ALIGNED(pfn | nr_pages, PAGES_PER_SECTION))
 		return;
 
-	for (int i = 0; i < nr_pages / PAGES_PER_SECTION; i++)
+	for (int i = 0; i < nr_pages / PAGES_PER_SECTION; i++) {
 		section_set_order(__nr_to_section(section_nr + i), order);
+		section_set_zone(__nr_to_section(section_nr + i), zone);
+	}
 }
 
 static inline bool section_vmemmap_optimizable(const struct mem_section *section)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 59728e942384..ce5a58aab5c3 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3281,7 +3281,7 @@ static void __init gather_bootmem_prealloc_node(unsigned long nid)
 
 		if (section_vmemmap_optimizable(__pfn_to_section(folio_pfn(folio))))
 			folio_set_hugetlb_vmemmap_optimized(folio);
-		section_set_order_pfn_range(folio_pfn(folio), folio_nr_pages(folio), 0);
+		section_set_compound_range(folio_pfn(folio), folio_nr_pages(folio), 0, 0);
 
 		if (hugetlb_bootmem_page_earlycma(m))
 			folio_set_hugetlb_cma(folio);
diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index a7ea98fcc18e..92c95ebdbb9a 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -681,11 +681,13 @@ void __init hugetlb_vmemmap_optimize_bootmem_page(struct huge_bootmem_page *m)
 {
 	struct hstate *h = m->hstate;
 	unsigned long pfn = PHYS_PFN(virt_to_phys(m));
+	int nid = early_pfn_to_nid(PHYS_PFN(__pa(m)));
 
 	if (!READ_ONCE(vmemmap_optimize_enabled))
 		return;
 
-	section_set_order_pfn_range(pfn, pages_per_huge_page(h), huge_page_order(h));
+	section_set_compound_range(pfn, pages_per_huge_page(h), huge_page_order(h),
+				   zone_idx(pfn_to_zone(pfn, nid)));
 }
 
 static const struct ctl_table hugetlb_vmemmap_sysctls[] = {
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index 6f959a999d5b..1867b5dcc73c 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -143,6 +143,11 @@ void __meminit vmemmap_verify(pte_t *pte, int node,
 			start, end - 1);
 }
 
+static inline struct zone *section_to_zone(const struct mem_section *ms, int nid)
+{
+	return &NODE_DATA(nid)->node_zones[section_zone(ms)];
+}
+
 static pte_t * __meminit vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node,
 					      struct vmem_altmap *altmap,
 					      unsigned long ptpfn)
@@ -159,7 +164,7 @@ static pte_t * __meminit vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, in
 			const struct mem_section *ms = __pfn_to_section(pfn);
 
 			page = vmemmap_shared_tail_page(section_order(ms),
-							pfn_to_zone(pfn, node));
+							section_to_zone(ms, node));
 			if (!page)
 				return NULL;
 			ptpfn = page_to_pfn(page);
@@ -471,16 +476,14 @@ static int __meminit vmemmap_populate_compound_pages(unsigned long start,
 	int rc;
 	unsigned long start_pfn = page_to_pfn((struct page *)start);
 	const struct mem_section *ms = __pfn_to_section(start_pfn);
-	struct page *tail = NULL;
+	struct page *tail;
 
 	/* This may occur in sub-section scenarios. */
 	if (!section_vmemmap_optimizable(ms))
 		return vmemmap_populate_range(start, end, node, NULL, -1);
 
-#ifdef CONFIG_ZONE_DEVICE
 	tail = vmemmap_shared_tail_page(section_order(ms),
-					&NODE_DATA(node)->node_zones[ZONE_DEVICE]);
-#endif
+					section_to_zone(ms, node));
 	if (!tail)
 		return -ENOMEM;
 
@@ -834,8 +837,12 @@ int __meminit sparse_add_section(int nid, unsigned long start_pfn,
 		return ret;
 
 	ms = __nr_to_section(section_nr);
-	if (vmemmap_can_optimize(altmap, pgmap) && nr_pages == PAGES_PER_SECTION)
+	if (vmemmap_can_optimize(altmap, pgmap) && nr_pages == PAGES_PER_SECTION) {
 		section_set_order(ms, pgmap->vmemmap_shift);
+#ifdef CONFIG_ZONE_DEVICE
+		section_set_zone(ms, ZONE_DEVICE);
+#endif
+	}
 	memmap = section_activate(nid, start_pfn, nr_pages, altmap, pgmap);
 	if (IS_ERR(memmap))
 		return PTR_ERR(memmap);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 36/49] powerpc/mm: use generic vmemmap_shared_tail_page() in compound vmemmap
  2026-04-05 12:51 [PATCH 00/49] mm: Generalize vmemmap optimization for DAX and HugeTLB Muchun Song
                   ` (34 preceding siblings ...)
  2026-04-05 12:52 ` [PATCH 35/49] mm/sparse-vmemmap: introduce section zone to struct mem_section Muchun Song
@ 2026-04-05 12:52 ` Muchun Song
  2026-04-05 12:52 ` [PATCH 37/49] mm/sparse-vmemmap: unify DAX and HugeTLB vmemmap optimization Muchun Song
                   ` (13 subsequent siblings)
  49 siblings, 0 replies; 51+ messages in thread
From: Muchun Song @ 2026-04-05 12:52 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
	Michael Ellerman, Madhavan Srinivasan
  Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
	Christophe Leroy, aneesh.kumar, joao.m.martins, linux-mm,
	linuxppc-dev, linux-kernel, Muchun Song

The ultimate goal is to unify the vmemmap optimization logic for both
DAX and HugeTLB. To achieve this, all platforms need to align with the
standard HugeTLB approach of using vmemmap_shared_tail_page() for tail
page mappings.

This patch updates PowerPC to utilize vmemmap_shared_tail_page() to
retrieve the pre-allocated and initialized shared tail page, instead of
dynamically looking up and constructing the tail page mapping via
vmemmap_compound_tail_page().

As a byproduct of this alignment, it greatly simplifies the vmemmap
compound page mapping logic in radix_pgtable by removing
vmemmap_compound_tail_page() entirely.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 arch/powerpc/mm/book3s64/radix_pgtable.c | 81 +++---------------------
 1 file changed, 9 insertions(+), 72 deletions(-)

diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c b/arch/powerpc/mm/book3s64/radix_pgtable.c
index ad44883b1030..5ce3deb464d5 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -1256,59 +1256,6 @@ static pte_t * __meminit radix__vmemmap_populate_address(unsigned long addr, int
 	return pte;
 }
 
-static pte_t * __meminit vmemmap_compound_tail_page(unsigned long addr,
-						    unsigned long pfn_offset, int node)
-{
-	pgd_t *pgd;
-	p4d_t *p4d;
-	pud_t *pud;
-	pmd_t *pmd;
-	pte_t *pte;
-	unsigned long map_addr;
-
-	/* the second vmemmap page which we use for duplication */
-	map_addr = addr - pfn_offset * sizeof(struct page) + PAGE_SIZE;
-	pgd = pgd_offset_k(map_addr);
-	p4d = p4d_offset(pgd, map_addr);
-	pud = vmemmap_pud_alloc(p4d, node, map_addr);
-	if (!pud)
-		return NULL;
-	pmd = vmemmap_pmd_alloc(pud, node, map_addr);
-	if (!pmd)
-		return NULL;
-	if (pmd_leaf(*pmd))
-		/*
-		 * The second page is mapped as a hugepage due to a nearby request.
-		 * Force our mapping to page size without deduplication
-		 */
-		return NULL;
-	pte = vmemmap_pte_alloc(pmd, node, map_addr);
-	if (!pte)
-		return NULL;
-	/*
-	 * Check if there exist a mapping to the left
-	 */
-	if (pte_none(*pte)) {
-		/*
-		 * Populate the head page vmemmap page.
-		 * It can fall in different pmd, hence
-		 * vmemmap_populate_address()
-		 */
-		pte = radix__vmemmap_populate_address(map_addr - PAGE_SIZE, node, NULL, NULL);
-		if (!pte)
-			return NULL;
-		/*
-		 * Populate the tail pages vmemmap page
-		 */
-		pte = radix__vmemmap_pte_populate(pmd, map_addr, node, NULL, NULL);
-		if (!pte)
-			return NULL;
-		vmemmap_verify(pte, node, map_addr, map_addr + PAGE_SIZE);
-		return pte;
-	}
-	return pte;
-}
-
 static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
 						     unsigned long start,
 						     unsigned long end, int node,
@@ -1327,6 +1274,13 @@ static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
 	pud_t *pud;
 	pmd_t *pmd;
 	pte_t *pte;
+	const struct mem_section *ms = __pfn_to_section(start_pfn);
+	struct page *tail_page;
+
+	tail_page = vmemmap_shared_tail_page(section_order(ms),
+					     &NODE_DATA(node)->node_zones[section_zone(ms)]);
+	if (!tail_page)
+		return -ENOMEM;
 
 	for (addr = start; addr < end; addr = next) {
 
@@ -1358,9 +1312,8 @@ static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
 			next = addr + PAGE_SIZE;
 			continue;
 		} else {
-			unsigned long nr_pages = pgmap_vmemmap_nr(pgmap);
+			unsigned long nr_pages = 1L << section_order(ms);
 			unsigned long pfn_offset = addr_pfn - ALIGN_DOWN(addr_pfn, nr_pages);
-			pte_t *tail_page_pte;
 
 			/*
 			 * if the address is aligned to huge page size it is the
@@ -1386,24 +1339,8 @@ static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
 				next = addr + 2 * PAGE_SIZE;
 				continue;
 			}
-			/*
-			 * get the 2nd mapping details
-			 * Also create it if that doesn't exist
-			 */
-			tail_page_pte = vmemmap_compound_tail_page(addr, pfn_offset, node);
-			if (!tail_page_pte) {
-
-				pte = radix__vmemmap_pte_populate(pmd, addr, node, NULL, NULL);
-				if (!pte)
-					return -ENOMEM;
-				vmemmap_verify(pte, node, addr, addr + PAGE_SIZE);
-
-				addr_pfn += 1;
-				next = addr + PAGE_SIZE;
-				continue;
-			}
 
-			pte = radix__vmemmap_pte_populate(pmd, addr, node, NULL, pte_page(*tail_page_pte));
+			pte = radix__vmemmap_pte_populate(pmd, addr, node, NULL, tail_page);
 			if (!pte)
 				return -ENOMEM;
 			vmemmap_verify(pte, node, addr, addr + PAGE_SIZE);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 37/49] mm/sparse-vmemmap: unify DAX and HugeTLB vmemmap optimization
  2026-04-05 12:51 [PATCH 00/49] mm: Generalize vmemmap optimization for DAX and HugeTLB Muchun Song
                   ` (35 preceding siblings ...)
  2026-04-05 12:52 ` [PATCH 36/49] powerpc/mm: use generic vmemmap_shared_tail_page() in compound vmemmap Muchun Song
@ 2026-04-05 12:52 ` Muchun Song
  2026-04-05 12:52 ` [PATCH 38/49] mm/sparse-vmemmap: remap the shared tail pages as read-only Muchun Song
                   ` (12 subsequent siblings)
  49 siblings, 0 replies; 51+ messages in thread
From: Muchun Song @ 2026-04-05 12:52 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
	Michael Ellerman, Madhavan Srinivasan
  Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
	Christophe Leroy, aneesh.kumar, joao.m.martins, linux-mm,
	linuxppc-dev, linux-kernel, Muchun Song

The ultimate goal of the recent refactoring series is to unify the vmemmap
optimization logic for both DAX and HugeTLB under a common framework
(CONFIG_SPARSEMEM_VMEMMAP_OPTIMIZATION).

A key breakthrough in this unification is that DAX now only requires 1
vmemmap page to be preserved (the head page), aligning its requirements
exactly with HugeTLB. Previously, DAX optimization relied on a dedicated
upper-level function, vmemmap_populate_compound_pages, which handled the
manual allocation of the head page AND the first tail page before reusing
the shared tail page for the rest.

Because DAX and HugeTLB are now perfectly aligned in their optimization
requirements (1 reserved page + reused shared tail pages), this patch
eliminates the dedicated compound page mapping loop entirely. Instead, it
pushes the optimization decision down to the lowest level in
vmemmap_pte_populate. Now, all mapping requests flow through the standard
vmemmap_populate_basepages.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 arch/powerpc/mm/book3s64/radix_pgtable.c |  13 +-
 include/linux/mm.h                       |   2 +-
 mm/mm_init.c                             |   2 +-
 mm/sparse-vmemmap.c                      | 185 +++++------------------
 4 files changed, 40 insertions(+), 162 deletions(-)

diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c b/arch/powerpc/mm/book3s64/radix_pgtable.c
index 5ce3deb464d5..714d5cdc10ec 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -1326,17 +1326,8 @@ static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
 					return -ENOMEM;
 				vmemmap_verify(pte, node, addr, addr + PAGE_SIZE);
 
-				/*
-				 * Populate the tail pages vmemmap page
-				 * It can fall in different pmd, hence
-				 * vmemmap_populate_address()
-				 */
-				pte = radix__vmemmap_populate_address(addr + PAGE_SIZE, node, NULL, NULL);
-				if (!pte)
-					return -ENOMEM;
-
-				addr_pfn += 2;
-				next = addr + 2 * PAGE_SIZE;
+				addr_pfn += 1;
+				next = addr + PAGE_SIZE;
 				continue;
 			}
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 15841829b7eb..bceef0dc578b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -4912,7 +4912,7 @@ static inline void vmem_altmap_free(struct vmem_altmap *altmap,
 }
 #endif
 
-#define VMEMMAP_RESERVE_NR	2
+#define VMEMMAP_RESERVE_NR	OPTIMIZED_FOLIO_VMEMMAP_PAGES
 #ifdef CONFIG_ARCH_WANT_OPTIMIZE_DAX_VMEMMAP
 static inline bool __vmemmap_can_optimize(struct vmem_altmap *altmap,
 					  struct dev_pagemap *pgmap)
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 636a0f9644f6..6b23b5f02544 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -1066,7 +1066,7 @@ static void __ref __init_zone_device_page(struct page *page, unsigned long pfn,
  * initialize is a lot smaller that the total amount of struct pages being
  * mapped. This is a paired / mild layering violation with explicit knowledge
  * of how the sparse_vmemmap internals handle compound pages in the lack
- * of an altmap. See vmemmap_populate_compound_pages().
+ * of an altmap.
  */
 static inline unsigned long compound_nr_pages(struct vmem_altmap *altmap,
 					      struct dev_pagemap *pgmap,
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index 1867b5dcc73c..fd7b0e1e5aba 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -152,46 +152,40 @@ static pte_t * __meminit vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, in
 					      struct vmem_altmap *altmap,
 					      unsigned long ptpfn)
 {
-	pte_t *pte = pte_offset_kernel(pmd, addr);
-
-	if (pte_none(ptep_get(pte))) {
-		pte_t entry;
-
-		if (vmemmap_page_optimizable((struct page *)addr) &&
-		    ptpfn == (unsigned long)-1) {
-			struct page *page;
-			unsigned long pfn = page_to_pfn((struct page *)addr);
-			const struct mem_section *ms = __pfn_to_section(pfn);
-
-			page = vmemmap_shared_tail_page(section_order(ms),
-							section_to_zone(ms, node));
-			if (!page)
-				return NULL;
-			ptpfn = page_to_pfn(page);
-		}
+	pte_t entry, *pte = pte_offset_kernel(pmd, addr);
 
-		if (ptpfn == (unsigned long)-1) {
-			void *p = vmemmap_alloc_block_buf(PAGE_SIZE, node, altmap);
-
-			if (!p)
-				return NULL;
-			ptpfn = PHYS_PFN(__pa(p));
-		} else {
-			/*
-			 * When a PTE/PMD entry is freed from the init_mm
-			 * there's a free_pages() call to this page allocated
-			 * above. Thus this get_page() is paired with the
-			 * put_page_testzero() on the freeing path.
-			 * This can only called by certain ZONE_DEVICE path,
-			 * and through vmemmap_populate_compound_pages() when
-			 * slab is available.
-			 */
-			if (slab_is_available())
-				get_page(pfn_to_page(ptpfn));
-		}
-		entry = pfn_pte(ptpfn, PAGE_KERNEL);
-		set_pte_at(&init_mm, addr, pte, entry);
+	if (!pte_none(ptep_get(pte)))
+		return pte;
+
+	/* See layout diagram in Documentation/mm/vmemmap_dedup.rst. */
+	if (vmemmap_page_optimizable((struct page *)addr)) {
+		struct page *page;
+		unsigned long pfn = page_to_pfn((struct page *)addr);
+		const struct mem_section *ms = __pfn_to_section(pfn);
+
+		page = vmemmap_shared_tail_page(section_order(ms),
+						section_to_zone(ms, node));
+		if (!page)
+			return NULL;
+
+		/*
+		 * When a PTE entry is freed, a free_pages() call occurs. This
+		 * get_page() pairs with put_page_testzero() on the freeing
+		 * path. This can only occur when slab is available.
+		 */
+		if (slab_is_available())
+			get_page(page);
+		ptpfn = page_to_pfn(page);
+	} else {
+		void *p = vmemmap_alloc_block_buf(PAGE_SIZE, node, altmap);
+
+		if (!p)
+			return NULL;
+		ptpfn = PHYS_PFN(__pa(p));
 	}
+	entry = pfn_pte(ptpfn, PAGE_KERNEL);
+	set_pte_at(&init_mm, addr, pte, entry);
+
 	return pte;
 }
 
@@ -287,17 +281,15 @@ static pte_t * __meminit vmemmap_populate_address(unsigned long addr, int node,
 	return pte;
 }
 
-static int __meminit vmemmap_populate_range(unsigned long start,
-					    unsigned long end, int node,
-					    struct vmem_altmap *altmap,
-					    unsigned long ptpfn)
+int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end,
+					 int node, struct vmem_altmap *altmap,
+					 struct dev_pagemap *pgmap)
 {
 	unsigned long addr = start;
 	pte_t *pte;
 
 	for (; addr < end; addr += PAGE_SIZE) {
-		pte = vmemmap_populate_address(addr, node, altmap,
-					       ptpfn);
+		pte = vmemmap_populate_address(addr, node, altmap, -1);
 		if (!pte)
 			return -ENOMEM;
 	}
@@ -305,19 +297,6 @@ static int __meminit vmemmap_populate_range(unsigned long start,
 	return 0;
 }
 
-static int __meminit vmemmap_populate_compound_pages(unsigned long start,
-						     unsigned long end, int node,
-						     struct dev_pagemap *pgmap);
-
-int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end,
-					 int node, struct vmem_altmap *altmap,
-					 struct dev_pagemap *pgmap)
-{
-	if (vmemmap_can_optimize(altmap, pgmap))
-		return vmemmap_populate_compound_pages(start, end, node, pgmap);
-	return vmemmap_populate_range(start, end, node, altmap, -1);
-}
-
 /*
  * Write protect the mirrored tail page structs for HVO. This will be
  * called from the hugetlb code when gathering and initializing the
@@ -397,9 +376,6 @@ int __meminit vmemmap_populate_hugepages(unsigned long start, unsigned long end,
 	pud_t *pud;
 	pmd_t *pmd;
 
-	if (vmemmap_can_optimize(altmap, pgmap))
-		return vmemmap_populate_compound_pages(start, end, node, pgmap);
-
 	for (addr = start; addr < end; addr = next) {
 		unsigned long pfn = page_to_pfn((struct page *)addr);
 		const struct mem_section *ms = __pfn_to_section(pfn);
@@ -447,95 +423,6 @@ int __meminit vmemmap_populate_hugepages(unsigned long start, unsigned long end,
 	return 0;
 }
 
-/*
- * For compound pages bigger than section size (e.g. x86 1G compound
- * pages with 2M subsection size) fill the rest of sections as tail
- * pages.
- *
- * Note that memremap_pages() resets @nr_range value and will increment
- * it after each range successful onlining. Thus the value or @nr_range
- * at section memmap populate corresponds to the in-progress range
- * being onlined here.
- */
-static bool __meminit reuse_compound_section(unsigned long start_pfn,
-					     struct dev_pagemap *pgmap)
-{
-	unsigned long nr_pages = pgmap_vmemmap_nr(pgmap);
-	unsigned long offset = start_pfn -
-		PHYS_PFN(pgmap->ranges[pgmap->nr_range].start);
-
-	return !IS_ALIGNED(offset, nr_pages) && nr_pages > PAGES_PER_SUBSECTION;
-}
-
-static int __meminit vmemmap_populate_compound_pages(unsigned long start,
-						     unsigned long end, int node,
-						     struct dev_pagemap *pgmap)
-{
-	unsigned long size, addr;
-	pte_t *pte;
-	int rc;
-	unsigned long start_pfn = page_to_pfn((struct page *)start);
-	const struct mem_section *ms = __pfn_to_section(start_pfn);
-	struct page *tail;
-
-	/* This may occur in sub-section scenarios. */
-	if (!section_vmemmap_optimizable(ms))
-		return vmemmap_populate_range(start, end, node, NULL, -1);
-
-	tail = vmemmap_shared_tail_page(section_order(ms),
-					section_to_zone(ms, node));
-	if (!tail)
-		return -ENOMEM;
-
-	if (reuse_compound_section(start_pfn, pgmap))
-		return vmemmap_populate_range(start, end, node, NULL,
-					      page_to_pfn(tail));
-
-	size = min(end - start, pgmap_vmemmap_nr(pgmap) * sizeof(struct page));
-	for (addr = start; addr < end; addr += size) {
-		unsigned long next, last = addr + size;
-		void *p;
-
-		/* Populate the head page vmemmap page */
-		pte = vmemmap_populate_address(addr, node, NULL, -1);
-		if (!pte)
-			return -ENOMEM;
-
-		/*
-		 * Allocate manually since vmemmap_populate_address() will assume DAX
-		 * only needs 1 vmemmap page to be reserved, however DAX now needs 2
-		 * vmemmap pages. This is a temporary solution and will be unified
-		 * with HugeTLB in the future.
-		 */
-		p = vmemmap_alloc_block_buf(PAGE_SIZE, node, NULL);
-		if (!p)
-			return -ENOMEM;
-
-		/* Populate the tail pages vmemmap page */
-		next = addr + PAGE_SIZE;
-		pte = vmemmap_populate_address(next, node, NULL, PHYS_PFN(__pa(p)));
-		/*
-		 * get_page() is called above. Since we are not actually
-		 * reusing it, to avoid a memory leak, we call put_page() here.
-		 */
-		put_page(virt_to_page(p));
-		if (!pte)
-			return -ENOMEM;
-
-		/*
-		 * Reuse the shared vmemmap page for the rest of tail pages
-		 * See layout diagram in Documentation/mm/vmemmap_dedup.rst
-		 */
-		next += PAGE_SIZE;
-		rc = vmemmap_populate_range(next, last, node, NULL,
-					    page_to_pfn(tail));
-		if (rc)
-			return -ENOMEM;
-	}
-
-	return 0;
-}
-
 struct page * __meminit __populate_section_memmap(unsigned long pfn,
 		unsigned long nr_pages, int nid, struct vmem_altmap *altmap,
 		struct dev_pagemap *pgmap)
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 38/49] mm/sparse-vmemmap: remap the shared tail pages as read-only
  2026-04-05 12:51 [PATCH 00/49] mm: Generalize vmemmap optimization for DAX and HugeTLB Muchun Song
                   ` (36 preceding siblings ...)
  2026-04-05 12:52 ` [PATCH 37/49] mm/sparse-vmemmap: unify DAX and HugeTLB vmemmap optimization Muchun Song
@ 2026-04-05 12:52 ` Muchun Song
  2026-04-05 12:52 ` [PATCH 39/49] mm/sparse-vmemmap: remove unused ptpfn argument Muchun Song
                   ` (11 subsequent siblings)
  49 siblings, 0 replies; 51+ messages in thread
From: Muchun Song @ 2026-04-05 12:52 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
	Michael Ellerman, Madhavan Srinivasan
  Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
	Christophe Leroy, aneesh.kumar, joao.m.martins, linux-mm,
	linuxppc-dev, linux-kernel, Muchun Song

HugeTLB enforces Read-Only mappings for HVO to prevent illegal
write operations, whereas DAX currently does not, which introduces
potential security risks.

Now that we are unifying the HVO logic for HugeTLB and DAX, we can
remap the shared tail pages as read-only directly in
vmemmap_pte_populate(). This ensures that both HugeTLB and DAX
benefit from the read-only protection of vmemmap tail pages right
from the point of mapping establishment.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/sparse-vmemmap.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index fd7b0e1e5aba..c70275717054 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -176,14 +176,17 @@ static pte_t * __meminit vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, in
 		if (slab_is_available())
 			get_page(page);
 		ptpfn = page_to_pfn(page);
+
+		/* Remap shared tail page read-only to catch illegal writes. */
+		entry = pfn_pte(ptpfn, PAGE_KERNEL_RO);
 	} else {
 		void *p = vmemmap_alloc_block_buf(PAGE_SIZE, node, altmap);
 
 		if (!p)
 			return NULL;
 		ptpfn = PHYS_PFN(__pa(p));
+		entry = pfn_pte(ptpfn, PAGE_KERNEL);
 	}
-	entry = pfn_pte(ptpfn, PAGE_KERNEL);
 	set_pte_at(&init_mm, addr, pte, entry);
 
 	return pte;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 39/49] mm/sparse-vmemmap: remove unused ptpfn argument
  2026-04-05 12:51 [PATCH 00/49] mm: Generalize vmemmap optimization for DAX and HugeTLB Muchun Song
                   ` (37 preceding siblings ...)
  2026-04-05 12:52 ` [PATCH 38/49] mm/sparse-vmemmap: remap the shared tail pages as read-only Muchun Song
@ 2026-04-05 12:52 ` Muchun Song
  2026-04-05 12:52 ` [PATCH 40/49] mm/hugetlb_vmemmap: remove vmemmap_wrprotect_hvo() and related code Muchun Song
                   ` (10 subsequent siblings)
  49 siblings, 0 replies; 51+ messages in thread
From: Muchun Song @ 2026-04-05 12:52 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
	Michael Ellerman, Madhavan Srinivasan
  Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
	Christophe Leroy, aneesh.kumar, joao.m.martins, linux-mm,
	linuxppc-dev, linux-kernel, Muchun Song

The ptpfn argument is unused as it is assigned unconditionally before
being used in vmemmap_pte_populate(). Let's remove it.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/sparse-vmemmap.c | 16 ++++++----------
 1 file changed, 6 insertions(+), 10 deletions(-)

diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index c70275717054..36e5bcb5ba9b 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -149,8 +149,7 @@ static inline struct zone *section_to_zone(const struct mem_section *ms, int nid
 }
 
 static pte_t * __meminit vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node,
-					      struct vmem_altmap *altmap,
-					      unsigned long ptpfn)
+					      struct vmem_altmap *altmap)
 {
 	pte_t entry, *pte = pte_offset_kernel(pmd, addr);
 
@@ -175,17 +174,15 @@ static pte_t * __meminit vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, in
 		 */
 		if (slab_is_available())
 			get_page(page);
-		ptpfn = page_to_pfn(page);
 
 		/* Remap shared tail page read-only to catch illegal writes. */
-		entry = pfn_pte(ptpfn, PAGE_KERNEL_RO);
+		entry = pfn_pte(page_to_pfn(page), PAGE_KERNEL_RO);
 	} else {
 		void *p = vmemmap_alloc_block_buf(PAGE_SIZE, node, altmap);
 
 		if (!p)
 			return NULL;
-		ptpfn = PHYS_PFN(__pa(p));
-		entry = pfn_pte(ptpfn, PAGE_KERNEL);
+		entry = pfn_pte(PHYS_PFN(__pa(p)), PAGE_KERNEL);
 	}
 	set_pte_at(&init_mm, addr, pte, entry);
 
@@ -255,8 +252,7 @@ static pgd_t * __meminit vmemmap_pgd_populate(unsigned long addr, int node)
 }
 
 static pte_t * __meminit vmemmap_populate_address(unsigned long addr, int node,
-					      struct vmem_altmap *altmap,
-					      unsigned long ptpfn)
+						  struct vmem_altmap *altmap)
 {
 	pgd_t *pgd;
 	p4d_t *p4d;
@@ -276,7 +272,7 @@ static pte_t * __meminit vmemmap_populate_address(unsigned long addr, int node,
 	pmd = vmemmap_pmd_populate(pud, addr, node);
 	if (!pmd)
 		return NULL;
-	pte = vmemmap_pte_populate(pmd, addr, node, altmap, ptpfn);
+	pte = vmemmap_pte_populate(pmd, addr, node, altmap);
 	if (!pte)
 		return NULL;
 	vmemmap_verify(pte, node, addr, addr + PAGE_SIZE);
@@ -292,7 +288,7 @@ int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end,
 	pte_t *pte;
 
 	for (; addr < end; addr += PAGE_SIZE) {
-		pte = vmemmap_populate_address(addr, node, altmap, -1);
+		pte = vmemmap_populate_address(addr, node, altmap);
 		if (!pte)
 			return -ENOMEM;
 	}
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 40/49] mm/hugetlb_vmemmap: remove vmemmap_wrprotect_hvo() and related code
  2026-04-05 12:51 [PATCH 00/49] mm: Generalize vmemmap optimization for DAX and HugeTLB Muchun Song
                   ` (38 preceding siblings ...)
  2026-04-05 12:52 ` [PATCH 39/49] mm/sparse-vmemmap: remove unused ptpfn argument Muchun Song
@ 2026-04-05 12:52 ` Muchun Song
  2026-04-05 12:52 ` [PATCH 41/49] mm/sparse: simplify section_vmemmap_pages() Muchun Song
                   ` (9 subsequent siblings)
  49 siblings, 0 replies; 51+ messages in thread
From: Muchun Song @ 2026-04-05 12:52 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
	Michael Ellerman, Madhavan Srinivasan
  Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
	Christophe Leroy, aneesh.kumar, joao.m.martins, linux-mm,
	linuxppc-dev, linux-kernel, Muchun Song

Since we have already remapped the shared tail pages as read-only in
vmemmap_pte_populate() right at the point of mapping establishment,
the separate pass of read-only mapping enforcement via
vmemmap_wrprotect_hvo() for HugeTLB bootmem folios is no longer
necessary.

Remove vmemmap_wrprotect_hvo() and the associated wrapper
hugetlb_vmemmap_optimize_bootmem_folios(), simplifying the code by
directly using hugetlb_vmemmap_optimize_folios() for bootmem folios
as well.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 include/linux/mm.h   |  2 --
 mm/hugetlb.c         |  2 +-
 mm/hugetlb_vmemmap.c | 31 ++++---------------------------
 mm/hugetlb_vmemmap.h |  6 ------
 mm/sparse-vmemmap.c  | 23 -----------------------
 5 files changed, 5 insertions(+), 59 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index bceef0dc578b..c36001c9d571 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -4877,8 +4877,6 @@ int vmemmap_populate_hugepages(unsigned long start, unsigned long end,
 			       struct dev_pagemap *pgmap);
 int vmemmap_populate(unsigned long start, unsigned long end, int node,
 		struct vmem_altmap *altmap, struct dev_pagemap *pgmap);
-void vmemmap_wrprotect_hvo(unsigned long start, unsigned long end, int node,
-			  unsigned long headsize);
 void vmemmap_populate_print_last(void);
 struct page *vmemmap_shared_tail_page(unsigned int order, struct zone *zone);
 #ifdef CONFIG_MEMORY_HOTPLUG
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index ce5a58aab5c3..84f095a23ef2 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3226,7 +3226,7 @@ static void __init prep_and_add_bootmem_folios(struct hstate *h,
 	struct folio *folio, *tmp_f;
 
 	/* Send list for bulk vmemmap optimization processing */
-	hugetlb_vmemmap_optimize_bootmem_folios(h, folio_list);
+	hugetlb_vmemmap_optimize_folios(h, folio_list);
 
 	list_for_each_entry_safe(folio, tmp_f, folio_list, lru) {
 		if (!folio_test_hugetlb_vmemmap_optimized(folio)) {
diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index 92c95ebdbb9a..d595ef759bc2 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -589,31 +589,18 @@ static int hugetlb_vmemmap_split_folio(const struct hstate *h, struct folio *fol
 	return vmemmap_remap_split(vmemmap_start, vmemmap_end);
 }
 
-static void __hugetlb_vmemmap_optimize_folios(struct hstate *h,
-					      struct list_head *folio_list,
-					      bool boot)
+void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_list)
 {
 	struct folio *folio;
-	int nr_to_optimize;
+	unsigned long nr_to_optimize = 0;
 	LIST_HEAD(vmemmap_pages);
 	unsigned long flags = VMEMMAP_REMAP_NO_TLB_FLUSH;
 
-	nr_to_optimize = 0;
 	list_for_each_entry(folio, folio_list, lru) {
 		int ret;
-		unsigned long spfn, epfn;
-
-		if (boot && folio_test_hugetlb_vmemmap_optimized(folio)) {
-			/*
-			 * Already optimized by pre-HVO, just map the
-			 * mirrored tail page structs RO.
-			 */
-			spfn = (unsigned long)&folio->page;
-			epfn = spfn + pages_per_huge_page(h);
-			vmemmap_wrprotect_hvo(spfn, epfn, folio_nid(folio),
-					OPTIMIZED_FOLIO_VMEMMAP_SIZE);
+
+		if (folio_test_hugetlb_vmemmap_optimized(folio))
 			continue;
-		}
 
 		nr_to_optimize++;
 
@@ -667,16 +654,6 @@ static void __hugetlb_vmemmap_optimize_folios(struct hstate *h,
 	free_vmemmap_page_list(&vmemmap_pages);
 }
 
-void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_list)
-{
-	__hugetlb_vmemmap_optimize_folios(h, folio_list, false);
-}
-
-void hugetlb_vmemmap_optimize_bootmem_folios(struct hstate *h, struct list_head *folio_list)
-{
-	__hugetlb_vmemmap_optimize_folios(h, folio_list, true);
-}
-
 void __init hugetlb_vmemmap_optimize_bootmem_page(struct huge_bootmem_page *m)
 {
 	struct hstate *h = m->hstate;
diff --git a/mm/hugetlb_vmemmap.h b/mm/hugetlb_vmemmap.h
index ff8e4c6e9833..0022f9c5a101 100644
--- a/mm/hugetlb_vmemmap.h
+++ b/mm/hugetlb_vmemmap.h
@@ -19,7 +19,6 @@ long hugetlb_vmemmap_restore_folios(const struct hstate *h,
 					struct list_head *non_hvo_folios);
 void hugetlb_vmemmap_optimize_folio(const struct hstate *h, struct folio *folio);
 void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_list);
-void hugetlb_vmemmap_optimize_bootmem_folios(struct hstate *h, struct list_head *folio_list);
 void hugetlb_vmemmap_optimize_bootmem_page(struct huge_bootmem_page *m);
 
 static inline unsigned int hugetlb_vmemmap_size(const struct hstate *h)
@@ -61,11 +60,6 @@ static inline void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list
 {
 }
 
-static inline void hugetlb_vmemmap_optimize_bootmem_folios(struct hstate *h,
-						struct list_head *folio_list)
-{
-}
-
 static inline unsigned int hugetlb_vmemmap_optimizable_size(const struct hstate *h)
 {
 	return 0;
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index 36e5bcb5ba9b..ba8c0c64f160 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -296,29 +296,6 @@ int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end,
 	return 0;
 }
 
-/*
- * Write protect the mirrored tail page structs for HVO. This will be
- * called from the hugetlb code when gathering and initializing the
- * memblock allocated gigantic pages. The write protect can't be
- * done earlier, since it can't be guaranteed that the reserved
- * page structures will not be written to during initialization,
- * even if CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled.
- *
- * The PTEs are known to exist, and nothing else should be touching
- * these pages. The caller is responsible for any TLB flushing.
- */
-void vmemmap_wrprotect_hvo(unsigned long addr, unsigned long end,
-				    int node, unsigned long headsize)
-{
-	unsigned long maddr;
-	pte_t *pte;
-
-	for (maddr = addr + headsize; maddr < end; maddr += PAGE_SIZE) {
-		pte = virt_to_kpte(maddr);
-		ptep_set_wrprotect(&init_mm, maddr, pte);
-	}
-}
-
 struct page *vmemmap_shared_tail_page(unsigned int order, struct zone *zone)
 {
 	void *addr;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 41/49] mm/sparse: simplify section_vmemmap_pages()
  2026-04-05 12:51 [PATCH 00/49] mm: Generalize vmemmap optimization for DAX and HugeTLB Muchun Song
                   ` (39 preceding siblings ...)
  2026-04-05 12:52 ` [PATCH 40/49] mm/hugetlb_vmemmap: remove vmemmap_wrprotect_hvo() and related code Muchun Song
@ 2026-04-05 12:52 ` Muchun Song
  2026-04-05 12:52 ` [PATCH 42/49] mm/sparse-vmemmap: introduce section_vmemmap_page_structs() Muchun Song
                   ` (8 subsequent siblings)
  49 siblings, 0 replies; 51+ messages in thread
From: Muchun Song @ 2026-04-05 12:52 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
	Michael Ellerman, Madhavan Srinivasan
  Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
	Christophe Leroy, aneesh.kumar, joao.m.martins, linux-mm,
	linuxppc-dev, linux-kernel, Muchun Song

After unifying DAX and HugeTLB vmemmap optimizations, we can now
simplify section_vmemmap_pages().

Previously, section_vmemmap_pages() needed to take altmap and pgmap
arguments to determine if vmemmap optimization was enabled. However,
sparse_add_section() already sets the section order using
section_set_order(ms, pgmap->vmemmap_shift) if vmemmap_can_optimize()
is true and the size is aligned to PAGES_PER_SECTION.

As a result, section_vmemmap_optimizable(ms) is sufficient to determine
if the section can be optimized, and section_order(ms) can directly
provide the order, making the altmap and pgmap arguments redundant.

Remove the unused altmap and pgmap arguments from
section_vmemmap_pages().

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/internal.h       |  3 +--
 mm/sparse-vmemmap.c |  8 +++-----
 mm/sparse.c         | 18 ++++++------------
 3 files changed, 10 insertions(+), 19 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index b569d8309f4d..7f0731e5c84f 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -998,8 +998,7 @@ static inline void __section_mark_present(struct mem_section *ms,
 	ms->section_mem_map |= SECTION_MARKED_PRESENT;
 }
 
-int section_vmemmap_pages(unsigned long pfn, unsigned long nr_pages,
-			  struct vmem_altmap *altmap, struct dev_pagemap *pgmap);
+int section_vmemmap_pages(unsigned long pfn, unsigned long nr_pages);
 #else
 static inline void memblocks_present(void) {}
 static inline void sparse_init(void) {}
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index ba8c0c64f160..ac2efba9ef92 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -608,12 +608,10 @@ static void section_deactivate(unsigned long pfn, unsigned long nr_pages,
 	 * section_activate() and pfn_valid() .
 	 */
 	if (!section_is_early) {
-		memmap_pages_add(-section_vmemmap_pages(pfn, nr_pages, altmap,
-							pgmap));
+		memmap_pages_add(-section_vmemmap_pages(pfn, nr_pages));
 		depopulate_section_memmap(pfn, nr_pages, altmap);
 	} else if (memmap) {
-		memmap_pages_add(-section_vmemmap_pages(pfn, nr_pages, altmap,
-							pgmap));
+		memmap_pages_add(-section_vmemmap_pages(pfn, nr_pages));
 		free_map_bootmem(memmap);
 	}
 
@@ -658,7 +656,7 @@ static struct page * __meminit section_activate(int nid, unsigned long pfn,
 		return pfn_to_page(pfn);
 
 	memmap = populate_section_memmap(pfn, nr_pages, nid, altmap, pgmap);
-	memmap_pages_add(section_vmemmap_pages(pfn, nr_pages, altmap, pgmap));
+	memmap_pages_add(section_vmemmap_pages(pfn, nr_pages));
 	if (!memmap) {
 		section_deactivate(pfn, nr_pages, altmap, pgmap);
 		return ERR_PTR(-ENOMEM);
diff --git a/mm/sparse.c b/mm/sparse.c
index 04c641b97325..163bb17bba96 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -345,28 +345,23 @@ static void __init sparse_usage_fini(void)
 	sparse_usagebuf = sparse_usagebuf_end = NULL;
 }
 
-int __meminit section_vmemmap_pages(unsigned long pfn, unsigned long nr_pages,
-				    struct vmem_altmap *altmap, struct dev_pagemap *pgmap)
+int __meminit section_vmemmap_pages(unsigned long pfn, unsigned long nr_pages)
 {
 	const struct mem_section *ms = __pfn_to_section(pfn);
-	unsigned int order = pgmap ? pgmap->vmemmap_shift : section_order(ms);
+	unsigned int order = section_order(ms);
 	unsigned long pages_per_compound = 1L << order;
-	unsigned int vmemmap_pages = OPTIMIZED_FOLIO_VMEMMAP_PAGES;
-
-	if (vmemmap_can_optimize(altmap, pgmap))
-		vmemmap_pages = VMEMMAP_RESERVE_NR;
 
 	VM_BUG_ON(!IS_ALIGNED(pfn | nr_pages, min(pages_per_compound, PAGES_PER_SECTION)));
 	VM_BUG_ON(pfn_to_section_nr(pfn) != pfn_to_section_nr(pfn + nr_pages - 1));
 
-	if (!vmemmap_can_optimize(altmap, pgmap) && !section_vmemmap_optimizable(ms))
+	if (!section_vmemmap_optimizable(ms))
 		return DIV_ROUND_UP(nr_pages * sizeof(struct page), PAGE_SIZE);
 
 	if (order < PFN_SECTION_SHIFT)
-		return vmemmap_pages * nr_pages / pages_per_compound;
+		return OPTIMIZED_FOLIO_VMEMMAP_PAGES * nr_pages / pages_per_compound;
 
 	if (IS_ALIGNED(pfn, pages_per_compound))
-		return vmemmap_pages;
+		return OPTIMIZED_FOLIO_VMEMMAP_PAGES;
 
 	return 0;
 }
@@ -396,8 +391,7 @@ static void __init sparse_init_nid(int nid, unsigned long pnum_begin,
 						nid, NULL, NULL);
 		if (!map)
 			panic("Populate section (%ld) on node[%d] failed\n", pnum, nid);
-		memmap_boot_pages_add(section_vmemmap_pages(pfn, PAGES_PER_SECTION,
-							    NULL, NULL));
+		memmap_boot_pages_add(section_vmemmap_pages(pfn, PAGES_PER_SECTION));
 		sparse_init_early_section(nid, map, pnum, 0);
 	}
 	sparse_usage_fini();
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 42/49] mm/sparse-vmemmap: introduce section_vmemmap_page_structs()
  2026-04-05 12:51 [PATCH 00/49] mm: Generalize vmemmap optimization for DAX and HugeTLB Muchun Song
                   ` (40 preceding siblings ...)
  2026-04-05 12:52 ` [PATCH 41/49] mm/sparse: simplify section_vmemmap_pages() Muchun Song
@ 2026-04-05 12:52 ` Muchun Song
  2026-04-05 12:52 ` [PATCH 43/49] powerpc/mm: rely on generic vmemmap_can_optimize() to simplify code Muchun Song
                   ` (7 subsequent siblings)
  49 siblings, 0 replies; 51+ messages in thread
From: Muchun Song @ 2026-04-05 12:52 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
	Michael Ellerman, Madhavan Srinivasan
  Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
	Christophe Leroy, aneesh.kumar, joao.m.martins, linux-mm,
	linuxppc-dev, linux-kernel, Muchun Song

The function compound_nr_pages() in mm_init.c was introduced to determine
how many unique struct pages need to be initialized when vmemmap optimization
is enabled. However, it exposes sparse_vmemmap internals to mm_init.c.

Now that DAX and HugeTLB vmemmap optimizations are unified and simplified,
we can expose a cleaner API from sparse.c to calculate the exact number
of struct page structures needed.

Introduce section_vmemmap_page_structs() which returns the number of
page structs that require initialization, rather than the number of physical
vmemmap pages to allocate. This perfectly aligns with the requirements
of memmap_init_zone_device().

As a result:
1. compound_nr_pages() is removed entirely.
2. The internal section_vmemmap_pages() in sparse.c is rewritten as
   a simple wrapper that calculates the number of physical pages based
   on section_vmemmap_page_structs().
3. A restrictive VM_BUG_ON spanning sections is removed, safely allowing
   compound pages (like 1G DAX pages) to cross section boundaries during
   device memory initialization.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/internal.h |  8 +++++++-
 mm/mm_init.c  | 21 +--------------------
 mm/sparse.c   |  9 ++++-----
 3 files changed, 12 insertions(+), 26 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 7f0731e5c84f..02064f21bfe1 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -998,7 +998,13 @@ static inline void __section_mark_present(struct mem_section *ms,
 	ms->section_mem_map |= SECTION_MARKED_PRESENT;
 }
 
-int section_vmemmap_pages(unsigned long pfn, unsigned long nr_pages);
+int section_vmemmap_page_structs(unsigned long pfn, unsigned long nr_pages);
+
+static inline int section_vmemmap_pages(unsigned long pfn, unsigned long nr_pages)
+{
+	return DIV_ROUND_UP(section_vmemmap_page_structs(pfn, nr_pages) *
+			    sizeof(struct page), PAGE_SIZE);
+}
 #else
 static inline void memblocks_present(void) {}
 static inline void sparse_init(void) {}
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 6b23b5f02544..74ccc556bf6e 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -1060,24 +1060,6 @@ static void __ref __init_zone_device_page(struct page *page, unsigned long pfn,
 	}
 }
 
-/*
- * With compound page geometry and when struct pages are stored in ram most
- * tail pages are reused. Consequently, the amount of unique struct pages to
- * initialize is a lot smaller that the total amount of struct pages being
- * mapped. This is a paired / mild layering violation with explicit knowledge
- * of how the sparse_vmemmap internals handle compound pages in the lack
- * of an altmap.
- */
-static inline unsigned long compound_nr_pages(struct vmem_altmap *altmap,
-					      struct dev_pagemap *pgmap,
-					      const struct mem_section *ms)
-{
-	if (!section_vmemmap_optimizable(ms))
-		return pgmap_vmemmap_nr(pgmap);
-
-	return VMEMMAP_RESERVE_NR * (PAGE_SIZE / sizeof(struct page));
-}
-
 static void __ref memmap_init_compound(struct page *head,
 				       unsigned long head_pfn,
 				       unsigned long zone_idx, int nid,
@@ -1141,8 +1123,7 @@ void __ref memmap_init_zone_device(struct zone *zone,
 			continue;
 
 		memmap_init_compound(page, pfn, zone_idx, nid, pgmap,
-				     compound_nr_pages(altmap, pgmap,
-						       __pfn_to_section(pfn)));
+				     section_vmemmap_page_structs(pfn, pfns_per_compound));
 	}
 
 	/*
diff --git a/mm/sparse.c b/mm/sparse.c
index 163bb17bba96..400542302ad4 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -345,23 +345,22 @@ static void __init sparse_usage_fini(void)
 	sparse_usagebuf = sparse_usagebuf_end = NULL;
 }
 
-int __meminit section_vmemmap_pages(unsigned long pfn, unsigned long nr_pages)
+int __meminit section_vmemmap_page_structs(unsigned long pfn, unsigned long nr_pages)
 {
 	const struct mem_section *ms = __pfn_to_section(pfn);
 	unsigned int order = section_order(ms);
 	unsigned long pages_per_compound = 1L << order;
 
 	VM_BUG_ON(!IS_ALIGNED(pfn | nr_pages, min(pages_per_compound, PAGES_PER_SECTION)));
-	VM_BUG_ON(pfn_to_section_nr(pfn) != pfn_to_section_nr(pfn + nr_pages - 1));
 
 	if (!section_vmemmap_optimizable(ms))
-		return DIV_ROUND_UP(nr_pages * sizeof(struct page), PAGE_SIZE);
+		return nr_pages;
 
 	if (order < PFN_SECTION_SHIFT)
-		return OPTIMIZED_FOLIO_VMEMMAP_PAGES * nr_pages / pages_per_compound;
+		return OPTIMIZED_FOLIO_VMEMMAP_PAGE_STRUCTS * nr_pages / pages_per_compound;
 
 	if (IS_ALIGNED(pfn, pages_per_compound))
-		return OPTIMIZED_FOLIO_VMEMMAP_PAGES;
+		return OPTIMIZED_FOLIO_VMEMMAP_PAGE_STRUCTS;
 
 	return 0;
 }
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 43/49] powerpc/mm: rely on generic vmemmap_can_optimize() to simplify code
  2026-04-05 12:51 [PATCH 00/49] mm: Generalize vmemmap optimization for DAX and HugeTLB Muchun Song
                   ` (41 preceding siblings ...)
  2026-04-05 12:52 ` [PATCH 42/49] mm/sparse-vmemmap: introduce section_vmemmap_page_structs() Muchun Song
@ 2026-04-05 12:52 ` Muchun Song
  2026-04-05 12:52 ` [PATCH 44/49] mm/sparse-vmemmap: drop ARCH_WANT_OPTIMIZE_DAX_VMEMMAP and simplify checks Muchun Song
                   ` (6 subsequent siblings)
  49 siblings, 0 replies; 51+ messages in thread
From: Muchun Song @ 2026-04-05 12:52 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
	Michael Ellerman, Madhavan Srinivasan
  Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
	Christophe Leroy, aneesh.kumar, joao.m.martins, linux-mm,
	linuxppc-dev, linux-kernel, Muchun Song

The goal of this patch is to simplify the code by removing unnecessary
architecture-specific overrides.

After unifying DAX and HugeTLB vmemmap optimizations, we can rely on
the generic rule of vmemmap_can_optimize() instead of having architecture
specific overrides.

In radix__vmemmap_populate(), we can directly depend on
section_vmemmap_optimizable(__pfn_to_section(pfn)) because the upper
layer (sparse_add_section()) has already set the section order correctly
if the optimization condition was met.

In the fallback case for Hash MMU (!radix_enabled) inside vmemmap_populate(),
we reset the section order to 0. This is necessary because sparse_add_section()
may have optimistically set the section order assuming optimization could
be enabled, but Hash MMU does not support it. This ensures that
section_vmemmap_pages() calculates the unoptimized page count accurately.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 arch/powerpc/include/asm/book3s/64/radix.h |  5 -----
 arch/powerpc/mm/book3s64/radix_pgtable.c   | 12 +-----------
 arch/powerpc/mm/init_64.c                  |  1 +
 3 files changed, 2 insertions(+), 16 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/radix.h b/arch/powerpc/include/asm/book3s/64/radix.h
index 2600defa2dc2..18e28deba255 100644
--- a/arch/powerpc/include/asm/book3s/64/radix.h
+++ b/arch/powerpc/include/asm/book3s/64/radix.h
@@ -352,10 +352,5 @@ int radix__create_section_mapping(unsigned long start, unsigned long end,
 				  int nid, pgprot_t prot);
 int radix__remove_section_mapping(unsigned long start, unsigned long end);
 #endif /* CONFIG_MEMORY_HOTPLUG */
-
-#ifdef CONFIG_ARCH_WANT_OPTIMIZE_DAX_VMEMMAP
-#define vmemmap_can_optimize vmemmap_can_optimize
-bool vmemmap_can_optimize(struct vmem_altmap *altmap, struct dev_pagemap *pgmap);
-#endif
 #endif /* __ASSEMBLER__ */
 #endif
diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c b/arch/powerpc/mm/book3s64/radix_pgtable.c
index 714d5cdc10ec..36a69589fae4 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -977,16 +977,6 @@ int __meminit radix__vmemmap_create_mapping(unsigned long start,
 	return 0;
 }
 
-#ifdef CONFIG_ARCH_WANT_OPTIMIZE_DAX_VMEMMAP
-bool vmemmap_can_optimize(struct vmem_altmap *altmap, struct dev_pagemap *pgmap)
-{
-	if (radix_enabled())
-		return __vmemmap_can_optimize(altmap, pgmap);
-
-	return false;
-}
-#endif
-
 int __meminit vmemmap_check_pmd(pmd_t *pmdp, int node,
 				unsigned long addr, unsigned long next)
 {
@@ -1126,7 +1116,7 @@ int __meminit radix__vmemmap_populate(unsigned long start, unsigned long end, in
 	pte_t *pte;
 	unsigned long pfn = page_to_pfn((struct page *)start);
 
-	if (vmemmap_can_optimize(altmap, pgmap) && section_vmemmap_optimizable(__pfn_to_section(pfn)))
+	if (section_vmemmap_optimizable(__pfn_to_section(pfn)))
 		return vmemmap_populate_compound_pages(pfn, start, end, node, pgmap);
 	/*
 	 * If altmap is present, Make sure we align the start vmemmap addr
diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
index 8f4aa5b32186..56cbea89d304 100644
--- a/arch/powerpc/mm/init_64.c
+++ b/arch/powerpc/mm/init_64.c
@@ -283,6 +283,7 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
 		return radix__vmemmap_populate(start, end, node, altmap, pgmap);
 #endif
 
+	section_set_order(__pfn_to_section(page_to_pfn((struct page *)start)), 0);
 	return __vmemmap_populate(start, end, node, altmap);
 }
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 44/49] mm/sparse-vmemmap: drop ARCH_WANT_OPTIMIZE_DAX_VMEMMAP and simplify checks
  2026-04-05 12:51 [PATCH 00/49] mm: Generalize vmemmap optimization for DAX and HugeTLB Muchun Song
                   ` (42 preceding siblings ...)
  2026-04-05 12:52 ` [PATCH 43/49] powerpc/mm: rely on generic vmemmap_can_optimize() to simplify code Muchun Song
@ 2026-04-05 12:52 ` Muchun Song
  2026-04-05 12:52 ` [PATCH 45/49] mm/sparse-vmemmap: drop @pgmap parameter from vmemmap populate APIs Muchun Song
                   ` (5 subsequent siblings)
  49 siblings, 0 replies; 51+ messages in thread
From: Muchun Song @ 2026-04-05 12:52 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
	Michael Ellerman, Madhavan Srinivasan
  Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
	Christophe Leroy, aneesh.kumar, joao.m.martins, linux-mm,
	linuxppc-dev, linux-kernel, Muchun Song

Historically, when device DAX vmemmap optimization was introduced, it was
initially implemented as a generic feature within sparse-vmemmap.c. However,
it was later discovered that architectures with specific page table formats
(such as PowerPC with hash translation) would crash because the generic
vmemmap_populate_compound_pages() was unaware of their specific page table
setup (e.g., bolted table entries).

To address this, commit 87a7ae75d738 ("mm/vmemmap/devdax: fix kernel crash
when probing devdax devices") introduced a restrictive config option,
which eventually evolved into ARCH_WANT_OPTIMIZE_DAX_VMEMMAP (via commits
0b376f1e0ff5 and 0b6f15824cc7). This effectively turned a generic
optimization into an opt-in architectural feature.

However, the architecture landscape has evolved. The decision of whether
to apply DAX vmemmap optimization techniques for specific page table formats
is now fully delegated to the architecture-specific implementations (e.g.,
within vmemmap_populate()). The upper-level Kconfig restrictions and the
rigid generic wrapper functions are no longer necessary to prevent crashes,
as the architectures themselves handle the viability of the mappings. If an
architecture does not support DAX vmemmap optimization, it can simply
implement fallback logic similar to what PowerPC does in its
vmemmap_populate() routines.

If the architecture supports neither HugeTLB vmemmap optimization nor DAX
vmemmap optimization, but still wants to reduce code size and disable this
feature entirely, it is now possible to turn off SPARSEMEM_VMEMMAP_OPTIMIZATION.
It is no longer a hidden option, but rather a user-configurable boolean under
the SPARSEMEM_VMEMMAP umbrella.

Therefore, this patch removes the redundant ARCH_WANT_OPTIMIZE_DAX_VMEMMAP
and drops the complicated vmemmap_can_optimize() helper. Instead, we
unify SPARSEMEM_VMEMMAP_OPTIMIZATION as a fundamental core capability that
is enabled by default whenever SPARSEMEM_VMEMMAP is selected.

The check in sparse_add_section() is safely simplified to:
if (!altmap && pgmap && nr_pages == PAGES_PER_SECTION)
which succinctly reflects the prerequisites for the optimization without
unnecessary boilerplate.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 arch/powerpc/Kconfig |  1 -
 arch/riscv/Kconfig   |  1 -
 arch/x86/Kconfig     |  1 -
 include/linux/mm.h   | 34 ----------------------------------
 mm/Kconfig           | 14 ++++++++------
 mm/sparse-vmemmap.c  |  2 +-
 6 files changed, 9 insertions(+), 44 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index da4e2ec2af20..8158d5d0c226 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -184,7 +184,6 @@ config PPC
 	select ARCH_WANT_IPC_PARSE_VERSION
 	select ARCH_WANT_IRQS_OFF_ACTIVATE_MM
 	select ARCH_WANT_LD_ORPHAN_WARN
-	select ARCH_WANT_OPTIMIZE_DAX_VMEMMAP	if PPC_RADIX_MMU
 	select ARCH_WANTS_MODULES_DATA_IN_VMALLOC	if PPC_BOOK3S_32 || PPC_8xx
 	select ARCH_WEAK_RELEASE_ACQUIRE
 	select BINFMT_ELF
diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index 61a9d8d3ea64..a8eccb828e7b 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -85,7 +85,6 @@ config RISCV
 	select ARCH_WANT_GENERAL_HUGETLB if !RISCV_ISA_SVNAPOT
 	select ARCH_WANT_HUGE_PMD_SHARE if 64BIT
 	select ARCH_WANT_LD_ORPHAN_WARN if !XIP_KERNEL
-	select ARCH_WANT_OPTIMIZE_DAX_VMEMMAP
 	select ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP
 	select ARCH_WANTS_NO_INSTR
 	select ARCH_WANTS_THP_SWAP if HAVE_ARCH_TRANSPARENT_HUGEPAGE
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index f19625648f0f..83c55e286b40 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -146,7 +146,6 @@ config X86
 	select ARCH_WANT_GENERAL_HUGETLB
 	select ARCH_WANT_HUGE_PMD_SHARE		if X86_64
 	select ARCH_WANT_LD_ORPHAN_WARN
-	select ARCH_WANT_OPTIMIZE_DAX_VMEMMAP	if X86_64
 	select ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP	if X86_64
 	select ARCH_WANTS_THP_SWAP		if X86_64
 	select ARCH_HAS_PARANOID_L1D_FLUSH
diff --git a/include/linux/mm.h b/include/linux/mm.h
index c36001c9d571..8baa224444be 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -4910,40 +4910,6 @@ static inline void vmem_altmap_free(struct vmem_altmap *altmap,
 }
 #endif
 
-#define VMEMMAP_RESERVE_NR	OPTIMIZED_FOLIO_VMEMMAP_PAGES
-#ifdef CONFIG_ARCH_WANT_OPTIMIZE_DAX_VMEMMAP
-static inline bool __vmemmap_can_optimize(struct vmem_altmap *altmap,
-					  struct dev_pagemap *pgmap)
-{
-	unsigned long nr_pages;
-	unsigned long nr_vmemmap_pages;
-
-	if (!pgmap || !is_power_of_2(sizeof(struct page)))
-		return false;
-
-	nr_pages = pgmap_vmemmap_nr(pgmap);
-	nr_vmemmap_pages = ((nr_pages * sizeof(struct page)) >> PAGE_SHIFT);
-	/*
-	 * For vmemmap optimization with DAX we need minimum 2 vmemmap
-	 * pages. See layout diagram in Documentation/mm/vmemmap_dedup.rst
-	 */
-	return !altmap && (nr_vmemmap_pages > VMEMMAP_RESERVE_NR);
-}
-/*
- * If we don't have an architecture override, use the generic rule
- */
-#ifndef vmemmap_can_optimize
-#define vmemmap_can_optimize __vmemmap_can_optimize
-#endif
-
-#else
-static inline bool vmemmap_can_optimize(struct vmem_altmap *altmap,
-					   struct dev_pagemap *pgmap)
-{
-	return false;
-}
-#endif
-
 enum mf_flags {
 	MF_COUNT_INCREASED = 1 << 0,
 	MF_ACTION_REQUIRED = 1 << 1,
diff --git a/mm/Kconfig b/mm/Kconfig
index e81aa77182b2..166552d5d69a 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -411,17 +411,19 @@ config SPARSEMEM_VMEMMAP
 	  efficient option when sufficient kernel resources are available.
 
 config SPARSEMEM_VMEMMAP_OPTIMIZATION
-	bool
+	bool "Enable Vmemmap Optimization Infrastructure"
+	default y
 	depends on SPARSEMEM_VMEMMAP
+	help
+	  This allows features like HugeTLB and DAX to map multiple contiguous
+	  vmemmap pages to a single underlying physical page to save memory.
+
+	  If unsure, say Y.
 
 #
 # Select this config option from the architecture Kconfig, if it is preferred
-# to enable the feature of HugeTLB/dev_dax vmemmap optimization.
+# to enable the feature of HugeTLB vmemmap optimization.
 #
-config ARCH_WANT_OPTIMIZE_DAX_VMEMMAP
-	bool
-	select SPARSEMEM_VMEMMAP_OPTIMIZATION if SPARSEMEM_VMEMMAP
-
 config ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP
 	bool
 
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index ac2efba9ef92..752a48112504 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -698,7 +698,7 @@ int __meminit sparse_add_section(int nid, unsigned long start_pfn,
 		return ret;
 
 	ms = __nr_to_section(section_nr);
-	if (vmemmap_can_optimize(altmap, pgmap) && nr_pages == PAGES_PER_SECTION) {
+	if (!altmap && pgmap && nr_pages == PAGES_PER_SECTION) {
 		section_set_order(ms, pgmap->vmemmap_shift);
 #ifdef CONFIG_ZONE_DEVICE
 		section_set_zone(ms, ZONE_DEVICE);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 45/49] mm/sparse-vmemmap: drop @pgmap parameter from vmemmap populate APIs
  2026-04-05 12:51 [PATCH 00/49] mm: Generalize vmemmap optimization for DAX and HugeTLB Muchun Song
                   ` (43 preceding siblings ...)
  2026-04-05 12:52 ` [PATCH 44/49] mm/sparse-vmemmap: drop ARCH_WANT_OPTIMIZE_DAX_VMEMMAP and simplify checks Muchun Song
@ 2026-04-05 12:52 ` Muchun Song
  2026-04-05 12:52 ` [PATCH 46/49] mm/sparse: replace pgmap with order and zone in sparse_add_section() Muchun Song
                   ` (4 subsequent siblings)
  49 siblings, 0 replies; 51+ messages in thread
From: Muchun Song @ 2026-04-05 12:52 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
	Michael Ellerman, Madhavan Srinivasan
  Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
	Christophe Leroy, aneesh.kumar, joao.m.martins, linux-mm,
	linuxppc-dev, linux-kernel, Muchun Song

Since architecture-specific choices about vmemmap optimization are now
handled directly inside the vmemmap_populate() implementations, the
@pgmap is no longer needed in the core memory hotplug APIs and most
sparse section routines.

Remove the pgmap parameter entirely from:
- sparse_remove_section()
- __remove_pages()
- arch_remove_memory()
- vmemmap_populate() and related functions

This simplifies the API a little.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 arch/arm64/mm/mmu.c                        | 11 ++++----
 arch/loongarch/mm/init.c                   | 12 ++++----
 arch/powerpc/include/asm/book3s/64/radix.h |  4 +--
 arch/powerpc/mm/book3s64/radix_pgtable.c   | 10 +++----
 arch/powerpc/mm/init_64.c                  |  4 +--
 arch/powerpc/mm/mem.c                      |  5 ++--
 arch/riscv/mm/init.c                       |  9 +++---
 arch/s390/mm/init.c                        |  5 ++--
 arch/s390/mm/vmem.c                        |  2 +-
 arch/sparc/mm/init_64.c                    |  5 ++--
 arch/x86/mm/init_64.c                      | 13 ++++-----
 include/linux/memory_hotplug.h             |  8 ++----
 include/linux/mm.h                         | 11 +++-----
 mm/memory_hotplug.c                        | 12 ++++----
 mm/memremap.c                              |  4 +--
 mm/sparse-vmemmap.c                        | 33 +++++++++-------------
 mm/sparse.c                                |  6 ++--
 17 files changed, 65 insertions(+), 89 deletions(-)

diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 86162aab5185..ec1c6971a561 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -1760,7 +1760,7 @@ int __meminit vmemmap_check_pmd(pmd_t *pmdp, int node,
 }
 
 int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
-		struct vmem_altmap *altmap, struct dev_pagemap *pgmap)
+		struct vmem_altmap *altmap)
 {
 	WARN_ON((start < VMEMMAP_START) || (end > VMEMMAP_END));
 	/* [start, end] should be within one section */
@@ -1768,9 +1768,9 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
 
 	if (!IS_ENABLED(CONFIG_ARM64_4K_PAGES) ||
 	    (end - start < PAGES_PER_SECTION * sizeof(struct page)))
-		return vmemmap_populate_basepages(start, end, node, altmap, pgmap);
+		return vmemmap_populate_basepages(start, end, node, altmap);
 	else
-		return vmemmap_populate_hugepages(start, end, node, altmap, pgmap);
+		return vmemmap_populate_hugepages(start, end, node, altmap);
 }
 
 #ifdef CONFIG_MEMORY_HOTPLUG
@@ -1994,13 +1994,12 @@ int arch_add_memory(int nid, u64 start, u64 size,
 	return ret;
 }
 
-void arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap,
-			struct dev_pagemap *pgmap)
+void arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap)
 {
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
 
-	__remove_pages(start_pfn, nr_pages, altmap, pgmap);
+	__remove_pages(start_pfn, nr_pages, altmap);
 	__remove_pgd_mapping(swapper_pg_dir, __phys_to_virt(start), size);
 }
 
diff --git a/arch/loongarch/mm/init.c b/arch/loongarch/mm/init.c
index d61c2e09caae..00f3822b6e47 100644
--- a/arch/loongarch/mm/init.c
+++ b/arch/loongarch/mm/init.c
@@ -86,8 +86,7 @@ int arch_add_memory(int nid, u64 start, u64 size, struct mhp_params *params)
 	return ret;
 }
 
-void arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap,
-			struct dev_pagemap *pgmap)
+void arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap)
 {
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
@@ -96,7 +95,7 @@ void arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap,
 	/* With altmap the first mapped page is offset from @start */
 	if (altmap)
 		page += vmem_altmap_offset(altmap);
-	__remove_pages(start_pfn, nr_pages, altmap, pgmap);
+	__remove_pages(start_pfn, nr_pages, altmap);
 }
 #endif
 
@@ -123,13 +122,12 @@ int __meminit vmemmap_check_pmd(pmd_t *pmd, int node,
 }
 
 int __meminit vmemmap_populate(unsigned long start, unsigned long end,
-			       int node, struct vmem_altmap *altmap,
-			       struct dev_pagemap *pgmap)
+			       int node, struct vmem_altmap *altmap)
 {
 #if CONFIG_PGTABLE_LEVELS == 2
-	return vmemmap_populate_basepages(start, end, node, NULL, pgmap);
+	return vmemmap_populate_basepages(start, end, node, NULL);
 #else
-	return vmemmap_populate_hugepages(start, end, node, NULL, pgmap);
+	return vmemmap_populate_hugepages(start, end, node, NULL);
 #endif
 }
 
diff --git a/arch/powerpc/include/asm/book3s/64/radix.h b/arch/powerpc/include/asm/book3s/64/radix.h
index 18e28deba255..0c9195dd50c9 100644
--- a/arch/powerpc/include/asm/book3s/64/radix.h
+++ b/arch/powerpc/include/asm/book3s/64/radix.h
@@ -316,13 +316,11 @@ static inline int radix__has_transparent_pud_hugepage(void)
 #endif
 
 struct vmem_altmap;
-struct dev_pagemap;
 extern int __meminit radix__vmemmap_create_mapping(unsigned long start,
 					     unsigned long page_size,
 					     unsigned long phys);
 int __meminit radix__vmemmap_populate(unsigned long start, unsigned long end,
-				      int node, struct vmem_altmap *altmap,
-				      struct dev_pagemap *pgmap);
+				      int node, struct vmem_altmap *altmap);
 void __ref radix__vmemmap_free(unsigned long start, unsigned long end,
 			       struct vmem_altmap *altmap);
 extern void radix__vmemmap_remove_mapping(unsigned long start,
diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c b/arch/powerpc/mm/book3s64/radix_pgtable.c
index 36a69589fae4..190448a17119 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -1101,11 +1101,10 @@ static inline pte_t *vmemmap_pte_alloc(pmd_t *pmdp, int node,
 
 static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
 						    unsigned long start,
-						    unsigned long end, int node,
-						    struct dev_pagemap *pgmap);
+						    unsigned long end, int node);
 
 int __meminit radix__vmemmap_populate(unsigned long start, unsigned long end, int node,
-				      struct vmem_altmap *altmap, struct dev_pagemap *pgmap)
+				      struct vmem_altmap *altmap)
 {
 	unsigned long addr;
 	unsigned long next;
@@ -1117,7 +1116,7 @@ int __meminit radix__vmemmap_populate(unsigned long start, unsigned long end, in
 	unsigned long pfn = page_to_pfn((struct page *)start);
 
 	if (section_vmemmap_optimizable(__pfn_to_section(pfn)))
-		return vmemmap_populate_compound_pages(pfn, start, end, node, pgmap);
+		return vmemmap_populate_compound_pages(pfn, start, end, node);
 	/*
 	 * If altmap is present, Make sure we align the start vmemmap addr
 	 * to PAGE_SIZE so that we calculate the correct start_pfn in
@@ -1248,8 +1247,7 @@ static pte_t * __meminit radix__vmemmap_populate_address(unsigned long addr, int
 
 static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
 						     unsigned long start,
-						     unsigned long end, int node,
-						     struct dev_pagemap *pgmap)
+						     unsigned long end, int node)
 {
 	/*
 	 * we want to map things as base page size mapping so that
diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
index 56cbea89d304..8e18ed427fdd 100644
--- a/arch/powerpc/mm/init_64.c
+++ b/arch/powerpc/mm/init_64.c
@@ -275,12 +275,12 @@ static int __meminit __vmemmap_populate(unsigned long start, unsigned long end,
 }
 
 int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
-			       struct vmem_altmap *altmap, struct dev_pagemap *pgmap)
+			       struct vmem_altmap *altmap)
 {
 
 #ifdef CONFIG_PPC_BOOK3S_64
 	if (radix_enabled())
-		return radix__vmemmap_populate(start, end, node, altmap, pgmap);
+		return radix__vmemmap_populate(start, end, node, altmap);
 #endif
 
 	section_set_order(__pfn_to_section(page_to_pfn((struct page *)start)), 0);
diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
index 4c1afab91996..648d0c5602ec 100644
--- a/arch/powerpc/mm/mem.c
+++ b/arch/powerpc/mm/mem.c
@@ -158,13 +158,12 @@ int __ref arch_add_memory(int nid, u64 start, u64 size,
 	return rc;
 }
 
-void __ref arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap,
-			      struct dev_pagemap *pgmap)
+void __ref arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap)
 {
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
 
-	__remove_pages(start_pfn, nr_pages, altmap, pgmap);
+	__remove_pages(start_pfn, nr_pages, altmap);
 	arch_remove_linear_mapping(start, size);
 }
 #endif
diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
index 277c89661dff..5142ca80be6f 100644
--- a/arch/riscv/mm/init.c
+++ b/arch/riscv/mm/init.c
@@ -1443,7 +1443,7 @@ int __meminit vmemmap_check_pmd(pmd_t *pmdp, int node,
 }
 
 int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
-			       struct vmem_altmap *altmap, struct dev_pagemap *pgmap)
+			       struct vmem_altmap *altmap)
 {
 	/*
 	 * Note that SPARSEMEM_VMEMMAP is only selected for rv64 and that we
@@ -1451,7 +1451,7 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
 	 * memory hotplug, we are not able to update all the page tables with
 	 * the new PMDs.
 	 */
-	return vmemmap_populate_hugepages(start, end, node, altmap, pgmap);
+	return vmemmap_populate_hugepages(start, end, node, altmap);
 }
 #endif
 
@@ -1810,10 +1810,9 @@ int __ref arch_add_memory(int nid, u64 start, u64 size, struct mhp_params *param
 	return ret;
 }
 
-void __ref arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap,
-			      struct dev_pagemap *pgmap)
+void __ref arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap)
 {
-	__remove_pages(start >> PAGE_SHIFT, size >> PAGE_SHIFT, altmap, pgmap);
+	__remove_pages(start >> PAGE_SHIFT, size >> PAGE_SHIFT, altmap);
 	remove_linear_mapping(start, size);
 	flush_tlb_all();
 }
diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c
index 11a689423440..1f72efc2a579 100644
--- a/arch/s390/mm/init.c
+++ b/arch/s390/mm/init.c
@@ -276,13 +276,12 @@ int arch_add_memory(int nid, u64 start, u64 size,
 	return rc;
 }
 
-void arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap,
-			struct dev_pagemap *pgmap)
+void arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap)
 {
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
 
-	__remove_pages(start_pfn, nr_pages, altmap, pgmap);
+	__remove_pages(start_pfn, nr_pages, altmap);
 	vmem_remove_mapping(start, size);
 }
 #endif /* CONFIG_MEMORY_HOTPLUG */
diff --git a/arch/s390/mm/vmem.c b/arch/s390/mm/vmem.c
index a7bf8d3d5601..eeadff45e0e1 100644
--- a/arch/s390/mm/vmem.c
+++ b/arch/s390/mm/vmem.c
@@ -506,7 +506,7 @@ static void vmem_remove_range(unsigned long start, unsigned long size)
  * Add a backed mem_map array to the virtual mem_map array.
  */
 int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
-			       struct vmem_altmap *altmap, struct dev_pagemap *pgmap)
+			       struct vmem_altmap *altmap)
 {
 	int ret;
 
diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
index f870ca330f9e..367c269305e5 100644
--- a/arch/sparc/mm/init_64.c
+++ b/arch/sparc/mm/init_64.c
@@ -2591,10 +2591,9 @@ int __meminit vmemmap_check_pmd(pmd_t *pmdp, int node,
 }
 
 int __meminit vmemmap_populate(unsigned long vstart, unsigned long vend,
-			       int node, struct vmem_altmap *altmap,
-			       struct dev_pagemap *pgmap)
+			       int node, struct vmem_altmap *altmap)
 {
-	return vmemmap_populate_hugepages(vstart, vend, node, NULL, pgmap);
+	return vmemmap_populate_hugepages(vstart, vend, node, NULL);
 }
 #endif /* CONFIG_SPARSEMEM_VMEMMAP */
 
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index e18cc81a30b4..df2261fa4f98 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1288,13 +1288,12 @@ kernel_physical_mapping_remove(unsigned long start, unsigned long end)
 	remove_pagetable(start, end, true, NULL);
 }
 
-void __ref arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap,
-			      struct dev_pagemap *pgmap)
+void __ref arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap)
 {
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
 
-	__remove_pages(start_pfn, nr_pages, altmap, pgmap);
+	__remove_pages(start_pfn, nr_pages, altmap);
 	kernel_physical_mapping_remove(start, start + size);
 }
 #endif /* CONFIG_MEMORY_HOTPLUG */
@@ -1557,7 +1556,7 @@ int __meminit vmemmap_check_pmd(pmd_t *pmd, int node,
 }
 
 int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
-		struct vmem_altmap *altmap, struct dev_pagemap *pgmap)
+		struct vmem_altmap *altmap)
 {
 	int err;
 
@@ -1565,15 +1564,15 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
 	VM_BUG_ON(!PAGE_ALIGNED(end));
 
 	if (end - start < PAGES_PER_SECTION * sizeof(struct page))
-		err = vmemmap_populate_basepages(start, end, node, NULL, pgmap);
+		err = vmemmap_populate_basepages(start, end, node, NULL);
 	else if (boot_cpu_has(X86_FEATURE_PSE))
-		err = vmemmap_populate_hugepages(start, end, node, altmap, pgmap);
+		err = vmemmap_populate_hugepages(start, end, node, altmap);
 	else if (altmap) {
 		pr_err_once("%s: no cpu support for altmap allocations\n",
 				__func__);
 		err = -ENOMEM;
 	} else
-		err = vmemmap_populate_basepages(start, end, node, NULL, pgmap);
+		err = vmemmap_populate_basepages(start, end, node, NULL);
 	if (!err)
 		sync_global_pgds(start, end - 1);
 	return err;
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 7c9d66729c60..815e908c4135 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -135,10 +135,9 @@ static inline bool movable_node_is_enabled(void)
 	return movable_node_enabled;
 }
 
-extern void arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap,
-			       struct dev_pagemap *pgmap);
+extern void arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap);
 extern void __remove_pages(unsigned long start_pfn, unsigned long nr_pages,
-			   struct vmem_altmap *altmap, struct dev_pagemap *pgmap);
+			   struct vmem_altmap *altmap);
 
 /* reasonably generic interface to expand the physical pages */
 extern int __add_pages(int nid, unsigned long start_pfn, unsigned long nr_pages,
@@ -308,8 +307,7 @@ extern int sparse_add_section(int nid, unsigned long pfn,
 		unsigned long nr_pages, struct vmem_altmap *altmap,
 		struct dev_pagemap *pgmap);
 extern void sparse_remove_section(unsigned long pfn, unsigned long nr_pages,
-				  struct vmem_altmap *altmap,
-				  struct dev_pagemap *pgmap);
+				  struct vmem_altmap *altmap);
 extern struct zone *zone_for_pfn_range(enum mmop online_type,
 		int nid, struct memory_group *group, unsigned long start_pfn,
 		unsigned long nr_pages);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8baa224444be..adca19a4b2c7 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -4858,8 +4858,7 @@ static inline void print_vma_addr(char *prefix, unsigned long rip)
 void *sparse_buffer_alloc(unsigned long size);
 unsigned long section_map_size(void);
 struct page * __populate_section_memmap(unsigned long pfn,
-		unsigned long nr_pages, int nid, struct vmem_altmap *altmap,
-		struct dev_pagemap *pgmap);
+		unsigned long nr_pages, int nid, struct vmem_altmap *altmap);
 void *vmemmap_alloc_block(unsigned long size, int node);
 struct vmem_altmap;
 void *vmemmap_alloc_block_buf(unsigned long size, int node,
@@ -4870,13 +4869,11 @@ void vmemmap_set_pmd(pmd_t *pmd, void *p, int node,
 int vmemmap_check_pmd(pmd_t *pmd, int node,
 		      unsigned long addr, unsigned long next);
 int vmemmap_populate_basepages(unsigned long start, unsigned long end,
-			       int node, struct vmem_altmap *altmap,
-			       struct dev_pagemap *pgmap);
+			       int node, struct vmem_altmap *altmap);
 int vmemmap_populate_hugepages(unsigned long start, unsigned long end,
-			       int node, struct vmem_altmap *altmap,
-			       struct dev_pagemap *pgmap);
+			       int node, struct vmem_altmap *altmap);
 int vmemmap_populate(unsigned long start, unsigned long end, int node,
-		struct vmem_altmap *altmap, struct dev_pagemap *pgmap);
+		struct vmem_altmap *altmap);
 void vmemmap_populate_print_last(void);
 struct page *vmemmap_shared_tail_page(unsigned int order, struct zone *zone);
 #ifdef CONFIG_MEMORY_HOTPLUG
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 28306196c0fe..68dd56dd9f74 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -584,7 +584,7 @@ void remove_pfn_range_from_zone(struct zone *zone,
  * calling offline_pages().
  */
 void __remove_pages(unsigned long pfn, unsigned long nr_pages,
-		    struct vmem_altmap *altmap, struct dev_pagemap *pgmap)
+		    struct vmem_altmap *altmap)
 {
 	const unsigned long end_pfn = pfn + nr_pages;
 	unsigned long cur_nr_pages;
@@ -599,7 +599,7 @@ void __remove_pages(unsigned long pfn, unsigned long nr_pages,
 		/* Select all remaining pages up to the next section boundary */
 		cur_nr_pages = min(end_pfn - pfn,
 				   SECTION_ALIGN_UP(pfn + 1) - pfn);
-		sparse_remove_section(pfn, cur_nr_pages, altmap, pgmap);
+		sparse_remove_section(pfn, cur_nr_pages, altmap);
 	}
 }
 
@@ -1419,7 +1419,7 @@ static void remove_memory_blocks_and_altmaps(u64 start, u64 size)
 
 		remove_memory_block_devices(cur_start, memblock_size);
 
-		arch_remove_memory(cur_start, memblock_size, altmap, NULL);
+		arch_remove_memory(cur_start, memblock_size, altmap);
 
 		/* Verify that all vmemmap pages have actually been freed. */
 		WARN(altmap->alloc, "Altmap not fully unmapped");
@@ -1462,7 +1462,7 @@ static int create_altmaps_and_memory_blocks(int nid, struct memory_group *group,
 		ret = create_memory_block_devices(cur_start, memblock_size, nid,
 						  params.altmap, group);
 		if (ret) {
-			arch_remove_memory(cur_start, memblock_size, NULL, NULL);
+			arch_remove_memory(cur_start, memblock_size, NULL);
 			kfree(params.altmap);
 			goto out;
 		}
@@ -1548,7 +1548,7 @@ int add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
 		/* create memory block devices after memory was added */
 		ret = create_memory_block_devices(start, size, nid, NULL, group);
 		if (ret) {
-			arch_remove_memory(start, size, params.altmap, NULL);
+			arch_remove_memory(start, size, params.altmap);
 			goto error;
 		}
 	}
@@ -2247,7 +2247,7 @@ static int try_remove_memory(u64 start, u64 size)
 		 * No altmaps present, do the removal directly
 		 */
 		remove_memory_block_devices(start, size);
-		arch_remove_memory(start, size, NULL, NULL);
+		arch_remove_memory(start, size, NULL);
 	} else {
 		/* all memblocks in the range have altmaps */
 		remove_memory_blocks_and_altmaps(start, size);
diff --git a/mm/memremap.c b/mm/memremap.c
index c45b90f334ea..ac7be07e3361 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -97,10 +97,10 @@ static void pageunmap_range(struct dev_pagemap *pgmap, int range_id)
 				   PHYS_PFN(range_len(range)));
 	if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
 		__remove_pages(PHYS_PFN(range->start),
-			       PHYS_PFN(range_len(range)), NULL, pgmap);
+			       PHYS_PFN(range_len(range)), NULL);
 	} else {
 		arch_remove_memory(range->start, range_len(range),
-				pgmap_altmap(pgmap), pgmap);
+				pgmap_altmap(pgmap));
 		kasan_remove_zero_shadow(__va(range->start), range_len(range));
 	}
 	mem_hotplug_done();
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index 752a48112504..68dcc52591d5 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -281,8 +281,7 @@ static pte_t * __meminit vmemmap_populate_address(unsigned long addr, int node,
 }
 
 int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end,
-					 int node, struct vmem_altmap *altmap,
-					 struct dev_pagemap *pgmap)
+					 int node, struct vmem_altmap *altmap)
 {
 	unsigned long addr = start;
 	pte_t *pte;
@@ -342,8 +341,7 @@ int __weak __meminit vmemmap_check_pmd(pmd_t *pmd, int node,
 }
 
 int __meminit vmemmap_populate_hugepages(unsigned long start, unsigned long end,
-					 int node, struct vmem_altmap *altmap,
-					 struct dev_pagemap *pgmap)
+					 int node, struct vmem_altmap *altmap)
 {
 	unsigned long addr;
 	unsigned long next;
@@ -393,15 +391,14 @@ int __meminit vmemmap_populate_hugepages(unsigned long start, unsigned long end,
 			VM_BUG_ON(section_vmemmap_optimizable(ms));
 			continue;
 		}
-		if (vmemmap_populate_basepages(addr, next, node, altmap, pgmap))
+		if (vmemmap_populate_basepages(addr, next, node, altmap))
 			return -ENOMEM;
 	}
 	return 0;
 }
 
 struct page * __meminit __populate_section_memmap(unsigned long pfn,
-		unsigned long nr_pages, int nid, struct vmem_altmap *altmap,
-		struct dev_pagemap *pgmap)
+		unsigned long nr_pages, int nid, struct vmem_altmap *altmap)
 {
 	unsigned long start = (unsigned long) pfn_to_page(pfn);
 	unsigned long end = start + nr_pages * sizeof(struct page);
@@ -410,7 +407,7 @@ struct page * __meminit __populate_section_memmap(unsigned long pfn,
 		!IS_ALIGNED(nr_pages, PAGES_PER_SUBSECTION)))
 		return NULL;
 
-	if (vmemmap_populate(start, end, nid, altmap, pgmap))
+	if (vmemmap_populate(start, end, nid, altmap))
 		return NULL;
 
 	return pfn_to_page(pfn);
@@ -486,10 +483,9 @@ void offline_mem_sections(unsigned long start_pfn, unsigned long end_pfn)
 }
 
 static struct page * __meminit populate_section_memmap(unsigned long pfn,
-		unsigned long nr_pages, int nid, struct vmem_altmap *altmap,
-		struct dev_pagemap *pgmap)
+		unsigned long nr_pages, int nid, struct vmem_altmap *altmap)
 {
-	return __populate_section_memmap(pfn, nr_pages, nid, altmap, pgmap);
+	return __populate_section_memmap(pfn, nr_pages, nid, altmap);
 }
 
 static void depopulate_section_memmap(unsigned long pfn, unsigned long nr_pages,
@@ -570,7 +566,7 @@ static int fill_subsection_map(unsigned long pfn, unsigned long nr_pages)
  * usage map, but still need to free the vmemmap range.
  */
 static void section_deactivate(unsigned long pfn, unsigned long nr_pages,
-		struct vmem_altmap *altmap, struct dev_pagemap *pgmap)
+		struct vmem_altmap *altmap)
 {
 	struct mem_section *ms = __pfn_to_section(pfn);
 	bool section_is_early = early_section(ms);
@@ -622,8 +618,7 @@ static void section_deactivate(unsigned long pfn, unsigned long nr_pages,
 }
 
 static struct page * __meminit section_activate(int nid, unsigned long pfn,
-		unsigned long nr_pages, struct vmem_altmap *altmap,
-		struct dev_pagemap *pgmap)
+		unsigned long nr_pages, struct vmem_altmap *altmap)
 {
 	struct mem_section *ms = __pfn_to_section(pfn);
 	struct mem_section_usage *usage = NULL;
@@ -655,10 +650,10 @@ static struct page * __meminit section_activate(int nid, unsigned long pfn,
 	if (nr_pages < PAGES_PER_SECTION && early_section(ms))
 		return pfn_to_page(pfn);
 
-	memmap = populate_section_memmap(pfn, nr_pages, nid, altmap, pgmap);
+	memmap = populate_section_memmap(pfn, nr_pages, nid, altmap);
 	memmap_pages_add(section_vmemmap_pages(pfn, nr_pages));
 	if (!memmap) {
-		section_deactivate(pfn, nr_pages, altmap, pgmap);
+		section_deactivate(pfn, nr_pages, altmap);
 		return ERR_PTR(-ENOMEM);
 	}
 
@@ -704,7 +699,7 @@ int __meminit sparse_add_section(int nid, unsigned long start_pfn,
 		section_set_zone(ms, ZONE_DEVICE);
 #endif
 	}
-	memmap = section_activate(nid, start_pfn, nr_pages, altmap, pgmap);
+	memmap = section_activate(nid, start_pfn, nr_pages, altmap);
 	if (IS_ERR(memmap))
 		return PTR_ERR(memmap);
 
@@ -726,13 +721,13 @@ int __meminit sparse_add_section(int nid, unsigned long start_pfn,
 }
 
 void sparse_remove_section(unsigned long pfn, unsigned long nr_pages,
-			   struct vmem_altmap *altmap, struct dev_pagemap *pgmap)
+			   struct vmem_altmap *altmap)
 {
 	struct mem_section *ms = __pfn_to_section(pfn);
 
 	if (WARN_ON_ONCE(!valid_section(ms)))
 		return;
 
-	section_deactivate(pfn, nr_pages, altmap, pgmap);
+	section_deactivate(pfn, nr_pages, altmap);
 }
 #endif /* CONFIG_MEMORY_HOTPLUG */
diff --git a/mm/sparse.c b/mm/sparse.c
index 400542302ad4..77bb0113bac5 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -237,8 +237,7 @@ unsigned long __init section_map_size(void)
 }
 
 struct page __init *__populate_section_memmap(unsigned long pfn,
-		unsigned long nr_pages, int nid, struct vmem_altmap *altmap,
-		struct dev_pagemap *pgmap)
+		unsigned long nr_pages, int nid, struct vmem_altmap *altmap)
 {
 	unsigned long size = section_map_size();
 	struct page *map = sparse_buffer_alloc(size);
@@ -386,8 +385,7 @@ static void __init sparse_init_nid(int nid, unsigned long pnum_begin,
 		if (pnum >= pnum_end)
 			break;
 
-		map = __populate_section_memmap(pfn, PAGES_PER_SECTION,
-						nid, NULL, NULL);
+		map = __populate_section_memmap(pfn, PAGES_PER_SECTION, nid, NULL);
 		if (!map)
 			panic("Populate section (%ld) on node[%d] failed\n", pnum, nid);
 		memmap_boot_pages_add(section_vmemmap_pages(pfn, PAGES_PER_SECTION));
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 46/49] mm/sparse: replace pgmap with order and zone in sparse_add_section()
  2026-04-05 12:51 [PATCH 00/49] mm: Generalize vmemmap optimization for DAX and HugeTLB Muchun Song
                   ` (44 preceding siblings ...)
  2026-04-05 12:52 ` [PATCH 45/49] mm/sparse-vmemmap: drop @pgmap parameter from vmemmap populate APIs Muchun Song
@ 2026-04-05 12:52 ` Muchun Song
  2026-04-05 12:52 ` [PATCH 47/49] mm: redefine HVO as Hugepage Vmemmap Optimization Muchun Song
                   ` (3 subsequent siblings)
  49 siblings, 0 replies; 51+ messages in thread
From: Muchun Song @ 2026-04-05 12:52 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
	Michael Ellerman, Madhavan Srinivasan
  Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
	Christophe Leroy, aneesh.kumar, joao.m.martins, linux-mm,
	linuxppc-dev, linux-kernel, Muchun Song

The sparse_add_section() function was using the struct dev_pagemap
argument only to extract the vmemmap_shift value to set the compound
page order for vmemmap optimization, and to set the zone to ZONE_DEVICE.

Since the full struct dev_pagemap is not needed here and is being
removed from the rest of the vmemmap APIs, replace the pgmap parameter
with direct order and zone parameters in sparse_add_section().

This cleanly decouples the sparse memory infrastructure from the
ZONE_DEVICE struct dev_pagemap. The main motivation behind this decouple
is to make sparse_add_section() a more generic memory population interface
that can be easily reused for other non ZONE_DEVICE population use cases
in the future.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 include/linux/memory_hotplug.h |  2 +-
 mm/memory_hotplug.c            | 10 ++++++++--
 mm/sparse-vmemmap.c            | 14 +++++++-------
 3 files changed, 16 insertions(+), 10 deletions(-)

diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 815e908c4135..089052d64b01 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -305,7 +305,7 @@ extern void remove_pfn_range_from_zone(struct zone *zone,
 				       unsigned long nr_pages);
 extern int sparse_add_section(int nid, unsigned long pfn,
 		unsigned long nr_pages, struct vmem_altmap *altmap,
-		struct dev_pagemap *pgmap);
+		unsigned int order, enum zone_type zone);
 extern void sparse_remove_section(unsigned long pfn, unsigned long nr_pages,
 				  struct vmem_altmap *altmap);
 extern struct zone *zone_for_pfn_range(enum mmop online_type,
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 68dd56dd9f74..0f7707f3d4bb 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -385,10 +385,17 @@ int __add_pages(int nid, unsigned long pfn, unsigned long nr_pages,
 	unsigned long cur_nr_pages;
 	int err;
 	struct vmem_altmap *altmap = params->altmap;
+	unsigned int order = params->pgmap ? params->pgmap->vmemmap_shift : 0;
+	enum zone_type zid = 0;
 
 	if (WARN_ON_ONCE(!pgprot_val(params->pgprot)))
 		return -EINVAL;
 
+#ifdef CONFIG_ZONE_DEVICE
+	if (params->pgmap)
+		zid = ZONE_DEVICE;
+#endif
+
 	VM_BUG_ON(!mhp_range_allowed(PFN_PHYS(pfn), nr_pages * PAGE_SIZE, false));
 
 	if (altmap) {
@@ -412,8 +419,7 @@ int __add_pages(int nid, unsigned long pfn, unsigned long nr_pages,
 		/* Select all remaining pages up to the next section boundary */
 		cur_nr_pages = min(end_pfn - pfn,
 				   SECTION_ALIGN_UP(pfn + 1) - pfn);
-		err = sparse_add_section(nid, pfn, cur_nr_pages, altmap,
-					 params->pgmap);
+		err = sparse_add_section(nid, pfn, cur_nr_pages, altmap, order, zid);
 		if (err)
 			break;
 		cond_resched();
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index 68dcc52591d5..894352cb8957 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -666,7 +666,8 @@ static struct page * __meminit section_activate(int nid, unsigned long pfn,
  * @start_pfn: start pfn of the memory range
  * @nr_pages: number of pfns to add in the section
  * @altmap: alternate pfns to allocate the memmap backing store
- * @pgmap: alternate compound page geometry for devmap mappings
+ * @order: section order
+ * @zone: section zone. Note that it is ignored when @order is 0.
  *
  * This is only intended for hotplug.
  *
@@ -681,7 +682,7 @@ static struct page * __meminit section_activate(int nid, unsigned long pfn,
  */
 int __meminit sparse_add_section(int nid, unsigned long start_pfn,
 		unsigned long nr_pages, struct vmem_altmap *altmap,
-		struct dev_pagemap *pgmap)
+		unsigned int order, enum zone_type zone)
 {
 	unsigned long section_nr = pfn_to_section_nr(start_pfn);
 	struct mem_section *ms;
@@ -693,11 +694,10 @@ int __meminit sparse_add_section(int nid, unsigned long start_pfn,
 		return ret;
 
 	ms = __nr_to_section(section_nr);
-	if (!altmap && pgmap && nr_pages == PAGES_PER_SECTION) {
-		section_set_order(ms, pgmap->vmemmap_shift);
-#ifdef CONFIG_ZONE_DEVICE
-		section_set_zone(ms, ZONE_DEVICE);
-#endif
+	/* HVO is not supported when memmap pages are backed by an altmap. */
+	if (!altmap && nr_pages == PAGES_PER_SECTION && order) {
+		section_set_order(ms, order);
+		section_set_zone(ms, zone);
 	}
 	memmap = section_activate(nid, start_pfn, nr_pages, altmap);
 	if (IS_ERR(memmap))
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 47/49] mm: redefine HVO as Hugepage Vmemmap Optimization
  2026-04-05 12:51 [PATCH 00/49] mm: Generalize vmemmap optimization for DAX and HugeTLB Muchun Song
                   ` (45 preceding siblings ...)
  2026-04-05 12:52 ` [PATCH 46/49] mm/sparse: replace pgmap with order and zone in sparse_add_section() Muchun Song
@ 2026-04-05 12:52 ` Muchun Song
  2026-04-05 12:52 ` [PATCH 48/49] Documentation/mm: restructure vmemmap_dedup.rst to reflect generalized HVO Muchun Song
                   ` (2 subsequent siblings)
  49 siblings, 0 replies; 51+ messages in thread
From: Muchun Song @ 2026-04-05 12:52 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
	Michael Ellerman, Madhavan Srinivasan
  Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
	Christophe Leroy, aneesh.kumar, joao.m.martins, linux-mm,
	linuxppc-dev, linux-kernel, Muchun Song

The vmemmap optimization is a generic method used to save struct page
overhead. Currently, HVO stands for "HugeTLB Vmemmap Optimization",
which strictly ties the concept to the HugeTLB subsystem.

To reflect the general applicability of this technique, redefine HVO
as "Hugepage Vmemmap Optimization" in generalized contexts, and
"Hugepage Vmemmap Optimization for HugeTLB" in contexts strictly
related to the HugeTLB subsystem.

Update all generic references and comments in the codebase to use the
new terminology "Hugepage Vmemmap Optimization", and modify the
HugeTLB-specific ones to "Hugepage Vmemmap Optimization (HVO) for
HugeTLB".

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 Documentation/admin-guide/kernel-parameters.txt | 2 +-
 Documentation/admin-guide/sysctl/vm.rst         | 2 +-
 Documentation/mm/vmemmap_dedup.rst              | 2 +-
 fs/Kconfig                                      | 4 ++--
 include/linux/mmzone.h                          | 2 +-
 mm/Kconfig                                      | 2 +-
 6 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 1c8a16309270..ae711cd7887d 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2125,7 +2125,7 @@ Kernel parameters
 	hugetlb_free_vmemmap=
 			[KNL] Requires CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
 			enabled.
-			Control if HugeTLB Vmemmap Optimization (HVO) is enabled.
+			Control if Hugepage Vmemmap Optimization (HVO) for HugeTLB is enabled.
 			Allows heavy hugetlb users to free up some more
 			memory (7 * PAGE_SIZE for each 2MB hugetlb page).
 			Format: { on | off (default) }
diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
index 97e12359775c..886f5e78686f 100644
--- a/Documentation/admin-guide/sysctl/vm.rst
+++ b/Documentation/admin-guide/sysctl/vm.rst
@@ -665,7 +665,7 @@ This knob is not available when the size of 'struct page' (a structure defined
 in include/linux/mm_types.h) is not power of two (an unusual system config could
 result in this).
 
-Enable (set to 1) or disable (set to 0) HugeTLB Vmemmap Optimization (HVO).
+Enable (set to 1) or disable (set to 0) Hugepage Vmemmap Optimization (HVO) for HugeTLB.
 
 Once enabled, the vmemmap pages of subsequent allocation of HugeTLB pages from
 buddy allocator will be optimized (7 pages per 2MB HugeTLB page and 4095 pages
diff --git a/Documentation/mm/vmemmap_dedup.rst b/Documentation/mm/vmemmap_dedup.rst
index 9fa8642ded48..44e80bd2e398 100644
--- a/Documentation/mm/vmemmap_dedup.rst
+++ b/Documentation/mm/vmemmap_dedup.rst
@@ -8,7 +8,7 @@ A vmemmap diet for HugeTLB and Device DAX
 HugeTLB
 =======
 
-This section is to explain how HugeTLB Vmemmap Optimization (HVO) works.
+This section is to explain how Hugepage Vmemmap Optimization (HVO) for HugeTLB works.
 
 The ``struct page`` structures are used to describe a physical page frame. By
 default, there is a one-to-one mapping from a page frame to its corresponding
diff --git a/fs/Kconfig b/fs/Kconfig
index 9b56a90e13db..0bcd5b5721a8 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -261,11 +261,11 @@ menuconfig HUGETLBFS
 
 if HUGETLBFS
 config HUGETLB_PAGE_OPTIMIZE_VMEMMAP_DEFAULT_ON
-	bool "HugeTLB Vmemmap Optimization (HVO) defaults to on"
+	bool "Hugepage Vmemmap Optimization (HVO) for HugeTLB defaults to on"
 	default n
 	depends on HUGETLB_PAGE_OPTIMIZE_VMEMMAP
 	help
-	  The HugeTLB Vmemmap Optimization (HVO) defaults to off. Say Y here to
+	  The Hugepage Vmemmap Optimization (HVO) for HugeTLB defaults to off. Say Y here to
 	  enable HVO by default. It can be disabled via hugetlb_free_vmemmap=off
 	  (boot command line) or hugetlb_optimize_vmemmap (sysctl).
 endif # HUGETLBFS
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 846a7ee1334f..a6900f585f9b 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -97,7 +97,7 @@
 #define MAX_FOLIO_NR_PAGES	(1UL << MAX_FOLIO_ORDER)
 
 /*
- * HugeTLB Vmemmap Optimization (HVO) requires struct pages of the head page to
+ * Hugepage Vmemmap Optimization (HVO) requires struct pages of the head page to
  * be naturally aligned with regard to the folio size.
  *
  * HVO which is only active if the size of struct page is a power of 2.
diff --git a/mm/Kconfig b/mm/Kconfig
index 166552d5d69a..33a36e20db3a 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -411,7 +411,7 @@ config SPARSEMEM_VMEMMAP
 	  efficient option when sufficient kernel resources are available.
 
 config SPARSEMEM_VMEMMAP_OPTIMIZATION
-	bool "Enable Vmemmap Optimization Infrastructure"
+	bool "Enable Hugepage Vmemmap Optimization (HVO) Infrastructure"
 	default y
 	depends on SPARSEMEM_VMEMMAP
 	help
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 48/49] Documentation/mm: restructure vmemmap_dedup.rst to reflect generalized HVO
  2026-04-05 12:51 [PATCH 00/49] mm: Generalize vmemmap optimization for DAX and HugeTLB Muchun Song
                   ` (46 preceding siblings ...)
  2026-04-05 12:52 ` [PATCH 47/49] mm: redefine HVO as Hugepage Vmemmap Optimization Muchun Song
@ 2026-04-05 12:52 ` Muchun Song
  2026-04-05 12:52 ` [PATCH 49/49] mm: consolidate struct page power-of-2 size checks for HVO Muchun Song
  2026-04-05 13:34 ` [PATCH 00/49] mm: Generalize vmemmap optimization for DAX and HugeTLB Mike Rapoport
  49 siblings, 0 replies; 51+ messages in thread
From: Muchun Song @ 2026-04-05 12:52 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
	Michael Ellerman, Madhavan Srinivasan
  Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
	Christophe Leroy, aneesh.kumar, joao.m.martins, linux-mm,
	linuxppc-dev, linux-kernel, Muchun Song

The documentation Documentation/mm/vmemmap_dedup.rst is slightly
outdated and poorly structured. The current document primarily focuses
on HugeTLB pages as the main context, which fails to clearly communicate
the general principles of Hugepage Vmemmap Optimization (HVO).

To make it more logical and readable, refine the document. Specifically,
introduce the general HVO concepts, principles, and calculations that
are agnostic to specific subsystems. Remove the outdated and
subsystem-specific contexts (like explicit HugeTLB and Device DAX sections)
to better reflect the universal applicability of HVO to any large compound
page.

This reorganization makes the documentation much easier to read and
understand, and aligns with the recent renaming and generalization of
the HVO mechanism.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 Documentation/mm/vmemmap_dedup.rst | 218 ++++++-----------------------
 1 file changed, 42 insertions(+), 176 deletions(-)

diff --git a/Documentation/mm/vmemmap_dedup.rst b/Documentation/mm/vmemmap_dedup.rst
index 44e80bd2e398..a21d84fcbe24 100644
--- a/Documentation/mm/vmemmap_dedup.rst
+++ b/Documentation/mm/vmemmap_dedup.rst
@@ -1,107 +1,33 @@
-
 .. SPDX-License-Identifier: GPL-2.0
 
-=========================================
-A vmemmap diet for HugeTLB and Device DAX
-=========================================
-
-HugeTLB
-=======
-
-This section is to explain how Hugepage Vmemmap Optimization (HVO) for HugeTLB works.
+===================================================
+Fundamentals of Hugepage Vmemmap Optimization (HVO)
+===================================================
 
-The ``struct page`` structures are used to describe a physical page frame. By
-default, there is a one-to-one mapping from a page frame to its corresponding
+The ``struct page`` structures are used to describe a physical base page frame.
+By default, there is a one-to-one mapping from a page frame to its corresponding
 ``struct page``.
 
-HugeTLB pages consist of multiple base page size pages and is supported by many
-architectures. See Documentation/admin-guide/mm/hugetlbpage.rst for more
-details. On the x86-64 architecture, HugeTLB pages of size 2MB and 1GB are
-currently supported. Since the base page size on x86 is 4KB, a 2MB HugeTLB page
-consists of 512 base pages and a 1GB HugeTLB page consists of 262144 base pages.
-For each base page, there is a corresponding ``struct page``.
-
-Within the HugeTLB subsystem, only the first 4 ``struct page`` are used to
-contain unique information about a HugeTLB page. ``__NR_USED_SUBPAGE`` provides
-this upper limit. The only 'useful' information in the remaining ``struct page``
-is the compound_info field, and this field is the same for all tail pages.
-
-By removing redundant ``struct page`` for HugeTLB pages, memory can be returned
-to the buddy allocator for other uses.
-
-Different architectures support different HugeTLB pages. For example, the
-following table is the HugeTLB page size supported by x86 and arm64
-architectures. Because arm64 supports 4k, 16k, and 64k base pages and
-supports contiguous entries, so it supports many kinds of sizes of HugeTLB
-page.
-
-+--------------+-----------+-----------------------------------------------+
-| Architecture | Page Size |                HugeTLB Page Size              |
-+--------------+-----------+-----------+-----------+-----------+-----------+
-|    x86-64    |    4KB    |    2MB    |    1GB    |           |           |
-+--------------+-----------+-----------+-----------+-----------+-----------+
-|              |    4KB    |   64KB    |    2MB    |    32MB   |    1GB    |
-|              +-----------+-----------+-----------+-----------+-----------+
-|    arm64     |   16KB    |    2MB    |   32MB    |     1GB   |           |
-|              +-----------+-----------+-----------+-----------+-----------+
-|              |   64KB    |    2MB    |  512MB    |    16GB   |           |
-+--------------+-----------+-----------+-----------+-----------+-----------+
-
-When the system boot up, every HugeTLB page has more than one ``struct page``
-structs which size is (unit: pages)::
-
-   struct_size = HugeTLB_Size / PAGE_SIZE * sizeof(struct page) / PAGE_SIZE
-
-Where HugeTLB_Size is the size of the HugeTLB page. We know that the size
-of the HugeTLB page is always n times PAGE_SIZE. So we can get the following
-relationship::
-
-   HugeTLB_Size = n * PAGE_SIZE
-
-Then::
+When huge pages (large compound page) are used, they consist of multiple base
+page size pages. For each base page, there is a corresponding ``struct page``.
+However, only a few ``struct page``
+structures are actually used to contain unique information about the huge page.
+The only 'useful' information in the remaining tail ``struct page`` structures
+is the ``->compound_info`` field to get the head page structure, and this field
+is the same for all tail pages.
 
-   struct_size = n * PAGE_SIZE / PAGE_SIZE * sizeof(struct page) / PAGE_SIZE
-               = n * sizeof(struct page) / PAGE_SIZE
+We can remove redundant ``struct page`` structures for huge pages to save memory.
+This optimization is referred to as Hugepage Vmemmap Optimization (HVO).
 
-We can use huge mapping at the pud/pmd level for the HugeTLB page.
+The optimization is only applied when the size of the ``struct page`` is a
+power-of-2. In this case, all tail pages of the same order are identical. See
+``compound_head()``. This allows us to remap the tail pages of the vmemmap to a
+shared page.
 
-For the HugeTLB page of the pmd level mapping, then::
+Let’s take a system with a 2 MB huge page and a base page size of 4 KB as an
+example for illustration. Here is how things look before optimization::
 
-   struct_size = n * sizeof(struct page) / PAGE_SIZE
-               = PAGE_SIZE / sizeof(pte_t) * sizeof(struct page) / PAGE_SIZE
-               = sizeof(struct page) / sizeof(pte_t)
-               = 64 / 8
-               = 8 (pages)
-
-Where n is how many pte entries which one page can contains. So the value of
-n is (PAGE_SIZE / sizeof(pte_t)).
-
-This optimization only supports 64-bit system, so the value of sizeof(pte_t)
-is 8. And this optimization also applicable only when the size of ``struct page``
-is a power of two. In most cases, the size of ``struct page`` is 64 bytes (e.g.
-x86-64 and arm64). So if we use pmd level mapping for a HugeTLB page, the
-size of ``struct page`` structs of it is 8 page frames which size depends on the
-size of the base page.
-
-For the HugeTLB page of the pud level mapping, then::
-
-   struct_size = PAGE_SIZE / sizeof(pmd_t) * struct_size(pmd)
-               = PAGE_SIZE / 8 * 8 (pages)
-               = PAGE_SIZE (pages)
-
-Where the struct_size(pmd) is the size of the ``struct page`` structs of a
-HugeTLB page of the pmd level mapping.
-
-E.g.: A 2MB HugeTLB page on x86_64 consists in 8 page frames while 1GB
-HugeTLB page consists in 4096.
-
-Next, we take the pmd level mapping of the HugeTLB page as an example to
-show the internal implementation of this optimization. There are 8 pages
-``struct page`` structs associated with a HugeTLB page which is pmd mapped.
-
-Here is how things look before optimization::
-
-    HugeTLB                  struct pages(8 pages)         page frame(8 pages)
+  2MB Hugepage                  struct pages (8 pages)        page frame (8 pages)
  +-----------+ ---virt_to_page---> +-----------+   mapping to   +-----------+
  |           |                     |     0     | -------------> |     0     |
  |           |                     +-----------+                +-----------+
@@ -112,9 +38,9 @@ Here is how things look before optimization::
  |           |                     |     3     | -------------> |     3     |
  |           |                     +-----------+                +-----------+
  |           |                     |     4     | -------------> |     4     |
- |    PMD    |                     +-----------+                +-----------+
- |   level   |                     |     5     | -------------> |     5     |
- |  mapping  |                     +-----------+                +-----------+
+ |           |                     +-----------+                +-----------+
+ |           |                     |     5     | -------------> |     5     |
+ |           |                     +-----------+                +-----------+
  |           |                     |     6     | -------------> |     6     |
  |           |                     +-----------+                +-----------+
  |           |                     |     7     | -------------> |     7     |
@@ -124,34 +50,27 @@ Here is how things look before optimization::
  |           |
  +-----------+
 
-The first page of ``struct page`` (page 0) associated with the HugeTLB page
-contains the 4 ``struct page`` necessary to describe the HugeTLB. The remaining
-pages of ``struct page`` (page 1 to page 7) are tail pages.
-
-The optimization is only applied when the size of the struct page is a power
-of 2. In this case, all tail pages of the same order are identical. See
-compound_head(). This allows us to remap the tail pages of the vmemmap to a
-shared, read-only page. The head page is also remapped to a new page. This
-allows the original vmemmap pages to be freed.
+We remap the tail pages (page 1 to page 7) of the vmemmap to a shared, read-only
+page (per-zone).
 
 Here is how things look after remapping::
 
-    HugeTLB                  struct pages(8 pages)                 page frame (new)
+  2MB Hugepage                  struct pages(8 pages)           page frame (1 page)
  +-----------+ ---virt_to_page---> +-----------+   mapping to   +----------------+
  |           |                     |     0     | -------------> |       0        |
  |           |                     +-----------+                +----------------+
  |           |                     |     1     | ------┐
  |           |                     +-----------+       |
- |           |                     |     2     | ------┼        +----------------------------+
+ |           |                     |     2     | ------┼
+ |           |                     +-----------+       |
+ |           |                     |     3     | ------┼        +----------------------------+
  |           |                     +-----------+       |        | A single, per-zone page    |
- |           |                     |     3     | ------┼------> | frame shared among all     |
+ |           |                     |     4     | ------┼------> | frame shared among all     |
  |           |                     +-----------+       |        | hugepages of the same size |
- |           |                     |     4     | ------┼        +----------------------------+
+ |           |                     |     5     | ------┼        +----------------------------+
+ |           |                     +-----------+       |
+ |           |                     |     6     | ------┼
  |           |                     +-----------+       |
- |           |                     |     5     | ------┼
- |    PMD    |                     +-----------+       |
- |   level   |                     |     6     | ------┼
- |  mapping  |                     +-----------+       |
  |           |                     |     7     | ------┘
  |           |                     +-----------+
  |           |
@@ -159,65 +78,12 @@ Here is how things look after remapping::
  |           |
  +-----------+
 
-When a HugeTLB is freed to the buddy system, we should allocate 7 pages for
-vmemmap pages and restore the previous mapping relationship.
-
-For the HugeTLB page of the pud level mapping. It is similar to the former.
-We also can use this approach to free (PAGE_SIZE - 1) vmemmap pages.
-
-Apart from the HugeTLB page of the pmd/pud level mapping, some architectures
-(e.g. aarch64) provides a contiguous bit in the translation table entries
-that hints to the MMU to indicate that it is one of a contiguous set of
-entries that can be cached in a single TLB entry.
-
-The contiguous bit is used to increase the mapping size at the pmd and pte
-(last) level. So this type of HugeTLB page can be optimized only when its
-size of the ``struct page`` structs is greater than **1** page.
-
-Device DAX
-==========
-
-The device-dax interface uses the same tail deduplication technique explained
-in the previous chapter, except when used with the vmemmap in
-the device (altmap).
-
-The following page sizes are supported in DAX: PAGE_SIZE (4K on x86_64),
-PMD_SIZE (2M on x86_64) and PUD_SIZE (1G on x86_64).
-For powerpc equivalent details see Documentation/arch/powerpc/vmemmap_dedup.rst
-
-The differences with HugeTLB are relatively minor.
-
-It only use 3 ``struct page`` for storing all information as opposed
-to 4 on HugeTLB pages.
-
-There's no remapping of vmemmap given that device-dax memory is not part of
-System RAM ranges initialized at boot. Thus the tail page deduplication
-happens at a later stage when we populate the sections. HugeTLB reuses the
-the head vmemmap page representing, whereas device-dax reuses the tail
-vmemmap page. This results in only half of the savings compared to HugeTLB.
-
-Deduplicated tail pages are not mapped read-only.
+Therefore, for any hugepage, if the total size of its corresponding ``struct pages``
+is greater than or equal to the size of two base pages, then HVO technology can
+be applied to this hugepage to save memory. For example, in this case, the
+smallest hugepage that can apply HVO is 512 KB (its order corresponds to
+``OPTIMIZABLE_FOLIO_MIN_ORDER``). Therefore, any hugepage with an order greater
+than or equal to ``OPTIMIZABLE_FOLIO_MIN_ORDER`` can apply HVO technology.
 
-Here's how things look like on device-dax after the sections are populated::
-
- +-----------+ ---virt_to_page---> +-----------+   mapping to   +-----------+
- |           |                     |     0     | -------------> |     0     |
- |           |                     +-----------+                +-----------+
- |           |                     |     1     | -------------> |     1     |
- |           |                     +-----------+                +-----------+
- |           |                     |     2     | ----------------^ ^ ^ ^ ^ ^
- |           |                     +-----------+                   | | | | |
- |           |                     |     3     | ------------------+ | | | |
- |           |                     +-----------+                     | | | |
- |           |                     |     4     | --------------------+ | | |
- |    PMD    |                     +-----------+                       | | |
- |   level   |                     |     5     | ----------------------+ | |
- |  mapping  |                     +-----------+                         | |
- |           |                     |     6     | ------------------------+ |
- |           |                     +-----------+                           |
- |           |                     |     7     | --------------------------+
- |           |                     +-----------+
- |           |
- |           |
- |           |
- +-----------+
+Meanwhile, each HVOed hugepage still has ``OPTIMIZED_FOLIO_VMEMMAP_PAGE_STRUCTS``
+available ``struct page`` structures.
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 49/49] mm: consolidate struct page power-of-2 size checks for HVO
  2026-04-05 12:51 [PATCH 00/49] mm: Generalize vmemmap optimization for DAX and HugeTLB Muchun Song
                   ` (47 preceding siblings ...)
  2026-04-05 12:52 ` [PATCH 48/49] Documentation/mm: restructure vmemmap_dedup.rst to reflect generalized HVO Muchun Song
@ 2026-04-05 12:52 ` Muchun Song
  2026-04-05 13:34 ` [PATCH 00/49] mm: Generalize vmemmap optimization for DAX and HugeTLB Mike Rapoport
  49 siblings, 0 replies; 51+ messages in thread
From: Muchun Song @ 2026-04-05 12:52 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
	Michael Ellerman, Madhavan Srinivasan
  Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
	Christophe Leroy, aneesh.kumar, joao.m.martins, linux-mm,
	linuxppc-dev, linux-kernel, Muchun Song

The Hugepage Vmemmap Optimization (HVO) requires that struct page
size is a power of two. This size is evaluated by the C compiler
and currently cannot be natively evaluated by Kconfig. Therefore,
the condition is_power_of_2(sizeof(struct page)) was scattered
across several macros and static inline functions.

Extract the check into a preprocessor macro
STRUCT_PAGE_SIZE_IS_POWER_OF_2 evaluated during the Kbuild process.

Define SPARSEMEM_VMEMMAP_OPTIMIZATION_ENABLED as a master toggle
that is 1 only if both Kconfig CONFIG_SPARSEMEM_VMEMMAP_OPTIMIZATION
and the power of 2 size check are true.

This allows us to completely remove all scattered sizeof(struct page)
checks, making the code much cleaner and eliminating redundant logic.

Additionally, mm/hugetlb_vmemmap.c and its corresponding header are now
guarded by SPARSEMEM_VMEMMAP_OPTIMIZATION_ENABLED. This brings an added
benefit: when struct page size is not a power of 2, the compiler can
entirely optimize away the unused functions in mm/hugetlb_vmemmap.c,
reducing kernel image size.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 include/linux/mm_types.h      |  2 ++
 include/linux/mm_types_task.h |  4 ++++
 include/linux/mmzone.h        | 32 +++++++++++++++-----------------
 include/linux/page-flags.h    | 28 ++++------------------------
 kernel/bounds.c               |  2 ++
 mm/hugetlb_vmemmap.c          |  2 ++
 mm/hugetlb_vmemmap.h          |  4 +---
 mm/internal.h                 |  3 ---
 mm/sparse.c                   |  6 ++----
 mm/util.c                     |  2 +-
 10 files changed, 33 insertions(+), 52 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index a308e2c23b82..6de6c0c20f8b 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -15,7 +15,9 @@
 #include <linux/cpumask.h>
 #include <linux/uprobes.h>
 #include <linux/rcupdate.h>
+#ifndef __GENERATING_BOUNDS_H
 #include <linux/page-flags-layout.h>
+#endif
 #include <linux/workqueue.h>
 #include <linux/seqlock.h>
 #include <linux/percpu_counter.h>
diff --git a/include/linux/mm_types_task.h b/include/linux/mm_types_task.h
index 11bf319d78ec..09e5039fff97 100644
--- a/include/linux/mm_types_task.h
+++ b/include/linux/mm_types_task.h
@@ -17,7 +17,11 @@
 #include <asm/tlbbatch.h>
 #endif
 
+#ifndef __GENERATING_BOUNDS_H
 #define ALLOC_SPLIT_PTLOCKS	(SPINLOCK_SIZE > BITS_PER_LONG/8)
+#else
+#define ALLOC_SPLIT_PTLOCKS	0
+#endif
 
 /*
  * When updating this, please also update struct resident_page_types[] in
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index a6900f585f9b..3a46cb0bfaaa 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -96,27 +96,26 @@
 
 #define MAX_FOLIO_NR_PAGES	(1UL << MAX_FOLIO_ORDER)
 
-/*
- * Hugepage Vmemmap Optimization (HVO) requires struct pages of the head page to
- * be naturally aligned with regard to the folio size.
- *
- * HVO which is only active if the size of struct page is a power of 2.
- */
-#define MAX_FOLIO_VMEMMAP_ALIGN					\
-	(IS_ENABLED(CONFIG_SPARSEMEM_VMEMMAP_OPTIMIZATION) &&	\
-	 is_power_of_2(sizeof(struct page)) ?			\
-	 MAX_FOLIO_NR_PAGES * sizeof(struct page) : 0)
-
 /* The number of vmemmap pages required by a vmemmap-optimized folio. */
 #define OPTIMIZED_FOLIO_VMEMMAP_PAGES		1
 #define OPTIMIZED_FOLIO_VMEMMAP_SIZE		(OPTIMIZED_FOLIO_VMEMMAP_PAGES * PAGE_SIZE)
 #define OPTIMIZED_FOLIO_VMEMMAP_PAGE_STRUCTS	(OPTIMIZED_FOLIO_VMEMMAP_SIZE / sizeof(struct page))
 #define OPTIMIZABLE_FOLIO_MIN_ORDER		(ilog2(OPTIMIZED_FOLIO_VMEMMAP_PAGE_STRUCTS) + 1)
 
+#if defined(CONFIG_SPARSEMEM_VMEMMAP_OPTIMIZATION) && STRUCT_PAGE_SIZE_IS_POWER_OF_2
+#define SPARSEMEM_VMEMMAP_OPTIMIZATION_ENABLED	1
+/*
+ * Hugepage Vmemmap Optimization (HVO) requires struct pages of the head page to
+ * be naturally aligned with regard to the folio size.
+ */
+#define MAX_FOLIO_VMEMMAP_ALIGN			(MAX_FOLIO_NR_PAGES * sizeof(struct page))
 #define __NR_OPTIMIZABLE_FOLIO_SIZES		(MAX_FOLIO_ORDER - OPTIMIZABLE_FOLIO_MIN_ORDER + 1)
 #define NR_OPTIMIZABLE_FOLIO_SIZES		\
-	((__NR_OPTIMIZABLE_FOLIO_SIZES > 0 &&	\
-	  IS_ENABLED(CONFIG_SPARSEMEM_VMEMMAP_OPTIMIZATION)) ? __NR_OPTIMIZABLE_FOLIO_SIZES : 0)
+	(__NR_OPTIMIZABLE_FOLIO_SIZES > 0 ? __NR_OPTIMIZABLE_FOLIO_SIZES : 0)
+#else
+#define MAX_FOLIO_VMEMMAP_ALIGN			0
+#define NR_OPTIMIZABLE_FOLIO_SIZES		0
+#endif
 
 enum migratetype {
 	MIGRATE_UNMOVABLE,
@@ -2015,7 +2014,7 @@ struct mem_section {
 	 */
 	struct page_ext *page_ext;
 #endif
-#ifdef CONFIG_SPARSEMEM_VMEMMAP_OPTIMIZATION
+#ifdef SPARSEMEM_VMEMMAP_OPTIMIZATION_ENABLED
 	/*
 	 * The order of compound pages in this section. Typically, the section
 	 * holds compound pages of this order; a larger compound page will span
@@ -2208,7 +2207,7 @@ static inline bool pfn_section_first_valid(struct mem_section *ms, unsigned long
 }
 #endif
 
-#ifdef CONFIG_SPARSEMEM_VMEMMAP_OPTIMIZATION
+#ifdef SPARSEMEM_VMEMMAP_OPTIMIZATION_ENABLED
 static inline void section_set_order(struct mem_section *section, unsigned int order)
 {
 	VM_BUG_ON(section->order && order && section->order != order);
@@ -2267,8 +2266,7 @@ static inline void section_set_compound_range(unsigned long pfn,
 
 static inline bool section_vmemmap_optimizable(const struct mem_section *section)
 {
-	return is_power_of_2(sizeof(struct page)) &&
-	       section_order(section) >= OPTIMIZABLE_FOLIO_MIN_ORDER;
+	return section_order(section) >= OPTIMIZABLE_FOLIO_MIN_ORDER;
 }
 
 void sparse_init_early_section(int nid, struct page *map, unsigned long pnum,
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 12665b34586c..bea934d49750 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -198,32 +198,12 @@ enum pageflags {
 
 #ifndef __GENERATING_BOUNDS_H
 
-/*
- * For tail pages, if the size of struct page is power-of-2 ->compound_info
- * encodes the mask that converts the address of the tail page address to
- * the head page address.
- *
- * Otherwise, ->compound_info has direct pointer to head pages.
- */
-static __always_inline bool compound_info_has_mask(void)
-{
-	/*
-	 * The approach with mask would work in the wider set of conditions,
-	 * but it requires validating that struct pages are naturally aligned
-	 * for all orders up to the MAX_FOLIO_ORDER, which can be tricky.
-	 */
-	if (!IS_ENABLED(CONFIG_SPARSEMEM_VMEMMAP_OPTIMIZATION))
-		return false;
-
-	return is_power_of_2(sizeof(struct page));
-}
-
 static __always_inline unsigned long _compound_head(const struct page *page)
 {
 	unsigned long info = READ_ONCE(page->compound_info);
 	unsigned long mask;
 
-	if (!compound_info_has_mask()) {
+	if (!IS_ENABLED(SPARSEMEM_VMEMMAP_OPTIMIZATION_ENABLED)) {
 		/* Bit 0 encodes PageTail() */
 		if (info & 1)
 			return info - 1;
@@ -232,8 +212,8 @@ static __always_inline unsigned long _compound_head(const struct page *page)
 	}
 
 	/*
-	 * If compound_info_has_mask() is true the rest of the info encodes
-	 * the mask that converts the address of the tail page to the head page.
+	 * If HVO is enabled the rest of the info encodes the mask that converts
+	 * the address of the tail page to the head page.
 	 *
 	 * No need to clear bit 0 in the mask as 'page' always has it clear.
 	 *
@@ -257,7 +237,7 @@ static __always_inline void set_compound_head(struct page *tail,
 	unsigned int shift;
 	unsigned long mask;
 
-	if (!compound_info_has_mask()) {
+	if (!IS_ENABLED(SPARSEMEM_VMEMMAP_OPTIMIZATION_ENABLED)) {
 		WRITE_ONCE(tail->compound_info, (unsigned long)head | 1);
 		return;
 	}
diff --git a/kernel/bounds.c b/kernel/bounds.c
index 02b619eb6106..ff2ec3834d32 100644
--- a/kernel/bounds.c
+++ b/kernel/bounds.c
@@ -8,6 +8,7 @@
 #define __GENERATING_BOUNDS_H
 #define COMPILE_OFFSETS
 /* Include headers that define the enum constants of interest */
+#include <linux/mm_types.h>
 #include <linux/page-flags.h>
 #include <linux/mmzone.h>
 #include <linux/kbuild.h>
@@ -30,6 +31,7 @@ int main(void)
 	DEFINE(LRU_GEN_WIDTH, 0);
 	DEFINE(__LRU_REFS_WIDTH, 0);
 #endif
+	DEFINE(STRUCT_PAGE_SIZE_IS_POWER_OF_2, is_power_of_2(sizeof(struct page)));
 	/* End of constants */
 
 	return 0;
diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index d595ef759bc2..0347341be156 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -21,6 +21,7 @@
 #include "hugetlb_vmemmap.h"
 #include "internal.h"
 
+#ifdef SPARSEMEM_VMEMMAP_OPTIMIZATION_ENABLED
 /**
  * struct vmemmap_remap_walk - walk vmemmap page table
  *
@@ -693,3 +694,4 @@ static int __init hugetlb_vmemmap_init(void)
 	return 0;
 }
 late_initcall(hugetlb_vmemmap_init);
+#endif
diff --git a/mm/hugetlb_vmemmap.h b/mm/hugetlb_vmemmap.h
index 0022f9c5a101..bd576ef41ee7 100644
--- a/mm/hugetlb_vmemmap.h
+++ b/mm/hugetlb_vmemmap.h
@@ -12,7 +12,7 @@
 #include <linux/io.h>
 #include <linux/memblock.h>
 
-#ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
+#if defined(CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP) && defined(SPARSEMEM_VMEMMAP_OPTIMIZATION_ENABLED)
 int hugetlb_vmemmap_restore_folio(const struct hstate *h, struct folio *folio);
 long hugetlb_vmemmap_restore_folios(const struct hstate *h,
 					struct list_head *folio_list,
@@ -34,8 +34,6 @@ static inline unsigned int hugetlb_vmemmap_optimizable_size(const struct hstate
 {
 	int size = hugetlb_vmemmap_size(h) - OPTIMIZED_FOLIO_VMEMMAP_SIZE;
 
-	if (!is_power_of_2(sizeof(struct page)))
-		return 0;
 	return size > 0 ? size : 0;
 }
 #else
diff --git a/mm/internal.h b/mm/internal.h
index 02064f21bfe1..121c9076f09a 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1026,9 +1026,6 @@ static inline bool vmemmap_page_optimizable(const struct page *page)
 	unsigned long pfn = page_to_pfn(page);
 	unsigned int order = section_order(__pfn_to_section(pfn));
 
-	if (!is_power_of_2(sizeof(struct page)))
-		return false;
-
 	return (pfn & ((1L << order) - 1)) >= OPTIMIZED_FOLIO_VMEMMAP_PAGE_STRUCTS;
 }
 
diff --git a/mm/sparse.c b/mm/sparse.c
index 77bb0113bac5..7375f66a58d5 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -404,10 +404,8 @@ void __init sparse_init(void)
 	unsigned long pnum_end, pnum_begin, map_count = 1;
 	int nid_begin;
 
-	if (compound_info_has_mask()) {
-		VM_WARN_ON_ONCE(!IS_ALIGNED((unsigned long) pfn_to_page(0),
-				    MAX_FOLIO_VMEMMAP_ALIGN));
-	}
+	VM_WARN_ON_ONCE(IS_ENABLED(SPARSEMEM_VMEMMAP_OPTIMIZATION_ENABLED) &&
+			!IS_ALIGNED((unsigned long)pfn_to_page(0), MAX_FOLIO_VMEMMAP_ALIGN));
 
 	pnum_begin = first_present_section_nr();
 	nid_begin = sparse_early_nid(__nr_to_section(pnum_begin));
diff --git a/mm/util.c b/mm/util.c
index f063fd4de1e8..783b2081ea74 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -1348,7 +1348,7 @@ void snapshot_page(struct page_snapshot *ps, const struct page *page)
 		foliop = (struct folio *)page;
 	} else {
 		/* See compound_head() */
-		if (compound_info_has_mask()) {
+		if (IS_ENABLED(SPARSEMEM_VMEMMAP_OPTIMIZATION_ENABLED)) {
 			unsigned long p = (unsigned long)page;
 
 			foliop = (struct folio *)(p & info);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [PATCH 00/49] mm: Generalize vmemmap optimization for DAX and HugeTLB
  2026-04-05 12:51 [PATCH 00/49] mm: Generalize vmemmap optimization for DAX and HugeTLB Muchun Song
                   ` (48 preceding siblings ...)
  2026-04-05 12:52 ` [PATCH 49/49] mm: consolidate struct page power-of-2 size checks for HVO Muchun Song
@ 2026-04-05 13:34 ` Mike Rapoport
  49 siblings, 0 replies; 51+ messages in thread
From: Mike Rapoport @ 2026-04-05 13:34 UTC (permalink / raw)
  To: Muchun Song
  Cc: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
	Michael Ellerman, Madhavan Srinivasan, Lorenzo Stoakes,
	Liam R . Howlett, Vlastimil Babka, Suren Baghdasaryan,
	Michal Hocko, Nicholas Piggin, Christophe Leroy, aneesh.kumar,
	joao.m.martins, linux-mm, linuxppc-dev, linux-kernel

Hi Muchun,

On Sun, Apr 05, 2026 at 08:51:51PM +0800, Muchun Song wrote:
> Overview:
> This patch series generalizes the HugeTLB Vmemmap Optimization (HVO)
> into a generic vmemmap optimization framework that can be used by both
> HugeTLB and DAX.
> 
> Background:
> Currently, the vmemmap optimization feature is highly coupled with
> HugeTLB. However, DAX also has similar requirements for optimizing vmemmap
> pages to save memory. The current implementation has separate vmemmap
> optimization paths for HugeTLB and DAX, leading to duplicated logic,
> complex initialization sequences, and architecture-specific flags.
> 
> Implementation:
> This series breaks down the optimization into a generic framework:
> - Patch 1-6: Fix bugs related to sparse vmemmap initialization and DAX.
> - Patch 7-13: Refactor the existing sparse vmemmap initialization.
> - Patch 14-26: Decouple the vmemmap optimization from HugeTLB and
>   introduce generic optimization macros and functions.
> - Patch 27-39: Switch HugeTLB and DAX to use the generic framework.
> - Patch 40-49: Clean up the old HVO-specific code and simplify it.
> 
> Benifit:
> - When CONFIG_DEFERRED_STRUCT_PAGE_INIT is disabled, all struct pages
>   utilizing HVO (HugeTLB Vmemmap Optimization) skip initialization in
>   memmap_init, significantly accelerating boot times.
> - All architectures supporting HVO benefit from the optimizations 
>   provided by SPARSEMEM_VMEMMAP_PREINIT without requiring 
>   architecture-specific adaptations.
> - Device DAX struct page savings are further improved, saving an 
>   additional 4KB of struct page memory for every 2MB huge page.
> - Vmemmap tail pages used for Device DAX shared mappings are changed 
>   from read-write to read-only, enhancing system security.
> - HugeTLB and Device DAX now share a unified vmemmap optimization 
>   framework, reducing long-term maintenance overhead.

This looks very interesting ...
 
>  32 files changed, 513 insertions(+), 1197 deletions(-)

... and nicely cleaning up things.

This series is high on my TODO list, but most likely I won't have time for
proper review until after -rc1.

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 51+ messages in thread

end of thread, other threads:[~2026-04-05 13:35 UTC | newest]

Thread overview: 51+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-05 12:51 [PATCH 00/49] mm: Generalize vmemmap optimization for DAX and HugeTLB Muchun Song
2026-04-05 12:51 ` [PATCH 01/49] mm/sparse: fix vmemmap accounting imbalance on memory hotplug error Muchun Song
2026-04-05 12:51 ` [PATCH 02/49] mm/sparse: add a @pgmap argument to memory deactivation paths Muchun Song
2026-04-05 12:51 ` [PATCH 03/49] mm/sparse: fix vmemmap page accounting for HVOed DAX Muchun Song
2026-04-05 12:51 ` [PATCH 04/49] mm/sparse: add a @pgmap parameter to arch vmemmap_populate() Muchun Song
2026-04-05 12:51 ` [PATCH 05/49] mm/sparse: fix missing architecture-specific page table sync for HVO DAX Muchun Song
2026-04-05 12:51 ` [PATCH 06/49] mm/mm_init: fix uninitialized pageblock migratetype for ZONE_DEVICE compound pages Muchun Song
2026-04-05 12:51 ` [PATCH 07/49] mm/mm_init: use pageblock_migratetype_init_range() in deferred_free_pages() Muchun Song
2026-04-05 12:51 ` [PATCH 08/49] mm: Convert vmemmap_p?d_populate() to static functions Muchun Song
2026-04-05 12:52 ` [PATCH 09/49] mm: panic on memory allocation failure in sparse_init_nid() Muchun Song
2026-04-05 12:52 ` [PATCH 10/49] mm: move subsection_map_init() into sparse_init() Muchun Song
2026-04-05 12:52 ` [PATCH 11/49] mm: defer sparse_init() until after zone initialization Muchun Song
2026-04-05 12:52 ` [PATCH 12/49] mm: make set_pageblock_order() static Muchun Song
2026-04-05 12:52 ` [PATCH 13/49] mm: integrate sparse_vmemmap_init_nid_late() into sparse_init_nid() Muchun Song
2026-04-05 12:52 ` [PATCH 14/49] mm/cma: validate hugetlb CMA range by zone at reserve time Muchun Song
2026-04-05 12:52 ` [PATCH 15/49] mm/hugetlb: free cross-zone bootmem gigantic pages after allocation Muchun Song
2026-04-05 12:52 ` [PATCH 16/49] mm/hugetlb: initialize vmemmap optimization in early stage Muchun Song
2026-04-05 12:52 ` [PATCH 17/49] mm: remove sparse_vmemmap_init_nid_late() Muchun Song
2026-04-05 12:52 ` [PATCH 18/49] mm/mm_init: make __init_page_from_nid() static Muchun Song
2026-04-05 12:52 ` [PATCH 19/49] mm/sparse-vmemmap: remove the VMEMMAP_POPULATE_PAGEREF flag Muchun Song
2026-04-05 12:52 ` [PATCH 20/49] mm: rename vmemmap optimization macros to generic names Muchun Song
2026-04-05 12:52 ` [PATCH 21/49] mm/sparse: drop power-of-2 size requirement for struct mem_section Muchun Song
2026-04-05 12:52 ` [PATCH 22/49] mm/sparse: introduce compound page order to mem_section Muchun Song
2026-04-05 12:52 ` [PATCH 23/49] mm/mm_init: skip initializing shared tail pages for compound pages Muchun Song
2026-04-05 12:52 ` [PATCH 24/49] mm/sparse-vmemmap: initialize shared tail vmemmap page upon allocation Muchun Song
2026-04-05 12:52 ` [PATCH 25/49] mm/sparse-vmemmap: support vmemmap-optimizable compound page population Muchun Song
2026-04-05 12:52 ` [PATCH 26/49] mm/hugetlb: use generic vmemmap optimization macros Muchun Song
2026-04-05 12:52 ` [PATCH 27/49] mm: call memblocks_present() before HugeTLB initialization Muchun Song
2026-04-05 12:52 ` [PATCH 28/49] mm/hugetlb: switch HugeTLB to use generic vmemmap optimization Muchun Song
2026-04-05 12:52 ` [PATCH 29/49] mm: extract pfn_to_zone() helper Muchun Song
2026-04-05 12:52 ` [PATCH 30/49] mm/sparse-vmemmap: remove unused SPARSEMEM_VMEMMAP_PREINIT feature Muchun Song
2026-04-05 12:52 ` [PATCH 31/49] mm/hugetlb: remove HUGE_BOOTMEM_HVO flag and simplify pre-HVO logic Muchun Song
2026-04-05 12:52 ` [PATCH 32/49] mm/sparse-vmemmap: consolidate shared tail page allocation Muchun Song
2026-04-05 12:52 ` [PATCH 33/49] mm: introduce CONFIG_SPARSEMEM_VMEMMAP_OPTIMIZATION Muchun Song
2026-04-05 12:52 ` [PATCH 34/49] mm/sparse-vmemmap: switch DAX to use generic vmemmap optimization Muchun Song
2026-04-05 12:52 ` [PATCH 35/49] mm/sparse-vmemmap: introduce section zone to struct mem_section Muchun Song
2026-04-05 12:52 ` [PATCH 36/49] powerpc/mm: use generic vmemmap_shared_tail_page() in compound vmemmap Muchun Song
2026-04-05 12:52 ` [PATCH 37/49] mm/sparse-vmemmap: unify DAX and HugeTLB vmemmap optimization Muchun Song
2026-04-05 12:52 ` [PATCH 38/49] mm/sparse-vmemmap: remap the shared tail pages as read-only Muchun Song
2026-04-05 12:52 ` [PATCH 39/49] mm/sparse-vmemmap: remove unused ptpfn argument Muchun Song
2026-04-05 12:52 ` [PATCH 40/49] mm/hugetlb_vmemmap: remove vmemmap_wrprotect_hvo() and related code Muchun Song
2026-04-05 12:52 ` [PATCH 41/49] mm/sparse: simplify section_vmemmap_pages() Muchun Song
2026-04-05 12:52 ` [PATCH 42/49] mm/sparse-vmemmap: introduce section_vmemmap_page_structs() Muchun Song
2026-04-05 12:52 ` [PATCH 43/49] powerpc/mm: rely on generic vmemmap_can_optimize() to simplify code Muchun Song
2026-04-05 12:52 ` [PATCH 44/49] mm/sparse-vmemmap: drop ARCH_WANT_OPTIMIZE_DAX_VMEMMAP and simplify checks Muchun Song
2026-04-05 12:52 ` [PATCH 45/49] mm/sparse-vmemmap: drop @pgmap parameter from vmemmap populate APIs Muchun Song
2026-04-05 12:52 ` [PATCH 46/49] mm/sparse: replace pgmap with order and zone in sparse_add_section() Muchun Song
2026-04-05 12:52 ` [PATCH 47/49] mm: redefine HVO as Hugepage Vmemmap Optimization Muchun Song
2026-04-05 12:52 ` [PATCH 48/49] Documentation/mm: restructure vmemmap_dedup.rst to reflect generalized HVO Muchun Song
2026-04-05 12:52 ` [PATCH 49/49] mm: consolidate struct page power-of-2 size checks for HVO Muchun Song
2026-04-05 13:34 ` [PATCH 00/49] mm: Generalize vmemmap optimization for DAX and HugeTLB Mike Rapoport

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox