[PATCH v7 1/6] mm/sparse-vmemmap: Fix vmemmap accounting underflow

public inbox for stable@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v7 1/6] mm/sparse-vmemmap: Fix vmemmap accounting underflow
       [not found] <20260426092640.375967-1-songmuchun@bytedance.com>
@ 2026-04-26  9:26 ` Muchun Song
  2026-04-26  9:26 ` [PATCH v7 2/6] mm/memory_hotplug: Fix incorrect altmap passing in error path Muchun Song
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 9+ messages in thread
From: Muchun Song @ 2026-04-26  9:26 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
	Michael Ellerman, Madhavan Srinivasan
  Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
	Christophe Leroy, aneesh.kumar, joao.m.martins, linux-mm,
	linuxppc-dev, linux-kernel, Muchun Song, stable

In section_activate(), if populate_section_memmap() fails, the error
handling path calls section_deactivate() to roll back the state. This
causes a vmemmap accounting imbalance.

Since commit c3576889d87b ("mm: fix accounting of memmap pages"),
memmap pages are accounted for only after populate_section_memmap()
succeeds. However, the failure path unconditionally calls
section_deactivate(), which decreases the vmemmap count. Consequently,
a failure in populate_section_memmap() leads to an accounting underflow,
incorrectly reducing the system's tracked vmemmap usage.

Fix this more thoroughly by moving all accounting calls into the lower
level functions that actually perform the vmemmap allocation and freeing:

  - populate_section_memmap() accounts for newly allocated vmemmap pages
  - depopulate_section_memmap() unaccounts when vmemmap is freed

This ensures proper accounting in all code paths, including error
handling and early section cases.

Fixes: c3576889d87b ("mm: fix accounting of memmap pages")
Cc: stable@vger.kernel.org
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Acked-by: Oscar Salvador <osalvador@suse.de>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
---
 mm/sparse-vmemmap.c | 20 ++++++++++++--------
 1 file changed, 12 insertions(+), 8 deletions(-)

diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index 6eadb9d116e4..a7b11248b989 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -656,7 +656,12 @@ static struct page * __meminit populate_section_memmap(unsigned long pfn,
 		unsigned long nr_pages, int nid, struct vmem_altmap *altmap,
 		struct dev_pagemap *pgmap)
 {
-	return __populate_section_memmap(pfn, nr_pages, nid, altmap, pgmap);
+	struct page *page = __populate_section_memmap(pfn, nr_pages, nid, altmap,
+						      pgmap);
+
+	memmap_pages_add(DIV_ROUND_UP(nr_pages * sizeof(struct page), PAGE_SIZE));
+
+	return page;
 }
 
 static void depopulate_section_memmap(unsigned long pfn, unsigned long nr_pages,
@@ -665,13 +670,17 @@ static void depopulate_section_memmap(unsigned long pfn, unsigned long nr_pages,
 	unsigned long start = (unsigned long) pfn_to_page(pfn);
 	unsigned long end = start + nr_pages * sizeof(struct page);
 
+	memmap_pages_add(-1L * (DIV_ROUND_UP(nr_pages * sizeof(struct page), PAGE_SIZE)));
 	vmemmap_free(start, end, altmap);
 }
+
 static void free_map_bootmem(struct page *memmap)
 {
 	unsigned long start = (unsigned long)memmap;
 	unsigned long end = (unsigned long)(memmap + PAGES_PER_SECTION);
 
+	memmap_boot_pages_add(-1L * (DIV_ROUND_UP(PAGES_PER_SECTION * sizeof(struct page),
+						  PAGE_SIZE)));
 	vmemmap_free(start, end, NULL);
 }
 
@@ -774,14 +783,10 @@ static void section_deactivate(unsigned long pfn, unsigned long nr_pages,
 	 * The memmap of early sections is always fully populated. See
 	 * section_activate() and pfn_valid() .
 	 */
-	if (!section_is_early) {
-		memmap_pages_add(-1L * (DIV_ROUND_UP(nr_pages * sizeof(struct page), PAGE_SIZE)));
+	if (!section_is_early)
 		depopulate_section_memmap(pfn, nr_pages, altmap);
-	} else if (memmap) {
-		memmap_boot_pages_add(-1L * (DIV_ROUND_UP(nr_pages * sizeof(struct page),
-							  PAGE_SIZE)));
+	else if (memmap)
 		free_map_bootmem(memmap);
-	}
 
 	if (empty)
 		ms->section_mem_map = (unsigned long)NULL;
@@ -826,7 +831,6 @@ static struct page * __meminit section_activate(int nid, unsigned long pfn,
 		section_deactivate(pfn, nr_pages, altmap);
 		return ERR_PTR(-ENOMEM);
 	}
-	memmap_pages_add(DIV_ROUND_UP(nr_pages * sizeof(struct page), PAGE_SIZE));
 
 	return memmap;
 }
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v7 2/6] mm/memory_hotplug: Fix incorrect altmap passing in error path
       [not found] <20260426092640.375967-1-songmuchun@bytedance.com>
  2026-04-26  9:26 ` [PATCH v7 1/6] mm/sparse-vmemmap: Fix vmemmap accounting underflow Muchun Song
@ 2026-04-26  9:26 ` Muchun Song
  2026-04-26  9:26 ` [PATCH v7 4/6] mm/sparse-vmemmap: Fix DAX vmemmap accounting with optimization Muchun Song
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 9+ messages in thread
From: Muchun Song @ 2026-04-26  9:26 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
	Michael Ellerman, Madhavan Srinivasan
  Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
	Christophe Leroy, aneesh.kumar, joao.m.martins, linux-mm,
	linuxppc-dev, linux-kernel, Muchun Song, stable

In create_altmaps_and_memory_blocks(), when arch_add_memory() succeeds
with memmap_on_memory enabled, the vmemmap pages are allocated from
params.altmap. If create_memory_block_devices() subsequently fails, the
error path calls arch_remove_memory() with a NULL altmap instead of
params.altmap.

This is a bug that could lead to memory corruption. Since altmap is
NULL, vmemmap_free() falls back to freeing the vmemmap pages into the
system buddy allocator via free_pages() instead of the altmap.
arch_remove_memory() then immediately destroys the physical linear
mapping for this memory. This injects unowned pages into the buddy
allocator, causing machine checks or memory corruption if the system
later attempts to allocate and use those freed pages.

Fix this by passing params.altmap to arch_remove_memory() in the error
path.

Fixes: 6b8f0798b85a ("mm/memory_hotplug: split memmap_on_memory requests across memblocks")
Cc: stable@vger.kernel.org
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
---
 mm/memory_hotplug.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 2a943ec57c85..0bad2aed2bde 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1468,7 +1468,7 @@ static int create_altmaps_and_memory_blocks(int nid, struct memory_group *group,
 		ret = create_memory_block_devices(cur_start, memblock_size, nid,
 						  params.altmap, group);
 		if (ret) {
-			arch_remove_memory(cur_start, memblock_size, NULL);
+			arch_remove_memory(cur_start, memblock_size, params.altmap);
 			kfree(params.altmap);
 			goto out;
 		}
-- 
2.20.1

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v7 4/6] mm/sparse-vmemmap: Fix DAX vmemmap accounting with optimization
       [not found] <20260426092640.375967-1-songmuchun@bytedance.com>
  2026-04-26  9:26 ` [PATCH v7 1/6] mm/sparse-vmemmap: Fix vmemmap accounting underflow Muchun Song
  2026-04-26  9:26 ` [PATCH v7 2/6] mm/memory_hotplug: Fix incorrect altmap passing in error path Muchun Song
@ 2026-04-26  9:26 ` Muchun Song
  2026-04-27 10:17   ` David Hildenbrand (Arm)
  2026-04-26  9:26 ` [PATCH v7 5/6] mm/mm_init: Fix pageblock migratetype for ZONE_DEVICE compound pages Muchun Song
  2026-04-26  9:26 ` [PATCH v7 6/6] mm/mm_init: Fix uninitialized struct pages for ZONE_DEVICE Muchun Song
  4 siblings, 1 reply; 9+ messages in thread
From: Muchun Song @ 2026-04-26  9:26 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
	Michael Ellerman, Madhavan Srinivasan
  Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
	Christophe Leroy, aneesh.kumar, joao.m.martins, linux-mm,
	linuxppc-dev, linux-kernel, Muchun Song, stable

When vmemmap optimization is enabled for DAX, the nr_memmap_pages
counter in /proc/vmstat is incorrect. The current code always accounts
for the full, non-optimized vmemmap size, but vmemmap optimization
reduces the actual number of vmemmap pages by reusing tail pages. This
causes the system to overcount vmemmap usage, leading to inaccurate
page statistics in /proc/vmstat.

Fix this by introducing section_nr_vmemmap_pages(), which returns the exact
vmemmap page count for a given pfn range based on whether optimization
is in effect.

Fixes: 15995a352474 ("mm: report per-page metadata information")
Cc: stable@vger.kernel.org
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Acked-by: Oscar Salvador <osalvador@suse.de>
---
v6 -> v7:
- Refine the alignment assertions in section_nr_vmemmap_pages().
---
 mm/sparse-vmemmap.c | 34 ++++++++++++++++++++++++++++++----
 1 file changed, 30 insertions(+), 4 deletions(-)

diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index 3340f6d30b01..01f448607bad 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -652,6 +652,31 @@ void offline_mem_sections(unsigned long start_pfn, unsigned long end_pfn)
 	}
 }
 
+static int __meminit section_nr_vmemmap_pages(unsigned long pfn, unsigned long nr_pages,
+		struct vmem_altmap *altmap, struct dev_pagemap *pgmap)
+{
+	const unsigned int order = pgmap ? pgmap->vmemmap_shift : 0;
+	const unsigned long pages_per_compound = 1UL << order;
+
+	VM_WARN_ON_ONCE(!IS_ALIGNED(pfn | nr_pages, PAGES_PER_SUBSECTION));
+
+	if (!vmemmap_can_optimize(altmap, pgmap))
+		return DIV_ROUND_UP(nr_pages * sizeof(struct page), PAGE_SIZE);
+
+	if (order < PFN_SECTION_SHIFT) {
+		VM_WARN_ON_ONCE(!IS_ALIGNED(pfn | nr_pages, pages_per_compound));
+		return VMEMMAP_RESERVE_NR * nr_pages / pages_per_compound;
+	}
+
+	VM_WARN_ON_ONCE(!IS_ALIGNED(pfn | nr_pages, PAGES_PER_SECTION));
+	VM_WARN_ON_ONCE(nr_pages > PAGES_PER_SECTION);
+
+	if (IS_ALIGNED(pfn, pages_per_compound))
+		return VMEMMAP_RESERVE_NR;
+
+	return 0;
+}
+
 static struct page * __meminit populate_section_memmap(unsigned long pfn,
 		unsigned long nr_pages, int nid, struct vmem_altmap *altmap,
 		struct dev_pagemap *pgmap)
@@ -659,7 +684,7 @@ static struct page * __meminit populate_section_memmap(unsigned long pfn,
 	struct page *page = __populate_section_memmap(pfn, nr_pages, nid, altmap,
 						      pgmap);
 
-	memmap_pages_add(DIV_ROUND_UP(nr_pages * sizeof(struct page), PAGE_SIZE));
+	memmap_pages_add(section_nr_vmemmap_pages(pfn, nr_pages, altmap, pgmap));
 
 	return page;
 }
@@ -670,7 +695,7 @@ static void depopulate_section_memmap(unsigned long pfn, unsigned long nr_pages,
 	unsigned long start = (unsigned long) pfn_to_page(pfn);
 	unsigned long end = start + nr_pages * sizeof(struct page);
 
-	memmap_pages_add(-1L * (DIV_ROUND_UP(nr_pages * sizeof(struct page), PAGE_SIZE)));
+	memmap_pages_add(-section_nr_vmemmap_pages(pfn, nr_pages, altmap, pgmap));
 	vmemmap_free(start, end, altmap);
 }
 
@@ -678,9 +703,10 @@ static void free_map_bootmem(struct page *memmap)
 {
 	unsigned long start = (unsigned long)memmap;
 	unsigned long end = (unsigned long)(memmap + PAGES_PER_SECTION);
+	unsigned long pfn = page_to_pfn(memmap);
 
-	memmap_boot_pages_add(-1L * (DIV_ROUND_UP(PAGES_PER_SECTION * sizeof(struct page),
-						  PAGE_SIZE)));
+	memmap_boot_pages_add(-section_nr_vmemmap_pages(pfn, PAGES_PER_SECTION,
+							NULL, NULL));
 	vmemmap_free(start, end, NULL);
 }
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v7 5/6] mm/mm_init: Fix pageblock migratetype for ZONE_DEVICE compound pages
       [not found] <20260426092640.375967-1-songmuchun@bytedance.com>
                   ` (2 preceding siblings ...)
  2026-04-26  9:26 ` [PATCH v7 4/6] mm/sparse-vmemmap: Fix DAX vmemmap accounting with optimization Muchun Song
@ 2026-04-26  9:26 ` Muchun Song
  2026-04-26  9:26 ` [PATCH v7 6/6] mm/mm_init: Fix uninitialized struct pages for ZONE_DEVICE Muchun Song
  4 siblings, 0 replies; 9+ messages in thread
From: Muchun Song @ 2026-04-26  9:26 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
	Michael Ellerman, Madhavan Srinivasan
  Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
	Christophe Leroy, aneesh.kumar, joao.m.martins, linux-mm,
	linuxppc-dev, linux-kernel, Muchun Song, stable

The memmap_init_zone_device() function only initializes the migratetype
of the first pageblock of a compound page. If the compound page size
exceeds pageblock_nr_pages (e.g., 1GB hugepages with 2MB pageblocks),
subsequent pageblocks in the compound page remain uninitialized.

Move the migratetype initialization out of __init_zone_device_page()
and into a separate pageblock_migratetype_init_range() function. This
iterates over the entire PFN range of the memory, ensuring that all
pageblocks are correctly initialized.

Also remove the stale confusing comment about MEMINIT_HOTPLUG above
the migratetype setting since it is an obsolete relic from commit
966cf44f637e ("mm: defer ZONE_DEVICE page initialization to the point
where we init pgmap") and no longer makes sense here.

Fixes: c4386bd8ee3a ("mm/memremap: add ZONE_DEVICE support for compound pages")
Cc: stable@vger.kernel.org
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
---
 mm/mm_init.c | 34 +++++++++++++++++++---------------
 1 file changed, 19 insertions(+), 15 deletions(-)

diff --git a/mm/mm_init.c b/mm/mm_init.c
index f9f8e1af921c..cfc76953e249 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -674,6 +674,20 @@ static inline void fixup_hashdist(void)
 static inline void fixup_hashdist(void) {}
 #endif /* CONFIG_NUMA */
 
+#ifdef CONFIG_ZONE_DEVICE
+static __meminit void pageblock_migratetype_init_range(unsigned long pfn,
+		unsigned long nr_pages, int migratetype)
+{
+	const unsigned long end = pfn + nr_pages;
+
+	for (pfn = pageblock_align(pfn); pfn < end; pfn += pageblock_nr_pages) {
+		init_pageblock_migratetype(pfn_to_page(pfn), migratetype, false);
+		if (IS_ALIGNED(pfn, PAGES_PER_SECTION))
+			cond_resched();
+	}
+}
+#endif
+
 /*
  * Initialize a reserved page unconditionally, finding its zone first.
  */
@@ -1011,21 +1025,6 @@ static void __ref __init_zone_device_page(struct page *page, unsigned long pfn,
 	page_folio(page)->pgmap = pgmap;
 	page->zone_device_data = NULL;
 
-	/*
-	 * Mark the block movable so that blocks are reserved for
-	 * movable at startup. This will force kernel allocations
-	 * to reserve their blocks rather than leaking throughout
-	 * the address space during boot when many long-lived
-	 * kernel allocations are made.
-	 *
-	 * Please note that MEMINIT_HOTPLUG path doesn't clear memmap
-	 * because this is done early in section_activate()
-	 */
-	if (pageblock_aligned(pfn)) {
-		init_pageblock_migratetype(page, MIGRATE_MOVABLE, false);
-		cond_resched();
-	}
-
 	/*
 	 * ZONE_DEVICE pages other than MEMORY_TYPE_GENERIC are released
 	 * directly to the driver page allocator which will set the page count
@@ -1122,6 +1121,9 @@ void __ref memmap_init_zone_device(struct zone *zone,
 
 		__init_zone_device_page(page, pfn, zone_idx, nid, pgmap);
 
+		if (IS_ALIGNED(pfn, PAGES_PER_SECTION))
+			cond_resched();
+
 		if (pfns_per_compound == 1)
 			continue;
 
@@ -1129,6 +1131,8 @@ void __ref memmap_init_zone_device(struct zone *zone,
 				     compound_nr_pages(altmap, pgmap));
 	}
 
+	pageblock_migratetype_init_range(start_pfn, nr_pages, MIGRATE_MOVABLE);
+
 	pr_debug("%s initialised %lu pages in %ums\n", __func__,
 		nr_pages, jiffies_to_msecs(jiffies - start));
 }
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v7 6/6] mm/mm_init: Fix uninitialized struct pages for ZONE_DEVICE
       [not found] <20260426092640.375967-1-songmuchun@bytedance.com>
                   ` (3 preceding siblings ...)
  2026-04-26  9:26 ` [PATCH v7 5/6] mm/mm_init: Fix pageblock migratetype for ZONE_DEVICE compound pages Muchun Song
@ 2026-04-26  9:26 ` Muchun Song
  4 siblings, 0 replies; 9+ messages in thread
From: Muchun Song @ 2026-04-26  9:26 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
	Michael Ellerman, Madhavan Srinivasan
  Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
	Christophe Leroy, aneesh.kumar, joao.m.martins, linux-mm,
	linuxppc-dev, linux-kernel, Muchun Song, stable

If DAX memory is hotplugged into an unoccupied subsection of an early
section, section_activate() reuses the unoptimized boot memmap.
However, compound_nr_pages() still assumes that vmemmap optimization is
in effect and initializes only the reduced number of struct pages. As a
result, the remaining tail struct pages are left uninitialized, which
can later lead to unexpected behavior or crashes.

Fix this by treating early sections as unoptimized when calculating how
many struct pages to initialize.

Fixes: 6fd3620b3428 ("mm/page_alloc: reuse tail struct pages for compound devmaps")
Cc: stable@vger.kernel.org
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
 mm/mm_init.c | 13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/mm/mm_init.c b/mm/mm_init.c
index cfc76953e249..bd466a3c10c8 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -1055,10 +1055,17 @@ static void __ref __init_zone_device_page(struct page *page, unsigned long pfn,
  * of how the sparse_vmemmap internals handle compound pages in the lack
  * of an altmap. See vmemmap_populate_compound_pages().
  */
-static inline unsigned long compound_nr_pages(struct vmem_altmap *altmap,
+static inline unsigned long compound_nr_pages(unsigned long pfn,
+					      struct vmem_altmap *altmap,
 					      struct dev_pagemap *pgmap)
 {
-	if (!vmemmap_can_optimize(altmap, pgmap))
+	/*
+	 * If DAX memory is hot-plugged into an unoccupied subsection
+	 * of an early section, the unoptimized boot memmap is reused.
+	 * See section_activate().
+	 */
+	if (early_section(__pfn_to_section(pfn)) ||
+	    !vmemmap_can_optimize(altmap, pgmap))
 		return pgmap_vmemmap_nr(pgmap);
 
 	return VMEMMAP_RESERVE_NR * (PAGE_SIZE / sizeof(struct page));
@@ -1128,7 +1135,7 @@ void __ref memmap_init_zone_device(struct zone *zone,
 			continue;
 
 		memmap_init_compound(page, pfn, zone_idx, nid, pgmap,
-				     compound_nr_pages(altmap, pgmap));
+				     compound_nr_pages(pfn, altmap, pgmap));
 	}
 
 	pageblock_migratetype_init_range(start_pfn, nr_pages, MIGRATE_MOVABLE);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v7 4/6] mm/sparse-vmemmap: Fix DAX vmemmap accounting with optimization
  2026-04-26  9:26 ` [PATCH v7 4/6] mm/sparse-vmemmap: Fix DAX vmemmap accounting with optimization Muchun Song
@ 2026-04-27 10:17   ` David Hildenbrand (Arm)
  2026-04-28  2:21     ` Muchun Song
  0 siblings, 1 reply; 9+ messages in thread
From: David Hildenbrand (Arm) @ 2026-04-27 10:17 UTC (permalink / raw)
  To: Muchun Song, Andrew Morton, Muchun Song, Oscar Salvador,
	Michael Ellerman, Madhavan Srinivasan
  Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
	Christophe Leroy, aneesh.kumar, joao.m.martins, linux-mm,
	linuxppc-dev, linux-kernel, stable

On 4/26/26 11:26, Muchun Song wrote:
> When vmemmap optimization is enabled for DAX, the nr_memmap_pages
> counter in /proc/vmstat is incorrect. The current code always accounts
> for the full, non-optimized vmemmap size, but vmemmap optimization
> reduces the actual number of vmemmap pages by reusing tail pages. This
> causes the system to overcount vmemmap usage, leading to inaccurate
> page statistics in /proc/vmstat.
> 
> Fix this by introducing section_nr_vmemmap_pages(), which returns the exact
> vmemmap page count for a given pfn range based on whether optimization
> is in effect.
> 
> Fixes: 15995a352474 ("mm: report per-page metadata information")
> Cc: stable@vger.kernel.org
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> Acked-by: Oscar Salvador <osalvador@suse.de>
> ---
> v6 -> v7:
> - Refine the alignment assertions in section_nr_vmemmap_pages().
> ---
>  mm/sparse-vmemmap.c | 34 ++++++++++++++++++++++++++++++----
>  1 file changed, 30 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
> index 3340f6d30b01..01f448607bad 100644
> --- a/mm/sparse-vmemmap.c
> +++ b/mm/sparse-vmemmap.c
> @@ -652,6 +652,31 @@ void offline_mem_sections(unsigned long start_pfn, unsigned long end_pfn)
>  	}
>  }
>  
> +static int __meminit section_nr_vmemmap_pages(unsigned long pfn, unsigned long nr_pages,
> +		struct vmem_altmap *altmap, struct dev_pagemap *pgmap)
> +{
> +	const unsigned int order = pgmap ? pgmap->vmemmap_shift : 0;
> +	const unsigned long pages_per_compound = 1UL << order;
> +
> +	VM_WARN_ON_ONCE(!IS_ALIGNED(pfn | nr_pages, PAGES_PER_SUBSECTION));
> +
> +	if (!vmemmap_can_optimize(altmap, pgmap))
> +		return DIV_ROUND_UP(nr_pages * sizeof(struct page), PAGE_SIZE);
> +
> +	if (order < PFN_SECTION_SHIFT) {
> +		VM_WARN_ON_ONCE(!IS_ALIGNED(pfn | nr_pages, pages_per_compound));
> +		return VMEMMAP_RESERVE_NR * nr_pages / pages_per_compound;
> +	}
> +
> +	VM_WARN_ON_ONCE(!IS_ALIGNED(pfn | nr_pages, PAGES_PER_SECTION));
> +	VM_WARN_ON_ONCE(nr_pages > PAGES_PER_SECTION);

I would just have done that at the very top, as this check applies to all cases.

Acked-by: David Hildenbrand (Arm) <david@kernel.org>

-- 
Cheers,

David

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v7 4/6] mm/sparse-vmemmap: Fix DAX vmemmap accounting with optimization
  2026-04-27 10:17   ` David Hildenbrand (Arm)
@ 2026-04-28  2:21     ` Muchun Song
  2026-04-28  7:00       ` David Hildenbrand (Arm)
  0 siblings, 1 reply; 9+ messages in thread
From: Muchun Song @ 2026-04-28  2:21 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Muchun Song, Andrew Morton, Oscar Salvador, Michael Ellerman,
	Madhavan Srinivasan, Lorenzo Stoakes, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Nicholas Piggin, Christophe Leroy, aneesh.kumar, joao.m.martins,
	linux-mm, linuxppc-dev, linux-kernel, stable



> On Apr 27, 2026, at 18:17, David Hildenbrand (Arm) <david@kernel.org> wrote:
> 
> On 4/26/26 11:26, Muchun Song wrote:
>> When vmemmap optimization is enabled for DAX, the nr_memmap_pages
>> counter in /proc/vmstat is incorrect. The current code always accounts
>> for the full, non-optimized vmemmap size, but vmemmap optimization
>> reduces the actual number of vmemmap pages by reusing tail pages. This
>> causes the system to overcount vmemmap usage, leading to inaccurate
>> page statistics in /proc/vmstat.
>> 
>> Fix this by introducing section_nr_vmemmap_pages(), which returns the exact
>> vmemmap page count for a given pfn range based on whether optimization
>> is in effect.
>> 
>> Fixes: 15995a352474 ("mm: report per-page metadata information")
>> Cc: stable@vger.kernel.org
>> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
>> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
>> Acked-by: Oscar Salvador <osalvador@suse.de>
>> ---
>> v6 -> v7:
>> - Refine the alignment assertions in section_nr_vmemmap_pages().
>> ---
>> mm/sparse-vmemmap.c | 34 ++++++++++++++++++++++++++++++----
>> 1 file changed, 30 insertions(+), 4 deletions(-)
>> 
>> diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
>> index 3340f6d30b01..01f448607bad 100644
>> --- a/mm/sparse-vmemmap.c
>> +++ b/mm/sparse-vmemmap.c
>> @@ -652,6 +652,31 @@ void offline_mem_sections(unsigned long start_pfn, unsigned long end_pfn)
>> }
>> }
>> 
>> +static int __meminit section_nr_vmemmap_pages(unsigned long pfn, unsigned long nr_pages,
>> + 		struct vmem_altmap *altmap, struct dev_pagemap *pgmap)
>> +{
>> + 	const unsigned int order = pgmap ? pgmap->vmemmap_shift : 0;
>> + 	const unsigned long pages_per_compound = 1UL << order;
>> +
>> + 	VM_WARN_ON_ONCE(!IS_ALIGNED(pfn | nr_pages, PAGES_PER_SUBSECTION));
>> +
>> + 	if (!vmemmap_can_optimize(altmap, pgmap))
>> + 		return DIV_ROUND_UP(nr_pages * sizeof(struct page), PAGE_SIZE);
>> +
>> + 	if (order < PFN_SECTION_SHIFT) {
>> + 		VM_WARN_ON_ONCE(!IS_ALIGNED(pfn | nr_pages, pages_per_compound));
>> + 		return VMEMMAP_RESERVE_NR * nr_pages / pages_per_compound;
>> + 	}
>> +
>> + 	VM_WARN_ON_ONCE(!IS_ALIGNED(pfn | nr_pages, PAGES_PER_SECTION));
>> + 	VM_WARN_ON_ONCE(nr_pages > PAGES_PER_SECTION);
> 
> I would just have done that at the very top, as this check applies to all cases.

My initial reasoning was that the current formula holds for compound pages smaller
than the section size, and we only need to impose limits when the page size exceeds
it. While the current callers of section_nr_vmemmap_pages() don't pass sizes larger
than a section, this will change in the future (see [1]).

I might have been overthinking the future-proofing, which led to this specific
implementation. However, I’m inclined to keep it as is for now, as I'll be updating
that series [1] soon and it will involve further changes to section_nr_vmemmap_pages().
That said, I'd love to hear your thoughts before I proceed.

[1] https://lore.kernel.org/linux-mm/20260405125240.2558577-43-songmuchun@bytedance.com/


> 
> Acked-by: David Hildenbrand (Arm) <david@kernel.org>

Thanks.

> 
> -- 
> Cheers,
> 
> David



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v7 4/6] mm/sparse-vmemmap: Fix DAX vmemmap accounting with optimization
  2026-04-28  2:21     ` Muchun Song
@ 2026-04-28  7:00       ` David Hildenbrand (Arm)
  2026-04-28  7:24         ` Muchun Song
  0 siblings, 1 reply; 9+ messages in thread
From: David Hildenbrand (Arm) @ 2026-04-28  7:00 UTC (permalink / raw)
  To: Muchun Song
  Cc: Muchun Song, Andrew Morton, Oscar Salvador, Michael Ellerman,
	Madhavan Srinivasan, Lorenzo Stoakes, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Nicholas Piggin, Christophe Leroy, aneesh.kumar, joao.m.martins,
	linux-mm, linuxppc-dev, linux-kernel, stable

On 4/28/26 04:21, Muchun Song wrote:
> 
> 
>> On Apr 27, 2026, at 18:17, David Hildenbrand (Arm) <david@kernel.org> wrote:
>>
>> On 4/26/26 11:26, Muchun Song wrote:
>>> When vmemmap optimization is enabled for DAX, the nr_memmap_pages
>>> counter in /proc/vmstat is incorrect. The current code always accounts
>>> for the full, non-optimized vmemmap size, but vmemmap optimization
>>> reduces the actual number of vmemmap pages by reusing tail pages. This
>>> causes the system to overcount vmemmap usage, leading to inaccurate
>>> page statistics in /proc/vmstat.
>>>
>>> Fix this by introducing section_nr_vmemmap_pages(), which returns the exact
>>> vmemmap page count for a given pfn range based on whether optimization
>>> is in effect.
>>>
>>> Fixes: 15995a352474 ("mm: report per-page metadata information")
>>> Cc: stable@vger.kernel.org
>>> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
>>> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
>>> Acked-by: Oscar Salvador <osalvador@suse.de>
>>> ---
>>> v6 -> v7:
>>> - Refine the alignment assertions in section_nr_vmemmap_pages().
>>> ---
>>> mm/sparse-vmemmap.c | 34 ++++++++++++++++++++++++++++++----
>>> 1 file changed, 30 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
>>> index 3340f6d30b01..01f448607bad 100644
>>> --- a/mm/sparse-vmemmap.c
>>> +++ b/mm/sparse-vmemmap.c
>>> @@ -652,6 +652,31 @@ void offline_mem_sections(unsigned long start_pfn, unsigned long end_pfn)
>>> }
>>> }
>>>
>>> +static int __meminit section_nr_vmemmap_pages(unsigned long pfn, unsigned long nr_pages,
>>> + 		struct vmem_altmap *altmap, struct dev_pagemap *pgmap)
>>> +{
>>> + 	const unsigned int order = pgmap ? pgmap->vmemmap_shift : 0;
>>> + 	const unsigned long pages_per_compound = 1UL << order;
>>> +
>>> + 	VM_WARN_ON_ONCE(!IS_ALIGNED(pfn | nr_pages, PAGES_PER_SUBSECTION));
>>> +
>>> + 	if (!vmemmap_can_optimize(altmap, pgmap))
>>> + 		return DIV_ROUND_UP(nr_pages * sizeof(struct page), PAGE_SIZE);
>>> +
>>> + 	if (order < PFN_SECTION_SHIFT) {
>>> + 		VM_WARN_ON_ONCE(!IS_ALIGNED(pfn | nr_pages, pages_per_compound));
>>> + 		return VMEMMAP_RESERVE_NR * nr_pages / pages_per_compound;
>>> + 	}
>>> +
>>> + 	VM_WARN_ON_ONCE(!IS_ALIGNED(pfn | nr_pages, PAGES_PER_SECTION));
>>> + 	VM_WARN_ON_ONCE(nr_pages > PAGES_PER_SECTION);
>>
>> I would just have done that at the very top, as this check applies to all cases.
> 
> My initial reasoning was that the current formula holds for compound pages smaller
> than the section size, and we only need to impose limits when the page size exceeds
> it. While the current callers of section_nr_vmemmap_pages() don't pass sizes larger
> than a section, this will change in the future (see [1]).

A function that is called *section_* will get a range that exceeds a section?

That sounds conceptually wrong, no?


-- 
Cheers,

David

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v7 4/6] mm/sparse-vmemmap: Fix DAX vmemmap accounting with optimization
  2026-04-28  7:00       ` David Hildenbrand (Arm)
@ 2026-04-28  7:24         ` Muchun Song
  0 siblings, 0 replies; 9+ messages in thread
From: Muchun Song @ 2026-04-28  7:24 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Muchun Song, Andrew Morton, Oscar Salvador, Michael Ellerman,
	Madhavan Srinivasan, Lorenzo Stoakes, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Nicholas Piggin, Christophe Leroy, aneesh.kumar, joao.m.martins,
	linux-mm, linuxppc-dev, linux-kernel, stable



> On Apr 28, 2026, at 15:00, David Hildenbrand (Arm) <david@kernel.org> wrote:
> 
> On 4/28/26 04:21, Muchun Song wrote:
>> 
>> 
>>> On Apr 27, 2026, at 18:17, David Hildenbrand (Arm) <david@kernel.org> wrote:
>>> 
>>> On 4/26/26 11:26, Muchun Song wrote:
>>>> When vmemmap optimization is enabled for DAX, the nr_memmap_pages
>>>> counter in /proc/vmstat is incorrect. The current code always accounts
>>>> for the full, non-optimized vmemmap size, but vmemmap optimization
>>>> reduces the actual number of vmemmap pages by reusing tail pages. This
>>>> causes the system to overcount vmemmap usage, leading to inaccurate
>>>> page statistics in /proc/vmstat.
>>>> 
>>>> Fix this by introducing section_nr_vmemmap_pages(), which returns the exact
>>>> vmemmap page count for a given pfn range based on whether optimization
>>>> is in effect.
>>>> 
>>>> Fixes: 15995a352474 ("mm: report per-page metadata information")
>>>> Cc: stable@vger.kernel.org
>>>> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
>>>> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
>>>> Acked-by: Oscar Salvador <osalvador@suse.de>
>>>> ---
>>>> v6 -> v7:
>>>> - Refine the alignment assertions in section_nr_vmemmap_pages().
>>>> ---
>>>> mm/sparse-vmemmap.c | 34 ++++++++++++++++++++++++++++++----
>>>> 1 file changed, 30 insertions(+), 4 deletions(-)
>>>> 
>>>> diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
>>>> index 3340f6d30b01..01f448607bad 100644
>>>> --- a/mm/sparse-vmemmap.c
>>>> +++ b/mm/sparse-vmemmap.c
>>>> @@ -652,6 +652,31 @@ void offline_mem_sections(unsigned long start_pfn, unsigned long end_pfn)
>>>> }
>>>> }
>>>> 
>>>> +static int __meminit section_nr_vmemmap_pages(unsigned long pfn, unsigned long nr_pages,
>>>> +  struct vmem_altmap *altmap, struct dev_pagemap *pgmap)
>>>> +{
>>>> +  const unsigned int order = pgmap ? pgmap->vmemmap_shift : 0;
>>>> +  const unsigned long pages_per_compound = 1UL << order;
>>>> +
>>>> +  VM_WARN_ON_ONCE(!IS_ALIGNED(pfn | nr_pages, PAGES_PER_SUBSECTION));
>>>> +
>>>> +  if (!vmemmap_can_optimize(altmap, pgmap))
>>>> +  return DIV_ROUND_UP(nr_pages * sizeof(struct page), PAGE_SIZE);
>>>> +
>>>> +  if (order < PFN_SECTION_SHIFT) {
>>>> +  VM_WARN_ON_ONCE(!IS_ALIGNED(pfn | nr_pages, pages_per_compound));
>>>> +  return VMEMMAP_RESERVE_NR * nr_pages / pages_per_compound;
>>>> +  }
>>>> +
>>>> +  VM_WARN_ON_ONCE(!IS_ALIGNED(pfn | nr_pages, PAGES_PER_SECTION));
>>>> +  VM_WARN_ON_ONCE(nr_pages > PAGES_PER_SECTION);
>>> 
>>> I would just have done that at the very top, as this check applies to all cases.
>> 
>> My initial reasoning was that the current formula holds for compound pages smaller
>> than the section size, and we only need to impose limits when the page size exceeds
>> it. While the current callers of section_nr_vmemmap_pages() don't pass sizes larger
>> than a section, this will change in the future (see [1]).
> 
> A function that is called *section_* will get a range that exceeds a section?
> 
> That sounds conceptually wrong, no?

It does seem a bit ambiguous. I will rename it to something more appropriate if I
expand its functionality in the future. For this series, I will update a v8 to move
VM_WARN_ON_ONCE(nr_pages > PAGES_PER_SECTION); to the top of this function.

Thanks.

> 
> 
> -- 
> Cheers,
> 
> David



^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2026-04-28  7:25 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20260426092640.375967-1-songmuchun@bytedance.com>
2026-04-26  9:26 ` [PATCH v7 1/6] mm/sparse-vmemmap: Fix vmemmap accounting underflow Muchun Song
2026-04-26  9:26 ` [PATCH v7 2/6] mm/memory_hotplug: Fix incorrect altmap passing in error path Muchun Song
2026-04-26  9:26 ` [PATCH v7 4/6] mm/sparse-vmemmap: Fix DAX vmemmap accounting with optimization Muchun Song
2026-04-27 10:17   ` David Hildenbrand (Arm)
2026-04-28  2:21     ` Muchun Song
2026-04-28  7:00       ` David Hildenbrand (Arm)
2026-04-28  7:24         ` Muchun Song
2026-04-26  9:26 ` [PATCH v7 5/6] mm/mm_init: Fix pageblock migratetype for ZONE_DEVICE compound pages Muchun Song
2026-04-26  9:26 ` [PATCH v7 6/6] mm/mm_init: Fix uninitialized struct pages for ZONE_DEVICE Muchun Song

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox