public inbox for stable@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH] x86/mm: fix freeing of PMD-sized vmemmap pages
@ 2026-04-28 10:29 David Hildenbrand (Arm)
  2026-04-28 10:34 ` David Hildenbrand (Arm)
                   ` (3 more replies)
  0 siblings, 4 replies; 8+ messages in thread
From: David Hildenbrand (Arm) @ 2026-04-28 10:29 UTC (permalink / raw)
  To: Dave Hansen, Andy Lutomirski, Peter Zijlstra, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H. Peter Anvin,
	Mike Rapoport (Microsoft), Jason Gunthorpe, Lu Baolu,
	Andrew Morton, Lu Baolu
  Cc: linux-kernel, linux-mm, stable, David Hildenbrand (Arm)

In commit bf9e4e30f353 ("x86/mm: use pagetable_free()"), we switched
from freeing non-boot page tables through __free_pages() to
pagetable_free().

However, the function is also called to free vmemmap pages.

Given that vmemmap pages are not page tables, already the page_ptdesc(page)
is wrong. But worse, pagetable_free() calls

	__free_pages(page, compound_order(page));

As vmemmap pages are not compound pages (see vmemmap_alloc_block()) --
except for HVO, which doesn't apply here -- we will only free the first
page when freeing a PMD-sized vmemmap page, leaking the other ones.

Fix it by properly decoupling pagetable and vmemmap freeing.
free_pagetable() no longer has to mess with SECTION_INFO, as only the
vmemmap is marked like that in register_page_bootmem_memmap().

While at it, just wire up the altmap parameter for remove_pte_table().
Also, the indentation in remove_pmd_table() is messed up, let's fix that
while touching it.

Note that we'll try to get rid of that bootmem info handling soon. For
now, we'll handle it similar to free_pagetable(), just avoiding the
ifdef.

Fixes: bf9e4e30f353 ("x86/mm: use pagetable_free()")
Cc: stable@vger.kernel.org
Signed-off-by: David Hildenbrand (Arm) <david@kernel.org>
---
Reproduced and tested with a simple VM with a virtio-mem device,
repeatedly adding and removing memory.

Found by code inspection while working on bootmem_info removal.
---
 arch/x86/mm/init_64.c | 43 +++++++++++++++++++++++++++----------------
 1 file changed, 27 insertions(+), 16 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index df2261fa4f98..8d03e44a7fb9 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1014,7 +1014,7 @@ static void __meminit free_pagetable(struct page *page, int order)
 #ifdef CONFIG_HAVE_BOOTMEM_INFO_NODE
 		enum bootmem_type type = bootmem_type(page);
 
-		if (type == SECTION_INFO || type == MIX_SECTION_INFO) {
+		if (type == MIX_SECTION_INFO) {
 			while (nr_pages--)
 				put_page_bootmem(page++);
 		} else {
@@ -1028,13 +1028,24 @@ static void __meminit free_pagetable(struct page *page, int order)
 	}
 }
 
-static void __meminit free_hugepage_table(struct page *page,
+static void __meminit free_vmemmap_pages(struct page *page, unsigned int order,
 		struct vmem_altmap *altmap)
 {
-	if (altmap)
-		vmem_altmap_free(altmap, PMD_SIZE / PAGE_SIZE);
-	else
-		free_pagetable(page, get_order(PMD_SIZE));
+	if (altmap) {
+		vmem_altmap_free(altmap, 1u << order);
+	} else if (PageReserved(page)) {
+		unsigned long nr_pages = 1 << order;
+
+		if (IS_ENABLED(CONFIG_HAVE_BOOTMEM_INFO_NODE) &&
+		    bootmem_type(page) == SECTION_INFO) {
+			while (nr_pages--)
+				put_page_bootmem(page++);
+		} else {
+			free_reserved_pages(page, nr_pages);
+		}
+	} else {
+		__free_pages(page, order);
+	}
 }
 
 static void __meminit free_pte_table(pte_t *pte_start, pmd_t *pmd)
@@ -1093,7 +1104,7 @@ static void __meminit free_pud_table(pud_t *pud_start, p4d_t *p4d)
 
 static void __meminit
 remove_pte_table(pte_t *pte_start, unsigned long addr, unsigned long end,
-		 bool direct)
+		 bool direct, struct vmem_altmap *altmap)
 {
 	unsigned long next, pages = 0;
 	pte_t *pte;
@@ -1118,7 +1129,7 @@ remove_pte_table(pte_t *pte_start, unsigned long addr, unsigned long end,
 			return;
 
 		if (!direct)
-			free_pagetable(pte_page(*pte), 0);
+			free_vmemmap_pages(pte_page(*pte), 0, altmap);
 
 		spin_lock(&init_mm.page_table_lock);
 		pte_clear(&init_mm, addr, pte);
@@ -1153,25 +1164,25 @@ remove_pmd_table(pmd_t *pmd_start, unsigned long addr, unsigned long end,
 			if (IS_ALIGNED(addr, PMD_SIZE) &&
 			    IS_ALIGNED(next, PMD_SIZE)) {
 				if (!direct)
-					free_hugepage_table(pmd_page(*pmd),
-							    altmap);
+					free_vmemmap_pages(pmd_page(*pmd),
+							   PMD_ORDER, altmap);
 
 				spin_lock(&init_mm.page_table_lock);
 				pmd_clear(pmd);
 				spin_unlock(&init_mm.page_table_lock);
 				pages++;
 			} else if (vmemmap_pmd_is_unused(addr, next)) {
-					free_hugepage_table(pmd_page(*pmd),
-							    altmap);
-					spin_lock(&init_mm.page_table_lock);
-					pmd_clear(pmd);
-					spin_unlock(&init_mm.page_table_lock);
+				free_vmemmap_pages(pmd_page(*pmd), PMD_ORDER,
+						   altmap);
+				spin_lock(&init_mm.page_table_lock);
+				pmd_clear(pmd);
+				spin_unlock(&init_mm.page_table_lock);
 			}
 			continue;
 		}
 
 		pte_base = (pte_t *)pmd_page_vaddr(*pmd);
-		remove_pte_table(pte_base, addr, next, direct);
+		remove_pte_table(pte_base, addr, next, direct, altmap);
 		free_pte_table(pte_base, pmd);
 	}
 

---

base-commit: a2ddbfd1af0f54ea84bf17f0400088815d012e8d

change-id: 20260428-vmemmap-ab4b949aa727

--

Cheers,

David


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH] x86/mm: fix freeing of PMD-sized vmemmap pages
  2026-04-28 10:29 [PATCH] x86/mm: fix freeing of PMD-sized vmemmap pages David Hildenbrand (Arm)
@ 2026-04-28 10:34 ` David Hildenbrand (Arm)
  2026-04-28 13:20 ` Lance Yang
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 8+ messages in thread
From: David Hildenbrand (Arm) @ 2026-04-28 10:34 UTC (permalink / raw)
  To: Dave Hansen, Andy Lutomirski, Peter Zijlstra, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H. Peter Anvin,
	Mike Rapoport (Microsoft), Jason Gunthorpe, Lu Baolu,
	Andrew Morton
  Cc: linux-kernel, linux-mm, stable


>  
> -static void __meminit free_hugepage_table(struct page *page,
> +static void __meminit free_vmemmap_pages(struct page *page, unsigned int order,
>  		struct vmem_altmap *altmap)
>  {
> -	if (altmap)
> -		vmem_altmap_free(altmap, PMD_SIZE / PAGE_SIZE);
> -	else
> -		free_pagetable(page, get_order(PMD_SIZE));
> +	if (altmap) {
> +		vmem_altmap_free(altmap, 1u << order);
> +	} else if (PageReserved(page)) {
> +		unsigned long nr_pages = 1 << order;

Staring at this code again, I should move that up so I can reuse it in the
vmem_altmap_free() call.

-- 
Cheers,

David

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] x86/mm: fix freeing of PMD-sized vmemmap pages
  2026-04-28 10:29 [PATCH] x86/mm: fix freeing of PMD-sized vmemmap pages David Hildenbrand (Arm)
  2026-04-28 10:34 ` David Hildenbrand (Arm)
@ 2026-04-28 13:20 ` Lance Yang
  2026-04-28 19:36   ` David Hildenbrand (Arm)
  2026-04-28 20:47 ` Mike Rapoport
  2026-04-29  2:12 ` Lance Yang
  3 siblings, 1 reply; 8+ messages in thread
From: Lance Yang @ 2026-04-28 13:20 UTC (permalink / raw)
  To: david
  Cc: dave.hansen, luto, peterz, tglx, mingo, bp, x86, hpa, rppt, jgg,
	baolu.lu, akpm, linux-kernel, linux-mm, stable, Lance Yang


On Tue, Apr 28, 2026 at 12:29:36PM +0200, David Hildenbrand (Arm) wrote:
>In commit bf9e4e30f353 ("x86/mm: use pagetable_free()"), we switched
>from freeing non-boot page tables through __free_pages() to
>pagetable_free().
>
>However, the function is also called to free vmemmap pages.
>
>Given that vmemmap pages are not page tables, already the page_ptdesc(page)
>is wrong. But worse, pagetable_free() calls
>
>	__free_pages(page, compound_order(page));
>
>As vmemmap pages are not compound pages (see vmemmap_alloc_block()) --
>except for HVO, which doesn't apply here -- we will only free the first
>page when freeing a PMD-sized vmemmap page, leaking the other ones.
>
>Fix it by properly decoupling pagetable and vmemmap freeing.
>free_pagetable() no longer has to mess with SECTION_INFO, as only the
>vmemmap is marked like that in register_page_bootmem_memmap().
>
>While at it, just wire up the altmap parameter for remove_pte_table().
>Also, the indentation in remove_pmd_table() is messed up, let's fix that
>while touching it.
>
>Note that we'll try to get rid of that bootmem info handling soon. For
>now, we'll handle it similar to free_pagetable(), just avoiding the
>ifdef.
>
>Fixes: bf9e4e30f353 ("x86/mm: use pagetable_free()")
>Cc: stable@vger.kernel.org
>Signed-off-by: David Hildenbrand (Arm) <david@kernel.org>
>---
>Reproduced and tested with a simple VM with a virtio-mem device,
>repeatedly adding and removing memory.
>
>Found by code inspection while working on bootmem_info removal.
>---

Cool! I just reproduced the leak with QEMU pc-dimm memory hotplug as
well.

Without the fix, nr_free_pages kept dropping after the hotplugged memory
was removed again. With the fix applied, it stays stable over repeated
add/remove cycles :)

Tested-by: Lance Yang <lance.yang@linux.dev>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] x86/mm: fix freeing of PMD-sized vmemmap pages
  2026-04-28 13:20 ` Lance Yang
@ 2026-04-28 19:36   ` David Hildenbrand (Arm)
  0 siblings, 0 replies; 8+ messages in thread
From: David Hildenbrand (Arm) @ 2026-04-28 19:36 UTC (permalink / raw)
  To: Lance Yang
  Cc: dave.hansen, luto, peterz, tglx, mingo, bp, x86, hpa, rppt, jgg,
	baolu.lu, akpm, linux-kernel, linux-mm, stable

On 4/28/26 15:20, Lance Yang wrote:
> 
> On Tue, Apr 28, 2026 at 12:29:36PM +0200, David Hildenbrand (Arm) wrote:
>> In commit bf9e4e30f353 ("x86/mm: use pagetable_free()"), we switched
>>from freeing non-boot page tables through __free_pages() to
>> pagetable_free().
>>
>> However, the function is also called to free vmemmap pages.
>>
>> Given that vmemmap pages are not page tables, already the page_ptdesc(page)
>> is wrong. But worse, pagetable_free() calls
>>
>> 	__free_pages(page, compound_order(page));
>>
>> As vmemmap pages are not compound pages (see vmemmap_alloc_block()) --
>> except for HVO, which doesn't apply here -- we will only free the first
>> page when freeing a PMD-sized vmemmap page, leaking the other ones.
>>
>> Fix it by properly decoupling pagetable and vmemmap freeing.
>> free_pagetable() no longer has to mess with SECTION_INFO, as only the
>> vmemmap is marked like that in register_page_bootmem_memmap().
>>
>> While at it, just wire up the altmap parameter for remove_pte_table().
>> Also, the indentation in remove_pmd_table() is messed up, let's fix that
>> while touching it.
>>
>> Note that we'll try to get rid of that bootmem info handling soon. For
>> now, we'll handle it similar to free_pagetable(), just avoiding the
>> ifdef.
>>
>> Fixes: bf9e4e30f353 ("x86/mm: use pagetable_free()")
>> Cc: stable@vger.kernel.org
>> Signed-off-by: David Hildenbrand (Arm) <david@kernel.org>
>> ---
>> Reproduced and tested with a simple VM with a virtio-mem device,
>> repeatedly adding and removing memory.
>>
>> Found by code inspection while working on bootmem_info removal.
>> ---
> 
> Cool! I just reproduced the leak with QEMU pc-dimm memory hotplug as
> well.
> 
> Without the fix, nr_free_pages kept dropping after the hotplugged memory
> was removed again. With the fix applied, it stays stable over repeated
> add/remove cycles :)
> 
> Tested-by: Lance Yang <lance.yang@linux.dev>

Thanks Lance! I'll wait a bit more for feedback, and then resend v2 with the
"nr_pages" simplification.

-- 
Cheers,

David

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] x86/mm: fix freeing of PMD-sized vmemmap pages
  2026-04-28 10:29 [PATCH] x86/mm: fix freeing of PMD-sized vmemmap pages David Hildenbrand (Arm)
  2026-04-28 10:34 ` David Hildenbrand (Arm)
  2026-04-28 13:20 ` Lance Yang
@ 2026-04-28 20:47 ` Mike Rapoport
  2026-04-29  2:12 ` Lance Yang
  3 siblings, 0 replies; 8+ messages in thread
From: Mike Rapoport @ 2026-04-28 20:47 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Dave Hansen, Andy Lutomirski, Peter Zijlstra, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H. Peter Anvin,
	Jason Gunthorpe, Lu Baolu, Andrew Morton, linux-kernel, linux-mm,
	stable

On Tue, Apr 28, 2026 at 12:29:36PM +0200, David Hildenbrand (Arm) wrote:
> In commit bf9e4e30f353 ("x86/mm: use pagetable_free()"), we switched
> from freeing non-boot page tables through __free_pages() to
> pagetable_free().
> 
> However, the function is also called to free vmemmap pages.
> 
> Given that vmemmap pages are not page tables, already the page_ptdesc(page)
> is wrong. But worse, pagetable_free() calls
> 
> 	__free_pages(page, compound_order(page));
> 
> As vmemmap pages are not compound pages (see vmemmap_alloc_block()) --
> except for HVO, which doesn't apply here -- we will only free the first
> page when freeing a PMD-sized vmemmap page, leaking the other ones.
> 
> Fix it by properly decoupling pagetable and vmemmap freeing.
> free_pagetable() no longer has to mess with SECTION_INFO, as only the
> vmemmap is marked like that in register_page_bootmem_memmap().
> 
> While at it, just wire up the altmap parameter for remove_pte_table().
> Also, the indentation in remove_pmd_table() is messed up, let's fix that
> while touching it.
> 
> Note that we'll try to get rid of that bootmem info handling soon. For
> now, we'll handle it similar to free_pagetable(), just avoiding the
> ifdef.
> 
> Fixes: bf9e4e30f353 ("x86/mm: use pagetable_free()")
> Cc: stable@vger.kernel.org
> Signed-off-by: David Hildenbrand (Arm) <david@kernel.org>

Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>

> ---
> Reproduced and tested with a simple VM with a virtio-mem device,
> repeatedly adding and removing memory.
> 
> Found by code inspection while working on bootmem_info removal.
> ---
>  arch/x86/mm/init_64.c | 43 +++++++++++++++++++++++++++----------------
>  1 file changed, 27 insertions(+), 16 deletions(-)
> 
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index df2261fa4f98..8d03e44a7fb9 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -1014,7 +1014,7 @@ static void __meminit free_pagetable(struct page *page, int order)
>  #ifdef CONFIG_HAVE_BOOTMEM_INFO_NODE
>  		enum bootmem_type type = bootmem_type(page);
>  
> -		if (type == SECTION_INFO || type == MIX_SECTION_INFO) {
> +		if (type == MIX_SECTION_INFO) {
>  			while (nr_pages--)
>  				put_page_bootmem(page++);
>  		} else {
> @@ -1028,13 +1028,24 @@ static void __meminit free_pagetable(struct page *page, int order)
>  	}
>  }
>  
> -static void __meminit free_hugepage_table(struct page *page,
> +static void __meminit free_vmemmap_pages(struct page *page, unsigned int order,
>  		struct vmem_altmap *altmap)
>  {
> -	if (altmap)
> -		vmem_altmap_free(altmap, PMD_SIZE / PAGE_SIZE);
> -	else
> -		free_pagetable(page, get_order(PMD_SIZE));
> +	if (altmap) {
> +		vmem_altmap_free(altmap, 1u << order);
> +	} else if (PageReserved(page)) {
> +		unsigned long nr_pages = 1 << order;
> +
> +		if (IS_ENABLED(CONFIG_HAVE_BOOTMEM_INFO_NODE) &&
> +		    bootmem_type(page) == SECTION_INFO) {
> +			while (nr_pages--)
> +				put_page_bootmem(page++);
> +		} else {
> +			free_reserved_pages(page, nr_pages);
> +		}
> +	} else {
> +		__free_pages(page, order);
> +	}
>  }
>  
>  static void __meminit free_pte_table(pte_t *pte_start, pmd_t *pmd)
> @@ -1093,7 +1104,7 @@ static void __meminit free_pud_table(pud_t *pud_start, p4d_t *p4d)
>  
>  static void __meminit
>  remove_pte_table(pte_t *pte_start, unsigned long addr, unsigned long end,
> -		 bool direct)
> +		 bool direct, struct vmem_altmap *altmap)
>  {
>  	unsigned long next, pages = 0;
>  	pte_t *pte;
> @@ -1118,7 +1129,7 @@ remove_pte_table(pte_t *pte_start, unsigned long addr, unsigned long end,
>  			return;
>  
>  		if (!direct)
> -			free_pagetable(pte_page(*pte), 0);
> +			free_vmemmap_pages(pte_page(*pte), 0, altmap);
>  
>  		spin_lock(&init_mm.page_table_lock);
>  		pte_clear(&init_mm, addr, pte);
> @@ -1153,25 +1164,25 @@ remove_pmd_table(pmd_t *pmd_start, unsigned long addr, unsigned long end,
>  			if (IS_ALIGNED(addr, PMD_SIZE) &&
>  			    IS_ALIGNED(next, PMD_SIZE)) {
>  				if (!direct)
> -					free_hugepage_table(pmd_page(*pmd),
> -							    altmap);
> +					free_vmemmap_pages(pmd_page(*pmd),
> +							   PMD_ORDER, altmap);
>  
>  				spin_lock(&init_mm.page_table_lock);
>  				pmd_clear(pmd);
>  				spin_unlock(&init_mm.page_table_lock);
>  				pages++;
>  			} else if (vmemmap_pmd_is_unused(addr, next)) {
> -					free_hugepage_table(pmd_page(*pmd),
> -							    altmap);
> -					spin_lock(&init_mm.page_table_lock);
> -					pmd_clear(pmd);
> -					spin_unlock(&init_mm.page_table_lock);
> +				free_vmemmap_pages(pmd_page(*pmd), PMD_ORDER,
> +						   altmap);
> +				spin_lock(&init_mm.page_table_lock);
> +				pmd_clear(pmd);
> +				spin_unlock(&init_mm.page_table_lock);
>  			}
>  			continue;
>  		}
>  
>  		pte_base = (pte_t *)pmd_page_vaddr(*pmd);
> -		remove_pte_table(pte_base, addr, next, direct);
> +		remove_pte_table(pte_base, addr, next, direct, altmap);
>  		free_pte_table(pte_base, pmd);
>  	}
>  
> 
> ---
> 
> base-commit: a2ddbfd1af0f54ea84bf17f0400088815d012e8d
> 
> change-id: 20260428-vmemmap-ab4b949aa727
> 
> --
> 
> Cheers,
> 
> David
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] x86/mm: fix freeing of PMD-sized vmemmap pages
  2026-04-28 10:29 [PATCH] x86/mm: fix freeing of PMD-sized vmemmap pages David Hildenbrand (Arm)
                   ` (2 preceding siblings ...)
  2026-04-28 20:47 ` Mike Rapoport
@ 2026-04-29  2:12 ` Lance Yang
  2026-04-29  2:30   ` Lance Yang
  2026-04-29  5:50   ` David Hildenbrand (Arm)
  3 siblings, 2 replies; 8+ messages in thread
From: Lance Yang @ 2026-04-29  2:12 UTC (permalink / raw)
  To: david
  Cc: dave.hansen, luto, peterz, tglx, mingo, bp, x86, hpa, rppt, jgg,
	baolu.lu, akpm, linux-kernel, linux-mm, stable, Lance Yang


On Tue, Apr 28, 2026 at 12:29:36PM +0200, David Hildenbrand (Arm) wrote:
>In commit bf9e4e30f353 ("x86/mm: use pagetable_free()"), we switched
>from freeing non-boot page tables through __free_pages() to
>pagetable_free().
>
>However, the function is also called to free vmemmap pages.
>
>Given that vmemmap pages are not page tables, already the page_ptdesc(page)
>is wrong. But worse, pagetable_free() calls
>
>	__free_pages(page, compound_order(page));
>
>As vmemmap pages are not compound pages (see vmemmap_alloc_block()) --
>except for HVO, which doesn't apply here -- we will only free the first
>page when freeing a PMD-sized vmemmap page, leaking the other ones.
>
>Fix it by properly decoupling pagetable and vmemmap freeing.
>free_pagetable() no longer has to mess with SECTION_INFO, as only the
>vmemmap is marked like that in register_page_bootmem_memmap().
>
>While at it, just wire up the altmap parameter for remove_pte_table().
>Also, the indentation in remove_pmd_table() is messed up, let's fix that
>while touching it.

One thing I'm not sure about is passing altmap down into
remove_pte_table().

Do we actually know that a non-NULL altmap means that the vmemmap
backing page came from that altmap?

On x86 we still have in vmemmap_populate():

	if (end - start < PAGES_PER_SECTION * sizeof(struct page))
		err = vmemmap_populate_basepages(start, end, node, NULL);

So for smaller-than-section vmemmap ranges, even if the caller has an
altmap, the backing pages are allocated from normal memory. But with
this fix the PTE removal path would now call vmem_altmap_free() just
because altmap is non-NULL, and would not free the actual backing page,
IIUC :)

Maybe free_vmemmap_pages() should first check that the backing page is
really inside the altmap range before using vmem_altmap_free()?

Hopefully I didn't miss anything :)

Thanks,
Lance

>Note that we'll try to get rid of that bootmem info handling soon. For
>now, we'll handle it similar to free_pagetable(), just avoiding the
>ifdef.
>
>Fixes: bf9e4e30f353 ("x86/mm: use pagetable_free()")
>Cc: stable@vger.kernel.org
>Signed-off-by: David Hildenbrand (Arm) <david@kernel.org>
>---
>Reproduced and tested with a simple VM with a virtio-mem device,
>repeatedly adding and removing memory.
>
>Found by code inspection while working on bootmem_info removal.
>---
> arch/x86/mm/init_64.c | 43 +++++++++++++++++++++++++++----------------
> 1 file changed, 27 insertions(+), 16 deletions(-)
>
>diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
>index df2261fa4f98..8d03e44a7fb9 100644
>--- a/arch/x86/mm/init_64.c
>+++ b/arch/x86/mm/init_64.c
>@@ -1014,7 +1014,7 @@ static void __meminit free_pagetable(struct page *page, int order)
> #ifdef CONFIG_HAVE_BOOTMEM_INFO_NODE
> 		enum bootmem_type type = bootmem_type(page);
> 
>-		if (type == SECTION_INFO || type == MIX_SECTION_INFO) {
>+		if (type == MIX_SECTION_INFO) {
> 			while (nr_pages--)
> 				put_page_bootmem(page++);
> 		} else {
>@@ -1028,13 +1028,24 @@ static void __meminit free_pagetable(struct page *page, int order)
> 	}
> }
> 
>-static void __meminit free_hugepage_table(struct page *page,
>+static void __meminit free_vmemmap_pages(struct page *page, unsigned int order,
> 		struct vmem_altmap *altmap)
> {
>-	if (altmap)
>-		vmem_altmap_free(altmap, PMD_SIZE / PAGE_SIZE);
>-	else
>-		free_pagetable(page, get_order(PMD_SIZE));
>+	if (altmap) {
>+		vmem_altmap_free(altmap, 1u << order);
>+	} else if (PageReserved(page)) {
>+		unsigned long nr_pages = 1 << order;
>+
>+		if (IS_ENABLED(CONFIG_HAVE_BOOTMEM_INFO_NODE) &&
>+		    bootmem_type(page) == SECTION_INFO) {
>+			while (nr_pages--)
>+				put_page_bootmem(page++);
>+		} else {
>+			free_reserved_pages(page, nr_pages);
>+		}
>+	} else {
>+		__free_pages(page, order);
>+	}
> }
> 
> static void __meminit free_pte_table(pte_t *pte_start, pmd_t *pmd)
>@@ -1093,7 +1104,7 @@ static void __meminit free_pud_table(pud_t *pud_start, p4d_t *p4d)
> 
> static void __meminit
> remove_pte_table(pte_t *pte_start, unsigned long addr, unsigned long end,
>-		 bool direct)
>+		 bool direct, struct vmem_altmap *altmap)
> {
> 	unsigned long next, pages = 0;
> 	pte_t *pte;
>@@ -1118,7 +1129,7 @@ remove_pte_table(pte_t *pte_start, unsigned long addr, unsigned long end,
> 			return;
> 
> 		if (!direct)
>-			free_pagetable(pte_page(*pte), 0);
>+			free_vmemmap_pages(pte_page(*pte), 0, altmap);
> 
> 		spin_lock(&init_mm.page_table_lock);
> 		pte_clear(&init_mm, addr, pte);
>@@ -1153,25 +1164,25 @@ remove_pmd_table(pmd_t *pmd_start, unsigned long addr, unsigned long end,
> 			if (IS_ALIGNED(addr, PMD_SIZE) &&
> 			    IS_ALIGNED(next, PMD_SIZE)) {
> 				if (!direct)
>-					free_hugepage_table(pmd_page(*pmd),
>-							    altmap);
>+					free_vmemmap_pages(pmd_page(*pmd),
>+							   PMD_ORDER, altmap);
> 
> 				spin_lock(&init_mm.page_table_lock);
> 				pmd_clear(pmd);
> 				spin_unlock(&init_mm.page_table_lock);
> 				pages++;
> 			} else if (vmemmap_pmd_is_unused(addr, next)) {
>-					free_hugepage_table(pmd_page(*pmd),
>-							    altmap);
>-					spin_lock(&init_mm.page_table_lock);
>-					pmd_clear(pmd);
>-					spin_unlock(&init_mm.page_table_lock);
>+				free_vmemmap_pages(pmd_page(*pmd), PMD_ORDER,
>+						   altmap);
>+				spin_lock(&init_mm.page_table_lock);
>+				pmd_clear(pmd);
>+				spin_unlock(&init_mm.page_table_lock);
> 			}
> 			continue;
> 		}
> 
> 		pte_base = (pte_t *)pmd_page_vaddr(*pmd);
>-		remove_pte_table(pte_base, addr, next, direct);
>+		remove_pte_table(pte_base, addr, next, direct, altmap);
> 		free_pte_table(pte_base, pmd);
> 	}
> 
>
>---
>
>base-commit: a2ddbfd1af0f54ea84bf17f0400088815d012e8d
>
>change-id: 20260428-vmemmap-ab4b949aa727
>
>--
>
>Cheers,
>
>David
>
>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] x86/mm: fix freeing of PMD-sized vmemmap pages
  2026-04-29  2:12 ` Lance Yang
@ 2026-04-29  2:30   ` Lance Yang
  2026-04-29  5:50   ` David Hildenbrand (Arm)
  1 sibling, 0 replies; 8+ messages in thread
From: Lance Yang @ 2026-04-29  2:30 UTC (permalink / raw)
  To: david
  Cc: dave.hansen, luto, peterz, tglx, mingo, bp, x86, hpa, rppt, jgg,
	baolu.lu, akpm, linux-kernel, linux-mm, stable, Lance Yang


On Wed, Apr 29, 2026 at 10:12:24AM +0800, Lance Yang wrote:
>
>On Tue, Apr 28, 2026 at 12:29:36PM +0200, David Hildenbrand (Arm) wrote:
>>In commit bf9e4e30f353 ("x86/mm: use pagetable_free()"), we switched
>>from freeing non-boot page tables through __free_pages() to
>>pagetable_free().
>>
>>However, the function is also called to free vmemmap pages.
>>
>>Given that vmemmap pages are not page tables, already the page_ptdesc(page)
>>is wrong. But worse, pagetable_free() calls
>>
>>	__free_pages(page, compound_order(page));
>>
>>As vmemmap pages are not compound pages (see vmemmap_alloc_block()) --
>>except for HVO, which doesn't apply here -- we will only free the first
>>page when freeing a PMD-sized vmemmap page, leaking the other ones.
>>
>>Fix it by properly decoupling pagetable and vmemmap freeing.
>>free_pagetable() no longer has to mess with SECTION_INFO, as only the
>>vmemmap is marked like that in register_page_bootmem_memmap().
>>
>>While at it, just wire up the altmap parameter for remove_pte_table().
>>Also, the indentation in remove_pmd_table() is messed up, let's fix that
>>while touching it.
>
>One thing I'm not sure about is passing altmap down into
>remove_pte_table().
>
>Do we actually know that a non-NULL altmap means that the vmemmap
>backing page came from that altmap?
>
>On x86 we still have in vmemmap_populate():
>
>	if (end - start < PAGES_PER_SECTION * sizeof(struct page))
>		err = vmemmap_populate_basepages(start, end, node, NULL);
>
>So for smaller-than-section vmemmap ranges, even if the caller has an
>altmap, the backing pages are allocated from normal memory. But with
>this fix the PTE removal path would now call vmem_altmap_free() just
>because altmap is non-NULL, and would not free the actual backing page,
>IIUC :)
>
>Maybe free_vmemmap_pages() should first check that the backing page is
>really inside the altmap range before using vmem_altmap_free()?
>
>Hopefully I didn't miss anything :)

I played a bit with the following on top of this fix (untested):

---8<---
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 8d03e44a7fb9..9a52f9424a07 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1028,14 +1028,33 @@ static void __meminit free_pagetable(struct page *page, int order)
 	}
 }

+static bool __meminit vmemmap_page_is_altmap(struct page *page,
+		unsigned long nr_pages, struct vmem_altmap *altmap)
+{
+	unsigned long pfn = page_to_pfn(page);
+	unsigned long start_pfn;
+	unsigned long end_pfn;
+
+	if (!altmap)
+		return false;
+
+	start_pfn = altmap->base_pfn + altmap->reserve;
+	end_pfn = altmap->base_pfn + vmem_altmap_offset(altmap);
+
+	if (pfn < start_pfn || pfn >= end_pfn)
+		return false;
+
+	return nr_pages <= end_pfn - pfn;
+}
+
 static void __meminit free_vmemmap_pages(struct page *page, unsigned int order,
 		struct vmem_altmap *altmap)
 {
-	if (altmap) {
-		vmem_altmap_free(altmap, 1u << order);
-	} else if (PageReserved(page)) {
-		unsigned long nr_pages = 1 << order;
+	unsigned long nr_pages = 1 << order;

+	if (vmemmap_page_is_altmap(page, nr_pages, altmap)) {
+		vmem_altmap_free(altmap, nr_pages);
+	} else if (PageReserved(page)) {
 		if (IS_ENABLED(CONFIG_HAVE_BOOTMEM_INFO_NODE) &&
 		    bootmem_type(page) == SECTION_INFO) {
 			while (nr_pages--)
--

Thanks,
Lance

>Thanks,
>Lance
>
>>Note that we'll try to get rid of that bootmem info handling soon. For
>>now, we'll handle it similar to free_pagetable(), just avoiding the
>>ifdef.
>>
>>Fixes: bf9e4e30f353 ("x86/mm: use pagetable_free()")
>>Cc: stable@vger.kernel.org
>>Signed-off-by: David Hildenbrand (Arm) <david@kernel.org>
>>---
>>Reproduced and tested with a simple VM with a virtio-mem device,
>>repeatedly adding and removing memory.
>>
>>Found by code inspection while working on bootmem_info removal.
>>---
>> arch/x86/mm/init_64.c | 43 +++++++++++++++++++++++++++----------------
>> 1 file changed, 27 insertions(+), 16 deletions(-)
>>
>>diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
>>index df2261fa4f98..8d03e44a7fb9 100644
>>--- a/arch/x86/mm/init_64.c
>>+++ b/arch/x86/mm/init_64.c
>>@@ -1014,7 +1014,7 @@ static void __meminit free_pagetable(struct page *page, int order)
>> #ifdef CONFIG_HAVE_BOOTMEM_INFO_NODE
>> 		enum bootmem_type type = bootmem_type(page);
>> 
>>-		if (type == SECTION_INFO || type == MIX_SECTION_INFO) {
>>+		if (type == MIX_SECTION_INFO) {
>> 			while (nr_pages--)
>> 				put_page_bootmem(page++);
>> 		} else {
>>@@ -1028,13 +1028,24 @@ static void __meminit free_pagetable(struct page *page, int order)
>> 	}
>> }
>> 
>>-static void __meminit free_hugepage_table(struct page *page,
>>+static void __meminit free_vmemmap_pages(struct page *page, unsigned int order,
>> 		struct vmem_altmap *altmap)
>> {
>>-	if (altmap)
>>-		vmem_altmap_free(altmap, PMD_SIZE / PAGE_SIZE);
>>-	else
>>-		free_pagetable(page, get_order(PMD_SIZE));
>>+	if (altmap) {
>>+		vmem_altmap_free(altmap, 1u << order);
>>+	} else if (PageReserved(page)) {
>>+		unsigned long nr_pages = 1 << order;
>>+
>>+		if (IS_ENABLED(CONFIG_HAVE_BOOTMEM_INFO_NODE) &&
>>+		    bootmem_type(page) == SECTION_INFO) {
>>+			while (nr_pages--)
>>+				put_page_bootmem(page++);
>>+		} else {
>>+			free_reserved_pages(page, nr_pages);
>>+		}
>>+	} else {
>>+		__free_pages(page, order);
>>+	}
>> }
>> 
>> static void __meminit free_pte_table(pte_t *pte_start, pmd_t *pmd)
>>@@ -1093,7 +1104,7 @@ static void __meminit free_pud_table(pud_t *pud_start, p4d_t *p4d)
>> 
>> static void __meminit
>> remove_pte_table(pte_t *pte_start, unsigned long addr, unsigned long end,
>>-		 bool direct)
>>+		 bool direct, struct vmem_altmap *altmap)
>> {
>> 	unsigned long next, pages = 0;
>> 	pte_t *pte;
>>@@ -1118,7 +1129,7 @@ remove_pte_table(pte_t *pte_start, unsigned long addr, unsigned long end,
>> 			return;
>> 
>> 		if (!direct)
>>-			free_pagetable(pte_page(*pte), 0);
>>+			free_vmemmap_pages(pte_page(*pte), 0, altmap);
>> 
>> 		spin_lock(&init_mm.page_table_lock);
>> 		pte_clear(&init_mm, addr, pte);
>>@@ -1153,25 +1164,25 @@ remove_pmd_table(pmd_t *pmd_start, unsigned long addr, unsigned long end,
>> 			if (IS_ALIGNED(addr, PMD_SIZE) &&
>> 			    IS_ALIGNED(next, PMD_SIZE)) {
>> 				if (!direct)
>>-					free_hugepage_table(pmd_page(*pmd),
>>-							    altmap);
>>+					free_vmemmap_pages(pmd_page(*pmd),
>>+							   PMD_ORDER, altmap);
>> 
>> 				spin_lock(&init_mm.page_table_lock);
>> 				pmd_clear(pmd);
>> 				spin_unlock(&init_mm.page_table_lock);
>> 				pages++;
>> 			} else if (vmemmap_pmd_is_unused(addr, next)) {
>>-					free_hugepage_table(pmd_page(*pmd),
>>-							    altmap);
>>-					spin_lock(&init_mm.page_table_lock);
>>-					pmd_clear(pmd);
>>-					spin_unlock(&init_mm.page_table_lock);
>>+				free_vmemmap_pages(pmd_page(*pmd), PMD_ORDER,
>>+						   altmap);
>>+				spin_lock(&init_mm.page_table_lock);
>>+				pmd_clear(pmd);
>>+				spin_unlock(&init_mm.page_table_lock);
>> 			}
>> 			continue;
>> 		}
>> 
>> 		pte_base = (pte_t *)pmd_page_vaddr(*pmd);
>>-		remove_pte_table(pte_base, addr, next, direct);
>>+		remove_pte_table(pte_base, addr, next, direct, altmap);
>> 		free_pte_table(pte_base, pmd);
>> 	}
>> 
>>
>>---
>>
>>base-commit: a2ddbfd1af0f54ea84bf17f0400088815d012e8d
>>
>>change-id: 20260428-vmemmap-ab4b949aa727
>>
>>--
>>
>>Cheers,
>>
>>David
>>
>>
>

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH] x86/mm: fix freeing of PMD-sized vmemmap pages
  2026-04-29  2:12 ` Lance Yang
  2026-04-29  2:30   ` Lance Yang
@ 2026-04-29  5:50   ` David Hildenbrand (Arm)
  1 sibling, 0 replies; 8+ messages in thread
From: David Hildenbrand (Arm) @ 2026-04-29  5:50 UTC (permalink / raw)
  To: Lance Yang
  Cc: dave.hansen, luto, peterz, tglx, mingo, bp, x86, hpa, rppt, jgg,
	baolu.lu, akpm, linux-kernel, linux-mm, stable

On 4/29/26 04:12, Lance Yang wrote:
> 
> On Tue, Apr 28, 2026 at 12:29:36PM +0200, David Hildenbrand (Arm) wrote:
>> In commit bf9e4e30f353 ("x86/mm: use pagetable_free()"), we switched
>>from freeing non-boot page tables through __free_pages() to
>> pagetable_free().
>>
>> However, the function is also called to free vmemmap pages.
>>
>> Given that vmemmap pages are not page tables, already the page_ptdesc(page)
>> is wrong. But worse, pagetable_free() calls
>>
>> 	__free_pages(page, compound_order(page));
>>
>> As vmemmap pages are not compound pages (see vmemmap_alloc_block()) --
>> except for HVO, which doesn't apply here -- we will only free the first
>> page when freeing a PMD-sized vmemmap page, leaking the other ones.
>>
>> Fix it by properly decoupling pagetable and vmemmap freeing.
>> free_pagetable() no longer has to mess with SECTION_INFO, as only the
>> vmemmap is marked like that in register_page_bootmem_memmap().
>>
>> While at it, just wire up the altmap parameter for remove_pte_table().
>> Also, the indentation in remove_pmd_table() is messed up, let's fix that
>> while touching it.
> 
> One thing I'm not sure about is passing altmap down into
> remove_pte_table().
> 
> Do we actually know that a non-NULL altmap means that the vmemmap
> backing page came from that altmap?

I thought with an altmap we'd never get into the situation of allocating outside
the altmap.

But you're right that sub-section hotplug might actually trigger this.


> 
> On x86 we still have in vmemmap_populate():
> 
> 	if (end - start < PAGES_PER_SECTION * sizeof(struct page))
> 		err = vmemmap_populate_basepages(start, end, node, NULL);
> 
> So for smaller-than-section vmemmap ranges, even if the caller has an
> altmap, the backing pages are allocated from normal memory. But with
> this fix the PTE removal path would now call vmem_altmap_free() just
> because altmap is non-NULL, and would not free the actual backing page,
> IIUC :)
> 
> Maybe free_vmemmap_pages() should first check that the backing page is
> really inside the altmap range before using vmem_altmap_free()?

Probably I'll just leave it as is for now, and simply pass "NULL" for the PTE
case like we effectively did before.

Thanks!


-- 
Cheers,

David

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2026-04-29  5:50 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-28 10:29 [PATCH] x86/mm: fix freeing of PMD-sized vmemmap pages David Hildenbrand (Arm)
2026-04-28 10:34 ` David Hildenbrand (Arm)
2026-04-28 13:20 ` Lance Yang
2026-04-28 19:36   ` David Hildenbrand (Arm)
2026-04-28 20:47 ` Mike Rapoport
2026-04-29  2:12 ` Lance Yang
2026-04-29  2:30   ` Lance Yang
2026-04-29  5:50   ` David Hildenbrand (Arm)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox