Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] x86/mm: fix vmemmap leak on memory hot-remove
@ 2026-05-19 15:10 Juhyung Park
  2026-05-19 16:02 ` Dave Hansen
  2026-05-20  5:24 ` David Hildenbrand (Arm)
  0 siblings, 2 replies; 13+ messages in thread
From: Juhyung Park @ 2026-05-19 15:10 UTC (permalink / raw)
  To: linux-mm
  Cc: Juhyung Park, stable, Lu Baolu, Jason Gunthorpe,
	David Hildenbrand, Mike Rapoport (Microsoft), Oscar Salvador,
	Andrew Morton, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dan Williams,
	Dave Jiang, Vishal Verma, linux-cxl, nvdimm

free_pagetable() is called via free_hugepage_table() with
get_order(PMD_SIZE) = 9 to free the 2 MB vmemmap PMD leaves that back
struct page arrays on x86_64. After commit bf9e4e30f353 ("x86/mm: use
pagetable_free()"), it goes through pagetable_free() instead of
__free_pages(), and pagetable_free() ultimately calls
__free_pages(page, compound_order()) which ignores the explicit order
argument and infers it from the page's compound metadata.

The vmemmap PMD chunks are allocated by vmemmap_alloc_block() using
alloc_pages_node() without __GFP_COMP, so PG_head is not set and
compound_order() returns 0. Only the first of 512 pages of each PMD
chunk is returned to the buddy allocator on hot-remove; the remaining
511 pages stay allocated and become unreachable. Generalized: roughly
16 MB leaked per GB of hot-removed memory per cycle.

The leak affects every memory hot-remove path on x86_64 when
memmap_on_memory=N (the default), including dax_kmem, virtio-mem,
balloon drivers, ACPI memory hotplug, and direct sysfs offline+remove.
memmap_on_memory=Y avoids it because free_hugepage_table() then takes
the altmap branch and does not call free_pagetable().

Reproduced with CXL memory toggled through DAX in a loop:

  daxctl reconfigure-device --mode=system-ram dax0.0 --force
  daxctl reconfigure-device --mode=devdax    dax0.0 --force

Fixes: bf9e4e30f353 ("x86/mm: use pagetable_free()")
Cc: stable@vger.kernel.org
Cc: Lu Baolu <baolu.lu@linux.intel.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: David Hildenbrand <david@kernel.org>
Cc: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dan Williams <djbw@kernel.org>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: linux-cxl@vger.kernel.org
Cc: nvdimm@lists.linux.dev
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Juhyung Park <qkrwngud825@gmail.com>
---
 arch/x86/mm/init_64.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index df2261fa4f98..a2301bddb647 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1024,7 +1024,12 @@ static void __meminit free_pagetable(struct page *page, int order)
 		free_reserved_pages(page, nr_pages);
 #endif
 	} else {
-		pagetable_free(page_ptdesc(page));
+		/*
+		 * Use __free_pages() to honor @order: vmemmap PMD leaves
+		 * freed here are not compound pages, so pagetable_free()
+		 * would lose leak 511 of 512 pages per 2 MB chunk.
+		 */
+		__free_pages(page, order);
 	}
 }
 
-- 
2.54.0



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH] x86/mm: fix vmemmap leak on memory hot-remove
  2026-05-19 15:10 [PATCH] x86/mm: fix vmemmap leak on memory hot-remove Juhyung Park
@ 2026-05-19 16:02 ` Dave Hansen
  2026-05-19 16:27   ` Juhyung Park
  2026-05-20  5:24 ` David Hildenbrand (Arm)
  1 sibling, 1 reply; 13+ messages in thread
From: Dave Hansen @ 2026-05-19 16:02 UTC (permalink / raw)
  To: Juhyung Park, linux-mm
  Cc: stable, Lu Baolu, Jason Gunthorpe, David Hildenbrand,
	Mike Rapoport (Microsoft), Oscar Salvador, Andrew Morton,
	Dave Hansen, Andy Lutomirski, Peter Zijlstra, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dan Williams, Dave Jiang,
	Vishal Verma, linux-cxl, nvdimm, Matthew Wilcox

On 5/19/26 08:10, Juhyung Park wrote:
>  #endif
>  	} else {
> -		pagetable_free(page_ptdesc(page));
> +		/*
> +		 * Use __free_pages() to honor @order: vmemmap PMD leaves
> +		 * freed here are not compound pages, so pagetable_free()
> +		 * would lose leak 511 of 512 pages per 2 MB chunk.
> +		 */
> +		__free_pages(page, order);
>  	}
>  }

I find myself really wondering how much of this came from a human and
how much from the LLM. Could you share that with us?

We're trying to get _away_ from using the 'struct page' APIs on page
tables. This goes backwards. Worst case, do:

	/* vmemmap PMD leaves are not compound pages */
	for (i = 0; i < 1<<order; i++)
		pagetable_free(page_ptdesc(&page[i]));

Right?

Even better would be to *make* these compound pages.

Even better than that would be to use some 'struct ptdesc' space to
explicitly store the order, just like compound pages. But that's
probably not trivial and probably not great for a bug fix.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] x86/mm: fix vmemmap leak on memory hot-remove
  2026-05-19 16:02 ` Dave Hansen
@ 2026-05-19 16:27   ` Juhyung Park
  2026-05-19 16:41     ` Dave Hansen
  0 siblings, 1 reply; 13+ messages in thread
From: Juhyung Park @ 2026-05-19 16:27 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-mm, stable, Lu Baolu, Jason Gunthorpe, David Hildenbrand,
	Mike Rapoport (Microsoft), Oscar Salvador, Andrew Morton,
	Dave Hansen, Andy Lutomirski, Peter Zijlstra, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dan Williams, Dave Jiang,
	Vishal Verma, linux-cxl, nvdimm, Matthew Wilcox

Hi Dave,

On Wed, May 20, 2026 at 1:02 AM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 5/19/26 08:10, Juhyung Park wrote:
> >  #endif
> >       } else {
> > -             pagetable_free(page_ptdesc(page));
> > +             /*
> > +              * Use __free_pages() to honor @order: vmemmap PMD leaves
> > +              * freed here are not compound pages, so pagetable_free()
> > +              * would lose leak 511 of 512 pages per 2 MB chunk.
> > +              */
> > +             __free_pages(page, order);
> >       }
> >  }
>
> I find myself really wondering how much of this came from a human and
> how much from the LLM. Could you share that with us?

Not my first kernel contribution, just so you know. (first in mm tho)

I asked Claude to write both the commit body and comment and it was
too verbose. I manually trimmed it down.
Sorry if it still sounds too LLM-ish.

This was tested on a VM with virtualized CXL device and toggling it
back and forth was visibly causing leaks. kmemleak was unable to catch
this (rightfully so), so I skeptically asked Claude to see if it can
figure it out while pwd was the kernel source the VM was running.
"Access the VM at "ssh -p2223 root@192.168.0.185". There's a memory
leak whenever CXL memory switches modes via: daxctl reconfigure-device
--mode=system-ram dax0.0 --force, daxctl reconfigure-device
--mode=devdax dax0.0 --force. Figure out why. If you need to reboot
the VM, do not do it yourself and ask me."

It did in 6 minutes and it basically told me to revert bf9e4e30f353. I
was very skeptical and reviewed manually (with my short knowledge of
mm) why this would be a correct fix.

>
> We're trying to get _away_ from using the 'struct page' APIs on page
> tables. This goes backwards. Worst case, do:
>
>         /* vmemmap PMD leaves are not compound pages */
>         for (i = 0; i < 1<<order; i++)
>                 pagetable_free(page_ptdesc(&page[i]));
>
> Right?

Shouldn't I worry about the loop overhead? With order == 9, that's 512
iterations. That's compounded to O(N) when the entire memory size is
in consideration.

>
> Even better would be to *make* these compound pages.
>
> Even better than that would be to use some 'struct ptdesc' space to
> explicitly store the order, just like compound pages. But that's
> probably not trivial and probably not great for a bug fix.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] x86/mm: fix vmemmap leak on memory hot-remove
  2026-05-19 16:27   ` Juhyung Park
@ 2026-05-19 16:41     ` Dave Hansen
  2026-05-19 16:59       ` Juhyung Park
  0 siblings, 1 reply; 13+ messages in thread
From: Dave Hansen @ 2026-05-19 16:41 UTC (permalink / raw)
  To: Juhyung Park
  Cc: linux-mm, stable, Lu Baolu, Jason Gunthorpe, David Hildenbrand,
	Mike Rapoport (Microsoft), Oscar Salvador, Andrew Morton,
	Dave Hansen, Andy Lutomirski, Peter Zijlstra, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dan Williams, Dave Jiang,
	Vishal Verma, linux-cxl, nvdimm, Matthew Wilcox

On 5/19/26 09:27, Juhyung Park wrote:
> Hi Dave,
> 
> On Wed, May 20, 2026 at 1:02 AM Dave Hansen <dave.hansen@intel.com> wrote:
>>
>> On 5/19/26 08:10, Juhyung Park wrote:
>>>  #endif
>>>       } else {
>>> -             pagetable_free(page_ptdesc(page));
>>> +             /*
>>> +              * Use __free_pages() to honor @order: vmemmap PMD leaves
>>> +              * freed here are not compound pages, so pagetable_free()
>>> +              * would lose leak 511 of 512 pages per 2 MB chunk.
>>> +              */
>>> +             __free_pages(page, order);
>>>       }
>>>  }
>>
>> I find myself really wondering how much of this came from a human and
>> how much from the LLM. Could you share that with us?
> 
> Not my first kernel contribution, just so you know. (first in mm tho)
> 
> I asked Claude to write both the commit body and comment and it was
> too verbose. I manually trimmed it down.
> Sorry if it still sounds too LLM-ish.

Yeah, it still sounded really LLM-ish to me. Still rather chatty.

> This was tested on a VM with virtualized CXL device and toggling it
> back and forth was visibly causing leaks. kmemleak was unable to catch
> this (rightfully so), so I skeptically asked Claude to see if it can
> figure it out while pwd was the kernel source the VM was running.
> "Access the VM at "ssh -p2223 root@192.168.0.185". There's a memory
> leak whenever CXL memory switches modes via: daxctl reconfigure-device
> --mode=system-ram dax0.0 --force, daxctl reconfigure-device
> --mode=devdax dax0.0 --force. Figure out why. If you need to reboot
> the VM, do not do it yourself and ask me."
> 
> It did in 6 minutes and it basically told me to revert bf9e4e30f353. I
> was very skeptical and reviewed manually (with my short knowledge of
> mm) why this would be a correct fix.

Neato.

>> We're trying to get _away_ from using the 'struct page' APIs on page
>> tables. This goes backwards. Worst case, do:
>>
>>         /* vmemmap PMD leaves are not compound pages */
>>         for (i = 0; i < 1<<order; i++)
>>                 pagetable_free(page_ptdesc(&page[i]));
>>
>> Right?
> 
> Shouldn't I worry about the loop overhead? With order == 9, that's 512
> iterations. That's compounded to O(N) when the entire memory size is
> in consideration.

Is it optimal? No.

Will anybody ever notice? Also no.

Will anybody ever care? No sir.

Can you measure the difference? I'd wager a beer: No again.

Even if someone manages to notice, then you have a clear path to fix it
*right*: fix the ptdesc data structure to represent high-order allocations.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] x86/mm: fix vmemmap leak on memory hot-remove
  2026-05-19 16:41     ` Dave Hansen
@ 2026-05-19 16:59       ` Juhyung Park
  2026-05-20  4:49         ` Mike Rapoport
  0 siblings, 1 reply; 13+ messages in thread
From: Juhyung Park @ 2026-05-19 16:59 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-mm, stable, Lu Baolu, Jason Gunthorpe, David Hildenbrand,
	Mike Rapoport (Microsoft), Oscar Salvador, Andrew Morton,
	Dave Hansen, Andy Lutomirski, Peter Zijlstra, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dan Williams, Dave Jiang,
	Vishal Verma, linux-cxl, nvdimm, Matthew Wilcox

On Wed, May 20, 2026 at 1:41 AM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 5/19/26 09:27, Juhyung Park wrote:
> > Hi Dave,
> >
> > On Wed, May 20, 2026 at 1:02 AM Dave Hansen <dave.hansen@intel.com> wrote:
> >>
> >> On 5/19/26 08:10, Juhyung Park wrote:
> >>>  #endif
> >>>       } else {
> >>> -             pagetable_free(page_ptdesc(page));
> >>> +             /*
> >>> +              * Use __free_pages() to honor @order: vmemmap PMD leaves
> >>> +              * freed here are not compound pages, so pagetable_free()
> >>> +              * would lose leak 511 of 512 pages per 2 MB chunk.
> >>> +              */
> >>> +             __free_pages(page, order);
> >>>       }
> >>>  }
> >>
> >> I find myself really wondering how much of this came from a human and
> >> how much from the LLM. Could you share that with us?
> >
> > Not my first kernel contribution, just so you know. (first in mm tho)
> >
> > I asked Claude to write both the commit body and comment and it was
> > too verbose. I manually trimmed it down.
> > Sorry if it still sounds too LLM-ish.
>
> Yeah, it still sounded really LLM-ish to me. Still rather chatty.
>
> > This was tested on a VM with virtualized CXL device and toggling it
> > back and forth was visibly causing leaks. kmemleak was unable to catch
> > this (rightfully so), so I skeptically asked Claude to see if it can
> > figure it out while pwd was the kernel source the VM was running.
> > "Access the VM at "ssh -p2223 root@192.168.0.185". There's a memory
> > leak whenever CXL memory switches modes via: daxctl reconfigure-device
> > --mode=system-ram dax0.0 --force, daxctl reconfigure-device
> > --mode=devdax dax0.0 --force. Figure out why. If you need to reboot
> > the VM, do not do it yourself and ask me."
> >
> > It did in 6 minutes and it basically told me to revert bf9e4e30f353. I
> > was very skeptical and reviewed manually (with my short knowledge of
> > mm) why this would be a correct fix.
>
> Neato.
>
> >> We're trying to get _away_ from using the 'struct page' APIs on page
> >> tables. This goes backwards. Worst case, do:
> >>
> >>         /* vmemmap PMD leaves are not compound pages */
> >>         for (i = 0; i < 1<<order; i++)
> >>                 pagetable_free(page_ptdesc(&page[i]));
> >>
> >> Right?
> >
> > Shouldn't I worry about the loop overhead? With order == 9, that's 512
> > iterations. That's compounded to O(N) when the entire memory size is
> > in consideration.
>
> Is it optimal? No.
>
> Will anybody ever notice? Also no.
>
> Will anybody ever care? No sir.

Just spun a test with that loop. It doesn't fix the leak.

I hate to be the guy that copy-pastas LLM but this is outside my
knowledge of mm. Claude suggests:
"Each pagetable_free() on the tails is a no-op: When
alloc_pages_node(node, gfp, order=9) returns without __GFP_COMP, the
buddy allocator only sets _refcount = 1 on the head page. The other
511 pages (page[1] … page[511]) have _refcount = 0. There's no
compound metadata, so they aren't "tails" in the folio sense either —
they're just contiguous pages whose refcounts the allocator never
touched."

Any ideas?

Thanks.

>
> Can you measure the difference? I'd wager a beer: No again.
>
> Even if someone manages to notice, then you have a clear path to fix it
> *right*: fix the ptdesc data structure to represent high-order allocations.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] x86/mm: fix vmemmap leak on memory hot-remove
  2026-05-19 16:59       ` Juhyung Park
@ 2026-05-20  4:49         ` Mike Rapoport
  0 siblings, 0 replies; 13+ messages in thread
From: Mike Rapoport @ 2026-05-20  4:49 UTC (permalink / raw)
  To: Juhyung Park, Vishal Moola
  Cc: Dave Hansen, linux-mm, stable, Lu Baolu, Jason Gunthorpe,
	David Hildenbrand, Oscar Salvador, Andrew Morton, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dan Williams, Dave Jiang, Vishal Verma,
	linux-cxl, nvdimm, Matthew Wilcox

(adding Vishal)

On Wed, May 20, 2026 at 01:59:49AM +0900, Juhyung Park wrote:
> On Wed, May 20, 2026 at 1:41 AM Dave Hansen <dave.hansen@intel.com> wrote:
> >
> > On 5/19/26 09:27, Juhyung Park wrote:
> > > Hi Dave,
> > >
> > > On Wed, May 20, 2026 at 1:02 AM Dave Hansen <dave.hansen@intel.com> wrote:
> > >>
> > >> On 5/19/26 08:10, Juhyung Park wrote:
> > >>>  #endif
> > >>>       } else {
> > >>> -             pagetable_free(page_ptdesc(page));
> > >>> +             /*
> > >>> +              * Use __free_pages() to honor @order: vmemmap PMD leaves
> > >>> +              * freed here are not compound pages, so pagetable_free()
> > >>> +              * would lose leak 511 of 512 pages per 2 MB chunk.
> > >>> +              */
> > >>> +             __free_pages(page, order);
> > >>>       }
> > >>>  }
> > >>
> > >> I find myself really wondering how much of this came from a human and
> > >> how much from the LLM. Could you share that with us?
> > >
> > > Not my first kernel contribution, just so you know. (first in mm tho)
> > >
> > > I asked Claude to write both the commit body and comment and it was
> > > too verbose. I manually trimmed it down.
> > > Sorry if it still sounds too LLM-ish.
> >
> > Yeah, it still sounded really LLM-ish to me. Still rather chatty.
> >
> > > This was tested on a VM with virtualized CXL device and toggling it
> > > back and forth was visibly causing leaks. kmemleak was unable to catch
> > > this (rightfully so), so I skeptically asked Claude to see if it can
> > > figure it out while pwd was the kernel source the VM was running.
> > > "Access the VM at "ssh -p2223 root@192.168.0.185". There's a memory
> > > leak whenever CXL memory switches modes via: daxctl reconfigure-device
> > > --mode=system-ram dax0.0 --force, daxctl reconfigure-device
> > > --mode=devdax dax0.0 --force. Figure out why. If you need to reboot
> > > the VM, do not do it yourself and ask me."
> > >
> > > It did in 6 minutes and it basically told me to revert bf9e4e30f353. I
> > > was very skeptical and reviewed manually (with my short knowledge of
> > > mm) why this would be a correct fix.
> >
> > Neato.
> >
> > >> We're trying to get _away_ from using the 'struct page' APIs on page
> > >> tables. This goes backwards. Worst case, do:
> > >>
> > >>         /* vmemmap PMD leaves are not compound pages */
> > >>         for (i = 0; i < 1<<order; i++)
> > >>                 pagetable_free(page_ptdesc(&page[i]));
> > >>
> > >> Right?
> > >
> > > Shouldn't I worry about the loop overhead? With order == 9, that's 512
> > > iterations. That's compounded to O(N) when the entire memory size is
> > > in consideration.
> >
> > Is it optimal? No.
> >
> > Will anybody ever notice? Also no.
> >
> > Will anybody ever care? No sir.
> 
> Just spun a test with that loop. It doesn't fix the leak.
> 
> I hate to be the guy that copy-pastas LLM but this is outside my
> knowledge of mm. Claude suggests:
> "Each pagetable_free() on the tails is a no-op: When
> alloc_pages_node(node, gfp, order=9) returns without __GFP_COMP, the
> buddy allocator only sets _refcount = 1 on the head page. The other
> 511 pages (page[1] … page[511]) have _refcount = 0. There's no
> compound metadata, so they aren't "tails" in the folio sense either —
> they're just contiguous pages whose refcounts the allocator never
> touched."
> 
> Any ideas?
> 
> Thanks.
> 
> >
> > Can you measure the difference? I'd wager a beer: No again.
> >
> > Even if someone manages to notice, then you have a clear path to fix it
> > *right*: fix the ptdesc data structure to represent high-order allocations.

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] x86/mm: fix vmemmap leak on memory hot-remove
  2026-05-19 15:10 [PATCH] x86/mm: fix vmemmap leak on memory hot-remove Juhyung Park
  2026-05-19 16:02 ` Dave Hansen
@ 2026-05-20  5:24 ` David Hildenbrand (Arm)
  2026-05-20 10:23   ` Juhyung Park
  2026-05-20 10:33   ` Juhyung Park
  1 sibling, 2 replies; 13+ messages in thread
From: David Hildenbrand (Arm) @ 2026-05-20  5:24 UTC (permalink / raw)
  To: Juhyung Park, linux-mm
  Cc: stable, Lu Baolu, Jason Gunthorpe, Mike Rapoport (Microsoft),
	Oscar Salvador, Andrew Morton, Dave Hansen, Andy Lutomirski,
	Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dan Williams, Dave Jiang, Vishal Verma, linux-cxl, nvdimm

On 5/19/26 17:10, Juhyung Park wrote:
> free_pagetable() is called via free_hugepage_table() with
> get_order(PMD_SIZE) = 9 to free the 2 MB vmemmap PMD leaves that back
> struct page arrays on x86_64. After commit bf9e4e30f353 ("x86/mm: use
> pagetable_free()"), it goes through pagetable_free() instead of
> __free_pages(), and pagetable_free() ultimately calls
> __free_pages(page, compound_order()) which ignores the explicit order
> argument and infers it from the page's compound metadata.
> 
> The vmemmap PMD chunks are allocated by vmemmap_alloc_block() using
> alloc_pages_node() without __GFP_COMP, so PG_head is not set and
> compound_order() returns 0. Only the first of 512 pages of each PMD
> chunk is returned to the buddy allocator on hot-remove; the remaining
> 511 pages stay allocated and become unreachable. Generalized: roughly
> 16 MB leaked per GB of hot-removed memory per cycle.
> 
> The leak affects every memory hot-remove path on x86_64 when
> memmap_on_memory=N (the default), including dax_kmem, virtio-mem,
> balloon drivers, ACPI memory hotplug, and direct sysfs offline+remove.
> memmap_on_memory=Y avoids it because free_hugepage_table() then takes
> the altmap branch and does not call free_pagetable().
> 
> Reproduced with CXL memory toggled through DAX in a loop:
> 
>   daxctl reconfigure-device --mode=system-ram dax0.0 --force
>   daxctl reconfigure-device --mode=devdax    dax0.0 --force
> 
> Fixes: bf9e4e30f353 ("x86/mm: use pagetable_free()")
> Cc: stable@vger.kernel.org
> Cc: Lu Baolu <baolu.lu@linux.intel.com>
> Cc: Jason Gunthorpe <jgg@nvidia.com>
> Cc: David Hildenbrand <david@kernel.org>
> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org>
> Cc: Oscar Salvador <osalvador@suse.de>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Andy Lutomirski <luto@kernel.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Thomas Gleixner <tglx@kernel.org>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Borislav Petkov <bp@alien8.de>
> Cc: Dan Williams <djbw@kernel.org>
> Cc: Dave Jiang <dave.jiang@intel.com>
> Cc: Vishal Verma <vishal.l.verma@intel.com>
> Cc: linux-cxl@vger.kernel.org
> Cc: nvdimm@lists.linux.dev
> Assisted-by: Claude:claude-opus-4-7
> Signed-off-by: Juhyung Park <qkrwngud825@gmail.com>
> ---
>  arch/x86/mm/init_64.c | 7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index df2261fa4f98..a2301bddb647 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -1024,7 +1024,12 @@ static void __meminit free_pagetable(struct page *page, int order)
>  		free_reserved_pages(page, nr_pages);
>  #endif
>  	} else {
> -		pagetable_free(page_ptdesc(page));
> +		/*
> +		 * Use __free_pages() to honor @order: vmemmap PMD leaves
> +		 * freed here are not compound pages, so pagetable_free()
> +		 * would lose leak 511 of 512 pages per 2 MB chunk.
> +		 */
> +		__free_pages(page, order);
>  	}
>  }
>  

I sent a proper fix for this already:

https://lore.kernel.org/all/20260429-vmemmap-v2-1-8dfcacffd877@kernel.org/

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] x86/mm: fix vmemmap leak on memory hot-remove
  2026-05-20  5:24 ` David Hildenbrand (Arm)
@ 2026-05-20 10:23   ` Juhyung Park
  2026-05-20 10:33   ` Juhyung Park
  1 sibling, 0 replies; 13+ messages in thread
From: Juhyung Park @ 2026-05-20 10:23 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: linux-mm, stable, Lu Baolu, Jason Gunthorpe,
	Mike Rapoport (Microsoft), Oscar Salvador, Andrew Morton,
	Dave Hansen, Andy Lutomirski, Peter Zijlstra, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dan Williams, Dave Jiang,
	Vishal Verma, linux-cxl, nvdimm

[-- Attachment #1: Type: text/plain, Size: 3606 bytes --]

Neat. Any sign of it getting merged?

Thanks.

On Wed, May 20, 2026 at 2:24 PM David Hildenbrand (Arm) <david@kernel.org>
wrote:

> On 5/19/26 17:10, Juhyung Park wrote:
> > free_pagetable() is called via free_hugepage_table() with
> > get_order(PMD_SIZE) = 9 to free the 2 MB vmemmap PMD leaves that back
> > struct page arrays on x86_64. After commit bf9e4e30f353 ("x86/mm: use
> > pagetable_free()"), it goes through pagetable_free() instead of
> > __free_pages(), and pagetable_free() ultimately calls
> > __free_pages(page, compound_order()) which ignores the explicit order
> > argument and infers it from the page's compound metadata.
> >
> > The vmemmap PMD chunks are allocated by vmemmap_alloc_block() using
> > alloc_pages_node() without __GFP_COMP, so PG_head is not set and
> > compound_order() returns 0. Only the first of 512 pages of each PMD
> > chunk is returned to the buddy allocator on hot-remove; the remaining
> > 511 pages stay allocated and become unreachable. Generalized: roughly
> > 16 MB leaked per GB of hot-removed memory per cycle.
> >
> > The leak affects every memory hot-remove path on x86_64 when
> > memmap_on_memory=N (the default), including dax_kmem, virtio-mem,
> > balloon drivers, ACPI memory hotplug, and direct sysfs offline+remove.
> > memmap_on_memory=Y avoids it because free_hugepage_table() then takes
> > the altmap branch and does not call free_pagetable().
> >
> > Reproduced with CXL memory toggled through DAX in a loop:
> >
> >   daxctl reconfigure-device --mode=system-ram dax0.0 --force
> >   daxctl reconfigure-device --mode=devdax    dax0.0 --force
> >
> > Fixes: bf9e4e30f353 ("x86/mm: use pagetable_free()")
> > Cc: stable@vger.kernel.org
> > Cc: Lu Baolu <baolu.lu@linux.intel.com>
> > Cc: Jason Gunthorpe <jgg@nvidia.com>
> > Cc: David Hildenbrand <david@kernel.org>
> > Cc: Mike Rapoport (Microsoft) <rppt@kernel.org>
> > Cc: Oscar Salvador <osalvador@suse.de>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: Dave Hansen <dave.hansen@linux.intel.com>
> > Cc: Andy Lutomirski <luto@kernel.org>
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Cc: Thomas Gleixner <tglx@kernel.org>
> > Cc: Ingo Molnar <mingo@redhat.com>
> > Cc: Borislav Petkov <bp@alien8.de>
> > Cc: Dan Williams <djbw@kernel.org>
> > Cc: Dave Jiang <dave.jiang@intel.com>
> > Cc: Vishal Verma <vishal.l.verma@intel.com>
> > Cc: linux-cxl@vger.kernel.org
> > Cc: nvdimm@lists.linux.dev
> > Assisted-by: Claude:claude-opus-4-7
> > Signed-off-by: Juhyung Park <qkrwngud825@gmail.com>
> > ---
> >  arch/x86/mm/init_64.c | 7 ++++++-
> >  1 file changed, 6 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> > index df2261fa4f98..a2301bddb647 100644
> > --- a/arch/x86/mm/init_64.c
> > +++ b/arch/x86/mm/init_64.c
> > @@ -1024,7 +1024,12 @@ static void __meminit free_pagetable(struct page
> *page, int order)
> >               free_reserved_pages(page, nr_pages);
> >  #endif
> >       } else {
> > -             pagetable_free(page_ptdesc(page));
> > +             /*
> > +              * Use __free_pages() to honor @order: vmemmap PMD leaves
> > +              * freed here are not compound pages, so pagetable_free()
> > +              * would lose leak 511 of 512 pages per 2 MB chunk.
> > +              */
> > +             __free_pages(page, order);
> >       }
> >  }
> >
>
> I sent a proper fix for this already:
>
> https://lore.kernel.org/all/20260429-vmemmap-v2-1-8dfcacffd877@kernel.org/
>
> --
> Cheers,
>
> David
>

[-- Attachment #2: Type: text/html, Size: 5698 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] x86/mm: fix vmemmap leak on memory hot-remove
  2026-05-20  5:24 ` David Hildenbrand (Arm)
  2026-05-20 10:23   ` Juhyung Park
@ 2026-05-20 10:33   ` Juhyung Park
  2026-05-20 21:52     ` David Hildenbrand (Arm)
  1 sibling, 1 reply; 13+ messages in thread
From: Juhyung Park @ 2026-05-20 10:33 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: linux-mm, stable, Lu Baolu, Jason Gunthorpe,
	Mike Rapoport (Microsoft), Oscar Salvador, Andrew Morton,
	Dave Hansen, Andy Lutomirski, Peter Zijlstra, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dan Williams, Dave Jiang,
	Vishal Verma, linux-cxl, nvdimm

Neat. Any sign of it getting merged?

Thanks.


On Wed, May 20, 2026 at 2:24 PM David Hildenbrand (Arm)
<david@kernel.org> wrote:
>
> On 5/19/26 17:10, Juhyung Park wrote:
> > free_pagetable() is called via free_hugepage_table() with
> > get_order(PMD_SIZE) = 9 to free the 2 MB vmemmap PMD leaves that back
> > struct page arrays on x86_64. After commit bf9e4e30f353 ("x86/mm: use
> > pagetable_free()"), it goes through pagetable_free() instead of
> > __free_pages(), and pagetable_free() ultimately calls
> > __free_pages(page, compound_order()) which ignores the explicit order
> > argument and infers it from the page's compound metadata.
> >
> > The vmemmap PMD chunks are allocated by vmemmap_alloc_block() using
> > alloc_pages_node() without __GFP_COMP, so PG_head is not set and
> > compound_order() returns 0. Only the first of 512 pages of each PMD
> > chunk is returned to the buddy allocator on hot-remove; the remaining
> > 511 pages stay allocated and become unreachable. Generalized: roughly
> > 16 MB leaked per GB of hot-removed memory per cycle.
> >
> > The leak affects every memory hot-remove path on x86_64 when
> > memmap_on_memory=N (the default), including dax_kmem, virtio-mem,
> > balloon drivers, ACPI memory hotplug, and direct sysfs offline+remove.
> > memmap_on_memory=Y avoids it because free_hugepage_table() then takes
> > the altmap branch and does not call free_pagetable().
> >
> > Reproduced with CXL memory toggled through DAX in a loop:
> >
> >   daxctl reconfigure-device --mode=system-ram dax0.0 --force
> >   daxctl reconfigure-device --mode=devdax    dax0.0 --force
> >
> > Fixes: bf9e4e30f353 ("x86/mm: use pagetable_free()")
> > Cc: stable@vger.kernel.org
> > Cc: Lu Baolu <baolu.lu@linux.intel.com>
> > Cc: Jason Gunthorpe <jgg@nvidia.com>
> > Cc: David Hildenbrand <david@kernel.org>
> > Cc: Mike Rapoport (Microsoft) <rppt@kernel.org>
> > Cc: Oscar Salvador <osalvador@suse.de>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: Dave Hansen <dave.hansen@linux.intel.com>
> > Cc: Andy Lutomirski <luto@kernel.org>
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Cc: Thomas Gleixner <tglx@kernel.org>
> > Cc: Ingo Molnar <mingo@redhat.com>
> > Cc: Borislav Petkov <bp@alien8.de>
> > Cc: Dan Williams <djbw@kernel.org>
> > Cc: Dave Jiang <dave.jiang@intel.com>
> > Cc: Vishal Verma <vishal.l.verma@intel.com>
> > Cc: linux-cxl@vger.kernel.org
> > Cc: nvdimm@lists.linux.dev
> > Assisted-by: Claude:claude-opus-4-7
> > Signed-off-by: Juhyung Park <qkrwngud825@gmail.com>
> > ---
> >  arch/x86/mm/init_64.c | 7 ++++++-
> >  1 file changed, 6 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> > index df2261fa4f98..a2301bddb647 100644
> > --- a/arch/x86/mm/init_64.c
> > +++ b/arch/x86/mm/init_64.c
> > @@ -1024,7 +1024,12 @@ static void __meminit free_pagetable(struct page *page, int order)
> >               free_reserved_pages(page, nr_pages);
> >  #endif
> >       } else {
> > -             pagetable_free(page_ptdesc(page));
> > +             /*
> > +              * Use __free_pages() to honor @order: vmemmap PMD leaves
> > +              * freed here are not compound pages, so pagetable_free()
> > +              * would lose leak 511 of 512 pages per 2 MB chunk.
> > +              */
> > +             __free_pages(page, order);
> >       }
> >  }
> >
>
> I sent a proper fix for this already:
>
> https://lore.kernel.org/all/20260429-vmemmap-v2-1-8dfcacffd877@kernel.org/
>
> --
> Cheers,
>
> David


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] x86/mm: fix vmemmap leak on memory hot-remove
  2026-05-20 10:33   ` Juhyung Park
@ 2026-05-20 21:52     ` David Hildenbrand (Arm)
  2026-05-20 21:54       ` Dave Hansen
  0 siblings, 1 reply; 13+ messages in thread
From: David Hildenbrand (Arm) @ 2026-05-20 21:52 UTC (permalink / raw)
  To: Juhyung Park
  Cc: linux-mm, stable, Lu Baolu, Jason Gunthorpe,
	Mike Rapoport (Microsoft), Oscar Salvador, Andrew Morton,
	Dave Hansen, Andy Lutomirski, Peter Zijlstra, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dan Williams, Dave Jiang,
	Vishal Verma, linux-cxl, nvdimm

On 5/20/26 12:33, Juhyung Park wrote:
> Neat. Any sign of it getting merged?

I hope it will catch the attention of more x86 maintainers soon :)

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] x86/mm: fix vmemmap leak on memory hot-remove
  2026-05-20 21:52     ` David Hildenbrand (Arm)
@ 2026-05-20 21:54       ` Dave Hansen
  2026-05-20 21:59         ` David Hildenbrand (Arm)
  2026-05-22  0:37         ` Andrew Morton
  0 siblings, 2 replies; 13+ messages in thread
From: Dave Hansen @ 2026-05-20 21:54 UTC (permalink / raw)
  To: David Hildenbrand (Arm), Juhyung Park
  Cc: linux-mm, stable, Lu Baolu, Jason Gunthorpe,
	Mike Rapoport (Microsoft), Oscar Salvador, Andrew Morton,
	Dave Hansen, Andy Lutomirski, Peter Zijlstra, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dan Williams, Dave Jiang,
	Vishal Verma, linux-cxl, nvdimm

On 5/20/26 14:52, David Hildenbrand (Arm) wrote:
> On 5/20/26 12:33, Juhyung Park wrote:
>> Neat. Any sign of it getting merged?
> I hope it will catch the attention of more x86 maintainers soon 🙂

David, thanks a ton for that patch. It's in the queue behind a couple of
other things, but I'll definitely take a look.

Attention caught, I promise! ;)


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] x86/mm: fix vmemmap leak on memory hot-remove
  2026-05-20 21:54       ` Dave Hansen
@ 2026-05-20 21:59         ` David Hildenbrand (Arm)
  2026-05-22  0:37         ` Andrew Morton
  1 sibling, 0 replies; 13+ messages in thread
From: David Hildenbrand (Arm) @ 2026-05-20 21:59 UTC (permalink / raw)
  To: Dave Hansen, Juhyung Park
  Cc: linux-mm, stable, Lu Baolu, Jason Gunthorpe,
	Mike Rapoport (Microsoft), Oscar Salvador, Andrew Morton,
	Dave Hansen, Andy Lutomirski, Peter Zijlstra, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dan Williams, Dave Jiang,
	Vishal Verma, linux-cxl, nvdimm

On 5/20/26 23:54, Dave Hansen wrote:
> On 5/20/26 14:52, David Hildenbrand (Arm) wrote:
>> On 5/20/26 12:33, Juhyung Park wrote:
>>> Neat. Any sign of it getting merged?
>> I hope it will catch the attention of more x86 maintainers soon 🙂
> 
> David, thanks a ton for that patch. It's in the queue behind a couple of
> other things, but I'll definitely take a look.
> 
> Attention caught, I promise! ;)

No need to excuse ... my inbox is overflowing because of conferences and travel ...

Note that I have more cleanups that are waiting for that fix (including,
removing the order parameter from free_pagetable(), removing
CONFIG_HAVE_BOOTMEM_INFO_NODE), so expect free_pagetable() and
free_vmemmap_pages() getting cleaned up properly next.

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] x86/mm: fix vmemmap leak on memory hot-remove
  2026-05-20 21:54       ` Dave Hansen
  2026-05-20 21:59         ` David Hildenbrand (Arm)
@ 2026-05-22  0:37         ` Andrew Morton
  1 sibling, 0 replies; 13+ messages in thread
From: Andrew Morton @ 2026-05-22  0:37 UTC (permalink / raw)
  To: Dave Hansen
  Cc: David Hildenbrand (Arm), Juhyung Park, linux-mm, stable, Lu Baolu,
	Jason Gunthorpe, Mike Rapoport (Microsoft), Oscar Salvador,
	Dave Hansen, Andy Lutomirski, Peter Zijlstra, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dan Williams, Dave Jiang,
	Vishal Verma, linux-cxl, nvdimm

On Wed, 20 May 2026 14:54:30 -0700 Dave Hansen <dave.hansen@intel.com> wrote:

> On 5/20/26 14:52, David Hildenbrand (Arm) wrote:
> > On 5/20/26 12:33, Juhyung Park wrote:
> >> Neat. Any sign of it getting merged?
> > I hope it will catch the attention of more x86 maintainers soon 🙂
> 
> David, thanks a ton for that patch. It's in the queue behind a couple of
> other things, but I'll definitely take a look.
> 
> Attention caught, I promise! ;)

Oh, there it is.  I'll add a note to my copy reminding myself that
it'll be upstreamed via the x86 tree.


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2026-05-22  0:37 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-19 15:10 [PATCH] x86/mm: fix vmemmap leak on memory hot-remove Juhyung Park
2026-05-19 16:02 ` Dave Hansen
2026-05-19 16:27   ` Juhyung Park
2026-05-19 16:41     ` Dave Hansen
2026-05-19 16:59       ` Juhyung Park
2026-05-20  4:49         ` Mike Rapoport
2026-05-20  5:24 ` David Hildenbrand (Arm)
2026-05-20 10:23   ` Juhyung Park
2026-05-20 10:33   ` Juhyung Park
2026-05-20 21:52     ` David Hildenbrand (Arm)
2026-05-20 21:54       ` Dave Hansen
2026-05-20 21:59         ` David Hildenbrand (Arm)
2026-05-22  0:37         ` Andrew Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox