Linux CXL
 help / color / mirror / Atom feed
* Re: [PATCH v2 0/8] Introduce a huge-page pre-zeroing mechanism
       [not found]   ` <83798495-915b-4a5d-9638-f5b3de913b71@kernel.org>
@ 2026-01-15 11:57     ` Jonathan Cameron
  2026-01-15 17:08       ` David Hildenbrand (Red Hat)
  0 siblings, 1 reply; 15+ messages in thread
From: Jonathan Cameron @ 2026-01-15 11:57 UTC (permalink / raw)
  To: David Hildenbrand (Red Hat)
  Cc: Li Zhe, akpm, ankur.a.arora, fvdl, joao.m.martins, linux-kernel,
	linux-mm, mhocko, mjguzik, muchun.song, osalvador, raghavendra.kt,
	linux-cxl, Davidlohr Bueso, Gregory Price, Dan Williams, zhanjie9,
	wangzhou1

On Thu, 15 Jan 2026 12:08:03 +0100
"David Hildenbrand (Red Hat)" <david@kernel.org> wrote:

> On 1/15/26 10:36, Li Zhe wrote:
> > On Wed, 14 Jan 2026 18:21:08 +0100, david@kernel.org wrote:
> >      
> >>>> But again, I think the main motivation here is "increase application
> >>>> startup", not optimize that the zeroing happens at specific points in
> >>>> time during system operation (e.g., when idle etc).
> >>>>  
> >>>
> >>> Framing this as "increase application startup" and merely shifting the
> >>> overhead to shutdown seems like gaming the problem statement to me.
> >>> The real problem is total real time spent on it while pages are
> >>> needed.
> >>>
> >>> Support for background zeroing can give you more usable pages provided
> >>> it has the cpu + ram to do it. If it does not, you are in the worst
> >>> case in the same spot as with zeroing on free.
> >>>
> >>> Let's take a look at some examples.
> >>>
> >>> Say there are no free huge pages and you kill a vm + start a new one.
> >>> On top of that all CPUs are pegged as is. In this case total time is
> >>> the same for "zero on free" as it is for background zeroing.  
> >>
> >> Right. If the pages get freed to immediately get allocated again, it
> >> doesn't really matter who does the freeing. There might be some details,
> >> of course.
> >>  
> >>>
> >>> Say the system is freshly booted and you start up a vm. There are no
> >>> pre-zeroed pages available so it suffers at start time no matter what.
> >>> However, with some support for background zeroing, the machinery could
> >>> respond to demand and do it in parallel in some capacity, shortening
> >>> the real time needed.  
> >>
> >> Just like for init_on_free, I would start with zeroing these pages
> >> during boot.
> >>
> >> init_on_free assures that all pages in the buddy were zeroed out. Which
> >> greatly simplifies the implementation, because there is no need to track
> >> what was initialized and what was not.
> >>
> >> It's a good question if initialization during that should be done in
> >> parallel, possibly asynchronously during boot. Reminds me a bit of
> >> deferred page initialization during boot. But that is rather an
> >> extension that could be added somewhat transparently on top later.
> >>
> >> If ever required we could dynamically enable this setting for a running
> >> system. Whoever would enable it (flips the magic toggle) would zero out
> >> all hugetlb pages that are already in the hugetlb allocator as free, but
> >> not initialized yet.
> >>
> >> But again, these are extensions on top of the basic design of having all
> >> free hugetlb folios be zeroed.
> >>  
> >>>
> >>> Say a little bit of real time passes and you start another vm. With
> >>> merely zeroing on free there are still no pre-zeroed pages available
> >>> so it again suffers the overhead. With background zeroing some of the
> >>> that memory would be already sorted out, speeding up said startup.  
> >>
> >> The moment they end up in the hugetlb allocator as free folios they
> >> would have to get initialized.
> >>
> >> Now, I am sure there are downsides to this approach (how to speedup
> >> process exit by parallelizing zeroing, if ever required)? But it sounds
> >> like being a bit ... simpler without user space changes required. In
> >> theory :)  
> > 
> > I strongly agree that init_on_free strategy effectively eliminates the
> > latency incurred during VM creation. However, it appears to introduce
> > two new issues.
> > 
> > First, the process that later allocates a page may not be the one that
> > freed it, raising the question of which process should bear the cost
> > of zeroing.  
> 
> Right now the cost is payed by the process that allocates a page. If you
> shift that to the freeing path, it's still the same process, just at a
> different point in time.
> 
> Of course, there are exceptions to that: if you have a hugetlb file that
> is shared by multiple processes (-> process that essentially truncates
> the file). Or if someone (GUP-pin) holds a reference to a file even after
> it was truncated (not common but possible).
> 
> With CoW it would be the process that last unmaps the folio. CoW with
> hugetlb is fortunately something that is rare (and rather shaky :) ).
> 
> > 
> > Second, put_page() is executed atomically, making it inappropriate to
> > invoke clear_page() within that context; off-loading the zeroing to a
> > workqueue merely reopens the same accounting problem.  
> 
> I thought about this as well. For init_on_free we always invoke it for
> up to 4MiB folios during put_page() on x86-64.
> 
> See __folio_put()->__free_frozen_pages()->free_pages_prepare()
> 
> Where we call kernel_init_pages(page, 1 << order);
> 
> So surely, for 2 MiB folios (hugetlb) this is not a problem.
> 
> ... but then, on arm64 with 64k base pages we have 512 MiB folios
> (managed by the buddy!) where this is apparently not a problem? Or is
> it and should be fixed?
> 
> So I would expect once we go up to 1 GiB, we might only reveal more
> areas where we should have optimized in the first case by dropping
> the reference outside the spin lock ... and these optimizations would
> obviously (unless in hugetlb specific code ...) benefit init_on_free
> setups as well (and page poisoning).

FWIW I'd be interesting in seeing if we can do the zeroing async and allow
for hardware offloading. If it happens to be in CXL (and someone
built the fancy bits) we can ask the device to zero ranges of memory
for us.  If they built the HDM-DB stuff it's coherent too (came up
in the Davidlohr's LPC Device-mem talk on HDM-DB + back invalidate
support)
+CC linux-cxl and Davidlohr + a few others.

More locally this sounds like fun for DMA engines, though they are going
to rapidly eat bandwidth up and so we'll need QoS stuff in place
to stop them perturbing other workloads.

Give me a list of 1Gig pages and this stuff becomes much more efficient
than anything the CPU can do.

Jonathan

> 
> 
> Looking at __unmap_hugepage_range(), for example, we already make sure
> to not drop the reference while holding the PTL (spinlock).
> 
> In general, I think when using MMU gather we drop folio references out
> of the PTL, because we know that it can hurt performance badly.
> 
> I documented some of the nasty things that can happen with MMU gather in
> 
> commit e61abd4490684de379b4a2ef1be2dbde39ac1ced
> Author: David Hildenbrand <david@kernel.org>
> Date:   Wed Feb 14 21:44:34 2024 +0100
> 
>      mm/mmu_gather: improve cond_resched() handling with large folios and expensive page freeing
>      
>      In tlb_batch_pages_flush(), we can end up freeing up to 512 pages or now
>      up to 256 folio fragments that span more than one page, before we
>      conditionally reschedule.
>      
>      It's a pain that we have to handle cond_resched() in
>      tlb_batch_pages_flush() manually and cannot simply handle it in
>      release_pages() -- release_pages() can be called from atomic context.
>      Well, in a perfect world we wouldn't have to make our code more
>      complicated at all.
>      
>      With page poisoning and init_on_free, we might now run into soft lockups
>      when we free a lot of rather large folio fragments, because page freeing
>      time then depends on the actual memory size we are freeing instead of on
>      the number of folios that are involved.
>      
>      In the absolute (unlikely) worst case, on arm64 with 64k we will be able
>      to free up to 256 folio fragments that each span 512 MiB: zeroing out 128
>      GiB does sound like it might take a while.  But instead of ignoring this
>      unlikely case, let's just handle it.
> 
> 
> But more general, when dealing with the PTL we try to put folio references outside
> the lock (there are some cases in mm/memory.c where we apparently don't do it yet),
> because freeing memory can take a while.
> 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 0/8] Introduce a huge-page pre-zeroing mechanism
  2026-01-15 11:57     ` [PATCH v2 0/8] Introduce a huge-page pre-zeroing mechanism Jonathan Cameron
@ 2026-01-15 17:08       ` David Hildenbrand (Red Hat)
  2026-01-15 20:16         ` dan.j.williams
  0 siblings, 1 reply; 15+ messages in thread
From: David Hildenbrand (Red Hat) @ 2026-01-15 17:08 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Li Zhe, akpm, ankur.a.arora, fvdl, joao.m.martins, linux-kernel,
	linux-mm, mhocko, mjguzik, muchun.song, osalvador, raghavendra.kt,
	linux-cxl, Davidlohr Bueso, Gregory Price, Dan Williams, zhanjie9,
	wangzhou1

On 1/15/26 12:57, Jonathan Cameron wrote:
> On Thu, 15 Jan 2026 12:08:03 +0100
> "David Hildenbrand (Red Hat)" <david@kernel.org> wrote:
> 
>> On 1/15/26 10:36, Li Zhe wrote:
>>> On Wed, 14 Jan 2026 18:21:08 +0100, david@kernel.org wrote:
>>>       
>>>>>> But again, I think the main motivation here is "increase application
>>>>>> startup", not optimize that the zeroing happens at specific points in
>>>>>> time during system operation (e.g., when idle etc).
>>>>>>   
>>>>>
>>>>> Framing this as "increase application startup" and merely shifting the
>>>>> overhead to shutdown seems like gaming the problem statement to me.
>>>>> The real problem is total real time spent on it while pages are
>>>>> needed.
>>>>>
>>>>> Support for background zeroing can give you more usable pages provided
>>>>> it has the cpu + ram to do it. If it does not, you are in the worst
>>>>> case in the same spot as with zeroing on free.
>>>>>
>>>>> Let's take a look at some examples.
>>>>>
>>>>> Say there are no free huge pages and you kill a vm + start a new one.
>>>>> On top of that all CPUs are pegged as is. In this case total time is
>>>>> the same for "zero on free" as it is for background zeroing.
>>>>
>>>> Right. If the pages get freed to immediately get allocated again, it
>>>> doesn't really matter who does the freeing. There might be some details,
>>>> of course.
>>>>   
>>>>>
>>>>> Say the system is freshly booted and you start up a vm. There are no
>>>>> pre-zeroed pages available so it suffers at start time no matter what.
>>>>> However, with some support for background zeroing, the machinery could
>>>>> respond to demand and do it in parallel in some capacity, shortening
>>>>> the real time needed.
>>>>
>>>> Just like for init_on_free, I would start with zeroing these pages
>>>> during boot.
>>>>
>>>> init_on_free assures that all pages in the buddy were zeroed out. Which
>>>> greatly simplifies the implementation, because there is no need to track
>>>> what was initialized and what was not.
>>>>
>>>> It's a good question if initialization during that should be done in
>>>> parallel, possibly asynchronously during boot. Reminds me a bit of
>>>> deferred page initialization during boot. But that is rather an
>>>> extension that could be added somewhat transparently on top later.
>>>>
>>>> If ever required we could dynamically enable this setting for a running
>>>> system. Whoever would enable it (flips the magic toggle) would zero out
>>>> all hugetlb pages that are already in the hugetlb allocator as free, but
>>>> not initialized yet.
>>>>
>>>> But again, these are extensions on top of the basic design of having all
>>>> free hugetlb folios be zeroed.
>>>>   
>>>>>
>>>>> Say a little bit of real time passes and you start another vm. With
>>>>> merely zeroing on free there are still no pre-zeroed pages available
>>>>> so it again suffers the overhead. With background zeroing some of the
>>>>> that memory would be already sorted out, speeding up said startup.
>>>>
>>>> The moment they end up in the hugetlb allocator as free folios they
>>>> would have to get initialized.
>>>>
>>>> Now, I am sure there are downsides to this approach (how to speedup
>>>> process exit by parallelizing zeroing, if ever required)? But it sounds
>>>> like being a bit ... simpler without user space changes required. In
>>>> theory :)
>>>
>>> I strongly agree that init_on_free strategy effectively eliminates the
>>> latency incurred during VM creation. However, it appears to introduce
>>> two new issues.
>>>
>>> First, the process that later allocates a page may not be the one that
>>> freed it, raising the question of which process should bear the cost
>>> of zeroing.
>>
>> Right now the cost is payed by the process that allocates a page. If you
>> shift that to the freeing path, it's still the same process, just at a
>> different point in time.
>>
>> Of course, there are exceptions to that: if you have a hugetlb file that
>> is shared by multiple processes (-> process that essentially truncates
>> the file). Or if someone (GUP-pin) holds a reference to a file even after
>> it was truncated (not common but possible).
>>
>> With CoW it would be the process that last unmaps the folio. CoW with
>> hugetlb is fortunately something that is rare (and rather shaky :) ).
>>
>>>
>>> Second, put_page() is executed atomically, making it inappropriate to
>>> invoke clear_page() within that context; off-loading the zeroing to a
>>> workqueue merely reopens the same accounting problem.
>>
>> I thought about this as well. For init_on_free we always invoke it for
>> up to 4MiB folios during put_page() on x86-64.
>>
>> See __folio_put()->__free_frozen_pages()->free_pages_prepare()
>>
>> Where we call kernel_init_pages(page, 1 << order);
>>
>> So surely, for 2 MiB folios (hugetlb) this is not a problem.
>>
>> ... but then, on arm64 with 64k base pages we have 512 MiB folios
>> (managed by the buddy!) where this is apparently not a problem? Or is
>> it and should be fixed?
>>
>> So I would expect once we go up to 1 GiB, we might only reveal more
>> areas where we should have optimized in the first case by dropping
>> the reference outside the spin lock ... and these optimizations would
>> obviously (unless in hugetlb specific code ...) benefit init_on_free
>> setups as well (and page poisoning).
> 
> FWIW I'd be interesting in seeing if we can do the zeroing async and allow
> for hardware offloading. If it happens to be in CXL (and someone
> built the fancy bits) we can ask the device to zero ranges of memory
> for us.  If they built the HDM-DB stuff it's coherent too (came up
> in the Davidlohr's LPC Device-mem talk on HDM-DB + back invalidate
> support)
> +CC linux-cxl and Davidlohr + a few others.
> 
> More locally this sounds like fun for DMA engines, though they are going
> to rapidly eat bandwidth up and so we'll need QoS stuff in place
> to stop them perturbing other workloads.
> 
> Give me a list of 1Gig pages and this stuff becomes much more efficient
> than anything the CPU can do.

Right, and ideally we'd implement any such mechanisms in a way that more 
parts of the kernel can benefit, and not just an unloved in-memory 
file-system that most people just want to get rid of as soon as we can :)

-- 
Cheers

David

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 0/8] Introduce a huge-page pre-zeroing mechanism
  2026-01-15 17:08       ` David Hildenbrand (Red Hat)
@ 2026-01-15 20:16         ` dan.j.williams
  2026-01-15 20:22           ` David Hildenbrand (Red Hat)
  0 siblings, 1 reply; 15+ messages in thread
From: dan.j.williams @ 2026-01-15 20:16 UTC (permalink / raw)
  To: David Hildenbrand (Red Hat), Jonathan Cameron
  Cc: Li Zhe, akpm, ankur.a.arora, fvdl, joao.m.martins, linux-kernel,
	linux-mm, mhocko, mjguzik, muchun.song, osalvador, raghavendra.kt,
	linux-cxl, Davidlohr Bueso, Gregory Price, Dan Williams, zhanjie9,
	wangzhou1

David Hildenbrand (Red Hat) wrote:
[..]
> > Give me a list of 1Gig pages and this stuff becomes much more efficient
> > than anything the CPU can do.
> 
> Right, and ideally we'd implement any such mechanisms in a way that more 
> parts of the kernel can benefit, and not just an unloved in-memory 
> file-system that most people just want to get rid of as soon as we can :)

CPUs have tended to eat the value of simple DMA offload operations like
copy/zero over time.

In the case of this patch there is no async-offload benefit because
userspace is already charged with spawning more threads if it wants more
parallelism.

For sync-offload one engine may be able to beat a single CPU, but now
you have created bandwidth contention problems with a component that is
less responsive to the scheduler.

Call me skeptical.

Signed, someone with async_tx and dmaengine battle scars.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 0/8] Introduce a huge-page pre-zeroing mechanism
  2026-01-15 20:16         ` dan.j.williams
@ 2026-01-15 20:22           ` David Hildenbrand (Red Hat)
  2026-01-15 22:30             ` Ankur Arora
  0 siblings, 1 reply; 15+ messages in thread
From: David Hildenbrand (Red Hat) @ 2026-01-15 20:22 UTC (permalink / raw)
  To: dan.j.williams, Jonathan Cameron
  Cc: Li Zhe, akpm, ankur.a.arora, fvdl, joao.m.martins, linux-kernel,
	linux-mm, mhocko, mjguzik, muchun.song, osalvador, raghavendra.kt,
	linux-cxl, Davidlohr Bueso, Gregory Price, zhanjie9, wangzhou1

On 1/15/26 21:16, dan.j.williams@intel.com wrote:
> David Hildenbrand (Red Hat) wrote:
> [..]
>>> Give me a list of 1Gig pages and this stuff becomes much more efficient
>>> than anything the CPU can do.
>>
>> Right, and ideally we'd implement any such mechanisms in a way that more
>> parts of the kernel can benefit, and not just an unloved in-memory
>> file-system that most people just want to get rid of as soon as we can :)
> 
> CPUs have tended to eat the value of simple DMA offload operations like
> copy/zero over time.
> 
> In the case of this patch there is no async-offload benefit because
> userspace is already charged with spawning more threads if it wants more
> parallelism.

In this subthread we're discussing handling that in the kernel like 
init_on_free. So when user space frees a hugetlb folio (or in the 
future, other similarly gigantic folios from another allocator), we'd be 
zeroing it.

If it would be freeing multiple such folios, we could pack them and send 
them to a DMA engine to zero them for us (concurrently? asynchronously? 
I don't know :) )

-- 
Cheers

David

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 0/8] Introduce a huge-page pre-zeroing mechanism
  2026-01-15 20:22           ` David Hildenbrand (Red Hat)
@ 2026-01-15 22:30             ` Ankur Arora
  2026-01-20  6:27               ` Li Zhe
  0 siblings, 1 reply; 15+ messages in thread
From: Ankur Arora @ 2026-01-15 22:30 UTC (permalink / raw)
  To: David Hildenbrand (Red Hat)
  Cc: dan.j.williams, Jonathan Cameron, Li Zhe, akpm, ankur.a.arora,
	fvdl, joao.m.martins, linux-kernel, linux-mm, mhocko, mjguzik,
	muchun.song, osalvador, raghavendra.kt, linux-cxl,
	Davidlohr Bueso, Gregory Price, zhanjie9, wangzhou1


David Hildenbrand (Red Hat) <david@kernel.org> writes:

> On 1/15/26 21:16, dan.j.williams@intel.com wrote:
>> David Hildenbrand (Red Hat) wrote:
>> [..]
>>>> Give me a list of 1Gig pages and this stuff becomes much more efficient
>>>> than anything the CPU can do.
>>>
>>> Right, and ideally we'd implement any such mechanisms in a way that more
>>> parts of the kernel can benefit, and not just an unloved in-memory
>>> file-system that most people just want to get rid of as soon as we can :)
>> CPUs have tended to eat the value of simple DMA offload operations like
>> copy/zero over time.
>> In the case of this patch there is no async-offload benefit because
>> userspace is already charged with spawning more threads if it wants more
>> parallelism.
>
> In this subthread we're discussing handling that in the kernel like
> init_on_free. So when user space frees a hugetlb folio (or in the 
> future, other similarly gigantic folios from another allocator), we'd be zeroing
> it.
>
> If it would be freeing multiple such folios, we could pack them and send them to
> a DMA engine to zero them for us (concurrently? asynchronously? I don't know :)
> )

I've been thinking about using non-temporal instructions (movnt/clzero)
for zeroing in that path.

Both the DMA engine and non-temporal zeroing would also improve things
because we won't be bringing free buffers to the cache while zeroing.

-- 
ankur

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 0/8] Introduce a huge-page pre-zeroing mechanism
  2026-01-15 22:30             ` Ankur Arora
@ 2026-01-20  6:27               ` Li Zhe
  2026-01-20  9:47                 ` David Laight
  0 siblings, 1 reply; 15+ messages in thread
From: Li Zhe @ 2026-01-20  6:27 UTC (permalink / raw)
  To: david
  Cc: akpm, dan.j.williams, dave, ankur.a.arora, fvdl, gourry,
	joao.m.martins, jonathan.cameron, linux-cxl, linux-kernel,
	linux-mm, lizhe.67, mhocko, mjguzik, muchun.song, osalvador,
	raghavendra.kt, wangzhou1, zhanjie9

In light of the preceding discussion, we appear to have reached the
following understanding:

(1) At present we prefer to mitigate slow application startup (e.g.,
VM creation) by zeroing huge pages at the moment they are freed
(init_on_free). The principal benefit is that user space gains the
performance improvement without deploying any additional user space
daemon.

(2) Deferring the zeroing from allocation to release may occasionally
cause the thread that frees the page to differ from the one that
originally allocates it, so the clearing cost is not charged to the
allocating thread. Because this situation is rare and the existing
init_on_free mechanism in the kernel already exhibits the same
behavior, we deem the consequence acceptable.

(3) The function __unmap_hugepage_range() employs the MMU-gather
mechanism, which refrains from dropping the page reference while
holding the PTL (spinlock). This allows huge-page zeroing to be
performed in a non-atomic context.

(4) Given that, in the vast majority of cases, the same thread that
allocates a huge page also frees it, and the exceptions highlighted
by David are genuinely rare[1]. We can achieve faster application
startup by implementing an init_on_free-style mechanism.

(5) Going forward we can further optimize the zeroing process by
leveraging a DMA engine.

If the foregoing is accurate, I propose we add a new hugetlbfs mount
option to achieve the init-on-free behavior.

Thanks,
Zhe

[1]: https://lore.kernel.org/all/83798495-915b-4a5d-9638-f5b3de913b71@kernel.org/#t

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 0/8] Introduce a huge-page pre-zeroing mechanism
  2026-01-20  6:27               ` Li Zhe
@ 2026-01-20  9:47                 ` David Laight
  2026-01-20 10:39                   ` Li Zhe
  2026-01-21 12:32                   ` David Hildenbrand (Red Hat)
  0 siblings, 2 replies; 15+ messages in thread
From: David Laight @ 2026-01-20  9:47 UTC (permalink / raw)
  To: Li Zhe
  Cc: david, akpm, dan.j.williams, dave, ankur.a.arora, fvdl, gourry,
	joao.m.martins, jonathan.cameron, linux-cxl, linux-kernel,
	linux-mm, mhocko, mjguzik, muchun.song, osalvador, raghavendra.kt,
	wangzhou1, zhanjie9

On Tue, 20 Jan 2026 14:27:06 +0800
"Li Zhe" <lizhe.67@bytedance.com> wrote:

> In light of the preceding discussion, we appear to have reached the
> following understanding:
> 
> (1) At present we prefer to mitigate slow application startup (e.g.,
> VM creation) by zeroing huge pages at the moment they are freed
> (init_on_free). The principal benefit is that user space gains the
> performance improvement without deploying any additional user space
> daemon.

Am I missing something?
If userspace does:
$ program_a; program_b
and pages used by program_a are zeroed when it exits you get the delay
for zeroing all the pages it used before program_b starts.
OTOH if the zeroing is deferred program_b only needs to zero the pages
it needs to start (and there may be some lurking).

The only real gain has to come from zeroing pages when the system is idle.
That will give plenty of zeroed pages needed for starting a web browser
from the desktop and also speed up single-threaded things like 'make -j1'.

	David

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 0/8] Introduce a huge-page pre-zeroing mechanism
  2026-01-20  9:47                 ` David Laight
@ 2026-01-20 10:39                   ` Li Zhe
  2026-01-20 18:18                     ` Gregory Price
  2026-01-21 12:32                   ` David Hildenbrand (Red Hat)
  1 sibling, 1 reply; 15+ messages in thread
From: Li Zhe @ 2026-01-20 10:39 UTC (permalink / raw)
  To: david.laight.linux
  Cc: akpm, ankur.a.arora, dan.j.williams, dave, david, fvdl, gourry,
	joao.m.martins, jonathan.cameron, linux-cxl, linux-kernel,
	linux-mm, lizhe.67, mhocko, mjguzik, muchun.song, osalvador,
	raghavendra.kt, wangzhou1, zhanjie9

On Tue, 20 Jan 2026 09:47:44 +0000, david.laight.linux@gmail.com wrote:

> On Tue, 20 Jan 2026 14:27:06 +0800
> "Li Zhe" <lizhe.67@bytedance.com> wrote:
> 
> > In light of the preceding discussion, we appear to have reached the
> > following understanding:
> > 
> > (1) At present we prefer to mitigate slow application startup (e.g.,
> > VM creation) by zeroing huge pages at the moment they are freed
> > (init_on_free). The principal benefit is that user space gains the
> > performance improvement without deploying any additional user space
> > daemon.
> 
> Am I missing something?
> If userspace does:
> $ program_a; program_b
> and pages used by program_a are zeroed when it exits you get the delay
> for zeroing all the pages it used before program_b starts.
> OTOH if the zeroing is deferred program_b only needs to zero the pages
> it needs to start (and there may be some lurking).

Under the init_on-free approach, improving the speed of zeroing may
indeed prove necessary.

However, I believe we should first reach consensus on adopting
“init_on_free” as the solution to slow application startup before
turning to performance tuning.

Thanks,
Zhe

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 0/8] Introduce a huge-page pre-zeroing mechanism
  2026-01-20 10:39                   ` Li Zhe
@ 2026-01-20 18:18                     ` Gregory Price
  2026-01-20 18:38                       ` Gregory Price
                                         ` (3 more replies)
  0 siblings, 4 replies; 15+ messages in thread
From: Gregory Price @ 2026-01-20 18:18 UTC (permalink / raw)
  To: Li Zhe
  Cc: david.laight.linux, akpm, ankur.a.arora, dan.j.williams, dave,
	david, fvdl, joao.m.martins, jonathan.cameron, linux-cxl,
	linux-kernel, linux-mm, mhocko, mjguzik, muchun.song, osalvador,
	raghavendra.kt, wangzhou1, zhanjie9

On Tue, Jan 20, 2026 at 06:39:48PM +0800, Li Zhe wrote:
> On Tue, 20 Jan 2026 09:47:44 +0000, david.laight.linux@gmail.com wrote:
> 
> > On Tue, 20 Jan 2026 14:27:06 +0800
> > "Li Zhe" <lizhe.67@bytedance.com> wrote:
> > 
> > > In light of the preceding discussion, we appear to have reached the
> > > following understanding:
> > > 
> > > (1) At present we prefer to mitigate slow application startup (e.g.,
> > > VM creation) by zeroing huge pages at the moment they are freed
> > > (init_on_free). The principal benefit is that user space gains the
> > > performance improvement without deploying any additional user space
> > > daemon.
> > 
> > Am I missing something?
> > If userspace does:
> > $ program_a; program_b
> > and pages used by program_a are zeroed when it exits you get the delay
> > for zeroing all the pages it used before program_b starts.
> > OTOH if the zeroing is deferred program_b only needs to zero the pages
> > it needs to start (and there may be some lurking).
> 
> Under the init_on-free approach, improving the speed of zeroing may
> indeed prove necessary.
> 
> However, I believe we should first reach consensus on adopting
> “init_on_free” as the solution to slow application startup before
> turning to performance tuning.
> 

His point was init_on_free may not actually reduce any delays on serial
applications, and can actually introduce additional delays.

Example
-------
program_a:  alloc_hugepages(10);
            exit();

program b:  alloc_hugepages(5);
	    exit();

/* Run programs in serial */
sh:  program_a && program_b

in zero_on_alloc():
	program_a eats zero(10) cost on startup
	program_b eats zero(5) cost on startup
	Overall zero(15) cost to start program_b

in zero_on_free()
	program_a eats zero(10) cost on startup
	program_a eats zero(10) cost on exit
	program_b eats zero(0) cost on startup
	Overall zero(20) cost to start program_b

zero_on_free is worse by zero(5)
-------

This is a trivial example, but it's unclear zero_on_free actually
provides a benefit.  You have to know ahead of time what the runtime
behavior, pre-zeroed count, and allocation pattern (0->10->5->...) would
be to determine whether there's an actual reduction in startup time.

But just trivially, starting from the base case of no pages being
zeroed, you're just injecting an additional zero(X) cost if program_a()
consumes more hugepages than program_b().

Long way of saying the shift from alloc to free seems heuristic-y and
you need stronger analysis / better data to show this change is actually
beneficial in the general case.

~Gregory

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 0/8] Introduce a huge-page pre-zeroing mechanism
  2026-01-20 18:18                     ` Gregory Price
@ 2026-01-20 18:38                       ` Gregory Price
  2026-01-20 19:30                       ` David Laight
                                         ` (2 subsequent siblings)
  3 siblings, 0 replies; 15+ messages in thread
From: Gregory Price @ 2026-01-20 18:38 UTC (permalink / raw)
  To: Li Zhe
  Cc: david.laight.linux, akpm, ankur.a.arora, dan.j.williams, dave,
	david, fvdl, joao.m.martins, jonathan.cameron, linux-cxl,
	linux-kernel, linux-mm, mhocko, mjguzik, muchun.song, osalvador,
	raghavendra.kt, wangzhou1, zhanjie9

On Tue, Jan 20, 2026 at 01:18:19PM -0500, Gregory Price wrote:
> This is a trivial example, but it's unclear zero_on_free actually
> provides a benefit.  You have to know ahead of time what the runtime
> behavior, pre-zeroed count, and allocation pattern (0->10->5->...) would
> be to determine whether there's an actual reduction in startup time.
> 
> But just trivially, starting from the base case of no pages being
> zeroed, you're just injecting an additional zero(X) cost if program_a()
> consumes more hugepages than program_b().
> 
> Long way of saying the shift from alloc to free seems heuristic-y and
> you need stronger analysis / better data to show this change is actually
> beneficial in the general case.
> 

As an addendum to this:  Maybe this is an indication that a global
switch (per-node sysfs entry) is not the best decision, and that maybe
there's a better way to accomplish this with a reduced scope.

hugetlb-only sysfs knob
	- same issue as current proposal, but better placed
	  why would you only apply this on one node?

prctl thingy
	- limits effects to just those opting into alloc-on-free
	- probably still needs hugetlb-internal zeroed-pages tracking
	  but doesn't require the rest of the machinery

do it entirely in userland
	- modify the software to zero before exit
	- use MAP_UNINITIALIZED
	- useful and simple if your hugetlb use case is homogenous

there's probably more oprtions

~Gregory

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 0/8] Introduce a huge-page pre-zeroing mechanism
  2026-01-20 18:18                     ` Gregory Price
  2026-01-20 18:38                       ` Gregory Price
@ 2026-01-20 19:30                       ` David Laight
  2026-01-20 19:52                         ` Gregory Price
  2026-01-21  8:03                       ` Li Zhe
  2026-01-21 12:41                       ` David Hildenbrand (Red Hat)
  3 siblings, 1 reply; 15+ messages in thread
From: David Laight @ 2026-01-20 19:30 UTC (permalink / raw)
  To: Gregory Price
  Cc: Li Zhe, akpm, ankur.a.arora, dan.j.williams, dave, david, fvdl,
	joao.m.martins, jonathan.cameron, linux-cxl, linux-kernel,
	linux-mm, mhocko, mjguzik, muchun.song, osalvador, raghavendra.kt,
	wangzhou1, zhanjie9

On Tue, 20 Jan 2026 13:18:19 -0500
Gregory Price <gourry@gourry.net> wrote:

> On Tue, Jan 20, 2026 at 06:39:48PM +0800, Li Zhe wrote:
> > On Tue, 20 Jan 2026 09:47:44 +0000, david.laight.linux@gmail.com wrote:
> >   
> > > On Tue, 20 Jan 2026 14:27:06 +0800
> > > "Li Zhe" <lizhe.67@bytedance.com> wrote:
> > >   
> > > > In light of the preceding discussion, we appear to have reached the
> > > > following understanding:
> > > > 
> > > > (1) At present we prefer to mitigate slow application startup (e.g.,
> > > > VM creation) by zeroing huge pages at the moment they are freed
> > > > (init_on_free). The principal benefit is that user space gains the
> > > > performance improvement without deploying any additional user space
> > > > daemon.  
> > > 
> > > Am I missing something?
> > > If userspace does:
> > > $ program_a; program_b
> > > and pages used by program_a are zeroed when it exits you get the delay
> > > for zeroing all the pages it used before program_b starts.
> > > OTOH if the zeroing is deferred program_b only needs to zero the pages
> > > it needs to start (and there may be some lurking).  
> > 
> > Under the init_on-free approach, improving the speed of zeroing may
> > indeed prove necessary.
> > 
> > However, I believe we should first reach consensus on adopting
> > “init_on_free” as the solution to slow application startup before
> > turning to performance tuning.
> >   
> 
> His point was init_on_free may not actually reduce any delays on serial
> applications, and can actually introduce additional delays.
> 
> Example
> -------
> program_a:  alloc_hugepages(10);
>             exit();
> 
> program b:  alloc_hugepages(5);
> 	    exit();
> 
> /* Run programs in serial */
> sh:  program_a && program_b
> 
> in zero_on_alloc():
> 	program_a eats zero(10) cost on startup
> 	program_b eats zero(5) cost on startup
> 	Overall zero(15) cost to start program_b
> 
> in zero_on_free()
> 	program_a eats zero(10) cost on startup

Do you get that cost? - wont all the unused memory be zeros.

> 	program_a eats zero(10) cost on exit
> 	program_b eats zero(0) cost on startup
> 	Overall zero(20) cost to start program_b
> 
> zero_on_free is worse by zero(5)
> -------
> 
> This is a trivial example, but it's unclear zero_on_free actually
> provides a benefit.  You have to know ahead of time what the runtime
> behavior, pre-zeroed count, and allocation pattern (0->10->5->...) would
> be to determine whether there's an actual reduction in startup time.
> 
> But just trivially, starting from the base case of no pages being
> zeroed, you're just injecting an additional zero(X) cost if program_a()
> consumes more hugepages than program_b().

I'd consider a different test:
	for c in $(jot 1 1000); do program_a; done

Regardless of whether you zero on alloc or free all the zeroing is in line.
Move it to a low priority thread (that uses a non-aggressive loop) and
there will be reasonable chance of there being pre-zeroed pages available.
(Most DMA is far too aggressive...)

If you zero on free it might also be a waste of time.
Maybe the memory is next used to read data from a disk file.

	David

> 
> Long way of saying the shift from alloc to free seems heuristic-y and
> you need stronger analysis / better data to show this change is actually
> beneficial in the general case.
> 
> ~Gregory


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 0/8] Introduce a huge-page pre-zeroing mechanism
  2026-01-20 19:30                       ` David Laight
@ 2026-01-20 19:52                         ` Gregory Price
  0 siblings, 0 replies; 15+ messages in thread
From: Gregory Price @ 2026-01-20 19:52 UTC (permalink / raw)
  To: David Laight
  Cc: Li Zhe, akpm, ankur.a.arora, dan.j.williams, dave, david, fvdl,
	joao.m.martins, jonathan.cameron, linux-cxl, linux-kernel,
	linux-mm, mhocko, mjguzik, muchun.song, osalvador, raghavendra.kt,
	wangzhou1, zhanjie9

On Tue, Jan 20, 2026 at 07:30:27PM +0000, David Laight wrote:
> > /* Run programs in serial */
> > sh:  program_a && program_b
> > 
> > in zero_on_alloc():
> > 	program_a eats zero(10) cost on startup
> > 	program_b eats zero(5) cost on startup
> > 	Overall zero(15) cost to start program_b
> > 
> > in zero_on_free()
> > 	program_a eats zero(10) cost on startup
> 
> Do you get that cost? - wont all the unused memory be zeros.
> 

If program_a was the first to access, wouldn't it have had to zero it?

> > But just trivially, starting from the base case of no pages being
> > zeroed, you're just injecting an additional zero(X) cost if program_a()
> > consumes more hugepages than program_b().
> 
> I'd consider a different test:
> 	for c in $(jot 1 1000); do program_a; done
> 
> Regardless of whether you zero on alloc or free all the zeroing is in line.
> Move it to a low priority thread (that uses a non-aggressive loop) and
> there will be reasonable chance of there being pre-zeroed pages available.
> (Most DMA is far too aggressive...)
> 
> If you zero on free it might also be a waste of time.
> Maybe the memory is next used to read data from a disk file.
> 

Right, both points here being that it's heuristic-y, it only applies in
certain scenarios and trying to optimize for one probably hurts another.

~Gregory

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 0/8] Introduce a huge-page pre-zeroing mechanism
  2026-01-20 18:18                     ` Gregory Price
  2026-01-20 18:38                       ` Gregory Price
  2026-01-20 19:30                       ` David Laight
@ 2026-01-21  8:03                       ` Li Zhe
  2026-01-21 12:41                       ` David Hildenbrand (Red Hat)
  3 siblings, 0 replies; 15+ messages in thread
From: Li Zhe @ 2026-01-21  8:03 UTC (permalink / raw)
  To: gourry
  Cc: akpm, ankur.a.arora, dan.j.williams, dave, david.laight.linux,
	david, fvdl, joao.m.martins, jonathan.cameron, linux-cxl,
	linux-kernel, linux-mm, lizhe.67, mhocko, mjguzik, muchun.song,
	osalvador, raghavendra.kt, wangzhou1, zhanjie9

On Tue, 20 Jan 2026 13:18:19 -0500, gourry@gourry.net wrote:

> On Tue, Jan 20, 2026 at 06:39:48PM +0800, Li Zhe wrote:
> > On Tue, 20 Jan 2026 09:47:44 +0000, david.laight.linux@gmail.com wrote:
> > 
> > > On Tue, 20 Jan 2026 14:27:06 +0800
> > > "Li Zhe" <lizhe.67@bytedance.com> wrote:
> > > 
> > > > In light of the preceding discussion, we appear to have reached the
> > > > following understanding:
> > > > 
> > > > (1) At present we prefer to mitigate slow application startup (e.g.,
> > > > VM creation) by zeroing huge pages at the moment they are freed
> > > > (init_on_free). The principal benefit is that user space gains the
> > > > performance improvement without deploying any additional user space
> > > > daemon.
> > > 
> > > Am I missing something?
> > > If userspace does:
> > > $ program_a; program_b
> > > and pages used by program_a are zeroed when it exits you get the delay
> > > for zeroing all the pages it used before program_b starts.
> > > OTOH if the zeroing is deferred program_b only needs to zero the pages
> > > it needs to start (and there may be some lurking).
> > 
> > Under the init_on-free approach, improving the speed of zeroing may
> > indeed prove necessary.
> > 
> > However, I believe we should first reach consensus on adopting
> > "init_on_free" as the solution to slow application startup before
> > turning to performance tuning.
> > 
> 
> His point was init_on_free may not actually reduce any delays on serial
> applications, and can actually introduce additional delays.
> 
> Example
> -------
> program_a:  alloc_hugepages(10);
>             exit();
> 
> program b:  alloc_hugepages(5);
> 	    exit();
> 
> /* Run programs in serial */
> sh:  program_a && program_b
> 
> in zero_on_alloc():
> 	program_a eats zero(10) cost on startup
> 	program_b eats zero(5) cost on startup
> 	Overall zero(15) cost to start program_b
> 
> in zero_on_free()
> 	program_a eats zero(10) cost on startup
> 	program_a eats zero(10) cost on exit
> 	program_b eats zero(0) cost on startup
> 	Overall zero(20) cost to start program_b
> 
> zero_on_free is worse by zero(5)
> -------
> 
> This is a trivial example, but it's unclear zero_on_free actually
> provides a benefit.  You have to know ahead of time what the runtime
> behavior, pre-zeroed count, and allocation pattern (0->10->5->...) would
> be to determine whether there's an actual reduction in startup time.
> 
> But just trivially, starting from the base case of no pages being
> zeroed, you're just injecting an additional zero(X) cost if program_a()
> consumes more hugepages than program_b().
> 
> Long way of saying the shift from alloc to free seems heuristic-y and
> you need stronger analysis / better data to show this change is actually
> beneficial in the general case.

I understand your concern. At some point some process must pay the
cost of zeroing, and the optimal strategy is inevitably
workload-dependent.

Our "zero-on-free for huge pages" draws on the existing kernel
init_on_free mechanism. Of course, it may prove sub-optimal in certain
scenarios.

Consistent with "provide tools, not policy", perhaps the decision is
better left to user space. And that is exactly what this patchset
does. Requiring a userspace daemon to decide when to zero pages
certainly adds complexity, but it also gives administrators a single,
flexible knob that can be tuned for any workload.

Thanks,
Zhe

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 0/8] Introduce a huge-page pre-zeroing mechanism
  2026-01-20  9:47                 ` David Laight
  2026-01-20 10:39                   ` Li Zhe
@ 2026-01-21 12:32                   ` David Hildenbrand (Red Hat)
  1 sibling, 0 replies; 15+ messages in thread
From: David Hildenbrand (Red Hat) @ 2026-01-21 12:32 UTC (permalink / raw)
  To: David Laight, Li Zhe
  Cc: akpm, dan.j.williams, dave, ankur.a.arora, fvdl, gourry,
	joao.m.martins, jonathan.cameron, linux-cxl, linux-kernel,
	linux-mm, mhocko, mjguzik, muchun.song, osalvador, raghavendra.kt,
	wangzhou1, zhanjie9

On 1/20/26 10:47, David Laight wrote:
> On Tue, 20 Jan 2026 14:27:06 +0800
> "Li Zhe" <lizhe.67@bytedance.com> wrote:
> 
>> In light of the preceding discussion, we appear to have reached the
>> following understanding:
>>
>> (1) At present we prefer to mitigate slow application startup (e.g.,
>> VM creation) by zeroing huge pages at the moment they are freed
>> (init_on_free). The principal benefit is that user space gains the
>> performance improvement without deploying any additional user space
>> daemon.
> 
> Am I missing something?
> If userspace does:
> $ program_a; program_b
> and pages used by program_a are zeroed when it exits you get the delay
> for zeroing all the pages it used before program_b starts.
> OTOH if the zeroing is deferred program_b only needs to zero the pages
> it needs to start (and there may be some lurking).

Can you point me to where was that spelled out as a requirement?

> 
> The only real gain has to come from zeroing pages when the system is idle.
> That will give plenty of zeroed pages needed for starting a web browser
> from the desktop and also speed up single-threaded things like 'make -j1'.

I am strictly against over-engineering any features that hugetlb 
benefits from only.

-- 
Cheers

David

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 0/8] Introduce a huge-page pre-zeroing mechanism
  2026-01-20 18:18                     ` Gregory Price
                                         ` (2 preceding siblings ...)
  2026-01-21  8:03                       ` Li Zhe
@ 2026-01-21 12:41                       ` David Hildenbrand (Red Hat)
  3 siblings, 0 replies; 15+ messages in thread
From: David Hildenbrand (Red Hat) @ 2026-01-21 12:41 UTC (permalink / raw)
  To: Gregory Price, Li Zhe
  Cc: david.laight.linux, akpm, ankur.a.arora, dan.j.williams, dave,
	fvdl, joao.m.martins, jonathan.cameron, linux-cxl, linux-kernel,
	linux-mm, mhocko, mjguzik, muchun.song, osalvador, raghavendra.kt,
	wangzhou1, zhanjie9

On 1/20/26 19:18, Gregory Price wrote:
> On Tue, Jan 20, 2026 at 06:39:48PM +0800, Li Zhe wrote:
>> On Tue, 20 Jan 2026 09:47:44 +0000, david.laight.linux@gmail.com wrote:
>>
>>> On Tue, 20 Jan 2026 14:27:06 +0800
>>> "Li Zhe" <lizhe.67@bytedance.com> wrote:
>>>
>>>
>>> Am I missing something?
>>> If userspace does:
>>> $ program_a; program_b
>>> and pages used by program_a are zeroed when it exits you get the delay
>>> for zeroing all the pages it used before program_b starts.
>>> OTOH if the zeroing is deferred program_b only needs to zero the pages
>>> it needs to start (and there may be some lurking).
>>
>> Under the init_on-free approach, improving the speed of zeroing may
>> indeed prove necessary.
>>
>> However, I believe we should first reach consensus on adopting
>> “init_on_free” as the solution to slow application startup before
>> turning to performance tuning.
>>
> 
> His point was init_on_free may not actually reduce any delays on serial
> applications, and can actually introduce additional delays.
> 
> Example
> -------
> program_a:  alloc_hugepages(10);
>              exit();
> 
> program b:  alloc_hugepages(5);
> 	    exit();
> 
> /* Run programs in serial */
> sh:  program_a && program_b
> 
> in zero_on_alloc():
> 	program_a eats zero(10) cost on startup
> 	program_b eats zero(5) cost on startup
> 	Overall zero(15) cost to start program_b
> 
> in zero_on_free()
> 	program_a eats zero(10) cost on startup
> 	program_a eats zero(10) cost on exit
> 	program_b eats zero(0) cost on startup
> 	Overall zero(20) cost to start program_b
> 
> zero_on_free is worse by zero(5)
> -------
> 
> This is a trivial example, but it's unclear zero_on_free actually
> provides a benefit.  You have to know ahead of time what the runtime
> behavior, pre-zeroed count, and allocation pattern (0->10->5->...) would
> be to determine whether there's an actual reduction in startup time.

For VMs with hugetlb people usually have some spare pages lying around. 
VM startup time is more important for cloud providers than VM shutdown time.

I'm sure there are examples where it is the other way around, but having 
mixed workloads on the system is likely not the highest priority right now.

> 
> But just trivially, starting from the base case of no pages being
> zeroed, you're just injecting an additional zero(X) cost if program_a()
> consumes more hugepages than program_b().


And whatever you do,

program_a()
program_b()

will have to zero the pages.

No asynchronous mechanism will really help.

> 
> Long way of saying the shift from alloc to free seems heuristic-y and
> you need stronger analysis / better data to show this change is actually
> beneficial in the general case.

I think the principle of "the allocator already contains zeroed pages" 
is quite universal and simple.

Whether you want to zero the pages actually when the last reference is 
gone (like we do in the buddy), or have that happen from some 
asynchonrous context is an rather an internal optimization.

-- 
Cheers

David

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2026-01-21 12:41 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <9daa39e6-9653-45cc-8c00-abf5f3bae974@kernel.org>
     [not found] ` <20260115093641.44404-1-lizhe.67@bytedance.com>
     [not found]   ` <83798495-915b-4a5d-9638-f5b3de913b71@kernel.org>
2026-01-15 11:57     ` [PATCH v2 0/8] Introduce a huge-page pre-zeroing mechanism Jonathan Cameron
2026-01-15 17:08       ` David Hildenbrand (Red Hat)
2026-01-15 20:16         ` dan.j.williams
2026-01-15 20:22           ` David Hildenbrand (Red Hat)
2026-01-15 22:30             ` Ankur Arora
2026-01-20  6:27               ` Li Zhe
2026-01-20  9:47                 ` David Laight
2026-01-20 10:39                   ` Li Zhe
2026-01-20 18:18                     ` Gregory Price
2026-01-20 18:38                       ` Gregory Price
2026-01-20 19:30                       ` David Laight
2026-01-20 19:52                         ` Gregory Price
2026-01-21  8:03                       ` Li Zhe
2026-01-21 12:41                       ` David Hildenbrand (Red Hat)
2026-01-21 12:32                   ` David Hildenbrand (Red Hat)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox