Re: [PATCH 12/12] mm/hugetlb: make bootmem allocation work with KHO

Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed

From: Pratyush Yadav <pratyush@kernel.org>
To: Mike Rapoport <rppt@kernel.org>
Cc: Pratyush Yadav <pratyush@kernel.org>,
	 Pasha Tatashin <pasha.tatashin@soleen.com>,
	 Alexander Graf <graf@amazon.com>,
	 Muchun Song <muchun.song@linux.dev>,
	 Oscar Salvador <osalvador@suse.de>,
	 David Hildenbrand <david@kernel.org>,
	 Andrew Morton <akpm@linux-foundation.org>,
	 Jason Miu <jasonmiu@google.com>,
	kexec@lists.infradead.org,  linux-mm@kvack.org,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH 12/12] mm/hugetlb: make bootmem allocation work with KHO
Date: Mon, 25 May 2026 17:24:09 +0200	[thread overview]
Message-ID: <2vxzo6i37bs6.fsf@kernel.org> (raw)
In-Reply-To: <agmS5yZRflzN1M8U@kernel.org> (Mike Rapoport's message of "Sun, 17 May 2026 13:05:27 +0300")

Hi Mike,

On Sun, May 17 2026, Mike Rapoport wrote:

> Hi Pratyush,
>
> On Wed, Apr 29, 2026 at 03:39:14PM +0200, Pratyush Yadav wrote:
>> From: "Pratyush Yadav (Google)" <pratyush@kernel.org>
>> 
>> Gigantic page allocation is somewhat broken currently when KHO is used.
>> 
>> Firstly, they break KHO scratch size accounting. RSRV_KERN is used to
>> track how much memory is reserved for use by the kernel. Since
>> alloc_bootmem() calls the memblock_alloc*() APIs, the hugepages
>> allocated also get marked as RSRV_KERN.
>
> The intended semantics of RSRV_KERN was to distinguish memory reserved
> because it's in use by firmware from the memory reserved or allocated by
> the kernel. It does not have to be directly used by the kernel, some of
> PG_Reserved memory is mapped to user-space and it's not only hugetlb. There
> are zero pages, VDSO and maybe something else, I don't remember.
>
>> Allocations marked RSRV_KERN are used by KHO to calculate how much
>> scratch space it should reserve to make sure the next kernel has enough
>> memory to boot when it is in scratch-only phase. Counting hugepages in
>
> The size of RSRV_KERN allocations is also used in
> memblock_estimated_nr_free_pages() that's currently used by
> set_max_threads() to set several rlimits and by s390 to allocate colored
> zero pages. Excluding hugetlb from that will skew the calculations there.

Oh, I didn't know that. I thought it was only for KHO.

>
>> that blows up scratch size, and can lead to the scratch allocation
>> failing, making KHO unusable. This will show up when huge pages make up
>> more than 50% of the system, which is a fairly common use case.
>> 
>> Secondly, while not supported right now, huge pages are user memory and
>> can be preserved via KHO. The scratch spaces should not have any
>> preserved memory. Allocating hugepages from scratch (on a KHO boot) can
>> lead to them being un-preservable.
>> 
>> Introduce memblock_alloc_nid_user(). This does two things: first, it
>> instructs __memblock_alloc_range_nid() to not use scratch areas to
>> fulfill allocation. If KHO is in scratch-only mode, allocations will
>> only be made from extended scratch areas. Second, it removes RSRV_KERN
>> from the allocation to make sure it doesn't mess up scratch size
>> accounting.
>> 
>> To reduce duplication, introduce __memblock_alloc_range_nid() which does
>> exactly what memblock_alloc_range_nid() used to do, but takes the flags
>> from its caller. Then make memblock_alloc_range_nid() a wrapper to it.
>> This lets memblock_alloc_nid_user() re-use most of the logic without
>> causing churn to update all callers of memblock_alloc_range_nid() and
>> adding yet another argument to it.
>
> That's neat :)
>
> But I'm not too fond of the memblock_alloc_nid_user() as a concept. That
> early at boot everything is still kernel, even though hugetlb pages might
> become user afterwards if they are actually consumed.
> Another thing, is that adding such global API to memblock could be abused
> and suddenly some early code will clear RSRV_KERN for a random pieces of
> memory.
>
> If we'd still need a special memblock function for gigantic pages
> allocation I'd rather make it explicit that it's for hugetlb and keep it in
> mm/internal.h

Sure, that would work I think.

>
> I was thinking about possible alternative solutions and here's what I came
> up with

Thanks for the ideas, much appreciated!

>
> 1. SCRATCH_EXT nicely increases the memory pool available to memblock, but
>    the decision which memory can be used for which allocations becomes more
>    complex. It should make sure that SCRATCH is not used for hugetlb, but
>    OTOH it's preferred for other early allocations. With implicit
>    interaction between choose_memblock_flags() and should_skip_region()
>    this seems to me quite a headache.

Yep, I agree. I can try to refactor some of the logic to make this
easier to understand, but I am not sure how successful I would be.

>    memblock_reserved_clear_kern() should be changed to something else to
>    keep set_max_threads() working. And, IMHO, memblock_alloc_nid_user()
>    should be turned into memblock_alloc_gigantic_hugetlb() and only exposed
>    to MM code. It's even possible that it will duplicate some of
>    memblock_alloc_range_nid) rather than use it because hugetlb is always a
>    special case.

That's fine by me I think. Perhaps we can also hide the complexity of
the allocation strategy here? For example, we won't have to expose the
semantic of SCRATCH_EXT only allocations to the rest of the memblock
allocation machinery.

>
> 2. Split memblock_reserve() part from kho_mem_retrieve() to run before
>    hugetlb allocations. With this we won't need new types and APIs, we can
>    ensure that hugetlb allocations do not use SCRATCH by reserving scratch
>    areas before hugetlb allocations and releasing them afterwards.
>    Obviously this is the slowest option as it will slow down all memblock
>    allocations from the point we memblock_reserve() preserved memory. Still
>    realistically I wouldn't expect large impact on performance because the
>    heaviest part there is reserving of the preserved memory that has to
>    happen anyway. Also, I don't thing that a system that uses a lot of
>    gigantic pages will have a lot of preserved chunks scattered around.

This option would make my life easier (in the short term) because I
think it is the easiest to implement. But I don't think this is good for
KHO long term.

First, I think the memblock_reserve() calls are band-aid and we should
get rid of them entirely. They scale terribly. With even a few gigabytes
of memory they go in the order of seconds. I ran a quick experiment, and
with around 4 GiB of 4k preservations, it takes around 650 ms on my test
machine.

And this isn't a theoretical use case. We are looking to preserve
similar sized memfds on our machines. This would be on top of having a
few terabytes in 1G huge pages. I think we should integrate KHO more
deeply into the memblock -> buddy handover sequence.

Second, while you pay the same cost on doing the memblock_reserve(), the
reservations need to happen much sooner. Right now, the reservations
happen right before memblock_free_all(). With this we would need to move
the reservations to mm_core_init_early() which happens right after
setup_arch(). So every allocation after mm_core_init_early() will take a
hit. While I haven't measured the performance impact with this, I
imagine it won't be trivial.

>
> 3. Move the complexity into hugetlb and make it preserve all the gigantic
>    pages with KHO. This means, though, that we won't be able to increase
>    the number of gigantic pages after the first boot (although decreasing
>    it seems easy) and that we need to let scratch auto scaling understand
>    what were the "normal" memblock allocations and what were the
>    allocations of the gigantic pages.

This is a viable option. There is of course the limitation of not being
able to add more gigantic pages but I suppose most workloads should be
able to live with that. The obvious downside is that you don't have a
fallback if scratch runs out so KHO will be a little less resilient.

The complexity with this comes with correctly preserving all the
gigantic pages. Hugepages can be removed at runtime, so we would need to
plug into the removal mechanism. There's also demotion and other things
that need to be taken care of. The reason we need to track this is
because what if the freed hugepage now contains preserved memory from
something else.

>
> 4. Invert SCRATCH_EXT logic and instead of freeing large chunks around the
>    preserved memory to SCRATCH_EXT, reserve memory surrounding the
>    preserved areas and release scratch_only before hugetlb allocations.
>    We'd still need to somehow prevent hugetlb allocation spilling into
>    scratch and there's a nasty piece of releasing the memory around the
>    preserved chunks. On the bright side, I think it's feasible to defer the
>    release of those regions and free them when we are already
>    multithreaded. This probably the most involved alternative but it also
>    could help with the bottleneck of kho_mem_retrieve() creating too many
>    memblock.reserved regions.

I think we should fix kho_mem_retrieve() by making memblock_free_all()
and deferred_free_pages() aware of KHO. I feel that splitting the
freeing of pages into two phases is going to make the early boot more
complicated than it needs to. And on top of all this we still need to do
the special case thing for hugetlb anyway. So I am not sure if this is a
good idea...

>
> Thoughts?

I personally like option 1 the most (maybe because it is very similar to
my patchset). My point is, if you think the "extended scratch" is fine
as a concept and we only need to clean up the neighbouring stuff like
memblock allocation strategy, then I would like to pursue this since I
think this has the most potential.

I think option 3 is also viable. While I personally am not a huge fan, I
don't have it either. So if you really dislike 1 then I can try to make
3 work.

Option 4 can kind of work, but it just gives me a bad feeling that it
will make early boot memory management more complex and might result in
flaky boot failures.

Option 2 is the worst of the lot I think. It is the easiest to
implement, but I think it just adds to the tech debt that
memblock_reserve() is.

So, in summary, I would like to pursue option 1 and try to make it more
appetizing. But I would like to at least know if you hate the "extended
scratch" (ignore the name) as a concept or only the code it results in.

-- 
Regards,
Pratyush Yadav

     prev parent reply	other threads:[~2026-05-25 15:24 UTC|newest]

Thread overview: 53+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-29 13:39 [PATCH 00/12] kho: make boot time huge page allocation work nicely with KHO Pratyush Yadav
2026-04-29 13:39 ` [PATCH 01/12] kho: generalize radix tree APIs Pratyush Yadav
2026-05-04 14:44   ` Pasha Tatashin
2026-05-05 11:20   ` Jork Loeser
2026-05-05 12:54     ` Pratyush Yadav
2026-05-05 13:12       ` Pasha Tatashin
2026-05-11 11:32   ` Mike Rapoport
2026-05-11 16:25     ` Pratyush Yadav
2026-05-13 10:32       ` Mike Rapoport
2026-04-29 13:39 ` [PATCH 02/12] kho: store incoming radix tree in kho_in Pratyush Yadav
2026-05-11 11:43   ` Mike Rapoport
2026-05-11 16:28     ` Pratyush Yadav
2026-05-12  6:46       ` Mike Rapoport
2026-05-21 23:27       ` Pasha Tatashin
2026-04-29 13:39 ` [PATCH 03/12] kho: add a struct for radix callbacks Pratyush Yadav
2026-05-11 11:47   ` Mike Rapoport
2026-05-11 16:35     ` Pratyush Yadav
2026-05-12  6:48       ` Mike Rapoport
2026-05-12  9:11         ` Pratyush Yadav
2026-05-21 23:31           ` Pasha Tatashin
2026-04-29 13:39 ` [PATCH 04/12] kho: add callback for table pages Pratyush Yadav
2026-05-11 11:50   ` Mike Rapoport
2026-05-11 16:36     ` Pratyush Yadav
2026-05-11 16:40       ` Pratyush Yadav
2026-04-29 13:39 ` [PATCH 05/12] kho: add data argument to radix walk callback Pratyush Yadav
2026-05-11 11:53   ` Mike Rapoport
2026-05-11 16:37     ` Pratyush Yadav
2026-05-21 23:34   ` Pasha Tatashin
2026-04-29 13:39 ` [PATCH 06/12] kho: allow early-boot usage of the KHO radix tree Pratyush Yadav
2026-05-11 11:56   ` Mike Rapoport
2026-05-11 16:37     ` Pratyush Yadav
2026-05-21 23:37   ` Pasha Tatashin
2026-04-29 13:39 ` [PATCH 07/12] kho: allow destroying " Pratyush Yadav
2026-05-11 11:57   ` Mike Rapoport
2026-05-21 23:46   ` Pasha Tatashin
2026-05-22 13:24     ` Pratyush Yadav
2026-04-29 13:39 ` [PATCH 08/12] kho: add kho_radix_init_tree() Pratyush Yadav
2026-05-06 10:51   ` Jork Loeser
2026-05-11 11:05     ` Pratyush Yadav
2026-04-29 13:39 ` [PATCH 09/12] memblock: introduce MEMBLOCK_KHO_SCRATCH_EXT Pratyush Yadav
2026-05-11 12:06   ` Mike Rapoport
2026-05-11 16:46     ` Pratyush Yadav
2026-05-22  0:48       ` Pasha Tatashin
2026-05-22 15:02         ` Pratyush Yadav
2026-04-29 13:39 ` [PATCH 10/12] kho: extended scratch Pratyush Yadav
2026-05-17 10:17   ` Mike Rapoport
2026-04-29 13:39 ` [PATCH 11/12] kho: return virtual address of mem_map Pratyush Yadav
2026-05-11 12:13   ` Mike Rapoport
2026-05-11 16:48     ` Pratyush Yadav
2026-05-12  6:51       ` Mike Rapoport
2026-04-29 13:39 ` [PATCH 12/12] mm/hugetlb: make bootmem allocation work with KHO Pratyush Yadav
2026-05-17 10:05   ` Mike Rapoport
2026-05-25 15:24     ` Pratyush Yadav [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=2vxzo6i37bs6.fsf@kernel.org \
    --to=pratyush@kernel.org \
    --cc=akpm@linux-foundation.org \
    --cc=david@kernel.org \
    --cc=graf@amazon.com \
    --cc=jasonmiu@google.com \
    --cc=kexec@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=muchun.song@linux.dev \
    --cc=osalvador@suse.de \
    --cc=pasha.tatashin@soleen.com \
    --cc=rppt@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox