From: "David Hildenbrand (Arm)" <david@kernel.org>
To: Qi Zheng <qi.zheng@linux.dev>,
akpm@linux-foundation.org, ljs@kernel.org, ziy@nvidia.com,
baolin.wang@linux.alibaba.com, liam@infradead.org,
npache@redhat.com, ryan.roberts@arm.com, dev.jain@arm.com,
baohua@kernel.org, lance.yang@linux.dev, muchun.song@linux.dev,
osalvador@suse.de, chrisl@kernel.org, kasong@tencent.com,
shikemeng@huaweicloud.com, nphamcs@gmail.com,
baoquan.he@linux.dev, youngjun.park@lge.com, peterx@redhat.com,
usama.arif@linux.dev, willy@infradead.org, vbabka@kernel.org,
surenb@google.com, mhocko@suse.com, jackmanb@google.com,
hannes@cmpxchg.org
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
Qi Zheng <zhengqi.arch@bytedance.com>
Subject: Re: [RFC PATCH 0/8] Introducte Reserved THP
Date: Mon, 29 Jun 2026 14:20:28 +0200 [thread overview]
Message-ID: <e2bd33d5-4de5-49cd-970f-9e80eec91a3b@kernel.org> (raw)
In-Reply-To: <cover.1782538002.git.zhengqi.arch@bytedance.com>
On 6/27/26 09:21, Qi Zheng wrote:
> From: Qi Zheng <zhengqi.arch@bytedance.com>
>
> Hi all,
>
Hi,
> This RFC patchset introduces a new feature called "Reserved THP", and I'd like
> to open up a discussion on how to use this as a stepping stone toward unifying
> HugeTLB and THP (Transparent Huge Page).
>
> 1. Background
> =============
>
> Currently, two huge page solutions co-exist in the kernel:
>
> 1. HugeTLB: Supports reservation, guaranteeing successful allocation within the
> reserved pool. However, it does not support features like swap. And
> it is a relatively independent subsystem.
> 2. THP: Does not support reservation and may fail to allocate and fallback to
> small pages when system memory is fragmented, but it is more tightly
> integrated with mm core and supports features like swap.
>
> Both have their pros and cons. However, in one of our internal scenarios, it
> seems we need to combine the features of both to meet the requirements.
>
> In our internal scenario, a user process needs to reserve double the amount
> of Hugetlb memory due to hot-upgrade requirements. For example, if the
> process needs 16GB of Hugetlb, an additional 16GB is required during the
> hot-upgrade to satisfy memory allocations. After the upgrade, the old
> process exits and releases the 16GB of HugeTLB. Therefore, in most cases,
> the extra 16GB of HugeTLB is wasted.
>
> A straightforward idea is to use the Hugetlb CMA feature, reserving a total
> of 32GB of hugetlb_cma. During normal operation, 16GB is consumed, and the
> remaining 16GB can be used by other processes. During hot-upgrade, we could
> try to migrate the memory used by other processes to allocate the required
> extra 16GB of Hugetlb. This might work, but it still requires reserving 32GB
> of memory.
>
> We also found that during the hot upgrade, about 10GB of the old process's
> hugetlb is actually cold memory, which could theoretically be reclaimed. In
> extreme cases, we could reserve only 22GB of memory and reclaim the
> remaining 10GB during the hot upgrade. But unfortunately, hugetlb currently
> does not support swap, and supporting it seems quite difficult.
>
> Therefore, we are wondering if we can introduce "reserved THP", which is THP
> that can be reserved. It can be consumed through methods like madvise(), while
> normal memory allocation cannot consume it.
madvise(). Gah. No :)
> This can achieve an effect similar
> to hugetlb. And because it is THP, it can relatively easily support swap
> features, which perfectly solves the above problem.
No, this is the wrong approach. We really shouldn't be making the same mistake
hugetlb did and support reserving of non-filebacked memory (IOW anonymous memory).
And even for files, the hugetlb mechanism is an absolute trainwreck, because
it's not NUMA aware.
This really needs some proper thought.
>
> Additionally, in 2024 (or possibly earlier), there have been discussions about
> the possibility of unifying Hugetlb and THP:
>
> Link: https://lwn.net/Articles/974491/
>
> After all, hugetlb's management is relatively independent and requires too
> much special handling in mm core. The introduction of reserved THP might be
> an opportunity. In the future, reserved THP could be enhanced to support
> various hugetlb features, such as acting as a backend for hugetlbfs. When
> reserved THP can completely replace HugeTLB, HugeTLB could be entirely
> removed, and reserved THP would just become a feature of THP.
>
> 2. Implementation
> =================
>
> In 2024, Yu Zhao proposed a similar idea:
>
> Link: https://lore.kernel.org/all/20240229183436.4110845-2-yuzhao@google.com/
>
> The idea was to introduce two virt zones: ZONE_NOSPLIT and ZONE_NOMERGE to
> guarantee the allocation success rate of THP, achieving an effect similar to
> reservation. However, it seems there was no further progress, perhaps because of
> reluctance to introduce more virt zones like ZONE_MOVABLE.
>
> This RFC wants to discuss another implementation:
>
> 1. Introduce a new migratetype: MIGRATE_RESERVED_THP.
> 2. Introduce two new hugetlb-like kernel boot parameters: `thp_reserved_size`
> and `thp_reserved_nr`. When set, the required memory is marked as
> MIGRATE_RESERVED_THP and put back into the buddy allocator.
I'm all for some mechanism to make runtime allocation of large chunks of memory
easier, by adding a pool from where multiple consumers (THP, guest_memfd,
hugetlb, whatever) can allocate memory.
Call me very skeptical of getting the page allocator involved like this. (I hate it)
> 3. Introduce a new madvise parameter: `MADV_RESERVED_THP`. Pages marked as
> MIGRATE_RESERVED_THP can only be consumed via `madvise(MADV_RESERVED_THP)`.
> Other normal memory allocations cannot consume MIGRATE_RESERVED_THP memory.
Definitely no.
>
> This can achieve a reservation effect similar to HugeTLB and guarantee
> allocation success.
>
> 3. Future Plans
> ===============
>
> 3.1 Enhance swap-out and swap-in for large folios
> -------------------------------------------------
>
> Currently, For swap-out, THP_SWAP is supported, but it only tries to swap out
> the THP folio as a whole. It is still possible to be forced to split in some
> situations (e.g., fragmented swap space, memory.swap.max limits, etc). For
> swap-in, it is almost impossible to directly swap in the THP folio as a whole.
>
> But for reserved THP, splitting is not allowed. We need to ensure that it
> remains a whole huge page during swap-out and swap-in, to achieve a function
> similar to hugetlb swap.
>
>
> 3.2 Integrate reserved THP into the common reclaim path
> -------------------------------------------------------
>
> Once swap-in and swap-out of huge pages can be supported without splitting,
> reserved THP can be integrated into the common reclaim path as a normal LRU
> folio for memory reclamation. This fills the gap of the hugetlb swap function.
>
> 3.3 Use reserved THP as a backend for shmem/tmpfs
> -------------------------------------------------
>
> This would allow shared or file-like usage to utilize reserved THP.
>
Really, any kind of reservation should be file-centric and have some level of
control.
And soon the question would pop up "but how can we control this inside memcgs".
This all needs some thought.
> 3.4 Use reserved THP as a backend for hugetlbfs
> -----------------------------------------------
>
> This would allow existing hugetlb users or applications to seamlessly switch to
> reserved THP.
You are really talking about a memory pool that can be used by different consumers.
I raised that in the past in the context of guest_memfd, whereby the short-term
plan is to take pages from hugetlb's pool, when really there should be a global
pool that can be consumed by various consumers.
A lot of questions around that.
>
> 3.5 Add 1GB page support to reserved THP
> ----------------------------------------
>
> Historically, there have been several attempts to add 1GB huge page support to
> THP:
>
> 1. https://lore.kernel.org/linux-mm/20260202005451.774496-1-usamaarif642@gmail.com/
> 2. https://lore.kernel.org/linux-mm/20210224223536.803765-1-zi.yan@sent.com/
>
> Adding 1GB huge page support for reserved THP would be relatively simpler
> compared to regular THP.
And that's what I told Usama: start with 1 GiB THP support for shmem/tmpfs, and
make it configurable.
How we would add a reservation mechanism is a good question. Because hugetlb
reservation is a broken concept. And anything that's not NUMA or memcg aware
will be a broken concept I'm afraid.
>
> 3.6 Remove Hugetlb
> ------------------
>
> Once reserved THP can completely replace the existing functions of hugetlb, we
> can gradually remove Hugetlb, leaving only one huge page management system in
> the kernel.
I'm sorry, but no way this will work in any reasonable timeframe unless you
mimic the exact user facing ABI -- and I don't think we'll gain a lot that way.
I know, we all like to dream, but this just isn't feasible.
--
Cheers,
David
prev parent reply other threads:[~2026-06-29 12:20 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-06-27 7:21 [RFC PATCH 0/8] Introducte Reserved THP Qi Zheng
2026-06-27 7:21 ` [RFC PATCH 1/8] mm: page_alloc: add reserved THP pageblock type Qi Zheng
2026-06-27 7:21 ` [RFC PATCH 2/8] mm: add boot-time reserved THP pageblock capacity Qi Zheng
2026-06-27 7:21 ` [RFC PATCH 3/8] mm: page_alloc: add a reserved THP allocation primitive Qi Zheng
2026-06-27 7:21 ` [RFC PATCH 4/8] mm: add reserved THP quota helpers Qi Zheng
2026-06-27 7:21 ` [RFC PATCH 5/8] mm: add reserved THP vma flag Qi Zheng
2026-06-27 7:26 ` [RFC PATCH 6/8] mm: maintain reserved THP quota across VMA changes Qi Zheng
2026-06-27 7:26 ` [RFC PATCH 7/8] mm: support reserved THP VMAs in anonymous faults Qi Zheng
2026-06-27 7:26 ` [RFC PATCH 8/8] mm: add MADV_RESERVED_THP range policy Qi Zheng
2026-06-29 3:46 ` [RFC PATCH 0/8] Introducte Reserved THP Matthew Wilcox
2026-06-29 10:13 ` Qi Zheng
2026-06-29 12:20 ` David Hildenbrand (Arm) [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=e2bd33d5-4de5-49cd-970f-9e80eec91a3b@kernel.org \
--to=david@kernel.org \
--cc=akpm@linux-foundation.org \
--cc=baohua@kernel.org \
--cc=baolin.wang@linux.alibaba.com \
--cc=baoquan.he@linux.dev \
--cc=chrisl@kernel.org \
--cc=dev.jain@arm.com \
--cc=hannes@cmpxchg.org \
--cc=jackmanb@google.com \
--cc=kasong@tencent.com \
--cc=lance.yang@linux.dev \
--cc=liam@infradead.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=ljs@kernel.org \
--cc=mhocko@suse.com \
--cc=muchun.song@linux.dev \
--cc=npache@redhat.com \
--cc=nphamcs@gmail.com \
--cc=osalvador@suse.de \
--cc=peterx@redhat.com \
--cc=qi.zheng@linux.dev \
--cc=ryan.roberts@arm.com \
--cc=shikemeng@huaweicloud.com \
--cc=surenb@google.com \
--cc=usama.arif@linux.dev \
--cc=vbabka@kernel.org \
--cc=willy@infradead.org \
--cc=youngjun.park@lge.com \
--cc=zhengqi.arch@bytedance.com \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox