From: Qi Zheng <qi.zheng@linux.dev>
To: Zi Yan <ziy@nvidia.com>,
"David Hildenbrand (Arm)" <david@kernel.org>,
akpm@linux-foundation.org, ljs@kernel.org,
baolin.wang@linux.alibaba.com, liam@infradead.org,
npache@redhat.com, ryan.roberts@arm.com, dev.jain@arm.com,
baohua@kernel.org, lance.yang@linux.dev, muchun.song@linux.dev,
osalvador@suse.de, chrisl@kernel.org, kasong@tencent.com,
shikemeng@huaweicloud.com, nphamcs@gmail.com,
baoquan.he@linux.dev, youngjun.park@lge.com, peterx@redhat.com,
usama.arif@linux.dev, willy@infradead.org, vbabka@kernel.org,
surenb@google.com, mhocko@suse.com, jackmanb@google.com,
hannes@cmpxchg.org
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
Qi Zheng <zhengqi.arch@bytedance.com>
Subject: Re: [RFC PATCH 0/8] Introducte Reserved THP
Date: Thu, 2 Jul 2026 10:53:21 +0800 [thread overview]
Message-ID: <ba68b5c0-b419-4191-8c04-2fc9a2e04f0c@linux.dev> (raw)
In-Reply-To: <DJMS84Y4KYDY.26GNQPOZ6G4L7@nvidia.com>
On 7/1/26 7:45 AM, Zi Yan wrote:
> On Mon Jun 29, 2026 at 8:20 AM EDT, David Hildenbrand (Arm) wrote:
>> On 6/27/26 09:21, Qi Zheng wrote:
>>> From: Qi Zheng <zhengqi.arch@bytedance.com>
>>>
>>> Hi all,
>>>
>>
>> Hi,
>>
>>> This RFC patchset introduces a new feature called "Reserved THP", and I'd like
>>> to open up a discussion on how to use this as a stepping stone toward unifying
>>> HugeTLB and THP (Transparent Huge Page).
>>>
>>> 1. Background
>>> =============
>>>
>>> Currently, two huge page solutions co-exist in the kernel:
>>>
>>> 1. HugeTLB: Supports reservation, guaranteeing successful allocation within the
>>> reserved pool. However, it does not support features like swap. And
>>> it is a relatively independent subsystem.
>>> 2. THP: Does not support reservation and may fail to allocate and fallback to
>>> small pages when system memory is fragmented, but it is more tightly
>>> integrated with mm core and supports features like swap.
>>>
>>> Both have their pros and cons. However, in one of our internal scenarios, it
>>> seems we need to combine the features of both to meet the requirements.
>>>
>>> In our internal scenario, a user process needs to reserve double the amount
>>> of Hugetlb memory due to hot-upgrade requirements. For example, if the
>>> process needs 16GB of Hugetlb, an additional 16GB is required during the
>>> hot-upgrade to satisfy memory allocations. After the upgrade, the old
>>> process exits and releases the 16GB of HugeTLB. Therefore, in most cases,
>>> the extra 16GB of HugeTLB is wasted.
>>>
>>> A straightforward idea is to use the Hugetlb CMA feature, reserving a total
>>> of 32GB of hugetlb_cma. During normal operation, 16GB is consumed, and the
>>> remaining 16GB can be used by other processes. During hot-upgrade, we could
>>> try to migrate the memory used by other processes to allocate the required
>>> extra 16GB of Hugetlb. This might work, but it still requires reserving 32GB
>>> of memory.
>>>
>>> We also found that during the hot upgrade, about 10GB of the old process's
>>> hugetlb is actually cold memory, which could theoretically be reclaimed. In
>>> extreme cases, we could reserve only 22GB of memory and reclaim the
>>> remaining 10GB during the hot upgrade. But unfortunately, hugetlb currently
>>> does not support swap, and supporting it seems quite difficult.
>>>
>>> Therefore, we are wondering if we can introduce "reserved THP", which is THP
>>> that can be reserved. It can be consumed through methods like madvise(), while
>>> normal memory allocation cannot consume it.
>>
>> madvise(). Gah. No :)
>>
>>> This can achieve an effect similar
>>> to hugetlb. And because it is THP, it can relatively easily support swap
>>> features, which perfectly solves the above problem.
>>
>> No, this is the wrong approach. We really shouldn't be making the same mistake
>> hugetlb did and support reserving of non-filebacked memory (IOW anonymous memory).
>>
>> And even for files, the hugetlb mechanism is an absolute trainwreck, because
>> it's not NUMA aware.
>>
>> This really needs some proper thought.
>
> You mean the reservation should be done via some file handle, like
> memfd, so that it is easy to apply memory policies to determine where
> reserved memory locates?
>
> For existing hugetlb reservation, there is no fine control, like NUMA,
> or cgroup, of the reserved free memory.
>
> Is that what you mean above?
>
>>
>>>
>>> Additionally, in 2024 (or possibly earlier), there have been discussions about
>>> the possibility of unifying Hugetlb and THP:
>>>
>>> Link: https://lwn.net/Articles/974491/
>>>
>>> After all, hugetlb's management is relatively independent and requires too
>>> much special handling in mm core. The introduction of reserved THP might be
>>> an opportunity. In the future, reserved THP could be enhanced to support
>>> various hugetlb features, such as acting as a backend for hugetlbfs. When
>>> reserved THP can completely replace HugeTLB, HugeTLB could be entirely
>>> removed, and reserved THP would just become a feature of THP.
>>>
>>> 2. Implementation
>>> =================
>>>
>>> In 2024, Yu Zhao proposed a similar idea:
>>>
>>> Link: https://lore.kernel.org/all/20240229183436.4110845-2-yuzhao@google.com/
>>>
>>> The idea was to introduce two virt zones: ZONE_NOSPLIT and ZONE_NOMERGE to
>>> guarantee the allocation success rate of THP, achieving an effect similar to
>>> reservation. However, it seems there was no further progress, perhaps because of
>>> reluctance to introduce more virt zones like ZONE_MOVABLE.
>>>
>>> This RFC wants to discuss another implementation:
>>>
>>> 1. Introduce a new migratetype: MIGRATE_RESERVED_THP.
>>> 2. Introduce two new hugetlb-like kernel boot parameters: `thp_reserved_size`
>>> and `thp_reserved_nr`. When set, the required memory is marked as
>>> MIGRATE_RESERVED_THP and put back into the buddy allocator.
>>
>> I'm all for some mechanism to make runtime allocation of large chunks of memory
>> easier, by adding a pool from where multiple consumers (THP, guest_memfd,
>> hugetlb, whatever) can allocate memory.
>
> I agree with this one. We do not want to invent different free memory
> reservation mechanisms for each possible consumer. A shared reservation
> mechanism with different reservation and allocation policies is better.
Hi all, thanks a lot for the feedback! It seems that introducing a new
reservation approach isn't the best way to go. Is the consensus to
address/optimize the problem by doing the following?
1. make THP allocation more reliable.
(pointed by Gregory. And I think this correspondingly requires
swap-in to support bringing in the THP folio as a whole. This is
also the issue Matthew mentioned that the swap subsystem needs
to address.)
2. design a shared memory reservation mechanism.
(suggested by David and Zi)
3. Minimize memory fragmentation as much as possible.
(Like Barry suggested, we could introduce something at pageblock
level to record memory order preferences.)
Thanks,
Qi
>
prev parent reply other threads:[~2026-07-02 2:53 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-06-27 7:21 [RFC PATCH 0/8] Introducte Reserved THP Qi Zheng
2026-06-27 7:21 ` [RFC PATCH 1/8] mm: page_alloc: add reserved THP pageblock type Qi Zheng
2026-06-27 7:21 ` [RFC PATCH 2/8] mm: add boot-time reserved THP pageblock capacity Qi Zheng
2026-06-27 7:21 ` [RFC PATCH 3/8] mm: page_alloc: add a reserved THP allocation primitive Qi Zheng
2026-06-27 7:21 ` [RFC PATCH 4/8] mm: add reserved THP quota helpers Qi Zheng
2026-06-27 7:21 ` [RFC PATCH 5/8] mm: add reserved THP vma flag Qi Zheng
2026-06-27 7:26 ` [RFC PATCH 6/8] mm: maintain reserved THP quota across VMA changes Qi Zheng
2026-06-27 7:26 ` [RFC PATCH 7/8] mm: support reserved THP VMAs in anonymous faults Qi Zheng
2026-06-27 7:26 ` [RFC PATCH 8/8] mm: add MADV_RESERVED_THP range policy Qi Zheng
2026-06-29 3:46 ` [RFC PATCH 0/8] Introducte Reserved THP Matthew Wilcox
2026-06-29 10:13 ` Qi Zheng
2026-06-29 12:20 ` David Hildenbrand (Arm)
2026-06-29 19:00 ` Gregory Price
2026-06-30 22:59 ` Barry Song
2026-06-30 23:34 ` Zi Yan
2026-07-01 0:24 ` Barry Song
2026-06-30 23:45 ` Zi Yan
2026-07-02 2:53 ` Qi Zheng [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ba68b5c0-b419-4191-8c04-2fc9a2e04f0c@linux.dev \
--to=qi.zheng@linux.dev \
--cc=akpm@linux-foundation.org \
--cc=baohua@kernel.org \
--cc=baolin.wang@linux.alibaba.com \
--cc=baoquan.he@linux.dev \
--cc=chrisl@kernel.org \
--cc=david@kernel.org \
--cc=dev.jain@arm.com \
--cc=hannes@cmpxchg.org \
--cc=jackmanb@google.com \
--cc=kasong@tencent.com \
--cc=lance.yang@linux.dev \
--cc=liam@infradead.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=ljs@kernel.org \
--cc=mhocko@suse.com \
--cc=muchun.song@linux.dev \
--cc=npache@redhat.com \
--cc=nphamcs@gmail.com \
--cc=osalvador@suse.de \
--cc=peterx@redhat.com \
--cc=ryan.roberts@arm.com \
--cc=shikemeng@huaweicloud.com \
--cc=surenb@google.com \
--cc=usama.arif@linux.dev \
--cc=vbabka@kernel.org \
--cc=willy@infradead.org \
--cc=youngjun.park@lge.com \
--cc=zhengqi.arch@bytedance.com \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox