Re: [RFC PATCH 0/8] Introducte Reserved THP

Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed

From: "Zi Yan" <ziy@nvidia.com>
To: "David Hildenbrand (Arm)" <david@kernel.org>,
	"Qi Zheng" <qi.zheng@linux.dev>, <akpm@linux-foundation.org>,
	<ljs@kernel.org>, <baolin.wang@linux.alibaba.com>,
	<liam@infradead.org>, <npache@redhat.com>, <ryan.roberts@arm.com>,
	<dev.jain@arm.com>, <baohua@kernel.org>, <lance.yang@linux.dev>,
	<muchun.song@linux.dev>, <osalvador@suse.de>, <chrisl@kernel.org>,
	<kasong@tencent.com>, <shikemeng@huaweicloud.com>,
	<nphamcs@gmail.com>, <baoquan.he@linux.dev>,
	<youngjun.park@lge.com>, <peterx@redhat.com>,
	<usama.arif@linux.dev>, <willy@infradead.org>,
	<vbabka@kernel.org>, <surenb@google.com>, <mhocko@suse.com>,
	<jackmanb@google.com>, <hannes@cmpxchg.org>
Cc: <linux-mm@kvack.org>, <linux-kernel@vger.kernel.org>,
	"Qi Zheng" <zhengqi.arch@bytedance.com>
Subject: Re: [RFC PATCH 0/8] Introducte Reserved THP
Date: Tue, 30 Jun 2026 19:45:08 -0400	[thread overview]
Message-ID: <DJMS84Y4KYDY.26GNQPOZ6G4L7@nvidia.com> (raw)
In-Reply-To: <e2bd33d5-4de5-49cd-970f-9e80eec91a3b@kernel.org>

On Mon Jun 29, 2026 at 8:20 AM EDT, David Hildenbrand (Arm) wrote:
> On 6/27/26 09:21, Qi Zheng wrote:
>> From: Qi Zheng <zhengqi.arch@bytedance.com>
>> 
>> Hi all,
>> 
>
> Hi,
>
>> This RFC patchset introduces a new feature called "Reserved THP", and I'd like
>> to open up a discussion on how to use this as a stepping stone toward unifying
>> HugeTLB and THP (Transparent Huge Page).
>> 
>> 1. Background
>> =============
>> 
>> Currently, two huge page solutions co-exist in the kernel:
>> 
>> 1. HugeTLB: Supports reservation, guaranteeing successful allocation within the
>>             reserved pool. However, it does not support features like swap. And
>>             it is a relatively independent subsystem.
>> 2. THP: Does not support reservation and may fail to allocate and fallback to
>>         small pages when system memory is fragmented, but it is more tightly
>>         integrated with mm core and supports features like swap.
>> 
>> Both have their pros and cons. However, in one of our internal scenarios, it
>> seems we need to combine the features of both to meet the requirements.
>> 
>> In our internal scenario, a user process needs to reserve double the amount
>> of Hugetlb memory due to hot-upgrade requirements. For example, if the
>> process needs 16GB of Hugetlb, an additional 16GB is required during the
>> hot-upgrade to satisfy memory allocations. After the upgrade, the old
>> process exits and releases the 16GB of HugeTLB. Therefore, in most cases,
>> the extra 16GB of HugeTLB is wasted.
>> 
>> A straightforward idea is to use the Hugetlb CMA feature, reserving a total
>> of 32GB of hugetlb_cma. During normal operation, 16GB is consumed, and the
>> remaining 16GB can be used by other processes. During hot-upgrade, we could
>> try to migrate the memory used by other processes to allocate the required
>> extra 16GB of Hugetlb. This might work, but it still requires reserving 32GB
>> of memory.
>> 
>> We also found that during the hot upgrade, about 10GB of the old process's
>> hugetlb is actually cold memory, which could theoretically be reclaimed. In
>> extreme cases, we could reserve only 22GB of memory and reclaim the
>> remaining 10GB during the hot upgrade. But unfortunately, hugetlb currently
>> does not support swap, and supporting it seems quite difficult.
>> 
>> Therefore, we are wondering if we can introduce "reserved THP", which is THP
>> that can be reserved. It can be consumed through methods like madvise(), while
>> normal memory allocation cannot consume it.
>
> madvise(). Gah. No :)
>
>> This can achieve an effect similar
>> to hugetlb. And because it is THP, it can relatively easily support swap
>> features, which perfectly solves the above problem.
>
> No, this is the wrong approach. We really shouldn't be making the same mistake
> hugetlb did and support reserving of non-filebacked memory (IOW anonymous memory).
>
> And even for files, the hugetlb mechanism is an absolute trainwreck, because
> it's not NUMA aware.
>
> This really needs some proper thought.

You mean the reservation should be done via some file handle, like
memfd, so that it is easy to apply memory policies to determine where
reserved memory locates?

For existing hugetlb reservation, there is no fine control, like NUMA,
or cgroup, of the reserved free memory.

Is that what you mean above?

>
>> 
>> Additionally, in 2024 (or possibly earlier), there have been discussions about
>> the possibility of unifying Hugetlb and THP:
>> 
>> Link: https://lwn.net/Articles/974491/
>> 
>> After all, hugetlb's management is relatively independent and requires too
>> much special handling in mm core. The introduction of reserved THP might be
>> an opportunity. In the future, reserved THP could be enhanced to support
>> various hugetlb features, such as acting as a backend for hugetlbfs. When
>> reserved THP can completely replace HugeTLB, HugeTLB could be entirely
>> removed, and reserved THP would just become a feature of THP.
>> 
>> 2. Implementation
>> =================
>> 
>> In 2024, Yu Zhao proposed a similar idea:
>> 
>> Link: https://lore.kernel.org/all/20240229183436.4110845-2-yuzhao@google.com/
>> 
>> The idea was to introduce two virt zones: ZONE_NOSPLIT and ZONE_NOMERGE to
>> guarantee the allocation success rate of THP, achieving an effect similar to
>> reservation. However, it seems there was no further progress, perhaps because of
>> reluctance to introduce more virt zones like ZONE_MOVABLE.
>> 
>> This RFC wants to discuss another implementation:
>> 
>> 1. Introduce a new migratetype: MIGRATE_RESERVED_THP.
>> 2. Introduce two new hugetlb-like kernel boot parameters: `thp_reserved_size`
>>    and `thp_reserved_nr`. When set, the required memory is marked as
>>    MIGRATE_RESERVED_THP and put back into the buddy allocator.
>
> I'm all for some mechanism to make runtime allocation of large chunks of memory
> easier, by adding a pool from where multiple consumers (THP, guest_memfd,
> hugetlb, whatever) can allocate memory.

I agree with this one. We do not want to invent different free memory
reservation mechanisms for each possible consumer. A shared reservation
mechanism with different reservation and allocation policies is better.
>
> Call me very skeptical of getting the page allocator involved like this. (I hate it)
>
>> 3. Introduce a new madvise parameter: `MADV_RESERVED_THP`. Pages marked as
>>    MIGRATE_RESERVED_THP can only be consumed via `madvise(MADV_RESERVED_THP)`.
>>    Other normal memory allocations cannot consume MIGRATE_RESERVED_THP memory.
>
> Definitely no.
>
>> 
>> This can achieve a reservation effect similar to HugeTLB and guarantee
>> allocation success.
>> 
>> 3. Future Plans
>> ===============
>> 
>> 3.1 Enhance swap-out and swap-in for large folios
>> -------------------------------------------------
>> 
>> Currently, For swap-out, THP_SWAP is supported, but it only tries to swap out
>> the THP folio as a whole. It is still possible to be forced to split in some
>> situations (e.g., fragmented swap space, memory.swap.max limits, etc). For
>> swap-in, it is almost impossible to directly swap in the THP folio as a whole.
>> 
>> But for reserved THP, splitting is not allowed. We need to ensure that it
>> remains a whole huge page during swap-out and swap-in, to achieve a function
>> similar to hugetlb swap.
>> 
>> 
>> 3.2 Integrate reserved THP into the common reclaim path
>> -------------------------------------------------------
>> 
>> Once swap-in and swap-out of huge pages can be supported without splitting,
>> reserved THP can be integrated into the common reclaim path as a normal LRU
>> folio for memory reclamation. This fills the gap of the hugetlb swap function.
>> 
>> 3.3 Use reserved THP as a backend for shmem/tmpfs
>> -------------------------------------------------
>> 
>> This would allow shared or file-like usage to utilize reserved THP.
>> 
>
> Really, any kind of reservation should be file-centric and have some level of
> control.
>
> And soon the question would pop up "but how can we control this inside memcgs".
>
> This all needs some thought.
>
>
>> 3.4 Use reserved THP as a backend for hugetlbfs
>> -----------------------------------------------
>> 
>> This would allow existing hugetlb users or applications to seamlessly switch to
>> reserved THP.
>
> You are really talking about a memory pool that can be used by different consumers.
>
> I raised that in the past in the context of guest_memfd, whereby the short-term
> plan is to take pages from hugetlb's pool, when really there should be a global
> pool that can be consumed by various consumers.
>
> A lot of questions around that.
>
>> 
>> 3.5 Add 1GB page support to reserved THP
>> ----------------------------------------
>> 
>> Historically, there have been several attempts to add 1GB huge page support to
>> THP:
>> 
>> 1. https://lore.kernel.org/linux-mm/20260202005451.774496-1-usamaarif642@gmail.com/
>> 2. https://lore.kernel.org/linux-mm/20210224223536.803765-1-zi.yan@sent.com/
>> 
>> Adding 1GB huge page support for reserved THP would be relatively simpler
>> compared to regular THP.
>
> And that's what I told Usama: start with 1 GiB THP support for shmem/tmpfs, and
> make it configurable.
>
> How we would add a reservation mechanism is a good question. Because hugetlb
> reservation is a broken concept. And anything that's not NUMA or memcg aware
> will be a broken concept I'm afraid.
>
>> 
>> 3.6 Remove Hugetlb
>> ------------------
>> 
>> Once reserved THP can completely replace the existing functions of hugetlb, we
>> can gradually remove Hugetlb, leaving only one huge page management system in
>> the kernel.
>
> I'm sorry, but no way this will work in any reasonable timeframe unless you
> mimic the exact user facing ABI -- and I don't think we'll gain a lot that way.
>
> I know, we all like to dream, but this just isn't feasible.

Based on my understanding, the key takeway is that we want to have more
control over reserved memory, where to get the free memory, who gets how
much of the reserved memory, and more.


-- 
Best Regards,
Yan, Zi

     prev parent reply	other threads:[~2026-06-30 23:45 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-27  7:21 [RFC PATCH 0/8] Introducte Reserved THP Qi Zheng
2026-06-27  7:21 ` [RFC PATCH 1/8] mm: page_alloc: add reserved THP pageblock type Qi Zheng
2026-06-27  7:21 ` [RFC PATCH 2/8] mm: add boot-time reserved THP pageblock capacity Qi Zheng
2026-06-27  7:21 ` [RFC PATCH 3/8] mm: page_alloc: add a reserved THP allocation primitive Qi Zheng
2026-06-27  7:21 ` [RFC PATCH 4/8] mm: add reserved THP quota helpers Qi Zheng
2026-06-27  7:21 ` [RFC PATCH 5/8] mm: add reserved THP vma flag Qi Zheng
2026-06-27  7:26 ` [RFC PATCH 6/8] mm: maintain reserved THP quota across VMA changes Qi Zheng
2026-06-27  7:26 ` [RFC PATCH 7/8] mm: support reserved THP VMAs in anonymous faults Qi Zheng
2026-06-27  7:26 ` [RFC PATCH 8/8] mm: add MADV_RESERVED_THP range policy Qi Zheng
2026-06-29  3:46 ` [RFC PATCH 0/8] Introducte Reserved THP Matthew Wilcox
2026-06-29 10:13   ` Qi Zheng
2026-06-29 12:20 ` David Hildenbrand (Arm)
2026-06-29 19:00   ` Gregory Price
2026-06-30 22:59   ` Barry Song
2026-06-30 23:34     ` Zi Yan
2026-07-01  0:24       ` Barry Song
2026-06-30 23:45   ` Zi Yan [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=DJMS84Y4KYDY.26GNQPOZ6G4L7@nvidia.com \
    --to=ziy@nvidia.com \
    --cc=akpm@linux-foundation.org \
    --cc=baohua@kernel.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=baoquan.he@linux.dev \
    --cc=chrisl@kernel.org \
    --cc=david@kernel.org \
    --cc=dev.jain@arm.com \
    --cc=hannes@cmpxchg.org \
    --cc=jackmanb@google.com \
    --cc=kasong@tencent.com \
    --cc=lance.yang@linux.dev \
    --cc=liam@infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ljs@kernel.org \
    --cc=mhocko@suse.com \
    --cc=muchun.song@linux.dev \
    --cc=npache@redhat.com \
    --cc=nphamcs@gmail.com \
    --cc=osalvador@suse.de \
    --cc=peterx@redhat.com \
    --cc=qi.zheng@linux.dev \
    --cc=ryan.roberts@arm.com \
    --cc=shikemeng@huaweicloud.com \
    --cc=surenb@google.com \
    --cc=usama.arif@linux.dev \
    --cc=vbabka@kernel.org \
    --cc=willy@infradead.org \
    --cc=youngjun.park@lge.com \
    --cc=zhengqi.arch@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox