Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: "David Hildenbrand (Arm)" <david@kernel.org>
To: Qi Zheng <qi.zheng@linux.dev>,
	akpm@linux-foundation.org, ljs@kernel.org, ziy@nvidia.com,
	baolin.wang@linux.alibaba.com, liam@infradead.org,
	npache@redhat.com, ryan.roberts@arm.com, dev.jain@arm.com,
	baohua@kernel.org, lance.yang@linux.dev, muchun.song@linux.dev,
	osalvador@suse.de, chrisl@kernel.org, kasong@tencent.com,
	shikemeng@huaweicloud.com, nphamcs@gmail.com,
	baoquan.he@linux.dev, youngjun.park@lge.com, peterx@redhat.com,
	usama.arif@linux.dev, willy@infradead.org, vbabka@kernel.org,
	surenb@google.com, mhocko@suse.com, jackmanb@google.com,
	hannes@cmpxchg.org
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	Qi Zheng <zhengqi.arch@bytedance.com>
Subject: Re: [RFC PATCH 0/8] Introducte Reserved THP
Date: Mon, 29 Jun 2026 14:20:28 +0200	[thread overview]
Message-ID: <e2bd33d5-4de5-49cd-970f-9e80eec91a3b@kernel.org> (raw)
In-Reply-To: <cover.1782538002.git.zhengqi.arch@bytedance.com>

On 6/27/26 09:21, Qi Zheng wrote:
> From: Qi Zheng <zhengqi.arch@bytedance.com>
> 
> Hi all,
> 

Hi,

> This RFC patchset introduces a new feature called "Reserved THP", and I'd like
> to open up a discussion on how to use this as a stepping stone toward unifying
> HugeTLB and THP (Transparent Huge Page).
> 
> 1. Background
> =============
> 
> Currently, two huge page solutions co-exist in the kernel:
> 
> 1. HugeTLB: Supports reservation, guaranteeing successful allocation within the
>             reserved pool. However, it does not support features like swap. And
>             it is a relatively independent subsystem.
> 2. THP: Does not support reservation and may fail to allocate and fallback to
>         small pages when system memory is fragmented, but it is more tightly
>         integrated with mm core and supports features like swap.
> 
> Both have their pros and cons. However, in one of our internal scenarios, it
> seems we need to combine the features of both to meet the requirements.
> 
> In our internal scenario, a user process needs to reserve double the amount
> of Hugetlb memory due to hot-upgrade requirements. For example, if the
> process needs 16GB of Hugetlb, an additional 16GB is required during the
> hot-upgrade to satisfy memory allocations. After the upgrade, the old
> process exits and releases the 16GB of HugeTLB. Therefore, in most cases,
> the extra 16GB of HugeTLB is wasted.
> 
> A straightforward idea is to use the Hugetlb CMA feature, reserving a total
> of 32GB of hugetlb_cma. During normal operation, 16GB is consumed, and the
> remaining 16GB can be used by other processes. During hot-upgrade, we could
> try to migrate the memory used by other processes to allocate the required
> extra 16GB of Hugetlb. This might work, but it still requires reserving 32GB
> of memory.
> 
> We also found that during the hot upgrade, about 10GB of the old process's
> hugetlb is actually cold memory, which could theoretically be reclaimed. In
> extreme cases, we could reserve only 22GB of memory and reclaim the
> remaining 10GB during the hot upgrade. But unfortunately, hugetlb currently
> does not support swap, and supporting it seems quite difficult.
> 
> Therefore, we are wondering if we can introduce "reserved THP", which is THP
> that can be reserved. It can be consumed through methods like madvise(), while
> normal memory allocation cannot consume it.

madvise(). Gah. No :)

> This can achieve an effect similar
> to hugetlb. And because it is THP, it can relatively easily support swap
> features, which perfectly solves the above problem.

No, this is the wrong approach. We really shouldn't be making the same mistake
hugetlb did and support reserving of non-filebacked memory (IOW anonymous memory).

And even for files, the hugetlb mechanism is an absolute trainwreck, because
it's not NUMA aware.

This really needs some proper thought.

> 
> Additionally, in 2024 (or possibly earlier), there have been discussions about
> the possibility of unifying Hugetlb and THP:
> 
> Link: https://lwn.net/Articles/974491/
> 
> After all, hugetlb's management is relatively independent and requires too
> much special handling in mm core. The introduction of reserved THP might be
> an opportunity. In the future, reserved THP could be enhanced to support
> various hugetlb features, such as acting as a backend for hugetlbfs. When
> reserved THP can completely replace HugeTLB, HugeTLB could be entirely
> removed, and reserved THP would just become a feature of THP.
> 
> 2. Implementation
> =================
> 
> In 2024, Yu Zhao proposed a similar idea:
> 
> Link: https://lore.kernel.org/all/20240229183436.4110845-2-yuzhao@google.com/
> 
> The idea was to introduce two virt zones: ZONE_NOSPLIT and ZONE_NOMERGE to
> guarantee the allocation success rate of THP, achieving an effect similar to
> reservation. However, it seems there was no further progress, perhaps because of
> reluctance to introduce more virt zones like ZONE_MOVABLE.
> 
> This RFC wants to discuss another implementation:
> 
> 1. Introduce a new migratetype: MIGRATE_RESERVED_THP.
> 2. Introduce two new hugetlb-like kernel boot parameters: `thp_reserved_size`
>    and `thp_reserved_nr`. When set, the required memory is marked as
>    MIGRATE_RESERVED_THP and put back into the buddy allocator.

I'm all for some mechanism to make runtime allocation of large chunks of memory
easier, by adding a pool from where multiple consumers (THP, guest_memfd,
hugetlb, whatever) can allocate memory.

Call me very skeptical of getting the page allocator involved like this. (I hate it)

> 3. Introduce a new madvise parameter: `MADV_RESERVED_THP`. Pages marked as
>    MIGRATE_RESERVED_THP can only be consumed via `madvise(MADV_RESERVED_THP)`.
>    Other normal memory allocations cannot consume MIGRATE_RESERVED_THP memory.

Definitely no.

> 
> This can achieve a reservation effect similar to HugeTLB and guarantee
> allocation success.
> 
> 3. Future Plans
> ===============
> 
> 3.1 Enhance swap-out and swap-in for large folios
> -------------------------------------------------
> 
> Currently, For swap-out, THP_SWAP is supported, but it only tries to swap out
> the THP folio as a whole. It is still possible to be forced to split in some
> situations (e.g., fragmented swap space, memory.swap.max limits, etc). For
> swap-in, it is almost impossible to directly swap in the THP folio as a whole.
> 
> But for reserved THP, splitting is not allowed. We need to ensure that it
> remains a whole huge page during swap-out and swap-in, to achieve a function
> similar to hugetlb swap.
> 
> 
> 3.2 Integrate reserved THP into the common reclaim path
> -------------------------------------------------------
> 
> Once swap-in and swap-out of huge pages can be supported without splitting,
> reserved THP can be integrated into the common reclaim path as a normal LRU
> folio for memory reclamation. This fills the gap of the hugetlb swap function.
> 
> 3.3 Use reserved THP as a backend for shmem/tmpfs
> -------------------------------------------------
> 
> This would allow shared or file-like usage to utilize reserved THP.
> 

Really, any kind of reservation should be file-centric and have some level of
control.

And soon the question would pop up "but how can we control this inside memcgs".

This all needs some thought.


> 3.4 Use reserved THP as a backend for hugetlbfs
> -----------------------------------------------
> 
> This would allow existing hugetlb users or applications to seamlessly switch to
> reserved THP.

You are really talking about a memory pool that can be used by different consumers.

I raised that in the past in the context of guest_memfd, whereby the short-term
plan is to take pages from hugetlb's pool, when really there should be a global
pool that can be consumed by various consumers.

A lot of questions around that.

> 
> 3.5 Add 1GB page support to reserved THP
> ----------------------------------------
> 
> Historically, there have been several attempts to add 1GB huge page support to
> THP:
> 
> 1. https://lore.kernel.org/linux-mm/20260202005451.774496-1-usamaarif642@gmail.com/
> 2. https://lore.kernel.org/linux-mm/20210224223536.803765-1-zi.yan@sent.com/
> 
> Adding 1GB huge page support for reserved THP would be relatively simpler
> compared to regular THP.

And that's what I told Usama: start with 1 GiB THP support for shmem/tmpfs, and
make it configurable.

How we would add a reservation mechanism is a good question. Because hugetlb
reservation is a broken concept. And anything that's not NUMA or memcg aware
will be a broken concept I'm afraid.

> 
> 3.6 Remove Hugetlb
> ------------------
> 
> Once reserved THP can completely replace the existing functions of hugetlb, we
> can gradually remove Hugetlb, leaving only one huge page management system in
> the kernel.

I'm sorry, but no way this will work in any reasonable timeframe unless you
mimic the exact user facing ABI -- and I don't think we'll gain a lot that way.

I know, we all like to dream, but this just isn't feasible.

-- 
Cheers,

David


      parent reply	other threads:[~2026-06-29 12:20 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-27  7:21 [RFC PATCH 0/8] Introducte Reserved THP Qi Zheng
2026-06-27  7:21 ` [RFC PATCH 1/8] mm: page_alloc: add reserved THP pageblock type Qi Zheng
2026-06-27  7:21 ` [RFC PATCH 2/8] mm: add boot-time reserved THP pageblock capacity Qi Zheng
2026-06-27  7:21 ` [RFC PATCH 3/8] mm: page_alloc: add a reserved THP allocation primitive Qi Zheng
2026-06-27  7:21 ` [RFC PATCH 4/8] mm: add reserved THP quota helpers Qi Zheng
2026-06-27  7:21 ` [RFC PATCH 5/8] mm: add reserved THP vma flag Qi Zheng
2026-06-27  7:26 ` [RFC PATCH 6/8] mm: maintain reserved THP quota across VMA changes Qi Zheng
2026-06-27  7:26 ` [RFC PATCH 7/8] mm: support reserved THP VMAs in anonymous faults Qi Zheng
2026-06-27  7:26 ` [RFC PATCH 8/8] mm: add MADV_RESERVED_THP range policy Qi Zheng
2026-06-29  3:46 ` [RFC PATCH 0/8] Introducte Reserved THP Matthew Wilcox
2026-06-29 10:13   ` Qi Zheng
2026-06-29 12:20 ` David Hildenbrand (Arm) [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=e2bd33d5-4de5-49cd-970f-9e80eec91a3b@kernel.org \
    --to=david@kernel.org \
    --cc=akpm@linux-foundation.org \
    --cc=baohua@kernel.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=baoquan.he@linux.dev \
    --cc=chrisl@kernel.org \
    --cc=dev.jain@arm.com \
    --cc=hannes@cmpxchg.org \
    --cc=jackmanb@google.com \
    --cc=kasong@tencent.com \
    --cc=lance.yang@linux.dev \
    --cc=liam@infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ljs@kernel.org \
    --cc=mhocko@suse.com \
    --cc=muchun.song@linux.dev \
    --cc=npache@redhat.com \
    --cc=nphamcs@gmail.com \
    --cc=osalvador@suse.de \
    --cc=peterx@redhat.com \
    --cc=qi.zheng@linux.dev \
    --cc=ryan.roberts@arm.com \
    --cc=shikemeng@huaweicloud.com \
    --cc=surenb@google.com \
    --cc=usama.arif@linux.dev \
    --cc=vbabka@kernel.org \
    --cc=willy@infradead.org \
    --cc=youngjun.park@lge.com \
    --cc=zhengqi.arch@bytedance.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox