Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: Qi Zheng <qi.zheng@linux.dev>
To: Matthew Wilcox <willy@infradead.org>
Cc: akpm@linux-foundation.org, david@kernel.org, ljs@kernel.org,
	ziy@nvidia.com, baolin.wang@linux.alibaba.com,
	liam@infradead.org, npache@redhat.com, ryan.roberts@arm.com,
	dev.jain@arm.com, baohua@kernel.org, lance.yang@linux.dev,
	muchun.song@linux.dev, osalvador@suse.de, chrisl@kernel.org,
	kasong@tencent.com, shikemeng@huaweicloud.com, nphamcs@gmail.com,
	baoquan.he@linux.dev, youngjun.park@lge.com, peterx@redhat.com,
	usama.arif@linux.dev, vbabka@kernel.org, surenb@google.com,
	mhocko@suse.com, jackmanb@google.com, hannes@cmpxchg.org,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	Qi Zheng <zhengqi.arch@bytedance.com>
Subject: Re: [RFC PATCH 0/8] Introducte Reserved THP
Date: Mon, 29 Jun 2026 18:13:22 +0800	[thread overview]
Message-ID: <ae67cdd2-b20b-42cb-836e-ef3bf35a1ad0@linux.dev> (raw)
In-Reply-To: <akHqebup1xqlgC0E@casper.infradead.org>

Hi Matthew,

Thanks a lot for your feedback!

On 6/29/26 11:46 AM, Matthew Wilcox wrote:
> On Sat, Jun 27, 2026 at 03:21:48PM +0800, Qi Zheng wrote:
>> This RFC patchset introduces a new feature called "Reserved THP", and I'd like
>> to open up a discussion on how to use this as a stepping stone toward unifying
>> HugeTLB and THP (Transparent Huge Page).
> 
> I'm really happy you're looking into this.  I'm not terribly familiar
> with the page allocator code, so I don't have any comments on the
> patches themselves, but I do have a few on your approach.

This is also what I am hoping for. The current version of the code is
just proof-of-concept (PoC) to facilitate discussion. The real goal is
to use reserved THP as a stepping stone to discuss the challages of
unifying HugeTLB and THP, and the overall evolution path.

Of course, swap support is a key part too. ;)

> 
>> Therefore, we are wondering if we can introduce "reserved THP", which is THP
>> that can be reserved. It can be consumed through methods like madvise(), while
>> normal memory allocation cannot consume it. This can achieve an effect similar
>> to hugetlb. And because it is THP, it can relatively easily support swap
>> features, which perfectly solves the above problem.
> 
> As I understand it, hugetlbfs reserves on mmap().

Exactly, hugetlbfs reserves HugeTLB pages at mmap() time:

hugetlbfs_file_mmap
--> hugetlb_reserve_pages

and it's the same without using hugetlbfs:

hugetlb_file_setup
--> hugetlb_reserve_pages

Using madvise() as the example is based on the following considerations:

1. It closely aligns with the existing usage patterns of THP madvise
    mode.
2. To properly support swap, we actually need to allow overcommit before
    actual page faults occur. This allows us to perform memory reclaim
    during the page fault, swapping out cold reserved THP to satisfy the
    memory demands of new process. So we can't directly pre-reserv the
    reserved THP at mmap/madvise time.

The second point seems to be a challenge that HugeTLB would also face if
it were to support swap. Perhaps reserved THP could be designed with two
modes:

1. with swap support: using the current madvise method.
2. without swap support: in this mode, we can directly let hugetlbfs
    reserve the reserved THP at mmap() time. The behavior remains the
    same, purely switching the underlying backend.

But this might muddy the semantics a bit...

> 
>> This RFC wants to discuss another implementation:
>>
>> 1. Introduce a new migratetype: MIGRATE_RESERVED_THP.
>> 2. Introduce two new hugetlb-like kernel boot parameters: `thp_reserved_size`
>>     and `thp_reserved_nr`. When set, the required memory is marked as
>>     MIGRATE_RESERVED_THP and put back into the buddy allocator.
>> 3. Introduce a new madvise parameter: `MADV_RESERVED_THP`. Pages marked as
>>     MIGRATE_RESERVED_THP can only be consumed via `madvise(MADV_RESERVED_THP)`.
>>     Other normal memory allocations cannot consume MIGRATE_RESERVED_THP memory.
>>
>> This can achieve a reservation effect similar to HugeTLB and guarantee
>> allocation success.
> 
> I think this is an interesting approach.  I don't think it should be too
> hard to migrate existing hugetlbfs users to it.

That is also what I hope to see.

> 
>> 3. Future Plans
>> ===============
>>
>> 3.1 Enhance swap-out and swap-in for large folios
>> -------------------------------------------------
>>
>> Currently, For swap-out, THP_SWAP is supported, but it only tries to swap out
>> the THP folio as a whole. It is still possible to be forced to split in some
>> situations (e.g., fragmented swap space, memory.swap.max limits, etc). For
>> swap-in, it is almost impossible to directly swap in the THP folio as a whole.
>>
>> But for reserved THP, splitting is not allowed. We need to ensure that it
>> remains a whole huge page during swap-out and swap-in, to achieve a function
>> similar to hugetlb swap.
> 
> So I think the current restriction is something that needs to be fixed
> anyway.  It doesn't actually make sense that a folio must be written out
> contiguously; filesystems do not have this restriction.  I understand

Hopefully, there won't be too much pushback.

> why swap currently has this limitation, but I'm hoping it gets removed
> at some point.  I'm not sure if the people working on swap right now
> intend to fix this.  They're already on the cc, so I hope they chime in.

+1.

Hi SWAP folks, how hard would it be to get this implemented? Are there
any current plans for this? ;)

> 
>> 3.2 Integrate reserved THP into the common reclaim path
>> -------------------------------------------------------
>>
>> Once swap-in and swap-out of huge pages can be supported without splitting,
>> reserved THP can be integrated into the common reclaim path as a normal LRU
>> folio for memory reclamation. This fills the gap of the hugetlb swap function.
> 
> Hm.  Then what does "reserved THP" mean if they can be swapped out?

Indeed, it is a bit weird.

In this version, what's actually reserved is essentially a memory pool.
After a reserved THP page is swapped out, the space in the pool might be
consumed by someone else. So, there's no guarantee that this reserved
THP page can be successfully swapped back in.

But if we don't want it swapped out, it can be guaranteed via mlock or
GUP.

> 
>> 3.4 Use reserved THP as a backend for hugetlbfs
>> -----------------------------------------------
>>
>> This would allow existing hugetlb users or applications to seamlessly switch to
>> reserved THP.
> 
> If this is the end goal, then I think introducing new command line
> options is probably the wrong approach right now.  Instead, "reserved
> THPs" should be allocated from the same pool as hugetlb reserve.  That
> way we're not jerking sysadmins around.

Do you mean reusing the existing HugeTLB boot parameters instead of
introducing new ones? That seems quite difficult to implement during the
transition. My idea is that we can eventually drop the HugeTLB boot
parameters entirely, so the system will still end up with only one set
of parameters. ;)

> 
>> 3.5 Add 1GB page support to reserved THP
>> ----------------------------------------
>>
>> Historically, there have been several attempts to add 1GB huge page support to
>> THP:
>>
>> 1. https://lore.kernel.org/linux-mm/20260202005451.774496-1-usamaarif642@gmail.com/
>> 2. https://lore.kernel.org/linux-mm/20210224223536.803765-1-zi.yan@sent.com/
>>
>> Adding 1GB huge page support for reserved THP would be relatively simpler
>> compared to regular THP.
> 
> Well.  Maybe?  What happens if we mmap() 16GiB,

At least the side effects are limited strictly to reserved THPs, and
reserved THP is pre-reserved, ensuring a higher allocation success rate.

> madvise(USE_RESERVED_THPS) and then munmap() the first 4KiB of it?

Since splitting is not allowed for reserved THPs, the entire huge page
will be freed at munmap time.

> 
>> 3.6 Remove Hugetlb
>> ------------------
>>
>> Once reserved THP can completely replace the existing functions of hugetlb, we
>> can gradually remove Hugetlb, leaving only one huge page management system in
>> the kernel.
> 
> We also need mshare to land ... but yes, eventually removing hugetlbfs

mshare? Do you mean CONFIG_HUGETLB_PMD_PAGE_TABLE_SHARING?

> is my hope.

+1.

Thanks,
Qi




  reply	other threads:[~2026-06-29 10:13 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-27  7:21 [RFC PATCH 0/8] Introducte Reserved THP Qi Zheng
2026-06-27  7:21 ` [RFC PATCH 1/8] mm: page_alloc: add reserved THP pageblock type Qi Zheng
2026-06-27  7:21 ` [RFC PATCH 2/8] mm: add boot-time reserved THP pageblock capacity Qi Zheng
2026-06-27  7:21 ` [RFC PATCH 3/8] mm: page_alloc: add a reserved THP allocation primitive Qi Zheng
2026-06-27  7:21 ` [RFC PATCH 4/8] mm: add reserved THP quota helpers Qi Zheng
2026-06-27  7:21 ` [RFC PATCH 5/8] mm: add reserved THP vma flag Qi Zheng
2026-06-27  7:26 ` [RFC PATCH 6/8] mm: maintain reserved THP quota across VMA changes Qi Zheng
2026-06-27  7:26 ` [RFC PATCH 7/8] mm: support reserved THP VMAs in anonymous faults Qi Zheng
2026-06-27  7:26 ` [RFC PATCH 8/8] mm: add MADV_RESERVED_THP range policy Qi Zheng
2026-06-29  3:46 ` [RFC PATCH 0/8] Introducte Reserved THP Matthew Wilcox
2026-06-29 10:13   ` Qi Zheng [this message]
2026-06-29 12:20 ` David Hildenbrand (Arm)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ae67cdd2-b20b-42cb-836e-ef3bf35a1ad0@linux.dev \
    --to=qi.zheng@linux.dev \
    --cc=akpm@linux-foundation.org \
    --cc=baohua@kernel.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=baoquan.he@linux.dev \
    --cc=chrisl@kernel.org \
    --cc=david@kernel.org \
    --cc=dev.jain@arm.com \
    --cc=hannes@cmpxchg.org \
    --cc=jackmanb@google.com \
    --cc=kasong@tencent.com \
    --cc=lance.yang@linux.dev \
    --cc=liam@infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ljs@kernel.org \
    --cc=mhocko@suse.com \
    --cc=muchun.song@linux.dev \
    --cc=npache@redhat.com \
    --cc=nphamcs@gmail.com \
    --cc=osalvador@suse.de \
    --cc=peterx@redhat.com \
    --cc=ryan.roberts@arm.com \
    --cc=shikemeng@huaweicloud.com \
    --cc=surenb@google.com \
    --cc=usama.arif@linux.dev \
    --cc=vbabka@kernel.org \
    --cc=willy@infradead.org \
    --cc=youngjun.park@lge.com \
    --cc=zhengqi.arch@bytedance.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox