From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from out-188.mta0.migadu.com (out-188.mta0.migadu.com [91.218.175.188])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4C1372D6E58
	for <linux-kernel@vger.kernel.org>; Thu,  2 Jul 2026 02:53:51 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=91.218.175.188
ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1782960832; cv=none; b=jNA//W7axq4AY7szkMfMnuACT9bc0BmsjlvpMnUyWsEdz1sMMOvM5bLVdZr32HShwTEpCbbn5oZF4aS9Gu4/wHui0LL436BX5IFqxu0hOjA1424UOsQ5ydRhvjX1z1TUwnrg76F4F+J9h2UHrdQDtjTFXsVEL4U0jIKLnINsVPk=
ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1782960832; c=relaxed/simple;
	bh=Dz7ETbYj3P2zoyzUk/TqicLozW2wIjLxuFShqsfl8VQ=;
	h=Message-ID:Date:MIME-Version:Subject:To:Cc:References:From:
	 In-Reply-To:Content-Type; b=IWLFo+HmZpVjjBzoDWAnMkVrAu8pFYVhPz62MN6U8Gg1RoEYa3vuv1tTOB4KGn3lO5BbDIWeBzsGxPcdw/bNHaYZpYg8HMNVoy7u5SKTUN1bGB+zh/LcMJSwXT361AXDS4RDGO5NFyMgCFgAdTA3csPycUo3mF8pKpXzNgwmrgU=
ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=CH47tm4x; arc=none smtp.client-ip=91.218.175.188
Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev
Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="CH47tm4x"
Message-ID: <ba68b5c0-b419-4191-8c04-2fc9a2e04f0c@linux.dev>
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1;
	t=1782960828;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=uPKgXy0ALP+JGFYDqT6Pv+Hgz926t2VlMz/v9HIyC9Y=;
	b=CH47tm4xPwdAwM/h3+VPVgyo4aYzFbjJ1sjWHclAcxOCzncykkE7htFk1ltMT9FTBG5HcH
	XPsGYDEEdB0b8ZMLmaKuO2yYD3tWf8W4udz4iUYs+AUEbn/5o9luSJ2xkho8woj9xCE2Rk
	nooDSCb+Fm7p27NDgjues9ydZTILVu4=
Date: Thu, 2 Jul 2026 10:53:21 +0800
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Subject: Re: [RFC PATCH 0/8] Introducte Reserved THP
To: Zi Yan <ziy@nvidia.com>, "David Hildenbrand (Arm)" <david@kernel.org>,
 akpm@linux-foundation.org, ljs@kernel.org, baolin.wang@linux.alibaba.com,
 liam@infradead.org, npache@redhat.com, ryan.roberts@arm.com,
 dev.jain@arm.com, baohua@kernel.org, lance.yang@linux.dev,
 muchun.song@linux.dev, osalvador@suse.de, chrisl@kernel.org,
 kasong@tencent.com, shikemeng@huaweicloud.com, nphamcs@gmail.com,
 baoquan.he@linux.dev, youngjun.park@lge.com, peterx@redhat.com,
 usama.arif@linux.dev, willy@infradead.org, vbabka@kernel.org,
 surenb@google.com, mhocko@suse.com, jackmanb@google.com, hannes@cmpxchg.org
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
 Qi Zheng <zhengqi.arch@bytedance.com>
References: <cover.1782538002.git.zhengqi.arch@bytedance.com>
 <e2bd33d5-4de5-49cd-970f-9e80eec91a3b@kernel.org>
 <DJMS84Y4KYDY.26GNQPOZ6G4L7@nvidia.com>
X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers.
From: Qi Zheng <qi.zheng@linux.dev>
In-Reply-To: <DJMS84Y4KYDY.26GNQPOZ6G4L7@nvidia.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
X-Migadu-Flow: FLOW_OUT


On 7/1/26 7:45 AM, Zi Yan wrote:
> On Mon Jun 29, 2026 at 8:20 AM EDT, David Hildenbrand (Arm) wrote:
>> On 6/27/26 09:21, Qi Zheng wrote:
>>> From: Qi Zheng <zhengqi.arch@bytedance.com>
>>>
>>> Hi all,
>>>
>>
>> Hi,
>>
>>> This RFC patchset introduces a new feature called "Reserved THP", and I'd like
>>> to open up a discussion on how to use this as a stepping stone toward unifying
>>> HugeTLB and THP (Transparent Huge Page).
>>>
>>> 1. Background
>>> =============
>>>
>>> Currently, two huge page solutions co-exist in the kernel:
>>>
>>> 1. HugeTLB: Supports reservation, guaranteeing successful allocation within the
>>>              reserved pool. However, it does not support features like swap. And
>>>              it is a relatively independent subsystem.
>>> 2. THP: Does not support reservation and may fail to allocate and fallback to
>>>          small pages when system memory is fragmented, but it is more tightly
>>>          integrated with mm core and supports features like swap.
>>>
>>> Both have their pros and cons. However, in one of our internal scenarios, it
>>> seems we need to combine the features of both to meet the requirements.
>>>
>>> In our internal scenario, a user process needs to reserve double the amount
>>> of Hugetlb memory due to hot-upgrade requirements. For example, if the
>>> process needs 16GB of Hugetlb, an additional 16GB is required during the
>>> hot-upgrade to satisfy memory allocations. After the upgrade, the old
>>> process exits and releases the 16GB of HugeTLB. Therefore, in most cases,
>>> the extra 16GB of HugeTLB is wasted.
>>>
>>> A straightforward idea is to use the Hugetlb CMA feature, reserving a total
>>> of 32GB of hugetlb_cma. During normal operation, 16GB is consumed, and the
>>> remaining 16GB can be used by other processes. During hot-upgrade, we could
>>> try to migrate the memory used by other processes to allocate the required
>>> extra 16GB of Hugetlb. This might work, but it still requires reserving 32GB
>>> of memory.
>>>
>>> We also found that during the hot upgrade, about 10GB of the old process's
>>> hugetlb is actually cold memory, which could theoretically be reclaimed. In
>>> extreme cases, we could reserve only 22GB of memory and reclaim the
>>> remaining 10GB during the hot upgrade. But unfortunately, hugetlb currently
>>> does not support swap, and supporting it seems quite difficult.
>>>
>>> Therefore, we are wondering if we can introduce "reserved THP", which is THP
>>> that can be reserved. It can be consumed through methods like madvise(), while
>>> normal memory allocation cannot consume it.
>>
>> madvise(). Gah. No :)
>>
>>> This can achieve an effect similar
>>> to hugetlb. And because it is THP, it can relatively easily support swap
>>> features, which perfectly solves the above problem.
>>
>> No, this is the wrong approach. We really shouldn't be making the same mistake
>> hugetlb did and support reserving of non-filebacked memory (IOW anonymous memory).
>>
>> And even for files, the hugetlb mechanism is an absolute trainwreck, because
>> it's not NUMA aware.
>>
>> This really needs some proper thought.
> 
> You mean the reservation should be done via some file handle, like
> memfd, so that it is easy to apply memory policies to determine where
> reserved memory locates?
> 
> For existing hugetlb reservation, there is no fine control, like NUMA,
> or cgroup, of the reserved free memory.
> 
> Is that what you mean above?
> 
>>
>>>
>>> Additionally, in 2024 (or possibly earlier), there have been discussions about
>>> the possibility of unifying Hugetlb and THP:
>>>
>>> Link: https://lwn.net/Articles/974491/
>>>
>>> After all, hugetlb's management is relatively independent and requires too
>>> much special handling in mm core. The introduction of reserved THP might be
>>> an opportunity. In the future, reserved THP could be enhanced to support
>>> various hugetlb features, such as acting as a backend for hugetlbfs. When
>>> reserved THP can completely replace HugeTLB, HugeTLB could be entirely
>>> removed, and reserved THP would just become a feature of THP.
>>>
>>> 2. Implementation
>>> =================
>>>
>>> In 2024, Yu Zhao proposed a similar idea:
>>>
>>> Link: https://lore.kernel.org/all/20240229183436.4110845-2-yuzhao@google.com/
>>>
>>> The idea was to introduce two virt zones: ZONE_NOSPLIT and ZONE_NOMERGE to
>>> guarantee the allocation success rate of THP, achieving an effect similar to
>>> reservation. However, it seems there was no further progress, perhaps because of
>>> reluctance to introduce more virt zones like ZONE_MOVABLE.
>>>
>>> This RFC wants to discuss another implementation:
>>>
>>> 1. Introduce a new migratetype: MIGRATE_RESERVED_THP.
>>> 2. Introduce two new hugetlb-like kernel boot parameters: `thp_reserved_size`
>>>     and `thp_reserved_nr`. When set, the required memory is marked as
>>>     MIGRATE_RESERVED_THP and put back into the buddy allocator.
>>
>> I'm all for some mechanism to make runtime allocation of large chunks of memory
>> easier, by adding a pool from where multiple consumers (THP, guest_memfd,
>> hugetlb, whatever) can allocate memory.
> 
> I agree with this one. We do not want to invent different free memory
> reservation mechanisms for each possible consumer. A shared reservation
> mechanism with different reservation and allocation policies is better.

Hi all, thanks a lot for the feedback! It seems that introducing a new
reservation approach isn't the best way to go. Is the consensus to
address/optimize the problem by doing the following?

1. make THP allocation more reliable.
    (pointed by Gregory. And I think this correspondingly requires
     swap-in to support bringing in the THP folio as a whole. This is
     also the issue Matthew mentioned that the swap subsystem needs
     to address.)

2. design a shared memory reservation mechanism.
    (suggested by David and Zi)

3. Minimize memory fragmentation as much as possible.
    (Like Barry suggested, we could introduce something at pageblock
     level to record memory order preferences.)

Thanks,
Qi

>