[RFC PATCH 0/8] Introducte Reserved THP

Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH 0/8] Introducte Reserved THP
@ 2026-06-27  7:21 Qi Zheng
  2026-06-27  7:21 ` [RFC PATCH 1/8] mm: page_alloc: add reserved THP pageblock type Qi Zheng
                   ` (9 more replies)
  0 siblings, 10 replies; 12+ messages in thread
From: Qi Zheng @ 2026-06-27  7:21 UTC (permalink / raw)
  To: akpm, david, ljs, ziy, baolin.wang, liam, npache, ryan.roberts,
	dev.jain, baohua, lance.yang, muchun.song, osalvador, chrisl,
	kasong, shikemeng, nphamcs, baoquan.he, youngjun.park, peterx,
	usama.arif, willy, vbabka, surenb, mhocko, jackmanb, hannes
  Cc: linux-mm, linux-kernel, Qi Zheng

From: Qi Zheng <zhengqi.arch@bytedance.com>

Hi all,

This RFC patchset introduces a new feature called "Reserved THP", and I'd like
to open up a discussion on how to use this as a stepping stone toward unifying
HugeTLB and THP (Transparent Huge Page).

1. Background
=============

Currently, two huge page solutions co-exist in the kernel:

1. HugeTLB: Supports reservation, guaranteeing successful allocation within the
            reserved pool. However, it does not support features like swap. And
            it is a relatively independent subsystem.
2. THP: Does not support reservation and may fail to allocate and fallback to
        small pages when system memory is fragmented, but it is more tightly
        integrated with mm core and supports features like swap.

Both have their pros and cons. However, in one of our internal scenarios, it
seems we need to combine the features of both to meet the requirements.

In our internal scenario, a user process needs to reserve double the amount
of Hugetlb memory due to hot-upgrade requirements. For example, if the
process needs 16GB of Hugetlb, an additional 16GB is required during the
hot-upgrade to satisfy memory allocations. After the upgrade, the old
process exits and releases the 16GB of HugeTLB. Therefore, in most cases,
the extra 16GB of HugeTLB is wasted.

A straightforward idea is to use the Hugetlb CMA feature, reserving a total
of 32GB of hugetlb_cma. During normal operation, 16GB is consumed, and the
remaining 16GB can be used by other processes. During hot-upgrade, we could
try to migrate the memory used by other processes to allocate the required
extra 16GB of Hugetlb. This might work, but it still requires reserving 32GB
of memory.

We also found that during the hot upgrade, about 10GB of the old process's
hugetlb is actually cold memory, which could theoretically be reclaimed. In
extreme cases, we could reserve only 22GB of memory and reclaim the
remaining 10GB during the hot upgrade. But unfortunately, hugetlb currently
does not support swap, and supporting it seems quite difficult.

Therefore, we are wondering if we can introduce "reserved THP", which is THP
that can be reserved. It can be consumed through methods like madvise(), while
normal memory allocation cannot consume it. This can achieve an effect similar
to hugetlb. And because it is THP, it can relatively easily support swap
features, which perfectly solves the above problem.

Additionally, in 2024 (or possibly earlier), there have been discussions about
the possibility of unifying Hugetlb and THP:

Link: https://lwn.net/Articles/974491/

After all, hugetlb's management is relatively independent and requires too
much special handling in mm core. The introduction of reserved THP might be
an opportunity. In the future, reserved THP could be enhanced to support
various hugetlb features, such as acting as a backend for hugetlbfs. When
reserved THP can completely replace HugeTLB, HugeTLB could be entirely
removed, and reserved THP would just become a feature of THP.

2. Implementation
=================

In 2024, Yu Zhao proposed a similar idea:

Link: https://lore.kernel.org/all/20240229183436.4110845-2-yuzhao@google.com/

The idea was to introduce two virt zones: ZONE_NOSPLIT and ZONE_NOMERGE to
guarantee the allocation success rate of THP, achieving an effect similar to
reservation. However, it seems there was no further progress, perhaps because of
reluctance to introduce more virt zones like ZONE_MOVABLE.

This RFC wants to discuss another implementation:

1. Introduce a new migratetype: MIGRATE_RESERVED_THP.
2. Introduce two new hugetlb-like kernel boot parameters: `thp_reserved_size`
   and `thp_reserved_nr`. When set, the required memory is marked as
   MIGRATE_RESERVED_THP and put back into the buddy allocator.
3. Introduce a new madvise parameter: `MADV_RESERVED_THP`. Pages marked as
   MIGRATE_RESERVED_THP can only be consumed via `madvise(MADV_RESERVED_THP)`.
   Other normal memory allocations cannot consume MIGRATE_RESERVED_THP memory.

This can achieve a reservation effect similar to HugeTLB and guarantee
allocation success.

3. Future Plans
===============

3.1 Enhance swap-out and swap-in for large folios
-------------------------------------------------

Currently, For swap-out, THP_SWAP is supported, but it only tries to swap out
the THP folio as a whole. It is still possible to be forced to split in some
situations (e.g., fragmented swap space, memory.swap.max limits, etc). For
swap-in, it is almost impossible to directly swap in the THP folio as a whole.

But for reserved THP, splitting is not allowed. We need to ensure that it
remains a whole huge page during swap-out and swap-in, to achieve a function
similar to hugetlb swap.

3.2 Integrate reserved THP into the common reclaim path
-------------------------------------------------------

Once swap-in and swap-out of huge pages can be supported without splitting,
reserved THP can be integrated into the common reclaim path as a normal LRU
folio for memory reclamation. This fills the gap of the hugetlb swap function.

3.3 Use reserved THP as a backend for shmem/tmpfs
-------------------------------------------------

This would allow shared or file-like usage to utilize reserved THP.

3.4 Use reserved THP as a backend for hugetlbfs
-----------------------------------------------

This would allow existing hugetlb users or applications to seamlessly switch to
reserved THP.

3.5 Add 1GB page support to reserved THP
----------------------------------------

Historically, there have been several attempts to add 1GB huge page support to
THP:

1. https://lore.kernel.org/linux-mm/20260202005451.774496-1-usamaarif642@gmail.com/
2. https://lore.kernel.org/linux-mm/20210224223536.803765-1-zi.yan@sent.com/

Adding 1GB huge page support for reserved THP would be relatively simpler
compared to regular THP.

3.6 Remove Hugetlb
------------------

Once reserved THP can completely replace the existing functions of hugetlb, we
can gradually remove Hugetlb, leaving only one huge page management system in
the kernel.

This series is based on the next-20260623.

Comments and feedback are welcome!

Thanks,
Qi

Qi Zheng (8):
  mm: page_alloc: add reserved THP pageblock type
  mm: add boot-time reserved THP pageblock capacity
  mm: page_alloc: add a reserved THP allocation primitive
  mm: add reserved THP quota helpers
  mm: add reserved THP vma flag
  mm: maintain reserved THP quota across VMA changes
  mm: support reserved THP VMAs in anonymous faults
  mm: add MADV_RESERVED_THP range policy

 arch/alpha/include/uapi/asm/mman.h     |   2 +
 arch/mips/include/uapi/asm/mman.h      |   2 +
 arch/parisc/include/uapi/asm/mman.h    |   2 +
 arch/xtensa/include/uapi/asm/mman.h    |   2 +
 fs/proc/task_mmu.c                     |   3 +
 include/linux/gfp.h                    |   3 +
 include/linux/gfp_types.h              |   8 +-
 include/linux/huge_mm.h                |   4 +-
 include/linux/mm.h                     |   7 ++
 include/linux/mmzone.h                 |  11 +-
 include/trace/events/mmflags.h         |   4 +-
 include/uapi/asm-generic/mman-common.h |   2 +
 mm/Makefile                            |   2 +-
 mm/huge_memory.c                       |  18 +++-
 mm/internal.h                          |   6 ++
 mm/khugepaged.c                        |   8 ++
 mm/madvise.c                           |  83 ++++++++++++++-
 mm/memory.c                            |   3 +
 mm/mmap.c                              |  18 ++++
 mm/mremap.c                            | 121 ++++++++++++++++------
 mm/page_alloc.c                        |  73 +++++++++++++-
 mm/reserved_thp.c                      | 133 +++++++++++++++++++++++++
 mm/show_mem.c                          |   5 +
 mm/vma.c                               |  23 +++++
 mm/vma.h                               |   1 +
 tools/include/linux/gfp_types.h        |   4 +-
 tools/perf/builtin-kmem.c              |   1 +
 tools/testing/vma/include/dup.h        |   1 +
 28 files changed, 499 insertions(+), 51 deletions(-)
 create mode 100644 mm/reserved_thp.c

-- 
2.54.0

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC PATCH 1/8] mm: page_alloc: add reserved THP pageblock type
  2026-06-27  7:21 [RFC PATCH 0/8] Introducte Reserved THP Qi Zheng
@ 2026-06-27  7:21 ` Qi Zheng
  2026-06-27  7:21 ` [RFC PATCH 2/8] mm: add boot-time reserved THP pageblock capacity Qi Zheng
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 12+ messages in thread
From: Qi Zheng @ 2026-06-27  7:21 UTC (permalink / raw)
  To: akpm, david, ljs, ziy, baolin.wang, liam, npache, ryan.roberts,
	dev.jain, baohua, lance.yang, muchun.song, osalvador, chrisl,
	kasong, shikemeng, nphamcs, baoquan.he, youngjun.park, peterx,
	usama.arif, willy, vbabka, surenb, mhocko, jackmanb, hannes
  Cc: linux-mm, linux-kernel, Qi Zheng

From: Qi Zheng <zhengqi.arch@bytedance.com>

Introduce a new migratetype MIGRATE_RESERVED_THP in preparation for the
upcoming Reserved THP feature.

Add specific handling in the page allocation and freeing paths (e.g.,
free_unref_folios() and __free_frozen_pages()) to prevent these reserved
pageblocks from being freed to the per-CPU (PCP) lists or being used by
other allocation contexts. Also, add statistical counters in the zone
structure and expose them in /proc/zoneinfo.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 include/linux/mmzone.h | 11 ++++++++---
 mm/page_alloc.c        | 16 +++++++++++++++-
 mm/show_mem.c          |  5 +++++
 3 files changed, 28 insertions(+), 4 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index ca27121871475..4418d0c9accdc 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -133,10 +133,9 @@ enum migratetype {
 	 * __free_pageblock_cma() function.
 	 */
 	MIGRATE_CMA,
-	__MIGRATE_TYPE_END = MIGRATE_CMA,
-#else
-	__MIGRATE_TYPE_END = MIGRATE_HIGHATOMIC,
 #endif
+	MIGRATE_RESERVED_THP,
+	__MIGRATE_TYPE_END = MIGRATE_RESERVED_THP,
 #ifdef CONFIG_MEMORY_ISOLATION
 	MIGRATE_ISOLATE,	/* can't allocate from here */
 #endif
@@ -161,6 +160,9 @@ extern const char * const migratetype_names[MIGRATE_TYPES];
 #  define is_migrate_cma_folio(folio, pfn) false
 #endif
 
+#define is_migrate_reserved_thp(migratetype) \
+	unlikely((migratetype) == MIGRATE_RESERVED_THP)
+
 static inline bool is_migrate_movable(int mt)
 {
 	return is_migrate_cma(mt) || mt == MIGRATE_MOVABLE;
@@ -975,6 +977,9 @@ struct zone {
 	unsigned long nr_reserved_highatomic;
 	unsigned long nr_free_highatomic;
 
+	unsigned long nr_reserved_thp;
+	unsigned long nr_free_reserved_thp;
+
 	/*
 	 * We don't know if the memory that we're going to allocate will be
 	 * freeable or/and it will be released eventually, so to avoid totally
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ee902a468c2f5..613a711305072 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -263,6 +263,7 @@ const char * const migratetype_names[MIGRATE_TYPES] = {
 #ifdef CONFIG_CMA
 	"CMA",
 #endif
+	"ReserveTHP",
 #ifdef CONFIG_MEMORY_ISOLATION
 	"Isolate",
 #endif
@@ -784,6 +785,9 @@ static inline void account_freepages(struct zone *zone, int nr_pages,
 	else if (migratetype == MIGRATE_HIGHATOMIC)
 		WRITE_ONCE(zone->nr_free_highatomic,
 			   zone->nr_free_highatomic + nr_pages);
+	else if (migratetype == MIGRATE_RESERVED_THP)
+		WRITE_ONCE(zone->nr_free_reserved_thp,
+			   zone->nr_free_reserved_thp + nr_pages);
 }
 
 /* Used for pages not on another list */
@@ -2960,7 +2964,8 @@ static void __free_frozen_pages(struct page *page, unsigned int order,
 	zone = page_zone(page);
 	migratetype = get_pfnblock_migratetype(page, pfn);
 	if (unlikely(migratetype >= MIGRATE_PCPTYPES)) {
-		if (unlikely(is_migrate_isolate(migratetype))) {
+		if (unlikely(is_migrate_reserved_thp(migratetype) ||
+			     is_migrate_isolate(migratetype))) {
 			free_one_page(zone, page, pfn, order, fpi_flags);
 			return;
 		}
@@ -3038,6 +3043,7 @@ void free_unref_folios(struct folio_batch *folios)
 
 		/* Different zone requires a different pcp lock */
 		if (zone != locked_zone ||
+		    is_migrate_reserved_thp(migratetype) ||
 		    is_migrate_isolate(migratetype)) {
 			if (pcp) {
 				pcp_spin_unlock(pcp);
@@ -3045,6 +3051,12 @@ void free_unref_folios(struct folio_batch *folios)
 				pcp = NULL;
 			}
 
+			if (is_migrate_reserved_thp(migratetype)) {
+				free_one_page(zone, &folio->page, pfn,
+					      order, FPI_NONE);
+				continue;
+			}
+
 			/*
 			 * Free isolated pages directly to the
 			 * allocator, see comment in free_frozen_pages.
@@ -3568,6 +3580,8 @@ static inline long __zone_watermark_unusable_free(struct zone *z,
 	if (likely(!(alloc_flags & ALLOC_RESERVES)))
 		unusable_free += READ_ONCE(z->nr_free_highatomic);
 
+	unusable_free += READ_ONCE(z->nr_free_reserved_thp);
+
 #ifdef CONFIG_CMA
 	/* If allocation can't use CMA areas don't use free CMA pages */
 	if (!(alloc_flags & ALLOC_CMA))
diff --git a/mm/show_mem.c b/mm/show_mem.c
index 43aca5a2ac990..e9381afca4aca 100644
--- a/mm/show_mem.c
+++ b/mm/show_mem.c
@@ -142,6 +142,7 @@ static void show_migration_types(unsigned char type)
 #ifdef CONFIG_CMA
 		[MIGRATE_CMA]		= 'C',
 #endif
+		[MIGRATE_RESERVED_THP]	= 'T',
 #ifdef CONFIG_MEMORY_ISOLATION
 		[MIGRATE_ISOLATE]	= 'I',
 #endif
@@ -308,6 +309,8 @@ static void show_free_areas(unsigned int filter, nodemask_t *nodemask, int max_z
 			" high:%lukB"
 			" reserved_highatomic:%luKB"
 			" free_highatomic:%luKB"
+			" reserved_thp:%luKB"
+			" free_reserved_thp:%luKB"
 			" active_anon:%lukB"
 			" inactive_anon:%lukB"
 			" active_file:%lukB"
@@ -331,6 +334,8 @@ static void show_free_areas(unsigned int filter, nodemask_t *nodemask, int max_z
 			K(high_wmark_pages(zone)),
 			K(zone->nr_reserved_highatomic),
 			K(zone->nr_free_highatomic),
+			K(zone->nr_reserved_thp),
+			K(zone->nr_free_reserved_thp),
 			K(zone_page_state(zone, NR_ZONE_ACTIVE_ANON)),
 			K(zone_page_state(zone, NR_ZONE_INACTIVE_ANON)),
 			K(zone_page_state(zone, NR_ZONE_ACTIVE_FILE)),
-- 
2.54.0



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC PATCH 2/8] mm: add boot-time reserved THP pageblock capacity
  2026-06-27  7:21 [RFC PATCH 0/8] Introducte Reserved THP Qi Zheng
  2026-06-27  7:21 ` [RFC PATCH 1/8] mm: page_alloc: add reserved THP pageblock type Qi Zheng
@ 2026-06-27  7:21 ` Qi Zheng
  2026-06-27  7:21 ` [RFC PATCH 3/8] mm: page_alloc: add a reserved THP allocation primitive Qi Zheng
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 12+ messages in thread
From: Qi Zheng @ 2026-06-27  7:21 UTC (permalink / raw)
  To: akpm, david, ljs, ziy, baolin.wang, liam, npache, ryan.roberts,
	dev.jain, baohua, lance.yang, muchun.song, osalvador, chrisl,
	kasong, shikemeng, nphamcs, baoquan.he, youngjun.park, peterx,
	usama.arif, willy, vbabka, surenb, mhocko, jackmanb, hannes
  Cc: linux-mm, linux-kernel, Qi Zheng

From: Qi Zheng <zhengqi.arch@bytedance.com>

Add kernel boot parameters "thp_reserved_size" and "thp_reserved_nr" to
allow reserving a specified number of THP pageblocks during system boot.
These reserved pageblocks are marked as MIGRATE_RESERVED_THP.

Additionally, expose the "total_hpages", "free_hpages", and "used_hpages"
nodes in sysfs (/sys/kernel/mm/reserved_thp/) to allow userspace to
monitor the usage of the reserved capacity.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 mm/Makefile       |  2 +-
 mm/internal.h     |  2 +
 mm/page_alloc.c   | 29 ++++++++++++++
 mm/reserved_thp.c | 98 +++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 130 insertions(+), 1 deletion(-)
 create mode 100644 mm/reserved_thp.c

diff --git a/mm/Makefile b/mm/Makefile
index eff9f9e7e061c..fd74a7392e346 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -98,7 +98,7 @@ obj-$(CONFIG_MEMTEST)		+= memtest.o
 obj-$(CONFIG_MIGRATION) += migrate.o
 obj-$(CONFIG_NUMA) += memory-tiers.o
 obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
-obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
+obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o reserved_thp.o
 obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
 obj-$(CONFIG_LIVEUPDATE_MEMFD) += memfd_luo.o
 obj-$(CONFIG_MEMCG_V1) += memcontrol-v1.o
diff --git a/mm/internal.h b/mm/internal.h
index 181e79f1d6a20..a76a1fad2a7fd 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1951,4 +1951,6 @@ static inline int get_sysctl_max_map_count(void)
 bool may_expand_vm(struct mm_struct *mm, const vma_flags_t *vma_flags,
 		   unsigned long npages);
 
+unsigned long reserved_thp_pageblocks(unsigned long nr_hpages);
+
 #endif	/* __MM_INTERNAL_H */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 613a711305072..23dbbef444f18 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3568,6 +3568,35 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
 	return false;
 }
 
+unsigned long reserved_thp_pageblocks(unsigned long nr_hpages)
+{
+	unsigned int order = max_t(unsigned int, HPAGE_PMD_ORDER,
+				   pageblock_order);
+	unsigned long hpages_per_block = 1UL << (order - HPAGE_PMD_ORDER);
+	unsigned long reserved = 0;
+	gfp_t gfp = (GFP_HIGHUSER | __GFP_COMP | __GFP_NOMEMALLOC |
+		     __GFP_NOWARN | __GFP_NORETRY);
+
+	while (reserved < nr_hpages) {
+		struct page *page;
+		struct zone *zone;
+		unsigned long flags;
+
+		page = alloc_pages(gfp, order);
+		if (!page)
+			break;
+
+		zone = page_zone(page);
+		spin_lock_irqsave(&zone->lock, flags);
+		change_pageblock_range(page, order, MIGRATE_RESERVED_THP);
+		zone->nr_reserved_thp += 1UL << order;
+		spin_unlock_irqrestore(&zone->lock, flags);
+		__free_pages(page, order);
+		reserved += hpages_per_block;
+	}
+	return reserved;
+}
+
 static inline long __zone_watermark_unusable_free(struct zone *z,
 				unsigned int order, unsigned int alloc_flags)
 {
diff --git a/mm/reserved_thp.c b/mm/reserved_thp.c
new file mode 100644
index 0000000000000..1eee4f39b9d69
--- /dev/null
+++ b/mm/reserved_thp.c
@@ -0,0 +1,98 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/mm.h>
+#include "internal.h"
+
+static DEFINE_SPINLOCK(reserved_thp_lock);
+
+static unsigned long reserved_thp_cmdline_size __initdata = HPAGE_PMD_SIZE;
+static bool reserved_thp_cmdline_size_valid __initdata = true;
+static unsigned long reserved_thp_requested __initdata;
+static unsigned long reserved_thp_total;
+static unsigned long reserved_thp_used;
+
+static int __init setup_reserved_thp_size(char *str)
+{
+	unsigned long size;
+	size = memparse(str, NULL);
+	if (size != HPAGE_PMD_SIZE) {
+		pr_warn("unsupported thp_reserved_size=%s, only %lu is supported\n",
+			str, HPAGE_PMD_SIZE);
+		reserved_thp_cmdline_size_valid = false;
+		return -EINVAL;
+	}
+	reserved_thp_cmdline_size = size;
+	reserved_thp_cmdline_size_valid = true;
+	return 0;
+}
+early_param("thp_reserved_size", setup_reserved_thp_size);
+static int __init setup_reserved_thp_nr(char *str)
+{
+	int count;
+	if (sscanf(str, "%lu%n", &reserved_thp_requested, &count) != 1 ||
+	    str[count]) {
+		pr_warn("invalid thp_reserved_nr=%s\n", str);
+		reserved_thp_requested = 0;
+		return -EINVAL;
+	}
+	return 0;
+}
+early_param("thp_reserved_nr", setup_reserved_thp_nr);
+
+static ssize_t total_hpages_show(struct kobject *kobj,
+				 struct kobj_attribute *attr, char *buf)
+{
+	return sysfs_emit(buf, "%lu\n", READ_ONCE(reserved_thp_total));
+}
+static ssize_t free_hpages_show(struct kobject *kobj,
+				struct kobj_attribute *attr, char *buf)
+{
+	unsigned long free_hpages;
+
+	spin_lock(&reserved_thp_lock);
+	free_hpages = reserved_thp_total - reserved_thp_used;
+	spin_unlock(&reserved_thp_lock);
+
+	return sysfs_emit(buf, "%lu\n", free_hpages);
+}
+static ssize_t used_hpages_show(struct kobject *kobj,
+				struct kobj_attribute *attr, char *buf)
+{
+	return sysfs_emit(buf, "%lu\n", READ_ONCE(reserved_thp_used));
+}
+
+static struct kobj_attribute total_hpages_attr = __ATTR_RO(total_hpages);
+static struct kobj_attribute free_hpages_attr = __ATTR_RO(free_hpages);
+static struct kobj_attribute used_hpages_attr = __ATTR_RO(used_hpages);
+
+static struct attribute *reserved_thp_attrs[] = {
+	&total_hpages_attr.attr,
+	&free_hpages_attr.attr,
+	&used_hpages_attr.attr,
+	NULL,
+};
+
+static const struct attribute_group reserved_thp_attr_group = {
+	.attrs = reserved_thp_attrs,
+};
+
+static int __init reserved_thp_init(void)
+{
+	struct kobject *kobj;
+	int ret;
+
+	if (reserved_thp_requested && reserved_thp_cmdline_size_valid) {
+		reserved_thp_total = reserved_thp_pageblocks(reserved_thp_requested);
+		pr_info("reserved %lu/%lu PMD THP pageblocks (%lu bytes each)\n",
+			reserved_thp_total, reserved_thp_requested,
+			reserved_thp_cmdline_size);
+	}
+	kobj = kobject_create_and_add("reserved_thp", mm_kobj);
+	if (!kobj)
+		return -ENOMEM;
+	ret = sysfs_create_group(kobj, &reserved_thp_attr_group);
+	if (ret)
+		kobject_put(kobj);
+	return ret;
+}
+subsys_initcall(reserved_thp_init);
\ No newline at end of file
-- 
2.54.0



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC PATCH 3/8] mm: page_alloc: add a reserved THP allocation primitive
  2026-06-27  7:21 [RFC PATCH 0/8] Introducte Reserved THP Qi Zheng
  2026-06-27  7:21 ` [RFC PATCH 1/8] mm: page_alloc: add reserved THP pageblock type Qi Zheng
  2026-06-27  7:21 ` [RFC PATCH 2/8] mm: add boot-time reserved THP pageblock capacity Qi Zheng
@ 2026-06-27  7:21 ` Qi Zheng
  2026-06-27  7:21 ` [RFC PATCH 4/8] mm: add reserved THP quota helpers Qi Zheng
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 12+ messages in thread
From: Qi Zheng @ 2026-06-27  7:21 UTC (permalink / raw)
  To: akpm, david, ljs, ziy, baolin.wang, liam, npache, ryan.roberts,
	dev.jain, baohua, lance.yang, muchun.song, osalvador, chrisl,
	kasong, shikemeng, nphamcs, baoquan.he, youngjun.park, peterx,
	usama.arif, willy, vbabka, surenb, mhocko, jackmanb, hannes
  Cc: linux-mm, linux-kernel, Qi Zheng

From: Qi Zheng <zhengqi.arch@bytedance.com>

Introduce the __GFP_RESERVED_THP and ALLOC_RESERVED_THP flags to implement
allocation primitives specifically for reserved THP.

Enforce strict isolation in the buddy allocator: allocation requests with
this flag can only be satisfied from MIGRATE_RESERVED_THP pageblocks.
Conversely, normal allocation requests are prohibited from stealing memory
from this migratetype, even in OOM or non-block fallback paths, ensuring
the reserved capacity is strictly dedicated to its target use cases.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 include/linux/gfp.h             |  3 +++
 include/linux/gfp_types.h       |  8 ++++++--
 include/trace/events/mmflags.h  |  4 ++--
 mm/internal.h                   |  1 +
 mm/page_alloc.c                 | 30 +++++++++++++++++++++++++++---
 tools/include/linux/gfp_types.h |  4 ++--
 tools/perf/builtin-kmem.c       |  1 +
 7 files changed, 42 insertions(+), 9 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index cdf95a9f0b87c..2d05929fd8c72 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -30,6 +30,9 @@ static inline int gfp_migratetype(const gfp_t gfp_flags)
 	BUILD_BUG_ON(((___GFP_MOVABLE | ___GFP_RECLAIMABLE) >>
 		      GFP_MOVABLE_SHIFT) != MIGRATE_HIGHATOMIC);
 
+	if (unlikely(gfp_flags & __GFP_RESERVED_THP))
+		return MIGRATE_RESERVED_THP;
+
 	if (unlikely(page_group_by_mobility_disabled))
 		return MIGRATE_UNMOVABLE;
 
diff --git a/include/linux/gfp_types.h b/include/linux/gfp_types.h
index 54ca0c88bab6e..1f82a9491d357 100644
--- a/include/linux/gfp_types.h
+++ b/include/linux/gfp_types.h
@@ -33,7 +33,7 @@ enum {
 	___GFP_IO_BIT,
 	___GFP_FS_BIT,
 	___GFP_ZERO_BIT,
-	___GFP_UNUSED_BIT,	/* 0x200u unused */
+	___GFP_RESERVED_THP_BIT,
 	___GFP_DIRECT_RECLAIM_BIT,
 	___GFP_KSWAPD_RECLAIM_BIT,
 	___GFP_WRITE_BIT,
@@ -69,7 +69,7 @@ enum {
 #define ___GFP_IO		BIT(___GFP_IO_BIT)
 #define ___GFP_FS		BIT(___GFP_FS_BIT)
 #define ___GFP_ZERO		BIT(___GFP_ZERO_BIT)
-/* 0x200u unused */
+#define ___GFP_RESERVED_THP	BIT(___GFP_RESERVED_THP_BIT)
 #define ___GFP_DIRECT_RECLAIM	BIT(___GFP_DIRECT_RECLAIM_BIT)
 #define ___GFP_KSWAPD_RECLAIM	BIT(___GFP_KSWAPD_RECLAIM_BIT)
 #define ___GFP_WRITE		BIT(___GFP_WRITE_BIT)
@@ -141,6 +141,9 @@ enum {
  * %__GFP_NO_OBJ_EXT causes slab allocation to have no object extension.
  * mark_obj_codetag_empty() should be called upon freeing for objects allocated
  * with this flag to indicate that their NULL tags are expected and normal.
+ *
+ * %__GFP_RESERVED_THP is an internal flag for reserved THP faults. It restricts
+ *the allocation to %MIGRATE_RESERVED_THP pageblocks.
  */
 #define __GFP_RECLAIMABLE ((__force gfp_t)___GFP_RECLAIMABLE)
 #define __GFP_WRITE	((__force gfp_t)___GFP_WRITE)
@@ -148,6 +151,7 @@ enum {
 #define __GFP_THISNODE	((__force gfp_t)___GFP_THISNODE)
 #define __GFP_ACCOUNT	((__force gfp_t)___GFP_ACCOUNT)
 #define __GFP_NO_OBJ_EXT   ((__force gfp_t)___GFP_NO_OBJ_EXT)
+#define __GFP_RESERVED_THP ((__force gfp_t)___GFP_RESERVED_THP)
 
 /**
  * DOC: Watermark modifiers
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index a6e5a44c9b429..3db40ebd7060b 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -24,6 +24,7 @@
 	TRACE_GFP_EM(IO)			\
 	TRACE_GFP_EM(FS)			\
 	TRACE_GFP_EM(ZERO)			\
+	TRACE_GFP_EM(RESERVED_THP)		\
 	TRACE_GFP_EM(DIRECT_RECLAIM)		\
 	TRACE_GFP_EM(KSWAPD_RECLAIM)		\
 	TRACE_GFP_EM(WRITE)			\
@@ -72,8 +73,7 @@
 
 TRACE_GFP_FLAGS
 
-/* Just in case these are ever used */
-TRACE_DEFINE_ENUM(___GFP_UNUSED_BIT);
+/* Just in case this is ever used */
 TRACE_DEFINE_ENUM(___GFP_LAST_BIT);
 
 #define gfpflag_string(flag) {(__force unsigned long)flag, #flag}
diff --git a/mm/internal.h b/mm/internal.h
index a76a1fad2a7fd..3826c88b3804c 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1477,6 +1477,7 @@ unsigned int reclaim_clean_pages_from_list(struct zone *zone,
 #define ALLOC_HIGHATOMIC	0x200 /* Allows access to MIGRATE_HIGHATOMIC */
 #define ALLOC_TRYLOCK		0x400 /* Only use spin_trylock in allocation path */
 #define ALLOC_KSWAPD		0x800 /* allow waking of kswapd, __GFP_KSWAPD_RECLAIM set */
+#define ALLOC_RESERVED_THP	0x1000 /* Allows access to reserved THP pageblocks */
 
 /* Flags that allow allocations below the min watermark. */
 #define ALLOC_RESERVES (ALLOC_NON_BLOCK|ALLOC_MIN_RESERVE|ALLOC_HIGHATOMIC|ALLOC_OOM)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 23dbbef444f18..660e501bf676b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2460,6 +2460,9 @@ __rmqueue(struct zone *zone, unsigned int order, int migratetype,
 {
 	struct page *page;
 
+	if (alloc_flags & ALLOC_RESERVED_THP)
+		return __rmqueue_smallest(zone, order, MIGRATE_RESERVED_THP);
+
 	if (IS_ENABLED(CONFIG_CMA)) {
 		/*
 		 * Balance movable allocations between regular and CMA areas by
@@ -3247,7 +3250,8 @@ struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone,
 			 * reserves as failing now is worse than failing a
 			 * high-order atomic allocation in the future.
 			 */
-			if (!page && (alloc_flags & (ALLOC_OOM|ALLOC_NON_BLOCK)))
+			if (!page && !(alloc_flags & ALLOC_RESERVED_THP) &&
+			    (alloc_flags & (ALLOC_OOM|ALLOC_NON_BLOCK)))
 				page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC);
 
 			if (!page) {
@@ -3417,7 +3421,8 @@ struct page *rmqueue(struct zone *preferred_zone,
 {
 	struct page *page;
 
-	if (likely(pcp_allowed_order(order))) {
+	if (likely(pcp_allowed_order(order)) &&
+	    !(alloc_flags & ALLOC_RESERVED_THP)) {
 		page = rmqueue_pcplist(preferred_zone, zone, order,
 				       migratetype, alloc_flags);
 		if (likely(page))
@@ -3609,7 +3614,8 @@ static inline long __zone_watermark_unusable_free(struct zone *z,
 	if (likely(!(alloc_flags & ALLOC_RESERVES)))
 		unusable_free += READ_ONCE(z->nr_free_highatomic);
 
-	unusable_free += READ_ONCE(z->nr_free_reserved_thp);
+	if (!(alloc_flags & ALLOC_RESERVED_THP))
+		unusable_free += READ_ONCE(z->nr_free_reserved_thp);
 
 #ifdef CONFIG_CMA
 	/* If allocation can't use CMA areas don't use free CMA pages */
@@ -3685,6 +3691,12 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
 		if (!area->nr_free)
 			continue;
 
+		if (alloc_flags & ALLOC_RESERVED_THP) {
+			if (!free_area_empty(area, MIGRATE_RESERVED_THP))
+				return true;
+			continue;
+		}
+
 		for (mt = 0; mt < MIGRATE_PCPTYPES; mt++) {
 			if (!free_area_empty(area, mt))
 				return true;
@@ -3919,6 +3931,9 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 
 		cond_accept_memory(zone, order, alloc_flags);
 
+		if (alloc_flags & ALLOC_RESERVED_THP)
+			goto try_this_zone;
+
 		/*
 		 * Detect whether the number of free pages is below high
 		 * watermark.  If so, we will decrease pcp->high and free
@@ -5076,6 +5091,15 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
 	ac->nodemask = nodemask;
 	ac->migratetype = gfp_migratetype(gfp_mask);
 
+	if (gfp_mask & __GFP_RESERVED_THP) {
+		if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) ||
+		    WARN_ON_ONCE_GFP(order != HPAGE_PMD_ORDER, gfp_mask))
+			return false;
+
+		ac->migratetype = MIGRATE_RESERVED_THP;
+		*alloc_flags |= ALLOC_RESERVED_THP;
+	}
+
 	if (cpusets_enabled()) {
 		*alloc_gfp |= __GFP_HARDWALL;
 		/*
diff --git a/tools/include/linux/gfp_types.h b/tools/include/linux/gfp_types.h
index 6c75df30a281d..53a1d22fcf957 100644
--- a/tools/include/linux/gfp_types.h
+++ b/tools/include/linux/gfp_types.h
@@ -33,7 +33,7 @@ enum {
 	___GFP_IO_BIT,
 	___GFP_FS_BIT,
 	___GFP_ZERO_BIT,
-	___GFP_UNUSED_BIT,	/* 0x200u unused */
+	___GFP_RESERVED_THP_BIT,
 	___GFP_DIRECT_RECLAIM_BIT,
 	___GFP_KSWAPD_RECLAIM_BIT,
 	___GFP_WRITE_BIT,
@@ -69,7 +69,7 @@ enum {
 #define ___GFP_IO		BIT(___GFP_IO_BIT)
 #define ___GFP_FS		BIT(___GFP_FS_BIT)
 #define ___GFP_ZERO		BIT(___GFP_ZERO_BIT)
-/* 0x200u unused */
+#define ___GFP_RESERVED_THP	BIT(___GFP_RESERVED_THP_BIT)
 #define ___GFP_DIRECT_RECLAIM	BIT(___GFP_DIRECT_RECLAIM_BIT)
 #define ___GFP_KSWAPD_RECLAIM	BIT(___GFP_KSWAPD_RECLAIM_BIT)
 #define ___GFP_WRITE		BIT(___GFP_WRITE_BIT)
diff --git a/tools/perf/builtin-kmem.c b/tools/perf/builtin-kmem.c
index e1b2f5bc1ba8d..45732aaf1a525 100644
--- a/tools/perf/builtin-kmem.c
+++ b/tools/perf/builtin-kmem.c
@@ -672,6 +672,7 @@ static const struct {
 	{ "__GFP_NORETRY",		"NR" },
 	{ "__GFP_COMP",			"C" },
 	{ "__GFP_ZERO",			"Z" },
+	{ "__GFP_RESERVED_THP",		"RTHP" },
 	{ "__GFP_NOMEMALLOC",		"NMA" },
 	{ "__GFP_MEMALLOC",		"MA" },
 	{ "__GFP_HARDWALL",		"HW" },
-- 
2.54.0



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC PATCH 4/8] mm: add reserved THP quota helpers
  2026-06-27  7:21 [RFC PATCH 0/8] Introducte Reserved THP Qi Zheng
                   ` (2 preceding siblings ...)
  2026-06-27  7:21 ` [RFC PATCH 3/8] mm: page_alloc: add a reserved THP allocation primitive Qi Zheng
@ 2026-06-27  7:21 ` Qi Zheng
  2026-06-27  7:21 ` [RFC PATCH 5/8] mm: add reserved THP vma flag Qi Zheng
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 12+ messages in thread
From: Qi Zheng @ 2026-06-27  7:21 UTC (permalink / raw)
  To: akpm, david, ljs, ziy, baolin.wang, liam, npache, ryan.roberts,
	dev.jain, baohua, lance.yang, muchun.song, osalvador, chrisl,
	kasong, shikemeng, nphamcs, baoquan.he, youngjun.park, peterx,
	usama.arif, willy, vbabka, surenb, mhocko, jackmanb, hannes
  Cc: linux-mm, linux-kernel, Qi Zheng

From: Qi Zheng <zhengqi.arch@bytedance.com>

Add quota management helpers reserved_thp_charge() and
reserved_thp_uncharge().

These functions are used to safely update the reserved_thp_used counter
and check the quota limit during the allocation and freeing of reserved
THPs.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 mm/internal.h     |  3 +++
 mm/reserved_thp.c | 35 +++++++++++++++++++++++++++++++++++
 2 files changed, 38 insertions(+)

diff --git a/mm/internal.h b/mm/internal.h
index 3826c88b3804c..4b2a13d353772 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1953,5 +1953,8 @@ bool may_expand_vm(struct mm_struct *mm, const vma_flags_t *vma_flags,
 		   unsigned long npages);
 
 unsigned long reserved_thp_pageblocks(unsigned long nr_hpages);
+unsigned long reserved_thp_hpage_nr(unsigned long start, unsigned long end);
+int reserved_thp_charge(unsigned long nr_hpages);
+void reserved_thp_uncharge(unsigned long nr_hpages);
 
 #endif	/* __MM_INTERNAL_H */
diff --git a/mm/reserved_thp.c b/mm/reserved_thp.c
index 1eee4f39b9d69..931c539c15a70 100644
--- a/mm/reserved_thp.c
+++ b/mm/reserved_thp.c
@@ -39,6 +39,41 @@ static int __init setup_reserved_thp_nr(char *str)
 }
 early_param("thp_reserved_nr", setup_reserved_thp_nr);
 
+unsigned long reserved_thp_hpage_nr(unsigned long start, unsigned long end)
+{
+	return (end - start) >> HPAGE_PMD_SHIFT;
+}
+
+int reserved_thp_charge(unsigned long nr_hpages)
+{
+	int ret = 0;
+
+	if (!nr_hpages)
+		return 0;
+
+	spin_lock(&reserved_thp_lock);
+	if (nr_hpages > reserved_thp_total - reserved_thp_used)
+		ret = -ENOMEM;
+	else
+		reserved_thp_used += nr_hpages;
+	spin_unlock(&reserved_thp_lock);
+
+	return ret;
+}
+
+void reserved_thp_uncharge(unsigned long nr_hpages)
+{
+	if (!nr_hpages)
+		return;
+
+	spin_lock(&reserved_thp_lock);
+	if (WARN_ON_ONCE(nr_hpages > reserved_thp_used))
+		reserved_thp_used = 0;
+	else
+		reserved_thp_used -= nr_hpages;
+	spin_unlock(&reserved_thp_lock);
+}
+
 static ssize_t total_hpages_show(struct kobject *kobj,
 				 struct kobj_attribute *attr, char *buf)
 {
-- 
2.54.0



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC PATCH 5/8] mm: add reserved THP vma flag
  2026-06-27  7:21 [RFC PATCH 0/8] Introducte Reserved THP Qi Zheng
                   ` (3 preceding siblings ...)
  2026-06-27  7:21 ` [RFC PATCH 4/8] mm: add reserved THP quota helpers Qi Zheng
@ 2026-06-27  7:21 ` Qi Zheng
  2026-06-27  7:26 ` [RFC PATCH 6/8] mm: maintain reserved THP quota across VMA changes Qi Zheng
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 12+ messages in thread
From: Qi Zheng @ 2026-06-27  7:21 UTC (permalink / raw)
  To: akpm, david, ljs, ziy, baolin.wang, liam, npache, ryan.roberts,
	dev.jain, baohua, lance.yang, muchun.song, osalvador, chrisl,
	kasong, shikemeng, nphamcs, baoquan.he, youngjun.park, peterx,
	usama.arif, willy, vbabka, surenb, mhocko, jackmanb, hannes
  Cc: linux-mm, linux-kernel, Qi Zheng

From: Qi Zheng <zhengqi.arch@bytedance.com>

Introduce the VM_RESERVED_THP VMA flag to mark Virtual Memory Areas (VMAs)
that have the reserved THP policy enabled.

Also, add the "rt" (reserved THP) flag indicator to the output of
/proc/<pid>/smaps to facilitate monitoring and debugging the status of
the corresponding VMAs from userspace.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 fs/proc/task_mmu.c              | 3 +++
 include/linux/mm.h              | 7 +++++++
 tools/testing/vma/include/dup.h | 1 +
 3 files changed, 11 insertions(+)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index d32408f7cd5ed..65c4b2a61aeac 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1245,6 +1245,9 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
 #endif
 #ifdef CONFIG_64BIT
 		[ilog2(VM_SEALED)] = "sl",
+#endif
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+		[ilog2(VM_RESERVED_THP)] = "rt",
 #endif
 	};
 	size_t i;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 485df9c2dbddb..278cf4bfd4ec5 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -353,6 +353,7 @@ enum {
 #endif
 	DECLARE_VMA_BIT(UFFD_MINOR, 41),
 	DECLARE_VMA_BIT(SEALED, 42),
+	DECLARE_VMA_BIT(RESERVED_THP, 43),
 	/* Flags that reuse flags above. */
 	DECLARE_VMA_BIT_ALIAS(PKEY_BIT0, HIGH_ARCH_0),
 	DECLARE_VMA_BIT_ALIAS(PKEY_BIT1, HIGH_ARCH_1),
@@ -526,6 +527,12 @@ enum {
 #define VMA_DROPPABLE		EMPTY_VMA_FLAGS
 #endif
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#define VM_RESERVED_THP		INIT_VM_FLAG(RESERVED_THP)
+#else
+#define VM_RESERVED_THP		VM_NONE
+#endif
+
 /* Bits set in the VMA until the stack is in its final location */
 #define VM_STACK_INCOMPLETE_SETUP (VM_RAND_READ | VM_SEQ_READ | VM_STACK_EARLY)
 
diff --git a/tools/testing/vma/include/dup.h b/tools/testing/vma/include/dup.h
index cf73bcd9bb9d5..022b7a56b6a9f 100644
--- a/tools/testing/vma/include/dup.h
+++ b/tools/testing/vma/include/dup.h
@@ -160,6 +160,7 @@ enum {
 #endif
 	DECLARE_VMA_BIT(UFFD_MINOR, 41),
 	DECLARE_VMA_BIT(SEALED, 42),
+	DECLARE_VMA_BIT(RESERVED_THP, 43),
 	/* Flags that reuse flags above. */
 	DECLARE_VMA_BIT_ALIAS(PKEY_BIT0, HIGH_ARCH_0),
 	DECLARE_VMA_BIT_ALIAS(PKEY_BIT1, HIGH_ARCH_1),
-- 
2.54.0



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC PATCH 6/8] mm: maintain reserved THP quota across VMA changes
  2026-06-27  7:21 [RFC PATCH 0/8] Introducte Reserved THP Qi Zheng
                   ` (4 preceding siblings ...)
  2026-06-27  7:21 ` [RFC PATCH 5/8] mm: add reserved THP vma flag Qi Zheng
@ 2026-06-27  7:26 ` Qi Zheng
  2026-06-27  7:26 ` [RFC PATCH 7/8] mm: support reserved THP VMAs in anonymous faults Qi Zheng
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 12+ messages in thread
From: Qi Zheng @ 2026-06-27  7:26 UTC (permalink / raw)
  To: akpm, david, ljs, ziy, baolin.wang, liam, npache, ryan.roberts,
	dev.jain, baohua, lance.yang, muchun.song, osalvador, chrisl,
	kasong, shikemeng, nphamcs, baoquan.he, youngjun.park, peterx,
	usama.arif, willy, vbabka, surenb, mhocko, jackmanb, hannes
  Cc: linux-mm, linux-kernel, Qi Zheng

From: Qi Zheng <zhengqi.arch@bytedance.com>

Since reserved THP represents a PMD-sized VMA reservation, VMA operations
must maintain the alignment of reserved ranges and keep the quota balanced.

Handle the lifecycle of VM_RESERVED_THP VMAs by:
- Rejecting non-PMD-aligned splits.
- Uncharging reserved ranges during munmap and teardown.
- Charging copied ranges during fork.
- Updating mremap accounting when a reserved VMA grows or moves.

This prepares the necessary lifecycle management before exposing the
user-visible MADV_RESERVED_THP entry point.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 mm/mmap.c   |  18 ++++++++
 mm/mremap.c | 121 ++++++++++++++++++++++++++++++++++++++--------------
 mm/vma.c    |  23 ++++++++++
 mm/vma.h    |   1 +
 4 files changed, 132 insertions(+), 31 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 2311ae7c2ff45..4818b14ec0ff6 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1251,6 +1251,7 @@ unsigned long tear_down_vmas(struct mm_struct *mm, struct vma_iterator *vmi,
 		struct vm_area_struct *vma, unsigned long end)
 {
 	unsigned long nr_accounted = 0;
+	unsigned long nr_reserved_thp = 0;
 	int count = 0;
 
 	mmap_assert_write_locked(mm);
@@ -1258,6 +1259,10 @@ unsigned long tear_down_vmas(struct mm_struct *mm, struct vma_iterator *vmi,
 	do {
 		if (vma->vm_flags & VM_ACCOUNT)
 			nr_accounted += vma_pages(vma);
+		if (vma->vm_flags & VM_RESERVED_THP)
+			nr_reserved_thp +=
+				reserved_thp_hpage_nr(vma->vm_start,
+						      vma->vm_end);
 		vma_mark_detached(vma);
 		remove_vma(vma);
 		count++;
@@ -1266,6 +1271,7 @@ unsigned long tear_down_vmas(struct mm_struct *mm, struct vma_iterator *vmi,
 	} while (vma && vma->vm_end <= end);
 
 	VM_WARN_ON_ONCE(count != mm->map_count);
+	reserved_thp_uncharge(nr_reserved_thp);
 	return nr_accounted;
 }
 
@@ -1733,6 +1739,7 @@ __latent_entropy int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
 	struct vm_area_struct *mpnt, *tmp;
 	int retval;
 	unsigned long charge = 0;
+	unsigned long reserved_charge = 0;
 	LIST_HEAD(uf);
 	VMA_ITERATOR(vmi, mm, 0);
 
@@ -1775,6 +1782,7 @@ __latent_entropy int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
 			continue;
 		}
 		charge = 0;
+		reserved_charge = 0;
 		if (mpnt->vm_flags & VM_ACCOUNT) {
 			unsigned long len = vma_pages(mpnt);
 
@@ -1782,6 +1790,15 @@ __latent_entropy int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
 				goto fail_nomem;
 			charge = len;
 		}
+		if (mpnt->vm_flags & VM_RESERVED_THP) {
+			unsigned long len;
+
+			len = reserved_thp_hpage_nr(mpnt->vm_start,
+						    mpnt->vm_end);
+			if (reserved_thp_charge(len))
+				goto fail_nomem;
+			reserved_charge = len;
+		}
 
 		tmp = vm_area_dup(mpnt);
 		if (!tmp)
@@ -1916,6 +1933,7 @@ __latent_entropy int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
 	vm_area_free(tmp);
 fail_nomem:
 	retval = -ENOMEM;
+	reserved_thp_uncharge(reserved_charge);
 	vm_unacct_memory(charge);
 	goto loop_out;
 }
diff --git a/mm/mremap.c b/mm/mremap.c
index e9c8b1d05832b..ae37e0b3ce788 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -24,6 +24,7 @@
 #include <linux/mmu_notifier.h>
 #include <linux/uaccess.h>
 #include <linux/userfaultfd_k.h>
+#include <linux/huge_mm.h>
 #include <linux/mempolicy.h>
 #include <linux/pgalloc.h>
 
@@ -69,6 +70,7 @@ struct vma_remap_struct {
 	enum mremap_type remap_type;	/* expand, shrink, etc. */
 	bool mmap_locked;		/* Is mm currently write-locked? */
 	unsigned long charged;		/* If VM_ACCOUNT, # pages to account. */
+	unsigned long reserved_thp_charged; /* If VM_RESERVED_THP, # hpages. */
 	bool vmi_needs_invalidate;	/* Is the VMA iterator invalidated? */
 };
 
@@ -962,6 +964,9 @@ static unsigned long vrm_set_new_addr(struct vma_remap_struct *vrm)
 				map_flags);
 	if (IS_ERR_VALUE(res))
 		return res;
+	if ((vma->vm_flags & VM_RESERVED_THP) &&
+	    !IS_ALIGNED(res, HPAGE_PMD_SIZE))
+		return -ENOMEM;
 
 	vrm->new_addr = res;
 	return 0;
@@ -977,24 +982,44 @@ static bool vrm_calc_charge(struct vma_remap_struct *vrm)
 {
 	unsigned long charged;
 
-	if (!(vrm->vma->vm_flags & VM_ACCOUNT))
-		return true;
+	vrm->charged = 0;
+	vrm->reserved_thp_charged = 0;
 
-	/*
-	 * If we don't unmap the old mapping, then we account the entirety of
-	 * the length of the new one. Otherwise it's just the delta in size.
-	 */
-	if (vrm->flags & MREMAP_DONTUNMAP)
-		charged = vrm->new_len >> PAGE_SHIFT;
-	else
-		charged = vrm->delta >> PAGE_SHIFT;
+	if (vrm->vma->vm_flags & VM_ACCOUNT) {
+		/*
+		 * If we don't unmap the old mapping, then we account the
+		 * entirety of the length of the new one. Otherwise it's just
+		 * the delta in size.
+		 */
+		if (vrm->flags & MREMAP_DONTUNMAP)
+			charged = vrm->new_len >> PAGE_SHIFT;
+		else
+			charged = vrm->delta >> PAGE_SHIFT;
 
 
-	/* This accounts 'charged' pages of memory. */
-	if (security_vm_enough_memory_mm(current->mm, charged))
-		return false;
+		/* This accounts 'charged' pages of memory. */
+		if (security_vm_enough_memory_mm(current->mm, charged))
+			return false;
 
-	vrm->charged = charged;
+		vrm->charged = charged;
+	}
+
+	if (vrm->vma->vm_flags & VM_RESERVED_THP) {
+		unsigned long hpages;
+
+		if (vrm->flags & MREMAP_DONTUNMAP)
+			hpages = reserved_thp_hpage_nr(0, vrm->new_len);
+		else
+			hpages = reserved_thp_hpage_nr(0, vrm->delta);
+
+		if (reserved_thp_charge(hpages)) {
+			vm_unacct_memory(vrm->charged);
+			vrm->charged = 0;
+			return false;
+		}
+
+		vrm->reserved_thp_charged = hpages;
+	}
 	return true;
 }
 
@@ -1004,11 +1029,10 @@ static bool vrm_calc_charge(struct vma_remap_struct *vrm)
  */
 static void vrm_uncharge(struct vma_remap_struct *vrm)
 {
-	if (!(vrm->vma->vm_flags & VM_ACCOUNT))
-		return;
-
 	vm_unacct_memory(vrm->charged);
 	vrm->charged = 0;
+	reserved_thp_uncharge(vrm->reserved_thp_charged);
+	vrm->reserved_thp_charged = 0;
 }
 
 /*
@@ -1157,8 +1181,8 @@ static void unmap_source_vma(struct vma_remap_struct *vrm)
 	struct vm_area_struct *vma = vrm->vma;
 	VMA_ITERATOR(vmi, mm, addr);
 	int err;
-	unsigned long vm_start;
-	unsigned long vm_end;
+	unsigned long vm_start = 0;
+	unsigned long vm_end = 0;
 	/*
 	 * It might seem odd that we check for MREMAP_DONTUNMAP here, given this
 	 * function implies that we unmap the original VMA, which seems
@@ -1170,6 +1194,8 @@ static void unmap_source_vma(struct vma_remap_struct *vrm)
 	 */
 	bool accountable_move = (vma->vm_flags & VM_ACCOUNT) &&
 		!(vrm->flags & MREMAP_DONTUNMAP);
+	bool reserved_thp_move = (vma->vm_flags & VM_RESERVED_THP) &&
+		!(vrm->flags & MREMAP_DONTUNMAP);
 
 	/*
 	 * So we perform a trick here to prevent incorrect accounting. Any merge
@@ -1192,6 +1218,13 @@ static void unmap_source_vma(struct vma_remap_struct *vrm)
 		vm_start = vma->vm_start;
 		vm_end = vma->vm_end;
 	}
+	if (reserved_thp_move) {
+		vm_flags_clear(vma, VM_RESERVED_THP);
+		if (!accountable_move) {
+			vm_start = vma->vm_start;
+			vm_end = vma->vm_end;
+		}
+	}
 
 	err = do_vmi_munmap(&vmi, mm, addr, len, vrm->uf_unmap, /* unlock= */false);
 	vrm->vma = NULL; /* Invalidated. */
@@ -1227,19 +1260,27 @@ static void unmap_source_vma(struct vma_remap_struct *vrm)
 	 *
 	 * do_vmi_munmap() will have restored the VMI back to addr.
 	 */
-	if (accountable_move) {
+	if (accountable_move || reserved_thp_move) {
 		unsigned long end = addr + len;
-
-		if (vm_start < addr) {
-			struct vm_area_struct *prev = vma_prev(&vmi);
-
-			vm_flags_set(prev, VM_ACCOUNT); /* Acquires VMA lock. */
+		struct vm_area_struct *prev = NULL;
+		struct vm_area_struct *next = NULL;
+
+		if (vm_start < addr)
+			prev = vma_prev(&vmi);
+		if (vm_end > end)
+			next = vma_next(&vmi);
+
+		if (accountable_move) {
+			if (prev)
+				vm_flags_set(prev, VM_ACCOUNT); /* Acquires VMA lock. */
+			if (next)
+				vm_flags_set(next, VM_ACCOUNT); /* Acquires VMA lock. */
 		}
-
-		if (vm_end > end) {
-			struct vm_area_struct *next = vma_next(&vmi);
-
-			vm_flags_set(next, VM_ACCOUNT); /* Acquires VMA lock. */
+		if (reserved_thp_move) {
+			if (prev)
+				vm_flags_set(prev, VM_RESERVED_THP);
+			if (next)
+				vm_flags_set(next, VM_RESERVED_THP);
 		}
 	}
 }
@@ -1309,7 +1350,6 @@ static int copy_vma_and_data(struct vma_remap_struct *vrm,
 	*new_vma_ptr = new_vma;
 	return err;
 }
-
 /*
  * Perform final tasks for MADV_DONTUNMAP operation, clearing mlock() flag on
  * remaining VMA by convention (it cannot be mlock()'d any longer, as pages in
@@ -1576,6 +1616,23 @@ static bool align_hugetlb(struct vma_remap_struct *vrm)
 	return true;
 }
 
+static bool check_reserved_thp_alignment(struct vma_remap_struct *vrm)
+{
+	if (!(vrm->vma->vm_flags & VM_RESERVED_THP))
+		return true;
+
+	if (!IS_ALIGNED(vrm->addr, HPAGE_PMD_SIZE) ||
+	    !IS_ALIGNED(vrm->old_len, HPAGE_PMD_SIZE) ||
+	    !IS_ALIGNED(vrm->new_len, HPAGE_PMD_SIZE))
+		return false;
+
+	if ((vrm->remap_type == MREMAP_EXPAND || vrm_implies_new_addr(vrm)) &&
+	    !IS_ALIGNED(vrm->new_addr, HPAGE_PMD_SIZE))
+		return false;
+
+	return true;
+}
+
 /*
  * We are mremap()'ing without specifying a fixed address to move to, but are
  * requesting that the VMA's size be increased.
@@ -1745,6 +1802,8 @@ static int check_prep_vma(struct vma_remap_struct *vrm)
 	/* For convenience, we set new_addr even if VMA won't move. */
 	if (!vrm_implies_new_addr(vrm))
 		vrm->new_addr = addr;
+	if (!check_reserved_thp_alignment(vrm))
+		return -EINVAL;
 
 	/* Below only meaningful if we expand or move a VMA. */
 	if (!vrm_will_map_new(vrm))
diff --git a/mm/vma.c b/mm/vma.c
index 9eea2850818a8..8c4cd7c97a984 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -7,6 +7,8 @@
 #include "vma_internal.h"
 #include "vma.h"
 
+#include <linux/huge_mm.h>
+
 struct mmap_state {
 	struct mm_struct *mm;
 	struct vma_iterator *vmi;
@@ -507,6 +509,10 @@ __split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
 	WARN_ON(vma->vm_start >= addr);
 	WARN_ON(vma->vm_end <= addr);
 
+	if ((vma->vm_flags & VM_RESERVED_THP) &&
+	    !IS_ALIGNED(addr, HPAGE_PMD_SIZE))
+		return -EINVAL;
+
 	if (vma->vm_ops && vma->vm_ops->may_split) {
 		err = vma->vm_ops->may_split(vma, addr);
 		if (err)
@@ -1361,6 +1367,7 @@ static void vms_complete_munmap_vmas(struct vma_munmap_struct *vms,
 		remove_vma(vma);
 
 	vm_unacct_memory(vms->nr_accounted);
+	reserved_thp_uncharge(vms->nr_reserved_thp);
 	validate_mm(mm);
 	if (vms->unlock)
 		mmap_read_unlock(mm);
@@ -1423,6 +1430,11 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
 			error = -EPERM;
 			goto start_split_failed;
 		}
+		if ((vms->vma->vm_flags & VM_RESERVED_THP) &&
+		    !IS_ALIGNED(vms->start, HPAGE_PMD_SIZE)) {
+			error = -EINVAL;
+			goto start_split_failed;
+		}
 
 		error = __split_vma(vms->vmi, vms->vma, vms->start, 1);
 		if (error)
@@ -1445,6 +1457,11 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
 		}
 		/* Does it split the end? */
 		if (next->vm_end > vms->end) {
+			if ((next->vm_flags & VM_RESERVED_THP) &&
+			    !IS_ALIGNED(vms->end, HPAGE_PMD_SIZE)) {
+				error = -EINVAL;
+				goto end_split_failed;
+			}
 			error = __split_vma(vms->vmi, next, vms->end, 0);
 			if (error)
 				goto end_split_failed;
@@ -1465,6 +1482,11 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
 		if (vma_test(next, VMA_ACCOUNT_BIT))
 			vms->nr_accounted += nrpages;
 
+		if (next->vm_flags & VM_RESERVED_THP)
+			vms->nr_reserved_thp +=
+				reserved_thp_hpage_nr(next->vm_start,
+						      next->vm_end);
+
 		if (is_exec_mapping(next->vm_flags))
 			vms->exec_vm += nrpages;
 		else if (is_stack_mapping(next->vm_flags))
@@ -1560,6 +1582,7 @@ static void init_vma_munmap(struct vma_munmap_struct *vms,
 	vms->uf = uf;
 	vms->vma_count = 0;
 	vms->nr_pages = vms->locked_vm = vms->nr_accounted = 0;
+	vms->nr_reserved_thp = 0;
 	vms->exec_vm = vms->stack_vm = vms->data_vm = 0;
 	vms->unmap_start = FIRST_USER_ADDRESS;
 	vms->unmap_end = USER_PGTABLES_CEILING;
diff --git a/mm/vma.h b/mm/vma.h
index 8e4b61a7304c6..68e44adee5c89 100644
--- a/mm/vma.h
+++ b/mm/vma.h
@@ -48,6 +48,7 @@ struct vma_munmap_struct {
 	unsigned long nr_pages;         /* Number of pages being removed */
 	unsigned long locked_vm;        /* Number of locked pages */
 	unsigned long nr_accounted;     /* Number of VM_ACCOUNT pages */
+	unsigned long nr_reserved_thp;  /* Number of reserved PMD THP slots */
 	unsigned long exec_vm;
 	unsigned long stack_vm;
 	unsigned long data_vm;
-- 
2.54.0



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC PATCH 7/8] mm: support reserved THP VMAs in anonymous faults
  2026-06-27  7:21 [RFC PATCH 0/8] Introducte Reserved THP Qi Zheng
                   ` (5 preceding siblings ...)
  2026-06-27  7:26 ` [RFC PATCH 6/8] mm: maintain reserved THP quota across VMA changes Qi Zheng
@ 2026-06-27  7:26 ` Qi Zheng
  2026-06-27  7:26 ` [RFC PATCH 8/8] mm: add MADV_RESERVED_THP range policy Qi Zheng
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 12+ messages in thread
From: Qi Zheng @ 2026-06-27  7:26 UTC (permalink / raw)
  To: akpm, david, ljs, ziy, baolin.wang, liam, npache, ryan.roberts,
	dev.jain, baohua, lance.yang, muchun.song, osalvador, chrisl,
	kasong, shikemeng, nphamcs, baoquan.he, youngjun.park, peterx,
	usama.arif, willy, vbabka, surenb, mhocko, jackmanb, hannes
  Cc: linux-mm, linux-kernel, Qi Zheng

From: Qi Zheng <zhengqi.arch@bytedance.com>

Wire VM_RESERVED_THP into the anonymous PMD fault path.

For reserved THP VMAs, the faulting folio is requested with the
__GFP_RESERVED_THP flag, restricting the allocation to reserved THP
pageblocks. The resulting folio remains a normal anonymous THP, using
the existing reclaim, swap, and buddy paths.

Additionally, enforce that reserved THP faults must either successfully
install a PMD-sized folio or fail completely. Fallbacks to the huge zero
page or small anonymous pages are not allowed if the PMD-sized allocation
fails.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 include/linux/huge_mm.h |  4 +++-
 mm/huge_memory.c        | 18 +++++++++++++-----
 mm/memory.c             |  3 +++
 3 files changed, 19 insertions(+), 6 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index ad20f7f8c1794..4fe9651cd86b5 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -330,6 +330,8 @@ unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
 
 		if (vm_flags & VM_HUGEPAGE)
 			mask |= READ_ONCE(huge_anon_orders_madvise);
+		if (vm_flags & VM_RESERVED_THP)
+			mask |= BIT(PMD_ORDER);
 		if (hugepage_global_always() ||
 		    ((vm_flags & VM_HUGEPAGE) && hugepage_global_enabled()))
 			mask |= READ_ONCE(huge_anon_orders_inherit);
@@ -371,7 +373,7 @@ static inline bool vma_thp_disabled(struct vm_area_struct *vma,
 	 * Are THPs disabled only for VMAs where we didn't get an explicit
 	 * advise to use them?
 	 */
-	if (vm_flags & VM_HUGEPAGE)
+	if (vm_flags & (VM_HUGEPAGE | VM_RESERVED_THP))
 		return false;
 	/*
 	 * Forcing a collapse (e.g., madv_collapse), is a clear advice to
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2bccb0a53a0a6..66d85a2fa855f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1267,6 +1267,9 @@ static struct folio *vma_alloc_anon_folio_pmd(struct vm_area_struct *vma,
 	const int order = HPAGE_PMD_ORDER;
 	struct folio *folio;
 
+	if (vma->vm_flags & VM_RESERVED_THP)
+		gfp |= __GFP_RESERVED_THP;
+
 	folio = vma_alloc_folio(gfp, order, vma, addr & HPAGE_PMD_MASK);
 
 	if (unlikely(!folio)) {
@@ -1344,8 +1347,11 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf)
 	vm_fault_t ret = 0;
 
 	folio = vma_alloc_anon_folio_pmd(vma, vmf->address);
-	if (unlikely(!folio))
+	if (unlikely(!folio)) {
+		if (vma->vm_flags & VM_RESERVED_THP)
+			return VM_FAULT_OOM;
 		return VM_FAULT_FALLBACK;
+	}
 
 	pgtable = pte_alloc_one(vma->vm_mm);
 	if (unlikely(!pgtable)) {
@@ -1480,15 +1486,17 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
 	vm_fault_t ret;
 
 	if (!thp_vma_suitable_order(vma, haddr, PMD_ORDER))
-		return VM_FAULT_FALLBACK;
+		return (vma->vm_flags & VM_RESERVED_THP) ? VM_FAULT_OOM :
+							   VM_FAULT_FALLBACK;
 	ret = vmf_anon_prepare(vmf);
 	if (ret)
 		return ret;
 	khugepaged_enter_vma(vma, vma->vm_flags);
 
-	if (!(vmf->flags & FAULT_FLAG_WRITE) &&
-			!mm_forbids_zeropage(vma->vm_mm) &&
-			transparent_hugepage_use_zero_page()) {
+	if (!(vma->vm_flags & VM_RESERVED_THP) &&
+	    !(vmf->flags & FAULT_FLAG_WRITE) &&
+	    !mm_forbids_zeropage(vma->vm_mm) &&
+	    transparent_hugepage_use_zero_page()) {
 		pgtable_t pgtable;
 		struct folio *zero_folio;
 		vm_fault_t ret;
diff --git a/mm/memory.c b/mm/memory.c
index ff338c2abe923..225fc1ae22386 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5297,6 +5297,9 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 	if (vma->vm_flags & VM_SHARED)
 		return VM_FAULT_SIGBUS;
 
+	if (unlikely(vma->vm_flags & VM_RESERVED_THP))
+		return VM_FAULT_OOM;
+
 	/*
 	 * Use pte_alloc() instead of pte_alloc_map(), so that OOM can
 	 * be distinguished from a transient failure of pte_offset_map().
-- 
2.54.0



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC PATCH 8/8] mm: add MADV_RESERVED_THP range policy
  2026-06-27  7:21 [RFC PATCH 0/8] Introducte Reserved THP Qi Zheng
                   ` (6 preceding siblings ...)
  2026-06-27  7:26 ` [RFC PATCH 7/8] mm: support reserved THP VMAs in anonymous faults Qi Zheng
@ 2026-06-27  7:26 ` Qi Zheng
  2026-06-29  3:46 ` [RFC PATCH 0/8] Introducte Reserved THP Matthew Wilcox
  2026-06-29 12:20 ` David Hildenbrand (Arm)
  9 siblings, 0 replies; 12+ messages in thread
From: Qi Zheng @ 2026-06-27  7:26 UTC (permalink / raw)
  To: akpm, david, ljs, ziy, baolin.wang, liam, npache, ryan.roberts,
	dev.jain, baohua, lance.yang, muchun.song, osalvador, chrisl,
	kasong, shikemeng, nphamcs, baoquan.he, youngjun.park, peterx,
	usama.arif, willy, vbabka, surenb, mhocko, jackmanb, hannes
  Cc: linux-mm, linux-kernel, Qi Zheng

From: Qi Zheng <zhengqi.arch@bytedance.com>

Introduce MADV_RESERVED_THP as a new madvise() policy to enable PMD-sized
reserved THP allocations for anonymous VMAs.

The policy enforces the following rules:
- Limited to private anonymous VMAs with PMD-aligned ranges.
- Mutually exclusive with other hugepage VMA flags (handled via
  hugepage_madvise()).
- Charges the reserved THP quota upon success, and rolls back the charge
  if madvise_update_vma() fails.
- Rejects partial madvise operations that could split PMD THPs.
- Rejects MADV_COLLAPSE, as khugepaged cannot currently allocate from
  the reserved capacity.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 arch/alpha/include/uapi/asm/mman.h     |  2 +
 arch/mips/include/uapi/asm/mman.h      |  2 +
 arch/parisc/include/uapi/asm/mman.h    |  2 +
 arch/xtensa/include/uapi/asm/mman.h    |  2 +
 include/uapi/asm-generic/mman-common.h |  2 +
 mm/khugepaged.c                        |  8 +++
 mm/madvise.c                           | 83 +++++++++++++++++++++++++-
 7 files changed, 100 insertions(+), 1 deletion(-)

diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h
index 1e700468a6858..672a2fc343861 100644
--- a/arch/alpha/include/uapi/asm/mman.h
+++ b/arch/alpha/include/uapi/asm/mman.h
@@ -78,6 +78,8 @@
 
 #define MADV_COLLAPSE	25		/* Synchronous hugepage collapse */
 
+#define MADV_RESERVED_THP 26		/* Use reserved transparent hugepages */
+
 #define MADV_GUARD_INSTALL 102		/* fatal signal on access to range */
 #define MADV_GUARD_REMOVE 103		/* unguard range */
 
diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h
index b700dae28c482..a94bf74dee21c 100644
--- a/arch/mips/include/uapi/asm/mman.h
+++ b/arch/mips/include/uapi/asm/mman.h
@@ -105,6 +105,8 @@
 
 #define MADV_COLLAPSE	25		/* Synchronous hugepage collapse */
 
+#define MADV_RESERVED_THP 26		/* Use reserved transparent hugepages */
+
 #define MADV_GUARD_INSTALL 102		/* fatal signal on access to range */
 #define MADV_GUARD_REMOVE 103		/* unguard range */
 
diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
index b6a709506987e..fe2fddefb6c5d 100644
--- a/arch/parisc/include/uapi/asm/mman.h
+++ b/arch/parisc/include/uapi/asm/mman.h
@@ -72,6 +72,8 @@
 
 #define MADV_COLLAPSE	25		/* Synchronous hugepage collapse */
 
+#define MADV_RESERVED_THP 26		/* Use reserved transparent hugepages */
+
 #define MADV_HWPOISON     100		/* poison a page for testing */
 #define MADV_SOFT_OFFLINE 101		/* soft offline page for testing */
 
diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
index 99d4ccee7f6e8..bb603530ba799 100644
--- a/arch/xtensa/include/uapi/asm/mman.h
+++ b/arch/xtensa/include/uapi/asm/mman.h
@@ -113,6 +113,8 @@
 
 #define MADV_COLLAPSE	25		/* Synchronous hugepage collapse */
 
+#define MADV_RESERVED_THP 26		/* Use reserved transparent hugepages */
+
 #define MADV_GUARD_INSTALL 102		/* fatal signal on access to range */
 #define MADV_GUARD_REMOVE 103		/* unguard range */
 
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index ef1c27fa3c570..b3d1448935ead 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -79,6 +79,8 @@
 
 #define MADV_COLLAPSE	25		/* Synchronous hugepage collapse */
 
+#define MADV_RESERVED_THP 26		/* Use reserved transparent hugepages */
+
 #define MADV_GUARD_INSTALL 102		/* fatal signal on access to range */
 #define MADV_GUARD_REMOVE 103		/* unguard range */
 
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 617bca76db49b..80293e8c1e4e7 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -451,6 +451,7 @@ int hugepage_madvise(struct vm_area_struct *vma,
 	switch (advice) {
 	case MADV_HUGEPAGE:
 		*vm_flags &= ~VM_NOHUGEPAGE;
+		*vm_flags &= ~VM_RESERVED_THP;
 		*vm_flags |= VM_HUGEPAGE;
 		/*
 		 * If the vma become good for khugepaged to scan,
@@ -461,6 +462,7 @@ int hugepage_madvise(struct vm_area_struct *vma,
 		break;
 	case MADV_NOHUGEPAGE:
 		*vm_flags &= ~VM_HUGEPAGE;
+		*vm_flags &= ~VM_RESERVED_THP;
 		*vm_flags |= VM_NOHUGEPAGE;
 		/*
 		 * Setting VM_NOHUGEPAGE will prevent khugepaged from scanning
@@ -468,6 +470,12 @@ int hugepage_madvise(struct vm_area_struct *vma,
 		 * it got registered before VM_NOHUGEPAGE was set.
 		 */
 		break;
+	case MADV_RESERVED_THP:
+		*vm_flags &= ~(VM_HUGEPAGE | VM_NOHUGEPAGE);
+		*vm_flags |= VM_RESERVED_THP;
+		break;
+	default:
+		return -EINVAL;
 	}
 
 	return 0;
diff --git a/mm/madvise.c b/mm/madvise.c
index cd9bb077072cc..dd91105db68c7 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -13,6 +13,7 @@
 #include <linux/page-isolation.h>
 #include <linux/page_idle.h>
 #include <linux/userfaultfd_k.h>
+#include <linux/huge_mm.h>
 #include <linux/hugetlb.h>
 #include <linux/falloc.h>
 #include <linux/fadvise.h>
@@ -1331,6 +1332,65 @@ static bool can_madvise_modify(struct madvise_behavior *madv_behavior)
 }
 #endif
 
+static bool reserved_thp_madvise_aligned(struct vm_area_struct *vma,
+					 struct madvise_behavior_range *range)
+{
+	if (!(vma->vm_flags & VM_RESERVED_THP))
+		return true;
+
+	return IS_ALIGNED(range->start, HPAGE_PMD_SIZE) &&
+	       IS_ALIGNED(range->end, HPAGE_PMD_SIZE);
+}
+
+static int madvise_hugepage_policy(struct madvise_behavior *madv_behavior,
+				   vm_flags_t *new_flags,
+				   unsigned long *reserved_hpages,
+				   bool *charge_reserved_thp,
+				   bool *uncharge_reserved_thp)
+{
+	struct vm_area_struct *vma = madv_behavior->vma;
+	struct madvise_behavior_range *range = &madv_behavior->range;
+	unsigned long hpages;
+	int behavior = madv_behavior->behavior;
+	int error;
+
+	switch (behavior) {
+	case MADV_HUGEPAGE:
+	case MADV_NOHUGEPAGE:
+		error = hugepage_madvise(vma, new_flags, behavior);
+		if (error)
+			return error;
+		*uncharge_reserved_thp = (vma->vm_flags & VM_RESERVED_THP) &&
+					 !(*new_flags & VM_RESERVED_THP);
+		return 0;
+	case MADV_RESERVED_THP:
+		if (!IS_ENABLED(CONFIG_64BIT))
+			return -EINVAL;
+		if (!vma_is_anonymous(vma) || (*new_flags & VM_SHARED) ||
+		    (*new_flags & VM_SPECIAL))
+			return -EINVAL;
+		if (!IS_ALIGNED(range->start, HPAGE_PMD_SIZE) ||
+		    !IS_ALIGNED(range->end, HPAGE_PMD_SIZE))
+			return -EINVAL;
+
+		error = hugepage_madvise(vma, new_flags, behavior);
+		if (error)
+			return error;
+
+		if (!(vma->vm_flags & VM_RESERVED_THP)) {
+			hpages = reserved_thp_hpage_nr(range->start, range->end);
+			error = reserved_thp_charge(hpages);
+			if (error)
+				return error;
+			*reserved_hpages = hpages;
+			*charge_reserved_thp = true;
+		}
+		return 0;
+	default:
+		return -EINVAL;
+	}
+}
+
 /*
  * Apply an madvise behavior to a region of a vma.  madvise_update_vma
  * will handle splitting a vm area into separate areas, each area with its own
@@ -1342,6 +1402,9 @@ static int madvise_vma_behavior(struct madvise_behavior *madv_behavior)
 	struct vm_area_struct *vma = madv_behavior->vma;
 	vm_flags_t new_flags = vma->vm_flags;
 	struct madvise_behavior_range *range = &madv_behavior->range;
+	unsigned long reserved_hpages = 0;
+	bool charge_reserved_thp = false;
+	bool uncharge_reserved_thp = false;
 	int error;
 
 	if (unlikely(!can_madvise_modify(madv_behavior)))
@@ -1353,14 +1416,22 @@ static int madvise_vma_behavior(struct madvise_behavior *madv_behavior)
 	case MADV_WILLNEED:
 		return madvise_willneed(madv_behavior);
 	case MADV_COLD:
+		if (!reserved_thp_madvise_aligned(vma, range))
+			return -EINVAL;
 		return madvise_cold(madv_behavior);
 	case MADV_PAGEOUT:
+		if (!reserved_thp_madvise_aligned(vma, range))
+			return -EINVAL;
 		return madvise_pageout(madv_behavior);
 	case MADV_FREE:
 	case MADV_DONTNEED:
 	case MADV_DONTNEED_LOCKED:
+		if (!reserved_thp_madvise_aligned(vma, range))
+			return -EINVAL;
 		return madvise_dontneed_free(madv_behavior);
 	case MADV_COLLAPSE:
+		if (vma->vm_flags & VM_RESERVED_THP)
+			return -EINVAL;
 		return madvise_collapse(vma, range->start, range->end,
 			&madv_behavior->lock_dropped);
 	case MADV_GUARD_INSTALL:
@@ -1416,7 +1487,11 @@ static int madvise_vma_behavior(struct madvise_behavior *madv_behavior)
 		break;
 	case MADV_HUGEPAGE:
 	case MADV_NOHUGEPAGE:
-		error = hugepage_madvise(vma, &new_flags, behavior);
+	case MADV_RESERVED_THP:
+		error = madvise_hugepage_policy(madv_behavior, &new_flags,
+						&reserved_hpages,
+						&charge_reserved_thp,
+						&uncharge_reserved_thp);
 		if (error)
 			goto out;
 		break;
@@ -1431,6 +1506,11 @@ static int madvise_vma_behavior(struct madvise_behavior *madv_behavior)
 	VM_WARN_ON_ONCE(madv_behavior->lock_mode != MADVISE_MMAP_WRITE_LOCK);
 
 	error = madvise_update_vma(new_flags, madv_behavior);
+	if (error && charge_reserved_thp)
+		reserved_thp_uncharge(reserved_hpages);
+	else if (!error && uncharge_reserved_thp)
+		reserved_thp_uncharge(reserved_thp_hpage_nr(range->start,
+							    range->end));
 out:
 	/*
 	 * madvise() returns EAGAIN if kernel resources, such as
@@ -1541,6 +1621,7 @@ madvise_behavior_valid(int behavior)
 	case MADV_HUGEPAGE:
 	case MADV_NOHUGEPAGE:
 	case MADV_COLLAPSE:
+	case MADV_RESERVED_THP:
 #endif
 	case MADV_DONTDUMP:
 	case MADV_DODUMP:
-- 
2.54.0



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 0/8] Introducte Reserved THP
  2026-06-27  7:21 [RFC PATCH 0/8] Introducte Reserved THP Qi Zheng
                   ` (7 preceding siblings ...)
  2026-06-27  7:26 ` [RFC PATCH 8/8] mm: add MADV_RESERVED_THP range policy Qi Zheng
@ 2026-06-29  3:46 ` Matthew Wilcox
  2026-06-29 10:13   ` Qi Zheng
  2026-06-29 12:20 ` David Hildenbrand (Arm)
  9 siblings, 1 reply; 12+ messages in thread
From: Matthew Wilcox @ 2026-06-29  3:46 UTC (permalink / raw)
  To: Qi Zheng
  Cc: akpm, david, ljs, ziy, baolin.wang, liam, npache, ryan.roberts,
	dev.jain, baohua, lance.yang, muchun.song, osalvador, chrisl,
	kasong, shikemeng, nphamcs, baoquan.he, youngjun.park, peterx,
	usama.arif, vbabka, surenb, mhocko, jackmanb, hannes, linux-mm,
	linux-kernel, Qi Zheng

On Sat, Jun 27, 2026 at 03:21:48PM +0800, Qi Zheng wrote:
> This RFC patchset introduces a new feature called "Reserved THP", and I'd like
> to open up a discussion on how to use this as a stepping stone toward unifying
> HugeTLB and THP (Transparent Huge Page).

I'm really happy you're looking into this.  I'm not terribly familiar
with the page allocator code, so I don't have any comments on the
patches themselves, but I do have a few on your approach.

> Therefore, we are wondering if we can introduce "reserved THP", which is THP
> that can be reserved. It can be consumed through methods like madvise(), while
> normal memory allocation cannot consume it. This can achieve an effect similar
> to hugetlb. And because it is THP, it can relatively easily support swap
> features, which perfectly solves the above problem.

As I understand it, hugetlbfs reserves on mmap().

> This RFC wants to discuss another implementation:
> 
> 1. Introduce a new migratetype: MIGRATE_RESERVED_THP.
> 2. Introduce two new hugetlb-like kernel boot parameters: `thp_reserved_size`
>    and `thp_reserved_nr`. When set, the required memory is marked as
>    MIGRATE_RESERVED_THP and put back into the buddy allocator.
> 3. Introduce a new madvise parameter: `MADV_RESERVED_THP`. Pages marked as
>    MIGRATE_RESERVED_THP can only be consumed via `madvise(MADV_RESERVED_THP)`.
>    Other normal memory allocations cannot consume MIGRATE_RESERVED_THP memory.
> 
> This can achieve a reservation effect similar to HugeTLB and guarantee
> allocation success.

I think this is an interesting approach.  I don't think it should be too
hard to migrate existing hugetlbfs users to it.

> 3. Future Plans
> ===============
> 
> 3.1 Enhance swap-out and swap-in for large folios
> -------------------------------------------------
> 
> Currently, For swap-out, THP_SWAP is supported, but it only tries to swap out
> the THP folio as a whole. It is still possible to be forced to split in some
> situations (e.g., fragmented swap space, memory.swap.max limits, etc). For
> swap-in, it is almost impossible to directly swap in the THP folio as a whole.
> 
> But for reserved THP, splitting is not allowed. We need to ensure that it
> remains a whole huge page during swap-out and swap-in, to achieve a function
> similar to hugetlb swap.

So I think the current restriction is something that needs to be fixed
anyway.  It doesn't actually make sense that a folio must be written out
contiguously; filesystems do not have this restriction.  I understand
why swap currently has this limitation, but I'm hoping it gets removed
at some point.  I'm not sure if the people working on swap right now
intend to fix this.  They're already on the cc, so I hope they chime in.

> 3.2 Integrate reserved THP into the common reclaim path
> -------------------------------------------------------
> 
> Once swap-in and swap-out of huge pages can be supported without splitting,
> reserved THP can be integrated into the common reclaim path as a normal LRU
> folio for memory reclamation. This fills the gap of the hugetlb swap function.

Hm.  Then what does "reserved THP" mean if they can be swapped out?

> 3.4 Use reserved THP as a backend for hugetlbfs
> -----------------------------------------------
> 
> This would allow existing hugetlb users or applications to seamlessly switch to
> reserved THP.

If this is the end goal, then I think introducing new command line
options is probably the wrong approach right now.  Instead, "reserved
THPs" should be allocated from the same pool as hugetlb reserve.  That
way we're not jerking sysadmins around.

> 3.5 Add 1GB page support to reserved THP
> ----------------------------------------
> 
> Historically, there have been several attempts to add 1GB huge page support to
> THP:
> 
> 1. https://lore.kernel.org/linux-mm/20260202005451.774496-1-usamaarif642@gmail.com/
> 2. https://lore.kernel.org/linux-mm/20210224223536.803765-1-zi.yan@sent.com/
> 
> Adding 1GB huge page support for reserved THP would be relatively simpler
> compared to regular THP.

Well.  Maybe?  What happens if we mmap() 16GiB,
madvise(USE_RESERVED_THPS) and then munmap() the first 4KiB of it?

> 3.6 Remove Hugetlb
> ------------------
> 
> Once reserved THP can completely replace the existing functions of hugetlb, we
> can gradually remove Hugetlb, leaving only one huge page management system in
> the kernel.

We also need mshare to land ... but yes, eventually removing hugetlbfs
is my hope.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 0/8] Introducte Reserved THP
  2026-06-29  3:46 ` [RFC PATCH 0/8] Introducte Reserved THP Matthew Wilcox
@ 2026-06-29 10:13   ` Qi Zheng
  0 siblings, 0 replies; 12+ messages in thread
From: Qi Zheng @ 2026-06-29 10:13 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: akpm, david, ljs, ziy, baolin.wang, liam, npache, ryan.roberts,
	dev.jain, baohua, lance.yang, muchun.song, osalvador, chrisl,
	kasong, shikemeng, nphamcs, baoquan.he, youngjun.park, peterx,
	usama.arif, vbabka, surenb, mhocko, jackmanb, hannes, linux-mm,
	linux-kernel, Qi Zheng

Hi Matthew,

Thanks a lot for your feedback!

On 6/29/26 11:46 AM, Matthew Wilcox wrote:
> On Sat, Jun 27, 2026 at 03:21:48PM +0800, Qi Zheng wrote:
>> This RFC patchset introduces a new feature called "Reserved THP", and I'd like
>> to open up a discussion on how to use this as a stepping stone toward unifying
>> HugeTLB and THP (Transparent Huge Page).
> 
> I'm really happy you're looking into this.  I'm not terribly familiar
> with the page allocator code, so I don't have any comments on the
> patches themselves, but I do have a few on your approach.

This is also what I am hoping for. The current version of the code is
just proof-of-concept (PoC) to facilitate discussion. The real goal is
to use reserved THP as a stepping stone to discuss the challages of
unifying HugeTLB and THP, and the overall evolution path.

Of course, swap support is a key part too. ;)

> 
>> Therefore, we are wondering if we can introduce "reserved THP", which is THP
>> that can be reserved. It can be consumed through methods like madvise(), while
>> normal memory allocation cannot consume it. This can achieve an effect similar
>> to hugetlb. And because it is THP, it can relatively easily support swap
>> features, which perfectly solves the above problem.
> 
> As I understand it, hugetlbfs reserves on mmap().

Exactly, hugetlbfs reserves HugeTLB pages at mmap() time:

hugetlbfs_file_mmap
--> hugetlb_reserve_pages

and it's the same without using hugetlbfs:

hugetlb_file_setup
--> hugetlb_reserve_pages

Using madvise() as the example is based on the following considerations:

1. It closely aligns with the existing usage patterns of THP madvise
    mode.
2. To properly support swap, we actually need to allow overcommit before
    actual page faults occur. This allows us to perform memory reclaim
    during the page fault, swapping out cold reserved THP to satisfy the
    memory demands of new process. So we can't directly pre-reserv the
    reserved THP at mmap/madvise time.

The second point seems to be a challenge that HugeTLB would also face if
it were to support swap. Perhaps reserved THP could be designed with two
modes:

1. with swap support: using the current madvise method.
2. without swap support: in this mode, we can directly let hugetlbfs
    reserve the reserved THP at mmap() time. The behavior remains the
    same, purely switching the underlying backend.

But this might muddy the semantics a bit...

> 
>> This RFC wants to discuss another implementation:
>>
>> 1. Introduce a new migratetype: MIGRATE_RESERVED_THP.
>> 2. Introduce two new hugetlb-like kernel boot parameters: `thp_reserved_size`
>>     and `thp_reserved_nr`. When set, the required memory is marked as
>>     MIGRATE_RESERVED_THP and put back into the buddy allocator.
>> 3. Introduce a new madvise parameter: `MADV_RESERVED_THP`. Pages marked as
>>     MIGRATE_RESERVED_THP can only be consumed via `madvise(MADV_RESERVED_THP)`.
>>     Other normal memory allocations cannot consume MIGRATE_RESERVED_THP memory.
>>
>> This can achieve a reservation effect similar to HugeTLB and guarantee
>> allocation success.
> 
> I think this is an interesting approach.  I don't think it should be too
> hard to migrate existing hugetlbfs users to it.

That is also what I hope to see.

> 
>> 3. Future Plans
>> ===============
>>
>> 3.1 Enhance swap-out and swap-in for large folios
>> -------------------------------------------------
>>
>> Currently, For swap-out, THP_SWAP is supported, but it only tries to swap out
>> the THP folio as a whole. It is still possible to be forced to split in some
>> situations (e.g., fragmented swap space, memory.swap.max limits, etc). For
>> swap-in, it is almost impossible to directly swap in the THP folio as a whole.
>>
>> But for reserved THP, splitting is not allowed. We need to ensure that it
>> remains a whole huge page during swap-out and swap-in, to achieve a function
>> similar to hugetlb swap.
> 
> So I think the current restriction is something that needs to be fixed
> anyway.  It doesn't actually make sense that a folio must be written out
> contiguously; filesystems do not have this restriction.  I understand

Hopefully, there won't be too much pushback.

> why swap currently has this limitation, but I'm hoping it gets removed
> at some point.  I'm not sure if the people working on swap right now
> intend to fix this.  They're already on the cc, so I hope they chime in.

+1.

Hi SWAP folks, how hard would it be to get this implemented? Are there
any current plans for this? ;)

> 
>> 3.2 Integrate reserved THP into the common reclaim path
>> -------------------------------------------------------
>>
>> Once swap-in and swap-out of huge pages can be supported without splitting,
>> reserved THP can be integrated into the common reclaim path as a normal LRU
>> folio for memory reclamation. This fills the gap of the hugetlb swap function.
> 
> Hm.  Then what does "reserved THP" mean if they can be swapped out?

Indeed, it is a bit weird.

In this version, what's actually reserved is essentially a memory pool.
After a reserved THP page is swapped out, the space in the pool might be
consumed by someone else. So, there's no guarantee that this reserved
THP page can be successfully swapped back in.

But if we don't want it swapped out, it can be guaranteed via mlock or
GUP.

> 
>> 3.4 Use reserved THP as a backend for hugetlbfs
>> -----------------------------------------------
>>
>> This would allow existing hugetlb users or applications to seamlessly switch to
>> reserved THP.
> 
> If this is the end goal, then I think introducing new command line
> options is probably the wrong approach right now.  Instead, "reserved
> THPs" should be allocated from the same pool as hugetlb reserve.  That
> way we're not jerking sysadmins around.

Do you mean reusing the existing HugeTLB boot parameters instead of
introducing new ones? That seems quite difficult to implement during the
transition. My idea is that we can eventually drop the HugeTLB boot
parameters entirely, so the system will still end up with only one set
of parameters. ;)

> 
>> 3.5 Add 1GB page support to reserved THP
>> ----------------------------------------
>>
>> Historically, there have been several attempts to add 1GB huge page support to
>> THP:
>>
>> 1. https://lore.kernel.org/linux-mm/20260202005451.774496-1-usamaarif642@gmail.com/
>> 2. https://lore.kernel.org/linux-mm/20210224223536.803765-1-zi.yan@sent.com/
>>
>> Adding 1GB huge page support for reserved THP would be relatively simpler
>> compared to regular THP.
> 
> Well.  Maybe?  What happens if we mmap() 16GiB,

At least the side effects are limited strictly to reserved THPs, and
reserved THP is pre-reserved, ensuring a higher allocation success rate.

> madvise(USE_RESERVED_THPS) and then munmap() the first 4KiB of it?

Since splitting is not allowed for reserved THPs, the entire huge page
will be freed at munmap time.

> 
>> 3.6 Remove Hugetlb
>> ------------------
>>
>> Once reserved THP can completely replace the existing functions of hugetlb, we
>> can gradually remove Hugetlb, leaving only one huge page management system in
>> the kernel.
> 
> We also need mshare to land ... but yes, eventually removing hugetlbfs

mshare? Do you mean CONFIG_HUGETLB_PMD_PAGE_TABLE_SHARING?

> is my hope.

+1.

Thanks,
Qi




^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 0/8] Introducte Reserved THP
  2026-06-27  7:21 [RFC PATCH 0/8] Introducte Reserved THP Qi Zheng
                   ` (8 preceding siblings ...)
  2026-06-29  3:46 ` [RFC PATCH 0/8] Introducte Reserved THP Matthew Wilcox
@ 2026-06-29 12:20 ` David Hildenbrand (Arm)
  9 siblings, 0 replies; 12+ messages in thread
From: David Hildenbrand (Arm) @ 2026-06-29 12:20 UTC (permalink / raw)
  To: Qi Zheng, akpm, ljs, ziy, baolin.wang, liam, npache, ryan.roberts,
	dev.jain, baohua, lance.yang, muchun.song, osalvador, chrisl,
	kasong, shikemeng, nphamcs, baoquan.he, youngjun.park, peterx,
	usama.arif, willy, vbabka, surenb, mhocko, jackmanb, hannes
  Cc: linux-mm, linux-kernel, Qi Zheng

On 6/27/26 09:21, Qi Zheng wrote:
> From: Qi Zheng <zhengqi.arch@bytedance.com>
> 
> Hi all,
> 

Hi,

> This RFC patchset introduces a new feature called "Reserved THP", and I'd like
> to open up a discussion on how to use this as a stepping stone toward unifying
> HugeTLB and THP (Transparent Huge Page).
> 
> 1. Background
> =============
> 
> Currently, two huge page solutions co-exist in the kernel:
> 
> 1. HugeTLB: Supports reservation, guaranteeing successful allocation within the
>             reserved pool. However, it does not support features like swap. And
>             it is a relatively independent subsystem.
> 2. THP: Does not support reservation and may fail to allocate and fallback to
>         small pages when system memory is fragmented, but it is more tightly
>         integrated with mm core and supports features like swap.
> 
> Both have their pros and cons. However, in one of our internal scenarios, it
> seems we need to combine the features of both to meet the requirements.
> 
> In our internal scenario, a user process needs to reserve double the amount
> of Hugetlb memory due to hot-upgrade requirements. For example, if the
> process needs 16GB of Hugetlb, an additional 16GB is required during the
> hot-upgrade to satisfy memory allocations. After the upgrade, the old
> process exits and releases the 16GB of HugeTLB. Therefore, in most cases,
> the extra 16GB of HugeTLB is wasted.
> 
> A straightforward idea is to use the Hugetlb CMA feature, reserving a total
> of 32GB of hugetlb_cma. During normal operation, 16GB is consumed, and the
> remaining 16GB can be used by other processes. During hot-upgrade, we could
> try to migrate the memory used by other processes to allocate the required
> extra 16GB of Hugetlb. This might work, but it still requires reserving 32GB
> of memory.
> 
> We also found that during the hot upgrade, about 10GB of the old process's
> hugetlb is actually cold memory, which could theoretically be reclaimed. In
> extreme cases, we could reserve only 22GB of memory and reclaim the
> remaining 10GB during the hot upgrade. But unfortunately, hugetlb currently
> does not support swap, and supporting it seems quite difficult.
> 
> Therefore, we are wondering if we can introduce "reserved THP", which is THP
> that can be reserved. It can be consumed through methods like madvise(), while
> normal memory allocation cannot consume it.

madvise(). Gah. No :)

> This can achieve an effect similar
> to hugetlb. And because it is THP, it can relatively easily support swap
> features, which perfectly solves the above problem.

No, this is the wrong approach. We really shouldn't be making the same mistake
hugetlb did and support reserving of non-filebacked memory (IOW anonymous memory).

And even for files, the hugetlb mechanism is an absolute trainwreck, because
it's not NUMA aware.

This really needs some proper thought.

> 
> Additionally, in 2024 (or possibly earlier), there have been discussions about
> the possibility of unifying Hugetlb and THP:
> 
> Link: https://lwn.net/Articles/974491/
> 
> After all, hugetlb's management is relatively independent and requires too
> much special handling in mm core. The introduction of reserved THP might be
> an opportunity. In the future, reserved THP could be enhanced to support
> various hugetlb features, such as acting as a backend for hugetlbfs. When
> reserved THP can completely replace HugeTLB, HugeTLB could be entirely
> removed, and reserved THP would just become a feature of THP.
> 
> 2. Implementation
> =================
> 
> In 2024, Yu Zhao proposed a similar idea:
> 
> Link: https://lore.kernel.org/all/20240229183436.4110845-2-yuzhao@google.com/
> 
> The idea was to introduce two virt zones: ZONE_NOSPLIT and ZONE_NOMERGE to
> guarantee the allocation success rate of THP, achieving an effect similar to
> reservation. However, it seems there was no further progress, perhaps because of
> reluctance to introduce more virt zones like ZONE_MOVABLE.
> 
> This RFC wants to discuss another implementation:
> 
> 1. Introduce a new migratetype: MIGRATE_RESERVED_THP.
> 2. Introduce two new hugetlb-like kernel boot parameters: `thp_reserved_size`
>    and `thp_reserved_nr`. When set, the required memory is marked as
>    MIGRATE_RESERVED_THP and put back into the buddy allocator.

I'm all for some mechanism to make runtime allocation of large chunks of memory
easier, by adding a pool from where multiple consumers (THP, guest_memfd,
hugetlb, whatever) can allocate memory.

Call me very skeptical of getting the page allocator involved like this. (I hate it)

> 3. Introduce a new madvise parameter: `MADV_RESERVED_THP`. Pages marked as
>    MIGRATE_RESERVED_THP can only be consumed via `madvise(MADV_RESERVED_THP)`.
>    Other normal memory allocations cannot consume MIGRATE_RESERVED_THP memory.

Definitely no.

> 
> This can achieve a reservation effect similar to HugeTLB and guarantee
> allocation success.
> 
> 3. Future Plans
> ===============
> 
> 3.1 Enhance swap-out and swap-in for large folios
> -------------------------------------------------
> 
> Currently, For swap-out, THP_SWAP is supported, but it only tries to swap out
> the THP folio as a whole. It is still possible to be forced to split in some
> situations (e.g., fragmented swap space, memory.swap.max limits, etc). For
> swap-in, it is almost impossible to directly swap in the THP folio as a whole.
> 
> But for reserved THP, splitting is not allowed. We need to ensure that it
> remains a whole huge page during swap-out and swap-in, to achieve a function
> similar to hugetlb swap.
> 
> 
> 3.2 Integrate reserved THP into the common reclaim path
> -------------------------------------------------------
> 
> Once swap-in and swap-out of huge pages can be supported without splitting,
> reserved THP can be integrated into the common reclaim path as a normal LRU
> folio for memory reclamation. This fills the gap of the hugetlb swap function.
> 
> 3.3 Use reserved THP as a backend for shmem/tmpfs
> -------------------------------------------------
> 
> This would allow shared or file-like usage to utilize reserved THP.
> 

Really, any kind of reservation should be file-centric and have some level of
control.

And soon the question would pop up "but how can we control this inside memcgs".

This all needs some thought.


> 3.4 Use reserved THP as a backend for hugetlbfs
> -----------------------------------------------
> 
> This would allow existing hugetlb users or applications to seamlessly switch to
> reserved THP.

You are really talking about a memory pool that can be used by different consumers.

I raised that in the past in the context of guest_memfd, whereby the short-term
plan is to take pages from hugetlb's pool, when really there should be a global
pool that can be consumed by various consumers.

A lot of questions around that.

> 
> 3.5 Add 1GB page support to reserved THP
> ----------------------------------------
> 
> Historically, there have been several attempts to add 1GB huge page support to
> THP:
> 
> 1. https://lore.kernel.org/linux-mm/20260202005451.774496-1-usamaarif642@gmail.com/
> 2. https://lore.kernel.org/linux-mm/20210224223536.803765-1-zi.yan@sent.com/
> 
> Adding 1GB huge page support for reserved THP would be relatively simpler
> compared to regular THP.

And that's what I told Usama: start with 1 GiB THP support for shmem/tmpfs, and
make it configurable.

How we would add a reservation mechanism is a good question. Because hugetlb
reservation is a broken concept. And anything that's not NUMA or memcg aware
will be a broken concept I'm afraid.

> 
> 3.6 Remove Hugetlb
> ------------------
> 
> Once reserved THP can completely replace the existing functions of hugetlb, we
> can gradually remove Hugetlb, leaving only one huge page management system in
> the kernel.

I'm sorry, but no way this will work in any reasonable timeframe unless you
mimic the exact user facing ABI -- and I don't think we'll gain a lot that way.

I know, we all like to dream, but this just isn't feasible.

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2026-06-29 12:20 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-27  7:21 [RFC PATCH 0/8] Introducte Reserved THP Qi Zheng
2026-06-27  7:21 ` [RFC PATCH 1/8] mm: page_alloc: add reserved THP pageblock type Qi Zheng
2026-06-27  7:21 ` [RFC PATCH 2/8] mm: add boot-time reserved THP pageblock capacity Qi Zheng
2026-06-27  7:21 ` [RFC PATCH 3/8] mm: page_alloc: add a reserved THP allocation primitive Qi Zheng
2026-06-27  7:21 ` [RFC PATCH 4/8] mm: add reserved THP quota helpers Qi Zheng
2026-06-27  7:21 ` [RFC PATCH 5/8] mm: add reserved THP vma flag Qi Zheng
2026-06-27  7:26 ` [RFC PATCH 6/8] mm: maintain reserved THP quota across VMA changes Qi Zheng
2026-06-27  7:26 ` [RFC PATCH 7/8] mm: support reserved THP VMAs in anonymous faults Qi Zheng
2026-06-27  7:26 ` [RFC PATCH 8/8] mm: add MADV_RESERVED_THP range policy Qi Zheng
2026-06-29  3:46 ` [RFC PATCH 0/8] Introducte Reserved THP Matthew Wilcox
2026-06-29 10:13   ` Qi Zheng
2026-06-29 12:20 ` David Hildenbrand (Arm)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox