Re: [PATCH mm-unstable v17 11/14] mm/khugepaged: Introduce mTHP collapse support

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Wei Yang <richard.weiyang@gmail.com>
To: Nico Pache <npache@redhat.com>
Cc: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, linux-trace-kernel@vger.kernel.org,
	aarcange@redhat.com, akpm@linux-foundation.org,
	anshuman.khandual@arm.com, apopple@nvidia.com, baohua@kernel.org,
	baolin.wang@linux.alibaba.com, byungchul@sk.com,
	catalin.marinas@arm.com, cl@gentwo.org, corbet@lwn.net,
	dave.hansen@linux.intel.com, david@kernel.org, dev.jain@arm.com,
	gourry@gourry.net, hannes@cmpxchg.org, hughd@google.com,
	jack@suse.cz, jackmanb@google.com, jannh@google.com,
	jglisse@google.com, joshua.hahnjy@gmail.com, kas@kernel.org,
	lance.yang@linux.dev, liam@infradead.org, ljs@kernel.org,
	mathieu.desnoyers@efficios.com, matthew.brost@intel.com,
	mhiramat@kernel.org, mhocko@suse.com, peterx@redhat.com,
	pfalcato@suse.de, rakie.kim@sk.com, raquini@redhat.com,
	rdunlap@infradead.org, richard.weiyang@gmail.com,
	rientjes@google.com, rostedt@goodmis.org, rppt@kernel.org,
	ryan.roberts@arm.com, shivankg@amd.com, sunnanyong@huawei.com,
	surenb@google.com, thomas.hellstrom@linux.intel.com,
	tiwai@suse.de, usamaarif642@gmail.com, vbabka@suse.cz,
	vishal.moola@gmail.com, wangkefeng.wang@huawei.com,
	will@kernel.org, willy@infradead.org,
	yang@os.amperecomputing.com, ying.huang@linux.alibaba.com,
	ziy@nvidia.com, zokeefe@google.com
Subject: Re: [PATCH mm-unstable v17 11/14] mm/khugepaged: Introduce mTHP collapse support
Date: Tue, 12 May 2026 15:44:31 +0000	[thread overview]
Message-ID: <20260512154431.jxcs632mqqatqtsw@master> (raw)
In-Reply-To: <20260511185817.686831-12-npache@redhat.com>

On Mon, May 11, 2026 at 12:58:11PM -0600, Nico Pache wrote:
>Enable khugepaged to collapse to mTHP orders. This patch implements the
>main scanning logic using a bitmap to track occupied pages and a stack
>structure that allows us to find optimal collapse sizes.
>
>Previous to this patch, PMD collapse had 3 main phases, a light weight
>scanning phase (mmap_read_lock) that determines a potential PMD
>collapse, an alloc phase (mmap unlocked), then finally heavier collapse
>phase (mmap_write_lock).
>
>To enabled mTHP collapse we make the following changes:
>
>During PMD scan phase, track occupied pages in a bitmap. When mTHP
>orders are enabled, we remove the restriction of max_ptes_none during the
>scan phase to avoid missing potential mTHP collapse candidates. Once we
>have scanned the full PMD range and updated the bitmap to track occupied
>pages, we use the bitmap to find the optimal mTHP size.
>
>Implement collapse_scan_bitmap() to perform binary recursion on the bitmap
>and determine the best eligible order for the collapse. A stack structure
>is used instead of traditional recursion to manage the search. This also
>prevents a traditional recursive approach when the kernel stack struct is
>limited. The algorithm recursively splits the bitmap into smaller chunks to
>find the highest order mTHPs that satisfy the collapse criteria. We start
>by attempting the PMD order, then moved on the consecutively lower orders
>(mTHP collapse). The stack maintains a pair of variables (offset, order),
>indicating the number of PTEs from the start of the PMD, and the order of
>the potential collapse candidate.
>
>The algorithm for consuming the bitmap works as such:
>    1) push (0, HPAGE_PMD_ORDER) onto the stack
>    2) pop the stack
>    3) check if the number of set bits in that (offset,order) pair
>       statisfy the max_ptes_none threshold for that order
>    4) if yes, attempt collapse
>    5) if no (or collapse fails), push two new stack items representing
>       the left and right halves of the current bitmap range, at the
>       next lower order
>    6) repeat at step (2) until stack is empty.
>
>Below is a diagram representing the algorithm and stack items:
>
>                            offset   mid_offset
>                            |        |
>                            |        |
>                            v        v
>          ____________________________________
>         |          PTE Page Table            |
>         --------------------------------------
>			    <-------><------->
>                             order-1  order-1
>
>mTHP collapses reject regions containing swapped out or shared pages.
>This is because adding new entries can lead to new none pages, and these
>may lead to constant promotion into a higher order mTHP. A similar
>issue can occur with "max_ptes_none > HPAGE_PMD_NR/2" due to a collapse
>introducing at least 2x the number of pages, and on a future scan will
>satisfy the promotion condition once again. This issue is prevented via
>the collapse_max_ptes_none() function which imposes the max_ptes_none
>restrictions above.
>
>We currently only support mTHP collapse for max_ptes_none values of 0
>and HPAGE_PMD_NR - 1. resulting in the following behavior:
>
>    - max_ptes_none=0: Never introduce new empty pages during collapse
>    - max_ptes_none=HPAGE_PMD_NR-1: Always try collapse to the highest
>      available mTHP order
>
>Any other max_ptes_none value will emit a warning and skip mTHP collapse
>attempts. There should be no behavior change for PMD collapse.
>
>Once we determine what mTHP sizes fits best in that PMD range a collapse
>is attempted. A minimum collapse order of 2 is used as this is the lowest
>order supported by anon memory as defined by THP_ORDERS_ALL_ANON.
>
>Currently madv_collapse is not supported and will only attempt PMD
>collapse.
>
>We can also remove the check for is_khugepaged inside the PMD scan as
>the collapse_max_ptes_none() function handles this logic now.
>
>Signed-off-by: Nico Pache <npache@redhat.com>

[...]

>+static int mthp_collapse(struct mm_struct *mm, unsigned long address,
>+		int referenced, int unmapped, struct collapse_control *cc,
>+		unsigned long enabled_orders)
>+{
>+	unsigned int nr_occupied_ptes, nr_ptes;
>+	int max_ptes_none, collapsed = 0, stack_size = 0;
>+	unsigned long collapse_address;
>+	struct mthp_range range;
>+	u16 offset;
>+	u8 order;
>+
>+	collapse_mthp_stack_push(cc, &stack_size, 0, HPAGE_PMD_ORDER);
>+
>+	while (stack_size) {
>+		range = collapse_mthp_stack_pop(cc, &stack_size);
>+		order = range.order;
>+		offset = range.offset;
>+		nr_ptes = 1UL << order;
>+
>+		if (!test_bit(order, &enabled_orders))
>+			goto next_order;
>+
>+		max_ptes_none = collapse_max_ptes_none(cc, NULL, order);

I am thinking whether there is a behavioral change for userfaultfd_armed(vma).

collapse_single_pmd()
    collapse_scan_pmd
        max_ptes_none = collapse_max_ptes_none(cc, vma)
        max_ptes_none = KHUGEPAGED_MAX_PTES_LIMIT                --- (1)
        mthp_collapse
            max_ptes_none = collapse_max_ptes_none(cc, NULL)     --- (2)
            collapse_huge_page(mm)
                hugepage_vma_revalidate(&vma)
                __collapse_huge_page_isolate(vma)
                    max_ptes_none = collapse_max_ptes_none(cc, vma)

Before mthp_collapse() introduced, userfaultfd_armed(vma) is skipped if there
is any pte_none_or_zero() in collapse_scan_pmd().

But now, max_ptes_none could be set to KHUGEPAGED_MAX_PTES_LIMIT at (1), so
that we can scan all the pte to get the bitmap. This means
userfaultfd_armed(vma) could continue even with pte_none_or_zero().

Then in mthp_collapse(), collapse_max_ptes_none() at (2) ignores
userfaultfd_armed(vma), which means it will continue to collapse a
userfaultfd_armed(vma) when there is pte_none_or_zero(). 

The good news is we will stop at __collapse_huge_page_isolate(), where we
get collapse_max_ptes_none() with vma. But we already did a lot of work.

Not sure if I missed something.

>+
>+		if (max_ptes_none < 0)
>+			return collapsed;
>+
>+		nr_occupied_ptes = collapse_mthp_count_present(cc, offset,
>+							       nr_ptes);
>+
>+		if (nr_occupied_ptes >= nr_ptes - max_ptes_none) {
>+			int ret;
>+
>+			collapse_address = address + offset * PAGE_SIZE;
>+			ret = collapse_huge_page(mm, collapse_address, referenced,
>+						 unmapped, cc, order);
>+			if (ret == SCAN_SUCCEED) {
>+				collapsed += nr_ptes;
>+				continue;
>+			}
>+		}
>+
>+next_order:
>+		if (order > KHUGEPAGED_MIN_MTHP_ORDER) {
>+			const u8 next_order = order - 1;
>+			const u16 mid_offset = offset + (nr_ptes / 2);
>+
>+			collapse_mthp_stack_push(cc, &stack_size, mid_offset,
>+						 next_order);
>+			collapse_mthp_stack_push(cc, &stack_size, offset,
>+						 next_order);
>+		}
>+	}
>+	return collapsed;
>+}
>+
> static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> 		struct vm_area_struct *vma, unsigned long start_addr,
> 		bool *lock_dropped, struct collapse_control *cc)
> {
>-	const int max_ptes_none = collapse_max_ptes_none(cc, vma, HPAGE_PMD_ORDER);
>+	int max_ptes_none = collapse_max_ptes_none(cc, vma, HPAGE_PMD_ORDER);
> 	const unsigned int max_ptes_shared = collapse_max_ptes_shared(cc, HPAGE_PMD_ORDER);
> 	const unsigned int max_ptes_swap = collapse_max_ptes_swap(cc, HPAGE_PMD_ORDER);
>+	enum tva_type tva_flags = cc->is_khugepaged ? TVA_KHUGEPAGED : TVA_FORCED_COLLAPSE;
> 	pmd_t *pmd;
>-	pte_t *pte, *_pte;
>-	int none_or_zero = 0, shared = 0, referenced = 0;
>+	pte_t *pte, *_pte, pteval;
>+	int i;
>+	int none_or_zero = 0, shared = 0, nr_collapsed = 0, referenced = 0;
> 	enum scan_result result = SCAN_FAIL;
> 	struct page *page = NULL;
> 	struct folio *folio = NULL;
> 	unsigned long addr;
>+	unsigned long enabled_orders;
> 	spinlock_t *ptl;
> 	int node = NUMA_NO_NODE, unmapped = 0;
> 
>@@ -1429,8 +1579,19 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> 		goto out;
> 	}
> 
>+	bitmap_zero(cc->mthp_bitmap, MAX_PTRS_PER_PTE);
> 	memset(cc->node_load, 0, sizeof(cc->node_load));
> 	nodes_clear(cc->alloc_nmask);
>+
>+	enabled_orders = collapse_allowable_orders(vma, vma->vm_flags, tva_flags);

Would it be 0 at this point?

>+
>+	/*
>+	 * If PMD is the only enabled order, enforce max_ptes_none, otherwise
>+	 * scan all pages to populate the bitmap for mTHP collapse.
>+	 */
>+	if (enabled_orders != BIT(HPAGE_PMD_ORDER))
>+		max_ptes_none = KHUGEPAGED_MAX_PTES_LIMIT;
>+
> 	pte = pte_offset_map_lock(mm, pmd, start_addr, &ptl);
> 	if (!pte) {
> 		cc->progress++;
>@@ -1438,11 +1599,13 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> 		goto out;
> 	}
> 
>-	for (addr = start_addr, _pte = pte; _pte < pte + HPAGE_PMD_NR;
>-	     _pte++, addr += PAGE_SIZE) {
>+	for (i = 0; i < HPAGE_PMD_NR; i++) {
>+		_pte = pte + i;
>+		addr = start_addr + i * PAGE_SIZE;
>+		pteval = ptep_get(_pte);
>+
> 		cc->progress++;
> 
>-		pte_t pteval = ptep_get(_pte);
> 		if (pte_none_or_zero(pteval)) {
> 			if (++none_or_zero > max_ptes_none) {
> 				result = SCAN_EXCEED_NONE_PTE;
>@@ -1522,6 +1685,8 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> 			}
> 		}
> 
>+		/* Set bit for occupied pages */
>+		__set_bit(i, cc->mthp_bitmap);
> 		/*
> 		 * Record which node the original page is from and save this
> 		 * information to cc->node_load[].
>@@ -1580,10 +1745,11 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> 	if (result == SCAN_SUCCEED) {
> 		/* collapse_huge_page expects the lock to be dropped before calling */
> 		mmap_read_unlock(mm);
>-		result = collapse_huge_page(mm, start_addr, referenced,
>-					    unmapped, cc, HPAGE_PMD_ORDER);
>+		nr_collapsed = mthp_collapse(mm, start_addr, referenced, unmapped,
>+					      cc, enabled_orders);
> 		/* collapse_huge_page will return with the mmap_lock released */

collapse_huge_page will return with mmap_lock released, but mthp_collapse()
may not?

> 		*lock_dropped = true;
>+		result = nr_collapsed ? SCAN_SUCCEED : SCAN_FAIL;
> 	}
> out:
> 	trace_mm_khugepaged_scan_pmd(mm, folio, referenced,
>-- 
>2.54.0

-- 
Wei Yang
Help you, Help me

next prev parent reply	other threads:[~2026-05-12 15:44 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-11 18:58 [PATCH mm-unstable v17 00/14] khugepaged: mTHP support Nico Pache
2026-05-11 18:58 ` [PATCH mm-unstable v17 01/14] mm/khugepaged: generalize hugepage_vma_revalidate for " Nico Pache
2026-05-11 18:58 ` [PATCH mm-unstable v17 02/14] mm/khugepaged: generalize alloc_charge_folio() Nico Pache
2026-05-11 18:58 ` [PATCH mm-unstable v17 03/14] mm/khugepaged: rework max_ptes_* handling with helper functions Nico Pache
2026-05-12  4:44   ` Lance Yang
2026-05-12  7:29   ` David Hildenbrand (Arm)
2026-05-11 18:58 ` [PATCH mm-unstable v17 04/14] mm/khugepaged: generalize __collapse_huge_page_* for mTHP support Nico Pache
2026-05-12  7:42   ` Lance Yang
2026-05-14  3:10     ` Wei Yang
2026-05-11 18:58 ` [PATCH mm-unstable v17 05/14] mm/khugepaged: require collapse_huge_page to enter/exit with the lock dropped Nico Pache
2026-05-12  7:42   ` David Hildenbrand (Arm)
2026-05-11 18:58 ` [PATCH mm-unstable v17 06/14] mm/khugepaged: generalize collapse_huge_page for mTHP collapse Nico Pache
2026-05-11 18:58 ` [PATCH mm-unstable v17 07/14] mm/khugepaged: skip collapsing mTHP to smaller orders Nico Pache
2026-05-11 18:58 ` [PATCH mm-unstable v17 08/14] mm/khugepaged: add per-order mTHP collapse failure statistics Nico Pache
2026-05-11 18:58 ` [PATCH mm-unstable v17 09/14] mm/khugepaged: improve tracepoints for mTHP orders Nico Pache
2026-05-11 18:58 ` [PATCH mm-unstable v17 10/14] mm/khugepaged: introduce collapse_allowable_orders helper function Nico Pache
2026-05-11 18:58 ` [PATCH mm-unstable v17 11/14] mm/khugepaged: Introduce mTHP collapse support Nico Pache
2026-05-12 15:44   ` Wei Yang [this message]
2026-05-11 18:58 ` [PATCH mm-unstable v17 12/14] mm/khugepaged: avoid unnecessary mTHP collapse attempts Nico Pache
2026-05-11 18:58 ` [PATCH mm-unstable v17 13/14] mm/khugepaged: run khugepaged for all orders Nico Pache
2026-05-11 18:58 ` [PATCH mm-unstable v17 14/14] Documentation: mm: update the admin guide for mTHP collapse Nico Pache
2026-05-11 21:04 ` [PATCH mm-unstable v17 00/14] khugepaged: mTHP support Andrew Morton

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260512154431.jxcs632mqqatqtsw@master \
    --to=richard.weiyang@gmail.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=anshuman.khandual@arm.com \
    --cc=apopple@nvidia.com \
    --cc=baohua@kernel.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=byungchul@sk.com \
    --cc=catalin.marinas@arm.com \
    --cc=cl@gentwo.org \
    --cc=corbet@lwn.net \
    --cc=dave.hansen@linux.intel.com \
    --cc=david@kernel.org \
    --cc=dev.jain@arm.com \
    --cc=gourry@gourry.net \
    --cc=hannes@cmpxchg.org \
    --cc=hughd@google.com \
    --cc=jack@suse.cz \
    --cc=jackmanb@google.com \
    --cc=jannh@google.com \
    --cc=jglisse@google.com \
    --cc=joshua.hahnjy@gmail.com \
    --cc=kas@kernel.org \
    --cc=lance.yang@linux.dev \
    --cc=liam@infradead.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-trace-kernel@vger.kernel.org \
    --cc=ljs@kernel.org \
    --cc=mathieu.desnoyers@efficios.com \
    --cc=matthew.brost@intel.com \
    --cc=mhiramat@kernel.org \
    --cc=mhocko@suse.com \
    --cc=npache@redhat.com \
    --cc=peterx@redhat.com \
    --cc=pfalcato@suse.de \
    --cc=rakie.kim@sk.com \
    --cc=raquini@redhat.com \
    --cc=rdunlap@infradead.org \
    --cc=rientjes@google.com \
    --cc=rostedt@goodmis.org \
    --cc=rppt@kernel.org \
    --cc=ryan.roberts@arm.com \
    --cc=shivankg@amd.com \
    --cc=sunnanyong@huawei.com \
    --cc=surenb@google.com \
    --cc=thomas.hellstrom@linux.intel.com \
    --cc=tiwai@suse.de \
    --cc=usamaarif642@gmail.com \
    --cc=vbabka@suse.cz \
    --cc=vishal.moola@gmail.com \
    --cc=wangkefeng.wang@huawei.com \
    --cc=will@kernel.org \
    --cc=willy@infradead.org \
    --cc=yang@os.amperecomputing.com \
    --cc=ying.huang@linux.alibaba.com \
    --cc=ziy@nvidia.com \
    --cc=zokeefe@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.