All of lore.kernel.org
 help / color / mirror / Atom feed
From: Lorenzo Stoakes <ljs@kernel.org>
To: "David Hildenbrand (Arm)" <david@kernel.org>
Cc: Nico Pache <npache@redhat.com>,
	linux-doc@vger.kernel.org,  linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, linux-trace-kernel@vger.kernel.org,
	 aarcange@redhat.com, akpm@linux-foundation.org,
	anshuman.khandual@arm.com,  apopple@nvidia.com,
	baohua@kernel.org, baolin.wang@linux.alibaba.com,
	 byungchul@sk.com, catalin.marinas@arm.com, cl@gentwo.org,
	corbet@lwn.net,  dave.hansen@linux.intel.com, dev.jain@arm.com,
	gourry@gourry.net, hannes@cmpxchg.org,  hughd@google.com,
	jack@suse.cz, jackmanb@google.com, jannh@google.com,
	 jglisse@google.com, joshua.hahnjy@gmail.com, kas@kernel.org,
	lance.yang@linux.dev,  liam@infradead.org,
	mathieu.desnoyers@efficios.com, matthew.brost@intel.com,
	 mhiramat@kernel.org, mhocko@suse.com, peterx@redhat.com,
	pfalcato@suse.de,  rakie.kim@sk.com, raquini@redhat.com,
	rdunlap@infradead.org,  richard.weiyang@gmail.com,
	rientjes@google.com, rostedt@goodmis.org, rppt@kernel.org,
	 ryan.roberts@arm.com, shivankg@amd.com, sunnanyong@huawei.com,
	surenb@google.com,  thomas.hellstrom@linux.intel.com,
	tiwai@suse.de, usamaarif642@gmail.com, vbabka@suse.cz,
	 vishal.moola@gmail.com, wangkefeng.wang@huawei.com,
	will@kernel.org, willy@infradead.org,
	 yang@os.amperecomputing.com, ying.huang@linux.alibaba.com,
	ziy@nvidia.com, zokeefe@google.com
Subject: Re: [PATCH mm-unstable v18 11/14] mm/khugepaged: Introduce mTHP collapse support
Date: Thu, 4 Jun 2026 14:53:25 +0100	[thread overview]
Message-ID: <aiF25dvH4qd_C4aj@lucifer> (raw)
In-Reply-To: <b8380eb3-096a-49f1-9ace-99c1e75888b4@kernel.org>

(Checking the algorithm here)

On Mon, Jun 01, 2026 at 10:11:24AM +0200, David Hildenbrand (Arm) wrote:
> On 5/22/26 17:00, Nico Pache wrote:
>
> Finally time for the core piece :)
>
> > Enable khugepaged to collapse to mTHP orders. This patch implements the
> > main scanning logic using a bitmap to track occupied pages and a stack
> > structure that allows us to find optimal collapse sizes.
> >
> > Previous to this patch, PMD collapse had 3 main phases, a light weight
> > scanning phase (mmap_read_lock) that determines a potential PMD
> > collapse, an alloc phase (mmap unlocked), then finally heavier collapse
> > phase (mmap_write_lock).
> >
> > To enabled mTHP collapse we make the following changes:
> >
> > During PMD scan phase, track occupied pages in a bitmap. When mTHP
> > orders are enabled, we remove the restriction of max_ptes_none during the
> > scan phase to avoid missing potential mTHP collapse candidates. Once we
> > have scanned the full PMD range and updated the bitmap to track occupied
> > pages, we use the bitmap to find the optimal mTHP size.
> >
> > Implement collapse_scan_bitmap() to perform binary recursion on the bitmap
> > and determine the best eligible order for the collapse. A stack structure
> > is used instead of traditional recursion to manage the search. This also
> > prevents a traditional recursive approach when the kernel stack struct is
> > limited. The algorithm recursively splits the bitmap into smaller chunks to
> > find the highest order mTHPs that satisfy the collapse criteria. We start
> > by attempting the PMD order, then moved on the consecutively lower orders
> > (mTHP collapse). The stack maintains a pair of variables (offset, order),

This is inaccurate, it's only consecutively smaller until you hit smallest then
it starts bumping around 2 -> 3 -> 2 -> 3 -> 2 -> .. -> 4 -> 3 -> 2 -> 3 -> 2 -> 4 -> etc.

More like consecutively smaller, then always trying for the smallest possible
fit?

Would be good to describe why we do this, presumably to get a best _fit_?

> > indicating the number of PTEs from the start of the PMD, and the order of
> > the potential collapse candidate.
> >
> > The algorithm for consuming the bitmap works as such:
> >     1) push (0, HPAGE_PMD_ORDER) onto the stack
> >     2) pop the stack
> >     3) check if the number of set bits in that (offset,order) pair
> >        statisfy the max_ptes_none threshold for that order
> >     4) if yes, attempt collapse
> >     5) if no (or collapse fails), push two new stack items representing
> >        the left and right halves of the current bitmap range, at the
> >        next lower order

I notice the ordering is wrong here, you actualy push the mid_offset first then
the offset (e.g. 'right', then 'left'):

			collapse_mthp_stack_push(cc, &stack_size, mid_offset,
						 next_order);
			collapse_mthp_stack_push(cc, &stack_size, offset,
						 next_order);

So that way you are popping the 'left' first then the 'right'.

So seems you'll get:

stack={0, 9}

Pop (0, order=9):

	|----------------------------------------|
	|########################################|
	|----------------------------------------|

stack={256, 8}, {0, 8}

Pop (0, order=8):

	|--------------------|-------------------|
	|####################|                   |
	|--------------------|-------------------|


stack={256, 8}, {128, 7}, {0, 7}

Pop (0, order=7):

	|----------|-----------------------------|
	|##########|                             |
	|----------|-----------------------------|

stack={256, 8}, {128, 7}, {64, 6}, {0, 6}

Pop (0, order=6):

	|----|-----------------------------------|
	|####|                                   |
	|----|-----------------------------------|

...

stack={256, 8}, ..., { 8, 3 }, {0, 2}

Pop (0, order=2):

	|-|--------------------------------------|
	|#|                                      |
	|-|--------------------------------------|

Then finally :) we get the offsets :)

stack={256, 8}, ..., {8, 3}, {4, 2}

Pop (4, order=2):

	|-|-|------------------------------------|
	| |#|                                    |
	|-|-|------------------------------------|

stack={256, 8}, ..., { 12, 2 }, {8, 3}

Pop (8, order=3):

	|---|--|---------------------------------|
	|   |##|                                 |
	|---|--|---------------------------------|

stack={256, 8}, ..., { 12, 2 }, {12, 2}, {8, 2}

Pop (8, order=2):

	|---|-|----------------------------------|
	|   |#|                                  |
	|---|-|----------------------------------|

etc.


It seems to me that you're going to keep iterating down until you match an mTHP
when a larger mTHP could have been had?

So we're going:

order 9 -> 8 -> 7 -> 6 -> ... -> 2 -> 3 -> 2 -> 4 -> 3 -> 2

I guess the point is to avoid only getting the largest possible




I guess if we did try to get the largest then we'd only get 2 of the largest
possible then exhaust the whole PMD, should a PMD-sized entry not be possble.

> >     6) repeat at step (2) until stack is empty.
> >
> > Below is a diagram representing the algorithm and stack items:
> >
> >                             offset   mid_offset
> >                             |        |
> >                             |        |
> >                             v        v
> >           ____________________________________
> >          |          PTE Page Table            |
> >          --------------------------------------
> > 			    <-------><------->
> >                              order-1  order-1
>
>
> Reading this, it is unclear why exactly do we need the stack.
>
> Why can't you work with offset + cur_order?
>
> Initially,
>
> 	offset = 0;
> 	cur_order = HPAGE_PMD_ORDER;
>
> If collapse succeeded, advance to next range.
> If collapse failed, try next smaller order, keeping offset unchanged.
>
> 	if (failed && cur_order > KHUGEPAGED_MIN_MTHP_ORDER) {
> 		/* Try next smaller order. */
> 		cur_order = cur_order - 1;

OK this matches the stack for the 0 offset entries...

> 	} else {
> 		/* Skip to next chunk. */
> 		offset += 1 << cur_order;
> 		cur_order = max_order_from_offset(offset);

Then 1 << 2 -> 4 so go to offset=4.

max_order_from_offset(4) = 2. so (4, offset=2) same as above.

Then we'd loop back here and go to offset = 8, and max_order_from_offset(8) = 3

And, yeah this seems equivalent.

> 	}

>
> Of course, handling disabled orders. max_order_from_offset() is rather trivial
> (natural buddy order, capped at HPAGE_PMD_ORDER).

Something like?

static unsigned long max_order_from_offset(unsigned long offset)
{
	if (!offset)
	   return HPAGE_PMD_ORDER;

	return ilog2(offset);
}

>
> What's the benefit of the stack?

Yeah it seems equivalent. Good idea!

Thanks, Lorenzo

  parent reply	other threads:[~2026-06-04 13:53 UTC|newest]

Thread overview: 144+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-22 14:59 [PATCH mm-hotfixes-unstable v18 00/14] khugepaged: add mTHP collapse support Nico Pache
2026-05-22 14:59 ` [PATCH mm-unstable v18 01/14] mm/khugepaged: generalize hugepage_vma_revalidate for mTHP support Nico Pache
2026-05-22 14:59 ` [PATCH mm-unstable v18 02/14] mm/khugepaged: generalize alloc_charge_folio() Nico Pache
2026-05-22 14:59 ` [PATCH mm-unstable v18 03/14] mm/khugepaged: rework max_ptes_* handling with helper functions Nico Pache
2026-05-22 21:16   ` David Hildenbrand (Arm)
2026-06-01 13:26   ` Lorenzo Stoakes
2026-06-05 16:04   ` Zi Yan
2026-05-22 14:59 ` [PATCH mm-unstable v18 04/14] mm/khugepaged: generalize __collapse_huge_page_* for mTHP support Nico Pache
2026-05-22 21:24   ` David Hildenbrand (Arm)
2026-05-26 14:39   ` Nico Pache
2026-06-01 14:04   ` Lorenzo Stoakes
2026-05-22 15:00 ` [PATCH mm-unstable v18 05/14] mm/khugepaged: require collapse_huge_page to enter/exit with the lock dropped Nico Pache
2026-06-01 14:07   ` Lorenzo Stoakes
2026-06-02 10:26     ` Nico Pache
2026-05-22 15:00 ` [PATCH mm-unstable v18 06/14] mm/khugepaged: generalize collapse_huge_page for mTHP collapse Nico Pache
2026-05-22 21:47   ` David Hildenbrand (Arm)
2026-05-26 14:42   ` Nico Pache
2026-05-31  9:39   ` Lance Yang
2026-05-31 20:00     ` David Hildenbrand (Arm)
2026-06-01  3:28       ` Lance Yang
2026-06-01  6:54         ` David Hildenbrand (Arm)
2026-06-01  7:49           ` Lance Yang
2026-06-01  8:15             ` David Hildenbrand (Arm)
2026-06-01  8:44               ` Lance Yang
2026-06-01 10:09                 ` David Hildenbrand (Arm)
2026-06-01  9:08           ` Lance Yang
2026-06-01 10:23             ` David Hildenbrand (Arm)
2026-06-01 10:47               ` Lance Yang
2026-06-01 11:13                 ` David Hildenbrand (Arm)
2026-06-01 15:00                   ` Nico Pache
2026-06-01 15:05                     ` David Hildenbrand (Arm)
2026-06-01 16:07                       ` Lance Yang
2026-06-04 17:04                     ` Nico Pache
2026-06-04 18:12                       ` Lorenzo Stoakes
2026-06-05  7:18                       ` David Hildenbrand (Arm)
2026-06-05  8:07                         ` Lorenzo Stoakes
2026-06-05  8:59                           ` Lance Yang
2026-06-02 15:30                 ` Nico Pache
2026-06-02 16:34                   ` Lance Yang
2026-06-04 12:33           ` Lorenzo Stoakes
2026-06-04 10:21   ` Lorenzo Stoakes
2026-06-04 10:32     ` Nico Pache
2026-06-04 11:38   ` Lorenzo Stoakes
2026-06-04 12:39     ` Lorenzo Stoakes
2026-06-04 12:45       ` Nico Pache
2026-06-04 12:55         ` Lorenzo Stoakes
2026-06-04 16:28           ` Nico Pache
2026-05-22 15:00 ` [PATCH mm-unstable v18 07/14] mm/khugepaged: skip collapsing mTHP to smaller orders Nico Pache
2026-05-22 21:51   ` David Hildenbrand (Arm)
2026-05-22 15:00 ` [PATCH mm-unstable v18 08/14] mm/khugepaged: add per-order mTHP collapse failure statistics Nico Pache
2026-05-31 20:09   ` David Hildenbrand (Arm)
2026-06-01 14:13   ` Lorenzo Stoakes
2026-05-22 15:00 ` [PATCH mm-unstable v18 09/14] mm/khugepaged: improve tracepoints for mTHP orders Nico Pache
2026-05-22 15:00 ` [PATCH mm-unstable v18 10/14] mm/khugepaged: introduce collapse_allowable_orders helper function Nico Pache
2026-05-31 20:18   ` David Hildenbrand (Arm)
2026-06-01 14:35     ` Lorenzo Stoakes
2026-06-01 14:40       ` David Hildenbrand (Arm)
2026-05-22 15:00 ` [PATCH mm-unstable v18 11/14] mm/khugepaged: Introduce mTHP collapse support Nico Pache
2026-05-25 14:15   ` Nico Pache
2026-05-25 19:10     ` Andrew Morton
2026-05-26  6:57       ` Wei Yang
2026-05-26 12:07         ` Nico Pache
2026-05-28  8:42           ` Wei Yang
2026-05-28 17:11             ` Nico Pache
2026-05-31  7:18   ` Lance Yang
2026-05-31  8:48     ` Lance Yang
2026-06-01 12:01       ` Nico Pache
2026-06-01 12:06         ` David Hildenbrand (Arm)
2026-06-02 10:58     ` Nico Pache
2026-06-02 15:44       ` Lance Yang
2026-06-03  8:05         ` David Hildenbrand (Arm)
2026-06-04 14:40           ` Lorenzo Stoakes
2026-06-01  8:11   ` David Hildenbrand (Arm)
2026-06-01 12:40     ` Nico Pache
2026-06-01 13:15       ` David Hildenbrand (Arm)
2026-06-02 17:23         ` Nico Pache
2026-06-02 17:26           ` Nico Pache
2026-06-03  9:55           ` David Hildenbrand (Arm)
2026-06-03 10:00           ` David Hildenbrand (Arm)
2026-06-03 12:16             ` Nico Pache
2026-06-03 12:27               ` David Hildenbrand (Arm)
2026-06-04 14:14           ` Lorenzo Stoakes
2026-06-04 14:19             ` Lorenzo Stoakes
2026-06-04 13:53     ` Lorenzo Stoakes [this message]
2026-06-04 13:59       ` Lorenzo Stoakes
2026-06-04 14:45   ` Lorenzo Stoakes
2026-06-05 11:07     ` Nico Pache
2026-06-05 11:08       ` Nico Pache
2026-05-22 15:00 ` [PATCH mm-unstable v18 12/14] mm/khugepaged: avoid unnecessary mTHP collapse attempts Nico Pache
2026-05-31  7:31   ` Lance Yang
2026-05-31 20:02     ` David Hildenbrand (Arm)
2026-06-01  1:53       ` Lance Yang
2026-05-22 15:00 ` [PATCH mm-unstable v18 13/14] mm/khugepaged: run khugepaged for all orders Nico Pache
2026-05-22 15:00 ` [PATCH mm-unstable v18 14/14] Documentation: mm: update the admin guide for mTHP collapse Nico Pache
2026-05-22 21:58   ` David Hildenbrand (Arm)
2026-05-26 12:00     ` Nico Pache
2026-05-26 14:45   ` Nico Pache
2026-05-22 15:07 ` [PATCH mm-hotfixes-unstable v18 00/14] khugepaged: add mTHP collapse support Nico Pache
2026-05-22 15:13   ` Vlastimil Babka (SUSE)
2026-05-22 16:11     ` Nico Pache
2026-05-22 21:13       ` David Hildenbrand (Arm)
2026-05-26  8:33         ` Process (was Re: [PATCH mm-hotfixes-unstable v18 00/14] khugepaged: add mTHP) " Lorenzo Stoakes
2026-05-26 19:09           ` Andrew Morton
2026-05-26 20:42           ` Vlastimil Babka (SUSE)
2026-05-31 19:49             ` David Hildenbrand (Arm)
2026-06-01 15:41               ` Lorenzo Stoakes
2026-06-01 15:45                 ` David Hildenbrand (Arm)
2026-06-01 16:16                   ` Lorenzo Stoakes
2026-06-02 11:20                     ` David Hildenbrand (Arm)
2026-06-02 11:31                       ` David Hildenbrand (Arm)
2026-06-02 12:47                         ` Lorenzo Stoakes
2026-06-02 12:55                         ` Vlastimil Babka (SUSE)
2026-06-02 13:01                           ` David Hildenbrand (Arm)
2026-06-02 17:31                           ` Mike Rapoport
2026-06-03  6:48                             ` Lorenzo Stoakes
2026-06-03  8:39                               ` Mike Rapoport
2026-06-03  9:57                                 ` Mark Brown
2026-06-03 10:51                                   ` Mike Rapoport
2026-06-03  9:03                               ` Mark Brown
2026-06-02 12:40                       ` Lorenzo Stoakes
2026-06-02 12:49                         ` David Hildenbrand (Arm)
2026-06-02 12:47                       ` Vlastimil Babka (SUSE)
2026-06-02 12:58                         ` David Hildenbrand (Arm)
2026-06-02 13:08                           ` Vlastimil Babka (SUSE)
2026-06-02 13:16                             ` David Hildenbrand (Arm)
2026-06-03  1:48                               ` SeongJae Park
2026-06-05 15:24                                 ` David Hildenbrand (Arm)
2026-06-01 15:37             ` Lorenzo Stoakes
2026-06-01 15:43               ` David Hildenbrand (Arm)
2026-06-01 15:47                 ` Lorenzo Stoakes
2026-06-01 16:00                   ` David Hildenbrand (Arm)
2026-05-22 15:16   ` [PATCH mm-hotfixes-unstable v18 00/14] khugepaged: add mTHP " Lorenzo Stoakes
2026-05-22 16:08     ` Nico Pache
2026-05-22 16:19       ` Lorenzo Stoakes
2026-05-22 16:31         ` Nico Pache
2026-05-22 17:12           ` Lorenzo Stoakes
2026-05-26  8:14             ` Lorenzo Stoakes
2026-05-22 15:13 ` Lorenzo Stoakes
2026-05-22 20:47 ` Andrew Morton
2026-06-01 15:58   ` Alexander Gordeev
2026-06-01 17:05     ` Nico Pache
2026-06-01 17:08     ` Lorenzo Stoakes
2026-06-02  1:53       ` Lance Yang
2026-06-04 10:10 ` Lorenzo Stoakes

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aiF25dvH4qd_C4aj@lucifer \
    --to=ljs@kernel.org \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=anshuman.khandual@arm.com \
    --cc=apopple@nvidia.com \
    --cc=baohua@kernel.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=byungchul@sk.com \
    --cc=catalin.marinas@arm.com \
    --cc=cl@gentwo.org \
    --cc=corbet@lwn.net \
    --cc=dave.hansen@linux.intel.com \
    --cc=david@kernel.org \
    --cc=dev.jain@arm.com \
    --cc=gourry@gourry.net \
    --cc=hannes@cmpxchg.org \
    --cc=hughd@google.com \
    --cc=jack@suse.cz \
    --cc=jackmanb@google.com \
    --cc=jannh@google.com \
    --cc=jglisse@google.com \
    --cc=joshua.hahnjy@gmail.com \
    --cc=kas@kernel.org \
    --cc=lance.yang@linux.dev \
    --cc=liam@infradead.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-trace-kernel@vger.kernel.org \
    --cc=mathieu.desnoyers@efficios.com \
    --cc=matthew.brost@intel.com \
    --cc=mhiramat@kernel.org \
    --cc=mhocko@suse.com \
    --cc=npache@redhat.com \
    --cc=peterx@redhat.com \
    --cc=pfalcato@suse.de \
    --cc=rakie.kim@sk.com \
    --cc=raquini@redhat.com \
    --cc=rdunlap@infradead.org \
    --cc=richard.weiyang@gmail.com \
    --cc=rientjes@google.com \
    --cc=rostedt@goodmis.org \
    --cc=rppt@kernel.org \
    --cc=ryan.roberts@arm.com \
    --cc=shivankg@amd.com \
    --cc=sunnanyong@huawei.com \
    --cc=surenb@google.com \
    --cc=thomas.hellstrom@linux.intel.com \
    --cc=tiwai@suse.de \
    --cc=usamaarif642@gmail.com \
    --cc=vbabka@suse.cz \
    --cc=vishal.moola@gmail.com \
    --cc=wangkefeng.wang@huawei.com \
    --cc=will@kernel.org \
    --cc=willy@infradead.org \
    --cc=yang@os.amperecomputing.com \
    --cc=ying.huang@linux.alibaba.com \
    --cc=ziy@nvidia.com \
    --cc=zokeefe@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.