From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 7B9F2CD6E4A for ; Thu, 4 Jun 2026 14:19:55 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id E1FEF6B0005; Thu, 4 Jun 2026 10:19:54 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id DD1046B0088; Thu, 4 Jun 2026 10:19:54 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id CBF356B0092; Thu, 4 Jun 2026 10:19:54 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id B9CD56B0005 for ; Thu, 4 Jun 2026 10:19:54 -0400 (EDT) Received: from smtpin05.hostedemail.com (lb01a-stub [10.200.18.249]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 80E311A078F for ; Thu, 4 Jun 2026 14:19:54 +0000 (UTC) X-FDA: 84842439108.05.D6C02BA Received: from tor.source.kernel.org (tor.source.kernel.org [172.105.4.254]) by imf30.hostedemail.com (Postfix) with ESMTP id 41A388001B for ; Thu, 4 Jun 2026 14:19:45 +0000 (UTC) ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1780582785; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=kz+wMMmvmIne9m547qA6qN3203xltR1Hvn5aJqU4oto=; b=0Cxc/PVTEX/Q64H6oDvrnQErYYdQKsVGu9Ed62EZy4F4TX46Si0PaIYZu4Sb5h/CWvQA0o 8cTAhTGTSsNXX2rOs/ACvI51lyi3hsdokPdZkSZQF2HmUHkQ5TOgAihsIw5Ey3H6Pp88Ir tPMr6eNru+hOZhCi//Q0klqUMTX8Cyk= ARC-Authentication-Results: i=1; imf30.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20260515 header.b=REp6kxpn; spf=pass (imf30.hostedemail.com: domain of ljs@kernel.org designates 172.105.4.254 as permitted sender) smtp.mailfrom=ljs@kernel.org; dmarc=pass (policy=quarantine) header.from=kernel.org ARC-Seal: i=1; a=rsa-sha256; d=hostedemail.com; s=arc-20220608; cv=none; t=1780582785; b=W6YeljFkAvXOwjEXmD4kl8TNuzGf0FWEG5scDDwXgklnroMZoZrN8p2/vNdrbhUgXlzCfa xJ5m1cSxAJJl9BbMR94zFuX83kSHSHNHBhE5QAbyfZUgBtukVnMzCQr4+OnPVlaY5jiV5Q VpuWrnS/tNFEocaz7xwfnXIT4Ngxx5w= Received: from smtp.kernel.org (quasi.space.kernel.org [100.103.45.18]) by tor.source.kernel.org (Postfix) with ESMTP id A2C54600B0; Thu, 4 Jun 2026 14:19:44 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 39AF81F00893; Thu, 4 Jun 2026 14:19:32 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel.org; s=k20260515; t=1780582784; bh=kz+wMMmvmIne9m547qA6qN3203xltR1Hvn5aJqU4oto=; h=Date:From:To:Cc:Subject:References:In-Reply-To; b=REp6kxpnKHKF7Ism8aaIytnDkU7xRoucVEkuISs54uR5MXWWepe3UnWL/tQoD8DXr VUXgdw29fqcarvKqqAUhAfxgU1MGck9nAUlr7fVfW2RffXNIj8D/6TuNwGVl2dWVZT Zf9Zg9gJ35eAy7WnpeM+ZzFuCbRxdCkyD9uUFSB+XcdyYCW1xWDpW4ucPPEx2eIevr cwE9PEV7FeSl/VVWYrqzuAewSpshIWfPMdPsahSI4VfMtSK5e84fhSUxzR3hT6gkfV nug6Q55EELfwUz7/oRX3xmuRJS0FUim1SCpgcOF8uEqodtUCRgz2gZxCzC0d3WxeKx 9iMrHB639LFxg== Date: Thu, 4 Jun 2026 15:19:29 +0100 From: Lorenzo Stoakes To: Nico Pache Cc: "David Hildenbrand (Arm)" , linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-trace-kernel@vger.kernel.org, aarcange@redhat.com, akpm@linux-foundation.org, anshuman.khandual@arm.com, apopple@nvidia.com, baohua@kernel.org, baolin.wang@linux.alibaba.com, byungchul@sk.com, catalin.marinas@arm.com, cl@gentwo.org, corbet@lwn.net, dave.hansen@linux.intel.com, dev.jain@arm.com, gourry@gourry.net, hannes@cmpxchg.org, hughd@google.com, jack@suse.cz, jackmanb@google.com, jannh@google.com, jglisse@google.com, joshua.hahnjy@gmail.com, kas@kernel.org, lance.yang@linux.dev, liam@infradead.org, mathieu.desnoyers@efficios.com, matthew.brost@intel.com, mhiramat@kernel.org, mhocko@suse.com, peterx@redhat.com, pfalcato@suse.de, rakie.kim@sk.com, raquini@redhat.com, rdunlap@infradead.org, richard.weiyang@gmail.com, rientjes@google.com, rostedt@goodmis.org, rppt@kernel.org, ryan.roberts@arm.com, shivankg@amd.com, sunnanyong@huawei.com, surenb@google.com, thomas.hellstrom@linux.intel.com, tiwai@suse.de, vbabka@suse.cz, vishal.moola@gmail.com, wangkefeng.wang@huawei.com, will@kernel.org, willy@infradead.org, yang@os.amperecomputing.com, ying.huang@linux.alibaba.com, ziy@nvidia.com, zokeefe@google.com, Usama Arif , usamaarif642@gmail.com Subject: Re: [PATCH mm-unstable v18 11/14] mm/khugepaged: Introduce mTHP collapse support Message-ID: References: <20260522150009.121603-1-npache@redhat.com> <20260522150009.121603-12-npache@redhat.com> <19639b08-5bf1-4974-9635-c458d512fa38@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-HE-Tag: 1780582785-998296 X-HE-Meta: U2FsdGVkX1+aVpzdf3erKFotgbtfzETsOq2tqW+KlxsBNWkG9XWQTfmw/COwKWbFJCZE93NTYS5z96Xzfmkkm6mkaaS2bPXjUTr+N6pBTfJL9UNrTOQHkYpZDNa54wW/RaJIJJBFHlsIDCFN7i2NIjMUbRYoWX4aYBwMpR94ogxWofLMxBYRfGlrC7lBuZUrSOuHv33oqdU0OPAE7+bLbabnjVGIhStqtmC8JHVvuVCbJE9jN7gU/bT+fi1KOeI4Xlm0MqU+KOmkkOyTXfm89g/vPyYa+k0iOHBNHbFtDhx/5OqH75+KK4yvDwnvtnYdKX42T43J+Vlnj0xCr+r4K9Zwz27aXykEDKiEukMMdztHyzZTxkELJpZEI0Y2r+NApDCwNZFXDNq1ql6qVKoO5caciEhkHibecxpvWiaEcFZhFCHfEDYVjo8FP8FH7l0HlM2lw0+E/XJM2keoP9+sYmxFtedjqca/XQTOmaLe2v5oSX0vXI1qojryynxtWwRaxGdKuRsNxFs= Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Jun 04, 2026 at 03:14:59PM +0100, Lorenzo Stoakes wrote: > On Tue, Jun 02, 2026 at 11:23:35AM -0600, Nico Pache wrote: > > > > > > On 6/1/26 7:15 AM, David Hildenbrand (Arm) wrote: > > >>> > > >>> Reading this, it is unclear why exactly do we need the stack. > > >> > > >> So I looked into your items below. It seems logical, and I think it > > >> works the same way; however, your method seems slightly harder to > > >> understand due to all the edge cases and more error-prone to future > > >> changes (the stack holds implicit knowledge of the offset/order that > > >> must now be tracked in the edge cases). > > >> > > >> Given the stack is 24 bytes, I'm not sure if the extra complexity is > > >> worth saving that small amount of memory. Although we would also be > > >> getting rid of (3?) functions, so both approaches have pros and cons. > > > > > > I consider a simple forward loop over the offset ... less complexity compared to > > > a stack structure :) > > > > > >> > > >> I will implement a patch comparing your solution against mine and send > > >> it here, then we can decide which approach is better. > > > > > > Right, throw it over the fence and I'll see how to improve it further. > > > > Ok heres what the diff looks like on top of my V19. > > > > you can access the tree here https://gitlab.com/npache/linux/-/commits/mthp-v19?ref_type=heads for easier review. > > > > So far I have no problem with this approach it appeared cleaner than i thought. Did some light testing. Gonna throw it more through the ringer tomorrow. > > > > > > From 9496c5d17eba7f6d04820d78c7c6f1592a58888a Mon Sep 17 00:00:00 2001 > > From: Nico Pache > > Date: Tue, 2 Jun 2026 10:26:18 -0600 > > Subject: [PATCH] convert from stack to forward loop > > > > Signed-off-by: Nico Pache > > --- > > mm/khugepaged.c | 96 ++++++++----------------------------------------- > > 1 file changed, 15 insertions(+), 81 deletions(-) > > > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > > index 498eba009751..6de935e76ceb 100644 > > --- a/mm/khugepaged.c > > +++ b/mm/khugepaged.c > > @@ -100,28 +100,6 @@ static DEFINE_READ_MOSTLY_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS); > > static struct kmem_cache *mm_slot_cache __ro_after_init; > > > > #define KHUGEPAGED_MIN_MTHP_ORDER 2 > > -/* > > - * mthp_collapse() does an iterative DFS over a binary tree, from > > - * HPAGE_PMD_ORDER down to KHUGEPAGED_MIN_MTHP_ORDER. The max stack > > - * size needed for a DFS on a binary tree is height + 1, where > > - * height = HPAGE_PMD_ORDER - KHUGEPAGED_MIN_MTHP_ORDER. > > - * > > - * ilog2 is used in place of HPAGE_PMD_ORDER because some architectures > > - * (e.g. ppc64le) do not define HPAGE_PMD_ORDER until after build time. > > - */ > > -#define MTHP_STACK_SIZE (ilog2(MAX_PTRS_PER_PTE) - KHUGEPAGED_MIN_MTHP_ORDER + 1) > > - > > -/* > > - * Defines a range of PTE entries in a PTE page table which are being > > - * considered for mTHP collapse. > > - * > > - * @offset: the offset of the first PTE entry in a PMD range. > > - * @order: the order of the PTE entries being considered for collapse. > > - */ > > -struct mthp_range { > > - u16 offset; > > - u8 order; > > -}; > > > > struct collapse_control { > > bool is_khugepaged; > > @@ -137,7 +115,6 @@ struct collapse_control { > > > > /* Each bit represents a single occupied (!none/zero) page. */ > > DECLARE_BITMAP(mthp_present_ptes, MAX_PTRS_PER_PTE); > > - struct mthp_range mthp_bitmap_stack[MTHP_STACK_SIZE]; > > }; > > > > /** > > @@ -1458,50 +1435,14 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long s > > return result; > > } > > > > -static void collapse_mthp_stack_push(struct collapse_control *cc, int *stack_size, > > - u16 offset, u8 order) > > -{ > > - const int size = *stack_size; > > - struct mthp_range *stack = &cc->mthp_bitmap_stack[size]; > > - > > - VM_WARN_ON_ONCE(size >= MTHP_STACK_SIZE); > > - stack->order = order; > > - stack->offset = offset; > > - (*stack_size)++; > > -} > > - > > -static struct mthp_range collapse_mthp_stack_pop(struct collapse_control *cc, > > - int *stack_size) > > -{ > > - const int size = *stack_size; > > - > > - VM_WARN_ON_ONCE(size <= 0); > > - (*stack_size)--; > > - return cc->mthp_bitmap_stack[size - 1]; > > -} > > - > > /* > > * mthp_collapse() consumes the bitmap that is generated during > > * collapse_scan_pmd() to determine what regions and mTHP orders fit best. > > * > > * Each bit in cc->mthp_present_ptes represents a single occupied (!none/zero) > > - * page. A stack structure cc->mthp_bitmap_stack is used to check different > > - * regions of the bitmap for collapse eligibility. The stack maintains a pair > > - * of variables (offset, order), indicating the number of PTEs from the start > > - * of the PMD, and the order of the potential collapse candidate respectively. > > - * We start at the PMD order and check if it is eligible for collapse; if not, > > - * we add two entries to the stack at a lower order to represent the left and > > - * right halves of the PTE page table we are examining. > > - * > > - * offset mid_offset > > - * | | > > - * | | > > - * v v > > - * -------------------------------------- > > - * | cc->mthp_present_ptes | > > - * -------------------------------------- > > - * <-------><-------> > > - * order-1 order-1 > > + * page. We start at the PMD order and check if it is eligible for collapse; > > + * if not, we check the left and right halves of the PTE page table we are > > + * examining at a lower order. > > Yeah this is not good enough sorry, before there was some kind of explanation of > the algortihm, just because you can explain the _code_ more simply, that's not > very useful. > > I had to sit down and spend quite a bit of time to figure out how the actual > output looks so I think that should be explained. > > > * > > * For each of these, we determine how many PTE entries are occupied in the > > * range of PTE entries we propose to collapse, then we compare this to a > > @@ -1517,26 +1458,20 @@ static enum scan_result mthp_collapse(struct mm_struct *mm, > > { > > unsigned int nr_occupied_ptes, nr_ptes, max_ptes_none; > > enum scan_result last_result = SCAN_FAIL; > > - int collapsed = 0, stack_size = 0; > > + int collapsed = 0; > > bool alloc_failed = false; > > unsigned long collapse_address; > > - struct mthp_range range; > > - u16 offset; > > - u8 order; > > + unsigned int offset = 0; > > + unsigned int order = HPAGE_PMD_ORDER; > > > > - collapse_mthp_stack_push(cc, &stack_size, 0, HPAGE_PMD_ORDER); > > > > - while (stack_size) { > > - range = collapse_mthp_stack_pop(cc, &stack_size); > > - order = range.order; > > - offset = range.offset; > > + while (offset < HPAGE_PMD_NR) { > > nr_ptes = 1UL << order; > > > > if (!test_bit(order, &enabled_orders)) > > goto next_order; > > > > max_ptes_none = collapse_max_ptes_none(cc, NULL, order); > > - > > nr_occupied_ptes = bitmap_weight_from(cc->mthp_present_ptes, offset, > > offset + nr_ptes); > > > > @@ -1553,7 +1488,7 @@ static enum scan_result mthp_collapse(struct mm_struct *mm, > > collapsed += nr_ptes; > > fallthrough; > > case SCAN_PTE_MAPPED_HUGEPAGE: > > - continue; > > + goto next_offset; > > /* Cases where lower orders might still succeed */ > > case SCAN_ALLOC_HUGE_PAGE_FAIL: > > alloc_failed = true; > > @@ -1581,15 +1516,14 @@ static enum scan_result mthp_collapse(struct mm_struct *mm, > > } > > > > This obviously needs some comments describing what you're doing here. I think > David said so too. > > > next_order: > > - if ((BIT(order) - 1) & enabled_orders) { > > - const u8 next_order = order - 1; > > - const u16 mid_offset = offset + (nr_ptes / 2); > > - > > - collapse_mthp_stack_push(cc, &stack_size, mid_offset, > > - next_order); > > - collapse_mthp_stack_push(cc, &stack_size, offset, > > - next_order); > > + if (order > KHUGEPAGED_MIN_MTHP_ORDER && > > + (BIT(order) - 1) & enabled_orders) { > > Wait, if I disable an order this changes the way we get mTHP doesn't it? > > Let's say I disable order-4 but retain order-3 and order-2 for offset 0 we get: > > 9->8->7->6->5->5->6->5->5->7 > > And we simply can't get order-3 no? > > This seems broken doesn't it? Maybe I'm missing something? OK it's the way this is written, very confusing. I do not know why you are writing this code in this 'compressed' way. (1 << order) - 1) & enabled_orders is to see if there's _any others_ to check. > > > > + order = order - 1; > > order--? > > > + continue; > > } > > +next_offset: > > + offset += nr_ptes; > > + order = min_t(int, __ffs(offset), HPAGE_PMD_ORDER); > > Also wouldn't, in the case where an enabled order check above skips an order--, > we could have offset=0 here and end up just looping around checking from (0, > HPAGE_PMD_ORDER) again? That also seems broken? Yeah sorry the offset += nr_ptes fixes that anyway. And the fact it's a mask check above makes this OK. So the logic seems probably fine but it needs to be clearer. > > Also, what's __ffs(0)? Isn't it undefined? We shouldn't be relying on undefined > behaviour no? > > https://elixir.bootlin.com/linux/v7.0.10/source/include/asm-generic/bitops/builtin-__ffs.h#L5 > Says as much? > > I guess we're assuming we're not going to get to 0 here, but that could do with > a comment or a VM_WARN_ON_ONCE() at least. > > Also why aren't we making this a function? > > static inline unsigned int max_order_from_offset(unsigned int offset) > { > if (!offset) > return HPAGE_PMD_ORDER; > > return __ffs(offset); > } > > Though __ffs() works on unsigned long... probably... OK? > > > } > > done: > > if (collapsed) > > -- > > 2.54.0 > > > > > > > > > > > > [...] > > > > > >>>> + bitmap_zero(cc->mthp_bitmap, MAX_PTRS_PER_PTE); > > >>>> memset(cc->node_load, 0, sizeof(cc->node_load)); > > >>>> nodes_clear(cc->alloc_nmask); > > >>>> + > > >>>> + enabled_orders = collapse_allowable_orders(vma, vma->vm_flags, tva_flags); > > >>>> + > > >>>> + /* > > >>>> + * If PMD is the only enabled order, enforce max_ptes_none, otherwise > > >>>> + * scan all pages to populate the bitmap for mTHP collapse. > > >>>> + */ > > >>> > > >>> You should note here, that we re-verify in mthp_collapse(). > > >>> > > >>> But the question is, whether we should relocate the check completely into > > >>> mthp_collapse(), instead of conditionally duplicating it. > > >>> > > >>> What speaks against always populating the bitmap and making the decision in > > >>> mthp_collapse()? > > >>> > > >>> Sure, we might scan a page table a bit longer, but the code gets clearer ... and > > >>> I am not sure if scanning some more page table entries is really that critical here. > > >> > > >> Someone asked me to preserve the legacy behavior (PMD only). Although > > >> rather trivial, if you set max_ptes_none=0 for example, we'd still > > >> have to do 511 iterations for no reason if PMD collapse is the only > > >> enabled order rather than bailing immediately. > > >> > > >> I'm ok with dropping it, but I think its the correct approach (despite > > >> the extra complexity). @Usama Arif brought up this point here > > >> https://lore.kernel.org/all/f8f7bb71-ca31-46ee-a62d-7ddfd83e0ead@gmail.com/ > > > > > > We talk about regressions, but I am not sure if we care about scanning speed > > > within a page table that much? > > > > > > After all, we locked it and already read some entries. > > > > > > Having the same check at two places to optimize for PMD order might right now > > > feel like a good optimization, but likely an irrelevant one in a near future? > > > > > > Anyhow, won't push back, as long as we document why we are special casing things > > > here. > > > > > > > Thanks, Lorenzo