From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-alma10-1.taild15c8.ts.net [100.103.45.18]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4DE8247F2C1; Thu, 4 Jun 2026 14:14:59 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=100.103.45.18 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780582500; cv=none; b=gTKM8c2Oz0XjZL2QUUvZVuMZKEICb8oa+oUWH5mEkIYcC5PWRF7LGTzbzlV4NeVnPMWl+vFizxE5vmEUvr9V1OLfX4dkQBX0WAvqSnfP+0lbN7Z9lTbZ/xoTXqiAc6ZhPll9U0gtgUrM8qW4Hbff2lIw23998GnvsQ9JTCGKxEA= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780582500; c=relaxed/simple; bh=lOP6l4Tef/B1TGmRNJIXB67bry7V+pPAEH1W6iZjXQc=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=dAYtFNpMFB55MHg894QxsPmiZcamQ1LRkHgXuEyzf2vC//iuE49qJK3vHnz4qjfpy01cwIyONi2lCESyJRpK+a9sYBZIi8nQ4XMmMLcqte/EM7RY8uAJrOCtVEbe3GEF4hPFJq7s4eOl7Bf3uoHzsdsIt69QZor0g5xrC83KWYI= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=RfkfMw8R; arc=none smtp.client-ip=100.103.45.18 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="RfkfMw8R" Received: by smtp.kernel.org (Postfix) with ESMTPSA id CE9551F00893; Thu, 4 Jun 2026 14:14:46 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel.org; s=k20260515; t=1780582499; bh=6TeCSwFcY1TICi/h5cWubzeJu2VOClK7cfXnsPuqg2g=; h=Date:From:To:Cc:Subject:References:In-Reply-To; b=RfkfMw8RJYbykieKNqD4yxk+UT80GwdmNIJWlMckRjjROvNZQ73OO8q2Qnipt0mU5 h+kAHecvkhkPphRCrgGEbxd/JIDRPky1zSeh9g2zMcP7GfXJjvODXgPU18/6pZlLUQ +tSxgTmy9Sz/2BatXKztyCgMq2RBY96FSra9EpTA2z9uwavAtrYFozNdkXZxTTbiAh eHbPvuk5Tg6BMpLCPzI8WbiHLRHiuc/0RmX3tH9ZNBZmgaDdjPYD0IDyBySrfrxQKP AxprmnAi1JtLGvVRXlFtpMqwKt9lcNAAld6vQhqih995H7FiSY40Gt6YTU+9H5fkb8 YAwLsayRQfxrQ== Date: Thu, 4 Jun 2026 15:14:43 +0100 From: Lorenzo Stoakes To: Nico Pache Cc: "David Hildenbrand (Arm)" , linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-trace-kernel@vger.kernel.org, aarcange@redhat.com, akpm@linux-foundation.org, anshuman.khandual@arm.com, apopple@nvidia.com, baohua@kernel.org, baolin.wang@linux.alibaba.com, byungchul@sk.com, catalin.marinas@arm.com, cl@gentwo.org, corbet@lwn.net, dave.hansen@linux.intel.com, dev.jain@arm.com, gourry@gourry.net, hannes@cmpxchg.org, hughd@google.com, jack@suse.cz, jackmanb@google.com, jannh@google.com, jglisse@google.com, joshua.hahnjy@gmail.com, kas@kernel.org, lance.yang@linux.dev, liam@infradead.org, mathieu.desnoyers@efficios.com, matthew.brost@intel.com, mhiramat@kernel.org, mhocko@suse.com, peterx@redhat.com, pfalcato@suse.de, rakie.kim@sk.com, raquini@redhat.com, rdunlap@infradead.org, richard.weiyang@gmail.com, rientjes@google.com, rostedt@goodmis.org, rppt@kernel.org, ryan.roberts@arm.com, shivankg@amd.com, sunnanyong@huawei.com, surenb@google.com, thomas.hellstrom@linux.intel.com, tiwai@suse.de, vbabka@suse.cz, vishal.moola@gmail.com, wangkefeng.wang@huawei.com, will@kernel.org, willy@infradead.org, yang@os.amperecomputing.com, ying.huang@linux.alibaba.com, ziy@nvidia.com, zokeefe@google.com, Usama Arif , usamaarif642@gmail.com Subject: Re: [PATCH mm-unstable v18 11/14] mm/khugepaged: Introduce mTHP collapse support Message-ID: References: <20260522150009.121603-1-npache@redhat.com> <20260522150009.121603-12-npache@redhat.com> <19639b08-5bf1-4974-9635-c458d512fa38@redhat.com> Precedence: bulk X-Mailing-List: linux-trace-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <19639b08-5bf1-4974-9635-c458d512fa38@redhat.com> On Tue, Jun 02, 2026 at 11:23:35AM -0600, Nico Pache wrote: > > > On 6/1/26 7:15 AM, David Hildenbrand (Arm) wrote: > >>> > >>> Reading this, it is unclear why exactly do we need the stack. > >> > >> So I looked into your items below. It seems logical, and I think it > >> works the same way; however, your method seems slightly harder to > >> understand due to all the edge cases and more error-prone to future > >> changes (the stack holds implicit knowledge of the offset/order that > >> must now be tracked in the edge cases). > >> > >> Given the stack is 24 bytes, I'm not sure if the extra complexity is > >> worth saving that small amount of memory. Although we would also be > >> getting rid of (3?) functions, so both approaches have pros and cons. > > > > I consider a simple forward loop over the offset ... less complexity compared to > > a stack structure :) > > > >> > >> I will implement a patch comparing your solution against mine and send > >> it here, then we can decide which approach is better. > > > > Right, throw it over the fence and I'll see how to improve it further. > > Ok heres what the diff looks like on top of my V19. > > you can access the tree here https://gitlab.com/npache/linux/-/commits/mthp-v19?ref_type=heads for easier review. > > So far I have no problem with this approach it appeared cleaner than i thought. Did some light testing. Gonna throw it more through the ringer tomorrow. > > > From 9496c5d17eba7f6d04820d78c7c6f1592a58888a Mon Sep 17 00:00:00 2001 > From: Nico Pache > Date: Tue, 2 Jun 2026 10:26:18 -0600 > Subject: [PATCH] convert from stack to forward loop > > Signed-off-by: Nico Pache > --- > mm/khugepaged.c | 96 ++++++++----------------------------------------- > 1 file changed, 15 insertions(+), 81 deletions(-) > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > index 498eba009751..6de935e76ceb 100644 > --- a/mm/khugepaged.c > +++ b/mm/khugepaged.c > @@ -100,28 +100,6 @@ static DEFINE_READ_MOSTLY_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS); > static struct kmem_cache *mm_slot_cache __ro_after_init; > > #define KHUGEPAGED_MIN_MTHP_ORDER 2 > -/* > - * mthp_collapse() does an iterative DFS over a binary tree, from > - * HPAGE_PMD_ORDER down to KHUGEPAGED_MIN_MTHP_ORDER. The max stack > - * size needed for a DFS on a binary tree is height + 1, where > - * height = HPAGE_PMD_ORDER - KHUGEPAGED_MIN_MTHP_ORDER. > - * > - * ilog2 is used in place of HPAGE_PMD_ORDER because some architectures > - * (e.g. ppc64le) do not define HPAGE_PMD_ORDER until after build time. > - */ > -#define MTHP_STACK_SIZE (ilog2(MAX_PTRS_PER_PTE) - KHUGEPAGED_MIN_MTHP_ORDER + 1) > - > -/* > - * Defines a range of PTE entries in a PTE page table which are being > - * considered for mTHP collapse. > - * > - * @offset: the offset of the first PTE entry in a PMD range. > - * @order: the order of the PTE entries being considered for collapse. > - */ > -struct mthp_range { > - u16 offset; > - u8 order; > -}; > > struct collapse_control { > bool is_khugepaged; > @@ -137,7 +115,6 @@ struct collapse_control { > > /* Each bit represents a single occupied (!none/zero) page. */ > DECLARE_BITMAP(mthp_present_ptes, MAX_PTRS_PER_PTE); > - struct mthp_range mthp_bitmap_stack[MTHP_STACK_SIZE]; > }; > > /** > @@ -1458,50 +1435,14 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long s > return result; > } > > -static void collapse_mthp_stack_push(struct collapse_control *cc, int *stack_size, > - u16 offset, u8 order) > -{ > - const int size = *stack_size; > - struct mthp_range *stack = &cc->mthp_bitmap_stack[size]; > - > - VM_WARN_ON_ONCE(size >= MTHP_STACK_SIZE); > - stack->order = order; > - stack->offset = offset; > - (*stack_size)++; > -} > - > -static struct mthp_range collapse_mthp_stack_pop(struct collapse_control *cc, > - int *stack_size) > -{ > - const int size = *stack_size; > - > - VM_WARN_ON_ONCE(size <= 0); > - (*stack_size)--; > - return cc->mthp_bitmap_stack[size - 1]; > -} > - > /* > * mthp_collapse() consumes the bitmap that is generated during > * collapse_scan_pmd() to determine what regions and mTHP orders fit best. > * > * Each bit in cc->mthp_present_ptes represents a single occupied (!none/zero) > - * page. A stack structure cc->mthp_bitmap_stack is used to check different > - * regions of the bitmap for collapse eligibility. The stack maintains a pair > - * of variables (offset, order), indicating the number of PTEs from the start > - * of the PMD, and the order of the potential collapse candidate respectively. > - * We start at the PMD order and check if it is eligible for collapse; if not, > - * we add two entries to the stack at a lower order to represent the left and > - * right halves of the PTE page table we are examining. > - * > - * offset mid_offset > - * | | > - * | | > - * v v > - * -------------------------------------- > - * | cc->mthp_present_ptes | > - * -------------------------------------- > - * <-------><-------> > - * order-1 order-1 > + * page. We start at the PMD order and check if it is eligible for collapse; > + * if not, we check the left and right halves of the PTE page table we are > + * examining at a lower order. Yeah this is not good enough sorry, before there was some kind of explanation of the algortihm, just because you can explain the _code_ more simply, that's not very useful. I had to sit down and spend quite a bit of time to figure out how the actual output looks so I think that should be explained. > * > * For each of these, we determine how many PTE entries are occupied in the > * range of PTE entries we propose to collapse, then we compare this to a > @@ -1517,26 +1458,20 @@ static enum scan_result mthp_collapse(struct mm_struct *mm, > { > unsigned int nr_occupied_ptes, nr_ptes, max_ptes_none; > enum scan_result last_result = SCAN_FAIL; > - int collapsed = 0, stack_size = 0; > + int collapsed = 0; > bool alloc_failed = false; > unsigned long collapse_address; > - struct mthp_range range; > - u16 offset; > - u8 order; > + unsigned int offset = 0; > + unsigned int order = HPAGE_PMD_ORDER; > > - collapse_mthp_stack_push(cc, &stack_size, 0, HPAGE_PMD_ORDER); > > - while (stack_size) { > - range = collapse_mthp_stack_pop(cc, &stack_size); > - order = range.order; > - offset = range.offset; > + while (offset < HPAGE_PMD_NR) { > nr_ptes = 1UL << order; > > if (!test_bit(order, &enabled_orders)) > goto next_order; > > max_ptes_none = collapse_max_ptes_none(cc, NULL, order); > - > nr_occupied_ptes = bitmap_weight_from(cc->mthp_present_ptes, offset, > offset + nr_ptes); > > @@ -1553,7 +1488,7 @@ static enum scan_result mthp_collapse(struct mm_struct *mm, > collapsed += nr_ptes; > fallthrough; > case SCAN_PTE_MAPPED_HUGEPAGE: > - continue; > + goto next_offset; > /* Cases where lower orders might still succeed */ > case SCAN_ALLOC_HUGE_PAGE_FAIL: > alloc_failed = true; > @@ -1581,15 +1516,14 @@ static enum scan_result mthp_collapse(struct mm_struct *mm, > } > This obviously needs some comments describing what you're doing here. I think David said so too. > next_order: > - if ((BIT(order) - 1) & enabled_orders) { > - const u8 next_order = order - 1; > - const u16 mid_offset = offset + (nr_ptes / 2); > - > - collapse_mthp_stack_push(cc, &stack_size, mid_offset, > - next_order); > - collapse_mthp_stack_push(cc, &stack_size, offset, > - next_order); > + if (order > KHUGEPAGED_MIN_MTHP_ORDER && > + (BIT(order) - 1) & enabled_orders) { Wait, if I disable an order this changes the way we get mTHP doesn't it? Let's say I disable order-4 but retain order-3 and order-2 for offset 0 we get: 9->8->7->6->5->5->6->5->5->7 And we simply can't get order-3 no? This seems broken doesn't it? Maybe I'm missing something? > + order = order - 1; order--? > + continue; > } > +next_offset: > + offset += nr_ptes; > + order = min_t(int, __ffs(offset), HPAGE_PMD_ORDER); Also wouldn't, in the case where an enabled order check above skips an order--, we could have offset=0 here and end up just looping around checking from (0, HPAGE_PMD_ORDER) again? That also seems broken? Also, what's __ffs(0)? Isn't it undefined? We shouldn't be relying on undefined behaviour no? https://elixir.bootlin.com/linux/v7.0.10/source/include/asm-generic/bitops/builtin-__ffs.h#L5 Says as much? I guess we're assuming we're not going to get to 0 here, but that could do with a comment or a VM_WARN_ON_ONCE() at least. Also why aren't we making this a function? static inline unsigned int max_order_from_offset(unsigned int offset) { if (!offset) return HPAGE_PMD_ORDER; return __ffs(offset); } Though __ffs() works on unsigned long... probably... OK? > } > done: > if (collapsed) > -- > 2.54.0 > > > > > > > [...] > > > >>>> + bitmap_zero(cc->mthp_bitmap, MAX_PTRS_PER_PTE); > >>>> memset(cc->node_load, 0, sizeof(cc->node_load)); > >>>> nodes_clear(cc->alloc_nmask); > >>>> + > >>>> + enabled_orders = collapse_allowable_orders(vma, vma->vm_flags, tva_flags); > >>>> + > >>>> + /* > >>>> + * If PMD is the only enabled order, enforce max_ptes_none, otherwise > >>>> + * scan all pages to populate the bitmap for mTHP collapse. > >>>> + */ > >>> > >>> You should note here, that we re-verify in mthp_collapse(). > >>> > >>> But the question is, whether we should relocate the check completely into > >>> mthp_collapse(), instead of conditionally duplicating it. > >>> > >>> What speaks against always populating the bitmap and making the decision in > >>> mthp_collapse()? > >>> > >>> Sure, we might scan a page table a bit longer, but the code gets clearer ... and > >>> I am not sure if scanning some more page table entries is really that critical here. > >> > >> Someone asked me to preserve the legacy behavior (PMD only). Although > >> rather trivial, if you set max_ptes_none=0 for example, we'd still > >> have to do 511 iterations for no reason if PMD collapse is the only > >> enabled order rather than bailing immediately. > >> > >> I'm ok with dropping it, but I think its the correct approach (despite > >> the extra complexity). @Usama Arif brought up this point here > >> https://lore.kernel.org/all/f8f7bb71-ca31-46ee-a62d-7ddfd83e0ead@gmail.com/ > > > > We talk about regressions, but I am not sure if we care about scanning speed > > within a page table that much? > > > > After all, we locked it and already read some entries. > > > > Having the same check at two places to optimize for PMD order might right now > > feel like a good optimization, but likely an irrelevant one in a near future? > > > > Anyhow, won't push back, as long as we document why we are special casing things > > here. > > > Thanks, Lorenzo