From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id AB3E0C369DC for ; Fri, 2 May 2025 01:23:57 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 60B0F6B009C; Thu, 1 May 2025 21:23:56 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 5B7AE6B009D; Thu, 1 May 2025 21:23:56 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 458FD6B009E; Thu, 1 May 2025 21:23:56 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 25F1B6B009C for ; Thu, 1 May 2025 21:23:56 -0400 (EDT) Received: from smtpin11.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 3D067120E35 for ; Fri, 2 May 2025 01:23:56 +0000 (UTC) X-FDA: 83396221272.11.1B2099A Received: from out30-113.freemail.mail.aliyun.com (out30-113.freemail.mail.aliyun.com [115.124.30.113]) by imf12.hostedemail.com (Postfix) with ESMTP id 4EABD40007 for ; Fri, 2 May 2025 01:23:52 +0000 (UTC) Authentication-Results: imf12.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=pXtgwirH; dmarc=pass (policy=none) header.from=linux.alibaba.com; spf=pass (imf12.hostedemail.com: domain of baolin.wang@linux.alibaba.com designates 115.124.30.113 as permitted sender) smtp.mailfrom=baolin.wang@linux.alibaba.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1746149034; a=rsa-sha256; cv=none; b=Wf/dk5HOOckc44Te7G6eeD/8BIqALOUOSN+LtAtL1Duwka2FsncRuPEro0okRp7dR+wk0b HXk1WloWn91aqElS+A8BPVQJHszFt0AvvyekP7AZRNZwCbqY6cuSvlmbojuDZ83MWzzhIC cPqMKADi7qnv8LINV85tFhOtNlUPRRg= ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1746149034; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=nCxVZm3wSwg9n3/8gMY7iE/dhHvHfCteXR6U47l0Wpc=; b=fAHrnyJJZVkRW5YoPM2NQEqHLjwdrHCbL2KtHCLhq7MdGuxGRSQTluraR0j0FFMj2IasND 8OEhuCUUej0zMUY6rR3uaKrxBg3aCRHdqERsY40q8S/1S7oxOacOrW6q46DoiKZmuUeA1c ArBsGUh1zJRwoeiWqKAqfLmVifPcdjs= ARC-Authentication-Results: i=1; imf12.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=pXtgwirH; dmarc=pass (policy=none) header.from=linux.alibaba.com; spf=pass (imf12.hostedemail.com: domain of baolin.wang@linux.alibaba.com designates 115.124.30.113 as permitted sender) smtp.mailfrom=baolin.wang@linux.alibaba.com DKIM-Signature:v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.alibaba.com; s=default; t=1746149030; h=Message-ID:Date:MIME-Version:Subject:To:From:Content-Type; bh=nCxVZm3wSwg9n3/8gMY7iE/dhHvHfCteXR6U47l0Wpc=; b=pXtgwirH4SOvLW+Q+Wi7sWCFt+lIF5a8TMw56ES6x5dfgHh/EY32Zk4iJN2T3i3k3zgql5IJLBF06XIWNGeYsUuQX58oON1NWsS2x7hJZQFbQ3HAG/PdIKp+PMkixIaVxcyjQPaXFwk/AlgXoopMDXxZ9fICtA9gs25HWMAszEw= Received: from 30.0.191.233(mailfrom:baolin.wang@linux.alibaba.com fp:SMTPD_---0WYtVH7S_1746149025 cluster:ay36) by smtp.aliyun-inc.com; Fri, 02 May 2025 09:23:47 +0800 Message-ID: <83a66442-b7c7-42e7-af4e-fd211d8ed6f8@linux.alibaba.com> Date: Fri, 2 May 2025 09:23:42 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v5 06/12] khugepaged: introduce khugepaged_scan_bitmap for mTHP support To: Nico Pache Cc: linux-mm@kvack.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-trace-kernel@vger.kernel.org, akpm@linux-foundation.org, corbet@lwn.net, rostedt@goodmis.org, mhiramat@kernel.org, mathieu.desnoyers@efficios.com, david@redhat.com, baohua@kernel.org, ryan.roberts@arm.com, willy@infradead.org, peterx@redhat.com, ziy@nvidia.com, wangkefeng.wang@huawei.com, usamaarif642@gmail.com, sunnanyong@huawei.com, vishal.moola@gmail.com, thomas.hellstrom@linux.intel.com, yang@os.amperecomputing.com, kirill.shutemov@linux.intel.com, aarcange@redhat.com, raquini@redhat.com, dev.jain@arm.com, anshuman.khandual@arm.com, catalin.marinas@arm.com, tiwai@suse.de, will@kernel.org, dave.hansen@linux.intel.com, jack@suse.cz, cl@gentwo.org, jglisse@google.com, surenb@google.com, zokeefe@google.com, hannes@cmpxchg.org, rientjes@google.com, mhocko@suse.com, rdunlap@infradead.org, lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com References: <20250428181218.85925-1-npache@redhat.com> <20250428181218.85925-7-npache@redhat.com> <5feb1d57-e069-4469-9751-af4fb067e858@linux.alibaba.com> From: Baolin Wang In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 4EABD40007 X-Stat-Signature: 3d7xp8ixskcyrsffz9r13s1ij79b4838 X-Rspam-User: X-HE-Tag: 1746149032-77107 X-HE-Meta: U2FsdGVkX180yM269NFu1DFOlTULV2DB5D1kjjC6AFR46kjqugIB0MArRkn8mL5PL+9Z0TJ917mfV6AstQ4lpqVVbjKShVC+LCbzZC3ovBbsSkDZMOMXwr08FcwN6UyclpXaVpbNCwYiXZe/G7h+uIbz5uYe1S7Ag41FJ9YhaF7o4dtXZ6EAHTVI0CnjE4vBQ/OW0j7c/FJ4u96lJx7FL9Fxn/tWsijNcoeQRXjYBVI1NvoyWM0PVh/rtTk+eensFBUX+k5vdQYBviTp440eY8FFrU+7/adFFMMNCpRUxEewuQ6EQ/e780SZkMX7V1jVnqigUhnjV1SBuMT3Bekgpq89Uhu6ZLL1ZbslSlAQw4eIZzdNtrwNOEbpFOgWedlMceJgjLBjLFo+WACIIMGnX2a5CvRWvZ1BuzxmBQvivOmQEdaCv83oAzzTZko7/0MlHjo39wm567PBhJEoweAE2U9Sdse4yXZKQ60KtWleyO0qDj+cBNBhT/qBkBQV7Mj3WqJbmyNk/eWfPnw+f5VmjnpAmIGT3wO+Rzct7NKAnvWz/aKKxY+Ls29xZCKvyghmlG5bJOXH3A2kqwfzUkhq+FeOoRk26LeiyOSBmYeca/YkW8OD8jIXVKHrXM6OWMGz0NErhkSDTXJFrj8aOkHZtjrPTBUhMdNbqXa5OGiMDUuHvmddTmBLR+kN/RjA++9IRiTRwPQkjKVq+y8QU2fZiqggid+GFNYGTPp4DtQAm2yATTHF2MtziciziT7Q0CE9CMrBQh4HXaNjbtXlieihofKR3oKk9sxO+qd4CU2DLEk8BokzLJrRQptwE5QJSkU5xcSQ4aVC9cbO6VPPlxXJm7QJhf4vSU530mD0XxcdCNMgBUEXL//i44KWZciFoghRuWD7zBoUeXep0fz4VlZCrwSqERX/9+ZtnWw/9wvzbTjAmbHjfgDy4RseuQxiw+RM52Glhhu8z8qq0HcUcT5 q7L9wtl6 Zzt+PSgokUERfm3XV9MtenmW6fDdyiXPc0CuNmyixD2Mkca5r8e5jOKResz/Mp/ceZPaHOvLFi6maE/mu/MplGM/rM7tnbLZkJvceugEoquwoyjXE7s3xKqC63APMOM/ctmYrlSO87llQODNrgNQyM2OCpd7go0QPl3+brmP5kBDG9MUP7DpFbR7uqK7C2tF8kfIEQx6XmtbEFUBigf9dBXc218/WMZA2Sk39noRbHx7HLIoZQMJsrz92qR3mhBwgFbfYWtUYpUr/mvrsZvNabJPO1V4Ecj5O8+bbb234VFDhSw4asBvN5bEmM/7rZfgar0s6tuu7nM0Y+Cu+m/8STNUqj25iHHiuLS6Cw9Dwuuc+mv1TVQnnLmSjpvCQjIZ7dGNVvEwhGspwluFIyZ5u4OShO5VOkL3647mpVkXce2HRItycI/QMuNWZEw4LerNayfeQ X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 2025/5/2 07:03, Nico Pache wrote: > On Wed, Apr 30, 2025 at 12:56 PM Nico Pache wrote: >> >> On Wed, Apr 30, 2025 at 4:08 AM Baolin Wang >> wrote: >>> >>> >>> >>> On 2025/4/29 02:12, Nico Pache wrote: >>>> khugepaged scans anons PMD ranges for potential collapse to a hugepage. >>>> To add mTHP support we use this scan to instead record chunks of utilized >>>> sections of the PMD. >>>> >>>> khugepaged_scan_bitmap uses a stack struct to recursively scan a bitmap >>>> that represents chunks of utilized regions. We can then determine what >>>> mTHP size fits best and in the following patch, we set this bitmap while >>>> scanning the anon PMD. >>>> >>>> max_ptes_none is used as a scale to determine how "full" an order must >>>> be before being considered for collapse. >>>> >>>> When attempting to collapse an order that has its order set to "always" >>>> lets always collapse to that order in a greedy manner without >>>> considering the number of bits set. >>>> >>>> Signed-off-by: Nico Pache >>>> --- >>>> include/linux/khugepaged.h | 4 ++ >>>> mm/khugepaged.c | 94 ++++++++++++++++++++++++++++++++++---- >>>> 2 files changed, 89 insertions(+), 9 deletions(-) >>>> >>>> diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h >>>> index 1f46046080f5..18fe6eb5051d 100644 >>>> --- a/include/linux/khugepaged.h >>>> +++ b/include/linux/khugepaged.h >>>> @@ -1,6 +1,10 @@ >>>> /* SPDX-License-Identifier: GPL-2.0 */ >>>> #ifndef _LINUX_KHUGEPAGED_H >>>> #define _LINUX_KHUGEPAGED_H >>>> +#define KHUGEPAGED_MIN_MTHP_ORDER 2 >>> >>> Still better to add some comments to explain explicitly why choose 2 as >>> the MIN_MTHP_ORDER. >> Ok i'll add a note that explicitly states that the min order of anon mTHPs is 2 >>> >>>> +#define KHUGEPAGED_MIN_MTHP_NR (1<>>> +#define MAX_MTHP_BITMAP_SIZE (1 << (ilog2(MAX_PTRS_PER_PTE) - KHUGEPAGED_MIN_MTHP_ORDER)) >>>> +#define MTHP_BITMAP_SIZE (1 << (HPAGE_PMD_ORDER - KHUGEPAGED_MIN_MTHP_ORDER)) >>>> >>>> extern unsigned int khugepaged_max_ptes_none __read_mostly; >>>> #ifdef CONFIG_TRANSPARENT_HUGEPAGE >>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c >>>> index e21998a06253..6e67db86409a 100644 >>>> --- a/mm/khugepaged.c >>>> +++ b/mm/khugepaged.c >>>> @@ -94,6 +94,11 @@ static DEFINE_READ_MOSTLY_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS); >>>> >>>> static struct kmem_cache *mm_slot_cache __ro_after_init; >>>> >>>> +struct scan_bit_state { >>>> + u8 order; >>>> + u16 offset; >>>> +}; >>>> + >>>> struct collapse_control { >>>> bool is_khugepaged; >>>> >>>> @@ -102,6 +107,18 @@ struct collapse_control { >>>> >>>> /* nodemask for allocation fallback */ >>>> nodemask_t alloc_nmask; >>>> + >>>> + /* >>>> + * bitmap used to collapse mTHP sizes. >>>> + * 1bit = order KHUGEPAGED_MIN_MTHP_ORDER mTHP >>>> + */ >>>> + DECLARE_BITMAP(mthp_bitmap, MAX_MTHP_BITMAP_SIZE); >>>> + DECLARE_BITMAP(mthp_bitmap_temp, MAX_MTHP_BITMAP_SIZE); >>>> + struct scan_bit_state mthp_bitmap_stack[MAX_MTHP_BITMAP_SIZE]; >>>> +}; >>>> + >>>> +struct collapse_control khugepaged_collapse_control = { >>>> + .is_khugepaged = true, >>>> }; >>>> >>>> /** >>>> @@ -851,10 +868,6 @@ static void khugepaged_alloc_sleep(void) >>>> remove_wait_queue(&khugepaged_wait, &wait); >>>> } >>>> >>>> -struct collapse_control khugepaged_collapse_control = { >>>> - .is_khugepaged = true, >>>> -}; >>>> - >>>> static bool khugepaged_scan_abort(int nid, struct collapse_control *cc) >>>> { >>>> int i; >>>> @@ -1118,7 +1131,8 @@ static int alloc_charge_folio(struct folio **foliop, struct mm_struct *mm, >>>> >>>> static int collapse_huge_page(struct mm_struct *mm, unsigned long address, >>>> int referenced, int unmapped, >>>> - struct collapse_control *cc) >>>> + struct collapse_control *cc, bool *mmap_locked, >>>> + u8 order, u16 offset) >>>> { >>>> LIST_HEAD(compound_pagelist); >>>> pmd_t *pmd, _pmd; >>>> @@ -1137,8 +1151,12 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address, >>>> * The allocation can take potentially a long time if it involves >>>> * sync compaction, and we do not need to hold the mmap_lock during >>>> * that. We will recheck the vma after taking it again in write mode. >>>> + * If collapsing mTHPs we may have already released the read_lock. >>>> */ >>>> - mmap_read_unlock(mm); >>>> + if (*mmap_locked) { >>>> + mmap_read_unlock(mm); >>>> + *mmap_locked = false; >>>> + } >>>> >>>> result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER); >>>> if (result != SCAN_SUCCEED) >>>> @@ -1273,12 +1291,72 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address, >>>> out_up_write: >>>> mmap_write_unlock(mm); >>>> out_nolock: >>>> + *mmap_locked = false; >>>> if (folio) >>>> folio_put(folio); >>>> trace_mm_collapse_huge_page(mm, result == SCAN_SUCCEED, result); >>>> return result; >>>> } >>>> >>>> +// Recursive function to consume the bitmap >>> >>> Nit: please use '/* Xxxx */' for comments in this patch. >>> >>>> +static int khugepaged_scan_bitmap(struct mm_struct *mm, unsigned long address, >>>> + int referenced, int unmapped, struct collapse_control *cc, >>>> + bool *mmap_locked, unsigned long enabled_orders) >>>> +{ >>>> + u8 order, next_order; >>>> + u16 offset, mid_offset; >>>> + int num_chunks; >>>> + int bits_set, threshold_bits; >>>> + int top = -1; >>>> + int collapsed = 0; >>>> + int ret; >>>> + struct scan_bit_state state; >>>> + bool is_pmd_only = (enabled_orders == (1 << HPAGE_PMD_ORDER)); >>>> + >>>> + cc->mthp_bitmap_stack[++top] = (struct scan_bit_state) >>>> + { HPAGE_PMD_ORDER - KHUGEPAGED_MIN_MTHP_ORDER, 0 }; >>>> + >>>> + while (top >= 0) { >>>> + state = cc->mthp_bitmap_stack[top--]; >>>> + order = state.order + KHUGEPAGED_MIN_MTHP_ORDER; >>>> + offset = state.offset; >>>> + num_chunks = 1 << (state.order); >>>> + // Skip mTHP orders that are not enabled >>>> + if (!test_bit(order, &enabled_orders)) >>>> + goto next; >>>> + >>>> + // copy the relavant section to a new bitmap >>>> + bitmap_shift_right(cc->mthp_bitmap_temp, cc->mthp_bitmap, offset, >>>> + MTHP_BITMAP_SIZE); >>>> + >>>> + bits_set = bitmap_weight(cc->mthp_bitmap_temp, num_chunks); >>>> + threshold_bits = (HPAGE_PMD_NR - khugepaged_max_ptes_none - 1) >>>> + >> (HPAGE_PMD_ORDER - state.order); >>>> + >>>> + //Check if the region is "almost full" based on the threshold >>>> + if (bits_set > threshold_bits || is_pmd_only >>>> + || test_bit(order, &huge_anon_orders_always)) { >>> >>> When testing this patch, I disabled the PMD-sized THP and enabled >>> 64K-sized mTHP, but it still attempts to collapse into a PMD-sized THP >>> (since bits_set > threshold_bits is ture). This doesn't seem reasonable? >> We are still required to have PMD enabled for mTHP collapse to work. >> It's a limitation of the current khugepaged code (it currently only >> adds mm_slots when PMD is enabled). >> We've discussed this in the past and are looking for a proper way >> forward, but the solution becomes tricky. >> >> However I'm surprised that it still collapses due to the code below. >> I'll test this out later today. > > Following up, you are correct, if I disable the PMD size within the > mTHP enabled settings (echo never > > /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled) it still > collapses to PMDs. I believe the global variable takes precedent. I'm > not sure what the correct behavior is... I will look into it further > >> + if (!test_bit(order, &enabled_orders)) >> + goto next; IMO, we should respect the mTHP sysfs control interfaces and use the 'TVA_ENFORCE_SYSFS' flag when determining allowable orders through thp_vma_allowable_orders().