From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2E5C8C021AA for ; Fri, 21 Feb 2025 04:57:36 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 7E2AD6B00B2; Thu, 20 Feb 2025 23:57:35 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 793796B00B3; Thu, 20 Feb 2025 23:57:35 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 65A226B00B4; Thu, 20 Feb 2025 23:57:35 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 48A2F6B00B2 for ; Thu, 20 Feb 2025 23:57:35 -0500 (EST) Received: from smtpin25.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 15627140482 for ; Fri, 21 Feb 2025 04:57:33 +0000 (UTC) X-FDA: 83142743586.25.EB12975 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by imf18.hostedemail.com (Postfix) with ESMTP id 1019D1C000B for ; Fri, 21 Feb 2025 04:57:30 +0000 (UTC) Authentication-Results: imf18.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf18.hostedemail.com: domain of dev.jain@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=dev.jain@arm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1740113851; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=BY2IF2TVZ/F16VO4jZdmKW34ZkuHuW0kO1l0a9w+W4c=; b=7eEqD7SLiUAyvV5xGi8l/lC1H+PJl430hb3U0efsV1su38Hdm9Q5KndF6+IAmXLTZLgF4N wQcZRdu0I6qUR4lCcaU0T8iErHfZhTHyD3FBQvVZa5kLtfLHdprIIi+2JgexkC6U2D6iJN E6m+k0rKEuuNhRA3cYvI+dS8u9Z44xI= ARC-Authentication-Results: i=1; imf18.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf18.hostedemail.com: domain of dev.jain@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=dev.jain@arm.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1740113851; a=rsa-sha256; cv=none; b=41f1sZTdEdZd/ldOIp4zFVEmwe8unlCgXWn1eTYlWhEVp6//c7komUvynv6InQpab4Lg8d dNt2ZufEG8vkPYwXwOF7VZma316L1hrokflYzxZjXlpfU2FaxWZTGGizETKwU0nbyH5Y+F qpct+vo5EOCFDE8cJOFR5IyZlXufe10= Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id E037916A3; Thu, 20 Feb 2025 20:57:47 -0800 (PST) Received: from [10.162.41.20] (K4MQJ0H1H2.blr.arm.com [10.162.41.20]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 695743F59E; Thu, 20 Feb 2025 20:57:17 -0800 (PST) Message-ID: Date: Fri, 21 Feb 2025 10:27:14 +0530 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [RFC v2 0/9] khugepaged: mTHP support To: Nico Pache Cc: Ryan Roberts , linux-kernel@vger.kernel.org, linux-trace-kernel@vger.kernel.org, linux-mm@kvack.org, anshuman.khandual@arm.com, catalin.marinas@arm.com, cl@gentwo.org, vbabka@suse.cz, mhocko@suse.com, apopple@nvidia.com, dave.hansen@linux.intel.com, will@kernel.org, baohua@kernel.org, jack@suse.cz, srivatsa@csail.mit.edu, haowenchao22@gmail.com, hughd@google.com, aneesh.kumar@kernel.org, yang@os.amperecomputing.com, peterx@redhat.com, ioworker0@gmail.com, wangkefeng.wang@huawei.com, ziy@nvidia.com, jglisse@google.com, surenb@google.com, vishal.moola@gmail.com, zokeefe@google.com, zhengqi.arch@bytedance.com, jhubbard@nvidia.com, 21cnbao@gmail.com, willy@infradead.org, kirill.shutemov@linux.intel.com, david@redhat.com, aarcange@redhat.com, raquini@redhat.com, sunnanyong@huawei.com, usamaarif642@gmail.com, audra@redhat.com, akpm@linux-foundation.org, rostedt@goodmis.org, mathieu.desnoyers@efficios.com, tiwai@suse.de References: <20250211003028.213461-1-npache@redhat.com> <8a37f99b-f207-4688-bc90-7f8e6900e29d@arm.com> <867280bf-2ba1-4e83-8e16-9d93e1c41e08@arm.com> Content-Language: en-US From: Dev Jain In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Rspam-User: X-Stat-Signature: kn57y8efh1mses6cyx4b7ow3c1ff8ed4 X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: 1019D1C000B X-HE-Tag: 1740113850-623155 X-HE-Meta: U2FsdGVkX1/XVEEhxrxDVKzVSpHrFtFghu31wPYGZS74/I/BMgx+ltEchKCb2Yjw33v8SaCf+sguwe9YmiolXp2gz4sJ3Z3M9fubBevhy1nAlQMrc0Fh67H6oXhjtxcV0BJGTprap83aZjvG3wtG3AVW0Foet/0c9c9n0WiJ65uFWMmxFHPtc8q8g8sQAmcn5oAjx1CLhtsZJtfZhqD/9nh9oxw860kpjH0VcMnYTyTikAIyrJwgphyaKLUmYmKvkBSRMZvfzVLLFEa7ClEJiyRjBarY6hQyzTOqA6AmtI94foY2BH/vtnpM3CgpifTTtEcu396ULil4VLym++qVLd+aZUSSIuQ9Ue7FStR5wd6/Lca20soW3zYOtHOxGpyZwo9qW338be3g0w7EI/sLsGfLwCJaPvR/5+mrGVwbDzUYdsqKhtRgLQbMoo9vVpIwnkkqJMh5fowFGMnQiY8q2XrVSlLO9vutEUKBF4izdT7zXmttzVvZVRpODpBnZGCFywhzmH3skZf02brRdmGgEGz36L6L1SagdwWpH3lF6KjueKeZEm5tqXhzRLcOWwXifdWF3h6xQzmHCZJB0m2LJAbAFABxVjrSr3KT5hn65Hg2it9GrIEyPnlL8G7wbXBv/dBmG8LMGllFhDRtVVndotk48VHuYPfGLZrnd0GpuGGdqw+Hl2X8uIu85vW76Pumnl4vIov0/6UPDR3cb8TUk3Zze81wOzWIVI6YbeIFsCOu8038PRtiEuNfkMkNUynzpnl2nQNRhqBEddWMLUWBKZqZ3uX68sqwVSRGVXgbn/nSk+JQpjTFRjQvLcF/3pSyKqOzvV99usBfIwcc+OKPGyXKviAwiQtFcZb0eOBiIjcyKkDQUMN1W2NIdumf6pD76tsUuKPr/eoOt5hQEiny1hF71avMvlIeA+yMW7AYZgMrd2wr7Fgzjsat1gko1IYIVIONSsW24JrL8DhDeRB 4RGRaFgG 1KK6Cs687QlrC1aa50r6wIQENDzkbsHWM9AH4B+32jPur1P7eS+/n07TnyB3bYhO819+umA5S22uYnNdC1ikGg6XXq7h4pZRz+SD/3b6VxaaPa6hfWmEcLlgMyLSWiXGmKpi7wqE0OvtmWcqjnjv34nHUIMsNU03Gwk6K45a6LCiqEEoeu6ZW6iJALuCkteW9BljFZ7JILezjnzeCzXWIsO8USTDgD/wE/hp2tsntLSKqsMPu8HXOqWeVrzvTZ81BkH5XZCGPjpZ6wcvMQbTdesJtP2aucuapGKcfyKJkt5ODzH8QctXamFdBnnQIIjzqBK4lGBscrMihRiV3zt+tS1jR0/8k1WZKj4iM5dM5WQcUk3Fkhz1GK2pyUw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 21/02/25 12:42 am, Nico Pache wrote: > On Wed, Feb 19, 2025 at 2:01 AM Dev Jain wrote: >> >> >> >> On 19/02/25 4:00 am, Nico Pache wrote: >>> On Tue, Feb 18, 2025 at 9:07 AM Ryan Roberts wrote: >>>> >>>> On 11/02/2025 00:30, Nico Pache wrote: >>>>> The following series provides khugepaged and madvise collapse with the >>>>> capability to collapse regions to mTHPs. >>>>> >>>>> To achieve this we generalize the khugepaged functions to no longer depend >>>>> on PMD_ORDER. Then during the PMD scan, we keep track of chunks of pages >>>>> (defined by MTHP_MIN_ORDER) that are utilized. This info is tracked >>>>> using a bitmap. After the PMD scan is done, we do binary recursion on the >>>>> bitmap to find the optimal mTHP sizes for the PMD range. The restriction >>>>> on max_ptes_none is removed during the scan, to make sure we account for >>>>> the whole PMD range. max_ptes_none will be scaled by the attempted collapse >>>>> order to determine how full a THP must be to be eligible. If a mTHP collapse >>>>> is attempted, but contains swapped out, or shared pages, we dont perform the >>>>> collapse. >>>>> >>>>> With the default max_ptes_none=511, the code should keep its most of its >>>>> original behavior. To exercise mTHP collapse we need to set max_ptes_none<=255. >>>>> With max_ptes_none > HPAGE_PMD_NR/2 you will experience collapse "creep" and >>>> >>>> nit: I think you mean "max_ptes_none >= HPAGE_PMD_NR/2" (greater or *equal*)? >>>> This is making my head hurt, but I *think* I agree with you that if >>>> max_ptes_none is less than half of the number of ptes in a pmd, then creep >>>> doesn't happen. >>> Haha yea the compressed bitmap does not make the math super easy to >>> follow, but i'm glad we arrived at the same conclusion :) >>>> >>>> To make sure I've understood; >>>> >>>> - to collapse to 16K, you would need >=3 out of 4 PTEs to be present >>>> - to collapse to 32K, you would need >=5 out of 8 PTEs to be present >>>> - to collapse to 64K, you would need >=9 out of 16 PTEs to be present >>>> - ... >>>> >>>> So if we start with 3 present PTEs in a 16K area, we collapse to 16K and now >>>> have 4 PTEs in a 32K area which is insufficient to collapse to 32K. >>>> >>>> Sounds good to me! >>> Great! Another easy way to think about it is, with max_ptes_none = >>> HPAGE_PMD_NR/2, a collapse will double the size, and we only need half >>> for it to collapse again. Each size is 2x the last, so if we hit one >>> collapse, it will be eligible again next round. >> >> Please someone correct me if I am wrong. >> > > max_ptes_none = 204 > scaled_none = (204 >> 9 - 3) = ~3.1 > so 4 pages need to be available in each chunk for the bit to be set, not 5. > > at 204 the bitmap check is > 512 - 1 - 204 = 307 > (PMD) 307 >> 3 = 38 > (1024k) 307 >> 4 = 19 > (512k) 307 >> 5 = 9 > (256k) 307 >> 6 = 4 > >> Consider this; you are collapsing a 256K folio. => #PTEs = 256K/4K = 64 >> => #chunks = 64 / 8 = 8. >> >> Let the PTE state within the chunks be as follows: >> >> Chunk 0: < 5 filled Chunk 1: 5 filled Chunk 2: 5 filled Chunk 3: 5 >> filled >> >> Chunk 4: 5 filled Chunk 5: < 5 filled Chunk 6: < 5 filled Chunk 7: >> < 5 filled >> >> Consider max_ptes_none = 40% (512 * 40 / 100 = 204.8 (round down) = 204 >> < HPAGE_PMD_NR/2). >> => To collapse we need at least 60% of the PTEs filled. >> >> Your algorithm marks chunks in the bitmap if 60% of the chunk is filled. >> Then, if the number of chunks set is greater than 60%, then we will >> collapse. >> >> Chunk 0 will be marked zero because less than 5 PTEs are filled => >> percentage filled <= 50% >> >> Right now the state is >> 0111 1000 >> where the indices are the chunk numbers. >> Since #1s = 4 => percent filled = 4/8 * 100 = 50%, 256K folio collapse >> won't happen. >> >> For the first 4 chunks, the percent filled is 75%. So the state becomes >> 1111 1000 >> after 128K collapse, and now 256K collapse will happen. >> >> Either I got this correct, or I do not understand the utility of >> maintaining chunks :) What you are doing is what I am doing except that >> my chunk size = 1. > > Ignoring all the math, and just going off the 0111 1000 > We do "creep", but its not the same type of "creep" we've been > describing. The collapse in the first half will allow the collapse in > order++ to collapse but it stops there and doesnt keep getting > promoted to a PMD size. That is unless the adjacent 256k also has some > bits set, then it can collapse to 512k. So I guess we still can creep, > but its way less aggressive and only when there is actual memory being > utilized in the adjacent chunk, so it's not like we are creating a > huge waste. I get you. You will creep when the adjacent chunk has at least 1 bit set. I don't really have a strong opinion on this one. > > >> >>>> >>>>> constantly promote mTHPs to the next available size. >>>>> >>>>> Patch 1: Some refactoring to combine madvise_collapse and khugepaged >>>>> Patch 2: Refactor/rename hpage_collapse >>>>> Patch 3-5: Generalize khugepaged functions for arbitrary orders >>>>> Patch 6-9: The mTHP patches >>>>> >>>>> --------- >>>>> Testing >>>>> --------- >>>>> - Built for x86_64, aarch64, ppc64le, and s390x >>>>> - selftests mm >>>>> - I created a test script that I used to push khugepaged to its limits while >>>>> monitoring a number of stats and tracepoints. The code is available >>>>> here[1] (Run in legacy mode for these changes and set mthp sizes to inherit) >>>>> The summary from my testings was that there was no significant regression >>>>> noticed through this test. In some cases my changes had better collapse >>>>> latencies, and was able to scan more pages in the same amount of time/work, >>>>> but for the most part the results were consistant. >>>>> - redis testing. I tested these changes along with my defer changes >>>>> (see followup post for more details). >>>>> - some basic testing on 64k page size. >>>>> - lots of general use. These changes have been running in my VM for some time. >>>>> >>>>> Changes since V1 [2]: >>>>> - Minor bug fixes discovered during review and testing >>>>> - removed dynamic allocations for bitmaps, and made them stack based >>>>> - Adjusted bitmap offset from u8 to u16 to support 64k pagesize. >>>>> - Updated trace events to include collapsing order info. >>>>> - Scaled max_ptes_none by order rather than scaling to a 0-100 scale. >>>>> - No longer require a chunk to be fully utilized before setting the bit. Use >>>>> the same max_ptes_none scaling principle to achieve this. >>>>> - Skip mTHP collapse that requires swapin or shared handling. This helps prevent >>>>> some of the "creep" that was discovered in v1. >>>>> >>>>> [1] - https://gitlab.com/npache/khugepaged_mthp_test >>>>> [2] - https://lore.kernel.org/lkml/20250108233128.14484-1-npache@redhat.com/ >>>>> >>>>> Nico Pache (9): >>>>> introduce khugepaged_collapse_single_pmd to unify khugepaged and >>>>> madvise_collapse >>>>> khugepaged: rename hpage_collapse_* to khugepaged_* >>>>> khugepaged: generalize hugepage_vma_revalidate for mTHP support >>>>> khugepaged: generalize alloc_charge_folio for mTHP support >>>>> khugepaged: generalize __collapse_huge_page_* for mTHP support >>>>> khugepaged: introduce khugepaged_scan_bitmap for mTHP support >>>>> khugepaged: add mTHP support >>>>> khugepaged: improve tracepoints for mTHP orders >>>>> khugepaged: skip collapsing mTHP to smaller orders >>>>> >>>>> include/linux/khugepaged.h | 4 + >>>>> include/trace/events/huge_memory.h | 34 ++- >>>>> mm/khugepaged.c | 422 +++++++++++++++++++---------- >>>>> 3 files changed, 306 insertions(+), 154 deletions(-) >>>>> >>>> >>> >>> >> >