From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4C170C83038 for ; Tue, 1 Jul 2025 12:15:58 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id DF5A06B009E; Tue, 1 Jul 2025 08:15:57 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id DA5D76B009F; Tue, 1 Jul 2025 08:15:57 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C94C56B00A1; Tue, 1 Jul 2025 08:15:57 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id B27B66B009E for ; Tue, 1 Jul 2025 08:15:57 -0400 (EDT) Received: from smtpin06.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id E06725901D for ; Tue, 1 Jul 2025 12:15:56 +0000 (UTC) X-FDA: 83615592312.06.7E51E3B Received: from techbitestudio.com (techbitestudio.com [75.119.147.106]) by imf06.hostedemail.com (Postfix) with ESMTP id 7E57D180003 for ; Tue, 1 Jul 2025 12:15:54 +0000 (UTC) Authentication-Results: imf06.hostedemail.com; dkim=pass header.d=kenip.in header.s=mail header.b=oDXH0+DR; spf=pass (imf06.hostedemail.com: domain of siddhartha@kenip.in designates 75.119.147.106 as permitted sender) smtp.mailfrom=siddhartha@kenip.in; dmarc=pass (policy=none) header.from=kenip.in ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1751372155; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=mNcQZ2y7RNSfkCENnPPV5KGCebawVcvX2HGc4o3jHqM=; b=BrRZgrbiwhPrl8ea+6lsAtiHeDhYE8gDRQDzrMiLK1pG+ZZZriW4fQoQcilXY/r5uxmdbb YzvNd+sg/+W473VO6ymORNy8owCfl9HY5tLJ3dtB8k/m8jwzxiBDCGH9V7iqKvr19v+DrY W0jkAEHXkxhDdXcHWN3YO4Ew3aUQjz4= ARC-Authentication-Results: i=1; imf06.hostedemail.com; dkim=pass header.d=kenip.in header.s=mail header.b=oDXH0+DR; spf=pass (imf06.hostedemail.com: domain of siddhartha@kenip.in designates 75.119.147.106 as permitted sender) smtp.mailfrom=siddhartha@kenip.in; dmarc=pass (policy=none) header.from=kenip.in ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1751372155; a=rsa-sha256; cv=none; b=zZk9lyc2mo5gqOjeXE/BCo+MYcS5Ow27pzEEx7+SNuxeU2b33xHpBo5lYEUvboYLbiHo6M 1rZNcnCWnp7hHPeXDN9FqkSL5K/J+VYcj1tZTA6PYIYlA7mZLNS1RBJDsy8u9JOrX3cAhr E8KOkzJ7kcd2r+9zK1zqRyRSsWVaaMs= DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=kenip.in; s=mail; h=Content-Transfer-Encoding:Content-Type:Message-ID:References: In-Reply-To:Subject:Cc:To:From:Date:MIME-Version:Sender:Reply-To:Content-ID: Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc :Resent-Message-ID:List-Id:List-Help:List-Unsubscribe:List-Subscribe: List-Post:List-Owner:List-Archive; bh=mNcQZ2y7RNSfkCENnPPV5KGCebawVcvX2HGc4o3jHqM=; b=oDXH0+DRRyLeouEuyla4Qnw7D0 HnvGz33DcqW19qDfFvIw1x84Dk9VmxoQkB8TEy90GGmUaJNHRPwg+WhgFavPrZm6TKsw4wdeIW1N0 UnvhtKnlCFgh8u1kE2P+xLCKf5r8JzOqVKE5PDg0v5DScgDFhQ5Bp2/e7mpwsPmfdr44=; Received: from localhost ([127.0.0.1] helo=kenip.in) by techbitestudio.com with esmtpa (Exim 4.93) (envelope-from ) id 1uWZtv-0006bO-8t; Tue, 01 Jul 2025 17:45:51 +0530 MIME-Version: 1.0 Date: Tue, 01 Jul 2025 17:45:51 +0530 From: siddhartha@kenip.in To: Dev Jain Cc: Lorenzo Stoakes , linux-mm@kvack.org, linux-kernel@vger.kernel.org, mgorman@suse.de Subject: =?UTF-8?Q?Re=3A_=5BPATCH=5D_mm=3A_limit_THP_alignment_=E2=80=93_?= =?UTF-8?Q?performance_gain_observed_in_AI_inference_workloads?= In-Reply-To: References: <4990838b-660d-46a2-b21c-67adcba61ff9@lucifer.local> <19714cae-6b73-43ec-af7a-1455196561d1@arm.com> <3ee2e7fea6f263aa884e3e715632b09f@kenip.in> <5816677a-705e-4a8f-b598-d74ff6198a02@arm.com> <80b849d4-faf3-47a9-8b8c-e8053299cfb2@arm.com> <2e99712b-8dac-4762-9fc5-fe3ef569b65e@lucifer.local> Message-ID: <787639a1e6a27c0f3b0e3ae658e1b8e7@kenip.in> X-Sender: siddhartha@kenip.in X-Priority: 1 (Highest) Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Rspam-User: X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: 7E57D180003 X-Stat-Signature: 9t8tez5qoewez1kzigqzzjj4gcohcupo X-HE-Tag: 1751372154-237511 X-HE-Meta: U2FsdGVkX18Zr/e5sv0zbgv0ADZWCTKZsJ2i0brbpIKJ2L3EMWwhjQKvOr1OHkgAQmMMDnqGizv20ozb7y6vJ28xGPvQOCCqaBNuLdzvMh+banWsnF74qAQx+YWxY4awopwOA8P6/0W6DKGT+kpk51L2kBz+nSWRoHqCml1l7SEa5gnqHLUHiDSDL+LAkSwZ/2tXs7306+XMvyxKNg+Wo2bfkxqbcBbncTkg6ysTnPhtOFNopNsETPKuCUkLEOYebKgNpr9xE4IFQ8xWPQX5De+YB33GvLfOk1jhcv6TgOTaiR2iN6xYSfP8A2IfhUEKKg2trztTd3LwVIbNQw2cZHcbFeMzGRtbAzRNPMbhx94t2VzivGR4RdqdAa8ma9L8MI4HRVDUSADXjZbJgoqhcToQR4ZRkOXoNiSyQ0qTIXgcnvo6IqacPlOuFK3WtUs6v7J3QhyYzSLUULImS6xKYmAs78TmYVsZxK904CQjBWArWFYJOkPYB/w1nxmsARdekLs6l5blpT74mbb+t7mTv8vzQZOB54LRsQh/qDvF7Q/FwKbKD5KCTD7CdqgP8dQf0GUojwUDv2L+p1cGIzjNFAOQugbX7gOzbco38LoXWr1+tbEw6IBmCowrOYXoZF2InemIoyOVjwHE6DLcehBPFh6OYfcSyqYkmjdbC/Q+iJ/Yo5owVXQcz/S6fZjE8tttwbDUiGMAFhmQLDwCJ4b81bT1di8j+jjnGRzzBrW5C+9AycRgzeh37TLsIz/d2+3edA+LBzs13mms5SNLYADMsusir7qTzJQ5QEbNHvAta0lQ7Wo6kEA7SDZzcTyLX9d57qaFTFrlmaFrR0xnNdbt+6fBLX0ebv+Q/doYMx7+6jyT8HCZL5z4OoMuz3epSfhwU3qeKhOhR99vsecWxGhAOFPSCX4tXf39Tm9vmapxuG0QWhiOg72lEIYEhd//rBcW3tiv9ZvFRSjL1vIcX4L gym/aFfz 3c4rP/pfdhwaCxBI38AXzuV9HVI2QXK3Kfg8Ws3rET2+XCaZBDSwTTv0MjhHpI6TeZTHihu494aJz+vboCZ26RE3v7Rj0YxJiQr94XCM5/F/bZ4hyrLzVgK0Zj/UcFVsZmtLfwGgY2Bvs7rBeAoaoBLG5bu74kA63aMKwmWJv9LnX+YFZYv6FIMrXuxt05UDPfe0a0WuqWS9hATMmfyP+boOaXbBD9kodzmQ7RXopYVaDlg0QWPVwyMStFeGgsx65a71wL3b3/D0Y9twSR40F3h2nHZ6/z5nSI1sS5IasQethRREgPsuBXCt8LwegzS556W/nSE6qN91kLDqYYYxcAkOiXzB/ciCMX7f5PNiaohVOIN14LJMEpvN2hdLt4W+UXsvfHhXwEhymooT0yJxZTJJVCcCRtP24Dijtxtddegl+N1fvlfk+f4wIj40MrihLpZmX0EU81wgZc/I= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 2025-07-01 12:28, Dev Jain wrote: > On 01/07/25 12:20 pm, Lorenzo Stoakes wrote: >> On Tue, Jul 01, 2025 at 12:00:21PM +0530, Dev Jain wrote: >>> On 01/07/25 11:23 am, Lorenzo Stoakes wrote: >>>> On Tue, Jul 01, 2025 at 11:15:25AM +0530, Dev Jain wrote: >>>>> Sorry I am not following, don't know in detail about the VMA merge >>>>> stuff. >>>>> Are you saying the after the patch, the VMAs will eventually get >>>>> merged? >>>>> Is it possible in the kernel to get a merge in the "future"; as I >>>>> understand >>>>> it only happens at mmap() time? >>>>> >>>>> Suppose before the patch, you have two consecutive VMAs between >>>>> (PMD, 2*PMD) size. >>>>> If they are able to get merged after the patch, why won't they be >>>>> merged before the patch, >>>>> since the VMA characteristics are the same? >>>>> >>>>> >>>> Rik's patch aligned each to 2 MiB boundary. So you'd get gaps: >>>> >>>> >>>> 0 2MB 4MB 6MB >>>> 8MB 10MB >>>> |-------------.------| |-------------.------| >>>> |-------------.------| >>>> | . | | . | >>>> | . | >>>> | . | | . | >>>> | . | >>>> |-------------.------| |-------------.------| >>>> |-------------.------| >>>> huge mapped 4k m'd >>> The effort to draw this is appreciated! >>> >>> I understood the alignment, what I am asking is this: >>> >>> In __get_unmapped_area(), we will return a THP-aligned addr from >>> thp_get_unmapped_area_vmflags(). Now for the diagram you have >>> drawn, suppose that before the patch, we first mmap() the >>> 8MB-start chunk. Then we mmap the 4MB start chunk. >>> We go to __mmap_region(), and we see that the 8MB-start chunk >>> has mergeable characteristics, so we merge. So the gap goes away? >> No because there's a gap, we only merge immedaitely adjacent VMAs. And >> obviously >> gaps mean page tables wouldn't be adjacent either... > > Ah shoot. That is prev->vm_end == vmg->start in can_vma_merge_left(). > Thanks. > >> >> The get_unmmaped_area() would have otherwise given adjacent mappings. >> Vlasta's >> patch means in this case we no longer bother trying to align these >> because their >> _length_ isn't PMD aligned. Hi Lorenzo, Dev, all Thank you for raising excellent points — I’ll respond to each in order to clarify the mechanics and relevance of this behavior in the context of AI inference workloads. 🧩 1. Does the patch cause VMAs to be merged eventually? You're correct: VMA merging only happens at mmap() time (via __mmap_region()). What the patch affects is the behavior of thp_get_unmapped_area_vmflags() before the mmap is placed. Before the patch (with Rik’s logic): Every mmap() returned an address rounded up to the next 2MB boundary — regardless of whether the requested size was 2MB-aligned. Result: even consecutive mmap()s (e.g., 1.5MB + 1.5MB) are now non-adjacent, so merging is impossible, even if their VMA flags match. After this patch: If the allocation is not PMD-aligned in size, the returned address is not forcibly aligned, increasing the likelihood that the next mmap() lands directly after the previous one → enabling merging. So, to be clear: this patch doesn’t cause merging, but it prevents unnecessary pre-mmap gaps, which previously blocked merges from ever happening exactly like a deadlock which has been cleared now. 📐 2. Why aren’t the VMAs mergeable before the patch? Great question. Even if the VMA flags are identical, gaps introduced by forced alignment from get_unmapped_area() break the precondition for merging: can_vma_merge_left() → return prev->vm_end == vma->vm_start With Rik’s patch in place: Suppose you mmap() 1.5MB → gets aligned to 2MB Next 1.5MB → gets aligned to 4MB → The kernel sees: prev->vm_end = 3.5MB, vma->vm_start = 4MB → No merge With this patch, non-aligned lengths don’t get forcibly aligned, so consecutive mmap()s often fall exactly after the previous, and merging becomes possible again. 🤖 3. How does this impact AI workloads like Hugging Face Transformers? Tokenization and dynamic batching create non-deterministic memory allocation patterns: Models like BERT and T5 dynamically allocate intermediate buffers per token-length, batch size, and attention window. Hugging Face + ONNX Runtime uses multiple small-ish anonymous mmap()s, often 512KB–1.8MB. These allocations come in bursts — but due to forced alignment, the kernel was placing them with artificial gaps, defeating THP eligibility entirely. By not force-aligning non-PMD-sized mappings, we avoid injecting gaps. The result is that: a. VMAs remain adjacent → mergeable b. Physical memory is contiguous → eligible for khugepaged collapse c. THP utilization increases → fewer TLB misses → lower latency → higher throughput 💡 4. Why this patch complements Rik’s rather than contradicts it: Rik's patch made it easier to guarantee alignment for workloads that benefit from explicit huge pages — but at the cost of breaking coalescence in workloads with non-PMD-sized mappings, like ML inference. This patch simply refines that logic: If the length is PMD-aligned → keep alignment If it’s not → don’t inject alignment gaps that block merging So, for workloads that can’t benefit from THP due to misalignment, this patch removes artificial fragmentation without harming the original intent. ⚙️ 5. mTHP note Although this patch doesn’t target mTHP directly, I believe a similar logic tweak could apply there too — especially with shmem-backed workloads (common in model servers using shared tensor memory). I’d be happy to help test any changes proposed there to derive the consequent results. Thanks again for the detailed discussion. Let me know if you’d like a trace or VMA map from a Hugging Face benchmarked run (happy to generate one locally). Best Regards, Siddhartha Sharma +91 9015185601