From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 4C170C83038
	for <linux-mm@archiver.kernel.org>; Tue,  1 Jul 2025 12:15:58 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id DF5A06B009E; Tue,  1 Jul 2025 08:15:57 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id DA5D76B009F; Tue,  1 Jul 2025 08:15:57 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id C94C56B00A1; Tue,  1 Jul 2025 08:15:57 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12])
	by kanga.kvack.org (Postfix) with ESMTP id B27B66B009E
	for <linux-mm@kvack.org>; Tue,  1 Jul 2025 08:15:57 -0400 (EDT)
Received: from smtpin06.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay05.hostedemail.com (Postfix) with ESMTP id E06725901D
	for <linux-mm@kvack.org>; Tue,  1 Jul 2025 12:15:56 +0000 (UTC)
X-FDA: 83615592312.06.7E51E3B
Received: from techbitestudio.com (techbitestudio.com [75.119.147.106])
	by imf06.hostedemail.com (Postfix) with ESMTP id 7E57D180003
	for <linux-mm@kvack.org>; Tue,  1 Jul 2025 12:15:54 +0000 (UTC)
Authentication-Results: imf06.hostedemail.com;
	dkim=pass header.d=kenip.in header.s=mail header.b=oDXH0+DR;
	spf=pass (imf06.hostedemail.com: domain of siddhartha@kenip.in designates 75.119.147.106 as permitted sender) smtp.mailfrom=siddhartha@kenip.in;
	dmarc=pass (policy=none) header.from=kenip.in
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1751372155;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=mNcQZ2y7RNSfkCENnPPV5KGCebawVcvX2HGc4o3jHqM=;
	b=BrRZgrbiwhPrl8ea+6lsAtiHeDhYE8gDRQDzrMiLK1pG+ZZZriW4fQoQcilXY/r5uxmdbb
	YzvNd+sg/+W473VO6ymORNy8owCfl9HY5tLJ3dtB8k/m8jwzxiBDCGH9V7iqKvr19v+DrY
	W0jkAEHXkxhDdXcHWN3YO4Ew3aUQjz4=
ARC-Authentication-Results: i=1;
	imf06.hostedemail.com;
	dkim=pass header.d=kenip.in header.s=mail header.b=oDXH0+DR;
	spf=pass (imf06.hostedemail.com: domain of siddhartha@kenip.in designates 75.119.147.106 as permitted sender) smtp.mailfrom=siddhartha@kenip.in;
	dmarc=pass (policy=none) header.from=kenip.in
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1751372155; a=rsa-sha256;
	cv=none;
	b=zZk9lyc2mo5gqOjeXE/BCo+MYcS5Ow27pzEEx7+SNuxeU2b33xHpBo5lYEUvboYLbiHo6M
	1rZNcnCWnp7hHPeXDN9FqkSL5K/J+VYcj1tZTA6PYIYlA7mZLNS1RBJDsy8u9JOrX3cAhr
	E8KOkzJ7kcd2r+9zK1zqRyRSsWVaaMs=
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=kenip.in;
	 s=mail; h=Content-Transfer-Encoding:Content-Type:Message-ID:References:
	In-Reply-To:Subject:Cc:To:From:Date:MIME-Version:Sender:Reply-To:Content-ID:
	Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc
	:Resent-Message-ID:List-Id:List-Help:List-Unsubscribe:List-Subscribe:
	List-Post:List-Owner:List-Archive;
	bh=mNcQZ2y7RNSfkCENnPPV5KGCebawVcvX2HGc4o3jHqM=; b=oDXH0+DRRyLeouEuyla4Qnw7D0
	HnvGz33DcqW19qDfFvIw1x84Dk9VmxoQkB8TEy90GGmUaJNHRPwg+WhgFavPrZm6TKsw4wdeIW1N0
	UnvhtKnlCFgh8u1kE2P+xLCKf5r8JzOqVKE5PDg0v5DScgDFhQ5Bp2/e7mpwsPmfdr44=;
Received: from localhost ([127.0.0.1] helo=kenip.in)
	by techbitestudio.com with esmtpa (Exim 4.93)
	(envelope-from <siddhartha@kenip.in>)
	id 1uWZtv-0006bO-8t; Tue, 01 Jul 2025 17:45:51 +0530
MIME-Version: 1.0
Date: Tue, 01 Jul 2025 17:45:51 +0530
From: siddhartha@kenip.in
To: Dev Jain <dev.jain@arm.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>, linux-mm@kvack.org,
 linux-kernel@vger.kernel.org, mgorman@suse.de
Subject: =?UTF-8?Q?Re=3A_=5BPATCH=5D_mm=3A_limit_THP_alignment_=E2=80=93_?=
 =?UTF-8?Q?performance_gain_observed_in_AI_inference_workloads?=
In-Reply-To: <afe95bb0-185b-4c4a-ae41-e02457422cc3@arm.com>
References: <4990838b-660d-46a2-b21c-67adcba61ff9@lucifer.local>
 <19714cae-6b73-43ec-af7a-1455196561d1@arm.com>
 <3ee2e7fea6f263aa884e3e715632b09f@kenip.in>
 <d8ffe547-5516-43e5-9f33-56b2698a0b4f@arm.com>
 <ba2c89bd-88de-48f8-abd0-b62d8b1d50b3@lucifer.local>
 <5816677a-705e-4a8f-b598-d74ff6198a02@arm.com>
 <ee92d6a9-529a-4ac5-b3d0-0ff4e9085786@lucifer.local>
 <e7152150-2f3e-4ad7-a6c5-f4b77e5c0e05@arm.com>
 <f746d3aa-17e7-4b42-9e08-97cdb2cad89b@lucifer.local>
 <80b849d4-faf3-47a9-8b8c-e8053299cfb2@arm.com>
 <2e99712b-8dac-4762-9fc5-fe3ef569b65e@lucifer.local>
 <afe95bb0-185b-4c4a-ae41-e02457422cc3@arm.com>
Message-ID: <787639a1e6a27c0f3b0e3ae658e1b8e7@kenip.in>
X-Sender: siddhartha@kenip.in
X-Priority: 1 (Highest)
Content-Type: text/plain; charset=UTF-8;
 format=flowed
Content-Transfer-Encoding: 8bit
X-Rspam-User: 
X-Rspamd-Server: rspam10
X-Rspamd-Queue-Id: 7E57D180003
X-Stat-Signature: 9t8tez5qoewez1kzigqzzjj4gcohcupo
X-HE-Tag: 1751372154-237511
X-HE-Meta: U2FsdGVkX18Zr/e5sv0zbgv0ADZWCTKZsJ2i0brbpIKJ2L3EMWwhjQKvOr1OHkgAQmMMDnqGizv20ozb7y6vJ28xGPvQOCCqaBNuLdzvMh+banWsnF74qAQx+YWxY4awopwOA8P6/0W6DKGT+kpk51L2kBz+nSWRoHqCml1l7SEa5gnqHLUHiDSDL+LAkSwZ/2tXs7306+XMvyxKNg+Wo2bfkxqbcBbncTkg6ysTnPhtOFNopNsETPKuCUkLEOYebKgNpr9xE4IFQ8xWPQX5De+YB33GvLfOk1jhcv6TgOTaiR2iN6xYSfP8A2IfhUEKKg2trztTd3LwVIbNQw2cZHcbFeMzGRtbAzRNPMbhx94t2VzivGR4RdqdAa8ma9L8MI4HRVDUSADXjZbJgoqhcToQR4ZRkOXoNiSyQ0qTIXgcnvo6IqacPlOuFK3WtUs6v7J3QhyYzSLUULImS6xKYmAs78TmYVsZxK904CQjBWArWFYJOkPYB/w1nxmsARdekLs6l5blpT74mbb+t7mTv8vzQZOB54LRsQh/qDvF7Q/FwKbKD5KCTD7CdqgP8dQf0GUojwUDv2L+p1cGIzjNFAOQugbX7gOzbco38LoXWr1+tbEw6IBmCowrOYXoZF2InemIoyOVjwHE6DLcehBPFh6OYfcSyqYkmjdbC/Q+iJ/Yo5owVXQcz/S6fZjE8tttwbDUiGMAFhmQLDwCJ4b81bT1di8j+jjnGRzzBrW5C+9AycRgzeh37TLsIz/d2+3edA+LBzs13mms5SNLYADMsusir7qTzJQ5QEbNHvAta0lQ7Wo6kEA7SDZzcTyLX9d57qaFTFrlmaFrR0xnNdbt+6fBLX0ebv+Q/doYMx7+6jyT8HCZL5z4OoMuz3epSfhwU3qeKhOhR99vsecWxGhAOFPSCX4tXf39Tm9vmapxuG0QWhiOg72lEIYEhd//rBcW3tiv9ZvFRSjL1vIcX4L
 gym/aFfz
 3c4rP/pfdhwaCxBI38AXzuV9HVI2QXK3Kfg8Ws3rET2+XCaZBDSwTTv0MjhHpI6TeZTHihu494aJz+vboCZ26RE3v7Rj0YxJiQr94XCM5/F/bZ4hyrLzVgK0Zj/UcFVsZmtLfwGgY2Bvs7rBeAoaoBLG5bu74kA63aMKwmWJv9LnX+YFZYv6FIMrXuxt05UDPfe0a0WuqWS9hATMmfyP+boOaXbBD9kodzmQ7RXopYVaDlg0QWPVwyMStFeGgsx65a71wL3b3/D0Y9twSR40F3h2nHZ6/z5nSI1sS5IasQethRREgPsuBXCt8LwegzS556W/nSE6qN91kLDqYYYxcAkOiXzB/ciCMX7f5PNiaohVOIN14LJMEpvN2hdLt4W+UXsvfHhXwEhymooT0yJxZTJJVCcCRtP24Dijtxtddegl+N1fvlfk+f4wIj40MrihLpZmX0EU81wgZc/I=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On 2025-07-01 12:28, Dev Jain wrote:
> On 01/07/25 12:20 pm, Lorenzo Stoakes wrote:
>> On Tue, Jul 01, 2025 at 12:00:21PM +0530, Dev Jain wrote:
>>> On 01/07/25 11:23 am, Lorenzo Stoakes wrote:
>>>> On Tue, Jul 01, 2025 at 11:15:25AM +0530, Dev Jain wrote:
>>>>> Sorry I am not following, don't know in detail about the VMA merge 
>>>>> stuff.
>>>>> Are you saying the after the patch, the VMAs will eventually get 
>>>>> merged?
>>>>> Is it possible in the kernel to get a merge in the "future"; as I 
>>>>> understand
>>>>> it only happens at mmap() time?
>>>>> 
>>>>> Suppose before the patch, you have two consecutive VMAs between 
>>>>> (PMD, 2*PMD) size.
>>>>> If they are able to get merged after the patch, why won't they be 
>>>>> merged before the patch,
>>>>> since the VMA characteristics are the same?
>>>>> 
>>>>> 
>>>> Rik's patch aligned each to 2 MiB boundary. So you'd get gaps:
>>>> 
>>>> 
>>>>     0            2MB                 4MB           6MB               
>>>>        8MB          10MB
>>>>     |-------------.------|            |-------------.------|         
>>>>         |-------------.------|
>>>>     |             .      |		 |             .      |                 
>>>> |             .      |
>>>>     |             .      |		 |             .      |                 
>>>> |             .      |
>>>>     |-------------.------|		 |-------------.------|                 
>>>> |-------------.------|
>>>>       huge mapped  4k m'd
>>> The effort to draw this is appreciated!
>>> 
>>> I understood the alignment, what I am asking is this:
>>> 
>>> In __get_unmapped_area(), we will return a THP-aligned addr from
>>> thp_get_unmapped_area_vmflags(). Now for the diagram you have
>>> drawn, suppose that before the patch, we first mmap() the
>>> 8MB-start chunk. Then we mmap the 4MB start chunk.
>>> We go to __mmap_region(), and we see that the 8MB-start chunk
>>> has mergeable characteristics, so we merge. So the gap goes away?
>> No because there's a gap, we only merge immedaitely adjacent VMAs. And 
>> obviously
>> gaps mean page tables wouldn't be adjacent either...
> 
> Ah shoot. That is prev->vm_end == vmg->start in can_vma_merge_left(). 
> Thanks.
> 
>> 
>> The get_unmmaped_area() would have otherwise given adjacent mappings. 
>> Vlasta's
>> patch means in this case we no longer bother trying to align these 
>> because their
>> _length_ isn't PMD aligned.

Hi Lorenzo, Dev, all

Thank you for raising excellent points — I’ll respond to each in order 
to clarify the mechanics and relevance of this behavior in the context 
of AI inference workloads.

🧩 1. Does the patch cause VMAs to be merged eventually?
You're correct: VMA merging only happens at mmap() time (via 
__mmap_region()). What the patch affects is the behavior of 
thp_get_unmapped_area_vmflags() before the mmap is placed.

Before the patch (with Rik’s logic):

Every mmap() returned an address rounded up to the next 2MB boundary — 
regardless of whether the requested size was 2MB-aligned.

Result: even consecutive mmap()s (e.g., 1.5MB + 1.5MB) are now 
non-adjacent, so merging is impossible, even if their VMA flags match.

After this patch:

If the allocation is not PMD-aligned in size, the returned address is 
not forcibly aligned, increasing the likelihood that the next mmap() 
lands directly after the previous one → enabling merging.

So, to be clear: this patch doesn’t cause merging, but it prevents 
unnecessary pre-mmap gaps, which previously blocked merges from ever 
happening exactly like a deadlock which has been cleared now.

📐 2. Why aren’t the VMAs mergeable before the patch?
Great question. Even if the VMA flags are identical, gaps introduced by 
forced alignment from get_unmapped_area() break the precondition for 
merging:

can_vma_merge_left()
  → return prev->vm_end == vma->vm_start

With Rik’s patch in place:

Suppose you mmap() 1.5MB → gets aligned to 2MB

Next 1.5MB → gets aligned to 4MB
→ The kernel sees: prev->vm_end = 3.5MB, vma->vm_start = 4MB
→ No merge

With this patch, non-aligned lengths don’t get forcibly aligned, so 
consecutive mmap()s often fall exactly after the previous, and merging 
becomes possible again.

🤖 3. How does this impact AI workloads like Hugging Face Transformers?
Tokenization and dynamic batching create non-deterministic memory 
allocation patterns:

Models like BERT and T5 dynamically allocate intermediate buffers per 
token-length, batch size, and attention window.

Hugging Face + ONNX Runtime uses multiple small-ish anonymous mmap()s, 
often 512KB–1.8MB.

These allocations come in bursts — but due to forced alignment, the 
kernel was placing them with artificial gaps, defeating THP eligibility 
entirely.

By not force-aligning non-PMD-sized mappings, we avoid injecting gaps. 
The result is that:

a. VMAs remain adjacent → mergeable

b. Physical memory is contiguous → eligible for khugepaged collapse

c. THP utilization increases → fewer TLB misses → lower latency → higher 
throughput

💡 4. Why this patch complements Rik’s rather than contradicts it:

Rik's patch made it easier to guarantee alignment for workloads that 
benefit from explicit huge pages — but at the cost of breaking 
coalescence in workloads with non-PMD-sized mappings, like ML inference.

This patch simply refines that logic:

If the length is PMD-aligned → keep alignment

If it’s not → don’t inject alignment gaps that block merging

So, for workloads that can’t benefit from THP due to misalignment, this 
patch removes artificial fragmentation without harming the original 
intent.

⚙️ 5. mTHP note
Although this patch doesn’t target mTHP directly, I believe a similar 
logic tweak could apply there too — especially with shmem-backed 
workloads (common in model servers using shared tensor memory). I’d be 
happy to help test any changes proposed there to derive the consequent 
results.

Thanks again for the detailed discussion. Let me know if you’d like a 
trace or VMA map from a Hugging Face benchmarked run (happy to generate 
one locally).

Best Regards,
Siddhartha Sharma
+91 9015185601