linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: siddhartha@kenip.in
To: Dev Jain <dev.jain@arm.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	mgorman@suse.de
Subject: Re: [PATCH] mm: limit THP alignment – performance gain observed in AI inference workloads
Date: Mon, 30 Jun 2025 06:13:28 +0530	[thread overview]
Message-ID: <3ee2e7fea6f263aa884e3e715632b09f@kenip.in> (raw)
In-Reply-To: <19714cae-6b73-43ec-af7a-1455196561d1@arm.com>

On 2025-06-28 09:19, Dev Jain wrote:
> On 27/06/25 9:00 pm, Lorenzo Stoakes wrote:
>> +cc Vlata
>> 
>> On Fri, Jun 27, 2025 at 04:09:16PM +0530, siddhartha@kenip.in wrote:
>>> Hi all,
>>> 
>>> I wanted to share validation data from a Hugging Face-based AI 
>>> inferencing
>>> workload,
>>> which was significantly impacted by the THP alignment logic 
>>> introduced in
>>> commit efa7df3e3bb5.
>>> 
>>> Using transformer models with dynamic input lengths on Intel Xeon 
>>> (Cooper
>>> Lake),
>>> we observed up to a 3200% throughput improvement after applying the 
>>> patch
>>> from Oct 2024:
>>> 
>>>    mm: limit THP alignment of anonymous mappings to PMD-aligned sizes
>> All congratulations are owed to Vlastimil Babka for doing this, cc'd 
>> :)
>> 
>> I gather he enjoys novelty beer mugs as tokens of thanks ;)
> 
> I was wondering how the change can get us such a big optimization - the
> alignment causes us to gain at most 1 extra PMD-THP mapping. Is there
> something else I am missing?
> 
> I ask because when I was reading the code I was thinking whether a 
> similar
> change can be done for mTHPs.
> 
>> 
>>> Metrics:
>>> - Model: BERT-base
>>> - Inference engine: Transformers + ONNX Runtime
>>> - Kernel: 6.6 vs patched 6.6.8
>>> - Batch size: 8-32, input length: 64-512 tokens
>>> - Metric: inference throughput (samples/sec)
>>> 
>>> Thanks for the fix -- this change had real impact on a 
>>> production-relevant
>>> workload.
>>> 
>>> Best Regards,
>>> Siddhartha Sharma
>>> ISV @ Kenip
>>> Solution Link: 
>>> https://www.intel.com/content/www/us/en/partner/showcase/offering/a5bHo00000045YUIAY/deadlock-clearance.html
>>> 

Hi Dev Jain,

Thank you for reviewing and for your thoughtful question.

You're absolutely right that, in isolation, gaining one additional 
PMD-THP mapping wouldn't explain a 3200% speedup. But in our use case 
(Hugging Face inference workloads with dynamic input sizes and many 
allocations), the original PMD alignment logic caused a cascade of side 
effects:

The performance improvement comes from how that interacts with dynamic 
memory allocation patterns in AI inference workloads, especially those 
using frameworks like Hugging Face Transformers.

In our specific use case, the workloads were running on Intel Developer 
Cloud, but I no longer have access to that particular environment or the 
original profiling output. However, I’d like to highlight why this patch 
had such an outsized effect:

🔹 1. Fragmentation Avoidance
In model shard loading (e.g., large BERT or GPT2 models split into 
multiple memory segments), many medium-sized anonymous allocations occur 
in rapid succession. These workloads tend to allocate many 512 KB – 1.5 
MB buffers dynamically (token buffers, intermediate tensors). Aligning 
each one to PMD size, even when their length wasn’t PMD-aligned, led to 
gaps between them — defeating natural coalescing into a single THP.

🔹 2. TLB aliasing and cache index pressure

These fragmented mappings caused frequent TLB misses and poor L1/L2 
cache reuse.

The result was what looks like “memory thrashing,” with slow memory 
access dominating total inference time.
When every mapping is PMD-aligned (even if not PMD-sized), the gaps 
between them prevent Transparent Huge Pages (THPs) from activating 
effectively.

This breaks THP coalescence and causes fragmented page tables and higher 
memory overhead per shard.

🔹 3. Latency & Throughput Penalty from Memory Misalignment
This leads to higher TLB miss rates, especially under multi-threaded 
load, which dramatically slows down token embedding and attention 
calculations.

When loading model shards, memory initialization becomes 
cache-unfriendly, with poor reuse across cores.

This affects not only inference latency but also model cold-start time — 
which is critical in autoscaling deployments.

🔹 4. Qualitative Observation
Without this patch: shard loading stuttered, warm-up was slow, and we 
saw CPU cycles dominated by page_fault and TLB miss handlers.

With this patch: shard loading smoothed out, THPs were correctly applied 
(based on smaps), and throughput shot up by an order of magnitude.

🔹 5. Measured Impact
On Intel Xeon (Cooper Lake), a 6.6.0 kernel with PMD alignment on 
non-aligned sizes showed 11–32× worse performance.

With the patched kernel (which skips alignment unless the length is 
PMD-aligned), memory layout was contiguous again and THP was 
consistently utilized.

This isn’t about one extra THP — it’s about preventing widespread THP 
fragmentation and the resulting dramatic cache/TLB degradation. For AI 
workloads with high concurrency and dynamic shapes, this small patch has 
a massive effect on layout and locality.

So, it's not just “1 more huge page” — it's avoiding massive 
fragmentation that leads to:

1. TLB miss storms

2. Poor locality

3. Cache index thrashing

4. Improvement in latency and throughput

This applies across many adjacent, odd-length allocations typical of AI 
inference workloads.

The original alignment logic created a pattern of broken contiguity — 
defeating THP benefits altogether.

In AI workloads using Hugging Face Transformers, model shards and 
intermediate tensors are dynamically allocated during inference. These 
allocations often fall just below or above the 2MB threshold that THP 
relies on. Misalignment or forced alignment to PMD boundaries causes 
fragmentation and disrupts huge page coalescence, affecting performance.

📊 Memory Allocation Pattern Diagram

Without Patch (PMD Alignment Forced):

|<--2MB-->|<--Gap-->|<--2MB-->|<--Gap-->|<--2MB-->
| Alloc A |         | Alloc B |         | Alloc C |

Each allocation is PMD-aligned, even if it’s not PMD-sized

Gaps prevent THP coalescence → TLB/cache fragmentation

With Patch (PMD Alignment Conditional):

|<---------6MB Contiguous Region--------->|
|  Alloc A  | Alloc B | Alloc C | Padding |

Contiguous anonymous memory region

Coalesced into one or more THPs

Improved locality and TLB efficiency

While I regret not having the raw perf output at hand, I’d be happy to 
replicate a similar test locally and share reproducible results if 
helpful.

Best Regards,

Siddhartha Sharma




  reply	other threads:[~2025-06-30  0:43 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-06-27 10:39 [PATCH] mm: limit THP alignment – performance gain observed in AI inference workloads siddhartha
2025-06-27 10:45 ` siddhartha
2025-06-27 15:30 ` Lorenzo Stoakes
2025-06-28  3:49   ` Dev Jain
2025-06-30  0:43     ` siddhartha [this message]
2025-06-30  5:25       ` Dev Jain
2025-06-30  5:28         ` Dev Jain
2025-06-30 10:54         ` Lorenzo Stoakes
2025-06-30 11:48           ` siddhartha
2025-07-01  5:23           ` Dev Jain
2025-07-01  5:28             ` Lorenzo Stoakes
2025-07-01  5:45               ` Dev Jain
2025-07-01  5:53                 ` Lorenzo Stoakes
2025-07-01  6:30                   ` Dev Jain
2025-07-01  6:50                     ` Lorenzo Stoakes
2025-07-01  6:58                       ` Dev Jain
2025-07-01 12:15                         ` siddhartha
2025-07-01 12:39                           ` Lorenzo Stoakes
2025-07-01 13:23                             ` siddhartha
2025-07-01 13:28                               ` Lorenzo Stoakes
2025-07-01 14:20                                 ` siddhartha
2025-07-01 16:20                             ` Dev Jain
2025-07-01 18:49                               ` Zi Yan
2025-07-07  8:56                                 ` Vlastimil Babka
2025-07-28  5:41                                   ` siddhartha
2025-07-28 11:00                                     ` Vlastimil Babka
2025-07-01 15:40                           ` Yang Shi
  -- strict thread matches above, loose matches on Subject: below --
2025-08-11 22:14 siddhartha

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3ee2e7fea6f263aa884e3e715632b09f@kenip.in \
    --to=siddhartha@kenip.in \
    --cc=dev.jain@arm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=mgorman@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).