[PATCH] mm: limit THP alignment – performance gain observed in AI inference workloads

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] mm: limit THP alignment – performance gain observed in AI inference workloads
@ 2025-06-27 10:39 siddhartha
  2025-06-27 10:45 ` siddhartha
  2025-06-27 15:30 ` Lorenzo Stoakes
  0 siblings, 2 replies; 28+ messages in thread
From: siddhartha @ 2025-06-27 10:39 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, mgorman

Hi all,

I wanted to share validation data from a Hugging Face-based AI 
inferencing workload,
which was significantly impacted by the THP alignment logic introduced 
in commit efa7df3e3bb5.

Using transformer models with dynamic input lengths on Intel Xeon 
(Cooper Lake),
we observed up to a 3200% throughput improvement after applying the 
patch from Oct 2024:

   mm: limit THP alignment of anonymous mappings to PMD-aligned sizes

Metrics:
- Model: BERT-base
- Inference engine: Transformers + ONNX Runtime
- Kernel: 6.6 vs patched 6.6.8
- Batch size: 8-32, input length: 64-512 tokens
- Metric: inference throughput (samples/sec)

Thanks for the fix -- this change had real impact on a 
production-relevant workload.

Best Regards,
Siddhartha Sharma
ISV @ Kenip
Solution Link: 
https://www.intel.com/content/www/us/en/partner/showcase/offering/a5bHo00000045YUIAY/deadlock-clearance.html


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] mm: limit THP alignment – performance gain observed in AI inference workloads
  2025-06-27 10:39 [PATCH] mm: limit THP alignment – performance gain observed in AI inference workloads siddhartha
@ 2025-06-27 10:45 ` siddhartha
  2025-06-27 15:30 ` Lorenzo Stoakes
  1 sibling, 0 replies; 28+ messages in thread
From: siddhartha @ 2025-06-27 10:45 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, mgorman

> Hi all,
> 
> I wanted to share validation data from a Hugging Face-based AI
> inferencing workload,
> which was significantly impacted by the THP alignment logic introduced
> in commit efa7df3e3bb5.
> 
> Using transformer models with dynamic input lengths on Intel Xeon 
> (Cooper Lake),
> we observed up to a 3200% throughput improvement after applying the
> patch from Oct 2024:
> 
>   mm: limit THP alignment of anonymous mappings to PMD-aligned sizes
> 
> Metrics:
> - Model: BERT-base
> - Inference engine: Transformers + ONNX Runtime
> - Kernel: 6.6 vs patched 6.6.8
> - Batch size: 8-32, input length: 64-512 tokens
> - Metric: inference throughput (samples/sec)
> 
> Thanks for the fix -- this change had real impact on a
> production-relevant workload.
> 
> Best Regards,
> Siddhartha Sharma
> ISV @ Kenip
> Solution Link:
> https://www.intel.com/content/www/us/en/partner/showcase/offering/a5bHo00000045YUIAY/deadlock-clearance.html


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] mm: limit THP alignment – performance gain observed in AI inference workloads
  2025-06-27 10:39 [PATCH] mm: limit THP alignment – performance gain observed in AI inference workloads siddhartha
  2025-06-27 10:45 ` siddhartha
@ 2025-06-27 15:30 ` Lorenzo Stoakes
  2025-06-28  3:49   ` Dev Jain
  1 sibling, 1 reply; 28+ messages in thread
From: Lorenzo Stoakes @ 2025-06-27 15:30 UTC (permalink / raw)
  To: siddhartha; +Cc: linux-mm, linux-kernel, mgorman, Vlastimil Babka

+cc Vlata

On Fri, Jun 27, 2025 at 04:09:16PM +0530, siddhartha@kenip.in wrote:
> Hi all,
>
> I wanted to share validation data from a Hugging Face-based AI inferencing
> workload,
> which was significantly impacted by the THP alignment logic introduced in
> commit efa7df3e3bb5.
>
> Using transformer models with dynamic input lengths on Intel Xeon (Cooper
> Lake),
> we observed up to a 3200% throughput improvement after applying the patch
> from Oct 2024:
>
>   mm: limit THP alignment of anonymous mappings to PMD-aligned sizes

All congratulations are owed to Vlastimil Babka for doing this, cc'd :)

I gather he enjoys novelty beer mugs as tokens of thanks ;)

>
> Metrics:
> - Model: BERT-base
> - Inference engine: Transformers + ONNX Runtime
> - Kernel: 6.6 vs patched 6.6.8
> - Batch size: 8-32, input length: 64-512 tokens
> - Metric: inference throughput (samples/sec)
>
> Thanks for the fix -- this change had real impact on a production-relevant
> workload.
>
> Best Regards,
> Siddhartha Sharma
> ISV @ Kenip
> Solution Link: https://www.intel.com/content/www/us/en/partner/showcase/offering/a5bHo00000045YUIAY/deadlock-clearance.html
>


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] mm: limit THP alignment – performance gain observed in AI inference workloads
  2025-06-27 15:30 ` Lorenzo Stoakes
@ 2025-06-28  3:49   ` Dev Jain
  2025-06-30  0:43     ` siddhartha
  0 siblings, 1 reply; 28+ messages in thread
From: Dev Jain @ 2025-06-28  3:49 UTC (permalink / raw)
  To: Lorenzo Stoakes, siddhartha
  Cc: linux-mm, linux-kernel, mgorman, Vlastimil Babka


On 27/06/25 9:00 pm, Lorenzo Stoakes wrote:
> +cc Vlata
>
> On Fri, Jun 27, 2025 at 04:09:16PM +0530, siddhartha@kenip.in wrote:
>> Hi all,
>>
>> I wanted to share validation data from a Hugging Face-based AI inferencing
>> workload,
>> which was significantly impacted by the THP alignment logic introduced in
>> commit efa7df3e3bb5.
>>
>> Using transformer models with dynamic input lengths on Intel Xeon (Cooper
>> Lake),
>> we observed up to a 3200% throughput improvement after applying the patch
>> from Oct 2024:
>>
>>    mm: limit THP alignment of anonymous mappings to PMD-aligned sizes
> All congratulations are owed to Vlastimil Babka for doing this, cc'd :)
>
> I gather he enjoys novelty beer mugs as tokens of thanks ;)

I was wondering how the change can get us such a big optimization - the
alignment causes us to gain at most 1 extra PMD-THP mapping. Is there
something else I am missing?

I ask because when I was reading the code I was thinking whether a similar
change can be done for mTHPs.

>
>> Metrics:
>> - Model: BERT-base
>> - Inference engine: Transformers + ONNX Runtime
>> - Kernel: 6.6 vs patched 6.6.8
>> - Batch size: 8-32, input length: 64-512 tokens
>> - Metric: inference throughput (samples/sec)
>>
>> Thanks for the fix -- this change had real impact on a production-relevant
>> workload.
>>
>> Best Regards,
>> Siddhartha Sharma
>> ISV @ Kenip
>> Solution Link: https://www.intel.com/content/www/us/en/partner/showcase/offering/a5bHo00000045YUIAY/deadlock-clearance.html
>>


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] mm: limit THP alignment – performance gain observed in AI inference workloads
  2025-06-28  3:49   ` Dev Jain
@ 2025-06-30  0:43     ` siddhartha
  2025-06-30  5:25       ` Dev Jain
  0 siblings, 1 reply; 28+ messages in thread
From: siddhartha @ 2025-06-30  0:43 UTC (permalink / raw)
  To: Dev Jain; +Cc: Lorenzo Stoakes, linux-mm, linux-kernel, mgorman

On 2025-06-28 09:19, Dev Jain wrote:
> On 27/06/25 9:00 pm, Lorenzo Stoakes wrote:
>> +cc Vlata
>> 
>> On Fri, Jun 27, 2025 at 04:09:16PM +0530, siddhartha@kenip.in wrote:
>>> Hi all,
>>> 
>>> I wanted to share validation data from a Hugging Face-based AI 
>>> inferencing
>>> workload,
>>> which was significantly impacted by the THP alignment logic 
>>> introduced in
>>> commit efa7df3e3bb5.
>>> 
>>> Using transformer models with dynamic input lengths on Intel Xeon 
>>> (Cooper
>>> Lake),
>>> we observed up to a 3200% throughput improvement after applying the 
>>> patch
>>> from Oct 2024:
>>> 
>>>    mm: limit THP alignment of anonymous mappings to PMD-aligned sizes
>> All congratulations are owed to Vlastimil Babka for doing this, cc'd 
>> :)
>> 
>> I gather he enjoys novelty beer mugs as tokens of thanks ;)
> 
> I was wondering how the change can get us such a big optimization - the
> alignment causes us to gain at most 1 extra PMD-THP mapping. Is there
> something else I am missing?
> 
> I ask because when I was reading the code I was thinking whether a 
> similar
> change can be done for mTHPs.
> 
>> 
>>> Metrics:
>>> - Model: BERT-base
>>> - Inference engine: Transformers + ONNX Runtime
>>> - Kernel: 6.6 vs patched 6.6.8
>>> - Batch size: 8-32, input length: 64-512 tokens
>>> - Metric: inference throughput (samples/sec)
>>> 
>>> Thanks for the fix -- this change had real impact on a 
>>> production-relevant
>>> workload.
>>> 
>>> Best Regards,
>>> Siddhartha Sharma
>>> ISV @ Kenip
>>> Solution Link: 
>>> https://www.intel.com/content/www/us/en/partner/showcase/offering/a5bHo00000045YUIAY/deadlock-clearance.html
>>> 

Hi Dev Jain,

Thank you for reviewing and for your thoughtful question.

You're absolutely right that, in isolation, gaining one additional 
PMD-THP mapping wouldn't explain a 3200% speedup. But in our use case 
(Hugging Face inference workloads with dynamic input sizes and many 
allocations), the original PMD alignment logic caused a cascade of side 
effects:

The performance improvement comes from how that interacts with dynamic 
memory allocation patterns in AI inference workloads, especially those 
using frameworks like Hugging Face Transformers.

In our specific use case, the workloads were running on Intel Developer 
Cloud, but I no longer have access to that particular environment or the 
original profiling output. However, I’d like to highlight why this patch 
had such an outsized effect:

🔹 1. Fragmentation Avoidance
In model shard loading (e.g., large BERT or GPT2 models split into 
multiple memory segments), many medium-sized anonymous allocations occur 
in rapid succession. These workloads tend to allocate many 512 KB – 1.5 
MB buffers dynamically (token buffers, intermediate tensors). Aligning 
each one to PMD size, even when their length wasn’t PMD-aligned, led to 
gaps between them — defeating natural coalescing into a single THP.

🔹 2. TLB aliasing and cache index pressure

These fragmented mappings caused frequent TLB misses and poor L1/L2 
cache reuse.

The result was what looks like “memory thrashing,” with slow memory 
access dominating total inference time.
When every mapping is PMD-aligned (even if not PMD-sized), the gaps 
between them prevent Transparent Huge Pages (THPs) from activating 
effectively.

This breaks THP coalescence and causes fragmented page tables and higher 
memory overhead per shard.

🔹 3. Latency & Throughput Penalty from Memory Misalignment
This leads to higher TLB miss rates, especially under multi-threaded 
load, which dramatically slows down token embedding and attention 
calculations.

When loading model shards, memory initialization becomes 
cache-unfriendly, with poor reuse across cores.

This affects not only inference latency but also model cold-start time — 
which is critical in autoscaling deployments.

🔹 4. Qualitative Observation
Without this patch: shard loading stuttered, warm-up was slow, and we 
saw CPU cycles dominated by page_fault and TLB miss handlers.

With this patch: shard loading smoothed out, THPs were correctly applied 
(based on smaps), and throughput shot up by an order of magnitude.

🔹 5. Measured Impact
On Intel Xeon (Cooper Lake), a 6.6.0 kernel with PMD alignment on 
non-aligned sizes showed 11–32× worse performance.

With the patched kernel (which skips alignment unless the length is 
PMD-aligned), memory layout was contiguous again and THP was 
consistently utilized.

This isn’t about one extra THP — it’s about preventing widespread THP 
fragmentation and the resulting dramatic cache/TLB degradation. For AI 
workloads with high concurrency and dynamic shapes, this small patch has 
a massive effect on layout and locality.

So, it's not just “1 more huge page” — it's avoiding massive 
fragmentation that leads to:

1. TLB miss storms

2. Poor locality

3. Cache index thrashing

4. Improvement in latency and throughput

This applies across many adjacent, odd-length allocations typical of AI 
inference workloads.

The original alignment logic created a pattern of broken contiguity — 
defeating THP benefits altogether.

In AI workloads using Hugging Face Transformers, model shards and 
intermediate tensors are dynamically allocated during inference. These 
allocations often fall just below or above the 2MB threshold that THP 
relies on. Misalignment or forced alignment to PMD boundaries causes 
fragmentation and disrupts huge page coalescence, affecting performance.

📊 Memory Allocation Pattern Diagram

Without Patch (PMD Alignment Forced):

|<--2MB-->|<--Gap-->|<--2MB-->|<--Gap-->|<--2MB-->
| Alloc A |         | Alloc B |         | Alloc C |

Each allocation is PMD-aligned, even if it’s not PMD-sized

Gaps prevent THP coalescence → TLB/cache fragmentation

With Patch (PMD Alignment Conditional):

|<---------6MB Contiguous Region--------->|
|  Alloc A  | Alloc B | Alloc C | Padding |

Contiguous anonymous memory region

Coalesced into one or more THPs

Improved locality and TLB efficiency

While I regret not having the raw perf output at hand, I’d be happy to 
replicate a similar test locally and share reproducible results if 
helpful.

Best Regards,

Siddhartha Sharma

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] mm: limit THP alignment – performance gain observed in AI inference workloads
  2025-06-30  0:43     ` siddhartha
@ 2025-06-30  5:25       ` Dev Jain
  2025-06-30  5:28         ` Dev Jain
  2025-06-30 10:54         ` Lorenzo Stoakes
  0 siblings, 2 replies; 28+ messages in thread
From: Dev Jain @ 2025-06-30  5:25 UTC (permalink / raw)
  To: siddhartha; +Cc: Lorenzo Stoakes, linux-mm, linux-kernel, mgorman


On 30/06/25 6:13 am, siddhartha@kenip.in wrote:
> On 2025-06-28 09:19, Dev Jain wrote:
>> On 27/06/25 9:00 pm, Lorenzo Stoakes wrote:
>>> +cc Vlata
>>>
>>> On Fri, Jun 27, 2025 at 04:09:16PM +0530, siddhartha@kenip.in wrote:
>>>> Hi all,
>>>>
>>>> I wanted to share validation data from a Hugging Face-based AI 
>>>> inferencing
>>>> workload,
>>>> which was significantly impacted by the THP alignment logic 
>>>> introduced in
>>>> commit efa7df3e3bb5.
>>>>
>>>> Using transformer models with dynamic input lengths on Intel Xeon 
>>>> (Cooper
>>>> Lake),
>>>> we observed up to a 3200% throughput improvement after applying the 
>>>> patch
>>>> from Oct 2024:
>>>>
>>>>    mm: limit THP alignment of anonymous mappings to PMD-aligned sizes
>>> All congratulations are owed to Vlastimil Babka for doing this, cc'd :)
>>>
>>> I gather he enjoys novelty beer mugs as tokens of thanks ;)
>>
>> I was wondering how the change can get us such a big optimization - the
>> alignment causes us to gain at most 1 extra PMD-THP mapping. Is there
>> something else I am missing?
>>
>> I ask because when I was reading the code I was thinking whether a 
>> similar
>> change can be done for mTHPs.
>>
>>>
>>>> Metrics:
>>>> - Model: BERT-base
>>>> - Inference engine: Transformers + ONNX Runtime
>>>> - Kernel: 6.6 vs patched 6.6.8
>>>> - Batch size: 8-32, input length: 64-512 tokens
>>>> - Metric: inference throughput (samples/sec)
>>>>
>>>> Thanks for the fix -- this change had real impact on a 
>>>> production-relevant
>>>> workload.
>>>>
>>>> Best Regards,
>>>> Siddhartha Sharma
>>>> ISV @ Kenip
>>>> Solution Link: 
>>>> https://www.intel.com/content/www/us/en/partner/showcase/offering/a5bHo00000045YUIAY/deadlock-clearance.html
>>>>
>
> Hi Dev Jain,
>
> Thank you for reviewing and for your thoughtful question.
>
> You're absolutely right that, in isolation, gaining one additional 
> PMD-THP mapping wouldn't explain a 3200% speedup. But in our use case 
> (Hugging Face inference workloads with dynamic input sizes and many 
> allocations), the original PMD alignment logic caused a cascade of 
> side effects:
>
> The performance improvement comes from how that interacts with dynamic 
> memory allocation patterns in AI inference workloads, especially those 
> using frameworks like Hugging Face Transformers.
>
> In our specific use case, the workloads were running on Intel 
> Developer Cloud, but I no longer have access to that particular 
> environment or the original profiling output. However, I’d like to 
> highlight why this patch had such an outsized effect:
>
> 🔹 1. Fragmentation Avoidance
> In model shard loading (e.g., large BERT or GPT2 models split into 
> multiple memory segments), many medium-sized anonymous allocations 
> occur in rapid succession. These workloads tend to allocate many 512 
> KB – 1.5 MB buffers dynamically (token buffers, intermediate tensors). 
> Aligning each one to PMD size, even when their length wasn’t 
> PMD-aligned, led to gaps between them — defeating natural coalescing 
> into a single THP.
>
> 🔹 2. TLB aliasing and cache index pressure
>
> These fragmented mappings caused frequent TLB misses and poor L1/L2 
> cache reuse.
>
> The result was what looks like “memory thrashing,” with slow memory 
> access dominating total inference time.
> When every mapping is PMD-aligned (even if not PMD-sized), the gaps 
> between them prevent Transparent Huge Pages (THPs) from activating 
> effectively.
>
> This breaks THP coalescence and causes fragmented page tables and 
> higher memory overhead per shard.
>
> 🔹 3. Latency & Throughput Penalty from Memory Misalignment
> This leads to higher TLB miss rates, especially under multi-threaded 
> load, which dramatically slows down token embedding and attention 
> calculations.
>
> When loading model shards, memory initialization becomes 
> cache-unfriendly, with poor reuse across cores.
>
> This affects not only inference latency but also model cold-start time 
> — which is critical in autoscaling deployments.
>
> 🔹 4. Qualitative Observation
> Without this patch: shard loading stuttered, warm-up was slow, and we 
> saw CPU cycles dominated by page_fault and TLB miss handlers.
>
> With this patch: shard loading smoothed out, THPs were correctly 
> applied (based on smaps), and throughput shot up by an order of 
> magnitude.
>
> 🔹 5. Measured Impact
> On Intel Xeon (Cooper Lake), a 6.6.0 kernel with PMD alignment on 
> non-aligned sizes showed 11–32× worse performance.
>
> With the patched kernel (which skips alignment unless the length is 
> PMD-aligned), memory layout was contiguous again and THP was 
> consistently utilized.
>
> This isn’t about one extra THP — it’s about preventing widespread THP 
> fragmentation and the resulting dramatic cache/TLB degradation. For AI 
> workloads with high concurrency and dynamic shapes, this small patch 
> has a massive effect on layout and locality.
>
> So, it's not just “1 more huge page” — it's avoiding massive 
> fragmentation that leads to:
>
> 1. TLB miss storms
>
> 2. Poor locality
>
> 3. Cache index thrashing
>
> 4. Improvement in latency and throughput
>
> This applies across many adjacent, odd-length allocations typical of 
> AI inference workloads.
>
> The original alignment logic created a pattern of broken contiguity — 
> defeating THP benefits altogether.
>
> In AI workloads using Hugging Face Transformers, model shards and 
> intermediate tensors are dynamically allocated during inference. These 
> allocations often fall just below or above the 2MB threshold that THP 
> relies on. Misalignment or forced alignment to PMD boundaries causes 
> fragmentation and disrupts huge page coalescence, affecting performance.
>
> 📊 Memory Allocation Pattern Diagram
>
> Without Patch (PMD Alignment Forced):
>
> |<--2MB-->|<--Gap-->|<--2MB-->|<--Gap-->|<--2MB-->
> | Alloc A |         | Alloc B |         | Alloc C |
>
> Each allocation is PMD-aligned, even if it’s not PMD-sized
>
> Gaps prevent THP coalescence → TLB/cache fragmentation
>
> With Patch (PMD Alignment Conditional):
>
> |<---------6MB Contiguous Region--------->|
> |  Alloc A  | Alloc B | Alloc C | Padding |
>
> Contiguous anonymous memory region
>
> Coalesced into one or more THPs
>
> Improved locality and TLB efficiency
>
> While I regret not having the raw perf output at hand, I’d be happy to 
> replicate a similar test locally and share reproducible results if 
> helpful.
>
> Best Regards,
>
> Siddhartha Sharma

Thanks for your detailed explanation! I misunderstood that the 
optimization you were talking about

was due to efa7df3e3bb5, instead it was due to the alignment. Your 
explanation makes a lot of

sense!


For this workload, do you enable mTHPs on your system? My plan is to 
make a similar patch for

the mTHP case and I'd be grateful if you can get me some results : )

>
>


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] mm: limit THP alignment – performance gain observed in AI inference workloads
  2025-06-30  5:25       ` Dev Jain
@ 2025-06-30  5:28         ` Dev Jain
  2025-06-30 10:54         ` Lorenzo Stoakes
  1 sibling, 0 replies; 28+ messages in thread
From: Dev Jain @ 2025-06-30  5:28 UTC (permalink / raw)
  To: siddhartha; +Cc: Lorenzo Stoakes, linux-mm, linux-kernel, mgorman


On 30/06/25 10:55 am, Dev Jain wrote:
>
> On 30/06/25 6:13 am, siddhartha@kenip.in wrote:
>> On 2025-06-28 09:19, Dev Jain wrote:
>>> On 27/06/25 9:00 pm, Lorenzo Stoakes wrote:
>>>> +cc Vlata
>>>>
>>>> On Fri, Jun 27, 2025 at 04:09:16PM +0530, siddhartha@kenip.in wrote:
>>>>> Hi all,
>>>>>
>>>>> I wanted to share validation data from a Hugging Face-based AI 
>>>>> inferencing
>>>>> workload,
>>>>> which was significantly impacted by the THP alignment logic 
>>>>> introduced in
>>>>> commit efa7df3e3bb5.
>>>>>
>>>>> Using transformer models with dynamic input lengths on Intel Xeon 
>>>>> (Cooper
>>>>> Lake),
>>>>> we observed up to a 3200% throughput improvement after applying 
>>>>> the patch
>>>>> from Oct 2024:
>>>>>
>>>>>    mm: limit THP alignment of anonymous mappings to PMD-aligned sizes
>>>> All congratulations are owed to Vlastimil Babka for doing this, 
>>>> cc'd :)
>>>>
>>>> I gather he enjoys novelty beer mugs as tokens of thanks ;)
>>>
>>> I was wondering how the change can get us such a big optimization - the
>>> alignment causes us to gain at most 1 extra PMD-THP mapping. Is there
>>> something else I am missing?
>>>
>>> I ask because when I was reading the code I was thinking whether a 
>>> similar
>>> change can be done for mTHPs.
>>>
>>>>
>>>>> Metrics:
>>>>> - Model: BERT-base
>>>>> - Inference engine: Transformers + ONNX Runtime
>>>>> - Kernel: 6.6 vs patched 6.6.8
>>>>> - Batch size: 8-32, input length: 64-512 tokens
>>>>> - Metric: inference throughput (samples/sec)
>>>>>
>>>>> Thanks for the fix -- this change had real impact on a 
>>>>> production-relevant
>>>>> workload.
>>>>>
>>>>> Best Regards,
>>>>> Siddhartha Sharma
>>>>> ISV @ Kenip
>>>>> Solution Link: 
>>>>> https://www.intel.com/content/www/us/en/partner/showcase/offering/a5bHo00000045YUIAY/deadlock-clearance.html
>>>>>
>>
>> Hi Dev Jain,
>>
>> Thank you for reviewing and for your thoughtful question.
>>
>> You're absolutely right that, in isolation, gaining one additional 
>> PMD-THP mapping wouldn't explain a 3200% speedup. But in our use case 
>> (Hugging Face inference workloads with dynamic input sizes and many 
>> allocations), the original PMD alignment logic caused a cascade of 
>> side effects:
>>
>> The performance improvement comes from how that interacts with 
>> dynamic memory allocation patterns in AI inference workloads, 
>> especially those using frameworks like Hugging Face Transformers.
>>
>> In our specific use case, the workloads were running on Intel 
>> Developer Cloud, but I no longer have access to that particular 
>> environment or the original profiling output. However, I’d like to 
>> highlight why this patch had such an outsized effect:
>>
>> 🔹 1. Fragmentation Avoidance
>> In model shard loading (e.g., large BERT or GPT2 models split into 
>> multiple memory segments), many medium-sized anonymous allocations 
>> occur in rapid succession. These workloads tend to allocate many 512 
>> KB – 1.5 MB buffers dynamically (token buffers, intermediate 
>> tensors). Aligning each one to PMD size, even when their length 
>> wasn’t PMD-aligned, led to gaps between them — defeating natural 
>> coalescing into a single THP.
>>
>> 🔹 2. TLB aliasing and cache index pressure
>>
>> These fragmented mappings caused frequent TLB misses and poor L1/L2 
>> cache reuse.
>>
>> The result was what looks like “memory thrashing,” with slow memory 
>> access dominating total inference time.
>> When every mapping is PMD-aligned (even if not PMD-sized), the gaps 
>> between them prevent Transparent Huge Pages (THPs) from activating 
>> effectively.
>>
>> This breaks THP coalescence and causes fragmented page tables and 
>> higher memory overhead per shard.
>>
>> 🔹 3. Latency & Throughput Penalty from Memory Misalignment
>> This leads to higher TLB miss rates, especially under multi-threaded 
>> load, which dramatically slows down token embedding and attention 
>> calculations.
>>
>> When loading model shards, memory initialization becomes 
>> cache-unfriendly, with poor reuse across cores.
>>
>> This affects not only inference latency but also model cold-start 
>> time — which is critical in autoscaling deployments.
>>
>> 🔹 4. Qualitative Observation
>> Without this patch: shard loading stuttered, warm-up was slow, and we 
>> saw CPU cycles dominated by page_fault and TLB miss handlers.
>>
>> With this patch: shard loading smoothed out, THPs were correctly 
>> applied (based on smaps), and throughput shot up by an order of 
>> magnitude.
>>
>> 🔹 5. Measured Impact
>> On Intel Xeon (Cooper Lake), a 6.6.0 kernel with PMD alignment on 
>> non-aligned sizes showed 11–32× worse performance.
>>
>> With the patched kernel (which skips alignment unless the length is 
>> PMD-aligned), memory layout was contiguous again and THP was 
>> consistently utilized.
>>
>> This isn’t about one extra THP — it’s about preventing widespread THP 
>> fragmentation and the resulting dramatic cache/TLB degradation. For 
>> AI workloads with high concurrency and dynamic shapes, this small 
>> patch has a massive effect on layout and locality.
>>
>> So, it's not just “1 more huge page” — it's avoiding massive 
>> fragmentation that leads to:
>>
>> 1. TLB miss storms
>>
>> 2. Poor locality
>>
>> 3. Cache index thrashing
>>
>> 4. Improvement in latency and throughput
>>
>> This applies across many adjacent, odd-length allocations typical of 
>> AI inference workloads.
>>
>> The original alignment logic created a pattern of broken contiguity — 
>> defeating THP benefits altogether.
>>
>> In AI workloads using Hugging Face Transformers, model shards and 
>> intermediate tensors are dynamically allocated during inference. 
>> These allocations often fall just below or above the 2MB threshold 
>> that THP relies on. Misalignment or forced alignment to PMD 
>> boundaries causes fragmentation and disrupts huge page coalescence, 
>> affecting performance.
>>
>> 📊 Memory Allocation Pattern Diagram
>>
>> Without Patch (PMD Alignment Forced):
>>
>> |<--2MB-->|<--Gap-->|<--2MB-->|<--Gap-->|<--2MB-->
>> | Alloc A |         | Alloc B |         | Alloc C |
>>
>> Each allocation is PMD-aligned, even if it’s not PMD-sized
>>
>> Gaps prevent THP coalescence → TLB/cache fragmentation
>>
>> With Patch (PMD Alignment Conditional):
>>
>> |<---------6MB Contiguous Region--------->|
>> |  Alloc A  | Alloc B | Alloc C | Padding |
>>
>> Contiguous anonymous memory region
>>
>> Coalesced into one or more THPs
>>
>> Improved locality and TLB efficiency
>>
>> While I regret not having the raw perf output at hand, I’d be happy 
>> to replicate a similar test locally and share reproducible results if 
>> helpful.
>>
>> Best Regards,
>>
>> Siddhartha Sharma
>
> Thanks for your detailed explanation! I misunderstood that the 
> optimization you were talking about
>
> was due to efa7df3e3bb5, instead it was due to the alignment. Your 
> explanation makes a lot of
>
> sense!
>
>
> For this workload, do you enable mTHPs on your system? My plan is to 
> make a similar patch for
>
> the mTHP case and I'd be grateful if you can get me some results : )

Oh I see that you are using the 6.6 kernel, which probably won't have 
the mTHP patches.


>
>>
>>
>


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] mm: limit THP alignment – performance gain observed in AI inference workloads
  2025-06-30  5:25       ` Dev Jain
  2025-06-30  5:28         ` Dev Jain
@ 2025-06-30 10:54         ` Lorenzo Stoakes
  2025-06-30 11:48           ` siddhartha
  2025-07-01  5:23           ` Dev Jain
  1 sibling, 2 replies; 28+ messages in thread
From: Lorenzo Stoakes @ 2025-06-30 10:54 UTC (permalink / raw)
  To: Dev Jain; +Cc: siddhartha, linux-mm, linux-kernel, mgorman, Vlastimil Babka

+cc Vlastimil, please keep him cc'd on discussions here as the author of this
fix in the conversation.

On Mon, Jun 30, 2025 at 10:55:52AM +0530, Dev Jain wrote:
>
>
> For this workload, do you enable mTHPs on your system? My plan is to make a
> similar patch for
>
> the mTHP case and I'd be grateful if you can get me some results : )

I'd urge caution here.

The reason there was a big perf improvement is that, for certain workloads, the
original patch by Rik caused issues with VMA fragmentation. So rather than
getting adjacent VMAs that might later be khugepage'd, you'd get a bunch of VMAs
that were auto-aligned and thus fragmented from one another.

So while you got speed ups on some workloads, you got really bad perf impact on
some that were subject to this.

The observed speed up was on a very specific benchmark also. While it's a great
improvement, it's important to understand the context (see the original patch
for details [0]).

I do think it's worth considering changing thp_get_unmapped_area_vmflags() for
mTHP, as it's currently very limited (just PMD alignment) and it'd possibly be
sensible to change this to checking against allowed THP alignments, but I'd not
assume this is going to get some crazy speed up as observed here.

Note that any such change would probably require some refactoring in THP first
to make it not quite so awful.

I also think for Siddharta's usecase mTHP isn't really relevant is it, as intel
do not support mTHP currently do they?

Regards, Lorenzo

[0]: https://lore.kernel.org/all/20241024151228.101841-2-vbabka@suse.cz/T/#u

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] mm: limit THP alignment – performance gain observed in AI inference workloads
  2025-06-30 10:54         ` Lorenzo Stoakes
@ 2025-06-30 11:48           ` siddhartha
  2025-07-01  5:23           ` Dev Jain
  1 sibling, 0 replies; 28+ messages in thread
From: siddhartha @ 2025-06-30 11:48 UTC (permalink / raw)
  To: Lorenzo Stoakes; +Cc: Dev Jain, linux-mm, linux-kernel, mgorman

On 2025-06-30 16:24, Lorenzo Stoakes wrote:
> +cc Vlastimil, please keep him cc'd on discussions here as the author 
> of this
> fix in the conversation.
> 
> On Mon, Jun 30, 2025 at 10:55:52AM +0530, Dev Jain wrote:
>> 
>> 
>> For this workload, do you enable mTHPs on your system? My plan is to 
>> make a
>> similar patch for
>> 
>> the mTHP case and I'd be grateful if you can get me some results : )
> 
> I'd urge caution here.
> 
> The reason there was a big perf improvement is that, for certain 
> workloads, the
> original patch by Rik caused issues with VMA fragmentation. So rather 
> than
> getting adjacent VMAs that might later be khugepage'd, you'd get a 
> bunch of VMAs
> that were auto-aligned and thus fragmented from one another.
> 
> So while you got speed ups on some workloads, you got really bad perf 
> impact on
> some that were subject to this.
> 
> The observed speed up was on a very specific benchmark also. While it's 
> a great
> improvement, it's important to understand the context (see the original 
> patch
> for details [0]).
> 
> I do think it's worth considering changing 
> thp_get_unmapped_area_vmflags() for
> mTHP, as it's currently very limited (just PMD alignment) and it'd 
> possibly be
> sensible to change this to checking against allowed THP alignments, but 
> I'd not
> assume this is going to get some crazy speed up as observed here.
> 
> Note that any such change would probably require some refactoring in 
> THP first
> to make it not quite so awful.
> 
> I also think for Siddharta's usecase mTHP isn't really relevant is it, 
> as intel
> do not support mTHP currently do they?
> 
> Regards, Lorenzo
> 
> [0]: 
> https://lore.kernel.org/all/20241024151228.101841-2-vbabka@suse.cz/T/#u

Hi Lorenzo, Dev, All,

Thank you for the thoughtful responses and for engaging with the 
performance implications of the patch.

You're absolutely right that the observed speedup came from a specific 
class of workloads — in this case, token-length-variable AI inference 
pipelines based on Hugging Face Transformers and ONNX Runtime. These 
workloads trigger highly dynamic, anonymous memory allocation patterns, 
often in bursts aligned with model shard loading and attention map 
resizing. In such cases, VMA fragmentation due to PMD-aligned, 
non-PMD-sized mappings led to near-complete loss of THP utilization.

Once the alignment restriction was lifted (via Rik’s patch), we observed 
substantial restoration of THP behavior, which is where the performance 
gains came from. That said, I completely agree that:

Not all workloads benefit from this

Some may even regress if the underlying VMAs aren't THP-coalescible for 
other reasons

Still, for high-throughput inference workloads on modern Intel CPUs, 
this behavior isn’t a corner case. The shift toward multi-model 
concurrent serving (e.g., LLM-as-a-Service) means this dynamic 
allocation pattern is becoming common, especially in 
edge/latency-sensitive deployments.

🧠 On mTHP: Intel Does Support It
Regarding mTHP — yes, Intel platforms (especially server-grade Xeon 
processors from Cascade Lake onward) do support mapped transparent huge 
pages, including via:

tmpfs-backed files

madvise(MADV_HUGEPAGE) on file mappings

shmem usage with shmem_enabled in the kernel

So I’d say mTHP is certainly relevant for workloads where model weights 
or tensors are pre-cached or memory-mapped — a pattern we’re also seeing 
as Hugging Face, ONNX, and PyTorch ecosystems move toward zero-copy 
tensor sharing.

Given that, I'd absolutely be interested in testing any mTHP-targeted 
patch — and I’d be happy to help validate it, especially if it avoids 
the VMA fragmentation pitfall you rightly pointed out.

Thanks again for the detailed feedback, and I’ll try to replicate and 
share further traces (from my local testbed) since I currently don’t 
have access to the original Intel Developer Cloud logs.

Best regards,
Siddhartha Sharma

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] mm: limit THP alignment – performance gain observed in AI inference workloads
  2025-06-30 10:54         ` Lorenzo Stoakes
  2025-06-30 11:48           ` siddhartha
@ 2025-07-01  5:23           ` Dev Jain
  2025-07-01  5:28             ` Lorenzo Stoakes
  1 sibling, 1 reply; 28+ messages in thread
From: Dev Jain @ 2025-07-01  5:23 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: siddhartha, linux-mm, linux-kernel, mgorman, Vlastimil Babka


On 30/06/25 4:24 pm, Lorenzo Stoakes wrote:
> +cc Vlastimil, please keep him cc'd on discussions here as the author of this
> fix in the conversation.
>
> On Mon, Jun 30, 2025 at 10:55:52AM +0530, Dev Jain wrote:
>>
>> For this workload, do you enable mTHPs on your system? My plan is to make a
>> similar patch for
>>
>> the mTHP case and I'd be grateful if you can get me some results : )
> I'd urge caution here.
>
> The reason there was a big perf improvement is that, for certain workloads, the
> original patch by Rik caused issues with VMA fragmentation. So rather than
> getting adjacent VMAs that might later be khugepage'd, you'd get a bunch of VMAs
> that were auto-aligned and thus fragmented from one another.

How does getting two different adjacent VMAs allow them to be khugepage'd if
both are less than PMD size? khugepaged operates per vma, I'm missing something.
  

>
> So while you got speed ups on some workloads, you got really bad perf impact on
> some that were subject to this.
>
> The observed speed up was on a very specific benchmark also. While it's a great
> improvement, it's important to understand the context (see the original patch
> for details [0]).
>
> I do think it's worth considering changing thp_get_unmapped_area_vmflags() for
> mTHP, as it's currently very limited (just PMD alignment) and it'd possibly be
> sensible to change this to checking against allowed THP alignments, but I'd not
> assume this is going to get some crazy speed up as observed here.
>
> Note that any such change would probably require some refactoring in THP first
> to make it not quite so awful.
>
> I also think for Siddharta's usecase mTHP isn't really relevant is it, as intel
> do not support mTHP currently do they?
>
> Regards, Lorenzo
>
> [0]: https://lore.kernel.org/all/20241024151228.101841-2-vbabka@suse.cz/T/#u


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] mm: limit THP alignment – performance gain observed in AI inference workloads
  2025-07-01  5:23           ` Dev Jain
@ 2025-07-01  5:28             ` Lorenzo Stoakes
  2025-07-01  5:45               ` Dev Jain
  0 siblings, 1 reply; 28+ messages in thread
From: Lorenzo Stoakes @ 2025-07-01  5:28 UTC (permalink / raw)
  To: Dev Jain; +Cc: siddhartha, linux-mm, linux-kernel, mgorman, Vlastimil Babka

On Tue, Jul 01, 2025 at 10:53:09AM +0530, Dev Jain wrote:
>
> On 30/06/25 4:24 pm, Lorenzo Stoakes wrote:
> > +cc Vlastimil, please keep him cc'd on discussions here as the author of this
> > fix in the conversation.
> >
> > On Mon, Jun 30, 2025 at 10:55:52AM +0530, Dev Jain wrote:
> > >
> > > For this workload, do you enable mTHPs on your system? My plan is to make a
> > > similar patch for
> > >
> > > the mTHP case and I'd be grateful if you can get me some results : )
> > I'd urge caution here.
> >
> > The reason there was a big perf improvement is that, for certain workloads, the
> > original patch by Rik caused issues with VMA fragmentation. So rather than
> > getting adjacent VMAs that might later be khugepage'd, you'd get a bunch of VMAs
> > that were auto-aligned and thus fragmented from one another.
>
> How does getting two different adjacent VMAs allow them to be khugepage'd if
> both are less than PMD size? khugepaged operates per vma, I'm missing something.

(future) VMA merge

Consider allocations that are >PMD but < 2*PMD for instance. Now you get
fragmentation. For some workloads you would have previously eventually got PMD
leaf mapping, PMD leaf mapping, PMD leaf mapping, etc. contiguouosly, with this
arragenement you get PMD mapping, <bunch of PTE mappings>, PMD mapping, etc.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] mm: limit THP alignment – performance gain observed in AI inference workloads
  2025-07-01  5:28             ` Lorenzo Stoakes
@ 2025-07-01  5:45               ` Dev Jain
  2025-07-01  5:53                 ` Lorenzo Stoakes
  0 siblings, 1 reply; 28+ messages in thread
From: Dev Jain @ 2025-07-01  5:45 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: siddhartha, linux-mm, linux-kernel, mgorman, Vlastimil Babka


On 01/07/25 10:58 am, Lorenzo Stoakes wrote:
> On Tue, Jul 01, 2025 at 10:53:09AM +0530, Dev Jain wrote:
>> On 30/06/25 4:24 pm, Lorenzo Stoakes wrote:
>>> +cc Vlastimil, please keep him cc'd on discussions here as the author of this
>>> fix in the conversation.
>>>
>>> On Mon, Jun 30, 2025 at 10:55:52AM +0530, Dev Jain wrote:
>>>> For this workload, do you enable mTHPs on your system? My plan is to make a
>>>> similar patch for
>>>>
>>>> the mTHP case and I'd be grateful if you can get me some results : )
>>> I'd urge caution here.
>>>
>>> The reason there was a big perf improvement is that, for certain workloads, the
>>> original patch by Rik caused issues with VMA fragmentation. So rather than
>>> getting adjacent VMAs that might later be khugepage'd, you'd get a bunch of VMAs
>>> that were auto-aligned and thus fragmented from one another.
>> How does getting two different adjacent VMAs allow them to be khugepage'd if
>> both are less than PMD size? khugepaged operates per vma, I'm missing something.
> (future) VMA merge
>
> Consider allocations that are >PMD but < 2*PMD for instance. Now you get
> fragmentation. For some workloads you would have previously eventually got PMD
> leaf mapping, PMD leaf mapping, PMD leaf mapping, etc. contiguouosly, with this
> arragenement you get PMD mapping, <bunch of PTE mappings>, PMD mapping, etc.

Sorry I am not following, don't know in detail about the VMA merge stuff.
Are you saying the after the patch, the VMAs will eventually get merged?
Is it possible in the kernel to get a merge in the "future"; as I understand
it only happens at mmap() time?

Suppose before the patch, you have two consecutive VMAs between (PMD, 2*PMD) size.
If they are able to get merged after the patch, why won't they be merged before the patch,
since the VMA characteristics are the same?



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] mm: limit THP alignment – performance gain observed in AI inference workloads
  2025-07-01  5:45               ` Dev Jain
@ 2025-07-01  5:53                 ` Lorenzo Stoakes
  2025-07-01  6:30                   ` Dev Jain
  0 siblings, 1 reply; 28+ messages in thread
From: Lorenzo Stoakes @ 2025-07-01  5:53 UTC (permalink / raw)
  To: Dev Jain; +Cc: siddhartha, linux-mm, linux-kernel, mgorman, Vlastimil Babka

On Tue, Jul 01, 2025 at 11:15:25AM +0530, Dev Jain wrote:
> Sorry I am not following, don't know in detail about the VMA merge stuff.
> Are you saying the after the patch, the VMAs will eventually get merged?
> Is it possible in the kernel to get a merge in the "future"; as I understand
> it only happens at mmap() time?
>
> Suppose before the patch, you have two consecutive VMAs between (PMD, 2*PMD) size.
> If they are able to get merged after the patch, why won't they be merged before the patch,
> since the VMA characteristics are the same?
>
>

Rik's patch aligned each to 2 MiB boundary. So you'd get gaps:


  0            2MB                      4MB           6MB                      8MB          10MB
  |-------------.------|                 |-------------.------|                 |-------------.------|
  |             .      |		 |             .      |                 |             .      |
  |             .      |		 |             .      |                 |             .      |
  |-------------.------|		 |-------------.------|                 |-------------.------|
    huge mapped  4k m'd


If you don't force alignment then subsequent mappings will be adjacent to one
another and those non-huge page parts can be merged.

Vlasta's fix up means we only try to get the THP up-front if the length is
already aligned at which point you won't end up with these gaps.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] mm: limit THP alignment – performance gain observed in AI inference workloads
  2025-07-01  5:53                 ` Lorenzo Stoakes
@ 2025-07-01  6:30                   ` Dev Jain
  2025-07-01  6:50                     ` Lorenzo Stoakes
  0 siblings, 1 reply; 28+ messages in thread
From: Dev Jain @ 2025-07-01  6:30 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: siddhartha, linux-mm, linux-kernel, mgorman, Vlastimil Babka


On 01/07/25 11:23 am, Lorenzo Stoakes wrote:
> On Tue, Jul 01, 2025 at 11:15:25AM +0530, Dev Jain wrote:
>> Sorry I am not following, don't know in detail about the VMA merge stuff.
>> Are you saying the after the patch, the VMAs will eventually get merged?
>> Is it possible in the kernel to get a merge in the "future"; as I understand
>> it only happens at mmap() time?
>>
>> Suppose before the patch, you have two consecutive VMAs between (PMD, 2*PMD) size.
>> If they are able to get merged after the patch, why won't they be merged before the patch,
>> since the VMA characteristics are the same?
>>
>>
> Rik's patch aligned each to 2 MiB boundary. So you'd get gaps:
>
>
>    0            2MB                      4MB           6MB                      8MB          10MB
>    |-------------.------|                 |-------------.------|                 |-------------.------|
>    |             .      |		 |             .      |                 |             .      |
>    |             .      |		 |             .      |                 |             .      |
>    |-------------.------|		 |-------------.------|                 |-------------.------|
>      huge mapped  4k m'd

The effort to draw this is appreciated!

I understood the alignment, what I am asking is this:

In __get_unmapped_area(), we will return a THP-aligned addr from
thp_get_unmapped_area_vmflags(). Now for the diagram you have
drawn, suppose that before the patch, we first mmap() the
8MB-start chunk. Then we mmap the 4MB start chunk.
We go to __mmap_region(), and we see that the 8MB-start chunk
has mergeable characteristics, so we merge. So the gap goes away?

>
> If you don't force alignment then subsequent mappings will be adjacent to one
> another and those non-huge page parts can be merged.
>
> Vlasta's fix up means we only try to get the THP up-front if the length is
> already aligned at which point you won't end up with these gaps.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] mm: limit THP alignment – performance gain observed in AI inference workloads
  2025-07-01  6:30                   ` Dev Jain
@ 2025-07-01  6:50                     ` Lorenzo Stoakes
  2025-07-01  6:58                       ` Dev Jain
  0 siblings, 1 reply; 28+ messages in thread
From: Lorenzo Stoakes @ 2025-07-01  6:50 UTC (permalink / raw)
  To: Dev Jain; +Cc: siddhartha, linux-mm, linux-kernel, mgorman, Vlastimil Babka

On Tue, Jul 01, 2025 at 12:00:21PM +0530, Dev Jain wrote:
>
> On 01/07/25 11:23 am, Lorenzo Stoakes wrote:
> > On Tue, Jul 01, 2025 at 11:15:25AM +0530, Dev Jain wrote:
> > > Sorry I am not following, don't know in detail about the VMA merge stuff.
> > > Are you saying the after the patch, the VMAs will eventually get merged?
> > > Is it possible in the kernel to get a merge in the "future"; as I understand
> > > it only happens at mmap() time?
> > >
> > > Suppose before the patch, you have two consecutive VMAs between (PMD, 2*PMD) size.
> > > If they are able to get merged after the patch, why won't they be merged before the patch,
> > > since the VMA characteristics are the same?
> > >
> > >
> > Rik's patch aligned each to 2 MiB boundary. So you'd get gaps:
> >
> >
> >    0            2MB                 4MB           6MB                      8MB          10MB
> >    |-------------.------|            |-------------.------|                 |-------------.------|
> >    |             .      |		 |             .      |                 |             .      |
> >    |             .      |		 |             .      |                 |             .      |
> >    |-------------.------|		 |-------------.------|                 |-------------.------|
> >      huge mapped  4k m'd
>
> The effort to draw this is appreciated!
>
> I understood the alignment, what I am asking is this:
>
> In __get_unmapped_area(), we will return a THP-aligned addr from
> thp_get_unmapped_area_vmflags(). Now for the diagram you have
> drawn, suppose that before the patch, we first mmap() the
> 8MB-start chunk. Then we mmap the 4MB start chunk.
> We go to __mmap_region(), and we see that the 8MB-start chunk
> has mergeable characteristics, so we merge. So the gap goes away?

No because there's a gap, we only merge immedaitely adjacent VMAs. And obviously
gaps mean page tables wouldn't be adjacent either...

The get_unmmaped_area() would have otherwise given adjacent mappings. Vlasta's
patch means in this case we no longer bother trying to align these because their
_length_ isn't PMD aligned.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] mm: limit THP alignment – performance gain observed in AI inference workloads
  2025-07-01  6:50                     ` Lorenzo Stoakes
@ 2025-07-01  6:58                       ` Dev Jain
  2025-07-01 12:15                         ` siddhartha
  0 siblings, 1 reply; 28+ messages in thread
From: Dev Jain @ 2025-07-01  6:58 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: siddhartha, linux-mm, linux-kernel, mgorman, Vlastimil Babka


On 01/07/25 12:20 pm, Lorenzo Stoakes wrote:
> On Tue, Jul 01, 2025 at 12:00:21PM +0530, Dev Jain wrote:
>> On 01/07/25 11:23 am, Lorenzo Stoakes wrote:
>>> On Tue, Jul 01, 2025 at 11:15:25AM +0530, Dev Jain wrote:
>>>> Sorry I am not following, don't know in detail about the VMA merge stuff.
>>>> Are you saying the after the patch, the VMAs will eventually get merged?
>>>> Is it possible in the kernel to get a merge in the "future"; as I understand
>>>> it only happens at mmap() time?
>>>>
>>>> Suppose before the patch, you have two consecutive VMAs between (PMD, 2*PMD) size.
>>>> If they are able to get merged after the patch, why won't they be merged before the patch,
>>>> since the VMA characteristics are the same?
>>>>
>>>>
>>> Rik's patch aligned each to 2 MiB boundary. So you'd get gaps:
>>>
>>>
>>>     0            2MB                 4MB           6MB                      8MB          10MB
>>>     |-------------.------|            |-------------.------|                 |-------------.------|
>>>     |             .      |		 |             .      |                 |             .      |
>>>     |             .      |		 |             .      |                 |             .      |
>>>     |-------------.------|		 |-------------.------|                 |-------------.------|
>>>       huge mapped  4k m'd
>> The effort to draw this is appreciated!
>>
>> I understood the alignment, what I am asking is this:
>>
>> In __get_unmapped_area(), we will return a THP-aligned addr from
>> thp_get_unmapped_area_vmflags(). Now for the diagram you have
>> drawn, suppose that before the patch, we first mmap() the
>> 8MB-start chunk. Then we mmap the 4MB start chunk.
>> We go to __mmap_region(), and we see that the 8MB-start chunk
>> has mergeable characteristics, so we merge. So the gap goes away?
> No because there's a gap, we only merge immedaitely adjacent VMAs. And obviously
> gaps mean page tables wouldn't be adjacent either...

Ah shoot. That is prev->vm_end == vmg->start in can_vma_merge_left(). Thanks.

>
> The get_unmmaped_area() would have otherwise given adjacent mappings. Vlasta's
> patch means in this case we no longer bother trying to align these because their
> _length_ isn't PMD aligned.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] mm: limit THP alignment – performance gain observed in AI inference workloads
  2025-07-01  6:58                       ` Dev Jain
@ 2025-07-01 12:15                         ` siddhartha
  2025-07-01 12:39                           ` Lorenzo Stoakes
  2025-07-01 15:40                           ` Yang Shi
  0 siblings, 2 replies; 28+ messages in thread
From: siddhartha @ 2025-07-01 12:15 UTC (permalink / raw)
  To: Dev Jain; +Cc: Lorenzo Stoakes, linux-mm, linux-kernel, mgorman

On 2025-07-01 12:28, Dev Jain wrote:
> On 01/07/25 12:20 pm, Lorenzo Stoakes wrote:
>> On Tue, Jul 01, 2025 at 12:00:21PM +0530, Dev Jain wrote:
>>> On 01/07/25 11:23 am, Lorenzo Stoakes wrote:
>>>> On Tue, Jul 01, 2025 at 11:15:25AM +0530, Dev Jain wrote:
>>>>> Sorry I am not following, don't know in detail about the VMA merge 
>>>>> stuff.
>>>>> Are you saying the after the patch, the VMAs will eventually get 
>>>>> merged?
>>>>> Is it possible in the kernel to get a merge in the "future"; as I 
>>>>> understand
>>>>> it only happens at mmap() time?
>>>>> 
>>>>> Suppose before the patch, you have two consecutive VMAs between 
>>>>> (PMD, 2*PMD) size.
>>>>> If they are able to get merged after the patch, why won't they be 
>>>>> merged before the patch,
>>>>> since the VMA characteristics are the same?
>>>>> 
>>>>> 
>>>> Rik's patch aligned each to 2 MiB boundary. So you'd get gaps:
>>>> 
>>>> 
>>>>     0            2MB                 4MB           6MB               
>>>>        8MB          10MB
>>>>     |-------------.------|            |-------------.------|         
>>>>         |-------------.------|
>>>>     |             .      |		 |             .      |                 
>>>> |             .      |
>>>>     |             .      |		 |             .      |                 
>>>> |             .      |
>>>>     |-------------.------|		 |-------------.------|                 
>>>> |-------------.------|
>>>>       huge mapped  4k m'd
>>> The effort to draw this is appreciated!
>>> 
>>> I understood the alignment, what I am asking is this:
>>> 
>>> In __get_unmapped_area(), we will return a THP-aligned addr from
>>> thp_get_unmapped_area_vmflags(). Now for the diagram you have
>>> drawn, suppose that before the patch, we first mmap() the
>>> 8MB-start chunk. Then we mmap the 4MB start chunk.
>>> We go to __mmap_region(), and we see that the 8MB-start chunk
>>> has mergeable characteristics, so we merge. So the gap goes away?
>> No because there's a gap, we only merge immedaitely adjacent VMAs. And 
>> obviously
>> gaps mean page tables wouldn't be adjacent either...
> 
> Ah shoot. That is prev->vm_end == vmg->start in can_vma_merge_left(). 
> Thanks.
> 
>> 
>> The get_unmmaped_area() would have otherwise given adjacent mappings. 
>> Vlasta's
>> patch means in this case we no longer bother trying to align these 
>> because their
>> _length_ isn't PMD aligned.

Hi Lorenzo, Dev, all

Thank you for raising excellent points — I’ll respond to each in order 
to clarify the mechanics and relevance of this behavior in the context 
of AI inference workloads.

🧩 1. Does the patch cause VMAs to be merged eventually?
You're correct: VMA merging only happens at mmap() time (via 
__mmap_region()). What the patch affects is the behavior of 
thp_get_unmapped_area_vmflags() before the mmap is placed.

Before the patch (with Rik’s logic):

Every mmap() returned an address rounded up to the next 2MB boundary — 
regardless of whether the requested size was 2MB-aligned.

Result: even consecutive mmap()s (e.g., 1.5MB + 1.5MB) are now 
non-adjacent, so merging is impossible, even if their VMA flags match.

After this patch:

If the allocation is not PMD-aligned in size, the returned address is 
not forcibly aligned, increasing the likelihood that the next mmap() 
lands directly after the previous one → enabling merging.

So, to be clear: this patch doesn’t cause merging, but it prevents 
unnecessary pre-mmap gaps, which previously blocked merges from ever 
happening exactly like a deadlock which has been cleared now.

📐 2. Why aren’t the VMAs mergeable before the patch?
Great question. Even if the VMA flags are identical, gaps introduced by 
forced alignment from get_unmapped_area() break the precondition for 
merging:

can_vma_merge_left()
  → return prev->vm_end == vma->vm_start

With Rik’s patch in place:

Suppose you mmap() 1.5MB → gets aligned to 2MB

Next 1.5MB → gets aligned to 4MB
→ The kernel sees: prev->vm_end = 3.5MB, vma->vm_start = 4MB
→ No merge

With this patch, non-aligned lengths don’t get forcibly aligned, so 
consecutive mmap()s often fall exactly after the previous, and merging 
becomes possible again.

🤖 3. How does this impact AI workloads like Hugging Face Transformers?
Tokenization and dynamic batching create non-deterministic memory 
allocation patterns:

Models like BERT and T5 dynamically allocate intermediate buffers per 
token-length, batch size, and attention window.

Hugging Face + ONNX Runtime uses multiple small-ish anonymous mmap()s, 
often 512KB–1.8MB.

These allocations come in bursts — but due to forced alignment, the 
kernel was placing them with artificial gaps, defeating THP eligibility 
entirely.

By not force-aligning non-PMD-sized mappings, we avoid injecting gaps. 
The result is that:

a. VMAs remain adjacent → mergeable

b. Physical memory is contiguous → eligible for khugepaged collapse

c. THP utilization increases → fewer TLB misses → lower latency → higher 
throughput

💡 4. Why this patch complements Rik’s rather than contradicts it:

Rik's patch made it easier to guarantee alignment for workloads that 
benefit from explicit huge pages — but at the cost of breaking 
coalescence in workloads with non-PMD-sized mappings, like ML inference.

This patch simply refines that logic:

If the length is PMD-aligned → keep alignment

If it’s not → don’t inject alignment gaps that block merging

So, for workloads that can’t benefit from THP due to misalignment, this 
patch removes artificial fragmentation without harming the original 
intent.

⚙️ 5. mTHP note
Although this patch doesn’t target mTHP directly, I believe a similar 
logic tweak could apply there too — especially with shmem-backed 
workloads (common in model servers using shared tensor memory). I’d be 
happy to help test any changes proposed there to derive the consequent 
results.

Thanks again for the detailed discussion. Let me know if you’d like a 
trace or VMA map from a Hugging Face benchmarked run (happy to generate 
one locally).

Best Regards,
Siddhartha Sharma
+91 9015185601

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] mm: limit THP alignment – performance gain observed in AI inference workloads
  2025-07-01 12:15                         ` siddhartha
@ 2025-07-01 12:39                           ` Lorenzo Stoakes
  2025-07-01 13:23                             ` siddhartha
  2025-07-01 16:20                             ` Dev Jain
  2025-07-01 15:40                           ` Yang Shi
  1 sibling, 2 replies; 28+ messages in thread
From: Lorenzo Stoakes @ 2025-07-01 12:39 UTC (permalink / raw)
  To: siddhartha; +Cc: Dev Jain, linux-mm, linux-kernel, mgorman

On Tue, Jul 01, 2025 at 05:45:51PM +0530, siddhartha@kenip.in wrote:
> 🧩 1. Does the patch cause VMAs to be merged eventually?
> You're correct: VMA merging only happens at mmap() time (via
> __mmap_region()). What the patch affects is the behavior of
> thp_get_unmapped_area_vmflags() before the mmap is placed.

[...]

>
> 📐 2. Why aren’t the VMAs mergeable before the patch?
> Great question. Even if the VMA flags are identical, gaps introduced by
> forced alignment from get_unmapped_area() break the precondition for
> merging:

[...]

> 💡 4. Why this patch complements Rik’s rather than contradicts it:

I'm really perplexed as to why you felt the need to (seemingly via LLM)
reply with the explanation I've already provided here?...

There's errors in things you say here too.

With respect, please don't do this.

(I'm the co-maintainer of pretty much all the relevant code here and wrote
the VMA merge logic you're referring to.)

>
> 🤖 3. How does this impact AI workloads like Hugging Face Transformers?
> Tokenization and dynamic batching create non-deterministic memory allocation
> patterns:
>
> Models like BERT and T5 dynamically allocate intermediate buffers per
> token-length, batch size, and attention window.
>
> Hugging Face + ONNX Runtime uses multiple small-ish anonymous mmap()s, often
> 512KB–1.8MB.
>
> These allocations come in bursts — but due to forced alignment, the kernel
> was placing them with artificial gaps, defeating THP eligibility entirely.
>
> By not force-aligning non-PMD-sized mappings, we avoid injecting gaps. The
> result is that:
>
> a. VMAs remain adjacent → mergeable
>
> b. Physical memory is contiguous → eligible for khugepaged collapse
>
> c. THP utilization increases → fewer TLB misses → lower latency → higher
> throughput
>

This is very useful information and it's appreciated! Let's not drown this
out with restatements of stuff already covered.

>
> ⚙️ 5. mTHP note
> Although this patch doesn’t target mTHP directly, I believe a similar logic
> tweak could apply there too — especially with shmem-backed workloads (common
> in model servers using shared tensor memory). I’d be happy to help test any
> changes proposed there to derive the consequent results.

Dev - could we hold off on any effort to do something like this until I've
had a chance to refactor THP somewhat? This is already a mess and I'd like
to avoid us piling on more complexity.

We can revisit this at a later stage.

>
> Thanks again for the detailed discussion. Let me know if you’d like a trace
> or VMA map from a Hugging Face benchmarked run (happy to generate one
> locally).
>

Thanks! Much appreciated.

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] mm: limit THP alignment – performance gain observed in AI inference workloads
  2025-07-01 12:39                           ` Lorenzo Stoakes
@ 2025-07-01 13:23                             ` siddhartha
  2025-07-01 13:28                               ` Lorenzo Stoakes
  2025-07-01 16:20                             ` Dev Jain
  1 sibling, 1 reply; 28+ messages in thread
From: siddhartha @ 2025-07-01 13:23 UTC (permalink / raw)
  To: Lorenzo Stoakes; +Cc: Dev Jain, linux-mm, linux-kernel, mgorman

On 2025-07-01 18:09, Lorenzo Stoakes wrote:
> On Tue, Jul 01, 2025 at 05:45:51PM +0530, siddhartha@kenip.in wrote:
>> 🧩 1. Does the patch cause VMAs to be merged eventually?
>> You're correct: VMA merging only happens at mmap() time (via
>> __mmap_region()). What the patch affects is the behavior of
>> thp_get_unmapped_area_vmflags() before the mmap is placed.
> 
> [...]
> 
>> 
>> 📐 2. Why aren’t the VMAs mergeable before the patch?
>> Great question. Even if the VMA flags are identical, gaps introduced 
>> by
>> forced alignment from get_unmapped_area() break the precondition for
>> merging:
> 
> [...]
> 
>> 💡 4. Why this patch complements Rik’s rather than contradicts it:
> 
> I'm really perplexed as to why you felt the need to (seemingly via LLM)
> reply with the explanation I've already provided here?...
> 
> There's errors in things you say here too.
> 
> With respect, please don't do this.
> 
> (I'm the co-maintainer of pretty much all the relevant code here and 
> wrote
> the VMA merge logic you're referring to.)
> 
>> 
>> 🤖 3. How does this impact AI workloads like Hugging Face Transformers?
>> Tokenization and dynamic batching create non-deterministic memory 
>> allocation
>> patterns:
>> 
>> Models like BERT and T5 dynamically allocate intermediate buffers per
>> token-length, batch size, and attention window.
>> 
>> Hugging Face + ONNX Runtime uses multiple small-ish anonymous mmap()s, 
>> often
>> 512KB–1.8MB.
>> 
>> These allocations come in bursts — but due to forced alignment, the 
>> kernel
>> was placing them with artificial gaps, defeating THP eligibility 
>> entirely.
>> 
>> By not force-aligning non-PMD-sized mappings, we avoid injecting gaps. 
>> The
>> result is that:
>> 
>> a. VMAs remain adjacent → mergeable
>> 
>> b. Physical memory is contiguous → eligible for khugepaged collapse
>> 
>> c. THP utilization increases → fewer TLB misses → lower latency → 
>> higher
>> throughput
>> 
> 
> This is very useful information and it's appreciated! Let's not drown 
> this
> out with restatements of stuff already covered.
> 
>> 
>> ⚙️ 5. mTHP note
>> Although this patch doesn’t target mTHP directly, I believe a similar 
>> logic
>> tweak could apply there too — especially with shmem-backed workloads 
>> (common
>> in model servers using shared tensor memory). I’d be happy to help 
>> test any
>> changes proposed there to derive the consequent results.
> 
> Dev - could we hold off on any effort to do something like this until 
> I've
> had a chance to refactor THP somewhat? This is already a mess and I'd 
> like
> to avoid us piling on more complexity.
> 
> We can revisit this at a later stage.
> 
>> 
>> Thanks again for the detailed discussion. Let me know if you’d like a 
>> trace
>> or VMA map from a Hugging Face benchmarked run (happy to generate one
>> locally).
>> 
> 
> Thanks! Much appreciated.
> 
> Cheers, Lorenzo

Hi Lorenzo,

Thanks for your clarification, and I appreciate your patience — 
especially given your role in maintaining and designing the VMA merge 
logic.

I understand now that my earlier phrasing may have repeated your 
explanation for VMA adjacency, and I regret unintentionally restating 
it.

I’ll make sure to be more careful and direct going forward.

As for the THP alignment condition now being `IS_ALIGNED(len, 
PMD_SIZE)`, I agree this resolves the regressions by removing alignment 
for non-aligned sizes, which was exactly what broke workloads like 
cactusBSSN or some AI inference loads.

Thanks again for the guidance — I’m learning a lot from this thread.

Best Regards,
Siddhartha Sharma



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] mm: limit THP alignment – performance gain observed in AI inference workloads
  2025-07-01 13:23                             ` siddhartha
@ 2025-07-01 13:28                               ` Lorenzo Stoakes
  2025-07-01 14:20                                 ` siddhartha
  0 siblings, 1 reply; 28+ messages in thread
From: Lorenzo Stoakes @ 2025-07-01 13:28 UTC (permalink / raw)
  To: siddhartha; +Cc: Dev Jain, linux-mm, linux-kernel, mgorman

On Tue, Jul 01, 2025 at 06:53:47PM +0530, siddhartha@kenip.in wrote:
> Hi Lorenzo,
>
> Thanks for your clarification, and I appreciate your patience — especially
> given your role in maintaining and designing the VMA merge logic.
>
> I understand now that my earlier phrasing may have repeated your explanation
> for VMA adjacency, and I regret unintentionally restating it.
>
> I’ll make sure to be more careful and direct going forward.

Thanks, no problem. Mostly avoids confusion.

>
> As for the THP alignment condition now being `IS_ALIGNED(len, PMD_SIZE)`, I
> agree this resolves the regressions by removing alignment for non-aligned
> sizes, which was exactly what broke workloads like cactusBSSN or some AI
> inference loads.

Ack - we're really happy to hear about workloads that this has helped as this
kind of input is very important as to getting insight into how THP-related stuff
impacts real users so we can best optimise especially for workloads that are
very important in the industry right now.

>
> Thanks again for the guidance — I’m learning a lot from this thread.

Glad to have helped, thanks again for reporting!

>
> Best Regards,
> Siddhartha Sharma
>

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] mm: limit THP alignment – performance gain observed in AI inference workloads
  2025-07-01 13:28                               ` Lorenzo Stoakes
@ 2025-07-01 14:20                                 ` siddhartha
  0 siblings, 0 replies; 28+ messages in thread
From: siddhartha @ 2025-07-01 14:20 UTC (permalink / raw)
  To: Lorenzo Stoakes; +Cc: Dev Jain, linux-mm, linux-kernel, mgorman

On 2025-07-01 18:58, Lorenzo Stoakes wrote:
> On Tue, Jul 01, 2025 at 06:53:47PM +0530, siddhartha@kenip.in wrote:
>> Hi Lorenzo,
>> 
>> Thanks for your clarification, and I appreciate your patience — 
>> especially
>> given your role in maintaining and designing the VMA merge logic.
>> 
>> I understand now that my earlier phrasing may have repeated your 
>> explanation
>> for VMA adjacency, and I regret unintentionally restating it.
>> 
>> I’ll make sure to be more careful and direct going forward.
> 
> Thanks, no problem. Mostly avoids confusion.
> 
>> 
>> As for the THP alignment condition now being `IS_ALIGNED(len, 
>> PMD_SIZE)`, I
>> agree this resolves the regressions by removing alignment for 
>> non-aligned
>> sizes, which was exactly what broke workloads like cactusBSSN or some 
>> AI
>> inference loads.
> 
> Ack - we're really happy to hear about workloads that this has helped 
> as this
> kind of input is very important as to getting insight into how 
> THP-related stuff
> impacts real users so we can best optimise especially for workloads 
> that are
> very important in the industry right now.
> 
>> 
>> Thanks again for the guidance — I’m learning a lot from this thread.
> 
> Glad to have helped, thanks again for reporting!
> 
>> 
>> Best Regards,
>> Siddhartha Sharma
>> 
> 
> Cheers, Lorenzo

Hi Lorenzo,

Thanks for the acknowledgement of my work, I really appreciate it. 
Please let me know if there is anything I can do here now moving 
forwards with integrating. Furthermore, once integrated and tested, I 
would like to see all performance metrics that have seen improvements if 
possible.

Best Regards,
Siddhartha Sharma


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] mm: limit THP alignment – performance gain observed in AI inference workloads
  2025-07-01 12:39                           ` Lorenzo Stoakes
  2025-07-01 13:23                             ` siddhartha
@ 2025-07-01 16:20                             ` Dev Jain
  2025-07-01 18:49                               ` Zi Yan
  1 sibling, 1 reply; 28+ messages in thread
From: Dev Jain @ 2025-07-01 16:20 UTC (permalink / raw)
  To: Lorenzo Stoakes, siddhartha; +Cc: linux-mm, linux-kernel, mgorman


On 01/07/25 6:09 pm, Lorenzo Stoakes wrote:
> On Tue, Jul 01, 2025 at 05:45:51PM +0530, siddhartha@kenip.in wrote:
>> 🧩 1. Does the patch cause VMAs to be merged eventually?
>> You're correct: VMA merging only happens at mmap() time (via
>> __mmap_region()). What the patch affects is the behavior of
>> thp_get_unmapped_area_vmflags() before the mmap is placed.
> [...]
>
>> 📐 2. Why aren’t the VMAs mergeable before the patch?
>> Great question. Even if the VMA flags are identical, gaps introduced by
>> forced alignment from get_unmapped_area() break the precondition for
>> merging:
> [...]
>
>> 💡 4. Why this patch complements Rik’s rather than contradicts it:
> I'm really perplexed as to why you felt the need to (seemingly via LLM)
> reply with the explanation I've already provided here?...
>
> There's errors in things you say here too.
>
> With respect, please don't do this.
>
> (I'm the co-maintainer of pretty much all the relevant code here and wrote
> the VMA merge logic you're referring to.)
>
>> 🤖 3. How does this impact AI workloads like Hugging Face Transformers?
>> Tokenization and dynamic batching create non-deterministic memory allocation
>> patterns:
>>
>> Models like BERT and T5 dynamically allocate intermediate buffers per
>> token-length, batch size, and attention window.
>>
>> Hugging Face + ONNX Runtime uses multiple small-ish anonymous mmap()s, often
>> 512KB–1.8MB.
>>
>> These allocations come in bursts — but due to forced alignment, the kernel
>> was placing them with artificial gaps, defeating THP eligibility entirely.
>>
>> By not force-aligning non-PMD-sized mappings, we avoid injecting gaps. The
>> result is that:
>>
>> a. VMAs remain adjacent → mergeable
>>
>> b. Physical memory is contiguous → eligible for khugepaged collapse
>>
>> c. THP utilization increases → fewer TLB misses → lower latency → higher
>> throughput
>>
> This is very useful information and it's appreciated! Let's not drown this
> out with restatements of stuff already covered.
>
>> ⚙️ 5. mTHP note
>> Although this patch doesn’t target mTHP directly, I believe a similar logic
>> tweak could apply there too — especially with shmem-backed workloads (common
>> in model servers using shared tensor memory). I’d be happy to help test any
>> changes proposed there to derive the consequent results.
> Dev - could we hold off on any effort to do something like this until I've
> had a chance to refactor THP somewhat? This is already a mess and I'd like
> to avoid us piling on more complexity.
>
> We can revisit this at a later stage.

Yes of course. I had run a small benchmark on a quick dumb patch I wrote and I
don't see any measurable perf improvement, probably because the highest THP order
getting chosen is always PMD size.

Out of curiosity, where do you plan to do the refactoring?

>
>> Thanks again for the detailed discussion. Let me know if you’d like a trace
>> or VMA map from a Hugging Face benchmarked run (happy to generate one
>> locally).
>>
> Thanks! Much appreciated.
>
> Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] mm: limit THP alignment – performance gain observed in AI inference workloads
  2025-07-01 16:20                             ` Dev Jain
@ 2025-07-01 18:49                               ` Zi Yan
  2025-07-07  8:56                                 ` Vlastimil Babka
  0 siblings, 1 reply; 28+ messages in thread
From: Zi Yan @ 2025-07-01 18:49 UTC (permalink / raw)
  To: Dev Jain; +Cc: Lorenzo Stoakes, siddhartha, linux-mm, linux-kernel, mgorman

On 1 Jul 2025, at 12:20, Dev Jain wrote:

> On 01/07/25 6:09 pm, Lorenzo Stoakes wrote:
>> On Tue, Jul 01, 2025 at 05:45:51PM +0530, siddhartha@kenip.in wrote:
>>> 🧩 1. Does the patch cause VMAs to be merged eventually?
>>> You're correct: VMA merging only happens at mmap() time (via
>>> __mmap_region()). What the patch affects is the behavior of
>>> thp_get_unmapped_area_vmflags() before the mmap is placed.
>> [...]
>>
>>> 📐 2. Why aren’t the VMAs mergeable before the patch?
>>> Great question. Even if the VMA flags are identical, gaps introduced by
>>> forced alignment from get_unmapped_area() break the precondition for
>>> merging:
>> [...]
>>
>>> 💡 4. Why this patch complements Rik’s rather than contradicts it:
>> I'm really perplexed as to why you felt the need to (seemingly via LLM)
>> reply with the explanation I've already provided here?...
>>
>> There's errors in things you say here too.
>>
>> With respect, please don't do this.
>>
>> (I'm the co-maintainer of pretty much all the relevant code here and wrote
>> the VMA merge logic you're referring to.)
>>
>>> 🤖 3. How does this impact AI workloads like Hugging Face Transformers?
>>> Tokenization and dynamic batching create non-deterministic memory allocation
>>> patterns:
>>>
>>> Models like BERT and T5 dynamically allocate intermediate buffers per
>>> token-length, batch size, and attention window.
>>>
>>> Hugging Face + ONNX Runtime uses multiple small-ish anonymous mmap()s, often
>>> 512KB–1.8MB.
>>>
>>> These allocations come in bursts — but due to forced alignment, the kernel
>>> was placing them with artificial gaps, defeating THP eligibility entirely.
>>>
>>> By not force-aligning non-PMD-sized mappings, we avoid injecting gaps. The
>>> result is that:
>>>
>>> a. VMAs remain adjacent → mergeable
>>>
>>> b. Physical memory is contiguous → eligible for khugepaged collapse
>>>
>>> c. THP utilization increases → fewer TLB misses → lower latency → higher
>>> throughput
>>>
>> This is very useful information and it's appreciated! Let's not drown this
>> out with restatements of stuff already covered.
>>
>>> ⚙️ 5. mTHP note
>>> Although this patch doesn’t target mTHP directly, I believe a similar logic
>>> tweak could apply there too — especially with shmem-backed workloads (common
>>> in model servers using shared tensor memory). I’d be happy to help test any
>>> changes proposed there to derive the consequent results.
>> Dev - could we hold off on any effort to do something like this until I've
>> had a chance to refactor THP somewhat? This is already a mess and I'd like
>> to avoid us piling on more complexity.
>>
>> We can revisit this at a later stage.
>
> Yes of course. I had run a small benchmark on a quick dumb patch I wrote and I
> don't see any measurable perf improvement, probably because the highest THP order
> getting chosen is always PMD size.

I think mTHP is much more complicated, since mTHP has many sizes.
Trying to adjust VMA alignments to get mTHP might not work well, since
you never know what sizes new VMAs are going to have.

IMHO, it might be better to align VMA to PMD or the largest mTHP size
(for example, on ARM64 with 64KB base page, PMD THP is 512MB, a 2MB
mTHP sounds more reasonable there) if possible and enable
VMA merging as much as possible for future huge page collapse.
mTHP can be used to fill the non faulted holes in VMAs if necessary.

>
> Out of curiosity, where do you plan to do the refactoring?


Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] mm: limit THP alignment – performance gain observed in AI inference workloads
  2025-07-01 18:49                               ` Zi Yan
@ 2025-07-07  8:56                                 ` Vlastimil Babka
  2025-07-28  5:41                                   ` siddhartha
  0 siblings, 1 reply; 28+ messages in thread
From: Vlastimil Babka @ 2025-07-07  8:56 UTC (permalink / raw)
  To: Zi Yan, Dev Jain
  Cc: Lorenzo Stoakes, siddhartha, linux-mm, linux-kernel, mgorman,
	Rik van Riel, Doug Smythies

On 7/1/25 20:49, Zi Yan wrote:
>>> This is very useful information and it's appreciated! Let's not drown this
>>> out with restatements of stuff already covered.
>>>
>>>> ⚙️ 5. mTHP note
>>>> Although this patch doesn’t target mTHP directly, I believe a similar logic
>>>> tweak could apply there too — especially with shmem-backed workloads (common
>>>> in model servers using shared tensor memory). I’d be happy to help test any
>>>> changes proposed there to derive the consequent results.
>>> Dev - could we hold off on any effort to do something like this until I've
>>> had a chance to refactor THP somewhat? This is already a mess and I'd like
>>> to avoid us piling on more complexity.
>>>
>>> We can revisit this at a later stage.
>>
>> Yes of course. I had run a small benchmark on a quick dumb patch I wrote and I
>> don't see any measurable perf improvement, probably because the highest THP order
>> getting chosen is always PMD size.
> 
> I think mTHP is much more complicated, since mTHP has many sizes.
> Trying to adjust VMA alignments to get mTHP might not work well, since
> you never know what sizes new VMAs are going to have.

Yes I agree it's more complicated. In case there would be a stream of
allocations of varying small-ish sizes, aligning each of them to its
smallest applicable mTHP could create gaps that wouldn't exist if we ignored
the alignment and just find any free area and in the end merge it to an
existing one. Basically we'd risk recreating the issue with gaps.

Sticking to one size (2MB) mitigates this to some extent. Unfortunately even
after my fix the heuristics might be prone to gaps:

- all allocations not multiple of 2MB - will merge freely

- all allocations multiple of 2MB - the alignment heuristic will kick in,
but as a result allocations should still merge as all boundaries are 2MB
alignned

- allocations alternate between multiple of 2MB and non-multiple of 2MB -
this will still create gaps

Note we already had a report about ebizzy regressing due to my commit [1]
and I suspect it might be due to this kind of scenario. A proper
investigation would be useful but I didn't get to it.

Maybe the solution is to first check if unaligned search gives us a range
that will merge with adjacent area, and only try the alignment heuristics if
it doesn't. This will still fail if mmap() is followed by e.g. mprotect() or
madvise() that will change an initially un-mergeable area to a mergeable
one. I have no ideas around that though. Just some thoughts to consider for
anyone wanting to change things here further :)

[1] https://lore.kernel.org/all/019401db769f%24961e7e20%24c25b7a60%24@telus.net/

> IMHO, it might be better to align VMA to PMD or the largest mTHP size
> (for example, on ARM64 with 64KB base page, PMD THP is 512MB, a 2MB
> mTHP sounds more reasonable there) if possible and enable
> VMA merging as much as possible for future huge page collapse.
> mTHP can be used to fill the non faulted holes in VMAs if necessary.
> 
>>
>> Out of curiosity, where do you plan to do the refactoring?
> 
> 
> Best Regards,
> Yan, Zi
> 



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] mm: limit THP alignment – performance gain observed in AI inference workloads
  2025-07-07  8:56                                 ` Vlastimil Babka
@ 2025-07-28  5:41                                   ` siddhartha
  2025-07-28 11:00                                     ` Vlastimil Babka
  0 siblings, 1 reply; 28+ messages in thread
From: siddhartha @ 2025-07-28  5:41 UTC (permalink / raw)
  To: Vlastimil Babka; +Cc: Zi Yan, linux-mm, Lorenzo Stoakes

On 2025-07-07 14:26, Vlastimil Babka wrote:
> On 7/1/25 20:49, Zi Yan wrote:
>>>> This is very useful information and it's appreciated! Let's not 
>>>> drown this
>>>> out with restatements of stuff already covered.
>>>> 
>>>>> ⚙️ 5. mTHP note
>>>>> Although this patch doesn’t target mTHP directly, I believe a 
>>>>> similar logic
>>>>> tweak could apply there too — especially with shmem-backed 
>>>>> workloads (common
>>>>> in model servers using shared tensor memory). I’d be happy to help 
>>>>> test any
>>>>> changes proposed there to derive the consequent results.
>>>> Dev - could we hold off on any effort to do something like this 
>>>> until I've
>>>> had a chance to refactor THP somewhat? This is already a mess and 
>>>> I'd like
>>>> to avoid us piling on more complexity.
>>>> 
>>>> We can revisit this at a later stage.
>>> 
>>> Yes of course. I had run a small benchmark on a quick dumb patch I 
>>> wrote and I
>>> don't see any measurable perf improvement, probably because the 
>>> highest THP order
>>> getting chosen is always PMD size.
>> 
>> I think mTHP is much more complicated, since mTHP has many sizes.
>> Trying to adjust VMA alignments to get mTHP might not work well, since
>> you never know what sizes new VMAs are going to have.
> 
> Yes I agree it's more complicated. In case there would be a stream of
> allocations of varying small-ish sizes, aligning each of them to its
> smallest applicable mTHP could create gaps that wouldn't exist if we 
> ignored
> the alignment and just find any free area and in the end merge it to an
> existing one. Basically we'd risk recreating the issue with gaps.
> 
> Sticking to one size (2MB) mitigates this to some extent. Unfortunately 
> even
> after my fix the heuristics might be prone to gaps:
> 
> - all allocations not multiple of 2MB - will merge freely
> 
> - all allocations multiple of 2MB - the alignment heuristic will kick 
> in,
> but as a result allocations should still merge as all boundaries are 
> 2MB
> alignned
> 
> - allocations alternate between multiple of 2MB and non-multiple of 2MB 
> -
> this will still create gaps
> 
> Note we already had a report about ebizzy regressing due to my commit 
> [1]
> and I suspect it might be due to this kind of scenario. A proper
> investigation would be useful but I didn't get to it.
> 
> Maybe the solution is to first check if unaligned search gives us a 
> range
> that will merge with adjacent area, and only try the alignment 
> heuristics if
> it doesn't. This will still fail if mmap() is followed by e.g. 
> mprotect() or
> madvise() that will change an initially un-mergeable area to a 
> mergeable
> one. I have no ideas around that though. Just some thoughts to consider 
> for
> anyone wanting to change things here further :)
> 
> [1] 
> https://lore.kernel.org/all/019401db769f%24961e7e20%24c25b7a60%24@telus.net/
> 
>> IMHO, it might be better to align VMA to PMD or the largest mTHP size
>> (for example, on ARM64 with 64KB base page, PMD THP is 512MB, a 2MB
>> mTHP sounds more reasonable there) if possible and enable
>> VMA merging as much as possible for future huge page collapse.
>> mTHP can be used to fill the non faulted holes in VMAs if necessary.
>> 
>>> 
>>> Out of curiosity, where do you plan to do the refactoring?
>> 
>> 
>> Best Regards,
>> Yan, Zi
>> 
Hi Lorenzo, Dev, Mel,

I'm following up on this patch submission from earlier this month:
"[PATCH] mm: limit THP alignment – performance gain observed in AI 
inference workloads."

The change limits THP alignment to PMD-sized mappings, avoiding 
unnecessary hugepage over-allocations in scenarios where 2MB alignment 
is not beneficial. We’ve observed consistent performance improvements in 
inference pipelines (specifically with OpenVINO) where the workload 
profile includes a mix of small and large allocations.

Please let me know if:
- There has been any progress or feedback from your end,
- The patch needs to align with ongoing THP refactoring efforts,
- Additional benchmarks, test traces, or system-level profiles would 
help.

Happy to revise or refine the patch based on further discussion. Thanks 
again for your time and input!

For your information, I have also posted the same at Openvino and 
Huggingface forums and currently waiting for review for the commit on 
the Openvino github repository.

Best regards,
Siddhartha Sharma


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] mm: limit THP alignment – performance gain observed in AI inference workloads
  2025-07-28  5:41                                   ` siddhartha
@ 2025-07-28 11:00                                     ` Vlastimil Babka
  0 siblings, 0 replies; 28+ messages in thread
From: Vlastimil Babka @ 2025-07-28 11:00 UTC (permalink / raw)
  To: Dev Jain, Lorenzo Stoakes, siddhartha; +Cc: Zi Yan, linux-mm@kvack.org, LKML

On 7/28/25 07:41, siddhartha@kenip.in wrote:
> On 2025-07-07 14:26, Vlastimil Babka wrote:
> Hi Lorenzo, Dev, Mel,
> 
> I'm following up on this patch submission from earlier this month:
> "[PATCH] mm: limit THP alignment – performance gain observed in AI 
> inference workloads."

I'm confused. That wasn't a patch submission, but reporting performance
results for my patch from late 2024? (and thanks for those!)

The patch was also already merged in late 2024:

commit d4148aeab412432bf928f311eca8a2ba52bb05df
Author: Vlastimil Babka <vbabka@suse.cz>
Date:   Thu Oct 24 17:12:29 2024 +0200

    mm, mmap: limit THP alignment of anonymous mappings to PMD-aligned sizes


So there's nothing more to do here AFAIK.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] mm: limit THP alignment – performance gain observed in AI inference workloads
  2025-07-01 12:15                         ` siddhartha
  2025-07-01 12:39                           ` Lorenzo Stoakes
@ 2025-07-01 15:40                           ` Yang Shi
  1 sibling, 0 replies; 28+ messages in thread
From: Yang Shi @ 2025-07-01 15:40 UTC (permalink / raw)
  To: siddhartha
  Cc: Dev Jain, Lorenzo Stoakes, linux-mm, linux-kernel, mgorman,
	Vlastimil Babka, Rik van Riel

>
> 🤖 3. How does this impact AI workloads like Hugging Face Transformers?
> Tokenization and dynamic batching create non-deterministic memory
> allocation patterns:
>
> Models like BERT and T5 dynamically allocate intermediate buffers per
> token-length, batch size, and attention window.
>
> Hugging Face + ONNX Runtime uses multiple small-ish anonymous mmap()s,
> often 512KB–1.8MB.

If I remember correctly, Rik's patch should just force PMD alignment
when the allocation size is greater than PMD size. Such VMA
fragmentation should be caused by allocations greater than 2M but not
PMD aligned, so they create 2M PMD + a bunch of 4K PTEs. Less than 2M
allocations should be right next to each other and mergeable. Did I
miss something?

Thanks,
Yang


>
> These allocations come in bursts — but due to forced alignment, the
> kernel was placing them with artificial gaps, defeating THP eligibility
> entirely.
>


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] mm: limit THP alignment – performance gain observed in AI inference workloads
@ 2025-08-11 22:14 siddhartha
  0 siblings, 0 replies; 28+ messages in thread
From: siddhartha @ 2025-08-11 22:14 UTC (permalink / raw)
  To: Vlastimil Babka; +Cc: Dev Jain, Lorenzo Stoakes, linux-mm, LKML

[-- Attachment #1: Type: text/plain, Size: 3149 bytes --]

On 2025-07-28 16:30, Vlastimil Babka wrote:

> On 7/28/25 07:41, siddhartha@kenip.in wrote:
> 
>> On 2025-07-07 14:26, Vlastimil Babka wrote:
>> Hi Lorenzo, Dev, Mel,
>> 
>> I'm following up on this patch submission from earlier this month:
>> "[PATCH] mm: limit THP alignment - performance gain observed in AI
>> inference workloads."
> 
> I'm confused. That wasn't a patch submission, but reporting performance
> results for my patch from late 2024? (and thanks for those!)
> 
> The patch was also already merged in late 2024:
> 
> commit d4148aeab412432bf928f311eca8a2ba52bb05df
> Author: Vlastimil Babka <vbabka@suse.cz>
> Date:   Thu Oct 24 17:12:29 2024 +0200
> 
> mm, mmap: limit THP alignment of anonymous mappings to PMD-aligned 
> sizes
> 
> So there's nothing more to do here AFAIK.

> Hello Vlastimil,
> 
> Hope you are doing great!
> 
> Sorry about the late reply, my inbox made your email invisible somehow.
> 
> Thank you for the clarification -- yes, I am aware that the mm, mmap: 
> limit THP alignment of anonymous mappings to PMD-aligned sizes patch 
> was merged in late 2024 (commit 
> d4148aeab412432bf928f311eca8a2ba52bb05df).
> 
> The performance results I shared were generated much later because of 
> my working setup:
> 
> *
> 
> The tests were conducted on Intel Developer Cloud workloads as part of 
> a broader benchmarking exercise involving OpenVINO-based inference 
> pipelines.
> *
> 
> The specific environment, dataset, and configuration scripts were 
> stored on an SSD that unfortunately suffered corruption. I am currently 
> working to recover them so I can share the exact test harness and 
> commit-specific diffs. If and when I get that access back from Intel 
> Developer Cloud, I can surely provide all those relevant files.
> 
> Although this is not a new patch submission, I thought the numbers 
> might still be valuable -- they show notable throughput and latency 
> changes when aligning the current behavior with OpenVINO's large 
> contiguous allocation preferences in certain inference scenarios.
> 
> Summary of observed improvements:
> 
> *
> 
> Throughput: +7.3% average increase in model inference throughput on 
> ResNet-50 with mixed batch sizes (64/128)
> *
> 
> Latency: -5.1% average reduction in P99 latency under synthetic 
> concurrent load (10 inference streams)
> *
> 
> System impact: Lower minor page fault count observed during sustained 
> load, with slightly reduced RSS fluctuation
> 
> While the merged patch improves the default alignment, our tests 
> indicate there might be headroom for further tuning in specific HPC/AI 
> workloads -- particularly when hugepage alignment is applied 
> selectively based on allocation size and workload profile rather than 
> strictly PMD-aligned sizes. I was also working on specifics and pseudo 
> diffs from the working Linux code that I can generate to send that 
> email via git send-email.
> 
> I'd be happy to collaborate on a deeper investigation once I recover 
> the original scripts -- or I can try to replicate the environment on a 
> fresh setup and collect new diffs for comparison.
> 
> Best regards,
> Siddhartha Sharma

[-- Attachment #2: Type: text/html, Size: 5027 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2025-08-11 22:15 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-27 10:39 [PATCH] mm: limit THP alignment – performance gain observed in AI inference workloads siddhartha
2025-06-27 10:45 ` siddhartha
2025-06-27 15:30 ` Lorenzo Stoakes
2025-06-28  3:49   ` Dev Jain
2025-06-30  0:43     ` siddhartha
2025-06-30  5:25       ` Dev Jain
2025-06-30  5:28         ` Dev Jain
2025-06-30 10:54         ` Lorenzo Stoakes
2025-06-30 11:48           ` siddhartha
2025-07-01  5:23           ` Dev Jain
2025-07-01  5:28             ` Lorenzo Stoakes
2025-07-01  5:45               ` Dev Jain
2025-07-01  5:53                 ` Lorenzo Stoakes
2025-07-01  6:30                   ` Dev Jain
2025-07-01  6:50                     ` Lorenzo Stoakes
2025-07-01  6:58                       ` Dev Jain
2025-07-01 12:15                         ` siddhartha
2025-07-01 12:39                           ` Lorenzo Stoakes
2025-07-01 13:23                             ` siddhartha
2025-07-01 13:28                               ` Lorenzo Stoakes
2025-07-01 14:20                                 ` siddhartha
2025-07-01 16:20                             ` Dev Jain
2025-07-01 18:49                               ` Zi Yan
2025-07-07  8:56                                 ` Vlastimil Babka
2025-07-28  5:41                                   ` siddhartha
2025-07-28 11:00                                     ` Vlastimil Babka
2025-07-01 15:40                           ` Yang Shi
  -- strict thread matches above, loose matches on Subject: below --
2025-08-11 22:14 siddhartha

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).