* [PATCH] mm: limit THP alignment – performance gain observed in AI inference workloads
@ 2025-06-27 10:39 siddhartha
2025-06-27 10:45 ` siddhartha
2025-06-27 15:30 ` Lorenzo Stoakes
0 siblings, 2 replies; 28+ messages in thread
From: siddhartha @ 2025-06-27 10:39 UTC (permalink / raw)
To: linux-mm; +Cc: linux-kernel, mgorman
Hi all,
I wanted to share validation data from a Hugging Face-based AI
inferencing workload,
which was significantly impacted by the THP alignment logic introduced
in commit efa7df3e3bb5.
Using transformer models with dynamic input lengths on Intel Xeon
(Cooper Lake),
we observed up to a 3200% throughput improvement after applying the
patch from Oct 2024:
mm: limit THP alignment of anonymous mappings to PMD-aligned sizes
Metrics:
- Model: BERT-base
- Inference engine: Transformers + ONNX Runtime
- Kernel: 6.6 vs patched 6.6.8
- Batch size: 8-32, input length: 64-512 tokens
- Metric: inference throughput (samples/sec)
Thanks for the fix -- this change had real impact on a
production-relevant workload.
Best Regards,
Siddhartha Sharma
ISV @ Kenip
Solution Link:
https://www.intel.com/content/www/us/en/partner/showcase/offering/a5bHo00000045YUIAY/deadlock-clearance.html
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] mm: limit THP alignment – performance gain observed in AI inference workloads
2025-06-27 10:39 [PATCH] mm: limit THP alignment – performance gain observed in AI inference workloads siddhartha
@ 2025-06-27 10:45 ` siddhartha
2025-06-27 15:30 ` Lorenzo Stoakes
1 sibling, 0 replies; 28+ messages in thread
From: siddhartha @ 2025-06-27 10:45 UTC (permalink / raw)
To: linux-mm; +Cc: linux-kernel, mgorman
> Hi all,
>
> I wanted to share validation data from a Hugging Face-based AI
> inferencing workload,
> which was significantly impacted by the THP alignment logic introduced
> in commit efa7df3e3bb5.
>
> Using transformer models with dynamic input lengths on Intel Xeon
> (Cooper Lake),
> we observed up to a 3200% throughput improvement after applying the
> patch from Oct 2024:
>
> mm: limit THP alignment of anonymous mappings to PMD-aligned sizes
>
> Metrics:
> - Model: BERT-base
> - Inference engine: Transformers + ONNX Runtime
> - Kernel: 6.6 vs patched 6.6.8
> - Batch size: 8-32, input length: 64-512 tokens
> - Metric: inference throughput (samples/sec)
>
> Thanks for the fix -- this change had real impact on a
> production-relevant workload.
>
> Best Regards,
> Siddhartha Sharma
> ISV @ Kenip
> Solution Link:
> https://www.intel.com/content/www/us/en/partner/showcase/offering/a5bHo00000045YUIAY/deadlock-clearance.html
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] mm: limit THP alignment – performance gain observed in AI inference workloads
2025-06-27 10:39 [PATCH] mm: limit THP alignment – performance gain observed in AI inference workloads siddhartha
2025-06-27 10:45 ` siddhartha
@ 2025-06-27 15:30 ` Lorenzo Stoakes
2025-06-28 3:49 ` Dev Jain
1 sibling, 1 reply; 28+ messages in thread
From: Lorenzo Stoakes @ 2025-06-27 15:30 UTC (permalink / raw)
To: siddhartha; +Cc: linux-mm, linux-kernel, mgorman, Vlastimil Babka
+cc Vlata
On Fri, Jun 27, 2025 at 04:09:16PM +0530, siddhartha@kenip.in wrote:
> Hi all,
>
> I wanted to share validation data from a Hugging Face-based AI inferencing
> workload,
> which was significantly impacted by the THP alignment logic introduced in
> commit efa7df3e3bb5.
>
> Using transformer models with dynamic input lengths on Intel Xeon (Cooper
> Lake),
> we observed up to a 3200% throughput improvement after applying the patch
> from Oct 2024:
>
> mm: limit THP alignment of anonymous mappings to PMD-aligned sizes
All congratulations are owed to Vlastimil Babka for doing this, cc'd :)
I gather he enjoys novelty beer mugs as tokens of thanks ;)
>
> Metrics:
> - Model: BERT-base
> - Inference engine: Transformers + ONNX Runtime
> - Kernel: 6.6 vs patched 6.6.8
> - Batch size: 8-32, input length: 64-512 tokens
> - Metric: inference throughput (samples/sec)
>
> Thanks for the fix -- this change had real impact on a production-relevant
> workload.
>
> Best Regards,
> Siddhartha Sharma
> ISV @ Kenip
> Solution Link: https://www.intel.com/content/www/us/en/partner/showcase/offering/a5bHo00000045YUIAY/deadlock-clearance.html
>
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] mm: limit THP alignment – performance gain observed in AI inference workloads
2025-06-27 15:30 ` Lorenzo Stoakes
@ 2025-06-28 3:49 ` Dev Jain
2025-06-30 0:43 ` siddhartha
0 siblings, 1 reply; 28+ messages in thread
From: Dev Jain @ 2025-06-28 3:49 UTC (permalink / raw)
To: Lorenzo Stoakes, siddhartha
Cc: linux-mm, linux-kernel, mgorman, Vlastimil Babka
On 27/06/25 9:00 pm, Lorenzo Stoakes wrote:
> +cc Vlata
>
> On Fri, Jun 27, 2025 at 04:09:16PM +0530, siddhartha@kenip.in wrote:
>> Hi all,
>>
>> I wanted to share validation data from a Hugging Face-based AI inferencing
>> workload,
>> which was significantly impacted by the THP alignment logic introduced in
>> commit efa7df3e3bb5.
>>
>> Using transformer models with dynamic input lengths on Intel Xeon (Cooper
>> Lake),
>> we observed up to a 3200% throughput improvement after applying the patch
>> from Oct 2024:
>>
>> mm: limit THP alignment of anonymous mappings to PMD-aligned sizes
> All congratulations are owed to Vlastimil Babka for doing this, cc'd :)
>
> I gather he enjoys novelty beer mugs as tokens of thanks ;)
I was wondering how the change can get us such a big optimization - the
alignment causes us to gain at most 1 extra PMD-THP mapping. Is there
something else I am missing?
I ask because when I was reading the code I was thinking whether a similar
change can be done for mTHPs.
>
>> Metrics:
>> - Model: BERT-base
>> - Inference engine: Transformers + ONNX Runtime
>> - Kernel: 6.6 vs patched 6.6.8
>> - Batch size: 8-32, input length: 64-512 tokens
>> - Metric: inference throughput (samples/sec)
>>
>> Thanks for the fix -- this change had real impact on a production-relevant
>> workload.
>>
>> Best Regards,
>> Siddhartha Sharma
>> ISV @ Kenip
>> Solution Link: https://www.intel.com/content/www/us/en/partner/showcase/offering/a5bHo00000045YUIAY/deadlock-clearance.html
>>
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] mm: limit THP alignment – performance gain observed in AI inference workloads
2025-06-28 3:49 ` Dev Jain
@ 2025-06-30 0:43 ` siddhartha
2025-06-30 5:25 ` Dev Jain
0 siblings, 1 reply; 28+ messages in thread
From: siddhartha @ 2025-06-30 0:43 UTC (permalink / raw)
To: Dev Jain; +Cc: Lorenzo Stoakes, linux-mm, linux-kernel, mgorman
On 2025-06-28 09:19, Dev Jain wrote:
> On 27/06/25 9:00 pm, Lorenzo Stoakes wrote:
>> +cc Vlata
>>
>> On Fri, Jun 27, 2025 at 04:09:16PM +0530, siddhartha@kenip.in wrote:
>>> Hi all,
>>>
>>> I wanted to share validation data from a Hugging Face-based AI
>>> inferencing
>>> workload,
>>> which was significantly impacted by the THP alignment logic
>>> introduced in
>>> commit efa7df3e3bb5.
>>>
>>> Using transformer models with dynamic input lengths on Intel Xeon
>>> (Cooper
>>> Lake),
>>> we observed up to a 3200% throughput improvement after applying the
>>> patch
>>> from Oct 2024:
>>>
>>> mm: limit THP alignment of anonymous mappings to PMD-aligned sizes
>> All congratulations are owed to Vlastimil Babka for doing this, cc'd
>> :)
>>
>> I gather he enjoys novelty beer mugs as tokens of thanks ;)
>
> I was wondering how the change can get us such a big optimization - the
> alignment causes us to gain at most 1 extra PMD-THP mapping. Is there
> something else I am missing?
>
> I ask because when I was reading the code I was thinking whether a
> similar
> change can be done for mTHPs.
>
>>
>>> Metrics:
>>> - Model: BERT-base
>>> - Inference engine: Transformers + ONNX Runtime
>>> - Kernel: 6.6 vs patched 6.6.8
>>> - Batch size: 8-32, input length: 64-512 tokens
>>> - Metric: inference throughput (samples/sec)
>>>
>>> Thanks for the fix -- this change had real impact on a
>>> production-relevant
>>> workload.
>>>
>>> Best Regards,
>>> Siddhartha Sharma
>>> ISV @ Kenip
>>> Solution Link:
>>> https://www.intel.com/content/www/us/en/partner/showcase/offering/a5bHo00000045YUIAY/deadlock-clearance.html
>>>
Hi Dev Jain,
Thank you for reviewing and for your thoughtful question.
You're absolutely right that, in isolation, gaining one additional
PMD-THP mapping wouldn't explain a 3200% speedup. But in our use case
(Hugging Face inference workloads with dynamic input sizes and many
allocations), the original PMD alignment logic caused a cascade of side
effects:
The performance improvement comes from how that interacts with dynamic
memory allocation patterns in AI inference workloads, especially those
using frameworks like Hugging Face Transformers.
In our specific use case, the workloads were running on Intel Developer
Cloud, but I no longer have access to that particular environment or the
original profiling output. However, I’d like to highlight why this patch
had such an outsized effect:
🔹 1. Fragmentation Avoidance
In model shard loading (e.g., large BERT or GPT2 models split into
multiple memory segments), many medium-sized anonymous allocations occur
in rapid succession. These workloads tend to allocate many 512 KB – 1.5
MB buffers dynamically (token buffers, intermediate tensors). Aligning
each one to PMD size, even when their length wasn’t PMD-aligned, led to
gaps between them — defeating natural coalescing into a single THP.
🔹 2. TLB aliasing and cache index pressure
These fragmented mappings caused frequent TLB misses and poor L1/L2
cache reuse.
The result was what looks like “memory thrashing,” with slow memory
access dominating total inference time.
When every mapping is PMD-aligned (even if not PMD-sized), the gaps
between them prevent Transparent Huge Pages (THPs) from activating
effectively.
This breaks THP coalescence and causes fragmented page tables and higher
memory overhead per shard.
🔹 3. Latency & Throughput Penalty from Memory Misalignment
This leads to higher TLB miss rates, especially under multi-threaded
load, which dramatically slows down token embedding and attention
calculations.
When loading model shards, memory initialization becomes
cache-unfriendly, with poor reuse across cores.
This affects not only inference latency but also model cold-start time —
which is critical in autoscaling deployments.
🔹 4. Qualitative Observation
Without this patch: shard loading stuttered, warm-up was slow, and we
saw CPU cycles dominated by page_fault and TLB miss handlers.
With this patch: shard loading smoothed out, THPs were correctly applied
(based on smaps), and throughput shot up by an order of magnitude.
🔹 5. Measured Impact
On Intel Xeon (Cooper Lake), a 6.6.0 kernel with PMD alignment on
non-aligned sizes showed 11–32× worse performance.
With the patched kernel (which skips alignment unless the length is
PMD-aligned), memory layout was contiguous again and THP was
consistently utilized.
This isn’t about one extra THP — it’s about preventing widespread THP
fragmentation and the resulting dramatic cache/TLB degradation. For AI
workloads with high concurrency and dynamic shapes, this small patch has
a massive effect on layout and locality.
So, it's not just “1 more huge page” — it's avoiding massive
fragmentation that leads to:
1. TLB miss storms
2. Poor locality
3. Cache index thrashing
4. Improvement in latency and throughput
This applies across many adjacent, odd-length allocations typical of AI
inference workloads.
The original alignment logic created a pattern of broken contiguity —
defeating THP benefits altogether.
In AI workloads using Hugging Face Transformers, model shards and
intermediate tensors are dynamically allocated during inference. These
allocations often fall just below or above the 2MB threshold that THP
relies on. Misalignment or forced alignment to PMD boundaries causes
fragmentation and disrupts huge page coalescence, affecting performance.
📊 Memory Allocation Pattern Diagram
Without Patch (PMD Alignment Forced):
|<--2MB-->|<--Gap-->|<--2MB-->|<--Gap-->|<--2MB-->
| Alloc A | | Alloc B | | Alloc C |
Each allocation is PMD-aligned, even if it’s not PMD-sized
Gaps prevent THP coalescence → TLB/cache fragmentation
With Patch (PMD Alignment Conditional):
|<---------6MB Contiguous Region--------->|
| Alloc A | Alloc B | Alloc C | Padding |
Contiguous anonymous memory region
Coalesced into one or more THPs
Improved locality and TLB efficiency
While I regret not having the raw perf output at hand, I’d be happy to
replicate a similar test locally and share reproducible results if
helpful.
Best Regards,
Siddhartha Sharma
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] mm: limit THP alignment – performance gain observed in AI inference workloads
2025-06-30 0:43 ` siddhartha
@ 2025-06-30 5:25 ` Dev Jain
2025-06-30 5:28 ` Dev Jain
2025-06-30 10:54 ` Lorenzo Stoakes
0 siblings, 2 replies; 28+ messages in thread
From: Dev Jain @ 2025-06-30 5:25 UTC (permalink / raw)
To: siddhartha; +Cc: Lorenzo Stoakes, linux-mm, linux-kernel, mgorman
On 30/06/25 6:13 am, siddhartha@kenip.in wrote:
> On 2025-06-28 09:19, Dev Jain wrote:
>> On 27/06/25 9:00 pm, Lorenzo Stoakes wrote:
>>> +cc Vlata
>>>
>>> On Fri, Jun 27, 2025 at 04:09:16PM +0530, siddhartha@kenip.in wrote:
>>>> Hi all,
>>>>
>>>> I wanted to share validation data from a Hugging Face-based AI
>>>> inferencing
>>>> workload,
>>>> which was significantly impacted by the THP alignment logic
>>>> introduced in
>>>> commit efa7df3e3bb5.
>>>>
>>>> Using transformer models with dynamic input lengths on Intel Xeon
>>>> (Cooper
>>>> Lake),
>>>> we observed up to a 3200% throughput improvement after applying the
>>>> patch
>>>> from Oct 2024:
>>>>
>>>> mm: limit THP alignment of anonymous mappings to PMD-aligned sizes
>>> All congratulations are owed to Vlastimil Babka for doing this, cc'd :)
>>>
>>> I gather he enjoys novelty beer mugs as tokens of thanks ;)
>>
>> I was wondering how the change can get us such a big optimization - the
>> alignment causes us to gain at most 1 extra PMD-THP mapping. Is there
>> something else I am missing?
>>
>> I ask because when I was reading the code I was thinking whether a
>> similar
>> change can be done for mTHPs.
>>
>>>
>>>> Metrics:
>>>> - Model: BERT-base
>>>> - Inference engine: Transformers + ONNX Runtime
>>>> - Kernel: 6.6 vs patched 6.6.8
>>>> - Batch size: 8-32, input length: 64-512 tokens
>>>> - Metric: inference throughput (samples/sec)
>>>>
>>>> Thanks for the fix -- this change had real impact on a
>>>> production-relevant
>>>> workload.
>>>>
>>>> Best Regards,
>>>> Siddhartha Sharma
>>>> ISV @ Kenip
>>>> Solution Link:
>>>> https://www.intel.com/content/www/us/en/partner/showcase/offering/a5bHo00000045YUIAY/deadlock-clearance.html
>>>>
>
> Hi Dev Jain,
>
> Thank you for reviewing and for your thoughtful question.
>
> You're absolutely right that, in isolation, gaining one additional
> PMD-THP mapping wouldn't explain a 3200% speedup. But in our use case
> (Hugging Face inference workloads with dynamic input sizes and many
> allocations), the original PMD alignment logic caused a cascade of
> side effects:
>
> The performance improvement comes from how that interacts with dynamic
> memory allocation patterns in AI inference workloads, especially those
> using frameworks like Hugging Face Transformers.
>
> In our specific use case, the workloads were running on Intel
> Developer Cloud, but I no longer have access to that particular
> environment or the original profiling output. However, I’d like to
> highlight why this patch had such an outsized effect:
>
> 🔹 1. Fragmentation Avoidance
> In model shard loading (e.g., large BERT or GPT2 models split into
> multiple memory segments), many medium-sized anonymous allocations
> occur in rapid succession. These workloads tend to allocate many 512
> KB – 1.5 MB buffers dynamically (token buffers, intermediate tensors).
> Aligning each one to PMD size, even when their length wasn’t
> PMD-aligned, led to gaps between them — defeating natural coalescing
> into a single THP.
>
> 🔹 2. TLB aliasing and cache index pressure
>
> These fragmented mappings caused frequent TLB misses and poor L1/L2
> cache reuse.
>
> The result was what looks like “memory thrashing,” with slow memory
> access dominating total inference time.
> When every mapping is PMD-aligned (even if not PMD-sized), the gaps
> between them prevent Transparent Huge Pages (THPs) from activating
> effectively.
>
> This breaks THP coalescence and causes fragmented page tables and
> higher memory overhead per shard.
>
> 🔹 3. Latency & Throughput Penalty from Memory Misalignment
> This leads to higher TLB miss rates, especially under multi-threaded
> load, which dramatically slows down token embedding and attention
> calculations.
>
> When loading model shards, memory initialization becomes
> cache-unfriendly, with poor reuse across cores.
>
> This affects not only inference latency but also model cold-start time
> — which is critical in autoscaling deployments.
>
> 🔹 4. Qualitative Observation
> Without this patch: shard loading stuttered, warm-up was slow, and we
> saw CPU cycles dominated by page_fault and TLB miss handlers.
>
> With this patch: shard loading smoothed out, THPs were correctly
> applied (based on smaps), and throughput shot up by an order of
> magnitude.
>
> 🔹 5. Measured Impact
> On Intel Xeon (Cooper Lake), a 6.6.0 kernel with PMD alignment on
> non-aligned sizes showed 11–32× worse performance.
>
> With the patched kernel (which skips alignment unless the length is
> PMD-aligned), memory layout was contiguous again and THP was
> consistently utilized.
>
> This isn’t about one extra THP — it’s about preventing widespread THP
> fragmentation and the resulting dramatic cache/TLB degradation. For AI
> workloads with high concurrency and dynamic shapes, this small patch
> has a massive effect on layout and locality.
>
> So, it's not just “1 more huge page” — it's avoiding massive
> fragmentation that leads to:
>
> 1. TLB miss storms
>
> 2. Poor locality
>
> 3. Cache index thrashing
>
> 4. Improvement in latency and throughput
>
> This applies across many adjacent, odd-length allocations typical of
> AI inference workloads.
>
> The original alignment logic created a pattern of broken contiguity —
> defeating THP benefits altogether.
>
> In AI workloads using Hugging Face Transformers, model shards and
> intermediate tensors are dynamically allocated during inference. These
> allocations often fall just below or above the 2MB threshold that THP
> relies on. Misalignment or forced alignment to PMD boundaries causes
> fragmentation and disrupts huge page coalescence, affecting performance.
>
> 📊 Memory Allocation Pattern Diagram
>
> Without Patch (PMD Alignment Forced):
>
> |<--2MB-->|<--Gap-->|<--2MB-->|<--Gap-->|<--2MB-->
> | Alloc A | | Alloc B | | Alloc C |
>
> Each allocation is PMD-aligned, even if it’s not PMD-sized
>
> Gaps prevent THP coalescence → TLB/cache fragmentation
>
> With Patch (PMD Alignment Conditional):
>
> |<---------6MB Contiguous Region--------->|
> | Alloc A | Alloc B | Alloc C | Padding |
>
> Contiguous anonymous memory region
>
> Coalesced into one or more THPs
>
> Improved locality and TLB efficiency
>
> While I regret not having the raw perf output at hand, I’d be happy to
> replicate a similar test locally and share reproducible results if
> helpful.
>
> Best Regards,
>
> Siddhartha Sharma
Thanks for your detailed explanation! I misunderstood that the
optimization you were talking about
was due to efa7df3e3bb5, instead it was due to the alignment. Your
explanation makes a lot of
sense!
For this workload, do you enable mTHPs on your system? My plan is to
make a similar patch for
the mTHP case and I'd be grateful if you can get me some results : )
>
>
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] mm: limit THP alignment – performance gain observed in AI inference workloads
2025-06-30 5:25 ` Dev Jain
@ 2025-06-30 5:28 ` Dev Jain
2025-06-30 10:54 ` Lorenzo Stoakes
1 sibling, 0 replies; 28+ messages in thread
From: Dev Jain @ 2025-06-30 5:28 UTC (permalink / raw)
To: siddhartha; +Cc: Lorenzo Stoakes, linux-mm, linux-kernel, mgorman
On 30/06/25 10:55 am, Dev Jain wrote:
>
> On 30/06/25 6:13 am, siddhartha@kenip.in wrote:
>> On 2025-06-28 09:19, Dev Jain wrote:
>>> On 27/06/25 9:00 pm, Lorenzo Stoakes wrote:
>>>> +cc Vlata
>>>>
>>>> On Fri, Jun 27, 2025 at 04:09:16PM +0530, siddhartha@kenip.in wrote:
>>>>> Hi all,
>>>>>
>>>>> I wanted to share validation data from a Hugging Face-based AI
>>>>> inferencing
>>>>> workload,
>>>>> which was significantly impacted by the THP alignment logic
>>>>> introduced in
>>>>> commit efa7df3e3bb5.
>>>>>
>>>>> Using transformer models with dynamic input lengths on Intel Xeon
>>>>> (Cooper
>>>>> Lake),
>>>>> we observed up to a 3200% throughput improvement after applying
>>>>> the patch
>>>>> from Oct 2024:
>>>>>
>>>>> mm: limit THP alignment of anonymous mappings to PMD-aligned sizes
>>>> All congratulations are owed to Vlastimil Babka for doing this,
>>>> cc'd :)
>>>>
>>>> I gather he enjoys novelty beer mugs as tokens of thanks ;)
>>>
>>> I was wondering how the change can get us such a big optimization - the
>>> alignment causes us to gain at most 1 extra PMD-THP mapping. Is there
>>> something else I am missing?
>>>
>>> I ask because when I was reading the code I was thinking whether a
>>> similar
>>> change can be done for mTHPs.
>>>
>>>>
>>>>> Metrics:
>>>>> - Model: BERT-base
>>>>> - Inference engine: Transformers + ONNX Runtime
>>>>> - Kernel: 6.6 vs patched 6.6.8
>>>>> - Batch size: 8-32, input length: 64-512 tokens
>>>>> - Metric: inference throughput (samples/sec)
>>>>>
>>>>> Thanks for the fix -- this change had real impact on a
>>>>> production-relevant
>>>>> workload.
>>>>>
>>>>> Best Regards,
>>>>> Siddhartha Sharma
>>>>> ISV @ Kenip
>>>>> Solution Link:
>>>>> https://www.intel.com/content/www/us/en/partner/showcase/offering/a5bHo00000045YUIAY/deadlock-clearance.html
>>>>>
>>
>> Hi Dev Jain,
>>
>> Thank you for reviewing and for your thoughtful question.
>>
>> You're absolutely right that, in isolation, gaining one additional
>> PMD-THP mapping wouldn't explain a 3200% speedup. But in our use case
>> (Hugging Face inference workloads with dynamic input sizes and many
>> allocations), the original PMD alignment logic caused a cascade of
>> side effects:
>>
>> The performance improvement comes from how that interacts with
>> dynamic memory allocation patterns in AI inference workloads,
>> especially those using frameworks like Hugging Face Transformers.
>>
>> In our specific use case, the workloads were running on Intel
>> Developer Cloud, but I no longer have access to that particular
>> environment or the original profiling output. However, I’d like to
>> highlight why this patch had such an outsized effect:
>>
>> 🔹 1. Fragmentation Avoidance
>> In model shard loading (e.g., large BERT or GPT2 models split into
>> multiple memory segments), many medium-sized anonymous allocations
>> occur in rapid succession. These workloads tend to allocate many 512
>> KB – 1.5 MB buffers dynamically (token buffers, intermediate
>> tensors). Aligning each one to PMD size, even when their length
>> wasn’t PMD-aligned, led to gaps between them — defeating natural
>> coalescing into a single THP.
>>
>> 🔹 2. TLB aliasing and cache index pressure
>>
>> These fragmented mappings caused frequent TLB misses and poor L1/L2
>> cache reuse.
>>
>> The result was what looks like “memory thrashing,” with slow memory
>> access dominating total inference time.
>> When every mapping is PMD-aligned (even if not PMD-sized), the gaps
>> between them prevent Transparent Huge Pages (THPs) from activating
>> effectively.
>>
>> This breaks THP coalescence and causes fragmented page tables and
>> higher memory overhead per shard.
>>
>> 🔹 3. Latency & Throughput Penalty from Memory Misalignment
>> This leads to higher TLB miss rates, especially under multi-threaded
>> load, which dramatically slows down token embedding and attention
>> calculations.
>>
>> When loading model shards, memory initialization becomes
>> cache-unfriendly, with poor reuse across cores.
>>
>> This affects not only inference latency but also model cold-start
>> time — which is critical in autoscaling deployments.
>>
>> 🔹 4. Qualitative Observation
>> Without this patch: shard loading stuttered, warm-up was slow, and we
>> saw CPU cycles dominated by page_fault and TLB miss handlers.
>>
>> With this patch: shard loading smoothed out, THPs were correctly
>> applied (based on smaps), and throughput shot up by an order of
>> magnitude.
>>
>> 🔹 5. Measured Impact
>> On Intel Xeon (Cooper Lake), a 6.6.0 kernel with PMD alignment on
>> non-aligned sizes showed 11–32× worse performance.
>>
>> With the patched kernel (which skips alignment unless the length is
>> PMD-aligned), memory layout was contiguous again and THP was
>> consistently utilized.
>>
>> This isn’t about one extra THP — it’s about preventing widespread THP
>> fragmentation and the resulting dramatic cache/TLB degradation. For
>> AI workloads with high concurrency and dynamic shapes, this small
>> patch has a massive effect on layout and locality.
>>
>> So, it's not just “1 more huge page” — it's avoiding massive
>> fragmentation that leads to:
>>
>> 1. TLB miss storms
>>
>> 2. Poor locality
>>
>> 3. Cache index thrashing
>>
>> 4. Improvement in latency and throughput
>>
>> This applies across many adjacent, odd-length allocations typical of
>> AI inference workloads.
>>
>> The original alignment logic created a pattern of broken contiguity —
>> defeating THP benefits altogether.
>>
>> In AI workloads using Hugging Face Transformers, model shards and
>> intermediate tensors are dynamically allocated during inference.
>> These allocations often fall just below or above the 2MB threshold
>> that THP relies on. Misalignment or forced alignment to PMD
>> boundaries causes fragmentation and disrupts huge page coalescence,
>> affecting performance.
>>
>> 📊 Memory Allocation Pattern Diagram
>>
>> Without Patch (PMD Alignment Forced):
>>
>> |<--2MB-->|<--Gap-->|<--2MB-->|<--Gap-->|<--2MB-->
>> | Alloc A | | Alloc B | | Alloc C |
>>
>> Each allocation is PMD-aligned, even if it’s not PMD-sized
>>
>> Gaps prevent THP coalescence → TLB/cache fragmentation
>>
>> With Patch (PMD Alignment Conditional):
>>
>> |<---------6MB Contiguous Region--------->|
>> | Alloc A | Alloc B | Alloc C | Padding |
>>
>> Contiguous anonymous memory region
>>
>> Coalesced into one or more THPs
>>
>> Improved locality and TLB efficiency
>>
>> While I regret not having the raw perf output at hand, I’d be happy
>> to replicate a similar test locally and share reproducible results if
>> helpful.
>>
>> Best Regards,
>>
>> Siddhartha Sharma
>
> Thanks for your detailed explanation! I misunderstood that the
> optimization you were talking about
>
> was due to efa7df3e3bb5, instead it was due to the alignment. Your
> explanation makes a lot of
>
> sense!
>
>
> For this workload, do you enable mTHPs on your system? My plan is to
> make a similar patch for
>
> the mTHP case and I'd be grateful if you can get me some results : )
Oh I see that you are using the 6.6 kernel, which probably won't have
the mTHP patches.
>
>>
>>
>
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] mm: limit THP alignment – performance gain observed in AI inference workloads
2025-06-30 5:25 ` Dev Jain
2025-06-30 5:28 ` Dev Jain
@ 2025-06-30 10:54 ` Lorenzo Stoakes
2025-06-30 11:48 ` siddhartha
2025-07-01 5:23 ` Dev Jain
1 sibling, 2 replies; 28+ messages in thread
From: Lorenzo Stoakes @ 2025-06-30 10:54 UTC (permalink / raw)
To: Dev Jain; +Cc: siddhartha, linux-mm, linux-kernel, mgorman, Vlastimil Babka
+cc Vlastimil, please keep him cc'd on discussions here as the author of this
fix in the conversation.
On Mon, Jun 30, 2025 at 10:55:52AM +0530, Dev Jain wrote:
>
>
> For this workload, do you enable mTHPs on your system? My plan is to make a
> similar patch for
>
> the mTHP case and I'd be grateful if you can get me some results : )
I'd urge caution here.
The reason there was a big perf improvement is that, for certain workloads, the
original patch by Rik caused issues with VMA fragmentation. So rather than
getting adjacent VMAs that might later be khugepage'd, you'd get a bunch of VMAs
that were auto-aligned and thus fragmented from one another.
So while you got speed ups on some workloads, you got really bad perf impact on
some that were subject to this.
The observed speed up was on a very specific benchmark also. While it's a great
improvement, it's important to understand the context (see the original patch
for details [0]).
I do think it's worth considering changing thp_get_unmapped_area_vmflags() for
mTHP, as it's currently very limited (just PMD alignment) and it'd possibly be
sensible to change this to checking against allowed THP alignments, but I'd not
assume this is going to get some crazy speed up as observed here.
Note that any such change would probably require some refactoring in THP first
to make it not quite so awful.
I also think for Siddharta's usecase mTHP isn't really relevant is it, as intel
do not support mTHP currently do they?
Regards, Lorenzo
[0]: https://lore.kernel.org/all/20241024151228.101841-2-vbabka@suse.cz/T/#u
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] mm: limit THP alignment – performance gain observed in AI inference workloads
2025-06-30 10:54 ` Lorenzo Stoakes
@ 2025-06-30 11:48 ` siddhartha
2025-07-01 5:23 ` Dev Jain
1 sibling, 0 replies; 28+ messages in thread
From: siddhartha @ 2025-06-30 11:48 UTC (permalink / raw)
To: Lorenzo Stoakes; +Cc: Dev Jain, linux-mm, linux-kernel, mgorman
On 2025-06-30 16:24, Lorenzo Stoakes wrote:
> +cc Vlastimil, please keep him cc'd on discussions here as the author
> of this
> fix in the conversation.
>
> On Mon, Jun 30, 2025 at 10:55:52AM +0530, Dev Jain wrote:
>>
>>
>> For this workload, do you enable mTHPs on your system? My plan is to
>> make a
>> similar patch for
>>
>> the mTHP case and I'd be grateful if you can get me some results : )
>
> I'd urge caution here.
>
> The reason there was a big perf improvement is that, for certain
> workloads, the
> original patch by Rik caused issues with VMA fragmentation. So rather
> than
> getting adjacent VMAs that might later be khugepage'd, you'd get a
> bunch of VMAs
> that were auto-aligned and thus fragmented from one another.
>
> So while you got speed ups on some workloads, you got really bad perf
> impact on
> some that were subject to this.
>
> The observed speed up was on a very specific benchmark also. While it's
> a great
> improvement, it's important to understand the context (see the original
> patch
> for details [0]).
>
> I do think it's worth considering changing
> thp_get_unmapped_area_vmflags() for
> mTHP, as it's currently very limited (just PMD alignment) and it'd
> possibly be
> sensible to change this to checking against allowed THP alignments, but
> I'd not
> assume this is going to get some crazy speed up as observed here.
>
> Note that any such change would probably require some refactoring in
> THP first
> to make it not quite so awful.
>
> I also think for Siddharta's usecase mTHP isn't really relevant is it,
> as intel
> do not support mTHP currently do they?
>
> Regards, Lorenzo
>
> [0]:
> https://lore.kernel.org/all/20241024151228.101841-2-vbabka@suse.cz/T/#u
Hi Lorenzo, Dev, All,
Thank you for the thoughtful responses and for engaging with the
performance implications of the patch.
You're absolutely right that the observed speedup came from a specific
class of workloads — in this case, token-length-variable AI inference
pipelines based on Hugging Face Transformers and ONNX Runtime. These
workloads trigger highly dynamic, anonymous memory allocation patterns,
often in bursts aligned with model shard loading and attention map
resizing. In such cases, VMA fragmentation due to PMD-aligned,
non-PMD-sized mappings led to near-complete loss of THP utilization.
Once the alignment restriction was lifted (via Rik’s patch), we observed
substantial restoration of THP behavior, which is where the performance
gains came from. That said, I completely agree that:
Not all workloads benefit from this
Some may even regress if the underlying VMAs aren't THP-coalescible for
other reasons
Still, for high-throughput inference workloads on modern Intel CPUs,
this behavior isn’t a corner case. The shift toward multi-model
concurrent serving (e.g., LLM-as-a-Service) means this dynamic
allocation pattern is becoming common, especially in
edge/latency-sensitive deployments.
🧠 On mTHP: Intel Does Support It
Regarding mTHP — yes, Intel platforms (especially server-grade Xeon
processors from Cascade Lake onward) do support mapped transparent huge
pages, including via:
tmpfs-backed files
madvise(MADV_HUGEPAGE) on file mappings
shmem usage with shmem_enabled in the kernel
So I’d say mTHP is certainly relevant for workloads where model weights
or tensors are pre-cached or memory-mapped — a pattern we’re also seeing
as Hugging Face, ONNX, and PyTorch ecosystems move toward zero-copy
tensor sharing.
Given that, I'd absolutely be interested in testing any mTHP-targeted
patch — and I’d be happy to help validate it, especially if it avoids
the VMA fragmentation pitfall you rightly pointed out.
Thanks again for the detailed feedback, and I’ll try to replicate and
share further traces (from my local testbed) since I currently don’t
have access to the original Intel Developer Cloud logs.
Best regards,
Siddhartha Sharma
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] mm: limit THP alignment – performance gain observed in AI inference workloads
2025-06-30 10:54 ` Lorenzo Stoakes
2025-06-30 11:48 ` siddhartha
@ 2025-07-01 5:23 ` Dev Jain
2025-07-01 5:28 ` Lorenzo Stoakes
1 sibling, 1 reply; 28+ messages in thread
From: Dev Jain @ 2025-07-01 5:23 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: siddhartha, linux-mm, linux-kernel, mgorman, Vlastimil Babka
On 30/06/25 4:24 pm, Lorenzo Stoakes wrote:
> +cc Vlastimil, please keep him cc'd on discussions here as the author of this
> fix in the conversation.
>
> On Mon, Jun 30, 2025 at 10:55:52AM +0530, Dev Jain wrote:
>>
>> For this workload, do you enable mTHPs on your system? My plan is to make a
>> similar patch for
>>
>> the mTHP case and I'd be grateful if you can get me some results : )
> I'd urge caution here.
>
> The reason there was a big perf improvement is that, for certain workloads, the
> original patch by Rik caused issues with VMA fragmentation. So rather than
> getting adjacent VMAs that might later be khugepage'd, you'd get a bunch of VMAs
> that were auto-aligned and thus fragmented from one another.
How does getting two different adjacent VMAs allow them to be khugepage'd if
both are less than PMD size? khugepaged operates per vma, I'm missing something.
>
> So while you got speed ups on some workloads, you got really bad perf impact on
> some that were subject to this.
>
> The observed speed up was on a very specific benchmark also. While it's a great
> improvement, it's important to understand the context (see the original patch
> for details [0]).
>
> I do think it's worth considering changing thp_get_unmapped_area_vmflags() for
> mTHP, as it's currently very limited (just PMD alignment) and it'd possibly be
> sensible to change this to checking against allowed THP alignments, but I'd not
> assume this is going to get some crazy speed up as observed here.
>
> Note that any such change would probably require some refactoring in THP first
> to make it not quite so awful.
>
> I also think for Siddharta's usecase mTHP isn't really relevant is it, as intel
> do not support mTHP currently do they?
>
> Regards, Lorenzo
>
> [0]: https://lore.kernel.org/all/20241024151228.101841-2-vbabka@suse.cz/T/#u
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] mm: limit THP alignment – performance gain observed in AI inference workloads
2025-07-01 5:23 ` Dev Jain
@ 2025-07-01 5:28 ` Lorenzo Stoakes
2025-07-01 5:45 ` Dev Jain
0 siblings, 1 reply; 28+ messages in thread
From: Lorenzo Stoakes @ 2025-07-01 5:28 UTC (permalink / raw)
To: Dev Jain; +Cc: siddhartha, linux-mm, linux-kernel, mgorman, Vlastimil Babka
On Tue, Jul 01, 2025 at 10:53:09AM +0530, Dev Jain wrote:
>
> On 30/06/25 4:24 pm, Lorenzo Stoakes wrote:
> > +cc Vlastimil, please keep him cc'd on discussions here as the author of this
> > fix in the conversation.
> >
> > On Mon, Jun 30, 2025 at 10:55:52AM +0530, Dev Jain wrote:
> > >
> > > For this workload, do you enable mTHPs on your system? My plan is to make a
> > > similar patch for
> > >
> > > the mTHP case and I'd be grateful if you can get me some results : )
> > I'd urge caution here.
> >
> > The reason there was a big perf improvement is that, for certain workloads, the
> > original patch by Rik caused issues with VMA fragmentation. So rather than
> > getting adjacent VMAs that might later be khugepage'd, you'd get a bunch of VMAs
> > that were auto-aligned and thus fragmented from one another.
>
> How does getting two different adjacent VMAs allow them to be khugepage'd if
> both are less than PMD size? khugepaged operates per vma, I'm missing something.
(future) VMA merge
Consider allocations that are >PMD but < 2*PMD for instance. Now you get
fragmentation. For some workloads you would have previously eventually got PMD
leaf mapping, PMD leaf mapping, PMD leaf mapping, etc. contiguouosly, with this
arragenement you get PMD mapping, <bunch of PTE mappings>, PMD mapping, etc.
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] mm: limit THP alignment – performance gain observed in AI inference workloads
2025-07-01 5:28 ` Lorenzo Stoakes
@ 2025-07-01 5:45 ` Dev Jain
2025-07-01 5:53 ` Lorenzo Stoakes
0 siblings, 1 reply; 28+ messages in thread
From: Dev Jain @ 2025-07-01 5:45 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: siddhartha, linux-mm, linux-kernel, mgorman, Vlastimil Babka
On 01/07/25 10:58 am, Lorenzo Stoakes wrote:
> On Tue, Jul 01, 2025 at 10:53:09AM +0530, Dev Jain wrote:
>> On 30/06/25 4:24 pm, Lorenzo Stoakes wrote:
>>> +cc Vlastimil, please keep him cc'd on discussions here as the author of this
>>> fix in the conversation.
>>>
>>> On Mon, Jun 30, 2025 at 10:55:52AM +0530, Dev Jain wrote:
>>>> For this workload, do you enable mTHPs on your system? My plan is to make a
>>>> similar patch for
>>>>
>>>> the mTHP case and I'd be grateful if you can get me some results : )
>>> I'd urge caution here.
>>>
>>> The reason there was a big perf improvement is that, for certain workloads, the
>>> original patch by Rik caused issues with VMA fragmentation. So rather than
>>> getting adjacent VMAs that might later be khugepage'd, you'd get a bunch of VMAs
>>> that were auto-aligned and thus fragmented from one another.
>> How does getting two different adjacent VMAs allow them to be khugepage'd if
>> both are less than PMD size? khugepaged operates per vma, I'm missing something.
> (future) VMA merge
>
> Consider allocations that are >PMD but < 2*PMD for instance. Now you get
> fragmentation. For some workloads you would have previously eventually got PMD
> leaf mapping, PMD leaf mapping, PMD leaf mapping, etc. contiguouosly, with this
> arragenement you get PMD mapping, <bunch of PTE mappings>, PMD mapping, etc.
Sorry I am not following, don't know in detail about the VMA merge stuff.
Are you saying the after the patch, the VMAs will eventually get merged?
Is it possible in the kernel to get a merge in the "future"; as I understand
it only happens at mmap() time?
Suppose before the patch, you have two consecutive VMAs between (PMD, 2*PMD) size.
If they are able to get merged after the patch, why won't they be merged before the patch,
since the VMA characteristics are the same?
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] mm: limit THP alignment – performance gain observed in AI inference workloads
2025-07-01 5:45 ` Dev Jain
@ 2025-07-01 5:53 ` Lorenzo Stoakes
2025-07-01 6:30 ` Dev Jain
0 siblings, 1 reply; 28+ messages in thread
From: Lorenzo Stoakes @ 2025-07-01 5:53 UTC (permalink / raw)
To: Dev Jain; +Cc: siddhartha, linux-mm, linux-kernel, mgorman, Vlastimil Babka
On Tue, Jul 01, 2025 at 11:15:25AM +0530, Dev Jain wrote:
> Sorry I am not following, don't know in detail about the VMA merge stuff.
> Are you saying the after the patch, the VMAs will eventually get merged?
> Is it possible in the kernel to get a merge in the "future"; as I understand
> it only happens at mmap() time?
>
> Suppose before the patch, you have two consecutive VMAs between (PMD, 2*PMD) size.
> If they are able to get merged after the patch, why won't they be merged before the patch,
> since the VMA characteristics are the same?
>
>
Rik's patch aligned each to 2 MiB boundary. So you'd get gaps:
0 2MB 4MB 6MB 8MB 10MB
|-------------.------| |-------------.------| |-------------.------|
| . | | . | | . |
| . | | . | | . |
|-------------.------| |-------------.------| |-------------.------|
huge mapped 4k m'd
If you don't force alignment then subsequent mappings will be adjacent to one
another and those non-huge page parts can be merged.
Vlasta's fix up means we only try to get the THP up-front if the length is
already aligned at which point you won't end up with these gaps.
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] mm: limit THP alignment – performance gain observed in AI inference workloads
2025-07-01 5:53 ` Lorenzo Stoakes
@ 2025-07-01 6:30 ` Dev Jain
2025-07-01 6:50 ` Lorenzo Stoakes
0 siblings, 1 reply; 28+ messages in thread
From: Dev Jain @ 2025-07-01 6:30 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: siddhartha, linux-mm, linux-kernel, mgorman, Vlastimil Babka
On 01/07/25 11:23 am, Lorenzo Stoakes wrote:
> On Tue, Jul 01, 2025 at 11:15:25AM +0530, Dev Jain wrote:
>> Sorry I am not following, don't know in detail about the VMA merge stuff.
>> Are you saying the after the patch, the VMAs will eventually get merged?
>> Is it possible in the kernel to get a merge in the "future"; as I understand
>> it only happens at mmap() time?
>>
>> Suppose before the patch, you have two consecutive VMAs between (PMD, 2*PMD) size.
>> If they are able to get merged after the patch, why won't they be merged before the patch,
>> since the VMA characteristics are the same?
>>
>>
> Rik's patch aligned each to 2 MiB boundary. So you'd get gaps:
>
>
> 0 2MB 4MB 6MB 8MB 10MB
> |-------------.------| |-------------.------| |-------------.------|
> | . | | . | | . |
> | . | | . | | . |
> |-------------.------| |-------------.------| |-------------.------|
> huge mapped 4k m'd
The effort to draw this is appreciated!
I understood the alignment, what I am asking is this:
In __get_unmapped_area(), we will return a THP-aligned addr from
thp_get_unmapped_area_vmflags(). Now for the diagram you have
drawn, suppose that before the patch, we first mmap() the
8MB-start chunk. Then we mmap the 4MB start chunk.
We go to __mmap_region(), and we see that the 8MB-start chunk
has mergeable characteristics, so we merge. So the gap goes away?
>
> If you don't force alignment then subsequent mappings will be adjacent to one
> another and those non-huge page parts can be merged.
>
> Vlasta's fix up means we only try to get the THP up-front if the length is
> already aligned at which point you won't end up with these gaps.
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] mm: limit THP alignment – performance gain observed in AI inference workloads
2025-07-01 6:30 ` Dev Jain
@ 2025-07-01 6:50 ` Lorenzo Stoakes
2025-07-01 6:58 ` Dev Jain
0 siblings, 1 reply; 28+ messages in thread
From: Lorenzo Stoakes @ 2025-07-01 6:50 UTC (permalink / raw)
To: Dev Jain; +Cc: siddhartha, linux-mm, linux-kernel, mgorman, Vlastimil Babka
On Tue, Jul 01, 2025 at 12:00:21PM +0530, Dev Jain wrote:
>
> On 01/07/25 11:23 am, Lorenzo Stoakes wrote:
> > On Tue, Jul 01, 2025 at 11:15:25AM +0530, Dev Jain wrote:
> > > Sorry I am not following, don't know in detail about the VMA merge stuff.
> > > Are you saying the after the patch, the VMAs will eventually get merged?
> > > Is it possible in the kernel to get a merge in the "future"; as I understand
> > > it only happens at mmap() time?
> > >
> > > Suppose before the patch, you have two consecutive VMAs between (PMD, 2*PMD) size.
> > > If they are able to get merged after the patch, why won't they be merged before the patch,
> > > since the VMA characteristics are the same?
> > >
> > >
> > Rik's patch aligned each to 2 MiB boundary. So you'd get gaps:
> >
> >
> > 0 2MB 4MB 6MB 8MB 10MB
> > |-------------.------| |-------------.------| |-------------.------|
> > | . | | . | | . |
> > | . | | . | | . |
> > |-------------.------| |-------------.------| |-------------.------|
> > huge mapped 4k m'd
>
> The effort to draw this is appreciated!
>
> I understood the alignment, what I am asking is this:
>
> In __get_unmapped_area(), we will return a THP-aligned addr from
> thp_get_unmapped_area_vmflags(). Now for the diagram you have
> drawn, suppose that before the patch, we first mmap() the
> 8MB-start chunk. Then we mmap the 4MB start chunk.
> We go to __mmap_region(), and we see that the 8MB-start chunk
> has mergeable characteristics, so we merge. So the gap goes away?
No because there's a gap, we only merge immedaitely adjacent VMAs. And obviously
gaps mean page tables wouldn't be adjacent either...
The get_unmmaped_area() would have otherwise given adjacent mappings. Vlasta's
patch means in this case we no longer bother trying to align these because their
_length_ isn't PMD aligned.
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] mm: limit THP alignment – performance gain observed in AI inference workloads
2025-07-01 6:50 ` Lorenzo Stoakes
@ 2025-07-01 6:58 ` Dev Jain
2025-07-01 12:15 ` siddhartha
0 siblings, 1 reply; 28+ messages in thread
From: Dev Jain @ 2025-07-01 6:58 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: siddhartha, linux-mm, linux-kernel, mgorman, Vlastimil Babka
On 01/07/25 12:20 pm, Lorenzo Stoakes wrote:
> On Tue, Jul 01, 2025 at 12:00:21PM +0530, Dev Jain wrote:
>> On 01/07/25 11:23 am, Lorenzo Stoakes wrote:
>>> On Tue, Jul 01, 2025 at 11:15:25AM +0530, Dev Jain wrote:
>>>> Sorry I am not following, don't know in detail about the VMA merge stuff.
>>>> Are you saying the after the patch, the VMAs will eventually get merged?
>>>> Is it possible in the kernel to get a merge in the "future"; as I understand
>>>> it only happens at mmap() time?
>>>>
>>>> Suppose before the patch, you have two consecutive VMAs between (PMD, 2*PMD) size.
>>>> If they are able to get merged after the patch, why won't they be merged before the patch,
>>>> since the VMA characteristics are the same?
>>>>
>>>>
>>> Rik's patch aligned each to 2 MiB boundary. So you'd get gaps:
>>>
>>>
>>> 0 2MB 4MB 6MB 8MB 10MB
>>> |-------------.------| |-------------.------| |-------------.------|
>>> | . | | . | | . |
>>> | . | | . | | . |
>>> |-------------.------| |-------------.------| |-------------.------|
>>> huge mapped 4k m'd
>> The effort to draw this is appreciated!
>>
>> I understood the alignment, what I am asking is this:
>>
>> In __get_unmapped_area(), we will return a THP-aligned addr from
>> thp_get_unmapped_area_vmflags(). Now for the diagram you have
>> drawn, suppose that before the patch, we first mmap() the
>> 8MB-start chunk. Then we mmap the 4MB start chunk.
>> We go to __mmap_region(), and we see that the 8MB-start chunk
>> has mergeable characteristics, so we merge. So the gap goes away?
> No because there's a gap, we only merge immedaitely adjacent VMAs. And obviously
> gaps mean page tables wouldn't be adjacent either...
Ah shoot. That is prev->vm_end == vmg->start in can_vma_merge_left(). Thanks.
>
> The get_unmmaped_area() would have otherwise given adjacent mappings. Vlasta's
> patch means in this case we no longer bother trying to align these because their
> _length_ isn't PMD aligned.
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] mm: limit THP alignment – performance gain observed in AI inference workloads
2025-07-01 6:58 ` Dev Jain
@ 2025-07-01 12:15 ` siddhartha
2025-07-01 12:39 ` Lorenzo Stoakes
2025-07-01 15:40 ` Yang Shi
0 siblings, 2 replies; 28+ messages in thread
From: siddhartha @ 2025-07-01 12:15 UTC (permalink / raw)
To: Dev Jain; +Cc: Lorenzo Stoakes, linux-mm, linux-kernel, mgorman
On 2025-07-01 12:28, Dev Jain wrote:
> On 01/07/25 12:20 pm, Lorenzo Stoakes wrote:
>> On Tue, Jul 01, 2025 at 12:00:21PM +0530, Dev Jain wrote:
>>> On 01/07/25 11:23 am, Lorenzo Stoakes wrote:
>>>> On Tue, Jul 01, 2025 at 11:15:25AM +0530, Dev Jain wrote:
>>>>> Sorry I am not following, don't know in detail about the VMA merge
>>>>> stuff.
>>>>> Are you saying the after the patch, the VMAs will eventually get
>>>>> merged?
>>>>> Is it possible in the kernel to get a merge in the "future"; as I
>>>>> understand
>>>>> it only happens at mmap() time?
>>>>>
>>>>> Suppose before the patch, you have two consecutive VMAs between
>>>>> (PMD, 2*PMD) size.
>>>>> If they are able to get merged after the patch, why won't they be
>>>>> merged before the patch,
>>>>> since the VMA characteristics are the same?
>>>>>
>>>>>
>>>> Rik's patch aligned each to 2 MiB boundary. So you'd get gaps:
>>>>
>>>>
>>>> 0 2MB 4MB 6MB
>>>> 8MB 10MB
>>>> |-------------.------| |-------------.------|
>>>> |-------------.------|
>>>> | . | | . |
>>>> | . |
>>>> | . | | . |
>>>> | . |
>>>> |-------------.------| |-------------.------|
>>>> |-------------.------|
>>>> huge mapped 4k m'd
>>> The effort to draw this is appreciated!
>>>
>>> I understood the alignment, what I am asking is this:
>>>
>>> In __get_unmapped_area(), we will return a THP-aligned addr from
>>> thp_get_unmapped_area_vmflags(). Now for the diagram you have
>>> drawn, suppose that before the patch, we first mmap() the
>>> 8MB-start chunk. Then we mmap the 4MB start chunk.
>>> We go to __mmap_region(), and we see that the 8MB-start chunk
>>> has mergeable characteristics, so we merge. So the gap goes away?
>> No because there's a gap, we only merge immedaitely adjacent VMAs. And
>> obviously
>> gaps mean page tables wouldn't be adjacent either...
>
> Ah shoot. That is prev->vm_end == vmg->start in can_vma_merge_left().
> Thanks.
>
>>
>> The get_unmmaped_area() would have otherwise given adjacent mappings.
>> Vlasta's
>> patch means in this case we no longer bother trying to align these
>> because their
>> _length_ isn't PMD aligned.
Hi Lorenzo, Dev, all
Thank you for raising excellent points — I’ll respond to each in order
to clarify the mechanics and relevance of this behavior in the context
of AI inference workloads.
🧩 1. Does the patch cause VMAs to be merged eventually?
You're correct: VMA merging only happens at mmap() time (via
__mmap_region()). What the patch affects is the behavior of
thp_get_unmapped_area_vmflags() before the mmap is placed.
Before the patch (with Rik’s logic):
Every mmap() returned an address rounded up to the next 2MB boundary —
regardless of whether the requested size was 2MB-aligned.
Result: even consecutive mmap()s (e.g., 1.5MB + 1.5MB) are now
non-adjacent, so merging is impossible, even if their VMA flags match.
After this patch:
If the allocation is not PMD-aligned in size, the returned address is
not forcibly aligned, increasing the likelihood that the next mmap()
lands directly after the previous one → enabling merging.
So, to be clear: this patch doesn’t cause merging, but it prevents
unnecessary pre-mmap gaps, which previously blocked merges from ever
happening exactly like a deadlock which has been cleared now.
📐 2. Why aren’t the VMAs mergeable before the patch?
Great question. Even if the VMA flags are identical, gaps introduced by
forced alignment from get_unmapped_area() break the precondition for
merging:
can_vma_merge_left()
→ return prev->vm_end == vma->vm_start
With Rik’s patch in place:
Suppose you mmap() 1.5MB → gets aligned to 2MB
Next 1.5MB → gets aligned to 4MB
→ The kernel sees: prev->vm_end = 3.5MB, vma->vm_start = 4MB
→ No merge
With this patch, non-aligned lengths don’t get forcibly aligned, so
consecutive mmap()s often fall exactly after the previous, and merging
becomes possible again.
🤖 3. How does this impact AI workloads like Hugging Face Transformers?
Tokenization and dynamic batching create non-deterministic memory
allocation patterns:
Models like BERT and T5 dynamically allocate intermediate buffers per
token-length, batch size, and attention window.
Hugging Face + ONNX Runtime uses multiple small-ish anonymous mmap()s,
often 512KB–1.8MB.
These allocations come in bursts — but due to forced alignment, the
kernel was placing them with artificial gaps, defeating THP eligibility
entirely.
By not force-aligning non-PMD-sized mappings, we avoid injecting gaps.
The result is that:
a. VMAs remain adjacent → mergeable
b. Physical memory is contiguous → eligible for khugepaged collapse
c. THP utilization increases → fewer TLB misses → lower latency → higher
throughput
💡 4. Why this patch complements Rik’s rather than contradicts it:
Rik's patch made it easier to guarantee alignment for workloads that
benefit from explicit huge pages — but at the cost of breaking
coalescence in workloads with non-PMD-sized mappings, like ML inference.
This patch simply refines that logic:
If the length is PMD-aligned → keep alignment
If it’s not → don’t inject alignment gaps that block merging
So, for workloads that can’t benefit from THP due to misalignment, this
patch removes artificial fragmentation without harming the original
intent.
⚙️ 5. mTHP note
Although this patch doesn’t target mTHP directly, I believe a similar
logic tweak could apply there too — especially with shmem-backed
workloads (common in model servers using shared tensor memory). I’d be
happy to help test any changes proposed there to derive the consequent
results.
Thanks again for the detailed discussion. Let me know if you’d like a
trace or VMA map from a Hugging Face benchmarked run (happy to generate
one locally).
Best Regards,
Siddhartha Sharma
+91 9015185601
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] mm: limit THP alignment – performance gain observed in AI inference workloads
2025-07-01 12:15 ` siddhartha
@ 2025-07-01 12:39 ` Lorenzo Stoakes
2025-07-01 13:23 ` siddhartha
2025-07-01 16:20 ` Dev Jain
2025-07-01 15:40 ` Yang Shi
1 sibling, 2 replies; 28+ messages in thread
From: Lorenzo Stoakes @ 2025-07-01 12:39 UTC (permalink / raw)
To: siddhartha; +Cc: Dev Jain, linux-mm, linux-kernel, mgorman
On Tue, Jul 01, 2025 at 05:45:51PM +0530, siddhartha@kenip.in wrote:
> 🧩 1. Does the patch cause VMAs to be merged eventually?
> You're correct: VMA merging only happens at mmap() time (via
> __mmap_region()). What the patch affects is the behavior of
> thp_get_unmapped_area_vmflags() before the mmap is placed.
[...]
>
> 📐 2. Why aren’t the VMAs mergeable before the patch?
> Great question. Even if the VMA flags are identical, gaps introduced by
> forced alignment from get_unmapped_area() break the precondition for
> merging:
[...]
> 💡 4. Why this patch complements Rik’s rather than contradicts it:
I'm really perplexed as to why you felt the need to (seemingly via LLM)
reply with the explanation I've already provided here?...
There's errors in things you say here too.
With respect, please don't do this.
(I'm the co-maintainer of pretty much all the relevant code here and wrote
the VMA merge logic you're referring to.)
>
> 🤖 3. How does this impact AI workloads like Hugging Face Transformers?
> Tokenization and dynamic batching create non-deterministic memory allocation
> patterns:
>
> Models like BERT and T5 dynamically allocate intermediate buffers per
> token-length, batch size, and attention window.
>
> Hugging Face + ONNX Runtime uses multiple small-ish anonymous mmap()s, often
> 512KB–1.8MB.
>
> These allocations come in bursts — but due to forced alignment, the kernel
> was placing them with artificial gaps, defeating THP eligibility entirely.
>
> By not force-aligning non-PMD-sized mappings, we avoid injecting gaps. The
> result is that:
>
> a. VMAs remain adjacent → mergeable
>
> b. Physical memory is contiguous → eligible for khugepaged collapse
>
> c. THP utilization increases → fewer TLB misses → lower latency → higher
> throughput
>
This is very useful information and it's appreciated! Let's not drown this
out with restatements of stuff already covered.
>
> ⚙️ 5. mTHP note
> Although this patch doesn’t target mTHP directly, I believe a similar logic
> tweak could apply there too — especially with shmem-backed workloads (common
> in model servers using shared tensor memory). I’d be happy to help test any
> changes proposed there to derive the consequent results.
Dev - could we hold off on any effort to do something like this until I've
had a chance to refactor THP somewhat? This is already a mess and I'd like
to avoid us piling on more complexity.
We can revisit this at a later stage.
>
> Thanks again for the detailed discussion. Let me know if you’d like a trace
> or VMA map from a Hugging Face benchmarked run (happy to generate one
> locally).
>
Thanks! Much appreciated.
Cheers, Lorenzo
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] mm: limit THP alignment – performance gain observed in AI inference workloads
2025-07-01 12:39 ` Lorenzo Stoakes
@ 2025-07-01 13:23 ` siddhartha
2025-07-01 13:28 ` Lorenzo Stoakes
2025-07-01 16:20 ` Dev Jain
1 sibling, 1 reply; 28+ messages in thread
From: siddhartha @ 2025-07-01 13:23 UTC (permalink / raw)
To: Lorenzo Stoakes; +Cc: Dev Jain, linux-mm, linux-kernel, mgorman
On 2025-07-01 18:09, Lorenzo Stoakes wrote:
> On Tue, Jul 01, 2025 at 05:45:51PM +0530, siddhartha@kenip.in wrote:
>> 🧩 1. Does the patch cause VMAs to be merged eventually?
>> You're correct: VMA merging only happens at mmap() time (via
>> __mmap_region()). What the patch affects is the behavior of
>> thp_get_unmapped_area_vmflags() before the mmap is placed.
>
> [...]
>
>>
>> 📐 2. Why aren’t the VMAs mergeable before the patch?
>> Great question. Even if the VMA flags are identical, gaps introduced
>> by
>> forced alignment from get_unmapped_area() break the precondition for
>> merging:
>
> [...]
>
>> 💡 4. Why this patch complements Rik’s rather than contradicts it:
>
> I'm really perplexed as to why you felt the need to (seemingly via LLM)
> reply with the explanation I've already provided here?...
>
> There's errors in things you say here too.
>
> With respect, please don't do this.
>
> (I'm the co-maintainer of pretty much all the relevant code here and
> wrote
> the VMA merge logic you're referring to.)
>
>>
>> 🤖 3. How does this impact AI workloads like Hugging Face Transformers?
>> Tokenization and dynamic batching create non-deterministic memory
>> allocation
>> patterns:
>>
>> Models like BERT and T5 dynamically allocate intermediate buffers per
>> token-length, batch size, and attention window.
>>
>> Hugging Face + ONNX Runtime uses multiple small-ish anonymous mmap()s,
>> often
>> 512KB–1.8MB.
>>
>> These allocations come in bursts — but due to forced alignment, the
>> kernel
>> was placing them with artificial gaps, defeating THP eligibility
>> entirely.
>>
>> By not force-aligning non-PMD-sized mappings, we avoid injecting gaps.
>> The
>> result is that:
>>
>> a. VMAs remain adjacent → mergeable
>>
>> b. Physical memory is contiguous → eligible for khugepaged collapse
>>
>> c. THP utilization increases → fewer TLB misses → lower latency →
>> higher
>> throughput
>>
>
> This is very useful information and it's appreciated! Let's not drown
> this
> out with restatements of stuff already covered.
>
>>
>> ⚙️ 5. mTHP note
>> Although this patch doesn’t target mTHP directly, I believe a similar
>> logic
>> tweak could apply there too — especially with shmem-backed workloads
>> (common
>> in model servers using shared tensor memory). I’d be happy to help
>> test any
>> changes proposed there to derive the consequent results.
>
> Dev - could we hold off on any effort to do something like this until
> I've
> had a chance to refactor THP somewhat? This is already a mess and I'd
> like
> to avoid us piling on more complexity.
>
> We can revisit this at a later stage.
>
>>
>> Thanks again for the detailed discussion. Let me know if you’d like a
>> trace
>> or VMA map from a Hugging Face benchmarked run (happy to generate one
>> locally).
>>
>
> Thanks! Much appreciated.
>
> Cheers, Lorenzo
Hi Lorenzo,
Thanks for your clarification, and I appreciate your patience —
especially given your role in maintaining and designing the VMA merge
logic.
I understand now that my earlier phrasing may have repeated your
explanation for VMA adjacency, and I regret unintentionally restating
it.
I’ll make sure to be more careful and direct going forward.
As for the THP alignment condition now being `IS_ALIGNED(len,
PMD_SIZE)`, I agree this resolves the regressions by removing alignment
for non-aligned sizes, which was exactly what broke workloads like
cactusBSSN or some AI inference loads.
Thanks again for the guidance — I’m learning a lot from this thread.
Best Regards,
Siddhartha Sharma
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] mm: limit THP alignment – performance gain observed in AI inference workloads
2025-07-01 13:23 ` siddhartha
@ 2025-07-01 13:28 ` Lorenzo Stoakes
2025-07-01 14:20 ` siddhartha
0 siblings, 1 reply; 28+ messages in thread
From: Lorenzo Stoakes @ 2025-07-01 13:28 UTC (permalink / raw)
To: siddhartha; +Cc: Dev Jain, linux-mm, linux-kernel, mgorman
On Tue, Jul 01, 2025 at 06:53:47PM +0530, siddhartha@kenip.in wrote:
> Hi Lorenzo,
>
> Thanks for your clarification, and I appreciate your patience — especially
> given your role in maintaining and designing the VMA merge logic.
>
> I understand now that my earlier phrasing may have repeated your explanation
> for VMA adjacency, and I regret unintentionally restating it.
>
> I’ll make sure to be more careful and direct going forward.
Thanks, no problem. Mostly avoids confusion.
>
> As for the THP alignment condition now being `IS_ALIGNED(len, PMD_SIZE)`, I
> agree this resolves the regressions by removing alignment for non-aligned
> sizes, which was exactly what broke workloads like cactusBSSN or some AI
> inference loads.
Ack - we're really happy to hear about workloads that this has helped as this
kind of input is very important as to getting insight into how THP-related stuff
impacts real users so we can best optimise especially for workloads that are
very important in the industry right now.
>
> Thanks again for the guidance — I’m learning a lot from this thread.
Glad to have helped, thanks again for reporting!
>
> Best Regards,
> Siddhartha Sharma
>
Cheers, Lorenzo
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] mm: limit THP alignment – performance gain observed in AI inference workloads
2025-07-01 13:28 ` Lorenzo Stoakes
@ 2025-07-01 14:20 ` siddhartha
0 siblings, 0 replies; 28+ messages in thread
From: siddhartha @ 2025-07-01 14:20 UTC (permalink / raw)
To: Lorenzo Stoakes; +Cc: Dev Jain, linux-mm, linux-kernel, mgorman
On 2025-07-01 18:58, Lorenzo Stoakes wrote:
> On Tue, Jul 01, 2025 at 06:53:47PM +0530, siddhartha@kenip.in wrote:
>> Hi Lorenzo,
>>
>> Thanks for your clarification, and I appreciate your patience —
>> especially
>> given your role in maintaining and designing the VMA merge logic.
>>
>> I understand now that my earlier phrasing may have repeated your
>> explanation
>> for VMA adjacency, and I regret unintentionally restating it.
>>
>> I’ll make sure to be more careful and direct going forward.
>
> Thanks, no problem. Mostly avoids confusion.
>
>>
>> As for the THP alignment condition now being `IS_ALIGNED(len,
>> PMD_SIZE)`, I
>> agree this resolves the regressions by removing alignment for
>> non-aligned
>> sizes, which was exactly what broke workloads like cactusBSSN or some
>> AI
>> inference loads.
>
> Ack - we're really happy to hear about workloads that this has helped
> as this
> kind of input is very important as to getting insight into how
> THP-related stuff
> impacts real users so we can best optimise especially for workloads
> that are
> very important in the industry right now.
>
>>
>> Thanks again for the guidance — I’m learning a lot from this thread.
>
> Glad to have helped, thanks again for reporting!
>
>>
>> Best Regards,
>> Siddhartha Sharma
>>
>
> Cheers, Lorenzo
Hi Lorenzo,
Thanks for the acknowledgement of my work, I really appreciate it.
Please let me know if there is anything I can do here now moving
forwards with integrating. Furthermore, once integrated and tested, I
would like to see all performance metrics that have seen improvements if
possible.
Best Regards,
Siddhartha Sharma
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] mm: limit THP alignment – performance gain observed in AI inference workloads
2025-07-01 12:15 ` siddhartha
2025-07-01 12:39 ` Lorenzo Stoakes
@ 2025-07-01 15:40 ` Yang Shi
1 sibling, 0 replies; 28+ messages in thread
From: Yang Shi @ 2025-07-01 15:40 UTC (permalink / raw)
To: siddhartha
Cc: Dev Jain, Lorenzo Stoakes, linux-mm, linux-kernel, mgorman,
Vlastimil Babka, Rik van Riel
>
> 🤖 3. How does this impact AI workloads like Hugging Face Transformers?
> Tokenization and dynamic batching create non-deterministic memory
> allocation patterns:
>
> Models like BERT and T5 dynamically allocate intermediate buffers per
> token-length, batch size, and attention window.
>
> Hugging Face + ONNX Runtime uses multiple small-ish anonymous mmap()s,
> often 512KB–1.8MB.
If I remember correctly, Rik's patch should just force PMD alignment
when the allocation size is greater than PMD size. Such VMA
fragmentation should be caused by allocations greater than 2M but not
PMD aligned, so they create 2M PMD + a bunch of 4K PTEs. Less than 2M
allocations should be right next to each other and mergeable. Did I
miss something?
Thanks,
Yang
>
> These allocations come in bursts — but due to forced alignment, the
> kernel was placing them with artificial gaps, defeating THP eligibility
> entirely.
>
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] mm: limit THP alignment – performance gain observed in AI inference workloads
2025-07-01 12:39 ` Lorenzo Stoakes
2025-07-01 13:23 ` siddhartha
@ 2025-07-01 16:20 ` Dev Jain
2025-07-01 18:49 ` Zi Yan
1 sibling, 1 reply; 28+ messages in thread
From: Dev Jain @ 2025-07-01 16:20 UTC (permalink / raw)
To: Lorenzo Stoakes, siddhartha; +Cc: linux-mm, linux-kernel, mgorman
On 01/07/25 6:09 pm, Lorenzo Stoakes wrote:
> On Tue, Jul 01, 2025 at 05:45:51PM +0530, siddhartha@kenip.in wrote:
>> 🧩 1. Does the patch cause VMAs to be merged eventually?
>> You're correct: VMA merging only happens at mmap() time (via
>> __mmap_region()). What the patch affects is the behavior of
>> thp_get_unmapped_area_vmflags() before the mmap is placed.
> [...]
>
>> 📐 2. Why aren’t the VMAs mergeable before the patch?
>> Great question. Even if the VMA flags are identical, gaps introduced by
>> forced alignment from get_unmapped_area() break the precondition for
>> merging:
> [...]
>
>> 💡 4. Why this patch complements Rik’s rather than contradicts it:
> I'm really perplexed as to why you felt the need to (seemingly via LLM)
> reply with the explanation I've already provided here?...
>
> There's errors in things you say here too.
>
> With respect, please don't do this.
>
> (I'm the co-maintainer of pretty much all the relevant code here and wrote
> the VMA merge logic you're referring to.)
>
>> 🤖 3. How does this impact AI workloads like Hugging Face Transformers?
>> Tokenization and dynamic batching create non-deterministic memory allocation
>> patterns:
>>
>> Models like BERT and T5 dynamically allocate intermediate buffers per
>> token-length, batch size, and attention window.
>>
>> Hugging Face + ONNX Runtime uses multiple small-ish anonymous mmap()s, often
>> 512KB–1.8MB.
>>
>> These allocations come in bursts — but due to forced alignment, the kernel
>> was placing them with artificial gaps, defeating THP eligibility entirely.
>>
>> By not force-aligning non-PMD-sized mappings, we avoid injecting gaps. The
>> result is that:
>>
>> a. VMAs remain adjacent → mergeable
>>
>> b. Physical memory is contiguous → eligible for khugepaged collapse
>>
>> c. THP utilization increases → fewer TLB misses → lower latency → higher
>> throughput
>>
> This is very useful information and it's appreciated! Let's not drown this
> out with restatements of stuff already covered.
>
>> ⚙️ 5. mTHP note
>> Although this patch doesn’t target mTHP directly, I believe a similar logic
>> tweak could apply there too — especially with shmem-backed workloads (common
>> in model servers using shared tensor memory). I’d be happy to help test any
>> changes proposed there to derive the consequent results.
> Dev - could we hold off on any effort to do something like this until I've
> had a chance to refactor THP somewhat? This is already a mess and I'd like
> to avoid us piling on more complexity.
>
> We can revisit this at a later stage.
Yes of course. I had run a small benchmark on a quick dumb patch I wrote and I
don't see any measurable perf improvement, probably because the highest THP order
getting chosen is always PMD size.
Out of curiosity, where do you plan to do the refactoring?
>
>> Thanks again for the detailed discussion. Let me know if you’d like a trace
>> or VMA map from a Hugging Face benchmarked run (happy to generate one
>> locally).
>>
> Thanks! Much appreciated.
>
> Cheers, Lorenzo
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] mm: limit THP alignment – performance gain observed in AI inference workloads
2025-07-01 16:20 ` Dev Jain
@ 2025-07-01 18:49 ` Zi Yan
2025-07-07 8:56 ` Vlastimil Babka
0 siblings, 1 reply; 28+ messages in thread
From: Zi Yan @ 2025-07-01 18:49 UTC (permalink / raw)
To: Dev Jain; +Cc: Lorenzo Stoakes, siddhartha, linux-mm, linux-kernel, mgorman
On 1 Jul 2025, at 12:20, Dev Jain wrote:
> On 01/07/25 6:09 pm, Lorenzo Stoakes wrote:
>> On Tue, Jul 01, 2025 at 05:45:51PM +0530, siddhartha@kenip.in wrote:
>>> 🧩 1. Does the patch cause VMAs to be merged eventually?
>>> You're correct: VMA merging only happens at mmap() time (via
>>> __mmap_region()). What the patch affects is the behavior of
>>> thp_get_unmapped_area_vmflags() before the mmap is placed.
>> [...]
>>
>>> 📐 2. Why aren’t the VMAs mergeable before the patch?
>>> Great question. Even if the VMA flags are identical, gaps introduced by
>>> forced alignment from get_unmapped_area() break the precondition for
>>> merging:
>> [...]
>>
>>> 💡 4. Why this patch complements Rik’s rather than contradicts it:
>> I'm really perplexed as to why you felt the need to (seemingly via LLM)
>> reply with the explanation I've already provided here?...
>>
>> There's errors in things you say here too.
>>
>> With respect, please don't do this.
>>
>> (I'm the co-maintainer of pretty much all the relevant code here and wrote
>> the VMA merge logic you're referring to.)
>>
>>> 🤖 3. How does this impact AI workloads like Hugging Face Transformers?
>>> Tokenization and dynamic batching create non-deterministic memory allocation
>>> patterns:
>>>
>>> Models like BERT and T5 dynamically allocate intermediate buffers per
>>> token-length, batch size, and attention window.
>>>
>>> Hugging Face + ONNX Runtime uses multiple small-ish anonymous mmap()s, often
>>> 512KB–1.8MB.
>>>
>>> These allocations come in bursts — but due to forced alignment, the kernel
>>> was placing them with artificial gaps, defeating THP eligibility entirely.
>>>
>>> By not force-aligning non-PMD-sized mappings, we avoid injecting gaps. The
>>> result is that:
>>>
>>> a. VMAs remain adjacent → mergeable
>>>
>>> b. Physical memory is contiguous → eligible for khugepaged collapse
>>>
>>> c. THP utilization increases → fewer TLB misses → lower latency → higher
>>> throughput
>>>
>> This is very useful information and it's appreciated! Let's not drown this
>> out with restatements of stuff already covered.
>>
>>> ⚙️ 5. mTHP note
>>> Although this patch doesn’t target mTHP directly, I believe a similar logic
>>> tweak could apply there too — especially with shmem-backed workloads (common
>>> in model servers using shared tensor memory). I’d be happy to help test any
>>> changes proposed there to derive the consequent results.
>> Dev - could we hold off on any effort to do something like this until I've
>> had a chance to refactor THP somewhat? This is already a mess and I'd like
>> to avoid us piling on more complexity.
>>
>> We can revisit this at a later stage.
>
> Yes of course. I had run a small benchmark on a quick dumb patch I wrote and I
> don't see any measurable perf improvement, probably because the highest THP order
> getting chosen is always PMD size.
I think mTHP is much more complicated, since mTHP has many sizes.
Trying to adjust VMA alignments to get mTHP might not work well, since
you never know what sizes new VMAs are going to have.
IMHO, it might be better to align VMA to PMD or the largest mTHP size
(for example, on ARM64 with 64KB base page, PMD THP is 512MB, a 2MB
mTHP sounds more reasonable there) if possible and enable
VMA merging as much as possible for future huge page collapse.
mTHP can be used to fill the non faulted holes in VMAs if necessary.
>
> Out of curiosity, where do you plan to do the refactoring?
Best Regards,
Yan, Zi
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] mm: limit THP alignment – performance gain observed in AI inference workloads
2025-07-01 18:49 ` Zi Yan
@ 2025-07-07 8:56 ` Vlastimil Babka
2025-07-28 5:41 ` siddhartha
0 siblings, 1 reply; 28+ messages in thread
From: Vlastimil Babka @ 2025-07-07 8:56 UTC (permalink / raw)
To: Zi Yan, Dev Jain
Cc: Lorenzo Stoakes, siddhartha, linux-mm, linux-kernel, mgorman,
Rik van Riel, Doug Smythies
On 7/1/25 20:49, Zi Yan wrote:
>>> This is very useful information and it's appreciated! Let's not drown this
>>> out with restatements of stuff already covered.
>>>
>>>> ⚙️ 5. mTHP note
>>>> Although this patch doesn’t target mTHP directly, I believe a similar logic
>>>> tweak could apply there too — especially with shmem-backed workloads (common
>>>> in model servers using shared tensor memory). I’d be happy to help test any
>>>> changes proposed there to derive the consequent results.
>>> Dev - could we hold off on any effort to do something like this until I've
>>> had a chance to refactor THP somewhat? This is already a mess and I'd like
>>> to avoid us piling on more complexity.
>>>
>>> We can revisit this at a later stage.
>>
>> Yes of course. I had run a small benchmark on a quick dumb patch I wrote and I
>> don't see any measurable perf improvement, probably because the highest THP order
>> getting chosen is always PMD size.
>
> I think mTHP is much more complicated, since mTHP has many sizes.
> Trying to adjust VMA alignments to get mTHP might not work well, since
> you never know what sizes new VMAs are going to have.
Yes I agree it's more complicated. In case there would be a stream of
allocations of varying small-ish sizes, aligning each of them to its
smallest applicable mTHP could create gaps that wouldn't exist if we ignored
the alignment and just find any free area and in the end merge it to an
existing one. Basically we'd risk recreating the issue with gaps.
Sticking to one size (2MB) mitigates this to some extent. Unfortunately even
after my fix the heuristics might be prone to gaps:
- all allocations not multiple of 2MB - will merge freely
- all allocations multiple of 2MB - the alignment heuristic will kick in,
but as a result allocations should still merge as all boundaries are 2MB
alignned
- allocations alternate between multiple of 2MB and non-multiple of 2MB -
this will still create gaps
Note we already had a report about ebizzy regressing due to my commit [1]
and I suspect it might be due to this kind of scenario. A proper
investigation would be useful but I didn't get to it.
Maybe the solution is to first check if unaligned search gives us a range
that will merge with adjacent area, and only try the alignment heuristics if
it doesn't. This will still fail if mmap() is followed by e.g. mprotect() or
madvise() that will change an initially un-mergeable area to a mergeable
one. I have no ideas around that though. Just some thoughts to consider for
anyone wanting to change things here further :)
[1] https://lore.kernel.org/all/019401db769f%24961e7e20%24c25b7a60%24@telus.net/
> IMHO, it might be better to align VMA to PMD or the largest mTHP size
> (for example, on ARM64 with 64KB base page, PMD THP is 512MB, a 2MB
> mTHP sounds more reasonable there) if possible and enable
> VMA merging as much as possible for future huge page collapse.
> mTHP can be used to fill the non faulted holes in VMAs if necessary.
>
>>
>> Out of curiosity, where do you plan to do the refactoring?
>
>
> Best Regards,
> Yan, Zi
>
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] mm: limit THP alignment – performance gain observed in AI inference workloads
2025-07-07 8:56 ` Vlastimil Babka
@ 2025-07-28 5:41 ` siddhartha
2025-07-28 11:00 ` Vlastimil Babka
0 siblings, 1 reply; 28+ messages in thread
From: siddhartha @ 2025-07-28 5:41 UTC (permalink / raw)
To: Vlastimil Babka; +Cc: Zi Yan, linux-mm, Lorenzo Stoakes
On 2025-07-07 14:26, Vlastimil Babka wrote:
> On 7/1/25 20:49, Zi Yan wrote:
>>>> This is very useful information and it's appreciated! Let's not
>>>> drown this
>>>> out with restatements of stuff already covered.
>>>>
>>>>> ⚙️ 5. mTHP note
>>>>> Although this patch doesn’t target mTHP directly, I believe a
>>>>> similar logic
>>>>> tweak could apply there too — especially with shmem-backed
>>>>> workloads (common
>>>>> in model servers using shared tensor memory). I’d be happy to help
>>>>> test any
>>>>> changes proposed there to derive the consequent results.
>>>> Dev - could we hold off on any effort to do something like this
>>>> until I've
>>>> had a chance to refactor THP somewhat? This is already a mess and
>>>> I'd like
>>>> to avoid us piling on more complexity.
>>>>
>>>> We can revisit this at a later stage.
>>>
>>> Yes of course. I had run a small benchmark on a quick dumb patch I
>>> wrote and I
>>> don't see any measurable perf improvement, probably because the
>>> highest THP order
>>> getting chosen is always PMD size.
>>
>> I think mTHP is much more complicated, since mTHP has many sizes.
>> Trying to adjust VMA alignments to get mTHP might not work well, since
>> you never know what sizes new VMAs are going to have.
>
> Yes I agree it's more complicated. In case there would be a stream of
> allocations of varying small-ish sizes, aligning each of them to its
> smallest applicable mTHP could create gaps that wouldn't exist if we
> ignored
> the alignment and just find any free area and in the end merge it to an
> existing one. Basically we'd risk recreating the issue with gaps.
>
> Sticking to one size (2MB) mitigates this to some extent. Unfortunately
> even
> after my fix the heuristics might be prone to gaps:
>
> - all allocations not multiple of 2MB - will merge freely
>
> - all allocations multiple of 2MB - the alignment heuristic will kick
> in,
> but as a result allocations should still merge as all boundaries are
> 2MB
> alignned
>
> - allocations alternate between multiple of 2MB and non-multiple of 2MB
> -
> this will still create gaps
>
> Note we already had a report about ebizzy regressing due to my commit
> [1]
> and I suspect it might be due to this kind of scenario. A proper
> investigation would be useful but I didn't get to it.
>
> Maybe the solution is to first check if unaligned search gives us a
> range
> that will merge with adjacent area, and only try the alignment
> heuristics if
> it doesn't. This will still fail if mmap() is followed by e.g.
> mprotect() or
> madvise() that will change an initially un-mergeable area to a
> mergeable
> one. I have no ideas around that though. Just some thoughts to consider
> for
> anyone wanting to change things here further :)
>
> [1]
> https://lore.kernel.org/all/019401db769f%24961e7e20%24c25b7a60%24@telus.net/
>
>> IMHO, it might be better to align VMA to PMD or the largest mTHP size
>> (for example, on ARM64 with 64KB base page, PMD THP is 512MB, a 2MB
>> mTHP sounds more reasonable there) if possible and enable
>> VMA merging as much as possible for future huge page collapse.
>> mTHP can be used to fill the non faulted holes in VMAs if necessary.
>>
>>>
>>> Out of curiosity, where do you plan to do the refactoring?
>>
>>
>> Best Regards,
>> Yan, Zi
>>
Hi Lorenzo, Dev, Mel,
I'm following up on this patch submission from earlier this month:
"[PATCH] mm: limit THP alignment – performance gain observed in AI
inference workloads."
The change limits THP alignment to PMD-sized mappings, avoiding
unnecessary hugepage over-allocations in scenarios where 2MB alignment
is not beneficial. We’ve observed consistent performance improvements in
inference pipelines (specifically with OpenVINO) where the workload
profile includes a mix of small and large allocations.
Please let me know if:
- There has been any progress or feedback from your end,
- The patch needs to align with ongoing THP refactoring efforts,
- Additional benchmarks, test traces, or system-level profiles would
help.
Happy to revise or refine the patch based on further discussion. Thanks
again for your time and input!
For your information, I have also posted the same at Openvino and
Huggingface forums and currently waiting for review for the commit on
the Openvino github repository.
Best regards,
Siddhartha Sharma
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] mm: limit THP alignment – performance gain observed in AI inference workloads
2025-07-28 5:41 ` siddhartha
@ 2025-07-28 11:00 ` Vlastimil Babka
0 siblings, 0 replies; 28+ messages in thread
From: Vlastimil Babka @ 2025-07-28 11:00 UTC (permalink / raw)
To: Dev Jain, Lorenzo Stoakes, siddhartha; +Cc: Zi Yan, linux-mm@kvack.org, LKML
On 7/28/25 07:41, siddhartha@kenip.in wrote:
> On 2025-07-07 14:26, Vlastimil Babka wrote:
> Hi Lorenzo, Dev, Mel,
>
> I'm following up on this patch submission from earlier this month:
> "[PATCH] mm: limit THP alignment – performance gain observed in AI
> inference workloads."
I'm confused. That wasn't a patch submission, but reporting performance
results for my patch from late 2024? (and thanks for those!)
The patch was also already merged in late 2024:
commit d4148aeab412432bf928f311eca8a2ba52bb05df
Author: Vlastimil Babka <vbabka@suse.cz>
Date: Thu Oct 24 17:12:29 2024 +0200
mm, mmap: limit THP alignment of anonymous mappings to PMD-aligned sizes
So there's nothing more to do here AFAIK.
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] mm: limit THP alignment – performance gain observed in AI inference workloads
@ 2025-08-11 22:14 siddhartha
0 siblings, 0 replies; 28+ messages in thread
From: siddhartha @ 2025-08-11 22:14 UTC (permalink / raw)
To: Vlastimil Babka; +Cc: Dev Jain, Lorenzo Stoakes, linux-mm, LKML
[-- Attachment #1: Type: text/plain, Size: 3149 bytes --]
On 2025-07-28 16:30, Vlastimil Babka wrote:
> On 7/28/25 07:41, siddhartha@kenip.in wrote:
>
>> On 2025-07-07 14:26, Vlastimil Babka wrote:
>> Hi Lorenzo, Dev, Mel,
>>
>> I'm following up on this patch submission from earlier this month:
>> "[PATCH] mm: limit THP alignment - performance gain observed in AI
>> inference workloads."
>
> I'm confused. That wasn't a patch submission, but reporting performance
> results for my patch from late 2024? (and thanks for those!)
>
> The patch was also already merged in late 2024:
>
> commit d4148aeab412432bf928f311eca8a2ba52bb05df
> Author: Vlastimil Babka <vbabka@suse.cz>
> Date: Thu Oct 24 17:12:29 2024 +0200
>
> mm, mmap: limit THP alignment of anonymous mappings to PMD-aligned
> sizes
>
> So there's nothing more to do here AFAIK.
> Hello Vlastimil,
>
> Hope you are doing great!
>
> Sorry about the late reply, my inbox made your email invisible somehow.
>
> Thank you for the clarification -- yes, I am aware that the mm, mmap:
> limit THP alignment of anonymous mappings to PMD-aligned sizes patch
> was merged in late 2024 (commit
> d4148aeab412432bf928f311eca8a2ba52bb05df).
>
> The performance results I shared were generated much later because of
> my working setup:
>
> *
>
> The tests were conducted on Intel Developer Cloud workloads as part of
> a broader benchmarking exercise involving OpenVINO-based inference
> pipelines.
> *
>
> The specific environment, dataset, and configuration scripts were
> stored on an SSD that unfortunately suffered corruption. I am currently
> working to recover them so I can share the exact test harness and
> commit-specific diffs. If and when I get that access back from Intel
> Developer Cloud, I can surely provide all those relevant files.
>
> Although this is not a new patch submission, I thought the numbers
> might still be valuable -- they show notable throughput and latency
> changes when aligning the current behavior with OpenVINO's large
> contiguous allocation preferences in certain inference scenarios.
>
> Summary of observed improvements:
>
> *
>
> Throughput: +7.3% average increase in model inference throughput on
> ResNet-50 with mixed batch sizes (64/128)
> *
>
> Latency: -5.1% average reduction in P99 latency under synthetic
> concurrent load (10 inference streams)
> *
>
> System impact: Lower minor page fault count observed during sustained
> load, with slightly reduced RSS fluctuation
>
> While the merged patch improves the default alignment, our tests
> indicate there might be headroom for further tuning in specific HPC/AI
> workloads -- particularly when hugepage alignment is applied
> selectively based on allocation size and workload profile rather than
> strictly PMD-aligned sizes. I was also working on specifics and pseudo
> diffs from the working Linux code that I can generate to send that
> email via git send-email.
>
> I'd be happy to collaborate on a deeper investigation once I recover
> the original scripts -- or I can try to replicate the environment on a
> fresh setup and collect new diffs for comparison.
>
> Best regards,
> Siddhartha Sharma
[-- Attachment #2: Type: text/html, Size: 5027 bytes --]
^ permalink raw reply [flat|nested] 28+ messages in thread
end of thread, other threads:[~2025-08-11 22:15 UTC | newest]
Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-27 10:39 [PATCH] mm: limit THP alignment – performance gain observed in AI inference workloads siddhartha
2025-06-27 10:45 ` siddhartha
2025-06-27 15:30 ` Lorenzo Stoakes
2025-06-28 3:49 ` Dev Jain
2025-06-30 0:43 ` siddhartha
2025-06-30 5:25 ` Dev Jain
2025-06-30 5:28 ` Dev Jain
2025-06-30 10:54 ` Lorenzo Stoakes
2025-06-30 11:48 ` siddhartha
2025-07-01 5:23 ` Dev Jain
2025-07-01 5:28 ` Lorenzo Stoakes
2025-07-01 5:45 ` Dev Jain
2025-07-01 5:53 ` Lorenzo Stoakes
2025-07-01 6:30 ` Dev Jain
2025-07-01 6:50 ` Lorenzo Stoakes
2025-07-01 6:58 ` Dev Jain
2025-07-01 12:15 ` siddhartha
2025-07-01 12:39 ` Lorenzo Stoakes
2025-07-01 13:23 ` siddhartha
2025-07-01 13:28 ` Lorenzo Stoakes
2025-07-01 14:20 ` siddhartha
2025-07-01 16:20 ` Dev Jain
2025-07-01 18:49 ` Zi Yan
2025-07-07 8:56 ` Vlastimil Babka
2025-07-28 5:41 ` siddhartha
2025-07-28 11:00 ` Vlastimil Babka
2025-07-01 15:40 ` Yang Shi
-- strict thread matches above, loose matches on Subject: below --
2025-08-11 22:14 siddhartha
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).