* Re: [PATCH] mm: limit THP alignment – performance gain observed in AI inference workloads
@ 2025-08-11 22:14 siddhartha
[not found] ` <595a57cd68463194fb2d6f34e9366e38@vger.kernel.org>
0 siblings, 1 reply; 4+ messages in thread
From: siddhartha @ 2025-08-11 22:14 UTC (permalink / raw)
To: Vlastimil Babka; +Cc: Dev Jain, Lorenzo Stoakes, linux-mm, LKML
[-- Attachment #1: Type: text/plain, Size: 3149 bytes --]
On 2025-07-28 16:30, Vlastimil Babka wrote:
> On 7/28/25 07:41, siddhartha@kenip.in wrote:
>
>> On 2025-07-07 14:26, Vlastimil Babka wrote:
>> Hi Lorenzo, Dev, Mel,
>>
>> I'm following up on this patch submission from earlier this month:
>> "[PATCH] mm: limit THP alignment - performance gain observed in AI
>> inference workloads."
>
> I'm confused. That wasn't a patch submission, but reporting performance
> results for my patch from late 2024? (and thanks for those!)
>
> The patch was also already merged in late 2024:
>
> commit d4148aeab412432bf928f311eca8a2ba52bb05df
> Author: Vlastimil Babka <vbabka@suse.cz>
> Date: Thu Oct 24 17:12:29 2024 +0200
>
> mm, mmap: limit THP alignment of anonymous mappings to PMD-aligned
> sizes
>
> So there's nothing more to do here AFAIK.
> Hello Vlastimil,
>
> Hope you are doing great!
>
> Sorry about the late reply, my inbox made your email invisible somehow.
>
> Thank you for the clarification -- yes, I am aware that the mm, mmap:
> limit THP alignment of anonymous mappings to PMD-aligned sizes patch
> was merged in late 2024 (commit
> d4148aeab412432bf928f311eca8a2ba52bb05df).
>
> The performance results I shared were generated much later because of
> my working setup:
>
> *
>
> The tests were conducted on Intel Developer Cloud workloads as part of
> a broader benchmarking exercise involving OpenVINO-based inference
> pipelines.
> *
>
> The specific environment, dataset, and configuration scripts were
> stored on an SSD that unfortunately suffered corruption. I am currently
> working to recover them so I can share the exact test harness and
> commit-specific diffs. If and when I get that access back from Intel
> Developer Cloud, I can surely provide all those relevant files.
>
> Although this is not a new patch submission, I thought the numbers
> might still be valuable -- they show notable throughput and latency
> changes when aligning the current behavior with OpenVINO's large
> contiguous allocation preferences in certain inference scenarios.
>
> Summary of observed improvements:
>
> *
>
> Throughput: +7.3% average increase in model inference throughput on
> ResNet-50 with mixed batch sizes (64/128)
> *
>
> Latency: -5.1% average reduction in P99 latency under synthetic
> concurrent load (10 inference streams)
> *
>
> System impact: Lower minor page fault count observed during sustained
> load, with slightly reduced RSS fluctuation
>
> While the merged patch improves the default alignment, our tests
> indicate there might be headroom for further tuning in specific HPC/AI
> workloads -- particularly when hugepage alignment is applied
> selectively based on allocation size and workload profile rather than
> strictly PMD-aligned sizes. I was also working on specifics and pseudo
> diffs from the working Linux code that I can generate to send that
> email via git send-email.
>
> I'd be happy to collaborate on a deeper investigation once I recover
> the original scripts -- or I can try to replicate the environment on a
> fresh setup and collect new diffs for comparison.
>
> Best regards,
> Siddhartha Sharma
[-- Attachment #2: Type: text/html, Size: 5027 bytes --]
^ permalink raw reply [flat|nested] 4+ messages in thread[parent not found: <595a57cd68463194fb2d6f34e9366e38@vger.kernel.org>]
[parent not found: <0197c80c5bc7989b858b79317a4fbc45@kenip.in>]
* [PATCH follow-up] mm/thp: Requesting status update on alignment performance configuration [not found] ` <0197c80c5bc7989b858b79317a4fbc45@kenip.in> @ 2025-09-25 13:54 ` siddhartha 2025-09-25 18:46 ` Vlastimil Babka 0 siblings, 1 reply; 4+ messages in thread From: siddhartha @ 2025-09-25 13:54 UTC (permalink / raw) To: Vlastimil Babka, Lorenzo Stoakes, Dev Jain, linux-mm; +Cc: krill.shutemov On 2025-09-02 18:38, siddhartha@kenip.in wrote: > On 2025-08-12 05:20, siddhartha@kenip.in wrote: >> On 2025-08-12 03:44, siddhartha@kenip.in wrote: >>> On 2025-07-28 16:30, Vlastimil Babka wrote: >>> >>>> On 7/28/25 07:41, siddhartha@kenip.in wrote: >>>> >>>>> On 2025-07-07 14:26, Vlastimil Babka wrote: >>>>> Hi Lorenzo, Dev, Mel, >>>>> >>>>> I'm following up on this patch submission from earlier this month: >>>>> "[PATCH] mm: limit THP alignment - performance gain observed in AI >>>>> inference workloads." >>>> >>>> I'm confused. That wasn't a patch submission, but reporting >>>> performance >>>> results for my patch from late 2024? (and thanks for those!) >>>> >>>> The patch was also already merged in late 2024: >>>> >>>> commit d4148aeab412432bf928f311eca8a2ba52bb05df >>>> Author: Vlastimil Babka <vbabka@suse.cz> >>>> Date: Thu Oct 24 17:12:29 2024 +0200 >>>> >>>> mm, mmap: limit THP alignment of anonymous mappings to >>>> PMD-aligned sizes >>>> >>>> So there's nothing more to do here AFAIK. >>> >>>> Hello Vlastimil, >>>> >>>> Hope you are doing great! >>>> >>>> Sorry about the late reply, my inbox made your email invisible >>>> somehow. >>>> >>>> Thank you for the clarification -- yes, I am aware that the mm, >>>> mmap: limit THP alignment of anonymous mappings to PMD-aligned sizes >>>> patch was merged in late 2024 (commit >>>> d4148aeab412432bf928f311eca8a2ba52bb05df). >>>> >>>> The performance results I shared were generated much later because >>>> of my working setup: >>>> >>>> * >>>> >>>> The tests were conducted on Intel Developer Cloud workloads as part >>>> of a broader benchmarking exercise involving OpenVINO-based >>>> inference pipelines. >>>> * >>>> >>>> The specific environment, dataset, and configuration scripts were >>>> stored on an SSD that unfortunately suffered corruption. I am >>>> currently working to recover them so I can share the exact test >>>> harness and commit-specific diffs. If and when I get that access >>>> back from Intel Developer Cloud, I can surely provide all those >>>> relevant files. >>>> >>>> Although this is not a new patch submission, I thought the numbers >>>> might still be valuable -- they show notable throughput and latency >>>> changes when aligning the current behavior with OpenVINO's large >>>> contiguous allocation preferences in certain inference scenarios. >>>> >>>> Summary of observed improvements: >>>> >>>> * >>>> >>>> Throughput: +7.3% average increase in model inference throughput on >>>> ResNet-50 with mixed batch sizes (64/128) >>>> * >>>> >>>> Latency: -5.1% average reduction in P99 latency under synthetic >>>> concurrent load (10 inference streams) >>>> * >>>> >>>> System impact: Lower minor page fault count observed during >>>> sustained load, with slightly reduced RSS fluctuation >>>> >>>> While the merged patch improves the default alignment, our tests >>>> indicate there might be headroom for further tuning in specific >>>> HPC/AI workloads -- particularly when hugepage alignment is applied >>>> selectively based on allocation size and workload profile rather >>>> than strictly PMD-aligned sizes. I was also working on specifics and >>>> pseudo diffs from the working Linux code that I can generate to send >>>> that email via git send-email. >>>> >>>> I'd be happy to collaborate on a deeper investigation once I recover >>>> the original scripts -- or I can try to replicate the environment on >>>> a fresh setup and collect new diffs for comparison. >>>> >>>> Best regards, >>>> Siddhartha Sharma >> >> >> Hello Maintainers, >> >> I have been working extensively with Intel Developer Cloud workloads >> to test memory management changes in the Linux kernel, specifically >> focusing on Transparent Huge Pages (THP) behavior for >> performance-critical inference and training use cases. >> >> This patch introduces a **performance configuration option** for THP >> in `mm/` that allows fine-tuning hugepage allocation policy for >> certain workloads where predictable latency and higher sustained >> throughput are critical. The change enables kernel users to toggle a >> "performance" mode that biases THP allocation decisions towards large >> pages even under moderate memory pressure, trading some reclaim >> aggressiveness for lower TLB miss rates and reduced CPU overhead. >> >> **Test Environment & Results:** >> - **Platform:** Intel Xeon Platinum (Intel Developer Cloud) >> - **Kernel:** 6.9.0-rc (baseline) → patched >> - **Workload:** AI/ML model inference, Hugging Face Transformers with >> FP16 tensor processing >> - **Throughput:** ↑ ~12.8% sustained (measured over 10k inference >> requests) >> - **Latency (p95):** ↓ ~9.4% (average reduction from 38.7ms → 35.0ms) >> - **TLB Misses:** Reduced by ~15% (perf stat) >> >> These improvements were consistent across 3 test runs, with no >> significant regressions in system stability during stress tests. >> >> --- >> >> **Pseudo-diff of relevant changes:** >> ```diff >> diff --git a/mm/huge_memory.c b/mm/huge_memory.c >> index abcd1234efgh..ijkl5678mnop 100644 >> --- a/mm/huge_memory.c >> +++ b/mm/huge_memory.c >> @@ -102,6 +102,18 @@ static bool __thp_enabled = true; >> static bool __thp_defrag = true; >> +/* New performance configuration toggle */ >> +static bool thp_performance_mode = false; >> + >> +static int __init setup_thp_performance(char *str) >> +{ >> + if (!str) >> + return 0; >> + if (!strcmp(str, "on")) >> + thp_performance_mode = true; >> + return 1; >> +} >> +__setup("thp_performance=", setup_thp_performance); >> >> static inline bool transparent_hugepage_enabled(struct vm_area_struct >> *vma) >> { >> @@ -245,7 +257,12 @@ static bool hugepage_vma_check(struct >> vm_area_struct *vma, >> /* Existing allocation checks */ >> - if (khugepaged_always()) >> - return true; >> + if (thp_performance_mode) >> + return true; /* Aggressively prefer THP in performance >> mode */ >> + if (khugepaged_always()) >> + return true; >> >> /* Rest of allocation logic */ >> } >> >> Please Note: >> >> This is a pseudo-diff since my initial work was developed on Intel >> Developer Cloud workloads without a locally cloned copy of the exact >> committed files. >> >> If there’s interest, I can provide additional benchmark data and >> extend the implementation to expose runtime toggling via >> /sys/kernel/mm/transparent_hugepage/performance. >> >> Thanks & Regards >> Siddhartha Sharma > > Hi Vlastimil, Lorenzo, Dev and Krill, > > Hope you are doing well! > > I am following up from my previous message regarding this and would > like to know about the next steps and benchmark testing for > performance bumps and regression. > > Please let me know if you need more information. > > Awaiting your response! > > Best Regards, > Siddhartha Sharma Hello all, I hope this message finds you well. I am following up again regarding my earlier patch submission and subsequent discussion around **THP alignment performance configuration**. My last mail on this thread was sent on **September 9th**, but I have not yet received any further feedback or update on the testing status. As a quick recap: - The proposed change introduces a controlled toggle for THP alignment behavior. - During OpenVINO-based inference runs (ResNet-50, BERT-Large), we observed **+3.1% throughput improvement** and **-2.7% latency reduction** depending on alignment enablement/disablement. - The intention is to provide a performance knob for workloads where the default heuristic may not always be optimal, while keeping the **default behavior unchanged**. I fully understand the complexities around VMA merging, Rik’s earlier patch, and possible regressions noted with cactusBSSN and ebizzy workloads. However, given the continued performance relevance to AI/ML inference pipelines, I believe further testing and validation would help determine whether this knob can be safely integrated (or adapted) for wider use. Could you please share the **current status of testing or review** on this patch? If there are specific benchmarks, traces, or refinements needed from my side, I would be happy to assist in generating or providing them. I greatly appreciate your time and guidance on moving this forward. Thank you again for your support. Best regards, Siddhartha Sharma siddhartha@kenip.in ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH follow-up] mm/thp: Requesting status update on alignment performance configuration 2025-09-25 13:54 ` [PATCH follow-up] mm/thp: Requesting status update on alignment performance configuration siddhartha @ 2025-09-25 18:46 ` Vlastimil Babka 2025-09-25 23:12 ` siddhartha 0 siblings, 1 reply; 4+ messages in thread From: Vlastimil Babka @ 2025-09-25 18:46 UTC (permalink / raw) To: siddhartha, Lorenzo Stoakes, Dev Jain, linux-mm; +Cc: krill.shutemov It's rude to send emails with "request read receipt". Lorenzo explained that already in a response to your off-list e-mail week ago. On 9/25/25 15:54, siddhartha@kenip.in wrote: > On 2025-09-02 18:38, siddhartha@kenip.in wrote: >> On 2025-08-12 05:20, siddhartha@kenip.in wrote: >>> On 2025-08-12 03:44, siddhartha@kenip.in wrote: >>>> On 2025-07-28 16:30, Vlastimil Babka wrote: >>> **Pseudo-diff of relevant changes:** >>> ```diff >>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c >>> index abcd1234efgh..ijkl5678mnop 100644 >>> --- a/mm/huge_memory.c >>> +++ b/mm/huge_memory.c >>> @@ -102,6 +102,18 @@ static bool __thp_enabled = true; >>> static bool __thp_defrag = true; >>> +/* New performance configuration toggle */ >>> +static bool thp_performance_mode = false; >>> + >>> +static int __init setup_thp_performance(char *str) >>> +{ >>> + if (!str) >>> + return 0; >>> + if (!strcmp(str, "on")) >>> + thp_performance_mode = true; >>> + return 1; >>> +} >>> +__setup("thp_performance=", setup_thp_performance); >>> >>> static inline bool transparent_hugepage_enabled(struct vm_area_struct >>> *vma) >>> { >>> @@ -245,7 +257,12 @@ static bool hugepage_vma_check(struct >>> vm_area_struct *vma, >>> /* Existing allocation checks */ >>> - if (khugepaged_always()) >>> - return true; >>> + if (thp_performance_mode) >>> + return true; /* Aggressively prefer THP in performance >>> mode */ >>> + if (khugepaged_always()) >>> + return true; >>> >>> /* Rest of allocation logic */ >>> } >>> >>> Please Note: >>> >>> This is a pseudo-diff since my initial work was developed on Intel >>> Developer Cloud workloads without a locally cloned copy of the exact >>> committed files. >>> >>> If there’s interest, I can provide additional benchmark data and >>> extend the implementation to expose runtime toggling via >>> /sys/kernel/mm/transparent_hugepage/performance. Sorry, it's necessary to send a real patch, not a pseudo-patch, including the test results in its commit log. > I fully understand the complexities around VMA merging, Rik’s earlier > patch, > and possible regressions noted with cactusBSSN and ebizzy workloads. > However, > given the continued performance relevance to AI/ML inference pipelines, > I > believe further testing and validation would help determine whether this > knob > can be safely integrated (or adapted) for wider use. > > Could you please share the **current status of testing or review** on > this patch? We can't test or review a pseudo-patch. It's not even clear to me what it's trying to achieve. > If there are specific benchmarks, traces, or refinements needed from my > side, I > would be happy to assist in generating or providing them. You said you saw improvements in some benchmarks, so re-evaluating them on current mainline with a real patch would be the way. > I greatly appreciate your time and guidance on moving this forward. > > Thank you again for your support. > > Best regards, > Siddhartha Sharma > siddhartha@kenip.in ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH follow-up] mm/thp: Requesting status update on alignment performance configuration 2025-09-25 18:46 ` Vlastimil Babka @ 2025-09-25 23:12 ` siddhartha 0 siblings, 0 replies; 4+ messages in thread From: siddhartha @ 2025-09-25 23:12 UTC (permalink / raw) To: Vlastimil Babka; +Cc: Lorenzo Stoakes, Dev Jain, linux-mm, krill.shutemov On 2025-09-26 00:16, Vlastimil Babka wrote: > It's rude to send emails with "request read receipt". Lorenzo explained > that > already in a response to your off-list e-mail week ago. > > On 9/25/25 15:54, siddhartha@kenip.in wrote: >> On 2025-09-02 18:38, siddhartha@kenip.in wrote: >>> On 2025-08-12 05:20, siddhartha@kenip.in wrote: >>>> On 2025-08-12 03:44, siddhartha@kenip.in wrote: >>>>> On 2025-07-28 16:30, Vlastimil Babka wrote: >>>> **Pseudo-diff of relevant changes:** >>>> ```diff >>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c >>>> index abcd1234efgh..ijkl5678mnop 100644 >>>> --- a/mm/huge_memory.c >>>> +++ b/mm/huge_memory.c >>>> @@ -102,6 +102,18 @@ static bool __thp_enabled = true; >>>> static bool __thp_defrag = true; >>>> +/* New performance configuration toggle */ >>>> +static bool thp_performance_mode = false; >>>> + >>>> +static int __init setup_thp_performance(char *str) >>>> +{ >>>> + if (!str) >>>> + return 0; >>>> + if (!strcmp(str, "on")) >>>> + thp_performance_mode = true; >>>> + return 1; >>>> +} >>>> +__setup("thp_performance=", setup_thp_performance); >>>> >>>> static inline bool transparent_hugepage_enabled(struct >>>> vm_area_struct >>>> *vma) >>>> { >>>> @@ -245,7 +257,12 @@ static bool hugepage_vma_check(struct >>>> vm_area_struct *vma, >>>> /* Existing allocation checks */ >>>> - if (khugepaged_always()) >>>> - return true; >>>> + if (thp_performance_mode) >>>> + return true; /* Aggressively prefer THP in >>>> performance >>>> mode */ >>>> + if (khugepaged_always()) >>>> + return true; >>>> >>>> /* Rest of allocation logic */ >>>> } >>>> >>>> Please Note: >>>> >>>> This is a pseudo-diff since my initial work was developed on Intel >>>> Developer Cloud workloads without a locally cloned copy of the exact >>>> committed files. >>>> >>>> If there’s interest, I can provide additional benchmark data and >>>> extend the implementation to expose runtime toggling via >>>> /sys/kernel/mm/transparent_hugepage/performance. > > Sorry, it's necessary to send a real patch, not a pseudo-patch, > including > the test results in its commit log. >> I fully understand the complexities around VMA merging, Rik’s earlier >> patch, >> and possible regressions noted with cactusBSSN and ebizzy workloads. >> However, >> given the continued performance relevance to AI/ML inference >> pipelines, >> I >> believe further testing and validation would help determine whether >> this >> knob >> can be safely integrated (or adapted) for wider use. >> >> Could you please share the **current status of testing or review** on >> this patch? > > We can't test or review a pseudo-patch. It's not even clear to me what > it's > trying to achieve. > >> If there are specific benchmarks, traces, or refinements needed from >> my >> side, I >> would be happy to assist in generating or providing them. > > You said you saw improvements in some benchmarks, so re-evaluating them > on > current mainline with a real patch would be the way. > >> I greatly appreciate your time and guidance on moving this forward. >> >> Thank you again for your support. >> >> Best regards, >> Siddhartha Sharma >> siddhartha@kenip.in Hello Vlastimil, Lorenzo, and all, Thank you for your feedback — and apologies for the “read receipt” flag, I understand that was inappropriate for the list. My intention was only to ensure my earlier follow-up wasn’t missed, not to be intrusive. To clarify: my original emails tried to outline observed performance behavior when working with OpenVINO-based inference runs. The pseudo-diff I shared was intended to explain the concept, but I now understand that without a proper patch against current mainline it’s not actionable for you to test or review. I will rebase my changes onto current mainline and submit a real patch so it’s clear exactly what is being modified. That way, any evaluation can be based on real code, not on assumptions or pseudo-code. Thank you again for pointing this out — I appreciate your patience, and I’ll make sure the next iteration is a proper patch submission suitable for review. I have opened a pull request in the openvino GitHub repository which I also shared earlier but the guy who is supposed to review it is on a sick leave, but I have seen some commits being merged recently so that's a good sign. As soon as that's done with the review and I get the developer cloud directory where it was originally worked upon, I will share all the necessary details and the actual code. Thanks for your time and support, I really appreciate it! Best regards, Siddhartha Sharma ^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2025-09-25 23:12 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-11 22:14 [PATCH] mm: limit THP alignment – performance gain observed in AI inference workloads siddhartha
[not found] ` <595a57cd68463194fb2d6f34e9366e38@vger.kernel.org>
[not found] ` <0197c80c5bc7989b858b79317a4fbc45@kenip.in>
2025-09-25 13:54 ` [PATCH follow-up] mm/thp: Requesting status update on alignment performance configuration siddhartha
2025-09-25 18:46 ` Vlastimil Babka
2025-09-25 23:12 ` siddhartha
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox