From: Usama Arif <usamaarif642@gmail.com>
To: Yafang Shao <laoar.shao@gmail.com>
Cc: akpm@linux-foundation.org, david@redhat.com, ziy@nvidia.com,
baolin.wang@linux.alibaba.com, lorenzo.stoakes@oracle.com,
Liam.Howlett@oracle.com, npache@redhat.com, ryan.roberts@arm.com,
dev.jain@arm.com, hannes@cmpxchg.org,
gutierrez.asier@huawei-partners.com, willy@infradead.org,
ast@kernel.org, daniel@iogearbox.net, andrii@kernel.org,
ameryhung@gmail.com, rientjes@google.com, bpf@vger.kernel.org,
linux-mm@kvack.org
Subject: Re: [RFC PATCH v5 mm-new 0/5] mm, bpf: BPF based THP order selection
Date: Tue, 19 Aug 2025 11:44:51 +0100 [thread overview]
Message-ID: <a24d632d-4b11-4c88-9ed0-26fa12a0fce4@gmail.com> (raw)
In-Reply-To: <CALOAHbAQ=51mfszBN+Bvb9z+ZDyTBuCW_s0EKi+5rDghFvRZzg@mail.gmail.com>
On 19/08/2025 03:41, Yafang Shao wrote:
> On Mon, Aug 18, 2025 at 10:35 PM Usama Arif <usamaarif642@gmail.com> wrote:
>>
>>
>>
>> On 18/08/2025 06:55, Yafang Shao wrote:
>>> Background
>>> ----------
>>>
>>> Our production servers consistently configure THP to "never" due to
>>> historical incidents caused by its behavior. Key issues include:
>>> - Increased Memory Consumption
>>> THP significantly raises overall memory usage, reducing available memory
>>> for workloads.
>>>
>>> - Latency Spikes
>>> Random latency spikes occur due to frequent memory compaction triggered
>>> by THP.
>>>
>>> - Lack of Fine-Grained Control
>>> THP tuning is globally configured, making it unsuitable for containerized
>>> environments. When multiple workloads share a host, enabling THP without
>>> per-workload control leads to unpredictable behavior.
>>>
>>> Due to these issues, administrators avoid switching to madvise or always
>>> modes—unless per-workload THP control is implemented.
>>>
>>> To address this, we propose BPF-based THP policy for flexible adjustment.
>>> Additionally, as David mentioned [0], this mechanism can also serve as a
>>> policy prototyping tool (test policies via BPF before upstreaming them).
>>
>> Hi Yafang,
>>
>> A few points:
>>
>> The link [0] is mentioned a couple of times in the coverletter, but it doesnt seem
>> to be anywhere in the coverletter.
>
> Oops, my bad.
>
>>
>> I am probably missing something over here, but the current version won't accomplish
>> the usecase you have described at the start of the coverletter and are aiming for, right?
>> i.e. THP global policy "never", but get hugepages on an madvise or always basis.
>
> In "never" mode, THP allocation is entirely disabled (except via
> MADV_COLLAPSE). However, we can achieve the same behavior—and
> more—using a BPF program, even in "madvise" or "always" mode. Instead
> of introducing a new THP mode, we dynamically enforce policy via BPF.
>
> Deployment Steps in our production servers:
>
> 1. Initial Setup:
> - Set THP mode to "never" (disabling THP by default).
> - Attach the BPF program and pin the BPF maps and links.
> - Pinning ensures persistence (like a kernel module), preventing
> disruption under system pressure.
> - A THP whitelist map tracks allowed cgroups (initially empty → no THP
> allocations).
>
> 2. Enable THP Control:
> - Switch THP mode to "always" or "madvise" (BPF now governs actual allocations).
Ah ok, so I was missing this part. With this solution you will still have to change
the system policy to madvise or always, and then basically disable THP for everyone apart
from the cgroups that want it?
>
> 3. Dynamic Management:
> - To permit THP for a cgroup, add its ID to the whitelist map.
> - To revoke permission, remove the cgroup ID from the map.
> - The BPF program can be updated live (policy adjustments require no
> task interruption).
>
>> I think there was a new THP mode introduced in some earlier revision where you can switch to it
>> from "never" and then you can use bpf programs with it, but its not in this revision?
>> It might be useful to add your specific usecase as a selftest.
>>
>> Do we have some numbers on what the overhead of calling the bpf program is in the
>> pagefault path as its a critical path?
>
> In our current implementation, THP allocation occurs during the page
> fault path. As such, I have not yet evaluated performance for this
> specific case.
> The overhead is expected to be workload-dependent, primarily influenced by:
> - Memory availability: The presence (or absence) of higher-order free pages
> - System pressure: Contention for memory compaction, NUMA balancing,
> or direct reclaim
>
Yes, I think might be worth seeing if perf indicates that you are spending more time
in __handle_mm_fault with this series + bpf program attached compared to without?
>>
>> I remember there was a discussion on this in the earlier revisions, and I have mentioned this in patch 1
>> as well, but I think making this feature experimental with warnings might not be a great idea.
>
> The experimental status of this feature was requested by David and
> Lorenzo, who likely have specific technical considerations behind this
> requirement.
>
>> It could lead to 2 paths:
>> - people don't deploy this in their fleet because its marked as experimental and they dont want
>> their machines to break once they upgrade the kernel and this is changed. We will have a difficult
>> time improving upon this as this is just going to be used for prototyping and won't be driven by
>> production data.
>> - people are careless and deploy it in on their production machines, and you get reports that this
>> has broken after kernel upgrades (despite being marked as experimental :)).
>> This is just my opinion (which can be wrong :)), but I think we should try and have this merged
>> as a stable interface that won't change. There might be bugs reported down the line, but I am hoping
>> we can get the interface of get_suggested_order right in the first implementation that gets merged?
>
> We may eventually remove the experimental status or deprecate this
> feature entirely, depending on its adoption. However, the first
> critical step is to make it available for broader usage and
> evaluation.
>
next prev parent reply other threads:[~2025-08-19 10:44 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-08-18 5:55 [RFC PATCH v5 mm-new 0/5] mm, bpf: BPF based THP order selection Yafang Shao
2025-08-18 5:55 ` [RFC PATCH v5 mm-new 1/5] mm: thp: add support for " Yafang Shao
2025-08-18 13:17 ` Usama Arif
2025-08-19 3:08 ` Yafang Shao
2025-08-19 10:11 ` Usama Arif
2025-08-19 11:10 ` Gutierrez Asier
2025-08-19 11:43 ` Yafang Shao
2025-08-18 5:55 ` [RFC PATCH v5 mm-new 2/5] mm: thp: add a new kfunc bpf_mm_get_mem_cgroup() Yafang Shao
2025-08-18 5:55 ` [RFC PATCH v5 mm-new 3/5] mm: thp: add a new kfunc bpf_mm_get_task() Yafang Shao
2025-08-18 5:55 ` [RFC PATCH v5 mm-new 4/5] bpf: mark vma->vm_mm as trusted Yafang Shao
2025-08-18 5:55 ` [RFC PATCH v5 mm-new 5/5] selftest/bpf: add selftest for BPF based THP order seletection Yafang Shao
2025-08-18 14:00 ` Usama Arif
2025-08-19 3:09 ` Yafang Shao
2025-08-18 14:35 ` [RFC PATCH v5 mm-new 0/5] mm, bpf: BPF based THP order selection Usama Arif
2025-08-19 2:41 ` Yafang Shao
2025-08-19 10:44 ` Usama Arif [this message]
2025-08-19 11:33 ` Yafang Shao
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=a24d632d-4b11-4c88-9ed0-26fa12a0fce4@gmail.com \
--to=usamaarif642@gmail.com \
--cc=Liam.Howlett@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=ameryhung@gmail.com \
--cc=andrii@kernel.org \
--cc=ast@kernel.org \
--cc=baolin.wang@linux.alibaba.com \
--cc=bpf@vger.kernel.org \
--cc=daniel@iogearbox.net \
--cc=david@redhat.com \
--cc=dev.jain@arm.com \
--cc=gutierrez.asier@huawei-partners.com \
--cc=hannes@cmpxchg.org \
--cc=laoar.shao@gmail.com \
--cc=linux-mm@kvack.org \
--cc=lorenzo.stoakes@oracle.com \
--cc=npache@redhat.com \
--cc=rientjes@google.com \
--cc=ryan.roberts@arm.com \
--cc=willy@infradead.org \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).