Re: [RFC PATCH v3 0/5] mm, bpf: BPF based THP adjustment

bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: David Hildenbrand <david@redhat.com>
To: Yafang Shao <laoar.shao@gmail.com>,
	akpm@linux-foundation.org, ziy@nvidia.com,
	baolin.wang@linux.alibaba.com, lorenzo.stoakes@oracle.com,
	Liam.Howlett@oracle.com, npache@redhat.com, ryan.roberts@arm.com,
	dev.jain@arm.com, hannes@cmpxchg.org, usamaarif642@gmail.com,
	gutierrez.asier@huawei-partners.com, willy@infradead.org,
	ast@kernel.org, daniel@iogearbox.net, andrii@kernel.org
Cc: bpf@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [RFC PATCH v3 0/5] mm, bpf: BPF based THP adjustment
Date: Wed, 16 Jul 2025 00:42:22 +0200	[thread overview]
Message-ID: <b2fc85fb-1c7b-40ab-922b-9351114aa994@redhat.com> (raw)
In-Reply-To: <20250608073516.22415-1-laoar.shao@gmail.com>

On 08.06.25 09:35, Yafang Shao wrote:

Sorry for not replying earlier, I was caught up with all other stuff.

I still consider this a very interesting approach, although I think we 
should think more about what a reasonable policy would look like 
medoium-term (in particular, multiple THP sizes, not always falling back 
to small pages if it means splitting excessively in the buddy etc.)

> Background
> ----------
> 
> We have consistently configured THP to "never" on our production servers
> due to past incidents caused by its behavior:
> 
> - Increased memory consumption
>    THP significantly raises overall memory usage.
> 
> - Latency spikes
>    Random latency spikes occur due to more frequent memory compaction
>    activity triggered by THP.
> 
> - Lack of Fine-Grained Control
>    THP tuning knobs are globally configured, making them unsuitable for
>    containerized environments. When different workloads run on the same
>    host, enabling THP globally (without per-workload control) can cause
>    unpredictable behavior.
> 
> Due to these issues, system administrators remain hesitant to switch to
> "madvise" or "always" modes—unless finer-grained control over THP
> behavior is implemented.
> 
> New Motivation
> --------------
> 
> We have now identified that certain AI workloads achieve substantial
> performance gains with THP enabled. However, we’ve also verified that some
> workloads see little to no benefit—or are even negatively impacted—by THP.
> 
> In our Kubernetes environment, we deploy mixed workloads on a single server
> to maximize resource utilization. Our goal is to selectively enable THP for
> services that benefit from it while keeping it disabled for others. This
> approach allows us to incrementally enable THP for additional services and
> assess how to make it more viable in production.
> 
> Proposed Solution
> -----------------
> 
> To enable fine-grained control over THP behavior, we propose dynamically
> adjusting THP policies using BPF. This approach allows per-workload THP
> tuning, providing greater flexibility and precision.
> 
> The BPF-based THP adjustment mechanism introduces two new APIs for granular
> policy control:
> 
> - THP allocator
> 
>    int (*allocator)(unsigned long vm_flags, unsigned long tva_flags);
> 
>    The BPF program returns either THP_ALLOC_CURRENT or THP_ALLOC_KHUGEPAGED,
>    indicating whether THP allocation should be performed synchronously
>    (current task) or asynchronously (khugepaged).
> 
>    The decision is based on the current task context, VMA flags, and TVA
>    flags.

I think we should go one step further and actually get advises about the 
orders (THP sizes) to use. It might be helpful if the program would have 
access to system stats, to make an educated decision.

Given page fault information and system information, the program could 
then decide which orders to try to allocate.

That means, one would query during page faults and during khugepaged, 
which order one should try -- compared to our current approach of "start 
with the largest order that is enabled and fits".

> 
> - THP reclaimer
> 
>    int (*reclaimer)(bool vma_madvised);
> 
>    The BPF program returns either RECLAIMER_CURRENT or RECLAIMER_KSWAPD,
>    determining whether memory reclamation is handled by the current task or
>    kswapd.

Not sure about that, will have to look into the details.

But what could be interesting is deciding how to deal with underutilized 
THPs: for now we will try replacing zero-filled pages by the shared 
zeropage during a split. *maybe* some workloads could benefit from ... 
not doing that, and instead optimize the split.

Will maybe be a bit more trick, though.


-- 
Cheers,

David / dhildenb

next prev parent reply	other threads:[~2025-07-15 22:43 UTC|newest]

Thread overview: 33+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-06-08  7:35 [RFC PATCH v3 0/5] mm, bpf: BPF based THP adjustment Yafang Shao
2025-06-08  7:35 ` [RFC PATCH v3 1/5] mm, thp: use __thp_vma_allowable_orders() in khugepaged_enter_vma() Yafang Shao
2025-07-17 14:48   ` Usama Arif
2025-07-20  2:37     ` Yafang Shao
2025-06-08  7:35 ` [RFC PATCH v3 2/5] mm, thp: add bpf thp hook to determine thp allocator Yafang Shao
2025-07-17 15:30   ` Usama Arif
2025-07-20  3:00     ` Yafang Shao
2025-06-08  7:35 ` [RFC PATCH v3 3/5] mm, thp: add bpf thp hook to determine thp reclaimer Yafang Shao
2025-07-17 16:06   ` Usama Arif
2025-07-20  3:03     ` Yafang Shao
2025-06-08  7:35 ` [RFC PATCH v3 4/5] mm: thp: add bpf thp struct ops Yafang Shao
2025-07-17 16:25   ` Usama Arif
2025-07-17 18:21   ` Amery Hung
2025-07-20  3:07     ` Yafang Shao
2025-06-08  7:35 ` [RFC PATCH v3 5/5] selftests/bpf: Add selftest for THP adjustment Yafang Shao
2025-07-15 22:42 ` David Hildenbrand [this message]
2025-07-17  3:09   ` [RFC PATCH v3 0/5] mm, bpf: BPF based " Yafang Shao
2025-07-17  8:52     ` David Hildenbrand
2025-07-17  9:05       ` Lorenzo Stoakes
2025-07-20  2:32       ` Yafang Shao
2025-07-20 15:56         ` David Hildenbrand
2025-07-22  2:40           ` Yafang Shao
2025-07-22  7:28             ` David Hildenbrand
2025-07-22 10:09               ` Lorenzo Stoakes
2025-07-22 11:56                 ` Yafang Shao
2025-07-22 12:04                   ` Lorenzo Stoakes
2025-07-22 12:16                     ` Yafang Shao
2025-07-22 11:46               ` Yafang Shao
2025-07-22 11:54                 ` Lorenzo Stoakes
2025-07-22 12:02                   ` Yafang Shao
2025-07-22 12:08                     ` Lorenzo Stoakes
2025-07-17 16:35 ` Usama Arif
2025-07-20  2:54   ` Yafang Shao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=b2fc85fb-1c7b-40ab-922b-9351114aa994@redhat.com \
    --to=david@redhat.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=dev.jain@arm.com \
    --cc=gutierrez.asier@huawei-partners.com \
    --cc=hannes@cmpxchg.org \
    --cc=laoar.shao@gmail.com \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=npache@redhat.com \
    --cc=ryan.roberts@arm.com \
    --cc=usamaarif642@gmail.com \
    --cc=willy@infradead.org \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).