From: Yafang Shao <laoar.shao@gmail.com>
To: akpm@linux-foundation.org, david@redhat.com, ziy@nvidia.com,
baolin.wang@linux.alibaba.com, lorenzo.stoakes@oracle.com,
Liam.Howlett@oracle.com, npache@redhat.com, ryan.roberts@arm.com,
dev.jain@arm.com, hannes@cmpxchg.org, usamaarif642@gmail.com,
gutierrez.asier@huawei-partners.com, willy@infradead.org,
ast@kernel.org, daniel@iogearbox.net, andrii@kernel.org
Cc: bpf@vger.kernel.org, linux-mm@kvack.org,
Yafang Shao <laoar.shao@gmail.com>
Subject: [RFC PATCH v3 0/5] mm, bpf: BPF based THP adjustment
Date: Sun, 8 Jun 2025 15:35:11 +0800 [thread overview]
Message-ID: <20250608073516.22415-1-laoar.shao@gmail.com> (raw)
Background
----------
We have consistently configured THP to "never" on our production servers
due to past incidents caused by its behavior:
- Increased memory consumption
THP significantly raises overall memory usage.
- Latency spikes
Random latency spikes occur due to more frequent memory compaction
activity triggered by THP.
- Lack of Fine-Grained Control
THP tuning knobs are globally configured, making them unsuitable for
containerized environments. When different workloads run on the same
host, enabling THP globally (without per-workload control) can cause
unpredictable behavior.
Due to these issues, system administrators remain hesitant to switch to
"madvise" or "always" modes—unless finer-grained control over THP
behavior is implemented.
New Motivation
--------------
We have now identified that certain AI workloads achieve substantial
performance gains with THP enabled. However, we’ve also verified that some
workloads see little to no benefit—or are even negatively impacted—by THP.
In our Kubernetes environment, we deploy mixed workloads on a single server
to maximize resource utilization. Our goal is to selectively enable THP for
services that benefit from it while keeping it disabled for others. This
approach allows us to incrementally enable THP for additional services and
assess how to make it more viable in production.
Proposed Solution
-----------------
To enable fine-grained control over THP behavior, we propose dynamically
adjusting THP policies using BPF. This approach allows per-workload THP
tuning, providing greater flexibility and precision.
The BPF-based THP adjustment mechanism introduces two new APIs for granular
policy control:
- THP allocator
int (*allocator)(unsigned long vm_flags, unsigned long tva_flags);
The BPF program returns either THP_ALLOC_CURRENT or THP_ALLOC_KHUGEPAGED,
indicating whether THP allocation should be performed synchronously
(current task) or asynchronously (khugepaged).
The decision is based on the current task context, VMA flags, and TVA
flags.
- THP reclaimer
int (*reclaimer)(bool vma_madvised);
The BPF program returns either RECLAIMER_CURRENT or RECLAIMER_KSWAPD,
determining whether memory reclamation is handled by the current task or
kswapd.
We may explore implementing fine-grained tuning for khugepaged in future
iterations.
Alternative Proposals
---------------------
- Gutierrez’s cgroup-based approach [1]
- Proposed adding a new cgroup file to control THP policy.
- However, as Johannes noted, cgroups are designed for hierarchical
resource allocation, not arbitrary policy settings [2].
- Usama’s per-task THP proposal based on prctl() [3]:
- Enabling THP per task via prctl().
- This provides an alternative approach for per-workload THP tuning,
though it lacks dynamic policy adjustment capabilities and thus offers
limited flexibility.
This is currently a PoC implementation with limited test. Feedback of any
kind is welcome.
Link: https://lore.kernel.org/linux-mm/20241030083311.965933-1-gutierrez.asier@huawei-partners.com/ [1]
Link: https://lore.kernel.org/linux-mm/20250430175954.GD2020@cmpxchg.org/ [2]
Link: https://lore.kernel.org/linux-mm/20250519223307.3601786-1-usamaarif642@gmail.com/ [3]
RFC v2->v3:
Thanks to the valuable input from David and Lorenzo:
- Finer-graind tuning based on madvise or always mode
- Use BPF to write more advanced policies / allocation logic
RFC v1->v2: https://lwn.net/Articles/1021783/
The main changes are as follows,
- Use struct_ops instead of fmod_ret (Alexei)
- Introduce a new THP mode (Johannes)
- Introduce new helpers for BPF hook (Zi)
- Refine the commit log
RFC v1: https://lwn.net/Articles/1019290/
Yafang Shao (5):
mm, thp: use __thp_vma_allowable_orders() in khugepaged_enter_vma()
mm, thp: add bpf thp hook to determine thp allocator
mm, thp: add bpf thp hook to determine thp reclaimer
mm: thp: add bpf thp struct ops
selftests/bpf: Add selftest for THP adjustment
include/linux/huge_mm.h | 8 +
mm/Makefile | 3 +
mm/bpf_thp.c | 184 ++++++++++++++++++
mm/huge_memory.c | 5 +
mm/khugepaged.c | 6 +-
tools/testing/selftests/bpf/config | 1 +
.../selftests/bpf/prog_tests/thp_adjust.c | 158 +++++++++++++++
.../selftests/bpf/progs/test_thp_adjust.c | 38 ++++
8 files changed, 401 insertions(+), 2 deletions(-)
create mode 100644 mm/bpf_thp.c
create mode 100644 tools/testing/selftests/bpf/prog_tests/thp_adjust.c
create mode 100644 tools/testing/selftests/bpf/progs/test_thp_adjust.c
--
2.43.5
next reply other threads:[~2025-06-08 7:35 UTC|newest]
Thread overview: 33+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-06-08 7:35 Yafang Shao [this message]
2025-06-08 7:35 ` [RFC PATCH v3 1/5] mm, thp: use __thp_vma_allowable_orders() in khugepaged_enter_vma() Yafang Shao
2025-07-17 14:48 ` Usama Arif
2025-07-20 2:37 ` Yafang Shao
2025-06-08 7:35 ` [RFC PATCH v3 2/5] mm, thp: add bpf thp hook to determine thp allocator Yafang Shao
2025-07-17 15:30 ` Usama Arif
2025-07-20 3:00 ` Yafang Shao
2025-06-08 7:35 ` [RFC PATCH v3 3/5] mm, thp: add bpf thp hook to determine thp reclaimer Yafang Shao
2025-07-17 16:06 ` Usama Arif
2025-07-20 3:03 ` Yafang Shao
2025-06-08 7:35 ` [RFC PATCH v3 4/5] mm: thp: add bpf thp struct ops Yafang Shao
2025-07-17 16:25 ` Usama Arif
2025-07-17 18:21 ` Amery Hung
2025-07-20 3:07 ` Yafang Shao
2025-06-08 7:35 ` [RFC PATCH v3 5/5] selftests/bpf: Add selftest for THP adjustment Yafang Shao
2025-07-15 22:42 ` [RFC PATCH v3 0/5] mm, bpf: BPF based " David Hildenbrand
2025-07-17 3:09 ` Yafang Shao
2025-07-17 8:52 ` David Hildenbrand
2025-07-17 9:05 ` Lorenzo Stoakes
2025-07-20 2:32 ` Yafang Shao
2025-07-20 15:56 ` David Hildenbrand
2025-07-22 2:40 ` Yafang Shao
2025-07-22 7:28 ` David Hildenbrand
2025-07-22 10:09 ` Lorenzo Stoakes
2025-07-22 11:56 ` Yafang Shao
2025-07-22 12:04 ` Lorenzo Stoakes
2025-07-22 12:16 ` Yafang Shao
2025-07-22 11:46 ` Yafang Shao
2025-07-22 11:54 ` Lorenzo Stoakes
2025-07-22 12:02 ` Yafang Shao
2025-07-22 12:08 ` Lorenzo Stoakes
2025-07-17 16:35 ` Usama Arif
2025-07-20 2:54 ` Yafang Shao
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250608073516.22415-1-laoar.shao@gmail.com \
--to=laoar.shao@gmail.com \
--cc=Liam.Howlett@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=andrii@kernel.org \
--cc=ast@kernel.org \
--cc=baolin.wang@linux.alibaba.com \
--cc=bpf@vger.kernel.org \
--cc=daniel@iogearbox.net \
--cc=david@redhat.com \
--cc=dev.jain@arm.com \
--cc=gutierrez.asier@huawei-partners.com \
--cc=hannes@cmpxchg.org \
--cc=linux-mm@kvack.org \
--cc=lorenzo.stoakes@oracle.com \
--cc=npache@redhat.com \
--cc=ryan.roberts@arm.com \
--cc=usamaarif642@gmail.com \
--cc=willy@infradead.org \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).