linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH v2 0/5] mm, bpf: BPF based THP adjustment
@ 2025-05-20  6:04 Yafang Shao
  2025-05-20  6:04 ` [RFC PATCH v2 1/5] mm: thp: Add a new mode "bpf" Yafang Shao
                   ` (7 more replies)
  0 siblings, 8 replies; 52+ messages in thread
From: Yafang Shao @ 2025-05-20  6:04 UTC (permalink / raw)
  To: akpm, david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	npache, ryan.roberts, dev.jain, hannes, usamaarif642,
	gutierrez.asier, willy, ast, daniel, andrii
  Cc: bpf, linux-mm, Yafang Shao

Background
----------

At my current employer, PDD, we have consistently configured THP to "never"
on our production servers due to past incidents caused by its behavior:

- Increased memory consumption
  THP significantly raises overall memory usage.

- Latency spikes
  Random latency spikes occur due to more frequent memory compaction
  activity triggered by THP.

These issues have made sysadmins hesitant to switch to "madvise" or
"always" modes.

New Motivation
--------------

We have now identified that certain AI workloads achieve substantial
performance gains with THP enabled. However, we’ve also verified that some
workloads see little to no benefit—or are even negatively impacted—by THP.

In our Kubernetes environment, we deploy mixed workloads on a single server
to maximize resource utilization. Our goal is to selectively enable THP for
services that benefit from it while keeping it disabled for others. This
approach allows us to incrementally enable THP for additional services and
assess how to make it more viable in production.

Proposed Solution
-----------------

For this use case, Johannes suggested introducing a dedicated mode [0]. In
this new mode, we could implement BPF-based THP adjustment for fine-grained
control over tasks or cgroups. If no BPF program is attached, THP remains
in "never" mode. This solution elegantly meets our needs while avoiding the
complexity of managing BPF alongside other THP modes.

A selftest example demonstrates how to enable THP for the current task
while keeping it disabled for others.

Alternative Proposals
---------------------

- Gutierrez’s cgroup-based approach [1]
  - Proposed adding a new cgroup file to control THP policy.
  - However, as Johannes noted, cgroups are designed for hierarchical
    resource allocation, not arbitrary policy settings [2].
  
- Usama’s per-task THP proposal based on prctl() [3]: 
  - Enabling THP per task via prctl().
  - As David pointed out, neither madvise() nor prctl() works in "never"
    mode [4], making this solution insufficient for our needs.

Conclusion
----------

Introducing a new "bpf" mode for BPF-based per-task THP adjustments is the
most effective solution for our requirements. This approach represents a
small but meaningful step toward making THP truly usable—and manageable—in
production environments.

This is currently a PoC implementation. Feedback of any kind is welcome.

Link: https://lore.kernel.org/linux-mm/20250509164654.GA608090@cmpxchg.org/ [0] 
Link: https://lore.kernel.org/linux-mm/20241030083311.965933-1-gutierrez.asier@huawei-partners.com/ [1]
Link: https://lore.kernel.org/linux-mm/20250430175954.GD2020@cmpxchg.org/ [2]
Link: https://lore.kernel.org/linux-mm/20250519223307.3601786-1-usamaarif642@gmail.com/ [3]
Link: https://lore.kernel.org/linux-mm/41e60fa0-2943-4b3f-ba92-9f02838c881b@redhat.com/ [4]

RFC v1->v2:
The main changes are as follows,
- Use struct_ops instead of fmod_ret (Alexei) 
- Introduce a new THP mode (Johannes)
- Introduce new helpers for BPF hook (Zi)
- Refine the commit log

RFC v1: https://lwn.net/Articles/1019290/

Yafang Shao (5):
  mm: thp: Add a new mode "bpf"
  mm: thp: Add hook for BPF based THP adjustment
  mm: thp: add struct ops for BPF based THP adjustment
  bpf: Add get_current_comm to bpf_base_func_proto
  selftests/bpf: Add selftest for THP adjustment

 include/linux/huge_mm.h                       |  15 +-
 kernel/bpf/cgroup.c                           |   2 -
 kernel/bpf/helpers.c                          |   2 +
 mm/Makefile                                   |   3 +
 mm/bpf_thp.c                                  | 120 ++++++++++++
 mm/huge_memory.c                              |  65 ++++++-
 mm/khugepaged.c                               |   3 +
 tools/testing/selftests/bpf/config            |   1 +
 .../selftests/bpf/prog_tests/thp_adjust.c     | 175 ++++++++++++++++++
 .../selftests/bpf/progs/test_thp_adjust.c     |  39 ++++
 10 files changed, 414 insertions(+), 11 deletions(-)
 create mode 100644 mm/bpf_thp.c
 create mode 100644 tools/testing/selftests/bpf/prog_tests/thp_adjust.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_thp_adjust.c

-- 
2.43.5



^ permalink raw reply	[flat|nested] 52+ messages in thread

end of thread, other threads:[~2025-05-28 20:32 UTC | newest]

Thread overview: 52+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-20  6:04 [RFC PATCH v2 0/5] mm, bpf: BPF based THP adjustment Yafang Shao
2025-05-20  6:04 ` [RFC PATCH v2 1/5] mm: thp: Add a new mode "bpf" Yafang Shao
2025-05-20  6:05 ` [RFC PATCH v2 2/5] mm: thp: Add hook for BPF based THP adjustment Yafang Shao
2025-05-20  6:05 ` [RFC PATCH v2 3/5] mm: thp: add struct ops " Yafang Shao
2025-05-20  6:05 ` [RFC PATCH v2 4/5] bpf: Add get_current_comm to bpf_base_func_proto Yafang Shao
2025-05-20 23:32   ` Andrii Nakryiko
2025-05-20  6:05 ` [RFC PATCH v2 5/5] selftests/bpf: Add selftest for THP adjustment Yafang Shao
2025-05-20  6:52 ` [RFC PATCH v2 0/5] mm, bpf: BPF based " Nico Pache
2025-05-20  7:25   ` Yafang Shao
2025-05-20 13:10     ` Matthew Wilcox
2025-05-20 14:08       ` Yafang Shao
2025-05-20 14:22         ` Lorenzo Stoakes
2025-05-20 14:32           ` Usama Arif
2025-05-20 14:35             ` Lorenzo Stoakes
2025-05-20 14:42               ` Matthew Wilcox
2025-05-20 14:56                 ` David Hildenbrand
2025-05-21  4:28                 ` Yafang Shao
2025-05-20 14:46               ` Usama Arif
2025-05-20 15:00             ` David Hildenbrand
2025-05-20  9:43 ` David Hildenbrand
2025-05-20  9:49   ` Lorenzo Stoakes
2025-05-20 12:06     ` Yafang Shao
2025-05-20 13:45       ` Lorenzo Stoakes
2025-05-20 15:54         ` David Hildenbrand
2025-05-21  4:02           ` Yafang Shao
2025-05-21  3:52         ` Yafang Shao
2025-05-20 11:59   ` Yafang Shao
2025-05-25  3:01 ` Yafang Shao
2025-05-26  7:41   ` Gutierrez Asier
2025-05-26  9:37     ` Yafang Shao
2025-05-26  8:14   ` David Hildenbrand
2025-05-26  9:37     ` Yafang Shao
2025-05-26 10:49       ` David Hildenbrand
2025-05-26 14:53         ` Liam R. Howlett
2025-05-26 15:54           ` Liam R. Howlett
2025-05-26 16:51             ` David Hildenbrand
2025-05-26 17:07               ` Liam R. Howlett
2025-05-26 17:12                 ` David Hildenbrand
2025-05-26 20:30               ` Gutierrez Asier
2025-05-26 20:37                 ` David Hildenbrand
2025-05-27  5:46         ` Yafang Shao
2025-05-27  7:57           ` David Hildenbrand
2025-05-27  8:13             ` Yafang Shao
2025-05-27  8:30               ` David Hildenbrand
2025-05-27  8:40                 ` Yafang Shao
2025-05-27  9:27                   ` David Hildenbrand
2025-05-27  9:43                     ` Yafang Shao
2025-05-27 12:19                       ` David Hildenbrand
2025-05-28  2:04                         ` Yafang Shao
2025-05-28 20:32                           ` David Hildenbrand
2025-05-26 14:32   ` Zi Yan
2025-05-27  5:53     ` Yafang Shao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).