From: Yafang Shao <laoar.shao@gmail.com>
To: Zi Yan <ziy@nvidia.com>
Cc: akpm@linux-foundation.org, david@redhat.com,
baolin.wang@linux.alibaba.com, lorenzo.stoakes@oracle.com,
Liam.Howlett@oracle.com, npache@redhat.com,
ryan.roberts@arm.com, dev.jain@arm.com, hannes@cmpxchg.org,
usamaarif642@gmail.com, gutierrez.asier@huawei-partners.com,
willy@infradead.org, ast@kernel.org, daniel@iogearbox.net,
andrii@kernel.org, bpf@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [RFC PATCH v2 0/5] mm, bpf: BPF based THP adjustment
Date: Tue, 27 May 2025 13:53:09 +0800 [thread overview]
Message-ID: <CALOAHbA=5cRHJV8hBS18oQ0C_aFx2f4JLJS0gazJgRzTNV99Ww@mail.gmail.com> (raw)
In-Reply-To: <7570019E-1FF1-47E0-82CD-D28378EBD8B6@nvidia.com>
On Mon, May 26, 2025 at 10:32 PM Zi Yan <ziy@nvidia.com> wrote:
>
> On 24 May 2025, at 23:01, Yafang Shao wrote:
>
> > On Tue, May 20, 2025 at 2:05 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> >>
> >> Background
> >> ----------
> >>
> >> At my current employer, PDD, we have consistently configured THP to "never"
> >> on our production servers due to past incidents caused by its behavior:
> >>
> >> - Increased memory consumption
> >> THP significantly raises overall memory usage.
> >>
> >> - Latency spikes
> >> Random latency spikes occur due to more frequent memory compaction
> >> activity triggered by THP.
> >>
> >> These issues have made sysadmins hesitant to switch to "madvise" or
> >> "always" modes.
> >>
> >> New Motivation
> >> --------------
> >>
> >> We have now identified that certain AI workloads achieve substantial
> >> performance gains with THP enabled. However, we’ve also verified that some
> >> workloads see little to no benefit—or are even negatively impacted—by THP.
> >>
> >> In our Kubernetes environment, we deploy mixed workloads on a single server
> >> to maximize resource utilization. Our goal is to selectively enable THP for
> >> services that benefit from it while keeping it disabled for others. This
> >> approach allows us to incrementally enable THP for additional services and
> >> assess how to make it more viable in production.
> >>
> >> Proposed Solution
> >> -----------------
> >>
> >> For this use case, Johannes suggested introducing a dedicated mode [0]. In
> >> this new mode, we could implement BPF-based THP adjustment for fine-grained
> >> control over tasks or cgroups. If no BPF program is attached, THP remains
> >> in "never" mode. This solution elegantly meets our needs while avoiding the
> >> complexity of managing BPF alongside other THP modes.
> >>
> >> A selftest example demonstrates how to enable THP for the current task
> >> while keeping it disabled for others.
> >>
> >> Alternative Proposals
> >> ---------------------
> >>
> >> - Gutierrez’s cgroup-based approach [1]
> >> - Proposed adding a new cgroup file to control THP policy.
> >> - However, as Johannes noted, cgroups are designed for hierarchical
> >> resource allocation, not arbitrary policy settings [2].
> >>
> >> - Usama’s per-task THP proposal based on prctl() [3]:
> >> - Enabling THP per task via prctl().
> >> - As David pointed out, neither madvise() nor prctl() works in "never"
> >> mode [4], making this solution insufficient for our needs.
> >>
> >> Conclusion
> >> ----------
> >>
> >> Introducing a new "bpf" mode for BPF-based per-task THP adjustments is the
> >> most effective solution for our requirements. This approach represents a
> >> small but meaningful step toward making THP truly usable—and manageable—in
> >> production environments.
> >>
> >> This is currently a PoC implementation. Feedback of any kind is welcome.
> >>
> >> Link: https://lore.kernel.org/linux-mm/20250509164654.GA608090@cmpxchg.org/ [0]
> >> Link: https://lore.kernel.org/linux-mm/20241030083311.965933-1-gutierrez.asier@huawei-partners.com/ [1]
> >> Link: https://lore.kernel.org/linux-mm/20250430175954.GD2020@cmpxchg.org/ [2]
> >> Link: https://lore.kernel.org/linux-mm/20250519223307.3601786-1-usamaarif642@gmail.com/ [3]
> >> Link: https://lore.kernel.org/linux-mm/41e60fa0-2943-4b3f-ba92-9f02838c881b@redhat.com/ [4]
> >>
> >> RFC v1->v2:
> >> The main changes are as follows,
> >> - Use struct_ops instead of fmod_ret (Alexei)
> >> - Introduce a new THP mode (Johannes)
> >> - Introduce new helpers for BPF hook (Zi)
> >> - Refine the commit log
> >>
> >> RFC v1: https://lwn.net/Articles/1019290/
> >>
> >> Yafang Shao (5):
> >> mm: thp: Add a new mode "bpf"
> >> mm: thp: Add hook for BPF based THP adjustment
> >> mm: thp: add struct ops for BPF based THP adjustment
> >> bpf: Add get_current_comm to bpf_base_func_proto
> >> selftests/bpf: Add selftest for THP adjustment
> >>
> >> include/linux/huge_mm.h | 15 +-
> >> kernel/bpf/cgroup.c | 2 -
> >> kernel/bpf/helpers.c | 2 +
> >> mm/Makefile | 3 +
> >> mm/bpf_thp.c | 120 ++++++++++++
> >> mm/huge_memory.c | 65 ++++++-
> >> mm/khugepaged.c | 3 +
> >> tools/testing/selftests/bpf/config | 1 +
> >> .../selftests/bpf/prog_tests/thp_adjust.c | 175 ++++++++++++++++++
> >> .../selftests/bpf/progs/test_thp_adjust.c | 39 ++++
> >> 10 files changed, 414 insertions(+), 11 deletions(-)
> >> create mode 100644 mm/bpf_thp.c
> >> create mode 100644 tools/testing/selftests/bpf/prog_tests/thp_adjust.c
> >> create mode 100644 tools/testing/selftests/bpf/progs/test_thp_adjust.c
> >>
> >> --
> >> 2.43.5
> >>
> >
> > Hi all,
> >
> > Let’s summarize the current state of the discussion and identify how
> > to move forward.
> >
> > - Global-Only Control is Not Viable
> > We all seem to agree that a global-only control for THP is unwise. In
> > practice, some workloads benefit from THP while others do not, so a
> > one-size-fits-all approach doesn’t work.
> >
> > - Should We Use "Always" or "Madvise"?
> > I suspect no one would choose 'always' in its current state. ;)
> > Both Lorenzo and David propose relying on the madvise mode. However,
> > since madvise is an unprivileged userspace mechanism, any user can
> > freely adjust their THP policy. This makes fine-grained control
> > impossible without breaking userspace compatibility—an undesirable
> > tradeoff.
> > Given these limitations, the community should consider introducing a
> > new "admin" mode for privileged THP policy management.
> >
>
> I agree with the above two points.
>
> > - Can the Kernel Automatically Manage THP Without User Input?
> > In practice, users define their own success metrics—such as latency
> > (RT), queries per second (QPS), or throughput—to evaluate a feature’s
> > usefulness. If a feature fails to improve these metrics, it provides
> > no practical value.
> > Currently, the kernel lacks visibility into user-defined metrics,
> > making fully automated optimization impossible (at least without user
> > input). More importantly, automatic management offers no benefit if it
> > doesn’t align with user needs.
>
> Yes, kernel is basically guessing what userspace wants with some hints
> like MADV_HUGEPAGE/MADV_NOHUGEPAGE. But kernel has the global view
> of memory fragmentation, which userspace cannot get easily.
Correct, memory fragmentation is another critical factor in
determining whether to allocate THP.
> I wonder
> if it is possible that userspace tuning might benefit one set of
> applications but hurt others or overall performance. Right now,
> THP tuning is 0 or 1, either an application wants THPs or not.
> We might need a way of ranking THP requests from userspace to
> let kernel prioritize them (I am not sure if we can add another
> user input parameter, like THP_nice, to get this done, since
> apparently everyone will set THP_nice to -100 to get themselves
> at the top of the list).
Interesting idea. Perhaps we could make this configurable only by sysadmins.
--
Regards
Yafang
prev parent reply other threads:[~2025-05-27 5:53 UTC|newest]
Thread overview: 52+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-05-20 6:04 [RFC PATCH v2 0/5] mm, bpf: BPF based THP adjustment Yafang Shao
2025-05-20 6:04 ` [RFC PATCH v2 1/5] mm: thp: Add a new mode "bpf" Yafang Shao
2025-05-20 6:05 ` [RFC PATCH v2 2/5] mm: thp: Add hook for BPF based THP adjustment Yafang Shao
2025-05-20 6:05 ` [RFC PATCH v2 3/5] mm: thp: add struct ops " Yafang Shao
2025-05-20 6:05 ` [RFC PATCH v2 4/5] bpf: Add get_current_comm to bpf_base_func_proto Yafang Shao
2025-05-20 23:32 ` Andrii Nakryiko
2025-05-20 6:05 ` [RFC PATCH v2 5/5] selftests/bpf: Add selftest for THP adjustment Yafang Shao
2025-05-20 6:52 ` [RFC PATCH v2 0/5] mm, bpf: BPF based " Nico Pache
2025-05-20 7:25 ` Yafang Shao
2025-05-20 13:10 ` Matthew Wilcox
2025-05-20 14:08 ` Yafang Shao
2025-05-20 14:22 ` Lorenzo Stoakes
2025-05-20 14:32 ` Usama Arif
2025-05-20 14:35 ` Lorenzo Stoakes
2025-05-20 14:42 ` Matthew Wilcox
2025-05-20 14:56 ` David Hildenbrand
2025-05-21 4:28 ` Yafang Shao
2025-05-20 14:46 ` Usama Arif
2025-05-20 15:00 ` David Hildenbrand
2025-05-20 9:43 ` David Hildenbrand
2025-05-20 9:49 ` Lorenzo Stoakes
2025-05-20 12:06 ` Yafang Shao
2025-05-20 13:45 ` Lorenzo Stoakes
2025-05-20 15:54 ` David Hildenbrand
2025-05-21 4:02 ` Yafang Shao
2025-05-21 3:52 ` Yafang Shao
2025-05-20 11:59 ` Yafang Shao
2025-05-25 3:01 ` Yafang Shao
2025-05-26 7:41 ` Gutierrez Asier
2025-05-26 9:37 ` Yafang Shao
2025-05-26 8:14 ` David Hildenbrand
2025-05-26 9:37 ` Yafang Shao
2025-05-26 10:49 ` David Hildenbrand
2025-05-26 14:53 ` Liam R. Howlett
2025-05-26 15:54 ` Liam R. Howlett
2025-05-26 16:51 ` David Hildenbrand
2025-05-26 17:07 ` Liam R. Howlett
2025-05-26 17:12 ` David Hildenbrand
2025-05-26 20:30 ` Gutierrez Asier
2025-05-26 20:37 ` David Hildenbrand
2025-05-27 5:46 ` Yafang Shao
2025-05-27 7:57 ` David Hildenbrand
2025-05-27 8:13 ` Yafang Shao
2025-05-27 8:30 ` David Hildenbrand
2025-05-27 8:40 ` Yafang Shao
2025-05-27 9:27 ` David Hildenbrand
2025-05-27 9:43 ` Yafang Shao
2025-05-27 12:19 ` David Hildenbrand
2025-05-28 2:04 ` Yafang Shao
2025-05-28 20:32 ` David Hildenbrand
2025-05-26 14:32 ` Zi Yan
2025-05-27 5:53 ` Yafang Shao [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CALOAHbA=5cRHJV8hBS18oQ0C_aFx2f4JLJS0gazJgRzTNV99Ww@mail.gmail.com' \
--to=laoar.shao@gmail.com \
--cc=Liam.Howlett@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=andrii@kernel.org \
--cc=ast@kernel.org \
--cc=baolin.wang@linux.alibaba.com \
--cc=bpf@vger.kernel.org \
--cc=daniel@iogearbox.net \
--cc=david@redhat.com \
--cc=dev.jain@arm.com \
--cc=gutierrez.asier@huawei-partners.com \
--cc=hannes@cmpxchg.org \
--cc=linux-mm@kvack.org \
--cc=lorenzo.stoakes@oracle.com \
--cc=npache@redhat.com \
--cc=ryan.roberts@arm.com \
--cc=usamaarif642@gmail.com \
--cc=willy@infradead.org \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).