From: Yafang Shao <laoar.shao@gmail.com>
To: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
Alexei Starovoitov <ast@kernel.org>,
Daniel Borkmann <daniel@iogearbox.net>,
Andrii Nakryiko <andrii@kernel.org>,
David Hildenbrand <david@redhat.com>,
Lorenzo Stoakes <lorenzo.stoakes@oracle.com>,
Martin KaFai Lau <martin.lau@linux.dev>,
Eduard <eddyz87@gmail.com>, Song Liu <song@kernel.org>,
Yonghong Song <yonghong.song@linux.dev>,
John Fastabend <john.fastabend@gmail.com>,
KP Singh <kpsingh@kernel.org>,
Stanislav Fomichev <sdf@fomichev.me>, Hao Luo <haoluo@google.com>,
Jiri Olsa <jolsa@kernel.org>, Zi Yan <ziy@nvidia.com>,
Liam Howlett <Liam.Howlett@oracle.com>,
npache@redhat.com, ryan.roberts@arm.com, dev.jain@arm.com,
Johannes Weiner <hannes@cmpxchg.org>,
usamaarif642@gmail.com, gutierrez.asier@huawei-partners.com,
Matthew Wilcox <willy@infradead.org>,
Amery Hung <ameryhung@gmail.com>,
David Rientjes <rientjes@google.com>,
Jonathan Corbet <corbet@lwn.net>, Barry Song <21cnbao@gmail.com>,
Shakeel Butt <shakeel.butt@linux.dev>, Tejun Heo <tj@kernel.org>,
lance.yang@linux.dev, Randy Dunlap <rdunlap@infradead.org>,
Chris Mason <clm@meta.com>, bpf <bpf@vger.kernel.org>,
linux-mm <linux-mm@kvack.org>
Subject: Re: [PATCH v12 mm-new 06/10] mm: bpf-thp: add support for global mode
Date: Thu, 30 Oct 2025 10:40:43 +0800 [thread overview]
Message-ID: <CALOAHbC3p-zUZBs7S2r78nCPZm3F=AjJY7cL3RezTPtOXmLb6g@mail.gmail.com> (raw)
In-Reply-To: <CAADnVQK9kp_5zh0gYvXdJ=3MSuXTbmZT+cah5uhZiGk5qYfckw@mail.gmail.com>
On Thu, Oct 30, 2025 at 8:57 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Tue, Oct 28, 2025 at 7:14 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > On Wed, Oct 29, 2025 at 9:33 AM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Sun, Oct 26, 2025 at 3:03 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > >
> > > > The per-process BPF-THP mode is unsuitable for managing shared resources
> > > > such as shmem THP and file-backed THP. This aligns with known cgroup
> > > > limitations for similar scenarios [0].
> > > >
> > > > Introduce a global BPF-THP mode to address this gap. When registered:
> > > > - All existing per-process instances are disabled
> > > > - New per-process registrations are blocked
> > > > - Existing per-process instances remain registered (no forced unregistration)
> > > >
> > > > The global mode takes precedence over per-process instances. Updates are
> > > > type-isolated: global instances can only be updated by new global
> > > > instances, and per-process instances by new per-process instances.
> > >
> > > ...
> > >
> > > > spin_lock(&thp_ops_lock);
> > > > - /* Each process is exclusively managed by a single BPF-THP. */
> > > > - if (rcu_access_pointer(mm->bpf_mm.bpf_thp)) {
> > > > + /* Each process is exclusively managed by a single BPF-THP.
> > > > + * Global mode disables per-process instances.
> > > > + */
> > > > + if (rcu_access_pointer(mm->bpf_mm.bpf_thp) || rcu_access_pointer(bpf_thp_global)) {
> > > > err = -EBUSY;
> > > > goto out;
> > > > }
> > >
> > > You didn't address the issue and instead doubled down
> > > on this broken global approach.
> > >
> > > This bait-and-switch patchset is frankly disingenuous.
> > > 'lets code up some per-mm hack, since people will hate it anyway,
> > > and I'm not going to use it either, and add this global mode
> > > as a fake "fallback"...'
> > >
> > > The way the previous thread evolved and this followup hack
> > > I don't see a genuine desire to find a solution.
> > > Just relentless push for global mode.
> > >
> > > Nacked-by: Alexei Starovoitov <ast@kernel.org>
> > >
> > > Please carry it in all future patches.
> >
> > To move forward, I'm happy to set the global mode aside for now and
> > potentially drop it in the next version. I'd really like to hear your
> > perspective on the per-process mode. Does this implementation meet
> > your needs?
>
> Attaching st_ops to task_struct or to mm_struct is a can of worms.
The feedback suggests there may not have been an opportunity to review
patch #3 in detail yet. I would appreciate it if you could take a look
at the specific changes in that patch, as it addresses the core of the
implementation.
> With cgroup-bpf we went through painful bugs with lifetime
> of cgroup vs bpf, dying cgroups, wq deadlock, etc. All these
> problems are behind us.
The attachment-based design of cgroup-bpf creates significant
operational challenges. It lacks visibility, making it difficult to
identify which cgroups have active attachments, and requires explicit
author knowledge for clean detachment.
> With st_ops in mm_struct it will be more
> painful.
To save your time, I've pasted the relevant portion of patch #3 below:
When registering a BPF-THP, we specify the PID of a target task. The
BPF-THP is then installed in the task's `mm_struct`
struct mm_struct {
struct bpf_thp_ops __rcu *thp_thp;
};
Inheritance Behavior:
- Existing child processes are unaffected
- Newly forked children inherit the BPF-THP from their parent
- The BPF-THP persists across execve() calls
A new linked list tracks all tasks managed by each BPF-THP instance:
- Newly managed tasks are added to the list
- Exiting tasks are automatically removed from the list
- During BPF-THP unregistration (e.g., when the BPF link is removed), all
managed tasks have their bpf_thp pointer set to NULL
- BPF-THP instances can be dynamically updated, with all tracked tasks
automatically migrating to the new version.
This design simplifies BPF-THP management in production environments by
providing clear lifecycle management and preventing conflicts between
multiple BPF-THP instances.
To clarify, this design has no lifecycle issues. It provides clear
traceability: you can always identify who loaded the program and which
processes it's attached to. Moreover, removing either the loader or
the pinned bpf_link will completely remove the program and all its
associated state.
> I'd rather not go that route.
I'm glad we can talk about this directly—it saves us both a lot of guesswork.
>
> And revist cgroup instead, since you were way too quick
> to accept the pushback because all you wanted is global mode.
>
> The main reason for pushback was:
> "
> Cgroup was designed for resource management not for grouping processes and
> tune those processes
> "
>
> which was true when cgroup-v2 was designed, but that ship sailed
> years ago when we introduced cgroup-bpf.
> None of the progs are doing resource management and lots of infrastructure,
> container management, and open source projects use cgroup-bpf
> as a grouping of processes. bpf progs attached to cgroup/hook tuple
> only care about processes within that cgroup. No resource management.
> See __cgroup_bpf_check_dev_permission or __cgroup_bpf_run_filter_sysctl
> and others.
> The path is current->cgroup->bpf_progs and progs do exactly
> what cgroup wasn't designed to do. They tune a set of processes.
>
> You should do the same.
I'm fully supportive of a cgroup-based approach, as it simplifies
integration by requiring only a kubelet plugin instead of
modifications to containerd.
However, my primary concern is the potential for maintainer pushback,
given the historical precedent. For instance, a similar discussion in
the NUMA-balancing context saw cgroup maintainers insisting on a
process-based method (see link below):
https://lore.kernel.org/lkml/ldyynnd3ngxnu3bie7ezuavewshgfepro5kjids6cuxy4imzy5@nt5id7nj5kt7/
To proactively address this, what alternative plan would you recommend
if we encounter such resistance? It's unclear what a viable path
forward would be if we are committed to a cgroup-based approach but it
is ultimately rejected by the maintainers.
(Adding Michal to CC for visibility)
>
> Also I really don't see a compelling use case for bpf in THP.
I'd recommend familiarizing yourself with the THP implementation. This
would be beneficial for our discussion on the specific changes.
> Your selftest is beyond primitive:
> +int pmd_order;
> +
> +SEC("struct_ops/thp_get_order")
> +int BPF_PROG(thp_not_eligible, struct vm_area_struct *vma, enum tva_type type,
> + unsigned long orders)
> +{
> + /* THPeligible in /proc/pid/smaps is 0 */
> + if (type == TVA_SMAPS)
> + return 0;
> + return pmd_order;
> +}
> hard code this thing. Don't bother with bpf.
A prior implementation that combined these components existed in an
earlier version:
https://lore.kernel.org/linux-mm/20250729091807.84310-5-laoar.shao@gmail.com/
However, based on your previous guidance that fexit and struct_ops
should not be mixed, the current approach was adopted.
In summary, I'm happy to proceed with a cgroup-based implementation. I
would appreciate your support in addressing any concerns the cgroup
maintainers might have.
--
Regards
Yafang
next prev parent reply other threads:[~2025-10-30 2:41 UTC|newest]
Thread overview: 29+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-10-26 10:01 [PATCH v12 mm-new 00/10] mm, bpf: BPF-MM, BPF-THP Yafang Shao
2025-10-26 10:01 ` [PATCH v12 mm-new 01/10] mm: thp: remove vm_flags parameter from khugepaged_enter_vma() Yafang Shao
2025-10-26 10:01 ` [PATCH v12 mm-new 02/10] mm: thp: remove vm_flags parameter from thp_vma_allowable_order() Yafang Shao
2025-10-26 10:01 ` [PATCH v12 mm-new 03/10] mm: thp: add support for BPF based THP order selection Yafang Shao
2025-10-26 10:01 ` [PATCH v12 mm-new 04/10] mm: thp: decouple THP allocation between swap and page fault paths Yafang Shao
2025-10-27 4:07 ` Barry Song
2025-10-26 10:01 ` [PATCH v12 mm-new 05/10] mm: thp: enable THP allocation exclusively through khugepaged Yafang Shao
2025-10-26 10:01 ` [PATCH v12 mm-new 06/10] mm: bpf-thp: add support for global mode Yafang Shao
2025-10-29 1:32 ` Alexei Starovoitov
2025-10-29 2:13 ` Yafang Shao
2025-10-30 0:57 ` Alexei Starovoitov
2025-10-30 2:40 ` Yafang Shao [this message]
2025-11-27 11:48 ` David Hildenbrand (Red Hat)
2025-11-28 2:53 ` Yafang Shao
2025-11-28 7:57 ` Lorenzo Stoakes
2025-11-28 8:18 ` Yafang Shao
2025-11-28 8:31 ` Lorenzo Stoakes
2025-11-28 11:56 ` Yafang Shao
2025-11-28 12:18 ` Lorenzo Stoakes
2025-11-28 12:51 ` Yafang Shao
2025-11-28 8:39 ` David Hildenbrand (Red Hat)
2025-11-28 8:55 ` Lorenzo Stoakes
2025-11-30 13:06 ` Yafang Shao
2025-11-26 15:13 ` Rik van Riel
2025-11-27 2:35 ` Yafang Shao
2025-10-26 10:01 ` [PATCH v12 mm-new 07/10] Documentation: add BPF THP Yafang Shao
2025-10-26 10:01 ` [PATCH v12 mm-new 08/10] selftests/bpf: add a simple BPF based THP policy Yafang Shao
2025-10-26 10:01 ` [PATCH v12 mm-new 09/10] selftests/bpf: add test case to update " Yafang Shao
2025-10-26 10:01 ` [PATCH v12 mm-new 10/10] selftests/bpf: add test case for BPF-THP inheritance across fork Yafang Shao
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CALOAHbC3p-zUZBs7S2r78nCPZm3F=AjJY7cL3RezTPtOXmLb6g@mail.gmail.com' \
--to=laoar.shao@gmail.com \
--cc=21cnbao@gmail.com \
--cc=Liam.Howlett@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=alexei.starovoitov@gmail.com \
--cc=ameryhung@gmail.com \
--cc=andrii@kernel.org \
--cc=ast@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=clm@meta.com \
--cc=corbet@lwn.net \
--cc=daniel@iogearbox.net \
--cc=david@redhat.com \
--cc=dev.jain@arm.com \
--cc=eddyz87@gmail.com \
--cc=gutierrez.asier@huawei-partners.com \
--cc=hannes@cmpxchg.org \
--cc=haoluo@google.com \
--cc=john.fastabend@gmail.com \
--cc=jolsa@kernel.org \
--cc=kpsingh@kernel.org \
--cc=lance.yang@linux.dev \
--cc=linux-mm@kvack.org \
--cc=lorenzo.stoakes@oracle.com \
--cc=martin.lau@linux.dev \
--cc=npache@redhat.com \
--cc=rdunlap@infradead.org \
--cc=rientjes@google.com \
--cc=ryan.roberts@arm.com \
--cc=sdf@fomichev.me \
--cc=shakeel.butt@linux.dev \
--cc=song@kernel.org \
--cc=tj@kernel.org \
--cc=usamaarif642@gmail.com \
--cc=willy@infradead.org \
--cc=yonghong.song@linux.dev \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).