Re: [PATCH v12 mm-new 06/10] mm: bpf-thp: add support for global mode

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Yafang Shao <laoar.shao@gmail.com>
To: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Alexei Starovoitov <ast@kernel.org>,
	 Daniel Borkmann <daniel@iogearbox.net>,
	Andrii Nakryiko <andrii@kernel.org>,
	 David Hildenbrand <david@redhat.com>,
	Lorenzo Stoakes <lorenzo.stoakes@oracle.com>,
	 Martin KaFai Lau <martin.lau@linux.dev>,
	Eduard <eddyz87@gmail.com>, Song Liu <song@kernel.org>,
	 Yonghong Song <yonghong.song@linux.dev>,
	John Fastabend <john.fastabend@gmail.com>,
	 KP Singh <kpsingh@kernel.org>,
	Stanislav Fomichev <sdf@fomichev.me>, Hao Luo <haoluo@google.com>,
	 Jiri Olsa <jolsa@kernel.org>, Zi Yan <ziy@nvidia.com>,
	 Liam Howlett <Liam.Howlett@oracle.com>,
	npache@redhat.com, ryan.roberts@arm.com,  dev.jain@arm.com,
	Johannes Weiner <hannes@cmpxchg.org>,
	usamaarif642@gmail.com,  gutierrez.asier@huawei-partners.com,
	Matthew Wilcox <willy@infradead.org>,
	 Amery Hung <ameryhung@gmail.com>,
	David Rientjes <rientjes@google.com>,
	 Jonathan Corbet <corbet@lwn.net>, Barry Song <21cnbao@gmail.com>,
	 Shakeel Butt <shakeel.butt@linux.dev>, Tejun Heo <tj@kernel.org>,
	lance.yang@linux.dev,  Randy Dunlap <rdunlap@infradead.org>,
	Chris Mason <clm@meta.com>, bpf <bpf@vger.kernel.org>,
	 linux-mm <linux-mm@kvack.org>
Subject: Re: [PATCH v12 mm-new 06/10] mm: bpf-thp: add support for global mode
Date: Thu, 30 Oct 2025 10:40:43 +0800	[thread overview]
Message-ID: <CALOAHbC3p-zUZBs7S2r78nCPZm3F=AjJY7cL3RezTPtOXmLb6g@mail.gmail.com> (raw)
In-Reply-To: <CAADnVQK9kp_5zh0gYvXdJ=3MSuXTbmZT+cah5uhZiGk5qYfckw@mail.gmail.com>

On Thu, Oct 30, 2025 at 8:57 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Tue, Oct 28, 2025 at 7:14 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > On Wed, Oct 29, 2025 at 9:33 AM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Sun, Oct 26, 2025 at 3:03 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > >
> > > > The per-process BPF-THP mode is unsuitable for managing shared resources
> > > > such as shmem THP and file-backed THP. This aligns with known cgroup
> > > > limitations for similar scenarios [0].
> > > >
> > > > Introduce a global BPF-THP mode to address this gap. When registered:
> > > > - All existing per-process instances are disabled
> > > > - New per-process registrations are blocked
> > > > - Existing per-process instances remain registered (no forced unregistration)
> > > >
> > > > The global mode takes precedence over per-process instances. Updates are
> > > > type-isolated: global instances can only be updated by new global
> > > > instances, and per-process instances by new per-process instances.
> > >
> > > ...
> > >
> > > >         spin_lock(&thp_ops_lock);
> > > > -       /* Each process is exclusively managed by a single BPF-THP. */
> > > > -       if (rcu_access_pointer(mm->bpf_mm.bpf_thp)) {
> > > > +       /* Each process is exclusively managed by a single BPF-THP.
> > > > +        * Global mode disables per-process instances.
> > > > +        */
> > > > +       if (rcu_access_pointer(mm->bpf_mm.bpf_thp) || rcu_access_pointer(bpf_thp_global)) {
> > > >                 err = -EBUSY;
> > > >                 goto out;
> > > >         }
> > >
> > > You didn't address the issue and instead doubled down
> > > on this broken global approach.
> > >
> > > This bait-and-switch patchset is frankly disingenuous.
> > > 'lets code up some per-mm hack, since people will hate it anyway,
> > > and I'm not going to use it either, and add this global mode
> > > as a fake "fallback"...'
> > >
> > > The way the previous thread evolved and this followup hack
> > > I don't see a genuine desire to find a solution.
> > > Just relentless push for global mode.
> > >
> > > Nacked-by: Alexei Starovoitov <ast@kernel.org>
> > >
> > > Please carry it in all future patches.
> >
> > To move forward, I'm happy to set the global mode aside for now and
> > potentially drop it in the next version. I'd really like to hear your
> > perspective on the per-process mode. Does this implementation meet
> > your needs?
>
> Attaching st_ops to task_struct or to mm_struct is a can of worms.

The feedback suggests there may not have been an opportunity to review
patch #3 in detail yet. I would appreciate it if you could take a look
at the specific changes in that patch, as it addresses the core of the
implementation.

> With cgroup-bpf we went through painful bugs with lifetime
> of cgroup vs bpf, dying cgroups, wq deadlock, etc. All these
> problems are behind us.

The attachment-based design of cgroup-bpf creates significant
operational challenges. It lacks visibility, making it difficult to
identify which cgroups have active attachments, and requires explicit
author knowledge for clean detachment.

> With st_ops in mm_struct it will be more
> painful.

To save your time, I've pasted the relevant portion of patch #3 below:

  When registering a BPF-THP, we specify the PID of a target task. The
  BPF-THP is then installed in the task's `mm_struct`

    struct mm_struct {
        struct bpf_thp_ops __rcu *thp_thp;
    };

  Inheritance Behavior:

  - Existing child processes are unaffected
  - Newly forked children inherit the BPF-THP from their parent
  -  The BPF-THP persists across execve() calls

    A new linked list tracks all tasks managed by each BPF-THP instance:

  - Newly managed tasks are added to the list
  - Exiting tasks are automatically removed from the list
  - During BPF-THP unregistration (e.g., when the BPF link is removed), all
    managed tasks have their bpf_thp pointer set to NULL
  - BPF-THP instances can be dynamically updated, with all tracked tasks
    automatically migrating to the new version.

  This design simplifies BPF-THP management in production environments by
  providing clear lifecycle management and preventing conflicts between
  multiple BPF-THP instances.

To clarify, this design has no lifecycle issues. It provides clear
traceability: you can always identify who loaded the program and which
processes it's attached to. Moreover, removing either the loader or
the pinned bpf_link will completely remove the program and all its
associated state.

> I'd rather not go that route.

I'm glad we can talk about this directly—it saves us both a lot of guesswork.

>
> And revist cgroup instead, since you were way too quick
> to accept the pushback because all you wanted is global mode.
>
> The main reason for pushback was:
> "
> Cgroup was designed for resource management not for grouping processes and
> tune those processes
> "
>
> which was true when cgroup-v2 was designed, but that ship sailed
> years ago when we introduced cgroup-bpf.
> None of the progs are doing resource management and lots of infrastructure,
> container management, and open source projects use cgroup-bpf
> as a grouping of processes. bpf progs attached to cgroup/hook tuple
> only care about processes within that cgroup. No resource management.
> See __cgroup_bpf_check_dev_permission or __cgroup_bpf_run_filter_sysctl
> and others.
> The path is current->cgroup->bpf_progs and progs do exactly
> what cgroup wasn't designed to do. They tune a set of processes.
>
> You should do the same.

I'm fully supportive of a cgroup-based approach, as it simplifies
integration by requiring only a kubelet plugin instead of
modifications to containerd.

However, my primary concern is the potential for maintainer pushback,
given the historical precedent. For instance, a similar discussion in
the NUMA-balancing context saw cgroup maintainers insisting on a
process-based method (see link below):

  https://lore.kernel.org/lkml/ldyynnd3ngxnu3bie7ezuavewshgfepro5kjids6cuxy4imzy5@nt5id7nj5kt7/

To proactively address this, what alternative plan would you recommend
if we encounter such resistance? It's unclear what a viable path
forward would be if we are committed to a cgroup-based approach but it
is ultimately rejected by the maintainers.

(Adding Michal to CC for visibility)

>
> Also I really don't see a compelling use case for bpf in THP.

I'd recommend familiarizing yourself with the THP implementation. This
would be beneficial for our discussion on the specific changes.

> Your selftest is beyond primitive:
> +int pmd_order;
> +
> +SEC("struct_ops/thp_get_order")
> +int BPF_PROG(thp_not_eligible, struct vm_area_struct *vma, enum tva_type type,
> +     unsigned long orders)
> +{
> + /* THPeligible in /proc/pid/smaps is 0 */
> + if (type == TVA_SMAPS)
> + return 0;
> + return pmd_order;
> +}
> hard code this thing. Don't bother with bpf.

A prior implementation that combined these components existed in an
earlier version:

  https://lore.kernel.org/linux-mm/20250729091807.84310-5-laoar.shao@gmail.com/

However, based on your previous guidance that fexit and struct_ops
should not be mixed, the current approach was adopted.

In summary, I'm happy to proceed with a cgroup-based implementation. I
would appreciate your support in addressing any concerns the cgroup
maintainers might have.

-- 
Regards
Yafang

next prev parent reply	other threads:[~2025-10-30  2:41 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-26 10:01 [PATCH v12 mm-new 00/10] mm, bpf: BPF-MM, BPF-THP Yafang Shao
2025-10-26 10:01 ` [PATCH v12 mm-new 01/10] mm: thp: remove vm_flags parameter from khugepaged_enter_vma() Yafang Shao
2025-10-26 10:01 ` [PATCH v12 mm-new 02/10] mm: thp: remove vm_flags parameter from thp_vma_allowable_order() Yafang Shao
2025-10-26 10:01 ` [PATCH v12 mm-new 03/10] mm: thp: add support for BPF based THP order selection Yafang Shao
2025-10-26 10:01 ` [PATCH v12 mm-new 04/10] mm: thp: decouple THP allocation between swap and page fault paths Yafang Shao
2025-10-27  4:07   ` Barry Song
2025-10-26 10:01 ` [PATCH v12 mm-new 05/10] mm: thp: enable THP allocation exclusively through khugepaged Yafang Shao
2025-10-26 10:01 ` [PATCH v12 mm-new 06/10] mm: bpf-thp: add support for global mode Yafang Shao
2025-10-29  1:32   ` Alexei Starovoitov
2025-10-29  2:13     ` Yafang Shao
2025-10-30  0:57       ` Alexei Starovoitov
2025-10-30  2:40         ` Yafang Shao [this message]
2025-11-27 11:48         ` David Hildenbrand (Red Hat)
2025-11-28  2:53           ` Yafang Shao
2025-11-28  7:57             ` Lorenzo Stoakes
2025-11-28  8:18               ` Yafang Shao
2025-11-28  8:31                 ` Lorenzo Stoakes
2025-11-28 11:56                   ` Yafang Shao
2025-11-28 12:18                     ` Lorenzo Stoakes
2025-11-28 12:51                       ` Yafang Shao
2025-11-28  8:39             ` David Hildenbrand (Red Hat)
2025-11-28  8:55               ` Lorenzo Stoakes
2025-11-30 13:06               ` Yafang Shao
2025-11-26 15:13     ` Rik van Riel
2025-11-27  2:35       ` Yafang Shao
2025-10-26 10:01 ` [PATCH v12 mm-new 07/10] Documentation: add BPF THP Yafang Shao
2025-10-26 10:01 ` [PATCH v12 mm-new 08/10] selftests/bpf: add a simple BPF based THP policy Yafang Shao
2025-10-26 10:01 ` [PATCH v12 mm-new 09/10] selftests/bpf: add test case to update " Yafang Shao
2025-10-26 10:01 ` [PATCH v12 mm-new 10/10] selftests/bpf: add test case for BPF-THP inheritance across fork Yafang Shao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CALOAHbC3p-zUZBs7S2r78nCPZm3F=AjJY7cL3RezTPtOXmLb6g@mail.gmail.com' \
    --to=laoar.shao@gmail.com \
    --cc=21cnbao@gmail.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=alexei.starovoitov@gmail.com \
    --cc=ameryhung@gmail.com \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=clm@meta.com \
    --cc=corbet@lwn.net \
    --cc=daniel@iogearbox.net \
    --cc=david@redhat.com \
    --cc=dev.jain@arm.com \
    --cc=eddyz87@gmail.com \
    --cc=gutierrez.asier@huawei-partners.com \
    --cc=hannes@cmpxchg.org \
    --cc=haoluo@google.com \
    --cc=john.fastabend@gmail.com \
    --cc=jolsa@kernel.org \
    --cc=kpsingh@kernel.org \
    --cc=lance.yang@linux.dev \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=martin.lau@linux.dev \
    --cc=npache@redhat.com \
    --cc=rdunlap@infradead.org \
    --cc=rientjes@google.com \
    --cc=ryan.roberts@arm.com \
    --cc=sdf@fomichev.me \
    --cc=shakeel.butt@linux.dev \
    --cc=song@kernel.org \
    --cc=tj@kernel.org \
    --cc=usamaarif642@gmail.com \
    --cc=willy@infradead.org \
    --cc=yonghong.song@linux.dev \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).