[LSF/MM/BPF TOPIC] Using BPF in MM

public inbox for linux-mm@kvack.org
 help / color / mirror / Atom feed

* [LSF/MM/BPF TOPIC] Using BPF in MM
@ 2026-04-27 23:57 Roman Gushchin
  2026-04-28  8:12 ` David Hildenbrand (Arm)
  2026-04-29  2:43 ` Yafang Shao
  0 siblings, 2 replies; 5+ messages in thread
From: Roman Gushchin @ 2026-04-27 23:57 UTC (permalink / raw)
  To: bpf, linux-mm, Vlastimil Babka
  Cc: Shakeel Butt, Andrew Morton, David Hildenbrand, lsf-pc,
	Daniel Borkmann

[LSF/MM/BPF TOPIC] Using BPF in MM
----------------------------------

Over the last decade, BPF successfully penetrated into multiple kernel
subsystems: started as a feature to filter (out) networking packets,
it captured its place in networking, tracing, security, HID drivers,
and scheduling. Memory management is a logical next step, and recently
we saw a growing number of proposals in this area.

In (approximately) historical order:
  - BPF OOM
  - BPF-based memcg stats access (landed)
  - BPF-based NUMA balancing
  - eBPF-mm
  - cache_ext (BPF Page Cache)
  - memcg_ext

There are some obvious target which haven't been covered yet:
  - BPF-driven readahead control
  - BPF-driven KSM
  - BPF-driven guest memory control

Despite a large number of suggestions only a relatively small feature
(query memcg statistics from BPF) made it to upstream.

It looks like using BPF in the MM subsystem comes with a set of somewhat
unique challenges and questions to be answered.

Problem 1. In-Tree/Out-of-Tree BPF Programs
-------------------------------------------
Historically, BPF was created and used to create relatively simple programs
which implemented custom policies, which are arguably mostly user-specific
and have limited value being shared. So keeping them outside of the Linux
source tree was totally reasonable. In the tree we had relatively simple
programs which played a role of examples, tests and documentation. But with
the growing capabilities of BPF, more and more complex BPF programs and
sets of programs are becoming viable. Arguably sched_ext and specific
scheduler implementations are the most complex BPF interfaces now.
Sched_ext developers decided to keep minimalist reference schedulers
in-tree, while production-grade schedulers are developed outside.
There are pros and cons: it allows for much faster iteration but
at the cost of fragmentation risk.

It seems like memory management maintainers (at least Andrew Morton)
are willing to see production-grade BPF programs in the tree. It solves
the fragmentation concern and brings more attention and collaborators,
but somewhat eliminates the strong sides of BPF: speed of iterations
and easy of the customization. And some of the programs are simple too
business-specific to upstream them (e.g. and OOM policy which relies on
cloud orchestrator logic for the victim selection).

So in practice I expect to see both in practice: policy-heavy programs
will live outside the tree, while generic mechanisms (e.g., BPF-driven
memory tiering or cgroup-aware OOM killer) will live within the tree.
Keeping complex bpf programs in tree requires some help from the BPF
community: we need to decide where to keep them, what's the maintenance
policy and potentially ship them with the kernel binary.

Problem 2. Performance in Hot Paths & Cgroup Hierarchy
------------------------------------------------------
BPF was always optimized for speed, and it's really fast. However,
for *some* MM use cases, this might not be enough. Especially if we
simultaneously want to keep it safe (see the next problem). Traffic
control programs which run for every packet need to be very fast,
but at least there is usually no state to manage. If we allow BPF
programs to actually manipulate low-level MM data types in a safe way
(e.g., folio's LRU pointers), it almost inevitably hurts performance.

Also, the lifetime tracking of objects becomes more complex: BPF often
relies on RCU to guarantee memory safety, but it's not trivial and
certainly not free to provide RCU guarantees to, e.g., all folios.
And if we do it using reference counting, it's a performance overhead.

I believe that the solution is to provide safe and performant kfuncs
to operate with low-level data structures, but there is likely a tradeoff
to make between performance, safety guarantees, and flexibility.

For MM programs which operate with memory cgroups, there is a separate
question: how to implement attachment to cgroups? For ordinary BPF programs
there is a complex infrastructure to propagate attached programs to all
cgroups in the sub-tree. For struct_ops'es which are increasingly used to
implement complex BPF mechanisms, there is no such mechanism yet. And
it's not obvious what the best way to implement it: there might be
some state on specific cgroup level, different mechanisms require
different hierarchical behavior, etc. E.g., for BPF OOM, it's perfectly
fine and even desirable to have it attached to some levels and traverse
the hierarchy when it needs to be invoked. But for some programs on very
hot paths, this overhead might not be acceptable.

Finally, MM heavily relies on batching for minimizing the performance
overhead, but it comes with it's own set of tradeoffs. E.g. for large
machines with hundred of CPUs which are  running thousands of cgroups
it's really hard to come up with memcg statistics which is reasonable
accurate but also not slowing everything down. If we add bpf on top of
batching, it's somewhat limited, e.g. a user can't implement a custom
batching mechanism. But most likely we can't do otherwise: the performance
overhead is simple too high.

Problem 3. Safety guarantees and fallback mechanisms
----------------------------------------------------
Safety guarantees were always one of the main, if not the main, selling
points for BPF. Otherwise, why simply not use kernel modules? But what
exactly does the BPF verifier and runtime engine guarantee? For networking,
tracing, and even the scheduler, the answer is the the stability of the
kernel itself (no oopses, UAF, or data corruption).

But the quality of service or usefulness of the system from a user's
perspective is not strictly guaranteed. A malformed BPF program which
drops all the traffic and makes the system unreachable over SSH is
considered acceptable. Sched_ext falls back to CFS if the BPF scheduler
is doing an obviously poor job scheduling tasks, but it takes time and,
of course, it doesn't guarantee performance, so a particularly bad BPF
scheduler can make the system barely useful.

What's the acceptable level of service for MM?

Given how critical MM is to the functioning of the system, it's hard to
guarantee the system stability without sacrificing the flexibility.
The trivial example: if we allow BPF OOM handler to do nothing and let
the system deadlock on memory, is it still acceptable? And if not,
how to implement the safety guarantee? One way is to add a layer of kfuncs
which limit what bpf can achieve and also record what it does. E.g.
bpf programs are allowed to kill processes only using a special helper
and a bpf program has to invoke it at least once. But it's complicated
even for the OOM handling, for hotter paths adding such layer will likely
come with an non-acceptable performance overhead.

Scheduler-like time-based fallback is also not easily applicable:
MM has historically had no notion of time, relying on refault distances,
LRU length, ratio of scanned vs reclaimed pages etc. So time-based
fallback mechanisms will not work well without more systematic changes.

In MM, it's usually not trivial to determine if things are really off
(even without BPF). The kernel historically has trouble deciding when
it's actually time to invoke the OOM killer. The effectiveness of,
for example, a specific readahead implementation or a certain reclaim
policy is not trivial to measure, yet it is even harder to draw
dynamically calculated acceptance criteria which can be calculated
with an acceptable overhead. If things are mildly off, it can be
considered a sub-optimal performance. But if a faulty bpf program
is leading to heavy trashing, how to make sure the system ends up
unloading the bpf program instead of killing all userspace programs?

And to make things worse, BPF itself can't be totally isolated from
relying on MM. BPF maps are backed by slabs and/or vmalloc's. How we can
make sure there are no circular dependencies and associated memory leaks?

--

It seems obvious at this point that there is a huge potential and a lot
of interest in using BPF in MM. Answering questions above seems to be
required to get the initial adoption. But I bet adding more use cases
will go faster and smoother.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Using BPF in MM
  2026-04-27 23:57 [LSF/MM/BPF TOPIC] Using BPF in MM Roman Gushchin
@ 2026-04-28  8:12 ` David Hildenbrand (Arm)
  2026-04-28 16:35   ` Roman Gushchin
  2026-05-03 17:25   ` Vernon Yang
  2026-04-29  2:43 ` Yafang Shao
  1 sibling, 2 replies; 5+ messages in thread
From: David Hildenbrand (Arm) @ 2026-04-28  8:12 UTC (permalink / raw)
  To: Roman Gushchin, bpf, linux-mm, Vlastimil Babka
  Cc: Shakeel Butt, Andrew Morton, lsf-pc, Daniel Borkmann

On 4/28/26 01:57, Roman Gushchin wrote:
> [LSF/MM/BPF TOPIC] Using BPF in MM
> ----------------------------------
> 
> Over the last decade, BPF successfully penetrated into multiple kernel
> subsystems: started as a feature to filter (out) networking packets,
> it captured its place in networking, tracing, security, HID drivers,
> and scheduling. Memory management is a logical next step, and recently
> we saw a growing number of proposals in this area.
> 
> In (approximately) historical order:
>   - BPF OOM
>   - BPF-based memcg stats access (landed)
>   - BPF-based NUMA balancing
>   - eBPF-mm
>   - cache_ext (BPF Page Cache)
>   - memcg_ext

There was also the BPF THP control.

> 
> There are some obvious target which haven't been covered yet:
>   - BPF-driven readahead control
>   - BPF-driven KSM
>   - BPF-driven guest memory control
> 
> Despite a large number of suggestions only a relatively small feature
> (query memcg statistics from BPF) made it to upstream.
> 
> It looks like using BPF in the MM subsystem comes with a set of somewhat
> unique challenges and questions to be answered.

[...]

I think you are missing one of the most important points: Unclear ABI stability
guarantees.

One the one hand, we are told that there are no ABI stability guarantees, and
that we can change hooks (add/remove/modify) any time we want.

On the other hand, as soon as there is some ebpf program out there that we
break, you can rest assured that there will be trouble.

In the area of THP, where we don't even know which hooks we will need long term
and how they would look like, that was one of the reasons why the BPF THP
control was rejected.

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Using BPF in MM
  2026-04-28  8:12 ` David Hildenbrand (Arm)
@ 2026-04-28 16:35   ` Roman Gushchin
  2026-05-03 17:25   ` Vernon Yang
  1 sibling, 0 replies; 5+ messages in thread
From: Roman Gushchin @ 2026-04-28 16:35 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: bpf, linux-mm, Vlastimil Babka, Shakeel Butt, Andrew Morton,
	lsf-pc, Daniel Borkmann

"David Hildenbrand (Arm)" <david@kernel.org> writes:

> On 4/28/26 01:57, Roman Gushchin wrote:
>> [LSF/MM/BPF TOPIC] Using BPF in MM
>> ----------------------------------
>> 
>> Over the last decade, BPF successfully penetrated into multiple kernel
>> subsystems: started as a feature to filter (out) networking packets,
>> it captured its place in networking, tracing, security, HID drivers,
>> and scheduling. Memory management is a logical next step, and recently
>> we saw a growing number of proposals in this area.
>> 
>> In (approximately) historical order:
>>   - BPF OOM
>>   - BPF-based memcg stats access (landed)
>>   - BPF-based NUMA balancing
>>   - eBPF-mm
>>   - cache_ext (BPF Page Cache)
>>   - memcg_ext
>
> There was also the BPF THP control.

Thanks, missed that.

>
>> 
>> There are some obvious target which haven't been covered yet:
>>   - BPF-driven readahead control
>>   - BPF-driven KSM
>>   - BPF-driven guest memory control
>> 
>> Despite a large number of suggestions only a relatively small feature
>> (query memcg statistics from BPF) made it to upstream.
>> 
>> It looks like using BPF in the MM subsystem comes with a set of somewhat
>> unique challenges and questions to be answered.
>
> [...]
>
> I think you are missing one of the most important points: Unclear ABI stability
> guarantees.

Totally agree, it's just not specific to mm and is not very new.
Arguments about the ABI stability are almost as old as BPF itself, e.g.
a quick search gave me a lwn article from 2019:
https://lwn.net/Articles/787856/ .

>
> One the one hand, we are told that there are no ABI stability guarantees, and
> that we can change hooks (add/remove/modify) any time we want.
>
> On the other hand, as soon as there is some ebpf program out there that we
> break, you can rest assured that there will be trouble.
>
> In the area of THP, where we don't even know which hooks we will need long term
> and how they would look like, that was one of the reasons why the BPF THP
> control was rejected.

Agree. And in my mind it's also related to safety/performance tradeoff:
if we use very generic hooks/interfaces, it's easier to keep them stable
and meaningful, but then it's hard to guarantee safety without
performance sacrifices. Or we use very targeted policy hooks, then it's
much easier to make them safe and performant, but they may become
obsolete very quickly.


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Using BPF in MM
  2026-04-27 23:57 [LSF/MM/BPF TOPIC] Using BPF in MM Roman Gushchin
  2026-04-28  8:12 ` David Hildenbrand (Arm)
@ 2026-04-29  2:43 ` Yafang Shao
  1 sibling, 0 replies; 5+ messages in thread
From: Yafang Shao @ 2026-04-29  2:43 UTC (permalink / raw)
  To: Roman Gushchin, Song Liu, Petr Mladek
  Cc: bpf, linux-mm, Vlastimil Babka, Shakeel Butt, Andrew Morton,
	David Hildenbrand, lsf-pc, Daniel Borkmann

On Tue, Apr 28, 2026 at 7:58 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
>
> [LSF/MM/BPF TOPIC] Using BPF in MM
> ----------------------------------
>
> Over the last decade, BPF successfully penetrated into multiple kernel
> subsystems: started as a feature to filter (out) networking packets,
> it captured its place in networking, tracing, security, HID drivers,
> and scheduling. Memory management is a logical next step, and recently
> we saw a growing number of proposals in this area.
>
> In (approximately) historical order:
>   - BPF OOM
>   - BPF-based memcg stats access (landed)
>   - BPF-based NUMA balancing
>   - eBPF-mm
>   - cache_ext (BPF Page Cache)
>   - memcg_ext
>
> There are some obvious target which haven't been covered yet:
>   - BPF-driven readahead control
>   - BPF-driven KSM
>   - BPF-driven guest memory control
>
> Despite a large number of suggestions only a relatively small feature
> (query memcg statistics from BPF) made it to upstream.

We are exploring an alternative approach that leverages livepatch
combined with BPF to modularize struct_ops-based kernel hooks. This
allows us to deploy these hooks as out-of-tree modules without direct
kernel modification or the immediate need for upstreaming.

  https://lore.kernel.org/live-patching/20260402092607.96430-1-laoar.shao@gmail.com/
  https://lore.kernel.org/live-patching/20260416001628.2062468-1-song@kernel.org/
  https://lore.kernel.org/bpf/CAPhsuW53pymgmFsHSkSwDvEAJ=+Rp2T102JYe4i9kgdePpR=6Q@mail.gmail.com/

In practice, we first introduce the BPF hooks via a livepatch and
subsequently attach BPF programs to them. Below is a recent use case
from our production environment (though not MM related) for reference:

   https://lore.kernel.org/live-patching/CALOAHbDnNba_w_nWH3-S9GAXw0+VKuLTh1gy5hy9Yqgeo4C0iA@mail.gmail.com/

In one of our clusters, we needed to route BGP traffic through
specific NICs based on destination IP addresses. To achieve this
without service interruption, we applied a livepatch to
bond_xmit_3ad_xor_slave_get() to introduce a new hook,
bond_get_slave_hook(). We then attached a BPF program to this hook to
select the outgoing NIC by parsing the SKB. Because the destination
IPs must be adjusted on demand, a static livepatch alone was
insufficient; the BPF integration provided the necessary dynamic
flexibility.

To fully support this architecture, several work-in-progress
enhancements are being developed for both the livepatch and BPF
subsystems.

+Song, Petr (CC'd for their expertise in BPF and livepatch).

-- 
Regards
Yafang

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Using BPF in MM
  2026-04-28  8:12 ` David Hildenbrand (Arm)
  2026-04-28 16:35   ` Roman Gushchin
@ 2026-05-03 17:25   ` Vernon Yang
  1 sibling, 0 replies; 5+ messages in thread
From: Vernon Yang @ 2026-05-03 17:25 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Roman Gushchin, bpf, linux-mm, Vlastimil Babka, Shakeel Butt,
	Andrew Morton, lsf-pc, Daniel Borkmann

On Tue, Apr 28, 2026 at 10:12:16AM +0200, David Hildenbrand (Arm) wrote:
> On 4/28/26 01:57, Roman Gushchin wrote:
> > [LSF/MM/BPF TOPIC] Using BPF in MM
> > ----------------------------------
> >
> > Over the last decade, BPF successfully penetrated into multiple kernel
> > subsystems: started as a feature to filter (out) networking packets,
> > it captured its place in networking, tracing, security, HID drivers,
> > and scheduling. Memory management is a logical next step, and recently
> > we saw a growing number of proposals in this area.
> >
> > In (approximately) historical order:
> >   - BPF OOM
> >   - BPF-based memcg stats access (landed)
> >   - BPF-based NUMA balancing
> >   - eBPF-mm
> >   - cache_ext (BPF Page Cache)
> >   - memcg_ext
>
> There was also the BPF THP control.

Hi David, Roman,

I submit a new series "mm: introduce mthp_ext via cgroup-bpf to make
mTHP more transparent"[1] to implement BPF-THP, which has excellent
performance data through stress testing. For details, please refer to
the patchset cover letter.

Although I did not attend the conference in person, I have been online
throughout. Please feel free to discuss any latest progress with me
online. Thank you!

[1] https://lore.kernel.org/linux-mm/20260503165024.1526680-1-vernon2gm@gmail.com/

--
Cheers,
Vernon

> >
> > There are some obvious target which haven't been covered yet:
> >   - BPF-driven readahead control
> >   - BPF-driven KSM
> >   - BPF-driven guest memory control
> >
> > Despite a large number of suggestions only a relatively small feature
> > (query memcg statistics from BPF) made it to upstream.
> >
> > It looks like using BPF in the MM subsystem comes with a set of somewhat
> > unique challenges and questions to be answered.
>
> [...]
>
> I think you are missing one of the most important points: Unclear ABI stability
> guarantees.
>
> One the one hand, we are told that there are no ABI stability guarantees, and
> that we can change hooks (add/remove/modify) any time we want.
>
> On the other hand, as soon as there is some ebpf program out there that we
> break, you can rest assured that there will be trouble.
>
> In the area of THP, where we don't even know which hooks we will need long term
> and how they would look like, that was one of the reasons why the BPF THP
> control was rejected.
>
> --
> Cheers,
>
> David
>


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2026-05-03 17:25 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-27 23:57 [LSF/MM/BPF TOPIC] Using BPF in MM Roman Gushchin
2026-04-28  8:12 ` David Hildenbrand (Arm)
2026-04-28 16:35   ` Roman Gushchin
2026-05-03 17:25   ` Vernon Yang
2026-04-29  2:43 ` Yafang Shao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox