* bpf: adding BPF linked list iteration support
@ 2024-11-19 12:22 Matt Bobrowski
2024-11-21 0:51 ` Alexei Starovoitov
0 siblings, 1 reply; 5+ messages in thread
From: Matt Bobrowski @ 2024-11-19 12:22 UTC (permalink / raw)
To: bpf; +Cc: ast, daniel, andrii, martin.lau, song, kpsingh, jolsa, memxor
Hi,
Currently, we have BPF kfuncs which allow BPF programs to add and
remove elements from a BPF linked list. However, we're currently
missing other simple capabilities, like being able to iterate over the
elements within the BPF linked lists. What is our current appetite
with regards to adding new BPF kfuncs that support this kind of
capability to BPF linked lists?
I know that we're now somewhat advocating for using BPF arenas
whenever and wherever possible, especially when it comes to building
out and supporting more complicated data structures in BPF. However,
IMO BPF linked lists still have their place. Specifically, and as of
now, I'd argue that the BPF linked list implementation could be
considered more memory efficient when compared to a BPF arena backed
linked list implementation. This is purely due to the fact that the
BPF linked list implementation can perform more constrained memory
allocations for elements via bpf_obj_new_impl() based on the demand,
whereas for a BPF arena based implementation a BPF program needs to
allocate memory upfront in terms of the number of pages (modulo the
fact that not all pages for the BPF arena will necessarily be reserved
upfront). The fact that allocations are performed in terms of
multiples of PAGE_SIZE can lead to unnecessary memory wastage.
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: bpf: adding BPF linked list iteration support
2024-11-19 12:22 bpf: adding BPF linked list iteration support Matt Bobrowski
@ 2024-11-21 0:51 ` Alexei Starovoitov
2024-11-22 9:12 ` Matt Bobrowski
0 siblings, 1 reply; 5+ messages in thread
From: Alexei Starovoitov @ 2024-11-21 0:51 UTC (permalink / raw)
To: Matt Bobrowski
Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
Martin KaFai Lau, Song Liu, KP Singh, Jiri Olsa,
Kumar Kartikeya Dwivedi
On Tue, Nov 19, 2024 at 4:22 AM Matt Bobrowski <mattbobrowski@google.com> wrote:
>
> Hi,
>
> Currently, we have BPF kfuncs which allow BPF programs to add and
> remove elements from a BPF linked list. However, we're currently
> missing other simple capabilities, like being able to iterate over the
> elements within the BPF linked lists. What is our current appetite
> with regards to adding new BPF kfuncs that support this kind of
> capability to BPF linked lists?
What kind of kfuncs do you have in mind for link lists ?
So far the only user of bpf_rbtree is sched-ext.
It was used in one scheduler and the experience was painful.
There is a chance that we will remove rbtree and link list
support from the verifier to reduce complexity.
So new link list kfuncs may be ok, but potentially not for too long.
> I know that we're now somewhat advocating for using BPF arenas
> whenever and wherever possible, especially when it comes to building
> out and supporting more complicated data structures in BPF. However,
> IMO BPF linked lists still have their place. Specifically, and as of
> now, I'd argue that the BPF linked list implementation could be
> considered more memory efficient when compared to a BPF arena backed
> linked list implementation. This is purely due to the fact that the
> BPF linked list implementation can perform more constrained memory
> allocations for elements via bpf_obj_new_impl() based on the demand,
> whereas for a BPF arena based implementation a BPF program needs to
> allocate memory upfront in terms of the number of pages (modulo the
> fact that not all pages for the BPF arena will necessarily be reserved
> upfront). The fact that allocations are performed in terms of
> multiples of PAGE_SIZE can lead to unnecessary memory wastage.
I don't follow this logic.
bpf_mem_alloc is relying on slab that relies on page alloc.
So either arena or bpf_ma allocates a page at a time.
So from that pov the cost is the same.
But in practice bpf_ma needs extra 8 bytes for every allocation
whereas arena allocs don't have this overhead.
Right now arena allocs need to be sleepable and that is
a severe limitation for tracing use cases.
We're working on lifting that. Once that happens
allocs in arena will be more usable than bpf_ma.
kptr-s in arena is another required feature.
There were few proposals. So it will be done as well. Eventually.
So new link list kfuncs are ok, but might get removed in the future.
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: bpf: adding BPF linked list iteration support
2024-11-21 0:51 ` Alexei Starovoitov
@ 2024-11-22 9:12 ` Matt Bobrowski
2024-11-22 21:54 ` Alexei Starovoitov
0 siblings, 1 reply; 5+ messages in thread
From: Matt Bobrowski @ 2024-11-22 9:12 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
Martin KaFai Lau, Song Liu, KP Singh, Jiri Olsa,
Kumar Kartikeya Dwivedi
On Wed, Nov 20, 2024 at 04:51:36PM -0800, Alexei Starovoitov wrote:
> On Tue, Nov 19, 2024 at 4:22 AM Matt Bobrowski <mattbobrowski@google.com> wrote:
> >
> > Hi,
> >
> > Currently, we have BPF kfuncs which allow BPF programs to add and
> > remove elements from a BPF linked list. However, we're currently
> > missing other simple capabilities, like being able to iterate over the
> > elements within the BPF linked lists. What is our current appetite
> > with regards to adding new BPF kfuncs that support this kind of
> > capability to BPF linked lists?
>
> What kind of kfuncs do you have in mind for link lists ?
At this point, it'd have to be some kind of iterator based BPF kfunc
that allows a BPF program to traverse over the supplied BPF linked
list, coupled with a delete based BPF kfunc such that elements from
the list can be removed whilst performing the iteration
i.e. list_for_each_safe/list_del. Now that I know you're not
completely opposed to adding such BPF kfuncs, I can concretely start
thinking about what this will actually end up looking like. But
essentially, it'd need to be BPF kfuncs that support those 2
previously mentioned capabilities, being traversal and arbitrary
removal of an element whilst performing the traversal.
> So far the only user of bpf_rbtree is sched-ext.
> It was used in one scheduler and the experience was painful.
> There is a chance that we will remove rbtree and link list
> support from the verifier to reduce complexity.
> So new link list kfuncs may be ok, but potentially not for too long.
Noted.
> > I know that we're now somewhat advocating for using BPF arenas
> > whenever and wherever possible, especially when it comes to building
> > out and supporting more complicated data structures in BPF. However,
> > IMO BPF linked lists still have their place. Specifically, and as of
> > now, I'd argue that the BPF linked list implementation could be
> > considered more memory efficient when compared to a BPF arena backed
> > linked list implementation. This is purely due to the fact that the
> > BPF linked list implementation can perform more constrained memory
> > allocations for elements via bpf_obj_new_impl() based on the demand,
> > whereas for a BPF arena based implementation a BPF program needs to
> > allocate memory upfront in terms of the number of pages (modulo the
> > fact that not all pages for the BPF arena will necessarily be reserved
> > upfront). The fact that allocations are performed in terms of
> > multiples of PAGE_SIZE can lead to unnecessary memory wastage.
>
> I don't follow this logic.
> bpf_mem_alloc is relying on slab that relies on page alloc.
> So either arena or bpf_ma allocates a page at a time.
> So from that pov the cost is the same.
Oh, what? So, both are actually performing full page sized allocations
whenever there's a need to fetch more memory? My shallow understanding
at this point was that the BPF specific memory allocator simply acts
as a front-end cache to kmalloc(), and depending on the size of your
allocation request i.e. via bpf_obj_new_impl for example, depends on
what freelist that allocation is fulfilled from. Any needs for
refilling a freelist due to exhaustion pressure are performed at
freelist size granularity i.e. 16, 32, 64, 128, 256..., 4096.
So, based on my current understanding and assuming that your BPF
program at most requests 128 byte sized allocations via
bpf_obj_new_impl, the BPF specific memory allocator will via kmalloc()
only allocate that much more memory at a time when needing to refill
the freelist caches (maybe it'd be a multiple of that because of
batching or something, but I don't fully understand the details of
that just yet). In comparison to using something like a BPF arena, if
I had to build out my own in-BPF program memory allocator that was
backed by a BPF arena, in the need of a refill like situation, the
allocation would be performed at page size granularity as that is the
amount of memory that the BPF arena pulls in at a time.
This is what I was basically trying to get at, but my trail of thought
may be flawed by my lack of understanding on how something actually
works in practice.
> But in practice bpf_ma needs extra 8 bytes for every allocation
> whereas arena allocs don't have this overhead.
Makes sense, freelist objects need to be tracked somehow by the BPF
specific memory allocator...
> Right now arena allocs need to be sleepable and that is
> a severe limitation for tracing use cases.
> We're working on lifting that. Once that happens
> allocs in arena will be more usable than bpf_ma.
Cool, sonuds good. I'm following this closely too BTW, because at some
point I think we'll end up using BPF arenas for a bunch of our stuff.
> kptr-s in arena is another required feature.
> There were few proposals. So it will be done as well. Eventually.
> So new link list kfuncs are ok, but might get removed in the future.
Let me think about this a little more and send out an RFC patch series
when time permits.
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: bpf: adding BPF linked list iteration support
2024-11-22 9:12 ` Matt Bobrowski
@ 2024-11-22 21:54 ` Alexei Starovoitov
2024-11-25 15:52 ` Matt Bobrowski
0 siblings, 1 reply; 5+ messages in thread
From: Alexei Starovoitov @ 2024-11-22 21:54 UTC (permalink / raw)
To: Matt Bobrowski
Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
Martin KaFai Lau, Song Liu, KP Singh, Jiri Olsa,
Kumar Kartikeya Dwivedi
On Fri, Nov 22, 2024 at 1:12 AM Matt Bobrowski <mattbobrowski@google.com> wrote:
>
> On Wed, Nov 20, 2024 at 04:51:36PM -0800, Alexei Starovoitov wrote:
> > On Tue, Nov 19, 2024 at 4:22 AM Matt Bobrowski <mattbobrowski@google.com> wrote:
> > >
> > > Hi,
> > >
> > > Currently, we have BPF kfuncs which allow BPF programs to add and
> > > remove elements from a BPF linked list. However, we're currently
> > > missing other simple capabilities, like being able to iterate over the
> > > elements within the BPF linked lists. What is our current appetite
> > > with regards to adding new BPF kfuncs that support this kind of
> > > capability to BPF linked lists?
> >
> > What kind of kfuncs do you have in mind for link lists ?
>
> At this point, it'd have to be some kind of iterator based BPF kfunc
> that allows a BPF program to traverse over the supplied BPF linked
> list, coupled with a delete based BPF kfunc such that elements from
> the list can be removed whilst performing the iteration
> i.e. list_for_each_safe/list_del. Now that I know you're not
> completely opposed to adding such BPF kfuncs, I can concretely start
> thinking about what this will actually end up looking like. But
> essentially, it'd need to be BPF kfuncs that support those 2
> previously mentioned capabilities, being traversal and arbitrary
> removal of an element whilst performing the traversal.
iterator and removal would need to be done while the lock is held.
There is no support for such things in the verifier.
I don't think it will be easy.
> > So far the only user of bpf_rbtree is sched-ext.
> > It was used in one scheduler and the experience was painful.
> > There is a chance that we will remove rbtree and link list
> > support from the verifier to reduce complexity.
> > So new link list kfuncs may be ok, but potentially not for too long.
>
> Noted.
>
> > > I know that we're now somewhat advocating for using BPF arenas
> > > whenever and wherever possible, especially when it comes to building
> > > out and supporting more complicated data structures in BPF. However,
> > > IMO BPF linked lists still have their place. Specifically, and as of
> > > now, I'd argue that the BPF linked list implementation could be
> > > considered more memory efficient when compared to a BPF arena backed
> > > linked list implementation. This is purely due to the fact that the
> > > BPF linked list implementation can perform more constrained memory
> > > allocations for elements via bpf_obj_new_impl() based on the demand,
> > > whereas for a BPF arena based implementation a BPF program needs to
> > > allocate memory upfront in terms of the number of pages (modulo the
> > > fact that not all pages for the BPF arena will necessarily be reserved
> > > upfront). The fact that allocations are performed in terms of
> > > multiples of PAGE_SIZE can lead to unnecessary memory wastage.
> >
> > I don't follow this logic.
> > bpf_mem_alloc is relying on slab that relies on page alloc.
> > So either arena or bpf_ma allocates a page at a time.
> > So from that pov the cost is the same.
>
> Oh, what? So, both are actually performing full page sized allocations
> whenever there's a need to fetch more memory? My shallow understanding
> at this point was that the BPF specific memory allocator simply acts
> as a front-end cache to kmalloc(), and depending on the size of your
> allocation request i.e. via bpf_obj_new_impl for example, depends on
> what freelist that allocation is fulfilled from. Any needs for
> refilling a freelist due to exhaustion pressure are performed at
> freelist size granularity i.e. 16, 32, 64, 128, 256..., 4096.
Correct, but how do you think kmalloc works?
It's a slab on top of the buddy page allocator.
Same thing at the end.
> So, based on my current understanding and assuming that your BPF
> program at most requests 128 byte sized allocations via
> bpf_obj_new_impl, the BPF specific memory allocator will via kmalloc()
> only allocate that much more memory at a time when needing to refill
128 byte sized bpf_ma would need 128+8 which will be in 196 bucket.
While page frag on top of arena will allocate exactly 128.
> the freelist caches (maybe it'd be a multiple of that because of
> batching or something, but I don't fully understand the details of
> that just yet). In comparison to using something like a BPF arena, if
> I had to build out my own in-BPF program memory allocator that was
> backed by a BPF arena, in the need of a refill like situation, the
> allocation would be performed at page size granularity as that is the
> amount of memory that the BPF arena pulls in at a time.
kmalloc/slab->page_alloc vs page_frag->arena_alloc have
their pros and cons. But in most cases the difference can be ignored.
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: bpf: adding BPF linked list iteration support
2024-11-22 21:54 ` Alexei Starovoitov
@ 2024-11-25 15:52 ` Matt Bobrowski
0 siblings, 0 replies; 5+ messages in thread
From: Matt Bobrowski @ 2024-11-25 15:52 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
Martin KaFai Lau, Song Liu, KP Singh, Jiri Olsa,
Kumar Kartikeya Dwivedi
On Fri, Nov 22, 2024 at 01:54:58PM -0800, Alexei Starovoitov wrote:
> On Fri, Nov 22, 2024 at 1:12 AM Matt Bobrowski <mattbobrowski@google.com> wrote:
> >
> > On Wed, Nov 20, 2024 at 04:51:36PM -0800, Alexei Starovoitov wrote:
> > > On Tue, Nov 19, 2024 at 4:22 AM Matt Bobrowski <mattbobrowski@google.com> wrote:
> > > >
> > > > Hi,
> > > >
> > > > Currently, we have BPF kfuncs which allow BPF programs to add and
> > > > remove elements from a BPF linked list. However, we're currently
> > > > missing other simple capabilities, like being able to iterate over the
> > > > elements within the BPF linked lists. What is our current appetite
> > > > with regards to adding new BPF kfuncs that support this kind of
> > > > capability to BPF linked lists?
> > >
> > > What kind of kfuncs do you have in mind for link lists ?
> >
> > At this point, it'd have to be some kind of iterator based BPF kfunc
> > that allows a BPF program to traverse over the supplied BPF linked
> > list, coupled with a delete based BPF kfunc such that elements from
> > the list can be removed whilst performing the iteration
> > i.e. list_for_each_safe/list_del. Now that I know you're not
> > completely opposed to adding such BPF kfuncs, I can concretely start
> > thinking about what this will actually end up looking like. But
> > essentially, it'd need to be BPF kfuncs that support those 2
> > previously mentioned capabilities, being traversal and arbitrary
> > removal of an element whilst performing the traversal.
>
> iterator and removal would need to be done while the lock is held.
> There is no support for such things in the verifier.
> I don't think it will be easy.
Hm, I was under the impression that this would rather be trivial to
add, but you'd obviously have more of an idea than I. Let me think
about how I want to go about adding these BPF kfuncs and come back to
you with a proposal.
> > > So far the only user of bpf_rbtree is sched-ext.
> > > It was used in one scheduler and the experience was painful.
> > > There is a chance that we will remove rbtree and link list
> > > support from the verifier to reduce complexity.
> > > So new link list kfuncs may be ok, but potentially not for too long.
> >
> > Noted.
> >
> > > > I know that we're now somewhat advocating for using BPF arenas
> > > > whenever and wherever possible, especially when it comes to building
> > > > out and supporting more complicated data structures in BPF. However,
> > > > IMO BPF linked lists still have their place. Specifically, and as of
> > > > now, I'd argue that the BPF linked list implementation could be
> > > > considered more memory efficient when compared to a BPF arena backed
> > > > linked list implementation. This is purely due to the fact that the
> > > > BPF linked list implementation can perform more constrained memory
> > > > allocations for elements via bpf_obj_new_impl() based on the demand,
> > > > whereas for a BPF arena based implementation a BPF program needs to
> > > > allocate memory upfront in terms of the number of pages (modulo the
> > > > fact that not all pages for the BPF arena will necessarily be reserved
> > > > upfront). The fact that allocations are performed in terms of
> > > > multiples of PAGE_SIZE can lead to unnecessary memory wastage.
> > >
> > > I don't follow this logic.
> > > bpf_mem_alloc is relying on slab that relies on page alloc.
> > > So either arena or bpf_ma allocates a page at a time.
> > > So from that pov the cost is the same.
> >
> > Oh, what? So, both are actually performing full page sized allocations
> > whenever there's a need to fetch more memory? My shallow understanding
> > at this point was that the BPF specific memory allocator simply acts
> > as a front-end cache to kmalloc(), and depending on the size of your
> > allocation request i.e. via bpf_obj_new_impl for example, depends on
> > what freelist that allocation is fulfilled from. Any needs for
> > refilling a freelist due to exhaustion pressure are performed at
> > freelist size granularity i.e. 16, 32, 64, 128, 256..., 4096.
>
> Correct, but how do you think kmalloc works?
> It's a slab on top of the buddy page allocator.
> Same thing at the end.
Right, I can see what you mean by this. So, whether you're memory
allocation request is being fulfilled by the BPF specific memory
allocator or BPF arena, in the end if you're in need of more memory
you're technically interfacing with the same backend i.e. Buddy
Allocator, and that in itself still ends up managing memory in
page-sized chunk granularity.
I just thought that given that kmalloc() interfaces with the
middle-end, being the SLUB/SLAB/SLOB allocator in this case, it too
has its allocation requests served from pre-allocated caches. So, in
theory, it'd mean that when the BPF specific memory allocator needs
more memory those memory requests would be fulfilled from this layer
of cache first prior to actually going all the way down to the Buddy
Allocator and getting more pages of memory directly from
it. Admittedly though, I have not looked at how BPF arena memory
allocation requests are fulfilled, so it could actually just end up
taking the exact same route. Maybe I'll have a look at that tonight as
I'm curious now.
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2024-11-25 15:52 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-11-19 12:22 bpf: adding BPF linked list iteration support Matt Bobrowski
2024-11-21 0:51 ` Alexei Starovoitov
2024-11-22 9:12 ` Matt Bobrowski
2024-11-22 21:54 ` Alexei Starovoitov
2024-11-25 15:52 ` Matt Bobrowski
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox