* kmalloc_nolock() follow-ups, including kfree_rcu_nolock()
@ 2026-05-12 12:25 Harry Yoo (Oracle)
2026-05-12 13:46 ` Levi Zim
0 siblings, 1 reply; 4+ messages in thread
From: Harry Yoo (Oracle) @ 2026-05-12 12:25 UTC (permalink / raw)
To: linux-mm, rcu, bpf
Cc: Vlastimil Babka, Hao Li, Paul E. McKenney, Uladzislau Rezki,
Joel Fernandes, Alexei Starovoitov, Andrii Nakryiko,
Puranjay Mohan, Shakeel Butt, Amery Hung, Kumar Kartikeya Dwivedi
Hello everybody. This is a follow-up discussion of
"kmalloc_nolock() follow-ups, including kfree_rcu_nolock()" topic at
LSFMMBPF 2026 last week. Unfortunately, many RCU folks were not there,
but we can still discuss over email ;)
The slides: https://docs.google.com/presentation/d/1kpaLd7D1dwRvIqRwQfSjJVVJL0CC2gwb-AV56yCMqXw/edit?usp=sharing
I'm copying the slides here to make it easier to reply.
kmalloc_nolock() follow-ups, including kfree_rcu_nolock()
=========================================================
Today's goal
============
1. Present why and what we're doing
2. Demystify BPF's requirements for memory allocation
3. Discuss solutions
Motivation
==========
BPF map preallocation wastes memory for correctness
- preallocate all elements by default, unless opted out explicitly
(BPF_F_NO_PREALLOC)
- Typically not all elements are used, wasting memory
The BPF memory allocator was invented to avoid that
- Allocate elements on demand at BPF runtime
- kmalloc isn't safe in some BPF contexts (in NMI, or in a critical
section), so a new allocator was invented
Challenges with the BPF memory allocator
- Memory is tied to the BPF subsystem and can't be used elsewhere
- A burst of allocations can cause failures until async refill catches up
- Trade-off between memory waste and allocation failures at large sizes
- Reinventing every memory allocator feature is a maintenance burden
The end goal
============
- Drop the BPF memory allocator
- Avoid preallocation as much as possible in BPF
- Use kmalloc_nolock() and kfree_{,rcu_}nolock() (and friends) instead
To achieve that, we need to define requirements & expectations from BPF
Background - RCU and BPF programs
=================================
- Non-sleepable BPF progs run in an RCU critical section
- Sleepable BPF progs run in an RCU Tasks Trace critical section
- Freeing by RCU for both non-sleepable (RCU) and sleepable
(RCU-tt) progs means we need to wait both GPs before releasing memory
- RCU Tasks Trace (RCU-tt) is an RCU flavor designed for sleepable
BPF progs, lighter than SRCU
- Since v7.0, RCU Tasks Trace is implemented using SRCU-fast
- Since v7.1-rc1, RCU Tasks Trace GP is contractually guaranteed to
imply an RCU GP
- Yes, that means waiting for an SRCU-fast GP automatically covers RCU
readers
Background - BPF memory lifetime
================================
Objects allocated for BPF may be referenced by either 1) non-sleepable
or 2) sleepable BPF programs, or both
Memory allocated for BPF may be:
1. Freed immediately <- This is supported today via kfree_nolock()
2. Freed immediately, but can be recycled with typesafety-by-rcu
semantics (for both RCU, RCU-tt)
- bpf_mem{,_cache}_free()
3. Freed after RCU GP
- sleepable progs not allowed
4. Freed after RCU GP + RCU-tt GP
- sleepable progs allowed
- bpf_mem{,_cache}_free_rcu()
If you need something not listed here, please let us know!
The big picture (today, within the BPF memory allocator)
========================================================
- bpf_mem{,_cache}_free()
- Analogous to SLAB_TYPESAFE_BY_RCU, but for BPF
- Insert objects to free_llist or free_llist_extra
- When high watermark is hit, move objects to free_by_rcu_ttrace
and then return objects to slab after RCU-tt GP
- However, they can be reused before returned to slab (again,
analogous to SLAB_TYPESAFE_BY_RCU)
- bpf_mem{,_cache}_free_rcu()
- Analogous to kfree_rcu(), but for BPF
- Objects are inserted to free_by_rcu
- Moved to waiting_for_gp list, then wait for RCU GP
- Moved to waiting_for_gp_ttrace list, then wait for RCU-tt GP,
then returned to slab
- Objects remain intact for RCU GP and RCU-tt GP
The big picture (today, outside the BPF memory allocator)
=========================================================
- This slide is intentionally left blank :)
The big picture (in the future - with kmalloc_nolock() follow-ups)
==================================================================
Let's drop the BPF memory allocator completely!
Case A: Free immediately
- Cache: existing kmalloc-<size> family
- Alloc: kmalloc_nolock() -> alloc_pages_nolock()
- Free immediately: kfree_nolock() -> free_pages_nolock()
Case B: Non-sleepable readers only, free by RCU
- Cache: existing kmalloc-<size> family
- Alloc: kmalloc_nolock() -> alloc_pages_nolock()
- Free by RCU: kfree_rcu_nolock(obj, rf) -> call_rcu_nolock()
Case C: Both sleepable and non-sleepable readers, with free by RCU
- Cache: existing kmalloc-<size> family
- Alloc: kmalloc_nolock()
- Free by RCU: kfree_srcu_fast_nolock() -> call_srcu_fast_nolock()
Case D: Both sleepable and non-sleepable readers*, with typesafety-
by-rcu semantics
- Cache: a fixed-size kmem_cache with SLAB_TYPESAFE_BY_SRCU_FAST
- Slab freeing deferred until SRCU-fast GP but objects can be
reused (analogous to SLAB_TYPESAFE_BY_RCU)
- Alloc: kmem_cache_alloc_nolock()
- Free immediately: kmem_cache_free_nolock() ->
call_srcu_fast_nolock() (to free slabs)
- Need slab dtor support to release resources when freeing slabs
after SRCU-fast GP
Unlike kmalloc_nolock(), "try the next bucket" trick doesn't work.
Instead, create two caches: one for normal allocations, the other for
fallback. Free objects with kfree_nolock() without passing the cache
pointer.
*Even when only non-sleepable readers are allowed, you can still
use this!
Progress since last year
========================
- alloc_pages_nolock() / free_pages_nolock() merged in v6.15
- kmalloc_nolock() / kfree_nolock() merged in v6.18
- RCU Tasks Trace re-implemented on top of SRCU-fast in v7.0
- Transition to SRCU-fast was not smooth, ended up fixing bugs
- RCU Tasks Trace GP now explicitly implies RCU GP
- implicit since SRCU-fast was introduced in v6.15
- explicit contractual guarantee in v7.1-rc1
Things to do
============
- Define clear requirements/expectations from BPF (for memory allocation)
- Introduce kfree_rcu_nolock() (in RFC)
- Introduce kfree_srcu_fast_nolock()
- Add SLAB_TYPESAFE_BY_SRCU_FAST support
- Need slab destructor support to clean up when freeing slabs
- Need call_srcu_fast_nolock() to submit free slabs to RCU
- Migrate remaining bpf_mem_alloc users to kmalloc_nolock()
- Introduce call_rcu_nolock() and call_srcu_fast_nolock()
- Allow kmalloc_nolock() with large kmalloc sizes via
alloc_pages_nolock()
--
Cheers,
Harry / Hyeonggon
^ permalink raw reply [flat|nested] 4+ messages in thread* Re: kmalloc_nolock() follow-ups, including kfree_rcu_nolock()
2026-05-12 12:25 kmalloc_nolock() follow-ups, including kfree_rcu_nolock() Harry Yoo (Oracle)
@ 2026-05-12 13:46 ` Levi Zim
2026-05-13 1:42 ` Harry Yoo (Oracle)
0 siblings, 1 reply; 4+ messages in thread
From: Levi Zim @ 2026-05-12 13:46 UTC (permalink / raw)
To: Harry Yoo (Oracle), linux-mm, rcu, bpf
Cc: Vlastimil Babka, Hao Li, Paul E. McKenney, Uladzislau Rezki,
Joel Fernandes, Alexei Starovoitov, Andrii Nakryiko,
Puranjay Mohan, Shakeel Butt, Amery Hung, Kumar Kartikeya Dwivedi
On 5/12/26 8:25 PM, Harry Yoo (Oracle) wrote:
> Hello everybody. This is a follow-up discussion of
> "kmalloc_nolock() follow-ups, including kfree_rcu_nolock()" topic at
> LSFMMBPF 2026 last week. Unfortunately, many RCU folks were not there,
> but we can still discuss over email ;)
>
> The slides: https://docs.google.com/presentation/d/1kpaLd7D1dwRvIqRwQfSjJVVJL0CC2gwb-AV56yCMqXw/edit?usp=sharing
>
> I'm copying the slides here to make it easier to reply.
>
> kmalloc_nolock() follow-ups, including kfree_rcu_nolock()
> =========================================================
>
> Today's goal
> ============
>
> 1. Present why and what we're doing
> 2. Demystify BPF's requirements for memory allocation
> 3. Discuss solutions
>
> Motivation
> ==========
>
> BPF map preallocation wastes memory for correctness
> - preallocate all elements by default, unless opted out explicitly
> (BPF_F_NO_PREALLOC)
> - Typically not all elements are used, wasting memory
>
> The BPF memory allocator was invented to avoid that
> - Allocate elements on demand at BPF runtime
> - kmalloc isn't safe in some BPF contexts (in NMI, or in a critical
> section), so a new allocator was invented
>
> Challenges with the BPF memory allocator
> - Memory is tied to the BPF subsystem and can't be used elsewhere
> - A burst of allocations can cause failures until async refill catches up
> - Trade-off between memory waste and allocation failures at large sizes
> - Reinventing every memory allocator feature is a maintenance burden
>
> The end goal
> ============
>
> - Drop the BPF memory allocator
> - Avoid preallocation as much as possible in BPF
> - Use kmalloc_nolock() and kfree_{,rcu_}nolock() (and friends) instead
By using kmalloc_nolock, a regression happens on architectures without HAVE_CMPXCHG_DOUBLE.
For reference, currently only x86, arm64, s390 and loongarch selects HAVE_CMPXCHG_DOUBLE
For example, this has already caused bpf_task_storage_get with flag
BPF_LOCAL_STORAGE_GET_F_CREATE to always fail on riscv64 6.19 kernel.
I attempted to fix it in https://lists.infradead.org/pipermail/linux-riscv/2026-March/087159.html,
but as pointed out in the threads, the approach is not sound.
After that, I thought about using the BPF memory allocator instead of kmalloc_nolock on such
architectures to fix it. But I haven't got time to implement it.
I don't know how could we fix it otherwise after removing BPF memory allocator completely.
Could we find a path to move forward without causing regressions on architectures without HAVE_CMPXCHG_DOUBLE?
Thanks,
Levi
>
> To achieve that, we need to define requirements & expectations from BPF
>
> Background - RCU and BPF programs
> =================================
>
> - Non-sleepable BPF progs run in an RCU critical section
> - Sleepable BPF progs run in an RCU Tasks Trace critical section
> - Freeing by RCU for both non-sleepable (RCU) and sleepable
> (RCU-tt) progs means we need to wait both GPs before releasing memory
> - RCU Tasks Trace (RCU-tt) is an RCU flavor designed for sleepable
> BPF progs, lighter than SRCU
> - Since v7.0, RCU Tasks Trace is implemented using SRCU-fast
> - Since v7.1-rc1, RCU Tasks Trace GP is contractually guaranteed to
> imply an RCU GP
> - Yes, that means waiting for an SRCU-fast GP automatically covers RCU
> readers
>
> Background - BPF memory lifetime
> ================================
>
> Objects allocated for BPF may be referenced by either 1) non-sleepable
> or 2) sleepable BPF programs, or both
>
> Memory allocated for BPF may be:
> 1. Freed immediately <- This is supported today via kfree_nolock()
> 2. Freed immediately, but can be recycled with typesafety-by-rcu
> semantics (for both RCU, RCU-tt)
> - bpf_mem{,_cache}_free()
> 3. Freed after RCU GP
> - sleepable progs not allowed
> 4. Freed after RCU GP + RCU-tt GP
> - sleepable progs allowed
> - bpf_mem{,_cache}_free_rcu()
>
> If you need something not listed here, please let us know!
>
> The big picture (today, within the BPF memory allocator)
> ========================================================
>
> - bpf_mem{,_cache}_free()
> - Analogous to SLAB_TYPESAFE_BY_RCU, but for BPF
> - Insert objects to free_llist or free_llist_extra
> - When high watermark is hit, move objects to free_by_rcu_ttrace
> and then return objects to slab after RCU-tt GP
> - However, they can be reused before returned to slab (again,
> analogous to SLAB_TYPESAFE_BY_RCU)
> - bpf_mem{,_cache}_free_rcu()
> - Analogous to kfree_rcu(), but for BPF
> - Objects are inserted to free_by_rcu
> - Moved to waiting_for_gp list, then wait for RCU GP
> - Moved to waiting_for_gp_ttrace list, then wait for RCU-tt GP,
> then returned to slab
> - Objects remain intact for RCU GP and RCU-tt GP
>
> The big picture (today, outside the BPF memory allocator)
> =========================================================
>
> - This slide is intentionally left blank :)
>
> The big picture (in the future - with kmalloc_nolock() follow-ups)
> ==================================================================
>
> Let's drop the BPF memory allocator completely!
>
> Case A: Free immediately
> - Cache: existing kmalloc-<size> family
> - Alloc: kmalloc_nolock() -> alloc_pages_nolock()
> - Free immediately: kfree_nolock() -> free_pages_nolock()
>
> Case B: Non-sleepable readers only, free by RCU
> - Cache: existing kmalloc-<size> family
> - Alloc: kmalloc_nolock() -> alloc_pages_nolock()
> - Free by RCU: kfree_rcu_nolock(obj, rf) -> call_rcu_nolock()
>
> Case C: Both sleepable and non-sleepable readers, with free by RCU
> - Cache: existing kmalloc-<size> family
> - Alloc: kmalloc_nolock()
> - Free by RCU: kfree_srcu_fast_nolock() -> call_srcu_fast_nolock()
>
> Case D: Both sleepable and non-sleepable readers*, with typesafety-
> by-rcu semantics
> - Cache: a fixed-size kmem_cache with SLAB_TYPESAFE_BY_SRCU_FAST
> - Slab freeing deferred until SRCU-fast GP but objects can be
> reused (analogous to SLAB_TYPESAFE_BY_RCU)
> - Alloc: kmem_cache_alloc_nolock()
> - Free immediately: kmem_cache_free_nolock() ->
> call_srcu_fast_nolock() (to free slabs)
> - Need slab dtor support to release resources when freeing slabs
> after SRCU-fast GP
>
> Unlike kmalloc_nolock(), "try the next bucket" trick doesn't work.
> Instead, create two caches: one for normal allocations, the other for
> fallback. Free objects with kfree_nolock() without passing the cache
> pointer.
>
> *Even when only non-sleepable readers are allowed, you can still
> use this!
>
> Progress since last year
> ========================
>
> - alloc_pages_nolock() / free_pages_nolock() merged in v6.15
> - kmalloc_nolock() / kfree_nolock() merged in v6.18
> - RCU Tasks Trace re-implemented on top of SRCU-fast in v7.0
> - Transition to SRCU-fast was not smooth, ended up fixing bugs
> - RCU Tasks Trace GP now explicitly implies RCU GP
> - implicit since SRCU-fast was introduced in v6.15
> - explicit contractual guarantee in v7.1-rc1
>
> Things to do
> ============
>
> - Define clear requirements/expectations from BPF (for memory allocation)
> - Introduce kfree_rcu_nolock() (in RFC)
> - Introduce kfree_srcu_fast_nolock()
> - Add SLAB_TYPESAFE_BY_SRCU_FAST support
> - Need slab destructor support to clean up when freeing slabs
> - Need call_srcu_fast_nolock() to submit free slabs to RCU
> - Migrate remaining bpf_mem_alloc users to kmalloc_nolock()
> - Introduce call_rcu_nolock() and call_srcu_fast_nolock()
> - Allow kmalloc_nolock() with large kmalloc sizes via
> alloc_pages_nolock()
>
^ permalink raw reply [flat|nested] 4+ messages in thread* Re: kmalloc_nolock() follow-ups, including kfree_rcu_nolock()
2026-05-12 13:46 ` Levi Zim
@ 2026-05-13 1:42 ` Harry Yoo (Oracle)
2026-05-13 13:34 ` Levi Zim
0 siblings, 1 reply; 4+ messages in thread
From: Harry Yoo (Oracle) @ 2026-05-13 1:42 UTC (permalink / raw)
To: Levi Zim
Cc: linux-mm, rcu, bpf, Vlastimil Babka, Hao Li, Paul E. McKenney,
Uladzislau Rezki, Joel Fernandes, Alexei Starovoitov,
Andrii Nakryiko, Puranjay Mohan, Shakeel Butt, Amery Hung,
Kumar Kartikeya Dwivedi
On Tue, May 12, 2026 at 09:46:33PM +0800, Levi Zim wrote:
> On 5/12/26 8:25 PM, Harry Yoo (Oracle) wrote:
> > Hello everybody. This is a follow-up discussion of
> > "kmalloc_nolock() follow-ups, including kfree_rcu_nolock()" topic at
> > LSFMMBPF 2026 last week. Unfortunately, many RCU folks were not there,
> > but we can still discuss over email ;)
> >
> > The slides: https://docs.google.com/presentation/d/1kpaLd7D1dwRvIqRwQfSjJVVJL0CC2gwb-AV56yCMqXw/edit?usp=sharing
> >
> > I'm copying the slides here to make it easier to reply.
[...]
> > The end goal
> > ============
> >
> > - Drop the BPF memory allocator
> > - Avoid preallocation as much as possible in BPF
> > - Use kmalloc_nolock() and kfree_{,rcu_}nolock() (and friends) instead
>
> By using kmalloc_nolock, a regression happens on architectures without HAVE_CMPXCHG_DOUBLE.
> For reference, currently only x86, arm64, s390 and loongarch selects HAVE_CMPXCHG_DOUBLE
>
> For example, this has already caused bpf_task_storage_get with flag
> BPF_LOCAL_STORAGE_GET_F_CREATE to always fail on riscv64 6.19 kernel.
Ouch.
> I attempted to fix it in https://lists.infradead.org/pipermail/linux-riscv/2026-March/087159.html,
> but as pointed out in the threads, the approach is not sound.
>
> After that, I thought about using the BPF memory allocator instead of kmalloc_nolock on such
> architectures to fix it. But I haven't got time to implement it.
Oh please, let's not go in that direction :)
> I don't know how could we fix it otherwise after removing BPF memory allocator completely.
> Could we find a path to move forward without causing regressions on architectures without HAVE_CMPXCHG_DOUBLE?
Probably we can. Could you please see if this works for you?
https://git.kernel.org/pub/scm/linux/kernel/git/harry/linux.git/log/?h=slab-kmalloc-nolock-without-cmpxchg-double-rfc-v1r1-wip
--
Cheers,
Harry / Hyeonggon
^ permalink raw reply [flat|nested] 4+ messages in thread* Re: kmalloc_nolock() follow-ups, including kfree_rcu_nolock()
2026-05-13 1:42 ` Harry Yoo (Oracle)
@ 2026-05-13 13:34 ` Levi Zim
0 siblings, 0 replies; 4+ messages in thread
From: Levi Zim @ 2026-05-13 13:34 UTC (permalink / raw)
To: Harry Yoo (Oracle)
Cc: linux-mm, rcu, bpf, Vlastimil Babka, Hao Li, Paul E. McKenney,
Uladzislau Rezki, Joel Fernandes, Alexei Starovoitov,
Andrii Nakryiko, Puranjay Mohan, Shakeel Butt, Amery Hung,
Kumar Kartikeya Dwivedi
On 5/13/26 9:42 AM, Harry Yoo (Oracle) wrote:
> On Tue, May 12, 2026 at 09:46:33PM +0800, Levi Zim wrote:
>> On 5/12/26 8:25 PM, Harry Yoo (Oracle) wrote:
>>> Hello everybody. This is a follow-up discussion of
>>> "kmalloc_nolock() follow-ups, including kfree_rcu_nolock()" topic at
>>> LSFMMBPF 2026 last week. Unfortunately, many RCU folks were not there,
>>> but we can still discuss over email ;)
>>>
>>> The slides: https://docs.google.com/presentation/d/1kpaLd7D1dwRvIqRwQfSjJVVJL0CC2gwb-AV56yCMqXw/edit?usp=sharing
>>>
>>> I'm copying the slides here to make it easier to reply.
>
> [...]
>
>>> The end goal
>>> ============
>>>
>>> - Drop the BPF memory allocator
>>> - Avoid preallocation as much as possible in BPF
>>> - Use kmalloc_nolock() and kfree_{,rcu_}nolock() (and friends) instead
>>
>> By using kmalloc_nolock, a regression happens on architectures without HAVE_CMPXCHG_DOUBLE.
>> For reference, currently only x86, arm64, s390 and loongarch selects HAVE_CMPXCHG_DOUBLE
>>
>> For example, this has already caused bpf_task_storage_get with flag
>> BPF_LOCAL_STORAGE_GET_F_CREATE to always fail on riscv64 6.19 kernel.
>
> Ouch.
>
>> I attempted to fix it in https://lists.infradead.org/pipermail/linux-riscv/2026-March/087159.html,
>> but as pointed out in the threads, the approach is not sound.
>>
>> After that, I thought about using the BPF memory allocator instead of kmalloc_nolock on such
>> architectures to fix it. But I haven't got time to implement it.
>
> Oh please, let's not go in that direction :)
>
>> I don't know how could we fix it otherwise after removing BPF memory allocator completely.
>> Could we find a path to move forward without causing regressions on architectures without HAVE_CMPXCHG_DOUBLE?
>
> Probably we can. Could you please see if this works for you?
>
> https://git.kernel.org/pub/scm/linux/kernel/git/harry/linux.git/log/?h=slab-kmalloc-nolock-without-cmpxchg-double-rfc-v1r1-wip
Thanks a lot! I tested it and could confirm that it could fix the failure of
bpf_task_storage_get(BPF_LOCAL_STORAGE_GET_F_CREATE) on riscv64.
The commit message says that the allocation may still fail if the slab lock
acquisition fails upon the first try. But this is still a great improvement
compared to the previous always failing code.
Thanks,
Levi
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2026-05-13 13:34 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-12 12:25 kmalloc_nolock() follow-ups, including kfree_rcu_nolock() Harry Yoo (Oracle)
2026-05-12 13:46 ` Levi Zim
2026-05-13 1:42 ` Harry Yoo (Oracle)
2026-05-13 13:34 ` Levi Zim
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox