* kmalloc_nolock() follow-ups, including kfree_rcu_nolock()
@ 2026-05-12 12:25 Harry Yoo (Oracle)
2026-05-12 13:46 ` Levi Zim
0 siblings, 1 reply; 2+ messages in thread
From: Harry Yoo (Oracle) @ 2026-05-12 12:25 UTC (permalink / raw)
To: linux-mm, rcu, bpf
Cc: Vlastimil Babka, Hao Li, Paul E. McKenney, Uladzislau Rezki,
Joel Fernandes, Alexei Starovoitov, Andrii Nakryiko,
Puranjay Mohan, Shakeel Butt, Amery Hung, Kumar Kartikeya Dwivedi
Hello everybody. This is a follow-up discussion of
"kmalloc_nolock() follow-ups, including kfree_rcu_nolock()" topic at
LSFMMBPF 2026 last week. Unfortunately, many RCU folks were not there,
but we can still discuss over email ;)
The slides: https://docs.google.com/presentation/d/1kpaLd7D1dwRvIqRwQfSjJVVJL0CC2gwb-AV56yCMqXw/edit?usp=sharing
I'm copying the slides here to make it easier to reply.
kmalloc_nolock() follow-ups, including kfree_rcu_nolock()
=========================================================
Today's goal
============
1. Present why and what we're doing
2. Demystify BPF's requirements for memory allocation
3. Discuss solutions
Motivation
==========
BPF map preallocation wastes memory for correctness
- preallocate all elements by default, unless opted out explicitly
(BPF_F_NO_PREALLOC)
- Typically not all elements are used, wasting memory
The BPF memory allocator was invented to avoid that
- Allocate elements on demand at BPF runtime
- kmalloc isn't safe in some BPF contexts (in NMI, or in a critical
section), so a new allocator was invented
Challenges with the BPF memory allocator
- Memory is tied to the BPF subsystem and can't be used elsewhere
- A burst of allocations can cause failures until async refill catches up
- Trade-off between memory waste and allocation failures at large sizes
- Reinventing every memory allocator feature is a maintenance burden
The end goal
============
- Drop the BPF memory allocator
- Avoid preallocation as much as possible in BPF
- Use kmalloc_nolock() and kfree_{,rcu_}nolock() (and friends) instead
To achieve that, we need to define requirements & expectations from BPF
Background - RCU and BPF programs
=================================
- Non-sleepable BPF progs run in an RCU critical section
- Sleepable BPF progs run in an RCU Tasks Trace critical section
- Freeing by RCU for both non-sleepable (RCU) and sleepable
(RCU-tt) progs means we need to wait both GPs before releasing memory
- RCU Tasks Trace (RCU-tt) is an RCU flavor designed for sleepable
BPF progs, lighter than SRCU
- Since v7.0, RCU Tasks Trace is implemented using SRCU-fast
- Since v7.1-rc1, RCU Tasks Trace GP is contractually guaranteed to
imply an RCU GP
- Yes, that means waiting for an SRCU-fast GP automatically covers RCU
readers
Background - BPF memory lifetime
================================
Objects allocated for BPF may be referenced by either 1) non-sleepable
or 2) sleepable BPF programs, or both
Memory allocated for BPF may be:
1. Freed immediately <- This is supported today via kfree_nolock()
2. Freed immediately, but can be recycled with typesafety-by-rcu
semantics (for both RCU, RCU-tt)
- bpf_mem{,_cache}_free()
3. Freed after RCU GP
- sleepable progs not allowed
4. Freed after RCU GP + RCU-tt GP
- sleepable progs allowed
- bpf_mem{,_cache}_free_rcu()
If you need something not listed here, please let us know!
The big picture (today, within the BPF memory allocator)
========================================================
- bpf_mem{,_cache}_free()
- Analogous to SLAB_TYPESAFE_BY_RCU, but for BPF
- Insert objects to free_llist or free_llist_extra
- When high watermark is hit, move objects to free_by_rcu_ttrace
and then return objects to slab after RCU-tt GP
- However, they can be reused before returned to slab (again,
analogous to SLAB_TYPESAFE_BY_RCU)
- bpf_mem{,_cache}_free_rcu()
- Analogous to kfree_rcu(), but for BPF
- Objects are inserted to free_by_rcu
- Moved to waiting_for_gp list, then wait for RCU GP
- Moved to waiting_for_gp_ttrace list, then wait for RCU-tt GP,
then returned to slab
- Objects remain intact for RCU GP and RCU-tt GP
The big picture (today, outside the BPF memory allocator)
=========================================================
- This slide is intentionally left blank :)
The big picture (in the future - with kmalloc_nolock() follow-ups)
==================================================================
Let's drop the BPF memory allocator completely!
Case A: Free immediately
- Cache: existing kmalloc-<size> family
- Alloc: kmalloc_nolock() -> alloc_pages_nolock()
- Free immediately: kfree_nolock() -> free_pages_nolock()
Case B: Non-sleepable readers only, free by RCU
- Cache: existing kmalloc-<size> family
- Alloc: kmalloc_nolock() -> alloc_pages_nolock()
- Free by RCU: kfree_rcu_nolock(obj, rf) -> call_rcu_nolock()
Case C: Both sleepable and non-sleepable readers, with free by RCU
- Cache: existing kmalloc-<size> family
- Alloc: kmalloc_nolock()
- Free by RCU: kfree_srcu_fast_nolock() -> call_srcu_fast_nolock()
Case D: Both sleepable and non-sleepable readers*, with typesafety-
by-rcu semantics
- Cache: a fixed-size kmem_cache with SLAB_TYPESAFE_BY_SRCU_FAST
- Slab freeing deferred until SRCU-fast GP but objects can be
reused (analogous to SLAB_TYPESAFE_BY_RCU)
- Alloc: kmem_cache_alloc_nolock()
- Free immediately: kmem_cache_free_nolock() ->
call_srcu_fast_nolock() (to free slabs)
- Need slab dtor support to release resources when freeing slabs
after SRCU-fast GP
Unlike kmalloc_nolock(), "try the next bucket" trick doesn't work.
Instead, create two caches: one for normal allocations, the other for
fallback. Free objects with kfree_nolock() without passing the cache
pointer.
*Even when only non-sleepable readers are allowed, you can still
use this!
Progress since last year
========================
- alloc_pages_nolock() / free_pages_nolock() merged in v6.15
- kmalloc_nolock() / kfree_nolock() merged in v6.18
- RCU Tasks Trace re-implemented on top of SRCU-fast in v7.0
- Transition to SRCU-fast was not smooth, ended up fixing bugs
- RCU Tasks Trace GP now explicitly implies RCU GP
- implicit since SRCU-fast was introduced in v6.15
- explicit contractual guarantee in v7.1-rc1
Things to do
============
- Define clear requirements/expectations from BPF (for memory allocation)
- Introduce kfree_rcu_nolock() (in RFC)
- Introduce kfree_srcu_fast_nolock()
- Add SLAB_TYPESAFE_BY_SRCU_FAST support
- Need slab destructor support to clean up when freeing slabs
- Need call_srcu_fast_nolock() to submit free slabs to RCU
- Migrate remaining bpf_mem_alloc users to kmalloc_nolock()
- Introduce call_rcu_nolock() and call_srcu_fast_nolock()
- Allow kmalloc_nolock() with large kmalloc sizes via
alloc_pages_nolock()
--
Cheers,
Harry / Hyeonggon
^ permalink raw reply [flat|nested] 2+ messages in thread
* Re: kmalloc_nolock() follow-ups, including kfree_rcu_nolock()
2026-05-12 12:25 kmalloc_nolock() follow-ups, including kfree_rcu_nolock() Harry Yoo (Oracle)
@ 2026-05-12 13:46 ` Levi Zim
0 siblings, 0 replies; 2+ messages in thread
From: Levi Zim @ 2026-05-12 13:46 UTC (permalink / raw)
To: Harry Yoo (Oracle), linux-mm, rcu, bpf
Cc: Vlastimil Babka, Hao Li, Paul E. McKenney, Uladzislau Rezki,
Joel Fernandes, Alexei Starovoitov, Andrii Nakryiko,
Puranjay Mohan, Shakeel Butt, Amery Hung, Kumar Kartikeya Dwivedi
On 5/12/26 8:25 PM, Harry Yoo (Oracle) wrote:
> Hello everybody. This is a follow-up discussion of
> "kmalloc_nolock() follow-ups, including kfree_rcu_nolock()" topic at
> LSFMMBPF 2026 last week. Unfortunately, many RCU folks were not there,
> but we can still discuss over email ;)
>
> The slides: https://docs.google.com/presentation/d/1kpaLd7D1dwRvIqRwQfSjJVVJL0CC2gwb-AV56yCMqXw/edit?usp=sharing
>
> I'm copying the slides here to make it easier to reply.
>
> kmalloc_nolock() follow-ups, including kfree_rcu_nolock()
> =========================================================
>
> Today's goal
> ============
>
> 1. Present why and what we're doing
> 2. Demystify BPF's requirements for memory allocation
> 3. Discuss solutions
>
> Motivation
> ==========
>
> BPF map preallocation wastes memory for correctness
> - preallocate all elements by default, unless opted out explicitly
> (BPF_F_NO_PREALLOC)
> - Typically not all elements are used, wasting memory
>
> The BPF memory allocator was invented to avoid that
> - Allocate elements on demand at BPF runtime
> - kmalloc isn't safe in some BPF contexts (in NMI, or in a critical
> section), so a new allocator was invented
>
> Challenges with the BPF memory allocator
> - Memory is tied to the BPF subsystem and can't be used elsewhere
> - A burst of allocations can cause failures until async refill catches up
> - Trade-off between memory waste and allocation failures at large sizes
> - Reinventing every memory allocator feature is a maintenance burden
>
> The end goal
> ============
>
> - Drop the BPF memory allocator
> - Avoid preallocation as much as possible in BPF
> - Use kmalloc_nolock() and kfree_{,rcu_}nolock() (and friends) instead
By using kmalloc_nolock, a regression happens on architectures without HAVE_CMPXCHG_DOUBLE.
For reference, currently only x86, arm64, s390 and loongarch selects HAVE_CMPXCHG_DOUBLE
For example, this has already caused bpf_task_storage_get with flag
BPF_LOCAL_STORAGE_GET_F_CREATE to always fail on riscv64 6.19 kernel.
I attempted to fix it in https://lists.infradead.org/pipermail/linux-riscv/2026-March/087159.html,
but as pointed out in the threads, the approach is not sound.
After that, I thought about using the BPF memory allocator instead of kmalloc_nolock on such
architectures to fix it. But I haven't got time to implement it.
I don't know how could we fix it otherwise after removing BPF memory allocator completely.
Could we find a path to move forward without causing regressions on architectures without HAVE_CMPXCHG_DOUBLE?
Thanks,
Levi
>
> To achieve that, we need to define requirements & expectations from BPF
>
> Background - RCU and BPF programs
> =================================
>
> - Non-sleepable BPF progs run in an RCU critical section
> - Sleepable BPF progs run in an RCU Tasks Trace critical section
> - Freeing by RCU for both non-sleepable (RCU) and sleepable
> (RCU-tt) progs means we need to wait both GPs before releasing memory
> - RCU Tasks Trace (RCU-tt) is an RCU flavor designed for sleepable
> BPF progs, lighter than SRCU
> - Since v7.0, RCU Tasks Trace is implemented using SRCU-fast
> - Since v7.1-rc1, RCU Tasks Trace GP is contractually guaranteed to
> imply an RCU GP
> - Yes, that means waiting for an SRCU-fast GP automatically covers RCU
> readers
>
> Background - BPF memory lifetime
> ================================
>
> Objects allocated for BPF may be referenced by either 1) non-sleepable
> or 2) sleepable BPF programs, or both
>
> Memory allocated for BPF may be:
> 1. Freed immediately <- This is supported today via kfree_nolock()
> 2. Freed immediately, but can be recycled with typesafety-by-rcu
> semantics (for both RCU, RCU-tt)
> - bpf_mem{,_cache}_free()
> 3. Freed after RCU GP
> - sleepable progs not allowed
> 4. Freed after RCU GP + RCU-tt GP
> - sleepable progs allowed
> - bpf_mem{,_cache}_free_rcu()
>
> If you need something not listed here, please let us know!
>
> The big picture (today, within the BPF memory allocator)
> ========================================================
>
> - bpf_mem{,_cache}_free()
> - Analogous to SLAB_TYPESAFE_BY_RCU, but for BPF
> - Insert objects to free_llist or free_llist_extra
> - When high watermark is hit, move objects to free_by_rcu_ttrace
> and then return objects to slab after RCU-tt GP
> - However, they can be reused before returned to slab (again,
> analogous to SLAB_TYPESAFE_BY_RCU)
> - bpf_mem{,_cache}_free_rcu()
> - Analogous to kfree_rcu(), but for BPF
> - Objects are inserted to free_by_rcu
> - Moved to waiting_for_gp list, then wait for RCU GP
> - Moved to waiting_for_gp_ttrace list, then wait for RCU-tt GP,
> then returned to slab
> - Objects remain intact for RCU GP and RCU-tt GP
>
> The big picture (today, outside the BPF memory allocator)
> =========================================================
>
> - This slide is intentionally left blank :)
>
> The big picture (in the future - with kmalloc_nolock() follow-ups)
> ==================================================================
>
> Let's drop the BPF memory allocator completely!
>
> Case A: Free immediately
> - Cache: existing kmalloc-<size> family
> - Alloc: kmalloc_nolock() -> alloc_pages_nolock()
> - Free immediately: kfree_nolock() -> free_pages_nolock()
>
> Case B: Non-sleepable readers only, free by RCU
> - Cache: existing kmalloc-<size> family
> - Alloc: kmalloc_nolock() -> alloc_pages_nolock()
> - Free by RCU: kfree_rcu_nolock(obj, rf) -> call_rcu_nolock()
>
> Case C: Both sleepable and non-sleepable readers, with free by RCU
> - Cache: existing kmalloc-<size> family
> - Alloc: kmalloc_nolock()
> - Free by RCU: kfree_srcu_fast_nolock() -> call_srcu_fast_nolock()
>
> Case D: Both sleepable and non-sleepable readers*, with typesafety-
> by-rcu semantics
> - Cache: a fixed-size kmem_cache with SLAB_TYPESAFE_BY_SRCU_FAST
> - Slab freeing deferred until SRCU-fast GP but objects can be
> reused (analogous to SLAB_TYPESAFE_BY_RCU)
> - Alloc: kmem_cache_alloc_nolock()
> - Free immediately: kmem_cache_free_nolock() ->
> call_srcu_fast_nolock() (to free slabs)
> - Need slab dtor support to release resources when freeing slabs
> after SRCU-fast GP
>
> Unlike kmalloc_nolock(), "try the next bucket" trick doesn't work.
> Instead, create two caches: one for normal allocations, the other for
> fallback. Free objects with kfree_nolock() without passing the cache
> pointer.
>
> *Even when only non-sleepable readers are allowed, you can still
> use this!
>
> Progress since last year
> ========================
>
> - alloc_pages_nolock() / free_pages_nolock() merged in v6.15
> - kmalloc_nolock() / kfree_nolock() merged in v6.18
> - RCU Tasks Trace re-implemented on top of SRCU-fast in v7.0
> - Transition to SRCU-fast was not smooth, ended up fixing bugs
> - RCU Tasks Trace GP now explicitly implies RCU GP
> - implicit since SRCU-fast was introduced in v6.15
> - explicit contractual guarantee in v7.1-rc1
>
> Things to do
> ============
>
> - Define clear requirements/expectations from BPF (for memory allocation)
> - Introduce kfree_rcu_nolock() (in RFC)
> - Introduce kfree_srcu_fast_nolock()
> - Add SLAB_TYPESAFE_BY_SRCU_FAST support
> - Need slab destructor support to clean up when freeing slabs
> - Need call_srcu_fast_nolock() to submit free slabs to RCU
> - Migrate remaining bpf_mem_alloc users to kmalloc_nolock()
> - Introduce call_rcu_nolock() and call_srcu_fast_nolock()
> - Allow kmalloc_nolock() with large kmalloc sizes via
> alloc_pages_nolock()
>
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2026-05-12 13:46 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-12 12:25 kmalloc_nolock() follow-ups, including kfree_rcu_nolock() Harry Yoo (Oracle)
2026-05-12 13:46 ` Levi Zim
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox