Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed
* kmalloc_nolock() follow-ups, including kfree_rcu_nolock()
@ 2026-05-12 12:25 Harry Yoo (Oracle)
  2026-05-12 13:46 ` Levi Zim
  0 siblings, 1 reply; 2+ messages in thread
From: Harry Yoo (Oracle) @ 2026-05-12 12:25 UTC (permalink / raw)
  To: linux-mm, rcu, bpf
  Cc: Vlastimil Babka, Hao Li, Paul E. McKenney, Uladzislau Rezki,
	Joel Fernandes, Alexei Starovoitov, Andrii Nakryiko,
	Puranjay Mohan, Shakeel Butt, Amery Hung, Kumar Kartikeya Dwivedi

Hello everybody. This is a follow-up discussion of
"kmalloc_nolock() follow-ups, including kfree_rcu_nolock()" topic at
LSFMMBPF 2026 last week. Unfortunately, many RCU folks were not there,
but we can still discuss over email ;)

The slides: https://docs.google.com/presentation/d/1kpaLd7D1dwRvIqRwQfSjJVVJL0CC2gwb-AV56yCMqXw/edit?usp=sharing

I'm copying the slides here to make it easier to reply.

kmalloc_nolock() follow-ups, including kfree_rcu_nolock()
=========================================================

Today's goal
============

1. Present why and what we're doing
2. Demystify BPF's requirements for memory allocation
3. Discuss solutions

Motivation
==========

BPF map preallocation wastes memory for correctness
- preallocate all elements by default, unless opted out explicitly
  (BPF_F_NO_PREALLOC)
- Typically not all elements are used, wasting memory

The BPF memory allocator was invented to avoid that
- Allocate elements on demand at BPF runtime
- kmalloc isn't safe in some BPF contexts (in NMI, or in a critical
  section), so a new allocator was invented

Challenges with the BPF memory allocator
- Memory is tied to the BPF subsystem and can't be used elsewhere
- A burst of allocations can cause failures until async refill catches up
- Trade-off between memory waste and allocation failures at large sizes
- Reinventing every memory allocator feature is a maintenance burden

The end goal
============

- Drop the BPF memory allocator
- Avoid preallocation as much as possible in BPF
- Use kmalloc_nolock() and kfree_{,rcu_}nolock() (and friends) instead

To achieve that, we need to define requirements & expectations from BPF

Background - RCU and BPF programs
=================================

- Non-sleepable BPF progs run in an RCU critical section
- Sleepable BPF progs run in an RCU Tasks Trace critical section
- Freeing by RCU for both non-sleepable (RCU) and sleepable
  (RCU-tt) progs means we need to wait both GPs before releasing memory
- RCU Tasks Trace (RCU-tt) is an RCU flavor designed for sleepable
  BPF progs, lighter than SRCU
- Since v7.0, RCU Tasks Trace is implemented using SRCU-fast
- Since v7.1-rc1, RCU Tasks Trace GP is contractually guaranteed to
  imply an RCU GP
  - Yes, that means waiting for an SRCU-fast GP automatically covers RCU
    readers

Background - BPF memory lifetime
================================

Objects allocated for BPF may be referenced by either 1) non-sleepable
or 2) sleepable BPF programs, or both

Memory allocated for BPF may be:
1. Freed immediately <- This is supported today via kfree_nolock()
2. Freed immediately, but can be recycled with typesafety-by-rcu
   semantics (for both RCU, RCU-tt)
  - bpf_mem{,_cache}_free()
3. Freed after RCU GP
  - sleepable progs not allowed
4. Freed after RCU GP + RCU-tt GP
  - sleepable progs allowed
  - bpf_mem{,_cache}_free_rcu()

If you need something not listed here, please let us know!

The big picture (today, within the BPF memory allocator)
========================================================

- bpf_mem{,_cache}_free()
- Analogous to SLAB_TYPESAFE_BY_RCU, but for BPF
- Insert objects to free_llist or free_llist_extra
- When high watermark is hit, move objects to free_by_rcu_ttrace
  and then return objects to slab after RCU-tt GP
- However, they can be reused before returned to slab (again,
  analogous to SLAB_TYPESAFE_BY_RCU)
- bpf_mem{,_cache}_free_rcu()
- Analogous to kfree_rcu(), but for BPF
- Objects are inserted to free_by_rcu
- Moved to waiting_for_gp list, then wait for RCU GP
- Moved to waiting_for_gp_ttrace list, then wait for RCU-tt GP,
  then returned to slab
- Objects remain intact for RCU GP and RCU-tt GP

The big picture (today, outside the BPF memory allocator)
=========================================================

- This slide is intentionally left blank :)

The big picture (in the future - with kmalloc_nolock() follow-ups)
==================================================================

Let's drop the BPF memory allocator completely!

Case A: Free immediately
- Cache: existing kmalloc-<size> family
- Alloc: kmalloc_nolock() -> alloc_pages_nolock()
- Free immediately: kfree_nolock() -> free_pages_nolock()

Case B: Non-sleepable readers only, free by RCU
- Cache: existing kmalloc-<size> family
- Alloc: kmalloc_nolock() -> alloc_pages_nolock()
- Free by RCU: kfree_rcu_nolock(obj, rf) -> call_rcu_nolock()

Case C: Both sleepable and non-sleepable readers, with free by RCU
- Cache: existing kmalloc-<size> family
- Alloc: kmalloc_nolock()
- Free by RCU: kfree_srcu_fast_nolock() -> call_srcu_fast_nolock()

Case D: Both sleepable and non-sleepable readers*, with typesafety-
by-rcu semantics
- Cache: a fixed-size kmem_cache with SLAB_TYPESAFE_BY_SRCU_FAST
- Slab freeing deferred until SRCU-fast GP but objects can be
  reused (analogous to SLAB_TYPESAFE_BY_RCU)
- Alloc: kmem_cache_alloc_nolock()
- Free immediately: kmem_cache_free_nolock() ->
  call_srcu_fast_nolock() (to free slabs)
- Need slab dtor support to release resources when freeing slabs
  after SRCU-fast GP

Unlike kmalloc_nolock(), "try the next bucket" trick doesn't work.
Instead, create two caches: one for normal allocations, the other for
fallback. Free objects with kfree_nolock() without passing the cache
pointer.

*Even when only non-sleepable readers are allowed, you can still
use this!

Progress since last year
========================

- alloc_pages_nolock() / free_pages_nolock() merged in v6.15
- kmalloc_nolock() / kfree_nolock() merged in v6.18
- RCU Tasks Trace re-implemented on top of SRCU-fast in v7.0
- Transition to SRCU-fast was not smooth, ended up fixing bugs
- RCU Tasks Trace GP now explicitly implies RCU GP
- implicit since SRCU-fast was introduced in v6.15
- explicit contractual guarantee in v7.1-rc1

Things to do
============

- Define clear requirements/expectations from BPF (for memory allocation)
- Introduce kfree_rcu_nolock() (in RFC)
- Introduce kfree_srcu_fast_nolock()
- Add SLAB_TYPESAFE_BY_SRCU_FAST support
  - Need slab destructor support to clean up when freeing slabs
  - Need call_srcu_fast_nolock() to submit free slabs to RCU
- Migrate remaining bpf_mem_alloc users to kmalloc_nolock()
- Introduce call_rcu_nolock() and call_srcu_fast_nolock()
- Allow kmalloc_nolock() with large kmalloc sizes via
  alloc_pages_nolock()

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: kmalloc_nolock() follow-ups, including kfree_rcu_nolock()
  2026-05-12 12:25 kmalloc_nolock() follow-ups, including kfree_rcu_nolock() Harry Yoo (Oracle)
@ 2026-05-12 13:46 ` Levi Zim
  0 siblings, 0 replies; 2+ messages in thread
From: Levi Zim @ 2026-05-12 13:46 UTC (permalink / raw)
  To: Harry Yoo (Oracle), linux-mm, rcu, bpf
  Cc: Vlastimil Babka, Hao Li, Paul E. McKenney, Uladzislau Rezki,
	Joel Fernandes, Alexei Starovoitov, Andrii Nakryiko,
	Puranjay Mohan, Shakeel Butt, Amery Hung, Kumar Kartikeya Dwivedi



On 5/12/26 8:25 PM, Harry Yoo (Oracle) wrote:
> Hello everybody. This is a follow-up discussion of
> "kmalloc_nolock() follow-ups, including kfree_rcu_nolock()" topic at
> LSFMMBPF 2026 last week. Unfortunately, many RCU folks were not there,
> but we can still discuss over email ;)
> 
> The slides: https://docs.google.com/presentation/d/1kpaLd7D1dwRvIqRwQfSjJVVJL0CC2gwb-AV56yCMqXw/edit?usp=sharing
> 
> I'm copying the slides here to make it easier to reply.
> 
> kmalloc_nolock() follow-ups, including kfree_rcu_nolock()
> =========================================================
> 
> Today's goal
> ============
> 
> 1. Present why and what we're doing
> 2. Demystify BPF's requirements for memory allocation
> 3. Discuss solutions
> 
> Motivation
> ==========
> 
> BPF map preallocation wastes memory for correctness
> - preallocate all elements by default, unless opted out explicitly
>   (BPF_F_NO_PREALLOC)
> - Typically not all elements are used, wasting memory
> 
> The BPF memory allocator was invented to avoid that
> - Allocate elements on demand at BPF runtime
> - kmalloc isn't safe in some BPF contexts (in NMI, or in a critical
>   section), so a new allocator was invented
> 
> Challenges with the BPF memory allocator
> - Memory is tied to the BPF subsystem and can't be used elsewhere
> - A burst of allocations can cause failures until async refill catches up
> - Trade-off between memory waste and allocation failures at large sizes
> - Reinventing every memory allocator feature is a maintenance burden
> 
> The end goal
> ============
> 
> - Drop the BPF memory allocator
> - Avoid preallocation as much as possible in BPF
> - Use kmalloc_nolock() and kfree_{,rcu_}nolock() (and friends) instead

By using kmalloc_nolock, a regression happens on architectures without HAVE_CMPXCHG_DOUBLE.
For reference, currently only x86, arm64, s390 and loongarch selects HAVE_CMPXCHG_DOUBLE

For example, this has already caused bpf_task_storage_get with flag
BPF_LOCAL_STORAGE_GET_F_CREATE to always fail on riscv64 6.19 kernel.
I attempted to fix it in https://lists.infradead.org/pipermail/linux-riscv/2026-March/087159.html,
but as pointed out in the threads, the approach is not sound.

After that, I thought about using the BPF memory allocator instead of kmalloc_nolock on such
architectures to fix it. But I haven't got time to implement it.

I don't know how could we fix it otherwise after removing BPF memory allocator completely.
Could we find a path to move forward without causing regressions on architectures without HAVE_CMPXCHG_DOUBLE?

Thanks,
Levi

> 
> To achieve that, we need to define requirements & expectations from BPF
> 
> Background - RCU and BPF programs
> =================================
> 
> - Non-sleepable BPF progs run in an RCU critical section
> - Sleepable BPF progs run in an RCU Tasks Trace critical section
> - Freeing by RCU for both non-sleepable (RCU) and sleepable
>   (RCU-tt) progs means we need to wait both GPs before releasing memory
> - RCU Tasks Trace (RCU-tt) is an RCU flavor designed for sleepable
>   BPF progs, lighter than SRCU
> - Since v7.0, RCU Tasks Trace is implemented using SRCU-fast
> - Since v7.1-rc1, RCU Tasks Trace GP is contractually guaranteed to
>   imply an RCU GP
>   - Yes, that means waiting for an SRCU-fast GP automatically covers RCU
>     readers
> 
> Background - BPF memory lifetime
> ================================
> 
> Objects allocated for BPF may be referenced by either 1) non-sleepable
> or 2) sleepable BPF programs, or both
> 
> Memory allocated for BPF may be:
> 1. Freed immediately <- This is supported today via kfree_nolock()
> 2. Freed immediately, but can be recycled with typesafety-by-rcu
>    semantics (for both RCU, RCU-tt)
>   - bpf_mem{,_cache}_free()
> 3. Freed after RCU GP
>   - sleepable progs not allowed
> 4. Freed after RCU GP + RCU-tt GP
>   - sleepable progs allowed
>   - bpf_mem{,_cache}_free_rcu()
> 
> If you need something not listed here, please let us know!
> 
> The big picture (today, within the BPF memory allocator)
> ========================================================
> 
> - bpf_mem{,_cache}_free()
> - Analogous to SLAB_TYPESAFE_BY_RCU, but for BPF
> - Insert objects to free_llist or free_llist_extra
> - When high watermark is hit, move objects to free_by_rcu_ttrace
>   and then return objects to slab after RCU-tt GP
> - However, they can be reused before returned to slab (again,
>   analogous to SLAB_TYPESAFE_BY_RCU)
> - bpf_mem{,_cache}_free_rcu()
> - Analogous to kfree_rcu(), but for BPF
> - Objects are inserted to free_by_rcu
> - Moved to waiting_for_gp list, then wait for RCU GP
> - Moved to waiting_for_gp_ttrace list, then wait for RCU-tt GP,
>   then returned to slab
> - Objects remain intact for RCU GP and RCU-tt GP
> 
> The big picture (today, outside the BPF memory allocator)
> =========================================================
> 
> - This slide is intentionally left blank :)
> 
> The big picture (in the future - with kmalloc_nolock() follow-ups)
> ==================================================================
> 
> Let's drop the BPF memory allocator completely!
> 
> Case A: Free immediately
> - Cache: existing kmalloc-<size> family
> - Alloc: kmalloc_nolock() -> alloc_pages_nolock()
> - Free immediately: kfree_nolock() -> free_pages_nolock()
> 
> Case B: Non-sleepable readers only, free by RCU
> - Cache: existing kmalloc-<size> family
> - Alloc: kmalloc_nolock() -> alloc_pages_nolock()
> - Free by RCU: kfree_rcu_nolock(obj, rf) -> call_rcu_nolock()
> 
> Case C: Both sleepable and non-sleepable readers, with free by RCU
> - Cache: existing kmalloc-<size> family
> - Alloc: kmalloc_nolock()
> - Free by RCU: kfree_srcu_fast_nolock() -> call_srcu_fast_nolock()
> 
> Case D: Both sleepable and non-sleepable readers*, with typesafety-
> by-rcu semantics
> - Cache: a fixed-size kmem_cache with SLAB_TYPESAFE_BY_SRCU_FAST
> - Slab freeing deferred until SRCU-fast GP but objects can be
>   reused (analogous to SLAB_TYPESAFE_BY_RCU)
> - Alloc: kmem_cache_alloc_nolock()
> - Free immediately: kmem_cache_free_nolock() ->
>   call_srcu_fast_nolock() (to free slabs)
> - Need slab dtor support to release resources when freeing slabs
>   after SRCU-fast GP
> 
> Unlike kmalloc_nolock(), "try the next bucket" trick doesn't work.
> Instead, create two caches: one for normal allocations, the other for
> fallback. Free objects with kfree_nolock() without passing the cache
> pointer.
> 
> *Even when only non-sleepable readers are allowed, you can still
> use this!
> 
> Progress since last year
> ========================
> 
> - alloc_pages_nolock() / free_pages_nolock() merged in v6.15
> - kmalloc_nolock() / kfree_nolock() merged in v6.18
> - RCU Tasks Trace re-implemented on top of SRCU-fast in v7.0
> - Transition to SRCU-fast was not smooth, ended up fixing bugs
> - RCU Tasks Trace GP now explicitly implies RCU GP
> - implicit since SRCU-fast was introduced in v6.15
> - explicit contractual guarantee in v7.1-rc1
> 
> Things to do
> ============
> 
> - Define clear requirements/expectations from BPF (for memory allocation)
> - Introduce kfree_rcu_nolock() (in RFC)
> - Introduce kfree_srcu_fast_nolock()
> - Add SLAB_TYPESAFE_BY_SRCU_FAST support
>   - Need slab destructor support to clean up when freeing slabs
>   - Need call_srcu_fast_nolock() to submit free slabs to RCU
> - Migrate remaining bpf_mem_alloc users to kmalloc_nolock()
> - Introduce call_rcu_nolock() and call_srcu_fast_nolock()
> - Allow kmalloc_nolock() with large kmalloc sizes via
>   alloc_pages_nolock()
> 



^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2026-05-12 13:46 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-12 12:25 kmalloc_nolock() follow-ups, including kfree_rcu_nolock() Harry Yoo (Oracle)
2026-05-12 13:46 ` Levi Zim

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox