Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed
* kmalloc_nolock() follow-ups, including kfree_rcu_nolock()
@ 2026-05-12 12:25 Harry Yoo (Oracle)
  2026-05-12 13:46 ` Levi Zim
  0 siblings, 1 reply; 4+ messages in thread
From: Harry Yoo (Oracle) @ 2026-05-12 12:25 UTC (permalink / raw)
  To: linux-mm, rcu, bpf
  Cc: Vlastimil Babka, Hao Li, Paul E. McKenney, Uladzislau Rezki,
	Joel Fernandes, Alexei Starovoitov, Andrii Nakryiko,
	Puranjay Mohan, Shakeel Butt, Amery Hung, Kumar Kartikeya Dwivedi

Hello everybody. This is a follow-up discussion of
"kmalloc_nolock() follow-ups, including kfree_rcu_nolock()" topic at
LSFMMBPF 2026 last week. Unfortunately, many RCU folks were not there,
but we can still discuss over email ;)

The slides: https://docs.google.com/presentation/d/1kpaLd7D1dwRvIqRwQfSjJVVJL0CC2gwb-AV56yCMqXw/edit?usp=sharing

I'm copying the slides here to make it easier to reply.

kmalloc_nolock() follow-ups, including kfree_rcu_nolock()
=========================================================

Today's goal
============

1. Present why and what we're doing
2. Demystify BPF's requirements for memory allocation
3. Discuss solutions

Motivation
==========

BPF map preallocation wastes memory for correctness
- preallocate all elements by default, unless opted out explicitly
  (BPF_F_NO_PREALLOC)
- Typically not all elements are used, wasting memory

The BPF memory allocator was invented to avoid that
- Allocate elements on demand at BPF runtime
- kmalloc isn't safe in some BPF contexts (in NMI, or in a critical
  section), so a new allocator was invented

Challenges with the BPF memory allocator
- Memory is tied to the BPF subsystem and can't be used elsewhere
- A burst of allocations can cause failures until async refill catches up
- Trade-off between memory waste and allocation failures at large sizes
- Reinventing every memory allocator feature is a maintenance burden

The end goal
============

- Drop the BPF memory allocator
- Avoid preallocation as much as possible in BPF
- Use kmalloc_nolock() and kfree_{,rcu_}nolock() (and friends) instead

To achieve that, we need to define requirements & expectations from BPF

Background - RCU and BPF programs
=================================

- Non-sleepable BPF progs run in an RCU critical section
- Sleepable BPF progs run in an RCU Tasks Trace critical section
- Freeing by RCU for both non-sleepable (RCU) and sleepable
  (RCU-tt) progs means we need to wait both GPs before releasing memory
- RCU Tasks Trace (RCU-tt) is an RCU flavor designed for sleepable
  BPF progs, lighter than SRCU
- Since v7.0, RCU Tasks Trace is implemented using SRCU-fast
- Since v7.1-rc1, RCU Tasks Trace GP is contractually guaranteed to
  imply an RCU GP
  - Yes, that means waiting for an SRCU-fast GP automatically covers RCU
    readers

Background - BPF memory lifetime
================================

Objects allocated for BPF may be referenced by either 1) non-sleepable
or 2) sleepable BPF programs, or both

Memory allocated for BPF may be:
1. Freed immediately <- This is supported today via kfree_nolock()
2. Freed immediately, but can be recycled with typesafety-by-rcu
   semantics (for both RCU, RCU-tt)
  - bpf_mem{,_cache}_free()
3. Freed after RCU GP
  - sleepable progs not allowed
4. Freed after RCU GP + RCU-tt GP
  - sleepable progs allowed
  - bpf_mem{,_cache}_free_rcu()

If you need something not listed here, please let us know!

The big picture (today, within the BPF memory allocator)
========================================================

- bpf_mem{,_cache}_free()
- Analogous to SLAB_TYPESAFE_BY_RCU, but for BPF
- Insert objects to free_llist or free_llist_extra
- When high watermark is hit, move objects to free_by_rcu_ttrace
  and then return objects to slab after RCU-tt GP
- However, they can be reused before returned to slab (again,
  analogous to SLAB_TYPESAFE_BY_RCU)
- bpf_mem{,_cache}_free_rcu()
- Analogous to kfree_rcu(), but for BPF
- Objects are inserted to free_by_rcu
- Moved to waiting_for_gp list, then wait for RCU GP
- Moved to waiting_for_gp_ttrace list, then wait for RCU-tt GP,
  then returned to slab
- Objects remain intact for RCU GP and RCU-tt GP

The big picture (today, outside the BPF memory allocator)
=========================================================

- This slide is intentionally left blank :)

The big picture (in the future - with kmalloc_nolock() follow-ups)
==================================================================

Let's drop the BPF memory allocator completely!

Case A: Free immediately
- Cache: existing kmalloc-<size> family
- Alloc: kmalloc_nolock() -> alloc_pages_nolock()
- Free immediately: kfree_nolock() -> free_pages_nolock()

Case B: Non-sleepable readers only, free by RCU
- Cache: existing kmalloc-<size> family
- Alloc: kmalloc_nolock() -> alloc_pages_nolock()
- Free by RCU: kfree_rcu_nolock(obj, rf) -> call_rcu_nolock()

Case C: Both sleepable and non-sleepable readers, with free by RCU
- Cache: existing kmalloc-<size> family
- Alloc: kmalloc_nolock()
- Free by RCU: kfree_srcu_fast_nolock() -> call_srcu_fast_nolock()

Case D: Both sleepable and non-sleepable readers*, with typesafety-
by-rcu semantics
- Cache: a fixed-size kmem_cache with SLAB_TYPESAFE_BY_SRCU_FAST
- Slab freeing deferred until SRCU-fast GP but objects can be
  reused (analogous to SLAB_TYPESAFE_BY_RCU)
- Alloc: kmem_cache_alloc_nolock()
- Free immediately: kmem_cache_free_nolock() ->
  call_srcu_fast_nolock() (to free slabs)
- Need slab dtor support to release resources when freeing slabs
  after SRCU-fast GP

Unlike kmalloc_nolock(), "try the next bucket" trick doesn't work.
Instead, create two caches: one for normal allocations, the other for
fallback. Free objects with kfree_nolock() without passing the cache
pointer.

*Even when only non-sleepable readers are allowed, you can still
use this!

Progress since last year
========================

- alloc_pages_nolock() / free_pages_nolock() merged in v6.15
- kmalloc_nolock() / kfree_nolock() merged in v6.18
- RCU Tasks Trace re-implemented on top of SRCU-fast in v7.0
- Transition to SRCU-fast was not smooth, ended up fixing bugs
- RCU Tasks Trace GP now explicitly implies RCU GP
- implicit since SRCU-fast was introduced in v6.15
- explicit contractual guarantee in v7.1-rc1

Things to do
============

- Define clear requirements/expectations from BPF (for memory allocation)
- Introduce kfree_rcu_nolock() (in RFC)
- Introduce kfree_srcu_fast_nolock()
- Add SLAB_TYPESAFE_BY_SRCU_FAST support
  - Need slab destructor support to clean up when freeing slabs
  - Need call_srcu_fast_nolock() to submit free slabs to RCU
- Migrate remaining bpf_mem_alloc users to kmalloc_nolock()
- Introduce call_rcu_nolock() and call_srcu_fast_nolock()
- Allow kmalloc_nolock() with large kmalloc sizes via
  alloc_pages_nolock()

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2026-05-13 13:34 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-12 12:25 kmalloc_nolock() follow-ups, including kfree_rcu_nolock() Harry Yoo (Oracle)
2026-05-12 13:46 ` Levi Zim
2026-05-13  1:42   ` Harry Yoo (Oracle)
2026-05-13 13:34     ` Levi Zim

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox