Re: [PATCH bpf-next 3/4] bpf,slab: Add slub-backed allocator for bpf_arena

BPF List
 help / color / mirror / Atom feed

From: "Alexei Starovoitov" <alexei.starovoitov@gmail.com>
To: <sashiko-reviews@lists.linux.dev>
Cc: <bpf@vger.kernel.org>
Subject: Re: [PATCH bpf-next 3/4] bpf,slab: Add slub-backed allocator for bpf_arena
Date: Fri, 29 May 2026 17:04:26 -0700	[thread overview]
Message-ID: <DIVKLHOSYUP3.MMDGEURVFLHD@gmail.com> (raw)
In-Reply-To: <20260529220410.1251F1F00893@smtp.kernel.org>

On Fri May 29, 2026 at 3:04 PM PDT, sashiko-bot wrote:
> Thank you for your contribution! Sashiko AI review found 7 potential issue(s) to consider:
> - [Critical] BPF-induced freelist double-frees artificially drive SLUB's inuse counter to zero, prematurely freeing active slabs and causing Use-After-Free.
> - [Critical] Skipping alloc_slab_obj_exts_early() for arena caches leaves slab->obj_exts uninitialized, causing a kernel panic when SLUB attempts to kfree() the leftover mem_cgroup pointer.
> - [Critical] An out-of-bounds array access in arena_free_pages allows a malicious BPF program to read and overwrite arbitrary kernel memory by bypassing the bounds check.
> - [Critical] A TOCTOU race condition in bpf_arena_free for multi-page fallback allocations allows concurrent frees to corrupt kernel page allocator metadata.
> - [High] BPF program can permanently brick the arena allocator by calling bpf_arena_free_pages() on an active slab page.
> - [High] Invalid wait context: calling apply_to_page_range() under a raw spinlock causes scheduling while atomic on PREEMPT_RT.
> - [Medium] Trylock failures in __slab_free permanently leak orphaned empty slabs.

that's a lot of slop to unpack :)
I wish sashiko learned from these explanations, but here they are.. at least for humans.

>>  static int apply_range_clear_cb(pte_t *pte, unsigned long addr, void *data)
>>  {
>>  	struct clear_range_data *d = data;
>> -	pte_t old_pte;
>> +	pte_t old_pte, cur;
>>  	struct page *page;
>>  
>> +	/*
>> +	 * Skip slub-owned pages: BPF must use bpf_arena_free() for per-object
>> +	 * slab frees. The PTE stays; slub releases it via arena_free_slab_page()
>> +	 * after __ClearPageSlab(). Non-atomic ptep_get() is safe -- ptep_try_set()
>> +	 * only fires on pte_none, and arena_free_slab_page() can't race on this
>> +	 * offset (range stays allocated in range_tree for our walk).
>> +	 */
>> +	cur = ptep_get(pte);
>> +	if (pte_none(cur) || !pte_present(cur))
>> +		return 0;
>> +	if (PageSlab(pte_page(cur)))
>> +		return 0;
>
> [Severity: High]
> If this skips clearing the kernel PTEs for PageSlab pages, could it permanently
> brick the arena allocator?

brick arena allocator? no.

> If a BPF program calls bpf_arena_free_pages() on an active slab page,
> arena_free_pages() adds the page offset back to the arena's free range tree
> unconditionally.

which is a bug in bpf prog.

> Since the physical page remains mapped, won't subsequent allocations via
> arena_alloc_pages() repeatedly pick this slot from the interval tree, hit
> -EBUSY in apply_to_page_range() because the PTE is not empty, and abort?

they won't "abort". subsequent arena_alloc_pages() will return NULL.
Works as designed.

>
> [ ... ]
>
>> @@ -625,7 +686,8 @@ static u64 clear_lo32(u64 val)
>>   * Later the pages will be mmaped into user space vma.
>>   */
>>  static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long page_cnt, int node_id,
>> -			      bool sleepable)
>> +			      bool sleepable, bool set_page_slab,
>> +			      struct page **out_page)
>
> [Severity: High]
> Does this code acquire a sleeping lock inside a raw spinlock?
>
> Both arena_alloc_pages() and arena_free_pages() acquire arena->spinlock using
> raw_res_spin_lock_irqsave(), which is a raw spinlock. While holding this, they
> call apply_to_page_range() or apply_to_existing_page_range().
>
> These functions walk the page tables and acquire the init_mm page table lock
> (ptl), which is a standard spinlock_t. On PREEMPT_RT kernels, standard
> spinlocks are sleepable rt_mutexes. Could this trigger a BUG/panic on
> PREEMPT_RT due to an invalid wait context?

Nope. There won't be any attempts to take locks.
sashiko is missing that populate_pgtable_except_pte() was called at arena creation time.

> [ ... ]
>
>> @@ -778,6 +854,21 @@ static void arena_free_pages(struct bpf_arena *arena, long uaddr, long page_cnt,
>>  
>>  	page_cnt = (uaddr_end - full_uaddr) >> PAGE_SHIFT;
>>  	pgoff = compute_pgoff(arena, uaddr);
>> +
>> +	/*
>> +	 * Drop bookkeeping for any bpf_arena_alloc() fallback pages within the
>> +	 * freed range. PageSlab entries are owned by slub and must not be
>> +	 * cleared here; slub clears them via bpf_arena_free_slab_page() when
>> +	 * the slab page is released.
>> +	 */
>> +	for (i = 0; i < page_cnt; i++) {
>> +		fb_page = READ_ONCE(arena->slab_pages[pgoff + i]);
>
> [Severity: Critical]
> Is there a risk of an out-of-bounds array access here that could bypass the
> bounds check and corrupt memory?
>
> If a BPF program passes ptr__ign = 0, and arena->user_vm_start has non-zero
> lower 32 bits, clear_lo32() rounds it down to a 4GB boundary, making
> full_uaddr this rounded-down value.

sashiko is missing that bpf_arena_free() has if (!ptr__ign) return;
but lack of page alignment check is indeed missing bpf_arena_free().
It shouldn't be possible to pass ptr__ign == 1 and go down
arena_free_pages() path.
Will fix like:
      if (!IS_ALIGNED(arena_off, PAGE_SIZE))
              goto violation;

>> +__bpf_kfunc void bpf_arena_free(void *p__map, void *ptr__ign)
>> +{
> [ ... ]
>> +	if (!PageSlab(page)) {
>> +		/*
>> +		 * Multi-page allocation from the bpf_arena_alloc() fallback.
>> +		 * page->private holds page_cnt stashed at allocation time.
>> +		 */
>> +		u32 page_cnt = page_private(page);
>> +
>> +		WRITE_ONCE(arena->slab_pages[pgoff], NULL);
>> +		set_page_private(page, 0);
>> +		arena_free_pages(arena, arena_off, page_cnt, false);
>> +		return;
>> +	}
>
> [Severity: Critical]
> Is there a TOCTOU race condition here when freeing multi-page fallback
> allocations?
>
> If a BPF program executes bpf_arena_free() concurrently from multiple threads
> on the same valid pointer, they might both read the page pointer locklessly and
> pass the !PageSlab(page) check.
>
> If the first thread calls arena_free_pages(), unmapping the page and freeing it
> back to the kernel buddy allocator, the second thread will then execute
> set_page_private(page, 0). 
>
> Since the page is already in the buddy allocator at that point, could writing 0
> to its private field corrupt the buddy allocator's internal order-tracking
> metadata?

that's a good one. it's a race indeed.
Fixing like this:
      if (cmpxchg(&arena->slab_pages[pgoff], page, NULL) != page)
              goto violation;

>> diff --git a/mm/slub.c b/mm/slub.c
>> --- a/mm/slub.c
>> +++ b/mm/slub.c
>
> [ ... ]
>
>> @@ -543,7 +558,12 @@ static inline void set_freepointer(struct kmem_cache *s, void *object, void *fp)
>>  	unsigned long freeptr_addr = (unsigned long)object + s->offset;
>>  
>>  #ifdef CONFIG_SLAB_FREELIST_HARDENED
>> -	BUG_ON(object == fp); /* naive detection of double free or corruption */
>> +	if (unlikely(object == fp)) {
>> +		/* BPF double-free of arena objects must not panic the kernel. */
>> +		if (s->bpf_arena)
>> +			return;
>> +		BUG_ON(object == fp); /* naive detection of double free or corruption */
>> +	}
>>  #endif
>
> [Severity: Critical]
> Does returning early here to suppress the double-free warning lead to a
> corrupted inuse counter?
>
> If a BPF program calls bpf_arena_free() multiple times on the same object,
> this early return bypasses the double-free detection for arena caches. However,
> __slab_free() still unconditionally executes new.inuse -= cnt.
>
> Repeated double-frees will drive the inuse counter to 0. When inuse hits 0,
> SLUB assumes the slab is empty and frees the page back to the buddy allocator,
> even though other objects in the slab may still be in use by the BPF program
> or cached in SLUB's per-CPU sheaves. Could this lead to arbitrary memory
> corruption if the page is reused for kernel data structures?

Not an issue. sashiko missed the key safety concept of the algorithm
that is described in the commit log.
__slab_free() calls bpf_arena_free_slab_page() which frees PTE for that page
in arena, but bpf progs never see kernel direct-map virtual addresses.
bpf progs operate on kern_vm_start + uaddr32.
So once arena page is freed, bpf has no way to access it.
new.inuse for this arena based kmem_cache can be garbage.
It doesn't affect kernel at all. Only bpf prog that caused corruption
will get garbage back.

>
> [ ... ]
>
>> @@ -3493,9 +3516,11 @@ static struct slab *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
>>  	init_slab_obj_exts(slab);
>>  	/*
>>  	 * Poison the slab before initializing the slabobj_ext array
>> -	 * to prevent the array from being overwritten.
>> +	 * to prevent the array from being overwritten. Arena caches
>> +	 * stash uaddr32 in slab->stride; let them keep it.
>>  	 */
>> -	alloc_slab_obj_exts_early(s, slab);
>> +	if (!(s->flags & SLAB_BPF_ARENA))
>> +		alloc_slab_obj_exts_early(s, slab);
>
> [Severity: Critical]
> Does skipping alloc_slab_obj_exts_early() for arena caches leave
> slab->obj_exts uninitialized?
>
> Pages allocated for the BPF arena via bpf_map_alloc_pages() use __GFP_ACCOUNT,
> which sets folio->memcg_data to point to the memcg. In SLUB, slab->obj_exts
> shares the same memory location as folio->memcg_data.
>
> If alloc_slab_obj_exts_early() is skipped, slab->obj_exts is never zeroed and
> retains the memcg pointer. When the slab is freed, won't unaccount_slab() call
> free_slab_obj_exts(), which reads the non-null slab->obj_exts and passes it to
> kfree(), causing a kernel panic?

sashiko is clearly confused in the last paragraph.
The code it quoted does init_slab_obj_exts() that zeros it.
But the point about sharing the same memory location as folio->memcg_data
is valid. As implemented it skips memcg uncharging when page is freed.
Fixing it with extra arena specific wrapper that does:
    if (!page->memcg_data && objcg && !obj_cgroup_is_root(objcg))
            page->memcg_data = (unsigned long)objcg | MEMCG_DATA_KMEM;
    __free_page(page);

>
> [ ... ]
>
>> @@ -5589,14 +5644,19 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
>>  
>>  			n = get_node(s, slab_nid(slab));
>>  			/*
>> -			 * Speculatively acquire the list_lock.
>> -			 * If the cmpxchg does not succeed then we may
>> -			 * drop the list_lock without any processing.
>> -			 *
>> -			 * Otherwise the list_lock will synchronize with
>> -			 * other processors updating the list of slabs.
>> +			 * Speculatively acquire the list_lock. If the cmpxchg
>> +			 * does not succeed we drop the lock without processing.
>>  			 *
>> +			 * Arena caches may reach here from kfree_nolock() in
>> +			 * NMI/irq-off context; trylock and orphan the slab on
>> +			 * failure. A later allow_spin caller adopts it.
>>  			 */
>> -			spin_lock_irqsave(&n->list_lock, flags);
>> +			if (unlikely(s->bpf_arena)) {
>> +				if (!spin_trylock_irqsave(&n->list_lock, flags))
>> +					n = NULL;
>> +			} else {
>> +				spin_lock_irqsave(&n->list_lock, flags);
>> +			}
>
> [Severity: Medium]
> Can this trylock failure result in permanently leaking orphaned empty slabs?
>
> If spin_trylock_irqsave() fails for an arena cache, n is set to NULL, and the
> function returns early without adding the slab to the partial list, orphaning
> the slab.
>
> Later, when the slab becomes completely empty (new.inuse == 0), __slab_free()
> acquires the list lock but evaluates:
>
> if (!was_full && !on_node_partial)
>
> Since the slab was orphaned, on_node_partial is false, causing the function
> to unlock and return early. The slab_empty discard logic is bypassed. Does
> this permanently leak the empty slab until the entire arena map is destroyed?

No, it doesn't leak. That was also explained in the comment quoted above
and elsewhere throught arena.c. The whole arena_destroy_slab_caches() is about that.

next prev parent reply	other threads:[~2026-05-30  0:04 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-29 20:24 [PATCH bpf-next 0/4] bpf,slab: Introduce bpf_arena_alloc() kfuncs Alexei Starovoitov
2026-05-29 20:24 ` [PATCH bpf-next 1/4] slab: Introduce kmem_cache_alloc_nolock() Alexei Starovoitov
2026-05-29 21:20   ` sashiko-bot
2026-05-29 22:37     ` Alexei Starovoitov
2026-05-29 20:24 ` [PATCH bpf-next 2/4] slub: Pass kmem_cache to alloc_slab_page() Alexei Starovoitov
2026-05-29 20:24 ` [PATCH bpf-next 3/4] bpf,slab: Add slub-backed allocator for bpf_arena Alexei Starovoitov
2026-05-29 22:04   ` sashiko-bot
2026-05-30  0:04     ` Alexei Starovoitov [this message]
2026-05-29 20:24 ` [PATCH bpf-next 4/4] selftests/bpf: Add tests for arena slub-backed allocator Alexei Starovoitov
2026-05-29 22:39   ` sashiko-bot
2026-05-30  0:13     ` Alexei Starovoitov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=DIVKLHOSYUP3.MMDGEURVFLHD@gmail.com \
    --to=alexei.starovoitov@gmail.com \
    --cc=bpf@vger.kernel.org \
    --cc=sashiko-reviews@lists.linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox