Re: [PATCH bpf-next 3/4] bpf,slab: Add slub-backed allocator for bpf_arena

All of lore.kernel.org
 help / color / mirror / Atom feed

From: sashiko-bot@kernel.org
To: "Alexei Starovoitov" <alexei.starovoitov@gmail.com>
Cc: bpf@vger.kernel.org
Subject: Re: [PATCH bpf-next 3/4] bpf,slab: Add slub-backed allocator for bpf_arena
Date: Fri, 29 May 2026 22:04:09 +0000	[thread overview]
Message-ID: <20260529220410.1251F1F00893@smtp.kernel.org> (raw)
In-Reply-To: <20260529202438.16408-4-alexei.starovoitov@gmail.com>

Thank you for your contribution! Sashiko AI review found 7 potential issue(s) to consider:
- [Critical] BPF-induced freelist double-frees artificially drive SLUB's inuse counter to zero, prematurely freeing active slabs and causing Use-After-Free.
- [Critical] Skipping alloc_slab_obj_exts_early() for arena caches leaves slab->obj_exts uninitialized, causing a kernel panic when SLUB attempts to kfree() the leftover mem_cgroup pointer.
- [Critical] An out-of-bounds array access in arena_free_pages allows a malicious BPF program to read and overwrite arbitrary kernel memory by bypassing the bounds check.
- [Critical] A TOCTOU race condition in bpf_arena_free for multi-page fallback allocations allows concurrent frees to corrupt kernel page allocator metadata.
- [High] BPF program can permanently brick the arena allocator by calling bpf_arena_free_pages() on an active slab page.
- [High] Invalid wait context: calling apply_to_page_range() under a raw spinlock causes scheduling while atomic on PREEMPT_RT.
- [Medium] Trylock failures in __slab_free permanently leak orphaned empty slabs.
--

commit 0bd3f7f14bd0e0d83546ab9ece9d884e9f9bd2e5
Author: Alexei Starovoitov <ast@kernel.org>

bpf,slab: Add slub-backed allocator for bpf_arena

Let BPF programs allocate typed objects in a bpf_arena via a
kvmalloc-style API: bpf_arena_alloc() routes requests up to
PAGE_SIZE through per-arena slab buckets, and falls back to
arena_alloc_pages() for larger sizes.

> diff --git a/kernel/bpf/arena.c b/kernel/bpf/arena.c
> --- a/kernel/bpf/arena.c
> +++ b/kernel/bpf/arena.c

[ ... ]

> @@ -179,9 +207,22 @@ static void flush_vmap_cache(unsigned long start, unsigned long size)
>  static int apply_range_clear_cb(pte_t *pte, unsigned long addr, void *data)
>  {
>  	struct clear_range_data *d = data;
> -	pte_t old_pte;
> +	pte_t old_pte, cur;
>  	struct page *page;
>  
> +	/*
> +	 * Skip slub-owned pages: BPF must use bpf_arena_free() for per-object
> +	 * slab frees. The PTE stays; slub releases it via arena_free_slab_page()
> +	 * after __ClearPageSlab(). Non-atomic ptep_get() is safe -- ptep_try_set()
> +	 * only fires on pte_none, and arena_free_slab_page() can't race on this
> +	 * offset (range stays allocated in range_tree for our walk).
> +	 */
> +	cur = ptep_get(pte);
> +	if (pte_none(cur) || !pte_present(cur))
> +		return 0;
> +	if (PageSlab(pte_page(cur)))
> +		return 0;

[Severity: High]
If this skips clearing the kernel PTEs for PageSlab pages, could it permanently
brick the arena allocator?

If a BPF program calls bpf_arena_free_pages() on an active slab page,
arena_free_pages() adds the page offset back to the arena's free range tree
unconditionally. 

Since the physical page remains mapped, won't subsequent allocations via
arena_alloc_pages() repeatedly pick this slot from the interval tree, hit
-EBUSY in apply_to_page_range() because the PTE is not empty, and abort?

[ ... ]

> @@ -625,7 +686,8 @@ static u64 clear_lo32(u64 val)
>   * Later the pages will be mmaped into user space vma.
>   */
>  static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long page_cnt, int node_id,
> -			      bool sleepable)
> +			      bool sleepable, bool set_page_slab,
> +			      struct page **out_page)

[Severity: High]
Does this code acquire a sleeping lock inside a raw spinlock?

Both arena_alloc_pages() and arena_free_pages() acquire arena->spinlock using
raw_res_spin_lock_irqsave(), which is a raw spinlock. While holding this, they
call apply_to_page_range() or apply_to_existing_page_range().

These functions walk the page tables and acquire the init_mm page table lock
(ptl), which is a standard spinlock_t. On PREEMPT_RT kernels, standard
spinlocks are sleepable rt_mutexes. Could this trigger a BUG/panic on
PREEMPT_RT due to an invalid wait context?

[ ... ]

> @@ -778,6 +854,21 @@ static void arena_free_pages(struct bpf_arena *arena, long uaddr, long page_cnt,
>  
>  	page_cnt = (uaddr_end - full_uaddr) >> PAGE_SHIFT;
>  	pgoff = compute_pgoff(arena, uaddr);
> +
> +	/*
> +	 * Drop bookkeeping for any bpf_arena_alloc() fallback pages within the
> +	 * freed range. PageSlab entries are owned by slub and must not be
> +	 * cleared here; slub clears them via bpf_arena_free_slab_page() when
> +	 * the slab page is released.
> +	 */
> +	for (i = 0; i < page_cnt; i++) {
> +		fb_page = READ_ONCE(arena->slab_pages[pgoff + i]);

[Severity: Critical]
Is there a risk of an out-of-bounds array access here that could bypass the
bounds check and corrupt memory?

If a BPF program passes ptr__ign = 0, and arena->user_vm_start has non-zero
lower 32 bits, clear_lo32() rounds it down to a 4GB boundary, making
full_uaddr this rounded-down value.

The bounds check:
if (full_uaddr >= uaddr_end)

evaluates to false because full_uaddr is less than uaddr_end. However,
compute_pgoff() evaluates (u32)(0 - (u32)user_vm_start) >> PAGE_SHIFT, which
underflows to a massive positive index.

Could this massive pgoff allow the loop to iterate over
arena->slab_pages[pgoff + i] out of bounds, potentially writing NULL to
arbitrary kernel memory if it finds a value that passes the !PageSlab() check?

[ ... ]

> +__bpf_kfunc void bpf_arena_free(void *p__map, void *ptr__ign)
> +{
[ ... ]
> +	if (!PageSlab(page)) {
> +		/*
> +		 * Multi-page allocation from the bpf_arena_alloc() fallback.
> +		 * page->private holds page_cnt stashed at allocation time.
> +		 */
> +		u32 page_cnt = page_private(page);
> +
> +		WRITE_ONCE(arena->slab_pages[pgoff], NULL);
> +		set_page_private(page, 0);
> +		arena_free_pages(arena, arena_off, page_cnt, false);
> +		return;
> +	}

[Severity: Critical]
Is there a TOCTOU race condition here when freeing multi-page fallback
allocations?

If a BPF program executes bpf_arena_free() concurrently from multiple threads
on the same valid pointer, they might both read the page pointer locklessly and
pass the !PageSlab(page) check.

If the first thread calls arena_free_pages(), unmapping the page and freeing it
back to the kernel buddy allocator, the second thread will then execute
set_page_private(page, 0). 

Since the page is already in the buddy allocator at that point, could writing 0
to its private field corrupt the buddy allocator's internal order-tracking
metadata?

> diff --git a/mm/slub.c b/mm/slub.c
> --- a/mm/slub.c
> +++ b/mm/slub.c

[ ... ]

> @@ -543,7 +558,12 @@ static inline void set_freepointer(struct kmem_cache *s, void *object, void *fp)
>  	unsigned long freeptr_addr = (unsigned long)object + s->offset;
>  
>  #ifdef CONFIG_SLAB_FREELIST_HARDENED
> -	BUG_ON(object == fp); /* naive detection of double free or corruption */
> +	if (unlikely(object == fp)) {
> +		/* BPF double-free of arena objects must not panic the kernel. */
> +		if (s->bpf_arena)
> +			return;
> +		BUG_ON(object == fp); /* naive detection of double free or corruption */
> +	}
>  #endif

[Severity: Critical]
Does returning early here to suppress the double-free warning lead to a
corrupted inuse counter?

If a BPF program calls bpf_arena_free() multiple times on the same object,
this early return bypasses the double-free detection for arena caches. However,
__slab_free() still unconditionally executes new.inuse -= cnt.

Repeated double-frees will drive the inuse counter to 0. When inuse hits 0,
SLUB assumes the slab is empty and frees the page back to the buddy allocator,
even though other objects in the slab may still be in use by the BPF program
or cached in SLUB's per-CPU sheaves. Could this lead to arbitrary memory
corruption if the page is reused for kernel data structures?

[ ... ]

> @@ -3493,9 +3516,11 @@ static struct slab *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
>  	init_slab_obj_exts(slab);
>  	/*
>  	 * Poison the slab before initializing the slabobj_ext array
> -	 * to prevent the array from being overwritten.
> +	 * to prevent the array from being overwritten. Arena caches
> +	 * stash uaddr32 in slab->stride; let them keep it.
>  	 */
> -	alloc_slab_obj_exts_early(s, slab);
> +	if (!(s->flags & SLAB_BPF_ARENA))
> +		alloc_slab_obj_exts_early(s, slab);

[Severity: Critical]
Does skipping alloc_slab_obj_exts_early() for arena caches leave
slab->obj_exts uninitialized?

Pages allocated for the BPF arena via bpf_map_alloc_pages() use __GFP_ACCOUNT,
which sets folio->memcg_data to point to the memcg. In SLUB, slab->obj_exts
shares the same memory location as folio->memcg_data.

If alloc_slab_obj_exts_early() is skipped, slab->obj_exts is never zeroed and
retains the memcg pointer. When the slab is freed, won't unaccount_slab() call
free_slab_obj_exts(), which reads the non-null slab->obj_exts and passes it to
kfree(), causing a kernel panic?

[ ... ]

> @@ -5589,14 +5644,19 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
>  
>  			n = get_node(s, slab_nid(slab));
>  			/*
> -			 * Speculatively acquire the list_lock.
> -			 * If the cmpxchg does not succeed then we may
> -			 * drop the list_lock without any processing.
> -			 *
> -			 * Otherwise the list_lock will synchronize with
> -			 * other processors updating the list of slabs.
> +			 * Speculatively acquire the list_lock. If the cmpxchg
> +			 * does not succeed we drop the lock without processing.
>  			 *
> +			 * Arena caches may reach here from kfree_nolock() in
> +			 * NMI/irq-off context; trylock and orphan the slab on
> +			 * failure. A later allow_spin caller adopts it.
>  			 */
> -			spin_lock_irqsave(&n->list_lock, flags);
> +			if (unlikely(s->bpf_arena)) {
> +				if (!spin_trylock_irqsave(&n->list_lock, flags))
> +					n = NULL;
> +			} else {
> +				spin_lock_irqsave(&n->list_lock, flags);
> +			}

[Severity: Medium]
Can this trylock failure result in permanently leaking orphaned empty slabs?

If spin_trylock_irqsave() fails for an arena cache, n is set to NULL, and the
function returns early without adding the slab to the partial list, orphaning
the slab.

Later, when the slab becomes completely empty (new.inuse == 0), __slab_free()
acquires the list lock but evaluates:

if (!was_full && !on_node_partial)

Since the slab was orphaned, on_node_partial is false, causing the function
to unlock and return early. The slab_empty discard logic is bypassed. Does
this permanently leak the empty slab until the entire arena map is destroyed?

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260529202438.16408-1-alexei.starovoitov@gmail.com?part=3

next prev parent reply	other threads:[~2026-05-29 22:04 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-29 20:24 [PATCH bpf-next 0/4] bpf,slab: Introduce bpf_arena_alloc() kfuncs Alexei Starovoitov
2026-05-29 20:24 ` [PATCH bpf-next 1/4] slab: Introduce kmem_cache_alloc_nolock() Alexei Starovoitov
2026-05-29 21:20   ` sashiko-bot
2026-05-29 22:37     ` Alexei Starovoitov
2026-05-29 20:24 ` [PATCH bpf-next 2/4] slub: Pass kmem_cache to alloc_slab_page() Alexei Starovoitov
2026-05-29 20:24 ` [PATCH bpf-next 3/4] bpf,slab: Add slub-backed allocator for bpf_arena Alexei Starovoitov
2026-05-29 22:04   ` sashiko-bot [this message]
2026-05-30  0:04     ` Alexei Starovoitov
2026-05-29 20:24 ` [PATCH bpf-next 4/4] selftests/bpf: Add tests for arena slub-backed allocator Alexei Starovoitov
2026-05-29 22:39   ` sashiko-bot
2026-05-30  0:13     ` Alexei Starovoitov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260529220410.1251F1F00893@smtp.kernel.org \
    --to=sashiko-bot@kernel.org \
    --cc=alexei.starovoitov@gmail.com \
    --cc=bpf@vger.kernel.org \
    --cc=sashiko-reviews@lists.linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.