Re: [PATCH] bpf: Fix use-after-free in htab elem with bpf_task_work

All of lore.kernel.org
 help / color / mirror / Atom feed

From: sashiko-bot@kernel.org
To: "Shengzhuo Wei" <me@cherr.cc>
Cc: bpf@vger.kernel.org
Subject: Re: [PATCH] bpf: Fix use-after-free in htab elem with bpf_task_work
Date: Wed, 24 Jun 2026 19:31:48 +0000	[thread overview]
Message-ID: <20260624193149.3EE921F000E9@smtp.kernel.org> (raw)
In-Reply-To: <20260625-bpf-htab-task-work-uaf-v1-1-31e711ccb159@cherr.cc>

Thank you for your contribution! Sashiko AI review found 5 potential issue(s) to consider:

New issues:
- [Critical] Unconditional call to `check_and_init_map_value` on inline element memory causes out-of-bounds writes for per-CPU maps.
- [High] Dynamic memory allocation (`kmalloc(GFP_ATOMIC)`) in the element free path breaks NMI safety and causes memory exhaustion.
- [High] Permanent bricking of `BPF_TASK_WORK` fields during in-place map updates.

Pre-existing issues:
- [Critical] LRU map eviction synchronously reuses elements containing `BPF_TASK_WORK`, causing UAF and data corruption.
- [High] Incomplete fix for UAF in preallocated HASH maps due to `extra_elems` cache bypass.
--

commit 3d208f380a67702d8970ff27139db174f1482e4a
Author: Shengzhuo Wei <me@cherr.cc>

bpf: Fix use-after-free in htab elem with bpf_task_work

This commit attempts to fix a use-after-free vulnerability when deleting
hash map elements that embed a bpf_task_work. It defers recycling through
call_rcu_tasks_trace() to wait for rcu_read_lock_trace() readers. It also
changes the sentinel value in cancel_and_free() to ERR_PTR(-EBUSY).

Fixes: 38aa7003e369 ("bpf: task work scheduling kfuncs")
Signed-off-by: Shengzhuo Wei <me@cherr.cc>

> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
> index 9f394e1aa2e85..19fa61909e4fe 100644
> --- a/kernel/bpf/hashtab.c
> +++ b/kernel/bpf/hashtab.c
> @@ -310,6 +310,7 @@ static struct htab_elem *prealloc_lru_pop(struct bpf_htab *htab, void *key,
>  		bpf_map_inc_elem_count(&htab->map);
>  		l = container_of(node, struct htab_elem, lru_node);
>  		memcpy(l->key, key, htab->map.key_size);

[Severity: Critical]
This is a pre-existing issue, but does this synchronous eviction path in
kernel/bpf/hashtab.c:prealloc_lru_pop() leave LRU maps vulnerable to a
use-after-free?

When an LRU map is full, it triggers synchronous eviction. The victim node is
removed and immediately returned for reuse.

Since there is no RCU tasks trace grace period between eviction and reuse,
could a concurrent BPF program holding rcu_read_lock_trace() have its memory
overwritten while still in use?

> +		check_and_init_map_value(&htab->map, htab_elem_value(l, htab->map.key_size));

[Severity: Critical]
Can this unconditional initialization cause out-of-bounds memory corruption
for per-CPU maps?

For per-CPU maps, the actual map values are allocated separately per CPU, and
the inline memory returned by htab_elem_value() is only large enough to hold
the 8-byte pointer to the per-CPU allocation.

If a per-CPU map contains a special field at an offset greater than 8, does
check_and_init_map_value() write zeroes out of bounds of the htab_elem
inline storage, corrupting adjacent slab memory?

>  		return l;
>  	}
>  

[ ... ]

> @@ -950,12 +951,69 @@ static int htab_map_get_next_key(struct bpf_map *map, void *key, void *next_key)
>  	return -ENOENT;
>  }
>  
> +/* Deferred htab_elem free for bpf_task_work maps.  cancel_and_free()
> + * returns while the task_work callback may still be accessing map_val;
> + * the callback holds guard(rcu_tasks_trace), so deferring the recycle

[ ... ]

> +static void htab_elem_defer_free(struct bpf_htab *htab, struct htab_elem *l)
> +{
> +	struct htab_elem_free_rcu *fr;
> +
> +	fr = kmalloc_obj(*fr, GFP_ATOMIC);

[Severity: High]
Does this dynamic memory allocation break the strict NMI safety guarantees of
preallocated BPF maps?

Preallocated maps are designed to avoid runtime dynamic allocations, ensuring
they are safe for NMI contexts like perf events. By unconditionally calling
htab_elem_defer_free() (which calls kmalloc) for elements with a task work
field, does this introduce a risk of deadlocks in SLUB?

Additionally, during map destruction of non-preallocated maps,
delete_all_elements() loops over all elements and frees them:

kernel/bpf/hashtab.c:delete_all_elements() {
    for (i = 0; i < htab->n_buckets; i++) {
        ...
        hlist_nulls_for_each_entry_safe(l, n, head, hash_node) {
            htab_elem_free(htab, l);
        }
    }
}

For a map with millions of elements, could this tight loop exhaust atomic
memory reserves and risk an OOM panic?

> +	if (WARN_ON_ONCE(!fr)) {
> +		/* Fallback: immediate recycle, small UAF risk */
> +		htab_elem_recycle(htab, l);
> +		return;
> +	}
> +	fr->htab = htab;
> +	fr->elem = l;
> +	call_rcu_tasks_trace(&fr->rcu, htab_elem_free_rcu_cb);
> +}

[ ... ]

> @@ -1118,6 +1180,11 @@ static struct htab_elem *alloc_htab_elem(struct bpf_htab *htab, void *key,
>  	}
>  
>  	memcpy(l_new->key, key, key_size);
> +	/* Re-initialize special fields for recycled elements.  copy_map_value()
> +	 * skips btf_record fields, so a stale ERR_PTR(-EBUSY) left by
> +	 * bpf_task_work_cancel_and_free would persist and block new scheduling.
> +	 */
> +	check_and_init_map_value(&htab->map, htab_elem_value(l_new, key_size));

[Severity: Critical]
Are we at risk of out-of-bounds writes here as well for per-CPU maps?

As with the call in prealloc_lru_pop() above, if the map is a per-CPU
hash map, htab_elem_value() points to an 8-byte space.

If the map has special fields at a larger offset, check_and_init_map_value()
might corrupt slab memory. If the offset is zero, could it overwrite the
per-CPU pointer itself and cause a NULL pointer dereference in
pcpu_init_value()?

[Severity: High]
This is a pre-existing issue, but does this allocation path bypass the new
RCU tasks trace deferral for preallocated hash maps during element updates?

When updating an existing element in a preallocated map, alloc_htab_elem()
stashes the old element synchronously into the CPU local cache:

kernel/bpf/hashtab.c:alloc_htab_elem() {
    if (prealloc) {
        if (old_elem) {
            pl_new = this_cpu_ptr(htab->extra_elems);
            l_new = *pl_new;
            *pl_new = old_elem;
        }
    ...
}

Then, htab_map_update_elem() intentionally skips calling free_htab_elem()
for preallocated maps.

A subsequent update on the same CPU can immediately pop the old element from
extra_elems and overwrite it. Does this entirely bypass the RCU grace period
provided by htab_elem_defer_free(), leaving the original race condition
intact if a concurrent callback is executing?

>  	if (percpu) {
>  		if (prealloc) {
>  			pptr = htab_elem_get_ptr(l_new, key_size);

[ ... ]

> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> index c18f1e16edee4..07088ac1f0b2d 100644
> --- a/kernel/bpf/helpers.c
> +++ b/kernel/bpf/helpers.c

[ ... ]

> @@ -4784,8 +4787,8 @@ void bpf_task_work_cancel_and_free(void *val)
>  	struct bpf_task_work_ctx *ctx;
>  	enum bpf_task_work_state state;
>  
> -	ctx = xchg(&twk->ctx, NULL);
> -	if (!ctx)
> +	ctx = xchg(&twk->ctx, ERR_PTR(-EBUSY));

[Severity: High]
Could this change permanently brick task work fields for map types that
update elements in place?

The patch relies on the newly added check_and_init_map_value() calls in hash
map element allocators to clear this ERR_PTR(-EBUSY) sentinel when an element
is reused.

However, map types that update elements in place, such as array maps, never
allocate a new element. They use copy_map_value() followed by
bpf_obj_cancel_fields():

kernel/bpf/arraymap.c:array_map_update_elem() {
    ...
    copy_map_value(map, val, value);
    bpf_obj_cancel_fields(map, val);
    ...
}

Since check_and_init_map_value() is never called for these in-place updates,
twk->ctx appears permanently left in the ERR_PTR(-EBUSY) state. Would any
future bpf_task_work_schedule() calls on that element permanently fail?

> +	if (IS_ERR_OR_NULL(ctx))
>  		return;
>  
>  	state = xchg(&ctx->state, BPF_TW_FREED);

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260625-bpf-htab-task-work-uaf-v1-1-31e711ccb159@cherr.cc?part=1

next prev parent reply	other threads:[~2026-06-24 19:31 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-24 19:11 [PATCH] bpf: Fix use-after-free in htab elem with bpf_task_work Shengzhuo Wei
2026-06-24 19:31 ` sashiko-bot [this message]
2026-06-24 20:54   ` Shengzhuo Wei
2026-06-24 21:07     ` Alexei Starovoitov
2026-06-24 20:22 ` bot+bpf-ci

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260624193149.3EE921F000E9@smtp.kernel.org \
    --to=sashiko-bot@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=me@cherr.cc \
    --cc=sashiko-reviews@lists.linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.