From: sashiko-bot@kernel.org
To: "Justin Suess" <utilityemal77@gmail.com>
Cc: bpf@vger.kernel.org
Subject: Re: [PATCH bpf-next 3/4] bpf: Fix deadlock in kptr dtor in nmi
Date: Wed, 29 Apr 2026 02:29:55 +0000 [thread overview]
Message-ID: <20260429022955.D0D0AC2BCB7@smtp.kernel.org> (raw)
In-Reply-To: <20260428201422.1518903-4-utilityemal77@gmail.com>
Thank you for your contribution! Sashiko AI review found 10 potential issue(s) to consider:
- [Critical] Use-After-Free and lockless list corruption due to independent irq_work execution.
- [Critical] Kernel pointer leak and arbitrary code execution via mmapable array maps.
- [Critical] Heap out-of-bounds write for dynamically allocated BPF local objects.
- [Critical] Use-after-free during map teardown due to premature element destruction.
- [High] Heap out-of-bounds access in local storage maps due to alignment mismatch.
- [High] List corruption and double free in `btf_record_dup` error path.
- [High] Out-of-bounds write during map creation due to suppressed `btf_parse_fields` errors.
- [Medium] Preallocated map elements are permanently leaked during NMI updates.
- [Medium] Use of uninitialized synchronization primitives in dummy BTF record cleanup paths.
- [Low] The `free_list` member in `struct btf_field_kptr` is unused dead code.
--
commit 94e8c46e991deb52c20e3b7e90fd68827ad56fd4
Author: Justin Suess <utilityemal77@gmail.com>
bpf: Fix deadlock in kptr dtor in nmi
This commit defers freeing of referenced kptrs using an irq_work queue
to prevent deadlocks in BPF tracing programs running under NMI. It adds
auxiliary storage to map elements to queue kptrs for deletion outside
the NMI context.
Link: https://lore.kernel.org/bpf/20260421201035.1729473-1-utilityemal77@gmail.com/
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 715b6df9c403..037bdadbed96 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
[ ... ]
> @@ -234,6 +236,10 @@ struct btf_field_kptr {
> * program-allocated, dtor is NULL, and __bpf_obj_drop_impl is used
> */
> btf_dtor_kfunc_t dtor;
> + struct irq_work irq_work;
> + struct llist_head irq_work_items;
> + struct llist_head free_list;
Is the free_list member used anywhere?
It appears to be initialized and cleared in the patch, but no elements are
ever added or removed from it, leaving it as dead code.
> diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
> index 5e25e0353509..919861b553c2 100644
> --- a/kernel/bpf/arraymap.c
> +++ b/kernel/bpf/arraymap.c
[ ... ]
> @@ -74,8 +75,12 @@ int array_map_alloc_check(union bpf_attr *attr)
> /* avoid overflow on round_up(map->value_size) */
> if (attr->value_size > INT_MAX)
> return -E2BIG;
> + aux_size = bpf_map_attr_ref_kptr_aux_size(attr);
> + if (aux_size < 0)
> + return aux_size;
> /* percpu map value size is bound by PCPU_MIN_UNIT_SIZE */
> - if (percpu && round_up(attr->value_size, 8) > PCPU_MIN_UNIT_SIZE)
> + if (percpu &&
> + round_up(attr->value_size, 8) + aux_size > PCPU_MIN_UNIT_SIZE)
> return -E2BIG;
Does this expose kernel pointers to userspace if the map is created with
the BPF_F_MMAPABLE flag?
Since aux_size is added to the element size of array maps without restricting
BPF_F_MMAPABLE, the internal struct bpf_kptr_dtor_aux could be mapped
directly into userspace.
Could this allow a malicious userspace application to overwrite the
llist_node.next pointer and potentially alter kernel control flow when
the irq_work processes the list?
> diff --git a/kernel/bpf/bpf_local_storage.c b/kernel/bpf/bpf_local_storage.c
> index 6fc6a4b672b5..8b0be9612f20 100644
> --- a/kernel/bpf/bpf_local_storage.c
> +++ b/kernel/bpf/bpf_local_storage.c
[ ... ]
> @@ -800,14 +801,20 @@ bpf_local_storage_map_alloc(union bpf_attr *attr,
> raw_res_spin_lock_init(&smap->buckets[i].lock);
> }
>
> - smap->elem_size = offsetof(struct bpf_local_storage_elem,
> - sdata.data[attr->value_size]);
> + smap->elem_size = offsetof(
> + struct bpf_local_storage_elem,
> + sdata.data[attr->value_size]);
> + err = bpf_map_attr_ref_kptr_aux_size(attr);
> + if (err < 0)
> + goto free_buckets;
> + smap->elem_size += err;
Could this cause a heap out-of-bounds access if attr->value_size is not a
multiple of 8?
The offset calculation here does not round up attr->value_size to an 8-byte
boundary. However, btf_parse_fields() computes aux_off as
round_up(value_size, 8). This mismatch might shift the auxiliary structure
into padding that was never allocated.
> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
> index 3dd9b4924ae4..c3ad371948c3 100644
> --- a/kernel/bpf/hashtab.c
> +++ b/kernel/bpf/hashtab.c
[ ... ]
> +static void htab_nmi_free_irq_work(struct irq_work *work)
> +{
> + struct bpf_htab *htab =
> + container_of(work, struct bpf_htab, nmi_free_irq_work);
> + struct llist_node *node, *tmp, *list;
> +
> + list = llist_del_all(&htab->nmi_free_elems);
> + if (!list)
> + return;
> +
> + list = llist_reverse_order(list);
> + llist_for_each_safe(node, tmp, list) {
> + struct htab_elem *elem;
> +
> + elem = container_of((struct pcpu_freelist_node *)node,
> + struct htab_elem, fnode);
> + if (htab_is_prealloc(htab)) {
> + if (htab_is_lru(htab))
> + bpf_lru_push_free(&htab->lru, &elem->lru_node);
> + else
> + pcpu_freelist_push(&htab->freelist,
> + &elem->fnode);
> + } else {
> + htab_elem_free_nofields(htab, elem);
> + }
> }
> }
Could this lead to a use-after-free or lockless list corruption?
Element destruction in NMI is split into two independent irq_work items:
field->kptr.irq_work and htab->nmi_free_irq_work. Since they are scheduled
independently, htab->nmi_free_irq_work might execute first and return the
element memory to the allocator.
When field->kptr.irq_work subsequently runs, it iterates over aux->node
which is embedded inside the now-freed element memory.
> @@ -1082,12 +1189,23 @@ static struct htab_elem *alloc_htab_elem(struct bpf_htab *htab, void *key,
>
> if (prealloc) {
> if (old_elem) {
> - /* if we're updating the existing element,
> - * use per-cpu extra elems to avoid freelist_pop/push
> - */
> - pl_new = this_cpu_ptr(htab->extra_elems);
> - l_new = *pl_new;
> - *pl_new = old_elem;
> + if (in_nmi() && htab->kptr_ref_aux_size) {
> + struct pcpu_freelist_node *l;
> +
> + l = __pcpu_freelist_pop(&htab->freelist);
> + if (!l)
> + return ERR_PTR(-E2BIG);
> + l_new = container_of(l, struct htab_elem,
> + fnode);
> + } else {
Does this leak preallocated map elements during NMI updates?
When updating an existing element from an NMI context, the code allocates
a new element from the freelist but skips stashing the old_elem in
htab->extra_elems. The htab_map_update_elem() logic assumes the old element
was stashed and purposely avoids freeing it, which could detach the element
permanently.
> @@ -1634,6 +1769,7 @@ static void htab_map_free(struct bpf_map *map)
> * underneath and is responsible for waiting for callbacks to finish
> * during bpf_mem_alloc_destroy().
> */
> + irq_work_sync(&htab->nmi_free_irq_work);
> if (!htab_is_prealloc(htab)) {
> delete_all_elements(htab);
> } else {
Can this cause a use-after-free during map teardown?
This function synchronizes htab->nmi_free_irq_work and frees all map elements
via delete_all_elements() or prealloc_destroy(), but btf_record_free() is
called later in bpf_map_free_record().
If an element was deleted in NMI shortly before map destruction, its aux->node
might still be linked in field->kptr.irq_work_items. Flushing the kptr
deferred destruction after the elements are already freed could dereference
freed memory.
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 2caafce00f24..f26c8ed81690 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
[ ... ]
> @@ -751,6 +832,7 @@ struct btf_record *btf_record_dup(const struct btf_record *rec)
> }
> new_rec->cnt++;
> }
> + bpf_kptr_record_init(new_rec);
> return new_rec;
> free:
> btf_record_free(new_rec);
Could this error path cause list corruption and double frees?
Since kmemdup() creates a bitwise copy of the btf_record, the copied
irq_work_items llist_head points to the exact same nodes as the original
record. If an error occurs (such as try_module_get() failing), calling
btf_record_free(new_rec) will flush these uninitialized bitwise-copied lists,
potentially executing destructors prematurely for the original map's elements.
> @@ -792,14 +874,79 @@ bool btf_record_equal(const struct btf_record *rec_a, const struct btf_record *r
[ ... ]
> +int bpf_map_attr_ref_kptr_aux_size(const union bpf_attr *attr)
> +{
[ ... ]
> + rec = btf_parse_fields(btf, value_type, BPF_KPTR_REF, attr->value_size);
> + if (IS_ERR(rec)) {
> + aux_size = 0;
> + goto out;
> + }
> + aux_size = rec ? rec->kptr_ref_aux_size : 0;
> + btf_record_free(rec);
> +out:
> + btf_put(btf);
> + return aux_size;
> +}
Could suppressing the btf_parse_fields() error here cause out-of-bounds
writes later?
If btf_parse_fields() fails, aux_size is returned as 0 and the map allocation
proceeds without auxiliary padding. If map_check_btf() later succeeds,
element initialization via bpf_kptr_aux_init_value() might write the 16-byte
structure past the allocated size of the element.
Also, could btf_record_free(rec) use uninitialized synchronization primitives?
Since bpf_kptr_record_init() is not called on the dummy record parsed here,
the irq_work structures remain zeroed. When btf_record_free() calls
bpf_kptr_record_flush(), it invokes irq_work_sync() on this uninitialized
memory, which might be unsafe on PREEMPT_RT kernels.
> @@ -857,14 +1003,35 @@ void bpf_obj_free_fields(const struct btf_record *rec, void *obj)
> if (!xchgd_field)
> break;
>
> - if (!btf_is_kernel(field->kptr.btf)) {
> - pointee_struct_meta = btf_find_struct_meta(field->kptr.btf,
> - field->kptr.btf_id);
> - __bpf_obj_drop_impl(xchgd_field, pointee_struct_meta ?
> - pointee_struct_meta->record : NULL,
> - fields[i].type == BPF_KPTR_PERCPU);
> + if (field->type == BPF_KPTR_REF && in_nmi()) {
> + struct bpf_kptr_dtor_aux *aux;
> +
> + aux = bpf_kptr_ref_aux(field, obj);
> + WARN_ON_ONCE(READ_ONCE(aux->ptr));
> + WRITE_ONCE(aux->ptr, xchgd_field);
> + if (llist_add(&aux->node,
> + &field->kptr.irq_work_items))
> + irq_work_queue(&field->kptr.irq_work);
> + break;
> + }
Could this cause an out-of-bounds write for dynamically allocated BPF local
objects?
While map allocators were updated to include kptr_ref_aux_size, dynamically
allocated objects created via bpf_obj_new() might still only allocate the
base BTF struct size.
When dropping such an object in NMI context, bpf_kptr_ref_aux() might
calculate an offset past the allocated bounds, writing the auxiliary data
into adjacent heap memory.
--
Sashiko AI review · https://sashiko.dev/#/patchset/20260428201422.1518903-1-utilityemal77@gmail.com?part=3
next prev parent reply other threads:[~2026-04-29 2:29 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-28 20:14 [PATCH bpf-next 0/4] bpf: Fix NMI deadlock in referenced kptr destructors Justin Suess
2026-04-28 20:14 ` [PATCH bpf-next 1/4] bpf: Limit fields used in btf_record_equal comparisons Justin Suess
2026-04-28 20:14 ` [PATCH bpf-next 2/4] bpf: Use rcu_work in BTF teardown Justin Suess
2026-04-29 1:49 ` sashiko-bot
2026-04-28 20:14 ` [PATCH bpf-next 3/4] bpf: Fix deadlock in kptr dtor in nmi Justin Suess
2026-04-29 2:29 ` sashiko-bot [this message]
2026-04-29 9:37 ` Alexei Starovoitov
2026-04-29 16:21 ` Justin Suess
2026-05-02 14:33 ` Justin Suess
2026-04-28 20:14 ` [PATCH bpf-next 4/4] selftests/bpf: Add kptr nmi deadlock reproducer Justin Suess
2026-04-29 3:39 ` sashiko-bot
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260429022955.D0D0AC2BCB7@smtp.kernel.org \
--to=sashiko-bot@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=sashiko@lists.linux.dev \
--cc=utilityemal77@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox