BPF List
 help / color / mirror / Atom feed
From: sashiko-bot@kernel.org
To: "Justin Suess" <utilityemal77@gmail.com>
Cc: bpf@vger.kernel.org
Subject: Re: [PATCH bpf-next 3/4] bpf: Fix deadlock in kptr dtor in nmi
Date: Wed, 29 Apr 2026 02:29:55 +0000	[thread overview]
Message-ID: <20260429022955.D0D0AC2BCB7@smtp.kernel.org> (raw)
In-Reply-To: <20260428201422.1518903-4-utilityemal77@gmail.com>

Thank you for your contribution! Sashiko AI review found 10 potential issue(s) to consider:
- [Critical] Use-After-Free and lockless list corruption due to independent irq_work execution.
- [Critical] Kernel pointer leak and arbitrary code execution via mmapable array maps.
- [Critical] Heap out-of-bounds write for dynamically allocated BPF local objects.
- [Critical] Use-after-free during map teardown due to premature element destruction.
- [High] Heap out-of-bounds access in local storage maps due to alignment mismatch.
- [High] List corruption and double free in `btf_record_dup` error path.
- [High] Out-of-bounds write during map creation due to suppressed `btf_parse_fields` errors.
- [Medium] Preallocated map elements are permanently leaked during NMI updates.
- [Medium] Use of uninitialized synchronization primitives in dummy BTF record cleanup paths.
- [Low] The `free_list` member in `struct btf_field_kptr` is unused dead code.
--

commit 94e8c46e991deb52c20e3b7e90fd68827ad56fd4
Author: Justin Suess <utilityemal77@gmail.com>

bpf: Fix deadlock in kptr dtor in nmi

This commit defers freeing of referenced kptrs using an irq_work queue
to prevent deadlocks in BPF tracing programs running under NMI. It adds
auxiliary storage to map elements to queue kptrs for deletion outside
the NMI context.

Link: https://lore.kernel.org/bpf/20260421201035.1729473-1-utilityemal77@gmail.com/

> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 715b6df9c403..037bdadbed96 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
[ ... ]
> @@ -234,6 +236,10 @@ struct btf_field_kptr {
>  	 * program-allocated, dtor is NULL,  and __bpf_obj_drop_impl is used
>  	 */
>  	btf_dtor_kfunc_t dtor;
> +	struct irq_work irq_work;
> +	struct llist_head irq_work_items;
> +	struct llist_head free_list;

Is the free_list member used anywhere?

It appears to be initialized and cleared in the patch, but no elements are
ever added or removed from it, leaving it as dead code.

> diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
> index 5e25e0353509..919861b553c2 100644
> --- a/kernel/bpf/arraymap.c
> +++ b/kernel/bpf/arraymap.c
[ ... ]
> @@ -74,8 +75,12 @@ int array_map_alloc_check(union bpf_attr *attr)
>  	/* avoid overflow on round_up(map->value_size) */
>  	if (attr->value_size > INT_MAX)
>  		return -E2BIG;
> +	aux_size = bpf_map_attr_ref_kptr_aux_size(attr);
> +	if (aux_size < 0)
> +		return aux_size;
>  	/* percpu map value size is bound by PCPU_MIN_UNIT_SIZE */
> -	if (percpu && round_up(attr->value_size, 8) > PCPU_MIN_UNIT_SIZE)
> +	if (percpu &&
> +	    round_up(attr->value_size, 8) + aux_size > PCPU_MIN_UNIT_SIZE)
>  		return -E2BIG;

Does this expose kernel pointers to userspace if the map is created with
the BPF_F_MMAPABLE flag?

Since aux_size is added to the element size of array maps without restricting
BPF_F_MMAPABLE, the internal struct bpf_kptr_dtor_aux could be mapped
directly into userspace.

Could this allow a malicious userspace application to overwrite the
llist_node.next pointer and potentially alter kernel control flow when
the irq_work processes the list?

> diff --git a/kernel/bpf/bpf_local_storage.c b/kernel/bpf/bpf_local_storage.c
> index 6fc6a4b672b5..8b0be9612f20 100644
> --- a/kernel/bpf/bpf_local_storage.c
> +++ b/kernel/bpf/bpf_local_storage.c
[ ... ]
> @@ -800,14 +801,20 @@ bpf_local_storage_map_alloc(union bpf_attr *attr,
>  		raw_res_spin_lock_init(&smap->buckets[i].lock);
>  	}
>  
> -	smap->elem_size = offsetof(struct bpf_local_storage_elem,
> -				   sdata.data[attr->value_size]);
> +	smap->elem_size = offsetof(
> +		struct bpf_local_storage_elem,
> +		sdata.data[attr->value_size]);
> +	err = bpf_map_attr_ref_kptr_aux_size(attr);
> +	if (err < 0)
> +		goto free_buckets;
> +	smap->elem_size += err;

Could this cause a heap out-of-bounds access if attr->value_size is not a
multiple of 8?

The offset calculation here does not round up attr->value_size to an 8-byte
boundary. However, btf_parse_fields() computes aux_off as
round_up(value_size, 8). This mismatch might shift the auxiliary structure
into padding that was never allocated.

> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
> index 3dd9b4924ae4..c3ad371948c3 100644
> --- a/kernel/bpf/hashtab.c
> +++ b/kernel/bpf/hashtab.c
[ ... ]
> +static void htab_nmi_free_irq_work(struct irq_work *work)
> +{
> +	struct bpf_htab *htab =
> +		container_of(work, struct bpf_htab, nmi_free_irq_work);
> +	struct llist_node *node, *tmp, *list;
> +
> +	list = llist_del_all(&htab->nmi_free_elems);
> +	if (!list)
> +		return;
> +
> +	list = llist_reverse_order(list);
> +	llist_for_each_safe(node, tmp, list) {
> +		struct htab_elem *elem;
> +
> +		elem = container_of((struct pcpu_freelist_node *)node,
> +				    struct htab_elem, fnode);
> +		if (htab_is_prealloc(htab)) {
> +			if (htab_is_lru(htab))
> +				bpf_lru_push_free(&htab->lru, &elem->lru_node);
> +			else
> +				pcpu_freelist_push(&htab->freelist,
> +						   &elem->fnode);
> +		} else {
> +			htab_elem_free_nofields(htab, elem);
> +		}
>  	}
>  }

Could this lead to a use-after-free or lockless list corruption?

Element destruction in NMI is split into two independent irq_work items:
field->kptr.irq_work and htab->nmi_free_irq_work. Since they are scheduled
independently, htab->nmi_free_irq_work might execute first and return the
element memory to the allocator.

When field->kptr.irq_work subsequently runs, it iterates over aux->node
which is embedded inside the now-freed element memory.

> @@ -1082,12 +1189,23 @@ static struct htab_elem *alloc_htab_elem(struct bpf_htab *htab, void *key,
>  
>  	if (prealloc) {
>  		if (old_elem) {
> -			/* if we're updating the existing element,
> -			 * use per-cpu extra elems to avoid freelist_pop/push
> -			 */
> -			pl_new = this_cpu_ptr(htab->extra_elems);
> -			l_new = *pl_new;
> -			*pl_new = old_elem;
> +			if (in_nmi() && htab->kptr_ref_aux_size) {
> +				struct pcpu_freelist_node *l;
> +
> +				l = __pcpu_freelist_pop(&htab->freelist);
> +				if (!l)
> +					return ERR_PTR(-E2BIG);
> +				l_new = container_of(l, struct htab_elem,
> +						     fnode);
> +			} else {

Does this leak preallocated map elements during NMI updates?

When updating an existing element from an NMI context, the code allocates
a new element from the freelist but skips stashing the old_elem in
htab->extra_elems. The htab_map_update_elem() logic assumes the old element
was stashed and purposely avoids freeing it, which could detach the element
permanently.

> @@ -1634,6 +1769,7 @@ static void htab_map_free(struct bpf_map *map)
>  	 * underneath and is responsible for waiting for callbacks to finish
>  	 * during bpf_mem_alloc_destroy().
>  	 */
> +	irq_work_sync(&htab->nmi_free_irq_work);
>  	if (!htab_is_prealloc(htab)) {
>  		delete_all_elements(htab);
>  	} else {

Can this cause a use-after-free during map teardown?

This function synchronizes htab->nmi_free_irq_work and frees all map elements
via delete_all_elements() or prealloc_destroy(), but btf_record_free() is
called later in bpf_map_free_record().

If an element was deleted in NMI shortly before map destruction, its aux->node
might still be linked in field->kptr.irq_work_items. Flushing the kptr
deferred destruction after the elements are already freed could dereference
freed memory.

> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 2caafce00f24..f26c8ed81690 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
[ ... ]
> @@ -751,6 +832,7 @@ struct btf_record *btf_record_dup(const struct btf_record *rec)
>  		}
>  		new_rec->cnt++;
>  	}
> +	bpf_kptr_record_init(new_rec);
>  	return new_rec;
>  free:
>  	btf_record_free(new_rec);

Could this error path cause list corruption and double frees?

Since kmemdup() creates a bitwise copy of the btf_record, the copied
irq_work_items llist_head points to the exact same nodes as the original
record. If an error occurs (such as try_module_get() failing), calling
btf_record_free(new_rec) will flush these uninitialized bitwise-copied lists,
potentially executing destructors prematurely for the original map's elements.

> @@ -792,14 +874,79 @@ bool btf_record_equal(const struct btf_record *rec_a, const struct btf_record *r
[ ... ]
> +int bpf_map_attr_ref_kptr_aux_size(const union bpf_attr *attr)
> +{
[ ... ]
> +	rec = btf_parse_fields(btf, value_type, BPF_KPTR_REF, attr->value_size);
> +	if (IS_ERR(rec)) {
> +		aux_size = 0;
> +		goto out;
> +	}
> +	aux_size = rec ? rec->kptr_ref_aux_size : 0;
> +	btf_record_free(rec);
> +out:
> +	btf_put(btf);
> +	return aux_size;
> +}

Could suppressing the btf_parse_fields() error here cause out-of-bounds
writes later?

If btf_parse_fields() fails, aux_size is returned as 0 and the map allocation
proceeds without auxiliary padding. If map_check_btf() later succeeds,
element initialization via bpf_kptr_aux_init_value() might write the 16-byte
structure past the allocated size of the element.

Also, could btf_record_free(rec) use uninitialized synchronization primitives?

Since bpf_kptr_record_init() is not called on the dummy record parsed here,
the irq_work structures remain zeroed. When btf_record_free() calls
bpf_kptr_record_flush(), it invokes irq_work_sync() on this uninitialized
memory, which might be unsafe on PREEMPT_RT kernels.

> @@ -857,14 +1003,35 @@ void bpf_obj_free_fields(const struct btf_record *rec, void *obj)
>  			if (!xchgd_field)
>  				break;
>  
> -			if (!btf_is_kernel(field->kptr.btf)) {
> -				pointee_struct_meta = btf_find_struct_meta(field->kptr.btf,
> -									   field->kptr.btf_id);
> -				__bpf_obj_drop_impl(xchgd_field, pointee_struct_meta ?
> -								 pointee_struct_meta->record : NULL,
> -								 fields[i].type == BPF_KPTR_PERCPU);
> +			if (field->type == BPF_KPTR_REF && in_nmi()) {
> +				struct bpf_kptr_dtor_aux *aux;
> +
> +				aux = bpf_kptr_ref_aux(field, obj);
> +				WARN_ON_ONCE(READ_ONCE(aux->ptr));
> +				WRITE_ONCE(aux->ptr, xchgd_field);
> +				if (llist_add(&aux->node,
> +					      &field->kptr.irq_work_items))
> +					irq_work_queue(&field->kptr.irq_work);
> +				break;
> +			}

Could this cause an out-of-bounds write for dynamically allocated BPF local
objects?

While map allocators were updated to include kptr_ref_aux_size, dynamically
allocated objects created via bpf_obj_new() might still only allocate the
base BTF struct size.

When dropping such an object in NMI context, bpf_kptr_ref_aux() might
calculate an offset past the allocated bounds, writing the auxiliary data
into adjacent heap memory.

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260428201422.1518903-1-utilityemal77@gmail.com?part=3

  reply	other threads:[~2026-04-29  2:29 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-28 20:14 [PATCH bpf-next 0/4] bpf: Fix NMI deadlock in referenced kptr destructors Justin Suess
2026-04-28 20:14 ` [PATCH bpf-next 1/4] bpf: Limit fields used in btf_record_equal comparisons Justin Suess
2026-04-28 20:14 ` [PATCH bpf-next 2/4] bpf: Use rcu_work in BTF teardown Justin Suess
2026-04-29  1:49   ` sashiko-bot
2026-04-28 20:14 ` [PATCH bpf-next 3/4] bpf: Fix deadlock in kptr dtor in nmi Justin Suess
2026-04-29  2:29   ` sashiko-bot [this message]
2026-04-29  9:37   ` Alexei Starovoitov
2026-04-29 16:21     ` Justin Suess
2026-05-02 14:33       ` Justin Suess
2026-04-28 20:14 ` [PATCH bpf-next 4/4] selftests/bpf: Add kptr nmi deadlock reproducer Justin Suess
2026-04-29  3:39   ` sashiko-bot

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260429022955.D0D0AC2BCB7@smtp.kernel.org \
    --to=sashiko-bot@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=sashiko@lists.linux.dev \
    --cc=utilityemal77@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox