[PATCH bpf-next 0/1] bpf: Fix deadlock in freeing of special fields in NMI

BPF List
 help / color / mirror / Atom feed

* [PATCH bpf-next 0/1] bpf: Fix deadlock in freeing of special fields in NMI
@ 2026-05-19  1:14 Justin Suess
  2026-05-19  1:14 ` [PATCH bpf-next 1/1] bpf: fix deadlock in special field destruction " Justin Suess
  0 siblings, 1 reply; 11+ messages in thread
From: Justin Suess @ 2026-05-19  1:14 UTC (permalink / raw)
  To: ast, daniel, andrii, eddyz87, memxor
  Cc: martin.lau, song, yonghong.song, jolsa, bpf, mic, Justin Suess

Hello,

While following up on a Sashiko report [1], I found that referenced kptr
destructors can run from NMI context. One way to trigger this is from a
tracing program attached to tp_btf/nmi_handler while a map element is
being torn down.

This is a discards any approach from the previous series [2], thus starting
fresh with a new one.

This changes the existing eager freeing behavior of bpf_obj_free_fields to only
freeing certain fields in irqs_disabled contexts.

These fields not safe to free in irqs_disabled contexts like nmi include
BPF_KPTR_REF, BPF_KPTR_PERCPU, BPF_UPTR, BPF_LIST_HEAD, and BPF_RB_ROOT.

The freeing of the fields under those conditions will now be done by the bpf
memory allocator, which has a safe destructor mechanism that guarantees the
fields are freed in a non-irqs_disabled context.

Some changes were required to the hashtab map specifically. When the above listed
fields are present, the map is converted at initialization time to the non-prealloc
variant. Additionally, the in-place field updates used by LRU and percpu maps were
disabled when these special fields are present, to avoid leaking references and
overwriting those fields when we can't free them immediately.

The relevant discussion for this change is here [2]

Kind regards,
Justin Suess

[1] https://lore.kernel.org/bpf/20260421010536.17FB1C19425@smtp.kernel.org/
[2] https://lore.kernel.org/bpf/20260507175453.1140400-1-utilityemal77@gmail.com/
[3] https://lore.kernel.org/bpf/agCXEJKJh-JGMhjG@zenbox/

Justin Suess (1):
  bpf: fix deadlock in special field destruction in NMI

 kernel/bpf/hashtab.c | 93 +++++++++++++++++++++++++++++++++++++++-----
 kernel/bpf/syscall.c |  8 +++-
 2 files changed, 89 insertions(+), 12 deletions(-)

base-commit: 576482b55c19e7ec00e162a0fde4c4f1a95128c7
-- 
2.53.0

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH bpf-next 1/1] bpf: fix deadlock in special field destruction in NMI
  2026-05-19  1:14 [PATCH bpf-next 0/1] bpf: Fix deadlock in freeing of special fields in NMI Justin Suess
@ 2026-05-19  1:14 ` Justin Suess
  2026-05-19  1:51   ` sashiko-bot
                     ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Justin Suess @ 2026-05-19  1:14 UTC (permalink / raw)
  To: ast, daniel, andrii, eddyz87, memxor
  Cc: martin.lau, song, yonghong.song, jolsa, bpf, mic, Justin Suess,
	Alexei Starovoitov, Mykyta Yatsenko

Relax bpf_obj_free_fields to only cancel/free async work in irq_disabled
contexts and defer unsafe free operations such as kptr dtors, list head
and rb root destruction to a later non-irq_disabled call driven by the
allocator or map free.

Detect fields that are unsafe to free under irqs_disabled at htab
creation time. When creating a hashtab with these fields, forcibly set
BPF_F_NO_PREALLOC and use the bpf memory allocator instead.

This must happen after the fields are checked, so convert the map to a
non-prealloc one if the special fields are present, but before the map
has been fully initialized.

Enable this fix for regular, percpu, and lru hashtabs.

This fixes a deadlock caused by updating/deleting map elements in an
NMI context by shifting the responsibility of freeing fields that are
unsafe to free in contexts like NMI to the memory allocators existing
mechanism for handling this.

Fixes: 14a324f6a67e ("bpf: Wire up freeing of referenced kptr")
Reported-by: Justin Suess <utilityemal77@gmail.com>
Closes: https://lore.kernel.org/bpf/20260421201035.1729473-1-utilityemal77@gmail.com/
Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Suggested-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Cc: Mykyta Yatsenko <mykyta.yatsenko5@gmail.com>
Link: https://lore.kernel.org/bpf/DIG0ONMVOP0L.3QFYUPWFSKWI4@gmail.com/
Signed-off-by: Justin Suess <utilityemal77@gmail.com>
---
 kernel/bpf/hashtab.c | 93 +++++++++++++++++++++++++++++++++++++++-----
 kernel/bpf/syscall.c |  8 +++-
 2 files changed, 89 insertions(+), 12 deletions(-)

diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 3dd9b4924ae4..0db1dc8ae0be 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -130,6 +130,16 @@ struct htab_btf_record {
 	u32 key_size;
 };
 
+static inline bool htab_has_nmi_special_fields(const struct bpf_htab *htab)
+{
+	const struct btf_record *rec = htab->map.record;
+
+	if (IS_ERR_OR_NULL(rec))
+		return false;
+	return rec->field_mask & (BPF_KPTR_REF | BPF_KPTR_PERCPU | BPF_UPTR |
+				  BPF_LIST_HEAD | BPF_RB_ROOT);
+}
+
 static inline bool htab_is_prealloc(const struct bpf_htab *htab)
 {
 	return !(htab->map.map_flags & BPF_F_NO_PREALLOC);
@@ -522,13 +532,53 @@ static int htab_set_dtor(struct bpf_htab *htab, void (*dtor)(void *, void *))
 	return 0;
 }
 
+static int htab_convert_to_non_prealloc(struct bpf_htab *htab)
+{
+	bool percpu = htab_is_percpu(htab);
+	int err;
+
+	htab_free_prealloced_fields(htab);
+	free_percpu(htab->extra_elems);
+	htab->extra_elems = NULL;
+	prealloc_destroy(htab);
+	htab->map.map_flags |= BPF_F_NO_PREALLOC;
+
+	err = bpf_mem_alloc_init(&htab->ma, htab->elem_size, false);
+	if (err)
+		return err;
+	if (percpu) {
+		err = bpf_mem_alloc_init(&htab->pcpu_ma,
+					 round_up(htab->map.value_size, 8),
+					 true);
+		if (err) {
+			bpf_mem_alloc_destroy(&htab->ma);
+			return err;
+		}
+	}
+
+	return 0;
+}
+
 static int htab_map_check_btf(struct bpf_map *map, const struct btf *btf,
 			      const struct btf_type *key_type, const struct btf_type *value_type)
 {
 	struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
+	int err;
+
+	if (!htab_has_nmi_special_fields(htab)) {
+		if (htab_is_prealloc(htab))
+			return 0;
+	} else {
+		if (htab_is_lru(htab))
+			return 0;
+
+		if (htab_is_prealloc(htab)) {
+			err = htab_convert_to_non_prealloc(htab);
+			if (err)
+				return err;
+		}
+	}
 
-	if (htab_is_prealloc(htab))
-		return 0;
 	/*
 	 * We must set the dtor using this callback, as map's BTF record is not
 	 * populated in htab_map_alloc(), so it will always appear as NULL.
@@ -1355,7 +1405,7 @@ static long htab_map_update_elem_in_place(struct bpf_map *map, void *key,
 					  bool percpu, bool onallcpus)
 {
 	struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
-	struct htab_elem *l_new, *l_old;
+	struct htab_elem *l_new = NULL, *l_old;
 	struct hlist_nulls_head *head;
 	void *old_map_ptr = NULL;
 	unsigned long flags;
@@ -1387,8 +1437,17 @@ static long htab_map_update_elem_in_place(struct bpf_map *map, void *key,
 		goto err;
 
 	if (l_old) {
-		/* Update value in-place */
-		if (percpu) {
+		if (htab_has_nmi_special_fields(htab)) {
+			l_new = alloc_htab_elem(htab, key, value, key_size,
+						hash, percpu, onallcpus,
+						l_old, map_flags);
+			if (IS_ERR(l_new)) {
+				ret = PTR_ERR(l_new);
+				goto err;
+			}
+			hlist_nulls_add_head_rcu(&l_new->hash_node, head);
+			hlist_nulls_del_rcu(&l_old->hash_node);
+		} else if (percpu) {
 			pcpu_copy_value(htab, htab_elem_get_ptr(l_old, key_size),
 					value, onallcpus, map_flags);
 		} else {
@@ -1408,6 +1467,8 @@ static long htab_map_update_elem_in_place(struct bpf_map *map, void *key,
 	}
 err:
 	htab_unlock_bucket(b, flags);
+	if (l_old && htab_has_nmi_special_fields(htab) && !ret)
+		free_htab_elem(htab, l_old);
 	if (old_map_ptr)
 		map->ops->map_fd_put_ptr(map, old_map_ptr, true);
 	return ret;
@@ -1443,7 +1504,7 @@ static long __htab_lru_percpu_map_update_elem(struct bpf_map *map, void *key,
 	 * to remove older elem from htab and this removal
 	 * operation will need a bucket lock.
 	 */
-	if (map_flags != BPF_EXIST) {
+	if (map_flags != BPF_EXIST || htab_has_nmi_special_fields(htab)) {
 		l_new = prealloc_lru_pop(htab, key, hash);
 		if (!l_new)
 			return -ENOMEM;
@@ -1460,11 +1521,21 @@ static long __htab_lru_percpu_map_update_elem(struct bpf_map *map, void *key,
 		goto err;
 
 	if (l_old) {
-		bpf_lru_node_set_ref(&l_old->lru_node);
+		if (htab_has_nmi_special_fields(htab)) {
+			pcpu_init_value(htab,
+					htab_elem_get_ptr(l_new, key_size),
+					value, onallcpus, map_flags);
+			hlist_nulls_add_head_rcu(&l_new->hash_node, head);
+			hlist_nulls_del_rcu(&l_old->hash_node);
+			bpf_lru_node_set_ref(&l_new->lru_node);
+			l_new = NULL;
+		} else {
+			bpf_lru_node_set_ref(&l_old->lru_node);
 
-		/* per-cpu hash map can update value in-place */
-		pcpu_copy_value(htab, htab_elem_get_ptr(l_old, key_size),
-				value, onallcpus, map_flags);
+			/* per-cpu hash map can update value in-place */
+			pcpu_copy_value(htab, htab_elem_get_ptr(l_old, key_size),
+					value, onallcpus, map_flags);
+		}
 	} else {
 		pcpu_init_value(htab, htab_elem_get_ptr(l_new, key_size),
 				value, onallcpus, map_flags);
@@ -1479,6 +1550,8 @@ static long __htab_lru_percpu_map_update_elem(struct bpf_map *map, void *key,
 		bpf_map_dec_elem_count(&htab->map);
 		bpf_lru_push_free(&htab->lru, &l_new->lru_node);
 	}
+	if (l_old && !ret && htab_has_nmi_special_fields(htab))
+		htab_lru_push_free(htab, l_old);
 	return ret;
 }
 
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 6600e126fbfb..1f52453d5a2f 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -839,6 +839,8 @@ void bpf_obj_free_fields(const struct btf_record *rec, void *obj)
 			break;
 		case BPF_KPTR_REF:
 		case BPF_KPTR_PERCPU:
+			if (irqs_disabled())
+				break;
 			xchgd_field = (void *)xchg((unsigned long *)field_ptr, 0);
 			if (!xchgd_field)
 				break;
@@ -854,16 +856,18 @@ void bpf_obj_free_fields(const struct btf_record *rec, void *obj)
 			}
 			break;
 		case BPF_UPTR:
+			if (irqs_disabled())
+				break;
 			/* The caller ensured that no one is using the uptr */
 			unpin_uptr_kaddr(*(void **)field_ptr);
 			break;
 		case BPF_LIST_HEAD:
-			if (WARN_ON_ONCE(rec->spin_lock_off < 0))
+			if (irqs_disabled() || WARN_ON_ONCE(rec->spin_lock_off < 0))
 				continue;
 			bpf_list_head_free(field, field_ptr, obj + rec->spin_lock_off);
 			break;
 		case BPF_RB_ROOT:
-			if (WARN_ON_ONCE(rec->spin_lock_off < 0))
+			if (irqs_disabled() || WARN_ON_ONCE(rec->spin_lock_off < 0))
 				continue;
 			bpf_rb_root_free(field, field_ptr, obj + rec->spin_lock_off);
 			break;
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH bpf-next 1/1] bpf: fix deadlock in special field destruction in NMI
  2026-05-19  1:14 ` [PATCH bpf-next 1/1] bpf: fix deadlock in special field destruction " Justin Suess
@ 2026-05-19  1:51   ` sashiko-bot
  2026-05-19 14:42     ` Justin Suess
  2026-05-19 10:25   ` kernel test robot
  2026-05-19 12:31   ` Mykyta Yatsenko
  2 siblings, 1 reply; 11+ messages in thread
From: sashiko-bot @ 2026-05-19  1:51 UTC (permalink / raw)
  To: Justin Suess; +Cc: bpf

Thank you for your contribution! Sashiko AI review found 3 potential issue(s) to consider:
- [Critical] LRU maps containing NMI-unsafe fields (like BPF_KPTR_REF or BPF_LIST_HEAD) will permanently leak these fields and corrupt map state when updated or deleted in an IRQ-disabled context.
- [High] Modifying the generic `bpf_obj_free_fields()` to silently skip destruction in IRQ-disabled contexts causes permanent memory leaks for callers lacking deferred destructors (like `bpf_obj_drop` and Array maps).
- [Medium] Forcibly mutating the user-requested `map_flags` to include `BPF_F_NO_PREALLOC` silently breaks performance expectations and UAPI transparency.
--

commit 3fbec1aa4107fa1f73249893a4ed21fd7d2213fa
Author: Justin Suess <utilityemal77@gmail.com>

bpf: fix deadlock in special field destruction in NMI

This commit relaxes bpf_obj_free_fields() to skip async work cancellation
and unsafe free operations in IRQ-disabled contexts. It converts hash maps
with NMI-unsafe fields to non-preallocated maps using bpf_mem_alloc to
defer the cleanup. LRU maps and preallocated hash maps are adjusted to
support this mechanism.

Closes: https://lore.kernel.org/bpf/20260421201035.1729473-1-utilityemal77@gmail.com/
Link: https://lore.kernel.org/bpf/DIG0ONMVOP0L.3QFYUPWFSKWI4@gmail.com/

> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
> index 3dd9b4924ae4f..0db1dc8ae0bec 100644
> --- a/kernel/bpf/hashtab.c
> +++ b/kernel/bpf/hashtab.c
[ ... ]
> @@ -522,13 +532,53 @@ static int htab_set_dtor(struct bpf_htab *htab, void (*dtor)(void *, void *))
>  	return 0;
>  }
>  
> +static int htab_convert_to_non_prealloc(struct bpf_htab *htab)
> +{
> +	bool percpu = htab_is_percpu(htab);
> +	int err;
> +
> +	htab_free_prealloced_fields(htab);
> +	free_percpu(htab->extra_elems);
> +	htab->extra_elems = NULL;
> +	prealloc_destroy(htab);
> +	htab->map.map_flags |= BPF_F_NO_PREALLOC;

Does mutating the map flags post-creation to include BPF_F_NO_PREALLOC alter
the memory model and user-visible flags without warning? If userspace queries
the map info via bpf_map_get_info_by_fd(), will it see flags it never
requested?

Would it be better to explicitly reject this configuration and return -EINVAL
rather than secretly altering the underlying memory model?

[ ... ]
>  static int htab_map_check_btf(struct bpf_map *map, const struct btf *btf,
>  			      const struct btf_type *key_type, const struct btf_type *value_type)
>  {
>  	struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
> +	int err;
> +
> +	if (!htab_has_nmi_special_fields(htab)) {
> +		if (htab_is_prealloc(htab))
> +			return 0;
> +	} else {
> +		if (htab_is_lru(htab))
> +			return 0;

Since LRU maps are exempted from conversion to non-preallocated mode here,
they do not use bpf_mem_alloc and completely lack the deferred destructor
mechanism relied upon by this patch.

When an LRU element is replaced or deleted, htab_lru_push_free() is called
and synchronously invokes check_and_free_fields() which eventually calls
bpf_obj_free_fields().

If this executes in an IRQ-disabled context, will the new if
(irqs_disabled()) checks in bpf_obj_free_fields() silently skip freeing
the special fields?

Does this cause a permanent memory leak since the element is then
immediately pushed back to the LRU free list? When the element is
subsequently popped and reused, could it inherit stale pointers and
corrupt map state?

[ ... ]
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 6600e126fbfb4..1f52453d5a2f1 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -839,6 +839,8 @@ void bpf_obj_free_fields(const struct btf_record *rec, void *obj)
>  			break;
>  		case BPF_KPTR_REF:
>  		case BPF_KPTR_PERCPU:
> +			if (irqs_disabled())
> +				break;
>  			xchgd_field = (void *)xchg((unsigned long *)field_ptr, 0);
>  			if (!xchgd_field)
>  				break;

Because bpf_obj_free_fields() is a generic helper used throughout the BPF
subsystem, does silently skipping destruction in IRQ-disabled contexts
cause leaks for callers that lack deferred destructors?

For example, if a BPF program running in an NMI or kprobe drops an object
containing nested kptrs via __bpf_obj_drop_impl(), will the nested fields
be skipped and permanently leaked when the outer object is queued for
freeing via bpf_global_ma?

Does this similarly affect in-place updates to Array maps which rely on
bpf_obj_free_fields() to clear old special fields?

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260519011450.1144935-2-utilityemal77@gmail.com?part=1

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH bpf-next 1/1] bpf: fix deadlock in special field destruction in NMI
  2026-05-19  1:14 ` [PATCH bpf-next 1/1] bpf: fix deadlock in special field destruction " Justin Suess
  2026-05-19  1:51   ` sashiko-bot
@ 2026-05-19 10:25   ` kernel test robot
  2026-05-19 12:31   ` Mykyta Yatsenko
  2 siblings, 0 replies; 11+ messages in thread
From: kernel test robot @ 2026-05-19 10:25 UTC (permalink / raw)
  To: Justin Suess, ast, daniel, andrii, eddyz87, memxor
  Cc: llvm, oe-kbuild-all, martin.lau, song, yonghong.song, jolsa, bpf,
	mic, Justin Suess, Alexei Starovoitov, Mykyta Yatsenko

Hi Justin,

kernel test robot noticed the following build warnings:

[auto build test WARNING on 576482b55c19e7ec00e162a0fde4c4f1a95128c7]

url:    https://github.com/intel-lab-lkp/linux/commits/Justin-Suess/bpf-fix-deadlock-in-special-field-destruction-in-NMI/20260519-123808
base:   576482b55c19e7ec00e162a0fde4c4f1a95128c7
patch link:    https://lore.kernel.org/r/20260519011450.1144935-2-utilityemal77%40gmail.com
patch subject: [PATCH bpf-next 1/1] bpf: fix deadlock in special field destruction in NMI
config: s390-randconfig-001-20260519 (https://download.01.org/0day-ci/archive/20260519/202605191848.XKKrABbz-lkp@intel.com/config)
compiler: clang version 23.0.0git (https://github.com/llvm/llvm-project 5bac06718f502014fade905512f1d26d578a18f3)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260519/202605191848.XKKrABbz-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202605191848.XKKrABbz-lkp@intel.com/

All warnings (new ones prefixed by >>):

>> kernel/bpf/hashtab.c:1514:6: warning: variable 'l_old' is used uninitialized whenever 'if' condition is true [-Wsometimes-uninitialized]
    1514 |         if (ret)
         |             ^~~
   kernel/bpf/hashtab.c:1553:6: note: uninitialized use occurs here
    1553 |         if (l_old && !ret && htab_has_nmi_special_fields(htab))
         |             ^~~~~
   kernel/bpf/hashtab.c:1514:2: note: remove the 'if' if its condition is always false
    1514 |         if (ret)
         |         ^~~~~~~~
    1515 |                 goto err_lock_bucket;
         |                 ~~~~~~~~~~~~~~~~~~~~
   kernel/bpf/hashtab.c:1482:40: note: initialize the variable 'l_old' to silence this warning
    1482 |         struct htab_elem *l_new = NULL, *l_old;
         |                                               ^
         |                                                = NULL
   1 warning generated.


vim +1514 kernel/bpf/hashtab.c

824bd0ce6c7c43a Alexei Starovoitov      2016-02-01  1476  
d7ba4cc900bf1ee JP Kobryn               2023-03-22  1477  static long __htab_lru_percpu_map_update_elem(struct bpf_map *map, void *key,
8f8449384ec364b Martin KaFai Lau        2016-11-11  1478  					      void *value, u64 map_flags,
8f8449384ec364b Martin KaFai Lau        2016-11-11  1479  					      bool onallcpus)
8f8449384ec364b Martin KaFai Lau        2016-11-11  1480  {
8f8449384ec364b Martin KaFai Lau        2016-11-11  1481  	struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
8f8449384ec364b Martin KaFai Lau        2016-11-11  1482  	struct htab_elem *l_new = NULL, *l_old;
4fe8435909fddc9 Alexei Starovoitov      2017-03-07  1483  	struct hlist_nulls_head *head;
8f8449384ec364b Martin KaFai Lau        2016-11-11  1484  	unsigned long flags;
8f8449384ec364b Martin KaFai Lau        2016-11-11  1485  	struct bucket *b;
8f8449384ec364b Martin KaFai Lau        2016-11-11  1486  	u32 key_size, hash;
8f8449384ec364b Martin KaFai Lau        2016-11-11  1487  	int ret;
8f8449384ec364b Martin KaFai Lau        2016-11-11  1488  
c6936161fd55d0c Leon Hwang              2026-01-07  1489  	ret = htab_map_check_update_flags(onallcpus, map_flags);
c6936161fd55d0c Leon Hwang              2026-01-07  1490  	if (unlikely(ret))
c6936161fd55d0c Leon Hwang              2026-01-07  1491  		return ret;
8f8449384ec364b Martin KaFai Lau        2016-11-11  1492  
48a97ffc6c82664 Andrii Nakryiko         2025-10-14  1493  	WARN_ON_ONCE(!bpf_rcu_lock_held());
8f8449384ec364b Martin KaFai Lau        2016-11-11  1494  
8f8449384ec364b Martin KaFai Lau        2016-11-11  1495  	key_size = map->key_size;
8f8449384ec364b Martin KaFai Lau        2016-11-11  1496  
c0203475765f827 Daniel Borkmann         2018-08-22  1497  	hash = htab_map_hash(key, key_size, htab->hashrnd);
8f8449384ec364b Martin KaFai Lau        2016-11-11  1498  
8f8449384ec364b Martin KaFai Lau        2016-11-11  1499  	b = __select_bucket(htab, hash);
8f8449384ec364b Martin KaFai Lau        2016-11-11  1500  	head = &b->head;
8f8449384ec364b Martin KaFai Lau        2016-11-11  1501  
8f8449384ec364b Martin KaFai Lau        2016-11-11  1502  	/* For LRU, we need to alloc before taking bucket's
8f8449384ec364b Martin KaFai Lau        2016-11-11  1503  	 * spinlock because LRU's elem alloc may need
8f8449384ec364b Martin KaFai Lau        2016-11-11  1504  	 * to remove older elem from htab and this removal
8f8449384ec364b Martin KaFai Lau        2016-11-11  1505  	 * operation will need a bucket lock.
8f8449384ec364b Martin KaFai Lau        2016-11-11  1506  	 */
d05cda57f3acd92 Justin Suess            2026-05-18  1507  	if (map_flags != BPF_EXIST || htab_has_nmi_special_fields(htab)) {
8f8449384ec364b Martin KaFai Lau        2016-11-11  1508  		l_new = prealloc_lru_pop(htab, key, hash);
8f8449384ec364b Martin KaFai Lau        2016-11-11  1509  		if (!l_new)
8f8449384ec364b Martin KaFai Lau        2016-11-11  1510  			return -ENOMEM;
8f8449384ec364b Martin KaFai Lau        2016-11-11  1511  	}
8f8449384ec364b Martin KaFai Lau        2016-11-11  1512  
4fa8d68aa53e6d7 Kumar Kartikeya Dwivedi 2025-03-15  1513  	ret = htab_lock_bucket(b, &flags);
20b6cc34ea74b6a Song Liu                2020-10-29 @1514  	if (ret)
b34ffb0c6d23583 Anton Protopopov        2023-05-22  1515  		goto err_lock_bucket;
8f8449384ec364b Martin KaFai Lau        2016-11-11  1516  
8f8449384ec364b Martin KaFai Lau        2016-11-11  1517  	l_old = lookup_elem_raw(head, hash, key, key_size);
8f8449384ec364b Martin KaFai Lau        2016-11-11  1518  
8f8449384ec364b Martin KaFai Lau        2016-11-11  1519  	ret = check_flags(htab, l_old, map_flags);
8f8449384ec364b Martin KaFai Lau        2016-11-11  1520  	if (ret)
8f8449384ec364b Martin KaFai Lau        2016-11-11  1521  		goto err;
8f8449384ec364b Martin KaFai Lau        2016-11-11  1522  
8f8449384ec364b Martin KaFai Lau        2016-11-11  1523  	if (l_old) {
d05cda57f3acd92 Justin Suess            2026-05-18  1524  		if (htab_has_nmi_special_fields(htab)) {
d05cda57f3acd92 Justin Suess            2026-05-18  1525  			pcpu_init_value(htab,
d05cda57f3acd92 Justin Suess            2026-05-18  1526  					htab_elem_get_ptr(l_new, key_size),
d05cda57f3acd92 Justin Suess            2026-05-18  1527  					value, onallcpus, map_flags);
d05cda57f3acd92 Justin Suess            2026-05-18  1528  			hlist_nulls_add_head_rcu(&l_new->hash_node, head);
d05cda57f3acd92 Justin Suess            2026-05-18  1529  			hlist_nulls_del_rcu(&l_old->hash_node);
d05cda57f3acd92 Justin Suess            2026-05-18  1530  			bpf_lru_node_set_ref(&l_new->lru_node);
d05cda57f3acd92 Justin Suess            2026-05-18  1531  			l_new = NULL;
d05cda57f3acd92 Justin Suess            2026-05-18  1532  		} else {
8f8449384ec364b Martin KaFai Lau        2016-11-11  1533  			bpf_lru_node_set_ref(&l_old->lru_node);
8f8449384ec364b Martin KaFai Lau        2016-11-11  1534  
8f8449384ec364b Martin KaFai Lau        2016-11-11  1535  			/* per-cpu hash map can update value in-place */
8f8449384ec364b Martin KaFai Lau        2016-11-11  1536  			pcpu_copy_value(htab, htab_elem_get_ptr(l_old, key_size),
c6936161fd55d0c Leon Hwang              2026-01-07  1537  					value, onallcpus, map_flags);
d05cda57f3acd92 Justin Suess            2026-05-18  1538  		}
8f8449384ec364b Martin KaFai Lau        2016-11-11  1539  	} else {
d3bec0138bfbe58 David Verbeiren         2020-11-04  1540  		pcpu_init_value(htab, htab_elem_get_ptr(l_new, key_size),
c6936161fd55d0c Leon Hwang              2026-01-07  1541  				value, onallcpus, map_flags);
4fe8435909fddc9 Alexei Starovoitov      2017-03-07  1542  		hlist_nulls_add_head_rcu(&l_new->hash_node, head);
8f8449384ec364b Martin KaFai Lau        2016-11-11  1543  		l_new = NULL;
8f8449384ec364b Martin KaFai Lau        2016-11-11  1544  	}
8f8449384ec364b Martin KaFai Lau        2016-11-11  1545  	ret = 0;
8f8449384ec364b Martin KaFai Lau        2016-11-11  1546  err:
4fa8d68aa53e6d7 Kumar Kartikeya Dwivedi 2025-03-15  1547  	htab_unlock_bucket(b, flags);
b34ffb0c6d23583 Anton Protopopov        2023-05-22  1548  err_lock_bucket:
9bc421b6be955c0 Anton Protopopov        2023-07-06  1549  	if (l_new) {
9bc421b6be955c0 Anton Protopopov        2023-07-06  1550  		bpf_map_dec_elem_count(&htab->map);
8f8449384ec364b Martin KaFai Lau        2016-11-11  1551  		bpf_lru_push_free(&htab->lru, &l_new->lru_node);
9bc421b6be955c0 Anton Protopopov        2023-07-06  1552  	}
d05cda57f3acd92 Justin Suess            2026-05-18  1553  	if (l_old && !ret && htab_has_nmi_special_fields(htab))
d05cda57f3acd92 Justin Suess            2026-05-18  1554  		htab_lru_push_free(htab, l_old);
8f8449384ec364b Martin KaFai Lau        2016-11-11  1555  	return ret;
8f8449384ec364b Martin KaFai Lau        2016-11-11  1556  }
8f8449384ec364b Martin KaFai Lau        2016-11-11  1557  

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH bpf-next 1/1] bpf: fix deadlock in special field destruction in NMI
  2026-05-19  1:14 ` [PATCH bpf-next 1/1] bpf: fix deadlock in special field destruction " Justin Suess
  2026-05-19  1:51   ` sashiko-bot
  2026-05-19 10:25   ` kernel test robot
@ 2026-05-19 12:31   ` Mykyta Yatsenko
  2026-05-19 13:22     ` Justin Suess
  2 siblings, 1 reply; 11+ messages in thread
From: Mykyta Yatsenko @ 2026-05-19 12:31 UTC (permalink / raw)
  To: Justin Suess, ast, daniel, andrii, eddyz87, memxor
  Cc: martin.lau, song, yonghong.song, jolsa, bpf, mic,
	Alexei Starovoitov



On 5/19/26 2:14 AM, Justin Suess wrote:
> Relax bpf_obj_free_fields to only cancel/free async work in irq_disabled
> contexts and defer unsafe free operations such as kptr dtors, list head
> and rb root destruction to a later non-irq_disabled call driven by the
> allocator or map free.
> 
> Detect fields that are unsafe to free under irqs_disabled at htab
> creation time. When creating a hashtab with these fields, forcibly set
> BPF_F_NO_PREALLOC and use the bpf memory allocator instead.
> 
> This must happen after the fields are checked, so convert the map to a
> non-prealloc one if the special fields are present, but before the map
> has been fully initialized.
> 
> Enable this fix for regular, percpu, and lru hashtabs.
> 
> This fixes a deadlock caused by updating/deleting map elements in an
> NMI context by shifting the responsibility of freeing fields that are
> unsafe to free in contexts like NMI to the memory allocators existing
> mechanism for handling this.
> 
> Fixes: 14a324f6a67e ("bpf: Wire up freeing of referenced kptr")
> Reported-by: Justin Suess <utilityemal77@gmail.com>
> Closes: https://lore.kernel.org/bpf/20260421201035.1729473-1-utilityemal77@gmail.com/
> Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
> Suggested-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> Cc: Mykyta Yatsenko <mykyta.yatsenko5@gmail.com>
> Link: https://lore.kernel.org/bpf/DIG0ONMVOP0L.3QFYUPWFSKWI4@gmail.com/
> Signed-off-by: Justin Suess <utilityemal77@gmail.com>
> ---

This is a v3 of the patch, it would be nice to include it in the subject
(see how other people do that).

>  kernel/bpf/hashtab.c | 93 +++++++++++++++++++++++++++++++++++++++-----
>  kernel/bpf/syscall.c |  8 +++-
>  2 files changed, 89 insertions(+), 12 deletions(-)
> 
> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
> index 3dd9b4924ae4..0db1dc8ae0be 100644
> --- a/kernel/bpf/hashtab.c
> +++ b/kernel/bpf/hashtab.c
> @@ -130,6 +130,16 @@ struct htab_btf_record {
>  	u32 key_size;
>  };
>  
> +static inline bool htab_has_nmi_special_fields(const struct bpf_htab *htab)
> +{
> +	const struct btf_record *rec = htab->map.record;
> +
> +	if (IS_ERR_OR_NULL(rec))
> +		return false;
> +	return rec->field_mask & (BPF_KPTR_REF | BPF_KPTR_PERCPU | BPF_UPTR |
> +				  BPF_LIST_HEAD | BPF_RB_ROOT);
> +}
> +
>  static inline bool htab_is_prealloc(const struct bpf_htab *htab)
>  {
>  	return !(htab->map.map_flags & BPF_F_NO_PREALLOC);
> @@ -522,13 +532,53 @@ static int htab_set_dtor(struct bpf_htab *htab, void (*dtor)(void *, void *))
>  	return 0;
>  }
>  
> +static int htab_convert_to_non_prealloc(struct bpf_htab *htab)

This conversion is not ideal: deallocating/deinitializeing then
constructing new map again is something we should try to avoid.

Is it possible to restructure map_create to parse btf and patch map_flags
before allocation/initialization?

Or what sashiko suggests: explicitly reject this configuration 
and return -EINVAL.

> +{
> +	bool percpu = htab_is_percpu(htab);
> +	int err;
> +
> +	htab_free_prealloced_fields(htab);
> +	free_percpu(htab->extra_elems);
> +	htab->extra_elems = NULL;
> +	prealloc_destroy(htab);
> +	htab->map.map_flags |= BPF_F_NO_PREALLOC;
> +
> +	err = bpf_mem_alloc_init(&htab->ma, htab->elem_size, false);
> +	if (err)
> +		return err;
> +	if (percpu) {
> +		err = bpf_mem_alloc_init(&htab->pcpu_ma,
> +					 round_up(htab->map.value_size, 8),
> +					 true);
> +		if (err) {
> +			bpf_mem_alloc_destroy(&htab->ma);
> +			return err;
> +		}
> +	}
> +
> +	return 0;
> +}
> +
>  static int htab_map_check_btf(struct bpf_map *map, const struct btf *btf,
>  			      const struct btf_type *key_type, const struct btf_type *value_type)
>  {
>  	struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
> +	int err;
> +
> +	if (!htab_has_nmi_special_fields(htab)) {
> +		if (htab_is_prealloc(htab))
> +			return 0;
> +	} else {
> +		if (htab_is_lru(htab))
> +			return 0;
> +
> +		if (htab_is_prealloc(htab)) {
> +			err = htab_convert_to_non_prealloc(htab);
> +			if (err)
> +				return err;
> +		}
> +	}
>  
> -	if (htab_is_prealloc(htab))
> -		return 0;
>  	/*
>  	 * We must set the dtor using this callback, as map's BTF record is not
>  	 * populated in htab_map_alloc(), so it will always appear as NULL.
> @@ -1355,7 +1405,7 @@ static long htab_map_update_elem_in_place(struct bpf_map *map, void *key,
>  					  bool percpu, bool onallcpus)
>  {
>  	struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
> -	struct htab_elem *l_new, *l_old;
> +	struct htab_elem *l_new = NULL, *l_old;
>  	struct hlist_nulls_head *head;
>  	void *old_map_ptr = NULL;
>  	unsigned long flags;
> @@ -1387,8 +1437,17 @@ static long htab_map_update_elem_in_place(struct bpf_map *map, void *key,
>  		goto err;
>  
>  	if (l_old) {
> -		/* Update value in-place */
> -		if (percpu) {
> +		if (htab_has_nmi_special_fields(htab)) {
> +			l_new = alloc_htab_elem(htab, key, value, key_size,
> +						hash, percpu, onallcpus,
> +						l_old, map_flags);
> +			if (IS_ERR(l_new)) {
> +				ret = PTR_ERR(l_new);
> +				goto err;
> +			}
> +			hlist_nulls_add_head_rcu(&l_new->hash_node, head);
> +			hlist_nulls_del_rcu(&l_old->hash_node);
> +		} else if (percpu) {
>  			pcpu_copy_value(htab, htab_elem_get_ptr(l_old, key_size),
>  					value, onallcpus, map_flags);
>  		} else {
> @@ -1408,6 +1467,8 @@ static long htab_map_update_elem_in_place(struct bpf_map *map, void *key,
>  	}
>  err:
>  	htab_unlock_bucket(b, flags);
> +	if (l_old && htab_has_nmi_special_fields(htab) && !ret)
> +		free_htab_elem(htab, l_old);
>  	if (old_map_ptr)
>  		map->ops->map_fd_put_ptr(map, old_map_ptr, true);
>  	return ret;
> @@ -1443,7 +1504,7 @@ static long __htab_lru_percpu_map_update_elem(struct bpf_map *map, void *key,
>  	 * to remove older elem from htab and this removal
>  	 * operation will need a bucket lock.
>  	 */
> -	if (map_flags != BPF_EXIST) {
> +	if (map_flags != BPF_EXIST || htab_has_nmi_special_fields(htab)) {
>  		l_new = prealloc_lru_pop(htab, key, hash);
>  		if (!l_new)
>  			return -ENOMEM;
> @@ -1460,11 +1521,21 @@ static long __htab_lru_percpu_map_update_elem(struct bpf_map *map, void *key,
>  		goto err;
>  
>  	if (l_old) {
> -		bpf_lru_node_set_ref(&l_old->lru_node);
> +		if (htab_has_nmi_special_fields(htab)) {
> +			pcpu_init_value(htab,
> +					htab_elem_get_ptr(l_new, key_size),
> +					value, onallcpus, map_flags);
> +			hlist_nulls_add_head_rcu(&l_new->hash_node, head);
> +			hlist_nulls_del_rcu(&l_old->hash_node);
> +			bpf_lru_node_set_ref(&l_new->lru_node);
> +			l_new = NULL;
> +		} else {
> +			bpf_lru_node_set_ref(&l_old->lru_node);
>  
> -		/* per-cpu hash map can update value in-place */
> -		pcpu_copy_value(htab, htab_elem_get_ptr(l_old, key_size),
> -				value, onallcpus, map_flags);
> +			/* per-cpu hash map can update value in-place */
> +			pcpu_copy_value(htab, htab_elem_get_ptr(l_old, key_size),
> +					value, onallcpus, map_flags);
> +		}
>  	} else {
>  		pcpu_init_value(htab, htab_elem_get_ptr(l_new, key_size),
>  				value, onallcpus, map_flags);
> @@ -1479,6 +1550,8 @@ static long __htab_lru_percpu_map_update_elem(struct bpf_map *map, void *key,
>  		bpf_map_dec_elem_count(&htab->map);
>  		bpf_lru_push_free(&htab->lru, &l_new->lru_node);
>  	}
> +	if (l_old && !ret && htab_has_nmi_special_fields(htab))
> +		htab_lru_push_free(htab, l_old);
>  	return ret;
>  }
>  
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 6600e126fbfb..1f52453d5a2f 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -839,6 +839,8 @@ void bpf_obj_free_fields(const struct btf_record *rec, void *obj)
>  			break;
>  		case BPF_KPTR_REF:
>  		case BPF_KPTR_PERCPU:
> +			if (irqs_disabled())
> +				break;
>  			xchgd_field = (void *)xchg((unsigned long *)field_ptr, 0);
>  			if (!xchgd_field)
>  				break;
> @@ -854,16 +856,18 @@ void bpf_obj_free_fields(const struct btf_record *rec, void *obj)
>  			}
>  			break;
>  		case BPF_UPTR:
> +			if (irqs_disabled())
> +				break;
>  			/* The caller ensured that no one is using the uptr */
>  			unpin_uptr_kaddr(*(void **)field_ptr);
>  			break;
>  		case BPF_LIST_HEAD:
> -			if (WARN_ON_ONCE(rec->spin_lock_off < 0))
> +			if (irqs_disabled() || WARN_ON_ONCE(rec->spin_lock_off < 0))
>  				continue;
>  			bpf_list_head_free(field, field_ptr, obj + rec->spin_lock_off);
>  			break;
>  		case BPF_RB_ROOT:
> -			if (WARN_ON_ONCE(rec->spin_lock_off < 0))
> +			if (irqs_disabled() || WARN_ON_ONCE(rec->spin_lock_off < 0))
>  				continue;
>  			bpf_rb_root_free(field, field_ptr, obj + rec->spin_lock_off);
>  			break;


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH bpf-next 1/1] bpf: fix deadlock in special field destruction in NMI
  2026-05-19 12:31   ` Mykyta Yatsenko
@ 2026-05-19 13:22     ` Justin Suess
  2026-05-19 13:55       ` Mykyta Yatsenko
  0 siblings, 1 reply; 11+ messages in thread
From: Justin Suess @ 2026-05-19 13:22 UTC (permalink / raw)
  To: Mykyta Yatsenko
  Cc: ast, daniel, andrii, eddyz87, memxor, martin.lau, song,
	yonghong.song, jolsa, bpf, mic, Alexei Starovoitov

On Tue, May 19, 2026 at 01:31:58PM +0100, Mykyta Yatsenko wrote:
> 
> 
> On 5/19/26 2:14 AM, Justin Suess wrote:
> > Relax bpf_obj_free_fields to only cancel/free async work in irq_disabled
> > contexts and defer unsafe free operations such as kptr dtors, list head
> > and rb root destruction to a later non-irq_disabled call driven by the
> > allocator or map free.
> > 
> > Detect fields that are unsafe to free under irqs_disabled at htab
> > creation time. When creating a hashtab with these fields, forcibly set
> > BPF_F_NO_PREALLOC and use the bpf memory allocator instead.
> > 
> > This must happen after the fields are checked, so convert the map to a
> > non-prealloc one if the special fields are present, but before the map
> > has been fully initialized.
> > 
> > Enable this fix for regular, percpu, and lru hashtabs.
> > 
> > This fixes a deadlock caused by updating/deleting map elements in an
> > NMI context by shifting the responsibility of freeing fields that are
> > unsafe to free in contexts like NMI to the memory allocators existing
> > mechanism for handling this.
> > 
> > Fixes: 14a324f6a67e ("bpf: Wire up freeing of referenced kptr")
> > Reported-by: Justin Suess <utilityemal77@gmail.com>
> > Closes: https://lore.kernel.org/bpf/20260421201035.1729473-1-utilityemal77@gmail.com/
> > Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
> > Suggested-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> > Cc: Mykyta Yatsenko <mykyta.yatsenko5@gmail.com>
> > Link: https://lore.kernel.org/bpf/DIG0ONMVOP0L.3QFYUPWFSKWI4@gmail.com/
> > Signed-off-by: Justin Suess <utilityemal77@gmail.com>
> > ---
> 
> This is a v3 of the patch, it would be nice to include it in the subject
> (see how other people do that).
>
I did mention this in the cover letter and link to the patch.

I was debating whether to make this a v4 of the last series (I already did a v3 [1]),
but decided against it since this approach really doesn't share a single common line
of code with the previous series.

But I'm not familiar with the norms on this.

[1] https://lore.kernel.org/bpf/20260507175453.1140400-1-utilityemal77@gmail.com/


> >  kernel/bpf/hashtab.c | 93 +++++++++++++++++++++++++++++++++++++++-----
> >  kernel/bpf/syscall.c |  8 +++-
> >  2 files changed, 89 insertions(+), 12 deletions(-)
> > 
> > diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
> > index 3dd9b4924ae4..0db1dc8ae0be 100644
> > --- a/kernel/bpf/hashtab.c
> > +++ b/kernel/bpf/hashtab.c
> > @@ -130,6 +130,16 @@ struct htab_btf_record {
> >  	u32 key_size;
> >  };
> >  
> > +static inline bool htab_has_nmi_special_fields(const struct bpf_htab *htab)
> > +{
> > +	const struct btf_record *rec = htab->map.record;
> > +
> > +	if (IS_ERR_OR_NULL(rec))
> > +		return false;
> > +	return rec->field_mask & (BPF_KPTR_REF | BPF_KPTR_PERCPU | BPF_UPTR |
> > +				  BPF_LIST_HEAD | BPF_RB_ROOT);
> > +}
> > +
> >  static inline bool htab_is_prealloc(const struct bpf_htab *htab)
> >  {
> >  	return !(htab->map.map_flags & BPF_F_NO_PREALLOC);
> > @@ -522,13 +532,53 @@ static int htab_set_dtor(struct bpf_htab *htab, void (*dtor)(void *, void *))
> >  	return 0;
> >  }
> >  
> > +static int htab_convert_to_non_prealloc(struct bpf_htab *htab)
> 
> This conversion is not ideal: deallocating/deinitializeing then
> constructing new map again is something we should try to avoid.
> 
> Is it possible to restructure map_create to parse btf and patch map_flags
> before allocation/initialization?
Agreed it's annoying.

The problem is map_check_bpf expects a bpf_map pointer.

So we must allocate the map before then with the existing map ops
definition as far as I can tell.

So we could do it but it would need to have a special case in the
__create_map function itself. Doable enough.
> 
> Or what sashiko suggests: explicitly reject this configuration 
> and return -EINVAL.
>
Yeah but that would just break userspace progs that use referenced kptrs
without the BPF_F_NO_PREALLOC flag.

That would be nice but it's too late to reject such a broad
configuration like that with an error.
> > +{
> > +	bool percpu = htab_is_percpu(htab);
> > +	int err;
> > +
> > +	htab_free_prealloced_fields(htab);
> > +	free_percpu(htab->extra_elems);
> > +	htab->extra_elems = NULL;
> > +	prealloc_destroy(htab);
> > +	htab->map.map_flags |= BPF_F_NO_PREALLOC;
> > +
> > +	err = bpf_mem_alloc_init(&htab->ma, htab->elem_size, false);
> > +	if (err)
> > +		return err;
> > +	if (percpu) {
> > +		err = bpf_mem_alloc_init(&htab->pcpu_ma,
> > +					 round_up(htab->map.value_size, 8),
> > +					 true);
> > +		if (err) {
> > +			bpf_mem_alloc_destroy(&htab->ma);
> > +			return err;
> > +		}
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> >  static int htab_map_check_btf(struct bpf_map *map, const struct btf *btf,
> >  			      const struct btf_type *key_type, const struct btf_type *value_type)
> >  {
> >  	struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
> > +	int err;
> > +
> > +	if (!htab_has_nmi_special_fields(htab)) {
> > +		if (htab_is_prealloc(htab))
> > +			return 0;
> > +	} else {
> > +		if (htab_is_lru(htab))
> > +			return 0;
> > +
> > +		if (htab_is_prealloc(htab)) {
> > +			err = htab_convert_to_non_prealloc(htab);
> > +			if (err)
> > +				return err;
> > +		}
> > +	}
> >  
> > -	if (htab_is_prealloc(htab))
> > -		return 0;
> >  	/*
> >  	 * We must set the dtor using this callback, as map's BTF record is not
> >  	 * populated in htab_map_alloc(), so it will always appear as NULL.
> > @@ -1355,7 +1405,7 @@ static long htab_map_update_elem_in_place(struct bpf_map *map, void *key,
> >  					  bool percpu, bool onallcpus)
> >  {
> >  	struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
> > -	struct htab_elem *l_new, *l_old;
> > +	struct htab_elem *l_new = NULL, *l_old;
> >  	struct hlist_nulls_head *head;
> >  	void *old_map_ptr = NULL;
> >  	unsigned long flags;
> > @@ -1387,8 +1437,17 @@ static long htab_map_update_elem_in_place(struct bpf_map *map, void *key,
> >  		goto err;
> >  
> >  	if (l_old) {
> > -		/* Update value in-place */
> > -		if (percpu) {
> > +		if (htab_has_nmi_special_fields(htab)) {
> > +			l_new = alloc_htab_elem(htab, key, value, key_size,
> > +						hash, percpu, onallcpus,
> > +						l_old, map_flags);
> > +			if (IS_ERR(l_new)) {
> > +				ret = PTR_ERR(l_new);
> > +				goto err;
> > +			}
> > +			hlist_nulls_add_head_rcu(&l_new->hash_node, head);
> > +			hlist_nulls_del_rcu(&l_old->hash_node);
> > +		} else if (percpu) {
> >  			pcpu_copy_value(htab, htab_elem_get_ptr(l_old, key_size),
> >  					value, onallcpus, map_flags);
> >  		} else {
> > @@ -1408,6 +1467,8 @@ static long htab_map_update_elem_in_place(struct bpf_map *map, void *key,
> >  	}
> >  err:
> >  	htab_unlock_bucket(b, flags);
> > +	if (l_old && htab_has_nmi_special_fields(htab) && !ret)
> > +		free_htab_elem(htab, l_old);
> >  	if (old_map_ptr)
> >  		map->ops->map_fd_put_ptr(map, old_map_ptr, true);
> >  	return ret;
> > @@ -1443,7 +1504,7 @@ static long __htab_lru_percpu_map_update_elem(struct bpf_map *map, void *key,
> >  	 * to remove older elem from htab and this removal
> >  	 * operation will need a bucket lock.
> >  	 */
> > -	if (map_flags != BPF_EXIST) {
> > +	if (map_flags != BPF_EXIST || htab_has_nmi_special_fields(htab)) {
> >  		l_new = prealloc_lru_pop(htab, key, hash);
> >  		if (!l_new)
> >  			return -ENOMEM;
> > @@ -1460,11 +1521,21 @@ static long __htab_lru_percpu_map_update_elem(struct bpf_map *map, void *key,
> >  		goto err;
> >  
> >  	if (l_old) {
> > -		bpf_lru_node_set_ref(&l_old->lru_node);
> > +		if (htab_has_nmi_special_fields(htab)) {
> > +			pcpu_init_value(htab,
> > +					htab_elem_get_ptr(l_new, key_size),
> > +					value, onallcpus, map_flags);
> > +			hlist_nulls_add_head_rcu(&l_new->hash_node, head);
> > +			hlist_nulls_del_rcu(&l_old->hash_node);
> > +			bpf_lru_node_set_ref(&l_new->lru_node);
> > +			l_new = NULL;
> > +		} else {
> > +			bpf_lru_node_set_ref(&l_old->lru_node);
> >  
> > -		/* per-cpu hash map can update value in-place */
> > -		pcpu_copy_value(htab, htab_elem_get_ptr(l_old, key_size),
> > -				value, onallcpus, map_flags);
> > +			/* per-cpu hash map can update value in-place */
> > +			pcpu_copy_value(htab, htab_elem_get_ptr(l_old, key_size),
> > +					value, onallcpus, map_flags);
> > +		}
> >  	} else {
> >  		pcpu_init_value(htab, htab_elem_get_ptr(l_new, key_size),
> >  				value, onallcpus, map_flags);
> > @@ -1479,6 +1550,8 @@ static long __htab_lru_percpu_map_update_elem(struct bpf_map *map, void *key,
> >  		bpf_map_dec_elem_count(&htab->map);
> >  		bpf_lru_push_free(&htab->lru, &l_new->lru_node);
> >  	}
> > +	if (l_old && !ret && htab_has_nmi_special_fields(htab))
> > +		htab_lru_push_free(htab, l_old);
> >  	return ret;
> >  }
> >  
> > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> > index 6600e126fbfb..1f52453d5a2f 100644
> > --- a/kernel/bpf/syscall.c
> > +++ b/kernel/bpf/syscall.c
> > @@ -839,6 +839,8 @@ void bpf_obj_free_fields(const struct btf_record *rec, void *obj)
> >  			break;
> >  		case BPF_KPTR_REF:
> >  		case BPF_KPTR_PERCPU:
> > +			if (irqs_disabled())
> > +				break;
> >  			xchgd_field = (void *)xchg((unsigned long *)field_ptr, 0);
> >  			if (!xchgd_field)
> >  				break;
> > @@ -854,16 +856,18 @@ void bpf_obj_free_fields(const struct btf_record *rec, void *obj)
> >  			}
> >  			break;
> >  		case BPF_UPTR:
> > +			if (irqs_disabled())
> > +				break;
> >  			/* The caller ensured that no one is using the uptr */
> >  			unpin_uptr_kaddr(*(void **)field_ptr);
> >  			break;
> >  		case BPF_LIST_HEAD:
> > -			if (WARN_ON_ONCE(rec->spin_lock_off < 0))
> > +			if (irqs_disabled() || WARN_ON_ONCE(rec->spin_lock_off < 0))
> >  				continue;
> >  			bpf_list_head_free(field, field_ptr, obj + rec->spin_lock_off);
> >  			break;
> >  		case BPF_RB_ROOT:
> > -			if (WARN_ON_ONCE(rec->spin_lock_off < 0))
> > +			if (irqs_disabled() || WARN_ON_ONCE(rec->spin_lock_off < 0))
> >  				continue;
> >  			bpf_rb_root_free(field, field_ptr, obj + rec->spin_lock_off);
> >  			break;
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH bpf-next 1/1] bpf: fix deadlock in special field destruction in NMI
  2026-05-19 13:22     ` Justin Suess
@ 2026-05-19 13:55       ` Mykyta Yatsenko
  0 siblings, 0 replies; 11+ messages in thread
From: Mykyta Yatsenko @ 2026-05-19 13:55 UTC (permalink / raw)
  To: Justin Suess
  Cc: ast, daniel, andrii, eddyz87, memxor, martin.lau, song,
	yonghong.song, jolsa, bpf, mic, Alexei Starovoitov

On 5/19/26 2:22 PM, Justin Suess wrote:
> On Tue, May 19, 2026 at 01:31:58PM +0100, Mykyta Yatsenko wrote:
>>
>>
>> On 5/19/26 2:14 AM, Justin Suess wrote:
>>> Relax bpf_obj_free_fields to only cancel/free async work in irq_disabled
>>> contexts and defer unsafe free operations such as kptr dtors, list head
>>> and rb root destruction to a later non-irq_disabled call driven by the
>>> allocator or map free.
>>>
>>> Detect fields that are unsafe to free under irqs_disabled at htab
>>> creation time. When creating a hashtab with these fields, forcibly set
>>> BPF_F_NO_PREALLOC and use the bpf memory allocator instead.
>>>
>>> This must happen after the fields are checked, so convert the map to a
>>> non-prealloc one if the special fields are present, but before the map
>>> has been fully initialized.
>>>
>>> Enable this fix for regular, percpu, and lru hashtabs.
>>>
>>> This fixes a deadlock caused by updating/deleting map elements in an
>>> NMI context by shifting the responsibility of freeing fields that are
>>> unsafe to free in contexts like NMI to the memory allocators existing
>>> mechanism for handling this.
>>>
>>> Fixes: 14a324f6a67e ("bpf: Wire up freeing of referenced kptr")
>>> Reported-by: Justin Suess <utilityemal77@gmail.com>
>>> Closes: https://lore.kernel.org/bpf/20260421201035.1729473-1-utilityemal77@gmail.com/
>>> Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
>>> Suggested-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
>>> Cc: Mykyta Yatsenko <mykyta.yatsenko5@gmail.com>
>>> Link: https://lore.kernel.org/bpf/DIG0ONMVOP0L.3QFYUPWFSKWI4@gmail.com/
>>> Signed-off-by: Justin Suess <utilityemal77@gmail.com>
>>> ---
>>
>> This is a v3 of the patch, it would be nice to include it in the subject
>> (see how other people do that).
>>
> I did mention this in the cover letter and link to the patch.
> 
> I was debating whether to make this a v4 of the last series (I already did a v3 [1]),
> but decided against it since this approach really doesn't share a single common line
> of code with the previous series.
> 
> But I'm not familiar with the norms on this.

I'd say carry the versioning, even if the code is completely changed, the problem you
are solving is the same, so it's the same patch series.

> 
> [1] https://lore.kernel.org/bpf/20260507175453.1140400-1-utilityemal77@gmail.com/
> 
> 
>>>  kernel/bpf/hashtab.c | 93 +++++++++++++++++++++++++++++++++++++++-----
>>>  kernel/bpf/syscall.c |  8 +++-
>>>  2 files changed, 89 insertions(+), 12 deletions(-)
>>>
>>> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
>>> index 3dd9b4924ae4..0db1dc8ae0be 100644
>>> --- a/kernel/bpf/hashtab.c
>>> +++ b/kernel/bpf/hashtab.c
>>> @@ -130,6 +130,16 @@ struct htab_btf_record {
>>>  	u32 key_size;
>>>  };
>>>  
>>> +static inline bool htab_has_nmi_special_fields(const struct bpf_htab *htab)
>>> +{
>>> +	const struct btf_record *rec = htab->map.record;
>>> +
>>> +	if (IS_ERR_OR_NULL(rec))
>>> +		return false;
>>> +	return rec->field_mask & (BPF_KPTR_REF | BPF_KPTR_PERCPU | BPF_UPTR |
>>> +				  BPF_LIST_HEAD | BPF_RB_ROOT);
>>> +}
>>> +
>>>  static inline bool htab_is_prealloc(const struct bpf_htab *htab)
>>>  {
>>>  	return !(htab->map.map_flags & BPF_F_NO_PREALLOC);
>>> @@ -522,13 +532,53 @@ static int htab_set_dtor(struct bpf_htab *htab, void (*dtor)(void *, void *))
>>>  	return 0;
>>>  }
>>>  
>>> +static int htab_convert_to_non_prealloc(struct bpf_htab *htab)
>>
>> This conversion is not ideal: deallocating/deinitializeing then
>> constructing new map again is something we should try to avoid.
>>
>> Is it possible to restructure map_create to parse btf and patch map_flags
>> before allocation/initialization?
> Agreed it's annoying.
> 
> The problem is map_check_bpf expects a bpf_map pointer.
> 
> So we must allocate the map before then with the existing map ops
> definition as far as I can tell.
> 
> So we could do it but it would need to have a special case in the
> __create_map function itself. Doable enough.

It's worth trying to find a refactoring that going to work.
Maybe moving map_flags patching logic out of map_check_btf.

This approach with reinitialization looks like the worst choice:
you are destroying/allocating map in the map_check_btf() (which sounds
like a function with no side effects), this is
very non-obvious. I think there is no precedent of map reinitialization.
Imagine if in other place we needed to change configuration again, are you
going to reinitialize it there too?
How would you design this if you were writing this from scratch?

>>
>> Or what sashiko suggests: explicitly reject this configuration 
>> and return -EINVAL.
>>
> Yeah but that would just break userspace progs that use referenced kptrs
> without the BPF_F_NO_PREALLOC flag.
> 
> That would be nice but it's too late to reject such a broad
> configuration like that with an error.
>>> +{
>>> +	bool percpu = htab_is_percpu(htab);
>>> +	int err;
>>> +
>>> +	htab_free_prealloced_fields(htab);
>>> +	free_percpu(htab->extra_elems);
>>> +	htab->extra_elems = NULL;
>>> +	prealloc_destroy(htab);
>>> +	htab->map.map_flags |= BPF_F_NO_PREALLOC;
>>> +
>>> +	err = bpf_mem_alloc_init(&htab->ma, htab->elem_size, false);
>>> +	if (err)
>>> +		return err;
>>> +	if (percpu) {
>>> +		err = bpf_mem_alloc_init(&htab->pcpu_ma,
>>> +					 round_up(htab->map.value_size, 8),
>>> +					 true);
>>> +		if (err) {
>>> +			bpf_mem_alloc_destroy(&htab->ma);
>>> +			return err;
>>> +		}
>>> +	}
>>> +
>>> +	return 0;
>>> +}
>>> +
>>>  static int htab_map_check_btf(struct bpf_map *map, const struct btf *btf,
>>>  			      const struct btf_type *key_type, const struct btf_type *value_type)
>>>  {
>>>  	struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
>>> +	int err;
>>> +
>>> +	if (!htab_has_nmi_special_fields(htab)) {
>>> +		if (htab_is_prealloc(htab))
>>> +			return 0;
>>> +	} else {
>>> +		if (htab_is_lru(htab))
>>> +			return 0;
>>> +
>>> +		if (htab_is_prealloc(htab)) {
>>> +			err = htab_convert_to_non_prealloc(htab);
>>> +			if (err)
>>> +				return err;
>>> +		}
>>> +	}
>>>  
>>> -	if (htab_is_prealloc(htab))
>>> -		return 0;
>>>  	/*
>>>  	 * We must set the dtor using this callback, as map's BTF record is not
>>>  	 * populated in htab_map_alloc(), so it will always appear as NULL.
>>> @@ -1355,7 +1405,7 @@ static long htab_map_update_elem_in_place(struct bpf_map *map, void *key,
>>>  					  bool percpu, bool onallcpus)
>>>  {
>>>  	struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
>>> -	struct htab_elem *l_new, *l_old;
>>> +	struct htab_elem *l_new = NULL, *l_old;
>>>  	struct hlist_nulls_head *head;
>>>  	void *old_map_ptr = NULL;
>>>  	unsigned long flags;
>>> @@ -1387,8 +1437,17 @@ static long htab_map_update_elem_in_place(struct bpf_map *map, void *key,
>>>  		goto err;
>>>  
>>>  	if (l_old) {
>>> -		/* Update value in-place */
>>> -		if (percpu) {
>>> +		if (htab_has_nmi_special_fields(htab)) {
>>> +			l_new = alloc_htab_elem(htab, key, value, key_size,
>>> +						hash, percpu, onallcpus,
>>> +						l_old, map_flags);
>>> +			if (IS_ERR(l_new)) {
>>> +				ret = PTR_ERR(l_new);
>>> +				goto err;
>>> +			}
>>> +			hlist_nulls_add_head_rcu(&l_new->hash_node, head);
>>> +			hlist_nulls_del_rcu(&l_old->hash_node);
>>> +		} else if (percpu) {
>>>  			pcpu_copy_value(htab, htab_elem_get_ptr(l_old, key_size),
>>>  					value, onallcpus, map_flags);
>>>  		} else {
>>> @@ -1408,6 +1467,8 @@ static long htab_map_update_elem_in_place(struct bpf_map *map, void *key,
>>>  	}
>>>  err:
>>>  	htab_unlock_bucket(b, flags);
>>> +	if (l_old && htab_has_nmi_special_fields(htab) && !ret)
>>> +		free_htab_elem(htab, l_old);
>>>  	if (old_map_ptr)
>>>  		map->ops->map_fd_put_ptr(map, old_map_ptr, true);
>>>  	return ret;
>>> @@ -1443,7 +1504,7 @@ static long __htab_lru_percpu_map_update_elem(struct bpf_map *map, void *key,
>>>  	 * to remove older elem from htab and this removal
>>>  	 * operation will need a bucket lock.
>>>  	 */
>>> -	if (map_flags != BPF_EXIST) {
>>> +	if (map_flags != BPF_EXIST || htab_has_nmi_special_fields(htab)) {
>>>  		l_new = prealloc_lru_pop(htab, key, hash);
>>>  		if (!l_new)
>>>  			return -ENOMEM;
>>> @@ -1460,11 +1521,21 @@ static long __htab_lru_percpu_map_update_elem(struct bpf_map *map, void *key,
>>>  		goto err;
>>>  
>>>  	if (l_old) {
>>> -		bpf_lru_node_set_ref(&l_old->lru_node);
>>> +		if (htab_has_nmi_special_fields(htab)) {
>>> +			pcpu_init_value(htab,
>>> +					htab_elem_get_ptr(l_new, key_size),
>>> +					value, onallcpus, map_flags);
>>> +			hlist_nulls_add_head_rcu(&l_new->hash_node, head);
>>> +			hlist_nulls_del_rcu(&l_old->hash_node);
>>> +			bpf_lru_node_set_ref(&l_new->lru_node);
>>> +			l_new = NULL;
>>> +		} else {
>>> +			bpf_lru_node_set_ref(&l_old->lru_node);
>>>  
>>> -		/* per-cpu hash map can update value in-place */
>>> -		pcpu_copy_value(htab, htab_elem_get_ptr(l_old, key_size),
>>> -				value, onallcpus, map_flags);
>>> +			/* per-cpu hash map can update value in-place */
>>> +			pcpu_copy_value(htab, htab_elem_get_ptr(l_old, key_size),
>>> +					value, onallcpus, map_flags);
>>> +		}
>>>  	} else {
>>>  		pcpu_init_value(htab, htab_elem_get_ptr(l_new, key_size),
>>>  				value, onallcpus, map_flags);
>>> @@ -1479,6 +1550,8 @@ static long __htab_lru_percpu_map_update_elem(struct bpf_map *map, void *key,
>>>  		bpf_map_dec_elem_count(&htab->map);
>>>  		bpf_lru_push_free(&htab->lru, &l_new->lru_node);
>>>  	}
>>> +	if (l_old && !ret && htab_has_nmi_special_fields(htab))
>>> +		htab_lru_push_free(htab, l_old);
>>>  	return ret;
>>>  }
>>>  
>>> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>>> index 6600e126fbfb..1f52453d5a2f 100644
>>> --- a/kernel/bpf/syscall.c
>>> +++ b/kernel/bpf/syscall.c
>>> @@ -839,6 +839,8 @@ void bpf_obj_free_fields(const struct btf_record *rec, void *obj)
>>>  			break;
>>>  		case BPF_KPTR_REF:
>>>  		case BPF_KPTR_PERCPU:
>>> +			if (irqs_disabled())
>>> +				break;
>>>  			xchgd_field = (void *)xchg((unsigned long *)field_ptr, 0);
>>>  			if (!xchgd_field)
>>>  				break;
>>> @@ -854,16 +856,18 @@ void bpf_obj_free_fields(const struct btf_record *rec, void *obj)
>>>  			}
>>>  			break;
>>>  		case BPF_UPTR:
>>> +			if (irqs_disabled())
>>> +				break;
>>>  			/* The caller ensured that no one is using the uptr */
>>>  			unpin_uptr_kaddr(*(void **)field_ptr);
>>>  			break;
>>>  		case BPF_LIST_HEAD:
>>> -			if (WARN_ON_ONCE(rec->spin_lock_off < 0))
>>> +			if (irqs_disabled() || WARN_ON_ONCE(rec->spin_lock_off < 0))
>>>  				continue;
>>>  			bpf_list_head_free(field, field_ptr, obj + rec->spin_lock_off);
>>>  			break;
>>>  		case BPF_RB_ROOT:
>>> -			if (WARN_ON_ONCE(rec->spin_lock_off < 0))
>>> +			if (irqs_disabled() || WARN_ON_ONCE(rec->spin_lock_off < 0))
>>>  				continue;
>>>  			bpf_rb_root_free(field, field_ptr, obj + rec->spin_lock_off);
>>>  			break;
>>


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH bpf-next 1/1] bpf: fix deadlock in special field destruction in NMI
  2026-05-19  1:51   ` sashiko-bot
@ 2026-05-19 14:42     ` Justin Suess
  2026-05-19 15:59       ` Justin Suess
  0 siblings, 1 reply; 11+ messages in thread
From: Justin Suess @ 2026-05-19 14:42 UTC (permalink / raw)
  To: sashiko-reviews; +Cc: bpf

On Tue, May 19, 2026 at 01:51:22AM +0000, sashiko-bot@kernel.org wrote:
> Thank you for your contribution! Sashiko AI review found 3 potential issue(s) to consider:
> - [Critical] LRU maps containing NMI-unsafe fields (like BPF_KPTR_REF or BPF_LIST_HEAD) will permanently leak these fields and corrupt map state when updated or deleted in an IRQ-disabled context.
> - [High] Modifying the generic `bpf_obj_free_fields()` to silently skip destruction in IRQ-disabled contexts causes permanent memory leaks for callers lacking deferred destructors (like `bpf_obj_drop` and Array maps).
> - [Medium] Forcibly mutating the user-requested `map_flags` to include `BPF_F_NO_PREALLOC` silently breaks performance expectations and UAPI transparency.
> --
> 
> commit 3fbec1aa4107fa1f73249893a4ed21fd7d2213fa
> Author: Justin Suess <utilityemal77@gmail.com>
> 
> bpf: fix deadlock in special field destruction in NMI
> 
> This commit relaxes bpf_obj_free_fields() to skip async work cancellation
> and unsafe free operations in IRQ-disabled contexts. It converts hash maps
> with NMI-unsafe fields to non-preallocated maps using bpf_mem_alloc to
> defer the cleanup. LRU maps and preallocated hash maps are adjusted to
> support this mechanism.
> 
> Closes: https://lore.kernel.org/bpf/20260421201035.1729473-1-utilityemal77@gmail.com/
> Link: https://lore.kernel.org/bpf/DIG0ONMVOP0L.3QFYUPWFSKWI4@gmail.com/
> 
> > diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
> > index 3dd9b4924ae4f..0db1dc8ae0bec 100644
> > --- a/kernel/bpf/hashtab.c
> > +++ b/kernel/bpf/hashtab.c
> [ ... ]
> > @@ -522,13 +532,53 @@ static int htab_set_dtor(struct bpf_htab *htab, void (*dtor)(void *, void *))
> >  	return 0;
> >  }
> >  
> > +static int htab_convert_to_non_prealloc(struct bpf_htab *htab)
> > +{
> > +	bool percpu = htab_is_percpu(htab);
> > +	int err;
> > +
> > +	htab_free_prealloced_fields(htab);
> > +	free_percpu(htab->extra_elems);
> > +	htab->extra_elems = NULL;
> > +	prealloc_destroy(htab);
> > +	htab->map.map_flags |= BPF_F_NO_PREALLOC;
> 
> Does mutating the map flags post-creation to include BPF_F_NO_PREALLOC alter
> the memory model and user-visible flags without warning? If userspace queries
> the map info via bpf_map_get_info_by_fd(), will it see flags it never
> requested?
Changing the userspace visible flag was done to provide visibility into
the change.
> 
> Would it be better to explicitly reject this configuration and return -EINVAL
> rather than secretly altering the underlying memory model?
>
That would break existing progs unacceptably, including the common case
of a normal hashtab with ref kptrs and no BPF_F_NO_PREALLOC flag.
> [ ... ]
> >  static int htab_map_check_btf(struct bpf_map *map, const struct btf *btf,
> >  			      const struct btf_type *key_type, const struct btf_type *value_type)
> >  {
> >  	struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
> > +	int err;
> > +
> > +	if (!htab_has_nmi_special_fields(htab)) {
> > +		if (htab_is_prealloc(htab))
> > +			return 0;
> > +	} else {
> > +		if (htab_is_lru(htab))
> > +			return 0;
> 
> Since LRU maps are exempted from conversion to non-preallocated mode here,
> they do not use bpf_mem_alloc and completely lack the deferred destructor
> mechanism relied upon by this patch.
> 
> When an LRU element is replaced or deleted, htab_lru_push_free() is called
> and synchronously invokes check_and_free_fields() which eventually calls
> bpf_obj_free_fields().
> 
> If this executes in an IRQ-disabled context, will the new if
> (irqs_disabled()) checks in bpf_obj_free_fields() silently skip freeing
> the special fields?
> 
> Does this cause a permanent memory leak since the element is then
> immediately pushed back to the LRU free list? When the element is
> subsequently popped and reused, could it inherit stale pointers and
> corrupt map state?
> 
It doesn't cause a memory leak.

It does have the potential to reuse slots with existing pointers, but they
aren't "stale" they are valid pointers that are recycled until bpf_obj_free_fields
gets called on them in a non-irqs_disabled context which can happen on
another delete / LRU trim / an xchg out and release / map
destruction.

> [ ... ]
> > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> > index 6600e126fbfb4..1f52453d5a2f1 100644
> > --- a/kernel/bpf/syscall.c
> > +++ b/kernel/bpf/syscall.c
> > @@ -839,6 +839,8 @@ void bpf_obj_free_fields(const struct btf_record *rec, void *obj)
> >  			break;
> >  		case BPF_KPTR_REF:
> >  		case BPF_KPTR_PERCPU:
> > +			if (irqs_disabled())
> > +				break;
> >  			xchgd_field = (void *)xchg((unsigned long *)field_ptr, 0);
> >  			if (!xchgd_field)
> >  				break;
> 
> Because bpf_obj_free_fields() is a generic helper used throughout the BPF
> subsystem, does silently skipping destruction in IRQ-disabled contexts
> cause leaks for callers that lack deferred destructors?
> 
> For example, if a BPF program running in an NMI or kprobe drops an object
> containing nested kptrs via __bpf_obj_drop_impl(), will the nested fields
> be skipped and permanently leaked when the outer object is queued for
> freeing via bpf_global_ma?
>
Looks like this kfunc also tries to eagerly free fields and is callable
from tp_btf/nmi_handler with no recovery mechanism.

I don't see a good way to handle this at all... it also is effectively
already bugged in the existing code unless I'm missing something

Since you can use it to call abritrary dtors in NMI.

1. 

In some context where task_acquire is valid:

struct blah {
  struct task_struct __kptr* task;
}

b = bpf_task_acquire(...)
a = bpf_obj_new(typeof(*blah_ptr));
bpf_kptr_xchg(a->task, b);

2. 

Then xchg a into map
bpf_kptr_xchg([map slot], a);
...

3. 

In tp_btf/nmi_handler we exchange the obj out of the map
and then drop it.

c = bpf_kptr_xchg([slot where a is stored], NULL);
bpf_obj_drop(c); 

And because this can be called on an object not in a map, we don't have
a good recovery mechanism.

Because tp_btf/nmi_handler is always in_nmi and hence irqs_disabled, in
the code before this patch the task_struct dtor will be invoked and
cause a deadlock.

This patch will not fix the issue and just free a struct blah holding the
reference without actually dropping it.

Unless I'm missing something then this whole approach taken by this
patch can't handle this case since the bpf_obj is detached from the map
entirely and has no allocator recovery mechanism at all.

Then we are back to square 1 needing to defer the freeing with irq work.

Kumar / Alexei do you have anything on this?


> Does this similarly affect in-place updates to Array maps which rely on
> bpf_obj_free_fields() to clear old special fields?
> 
Basically existing behavior (undocumented) is that bpf_map_update_elem
won't copy kptr fields timers or spinlocks. (see copy_map_value, and
bpf_obj_memcpy).

So those fields can never be overwritten from bpf_map_update_elem on an
array.

The only thing that really changes is that now instead of trying to free
the existing map slot value on update we just leave it alone if
irqs_disabled().

It's definitely a behavioral change either way. And it may be worth
extending the bpf_obj_memcpy/copy_map_value exceptions to the other
types covered in this patch.

This was discussed here. [1]

[1] https://lore.kernel.org/bpf/CAP01T75942V+wKJD0yhs8yb7U5vj+y9s6efJ+UdpTMfzP31CWg@mail.gmail.com/
> -- 
> Sashiko AI review · https://sashiko.dev/#/patchset/20260519011450.1144935-2-utilityemal77@gmail.com?part=1

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH bpf-next 1/1] bpf: fix deadlock in special field destruction in NMI
  2026-05-19 14:42     ` Justin Suess
@ 2026-05-19 15:59       ` Justin Suess
  2026-05-19 19:27         ` Kumar Kartikeya Dwivedi
  0 siblings, 1 reply; 11+ messages in thread
From: Justin Suess @ 2026-05-19 15:59 UTC (permalink / raw)
  To: sashiko-reviews; +Cc: bpf

On Tue, May 19, 2026 at 10:42:44AM -0400, Justin Suess wrote:
> Looks like this kfunc also tries to eagerly free fields and is callable
> from tp_btf/nmi_handler with no recovery mechanism.
> 
> I don't see a good way to handle this at all... it also is effectively
> already bugged in the existing code unless I'm missing something
> 
> Since you can use it to call abritrary dtors in NMI.
> 
> 1. 
> 
> In some context where task_acquire is valid:
> 
> struct blah {
>   struct task_struct __kptr* task;
> }
> 
> b = bpf_task_acquire(...)
> a = bpf_obj_new(typeof(*blah_ptr));
> bpf_kptr_xchg(a->task, b);
> 
> 2. 
> 
> Then xchg a into map
> bpf_kptr_xchg([map slot], a);
> ...
> 
> 3. 
> 
> In tp_btf/nmi_handler we exchange the obj out of the map
> and then drop it.
> 
> c = bpf_kptr_xchg([slot where a is stored], NULL);
> bpf_obj_drop(c); 
> 
> And because this can be called on an object not in a map, we don't have
> a good recovery mechanism.
> 
> Because tp_btf/nmi_handler is always in_nmi and hence irqs_disabled, in
> the code before this patch the task_struct dtor will be invoked and
> cause a deadlock.
> 
> This patch will not fix the issue and just free a struct blah holding the
> reference without actually dropping it.
> 
> Unless I'm missing something then this whole approach taken by this
> patch can't handle this case since the bpf_obj is detached from the map
> entirely and has no allocator recovery mechanism at all.
> 
> Then we are back to square 1 needing to defer the freeing with irq work.
> 
> Kumar / Alexei do you have anything on this?

I've confirmed this is a viable mechanism to deadlock as well

Using the above approach, just exchanging out of a map and dropping the
bpf_obj containing a task_struct kptr can be used to cause a deadlock
outside any map path:

On 576482b55c19e7ec00e162a0fde4c4f1a95128c7 (bpf-next/master) w/ RCU_NOCB_CPU/RCU_EXPERT:

  [   10.773819] ================================
  [   10.773820] WARNING: inconsistent lock state
  [   10.773821] 7.1.0-rc2-g576482b55c19 #150 Not tainted
  [   10.773822] --------------------------------
  [   10.773822] inconsistent {INITIAL USE} -> {IN-NMI} usage.
  [   10.773823] test_progs/471 [HC1[1]:SC0[0]:HE0:SE1] takes:
  [   10.773825] ffff98623d66f0e8 (&rdp->nocb_lock){-.-.}-{2:2}, at: __call_rcu_common.constprop.0+0x54a/0x730
  [   10.773832] {INITIAL USE} state was registered at:
  [   10.773832]   lock_acquire+0xbf/0x2e0
  [   10.773835]   _raw_spin_lock+0x30/0x40
  [   10.773838]   rcu_nocb_gp_kthread+0x13f/0xb90
  [   10.773840]   kthread+0xf4/0x130
  [   10.773845]   ret_from_fork+0x264/0x330
  [   10.773848]   ret_from_fork_asm+0x1a/0x30
  [   10.773851] irq event stamp: 1127908
  [   10.773851] hardirqs last  enabled at (1127907): [<ffffffff87104484>] _raw_spin_unlock_irqrestore+0x44/0x60
  [   10.773853] hardirqs last disabled at (1127908): [<ffffffff871041a1>] _raw_spin_lock_irqsave+0x51/0x60
  [   10.773854] softirqs last  enabled at (1127782): [<ffffffff862e6b80>] __irq_exit_rcu+0xc0/0x100
  [   10.773857] softirqs last disabled at (1127777): [<ffffffff862e6b80>] __irq_exit_rcu+0xc0/0x100
  [   10.773858]
                 other info that might help us debug this:
  [   10.773859]  Possible unsafe locking scenario:
  
  [   10.773859]        CPU0
  [   10.773859]        ----
  [   10.773860]   lock(&rdp->nocb_lock);
  [   10.773861]   <Interrupt>
  [   10.773861]     lock(&rdp->nocb_lock);
  [   10.773862]
                  *** DEADLOCK ***
  
  [   10.773862] 4 locks held by test_progs/471:
  [   10.773863]  #0: ffffffff885ccaf0 (dup_mmap_sem){.+.+}-{0:0}, at: copy_process+0x1e39/0x2b80
  [   10.773866]  #1: ffff9861c33a2ff8 (&mm->mmap_lock){++++}-{4:4}, at: dup_mmap+0x83/0x990
  [   10.773871]  #2: ffff9861c1152ff8 (&mm->mmap_lock/1){+.+.}-{4:4}, at: dup_mmap+0xc8/0x990
  [   10.773874]  #3: ffffffff885f43b8 (kmemleak_lock){-.-.}-{2:2}, at: __create_object+0x3a/0xa0
  [   10.773877]
                 stack backtrace:
  [   10.773879] CPU: 1 UID: 0 PID: 471 Comm: test_progs Not tainted 7.1.0-rc2-g576482b55c19 #150 PREEMPT(full)
  [   10.773881] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.17.0-0-gb52ca86e094d-prebuilt.qemu.org 04/01/2014
  [   10.773882] Call Trace:
  [   10.773883]  <NMI>
  [   10.773884]  dump_stack_lvl+0x5d/0x80
  [   10.773887]  print_usage_bug.part.0+0x22b/0x2c0
  [   10.773891]  lock_acquire+0x295/0x2e0
  [   10.773893]  ? __call_rcu_common.constprop.0+0x54a/0x730
  [   10.773896]  _raw_spin_lock+0x30/0x40
  [   10.773898]  ? __call_rcu_common.constprop.0+0x54a/0x730
  [   10.773899]  __call_rcu_common.constprop.0+0x54a/0x730
  [   10.773903]  bpf_obj_free_fields+0x118/0x250
  [   10.773907]  bpf_obj_drop+0x4f/0x80
  [   10.773910]  bpf_prog_b7ff9c2e6c583824_clear_task_kptrs_from_nmi+0x174/0x1f4
  [   10.773912]  bpf_trace_run3+0x126/0x430
  [   10.773916]  ? __pfx_perf_event_nmi_handler+0x10/0x10
  [   10.773919]  nmi_handle.part.0+0x15b/0x250
  [   10.773922]  ? __pfx_perf_event_nmi_handler+0x10/0x10
  [   10.773924]  default_do_nmi+0x120/0x180
  [   10.773927]  exc_nmi+0xe3/0x110
  [   10.773928]  end_repeat_nmi+0xf/0x53
  [   10.773931] RIP: 0010:__link_object+0xb4/0x270
  [   10.773932] Code: 00 00 00 48 c7 c6 80 36 bb 89 48 c7 43 70 00 00 00 00 48 c7 43 78 00 00 00 00 48 89 3d 35 62 50 03 eb 66 48 89 c8 48 8b 48 30 <48> 8d 70 10 48 39 d1 73 11 48 03 48 38 48 39 cf 0f 82 2f 01 00 00
  [   10.773933] RSP: 0018:ffffb516013b7a60 EFLAGS: 00000082
  [   10.773935] RAX: ffff9861c37b5c80 RBX: ffff9861df869878 RCX: ffff9861c3776420
  [   10.773936] RDX: ffff9861c3769e00 RSI: ffff9861c43564f8 RDI: ffff9861c3769d00
  [   10.773936] RBP: 0000000000000100 R08: 0000000000000000 R09: 0000000000000000
  [   10.773937] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9861c3769d00
  [   10.773937] R13: 0000000000000286 R14: 0000000000000000 R15: 0000000000000001
  [   10.773943]  ? __link_object+0xb4/0x270
  [   10.773945]  ? __link_object+0xb4/0x270
  [   10.773947]  </NMI>
  [   10.773947]  <TASK>
  [   10.773948]  __create_object+0x51/0xa0
  [   10.773951]  kmem_cache_alloc_from_sheaf_noprof+0x1e2/0x290
  [   10.773956]  mas_dup_build+0x41d/0x7f0
  [   10.773960]  __mt_dup+0x78/0xf0
  [   10.773967]  dup_mmap+0x12e/0x990
  [   10.773972]  ? srso_alias_return_thunk+0x5/0xfbef5
  [   10.773973]  ? lock_acquire+0xbf/0x2e0
  [   10.773979]  copy_process+0x1e45/0x2b80
  [   10.773985]  kernel_clone+0xbb/0x3d0
  [   10.773986]  ? srso_alias_return_thunk+0x5/0xfbef5
  [   10.773987]  ? __lock_acquire+0x481/0x2590
  [   10.773993]  __do_sys_clone+0x65/0x90
  [   10.773998]  do_syscall_64+0xa1/0x5f0
  [   10.774002]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
  [   10.774003] RIP: 0033:0x7f9da26e8f2f
  [   10.774005] Code: 89 e7 e8 c4 0e f6 ff 45 31 c0 31 d2 31 f6 64 48 8b 04 25 10 00 00 00 bf 11 00 20 01 4c 8d 90 d0 02 00 00 b8 38 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 59 89 c5 85 c0 75 31 64 48 8b 04 25 10 00 00
  [   10.774005] RSP: 002b:00007ffc278b4540 EFLAGS: 00000246 ORIG_RAX: 0000000000000038
  [   10.774007] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f9da26e8f2f
  [   10.774007] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000001200011
  [   10.774008] RBP: 00007ffc278b4740 R08: 0000000000000000 R09: 0000000000000001
  [   10.774008] R10: 00007f9da2acaf10 R11: 0000000000000246 R12: 0000000000000003
  [   10.774009] R13: 0000000000000001 R14: 0000000000000000 R15: 00005625de8b38d0
  [   10.774014]  </TASK>
  [   10.866584] perf: interrupt took too long (6193 > 6190), lowering kernel.perf_event_max_sample_rate to 32000
  [   16.630663] kauditd_printk_skb: 30 callbacks suppressed

I can give the reproducer if wished.

You can see the path involves bpf_obj_drop and not any map methods at
all.

But basically this makes the whole mechanism of relying on map allocators to
recycle and free objects in irqs_disabled contexts unsound because kptrs
can live and be released outside map contexts.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH bpf-next 1/1] bpf: fix deadlock in special field destruction in NMI
  2026-05-19 15:59       ` Justin Suess
@ 2026-05-19 19:27         ` Kumar Kartikeya Dwivedi
  2026-05-19 23:19           ` Justin Suess
  0 siblings, 1 reply; 11+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2026-05-19 19:27 UTC (permalink / raw)
  To: Justin Suess, sashiko-reviews; +Cc: bpf

On Tue May 19, 2026 at 5:59 PM CEST, Justin Suess wrote:
> On Tue, May 19, 2026 at 10:42:44AM -0400, Justin Suess wrote:
>> Looks like this kfunc also tries to eagerly free fields and is callable
>> from tp_btf/nmi_handler with no recovery mechanism.
>>
>> I don't see a good way to handle this at all... it also is effectively
>> already bugged in the existing code unless I'm missing something
>>
>> Since you can use it to call abritrary dtors in NMI.
>>
>> 1.
>>
>> In some context where task_acquire is valid:
>>
>> struct blah {
>>   struct task_struct __kptr* task;
>> }
>>
>> b = bpf_task_acquire(...)
>> a = bpf_obj_new(typeof(*blah_ptr));
>> bpf_kptr_xchg(a->task, b);
>>
>> 2.
>>
>> Then xchg a into map
>> bpf_kptr_xchg([map slot], a);
>> ...
>>
>> 3.
>>
>> In tp_btf/nmi_handler we exchange the obj out of the map
>> and then drop it.
>>
>> c = bpf_kptr_xchg([slot where a is stored], NULL);
>> bpf_obj_drop(c);
>>
>> And because this can be called on an object not in a map, we don't have
>> a good recovery mechanism.
>>
>> Because tp_btf/nmi_handler is always in_nmi and hence irqs_disabled, in
>> the code before this patch the task_struct dtor will be invoked and
>> cause a deadlock.
>>
>> This patch will not fix the issue and just free a struct blah holding the
>> reference without actually dropping it.
>>
>> Unless I'm missing something then this whole approach taken by this
>> patch can't handle this case since the bpf_obj is detached from the map
>> entirely and has no allocator recovery mechanism at all.
>>
>> Then we are back to square 1 needing to defer the freeing with irq work.
>>
>> Kumar / Alexei do you have anything on this?
>
> I've confirmed this is a viable mechanism to deadlock as well
>
> Using the above approach, just exchanging out of a map and dropping the
> bpf_obj containing a task_struct kptr can be used to cause a deadlock
> outside any map path:
>
> On 576482b55c19e7ec00e162a0fde4c4f1a95128c7 (bpf-next/master) w/ RCU_NOCB_CPU/RCU_EXPERT:
>
>   [   10.773819] ================================
>   [   10.773820] WARNING: inconsistent lock state
>   [   10.773821] 7.1.0-rc2-g576482b55c19 #150 Not tainted
>   [   10.773822] --------------------------------
>   [   10.773822] inconsistent {INITIAL USE} -> {IN-NMI} usage.
>   [   10.773823] test_progs/471 [HC1[1]:SC0[0]:HE0:SE1] takes:
>   [   10.773825] ffff98623d66f0e8 (&rdp->nocb_lock){-.-.}-{2:2}, at: __call_rcu_common.constprop.0+0x54a/0x730
>   [   10.773832] {INITIAL USE} state was registered at:
>   [   10.773832]   lock_acquire+0xbf/0x2e0
>   [   10.773835]   _raw_spin_lock+0x30/0x40
>   [   10.773838]   rcu_nocb_gp_kthread+0x13f/0xb90
>   [   10.773840]   kthread+0xf4/0x130
>   [   10.773845]   ret_from_fork+0x264/0x330
>   [   10.773848]   ret_from_fork_asm+0x1a/0x30
>   [   10.773851] irq event stamp: 1127908
>   [   10.773851] hardirqs last  enabled at (1127907): [<ffffffff87104484>] _raw_spin_unlock_irqrestore+0x44/0x60
>   [   10.773853] hardirqs last disabled at (1127908): [<ffffffff871041a1>] _raw_spin_lock_irqsave+0x51/0x60
>   [   10.773854] softirqs last  enabled at (1127782): [<ffffffff862e6b80>] __irq_exit_rcu+0xc0/0x100
>   [   10.773857] softirqs last disabled at (1127777): [<ffffffff862e6b80>] __irq_exit_rcu+0xc0/0x100
>   [   10.773858]
>                  other info that might help us debug this:
>   [   10.773859]  Possible unsafe locking scenario:
>
>   [   10.773859]        CPU0
>   [   10.773859]        ----
>   [   10.773860]   lock(&rdp->nocb_lock);
>   [   10.773861]   <Interrupt>
>   [   10.773861]     lock(&rdp->nocb_lock);
>   [   10.773862]
>                   *** DEADLOCK ***
>
>   [   10.773862] 4 locks held by test_progs/471:
>   [   10.773863]  #0: ffffffff885ccaf0 (dup_mmap_sem){.+.+}-{0:0}, at: copy_process+0x1e39/0x2b80
>   [   10.773866]  #1: ffff9861c33a2ff8 (&mm->mmap_lock){++++}-{4:4}, at: dup_mmap+0x83/0x990
>   [   10.773871]  #2: ffff9861c1152ff8 (&mm->mmap_lock/1){+.+.}-{4:4}, at: dup_mmap+0xc8/0x990
>   [   10.773874]  #3: ffffffff885f43b8 (kmemleak_lock){-.-.}-{2:2}, at: __create_object+0x3a/0xa0
>   [   10.773877]
>                  stack backtrace:
>   [   10.773879] CPU: 1 UID: 0 PID: 471 Comm: test_progs Not tainted 7.1.0-rc2-g576482b55c19 #150 PREEMPT(full)
>   [   10.773881] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.17.0-0-gb52ca86e094d-prebuilt.qemu.org 04/01/2014
>   [   10.773882] Call Trace:
>   [   10.773883]  <NMI>
>   [   10.773884]  dump_stack_lvl+0x5d/0x80
>   [   10.773887]  print_usage_bug.part.0+0x22b/0x2c0
>   [   10.773891]  lock_acquire+0x295/0x2e0
>   [   10.773893]  ? __call_rcu_common.constprop.0+0x54a/0x730
>   [   10.773896]  _raw_spin_lock+0x30/0x40
>   [   10.773898]  ? __call_rcu_common.constprop.0+0x54a/0x730
>   [   10.773899]  __call_rcu_common.constprop.0+0x54a/0x730
>   [   10.773903]  bpf_obj_free_fields+0x118/0x250
>   [   10.773907]  bpf_obj_drop+0x4f/0x80
>   [   10.773910]  bpf_prog_b7ff9c2e6c583824_clear_task_kptrs_from_nmi+0x174/0x1f4
>   [   10.773912]  bpf_trace_run3+0x126/0x430
>   [   10.773916]  ? __pfx_perf_event_nmi_handler+0x10/0x10
>   [   10.773919]  nmi_handle.part.0+0x15b/0x250
>   [   10.773922]  ? __pfx_perf_event_nmi_handler+0x10/0x10
>   [   10.773924]  default_do_nmi+0x120/0x180
>   [   10.773927]  exc_nmi+0xe3/0x110
>   [   10.773928]  end_repeat_nmi+0xf/0x53
>   [   10.773931] RIP: 0010:__link_object+0xb4/0x270
>   [   10.773932] Code: 00 00 00 48 c7 c6 80 36 bb 89 48 c7 43 70 00 00 00 00 48 c7 43 78 00 00 00 00 48 89 3d 35 62 50 03 eb 66 48 89 c8 48 8b 48 30 <48> 8d 70 10 48 39 d1 73 11 48 03 48 38 48 39 cf 0f 82 2f 01 00 00
>   [   10.773933] RSP: 0018:ffffb516013b7a60 EFLAGS: 00000082
>   [   10.773935] RAX: ffff9861c37b5c80 RBX: ffff9861df869878 RCX: ffff9861c3776420
>   [   10.773936] RDX: ffff9861c3769e00 RSI: ffff9861c43564f8 RDI: ffff9861c3769d00
>   [   10.773936] RBP: 0000000000000100 R08: 0000000000000000 R09: 0000000000000000
>   [   10.773937] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9861c3769d00
>   [   10.773937] R13: 0000000000000286 R14: 0000000000000000 R15: 0000000000000001
>   [   10.773943]  ? __link_object+0xb4/0x270
>   [   10.773945]  ? __link_object+0xb4/0x270
>   [   10.773947]  </NMI>
>   [   10.773947]  <TASK>
>   [   10.773948]  __create_object+0x51/0xa0
>   [   10.773951]  kmem_cache_alloc_from_sheaf_noprof+0x1e2/0x290
>   [   10.773956]  mas_dup_build+0x41d/0x7f0
>   [   10.773960]  __mt_dup+0x78/0xf0
>   [   10.773967]  dup_mmap+0x12e/0x990
>   [   10.773972]  ? srso_alias_return_thunk+0x5/0xfbef5
>   [   10.773973]  ? lock_acquire+0xbf/0x2e0
>   [   10.773979]  copy_process+0x1e45/0x2b80
>   [   10.773985]  kernel_clone+0xbb/0x3d0
>   [   10.773986]  ? srso_alias_return_thunk+0x5/0xfbef5
>   [   10.773987]  ? __lock_acquire+0x481/0x2590
>   [   10.773993]  __do_sys_clone+0x65/0x90
>   [   10.773998]  do_syscall_64+0xa1/0x5f0
>   [   10.774002]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
>   [   10.774003] RIP: 0033:0x7f9da26e8f2f
>   [   10.774005] Code: 89 e7 e8 c4 0e f6 ff 45 31 c0 31 d2 31 f6 64 48 8b 04 25 10 00 00 00 bf 11 00 20 01 4c 8d 90 d0 02 00 00 b8 38 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 59 89 c5 85 c0 75 31 64 48 8b 04 25 10 00 00
>   [   10.774005] RSP: 002b:00007ffc278b4540 EFLAGS: 00000246 ORIG_RAX: 0000000000000038
>   [   10.774007] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f9da26e8f2f
>   [   10.774007] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000001200011
>   [   10.774008] RBP: 00007ffc278b4740 R08: 0000000000000000 R09: 0000000000000001
>   [   10.774008] R10: 00007f9da2acaf10 R11: 0000000000000246 R12: 0000000000000003
>   [   10.774009] R13: 0000000000000001 R14: 0000000000000000 R15: 00005625de8b38d0
>   [   10.774014]  </TASK>
>   [   10.866584] perf: interrupt took too long (6193 > 6190), lowering kernel.perf_event_max_sample_rate to 32000
>   [   16.630663] kauditd_printk_skb: 30 callbacks suppressed
>
> I can give the reproducer if wished.
>
> You can see the path involves bpf_obj_drop and not any map methods at
> all.
>
> But basically this makes the whole mechanism of relying on map allocators to
> recycle and free objects in irqs_disabled contexts unsound because kptrs
> can live and be released outside map contexts.

The approach in this version is still different than what I had proposed. For
the particular case you highlighted, we should just reject bpf_obj_drop() in
is_tracing_prog_type() once the object has kptr fields etc. If someone still
wants it they should just use bpf_wq to defer manually.

The whole prealloc related change to the map is unnecessary, as I said before.
It's an orthogonal change. We should also not use irqs_disabled() and just do
the skipping unconditionally.

Let me work on a patch, since the nuance is obviously getting lost in email.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH bpf-next 1/1] bpf: fix deadlock in special field destruction in NMI
  2026-05-19 19:27         ` Kumar Kartikeya Dwivedi
@ 2026-05-19 23:19           ` Justin Suess
  0 siblings, 0 replies; 11+ messages in thread
From: Justin Suess @ 2026-05-19 23:19 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi; +Cc: sashiko-reviews, bpf

On Tue, May 19, 2026 at 09:27:58PM +0200, Kumar Kartikeya Dwivedi wrote:
> On Tue May 19, 2026 at 5:59 PM CEST, Justin Suess wrote:
> > On Tue, May 19, 2026 at 10:42:44AM -0400, Justin Suess wrote:
> >> Looks like this kfunc also tries to eagerly free fields and is callable
> >> from tp_btf/nmi_handler with no recovery mechanism.
> >>
> >> I don't see a good way to handle this at all... it also is effectively
> >> already bugged in the existing code unless I'm missing something
> >>
> >> Since you can use it to call abritrary dtors in NMI.
> >>
> >> 1.
> >>
> >> In some context where task_acquire is valid:
> >>
> >> struct blah {
> >>   struct task_struct __kptr* task;
> >> }
> >>
> >> b = bpf_task_acquire(...)
> >> a = bpf_obj_new(typeof(*blah_ptr));
> >> bpf_kptr_xchg(a->task, b);
> >>
> >> 2.
> >>
> >> Then xchg a into map
> >> bpf_kptr_xchg([map slot], a);
> >> ...
> >>
> >> 3.
> >>
> >> In tp_btf/nmi_handler we exchange the obj out of the map
> >> and then drop it.
> >>
> >> c = bpf_kptr_xchg([slot where a is stored], NULL);
> >> bpf_obj_drop(c);
> >>
> >> And because this can be called on an object not in a map, we don't have
> >> a good recovery mechanism.
> >>
> >> Because tp_btf/nmi_handler is always in_nmi and hence irqs_disabled, in
> >> the code before this patch the task_struct dtor will be invoked and
> >> cause a deadlock.
> >>
> >> This patch will not fix the issue and just free a struct blah holding the
> >> reference without actually dropping it.
> >>
> >> Unless I'm missing something then this whole approach taken by this
> >> patch can't handle this case since the bpf_obj is detached from the map
> >> entirely and has no allocator recovery mechanism at all.
> >>
> >> Then we are back to square 1 needing to defer the freeing with irq work.
> >>
> >> Kumar / Alexei do you have anything on this?
> >
> > I've confirmed this is a viable mechanism to deadlock as well
> >
> > Using the above approach, just exchanging out of a map and dropping the
> > bpf_obj containing a task_struct kptr can be used to cause a deadlock
> > outside any map path:
> >
> > On 576482b55c19e7ec00e162a0fde4c4f1a95128c7 (bpf-next/master) w/ RCU_NOCB_CPU/RCU_EXPERT:
> >
> >   [   10.773819] ================================
> >   [   10.773820] WARNING: inconsistent lock state
> >   [   10.773821] 7.1.0-rc2-g576482b55c19 #150 Not tainted
> >   [   10.773822] --------------------------------
> >   [   10.773822] inconsistent {INITIAL USE} -> {IN-NMI} usage.
> >   [   10.773823] test_progs/471 [HC1[1]:SC0[0]:HE0:SE1] takes:
> >   [   10.773825] ffff98623d66f0e8 (&rdp->nocb_lock){-.-.}-{2:2}, at: __call_rcu_common.constprop.0+0x54a/0x730
> >   [   10.773832] {INITIAL USE} state was registered at:
> >   [   10.773832]   lock_acquire+0xbf/0x2e0
> >   [   10.773835]   _raw_spin_lock+0x30/0x40
> >   [   10.773838]   rcu_nocb_gp_kthread+0x13f/0xb90
> >   [   10.773840]   kthread+0xf4/0x130
> >   [   10.773845]   ret_from_fork+0x264/0x330
> >   [   10.773848]   ret_from_fork_asm+0x1a/0x30
> >   [   10.773851] irq event stamp: 1127908
> >   [   10.773851] hardirqs last  enabled at (1127907): [<ffffffff87104484>] _raw_spin_unlock_irqrestore+0x44/0x60
> >   [   10.773853] hardirqs last disabled at (1127908): [<ffffffff871041a1>] _raw_spin_lock_irqsave+0x51/0x60
> >   [   10.773854] softirqs last  enabled at (1127782): [<ffffffff862e6b80>] __irq_exit_rcu+0xc0/0x100
> >   [   10.773857] softirqs last disabled at (1127777): [<ffffffff862e6b80>] __irq_exit_rcu+0xc0/0x100
> >   [   10.773858]
> >                  other info that might help us debug this:
> >   [   10.773859]  Possible unsafe locking scenario:
> >
> >   [   10.773859]        CPU0
> >   [   10.773859]        ----
> >   [   10.773860]   lock(&rdp->nocb_lock);
> >   [   10.773861]   <Interrupt>
> >   [   10.773861]     lock(&rdp->nocb_lock);
> >   [   10.773862]
> >                   *** DEADLOCK ***
> >
> >   [   10.773862] 4 locks held by test_progs/471:
> >   [   10.773863]  #0: ffffffff885ccaf0 (dup_mmap_sem){.+.+}-{0:0}, at: copy_process+0x1e39/0x2b80
> >   [   10.773866]  #1: ffff9861c33a2ff8 (&mm->mmap_lock){++++}-{4:4}, at: dup_mmap+0x83/0x990
> >   [   10.773871]  #2: ffff9861c1152ff8 (&mm->mmap_lock/1){+.+.}-{4:4}, at: dup_mmap+0xc8/0x990
> >   [   10.773874]  #3: ffffffff885f43b8 (kmemleak_lock){-.-.}-{2:2}, at: __create_object+0x3a/0xa0
> >   [   10.773877]
> >                  stack backtrace:
> >   [   10.773879] CPU: 1 UID: 0 PID: 471 Comm: test_progs Not tainted 7.1.0-rc2-g576482b55c19 #150 PREEMPT(full)
> >   [   10.773881] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.17.0-0-gb52ca86e094d-prebuilt.qemu.org 04/01/2014
> >   [   10.773882] Call Trace:
> >   [   10.773883]  <NMI>
> >   [   10.773884]  dump_stack_lvl+0x5d/0x80
> >   [   10.773887]  print_usage_bug.part.0+0x22b/0x2c0
> >   [   10.773891]  lock_acquire+0x295/0x2e0
> >   [   10.773893]  ? __call_rcu_common.constprop.0+0x54a/0x730
> >   [   10.773896]  _raw_spin_lock+0x30/0x40
> >   [   10.773898]  ? __call_rcu_common.constprop.0+0x54a/0x730
> >   [   10.773899]  __call_rcu_common.constprop.0+0x54a/0x730
> >   [   10.773903]  bpf_obj_free_fields+0x118/0x250
> >   [   10.773907]  bpf_obj_drop+0x4f/0x80
> >   [   10.773910]  bpf_prog_b7ff9c2e6c583824_clear_task_kptrs_from_nmi+0x174/0x1f4
> >   [   10.773912]  bpf_trace_run3+0x126/0x430
> >   [   10.773916]  ? __pfx_perf_event_nmi_handler+0x10/0x10
> >   [   10.773919]  nmi_handle.part.0+0x15b/0x250
> >   [   10.773922]  ? __pfx_perf_event_nmi_handler+0x10/0x10
> >   [   10.773924]  default_do_nmi+0x120/0x180
> >   [   10.773927]  exc_nmi+0xe3/0x110
> >   [   10.773928]  end_repeat_nmi+0xf/0x53
> >   [   10.773931] RIP: 0010:__link_object+0xb4/0x270
> >   [   10.773932] Code: 00 00 00 48 c7 c6 80 36 bb 89 48 c7 43 70 00 00 00 00 48 c7 43 78 00 00 00 00 48 89 3d 35 62 50 03 eb 66 48 89 c8 48 8b 48 30 <48> 8d 70 10 48 39 d1 73 11 48 03 48 38 48 39 cf 0f 82 2f 01 00 00
> >   [   10.773933] RSP: 0018:ffffb516013b7a60 EFLAGS: 00000082
> >   [   10.773935] RAX: ffff9861c37b5c80 RBX: ffff9861df869878 RCX: ffff9861c3776420
> >   [   10.773936] RDX: ffff9861c3769e00 RSI: ffff9861c43564f8 RDI: ffff9861c3769d00
> >   [   10.773936] RBP: 0000000000000100 R08: 0000000000000000 R09: 0000000000000000
> >   [   10.773937] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9861c3769d00
> >   [   10.773937] R13: 0000000000000286 R14: 0000000000000000 R15: 0000000000000001
> >   [   10.773943]  ? __link_object+0xb4/0x270
> >   [   10.773945]  ? __link_object+0xb4/0x270
> >   [   10.773947]  </NMI>
> >   [   10.773947]  <TASK>
> >   [   10.773948]  __create_object+0x51/0xa0
> >   [   10.773951]  kmem_cache_alloc_from_sheaf_noprof+0x1e2/0x290
> >   [   10.773956]  mas_dup_build+0x41d/0x7f0
> >   [   10.773960]  __mt_dup+0x78/0xf0
> >   [   10.773967]  dup_mmap+0x12e/0x990
> >   [   10.773972]  ? srso_alias_return_thunk+0x5/0xfbef5
> >   [   10.773973]  ? lock_acquire+0xbf/0x2e0
> >   [   10.773979]  copy_process+0x1e45/0x2b80
> >   [   10.773985]  kernel_clone+0xbb/0x3d0
> >   [   10.773986]  ? srso_alias_return_thunk+0x5/0xfbef5
> >   [   10.773987]  ? __lock_acquire+0x481/0x2590
> >   [   10.773993]  __do_sys_clone+0x65/0x90
> >   [   10.773998]  do_syscall_64+0xa1/0x5f0
> >   [   10.774002]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> >   [   10.774003] RIP: 0033:0x7f9da26e8f2f
> >   [   10.774005] Code: 89 e7 e8 c4 0e f6 ff 45 31 c0 31 d2 31 f6 64 48 8b 04 25 10 00 00 00 bf 11 00 20 01 4c 8d 90 d0 02 00 00 b8 38 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 59 89 c5 85 c0 75 31 64 48 8b 04 25 10 00 00
> >   [   10.774005] RSP: 002b:00007ffc278b4540 EFLAGS: 00000246 ORIG_RAX: 0000000000000038
> >   [   10.774007] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f9da26e8f2f
> >   [   10.774007] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000001200011
> >   [   10.774008] RBP: 00007ffc278b4740 R08: 0000000000000000 R09: 0000000000000001
> >   [   10.774008] R10: 00007f9da2acaf10 R11: 0000000000000246 R12: 0000000000000003
> >   [   10.774009] R13: 0000000000000001 R14: 0000000000000000 R15: 00005625de8b38d0
> >   [   10.774014]  </TASK>
> >   [   10.866584] perf: interrupt took too long (6193 > 6190), lowering kernel.perf_event_max_sample_rate to 32000
> >   [   16.630663] kauditd_printk_skb: 30 callbacks suppressed
> >
> > I can give the reproducer if wished.
> >
> > You can see the path involves bpf_obj_drop and not any map methods at
> > all.
> >
> > But basically this makes the whole mechanism of relying on map allocators to
> > recycle and free objects in irqs_disabled contexts unsound because kptrs
> > can live and be released outside map contexts.
> 
> The approach in this version is still different than what I had proposed. For
> the particular case you highlighted, we should just reject bpf_obj_drop() in
> is_tracing_prog_type() once the object has kptr fields etc. If someone still
tp_btf is BPF_PROG_TYPE_TRACING.

And ironically enough, is_tracing_prog_type doesn't even include
BPF_PROG_TYPE_TRACING. Not sure if that was intentional?

(Also same would apply to bpf_percpu_obj_drop_impl)
> wants it they should just use bpf_wq to defer manually.
Agreed.

Nobody really should ever be doing this tomfoolery anyway of freeing
cgroups and task structs from NMI but that's beside the point.
> 
> The whole prealloc related change to the map is unnecessary, as I said before.
> It's an orthogonal change. We should also not use irqs_disabled() and just do
> the skipping unconditionally.
I had trouble getting test_percpu_hash_refcounted_kptr_refcount_leak to
pass without the prealloc changes, but it may have been an unrelated issue.
> Let me work on a patch, since the nuance is obviously getting lost in email.
Please have a go!

Thanks,
Justin

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2026-05-19 23:19 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-19  1:14 [PATCH bpf-next 0/1] bpf: Fix deadlock in freeing of special fields in NMI Justin Suess
2026-05-19  1:14 ` [PATCH bpf-next 1/1] bpf: fix deadlock in special field destruction " Justin Suess
2026-05-19  1:51   ` sashiko-bot
2026-05-19 14:42     ` Justin Suess
2026-05-19 15:59       ` Justin Suess
2026-05-19 19:27         ` Kumar Kartikeya Dwivedi
2026-05-19 23:19           ` Justin Suess
2026-05-19 10:25   ` kernel test robot
2026-05-19 12:31   ` Mykyta Yatsenko
2026-05-19 13:22     ` Justin Suess
2026-05-19 13:55       ` Mykyta Yatsenko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox