[bpf-next v2 0/2] bpf: Fix deadlock in kptr dtor in nmi

BPF List
 help / color / mirror / Atom feed

* [bpf-next v2 0/2] bpf: Fix deadlock in kptr dtor in nmi
@ 2026-05-05 15:08 Justin Suess
  2026-05-05 15:08 ` [bpf-next v2 1/2] bpf: Offload kptr destructors that run from NMI Justin Suess
  2026-05-05 15:08 ` [bpf-next v2 2/2] selftests/bpf: Add kptr destructor NMI exerciser Justin Suess
  0 siblings, 2 replies; 12+ messages in thread
From: Justin Suess @ 2026-05-05 15:08 UTC (permalink / raw)
  To: ast, daniel, andrii, eddyz87, memxor
  Cc: martin.lau, song, yonghong.song, jolsa, bpf, mic, Justin Suess

Hello,

While following up on a Sashiko report [1], I found that referenced kptr
destructors can run from NMI context. One way to trigger this is from a
tracing program attached to tp_btf/nmi_handler while a map element is
being torn down.

That is problematic because referenced kptr destructor paths are not
universally NMI-safe. In particular, they may rely on operations such as
call_rcu(), which can deadlock when reached from NMI context.

This is v2 of the series. The approach from v1 was effectively discarded:
instead of teaching the broader teardown path about deferred destruction,
this revision fixes the problem at the referenced-kptr exchange and
teardown points directly.

This revision is also different from the problematic solution I proposed
in [2]. The freelist hijacking approach was too brittle and touched
allocator internals in an unsafe way.

This revision uses a global (percpu) irq work queue and llist.

The core change is to offload destructor execution that originates from
NMI context. It preallocates offload jobs from the global BPF memory
allocator, tracks the number of live destructor-backed references so the
pool stays ahead of NMI frees, and runs the actual destructor from
irq_work after NMI exits. Non-destructor kptr exchanges keep the fast
path, while referenced kptr exchanges pay the extra bookkeeping needed to
guarantee that teardown can be deferred safely.

The second patch adds a dedicated selftest based on the reproducer from
the report [3]. This utilizes the cpumask bpf kptr instead, to simplify
the test harness. It exercises both hash and array map cases by creating
cpumask kptrs and dropping them from an NMI handler, then verifies that
objects queued from NMI do get their destructors run.

1. bpf: Offload kptr destructors that run from NMI
2. selftests/bpf: Add kptr destructor NMI exerciser

Kind regards,
Justin Suess

[1] https://lore.kernel.org/bpf/20260421010536.17FB1C19425@smtp.kernel.org/
[2] https://lore.kernel.org/bpf/afYLJAT9brXkWxz2@zenbox/
[3] https://lore.kernel.org/bpf/20260421201035.1729473-1-utilityemal77@gmail.com/

Justin Suess (2):
  bpf: Offload kptr destructors that run from NMI
  selftests/bpf: Add kptr destructor NMI exerciser

 include/linux/bpf.h                           |  16 +
 include/linux/bpf_verifier.h                  |   1 +
 kernel/bpf/fixups.c                           |  33 +-
 kernel/bpf/helpers.c                          |  24 +-
 kernel/bpf/syscall.c                          | 181 ++++++++
 kernel/bpf/verifier.c                         |   2 +
 .../selftests/bpf/prog_tests/kptr_dtor_nmi.c  | 259 +++++++++++
 .../selftests/bpf/progs/kptr_dtor_nmi.c       | 414 ++++++++++++++++++
 8 files changed, 915 insertions(+), 15 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/kptr_dtor_nmi.c
 create mode 100644 tools/testing/selftests/bpf/progs/kptr_dtor_nmi.c

base-commit: 2ca6723a5f7b68c739dba47b2639e3eaa7884b09
-- 
2.53.0

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [bpf-next v2 1/2] bpf: Offload kptr destructors that run from NMI
  2026-05-05 15:08 [bpf-next v2 0/2] bpf: Fix deadlock in kptr dtor in nmi Justin Suess
@ 2026-05-05 15:08 ` Justin Suess
  2026-05-05 16:06   ` bot+bpf-ci
                     ` (2 more replies)
  2026-05-05 15:08 ` [bpf-next v2 2/2] selftests/bpf: Add kptr destructor NMI exerciser Justin Suess
  1 sibling, 3 replies; 12+ messages in thread
From: Justin Suess @ 2026-05-05 15:08 UTC (permalink / raw)
  To: ast, daniel, andrii, eddyz87, memxor
  Cc: martin.lau, song, yonghong.song, jolsa, bpf, mic, Justin Suess,
	Alexei Starovoitov

A BPF program attached to tp_btf/nmi_handler can delete map entries or
swap out referenced kptrs from NMI context. Today that runs the kptr
destructor inline. Destructors such as bpf_cpumask_release() can take
RCU-related locks, so running them from NMI can deadlock the system.

Preallocate offload jobs from the global BPF memory allocator, track the
number of live destructor-backed references so the pool stays ahead of
NMI frees, and let the worker invoke the destructor after NMI exits.

The algorithm for preallocation is simple: The invariant is total >=
refs + active, where refs = the ref kptrs installed in maps, active =
jobs being executed in the irq_work worker, and total is the number of
job structures allocated. To avoid excessive pre-allocation calls while
maintaining the invariant, we allocate the needed slots, plus a small
amount of extra, min(needed, BPF_DTOR_KPTR_RESERVE_HEADROOM), where
BPF_DTOR_KPTR_RESERVE_HEADROOM is 64 in this patch.

A small but harmless ordering subtlety: the active atomic is read before
refs. This can result in a small amount of over allocation, but this
won't be leaked and will properly be carried into the trim stage.

The trim stage is simple. It uses a CAS loop to free excessive leftover
idle job slots. It snapshots total refs and active, pops an idle job if
the pool is excessively large, and attempts a cmpxhg to decrement it
atomically. On a failure case, it will just push the job back into the
idle list and retry.

There are several best-effort mitigation methods to tackle the memory
pressure problem, preserving integrity under this unlikely scenario.

If reserving another offload slot fails while installing a new
destructor-backed kptr through bpf_kptr_xchg(), leave the destination
unchanged and return the incoming pointer so the caller keeps ownership.

This is superior to leaking the pointer, and should only happen if the
accounting is incorrect. Moreover, this is a condition the caller can
check for and recover from.

If NMI teardown still fails to grab an idle offload job despite that
reserve accounting, warn once and run the destructor inline rather than
leak the object permanently. Attempt to repair the counter safely with
another CAS loop in that case, preserving concurrent increments.

This fix does come with small performance tradeoffs for safety. xchg can
no longer be inlined for referenced kptrs, as inlining would break the
reference counting. The inlining fix is preserved for kptrs with no
destructor defined.

This keeps refcounted kptr teardown out of NMI context without slowing
down raw kptr exchanges that never need destructor handling.

Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Reported-by: Justin Suess <utilityemal77@gmail.com>
Closes: https://lore.kernel.org/bpf/20260421201035.1729473-1-utilityemal77@gmail.com/
Signed-off-by: Justin Suess <utilityemal77@gmail.com>
---
 include/linux/bpf.h          |  16 ++++
 include/linux/bpf_verifier.h |   1 +
 kernel/bpf/fixups.c          |  33 ++++---
 kernel/bpf/helpers.c         |  24 ++++-
 kernel/bpf/syscall.c         | 181 +++++++++++++++++++++++++++++++++++
 kernel/bpf/verifier.c        |   2 +
 6 files changed, 242 insertions(+), 15 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 715b6df9c403..307de5caa646 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -3454,6 +3454,22 @@ static inline struct bpf_prog *bpf_prog_get_type(u32 ufd,
 
 void __bpf_free_used_maps(struct bpf_prog_aux *aux,
 			  struct bpf_map **used_maps, u32 len);
+/* Direct-call target used by fixups for bpf_kptr_xchg() sites without dtors. */
+u64 bpf_kptr_xchg_nodtor(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
+
+#ifdef CONFIG_BPF_SYSCALL
+int bpf_kptr_offload_inc(void);
+void bpf_kptr_offload_dec(void);
+#else
+static inline int bpf_kptr_offload_inc(void)
+{
+	return 0;
+}
+
+static inline void bpf_kptr_offload_dec(void)
+{
+}
+#endif
 
 bool bpf_prog_get_ok(struct bpf_prog *, enum bpf_prog_type *, bool);
 
diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index 976e2b2f40e8..8e39ff92dd2c 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -672,6 +672,7 @@ struct bpf_insn_aux_data {
 	bool non_sleepable; /* helper/kfunc may be called from non-sleepable context */
 	bool is_iter_next; /* bpf_iter_<type>_next() kfunc call */
 	bool call_with_percpu_alloc_ptr; /* {this,per}_cpu_ptr() with prog percpu alloc */
+	bool kptr_has_dtor;
 	u8 alu_state; /* used in combination with alu_limit */
 	/* true if STX or LDX instruction is a part of a spill/fill
 	 * pattern for a bpf_fastcall call.
diff --git a/kernel/bpf/fixups.c b/kernel/bpf/fixups.c
index fba9e8c00878..459e855e86a5 100644
--- a/kernel/bpf/fixups.c
+++ b/kernel/bpf/fixups.c
@@ -2284,23 +2284,30 @@ int bpf_do_misc_fixups(struct bpf_verifier_env *env)
 			goto next_insn;
 		}
 
-		/* Implement bpf_kptr_xchg inline */
-		if (prog->jit_requested && BITS_PER_LONG == 64 &&
-		    insn->imm == BPF_FUNC_kptr_xchg &&
-		    bpf_jit_supports_ptr_xchg()) {
-			insn_buf[0] = BPF_MOV64_REG(BPF_REG_0, BPF_REG_2);
-			insn_buf[1] = BPF_ATOMIC_OP(BPF_DW, BPF_XCHG, BPF_REG_1, BPF_REG_0, 0);
-			cnt = 2;
+		/* Implement bpf_kptr_xchg inline. */
+		if (insn->imm == BPF_FUNC_kptr_xchg &&
+		    !env->insn_aux_data[i + delta].kptr_has_dtor) {
+			if (prog->jit_requested && BITS_PER_LONG == 64 &&
+			    bpf_jit_supports_ptr_xchg()) {
+				insn_buf[0] = BPF_MOV64_REG(BPF_REG_0, BPF_REG_2);
+				insn_buf[1] = BPF_ATOMIC_OP(BPF_DW, BPF_XCHG,
+						     BPF_REG_1, BPF_REG_0, 0);
+				cnt = 2;
 
-			new_prog = bpf_patch_insn_data(env, i + delta, insn_buf, cnt);
-			if (!new_prog)
-				return -ENOMEM;
+				new_prog = bpf_patch_insn_data(env, i + delta, insn_buf, cnt);
+				if (!new_prog)
+					return -ENOMEM;
 
-			delta    += cnt - 1;
-			env->prog = prog = new_prog;
-			insn      = new_prog->insnsi + i + delta;
+				delta    += cnt - 1;
+				env->prog = prog = new_prog;
+				insn      = new_prog->insnsi + i + delta;
+				goto next_insn;
+			}
+
+			insn->imm = bpf_kptr_xchg_nodtor - __bpf_call_base;
 			goto next_insn;
 		}
+
 patch_call_imm:
 		fn = env->ops->get_func_proto(insn->imm, env->prog);
 		/* all functions that have prototype and verifier allowed
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index baa12b24bb64..cdc64ab83ef6 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -1728,7 +1728,7 @@ void bpf_wq_cancel_and_free(void *val)
 	bpf_async_cancel_and_free(val);
 }
 
-BPF_CALL_2(bpf_kptr_xchg, void *, dst, void *, ptr)
+BPF_CALL_2(bpf_kptr_xchg_nodtor, void *, dst, void *, ptr)
 {
 	unsigned long *kptr = dst;
 
@@ -1736,12 +1736,32 @@ BPF_CALL_2(bpf_kptr_xchg, void *, dst, void *, ptr)
 	return xchg(kptr, (unsigned long)ptr);
 }
 
+BPF_CALL_2(bpf_ref_kptr_xchg, void *, dst, void *, ptr)
+{
+	unsigned long *kptr = dst;
+	void *old;
+
+	/*
+	 * If the incoming pointer cannot be torn down safely from NMI later on,
+	 * leave the destination untouched and return ptr so the caller keeps
+	 * ownership.
+	 */
+	if (ptr && bpf_kptr_offload_inc())
+		return (unsigned long)ptr;
+
+	old = (void *)xchg(kptr, (unsigned long)ptr);
+	if (old)
+		bpf_kptr_offload_dec();
+	return (unsigned long)old;
+}
+
 /* Unlike other PTR_TO_BTF_ID helpers the btf_id in bpf_kptr_xchg()
  * helper is determined dynamically by the verifier. Use BPF_PTR_POISON to
  * denote type that verifier will determine.
+ * No-dtor callsites are redirected to bpf_kptr_xchg_nodtor() from fixups.
  */
 static const struct bpf_func_proto bpf_kptr_xchg_proto = {
-	.func         = bpf_kptr_xchg,
+	.func         = bpf_ref_kptr_xchg,
 	.gpl_only     = false,
 	.ret_type     = RET_PTR_TO_BTF_ID_OR_NULL,
 	.ret_btf_id   = BPF_PTR_POISON,
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 3b1f0ba02f61..162bfd4796ea 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -7,6 +7,7 @@
 #include <linux/bpf_trace.h>
 #include <linux/bpf_lirc.h>
 #include <linux/bpf_verifier.h>
+#include <linux/bpf_mem_alloc.h>
 #include <linux/bsearch.h>
 #include <linux/btf.h>
 #include <linux/hex.h>
@@ -19,6 +20,8 @@
 #include <linux/fdtable.h>
 #include <linux/file.h>
 #include <linux/fs.h>
+#include <linux/irq_work.h>
+#include <linux/llist.h>
 #include <linux/license.h>
 #include <linux/filter.h>
 #include <linux/kernel.h>
@@ -65,6 +68,131 @@ static DEFINE_SPINLOCK(map_idr_lock);
 static DEFINE_IDR(link_idr);
 static DEFINE_SPINLOCK(link_idr_lock);
 
+struct bpf_dtor_kptr_work {
+	struct llist_node node;
+	void *obj;
+	btf_dtor_kfunc_t dtor;
+};
+
+/* Queue pending dtors per CPU; the idle pool stays global. */
+static DEFINE_PER_CPU(struct llist_head, bpf_dtor_kptr_jobs);
+static LLIST_HEAD(bpf_dtor_kptr_idle);
+/* Keep total >= refs + active so NMI frees never need to allocate. */
+static atomic_long_t bpf_dtor_kptr_refs = ATOMIC_LONG_INIT(0);
+static atomic_long_t bpf_dtor_kptr_active = ATOMIC_LONG_INIT(0);
+static atomic_long_t bpf_dtor_kptr_total = ATOMIC_LONG_INIT(0);
+
+/* Bound reserve overshoot so the pool tracks demand instead of growing on itself. */
+#define BPF_DTOR_KPTR_RESERVE_HEADROOM 64L
+
+static void bpf_dtor_kptr_worker(struct irq_work *work);
+static DEFINE_PER_CPU(struct irq_work, bpf_dtor_kptr_irq_work) =
+	IRQ_WORK_INIT_HARD(bpf_dtor_kptr_worker);
+
+static void bpf_dtor_kptr_push_idle(struct bpf_dtor_kptr_work *job)
+{
+	llist_add(&job->node, &bpf_dtor_kptr_idle);
+}
+
+static struct bpf_dtor_kptr_work *bpf_dtor_kptr_pop_idle(void)
+{
+	struct llist_node *node;
+
+	node = llist_del_first(&bpf_dtor_kptr_idle);
+	if (!node)
+		return NULL;
+
+	return llist_entry(node, struct bpf_dtor_kptr_work, node);
+}
+
+static void bpf_dtor_kptr_trim(void)
+{
+	struct bpf_dtor_kptr_work *job;
+	long total;
+	long needed;
+
+	for (;;) {
+		total = atomic_long_read(&bpf_dtor_kptr_total);
+		needed = atomic_long_read(&bpf_dtor_kptr_refs) +
+			 atomic_long_read(&bpf_dtor_kptr_active);
+		if (total <= needed)
+			return;
+
+		job = bpf_dtor_kptr_pop_idle();
+		if (!job)
+			return;
+
+		if (!atomic_long_try_cmpxchg(&bpf_dtor_kptr_total, &total, total - 1)) {
+			bpf_dtor_kptr_push_idle(job);
+			continue;
+		}
+
+		bpf_mem_free(&bpf_global_ma, job);
+	}
+}
+
+static int bpf_dtor_kptr_reserve(long needed)
+{
+	struct bpf_dtor_kptr_work *job;
+	long headroom;
+	long target;
+
+	headroom = min_t(long, needed, BPF_DTOR_KPTR_RESERVE_HEADROOM);
+	if (check_add_overflow(needed, headroom, &target))
+		target = needed;
+
+	while (atomic_long_read(&bpf_dtor_kptr_total) < target) {
+		job = bpf_mem_alloc(&bpf_global_ma, sizeof(*job));
+		if (!job)
+			return -ENOMEM;
+		atomic_long_inc(&bpf_dtor_kptr_total);
+		bpf_dtor_kptr_push_idle(job);
+	}
+
+	return 0;
+}
+
+int bpf_kptr_offload_inc(void)
+{
+	long needed;
+	int err;
+
+	if (unlikely(!bpf_global_ma_set))
+		return -ENOMEM;
+
+	/*
+	 * Read active before incrementing refs so a free path moving one slot from
+	 * refs to active cannot shrink the reservation snapshot below the steady
+	 * state we need to cover. Racing results worst case in a larger reservation.
+	 */
+	needed = atomic_long_read(&bpf_dtor_kptr_active);
+	needed += atomic_long_inc_return(&bpf_dtor_kptr_refs);
+	err = bpf_dtor_kptr_reserve(needed);
+	if (err)
+		atomic_long_dec(&bpf_dtor_kptr_refs);
+
+	return err;
+}
+
+void bpf_kptr_offload_dec(void)
+{
+	long val;
+
+	val = atomic_long_dec_return(&bpf_dtor_kptr_refs);
+	if (!WARN_ON_ONCE(val < 0))
+		return;
+
+	/*
+	 * Clamp a mismatched decrement back to zero without overwriting a
+	 * concurrent increment that already repaired the counter.
+	 */
+	do {
+		val = atomic_long_read(&bpf_dtor_kptr_refs);
+		if (val >= 0)
+			break;
+	} while (!atomic_long_try_cmpxchg(&bpf_dtor_kptr_refs, &val, 0));
+}
+
 int sysctl_unprivileged_bpf_disabled __read_mostly =
 	IS_BUILTIN(CONFIG_BPF_UNPRIV_DEFAULT_OFF) ? 2 : 0;
 
@@ -807,6 +935,46 @@ void bpf_obj_free_task_work(const struct btf_record *rec, void *obj)
 	bpf_task_work_cancel_and_free(obj + rec->task_work_off);
 }
 
+static void bpf_dtor_kptr_worker(struct irq_work *work)
+{
+	struct llist_node *jobs, *node, *next;
+
+	jobs = llist_del_all(this_cpu_ptr(&bpf_dtor_kptr_jobs));
+	llist_for_each_safe(node, next, jobs) {
+		struct bpf_dtor_kptr_work *job;
+
+		job = llist_entry(node, struct bpf_dtor_kptr_work, node);
+		job->dtor(job->obj);
+		atomic_long_dec(&bpf_dtor_kptr_active);
+		bpf_dtor_kptr_push_idle(job);
+	}
+
+	bpf_dtor_kptr_trim();
+}
+
+static void bpf_dtor_kptr_offload(void *obj, btf_dtor_kfunc_t dtor)
+{
+	struct bpf_dtor_kptr_work *job;
+
+	atomic_long_inc(&bpf_dtor_kptr_active);
+	job = bpf_dtor_kptr_pop_idle();
+	if (WARN_ON_ONCE(!job)) {
+		atomic_long_dec(&bpf_dtor_kptr_active);
+		/*
+		 * This should stay unreachable if reserve accounting is correct. If it
+		 * ever breaks, running the destructor unsafely is still better than
+		 * leaking the object permanently.
+		 */
+		dtor(obj);
+		return;
+	}
+
+	job->obj = obj;
+	job->dtor = dtor;
+	if (llist_add(&job->node, this_cpu_ptr(&bpf_dtor_kptr_jobs)))
+		irq_work_queue(this_cpu_ptr(&bpf_dtor_kptr_irq_work));
+}
+
 void bpf_obj_free_fields(const struct btf_record *rec, void *obj)
 {
 	const struct btf_field *fields;
@@ -842,6 +1010,19 @@ void bpf_obj_free_fields(const struct btf_record *rec, void *obj)
 			xchgd_field = (void *)xchg((unsigned long *)field_ptr, 0);
 			if (!xchgd_field)
 				break;
+			if (in_nmi() && field->kptr.dtor) {
+				bpf_dtor_kptr_offload(xchgd_field, field->kptr.dtor);
+				bpf_kptr_offload_dec();
+				break;
+			}
+			if (field->kptr.dtor)
+				/*
+				 * Dtor kptrs reach storage through bpf_ref_kptr_xchg(), which
+				 * pairs installation with bpf_kptr_offload_inc(). Drop that
+				 * reservation on non-NMI teardown once no active transition is
+				 * needed.
+				 */
+				bpf_kptr_offload_dec();
 
 			if (!btf_is_kernel(field->kptr.btf)) {
 				pointee_struct_meta = btf_find_struct_meta(field->kptr.btf,
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 11054ad89c14..2c7b21bda666 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -9950,6 +9950,8 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
 		if (err)
 			return err;
 	}
+	env->insn_aux_data[insn_idx].kptr_has_dtor =
+		func_id == BPF_FUNC_kptr_xchg && !!meta.kptr_field->kptr.dtor;
 
 	err = record_func_map(env, &meta, func_id, insn_idx);
 	if (err)
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [bpf-next v2 2/2] selftests/bpf: Add kptr destructor NMI exerciser
  2026-05-05 15:08 [bpf-next v2 0/2] bpf: Fix deadlock in kptr dtor in nmi Justin Suess
  2026-05-05 15:08 ` [bpf-next v2 1/2] bpf: Offload kptr destructors that run from NMI Justin Suess
@ 2026-05-05 15:08 ` Justin Suess
  2026-05-05 20:15   ` sashiko-bot
  1 sibling, 1 reply; 12+ messages in thread
From: Justin Suess @ 2026-05-05 15:08 UTC (permalink / raw)
  To: ast, daniel, andrii, eddyz87, memxor
  Cc: martin.lau, song, yonghong.song, jolsa, bpf, mic, Justin Suess,
	Alexei Starovoitov

Programs attached to tp_btf/nmi_handler can drop refcounted kptrs from
NMI context by deleting map entries or clearing map values.  Add a
dedicated BPF-side selftest program that populates hash and array maps
with cpumask kptrs and clears them again from the NMI handler.

This test fails on the upstream and results in a lockdep warning, but
passes when NMI dtors are properly offloaded by the previous commit.

The test asserts that every object queued for destruction in hardirq
from NMI had the dtor called on it. The irq_work which has the
IRQ_WORK_HARD_IRQ flag is drained with kern_sync_rcu to ensure
consistency.

Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Signed-off-by: Justin Suess <utilityemal77@gmail.com>
---
 .../selftests/bpf/prog_tests/kptr_dtor_nmi.c  | 259 +++++++++++
 .../selftests/bpf/progs/kptr_dtor_nmi.c       | 414 ++++++++++++++++++
 2 files changed, 673 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/kptr_dtor_nmi.c
 create mode 100644 tools/testing/selftests/bpf/progs/kptr_dtor_nmi.c

diff --git a/tools/testing/selftests/bpf/prog_tests/kptr_dtor_nmi.c b/tools/testing/selftests/bpf/prog_tests/kptr_dtor_nmi.c
new file mode 100644
index 000000000000..a1aa8b6646b3
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/kptr_dtor_nmi.c
@@ -0,0 +1,259 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#define _GNU_SOURCE
+
+#include <errno.h>
+#include <linux/perf_event.h>
+#include <sched.h>
+#include <sys/syscall.h>
+#include <unistd.h>
+
+#include <test_progs.h>
+
+#include "kptr_dtor_nmi.skel.h"
+
+#define KPTR_DTOR_NMI_MAX_SLOTS 8
+#define KPTR_DTOR_NMI_ROUNDS 256
+#define DELETE_TIMEOUT_NS (5ULL * 1000 * 1000 * 1000)
+
+enum kptr_dtor_nmi_map_type {
+	KPTR_DTOR_NMI_MAP_HASH = 1,
+	KPTR_DTOR_NMI_MAP_ARRAY,
+};
+
+struct kptr_dtor_nmi_case {
+	const char *name;
+	__u32 map_type;
+};
+
+__maybe_unused
+static int find_test_cpu(void)
+{
+	cpu_set_t cpuset;
+	int cpu, err;
+
+	err = sched_getaffinity(0, sizeof(cpuset), &cpuset);
+	if (!ASSERT_OK(err, "sched_getaffinity"))
+		return -1;
+
+	for (cpu = 0; cpu < CPU_SETSIZE; cpu++) {
+		if (CPU_ISSET(cpu, &cpuset))
+			return cpu;
+	}
+
+	ASSERT_TRUE(false, "cpu_available");
+	return -1;
+}
+
+__maybe_unused
+static int open_nmi_pmu_event_on_cpu(int cpu)
+{
+	struct perf_event_attr attr = {
+		.size = sizeof(attr),
+		.type = PERF_TYPE_HARDWARE,
+		.config = PERF_COUNT_HW_CPU_CYCLES,
+		.freq = 1,
+		.sample_freq = 1000,
+	};
+	int pmu_fd;
+
+	pmu_fd = syscall(__NR_perf_event_open, &attr, -1, cpu, -1,
+			 PERF_FLAG_FD_CLOEXEC);
+	if (pmu_fd == -1) {
+		if (errno == ENOENT || errno == EOPNOTSUPP) {
+			printf("SKIP:no PERF_COUNT_HW_CPU_CYCLES\n");
+			test__skip();
+		}
+		return -1;
+	}
+
+	return pmu_fd;
+}
+
+__maybe_unused
+static bool pin_to_cpu(int cpu, cpu_set_t *old_cpuset)
+{
+	cpu_set_t cpuset;
+	int err;
+
+	err = sched_getaffinity(0, sizeof(*old_cpuset), old_cpuset);
+	if (!ASSERT_OK(err, "sched_getaffinity"))
+		return false;
+
+	CPU_ZERO(&cpuset);
+	CPU_SET(cpu, &cpuset);
+	err = sched_setaffinity(0, sizeof(cpuset), &cpuset);
+	if (!ASSERT_OK(err, "sched_setaffinity"))
+		return false;
+
+	return true;
+}
+
+__maybe_unused
+static void restore_affinity(const cpu_set_t *old_cpuset)
+{
+	ASSERT_OK(sched_setaffinity(0, sizeof(*old_cpuset), old_cpuset),
+		  "restore_affinity");
+}
+
+__maybe_unused
+static bool run_syscall_prog(struct bpf_program *prog, const char *name)
+{
+	LIBBPF_OPTS(bpf_test_run_opts, opts);
+	int err;
+
+	err = bpf_prog_test_run_opts(bpf_program__fd(prog), &opts);
+	if (!ASSERT_OK(err, name))
+		return false;
+	if (!ASSERT_EQ(opts.retval, 0, name))
+		return false;
+
+	return true;
+}
+
+__maybe_unused
+static bool wait_for_nmi_drain(struct kptr_dtor_nmi *skel,
+			       __u32 expected_deleted,
+			       __u32 expected_release_calls)
+{
+	u64 now_ns, timeout_time_ns;
+
+	now_ns = get_time_ns();
+	timeout_time_ns = now_ns + DELETE_TIMEOUT_NS;
+	while (skel->bss->kptr_dtor_nmi_deleted < expected_deleted ||
+	       skel->bss->kptr_dtor_nmi_release_calls < expected_release_calls) {
+		if (skel->bss->kptr_dtor_nmi_setup_err ||
+		    skel->bss->kptr_dtor_nmi_nmi_err ||
+		    skel->bss->kptr_dtor_nmi_cleanup_err)
+			break;
+		now_ns = get_time_ns();
+		if (now_ns >= timeout_time_ns)
+			break;
+		sched_yield();
+	}
+
+	if (!ASSERT_EQ(skel->bss->kptr_dtor_nmi_setup_err, 0,
+		       "kptr_dtor_nmi_setup_err"))
+		return false;
+	if (!ASSERT_EQ(skel->bss->kptr_dtor_nmi_nmi_err, 0,
+		       "kptr_dtor_nmi_nmi_err"))
+		return false;
+	if (!ASSERT_EQ(skel->bss->kptr_dtor_nmi_cleanup_err, 0,
+		       "kptr_dtor_nmi_cleanup_err"))
+		return false;
+	if (!ASSERT_GE(skel->bss->kptr_dtor_nmi_deleted, expected_deleted,
+		       "kptr_dtor_nmi_deleted"))
+		return false;
+	if (!ASSERT_GE(skel->bss->kptr_dtor_nmi_release_calls,
+		       expected_release_calls,
+		       "kptr_dtor_nmi_release_calls"))
+		return false;
+	if (!ASSERT_LT(now_ns, timeout_time_ns, "kptr_dtor_nmi_timeout"))
+		return false;
+
+	return true;
+}
+
+__maybe_unused
+static void run_kptr_dtor_nmi_case(const struct kptr_dtor_nmi_case *test)
+{
+	struct kptr_dtor_nmi *skel;
+	cpu_set_t old_cpuset;
+	bool pinned = false;
+	int cpu = -1;
+	int pmu_fd = -1;
+	int err, round;
+
+	cpu = find_test_cpu();
+	if (cpu < 0)
+		return;
+
+	skel = kptr_dtor_nmi__open();
+	if (!ASSERT_OK_PTR(skel, "kptr_dtor_nmi__open"))
+		return;
+
+	skel->bss->kptr_dtor_nmi_map_type = test->map_type;
+	bpf_program__set_autoattach(skel->progs.clear_kptrs_from_nmi, false);
+
+	err = kptr_dtor_nmi__load(skel);
+	if (!ASSERT_OK(err, "kptr_dtor_nmi__load"))
+		goto cleanup;
+
+	err = kptr_dtor_nmi__attach(skel);
+	if (!ASSERT_OK(err, "kptr_dtor_nmi__attach"))
+		goto cleanup;
+
+	skel->links.clear_kptrs_from_nmi =
+		bpf_program__attach_trace(skel->progs.clear_kptrs_from_nmi);
+	if (!ASSERT_OK_PTR(skel->links.clear_kptrs_from_nmi,
+			   "attach_tp_btf_nmi_handler"))
+		goto cleanup;
+
+	pinned = pin_to_cpu(cpu, &old_cpuset);
+	if (!pinned)
+		goto cleanup;
+
+	pmu_fd = open_nmi_pmu_event_on_cpu(cpu);
+	if (pmu_fd < 0)
+		goto cleanup;
+
+	for (round = 0; round < KPTR_DTOR_NMI_ROUNDS; round++) {
+		__u32 expected_total;
+
+		if (!run_syscall_prog(skel->progs.populate_kptrs, "populate_kptrs"))
+			goto cleanup;
+
+		expected_total = (round + 1) * KPTR_DTOR_NMI_MAX_SLOTS;
+		if (!ASSERT_EQ(skel->bss->kptr_dtor_nmi_setup_created,
+			       expected_total,
+			       "kptr_dtor_nmi_setup_created"))
+			goto cleanup;
+
+		if (!wait_for_nmi_drain(skel, expected_total, expected_total))
+			goto cleanup;
+
+		kern_sync_rcu();
+	}
+
+	if (!run_syscall_prog(skel->progs.cleanup_kptrs, "cleanup_kptrs"))
+		goto cleanup;
+	/*
+	 * The grace period for rcu cannot complete until the CPU that ran the
+	 * hard irq_work has passed through a quiescent state after running
+	 * our dtor work. This effectively flushes our pending work and allows
+	 * the test to verify the dtor was called the expected number of times.
+	 */
+	kern_sync_rcu();
+	ASSERT_EQ(skel->bss->kptr_dtor_nmi_cleanup_deleted, 0,
+		  "kptr_dtor_nmi_cleanup_deleted");
+
+cleanup:
+	if (pmu_fd >= 0)
+		close(pmu_fd);
+	if (pinned)
+		restore_affinity(&old_cpuset);
+	kptr_dtor_nmi__destroy(skel);
+}
+
+void serial_test_kptr_dtor_nmi(void)
+{
+/*
+ * nmi_handler isn't supported for these architectures.
+ */
+#if defined(__aarch64__) || defined(__s390x__)
+	test__skip();
+	return;
+#else
+	static const struct kptr_dtor_nmi_case tests[] = {
+		{ "hash", KPTR_DTOR_NMI_MAP_HASH },
+		{ "array", KPTR_DTOR_NMI_MAP_ARRAY },
+	};
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(tests); i++) {
+		if (!test__start_subtest(tests[i].name))
+			continue;
+		run_kptr_dtor_nmi_case(&tests[i]);
+	}
+#endif
+}
diff --git a/tools/testing/selftests/bpf/progs/kptr_dtor_nmi.c b/tools/testing/selftests/bpf/progs/kptr_dtor_nmi.c
new file mode 100644
index 000000000000..1ab5951a7a22
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/kptr_dtor_nmi.c
@@ -0,0 +1,414 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <vmlinux.h>
+#include <linux/errno.h>
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+
+#define KPTR_DTOR_NMI_MAX_SLOTS 8
+
+enum kptr_dtor_nmi_map_type {
+	KPTR_DTOR_NMI_MAP_HASH = 1,
+	KPTR_DTOR_NMI_MAP_ARRAY,
+};
+
+enum kptr_dtor_nmi_err {
+	KPTR_DTOR_NMI_SETUP_CREATE_ERR = 1,
+	KPTR_DTOR_NMI_SETUP_LOOKUP_ERR,
+	KPTR_DTOR_NMI_SETUP_STALE_ERR,
+	KPTR_DTOR_NMI_SETUP_MAP_ERR,
+	KPTR_DTOR_NMI_DELETE_ERR,
+	KPTR_DTOR_NMI_CLEANUP_ERR,
+};
+
+struct kptr_dtor_nmi_value {
+	struct bpf_cpumask __kptr * mask;
+};
+
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__uint(map_flags, BPF_F_NO_PREALLOC);
+	__type(key, __u32);
+	__type(value, struct kptr_dtor_nmi_value);
+	__uint(max_entries, KPTR_DTOR_NMI_MAX_SLOTS);
+} kptr_dtor_nmi_hash SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_ARRAY);
+	__type(key, __u32);
+	__type(value, struct kptr_dtor_nmi_value);
+	__uint(max_entries, KPTR_DTOR_NMI_MAX_SLOTS);
+} kptr_dtor_nmi_array SEC(".maps");
+
+struct bpf_cpumask *bpf_cpumask_create(void) __ksym __weak;
+void bpf_cpumask_release(struct bpf_cpumask *cpumask) __ksym __weak;
+
+__u32 kptr_dtor_nmi_live_mask;
+__u32 kptr_dtor_nmi_map_type;
+__u32 kptr_dtor_nmi_setup_created;
+__u32 kptr_dtor_nmi_deleted;
+__u32 kptr_dtor_nmi_cleanup_deleted;
+__u32 kptr_dtor_nmi_release_calls;
+__u32 kptr_dtor_nmi_setup_err;
+__u32 kptr_dtor_nmi_nmi_err;
+__u32 kptr_dtor_nmi_cleanup_err;
+int kptr_dtor_nmi_setup_errno;
+int kptr_dtor_nmi_nmi_errno;
+int kptr_dtor_nmi_cleanup_errno;
+
+static void set_err(__u32 *err_dst, int *errno_dst, __u32 err, int err_no)
+{
+	if (!*err_dst) {
+		*err_dst = err;
+		*errno_dst = err_no;
+	}
+}
+
+static bool slot_is_live(__u32 slot)
+{
+	return kptr_dtor_nmi_live_mask & (1U << slot);
+}
+
+static void mark_slot_live(__u32 slot)
+{
+	kptr_dtor_nmi_live_mask |= 1U << slot;
+}
+
+static void clear_slot_live(__u32 slot)
+{
+	kptr_dtor_nmi_live_mask &= ~(1U << slot);
+}
+
+static struct kptr_dtor_nmi_value *lookup_hash_value(__u32 slot)
+{
+	return bpf_map_lookup_elem(&kptr_dtor_nmi_hash, &slot);
+}
+
+static struct kptr_dtor_nmi_value *lookup_array_value(__u32 slot)
+{
+	return bpf_map_lookup_elem(&kptr_dtor_nmi_array, &slot);
+}
+
+static int stash_mask(struct kptr_dtor_nmi_value *value, __u32 slot)
+{
+	struct bpf_cpumask *mask, *old;
+
+	mask = bpf_cpumask_create();
+	if (!mask)
+		return -ENOMEM;
+
+	old = bpf_kptr_xchg(&value->mask, mask);
+	if (old) {
+		bpf_cpumask_release(old);
+		return -EEXIST;
+	}
+
+	mark_slot_live(slot);
+	kptr_dtor_nmi_setup_created++;
+	return 0;
+}
+
+static bool populate_hash_slot(__u32 slot)
+{
+	struct kptr_dtor_nmi_value init = {};
+	struct kptr_dtor_nmi_value *value;
+	int err;
+
+	err = bpf_map_update_elem(&kptr_dtor_nmi_hash, &slot, &init, BPF_NOEXIST);
+	if (err) {
+		set_err(&kptr_dtor_nmi_setup_err,
+				&kptr_dtor_nmi_setup_errno,
+				KPTR_DTOR_NMI_SETUP_CREATE_ERR, err);
+		return false;
+	}
+
+	value = lookup_hash_value(slot);
+	if (!value) {
+		set_err(&kptr_dtor_nmi_setup_err,
+				&kptr_dtor_nmi_setup_errno,
+				KPTR_DTOR_NMI_SETUP_LOOKUP_ERR, -ENOENT);
+		return false;
+	}
+
+	err = stash_mask(value, slot);
+	if (err) {
+		set_err(&kptr_dtor_nmi_setup_err,
+				&kptr_dtor_nmi_setup_errno,
+				KPTR_DTOR_NMI_SETUP_STALE_ERR, err);
+		return false;
+	}
+
+	return true;
+}
+
+static bool populate_array_slot(__u32 slot)
+{
+	struct kptr_dtor_nmi_value *value;
+	int err;
+
+	value = lookup_array_value(slot);
+	if (!value) {
+		set_err(&kptr_dtor_nmi_setup_err,
+				&kptr_dtor_nmi_setup_errno,
+				KPTR_DTOR_NMI_SETUP_LOOKUP_ERR, -ENOENT);
+		return false;
+	}
+
+	err = stash_mask(value, slot);
+	if (err) {
+		set_err(&kptr_dtor_nmi_setup_err,
+				&kptr_dtor_nmi_setup_errno,
+				KPTR_DTOR_NMI_SETUP_STALE_ERR, err);
+		return false;
+	}
+
+	return true;
+}
+
+static bool clear_hash_slot_from_nmi(__u32 slot)
+{
+	struct kptr_dtor_nmi_value *value;
+	int err;
+
+	if (!slot_is_live(slot))
+		return true;
+
+	err = bpf_map_delete_elem(&kptr_dtor_nmi_hash, &slot);
+	if (!err) {
+		clear_slot_live(slot);
+		kptr_dtor_nmi_deleted++;
+		return true;
+	}
+
+	/*
+	 * Hash deletes take rqspinlock-backed bucket locks. NMI reentry can lose
+	 * those acquisitions with -EDEADLK or -ETIMEDOUT even though the slot is
+	 * still valid, so leave it live and retry on a later NMI.
+	 */
+	if (err == -EDEADLK || err == -ETIMEDOUT)
+		return true;
+
+	value = lookup_hash_value(slot);
+	if (value)
+		set_err(&kptr_dtor_nmi_nmi_err,
+				&kptr_dtor_nmi_nmi_errno,
+				KPTR_DTOR_NMI_DELETE_ERR, err);
+
+	return false;
+}
+
+static bool clear_array_slot_from_nmi(__u32 slot)
+{
+	struct kptr_dtor_nmi_value init = {};
+	int err;
+
+	if (!slot_is_live(slot))
+		return true;
+
+	err = bpf_map_update_elem(&kptr_dtor_nmi_array, &slot, &init, BPF_EXIST);
+	if (err) {
+		set_err(&kptr_dtor_nmi_nmi_err,
+				&kptr_dtor_nmi_nmi_errno,
+				KPTR_DTOR_NMI_DELETE_ERR, err);
+		return false;
+	}
+
+	clear_slot_live(slot);
+	kptr_dtor_nmi_deleted++;
+	return true;
+}
+
+static bool cleanup_hash_slot(__u32 slot)
+{
+	struct kptr_dtor_nmi_value *value;
+	struct bpf_cpumask *old = NULL;
+
+	value = lookup_hash_value(slot);
+	if (!value) {
+		clear_slot_live(slot);
+		return true;
+	}
+
+	old = bpf_kptr_xchg(&value->mask, old);
+	if (old) {
+		bpf_cpumask_release(old);
+		kptr_dtor_nmi_cleanup_deleted++;
+	}
+
+	if (bpf_map_delete_elem(&kptr_dtor_nmi_hash, &slot) &&
+	    lookup_hash_value(slot)) {
+		set_err(&kptr_dtor_nmi_cleanup_err,
+				&kptr_dtor_nmi_cleanup_errno,
+				KPTR_DTOR_NMI_CLEANUP_ERR, -EIO);
+		return false;
+	}
+
+	clear_slot_live(slot);
+	return true;
+}
+
+static bool cleanup_array_slot(__u32 slot)
+{
+	struct kptr_dtor_nmi_value *value;
+	struct bpf_cpumask *old = NULL;
+
+	value = lookup_array_value(slot);
+	if (!value) {
+		set_err(&kptr_dtor_nmi_cleanup_err,
+				&kptr_dtor_nmi_cleanup_errno,
+				KPTR_DTOR_NMI_CLEANUP_ERR, -ENOENT);
+		return false;
+	}
+
+	old = bpf_kptr_xchg(&value->mask, old);
+	if (old) {
+		bpf_cpumask_release(old);
+		kptr_dtor_nmi_cleanup_deleted++;
+	}
+
+	clear_slot_live(slot);
+	return true;
+}
+
+static void populate_hash_masks(void)
+{
+	__u32 slot;
+
+	for (slot = 0; slot < KPTR_DTOR_NMI_MAX_SLOTS; slot++) {
+		if (!populate_hash_slot(slot))
+			return;
+	}
+}
+
+static void populate_array_masks(void)
+{
+	__u32 slot;
+
+	for (slot = 0; slot < KPTR_DTOR_NMI_MAX_SLOTS; slot++) {
+		if (!populate_array_slot(slot))
+			return;
+	}
+}
+
+static void clear_hash_masks_from_nmi(void)
+{
+	__u32 slot;
+
+	for (slot = 0; slot < KPTR_DTOR_NMI_MAX_SLOTS; slot++) {
+		if (!clear_hash_slot_from_nmi(slot))
+			return;
+	}
+}
+
+static void clear_array_masks_from_nmi(void)
+{
+	__u32 slot;
+
+	for (slot = 0; slot < KPTR_DTOR_NMI_MAX_SLOTS; slot++) {
+		if (!clear_array_slot_from_nmi(slot))
+			return;
+	}
+}
+
+static void cleanup_hash_masks(void)
+{
+	__u32 slot;
+
+	for (slot = 0; slot < KPTR_DTOR_NMI_MAX_SLOTS; slot++) {
+		if (!cleanup_hash_slot(slot))
+			return;
+	}
+}
+
+static void cleanup_array_masks(void)
+{
+	__u32 slot;
+
+	for (slot = 0; slot < KPTR_DTOR_NMI_MAX_SLOTS; slot++) {
+		if (!cleanup_array_slot(slot))
+			return;
+	}
+}
+
+SEC("syscall")
+int populate_kptrs(void *ctx)
+{
+	(void)ctx;
+
+	switch (kptr_dtor_nmi_map_type) {
+	case KPTR_DTOR_NMI_MAP_HASH:
+		populate_hash_masks();
+		break;
+	case KPTR_DTOR_NMI_MAP_ARRAY:
+		populate_array_masks();
+		break;
+	default:
+		set_err(&kptr_dtor_nmi_setup_err,
+			&kptr_dtor_nmi_setup_errno,
+			KPTR_DTOR_NMI_SETUP_MAP_ERR, -EINVAL);
+		break;
+	}
+
+	return 0;
+}
+
+SEC("syscall")
+int cleanup_kptrs(void *ctx)
+{
+	(void)ctx;
+
+	switch (kptr_dtor_nmi_map_type) {
+	case KPTR_DTOR_NMI_MAP_HASH:
+		cleanup_hash_masks();
+		break;
+	case KPTR_DTOR_NMI_MAP_ARRAY:
+		cleanup_array_masks();
+		break;
+	default:
+		set_err(&kptr_dtor_nmi_cleanup_err,
+			&kptr_dtor_nmi_cleanup_errno,
+			KPTR_DTOR_NMI_CLEANUP_ERR, -EINVAL);
+		break;
+	}
+
+	return 0;
+}
+
+SEC("tp_btf/nmi_handler")
+int BPF_PROG(clear_kptrs_from_nmi, void *handler, void *regs, s64 delta_ns,
+	     int handled)
+{
+	(void)handler;
+	(void)regs;
+	(void)delta_ns;
+	(void)handled;
+
+	if (kptr_dtor_nmi_deleted >= kptr_dtor_nmi_setup_created)
+		return 0;
+
+	switch (kptr_dtor_nmi_map_type) {
+	case KPTR_DTOR_NMI_MAP_HASH:
+		clear_hash_masks_from_nmi();
+		break;
+	case KPTR_DTOR_NMI_MAP_ARRAY:
+		clear_array_masks_from_nmi();
+		break;
+	default:
+		set_err(&kptr_dtor_nmi_nmi_err,
+			&kptr_dtor_nmi_nmi_errno,
+			KPTR_DTOR_NMI_DELETE_ERR, -EINVAL);
+		break;
+	}
+
+	return 0;
+}
+
+SEC("fentry/bpf_cpumask_release")
+int BPF_PROG(count_cpumask_release, struct bpf_cpumask *mask)
+{
+	(void)mask;
+	kptr_dtor_nmi_release_calls++;
+	return 0;
+}
+
+char _license[] SEC("license") = "GPL";
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [bpf-next v2 1/2] bpf: Offload kptr destructors that run from NMI
  2026-05-05 15:08 ` [bpf-next v2 1/2] bpf: Offload kptr destructors that run from NMI Justin Suess
@ 2026-05-05 16:06   ` bot+bpf-ci
  2026-05-05 19:48     ` Justin Suess
  2026-05-05 19:49   ` sashiko-bot
  2026-05-06 16:43   ` Mykyta Yatsenko
  2 siblings, 1 reply; 12+ messages in thread
From: bot+bpf-ci @ 2026-05-05 16:06 UTC (permalink / raw)
  To: utilityemal77, ast, daniel, andrii, eddyz87, memxor
  Cc: martin.lau, song, yonghong.song, jolsa, bpf, mic, utilityemal77,
	alexei.starovoitov, ast, andrii, daniel, martin.lau, eddyz87,
	yonghong.song, clm, ihor.solodrai

[-- Attachment #1: Type: text/plain, Size: 7360 bytes --]

> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 3b1f0ba02f61..162bfd4796ea 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c

[ ... ]

> @@ -65,6 +68,131 @@ static DEFINE_SPINLOCK(map_idr_lock);
>  static DEFINE_IDR(link_idr);
>  static DEFINE_SPINLOCK(link_idr_lock);
>
> +struct bpf_dtor_kptr_work {
> +	struct llist_node node;
> +	void *obj;
> +	btf_dtor_kfunc_t dtor;
> +};
> +
> +/* Queue pending dtors per CPU; the idle pool stays global. */
> +static DEFINE_PER_CPU(struct llist_head, bpf_dtor_kptr_jobs);
> +static LLIST_HEAD(bpf_dtor_kptr_idle);
> +/* Keep total >= refs + active so NMI frees never need to allocate. */
> +static atomic_long_t bpf_dtor_kptr_refs = ATOMIC_LONG_INIT(0);
> +static atomic_long_t bpf_dtor_kptr_active = ATOMIC_LONG_INIT(0);
> +static atomic_long_t bpf_dtor_kptr_total = ATOMIC_LONG_INIT(0);
> +
> +/* Bound reserve overshoot so the pool tracks demand instead of growing on itself. */
> +#define BPF_DTOR_KPTR_RESERVE_HEADROOM 64L
> +
> +static void bpf_dtor_kptr_worker(struct irq_work *work);
> +static DEFINE_PER_CPU(struct irq_work, bpf_dtor_kptr_irq_work) =
> +	IRQ_WORK_INIT_HARD(bpf_dtor_kptr_worker);
> +
> +static void bpf_dtor_kptr_push_idle(struct bpf_dtor_kptr_work *job)
> +{
> +	llist_add(&job->node, &bpf_dtor_kptr_idle);
> +}
> +
> +static struct bpf_dtor_kptr_work *bpf_dtor_kptr_pop_idle(void)
> +{
> +	struct llist_node *node;
> +
> +	node = llist_del_first(&bpf_dtor_kptr_idle);
> +	if (!node)
> +		return NULL;
> +
> +	return llist_entry(node, struct bpf_dtor_kptr_work, node);
> +}

Does bpf_dtor_kptr_pop_idle() handle multiple concurrent consumers
safely?

The documentation for llist_del_first() in lib/llist.c explicitly
states:

  Only one llist_del_first user can be used simultaneously with multiple
  llist_add users without lock. Because otherwise llist_del_first,
  llist_add, llist_add (or llist_del_all, llist_add, llist_add)
  sequence in another user may change @head->first->next, but keep
  @head->first.

This is the classic ABA race that llist_del_first cannot defend against.

In this patch, bpf_dtor_kptr_pop_idle() has multiple concurrent
consumers:

  1. bpf_dtor_kptr_offload() called from NMI context on any CPU
     (the function installed below, called from bpf_obj_free_fields
     when in_nmi()).

  2. bpf_dtor_kptr_trim() called from the per-CPU irq_work worker on
     any CPU (called from bpf_dtor_kptr_worker below).

Both paths call bpf_dtor_kptr_pop_idle() which calls
llist_del_first(&bpf_dtor_kptr_idle) on the global idle list. Two or
more CPUs can be popping simultaneously with no external lock and no
single-consumer guarantee.

Can the following race timeline produce list corruption or
double-ownership?

Initial state: idle list = A -> B -> C -> NULL

CPU0 (NMI): enters bpf_dtor_kptr_pop_idle
  entry = head->first = A
  next  = A->next     = B
  (stalls before cmpxchg)

CPU1 (NMI/worker): runs bpf_dtor_kptr_pop_idle
  entry = A, next = B, cmpxchg(head->first, A, B) succeeds.
  Now idle = B -> C -> NULL. CPU1 uses A (for offload or mem_free).

CPU1: pops again, takes B. idle = C -> NULL.

CPU1 eventually runs worker -> push_idle(A). Now idle = A -> C -> NULL.
   A->next = C (not B anymore).

CPU0 resumes:
  Its cached next variable is still B, cmpxchg(&head->first, A, B)
  finds head->first is currently A, so the cmpxchg succeeds.
  Result: head->first = B, but B has been popped and is in CPU1's
  per-CPU bpf_dtor_kptr_jobs list already (not free).

Node B is now simultaneously:
  (a) considered removed from idle (CPU1 popped it and is using it)
  (b) referenced as idle->first by the head pointer

Another pop_idle() may grab B again as if it were idle, while it is
active in a per-CPU jobs list. Then bpf_dtor_kptr_worker will execute
job->dtor() on already-freed/reused job->obj, or the same job is pushed
into two lists, corrupting llist_node chains.

Concurrency is not hypothetical here. A BPF tp_btf/nmi_handler program
calling bpf_map_delete_elem can fire on multiple CPUs simultaneously.
The irq_work worker also runs on any CPU that queued work. Thus both
bpf_dtor_kptr_offload and bpf_dtor_kptr_trim can run concurrently on
different CPUs, giving at least 2 concurrent llist_del_first consumers.

Typical fixes require either:
  - using a locked list (spinlock + list_head), or
  - using llist_del_all() with batch processing, or
  - a cmpxchg-based stack that properly handles ABA (e.g., using a
    tagged pointer).

> +
> +static void bpf_dtor_kptr_trim(void)
> +{
> +	struct bpf_dtor_kptr_work *job;
> +	long total;
> +	long needed;
> +
> +	for (;;) {
> +		total = atomic_long_read(&bpf_dtor_kptr_total);
> +		needed = atomic_long_read(&bpf_dtor_kptr_refs) +
> +			 atomic_long_read(&bpf_dtor_kptr_active);
> +		if (total <= needed)
> +			return;
> +
> +		job = bpf_dtor_kptr_pop_idle();
> +		if (!job)
> +			return;
> +
> +		if (!atomic_long_try_cmpxchg(&bpf_dtor_kptr_total, &total, total - 1)) {
> +			bpf_dtor_kptr_push_idle(job);
> +			continue;
> +		}
> +
> +		bpf_mem_free(&bpf_global_ma, job);
> +	}
> +}

[ ... ]

> +static void bpf_dtor_kptr_worker(struct irq_work *work)
> +{
> +	struct llist_node *jobs, *node, *next;
> +
> +	jobs = llist_del_all(this_cpu_ptr(&bpf_dtor_kptr_jobs));
> +	llist_for_each_safe(node, next, jobs) {
> +		struct bpf_dtor_kptr_work *job;
> +
> +		job = llist_entry(node, struct bpf_dtor_kptr_work, node);
> +		job->dtor(job->obj);
> +		atomic_long_dec(&bpf_dtor_kptr_active);
> +		bpf_dtor_kptr_push_idle(job);
> +	}
> +
> +	bpf_dtor_kptr_trim();
> +}
> +
> +static void bpf_dtor_kptr_offload(void *obj, btf_dtor_kfunc_t dtor)
> +{
> +	struct bpf_dtor_kptr_work *job;
> +
> +	atomic_long_inc(&bpf_dtor_kptr_active);
> +	job = bpf_dtor_kptr_pop_idle();
> +	if (WARN_ON_ONCE(!job)) {
> +		atomic_long_dec(&bpf_dtor_kptr_active);
> +		/*
> +		 * This should stay unreachable if reserve accounting is correct. If it
> +		 * ever breaks, running the destructor unsafely is still better than
> +		 * leaking the object permanently.
> +		 */
> +		dtor(obj);
> +		return;
> +	}
> +
> +	job->obj = obj;
> +	job->dtor = dtor;
> +	if (llist_add(&job->node, this_cpu_ptr(&bpf_dtor_kptr_jobs)))
> +		irq_work_queue(this_cpu_ptr(&bpf_dtor_kptr_irq_work));
> +}
> +
>  void bpf_obj_free_fields(const struct btf_record *rec, void *obj)
>  {
>  	const struct btf_field *fields;
> @@ -842,6 +1010,19 @@ void bpf_obj_free_fields(const struct btf_record *rec, void *obj)
>  		xchgd_field = (void *)xchg((unsigned long *)field_ptr, 0);
>  		if (!xchgd_field)
>  			break;
> +		if (in_nmi() && field->kptr.dtor) {
> +			bpf_dtor_kptr_offload(xchgd_field, field->kptr.dtor);
> +			bpf_kptr_offload_dec();
> +			break;
> +		}
> +		if (field->kptr.dtor)
> +			/*
> +			 * Dtor kptrs reach storage through bpf_ref_kptr_xchg(), which
> +			 * pairs installation with bpf_kptr_offload_inc(). Drop that
> +			 * reservation on non-NMI teardown once no active transition is
> +			 * needed.
> +			 */
> +			bpf_kptr_offload_dec();
>
>  		if (!btf_is_kernel(field->kptr.btf)) {
>  			pointee_struct_meta = btf_find_struct_meta(field->kptr.btf,


---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/25385531271

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [bpf-next v2 1/2] bpf: Offload kptr destructors that run from NMI
  2026-05-05 16:06   ` bot+bpf-ci
@ 2026-05-05 19:48     ` Justin Suess
  0 siblings, 0 replies; 12+ messages in thread
From: Justin Suess @ 2026-05-05 19:48 UTC (permalink / raw)
  To: bot+bpf-ci
  Cc: ast, daniel, andrii, eddyz87, memxor, martin.lau, song,
	yonghong.song, jolsa, bpf, mic, alexei.starovoitov, martin.lau,
	clm, ihor.solodrai

On Tue, May 05, 2026 at 04:06:23PM +0000, bot+bpf-ci@kernel.org wrote:
> > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> > index 3b1f0ba02f61..162bfd4796ea 100644
> > --- a/kernel/bpf/syscall.c
> > +++ b/kernel/bpf/syscall.c
> 
> [ ... ]
> 
> > @@ -65,6 +68,131 @@ static DEFINE_SPINLOCK(map_idr_lock);
> >  static DEFINE_IDR(link_idr);
> >  static DEFINE_SPINLOCK(link_idr_lock);
> >
> > +struct bpf_dtor_kptr_work {
> > +	struct llist_node node;
> > +	void *obj;
> > +	btf_dtor_kfunc_t dtor;
> > +};
> > +
> > +/* Queue pending dtors per CPU; the idle pool stays global. */
> > +static DEFINE_PER_CPU(struct llist_head, bpf_dtor_kptr_jobs);
> > +static LLIST_HEAD(bpf_dtor_kptr_idle);
> > +/* Keep total >= refs + active so NMI frees never need to allocate. */
> > +static atomic_long_t bpf_dtor_kptr_refs = ATOMIC_LONG_INIT(0);
> > +static atomic_long_t bpf_dtor_kptr_active = ATOMIC_LONG_INIT(0);
> > +static atomic_long_t bpf_dtor_kptr_total = ATOMIC_LONG_INIT(0);
> > +
> > +/* Bound reserve overshoot so the pool tracks demand instead of growing on itself. */
> > +#define BPF_DTOR_KPTR_RESERVE_HEADROOM 64L
> > +
> > +static void bpf_dtor_kptr_worker(struct irq_work *work);
> > +static DEFINE_PER_CPU(struct irq_work, bpf_dtor_kptr_irq_work) =
> > +	IRQ_WORK_INIT_HARD(bpf_dtor_kptr_worker);
> > +
> > +static void bpf_dtor_kptr_push_idle(struct bpf_dtor_kptr_work *job)
> > +{
> > +	llist_add(&job->node, &bpf_dtor_kptr_idle);
> > +}
> > +
> > +static struct bpf_dtor_kptr_work *bpf_dtor_kptr_pop_idle(void)
> > +{
> > +	struct llist_node *node;
> > +
> > +	node = llist_del_first(&bpf_dtor_kptr_idle);
> > +	if (!node)
> > +		return NULL;
> > +
> > +	return llist_entry(node, struct bpf_dtor_kptr_work, node);
> > +}
> 
> Does bpf_dtor_kptr_pop_idle() handle multiple concurrent consumers
> safely?
> 
> The documentation for llist_del_first() in lib/llist.c explicitly
> states:
> 
>   Only one llist_del_first user can be used simultaneously with multiple
>   llist_add users without lock. Because otherwise llist_del_first,
>   llist_add, llist_add (or llist_del_all, llist_add, llist_add)
>   sequence in another user may change @head->first->next, but keep
>   @head->first.
> 
> This is the classic ABA race that llist_del_first cannot defend against.
> 
> In this patch, bpf_dtor_kptr_pop_idle() has multiple concurrent
> consumers:
> 
>   1. bpf_dtor_kptr_offload() called from NMI context on any CPU
>      (the function installed below, called from bpf_obj_free_fields
>      when in_nmi()).
> 
>   2. bpf_dtor_kptr_trim() called from the per-CPU irq_work worker on
>      any CPU (called from bpf_dtor_kptr_worker below).
> 
> Both paths call bpf_dtor_kptr_pop_idle() which calls
> llist_del_first(&bpf_dtor_kptr_idle) on the global idle list. Two or
> more CPUs can be popping simultaneously with no external lock and no
> single-consumer guarantee.
> 
> Can the following race timeline produce list corruption or
> double-ownership?
> 
> Initial state: idle list = A -> B -> C -> NULL
> 
> CPU0 (NMI): enters bpf_dtor_kptr_pop_idle
>   entry = head->first = A
>   next  = A->next     = B
>   (stalls before cmpxchg)
> 
> CPU1 (NMI/worker): runs bpf_dtor_kptr_pop_idle
>   entry = A, next = B, cmpxchg(head->first, A, B) succeeds.
>   Now idle = B -> C -> NULL. CPU1 uses A (for offload or mem_free).
> 
> CPU1: pops again, takes B. idle = C -> NULL.
> 
> CPU1 eventually runs worker -> push_idle(A). Now idle = A -> C -> NULL.
>    A->next = C (not B anymore).
> 
> CPU0 resumes:
>   Its cached next variable is still B, cmpxchg(&head->first, A, B)
>   finds head->first is currently A, so the cmpxchg succeeds.
>   Result: head->first = B, but B has been popped and is in CPU1's
>   per-CPU bpf_dtor_kptr_jobs list already (not free).
> 
> Node B is now simultaneously:
>   (a) considered removed from idle (CPU1 popped it and is using it)
>   (b) referenced as idle->first by the head pointer
> 
> Another pop_idle() may grab B again as if it were idle, while it is
> active in a per-CPU jobs list. Then bpf_dtor_kptr_worker will execute
> job->dtor() on already-freed/reused job->obj, or the same job is pushed
> into two lists, corrupting llist_node chains.
> 
> Concurrency is not hypothetical here. A BPF tp_btf/nmi_handler program
> calling bpf_map_delete_elem can fire on multiple CPUs simultaneously.
> The irq_work worker also runs on any CPU that queued work. Thus both
> bpf_dtor_kptr_offload and bpf_dtor_kptr_trim can run concurrently on
> different CPUs, giving at least 2 concurrent llist_del_first consumers.
> 
> Typical fixes require either:
>   - using a locked list (spinlock + list_head), or
>   - using llist_del_all() with batch processing, or
>   - a cmpxchg-based stack that properly handles ABA (e.g., using a
>     tagged pointer).
>
This is indeed ABA.

Switching the idle list to use percpu freelist instead should
fix this issue. That's better for this model anyway where we don't care
about anything other than pop and push.

(The bpf_dtor_kptr_jobs llist is fine since we're not using
llist_del_first)

Something like:

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 162bfd4796ea..f1e7dc9ef4fe 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -45,6 +45,8 @@
 #include <linux/cookie.h>
 #include <linux/verification.h>
 
+#include "percpu_freelist.h"
+
 #include <net/netfilter/nf_bpf_link.h>
 #include <net/netkit.h>
 #include <net/tcx.h>
@@ -69,14 +71,17 @@ static DEFINE_IDR(link_idr);
 static DEFINE_SPINLOCK(link_idr_lock);
 
 struct bpf_dtor_kptr_work {
-	struct llist_node node;
+	union {
+		struct llist_node node;
+		struct pcpu_freelist_node fnode;
+	};
 	void *obj;
 	btf_dtor_kfunc_t dtor;
 };
 
-/* Queue pending dtors per CPU; the idle pool stays global. */
+/* Queue pending dtors per CPU; the idle pool uses a global pcpu_freelist. */
 static DEFINE_PER_CPU(struct llist_head, bpf_dtor_kptr_jobs);
-static LLIST_HEAD(bpf_dtor_kptr_idle);
+static struct pcpu_freelist bpf_dtor_kptr_idle;
 /* Keep total >= refs + active so NMI frees never need to allocate. */
 static atomic_long_t bpf_dtor_kptr_refs = ATOMIC_LONG_INIT(0);
 static atomic_long_t bpf_dtor_kptr_active = ATOMIC_LONG_INIT(0);
@@ -91,18 +96,18 @@ static DEFINE_PER_CPU(struct irq_work, bpf_dtor_kptr_irq_work) =
 
 static void bpf_dtor_kptr_push_idle(struct bpf_dtor_kptr_work *job)
 {
-	llist_add(&job->node, &bpf_dtor_kptr_idle);
+	pcpu_freelist_push(&bpf_dtor_kptr_idle, &job->fnode);
 }
 
 static struct bpf_dtor_kptr_work *bpf_dtor_kptr_pop_idle(void)
 {
-	struct llist_node *node;
+	struct pcpu_freelist_node *node;
 
-	node = llist_del_first(&bpf_dtor_kptr_idle);
+	node = pcpu_freelist_pop(&bpf_dtor_kptr_idle);
 	if (!node)
 		return NULL;
 
-	return llist_entry(node, struct bpf_dtor_kptr_work, node);
+	return container_of(node, struct bpf_dtor_kptr_work, fnode);
 }
 
 static void bpf_dtor_kptr_trim(void)
@@ -157,7 +162,7 @@ int bpf_kptr_offload_inc(void)
 	long needed;
 	int err;
 
-	if (unlikely(!bpf_global_ma_set))
+	if (unlikely(!bpf_global_ma_set || !READ_ONCE(bpf_dtor_kptr_idle.freelist)))
 		return -ENOMEM;
 
 	/*
@@ -193,6 +198,18 @@ void bpf_kptr_offload_dec(void)
 	} while (!atomic_long_try_cmpxchg(&bpf_dtor_kptr_refs, &val, 0));
 }
 
+static int __init bpf_dtor_kptr_init(void)
+{
+	int err;
+
+	err = pcpu_freelist_init(&bpf_dtor_kptr_idle);
+	if (err)
+		return err;
+
+	return 0;
+}
+late_initcall(bpf_dtor_kptr_init);
+
 int sysctl_unprivileged_bpf_disabled __read_mostly =
 	IS_BUILTIN(CONFIG_BPF_UNPRIV_DEFAULT_OFF) ? 2 : 0;
 
> [...]

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [bpf-next v2 1/2] bpf: Offload kptr destructors that run from NMI
  2026-05-05 15:08 ` [bpf-next v2 1/2] bpf: Offload kptr destructors that run from NMI Justin Suess
  2026-05-05 16:06   ` bot+bpf-ci
@ 2026-05-05 19:49   ` sashiko-bot
  2026-05-06 16:43   ` Mykyta Yatsenko
  2 siblings, 0 replies; 12+ messages in thread
From: sashiko-bot @ 2026-05-05 19:49 UTC (permalink / raw)
  To: Justin Suess; +Cc: bpf

Thank you for your contribution! Sashiko AI review found 5 potential issue(s) to consider:
- [Critical] Multiple concurrent consumers on the lockless `bpf_dtor_kptr_idle` list cause ABA memory corruption.
- [High] Unrestricted assignment to `env->insn_aux_data[insn_idx].kptr_has_dtor` causes mismatched JIT patching for subprograms.
- [High] Time-of-Check to Time-of-Use (TOCTOU) race in `bpf_kptr_offload_inc` under-allocates the job pool.
- [High] TOCTOU race in `bpf_dtor_kptr_trim` steals jobs from concurrent insertions, leading to NMI deadlocks.
- [Medium] Preallocated offload jobs are never trimmed during non-NMI teardown, causing unbounded memory retention.
--

commit 274934ceea98ba86682bc3758a94297eee420620
Author: Justin Suess <utilityemal77@gmail.com>

bpf: Offload kptr destructors that run from NMI

A BPF program attached to tp_btf/nmi_handler can delete map entries or
swap out referenced kptrs from NMI context. This patch introduces an
offload pool to let a worker invoke destructors after NMI exits.

Closes: https://lore.kernel.org/bpf/20260421201035.1729473-1-utilityemal77@gmail.com/

> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 3b1f0ba02f613..162bfd4796ea0 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c

[ ... ]

> @@ -65,6 +68,131 @@ static DEFINE_SPINLOCK(map_idr_lock);

[ ... ]

> +static struct bpf_dtor_kptr_work *bpf_dtor_kptr_pop_idle(void)
> +{
> +	struct llist_node *node;
> +
> +	node = llist_del_first(&bpf_dtor_kptr_idle);
> +	if (!node)
> +		return NULL;
> +
> +	return llist_entry(node, struct bpf_dtor_kptr_work, node);
> +}

Is this susceptible to an ABA issue resulting in list corruption?

The llist_del_first() function is strictly single-consumer because it lacks
ABA protection. Here, bpf_dtor_kptr_pop_idle() is accessed concurrently by
multiple consumers: bpf_dtor_kptr_offload() running in NMI context across
multiple CPUs, and bpf_dtor_kptr_trim() running in irq_work context.

If CPU 0 begins popping a node, is interrupted, and CPU 1 pops and re-pushes
nodes, could CPU 0's cmpxchg succeed but set the list head to a stale or
freed next pointer?

[ ... ]

> +static void bpf_dtor_kptr_trim(void)
> +{
> +	struct bpf_dtor_kptr_work *job;
> +	long total;
> +	long needed;
> +
> +	for (;;) {
> +		total = atomic_long_read(&bpf_dtor_kptr_total);
> +		needed = atomic_long_read(&bpf_dtor_kptr_refs) +
> +			 atomic_long_read(&bpf_dtor_kptr_active);
> +		if (total <= needed)
> +			return;
> +
> +		job = bpf_dtor_kptr_pop_idle();
> +		if (!job)
> +			return;
> +
> +		if (!atomic_long_try_cmpxchg(&bpf_dtor_kptr_total, &total, total - 1)) {
> +			bpf_dtor_kptr_push_idle(job);
> +			continue;
> +		}

Can a concurrent bpf_kptr_offload_inc() cause this to violate the
total >= refs + active invariant?

If bpf_dtor_kptr_trim() reads needed, and then another CPU increments refs
in bpf_kptr_offload_inc():

CPU1
bpf_dtor_kptr_trim()
   needed = refs + active;

CPU2
bpf_kptr_offload_inc()
   atomic_long_inc_return(&bpf_dtor_kptr_refs);
   bpf_dtor_kptr_reserve() observes the old high total and skips allocation

CPU1 would see the cmpxchg succeed because total hasn't changed, successfully
freeing the job. Does this lead to an empty idle list when the kptr is later
freed in an NMI?

[ ... ]

> +int bpf_kptr_offload_inc(void)
> +{
> +	long needed;
> +	int err;
> +
> +	if (unlikely(!bpf_global_ma_set))
> +		return -ENOMEM;
> +
> +	/*
> +	 * Read active before incrementing refs so a free path moving one slot from
> +	 * refs to active cannot shrink the reservation snapshot below the steady
> +	 * state we need to cover. Racing results worst case in a larger reservation.
> +	 */
> +	needed = atomic_long_read(&bpf_dtor_kptr_active);
> +	needed += atomic_long_inc_return(&bpf_dtor_kptr_refs);
> +	err = bpf_dtor_kptr_reserve(needed);

Could an NMI interrupting this sequence result in under-allocating the job
pool?

If an NMI free occurs precisely after active is read but before refs is
incremented:

CPU1
bpf_kptr_offload_inc()
   needed = atomic_long_read(&bpf_dtor_kptr_active);
   // NMI interrupts CPU1
   bpf_dtor_kptr_offload()
      atomic_long_inc(&bpf_dtor_kptr_active);
      // pops job
   bpf_kptr_offload_dec()
      atomic_long_dec(&bpf_dtor_kptr_refs);
   // NMI exits
   needed += atomic_long_inc_return(&bpf_dtor_kptr_refs);

In this scenario, needed is calculated with the old active value and the new
refs value, making it 1 less than the true requirement. Would a subsequent
NMI free encounter an empty job list?

[ ... ]

> +void bpf_kptr_offload_dec(void)
> +{
> +	long val;
> +
> +	val = atomic_long_dec_return(&bpf_dtor_kptr_refs);
> +	if (!WARN_ON_ONCE(val < 0))
> +		return;
> +
> +	/*
> +	 * Clamp a mismatched decrement back to zero without overwriting a
> +	 * concurrent increment that already repaired the counter.
> +	 */
> +	do {
> +		val = atomic_long_read(&bpf_dtor_kptr_refs);
> +		if (val >= 0)
> +			break;
> +	} while (!atomic_long_try_cmpxchg(&bpf_dtor_kptr_refs, &val, 0));
> +}

When kptrs are deleted from normal process or softirq contexts, does this code
cause unbounded memory retention?

Since bpf_dtor_kptr_trim() is exclusively invoked by the irq_work worker
during NMI teardown, the global idle list might never be trimmed for non-NMI
teardowns. This could cause the pool memory to stay at the high-water mark of
concurrent kptrs indefinitely.

[ ... ]

> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 11054ad89c14d..2c7b21bda666a 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c

[ ... ]

> @@ -9950,6 +9950,8 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
>  		if (err)
>  			return err;
>  	}
> +	env->insn_aux_data[insn_idx].kptr_has_dtor =
> +		func_id == BPF_FUNC_kptr_xchg && !!meta.kptr_field->kptr.dtor;
>  
>  	err = record_func_map(env, &meta, func_id, insn_idx);
>  	if (err)

Does this unrestricted assignment cause mismatched JIT patching for subprograms?

If a BPF static subprogram contains a bpf_kptr_xchg instruction and is called
twice by the main program - once with a map containing a destructor kptr,
and once without - the verifier will simulate the subprogram twice. The
second pass will blindly overwrite kptr_has_dtor for that instruction.

If the last simulated path had no destructor, the JIT patches the instruction
to bpf_kptr_xchg_nodtor. At runtime, would the path with a destructor execute
this un-instrumented variant, skipping bpf_kptr_offload_inc() and causing
an NMI deadlock when freed?

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260505150851.3090688-1-utilityemal77@gmail.com?part=1

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [bpf-next v2 2/2] selftests/bpf: Add kptr destructor NMI exerciser
  2026-05-05 15:08 ` [bpf-next v2 2/2] selftests/bpf: Add kptr destructor NMI exerciser Justin Suess
@ 2026-05-05 20:15   ` sashiko-bot
  0 siblings, 0 replies; 12+ messages in thread
From: sashiko-bot @ 2026-05-05 20:15 UTC (permalink / raw)
  To: Justin Suess; +Cc: bpf

Thank you for your contribution! Sashiko AI review found 4 potential issue(s) to consider:
- [High] Read-modify-write data race on `kptr_dtor_nmi_live_mask` between syscall and NMI contexts.
- [High] Non-atomic increment of `kptr_dtor_nmi_release_calls` in a global fentry hook leads to lost updates.
- [Medium] API Contract Violation: Incorrect BPF program signature for the `tp_btf/nmi_handler` tracepoint.
- [Medium] Test flakiness due to reliance on natural RCU grace periods and a misplaced `kern_sync_rcu()` call.
--

commit cb04c2d31ab28ab2b4ff2bbc2cd6c029cabdc851
Author: Justin Suess <utilityemal77@gmail.com>

selftests/bpf: Add kptr destructor NMI exerciser

Programs attached to tp_btf/nmi_handler can drop refcounted kptrs from
NMI context by deleting map entries or clearing map values. Add a
dedicated BPF-side selftest program that populates hash and array maps
with cpumask kptrs and clears them again from the NMI handler.

> diff --git a/tools/testing/selftests/bpf/prog_tests/kptr_dtor_nmi.c b/tools/testing/selftests/bpf/prog_tests/kptr_dtor_nmi.c
> index 0000000000000..a1aa8b6646b36
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/prog_tests/kptr_dtor_nmi.c

[ ... ]

> @@ -193,12 +193,12 @@ static void run_kptr_dtor_nmi_case(const struct kptr_dtor_nmi_case *test)
>  			       "kptr_dtor_nmi_setup_created"))
>  			goto cleanup;
>  
>  		if (!wait_for_nmi_drain(skel, expected_total, expected_total))
>  			goto cleanup;
>  
>  		kern_sync_rcu();

Is it possible for the test to flake here due to relying on the natural
background RCU grace period?

For hash maps, deleted elements are freed and their kptrs are destructed via
RCU callbacks, which increment the release calls counter. The
wait_for_nmi_drain() loop waits for up to 5 seconds for this counter to hit
the target, but we only force an RCU grace period with kern_sync_rcu() after
the wait loop completes.

Under heavy system load, could the natural grace period take longer than
5 seconds and cause the wait loop to time out?

>  	}
>  
>  	if (!run_syscall_prog(skel->progs.cleanup_kptrs, "cleanup_kptrs"))
>  		goto cleanup;

[ ... ]

> diff --git a/tools/testing/selftests/bpf/progs/kptr_dtor_nmi.c b/tools/testing/selftests/bpf/progs/kptr_dtor_nmi.c
> index 0000000000000..1ab5951a7a223
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/kptr_dtor_nmi.c

[ ... ]

> @@ -66,15 +66,15 @@ static void set_err(__u32 *err_dst, int *errno_dst, __u32 err, int err_no)
>  static bool slot_is_live(__u32 slot)
>  {
>  	return kptr_dtor_nmi_live_mask & (1U << slot);
>  }
>  
>  static void mark_slot_live(__u32 slot)
>  {
>  	kptr_dtor_nmi_live_mask |= 1U << slot;
>  }
>  
>  static void clear_slot_live(__u32 slot)
>  {
>  	kptr_dtor_nmi_live_mask &= ~(1U << slot);
>  }

Can this sequence cause updates to the live mask to be lost?

Since kptr_dtor_nmi_live_mask is updated using non-atomic bitwise operations,
if an NMI interrupts the syscall program exactly between the load and store
instructions of mark_slot_live():

syscall context (mark_slot_live)
    loads kptr_dtor_nmi_live_mask
    <NMI interrupts>

NMI context (clear_slot_live)
    loads, clears bit, stores kptr_dtor_nmi_live_mask

syscall context resumes
    stores modified mask, overwriting the NMI's cleared bit

If this happens, the NMI handler on subsequent runs will encounter an
already-deleted slot, fail with -ENOENT, and permanently abort the loop.
Should these updates use atomic builtins like __sync_fetch_and_or()?

>  
>  static struct kptr_dtor_nmi_value *lookup_hash_value(__u32 slot)

[ ... ]

> @@ -375,12 +375,12 @@ int cleanup_kptrs(void *ctx)
>  		break;
>  	}
>  
>  	return 0;
>  }
>  
>  SEC("tp_btf/nmi_handler")
>  int BPF_PROG(clear_kptrs_from_nmi, void *handler, void *regs, s64 delta_ns,
>  	     int handled)
>  {
>  	(void)handler;

Does this signature match the kernel tracepoint for nmi_handler?

The nmi_handler tracepoint in the kernel has exactly 3 arguments: handler,
delta_ns, and handled. By including a regs parameter here, the BPF_PROG
macro maps delta_ns to ctx[1] and handled to ctx[3].

Since the actual context size is 3 elements, accessing handled would read
out-of-bounds. Should the regs parameter be removed to match the tracepoint?

>  	(void)regs;
>  	(void)delta_ns;

[ ... ]

> @@ -402,10 +402,10 @@ int BPF_PROG(clear_kptrs_from_nmi, void *handler, void *regs, s64 delta_ns,
>  		break;
>  	}
>  
>  	return 0;
>  }
>  
>  SEC("fentry/bpf_cpumask_release")
>  int BPF_PROG(count_cpumask_release, struct bpf_cpumask *mask)
>  {
>  	(void)mask;
>  	kptr_dtor_nmi_release_calls++;
>  	return 0;
>  }

Could concurrent global calls to bpf_cpumask_release() cause lost updates
to this counter?

This fentry hook traces the function globally across all CPUs. The counter
is incremented non-atomically, which translates to a load-add-store
sequence. If bpf_cpumask_release is executed concurrently on multiple CPUs
(e.g., when running selftests in parallel), these increments can race.

If an increment generated by the test itself is lost, the test will fall
short of expected_release_calls, causing wait_for_nmi_drain() to hang and
time out. Should this use __sync_fetch_and_add() instead?

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260505150851.3090688-1-utilityemal77@gmail.com?part=2

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [bpf-next v2 1/2] bpf: Offload kptr destructors that run from NMI
  2026-05-05 15:08 ` [bpf-next v2 1/2] bpf: Offload kptr destructors that run from NMI Justin Suess
  2026-05-05 16:06   ` bot+bpf-ci
  2026-05-05 19:49   ` sashiko-bot
@ 2026-05-06 16:43   ` Mykyta Yatsenko
  2026-05-06 19:52     ` Justin Suess
  2 siblings, 1 reply; 12+ messages in thread
From: Mykyta Yatsenko @ 2026-05-06 16:43 UTC (permalink / raw)
  To: Justin Suess, ast, daniel, andrii, eddyz87, memxor
  Cc: martin.lau, song, yonghong.song, jolsa, bpf, mic,
	Alexei Starovoitov

On 5/5/26 4:08 PM, Justin Suess wrote:
> A BPF program attached to tp_btf/nmi_handler can delete map entries or
> swap out referenced kptrs from NMI context. Today that runs the kptr
> destructor inline. Destructors such as bpf_cpumask_release() can take
> RCU-related locks, so running them from NMI can deadlock the system.
> 
> Preallocate offload jobs from the global BPF memory allocator, track the
> number of live destructor-backed references so the pool stays ahead of
> NMI frees, and let the worker invoke the destructor after NMI exits.
> 
> The algorithm for preallocation is simple: The invariant is total >=
> refs + active, where refs = the ref kptrs installed in maps, active =
> jobs being executed in the irq_work worker, and total is the number of
> job structures allocated. To avoid excessive pre-allocation calls while
> maintaining the invariant, we allocate the needed slots, plus a small
> amount of extra, min(needed, BPF_DTOR_KPTR_RESERVE_HEADROOM), where
> BPF_DTOR_KPTR_RESERVE_HEADROOM is 64 in this patch.
> 
> A small but harmless ordering subtlety: the active atomic is read before
> refs. This can result in a small amount of over allocation, but this
> won't be leaked and will properly be carried into the trim stage.
> 
> The trim stage is simple. It uses a CAS loop to free excessive leftover
> idle job slots. It snapshots total refs and active, pops an idle job if
> the pool is excessively large, and attempts a cmpxhg to decrement it
> atomically. On a failure case, it will just push the job back into the
> idle list and retry.
> 
> There are several best-effort mitigation methods to tackle the memory
> pressure problem, preserving integrity under this unlikely scenario.
> 
> If reserving another offload slot fails while installing a new
> destructor-backed kptr through bpf_kptr_xchg(), leave the destination
> unchanged and return the incoming pointer so the caller keeps ownership.
> 
> This is superior to leaking the pointer, and should only happen if the
> accounting is incorrect. Moreover, this is a condition the caller can
> check for and recover from.
> 
> If NMI teardown still fails to grab an idle offload job despite that
> reserve accounting, warn once and run the destructor inline rather than
> leak the object permanently. Attempt to repair the counter safely with
> another CAS loop in that case, preserving concurrent increments.
> 
> This fix does come with small performance tradeoffs for safety. xchg can
> no longer be inlined for referenced kptrs, as inlining would break the
> reference counting. The inlining fix is preserved for kptrs with no
> destructor defined.
> 
> This keeps refcounted kptr teardown out of NMI context without slowing
> down raw kptr exchanges that never need destructor handling.
> 
> Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
> Reported-by: Justin Suess <utilityemal77@gmail.com>
> Closes: https://lore.kernel.org/bpf/20260421201035.1729473-1-utilityemal77@gmail.com/
> Signed-off-by: Justin Suess <utilityemal77@gmail.com>
> ---
>  include/linux/bpf.h          |  16 ++++
>  include/linux/bpf_verifier.h |   1 +
>  kernel/bpf/fixups.c          |  33 ++++---
>  kernel/bpf/helpers.c         |  24 ++++-
>  kernel/bpf/syscall.c         | 181 +++++++++++++++++++++++++++++++++++
>  kernel/bpf/verifier.c        |   2 +
>  6 files changed, 242 insertions(+), 15 deletions(-)
> 
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 715b6df9c403..307de5caa646 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -3454,6 +3454,22 @@ static inline struct bpf_prog *bpf_prog_get_type(u32 ufd,
>  
>  void __bpf_free_used_maps(struct bpf_prog_aux *aux,
>  			  struct bpf_map **used_maps, u32 len);
> +/* Direct-call target used by fixups for bpf_kptr_xchg() sites without dtors. */
> +u64 bpf_kptr_xchg_nodtor(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
> +
> +#ifdef CONFIG_BPF_SYSCALL
> +int bpf_kptr_offload_inc(void);
> +void bpf_kptr_offload_dec(void);
> +#else
> +static inline int bpf_kptr_offload_inc(void)
> +{
> +	return 0;
> +}
> +
> +static inline void bpf_kptr_offload_dec(void)
> +{
> +}
> +#endif
>  
>  bool bpf_prog_get_ok(struct bpf_prog *, enum bpf_prog_type *, bool);
>  
> diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
> index 976e2b2f40e8..8e39ff92dd2c 100644
> --- a/include/linux/bpf_verifier.h
> +++ b/include/linux/bpf_verifier.h
> @@ -672,6 +672,7 @@ struct bpf_insn_aux_data {
>  	bool non_sleepable; /* helper/kfunc may be called from non-sleepable context */
>  	bool is_iter_next; /* bpf_iter_<type>_next() kfunc call */
>  	bool call_with_percpu_alloc_ptr; /* {this,per}_cpu_ptr() with prog percpu alloc */
> +	bool kptr_has_dtor;
>  	u8 alu_state; /* used in combination with alu_limit */
>  	/* true if STX or LDX instruction is a part of a spill/fill
>  	 * pattern for a bpf_fastcall call.
> diff --git a/kernel/bpf/fixups.c b/kernel/bpf/fixups.c
> index fba9e8c00878..459e855e86a5 100644
> --- a/kernel/bpf/fixups.c
> +++ b/kernel/bpf/fixups.c
> @@ -2284,23 +2284,30 @@ int bpf_do_misc_fixups(struct bpf_verifier_env *env)
>  			goto next_insn;
>  		}
>  
> -		/* Implement bpf_kptr_xchg inline */
> -		if (prog->jit_requested && BITS_PER_LONG == 64 &&
> -		    insn->imm == BPF_FUNC_kptr_xchg &&
> -		    bpf_jit_supports_ptr_xchg()) {
> -			insn_buf[0] = BPF_MOV64_REG(BPF_REG_0, BPF_REG_2);
> -			insn_buf[1] = BPF_ATOMIC_OP(BPF_DW, BPF_XCHG, BPF_REG_1, BPF_REG_0, 0);
> -			cnt = 2;
> +		/* Implement bpf_kptr_xchg inline. */
> +		if (insn->imm == BPF_FUNC_kptr_xchg &&
> +		    !env->insn_aux_data[i + delta].kptr_has_dtor) {
> +			if (prog->jit_requested && BITS_PER_LONG == 64 &&
> +			    bpf_jit_supports_ptr_xchg()) {
> +				insn_buf[0] = BPF_MOV64_REG(BPF_REG_0, BPF_REG_2);
> +				insn_buf[1] = BPF_ATOMIC_OP(BPF_DW, BPF_XCHG,
> +						     BPF_REG_1, BPF_REG_0, 0);
> +				cnt = 2;
>  
> -			new_prog = bpf_patch_insn_data(env, i + delta, insn_buf, cnt);
> -			if (!new_prog)
> -				return -ENOMEM;
> +				new_prog = bpf_patch_insn_data(env, i + delta, insn_buf, cnt);
> +				if (!new_prog)
> +					return -ENOMEM;
>  
> -			delta    += cnt - 1;
> -			env->prog = prog = new_prog;
> -			insn      = new_prog->insnsi + i + delta;
> +				delta    += cnt - 1;
> +				env->prog = prog = new_prog;
> +				insn      = new_prog->insnsi + i + delta;
> +				goto next_insn;
> +			}
> +
> +			insn->imm = bpf_kptr_xchg_nodtor - __bpf_call_base;
>  			goto next_insn;
>  		}
> +
>  patch_call_imm:
>  		fn = env->ops->get_func_proto(insn->imm, env->prog);
>  		/* all functions that have prototype and verifier allowed
> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> index baa12b24bb64..cdc64ab83ef6 100644
> --- a/kernel/bpf/helpers.c
> +++ b/kernel/bpf/helpers.c
> @@ -1728,7 +1728,7 @@ void bpf_wq_cancel_and_free(void *val)
>  	bpf_async_cancel_and_free(val);
>  }
>  
> -BPF_CALL_2(bpf_kptr_xchg, void *, dst, void *, ptr)
> +BPF_CALL_2(bpf_kptr_xchg_nodtor, void *, dst, void *, ptr)
>  {
>  	unsigned long *kptr = dst;
>  
> @@ -1736,12 +1736,32 @@ BPF_CALL_2(bpf_kptr_xchg, void *, dst, void *, ptr)
>  	return xchg(kptr, (unsigned long)ptr);
>  }
>  
> +BPF_CALL_2(bpf_ref_kptr_xchg, void *, dst, void *, ptr)
> +{
> +	unsigned long *kptr = dst;
> +	void *old;
> +
> +	/*
> +	 * If the incoming pointer cannot be torn down safely from NMI later on,
> +	 * leave the destination untouched and return ptr so the caller keeps
> +	 * ownership.
> +	 */
> +	if (ptr && bpf_kptr_offload_inc())
> +		return (unsigned long)ptr;
> +
> +	old = (void *)xchg(kptr, (unsigned long)ptr);
> +	if (old)
> +		bpf_kptr_offload_dec();
> +	return (unsigned long)old;
> +}
> +
>  /* Unlike other PTR_TO_BTF_ID helpers the btf_id in bpf_kptr_xchg()
>   * helper is determined dynamically by the verifier. Use BPF_PTR_POISON to
>   * denote type that verifier will determine.
> + * No-dtor callsites are redirected to bpf_kptr_xchg_nodtor() from fixups.
>   */
>  static const struct bpf_func_proto bpf_kptr_xchg_proto = {
> -	.func         = bpf_kptr_xchg,
> +	.func         = bpf_ref_kptr_xchg,
>  	.gpl_only     = false,
>  	.ret_type     = RET_PTR_TO_BTF_ID_OR_NULL,
>  	.ret_btf_id   = BPF_PTR_POISON,
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 3b1f0ba02f61..162bfd4796ea 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -7,6 +7,7 @@
>  #include <linux/bpf_trace.h>
>  #include <linux/bpf_lirc.h>
>  #include <linux/bpf_verifier.h>
> +#include <linux/bpf_mem_alloc.h>
>  #include <linux/bsearch.h>
>  #include <linux/btf.h>
>  #include <linux/hex.h>
> @@ -19,6 +20,8 @@
>  #include <linux/fdtable.h>
>  #include <linux/file.h>
>  #include <linux/fs.h>
> +#include <linux/irq_work.h>
> +#include <linux/llist.h>
>  #include <linux/license.h>
>  #include <linux/filter.h>
>  #include <linux/kernel.h>
> @@ -65,6 +68,131 @@ static DEFINE_SPINLOCK(map_idr_lock);
>  static DEFINE_IDR(link_idr);
>  static DEFINE_SPINLOCK(link_idr_lock);
>  
> +struct bpf_dtor_kptr_work {
> +	struct llist_node node;
> +	void *obj;
> +	btf_dtor_kfunc_t dtor;
> +};
> +
> +/* Queue pending dtors per CPU; the idle pool stays global. */
> +static DEFINE_PER_CPU(struct llist_head, bpf_dtor_kptr_jobs);
> +static LLIST_HEAD(bpf_dtor_kptr_idle);
> +/* Keep total >= refs + active so NMI frees never need to allocate. */
> +static atomic_long_t bpf_dtor_kptr_refs = ATOMIC_LONG_INIT(0);
> +static atomic_long_t bpf_dtor_kptr_active = ATOMIC_LONG_INIT(0);
> +static atomic_long_t bpf_dtor_kptr_total = ATOMIC_LONG_INIT(0);
> +
> +/* Bound reserve overshoot so the pool tracks demand instead of growing on itself. */
> +#define BPF_DTOR_KPTR_RESERVE_HEADROOM 64L
> +
> +static void bpf_dtor_kptr_worker(struct irq_work *work);
> +static DEFINE_PER_CPU(struct irq_work, bpf_dtor_kptr_irq_work) =
> +	IRQ_WORK_INIT_HARD(bpf_dtor_kptr_worker);
> +

I think this still looks too complex:
 * 2 lists - idle list and armed list
 * 3 atomics, controlling demand/supply
 * headroom/trimming management

The complexity introduced for performance reasons, but
I'm not sure if the tradeoff is worth it.

What about the next design:

Instead of idle list, store bpf_dtor_kptr_work in the kptr map slot itself.
Use kmalloc_nolock() to allocate bpf_dtor_kptr_work on the first 
xchg just once per map value, then reuse it across xchg in/out.

Detach: When map value is deleted, atomically set kptr map field storing 
bpf_dtor_kptr_work to NULL (so the next xchg-in allocates new 
bpf_dtor_kptr_work.)
After detaching insert bpf_dtor_kptr_work to the global list and run irq_work. 
Free bpf_dtor_kptr_work in call_rcu_tasks_trace().

This is based on the bpf_timer and bpf_task_work
implementations.

> +static void bpf_dtor_kptr_push_idle(struct bpf_dtor_kptr_work *job)
> +{
> +	llist_add(&job->node, &bpf_dtor_kptr_idle);
> +}
> +
> +static struct bpf_dtor_kptr_work *bpf_dtor_kptr_pop_idle(void)
> +{
> +	struct llist_node *node;
> +
> +	node = llist_del_first(&bpf_dtor_kptr_idle);
> +	if (!node)
> +		return NULL;
> +
> +	return llist_entry(node, struct bpf_dtor_kptr_work, node);
> +}
> +
> +static void bpf_dtor_kptr_trim(void)
> +{
> +	struct bpf_dtor_kptr_work *job;
> +	long total;
> +	long needed;
> +
> +	for (;;) {
> +		total = atomic_long_read(&bpf_dtor_kptr_total);
> +		needed = atomic_long_read(&bpf_dtor_kptr_refs) +
> +			 atomic_long_read(&bpf_dtor_kptr_active);
> +		if (total <= needed)
> +			return;
> +
> +		job = bpf_dtor_kptr_pop_idle();
> +		if (!job)
> +			return;
> +
> +		if (!atomic_long_try_cmpxchg(&bpf_dtor_kptr_total, &total, total - 1)) {
> +			bpf_dtor_kptr_push_idle(job);
> +			continue;
> +		}
> +
> +		bpf_mem_free(&bpf_global_ma, job);
> +	}
> +}
> +
> +static int bpf_dtor_kptr_reserve(long needed)
> +{
> +	struct bpf_dtor_kptr_work *job;
> +	long headroom;
> +	long target;
> +
> +	headroom = min_t(long, needed, BPF_DTOR_KPTR_RESERVE_HEADROOM);
> +	if (check_add_overflow(needed, headroom, &target))
> +		target = needed;
> +
> +	while (atomic_long_read(&bpf_dtor_kptr_total) < target) {
> +		job = bpf_mem_alloc(&bpf_global_ma, sizeof(*job));
> +		if (!job)
> +			return -ENOMEM;
> +		atomic_long_inc(&bpf_dtor_kptr_total);
> +		bpf_dtor_kptr_push_idle(job);
> +	}
> +
> +	return 0;
> +}
> +
> +int bpf_kptr_offload_inc(void)
> +{
> +	long needed;
> +	int err;
> +
> +	if (unlikely(!bpf_global_ma_set))
> +		return -ENOMEM;
> +
> +	/*
> +	 * Read active before incrementing refs so a free path moving one slot from
> +	 * refs to active cannot shrink the reservation snapshot below the steady
> +	 * state we need to cover. Racing results worst case in a larger reservation.
> +	 */
> +	needed = atomic_long_read(&bpf_dtor_kptr_active);
> +	needed += atomic_long_inc_return(&bpf_dtor_kptr_refs);
> +	err = bpf_dtor_kptr_reserve(needed);
> +	if (err)
> +		atomic_long_dec(&bpf_dtor_kptr_refs);
> +
> +	return err;
> +}
> +
> +void bpf_kptr_offload_dec(void)
> +{
> +	long val;
> +
> +	val = atomic_long_dec_return(&bpf_dtor_kptr_refs);
> +	if (!WARN_ON_ONCE(val < 0))
> +		return;
> +
> +	/*
> +	 * Clamp a mismatched decrement back to zero without overwriting a
> +	 * concurrent increment that already repaired the counter.
> +	 */
> +	do {
> +		val = atomic_long_read(&bpf_dtor_kptr_refs);
> +		if (val >= 0)
> +			break;
> +	} while (!atomic_long_try_cmpxchg(&bpf_dtor_kptr_refs, &val, 0));
> +}
> +
>  int sysctl_unprivileged_bpf_disabled __read_mostly =
>  	IS_BUILTIN(CONFIG_BPF_UNPRIV_DEFAULT_OFF) ? 2 : 0;
>  
> @@ -807,6 +935,46 @@ void bpf_obj_free_task_work(const struct btf_record *rec, void *obj)
>  	bpf_task_work_cancel_and_free(obj + rec->task_work_off);
>  }
>  
> +static void bpf_dtor_kptr_worker(struct irq_work *work)
> +{
> +	struct llist_node *jobs, *node, *next;
> +
> +	jobs = llist_del_all(this_cpu_ptr(&bpf_dtor_kptr_jobs));
> +	llist_for_each_safe(node, next, jobs) {
> +		struct bpf_dtor_kptr_work *job;
> +
> +		job = llist_entry(node, struct bpf_dtor_kptr_work, node);
> +		job->dtor(job->obj);
> +		atomic_long_dec(&bpf_dtor_kptr_active);
> +		bpf_dtor_kptr_push_idle(job);
> +	}
> +
> +	bpf_dtor_kptr_trim();
> +}
> +
> +static void bpf_dtor_kptr_offload(void *obj, btf_dtor_kfunc_t dtor)
> +{
> +	struct bpf_dtor_kptr_work *job;
> +
> +	atomic_long_inc(&bpf_dtor_kptr_active);
> +	job = bpf_dtor_kptr_pop_idle();
> +	if (WARN_ON_ONCE(!job)) {
> +		atomic_long_dec(&bpf_dtor_kptr_active);
> +		/*
> +		 * This should stay unreachable if reserve accounting is correct. If it
> +		 * ever breaks, running the destructor unsafely is still better than
> +		 * leaking the object permanently.
> +		 */
> +		dtor(obj);
> +		return;
> +	}
> +
> +	job->obj = obj;
> +	job->dtor = dtor;
> +	if (llist_add(&job->node, this_cpu_ptr(&bpf_dtor_kptr_jobs)))
> +		irq_work_queue(this_cpu_ptr(&bpf_dtor_kptr_irq_work));
> +}
> +
>  void bpf_obj_free_fields(const struct btf_record *rec, void *obj)
>  {
>  	const struct btf_field *fields;
> @@ -842,6 +1010,19 @@ void bpf_obj_free_fields(const struct btf_record *rec, void *obj)
>  			xchgd_field = (void *)xchg((unsigned long *)field_ptr, 0);
>  			if (!xchgd_field)
>  				break;
> +			if (in_nmi() && field->kptr.dtor) {
> +				bpf_dtor_kptr_offload(xchgd_field, field->kptr.dtor);
> +				bpf_kptr_offload_dec();
> +				break;
> +			}
> +			if (field->kptr.dtor)
> +				/*
> +				 * Dtor kptrs reach storage through bpf_ref_kptr_xchg(), which
> +				 * pairs installation with bpf_kptr_offload_inc(). Drop that
> +				 * reservation on non-NMI teardown once no active transition is
> +				 * needed.
> +				 */
> +				bpf_kptr_offload_dec();
>  
>  			if (!btf_is_kernel(field->kptr.btf)) {
>  				pointee_struct_meta = btf_find_struct_meta(field->kptr.btf,
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 11054ad89c14..2c7b21bda666 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -9950,6 +9950,8 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
>  		if (err)
>  			return err;
>  	}
> +	env->insn_aux_data[insn_idx].kptr_has_dtor =
> +		func_id == BPF_FUNC_kptr_xchg && !!meta.kptr_field->kptr.dtor;
>  
>  	err = record_func_map(env, &meta, func_id, insn_idx);
>  	if (err)


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [bpf-next v2 1/2] bpf: Offload kptr destructors that run from NMI
  2026-05-06 16:43   ` Mykyta Yatsenko
@ 2026-05-06 19:52     ` Justin Suess
  2026-05-07 14:59       ` Mykyta Yatsenko
  0 siblings, 1 reply; 12+ messages in thread
From: Justin Suess @ 2026-05-06 19:52 UTC (permalink / raw)
  To: Mykyta Yatsenko
  Cc: ast, daniel, andrii, eddyz87, memxor, martin.lau, song,
	yonghong.song, jolsa, bpf, mic, Alexei Starovoitov

On Wed, May 06, 2026 at 05:43:51PM +0100, Mykyta Yatsenko wrote:
> On 5/5/26 4:08 PM, Justin Suess wrote:
> > A BPF program attached to tp_btf/nmi_handler can delete map entries or
> > swap out referenced kptrs from NMI context. Today that runs the kptr
> > destructor inline. Destructors such as bpf_cpumask_release() can take
> > RCU-related locks, so running them from NMI can deadlock the system.
> > 
> > Preallocate offload jobs from the global BPF memory allocator, track the
> > number of live destructor-backed references so the pool stays ahead of
> > NMI frees, and let the worker invoke the destructor after NMI exits.
> > 
> > The algorithm for preallocation is simple: The invariant is total >=
> > refs + active, where refs = the ref kptrs installed in maps, active =
> > jobs being executed in the irq_work worker, and total is the number of
> > job structures allocated. To avoid excessive pre-allocation calls while
> > maintaining the invariant, we allocate the needed slots, plus a small
> > amount of extra, min(needed, BPF_DTOR_KPTR_RESERVE_HEADROOM), where
> > BPF_DTOR_KPTR_RESERVE_HEADROOM is 64 in this patch.
> > 
> > A small but harmless ordering subtlety: the active atomic is read before
> > refs. This can result in a small amount of over allocation, but this
> > won't be leaked and will properly be carried into the trim stage.
> > 
> > The trim stage is simple. It uses a CAS loop to free excessive leftover
> > idle job slots. It snapshots total refs and active, pops an idle job if
> > the pool is excessively large, and attempts a cmpxhg to decrement it
> > atomically. On a failure case, it will just push the job back into the
> > idle list and retry.
> > 
> > There are several best-effort mitigation methods to tackle the memory
> > pressure problem, preserving integrity under this unlikely scenario.
> > 
> > If reserving another offload slot fails while installing a new
> > destructor-backed kptr through bpf_kptr_xchg(), leave the destination
> > unchanged and return the incoming pointer so the caller keeps ownership.
> > 
> > This is superior to leaking the pointer, and should only happen if the
> > accounting is incorrect. Moreover, this is a condition the caller can
> > check for and recover from.
> > 
> > If NMI teardown still fails to grab an idle offload job despite that
> > reserve accounting, warn once and run the destructor inline rather than
> > leak the object permanently. Attempt to repair the counter safely with
> > another CAS loop in that case, preserving concurrent increments.
> > 
> > This fix does come with small performance tradeoffs for safety. xchg can
> > no longer be inlined for referenced kptrs, as inlining would break the
> > reference counting. The inlining fix is preserved for kptrs with no
> > destructor defined.
> > 
> > This keeps refcounted kptr teardown out of NMI context without slowing
> > down raw kptr exchanges that never need destructor handling.
> > 
> > Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
> > Reported-by: Justin Suess <utilityemal77@gmail.com>
> > Closes: https://lore.kernel.org/bpf/20260421201035.1729473-1-utilityemal77@gmail.com/
> > Signed-off-by: Justin Suess <utilityemal77@gmail.com>
> > ---
> >  include/linux/bpf.h          |  16 ++++
> >  include/linux/bpf_verifier.h |   1 +
> >  kernel/bpf/fixups.c          |  33 ++++---
> >  kernel/bpf/helpers.c         |  24 ++++-
> >  kernel/bpf/syscall.c         | 181 +++++++++++++++++++++++++++++++++++
> >  kernel/bpf/verifier.c        |   2 +
> >  6 files changed, 242 insertions(+), 15 deletions(-)
> > 
> > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > index 715b6df9c403..307de5caa646 100644
> > --- a/include/linux/bpf.h
> > +++ b/include/linux/bpf.h
> > @@ -3454,6 +3454,22 @@ static inline struct bpf_prog *bpf_prog_get_type(u32 ufd,
> >  
> >  void __bpf_free_used_maps(struct bpf_prog_aux *aux,
> >  			  struct bpf_map **used_maps, u32 len);
> > +/* Direct-call target used by fixups for bpf_kptr_xchg() sites without dtors. */
> > +u64 bpf_kptr_xchg_nodtor(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
> > +
> > +#ifdef CONFIG_BPF_SYSCALL
> > +int bpf_kptr_offload_inc(void);
> > +void bpf_kptr_offload_dec(void);
> > +#else
> > +static inline int bpf_kptr_offload_inc(void)
> > +{
> > +	return 0;
> > +}
> > +
> > +static inline void bpf_kptr_offload_dec(void)
> > +{
> > +}
> > +#endif
> >  
> >  bool bpf_prog_get_ok(struct bpf_prog *, enum bpf_prog_type *, bool);
> >  
> > diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
> > index 976e2b2f40e8..8e39ff92dd2c 100644
> > --- a/include/linux/bpf_verifier.h
> > +++ b/include/linux/bpf_verifier.h
> > @@ -672,6 +672,7 @@ struct bpf_insn_aux_data {
> >  	bool non_sleepable; /* helper/kfunc may be called from non-sleepable context */
> >  	bool is_iter_next; /* bpf_iter_<type>_next() kfunc call */
> >  	bool call_with_percpu_alloc_ptr; /* {this,per}_cpu_ptr() with prog percpu alloc */
> > +	bool kptr_has_dtor;
> >  	u8 alu_state; /* used in combination with alu_limit */
> >  	/* true if STX or LDX instruction is a part of a spill/fill
> >  	 * pattern for a bpf_fastcall call.
> > diff --git a/kernel/bpf/fixups.c b/kernel/bpf/fixups.c
> > index fba9e8c00878..459e855e86a5 100644
> > --- a/kernel/bpf/fixups.c
> > +++ b/kernel/bpf/fixups.c
> > @@ -2284,23 +2284,30 @@ int bpf_do_misc_fixups(struct bpf_verifier_env *env)
> >  			goto next_insn;
> >  		}
> >  
> > -		/* Implement bpf_kptr_xchg inline */
> > -		if (prog->jit_requested && BITS_PER_LONG == 64 &&
> > -		    insn->imm == BPF_FUNC_kptr_xchg &&
> > -		    bpf_jit_supports_ptr_xchg()) {
> > -			insn_buf[0] = BPF_MOV64_REG(BPF_REG_0, BPF_REG_2);
> > -			insn_buf[1] = BPF_ATOMIC_OP(BPF_DW, BPF_XCHG, BPF_REG_1, BPF_REG_0, 0);
> > -			cnt = 2;
> > +		/* Implement bpf_kptr_xchg inline. */
> > +		if (insn->imm == BPF_FUNC_kptr_xchg &&
> > +		    !env->insn_aux_data[i + delta].kptr_has_dtor) {
> > +			if (prog->jit_requested && BITS_PER_LONG == 64 &&
> > +			    bpf_jit_supports_ptr_xchg()) {
> > +				insn_buf[0] = BPF_MOV64_REG(BPF_REG_0, BPF_REG_2);
> > +				insn_buf[1] = BPF_ATOMIC_OP(BPF_DW, BPF_XCHG,
> > +						     BPF_REG_1, BPF_REG_0, 0);
> > +				cnt = 2;
> >  
> > -			new_prog = bpf_patch_insn_data(env, i + delta, insn_buf, cnt);
> > -			if (!new_prog)
> > -				return -ENOMEM;
> > +				new_prog = bpf_patch_insn_data(env, i + delta, insn_buf, cnt);
> > +				if (!new_prog)
> > +					return -ENOMEM;
> >  
> > -			delta    += cnt - 1;
> > -			env->prog = prog = new_prog;
> > -			insn      = new_prog->insnsi + i + delta;
> > +				delta    += cnt - 1;
> > +				env->prog = prog = new_prog;
> > +				insn      = new_prog->insnsi + i + delta;
> > +				goto next_insn;
> > +			}
> > +
> > +			insn->imm = bpf_kptr_xchg_nodtor - __bpf_call_base;
> >  			goto next_insn;
> >  		}
> > +
> >  patch_call_imm:
> >  		fn = env->ops->get_func_proto(insn->imm, env->prog);
> >  		/* all functions that have prototype and verifier allowed
> > diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> > index baa12b24bb64..cdc64ab83ef6 100644
> > --- a/kernel/bpf/helpers.c
> > +++ b/kernel/bpf/helpers.c
> > @@ -1728,7 +1728,7 @@ void bpf_wq_cancel_and_free(void *val)
> >  	bpf_async_cancel_and_free(val);
> >  }
> >  
> > -BPF_CALL_2(bpf_kptr_xchg, void *, dst, void *, ptr)
> > +BPF_CALL_2(bpf_kptr_xchg_nodtor, void *, dst, void *, ptr)
> >  {
> >  	unsigned long *kptr = dst;
> >  
> > @@ -1736,12 +1736,32 @@ BPF_CALL_2(bpf_kptr_xchg, void *, dst, void *, ptr)
> >  	return xchg(kptr, (unsigned long)ptr);
> >  }
> >  
> > +BPF_CALL_2(bpf_ref_kptr_xchg, void *, dst, void *, ptr)
> > +{
> > +	unsigned long *kptr = dst;
> > +	void *old;
> > +
> > +	/*
> > +	 * If the incoming pointer cannot be torn down safely from NMI later on,
> > +	 * leave the destination untouched and return ptr so the caller keeps
> > +	 * ownership.
> > +	 */
> > +	if (ptr && bpf_kptr_offload_inc())
> > +		return (unsigned long)ptr;
> > +
> > +	old = (void *)xchg(kptr, (unsigned long)ptr);
> > +	if (old)
> > +		bpf_kptr_offload_dec();
> > +	return (unsigned long)old;
> > +}
> > +
> >  /* Unlike other PTR_TO_BTF_ID helpers the btf_id in bpf_kptr_xchg()
> >   * helper is determined dynamically by the verifier. Use BPF_PTR_POISON to
> >   * denote type that verifier will determine.
> > + * No-dtor callsites are redirected to bpf_kptr_xchg_nodtor() from fixups.
> >   */
> >  static const struct bpf_func_proto bpf_kptr_xchg_proto = {
> > -	.func         = bpf_kptr_xchg,
> > +	.func         = bpf_ref_kptr_xchg,
> >  	.gpl_only     = false,
> >  	.ret_type     = RET_PTR_TO_BTF_ID_OR_NULL,
> >  	.ret_btf_id   = BPF_PTR_POISON,
> > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> > index 3b1f0ba02f61..162bfd4796ea 100644
> > --- a/kernel/bpf/syscall.c
> > +++ b/kernel/bpf/syscall.c
> > @@ -7,6 +7,7 @@
> >  #include <linux/bpf_trace.h>
> >  #include <linux/bpf_lirc.h>
> >  #include <linux/bpf_verifier.h>
> > +#include <linux/bpf_mem_alloc.h>
> >  #include <linux/bsearch.h>
> >  #include <linux/btf.h>
> >  #include <linux/hex.h>
> > @@ -19,6 +20,8 @@
> >  #include <linux/fdtable.h>
> >  #include <linux/file.h>
> >  #include <linux/fs.h>
> > +#include <linux/irq_work.h>
> > +#include <linux/llist.h>
> >  #include <linux/license.h>
> >  #include <linux/filter.h>
> >  #include <linux/kernel.h>
> > @@ -65,6 +68,131 @@ static DEFINE_SPINLOCK(map_idr_lock);
> >  static DEFINE_IDR(link_idr);
> >  static DEFINE_SPINLOCK(link_idr_lock);
> >  
> > +struct bpf_dtor_kptr_work {
> > +	struct llist_node node;
> > +	void *obj;
> > +	btf_dtor_kfunc_t dtor;
> > +};
> > +
> > +/* Queue pending dtors per CPU; the idle pool stays global. */
> > +static DEFINE_PER_CPU(struct llist_head, bpf_dtor_kptr_jobs);
> > +static LLIST_HEAD(bpf_dtor_kptr_idle);
> > +/* Keep total >= refs + active so NMI frees never need to allocate. */
> > +static atomic_long_t bpf_dtor_kptr_refs = ATOMIC_LONG_INIT(0);
> > +static atomic_long_t bpf_dtor_kptr_active = ATOMIC_LONG_INIT(0);
> > +static atomic_long_t bpf_dtor_kptr_total = ATOMIC_LONG_INIT(0);
> > +
> > +/* Bound reserve overshoot so the pool tracks demand instead of growing on itself. */
> > +#define BPF_DTOR_KPTR_RESERVE_HEADROOM 64L
> > +
> > +static void bpf_dtor_kptr_worker(struct irq_work *work);
> > +static DEFINE_PER_CPU(struct irq_work, bpf_dtor_kptr_irq_work) =
> > +	IRQ_WORK_INIT_HARD(bpf_dtor_kptr_worker);
> > +
> 
> I think this still looks too complex:
>  * 2 lists - idle list and armed list
>  * 3 atomics, controlling demand/supply
>  * headroom/trimming management
> 
> The complexity introduced for performance reasons, but
> I'm not sure if the tradeoff is worth it.
> 
> What about the next design:
> 
> Instead of idle list, store bpf_dtor_kptr_work in the kptr map slot itself.
> Use kmalloc_nolock() to allocate bpf_dtor_kptr_work on the first 
> xchg just once per map value, then reuse it across xchg in/out.
>
Hello,

Thanks for the feedback.

I tried a different but related approach in [1], my v1, which also
stored the work in the map slot (struct bpf_kptr_dtor_aux).

> Detach: When map value is deleted, atomically set kptr map field storing 
> bpf_dtor_kptr_work to NULL (so the next xchg-in allocates new 
> bpf_dtor_kptr_work.)
> After detaching insert bpf_dtor_kptr_work to the global list and run irq_work. 
> Free bpf_dtor_kptr_work in call_rcu_tasks_trace().
This would be nice to do but there's a sequence where you can delete and
insert stimultaniously. Basically an ABBA problem that is inconvienent to
deal with, especially in NMI.

Let's say

CPU0 = Deleting value from the map. xchg w/ null.
CPU1 = Exchanging new value into the map. xchg w/ something

So I give the two orders we can do it in, zeroing the map slot first or
zeroing the work first.

1. Set obj to null, then null out work

CPU0
sets map obj field to null (dtor work is still != null)
CPU1
sets map obj field to something
CPU1 (again)
sees that dtor work is still != null, (does nothing, assumes a valid
slot was still allocated)
CPU0
sets dtor work to null and enqueues it in the irq_work.

Then a new dtor work is never allocated for the new slot.

2. And doing in the other order (nulling out work before the obj)

CPU0
sets work to null and enqueues it.
CPU1
sets work to something (allocates new work)
CPU1
sets obj to something
CPU0
sets obj to null.

Then you end up leaking the object set by CPU1.

So either order we do it in 1 or 2, this is an ABBA problem.
And would require serialization and/or some sort of lock state
for every map slot.

Not to mention the complexity of adding extra data to the map slots
requires modifications to every map type. You can see the problem this
introduced in my v1, which took a related approach adding auxiliary data
to map slots.

....

FWIW I do have a much less complex implementation for my v3
revision which brings it down to a single atomic and two
pcpu_freelists, cutting it to 215~ lines net for the whole fix.

Let me know if I missed anything or am misunderstanding something. I
really appreciate the feedback though this is a tougher problem than it
seemed and your thought was very similar to my initial approach :)
> 
> This is based on the bpf_timer and bpf_task_work
> implementations.
>
Those were some of my initial inspiration as well!

Thanks again,
Justin

[1] https://lore.kernel.org/bpf/20260428201422.1518903-1-utilityemal77@gmail.com/
> > +static void bpf_dtor_kptr_push_idle(struct bpf_dtor_kptr_work *job)
> > +{
> > +	llist_add(&job->node, &bpf_dtor_kptr_idle);
> > +}
> > +
> > +static struct bpf_dtor_kptr_work *bpf_dtor_kptr_pop_idle(void)
> > +{
> > +	struct llist_node *node;
> > +
> > +	node = llist_del_first(&bpf_dtor_kptr_idle);
> > +	if (!node)
> > +		return NULL;
> > +
> > +	return llist_entry(node, struct bpf_dtor_kptr_work, node);
> > +}
> > +
> > +static void bpf_dtor_kptr_trim(void)
> > +{
> > +	struct bpf_dtor_kptr_work *job;
> > +	long total;
> > +	long needed;
> > +
> > +	for (;;) {
> > +		total = atomic_long_read(&bpf_dtor_kptr_total);
> > +		needed = atomic_long_read(&bpf_dtor_kptr_refs) +
> > +			 atomic_long_read(&bpf_dtor_kptr_active);
> > +		if (total <= needed)
> > +			return;
> > +
> > +		job = bpf_dtor_kptr_pop_idle();
> > +		if (!job)
> > +			return;
> > +
> > +		if (!atomic_long_try_cmpxchg(&bpf_dtor_kptr_total, &total, total - 1)) {
> > +			bpf_dtor_kptr_push_idle(job);
> > +			continue;
> > +		}
> > +
> > +		bpf_mem_free(&bpf_global_ma, job);
> > +	}
> > +}
> > +
> > +static int bpf_dtor_kptr_reserve(long needed)
> > +{
> > +	struct bpf_dtor_kptr_work *job;
> > +	long headroom;
> > +	long target;
> > +
> > +	headroom = min_t(long, needed, BPF_DTOR_KPTR_RESERVE_HEADROOM);
> > +	if (check_add_overflow(needed, headroom, &target))
> > +		target = needed;
> > +
> > +	while (atomic_long_read(&bpf_dtor_kptr_total) < target) {
> > +		job = bpf_mem_alloc(&bpf_global_ma, sizeof(*job));
> > +		if (!job)
> > +			return -ENOMEM;
> > +		atomic_long_inc(&bpf_dtor_kptr_total);
> > +		bpf_dtor_kptr_push_idle(job);
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +int bpf_kptr_offload_inc(void)
> > +{
> > +	long needed;
> > +	int err;
> > +
> > +	if (unlikely(!bpf_global_ma_set))
> > +		return -ENOMEM;
> > +
> > +	/*
> > +	 * Read active before incrementing refs so a free path moving one slot from
> > +	 * refs to active cannot shrink the reservation snapshot below the steady
> > +	 * state we need to cover. Racing results worst case in a larger reservation.
> > +	 */
> > +	needed = atomic_long_read(&bpf_dtor_kptr_active);
> > +	needed += atomic_long_inc_return(&bpf_dtor_kptr_refs);
> > +	err = bpf_dtor_kptr_reserve(needed);
> > +	if (err)
> > +		atomic_long_dec(&bpf_dtor_kptr_refs);
> > +
> > +	return err;
> > +}
> > +
> > +void bpf_kptr_offload_dec(void)
> > +{
> > +	long val;
> > +
> > +	val = atomic_long_dec_return(&bpf_dtor_kptr_refs);
> > +	if (!WARN_ON_ONCE(val < 0))
> > +		return;
> > +
> > +	/*
> > +	 * Clamp a mismatched decrement back to zero without overwriting a
> > +	 * concurrent increment that already repaired the counter.
> > +	 */
> > +	do {
> > +		val = atomic_long_read(&bpf_dtor_kptr_refs);
> > +		if (val >= 0)
> > +			break;
> > +	} while (!atomic_long_try_cmpxchg(&bpf_dtor_kptr_refs, &val, 0));
> > +}
> > +
> >  int sysctl_unprivileged_bpf_disabled __read_mostly =
> >  	IS_BUILTIN(CONFIG_BPF_UNPRIV_DEFAULT_OFF) ? 2 : 0;
> >  
> > @@ -807,6 +935,46 @@ void bpf_obj_free_task_work(const struct btf_record *rec, void *obj)
> >  	bpf_task_work_cancel_and_free(obj + rec->task_work_off);
> >  }
> >  
> > +static void bpf_dtor_kptr_worker(struct irq_work *work)
> > +{
> > +	struct llist_node *jobs, *node, *next;
> > +
> > +	jobs = llist_del_all(this_cpu_ptr(&bpf_dtor_kptr_jobs));
> > +	llist_for_each_safe(node, next, jobs) {
> > +		struct bpf_dtor_kptr_work *job;
> > +
> > +		job = llist_entry(node, struct bpf_dtor_kptr_work, node);
> > +		job->dtor(job->obj);
> > +		atomic_long_dec(&bpf_dtor_kptr_active);
> > +		bpf_dtor_kptr_push_idle(job);
> > +	}
> > +
> > +	bpf_dtor_kptr_trim();
> > +}
> > +
> > +static void bpf_dtor_kptr_offload(void *obj, btf_dtor_kfunc_t dtor)
> > +{
> > +	struct bpf_dtor_kptr_work *job;
> > +
> > +	atomic_long_inc(&bpf_dtor_kptr_active);
> > +	job = bpf_dtor_kptr_pop_idle();
> > +	if (WARN_ON_ONCE(!job)) {
> > +		atomic_long_dec(&bpf_dtor_kptr_active);
> > +		/*
> > +		 * This should stay unreachable if reserve accounting is correct. If it
> > +		 * ever breaks, running the destructor unsafely is still better than
> > +		 * leaking the object permanently.
> > +		 */
> > +		dtor(obj);
> > +		return;
> > +	}
> > +
> > +	job->obj = obj;
> > +	job->dtor = dtor;
> > +	if (llist_add(&job->node, this_cpu_ptr(&bpf_dtor_kptr_jobs)))
> > +		irq_work_queue(this_cpu_ptr(&bpf_dtor_kptr_irq_work));
> > +}
> > +
> >  void bpf_obj_free_fields(const struct btf_record *rec, void *obj)
> >  {
> >  	const struct btf_field *fields;
> > @@ -842,6 +1010,19 @@ void bpf_obj_free_fields(const struct btf_record *rec, void *obj)
> >  			xchgd_field = (void *)xchg((unsigned long *)field_ptr, 0);
> >  			if (!xchgd_field)
> >  				break;
> > +			if (in_nmi() && field->kptr.dtor) {
> > +				bpf_dtor_kptr_offload(xchgd_field, field->kptr.dtor);
> > +				bpf_kptr_offload_dec();
> > +				break;
> > +			}
> > +			if (field->kptr.dtor)
> > +				/*
> > +				 * Dtor kptrs reach storage through bpf_ref_kptr_xchg(), which
> > +				 * pairs installation with bpf_kptr_offload_inc(). Drop that
> > +				 * reservation on non-NMI teardown once no active transition is
> > +				 * needed.
> > +				 */
> > +				bpf_kptr_offload_dec();
> >  
> >  			if (!btf_is_kernel(field->kptr.btf)) {
> >  				pointee_struct_meta = btf_find_struct_meta(field->kptr.btf,
> > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> > index 11054ad89c14..2c7b21bda666 100644
> > --- a/kernel/bpf/verifier.c
> > +++ b/kernel/bpf/verifier.c
> > @@ -9950,6 +9950,8 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
> >  		if (err)
> >  			return err;
> >  	}
> > +	env->insn_aux_data[insn_idx].kptr_has_dtor =
> > +		func_id == BPF_FUNC_kptr_xchg && !!meta.kptr_field->kptr.dtor;
> >  
> >  	err = record_func_map(env, &meta, func_id, insn_idx);
> >  	if (err)
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [bpf-next v2 1/2] bpf: Offload kptr destructors that run from NMI
  2026-05-06 19:52     ` Justin Suess
@ 2026-05-07 14:59       ` Mykyta Yatsenko
  2026-05-07 16:41         ` Justin Suess
  0 siblings, 1 reply; 12+ messages in thread
From: Mykyta Yatsenko @ 2026-05-07 14:59 UTC (permalink / raw)
  To: Justin Suess
  Cc: ast, daniel, andrii, eddyz87, memxor, martin.lau, song,
	yonghong.song, jolsa, bpf, mic, Alexei Starovoitov



On 5/6/26 8:52 PM, Justin Suess wrote:
> On Wed, May 06, 2026 at 05:43:51PM +0100, Mykyta Yatsenko wrote:
>> On 5/5/26 4:08 PM, Justin Suess wrote:
>>> A BPF program attached to tp_btf/nmi_handler can delete map entries or
>>> swap out referenced kptrs from NMI context. Today that runs the kptr
>>> destructor inline. Destructors such as bpf_cpumask_release() can take
>>> RCU-related locks, so running them from NMI can deadlock the system.
>>>
>>> Preallocate offload jobs from the global BPF memory allocator, track the
>>> number of live destructor-backed references so the pool stays ahead of
>>> NMI frees, and let the worker invoke the destructor after NMI exits.
>>>
>>> The algorithm for preallocation is simple: The invariant is total >=
>>> refs + active, where refs = the ref kptrs installed in maps, active =
>>> jobs being executed in the irq_work worker, and total is the number of
>>> job structures allocated. To avoid excessive pre-allocation calls while
>>> maintaining the invariant, we allocate the needed slots, plus a small
>>> amount of extra, min(needed, BPF_DTOR_KPTR_RESERVE_HEADROOM), where
>>> BPF_DTOR_KPTR_RESERVE_HEADROOM is 64 in this patch.
>>>
>>> A small but harmless ordering subtlety: the active atomic is read before
>>> refs. This can result in a small amount of over allocation, but this
>>> won't be leaked and will properly be carried into the trim stage.
>>>
>>> The trim stage is simple. It uses a CAS loop to free excessive leftover
>>> idle job slots. It snapshots total refs and active, pops an idle job if
>>> the pool is excessively large, and attempts a cmpxhg to decrement it
>>> atomically. On a failure case, it will just push the job back into the
>>> idle list and retry.
>>>
>>> There are several best-effort mitigation methods to tackle the memory
>>> pressure problem, preserving integrity under this unlikely scenario.
>>>
>>> If reserving another offload slot fails while installing a new
>>> destructor-backed kptr through bpf_kptr_xchg(), leave the destination
>>> unchanged and return the incoming pointer so the caller keeps ownership.
>>>
>>> This is superior to leaking the pointer, and should only happen if the
>>> accounting is incorrect. Moreover, this is a condition the caller can
>>> check for and recover from.
>>>
>>> If NMI teardown still fails to grab an idle offload job despite that
>>> reserve accounting, warn once and run the destructor inline rather than
>>> leak the object permanently. Attempt to repair the counter safely with
>>> another CAS loop in that case, preserving concurrent increments.
>>>
>>> This fix does come with small performance tradeoffs for safety. xchg can
>>> no longer be inlined for referenced kptrs, as inlining would break the
>>> reference counting. The inlining fix is preserved for kptrs with no
>>> destructor defined.
>>>
>>> This keeps refcounted kptr teardown out of NMI context without slowing
>>> down raw kptr exchanges that never need destructor handling.
>>>
>>> Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
>>> Reported-by: Justin Suess <utilityemal77@gmail.com>
>>> Closes: https://lore.kernel.org/bpf/20260421201035.1729473-1-utilityemal77@gmail.com/
>>> Signed-off-by: Justin Suess <utilityemal77@gmail.com>
>>> ---
>>>  include/linux/bpf.h          |  16 ++++
>>>  include/linux/bpf_verifier.h |   1 +
>>>  kernel/bpf/fixups.c          |  33 ++++---
>>>  kernel/bpf/helpers.c         |  24 ++++-
>>>  kernel/bpf/syscall.c         | 181 +++++++++++++++++++++++++++++++++++
>>>  kernel/bpf/verifier.c        |   2 +
>>>  6 files changed, 242 insertions(+), 15 deletions(-)
>>>
>>> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
>>> index 715b6df9c403..307de5caa646 100644
>>> --- a/include/linux/bpf.h
>>> +++ b/include/linux/bpf.h
>>> @@ -3454,6 +3454,22 @@ static inline struct bpf_prog *bpf_prog_get_type(u32 ufd,
>>>  
>>>  void __bpf_free_used_maps(struct bpf_prog_aux *aux,
>>>  			  struct bpf_map **used_maps, u32 len);
>>> +/* Direct-call target used by fixups for bpf_kptr_xchg() sites without dtors. */
>>> +u64 bpf_kptr_xchg_nodtor(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
>>> +
>>> +#ifdef CONFIG_BPF_SYSCALL
>>> +int bpf_kptr_offload_inc(void);
>>> +void bpf_kptr_offload_dec(void);
>>> +#else
>>> +static inline int bpf_kptr_offload_inc(void)
>>> +{
>>> +	return 0;
>>> +}
>>> +
>>> +static inline void bpf_kptr_offload_dec(void)
>>> +{
>>> +}
>>> +#endif
>>>  
>>>  bool bpf_prog_get_ok(struct bpf_prog *, enum bpf_prog_type *, bool);
>>>  
>>> diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
>>> index 976e2b2f40e8..8e39ff92dd2c 100644
>>> --- a/include/linux/bpf_verifier.h
>>> +++ b/include/linux/bpf_verifier.h
>>> @@ -672,6 +672,7 @@ struct bpf_insn_aux_data {
>>>  	bool non_sleepable; /* helper/kfunc may be called from non-sleepable context */
>>>  	bool is_iter_next; /* bpf_iter_<type>_next() kfunc call */
>>>  	bool call_with_percpu_alloc_ptr; /* {this,per}_cpu_ptr() with prog percpu alloc */
>>> +	bool kptr_has_dtor;
>>>  	u8 alu_state; /* used in combination with alu_limit */
>>>  	/* true if STX or LDX instruction is a part of a spill/fill
>>>  	 * pattern for a bpf_fastcall call.
>>> diff --git a/kernel/bpf/fixups.c b/kernel/bpf/fixups.c
>>> index fba9e8c00878..459e855e86a5 100644
>>> --- a/kernel/bpf/fixups.c
>>> +++ b/kernel/bpf/fixups.c
>>> @@ -2284,23 +2284,30 @@ int bpf_do_misc_fixups(struct bpf_verifier_env *env)
>>>  			goto next_insn;
>>>  		}
>>>  
>>> -		/* Implement bpf_kptr_xchg inline */
>>> -		if (prog->jit_requested && BITS_PER_LONG == 64 &&
>>> -		    insn->imm == BPF_FUNC_kptr_xchg &&
>>> -		    bpf_jit_supports_ptr_xchg()) {
>>> -			insn_buf[0] = BPF_MOV64_REG(BPF_REG_0, BPF_REG_2);
>>> -			insn_buf[1] = BPF_ATOMIC_OP(BPF_DW, BPF_XCHG, BPF_REG_1, BPF_REG_0, 0);
>>> -			cnt = 2;
>>> +		/* Implement bpf_kptr_xchg inline. */
>>> +		if (insn->imm == BPF_FUNC_kptr_xchg &&
>>> +		    !env->insn_aux_data[i + delta].kptr_has_dtor) {
>>> +			if (prog->jit_requested && BITS_PER_LONG == 64 &&
>>> +			    bpf_jit_supports_ptr_xchg()) {
>>> +				insn_buf[0] = BPF_MOV64_REG(BPF_REG_0, BPF_REG_2);
>>> +				insn_buf[1] = BPF_ATOMIC_OP(BPF_DW, BPF_XCHG,
>>> +						     BPF_REG_1, BPF_REG_0, 0);
>>> +				cnt = 2;
>>>  
>>> -			new_prog = bpf_patch_insn_data(env, i + delta, insn_buf, cnt);
>>> -			if (!new_prog)
>>> -				return -ENOMEM;
>>> +				new_prog = bpf_patch_insn_data(env, i + delta, insn_buf, cnt);
>>> +				if (!new_prog)
>>> +					return -ENOMEM;
>>>  
>>> -			delta    += cnt - 1;
>>> -			env->prog = prog = new_prog;
>>> -			insn      = new_prog->insnsi + i + delta;
>>> +				delta    += cnt - 1;
>>> +				env->prog = prog = new_prog;
>>> +				insn      = new_prog->insnsi + i + delta;
>>> +				goto next_insn;
>>> +			}
>>> +
>>> +			insn->imm = bpf_kptr_xchg_nodtor - __bpf_call_base;
>>>  			goto next_insn;
>>>  		}
>>> +
>>>  patch_call_imm:
>>>  		fn = env->ops->get_func_proto(insn->imm, env->prog);
>>>  		/* all functions that have prototype and verifier allowed
>>> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
>>> index baa12b24bb64..cdc64ab83ef6 100644
>>> --- a/kernel/bpf/helpers.c
>>> +++ b/kernel/bpf/helpers.c
>>> @@ -1728,7 +1728,7 @@ void bpf_wq_cancel_and_free(void *val)
>>>  	bpf_async_cancel_and_free(val);
>>>  }
>>>  
>>> -BPF_CALL_2(bpf_kptr_xchg, void *, dst, void *, ptr)
>>> +BPF_CALL_2(bpf_kptr_xchg_nodtor, void *, dst, void *, ptr)
>>>  {
>>>  	unsigned long *kptr = dst;
>>>  
>>> @@ -1736,12 +1736,32 @@ BPF_CALL_2(bpf_kptr_xchg, void *, dst, void *, ptr)
>>>  	return xchg(kptr, (unsigned long)ptr);
>>>  }
>>>  
>>> +BPF_CALL_2(bpf_ref_kptr_xchg, void *, dst, void *, ptr)
>>> +{
>>> +	unsigned long *kptr = dst;
>>> +	void *old;
>>> +
>>> +	/*
>>> +	 * If the incoming pointer cannot be torn down safely from NMI later on,
>>> +	 * leave the destination untouched and return ptr so the caller keeps
>>> +	 * ownership.
>>> +	 */
>>> +	if (ptr && bpf_kptr_offload_inc())
>>> +		return (unsigned long)ptr;
>>> +
>>> +	old = (void *)xchg(kptr, (unsigned long)ptr);
>>> +	if (old)
>>> +		bpf_kptr_offload_dec();
>>> +	return (unsigned long)old;
>>> +}
>>> +
>>>  /* Unlike other PTR_TO_BTF_ID helpers the btf_id in bpf_kptr_xchg()
>>>   * helper is determined dynamically by the verifier. Use BPF_PTR_POISON to
>>>   * denote type that verifier will determine.
>>> + * No-dtor callsites are redirected to bpf_kptr_xchg_nodtor() from fixups.
>>>   */
>>>  static const struct bpf_func_proto bpf_kptr_xchg_proto = {
>>> -	.func         = bpf_kptr_xchg,
>>> +	.func         = bpf_ref_kptr_xchg,
>>>  	.gpl_only     = false,
>>>  	.ret_type     = RET_PTR_TO_BTF_ID_OR_NULL,
>>>  	.ret_btf_id   = BPF_PTR_POISON,
>>> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>>> index 3b1f0ba02f61..162bfd4796ea 100644
>>> --- a/kernel/bpf/syscall.c
>>> +++ b/kernel/bpf/syscall.c
>>> @@ -7,6 +7,7 @@
>>>  #include <linux/bpf_trace.h>
>>>  #include <linux/bpf_lirc.h>
>>>  #include <linux/bpf_verifier.h>
>>> +#include <linux/bpf_mem_alloc.h>
>>>  #include <linux/bsearch.h>
>>>  #include <linux/btf.h>
>>>  #include <linux/hex.h>
>>> @@ -19,6 +20,8 @@
>>>  #include <linux/fdtable.h>
>>>  #include <linux/file.h>
>>>  #include <linux/fs.h>
>>> +#include <linux/irq_work.h>
>>> +#include <linux/llist.h>
>>>  #include <linux/license.h>
>>>  #include <linux/filter.h>
>>>  #include <linux/kernel.h>
>>> @@ -65,6 +68,131 @@ static DEFINE_SPINLOCK(map_idr_lock);
>>>  static DEFINE_IDR(link_idr);
>>>  static DEFINE_SPINLOCK(link_idr_lock);
>>>  
>>> +struct bpf_dtor_kptr_work {
>>> +	struct llist_node node;
>>> +	void *obj;
>>> +	btf_dtor_kfunc_t dtor;
>>> +};
>>> +
>>> +/* Queue pending dtors per CPU; the idle pool stays global. */
>>> +static DEFINE_PER_CPU(struct llist_head, bpf_dtor_kptr_jobs);
>>> +static LLIST_HEAD(bpf_dtor_kptr_idle);
>>> +/* Keep total >= refs + active so NMI frees never need to allocate. */
>>> +static atomic_long_t bpf_dtor_kptr_refs = ATOMIC_LONG_INIT(0);
>>> +static atomic_long_t bpf_dtor_kptr_active = ATOMIC_LONG_INIT(0);
>>> +static atomic_long_t bpf_dtor_kptr_total = ATOMIC_LONG_INIT(0);
>>> +
>>> +/* Bound reserve overshoot so the pool tracks demand instead of growing on itself. */
>>> +#define BPF_DTOR_KPTR_RESERVE_HEADROOM 64L
>>> +
>>> +static void bpf_dtor_kptr_worker(struct irq_work *work);
>>> +static DEFINE_PER_CPU(struct irq_work, bpf_dtor_kptr_irq_work) =
>>> +	IRQ_WORK_INIT_HARD(bpf_dtor_kptr_worker);
>>> +
>>
>> I think this still looks too complex:
>>  * 2 lists - idle list and armed list
>>  * 3 atomics, controlling demand/supply
>>  * headroom/trimming management
>>
>> The complexity introduced for performance reasons, but
>> I'm not sure if the tradeoff is worth it.
>>
>> What about the next design:
>>
>> Instead of idle list, store bpf_dtor_kptr_work in the kptr map slot itself.
>> Use kmalloc_nolock() to allocate bpf_dtor_kptr_work on the first 
>> xchg just once per map value, then reuse it across xchg in/out.
>>
> Hello,
> 
> Thanks for the feedback.
> 
> I tried a different but related approach in [1], my v1, which also
> stored the work in the map slot (struct bpf_kptr_dtor_aux).
> 
>> Detach: When map value is deleted, atomically set kptr map field storing 
>> bpf_dtor_kptr_work to NULL (so the next xchg-in allocates new 
>> bpf_dtor_kptr_work.)
>> After detaching insert bpf_dtor_kptr_work to the global list and run irq_work. 
>> Free bpf_dtor_kptr_work in call_rcu_tasks_trace().
> This would be nice to do but there's a sequence where you can delete and
> insert stimultaniously. Basically an ABBA problem that is inconvienent to
> deal with, especially in NMI.
> 
> Let's say
> 
> CPU0 = Deleting value from the map. xchg w/ null.
> CPU1 = Exchanging new value into the map. xchg w/ something
> 
> So I give the two orders we can do it in, zeroing the map slot first or
> zeroing the work first.
> 
> 1. Set obj to null, then null out work
> 
> CPU0
> sets map obj field to null (dtor work is still != null)
> CPU1
> sets map obj field to something
> CPU1 (again)
> sees that dtor work is still != null, (does nothing, assumes a valid
> slot was still allocated)
> CPU0
> sets dtor work to null and enqueues it in the irq_work.
> 
> Then a new dtor work is never allocated for the new slot.
> 
> 2. And doing in the other order (nulling out work before the obj)
> 
> CPU0
> sets work to null and enqueues it.
> CPU1
> sets work to something (allocates new work)
> CPU1
> sets obj to something
> CPU0
> sets obj to null.
> 
> Then you end up leaking the object set by CPU1.

I'm not sure I'm following,
the idea I suggest is next: you place kptr and work struct in the
same structure:

  struct bpf_kptr_box {
        void *kptr;
        struct bpf_dtor_kptr_work work;
        struct rcu_head rcu;
  };

Then you can atomically detach/attach it.
Very schematic implementation:

unsigned long bpf_ref_kptr_xchg(void *dst, void *ptr)
{
        struct bpf_kptr_box **slot = dst;
        struct bpf_kptr_box *box;

        box = READ_ONCE(*slot);
	if (box)
	/* Box can be detached already, but its lifetime and kptr destructor are
	 * RCU protected, so it's safe to access it here.
	 */
		return xchg(&box->kptr, ptr);

        if (!ptr)
                return 0; /* xchg-out from empty: nothing to do */

	/* slow path, when map value is not initialized */
        box = kmalloc_nolock();
        if (!box)
                return (unsigned long)ptr;

        box->kptr = ptr;
        if (cmpxchg(slot, NULL, box) != NULL) {
                kfree(new_box);
                /* someone installed first, retry or error out */
        }

        return NULL;
  }

Run destructor in rcu tt callback, to make sure no concurrent BPF program 
is installing kptr in the meantime:

void box_rcu_free()
{
	dtor();
	kfree(box);
}

static void box_kill_irq_work(struct irq_work* irq_work)
{
	/* guarantee no BPF program sees the box */
        call_rcu_tasks_trace(&box->rcu, box_rcu_free);
}

Detach path is simply: box = xchg(slot, NULL).

You can still use a single list where you link detached boxes and
irq_work similarly as you have now, just make sure dtor and free
are called after RCU GP.

I may have missed some race conditions, though, does the idea
make sense?

> 
> So either order we do it in 1 or 2, this is an ABBA problem.
> And would require serialization and/or some sort of lock state
> for every map slot.
> 
> Not to mention the complexity of adding extra data to the map slots
> requires modifications to every map type. You can see the problem this
> introduced in my v1, which took a related approach adding auxiliary data
> to map slots.
> 
> ....
> 
> FWIW I do have a much less complex implementation for my v3
> revision which brings it down to a single atomic and two
> pcpu_freelists, cutting it to 215~ lines net for the whole fix.
> 
> Let me know if I missed anything or am misunderstanding something. I
> really appreciate the feedback though this is a tougher problem than it
> seemed and your thought was very similar to my initial approach :)
>>
>> This is based on the bpf_timer and bpf_task_work
>> implementations.
>>
> Those were some of my initial inspiration as well!
> 
> Thanks again,
> Justin
> 
> [1] https://lore.kernel.org/bpf/20260428201422.1518903-1-utilityemal77@gmail.com/
>>> +static void bpf_dtor_kptr_push_idle(struct bpf_dtor_kptr_work *job)
>>> +{
>>> +	llist_add(&job->node, &bpf_dtor_kptr_idle);
>>> +}
>>> +
>>> +static struct bpf_dtor_kptr_work *bpf_dtor_kptr_pop_idle(void)
>>> +{
>>> +	struct llist_node *node;
>>> +
>>> +	node = llist_del_first(&bpf_dtor_kptr_idle);
>>> +	if (!node)
>>> +		return NULL;
>>> +
>>> +	return llist_entry(node, struct bpf_dtor_kptr_work, node);
>>> +}
>>> +
>>> +static void bpf_dtor_kptr_trim(void)
>>> +{
>>> +	struct bpf_dtor_kptr_work *job;
>>> +	long total;
>>> +	long needed;
>>> +
>>> +	for (;;) {
>>> +		total = atomic_long_read(&bpf_dtor_kptr_total);
>>> +		needed = atomic_long_read(&bpf_dtor_kptr_refs) +
>>> +			 atomic_long_read(&bpf_dtor_kptr_active);
>>> +		if (total <= needed)
>>> +			return;
>>> +
>>> +		job = bpf_dtor_kptr_pop_idle();
>>> +		if (!job)
>>> +			return;
>>> +
>>> +		if (!atomic_long_try_cmpxchg(&bpf_dtor_kptr_total, &total, total - 1)) {
>>> +			bpf_dtor_kptr_push_idle(job);
>>> +			continue;
>>> +		}
>>> +
>>> +		bpf_mem_free(&bpf_global_ma, job);
>>> +	}
>>> +}
>>> +
>>> +static int bpf_dtor_kptr_reserve(long needed)
>>> +{
>>> +	struct bpf_dtor_kptr_work *job;
>>> +	long headroom;
>>> +	long target;
>>> +
>>> +	headroom = min_t(long, needed, BPF_DTOR_KPTR_RESERVE_HEADROOM);
>>> +	if (check_add_overflow(needed, headroom, &target))
>>> +		target = needed;
>>> +
>>> +	while (atomic_long_read(&bpf_dtor_kptr_total) < target) {
>>> +		job = bpf_mem_alloc(&bpf_global_ma, sizeof(*job));
>>> +		if (!job)
>>> +			return -ENOMEM;
>>> +		atomic_long_inc(&bpf_dtor_kptr_total);
>>> +		bpf_dtor_kptr_push_idle(job);
>>> +	}
>>> +
>>> +	return 0;
>>> +}
>>> +
>>> +int bpf_kptr_offload_inc(void)
>>> +{
>>> +	long needed;
>>> +	int err;
>>> +
>>> +	if (unlikely(!bpf_global_ma_set))
>>> +		return -ENOMEM;
>>> +
>>> +	/*
>>> +	 * Read active before incrementing refs so a free path moving one slot from
>>> +	 * refs to active cannot shrink the reservation snapshot below the steady
>>> +	 * state we need to cover. Racing results worst case in a larger reservation.
>>> +	 */
>>> +	needed = atomic_long_read(&bpf_dtor_kptr_active);
>>> +	needed += atomic_long_inc_return(&bpf_dtor_kptr_refs);
>>> +	err = bpf_dtor_kptr_reserve(needed);
>>> +	if (err)
>>> +		atomic_long_dec(&bpf_dtor_kptr_refs);
>>> +
>>> +	return err;
>>> +}
>>> +
>>> +void bpf_kptr_offload_dec(void)
>>> +{
>>> +	long val;
>>> +
>>> +	val = atomic_long_dec_return(&bpf_dtor_kptr_refs);
>>> +	if (!WARN_ON_ONCE(val < 0))
>>> +		return;
>>> +
>>> +	/*
>>> +	 * Clamp a mismatched decrement back to zero without overwriting a
>>> +	 * concurrent increment that already repaired the counter.
>>> +	 */
>>> +	do {
>>> +		val = atomic_long_read(&bpf_dtor_kptr_refs);
>>> +		if (val >= 0)
>>> +			break;
>>> +	} while (!atomic_long_try_cmpxchg(&bpf_dtor_kptr_refs, &val, 0));
>>> +}
>>> +
>>>  int sysctl_unprivileged_bpf_disabled __read_mostly =
>>>  	IS_BUILTIN(CONFIG_BPF_UNPRIV_DEFAULT_OFF) ? 2 : 0;
>>>  
>>> @@ -807,6 +935,46 @@ void bpf_obj_free_task_work(const struct btf_record *rec, void *obj)
>>>  	bpf_task_work_cancel_and_free(obj + rec->task_work_off);
>>>  }
>>>  
>>> +static void bpf_dtor_kptr_worker(struct irq_work *work)
>>> +{
>>> +	struct llist_node *jobs, *node, *next;
>>> +
>>> +	jobs = llist_del_all(this_cpu_ptr(&bpf_dtor_kptr_jobs));
>>> +	llist_for_each_safe(node, next, jobs) {
>>> +		struct bpf_dtor_kptr_work *job;
>>> +
>>> +		job = llist_entry(node, struct bpf_dtor_kptr_work, node);
>>> +		job->dtor(job->obj);
>>> +		atomic_long_dec(&bpf_dtor_kptr_active);
>>> +		bpf_dtor_kptr_push_idle(job);
>>> +	}
>>> +
>>> +	bpf_dtor_kptr_trim();
>>> +}
>>> +
>>> +static void bpf_dtor_kptr_offload(void *obj, btf_dtor_kfunc_t dtor)
>>> +{
>>> +	struct bpf_dtor_kptr_work *job;
>>> +
>>> +	atomic_long_inc(&bpf_dtor_kptr_active);
>>> +	job = bpf_dtor_kptr_pop_idle();
>>> +	if (WARN_ON_ONCE(!job)) {
>>> +		atomic_long_dec(&bpf_dtor_kptr_active);
>>> +		/*
>>> +		 * This should stay unreachable if reserve accounting is correct. If it
>>> +		 * ever breaks, running the destructor unsafely is still better than
>>> +		 * leaking the object permanently.
>>> +		 */
>>> +		dtor(obj);
>>> +		return;
>>> +	}
>>> +
>>> +	job->obj = obj;
>>> +	job->dtor = dtor;
>>> +	if (llist_add(&job->node, this_cpu_ptr(&bpf_dtor_kptr_jobs)))
>>> +		irq_work_queue(this_cpu_ptr(&bpf_dtor_kptr_irq_work));
>>> +}
>>> +
>>>  void bpf_obj_free_fields(const struct btf_record *rec, void *obj)
>>>  {
>>>  	const struct btf_field *fields;
>>> @@ -842,6 +1010,19 @@ void bpf_obj_free_fields(const struct btf_record *rec, void *obj)
>>>  			xchgd_field = (void *)xchg((unsigned long *)field_ptr, 0);
>>>  			if (!xchgd_field)
>>>  				break;
>>> +			if (in_nmi() && field->kptr.dtor) {
>>> +				bpf_dtor_kptr_offload(xchgd_field, field->kptr.dtor);
>>> +				bpf_kptr_offload_dec();
>>> +				break;
>>> +			}
>>> +			if (field->kptr.dtor)
>>> +				/*
>>> +				 * Dtor kptrs reach storage through bpf_ref_kptr_xchg(), which
>>> +				 * pairs installation with bpf_kptr_offload_inc(). Drop that
>>> +				 * reservation on non-NMI teardown once no active transition is
>>> +				 * needed.
>>> +				 */
>>> +				bpf_kptr_offload_dec();
>>>  
>>>  			if (!btf_is_kernel(field->kptr.btf)) {
>>>  				pointee_struct_meta = btf_find_struct_meta(field->kptr.btf,
>>> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
>>> index 11054ad89c14..2c7b21bda666 100644
>>> --- a/kernel/bpf/verifier.c
>>> +++ b/kernel/bpf/verifier.c
>>> @@ -9950,6 +9950,8 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
>>>  		if (err)
>>>  			return err;
>>>  	}
>>> +	env->insn_aux_data[insn_idx].kptr_has_dtor =
>>> +		func_id == BPF_FUNC_kptr_xchg && !!meta.kptr_field->kptr.dtor;
>>>  
>>>  	err = record_func_map(env, &meta, func_id, insn_idx);
>>>  	if (err)
>>


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [bpf-next v2 1/2] bpf: Offload kptr destructors that run from NMI
  2026-05-07 14:59       ` Mykyta Yatsenko
@ 2026-05-07 16:41         ` Justin Suess
  2026-05-07 17:19           ` Mykyta Yatsenko
  0 siblings, 1 reply; 12+ messages in thread
From: Justin Suess @ 2026-05-07 16:41 UTC (permalink / raw)
  To: Mykyta Yatsenko
  Cc: ast, daniel, andrii, eddyz87, memxor, martin.lau, song,
	yonghong.song, jolsa, bpf, mic, Alexei Starovoitov

On Thu, May 07, 2026 at 03:59:04PM +0100, Mykyta Yatsenko wrote:
> 
> 
> On 5/6/26 8:52 PM, Justin Suess wrote:
> > On Wed, May 06, 2026 at 05:43:51PM +0100, Mykyta Yatsenko wrote:
> >> On 5/5/26 4:08 PM, Justin Suess wrote:
> >>> A BPF program attached to tp_btf/nmi_handler can delete map entries or
> >>> swap out referenced kptrs from NMI context. Today that runs the kptr
> >>> destructor inline. Destructors such as bpf_cpumask_release() can take
> >>> RCU-related locks, so running them from NMI can deadlock the system.
> >>>
> >>> Preallocate offload jobs from the global BPF memory allocator, track the
> >>> number of live destructor-backed references so the pool stays ahead of
> >>> NMI frees, and let the worker invoke the destructor after NMI exits.
> >>>
> >>> The algorithm for preallocation is simple: The invariant is total >=
> >>> refs + active, where refs = the ref kptrs installed in maps, active =
> >>> jobs being executed in the irq_work worker, and total is the number of
> >>> job structures allocated. To avoid excessive pre-allocation calls while
> >>> maintaining the invariant, we allocate the needed slots, plus a small
> >>> amount of extra, min(needed, BPF_DTOR_KPTR_RESERVE_HEADROOM), where
> >>> BPF_DTOR_KPTR_RESERVE_HEADROOM is 64 in this patch.
> >>>
> >>> A small but harmless ordering subtlety: the active atomic is read before
> >>> refs. This can result in a small amount of over allocation, but this
> >>> won't be leaked and will properly be carried into the trim stage.
> >>>
> >>> The trim stage is simple. It uses a CAS loop to free excessive leftover
> >>> idle job slots. It snapshots total refs and active, pops an idle job if
> >>> the pool is excessively large, and attempts a cmpxhg to decrement it
> >>> atomically. On a failure case, it will just push the job back into the
> >>> idle list and retry.
> >>>
> >>> There are several best-effort mitigation methods to tackle the memory
> >>> pressure problem, preserving integrity under this unlikely scenario.
> >>>
> >>> If reserving another offload slot fails while installing a new
> >>> destructor-backed kptr through bpf_kptr_xchg(), leave the destination
> >>> unchanged and return the incoming pointer so the caller keeps ownership.
> >>>
> >>> This is superior to leaking the pointer, and should only happen if the
> >>> accounting is incorrect. Moreover, this is a condition the caller can
> >>> check for and recover from.
> >>>
> >>> If NMI teardown still fails to grab an idle offload job despite that
> >>> reserve accounting, warn once and run the destructor inline rather than
> >>> leak the object permanently. Attempt to repair the counter safely with
> >>> another CAS loop in that case, preserving concurrent increments.
> >>>
> >>> This fix does come with small performance tradeoffs for safety. xchg can
> >>> no longer be inlined for referenced kptrs, as inlining would break the
> >>> reference counting. The inlining fix is preserved for kptrs with no
> >>> destructor defined.
> >>>
> >>> This keeps refcounted kptr teardown out of NMI context without slowing
> >>> down raw kptr exchanges that never need destructor handling.
> >>>
> >>> Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
> >>> Reported-by: Justin Suess <utilityemal77@gmail.com>
> >>> Closes: https://lore.kernel.org/bpf/20260421201035.1729473-1-utilityemal77@gmail.com/
> >>> Signed-off-by: Justin Suess <utilityemal77@gmail.com>
> >>> ---
> >>>  include/linux/bpf.h          |  16 ++++
> >>>  include/linux/bpf_verifier.h |   1 +
> >>>  kernel/bpf/fixups.c          |  33 ++++---
> >>>  kernel/bpf/helpers.c         |  24 ++++-
> >>>  kernel/bpf/syscall.c         | 181 +++++++++++++++++++++++++++++++++++
> >>>  kernel/bpf/verifier.c        |   2 +
> >>>  6 files changed, 242 insertions(+), 15 deletions(-)
> >>>
> >>> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> >>> index 715b6df9c403..307de5caa646 100644
> >>> --- a/include/linux/bpf.h
> >>> +++ b/include/linux/bpf.h
> >>> @@ -3454,6 +3454,22 @@ static inline struct bpf_prog *bpf_prog_get_type(u32 ufd,
> >>>  
> >>>  void __bpf_free_used_maps(struct bpf_prog_aux *aux,
> >>>  			  struct bpf_map **used_maps, u32 len);
> >>> +/* Direct-call target used by fixups for bpf_kptr_xchg() sites without dtors. */
> >>> +u64 bpf_kptr_xchg_nodtor(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
> >>> +
> >>> +#ifdef CONFIG_BPF_SYSCALL
> >>> +int bpf_kptr_offload_inc(void);
> >>> +void bpf_kptr_offload_dec(void);
> >>> +#else
> >>> +static inline int bpf_kptr_offload_inc(void)
> >>> +{
> >>> +	return 0;
> >>> +}
> >>> +
> >>> +static inline void bpf_kptr_offload_dec(void)
> >>> +{
> >>> +}
> >>> +#endif
> >>>  
> >>>  bool bpf_prog_get_ok(struct bpf_prog *, enum bpf_prog_type *, bool);
> >>>  
> >>> diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
> >>> index 976e2b2f40e8..8e39ff92dd2c 100644
> >>> --- a/include/linux/bpf_verifier.h
> >>> +++ b/include/linux/bpf_verifier.h
> >>> @@ -672,6 +672,7 @@ struct bpf_insn_aux_data {
> >>>  	bool non_sleepable; /* helper/kfunc may be called from non-sleepable context */
> >>>  	bool is_iter_next; /* bpf_iter_<type>_next() kfunc call */
> >>>  	bool call_with_percpu_alloc_ptr; /* {this,per}_cpu_ptr() with prog percpu alloc */
> >>> +	bool kptr_has_dtor;
> >>>  	u8 alu_state; /* used in combination with alu_limit */
> >>>  	/* true if STX or LDX instruction is a part of a spill/fill
> >>>  	 * pattern for a bpf_fastcall call.
> >>> diff --git a/kernel/bpf/fixups.c b/kernel/bpf/fixups.c
> >>> index fba9e8c00878..459e855e86a5 100644
> >>> --- a/kernel/bpf/fixups.c
> >>> +++ b/kernel/bpf/fixups.c
> >>> @@ -2284,23 +2284,30 @@ int bpf_do_misc_fixups(struct bpf_verifier_env *env)
> >>>  			goto next_insn;
> >>>  		}
> >>>  
> >>> -		/* Implement bpf_kptr_xchg inline */
> >>> -		if (prog->jit_requested && BITS_PER_LONG == 64 &&
> >>> -		    insn->imm == BPF_FUNC_kptr_xchg &&
> >>> -		    bpf_jit_supports_ptr_xchg()) {
> >>> -			insn_buf[0] = BPF_MOV64_REG(BPF_REG_0, BPF_REG_2);
> >>> -			insn_buf[1] = BPF_ATOMIC_OP(BPF_DW, BPF_XCHG, BPF_REG_1, BPF_REG_0, 0);
> >>> -			cnt = 2;
> >>> +		/* Implement bpf_kptr_xchg inline. */
> >>> +		if (insn->imm == BPF_FUNC_kptr_xchg &&
> >>> +		    !env->insn_aux_data[i + delta].kptr_has_dtor) {
> >>> +			if (prog->jit_requested && BITS_PER_LONG == 64 &&
> >>> +			    bpf_jit_supports_ptr_xchg()) {
> >>> +				insn_buf[0] = BPF_MOV64_REG(BPF_REG_0, BPF_REG_2);
> >>> +				insn_buf[1] = BPF_ATOMIC_OP(BPF_DW, BPF_XCHG,
> >>> +						     BPF_REG_1, BPF_REG_0, 0);
> >>> +				cnt = 2;
> >>>  
> >>> -			new_prog = bpf_patch_insn_data(env, i + delta, insn_buf, cnt);
> >>> -			if (!new_prog)
> >>> -				return -ENOMEM;
> >>> +				new_prog = bpf_patch_insn_data(env, i + delta, insn_buf, cnt);
> >>> +				if (!new_prog)
> >>> +					return -ENOMEM;
> >>>  
> >>> -			delta    += cnt - 1;
> >>> -			env->prog = prog = new_prog;
> >>> -			insn      = new_prog->insnsi + i + delta;
> >>> +				delta    += cnt - 1;
> >>> +				env->prog = prog = new_prog;
> >>> +				insn      = new_prog->insnsi + i + delta;
> >>> +				goto next_insn;
> >>> +			}
> >>> +
> >>> +			insn->imm = bpf_kptr_xchg_nodtor - __bpf_call_base;
> >>>  			goto next_insn;
> >>>  		}
> >>> +
> >>>  patch_call_imm:
> >>>  		fn = env->ops->get_func_proto(insn->imm, env->prog);
> >>>  		/* all functions that have prototype and verifier allowed
> >>> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> >>> index baa12b24bb64..cdc64ab83ef6 100644
> >>> --- a/kernel/bpf/helpers.c
> >>> +++ b/kernel/bpf/helpers.c
> >>> @@ -1728,7 +1728,7 @@ void bpf_wq_cancel_and_free(void *val)
> >>>  	bpf_async_cancel_and_free(val);
> >>>  }
> >>>  
> >>> -BPF_CALL_2(bpf_kptr_xchg, void *, dst, void *, ptr)
> >>> +BPF_CALL_2(bpf_kptr_xchg_nodtor, void *, dst, void *, ptr)
> >>>  {
> >>>  	unsigned long *kptr = dst;
> >>>  
> >>> @@ -1736,12 +1736,32 @@ BPF_CALL_2(bpf_kptr_xchg, void *, dst, void *, ptr)
> >>>  	return xchg(kptr, (unsigned long)ptr);
> >>>  }
> >>>  
> >>> +BPF_CALL_2(bpf_ref_kptr_xchg, void *, dst, void *, ptr)
> >>> +{
> >>> +	unsigned long *kptr = dst;
> >>> +	void *old;
> >>> +
> >>> +	/*
> >>> +	 * If the incoming pointer cannot be torn down safely from NMI later on,
> >>> +	 * leave the destination untouched and return ptr so the caller keeps
> >>> +	 * ownership.
> >>> +	 */
> >>> +	if (ptr && bpf_kptr_offload_inc())
> >>> +		return (unsigned long)ptr;
> >>> +
> >>> +	old = (void *)xchg(kptr, (unsigned long)ptr);
> >>> +	if (old)
> >>> +		bpf_kptr_offload_dec();
> >>> +	return (unsigned long)old;
> >>> +}
> >>> +
> >>>  /* Unlike other PTR_TO_BTF_ID helpers the btf_id in bpf_kptr_xchg()
> >>>   * helper is determined dynamically by the verifier. Use BPF_PTR_POISON to
> >>>   * denote type that verifier will determine.
> >>> + * No-dtor callsites are redirected to bpf_kptr_xchg_nodtor() from fixups.
> >>>   */
> >>>  static const struct bpf_func_proto bpf_kptr_xchg_proto = {
> >>> -	.func         = bpf_kptr_xchg,
> >>> +	.func         = bpf_ref_kptr_xchg,
> >>>  	.gpl_only     = false,
> >>>  	.ret_type     = RET_PTR_TO_BTF_ID_OR_NULL,
> >>>  	.ret_btf_id   = BPF_PTR_POISON,
> >>> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> >>> index 3b1f0ba02f61..162bfd4796ea 100644
> >>> --- a/kernel/bpf/syscall.c
> >>> +++ b/kernel/bpf/syscall.c
> >>> @@ -7,6 +7,7 @@
> >>>  #include <linux/bpf_trace.h>
> >>>  #include <linux/bpf_lirc.h>
> >>>  #include <linux/bpf_verifier.h>
> >>> +#include <linux/bpf_mem_alloc.h>
> >>>  #include <linux/bsearch.h>
> >>>  #include <linux/btf.h>
> >>>  #include <linux/hex.h>
> >>> @@ -19,6 +20,8 @@
> >>>  #include <linux/fdtable.h>
> >>>  #include <linux/file.h>
> >>>  #include <linux/fs.h>
> >>> +#include <linux/irq_work.h>
> >>> +#include <linux/llist.h>
> >>>  #include <linux/license.h>
> >>>  #include <linux/filter.h>
> >>>  #include <linux/kernel.h>
> >>> @@ -65,6 +68,131 @@ static DEFINE_SPINLOCK(map_idr_lock);
> >>>  static DEFINE_IDR(link_idr);
> >>>  static DEFINE_SPINLOCK(link_idr_lock);
> >>>  
> >>> +struct bpf_dtor_kptr_work {
> >>> +	struct llist_node node;
> >>> +	void *obj;
> >>> +	btf_dtor_kfunc_t dtor;
> >>> +};
> >>> +
> >>> +/* Queue pending dtors per CPU; the idle pool stays global. */
> >>> +static DEFINE_PER_CPU(struct llist_head, bpf_dtor_kptr_jobs);
> >>> +static LLIST_HEAD(bpf_dtor_kptr_idle);
> >>> +/* Keep total >= refs + active so NMI frees never need to allocate. */
> >>> +static atomic_long_t bpf_dtor_kptr_refs = ATOMIC_LONG_INIT(0);
> >>> +static atomic_long_t bpf_dtor_kptr_active = ATOMIC_LONG_INIT(0);
> >>> +static atomic_long_t bpf_dtor_kptr_total = ATOMIC_LONG_INIT(0);
> >>> +
> >>> +/* Bound reserve overshoot so the pool tracks demand instead of growing on itself. */
> >>> +#define BPF_DTOR_KPTR_RESERVE_HEADROOM 64L
> >>> +
> >>> +static void bpf_dtor_kptr_worker(struct irq_work *work);
> >>> +static DEFINE_PER_CPU(struct irq_work, bpf_dtor_kptr_irq_work) =
> >>> +	IRQ_WORK_INIT_HARD(bpf_dtor_kptr_worker);
> >>> +
> >>
> >> I think this still looks too complex:
> >>  * 2 lists - idle list and armed list
> >>  * 3 atomics, controlling demand/supply
> >>  * headroom/trimming management
> >>
> >> The complexity introduced for performance reasons, but
> >> I'm not sure if the tradeoff is worth it.
> >>
> >> What about the next design:
> >>
> >> Instead of idle list, store bpf_dtor_kptr_work in the kptr map slot itself.
> >> Use kmalloc_nolock() to allocate bpf_dtor_kptr_work on the first 
> >> xchg just once per map value, then reuse it across xchg in/out.
> >>
> > Hello,
> > 
> > Thanks for the feedback.
> > 
> > I tried a different but related approach in [1], my v1, which also
> > stored the work in the map slot (struct bpf_kptr_dtor_aux).
> > 
> >> Detach: When map value is deleted, atomically set kptr map field storing 
> >> bpf_dtor_kptr_work to NULL (so the next xchg-in allocates new 
> >> bpf_dtor_kptr_work.)
> >> After detaching insert bpf_dtor_kptr_work to the global list and run irq_work. 
> >> Free bpf_dtor_kptr_work in call_rcu_tasks_trace().
> > This would be nice to do but there's a sequence where you can delete and
> > insert stimultaniously. Basically an ABBA problem that is inconvienent to
> > deal with, especially in NMI.
> > 
> > Let's say
> > 
> > CPU0 = Deleting value from the map. xchg w/ null.
> > CPU1 = Exchanging new value into the map. xchg w/ something
> > 
> > So I give the two orders we can do it in, zeroing the map slot first or
> > zeroing the work first.
> > 
> > 1. Set obj to null, then null out work
> > 
> > CPU0
> > sets map obj field to null (dtor work is still != null)
> > CPU1
> > sets map obj field to something
> > CPU1 (again)
> > sees that dtor work is still != null, (does nothing, assumes a valid
> > slot was still allocated)
> > CPU0
> > sets dtor work to null and enqueues it in the irq_work.
> > 
> > Then a new dtor work is never allocated for the new slot.
> > 
> > 2. And doing in the other order (nulling out work before the obj)
> > 
> > CPU0
> > sets work to null and enqueues it.
> > CPU1
> > sets work to something (allocates new work)
> > CPU1
> > sets obj to something
> > CPU0
> > sets obj to null.
> > 
> > Then you end up leaking the object set by CPU1.
> 
> I'm not sure I'm following,
> the idea I suggest is next: you place kptr and work struct in the
> same structure:
> 
>   struct bpf_kptr_box {
>         void *kptr;
>         struct bpf_dtor_kptr_work work;
>         struct rcu_head rcu;
>   };
>
Ah I see. Sorry for misinterpreting that. This is less similar to my
initial approach then, and doesn't have this race condition.

There's a seprate problem though:

This essentially makes it so the map stores a pointer to the
bpf_kptr_box object instead of the kptr directly. Map values in BPF
BPF_KPTR_REF fields can be accessed directly, and when you do that,
you're now going to be grabbing our box instead of the object.

In bpf.h 731:

	/* PTR is not trusted. This is only used with PTR_TO_BTF_ID, to mark
	 * unreferenced and referenced kptr loaded from map value using a load
	 * instruction, so that they can only be dereferenced but not escape the
	 * BPF program into the kernel (i.e. cannot be passed as arguments to
	 * kfunc or bpf helpers).
	 */

So while putting a referenced kptr field into a map requires xchg, the
inverse isn't true, you can directly access values in maps as a
PTR_UNTRUSTED.

This box struct would make it so whenever we do a BPF_LDX directly
on a BPF_KPTR_REF field, we're accessing our box struct rather than
the actual kptr. So this would break dereferencing and require every
dereference be made into two dereferences slot-> our box -> our kptr obj.
It would also break the case where we want to compare kptrs directly,
since even though the kptrs might be identical addresses, our box
structs will have seperate ones. So the consequences of changing that
ripple very far.

(Also this would break KF_RCU kfuncs, which can accept directly loaded
kptrs in read side CS / for RCU safe kptrs and require indirection
there).

> Then you can atomically detach/attach it.
> Very schematic implementation:
> 
> unsigned long bpf_ref_kptr_xchg(void *dst, void *ptr)
> {
>         struct bpf_kptr_box **slot = dst;
>         struct bpf_kptr_box *box;
> 
>         box = READ_ONCE(*slot);
> 	if (box)
> 	/* Box can be detached already, but its lifetime and kptr destructor are
> 	 * RCU protected, so it's safe to access it here.
> 	 */
> 		return xchg(&box->kptr, ptr);
> 
>         if (!ptr)
>                 return 0; /* xchg-out from empty: nothing to do */
> 
> 	/* slow path, when map value is not initialized */
>         box = kmalloc_nolock();
>         if (!box)
>                 return (unsigned long)ptr;
> 
>         box->kptr = ptr;
>         if (cmpxchg(slot, NULL, box) != NULL) {
>                 kfree(new_box);
>                 /* someone installed first, retry or error out */
>         }
> 
>         return NULL;
>   }
> 
> Run destructor in rcu tt callback, to make sure no concurrent BPF program 
> is installing kptr in the meantime:
> 
> void box_rcu_free()
> {
> 	dtor();
> 	kfree(box);
> }
> 
> static void box_kill_irq_work(struct irq_work* irq_work)
> {
> 	/* guarantee no BPF program sees the box */
>         call_rcu_tasks_trace(&box->rcu, box_rcu_free);
> }
> 
> Detach path is simply: box = xchg(slot, NULL).
> 
> You can still use a single list where you link detached boxes and
> irq_work similarly as you have now, just make sure dtor and free
> are called after RCU GP.
> 
> I may have missed some race conditions, though, does the idea
> make sense?
>
It does make sense. That does solve the race condition issue, but
basically existing code assumes that the kptr in the map points to a
real live kernel object, so pointing it to a box breaks a lot of
assumptions.

Thanks for the feedback again and appreciate the follow up! This is
interesting to dig into.

I have v3 about ready, and will send in a little bit.

Thanks,
Justin
> > 
> > So either order we do it in 1 or 2, this is an ABBA problem.
> > And would require serialization and/or some sort of lock state
> > for every map slot.
> > 
> > Not to mention the complexity of adding extra data to the map slots
> > requires modifications to every map type. You can see the problem this
> > introduced in my v1, which took a related approach adding auxiliary data
> > to map slots.
> > 
> > ....
> > 
> > FWIW I do have a much less complex implementation for my v3
> > revision which brings it down to a single atomic and two
> > pcpu_freelists, cutting it to 215~ lines net for the whole fix.
> > 
> > Let me know if I missed anything or am misunderstanding something. I
> > really appreciate the feedback though this is a tougher problem than it
> > seemed and your thought was very similar to my initial approach :)
> >>
> >> This is based on the bpf_timer and bpf_task_work
> >> implementations.
> >>
> > Those were some of my initial inspiration as well!
> > 
> > Thanks again,
> > Justin
> > 
> > [1] https://lore.kernel.org/bpf/20260428201422.1518903-1-utilityemal77@gmail.com/
> >>> +static void bpf_dtor_kptr_push_idle(struct bpf_dtor_kptr_work *job)
> >>> +{
> >>> +	llist_add(&job->node, &bpf_dtor_kptr_idle);
> >>> +}
> >>> +
> >>> +static struct bpf_dtor_kptr_work *bpf_dtor_kptr_pop_idle(void)
> >>> +{
> >>> +	struct llist_node *node;
> >>> +
> >>> +	node = llist_del_first(&bpf_dtor_kptr_idle);
> >>> +	if (!node)
> >>> +		return NULL;
> >>> +
> >>> +	return llist_entry(node, struct bpf_dtor_kptr_work, node);
> >>> +}
> >>> +
> >>> +static void bpf_dtor_kptr_trim(void)
> >>> +{
> >>> +	struct bpf_dtor_kptr_work *job;
> >>> +	long total;
> >>> +	long needed;
> >>> +
> >>> +	for (;;) {
> >>> +		total = atomic_long_read(&bpf_dtor_kptr_total);
> >>> +		needed = atomic_long_read(&bpf_dtor_kptr_refs) +
> >>> +			 atomic_long_read(&bpf_dtor_kptr_active);
> >>> +		if (total <= needed)
> >>> +			return;
> >>> +
> >>> +		job = bpf_dtor_kptr_pop_idle();
> >>> +		if (!job)
> >>> +			return;
> >>> +
> >>> +		if (!atomic_long_try_cmpxchg(&bpf_dtor_kptr_total, &total, total - 1)) {
> >>> +			bpf_dtor_kptr_push_idle(job);
> >>> +			continue;
> >>> +		}
> >>> +
> >>> +		bpf_mem_free(&bpf_global_ma, job);
> >>> +	}
> >>> +}
> >>> +
> >>> +static int bpf_dtor_kptr_reserve(long needed)
> >>> +{
> >>> +	struct bpf_dtor_kptr_work *job;
> >>> +	long headroom;
> >>> +	long target;
> >>> +
> >>> +	headroom = min_t(long, needed, BPF_DTOR_KPTR_RESERVE_HEADROOM);
> >>> +	if (check_add_overflow(needed, headroom, &target))
> >>> +		target = needed;
> >>> +
> >>> +	while (atomic_long_read(&bpf_dtor_kptr_total) < target) {
> >>> +		job = bpf_mem_alloc(&bpf_global_ma, sizeof(*job));
> >>> +		if (!job)
> >>> +			return -ENOMEM;
> >>> +		atomic_long_inc(&bpf_dtor_kptr_total);
> >>> +		bpf_dtor_kptr_push_idle(job);
> >>> +	}
> >>> +
> >>> +	return 0;
> >>> +}
> >>> +
> >>> +int bpf_kptr_offload_inc(void)
> >>> +{
> >>> +	long needed;
> >>> +	int err;
> >>> +
> >>> +	if (unlikely(!bpf_global_ma_set))
> >>> +		return -ENOMEM;
> >>> +
> >>> +	/*
> >>> +	 * Read active before incrementing refs so a free path moving one slot from
> >>> +	 * refs to active cannot shrink the reservation snapshot below the steady
> >>> +	 * state we need to cover. Racing results worst case in a larger reservation.
> >>> +	 */
> >>> +	needed = atomic_long_read(&bpf_dtor_kptr_active);
> >>> +	needed += atomic_long_inc_return(&bpf_dtor_kptr_refs);
> >>> +	err = bpf_dtor_kptr_reserve(needed);
> >>> +	if (err)
> >>> +		atomic_long_dec(&bpf_dtor_kptr_refs);
> >>> +
> >>> +	return err;
> >>> +}
> >>> +
> >>> +void bpf_kptr_offload_dec(void)
> >>> +{
> >>> +	long val;
> >>> +
> >>> +	val = atomic_long_dec_return(&bpf_dtor_kptr_refs);
> >>> +	if (!WARN_ON_ONCE(val < 0))
> >>> +		return;
> >>> +
> >>> +	/*
> >>> +	 * Clamp a mismatched decrement back to zero without overwriting a
> >>> +	 * concurrent increment that already repaired the counter.
> >>> +	 */
> >>> +	do {
> >>> +		val = atomic_long_read(&bpf_dtor_kptr_refs);
> >>> +		if (val >= 0)
> >>> +			break;
> >>> +	} while (!atomic_long_try_cmpxchg(&bpf_dtor_kptr_refs, &val, 0));
> >>> +}
> >>> +
> >>>  int sysctl_unprivileged_bpf_disabled __read_mostly =
> >>>  	IS_BUILTIN(CONFIG_BPF_UNPRIV_DEFAULT_OFF) ? 2 : 0;
> >>>  
> >>> @@ -807,6 +935,46 @@ void bpf_obj_free_task_work(const struct btf_record *rec, void *obj)
> >>>  	bpf_task_work_cancel_and_free(obj + rec->task_work_off);
> >>>  }
> >>>  
> >>> +static void bpf_dtor_kptr_worker(struct irq_work *work)
> >>> +{
> >>> +	struct llist_node *jobs, *node, *next;
> >>> +
> >>> +	jobs = llist_del_all(this_cpu_ptr(&bpf_dtor_kptr_jobs));
> >>> +	llist_for_each_safe(node, next, jobs) {
> >>> +		struct bpf_dtor_kptr_work *job;
> >>> +
> >>> +		job = llist_entry(node, struct bpf_dtor_kptr_work, node);
> >>> +		job->dtor(job->obj);
> >>> +		atomic_long_dec(&bpf_dtor_kptr_active);
> >>> +		bpf_dtor_kptr_push_idle(job);
> >>> +	}
> >>> +
> >>> +	bpf_dtor_kptr_trim();
> >>> +}
> >>> +
> >>> +static void bpf_dtor_kptr_offload(void *obj, btf_dtor_kfunc_t dtor)
> >>> +{
> >>> +	struct bpf_dtor_kptr_work *job;
> >>> +
> >>> +	atomic_long_inc(&bpf_dtor_kptr_active);
> >>> +	job = bpf_dtor_kptr_pop_idle();
> >>> +	if (WARN_ON_ONCE(!job)) {
> >>> +		atomic_long_dec(&bpf_dtor_kptr_active);
> >>> +		/*
> >>> +		 * This should stay unreachable if reserve accounting is correct. If it
> >>> +		 * ever breaks, running the destructor unsafely is still better than
> >>> +		 * leaking the object permanently.
> >>> +		 */
> >>> +		dtor(obj);
> >>> +		return;
> >>> +	}
> >>> +
> >>> +	job->obj = obj;
> >>> +	job->dtor = dtor;
> >>> +	if (llist_add(&job->node, this_cpu_ptr(&bpf_dtor_kptr_jobs)))
> >>> +		irq_work_queue(this_cpu_ptr(&bpf_dtor_kptr_irq_work));
> >>> +}
> >>> +
> >>>  void bpf_obj_free_fields(const struct btf_record *rec, void *obj)
> >>>  {
> >>>  	const struct btf_field *fields;
> >>> @@ -842,6 +1010,19 @@ void bpf_obj_free_fields(const struct btf_record *rec, void *obj)
> >>>  			xchgd_field = (void *)xchg((unsigned long *)field_ptr, 0);
> >>>  			if (!xchgd_field)
> >>>  				break;
> >>> +			if (in_nmi() && field->kptr.dtor) {
> >>> +				bpf_dtor_kptr_offload(xchgd_field, field->kptr.dtor);
> >>> +				bpf_kptr_offload_dec();
> >>> +				break;
> >>> +			}
> >>> +			if (field->kptr.dtor)
> >>> +				/*
> >>> +				 * Dtor kptrs reach storage through bpf_ref_kptr_xchg(), which
> >>> +				 * pairs installation with bpf_kptr_offload_inc(). Drop that
> >>> +				 * reservation on non-NMI teardown once no active transition is
> >>> +				 * needed.
> >>> +				 */
> >>> +				bpf_kptr_offload_dec();
> >>>  
> >>>  			if (!btf_is_kernel(field->kptr.btf)) {
> >>>  				pointee_struct_meta = btf_find_struct_meta(field->kptr.btf,
> >>> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> >>> index 11054ad89c14..2c7b21bda666 100644
> >>> --- a/kernel/bpf/verifier.c
> >>> +++ b/kernel/bpf/verifier.c
> >>> @@ -9950,6 +9950,8 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
> >>>  		if (err)
> >>>  			return err;
> >>>  	}
> >>> +	env->insn_aux_data[insn_idx].kptr_has_dtor =
> >>> +		func_id == BPF_FUNC_kptr_xchg && !!meta.kptr_field->kptr.dtor;
> >>>  
> >>>  	err = record_func_map(env, &meta, func_id, insn_idx);
> >>>  	if (err)
> >>
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [bpf-next v2 1/2] bpf: Offload kptr destructors that run from NMI
  2026-05-07 16:41         ` Justin Suess
@ 2026-05-07 17:19           ` Mykyta Yatsenko
  0 siblings, 0 replies; 12+ messages in thread
From: Mykyta Yatsenko @ 2026-05-07 17:19 UTC (permalink / raw)
  To: Justin Suess
  Cc: ast, daniel, andrii, eddyz87, memxor, martin.lau, song,
	yonghong.song, jolsa, bpf, mic, Alexei Starovoitov

On 5/7/26 5:41 PM, Justin Suess wrote:
> On Thu, May 07, 2026 at 03:59:04PM +0100, Mykyta Yatsenko wrote:
>>
>>
>> On 5/6/26 8:52 PM, Justin Suess wrote:
>>> On Wed, May 06, 2026 at 05:43:51PM +0100, Mykyta Yatsenko wrote:
>>>> On 5/5/26 4:08 PM, Justin Suess wrote:
>>>>> A BPF program attached to tp_btf/nmi_handler can delete map entries or
>>>>> swap out referenced kptrs from NMI context. Today that runs the kptr
>>>>> destructor inline. Destructors such as bpf_cpumask_release() can take
>>>>> RCU-related locks, so running them from NMI can deadlock the system.
>>>>>
>>>>> Preallocate offload jobs from the global BPF memory allocator, track the
>>>>> number of live destructor-backed references so the pool stays ahead of
>>>>> NMI frees, and let the worker invoke the destructor after NMI exits.
>>>>>
>>>>> The algorithm for preallocation is simple: The invariant is total >=
>>>>> refs + active, where refs = the ref kptrs installed in maps, active =
>>>>> jobs being executed in the irq_work worker, and total is the number of
>>>>> job structures allocated. To avoid excessive pre-allocation calls while
>>>>> maintaining the invariant, we allocate the needed slots, plus a small
>>>>> amount of extra, min(needed, BPF_DTOR_KPTR_RESERVE_HEADROOM), where
>>>>> BPF_DTOR_KPTR_RESERVE_HEADROOM is 64 in this patch.
>>>>>
>>>>> A small but harmless ordering subtlety: the active atomic is read before
>>>>> refs. This can result in a small amount of over allocation, but this
>>>>> won't be leaked and will properly be carried into the trim stage.
>>>>>
>>>>> The trim stage is simple. It uses a CAS loop to free excessive leftover
>>>>> idle job slots. It snapshots total refs and active, pops an idle job if
>>>>> the pool is excessively large, and attempts a cmpxhg to decrement it
>>>>> atomically. On a failure case, it will just push the job back into the
>>>>> idle list and retry.
>>>>>
>>>>> There are several best-effort mitigation methods to tackle the memory
>>>>> pressure problem, preserving integrity under this unlikely scenario.
>>>>>
>>>>> If reserving another offload slot fails while installing a new
>>>>> destructor-backed kptr through bpf_kptr_xchg(), leave the destination
>>>>> unchanged and return the incoming pointer so the caller keeps ownership.
>>>>>
>>>>> This is superior to leaking the pointer, and should only happen if the
>>>>> accounting is incorrect. Moreover, this is a condition the caller can
>>>>> check for and recover from.
>>>>>
>>>>> If NMI teardown still fails to grab an idle offload job despite that
>>>>> reserve accounting, warn once and run the destructor inline rather than
>>>>> leak the object permanently. Attempt to repair the counter safely with
>>>>> another CAS loop in that case, preserving concurrent increments.
>>>>>
>>>>> This fix does come with small performance tradeoffs for safety. xchg can
>>>>> no longer be inlined for referenced kptrs, as inlining would break the
>>>>> reference counting. The inlining fix is preserved for kptrs with no
>>>>> destructor defined.
>>>>>
>>>>> This keeps refcounted kptr teardown out of NMI context without slowing
>>>>> down raw kptr exchanges that never need destructor handling.
>>>>>
>>>>> Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
>>>>> Reported-by: Justin Suess <utilityemal77@gmail.com>
>>>>> Closes: https://lore.kernel.org/bpf/20260421201035.1729473-1-utilityemal77@gmail.com/
>>>>> Signed-off-by: Justin Suess <utilityemal77@gmail.com>
>>>>> ---
>>>>>  include/linux/bpf.h          |  16 ++++
>>>>>  include/linux/bpf_verifier.h |   1 +
>>>>>  kernel/bpf/fixups.c          |  33 ++++---
>>>>>  kernel/bpf/helpers.c         |  24 ++++-
>>>>>  kernel/bpf/syscall.c         | 181 +++++++++++++++++++++++++++++++++++
>>>>>  kernel/bpf/verifier.c        |   2 +
>>>>>  6 files changed, 242 insertions(+), 15 deletions(-)
>>>>>
>>>>> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
>>>>> index 715b6df9c403..307de5caa646 100644
>>>>> --- a/include/linux/bpf.h
>>>>> +++ b/include/linux/bpf.h
>>>>> @@ -3454,6 +3454,22 @@ static inline struct bpf_prog *bpf_prog_get_type(u32 ufd,
>>>>>  
>>>>>  void __bpf_free_used_maps(struct bpf_prog_aux *aux,
>>>>>  			  struct bpf_map **used_maps, u32 len);
>>>>> +/* Direct-call target used by fixups for bpf_kptr_xchg() sites without dtors. */
>>>>> +u64 bpf_kptr_xchg_nodtor(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
>>>>> +
>>>>> +#ifdef CONFIG_BPF_SYSCALL
>>>>> +int bpf_kptr_offload_inc(void);
>>>>> +void bpf_kptr_offload_dec(void);
>>>>> +#else
>>>>> +static inline int bpf_kptr_offload_inc(void)
>>>>> +{
>>>>> +	return 0;
>>>>> +}
>>>>> +
>>>>> +static inline void bpf_kptr_offload_dec(void)
>>>>> +{
>>>>> +}
>>>>> +#endif
>>>>>  
>>>>>  bool bpf_prog_get_ok(struct bpf_prog *, enum bpf_prog_type *, bool);
>>>>>  
>>>>> diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
>>>>> index 976e2b2f40e8..8e39ff92dd2c 100644
>>>>> --- a/include/linux/bpf_verifier.h
>>>>> +++ b/include/linux/bpf_verifier.h
>>>>> @@ -672,6 +672,7 @@ struct bpf_insn_aux_data {
>>>>>  	bool non_sleepable; /* helper/kfunc may be called from non-sleepable context */
>>>>>  	bool is_iter_next; /* bpf_iter_<type>_next() kfunc call */
>>>>>  	bool call_with_percpu_alloc_ptr; /* {this,per}_cpu_ptr() with prog percpu alloc */
>>>>> +	bool kptr_has_dtor;
>>>>>  	u8 alu_state; /* used in combination with alu_limit */
>>>>>  	/* true if STX or LDX instruction is a part of a spill/fill
>>>>>  	 * pattern for a bpf_fastcall call.
>>>>> diff --git a/kernel/bpf/fixups.c b/kernel/bpf/fixups.c
>>>>> index fba9e8c00878..459e855e86a5 100644
>>>>> --- a/kernel/bpf/fixups.c
>>>>> +++ b/kernel/bpf/fixups.c
>>>>> @@ -2284,23 +2284,30 @@ int bpf_do_misc_fixups(struct bpf_verifier_env *env)
>>>>>  			goto next_insn;
>>>>>  		}
>>>>>  
>>>>> -		/* Implement bpf_kptr_xchg inline */
>>>>> -		if (prog->jit_requested && BITS_PER_LONG == 64 &&
>>>>> -		    insn->imm == BPF_FUNC_kptr_xchg &&
>>>>> -		    bpf_jit_supports_ptr_xchg()) {
>>>>> -			insn_buf[0] = BPF_MOV64_REG(BPF_REG_0, BPF_REG_2);
>>>>> -			insn_buf[1] = BPF_ATOMIC_OP(BPF_DW, BPF_XCHG, BPF_REG_1, BPF_REG_0, 0);
>>>>> -			cnt = 2;
>>>>> +		/* Implement bpf_kptr_xchg inline. */
>>>>> +		if (insn->imm == BPF_FUNC_kptr_xchg &&
>>>>> +		    !env->insn_aux_data[i + delta].kptr_has_dtor) {
>>>>> +			if (prog->jit_requested && BITS_PER_LONG == 64 &&
>>>>> +			    bpf_jit_supports_ptr_xchg()) {
>>>>> +				insn_buf[0] = BPF_MOV64_REG(BPF_REG_0, BPF_REG_2);
>>>>> +				insn_buf[1] = BPF_ATOMIC_OP(BPF_DW, BPF_XCHG,
>>>>> +						     BPF_REG_1, BPF_REG_0, 0);
>>>>> +				cnt = 2;
>>>>>  
>>>>> -			new_prog = bpf_patch_insn_data(env, i + delta, insn_buf, cnt);
>>>>> -			if (!new_prog)
>>>>> -				return -ENOMEM;
>>>>> +				new_prog = bpf_patch_insn_data(env, i + delta, insn_buf, cnt);
>>>>> +				if (!new_prog)
>>>>> +					return -ENOMEM;
>>>>>  
>>>>> -			delta    += cnt - 1;
>>>>> -			env->prog = prog = new_prog;
>>>>> -			insn      = new_prog->insnsi + i + delta;
>>>>> +				delta    += cnt - 1;
>>>>> +				env->prog = prog = new_prog;
>>>>> +				insn      = new_prog->insnsi + i + delta;
>>>>> +				goto next_insn;
>>>>> +			}
>>>>> +
>>>>> +			insn->imm = bpf_kptr_xchg_nodtor - __bpf_call_base;
>>>>>  			goto next_insn;
>>>>>  		}
>>>>> +
>>>>>  patch_call_imm:
>>>>>  		fn = env->ops->get_func_proto(insn->imm, env->prog);
>>>>>  		/* all functions that have prototype and verifier allowed
>>>>> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
>>>>> index baa12b24bb64..cdc64ab83ef6 100644
>>>>> --- a/kernel/bpf/helpers.c
>>>>> +++ b/kernel/bpf/helpers.c
>>>>> @@ -1728,7 +1728,7 @@ void bpf_wq_cancel_and_free(void *val)
>>>>>  	bpf_async_cancel_and_free(val);
>>>>>  }
>>>>>  
>>>>> -BPF_CALL_2(bpf_kptr_xchg, void *, dst, void *, ptr)
>>>>> +BPF_CALL_2(bpf_kptr_xchg_nodtor, void *, dst, void *, ptr)
>>>>>  {
>>>>>  	unsigned long *kptr = dst;
>>>>>  
>>>>> @@ -1736,12 +1736,32 @@ BPF_CALL_2(bpf_kptr_xchg, void *, dst, void *, ptr)
>>>>>  	return xchg(kptr, (unsigned long)ptr);
>>>>>  }
>>>>>  
>>>>> +BPF_CALL_2(bpf_ref_kptr_xchg, void *, dst, void *, ptr)
>>>>> +{
>>>>> +	unsigned long *kptr = dst;
>>>>> +	void *old;
>>>>> +
>>>>> +	/*
>>>>> +	 * If the incoming pointer cannot be torn down safely from NMI later on,
>>>>> +	 * leave the destination untouched and return ptr so the caller keeps
>>>>> +	 * ownership.
>>>>> +	 */
>>>>> +	if (ptr && bpf_kptr_offload_inc())
>>>>> +		return (unsigned long)ptr;
>>>>> +
>>>>> +	old = (void *)xchg(kptr, (unsigned long)ptr);
>>>>> +	if (old)
>>>>> +		bpf_kptr_offload_dec();
>>>>> +	return (unsigned long)old;
>>>>> +}
>>>>> +
>>>>>  /* Unlike other PTR_TO_BTF_ID helpers the btf_id in bpf_kptr_xchg()
>>>>>   * helper is determined dynamically by the verifier. Use BPF_PTR_POISON to
>>>>>   * denote type that verifier will determine.
>>>>> + * No-dtor callsites are redirected to bpf_kptr_xchg_nodtor() from fixups.
>>>>>   */
>>>>>  static const struct bpf_func_proto bpf_kptr_xchg_proto = {
>>>>> -	.func         = bpf_kptr_xchg,
>>>>> +	.func         = bpf_ref_kptr_xchg,
>>>>>  	.gpl_only     = false,
>>>>>  	.ret_type     = RET_PTR_TO_BTF_ID_OR_NULL,
>>>>>  	.ret_btf_id   = BPF_PTR_POISON,
>>>>> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>>>>> index 3b1f0ba02f61..162bfd4796ea 100644
>>>>> --- a/kernel/bpf/syscall.c
>>>>> +++ b/kernel/bpf/syscall.c
>>>>> @@ -7,6 +7,7 @@
>>>>>  #include <linux/bpf_trace.h>
>>>>>  #include <linux/bpf_lirc.h>
>>>>>  #include <linux/bpf_verifier.h>
>>>>> +#include <linux/bpf_mem_alloc.h>
>>>>>  #include <linux/bsearch.h>
>>>>>  #include <linux/btf.h>
>>>>>  #include <linux/hex.h>
>>>>> @@ -19,6 +20,8 @@
>>>>>  #include <linux/fdtable.h>
>>>>>  #include <linux/file.h>
>>>>>  #include <linux/fs.h>
>>>>> +#include <linux/irq_work.h>
>>>>> +#include <linux/llist.h>
>>>>>  #include <linux/license.h>
>>>>>  #include <linux/filter.h>
>>>>>  #include <linux/kernel.h>
>>>>> @@ -65,6 +68,131 @@ static DEFINE_SPINLOCK(map_idr_lock);
>>>>>  static DEFINE_IDR(link_idr);
>>>>>  static DEFINE_SPINLOCK(link_idr_lock);
>>>>>  
>>>>> +struct bpf_dtor_kptr_work {
>>>>> +	struct llist_node node;
>>>>> +	void *obj;
>>>>> +	btf_dtor_kfunc_t dtor;
>>>>> +};
>>>>> +
>>>>> +/* Queue pending dtors per CPU; the idle pool stays global. */
>>>>> +static DEFINE_PER_CPU(struct llist_head, bpf_dtor_kptr_jobs);
>>>>> +static LLIST_HEAD(bpf_dtor_kptr_idle);
>>>>> +/* Keep total >= refs + active so NMI frees never need to allocate. */
>>>>> +static atomic_long_t bpf_dtor_kptr_refs = ATOMIC_LONG_INIT(0);
>>>>> +static atomic_long_t bpf_dtor_kptr_active = ATOMIC_LONG_INIT(0);
>>>>> +static atomic_long_t bpf_dtor_kptr_total = ATOMIC_LONG_INIT(0);
>>>>> +
>>>>> +/* Bound reserve overshoot so the pool tracks demand instead of growing on itself. */
>>>>> +#define BPF_DTOR_KPTR_RESERVE_HEADROOM 64L
>>>>> +
>>>>> +static void bpf_dtor_kptr_worker(struct irq_work *work);
>>>>> +static DEFINE_PER_CPU(struct irq_work, bpf_dtor_kptr_irq_work) =
>>>>> +	IRQ_WORK_INIT_HARD(bpf_dtor_kptr_worker);
>>>>> +
>>>>
>>>> I think this still looks too complex:
>>>>  * 2 lists - idle list and armed list
>>>>  * 3 atomics, controlling demand/supply
>>>>  * headroom/trimming management
>>>>
>>>> The complexity introduced for performance reasons, but
>>>> I'm not sure if the tradeoff is worth it.
>>>>
>>>> What about the next design:
>>>>
>>>> Instead of idle list, store bpf_dtor_kptr_work in the kptr map slot itself.
>>>> Use kmalloc_nolock() to allocate bpf_dtor_kptr_work on the first 
>>>> xchg just once per map value, then reuse it across xchg in/out.
>>>>
>>> Hello,
>>>
>>> Thanks for the feedback.
>>>
>>> I tried a different but related approach in [1], my v1, which also
>>> stored the work in the map slot (struct bpf_kptr_dtor_aux).
>>>
>>>> Detach: When map value is deleted, atomically set kptr map field storing 
>>>> bpf_dtor_kptr_work to NULL (so the next xchg-in allocates new 
>>>> bpf_dtor_kptr_work.)
>>>> After detaching insert bpf_dtor_kptr_work to the global list and run irq_work. 
>>>> Free bpf_dtor_kptr_work in call_rcu_tasks_trace().
>>> This would be nice to do but there's a sequence where you can delete and
>>> insert stimultaniously. Basically an ABBA problem that is inconvienent to
>>> deal with, especially in NMI.
>>>
>>> Let's say
>>>
>>> CPU0 = Deleting value from the map. xchg w/ null.
>>> CPU1 = Exchanging new value into the map. xchg w/ something
>>>
>>> So I give the two orders we can do it in, zeroing the map slot first or
>>> zeroing the work first.
>>>
>>> 1. Set obj to null, then null out work
>>>
>>> CPU0
>>> sets map obj field to null (dtor work is still != null)
>>> CPU1
>>> sets map obj field to something
>>> CPU1 (again)
>>> sees that dtor work is still != null, (does nothing, assumes a valid
>>> slot was still allocated)
>>> CPU0
>>> sets dtor work to null and enqueues it in the irq_work.
>>>
>>> Then a new dtor work is never allocated for the new slot.
>>>
>>> 2. And doing in the other order (nulling out work before the obj)
>>>
>>> CPU0
>>> sets work to null and enqueues it.
>>> CPU1
>>> sets work to something (allocates new work)
>>> CPU1
>>> sets obj to something
>>> CPU0
>>> sets obj to null.
>>>
>>> Then you end up leaking the object set by CPU1.
>>
>> I'm not sure I'm following,
>> the idea I suggest is next: you place kptr and work struct in the
>> same structure:
>>
>>   struct bpf_kptr_box {
>>         void *kptr;
>>         struct bpf_dtor_kptr_work work;
>>         struct rcu_head rcu;
>>   };
>>
> Ah I see. Sorry for misinterpreting that. This is less similar to my
> initial approach then, and doesn't have this race condition.
> 
> There's a seprate problem though:
> 
> This essentially makes it so the map stores a pointer to the
> bpf_kptr_box object instead of the kptr directly. Map values in BPF
> BPF_KPTR_REF fields can be accessed directly, and when you do that,
> you're now going to be grabbing our box instead of the object.
> 
> In bpf.h 731:
> 
> 	/* PTR is not trusted. This is only used with PTR_TO_BTF_ID, to mark
> 	 * unreferenced and referenced kptr loaded from map value using a load
> 	 * instruction, so that they can only be dereferenced but not escape the
> 	 * BPF program into the kernel (i.e. cannot be passed as arguments to
> 	 * kfunc or bpf helpers).
> 	 */
> 
> So while putting a referenced kptr field into a map requires xchg, the
> inverse isn't true, you can directly access values in maps as a
> PTR_UNTRUSTED.
> 
> This box struct would make it so whenever we do a BPF_LDX directly
> on a BPF_KPTR_REF field, we're accessing our box struct rather than
> the actual kptr. So this would break dereferencing and require every
> dereference be made into two dereferences slot-> our box -> our kptr obj.
> It would also break the case where we want to compare kptrs directly,
> since even though the kptrs might be identical addresses, our box
> structs will have seperate ones. So the consequences of changing that
> ripple very far.
> 
> (Also this would break KF_RCU kfuncs, which can accept directly loaded
> kptrs in read side CS / for RCU safe kptrs and require indirection
> there).
> 

Right, I missed the direct load part. That will require patching program,
not something we should do here. Thanks!

>> Then you can atomically detach/attach it.
>> Very schematic implementation:
>>
>> unsigned long bpf_ref_kptr_xchg(void *dst, void *ptr)
>> {
>>         struct bpf_kptr_box **slot = dst;
>>         struct bpf_kptr_box *box;
>>
>>         box = READ_ONCE(*slot);
>> 	if (box)
>> 	/* Box can be detached already, but its lifetime and kptr destructor are
>> 	 * RCU protected, so it's safe to access it here.
>> 	 */
>> 		return xchg(&box->kptr, ptr);
>>
>>         if (!ptr)
>>                 return 0; /* xchg-out from empty: nothing to do */
>>
>> 	/* slow path, when map value is not initialized */
>>         box = kmalloc_nolock();
>>         if (!box)
>>                 return (unsigned long)ptr;
>>
>>         box->kptr = ptr;
>>         if (cmpxchg(slot, NULL, box) != NULL) {
>>                 kfree(new_box);
>>                 /* someone installed first, retry or error out */
>>         }
>>
>>         return NULL;
>>   }
>>
>> Run destructor in rcu tt callback, to make sure no concurrent BPF program 
>> is installing kptr in the meantime:
>>
>> void box_rcu_free()
>> {
>> 	dtor();
>> 	kfree(box);
>> }
>>
>> static void box_kill_irq_work(struct irq_work* irq_work)
>> {
>> 	/* guarantee no BPF program sees the box */
>>         call_rcu_tasks_trace(&box->rcu, box_rcu_free);
>> }
>>
>> Detach path is simply: box = xchg(slot, NULL).
>>
>> You can still use a single list where you link detached boxes and
>> irq_work similarly as you have now, just make sure dtor and free
>> are called after RCU GP.
>>
>> I may have missed some race conditions, though, does the idea
>> make sense?
>>
> It does make sense. That does solve the race condition issue, but
> basically existing code assumes that the kptr in the map points to a
> real live kernel object, so pointing it to a box breaks a lot of
> assumptions.
> 
> Thanks for the feedback again and appreciate the follow up! This is
> interesting to dig into.
> 
> I have v3 about ready, and will send in a little bit.
> 
> Thanks,
> Justin
>>>
>>> So either order we do it in 1 or 2, this is an ABBA problem.
>>> And would require serialization and/or some sort of lock state
>>> for every map slot.
>>>
>>> Not to mention the complexity of adding extra data to the map slots
>>> requires modifications to every map type. You can see the problem this
>>> introduced in my v1, which took a related approach adding auxiliary data
>>> to map slots.
>>>
>>> ....
>>>
>>> FWIW I do have a much less complex implementation for my v3
>>> revision which brings it down to a single atomic and two
>>> pcpu_freelists, cutting it to 215~ lines net for the whole fix.
>>>
>>> Let me know if I missed anything or am misunderstanding something. I
>>> really appreciate the feedback though this is a tougher problem than it
>>> seemed and your thought was very similar to my initial approach :)
>>>>
>>>> This is based on the bpf_timer and bpf_task_work
>>>> implementations.
>>>>
>>> Those were some of my initial inspiration as well!
>>>
>>> Thanks again,
>>> Justin
>>>
>>> [1] https://lore.kernel.org/bpf/20260428201422.1518903-1-utilityemal77@gmail.com/
>>>>> +static void bpf_dtor_kptr_push_idle(struct bpf_dtor_kptr_work *job)
>>>>> +{
>>>>> +	llist_add(&job->node, &bpf_dtor_kptr_idle);
>>>>> +}
>>>>> +
>>>>> +static struct bpf_dtor_kptr_work *bpf_dtor_kptr_pop_idle(void)
>>>>> +{
>>>>> +	struct llist_node *node;
>>>>> +
>>>>> +	node = llist_del_first(&bpf_dtor_kptr_idle);
>>>>> +	if (!node)
>>>>> +		return NULL;
>>>>> +
>>>>> +	return llist_entry(node, struct bpf_dtor_kptr_work, node);
>>>>> +}
>>>>> +
>>>>> +static void bpf_dtor_kptr_trim(void)
>>>>> +{
>>>>> +	struct bpf_dtor_kptr_work *job;
>>>>> +	long total;
>>>>> +	long needed;
>>>>> +
>>>>> +	for (;;) {
>>>>> +		total = atomic_long_read(&bpf_dtor_kptr_total);
>>>>> +		needed = atomic_long_read(&bpf_dtor_kptr_refs) +
>>>>> +			 atomic_long_read(&bpf_dtor_kptr_active);
>>>>> +		if (total <= needed)
>>>>> +			return;
>>>>> +
>>>>> +		job = bpf_dtor_kptr_pop_idle();
>>>>> +		if (!job)
>>>>> +			return;
>>>>> +
>>>>> +		if (!atomic_long_try_cmpxchg(&bpf_dtor_kptr_total, &total, total - 1)) {
>>>>> +			bpf_dtor_kptr_push_idle(job);
>>>>> +			continue;
>>>>> +		}
>>>>> +
>>>>> +		bpf_mem_free(&bpf_global_ma, job);
>>>>> +	}
>>>>> +}
>>>>> +
>>>>> +static int bpf_dtor_kptr_reserve(long needed)
>>>>> +{
>>>>> +	struct bpf_dtor_kptr_work *job;
>>>>> +	long headroom;
>>>>> +	long target;
>>>>> +
>>>>> +	headroom = min_t(long, needed, BPF_DTOR_KPTR_RESERVE_HEADROOM);
>>>>> +	if (check_add_overflow(needed, headroom, &target))
>>>>> +		target = needed;
>>>>> +
>>>>> +	while (atomic_long_read(&bpf_dtor_kptr_total) < target) {
>>>>> +		job = bpf_mem_alloc(&bpf_global_ma, sizeof(*job));
>>>>> +		if (!job)
>>>>> +			return -ENOMEM;
>>>>> +		atomic_long_inc(&bpf_dtor_kptr_total);
>>>>> +		bpf_dtor_kptr_push_idle(job);
>>>>> +	}
>>>>> +
>>>>> +	return 0;
>>>>> +}
>>>>> +
>>>>> +int bpf_kptr_offload_inc(void)
>>>>> +{
>>>>> +	long needed;
>>>>> +	int err;
>>>>> +
>>>>> +	if (unlikely(!bpf_global_ma_set))
>>>>> +		return -ENOMEM;
>>>>> +
>>>>> +	/*
>>>>> +	 * Read active before incrementing refs so a free path moving one slot from
>>>>> +	 * refs to active cannot shrink the reservation snapshot below the steady
>>>>> +	 * state we need to cover. Racing results worst case in a larger reservation.
>>>>> +	 */
>>>>> +	needed = atomic_long_read(&bpf_dtor_kptr_active);
>>>>> +	needed += atomic_long_inc_return(&bpf_dtor_kptr_refs);
>>>>> +	err = bpf_dtor_kptr_reserve(needed);
>>>>> +	if (err)
>>>>> +		atomic_long_dec(&bpf_dtor_kptr_refs);
>>>>> +
>>>>> +	return err;
>>>>> +}
>>>>> +
>>>>> +void bpf_kptr_offload_dec(void)
>>>>> +{
>>>>> +	long val;
>>>>> +
>>>>> +	val = atomic_long_dec_return(&bpf_dtor_kptr_refs);
>>>>> +	if (!WARN_ON_ONCE(val < 0))
>>>>> +		return;
>>>>> +
>>>>> +	/*
>>>>> +	 * Clamp a mismatched decrement back to zero without overwriting a
>>>>> +	 * concurrent increment that already repaired the counter.
>>>>> +	 */
>>>>> +	do {
>>>>> +		val = atomic_long_read(&bpf_dtor_kptr_refs);
>>>>> +		if (val >= 0)
>>>>> +			break;
>>>>> +	} while (!atomic_long_try_cmpxchg(&bpf_dtor_kptr_refs, &val, 0));
>>>>> +}
>>>>> +
>>>>>  int sysctl_unprivileged_bpf_disabled __read_mostly =
>>>>>  	IS_BUILTIN(CONFIG_BPF_UNPRIV_DEFAULT_OFF) ? 2 : 0;
>>>>>  
>>>>> @@ -807,6 +935,46 @@ void bpf_obj_free_task_work(const struct btf_record *rec, void *obj)
>>>>>  	bpf_task_work_cancel_and_free(obj + rec->task_work_off);
>>>>>  }
>>>>>  
>>>>> +static void bpf_dtor_kptr_worker(struct irq_work *work)
>>>>> +{
>>>>> +	struct llist_node *jobs, *node, *next;
>>>>> +
>>>>> +	jobs = llist_del_all(this_cpu_ptr(&bpf_dtor_kptr_jobs));
>>>>> +	llist_for_each_safe(node, next, jobs) {
>>>>> +		struct bpf_dtor_kptr_work *job;
>>>>> +
>>>>> +		job = llist_entry(node, struct bpf_dtor_kptr_work, node);
>>>>> +		job->dtor(job->obj);
>>>>> +		atomic_long_dec(&bpf_dtor_kptr_active);
>>>>> +		bpf_dtor_kptr_push_idle(job);
>>>>> +	}
>>>>> +
>>>>> +	bpf_dtor_kptr_trim();
>>>>> +}
>>>>> +
>>>>> +static void bpf_dtor_kptr_offload(void *obj, btf_dtor_kfunc_t dtor)
>>>>> +{
>>>>> +	struct bpf_dtor_kptr_work *job;
>>>>> +
>>>>> +	atomic_long_inc(&bpf_dtor_kptr_active);
>>>>> +	job = bpf_dtor_kptr_pop_idle();
>>>>> +	if (WARN_ON_ONCE(!job)) {
>>>>> +		atomic_long_dec(&bpf_dtor_kptr_active);
>>>>> +		/*
>>>>> +		 * This should stay unreachable if reserve accounting is correct. If it
>>>>> +		 * ever breaks, running the destructor unsafely is still better than
>>>>> +		 * leaking the object permanently.
>>>>> +		 */
>>>>> +		dtor(obj);
>>>>> +		return;
>>>>> +	}
>>>>> +
>>>>> +	job->obj = obj;
>>>>> +	job->dtor = dtor;
>>>>> +	if (llist_add(&job->node, this_cpu_ptr(&bpf_dtor_kptr_jobs)))
>>>>> +		irq_work_queue(this_cpu_ptr(&bpf_dtor_kptr_irq_work));
>>>>> +}
>>>>> +
>>>>>  void bpf_obj_free_fields(const struct btf_record *rec, void *obj)
>>>>>  {
>>>>>  	const struct btf_field *fields;
>>>>> @@ -842,6 +1010,19 @@ void bpf_obj_free_fields(const struct btf_record *rec, void *obj)
>>>>>  			xchgd_field = (void *)xchg((unsigned long *)field_ptr, 0);
>>>>>  			if (!xchgd_field)
>>>>>  				break;
>>>>> +			if (in_nmi() && field->kptr.dtor) {
>>>>> +				bpf_dtor_kptr_offload(xchgd_field, field->kptr.dtor);
>>>>> +				bpf_kptr_offload_dec();
>>>>> +				break;
>>>>> +			}
>>>>> +			if (field->kptr.dtor)
>>>>> +				/*
>>>>> +				 * Dtor kptrs reach storage through bpf_ref_kptr_xchg(), which
>>>>> +				 * pairs installation with bpf_kptr_offload_inc(). Drop that
>>>>> +				 * reservation on non-NMI teardown once no active transition is
>>>>> +				 * needed.
>>>>> +				 */
>>>>> +				bpf_kptr_offload_dec();
>>>>>  
>>>>>  			if (!btf_is_kernel(field->kptr.btf)) {
>>>>>  				pointee_struct_meta = btf_find_struct_meta(field->kptr.btf,
>>>>> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
>>>>> index 11054ad89c14..2c7b21bda666 100644
>>>>> --- a/kernel/bpf/verifier.c
>>>>> +++ b/kernel/bpf/verifier.c
>>>>> @@ -9950,6 +9950,8 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
>>>>>  		if (err)
>>>>>  			return err;
>>>>>  	}
>>>>> +	env->insn_aux_data[insn_idx].kptr_has_dtor =
>>>>> +		func_id == BPF_FUNC_kptr_xchg && !!meta.kptr_field->kptr.dtor;
>>>>>  
>>>>>  	err = record_func_map(env, &meta, func_id, insn_idx);
>>>>>  	if (err)
>>>>
>>


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2026-05-07 17:19 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-05 15:08 [bpf-next v2 0/2] bpf: Fix deadlock in kptr dtor in nmi Justin Suess
2026-05-05 15:08 ` [bpf-next v2 1/2] bpf: Offload kptr destructors that run from NMI Justin Suess
2026-05-05 16:06   ` bot+bpf-ci
2026-05-05 19:48     ` Justin Suess
2026-05-05 19:49   ` sashiko-bot
2026-05-06 16:43   ` Mykyta Yatsenko
2026-05-06 19:52     ` Justin Suess
2026-05-07 14:59       ` Mykyta Yatsenko
2026-05-07 16:41         ` Justin Suess
2026-05-07 17:19           ` Mykyta Yatsenko
2026-05-05 15:08 ` [bpf-next v2 2/2] selftests/bpf: Add kptr destructor NMI exerciser Justin Suess
2026-05-05 20:15   ` sashiko-bot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox