[bpf-next v3 0/2] bpf: Fix deadlock in kptr dtor in nmi

BPF List
 help / color / mirror / Atom feed

* [bpf-next v3 0/2] bpf: Fix deadlock in kptr dtor in nmi
@ 2026-05-07 17:54 Justin Suess
  2026-05-07 17:54 ` [bpf-next v3 1/2] bpf: Offload kptr destructors that run from NMI Justin Suess
  2026-05-07 17:54 ` [bpf-next v3 2/2] selftests/bpf: Add kptr destructor NMI exerciser Justin Suess
  0 siblings, 2 replies; 23+ messages in thread
From: Justin Suess @ 2026-05-07 17:54 UTC (permalink / raw)
  To: ast, daniel, andrii, eddyz87, memxor
  Cc: martin.lau, song, yonghong.song, jolsa, bpf, Justin Suess

Hello,

While following up on a Sashiko report [1], I found that referenced kptr
destructors can run from NMI context. One way to trigger this is from a
tracing program attached to tp_btf/nmi_handler while a map element is
being torn down.

That is problematic because referenced kptr destructor paths are not
universally NMI-safe. In particular, they may rely on operations such as
call_rcu(), which can deadlock when reached from NMI context.

This is v3 of the series.

Changes since v2:

The previous version of the series used three atomics and had an ABA race
condition. This version of the series moves both the idle and active job
queues to pcpu_freelists, which are designed for the push/pop operations
and better handle NMI.

The number of atomics was reduced from 3 -> 1, using a counter that tracks
only demand, reducing complexity significantly.

See the patch one commit message for the full details on the new surplus
accounting mechanism.

The verifier changes were fixed as well to account for a case found by
Sashiko; because we are now only inlining in the non-dtor case, there
was a bug (introduced by the patch) where an xchg call insn could be
polymorphic with respect to referenced and unreferenced kptrs. This is
fixed with a new verifier check.

Finally, the selftests had some small adjustments. The counters were
moved to u64 atomics from u32 non-atomics to decrease potential test
flakiness. There was a small change on when kern_sync_rcu is called.

1. bpf: Offload kptr destructors that run from NMI
2. selftests/bpf: Add kptr destructor NMI exerciser

Kind regards,
Justin Suess

[1] https://lore.kernel.org/bpf/20260421010536.17FB1C19425@smtp.kernel.org/
[2] https://lore.kernel.org/bpf/afYLJAT9brXkWxz2@zenbox/
[3] https://lore.kernel.org/bpf/20260421201035.1729473-1-utilityemal77@gmail.com/

v2: https://lore.kernel.org/bpf/20260505150851.3090688-1-utilityemal77@gmail.com/
v1: https://lore.kernel.org/bpf/20260428201422.1518903-1-utilityemal77@gmail.com/

Justin Suess (2):
  bpf: Offload kptr destructors that run from NMI
  selftests/bpf: Add kptr destructor NMI exerciser

 include/linux/bpf.h                           |  16 +
 include/linux/bpf_verifier.h                  |   2 +
 kernel/bpf/fixups.c                           |  33 +-
 kernel/bpf/helpers.c                          |  24 +-
 kernel/bpf/syscall.c                          | 159 +++++++
 kernel/bpf/verifier.c                         |  13 +
 .../selftests/bpf/prog_tests/kptr_dtor_nmi.c  | 258 +++++++++++
 .../selftests/bpf/progs/kptr_dtor_nmi.c       | 412 ++++++++++++++++++
 8 files changed, 902 insertions(+), 15 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/kptr_dtor_nmi.c
 create mode 100644 tools/testing/selftests/bpf/progs/kptr_dtor_nmi.c

base-commit: 2ca6723a5f7b68c739dba47b2639e3eaa7884b09
-- 
2.53.0

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [bpf-next v3 1/2] bpf: Offload kptr destructors that run from NMI
  2026-05-07 17:54 [bpf-next v3 0/2] bpf: Fix deadlock in kptr dtor in nmi Justin Suess
@ 2026-05-07 17:54 ` Justin Suess
  2026-05-07 18:43   ` bot+bpf-ci
  2026-05-07 23:45   ` sashiko-bot
  2026-05-07 17:54 ` [bpf-next v3 2/2] selftests/bpf: Add kptr destructor NMI exerciser Justin Suess
  1 sibling, 2 replies; 23+ messages in thread
From: Justin Suess @ 2026-05-07 17:54 UTC (permalink / raw)
  To: ast, daniel, andrii, eddyz87, memxor
  Cc: martin.lau, song, yonghong.song, jolsa, bpf, Justin Suess,
	Mykyta Yatsenko, Alexei Starovoitov

A BPF program attached to tp_btf/nmi_handler can delete map entries or
swap out referenced kptrs from NMI context. Today that runs the kptr
destructor inline. Destructors such as bpf_cpumask_release() can take
RCU-related locks, so running them from NMI can deadlock the system.

Queue destructor-backed teardown to irq_work and track the preallocated
capacity with an idle-slot surplus counter. Two pcpu_freelists are
maintained: one for the idle slots and one for active jobs.
pcpu_freelist is built for NMI context, using raw_res_spin_lock and per
cpu list heads.

This counter is positive when the idle slots for work exceeds the number
of kptrs exchanged into maps. The counter is negative when the idle slots
for work is less than the number of ref kptrs exchanged into maps.

The counter will always attempt to trend to zero, keeping exactly the
amount of work slots ready for the worst case, freeing excess surplus
while allocating new surplus when needed.

This keeps NMI teardown on the fast path: it only has to pop an idle job
and never allocate. On consumption of a job, the memory is returned to
the idle queue for reuse.

Each successful install can consume at most one future offload slot, so
bpf_kptr_offload_slot_acquire() decrements surplus and provisions one
additional idle job only when the pool falls short. Teardown and
irq_work completion return that slot with
bpf_kptr_offload_slot_release(), which rechecks the surplus before
freeing an extra idle job.

There are two main safety nets against memory exhaustion:

If reserving another offload slot fails while installing a new
destructor-backed kptr through bpf_kptr_xchg(), leave the destination
unchanged and return the incoming pointer so the caller keeps ownership.

This should only happen under severe memory pressure, as the work is
only 24 bytes on 64 bit architectures, and slots are reused whenever
possible.

If NMI teardown still fails to grab an idle offload job despite that
accounting, warn once and run the destructor inline rather than leak the
object permanently. This warn condition is a canary for accounting bugs,
since the xchg safety net shouldn't let this state be reached. The
justification for falling back to running the dtor in nmi is leaking
memory is less debuggable and visible to users than an explicit
warning and a possible deadlock that is recoverable by the watchdog.

This fix does come with small performance tradeoffs for safety. xchg can
no longer be inlined for referenced kptrs, as inlining would break the
slot accounting. The inlining fix is preserved for kptrs with no
destructor defined.

There is a small price paid of a potential alloc and a few atomic ops in
the xchg path for kptrs with dtors. The nmi dtor path has a few atomic ops
and an IPI for the irq_work overhead.

This keeps refcounted kptr teardown out of NMI context without slowing
down raw kptr exchanges that never need destructor handling.

This change ensures that dtors can be written for hardirq context
instead of NMI.

Cc: Mykyta Yatsenko <mykyta.yatsenko5@gmail.com>
Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Reported-by: Justin Suess <utilityemal77@gmail.com>
Closes: https://lore.kernel.org/bpf/20260421201035.1729473-1-utilityemal77@gmail.com/
Signed-off-by: Justin Suess <utilityemal77@gmail.com>
---
 include/linux/bpf.h          |  16 ++++
 include/linux/bpf_verifier.h |   2 +
 kernel/bpf/fixups.c          |  33 +++++---
 kernel/bpf/helpers.c         |  24 +++++-
 kernel/bpf/syscall.c         | 159 +++++++++++++++++++++++++++++++++++
 kernel/bpf/verifier.c        |  13 +++
 6 files changed, 232 insertions(+), 15 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 715b6df9c403..583e8551d162 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -3454,6 +3454,22 @@ static inline struct bpf_prog *bpf_prog_get_type(u32 ufd,
 
 void __bpf_free_used_maps(struct bpf_prog_aux *aux,
 			  struct bpf_map **used_maps, u32 len);
+/* Direct-call target used by fixups for bpf_kptr_xchg() sites without dtors. */
+u64 bpf_kptr_xchg_nodtor(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
+
+#ifdef CONFIG_BPF_SYSCALL
+int bpf_kptr_offload_slot_acquire(void);
+void bpf_kptr_offload_slot_release(void);
+#else
+static inline int bpf_kptr_offload_slot_acquire(void)
+{
+	return 0;
+}
+
+static inline void bpf_kptr_offload_slot_release(void)
+{
+}
+#endif
 
 bool bpf_prog_get_ok(struct bpf_prog *, enum bpf_prog_type *, bool);
 
diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index 976e2b2f40e8..26bc8b5c9030 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -672,6 +672,8 @@ struct bpf_insn_aux_data {
 	bool non_sleepable; /* helper/kfunc may be called from non-sleepable context */
 	bool is_iter_next; /* bpf_iter_<type>_next() kfunc call */
 	bool call_with_percpu_alloc_ptr; /* {this,per}_cpu_ptr() with prog percpu alloc */
+	u8 kptr_has_dtor:1;
+	u8 kptr_has_dtor_seen:1;
 	u8 alu_state; /* used in combination with alu_limit */
 	/* true if STX or LDX instruction is a part of a spill/fill
 	 * pattern for a bpf_fastcall call.
diff --git a/kernel/bpf/fixups.c b/kernel/bpf/fixups.c
index fba9e8c00878..459e855e86a5 100644
--- a/kernel/bpf/fixups.c
+++ b/kernel/bpf/fixups.c
@@ -2284,23 +2284,30 @@ int bpf_do_misc_fixups(struct bpf_verifier_env *env)
 			goto next_insn;
 		}
 
-		/* Implement bpf_kptr_xchg inline */
-		if (prog->jit_requested && BITS_PER_LONG == 64 &&
-		    insn->imm == BPF_FUNC_kptr_xchg &&
-		    bpf_jit_supports_ptr_xchg()) {
-			insn_buf[0] = BPF_MOV64_REG(BPF_REG_0, BPF_REG_2);
-			insn_buf[1] = BPF_ATOMIC_OP(BPF_DW, BPF_XCHG, BPF_REG_1, BPF_REG_0, 0);
-			cnt = 2;
+		/* Implement bpf_kptr_xchg inline. */
+		if (insn->imm == BPF_FUNC_kptr_xchg &&
+		    !env->insn_aux_data[i + delta].kptr_has_dtor) {
+			if (prog->jit_requested && BITS_PER_LONG == 64 &&
+			    bpf_jit_supports_ptr_xchg()) {
+				insn_buf[0] = BPF_MOV64_REG(BPF_REG_0, BPF_REG_2);
+				insn_buf[1] = BPF_ATOMIC_OP(BPF_DW, BPF_XCHG,
+						     BPF_REG_1, BPF_REG_0, 0);
+				cnt = 2;
 
-			new_prog = bpf_patch_insn_data(env, i + delta, insn_buf, cnt);
-			if (!new_prog)
-				return -ENOMEM;
+				new_prog = bpf_patch_insn_data(env, i + delta, insn_buf, cnt);
+				if (!new_prog)
+					return -ENOMEM;
 
-			delta    += cnt - 1;
-			env->prog = prog = new_prog;
-			insn      = new_prog->insnsi + i + delta;
+				delta    += cnt - 1;
+				env->prog = prog = new_prog;
+				insn      = new_prog->insnsi + i + delta;
+				goto next_insn;
+			}
+
+			insn->imm = bpf_kptr_xchg_nodtor - __bpf_call_base;
 			goto next_insn;
 		}
+
 patch_call_imm:
 		fn = env->ops->get_func_proto(insn->imm, env->prog);
 		/* all functions that have prototype and verifier allowed
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index baa12b24bb64..51717a88f627 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -1728,7 +1728,7 @@ void bpf_wq_cancel_and_free(void *val)
 	bpf_async_cancel_and_free(val);
 }
 
-BPF_CALL_2(bpf_kptr_xchg, void *, dst, void *, ptr)
+BPF_CALL_2(bpf_kptr_xchg_nodtor, void *, dst, void *, ptr)
 {
 	unsigned long *kptr = dst;
 
@@ -1736,12 +1736,32 @@ BPF_CALL_2(bpf_kptr_xchg, void *, dst, void *, ptr)
 	return xchg(kptr, (unsigned long)ptr);
 }
 
+BPF_CALL_2(bpf_ref_kptr_xchg, void *, dst, void *, ptr)
+{
+	unsigned long *kptr = dst;
+	void *old;
+
+	/*
+	 * If the incoming pointer cannot be torn down safely from NMI later on,
+	 * leave the destination untouched and return ptr so the caller keeps
+	 * ownership.
+	 */
+	if (ptr && bpf_kptr_offload_slot_acquire())
+		return (unsigned long)ptr;
+
+	old = (void *)xchg(kptr, (unsigned long)ptr);
+	if (old)
+		bpf_kptr_offload_slot_release();
+	return (unsigned long)old;
+}
+
 /* Unlike other PTR_TO_BTF_ID helpers the btf_id in bpf_kptr_xchg()
  * helper is determined dynamically by the verifier. Use BPF_PTR_POISON to
  * denote type that verifier will determine.
+ * No-dtor callsites are redirected to bpf_kptr_xchg_nodtor() from fixups.
  */
 static const struct bpf_func_proto bpf_kptr_xchg_proto = {
-	.func         = bpf_kptr_xchg,
+	.func         = bpf_ref_kptr_xchg,
 	.gpl_only     = false,
 	.ret_type     = RET_PTR_TO_BTF_ID_OR_NULL,
 	.ret_btf_id   = BPF_PTR_POISON,
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 3b1f0ba02f61..d34fdb99eb8a 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -7,6 +7,7 @@
 #include <linux/bpf_trace.h>
 #include <linux/bpf_lirc.h>
 #include <linux/bpf_verifier.h>
+#include <linux/bpf_mem_alloc.h>
 #include <linux/bsearch.h>
 #include <linux/btf.h>
 #include <linux/hex.h>
@@ -19,6 +20,7 @@
 #include <linux/fdtable.h>
 #include <linux/file.h>
 #include <linux/fs.h>
+#include <linux/irq_work.h>
 #include <linux/license.h>
 #include <linux/filter.h>
 #include <linux/kernel.h>
@@ -42,6 +44,8 @@
 #include <linux/cookie.h>
 #include <linux/verification.h>
 
+#include "percpu_freelist.h"
+
 #include <net/netfilter/nf_bpf_link.h>
 #include <net/netkit.h>
 #include <net/tcx.h>
@@ -65,6 +69,111 @@ static DEFINE_SPINLOCK(map_idr_lock);
 static DEFINE_IDR(link_idr);
 static DEFINE_SPINLOCK(link_idr_lock);
 
+struct bpf_dtor_kptr_work {
+	struct pcpu_freelist_node fnode;
+	void *obj;
+	btf_dtor_kfunc_t dtor;
+};
+
+/* Queue pending dtors; the idle pool uses a global pcpu_freelist. */
+static struct pcpu_freelist bpf_dtor_kptr_jobs;
+static struct pcpu_freelist bpf_dtor_kptr_idle;
+/* Keep surplus = total - needed = idle - refs >= 0 so NMI frees never need to allocate. */
+static atomic_long_t bpf_dtor_kptr_surplus = ATOMIC_LONG_INIT(0);
+
+static void bpf_dtor_kptr_worker(struct irq_work *work);
+static DEFINE_PER_CPU(struct irq_work, bpf_dtor_kptr_irq_work) =
+	IRQ_WORK_INIT_HARD(bpf_dtor_kptr_worker);
+
+static struct bpf_dtor_kptr_work *bpf_dtor_kptr_pop_idle(void)
+{
+	struct pcpu_freelist_node *node;
+
+	node = pcpu_freelist_pop(&bpf_dtor_kptr_idle);
+	if (!node)
+		return NULL;
+
+	return container_of(node, struct bpf_dtor_kptr_work, fnode);
+}
+
+static void bpf_dtor_kptr_release_one(void)
+{
+	struct bpf_dtor_kptr_work *job;
+	long surplus;
+
+	for (;;) {
+		surplus = atomic_long_read(&bpf_dtor_kptr_surplus);
+		if (surplus <= 0)
+			return;
+
+		job = bpf_dtor_kptr_pop_idle();
+		if (!job)
+			return;
+
+		if (!atomic_long_try_cmpxchg(&bpf_dtor_kptr_surplus, &surplus,
+						     surplus - 1)) {
+			pcpu_freelist_push(&bpf_dtor_kptr_idle, &job->fnode);
+			continue;
+		}
+
+		bpf_mem_free(&bpf_global_ma, job);
+		return;
+	}
+}
+
+void bpf_kptr_offload_slot_release(void)
+{
+	if (atomic_long_inc_return(&bpf_dtor_kptr_surplus) > 0)
+		bpf_dtor_kptr_release_one();
+}
+
+int bpf_kptr_offload_slot_acquire(void)
+{
+	struct bpf_dtor_kptr_work *job;
+	long surplus;
+
+	if (unlikely(!bpf_global_ma_set ||
+		     !READ_ONCE(bpf_dtor_kptr_idle.freelist) ||
+		     !READ_ONCE(bpf_dtor_kptr_jobs.freelist)))
+		return -ENOMEM;
+
+	/*
+	 * Each successful install can decrease the surplus by at most one, so it only
+	 * ever needs to provision one additional idle job.
+	 */
+	surplus = atomic_long_dec_return(&bpf_dtor_kptr_surplus);
+	if (surplus >= 0)
+		return 0;
+
+	job = bpf_mem_alloc(&bpf_global_ma, sizeof(*job));
+	if (!job) {
+		atomic_long_inc(&bpf_dtor_kptr_surplus);
+		return -ENOMEM;
+	}
+
+	pcpu_freelist_push(&bpf_dtor_kptr_idle, &job->fnode);
+	/* A racing teardown may have already removed the demand that forced this. */
+	bpf_kptr_offload_slot_release();
+
+	return 0;
+}
+
+static int __init bpf_dtor_kptr_init(void)
+{
+	int err;
+
+	err = pcpu_freelist_init(&bpf_dtor_kptr_idle);
+	if (err)
+		return err;
+
+	err = pcpu_freelist_init(&bpf_dtor_kptr_jobs);
+	if (err)
+		return err;
+
+	return 0;
+}
+late_initcall(bpf_dtor_kptr_init);
+
 int sysctl_unprivileged_bpf_disabled __read_mostly =
 	IS_BUILTIN(CONFIG_BPF_UNPRIV_DEFAULT_OFF) ? 2 : 0;
 
@@ -807,6 +916,43 @@ void bpf_obj_free_task_work(const struct btf_record *rec, void *obj)
 	bpf_task_work_cancel_and_free(obj + rec->task_work_off);
 }
 
+static void bpf_dtor_kptr_worker(struct irq_work *work)
+{
+	struct pcpu_freelist_node *fnode;
+	struct bpf_dtor_kptr_work *job;
+
+	while ((fnode = pcpu_freelist_pop(&bpf_dtor_kptr_jobs))) {
+		job = container_of(fnode, struct bpf_dtor_kptr_work, fnode);
+		job->dtor(job->obj);
+		pcpu_freelist_push(&bpf_dtor_kptr_idle, &job->fnode);
+		bpf_kptr_offload_slot_release();
+	}
+}
+
+static void bpf_dtor_kptr_offload(void *obj, btf_dtor_kfunc_t dtor)
+{
+	struct bpf_dtor_kptr_work *job;
+
+	/* Handing storage teardown off to irq_work consumes one idle slot. */
+	atomic_long_dec(&bpf_dtor_kptr_surplus);
+	job = bpf_dtor_kptr_pop_idle();
+	if (WARN_ON_ONCE(!job)) {
+		atomic_long_inc(&bpf_dtor_kptr_surplus);
+		/*
+		 * This should stay unreachable if reserve accounting is correct. If it
+		 * ever breaks, running the destructor unsafely is still better than
+		 * leaking the object permanently.
+		 */
+		dtor(obj);
+		return;
+	}
+
+	job->obj = obj;
+	job->dtor = dtor;
+	pcpu_freelist_push(&bpf_dtor_kptr_jobs, &job->fnode);
+	irq_work_queue(this_cpu_ptr(&bpf_dtor_kptr_irq_work));
+}
+
 void bpf_obj_free_fields(const struct btf_record *rec, void *obj)
 {
 	const struct btf_field *fields;
@@ -842,6 +988,19 @@ void bpf_obj_free_fields(const struct btf_record *rec, void *obj)
 			xchgd_field = (void *)xchg((unsigned long *)field_ptr, 0);
 			if (!xchgd_field)
 				break;
+			if (in_nmi() && field->kptr.dtor) {
+				bpf_dtor_kptr_offload(xchgd_field, field->kptr.dtor);
+				bpf_kptr_offload_slot_release();
+				break;
+			}
+			if (field->kptr.dtor)
+				/*
+				 * Dtor kptrs reach storage through bpf_ref_kptr_xchg(), which
+				 * pairs installation with bpf_kptr_offload_slot_acquire(). Return
+				 * that slot on non-NMI teardown once no active transition is
+				 * needed.
+				 */
+				bpf_kptr_offload_slot_release();
 
 			if (!btf_is_kernel(field->kptr.btf)) {
 				pointee_struct_meta = btf_find_struct_meta(field->kptr.btf,
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 11054ad89c14..d042be8ed789 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -9891,6 +9891,7 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
 	int insn_idx = *insn_idx_p;
 	bool changes_data;
 	int i, err, func_id;
+	bool kptr_has_dtor;
 
 	/* find function prototype */
 	func_id = insn->imm;
@@ -9950,6 +9951,18 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
 		if (err)
 			return err;
 	}
+	if (func_id == BPF_FUNC_kptr_xchg) {
+		kptr_has_dtor = !!meta.kptr_field->kptr.dtor;
+		if (env->insn_aux_data[insn_idx].kptr_has_dtor_seen &&
+		    env->insn_aux_data[insn_idx].kptr_has_dtor != kptr_has_dtor) {
+			verbose(env,
+				"same insn cannot call bpf_kptr_xchg() on both dtor and non-dtor kptrs\n");
+			return -EINVAL;
+		}
+
+		env->insn_aux_data[insn_idx].kptr_has_dtor_seen = true;
+		env->insn_aux_data[insn_idx].kptr_has_dtor = kptr_has_dtor;
+	}
 
 	err = record_func_map(env, &meta, func_id, insn_idx);
 	if (err)
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [bpf-next v3 2/2] selftests/bpf: Add kptr destructor NMI exerciser
  2026-05-07 17:54 [bpf-next v3 0/2] bpf: Fix deadlock in kptr dtor in nmi Justin Suess
  2026-05-07 17:54 ` [bpf-next v3 1/2] bpf: Offload kptr destructors that run from NMI Justin Suess
@ 2026-05-07 17:54 ` Justin Suess
  2026-05-08  0:03   ` sashiko-bot
  1 sibling, 1 reply; 23+ messages in thread
From: Justin Suess @ 2026-05-07 17:54 UTC (permalink / raw)
  To: ast, daniel, andrii, eddyz87, memxor
  Cc: martin.lau, song, yonghong.song, jolsa, bpf, Justin Suess,
	Alexei Starovoitov

Programs attached to tp_btf/nmi_handler can drop refcounted kptrs from
NMI context by deleting map entries or clearing map values.  Add a
dedicated BPF-side selftest program that populates hash and array maps
with cpumask kptrs and clears them again from the NMI handler.

This test fails on the upstream and results in a lockdep warning, but
passes when NMI dtors are properly offloaded by the previous commit.

The test asserts that every object queued for destruction in hardirq
from NMI had the dtor called on it. The irq_work which has the
IRQ_WORK_HARD_IRQ flag is drained with kern_sync_rcu to ensure
consistency.

Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Signed-off-by: Justin Suess <utilityemal77@gmail.com>
---
 .../selftests/bpf/prog_tests/kptr_dtor_nmi.c  | 258 +++++++++++
 .../selftests/bpf/progs/kptr_dtor_nmi.c       | 412 ++++++++++++++++++
 2 files changed, 670 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/kptr_dtor_nmi.c
 create mode 100644 tools/testing/selftests/bpf/progs/kptr_dtor_nmi.c

diff --git a/tools/testing/selftests/bpf/prog_tests/kptr_dtor_nmi.c b/tools/testing/selftests/bpf/prog_tests/kptr_dtor_nmi.c
new file mode 100644
index 000000000000..21452b3cf9eb
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/kptr_dtor_nmi.c
@@ -0,0 +1,258 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#define _GNU_SOURCE
+
+#include <errno.h>
+#include <linux/perf_event.h>
+#include <sched.h>
+#include <sys/syscall.h>
+#include <unistd.h>
+
+#include <test_progs.h>
+
+#include "kptr_dtor_nmi.skel.h"
+
+#define KPTR_DTOR_NMI_MAX_SLOTS 8
+#define KPTR_DTOR_NMI_ROUNDS 256
+#define DELETE_TIMEOUT_NS (5ULL * 1000 * 1000 * 1000)
+
+enum kptr_dtor_nmi_map_type {
+	KPTR_DTOR_NMI_MAP_HASH = 1,
+	KPTR_DTOR_NMI_MAP_ARRAY,
+};
+
+struct kptr_dtor_nmi_case {
+	const char *name;
+	__u32 map_type;
+};
+
+__maybe_unused
+static int find_test_cpu(void)
+{
+	cpu_set_t cpuset;
+	int cpu, err;
+
+	err = sched_getaffinity(0, sizeof(cpuset), &cpuset);
+	if (!ASSERT_OK(err, "sched_getaffinity"))
+		return -1;
+
+	for (cpu = 0; cpu < CPU_SETSIZE; cpu++) {
+		if (CPU_ISSET(cpu, &cpuset))
+			return cpu;
+	}
+
+	ASSERT_TRUE(false, "cpu_available");
+	return -1;
+}
+
+__maybe_unused
+static int open_nmi_pmu_event_on_cpu(int cpu)
+{
+	struct perf_event_attr attr = {
+		.size = sizeof(attr),
+		.type = PERF_TYPE_HARDWARE,
+		.config = PERF_COUNT_HW_CPU_CYCLES,
+		.freq = 1,
+		.sample_freq = 1000,
+	};
+	int pmu_fd;
+
+	pmu_fd = syscall(__NR_perf_event_open, &attr, -1, cpu, -1,
+			 PERF_FLAG_FD_CLOEXEC);
+	if (pmu_fd == -1) {
+		if (errno == ENOENT || errno == EOPNOTSUPP) {
+			printf("SKIP:no PERF_COUNT_HW_CPU_CYCLES\n");
+			test__skip();
+		}
+		return -1;
+	}
+
+	return pmu_fd;
+}
+
+__maybe_unused
+static bool pin_to_cpu(int cpu, cpu_set_t *old_cpuset)
+{
+	cpu_set_t cpuset;
+	int err;
+
+	err = sched_getaffinity(0, sizeof(*old_cpuset), old_cpuset);
+	if (!ASSERT_OK(err, "sched_getaffinity"))
+		return false;
+
+	CPU_ZERO(&cpuset);
+	CPU_SET(cpu, &cpuset);
+	err = sched_setaffinity(0, sizeof(cpuset), &cpuset);
+	if (!ASSERT_OK(err, "sched_setaffinity"))
+		return false;
+
+	return true;
+}
+
+__maybe_unused
+static void restore_affinity(const cpu_set_t *old_cpuset)
+{
+	ASSERT_OK(sched_setaffinity(0, sizeof(*old_cpuset), old_cpuset),
+		  "restore_affinity");
+}
+
+__maybe_unused
+static bool run_syscall_prog(struct bpf_program *prog, const char *name)
+{
+	LIBBPF_OPTS(bpf_test_run_opts, opts);
+	int err;
+
+	err = bpf_prog_test_run_opts(bpf_program__fd(prog), &opts);
+	if (!ASSERT_OK(err, name))
+		return false;
+	if (!ASSERT_EQ(opts.retval, 0, name))
+		return false;
+
+	return true;
+}
+
+__maybe_unused
+static bool wait_for_nmi_drain(struct kptr_dtor_nmi *skel,
+			       __u64 expected_deleted,
+			       __u64 expected_release_calls)
+{
+	u64 now_ns, timeout_time_ns;
+
+	now_ns = get_time_ns();
+	timeout_time_ns = now_ns + DELETE_TIMEOUT_NS;
+	while (skel->bss->kptr_dtor_nmi_deleted < expected_deleted) {
+		if (skel->bss->kptr_dtor_nmi_setup_err ||
+		    skel->bss->kptr_dtor_nmi_nmi_err ||
+		    skel->bss->kptr_dtor_nmi_cleanup_err)
+			break;
+		now_ns = get_time_ns();
+		if (now_ns >= timeout_time_ns)
+			break;
+		sched_yield();
+	}
+
+	if (!ASSERT_EQ(skel->bss->kptr_dtor_nmi_setup_err, 0,
+		       "kptr_dtor_nmi_setup_err"))
+		return false;
+	if (!ASSERT_EQ(skel->bss->kptr_dtor_nmi_nmi_err, 0,
+		       "kptr_dtor_nmi_nmi_err"))
+		return false;
+	if (!ASSERT_EQ(skel->bss->kptr_dtor_nmi_cleanup_err, 0,
+		       "kptr_dtor_nmi_cleanup_err"))
+		return false;
+	if (!ASSERT_GE(skel->bss->kptr_dtor_nmi_deleted, expected_deleted,
+		       "kptr_dtor_nmi_deleted"))
+		return false;
+	if (!ASSERT_OK(kern_sync_rcu(), "kern_sync_rcu"))
+		return false;
+	if (!ASSERT_GE(skel->bss->kptr_dtor_nmi_release_calls,
+		       expected_release_calls,
+		       "kptr_dtor_nmi_release_calls"))
+		return false;
+	if (!ASSERT_LT(now_ns, timeout_time_ns, "kptr_dtor_nmi_timeout"))
+		return false;
+
+	return true;
+}
+
+__maybe_unused
+static void run_kptr_dtor_nmi_case(const struct kptr_dtor_nmi_case *test)
+{
+	struct kptr_dtor_nmi *skel;
+	cpu_set_t old_cpuset;
+	bool pinned = false;
+	int cpu = -1;
+	int pmu_fd = -1;
+	int err, round;
+
+	cpu = find_test_cpu();
+	if (cpu < 0)
+		return;
+
+	skel = kptr_dtor_nmi__open();
+	if (!ASSERT_OK_PTR(skel, "kptr_dtor_nmi__open"))
+		return;
+
+	skel->bss->kptr_dtor_nmi_map_type = test->map_type;
+	bpf_program__set_autoattach(skel->progs.clear_kptrs_from_nmi, false);
+
+	err = kptr_dtor_nmi__load(skel);
+	if (!ASSERT_OK(err, "kptr_dtor_nmi__load"))
+		goto cleanup;
+
+	err = kptr_dtor_nmi__attach(skel);
+	if (!ASSERT_OK(err, "kptr_dtor_nmi__attach"))
+		goto cleanup;
+
+	skel->links.clear_kptrs_from_nmi =
+		bpf_program__attach_trace(skel->progs.clear_kptrs_from_nmi);
+	if (!ASSERT_OK_PTR(skel->links.clear_kptrs_from_nmi,
+			   "attach_tp_btf_nmi_handler"))
+		goto cleanup;
+
+	pinned = pin_to_cpu(cpu, &old_cpuset);
+	if (!pinned)
+		goto cleanup;
+
+	pmu_fd = open_nmi_pmu_event_on_cpu(cpu);
+	if (pmu_fd < 0)
+		goto cleanup;
+
+	for (round = 0; round < KPTR_DTOR_NMI_ROUNDS; round++) {
+		__u64 expected_total;
+
+		if (!run_syscall_prog(skel->progs.populate_kptrs, "populate_kptrs"))
+			goto cleanup;
+
+		expected_total = (round + 1) * KPTR_DTOR_NMI_MAX_SLOTS;
+		if (!ASSERT_EQ(skel->bss->kptr_dtor_nmi_setup_created,
+			       expected_total,
+			       "kptr_dtor_nmi_setup_created"))
+			goto cleanup;
+
+		if (!wait_for_nmi_drain(skel, expected_total, expected_total))
+			goto cleanup;
+	}
+
+	if (!run_syscall_prog(skel->progs.cleanup_kptrs, "cleanup_kptrs"))
+		goto cleanup;
+	/*
+	 * The grace period for rcu cannot complete until the CPU that ran the
+	 * hard irq_work has passed through a quiescent state after running
+	 * our dtor work. This effectively flushes our pending work and allows
+	 * the test to verify the dtor was called the expected number of times.
+	 */
+	kern_sync_rcu();
+	ASSERT_EQ(skel->bss->kptr_dtor_nmi_cleanup_deleted, 0,
+		  "kptr_dtor_nmi_cleanup_deleted");
+
+cleanup:
+	if (pmu_fd >= 0)
+		close(pmu_fd);
+	if (pinned)
+		restore_affinity(&old_cpuset);
+	kptr_dtor_nmi__destroy(skel);
+}
+
+void serial_test_kptr_dtor_nmi(void)
+{
+/*
+ * nmi_handler isn't supported for these architectures.
+ */
+#if defined(__aarch64__) || defined(__s390x__)
+	test__skip();
+	return;
+#else
+	static const struct kptr_dtor_nmi_case tests[] = {
+		{ "hash", KPTR_DTOR_NMI_MAP_HASH },
+		{ "array", KPTR_DTOR_NMI_MAP_ARRAY },
+	};
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(tests); i++) {
+		if (!test__start_subtest(tests[i].name))
+			continue;
+		run_kptr_dtor_nmi_case(&tests[i]);
+	}
+#endif
+}
diff --git a/tools/testing/selftests/bpf/progs/kptr_dtor_nmi.c b/tools/testing/selftests/bpf/progs/kptr_dtor_nmi.c
new file mode 100644
index 000000000000..693b07215006
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/kptr_dtor_nmi.c
@@ -0,0 +1,412 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <vmlinux.h>
+#include <linux/errno.h>
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+
+#define KPTR_DTOR_NMI_MAX_SLOTS 8
+
+enum kptr_dtor_nmi_map_type {
+	KPTR_DTOR_NMI_MAP_HASH = 1,
+	KPTR_DTOR_NMI_MAP_ARRAY,
+};
+
+enum kptr_dtor_nmi_err {
+	KPTR_DTOR_NMI_SETUP_CREATE_ERR = 1,
+	KPTR_DTOR_NMI_SETUP_LOOKUP_ERR,
+	KPTR_DTOR_NMI_SETUP_STALE_ERR,
+	KPTR_DTOR_NMI_SETUP_MAP_ERR,
+	KPTR_DTOR_NMI_DELETE_ERR,
+	KPTR_DTOR_NMI_CLEANUP_ERR,
+};
+
+struct kptr_dtor_nmi_value {
+	struct bpf_cpumask __kptr * mask;
+};
+
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__uint(map_flags, BPF_F_NO_PREALLOC);
+	__type(key, __u32);
+	__type(value, struct kptr_dtor_nmi_value);
+	__uint(max_entries, KPTR_DTOR_NMI_MAX_SLOTS);
+} kptr_dtor_nmi_hash SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_ARRAY);
+	__type(key, __u32);
+	__type(value, struct kptr_dtor_nmi_value);
+	__uint(max_entries, KPTR_DTOR_NMI_MAX_SLOTS);
+} kptr_dtor_nmi_array SEC(".maps");
+
+struct bpf_cpumask *bpf_cpumask_create(void) __ksym __weak;
+void bpf_cpumask_release(struct bpf_cpumask *cpumask) __ksym __weak;
+
+__u64 kptr_dtor_nmi_live_mask;
+__u32 kptr_dtor_nmi_map_type;
+__u64 kptr_dtor_nmi_setup_created;
+__u64 kptr_dtor_nmi_deleted;
+__u64 kptr_dtor_nmi_cleanup_deleted;
+__u64 kptr_dtor_nmi_release_calls;
+__u32 kptr_dtor_nmi_setup_err;
+__u32 kptr_dtor_nmi_nmi_err;
+__u32 kptr_dtor_nmi_cleanup_err;
+int kptr_dtor_nmi_setup_errno;
+int kptr_dtor_nmi_nmi_errno;
+int kptr_dtor_nmi_cleanup_errno;
+
+static void set_err(__u32 *err_dst, int *errno_dst, __u32 err, int err_no)
+{
+	if (!*err_dst) {
+		*err_dst = err;
+		*errno_dst = err_no;
+	}
+}
+
+static bool slot_is_live(__u32 slot)
+{
+	return kptr_dtor_nmi_live_mask & (1ULL << slot);
+}
+
+static void mark_slot_live(__u32 slot)
+{
+	__sync_fetch_and_or(&kptr_dtor_nmi_live_mask, 1ULL << slot);
+}
+
+static void clear_slot_live(__u32 slot)
+{
+	__sync_fetch_and_and(&kptr_dtor_nmi_live_mask, ~(1ULL << slot));
+}
+
+static struct kptr_dtor_nmi_value *lookup_hash_value(__u32 slot)
+{
+	return bpf_map_lookup_elem(&kptr_dtor_nmi_hash, &slot);
+}
+
+static struct kptr_dtor_nmi_value *lookup_array_value(__u32 slot)
+{
+	return bpf_map_lookup_elem(&kptr_dtor_nmi_array, &slot);
+}
+
+static int stash_mask(struct kptr_dtor_nmi_value *value, __u32 slot)
+{
+	struct bpf_cpumask *mask, *old;
+
+	mask = bpf_cpumask_create();
+	if (!mask)
+		return -ENOMEM;
+
+	old = bpf_kptr_xchg(&value->mask, mask);
+	if (old) {
+		bpf_cpumask_release(old);
+		return -EEXIST;
+	}
+
+	mark_slot_live(slot);
+	__sync_fetch_and_add(&kptr_dtor_nmi_setup_created, 1);
+	return 0;
+}
+
+static bool populate_hash_slot(__u32 slot)
+{
+	struct kptr_dtor_nmi_value init = {};
+	struct kptr_dtor_nmi_value *value;
+	int err;
+
+	err = bpf_map_update_elem(&kptr_dtor_nmi_hash, &slot, &init, BPF_NOEXIST);
+	if (err) {
+		set_err(&kptr_dtor_nmi_setup_err,
+				&kptr_dtor_nmi_setup_errno,
+				KPTR_DTOR_NMI_SETUP_CREATE_ERR, err);
+		return false;
+	}
+
+	value = lookup_hash_value(slot);
+	if (!value) {
+		set_err(&kptr_dtor_nmi_setup_err,
+				&kptr_dtor_nmi_setup_errno,
+				KPTR_DTOR_NMI_SETUP_LOOKUP_ERR, -ENOENT);
+		return false;
+	}
+
+	err = stash_mask(value, slot);
+	if (err) {
+		set_err(&kptr_dtor_nmi_setup_err,
+				&kptr_dtor_nmi_setup_errno,
+				KPTR_DTOR_NMI_SETUP_STALE_ERR, err);
+		return false;
+	}
+
+	return true;
+}
+
+static bool populate_array_slot(__u32 slot)
+{
+	struct kptr_dtor_nmi_value *value;
+	int err;
+
+	value = lookup_array_value(slot);
+	if (!value) {
+		set_err(&kptr_dtor_nmi_setup_err,
+				&kptr_dtor_nmi_setup_errno,
+				KPTR_DTOR_NMI_SETUP_LOOKUP_ERR, -ENOENT);
+		return false;
+	}
+
+	err = stash_mask(value, slot);
+	if (err) {
+		set_err(&kptr_dtor_nmi_setup_err,
+				&kptr_dtor_nmi_setup_errno,
+				KPTR_DTOR_NMI_SETUP_STALE_ERR, err);
+		return false;
+	}
+
+	return true;
+}
+
+static bool clear_hash_slot_from_nmi(__u32 slot)
+{
+	struct kptr_dtor_nmi_value *value;
+	int err;
+
+	if (!slot_is_live(slot))
+		return true;
+
+	err = bpf_map_delete_elem(&kptr_dtor_nmi_hash, &slot);
+	if (!err) {
+		clear_slot_live(slot);
+		__sync_fetch_and_add(&kptr_dtor_nmi_deleted, 1);
+		return true;
+	}
+
+	/*
+	 * Hash deletes take rqspinlock-backed bucket locks. NMI reentry can lose
+	 * those acquisitions with -EDEADLK or -ETIMEDOUT even though the slot is
+	 * still valid, so leave it live and retry on a later NMI.
+	 */
+	if (err == -EDEADLK || err == -ETIMEDOUT)
+		return true;
+
+	value = lookup_hash_value(slot);
+	if (value)
+		set_err(&kptr_dtor_nmi_nmi_err,
+				&kptr_dtor_nmi_nmi_errno,
+				KPTR_DTOR_NMI_DELETE_ERR, err);
+
+	return false;
+}
+
+static bool clear_array_slot_from_nmi(__u32 slot)
+{
+	struct kptr_dtor_nmi_value init = {};
+	int err;
+
+	if (!slot_is_live(slot))
+		return true;
+
+	err = bpf_map_update_elem(&kptr_dtor_nmi_array, &slot, &init, BPF_EXIST);
+	if (err) {
+		set_err(&kptr_dtor_nmi_nmi_err,
+				&kptr_dtor_nmi_nmi_errno,
+				KPTR_DTOR_NMI_DELETE_ERR, err);
+		return false;
+	}
+
+	clear_slot_live(slot);
+	__sync_fetch_and_add(&kptr_dtor_nmi_deleted, 1);
+	return true;
+}
+
+static bool cleanup_hash_slot(__u32 slot)
+{
+	struct kptr_dtor_nmi_value *value;
+	struct bpf_cpumask *old = NULL;
+
+	value = lookup_hash_value(slot);
+	if (!value) {
+		clear_slot_live(slot);
+		return true;
+	}
+
+	old = bpf_kptr_xchg(&value->mask, old);
+	if (old) {
+		bpf_cpumask_release(old);
+		__sync_fetch_and_add(&kptr_dtor_nmi_cleanup_deleted, 1);
+	}
+
+	if (bpf_map_delete_elem(&kptr_dtor_nmi_hash, &slot) &&
+	    lookup_hash_value(slot)) {
+		set_err(&kptr_dtor_nmi_cleanup_err,
+				&kptr_dtor_nmi_cleanup_errno,
+				KPTR_DTOR_NMI_CLEANUP_ERR, -EIO);
+		return false;
+	}
+
+	clear_slot_live(slot);
+	return true;
+}
+
+static bool cleanup_array_slot(__u32 slot)
+{
+	struct kptr_dtor_nmi_value *value;
+	struct bpf_cpumask *old = NULL;
+
+	value = lookup_array_value(slot);
+	if (!value) {
+		set_err(&kptr_dtor_nmi_cleanup_err,
+				&kptr_dtor_nmi_cleanup_errno,
+				KPTR_DTOR_NMI_CLEANUP_ERR, -ENOENT);
+		return false;
+	}
+
+	old = bpf_kptr_xchg(&value->mask, old);
+	if (old) {
+		bpf_cpumask_release(old);
+		__sync_fetch_and_add(&kptr_dtor_nmi_cleanup_deleted, 1);
+	}
+
+	clear_slot_live(slot);
+	return true;
+}
+
+static void populate_hash_masks(void)
+{
+	__u32 slot;
+
+	for (slot = 0; slot < KPTR_DTOR_NMI_MAX_SLOTS; slot++) {
+		if (!populate_hash_slot(slot))
+			return;
+	}
+}
+
+static void populate_array_masks(void)
+{
+	__u32 slot;
+
+	for (slot = 0; slot < KPTR_DTOR_NMI_MAX_SLOTS; slot++) {
+		if (!populate_array_slot(slot))
+			return;
+	}
+}
+
+static void clear_hash_masks_from_nmi(void)
+{
+	__u32 slot;
+
+	for (slot = 0; slot < KPTR_DTOR_NMI_MAX_SLOTS; slot++) {
+		if (!clear_hash_slot_from_nmi(slot))
+			return;
+	}
+}
+
+static void clear_array_masks_from_nmi(void)
+{
+	__u32 slot;
+
+	for (slot = 0; slot < KPTR_DTOR_NMI_MAX_SLOTS; slot++) {
+		if (!clear_array_slot_from_nmi(slot))
+			return;
+	}
+}
+
+static void cleanup_hash_masks(void)
+{
+	__u32 slot;
+
+	for (slot = 0; slot < KPTR_DTOR_NMI_MAX_SLOTS; slot++) {
+		if (!cleanup_hash_slot(slot))
+			return;
+	}
+}
+
+static void cleanup_array_masks(void)
+{
+	__u32 slot;
+
+	for (slot = 0; slot < KPTR_DTOR_NMI_MAX_SLOTS; slot++) {
+		if (!cleanup_array_slot(slot))
+			return;
+	}
+}
+
+SEC("syscall")
+int populate_kptrs(void *ctx)
+{
+	(void)ctx;
+
+	switch (kptr_dtor_nmi_map_type) {
+	case KPTR_DTOR_NMI_MAP_HASH:
+		populate_hash_masks();
+		break;
+	case KPTR_DTOR_NMI_MAP_ARRAY:
+		populate_array_masks();
+		break;
+	default:
+		set_err(&kptr_dtor_nmi_setup_err,
+			&kptr_dtor_nmi_setup_errno,
+			KPTR_DTOR_NMI_SETUP_MAP_ERR, -EINVAL);
+		break;
+	}
+
+	return 0;
+}
+
+SEC("syscall")
+int cleanup_kptrs(void *ctx)
+{
+	(void)ctx;
+
+	switch (kptr_dtor_nmi_map_type) {
+	case KPTR_DTOR_NMI_MAP_HASH:
+		cleanup_hash_masks();
+		break;
+	case KPTR_DTOR_NMI_MAP_ARRAY:
+		cleanup_array_masks();
+		break;
+	default:
+		set_err(&kptr_dtor_nmi_cleanup_err,
+			&kptr_dtor_nmi_cleanup_errno,
+			KPTR_DTOR_NMI_CLEANUP_ERR, -EINVAL);
+		break;
+	}
+
+	return 0;
+}
+
+SEC("tp_btf/nmi_handler")
+int BPF_PROG(clear_kptrs_from_nmi, void *handler, s64 delta_ns, int handled)
+{
+	(void)handler;
+	(void)delta_ns;
+	(void)handled;
+
+	if (kptr_dtor_nmi_deleted >= kptr_dtor_nmi_setup_created)
+		return 0;
+
+	switch (kptr_dtor_nmi_map_type) {
+	case KPTR_DTOR_NMI_MAP_HASH:
+		clear_hash_masks_from_nmi();
+		break;
+	case KPTR_DTOR_NMI_MAP_ARRAY:
+		clear_array_masks_from_nmi();
+		break;
+	default:
+		set_err(&kptr_dtor_nmi_nmi_err,
+			&kptr_dtor_nmi_nmi_errno,
+			KPTR_DTOR_NMI_DELETE_ERR, -EINVAL);
+		break;
+	}
+
+	return 0;
+}
+
+SEC("fentry/bpf_cpumask_release")
+int BPF_PROG(count_cpumask_release, struct bpf_cpumask *mask)
+{
+	(void)mask;
+	__sync_fetch_and_add(&kptr_dtor_nmi_release_calls, 1);
+	return 0;
+}
+
+char _license[] SEC("license") = "GPL";
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [bpf-next v3 1/2] bpf: Offload kptr destructors that run from NMI
  2026-05-07 17:54 ` [bpf-next v3 1/2] bpf: Offload kptr destructors that run from NMI Justin Suess
@ 2026-05-07 18:43   ` bot+bpf-ci
  2026-05-07 18:52     ` Justin Suess
  2026-05-07 23:45   ` sashiko-bot
  1 sibling, 1 reply; 23+ messages in thread
From: bot+bpf-ci @ 2026-05-07 18:43 UTC (permalink / raw)
  To: utilityemal77, ast, daniel, andrii, eddyz87, memxor
  Cc: martin.lau, song, yonghong.song, jolsa, bpf, utilityemal77,
	mykyta.yatsenko5, alexei.starovoitov, ast, andrii, daniel,
	martin.lau, eddyz87, yonghong.song, clm, ihor.solodrai

[-- Attachment #1: Type: text/plain, Size: 3972 bytes --]

> diff --git a/include/linux/bpf.h b/include/linux/bpf.h

[ ... ]

> diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h

[ ... ]

> diff --git a/kernel/bpf/fixups.c b/kernel/bpf/fixups.c

[ ... ]

> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c

[ ... ]

> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 3b1f0ba02f61..d34fdb99eb8a 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -7,6 +7,7 @@
>  #include <linux/bpf_trace.h>
>  #include <linux/bpf_lirc.h>
>  #include <linux/bpf_verifier.h>
> +#include <linux/bpf_mem_alloc.h>
>  #include <linux/bsearch.h>
>  #include <linux/btf.h>
>  #include <linux/hex.h>
> @@ -19,6 +20,7 @@
>  #include <linux/fdtable.h>
>  #include <linux/file.h>
>  #include <linux/fs.h>
> +#include <linux/irq_work.h>
>  #include <linux/license.h>
>  #include <linux/filter.h>
>  #include <linux/kernel.h>
> @@ -42,6 +44,8 @@
>  #include <linux/cookie.h>
>  #include <linux/verification.h>
>
> +#include "percpu_freelist.h"
> +
>  #include <net/netfilter/nf_bpf_link.h>
>  #include <net/netkit.h>
>  #include <net/tcx.h>
> @@ -65,6 +69,111 @@ static DEFINE_SPINLOCK(map_idr_lock);
>  static DEFINE_IDR(link_idr);
>  static DEFINE_SPINLOCK(link_idr_lock);
>
> +struct bpf_dtor_kptr_work {
> +	struct pcpu_freelist_node fnode;
> +	void *obj;
> +	btf_dtor_kfunc_t dtor;
> +};
> +
> +/* Queue pending dtors; the idle pool uses a global pcpu_freelist. */
> +static struct pcpu_freelist bpf_dtor_kptr_jobs;
> +static struct pcpu_freelist bpf_dtor_kptr_idle;
> +/* Keep surplus = total - needed = idle - refs >= 0 so NMI frees never need to allocate. */
> +static atomic_long_t bpf_dtor_kptr_surplus = ATOMIC_LONG_INIT(0);

An unaddressed review comment from the v2 discussion raises a design
concern about the complexity of this implementation. Mykyta Yatsenko
suggested an alternative design where bpf_dtor_kptr_work is stored in the
kptr map slot itself (a bpf_kptr_box approach) instead of using two lists
(idle and armed) with 3 atomics and headroom/trimming management.

From the lore discussion at:
https://lore.kernel.org/bpf/51a054a0-e57f-49dc-9527-36da0535087c@gmail.com/

Mykyta's suggestion was:

"I think this still looks too complex:
 * 2 lists - idle list and armed list
 * 3 atomics, controlling demand/supply
 * headroom/trimming management

The complexity introduced for performance reasons, but I'm not sure if
the tradeoff is worth it.

What about the next design:

Instead of idle list, store bpf_dtor_kptr_work in the kptr map slot
itself. Use kmalloc_nolock() to allocate bpf_dtor_kptr_work on the first
xchg just once per map value, then reuse it across xchg in/out."

While Justin responded with an explanation of ABBA ordering issues, and
Mykyta provided a more detailed counter-proposal with RCU protection and
atomic detach/attach, the v3 implementation does not adopt this
suggestion and still uses the pcpu_freelist approach with a surplus
counter.

Was there a specific technical reason to continue with the two-list
approach rather than exploring the bpf_kptr_box alternative, or would it
be worth revisiting this design choice?

[ ... ]

> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c

[ ... ]

This commit fixes a deadlock bug but lacks a Fixes: tag. The bug was
introduced in commit 14a324f6a67e ("bpf: Wire up freeing of referenced
kptr") which added bpf_map_free_kptrs() that calls kptr destructors
inline without checking for NMI context.

When BPF programs attached to tp_btf/nmi_handler delete map entries,
destructors like bpf_cpumask_release() that take RCU locks are called
from NMI, causing system deadlock.

Suggested: Fixes: 14a324f6a67e ("bpf: Wire up freeing of referenced kptr")

---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/25513277524

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [bpf-next v3 1/2] bpf: Offload kptr destructors that run from NMI
  2026-05-07 18:43   ` bot+bpf-ci
@ 2026-05-07 18:52     ` Justin Suess
  0 siblings, 0 replies; 23+ messages in thread
From: Justin Suess @ 2026-05-07 18:52 UTC (permalink / raw)
  To: bot+bpf-ci
  Cc: ast, daniel, andrii, eddyz87, memxor, martin.lau, song,
	yonghong.song, jolsa, bpf, mykyta.yatsenko5, alexei.starovoitov,
	martin.lau, clm, ihor.solodrai

On Thu, May 07, 2026 at 06:43:29PM +0000, bot+bpf-ci@kernel.org wrote:
> > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> 
> [ ... ]
> 
> > diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
> 
> [ ... ]
> 
> > diff --git a/kernel/bpf/fixups.c b/kernel/bpf/fixups.c
> 
> [ ... ]
> 
> > diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> 
> [ ... ]
> 
> > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> > index 3b1f0ba02f61..d34fdb99eb8a 100644
> > --- a/kernel/bpf/syscall.c
> > +++ b/kernel/bpf/syscall.c
> > @@ -7,6 +7,7 @@
> >  #include <linux/bpf_trace.h>
> >  #include <linux/bpf_lirc.h>
> >  #include <linux/bpf_verifier.h>
> > +#include <linux/bpf_mem_alloc.h>
> >  #include <linux/bsearch.h>
> >  #include <linux/btf.h>
> >  #include <linux/hex.h>
> > @@ -19,6 +20,7 @@
> >  #include <linux/fdtable.h>
> >  #include <linux/file.h>
> >  #include <linux/fs.h>
> > +#include <linux/irq_work.h>
> >  #include <linux/license.h>
> >  #include <linux/filter.h>
> >  #include <linux/kernel.h>
> > @@ -42,6 +44,8 @@
> >  #include <linux/cookie.h>
> >  #include <linux/verification.h>
> >
> > +#include "percpu_freelist.h"
> > +
> >  #include <net/netfilter/nf_bpf_link.h>
> >  #include <net/netkit.h>
> >  #include <net/tcx.h>
> > @@ -65,6 +69,111 @@ static DEFINE_SPINLOCK(map_idr_lock);
> >  static DEFINE_IDR(link_idr);
> >  static DEFINE_SPINLOCK(link_idr_lock);
> >
> > +struct bpf_dtor_kptr_work {
> > +	struct pcpu_freelist_node fnode;
> > +	void *obj;
> > +	btf_dtor_kfunc_t dtor;
> > +};
> > +
> > +/* Queue pending dtors; the idle pool uses a global pcpu_freelist. */
> > +static struct pcpu_freelist bpf_dtor_kptr_jobs;
> > +static struct pcpu_freelist bpf_dtor_kptr_idle;
> > +/* Keep surplus = total - needed = idle - refs >= 0 so NMI frees never need to allocate. */
> > +static atomic_long_t bpf_dtor_kptr_surplus = ATOMIC_LONG_INIT(0);
> 
> An unaddressed review comment from the v2 discussion raises a design
> concern about the complexity of this implementation. Mykyta Yatsenko
> suggested an alternative design where bpf_dtor_kptr_work is stored in the
> kptr map slot itself (a bpf_kptr_box approach) instead of using two lists
> (idle and armed) with 3 atomics and headroom/trimming management.
> 
> From the lore discussion at:
> https://lore.kernel.org/bpf/51a054a0-e57f-49dc-9527-36da0535087c@gmail.com/
> 
> Mykyta's suggestion was:
> 
> "I think this still looks too complex:
>  * 2 lists - idle list and armed list
>  * 3 atomics, controlling demand/supply
>  * headroom/trimming management
> 
> The complexity introduced for performance reasons, but I'm not sure if
> the tradeoff is worth it.
> 
> What about the next design:
> 
> Instead of idle list, store bpf_dtor_kptr_work in the kptr map slot
> itself. Use kmalloc_nolock() to allocate bpf_dtor_kptr_work on the first
> xchg just once per map value, then reuse it across xchg in/out."
> 
> While Justin responded with an explanation of ABBA ordering issues, and
> Mykyta provided a more detailed counter-proposal with RCU protection and
> atomic detach/attach, the v3 implementation does not adopt this
> suggestion and still uses the pcpu_freelist approach with a surplus
> counter.
> 
> Was there a specific technical reason to continue with the two-list
> approach rather than exploring the bpf_kptr_box alternative, or would it
> be worth revisiting this design choice?
> 
This was addressed.

See https://lore.kernel.org/bpf/afy8yl1bleXmO78H@suesslenovo/

storing a pointer to the bpf_kptr_box, or any other pointer that isn't
the original kptr in the map slot would break direct load operations
on ref kptrs in maps requiring patching the program and breaking
bpf assumptions.

> [ ... ]
> 
> > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> 
> [ ... ]
> 
> This commit fixes a deadlock bug but lacks a Fixes: tag. The bug was
> introduced in commit 14a324f6a67e ("bpf: Wire up freeing of referenced
> kptr") which added bpf_map_free_kptrs() that calls kptr destructors
> inline without checking for NMI context.
> 
> When BPF programs attached to tp_btf/nmi_handler delete map entries,
> destructors like bpf_cpumask_release() that take RCU locks are called
> from NMI, causing system deadlock.
> 
> Suggested: Fixes: 14a324f6a67e ("bpf: Wire up freeing of referenced kptr")
> 
>
I'll add the Fixes: tag in next reroll.
> ---
> AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
> See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md
> 
> CI run summary: https://github.com/kernel-patches/bpf/actions/runs/25513277524


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [bpf-next v3 1/2] bpf: Offload kptr destructors that run from NMI
  2026-05-07 17:54 ` [bpf-next v3 1/2] bpf: Offload kptr destructors that run from NMI Justin Suess
  2026-05-07 18:43   ` bot+bpf-ci
@ 2026-05-07 23:45   ` sashiko-bot
  2026-05-10 15:13     ` Justin Suess
  1 sibling, 1 reply; 23+ messages in thread
From: sashiko-bot @ 2026-05-07 23:45 UTC (permalink / raw)
  To: Justin Suess; +Cc: bpf

Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:
- [High] TOCTOU race in `bpf_dtor_kptr_release_one()` can lead to NMI deadlocks.
- [Low] Memory leak on initialization failure in `bpf_dtor_kptr_init()`.
--

commit 488cdc390b90427a591d47211600c09addc95fcd
Author: Justin Suess <utilityemal77@gmail.com>
Date:   Thu May 7 17:54:52 2026 +0000

    bpf: Offload kptr destructors that run from NMI

This commit queues destructor-backed kptr teardowns to irq_work when
running from NMI context. It maintains an idle-slot surplus counter to
ensure offload slots are preallocated to prevent deadlocks.

Closes: https://lore.kernel.org/bpf/20260421201035.1729473-1-utilityemal77@gmail.com/

> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 3b1f0ba02f613..d34fdb99eb8a9 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
[ ... ]
> @@ -65,6 +69,111 @@ static DEFINE_SPINLOCK(map_idr_lock);
> +static void bpf_dtor_kptr_release_one(void)
> +{
> +	struct bpf_dtor_kptr_work *job;
> +	long surplus;
> +
> +	for (;;) {
> +		surplus = atomic_long_read(&bpf_dtor_kptr_surplus);
> +		if (surplus <= 0)
> +			return;
> +
> +		job = bpf_dtor_kptr_pop_idle();

Could this sequence lead to a race condition that drains the idle pool
too early?

If multiple threads read a positive surplus value here, they might all call
bpf_dtor_kptr_pop_idle() before the cmpxchg executes. This could temporarily
drain the idle pool below the required number of reserved slots.

If an NMI fires during this window and attempts to tear down a kptr, would
bpf_dtor_kptr_offload() find the idle pool empty? If so, it looks like it
would trigger the fallback warning and run the destructor inline from the
NMI context, which might reintroduce the deadlocks this patch aims to
prevent.

> +		if (!job)
> +			return;
> +
> +		if (!atomic_long_try_cmpxchg(&bpf_dtor_kptr_surplus, &surplus,
> +						     surplus - 1)) {
> +			pcpu_freelist_push(&bpf_dtor_kptr_idle, &job->fnode);
> +			continue;
> +		}
> +
> +		bpf_mem_free(&bpf_global_ma, job);
> +		return;
> +	}
> +}
[ ... ]
> +static int __init bpf_dtor_kptr_init(void)
> +{
> +	int err;
> +
> +	err = pcpu_freelist_init(&bpf_dtor_kptr_idle);
> +	if (err)
> +		return err;
> +
> +	err = pcpu_freelist_init(&bpf_dtor_kptr_jobs);
> +	if (err)
> +		return err;

Does this error path leak the per-cpu memory allocated for the idle list?

If the second pcpu_freelist_init() fails, the function returns immediately.
Should there be a cleanup path here to free the already initialized
bpf_dtor_kptr_idle list?

> +
> +	return 0;
> +}
> +late_initcall(bpf_dtor_kptr_init);

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260507175453.1140400-1-utilityemal77@gmail.com?part=1

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [bpf-next v3 2/2] selftests/bpf: Add kptr destructor NMI exerciser
  2026-05-07 17:54 ` [bpf-next v3 2/2] selftests/bpf: Add kptr destructor NMI exerciser Justin Suess
@ 2026-05-08  0:03   ` sashiko-bot
  0 siblings, 0 replies; 23+ messages in thread
From: sashiko-bot @ 2026-05-08  0:03 UTC (permalink / raw)
  To: Justin Suess; +Cc: bpf

Thank you for your contribution! Sashiko AI review found 3 potential issue(s) to consider:
- [Medium] Failure to acquire the PMU file descriptor (e.g., due to EACCES permissions) causes the test to silently pass rather than fail or skip.
- [Medium] The test executes the `cleanup_kptrs` BPF program but fails to verify if it encountered any internal errors.
- [Medium] Concurrent NMI handlers can cause double-increments of `kptr_dtor_nmi_deleted` and loop aborts due to a TOCTOU race on `slot_is_live`.
--

commit f53640a66945f8f1050d74f78ad482b51aeaf343
Author: Justin Suess <utilityemal77@gmail.com>

selftests/bpf: Add kptr destructor NMI exerciser

This patch adds a BPF selftest program that populates hash and array maps
with cpumask kptrs and clears them from the NMI handler. The test verifies
that the destructor is called on all objects queued for destruction in
hardirq from NMI.

> diff --git a/tools/testing/selftests/bpf/prog_tests/kptr_dtor_nmi.c b/tools/testing/selftests/bpf/prog_tests/kptr_dtor_nmi.c
> new file mode 100644
> index 0000000000000..21452b3cf9eb8
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/prog_tests/kptr_dtor_nmi.c
> @@ -0,0 +1,258 @@
[ ... ]
> +__maybe_unused
> +static int open_nmi_pmu_event_on_cpu(int cpu)
> +{
> +	struct perf_event_attr attr = {
> +		.size = sizeof(attr),
> +		.type = PERF_TYPE_HARDWARE,
> +		.config = PERF_COUNT_HW_CPU_CYCLES,
> +		.freq = 1,
> +		.sample_freq = 1000,
> +	};
> +	int pmu_fd;
> +
> +	pmu_fd = syscall(__NR_perf_event_open, &attr, -1, cpu, -1,
> +			 PERF_FLAG_FD_CLOEXEC);
> +	if (pmu_fd == -1) {
> +		if (errno == ENOENT || errno == EOPNOTSUPP) {
> +			printf("SKIP:no PERF_COUNT_HW_CPU_CYCLES\n");
> +			test__skip();
> +		}
> +		return -1;
> +	}

If the syscall fails with an error other than ENOENT or EOPNOTSUPP (like
EACCES), this returns -1 without calling test__skip().

In run_kptr_dtor_nmi_case():

	pmu_fd = open_nmi_pmu_event_on_cpu(cpu);
	if (pmu_fd < 0)
		goto cleanup;

Since no assertions are triggered and the test isn't explicitly skipped,
will the selftest framework register this as a PASS, masking the resource
acquisition failure?

[ ... ]
> +static void run_kptr_dtor_nmi_case(const struct kptr_dtor_nmi_case *test)
> +{
[ ... ]
> +	if (!run_syscall_prog(skel->progs.cleanup_kptrs, "cleanup_kptrs"))
> +		goto cleanup;
> +	/*
> +	 * The grace period for rcu cannot complete until the CPU that ran the
> +	 * hard irq_work has passed through a quiescent state after running
> +	 * our dtor work. This effectively flushes our pending work and allows
> +	 * the test to verify the dtor was called the expected number of times.
> +	 */
> +	kern_sync_rcu();
> +	ASSERT_EQ(skel->bss->kptr_dtor_nmi_cleanup_deleted, 0,
> +		  "kptr_dtor_nmi_cleanup_deleted");

Does this test need to assert that skel->bss->kptr_dtor_nmi_cleanup_err
is 0?

The cleanup_kptrs BPF program sets kptr_dtor_nmi_cleanup_err if an error
occurs, but it looks like only kptr_dtor_nmi_cleanup_deleted is verified
here. If an internal error happens but no masks were exchanged out,
cleanup_deleted will be 0 and the test might silently pass.

> +cleanup:
> +	if (pmu_fd >= 0)
> +		close(pmu_fd);
[ ... ]
> diff --git a/tools/testing/selftests/bpf/progs/kptr_dtor_nmi.c b/tools/testing/selftests/bpf/progs/kptr_dtor_nmi.c
> new file mode 100644
> index 0000000000000..693b07215006b
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/kptr_dtor_nmi.c
[ ... ]
> +static bool clear_array_slot_from_nmi(__u32 slot)
> +{
> +	struct kptr_dtor_nmi_value init = {};
> +	int err;
> +
> +	if (!slot_is_live(slot))
> +		return true;
> +
> +	err = bpf_map_update_elem(&kptr_dtor_nmi_array, &slot, &init, BPF_EXIST);
> +	if (err) {
> +		set_err(&kptr_dtor_nmi_nmi_err,
> +				&kptr_dtor_nmi_nmi_errno,
> +				KPTR_DTOR_NMI_DELETE_ERR, err);
> +		return false;
> +	}
> +
> +	clear_slot_live(slot);
> +	__sync_fetch_and_add(&kptr_dtor_nmi_deleted, 1);

Could concurrent NMI handlers cause a double-increment of kptr_dtor_nmi_deleted
due to a TOCTOU race on slot_is_live()?

Because clear_kptrs_from_nmi is attached to the global tp_btf/nmi_handler
tracepoint, it executes on NMIs across all CPUs. If two NMIs fire concurrently
on different CPUs, both could read slot_is_live(slot) as true and successfully
execute bpf_map_update_elem(..., BPF_EXIST) on the array map. 

Both would then call clear_slot_live() and increment kptr_dtor_nmi_deleted.
This double-increment could cause wait_for_nmi_drain() to return prematurely
before all slots are truly drained. The next round's populate_kptrs syscall
would then encounter uncleared kptrs, causing stash_mask() to fail with -EEXIST.

Would it be better to use an atomic test-and-clear operation here?

> +	return true;
> +}

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260507175453.1140400-1-utilityemal77@gmail.com?part=2

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [bpf-next v3 1/2] bpf: Offload kptr destructors that run from NMI
  2026-05-07 23:45   ` sashiko-bot
@ 2026-05-10 15:13     ` Justin Suess
  2026-05-10 22:38       ` Alexei Starovoitov
  0 siblings, 1 reply; 23+ messages in thread
From: Justin Suess @ 2026-05-10 15:13 UTC (permalink / raw)
  To: sashiko; +Cc: bpf

On Thu, May 07, 2026 at 11:45:19PM +0000, sashiko-bot@kernel.org wrote:
> Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:
> - [High] TOCTOU race in `bpf_dtor_kptr_release_one()` can lead to NMI deadlocks.
> - [Low] Memory leak on initialization failure in `bpf_dtor_kptr_init()`.
> --
> 
> commit 488cdc390b90427a591d47211600c09addc95fcd
> Author: Justin Suess <utilityemal77@gmail.com>
> Date:   Thu May 7 17:54:52 2026 +0000
> 
>     bpf: Offload kptr destructors that run from NMI
> 
> This commit queues destructor-backed kptr teardowns to irq_work when
> running from NMI context. It maintains an idle-slot surplus counter to
> ensure offload slots are preallocated to prevent deadlocks.
> 
> Closes: https://lore.kernel.org/bpf/20260421201035.1729473-1-utilityemal77@gmail.com/
> 
> > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> > index 3b1f0ba02f613..d34fdb99eb8a9 100644
> > --- a/kernel/bpf/syscall.c
> > +++ b/kernel/bpf/syscall.c
> [ ... ]
> > @@ -65,6 +69,111 @@ static DEFINE_SPINLOCK(map_idr_lock);
> > +static void bpf_dtor_kptr_release_one(void)
> > +{
> > +	struct bpf_dtor_kptr_work *job;
> > +	long surplus;
> > +
> > +	for (;;) {
> > +		surplus = atomic_long_read(&bpf_dtor_kptr_surplus);
> > +		if (surplus <= 0)
> > +			return;
> > +
> > +		job = bpf_dtor_kptr_pop_idle();
> 
> Could this sequence lead to a race condition that drains the idle pool
> too early?
> 
> If multiple threads read a positive surplus value here, they might all call
> bpf_dtor_kptr_pop_idle() before the cmpxchg executes. This could temporarily
> drain the idle pool below the required number of reserved slots.
> 
> If an NMI fires during this window and attempts to tear down a kptr, would
> bpf_dtor_kptr_offload() find the idle pool empty? If so, it looks like it
> would trigger the fallback warning and run the destructor inline from the
> NMI context, which might reintroduce the deadlocks this patch aims to
> prevent.
>
Yes this is real.

But there's a bigger problem: the push side of pcpu_freelist is not
interruptible.

If we jump to NMI during this stage then the pcpu_freelist will be
locked up because we were pushing to it, thus preventing us from
popping. Thus we just deadlock again.

I think we technically have exclusivity here on our percpu head
(we can't be in NMI and process context stimultaniously on the cpu).

So maybe it is safe to forcibly acquire that specific lock? I'm not 100%
sure.

llist won't work since we can't have multiple llist_del_first consumers,
and llist_del_all would render the list null and unusable if we jump to
it from NMI while we're borrowing it. 

Making percpu llists won't work either, since the cpu we allocate the
job one might not be the one the dtor runs on, so we need percpu
stealing, which then runs into the same multiple consumers problem if
threads steal from same cpu head.

So this code is not ready either way.

I think 128 bit cmpxchg could technically solve this problem, but don't
think we can rely on it for all architectures BPF supports.

If anyone has any guidance on this it would be much appreciated. Been
stuck on this for several weeks...

Construct I need is a data structure with atomic push, pop, and move
between sets. No specific ordering needed, but updates from producers 
must be immediately visible to all consumers. I've looked into Treiber
stacks.

Any help or guidance on this would be appreciated!
> > +		if (!job)
> > +			return;
> > +
> > +		if (!atomic_long_try_cmpxchg(&bpf_dtor_kptr_surplus, &surplus,
> > +						     surplus - 1)) {
> > +			pcpu_freelist_push(&bpf_dtor_kptr_idle, &job->fnode);
> > +			continue;
> > +		}
> > +
> > +		bpf_mem_free(&bpf_global_ma, job);
> > +		return;
> > +	}
> > +}
> [ ... ]
> > +static int __init bpf_dtor_kptr_init(void)
> > +{
> > +	int err;
> > +
> > +	err = pcpu_freelist_init(&bpf_dtor_kptr_idle);
> > +	if (err)
> > +		return err;
> > +
> > +	err = pcpu_freelist_init(&bpf_dtor_kptr_jobs);
> > +	if (err)
> > +		return err;
> 
> Does this error path leak the per-cpu memory allocated for the idle list?
> 
> If the second pcpu_freelist_init() fails, the function returns immediately.
> Should there be a cleanup path here to free the already initialized
> bpf_dtor_kptr_idle list?
> 
> > +
> > +	return 0;
> > +}
> > +late_initcall(bpf_dtor_kptr_init);
> 
> -- 
> Sashiko AI review · https://sashiko.dev/#/patchset/20260507175453.1140400-1-utilityemal77@gmail.com?part=1

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [bpf-next v3 1/2] bpf: Offload kptr destructors that run from NMI
  2026-05-10 15:13     ` Justin Suess
@ 2026-05-10 22:38       ` Alexei Starovoitov
  2026-05-11  1:49         ` Justin Suess
  0 siblings, 1 reply; 23+ messages in thread
From: Alexei Starovoitov @ 2026-05-10 22:38 UTC (permalink / raw)
  To: Justin Suess; +Cc: sashiko, bpf

On Sun, May 10, 2026 at 8:14 AM Justin Suess <utilityemal77@gmail.com> wrote:
>
>
> Any help or guidance on this would be appreciated!

sorry for the delay. Everyone was at lsfmmbpf for a week+.

All of the solutions so far are way too complicated.
bpf_kptr_xchg() has to remain inlined as single atomic xchg
without slowpath otherwise it ruins the concept
and makes usage unpredictable.

Let's step back.
What is the issue you're trying to solve?

the commit log say:

> A BPF program attached to tp_btf/nmi_handler can delete map entries or
> swap out referenced kptrs from NMI context. Today that runs the kptr
> destructor inline. Destructors such as bpf_cpumask_release() can take
> RCU-related locks, so running them from NMI can deadlock the system.

and looking at selftest from patch 2 you do:

old = bpf_kptr_xchg(&value->mask, old);
if (old)
  bpf_cpumask_release(old);

so?
bpf_cpumask_release() is fine to call from any context,
because bpf_mem_cache_free_rcu() is safe everywhere including NMI.

hashtab introduced dtor in bpf_mem_alloc,
so bpf_obj_free_fields() and corresponding dtor's of kptr-s
are called from valid context.

What is the problematic sequence?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [bpf-next v3 1/2] bpf: Offload kptr destructors that run from NMI
  2026-05-10 22:38       ` Alexei Starovoitov
@ 2026-05-11  1:49         ` Justin Suess
  2026-05-11 15:51           ` Alexei Starovoitov
  0 siblings, 1 reply; 23+ messages in thread
From: Justin Suess @ 2026-05-11  1:49 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: sashiko, bpf

On Sun, May 10, 2026 at 03:38:08PM -0700, Alexei Starovoitov wrote:
> On Sun, May 10, 2026 at 8:14 AM Justin Suess <utilityemal77@gmail.com> wrote:
> >
> >
> > Any help or guidance on this would be appreciated!
> 
> sorry for the delay. Everyone was at lsfmmbpf for a week+.
> 

No worries! I hope it was an enjoyable trip and I look forward to
hearing about the conference.

> All of the solutions so far are way too complicated.
> bpf_kptr_xchg() has to remain inlined as single atomic xchg
> without slowpath otherwise it ruins the concept
> and makes usage unpredictable.
> 
> Let's step back.
> What is the issue you're trying to solve?
> 
> the commit log say:
> 
> > A BPF program attached to tp_btf/nmi_handler can delete map entries or
> > swap out referenced kptrs from NMI context. Today that runs the kptr
> > destructor inline. Destructors such as bpf_cpumask_release() can take
> > RCU-related locks, so running them from NMI can deadlock the system.
> 
> and looking at selftest from patch 2 you do:
> 
> old = bpf_kptr_xchg(&value->mask, old);
> if (old)
>   bpf_cpumask_release(old);
> 
> so?
> bpf_cpumask_release() is fine to call from any context,
> because bpf_mem_cache_free_rcu() is safe everywhere including NMI.
>

My mistake on that. I picked a bad example for the test, but the test is
just exercising the nmi dtor path, and doesn't rely on the particular
type of kptr. I just picked one that was easy to acquire a reference to.

This dtor is safe. task_struct dtor, cgroup dtor, crypto_ctx dtor are
not. I've annotated why here:

crypto_ctx:

  __bpf_kfunc void bpf_crypto_ctx_release(struct bpf_crypto_ctx *ctx)
  {
  	if (refcount_dec_and_test(&ctx->usage))
  		call_rcu(&ctx->rcu, crypto_free_cb); /* UNSAFE: call_rcu */
  }

  __bpf_kfunc void bpf_crypto_ctx_release_dtor(void *ctx)
  {
  	bpf_crypto_ctx_release(ctx);
  }

task_struct:

  __bpf_kfunc void bpf_task_release(struct task_struct *p)
  {
  	put_task_struct_rcu_user(p);
  }

  __bpf_kfunc void bpf_task_release_dtor(void *p)
  {
  	put_task_struct_rcu_user(p);
  }

  void put_task_struct_rcu_user(struct task_struct *task)
  {
  	if (refcount_dec_and_test(&task->rcu_users))
  		call_rcu(&task->rcu, delayed_put_task_struct); /* UNSAFE: call_rcu
      */
  }

cgroup_release_dtor is more complex, goes through ultimately through
callbacks to:

  static void css_release(struct percpu_ref *ref)
  {
  	struct cgroup_subsys_state *css =
  		container_of(ref, struct cgroup_subsys_state, refcnt);

  	INIT_WORK(&css->destroy_work, css_release_work_fn);
  	queue_work(cgroup_release_wq, &css->destroy_work); /* UNSAFE:
    workqueue */
  }

More generally, unless it's a BPF allocated object or doesn't rely on
locking/call_rcu or workqueues, the dtor is unsafe.

> hashtab introduced dtor in bpf_mem_alloc,
> so bpf_obj_free_fields() and corresponding dtor's of kptr-s
> are called from valid context.
> 
> What is the problematic sequence?

So from the beginning stepping back.

The problematic sequence:

1. ref kptr (i.e task_struct, cgroup, crypto_ctx) xchg'd into map. 

2. in a tp_btf/nmi_handler (NMI CTX) program we drop the item from the map
   with that referenced kptr field.

3. dtor runs in the nmi context

4. dtor runs call_rcu/queue_work/some bad thing in nmi, causing deadlock.

You can see this demonstrated in my task_struct reproducer [1].
It causes a deadlock by deliberately releasing the last reference to a
task_struct via a ref kptr in nmi, getting it to call_rcu and deadlock.

The typical solution to this is to run the nmi unsafe code in non-nmi
context by offloading to NMI work, as you proposed.

The problem is we need space to for the jobs we enqueue. The required
information to run the dtor is the dtor function and the original object
pointer. The number of dtors that can run in a single tp_btf/nmi_handler
prog is nearly unbounded.

The other problem is even though bpf_mem_alloc is safe in NMI generally,
we cannot allocate in path that destroys an object. If the allocation
fails due to memory pressure, we leak the object.

There are a few options, all with drawbacks.

1. Dynamically allocate the job. Non-starter, failing to allocate is
unrecoverable, memory pressure means we can't ever schedule the dtor to
run.

2. Store job ntrusively in the object : Requires a safe place to place
it within the object. Bad because not all objects have a space we can write to.
Non-starter.

3. Within the map slot (after actual kptr): Taken with my initial approach in v1.
Significant complexity and requires per-map changes. Feasible but very
complex and would need DCAS or locking to make updating the map slot and
our job information atomic.

4. Wrapping the kptr in a box and storing it in place of the kptr [2] :
Proposed by Mykyta. Would break direct load access to kptr objects.

5. Make every dtor nmi safe individually. This would require a lot of
duplicated code and require updating every destructor invididually.
Feasible technically, but seems brittle.

6. One that would be the least complex, would be forbidding xchg operations
that can run the dtor in NMI context. That would preserve the inlining fix,
but limit our usage of referenced kptrs in BPF programs that run in NMI context.

The approach here:

7. Allocate a new spot for a free job every time we xchg into the map
and put it in a global list. When in NMI and we run a dtor, we pop a
job from that slot and use it to offload our work via irq_work. If
we're not in NMI we run normally. Downside is this breaks inlining for
ref kptrs.

...

I may be missing something critical, but everything I've looked at
points to this problem being much more complex that it initially seemed.

I'm happy to discuss further and do a respin.

Justin

[1] https://lore.kernel.org/bpf/383afba6-f732-49bc-a197-147b9d8b1a29@gmail.com/
[2] https://lore.kernel.org/bpf/51a054a0-e57f-49dc-9527-36da0535087c@gmail.com/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [bpf-next v3 1/2] bpf: Offload kptr destructors that run from NMI
  2026-05-11  1:49         ` Justin Suess
@ 2026-05-11 15:51           ` Alexei Starovoitov
  2026-05-11 16:38             ` Justin Suess
  2026-05-11 19:22             ` Justin Suess
  0 siblings, 2 replies; 23+ messages in thread
From: Alexei Starovoitov @ 2026-05-11 15:51 UTC (permalink / raw)
  To: Justin Suess; +Cc: sashiko, bpf

On Sun May 10, 2026 at 6:49 PM PDT, Justin Suess wrote:
> On Sun, May 10, 2026 at 03:38:08PM -0700, Alexei Starovoitov wrote:
>> On Sun, May 10, 2026 at 8:14 AM Justin Suess <utilityemal77@gmail.com> wrote:
>> >
>> >
>> > Any help or guidance on this would be appreciated!
>> 
>> sorry for the delay. Everyone was at lsfmmbpf for a week+.
>> 
>
> No worries! I hope it was an enjoyable trip and I look forward to
> hearing about the conference.
>
>> All of the solutions so far are way too complicated.
>> bpf_kptr_xchg() has to remain inlined as single atomic xchg
>> without slowpath otherwise it ruins the concept
>> and makes usage unpredictable.
>> 
>> Let's step back.
>> What is the issue you're trying to solve?
>> 
>> the commit log say:
>> 
>> > A BPF program attached to tp_btf/nmi_handler can delete map entries or
>> > swap out referenced kptrs from NMI context. Today that runs the kptr
>> > destructor inline. Destructors such as bpf_cpumask_release() can take
>> > RCU-related locks, so running them from NMI can deadlock the system.
>> 
>> and looking at selftest from patch 2 you do:
>> 
>> old = bpf_kptr_xchg(&value->mask, old);
>> if (old)
>>   bpf_cpumask_release(old);
>> 
>> so?
>> bpf_cpumask_release() is fine to call from any context,
>> because bpf_mem_cache_free_rcu() is safe everywhere including NMI.
>>
>
> My mistake on that. I picked a bad example for the test, but the test is
> just exercising the nmi dtor path, and doesn't rely on the particular
> type of kptr. I just picked one that was easy to acquire a reference to.
>
> This dtor is safe. task_struct dtor, cgroup dtor, crypto_ctx dtor are
> not. I've annotated why here:
>
> crypto_ctx:
>
>   __bpf_kfunc void bpf_crypto_ctx_release(struct bpf_crypto_ctx *ctx)
>   {
>   	if (refcount_dec_and_test(&ctx->usage))
>   		call_rcu(&ctx->rcu, crypto_free_cb); /* UNSAFE: call_rcu */
>   }
>   
>   __bpf_kfunc void bpf_crypto_ctx_release_dtor(void *ctx)
>   {
>   	bpf_crypto_ctx_release(ctx);
>   }

bpf_crypto_ctx_release() is only allowed in syscall prog types
and dtor via hashmap free will execute in safe context as well.
So not an issue.

> task_struct:
>
>   __bpf_kfunc void bpf_task_release(struct task_struct *p)
>   {
>   	put_task_struct_rcu_user(p);
>   }
>   
>   __bpf_kfunc void bpf_task_release_dtor(void *p)
>   {
>   	put_task_struct_rcu_user(p);
>   }
>
>   void put_task_struct_rcu_user(struct task_struct *task)
>   {
>   	if (refcount_dec_and_test(&task->rcu_users))
>   		call_rcu(&task->rcu, delayed_put_task_struct); /* UNSAFE: call_rcu
>       */
>   }

In theory. I don't think there is a reproducer.

> cgroup_release_dtor is more complex, goes through ultimately through
> callbacks to:
>
>   static void css_release(struct percpu_ref *ref)
>   {
>   	struct cgroup_subsys_state *css =
>   		container_of(ref, struct cgroup_subsys_state, refcnt);
>   
>   	INIT_WORK(&css->destroy_work, css_release_work_fn);
>   	queue_work(cgroup_release_wq, &css->destroy_work); /* UNSAFE:
>     workqueue */
>   }

similar to task_struct. I don't think it's exploitable.

> More generally, unless it's a BPF allocated object or doesn't rely on
> locking/call_rcu or workqueues, the dtor is unsafe.
>
>> hashtab introduced dtor in bpf_mem_alloc,
>> so bpf_obj_free_fields() and corresponding dtor's of kptr-s
>> are called from valid context.
>> 
>> What is the problematic sequence?
>
> So from the beginning stepping back.
>
> The problematic sequence:
>
> 1. ref kptr (i.e task_struct, cgroup, crypto_ctx) xchg'd into map. 
>
> 2. in a tp_btf/nmi_handler (NMI CTX) program we drop the item from the map
>    with that referenced kptr field.
>
> 3. dtor runs in the nmi context
>
> 4. dtor runs call_rcu/queue_work/some bad thing in nmi, causing deadlock.
>
> You can see this demonstrated in my task_struct reproducer [1].

did you?
That link points to your v2 with cpumask.
I don't recall seeing task_struct repro.

> It causes a deadlock by deliberately releasing the last reference to a
> task_struct via a ref kptr in nmi, getting it to call_rcu and deadlock.
>
> The typical solution to this is to run the nmi unsafe code in non-nmi
> context by offloading to NMI work, as you proposed.
>
> The problem is we need space to for the jobs we enqueue. The required
> information to run the dtor is the dtor function and the original object
> pointer. The number of dtors that can run in a single tp_btf/nmi_handler
> prog is nearly unbounded.
>
> The other problem is even though bpf_mem_alloc is safe in NMI generally,
> we cannot allocate in path that destroys an object. If the allocation
> fails due to memory pressure, we leak the object.
>
> There are a few options, all with drawbacks.
>
> 1. Dynamically allocate the job. Non-starter, failing to allocate is
> unrecoverable, memory pressure means we can't ever schedule the dtor to
> run.
>
> 2. Store job ntrusively in the object : Requires a safe place to place
> it within the object. Bad because not all objects have a space we can write to.
> Non-starter.
>
> 3. Within the map slot (after actual kptr): Taken with my initial approach in v1.
> Significant complexity and requires per-map changes. Feasible but very
> complex and would need DCAS or locking to make updating the map slot and
> our job information atomic.
>
> 4. Wrapping the kptr in a box and storing it in place of the kptr [2] :
> Proposed by Mykyta. Would break direct load access to kptr objects.
>
> 5. Make every dtor nmi safe individually. This would require a lot of
> duplicated code and require updating every destructor invididually.
> Feasible technically, but seems brittle.
>
> 6. One that would be the least complex, would be forbidding xchg operations
> that can run the dtor in NMI context. That would preserve the inlining fix,
> but limit our usage of referenced kptrs in BPF programs that run in NMI context.
>
> The approach here:
>
> 7. Allocate a new spot for a free job every time we xchg into the map
> and put it in a global list. When in NMI and we run a dtor, we pop a
> job from that slot and use it to offload our work via irq_work. If
> we're not in NMI we run normally. Downside is this breaks inlining for
> ref kptrs.
>
> ...
>
> I may be missing something critical, but everything I've looked at
> points to this problem being much more complex that it initially seemed.

yes. it is complex. all 7 options are not good.

I recall the whole thing started with desire to add bpf_put_file_dtor().
This was discussed with VFS maintainers and they didn't like the idea,
since it needs a ton of work to make it safe:
. umount notifier to make sure stashed file doesn't hold umount
. potential circular refcnt issue if file to bpf map stashed into the same map
. scm_rights-like facility with garbage collection

So generic file stash is really no go.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [bpf-next v3 1/2] bpf: Offload kptr destructors that run from NMI
  2026-05-11 15:51           ` Alexei Starovoitov
@ 2026-05-11 16:38             ` Justin Suess
  2026-05-11 17:18               ` Alexei Starovoitov
  2026-05-11 19:22             ` Justin Suess
  1 sibling, 1 reply; 23+ messages in thread
From: Justin Suess @ 2026-05-11 16:38 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: sashiko, bpf

On Mon, May 11, 2026 at 08:51:53AM -0700, Alexei Starovoitov wrote:
> On Sun May 10, 2026 at 6:49 PM PDT, Justin Suess wrote:
> > On Sun, May 10, 2026 at 03:38:08PM -0700, Alexei Starovoitov wrote:
> >> On Sun, May 10, 2026 at 8:14 AM Justin Suess <utilityemal77@gmail.com> wrote:
> >> >
> >> >
> >> > Any help or guidance on this would be appreciated!
> >> 
> >> sorry for the delay. Everyone was at lsfmmbpf for a week+.
> >> 
> >
> > No worries! I hope it was an enjoyable trip and I look forward to
> > hearing about the conference.
> >
> >> All of the solutions so far are way too complicated.
> >> bpf_kptr_xchg() has to remain inlined as single atomic xchg
> >> without slowpath otherwise it ruins the concept
> >> and makes usage unpredictable.
> >> 
> >> Let's step back.
> >> What is the issue you're trying to solve?
> >> 
> >> the commit log say:
> >> 
> >> > A BPF program attached to tp_btf/nmi_handler can delete map entries or
> >> > swap out referenced kptrs from NMI context. Today that runs the kptr
> >> > destructor inline. Destructors such as bpf_cpumask_release() can take
> >> > RCU-related locks, so running them from NMI can deadlock the system.
> >> 
> >> and looking at selftest from patch 2 you do:
> >> 
> >> old = bpf_kptr_xchg(&value->mask, old);
> >> if (old)
> >>   bpf_cpumask_release(old);
> >> 
> >> so?
> >> bpf_cpumask_release() is fine to call from any context,
> >> because bpf_mem_cache_free_rcu() is safe everywhere including NMI.
> >>
> >
> > My mistake on that. I picked a bad example for the test, but the test is
> > just exercising the nmi dtor path, and doesn't rely on the particular
> > type of kptr. I just picked one that was easy to acquire a reference to.
> >
> > This dtor is safe. task_struct dtor, cgroup dtor, crypto_ctx dtor are
> > not. I've annotated why here:
> >
> > crypto_ctx:
> >
> >   __bpf_kfunc void bpf_crypto_ctx_release(struct bpf_crypto_ctx *ctx)
> >   {
> >   	if (refcount_dec_and_test(&ctx->usage))
> >   		call_rcu(&ctx->rcu, crypto_free_cb); /* UNSAFE: call_rcu */
> >   }
> >   
> >   __bpf_kfunc void bpf_crypto_ctx_release_dtor(void *ctx)
> >   {
> >   	bpf_crypto_ctx_release(ctx);
> >   }
> 
> bpf_crypto_ctx_release() is only allowed in syscall prog types
> and dtor via hashmap free will execute in safe context as well.
> So not an issue.
>
It doesn't matter if bpf_crypto_ctx_release is only allowed in syscall
program types; we don't need that to invoke it from a tracing program.
All we have to do is delete a map element with that type of kptr already
in it.

The release function is not the issue, it's bpf_map_delete_elem, which
ultimately invokes bpf_obj_free_fields and then the dtor.

Hashmap free doesn't execute in safe context. See the kernel splat
below.

> > task_struct:
> >
> >   __bpf_kfunc void bpf_task_release(struct task_struct *p)
> >   {
> >   	put_task_struct_rcu_user(p);
> >   }
> >   
> >   __bpf_kfunc void bpf_task_release_dtor(void *p)
> >   {
> >   	put_task_struct_rcu_user(p);
> >   }
> >
> >   void put_task_struct_rcu_user(struct task_struct *task)
> >   {
> >   	if (refcount_dec_and_test(&task->rcu_users))
> >   		call_rcu(&task->rcu, delayed_put_task_struct); /* UNSAFE: call_rcu
> >       */
> >   }
> 
> In theory. I don't think there is a reproducer.
>
I sent the wrong link in my reply. apologies.

This does indeed reproduce, even on bpf-next/master.

Using this reproducer:
https://gist.githubusercontent.com/RazeLighter777/5539336d79ab1854f9e9550c6dcab118/raw/082f1eeb2dd445936e64dd3a33861764690bde82/task_struct_dtor_deadlock.patch

On the latest bpf-next/master : "7e033543a2ab: Merge
git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf 7.1-rc3":

With these kconfig options set (not necessary, but increase
reproducibility):
  [justin@zenbox linux-next (repro $%)]$ rg 'RCU_NOCB|RCU_EXPERT' .config
  171:CONFIG_RCU_EXPERT=y
  188:CONFIG_RCU_NOCB_CPU=y

Run ./test_progs -t task_kptr_nmi_deadlock_repro

  [   21.604612] ================================
  [   21.604612] WARNING: inconsistent lock state
  [   21.604613] 7.1.0-rc2-g7e033543a2ab #126 Tainted: G           OE      
  [   21.604614] --------------------------------
  [   21.604615] inconsistent {INITIAL USE} -> {IN-NMI} usage.
  [   21.604616] test_progs/494 [HC1[1]:SC0[0]:HE0:SE1] takes:
  [   21.604617] ffffa2077d66f0e8 (&rdp->nocb_lock){....}-{2:2}, at: __call_rcu_common.constprop.0+0x309/0x730
  [   21.604625] {INITIAL USE} state was registered at:
  [   21.604625]   lock_acquire+0xbf/0x2e0
  [   21.604628]   _raw_spin_lock+0x30/0x40
  [   21.604632]   rcu_nocb_gp_kthread+0x13f/0xb90
  [   21.604633]   kthread+0xf4/0x130
  [   21.604636]   ret_from_fork+0x264/0x330
  [   21.604639]   ret_from_fork_asm+0x1a/0x30
  [   21.604642] irq event stamp: 194
  [   21.604642] hardirqs last  enabled at (193): [<ffffffff9700012f>] entry_SYSCALL_64_after_hwframe+0x76/0x7e
  [   21.604644] hardirqs last disabled at (194): [<ffffffff980e7f4f>] exc_nmi+0x7f/0x110
  [   21.604646] softirqs last  enabled at (0): [<ffffffff972d9838>] copy_process+0xfd8/0x2b80
  [   21.604648] softirqs last disabled at (0): [<0000000000000000>] 0x0
  [   21.604650] 
                 other info that might help us debug this:
  [   21.604651]  Possible unsafe locking scenario:
  
  [   21.604651]        CPU0
  [   21.604651]        ----
  [   21.604652]   lock(&rdp->nocb_lock);
  [   21.604653]   <Interrupt>
  [   21.604653]     lock(&rdp->nocb_lock);
  [   21.604654] 
                  *** DEADLOCK ***
  
  [   21.604654] no locks held by test_progs/494.
  [   21.604655] 
                 stack backtrace:
  [   21.604657] CPU: 1 UID: 0 PID: 494 Comm: test_progs Tainted: G           OE       7.1.0-rc2-g7e033543a2ab #126 PREEMPT(full) 
  [   21.604659] Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
  [   21.604660] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.17.0-0-gb52ca86e094d-prebuilt.qemu.org 04/01/2014
  [   21.604660] Call Trace:
  [   21.604662]  <TASK>
  [   21.604663]  dump_stack_lvl+0x5d/0x80
  [   21.604666]  print_usage_bug.part.0+0x22b/0x2c0
  [   21.604669]  lock_acquire+0x295/0x2e0
  [   21.604671]  ? terminate_walk+0x33/0x160
  [   21.604674]  ? __call_rcu_common.constprop.0+0x309/0x730
  [   21.604679]  _raw_spin_lock+0x30/0x40
  [   21.604680]  ? __call_rcu_common.constprop.0+0x309/0x730
  [   21.604682]  __call_rcu_common.constprop.0+0x309/0x730
  [   21.604686]  bpf_obj_free_fields+0x118/0x250
  [   21.604691]  free_htab_elem+0x85/0xd0
  [   21.604694]  htab_map_delete_elem+0x168/0x230
  [   21.604698]  bpf_prog_f6a7136050cb5431_clear_task_kptrs_from_nmi+0xeb/0x144
  [   21.604700]  bpf_trace_run3+0x126/0x430
  [   21.604703]  ? __pfx_perf_event_nmi_handler+0x10/0x10
  [   21.604707]  nmi_handle.part.0+0x15b/0x250
  [   21.604710]  ? __pfx_perf_event_nmi_handler+0x10/0x10
  [   21.604712]  default_do_nmi+0x120/0x180
  [   21.604715]  exc_nmi+0xe3/0x110
  [   21.604717]  asm_exc_nmi+0xb7/0x100
  [   21.604722] RIP: 0033:0x7fd27529a7ee
  [   21.604724] Code: ff 0f 1f 00 49 89 ca 48 8b 44 24 20 0f 05 48 83 c4 18 c3 66 0f 1f 84 00 00 00 00 00 48 83 ec 10 ff 74 24 18 e8 63 ff ff ff 5a <59> 48 3d 00 f0 ff ff 77 09 48 83 c4 08 c3 0f 1f 40 00 48 8b 15 e9
  [   21.604725] RSP: 002b:00007ffe5e3eca08 EFLAGS: 00000202
  [   21.604726] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 00007fd2756cef48
  [   21.604727] RDX: 0000000000000003 RSI: 0000000000000000 RDI: 000000000000000c
  [   21.604728] RBP: 00007ffe5e3ecae0 R08: 0000000000000000 R09: 0000000000000000
  [   21.604728] R10: 0000000000000000 R11: 00007fd2756cec40 R12: 0000000000000003
  [   21.604729] R13: 00007fd275887000 R14: 00007ffe5e3eceb8 R15: 000055cd1271d8d0
  [   21.604734]  </TASK>
  [   21.666237] perf: interrupt took too long (2501 > 2500), lowering kernel.perf_event_max_sample_rate to 79000
  [   21.691290] perf: interrupt took too long (3144 > 3126), lowering kernel.perf_event_max_sample_rate to 63000
  [   21.730227] perf: interrupt took too long (3932 > 3930), lowering kernel.perf_event_max_sample_rate to 50000
  [   21.805475] perf: interrupt took too long (4916 > 4915), lowering kernel.perf_event_max_sample_rate to 40000
  

> > cgroup_release_dtor is more complex, goes through ultimately through
> > callbacks to:
> >
> >   static void css_release(struct percpu_ref *ref)
> >   {
> >   	struct cgroup_subsys_state *css =
> >   		container_of(ref, struct cgroup_subsys_state, refcnt);
> >   
> >   	INIT_WORK(&css->destroy_work, css_release_work_fn);
> >   	queue_work(cgroup_release_wq, &css->destroy_work); /* UNSAFE:
> >     workqueue */
> >   }
> 
> similar to task_struct. I don't think it's exploitable.
> 
> > More generally, unless it's a BPF allocated object or doesn't rely on
> > locking/call_rcu or workqueues, the dtor is unsafe.
> >
> >> hashtab introduced dtor in bpf_mem_alloc,
> >> so bpf_obj_free_fields() and corresponding dtor's of kptr-s
> >> are called from valid context.
> >> 
> >> What is the problematic sequence?
> >
> > So from the beginning stepping back.
> >
> > The problematic sequence:
> >
> > 1. ref kptr (i.e task_struct, cgroup, crypto_ctx) xchg'd into map. 
> >
> > 2. in a tp_btf/nmi_handler (NMI CTX) program we drop the item from the map
> >    with that referenced kptr field.
> >
> > 3. dtor runs in the nmi context
> >
> > 4. dtor runs call_rcu/queue_work/some bad thing in nmi, causing deadlock.
> >
> > You can see this demonstrated in my task_struct reproducer [1].
> 
> did you?
> That link points to your v2 with cpumask.
> I don't recall seeing task_struct repro.
> 
I misplaced the link. See above
> > It causes a deadlock by deliberately releasing the last reference to a
> > task_struct via a ref kptr in nmi, getting it to call_rcu and deadlock.
> >
> > The typical solution to this is to run the nmi unsafe code in non-nmi
> > context by offloading to NMI work, as you proposed.
> >
> > The problem is we need space to for the jobs we enqueue. The required
> > information to run the dtor is the dtor function and the original object
> > pointer. The number of dtors that can run in a single tp_btf/nmi_handler
> > prog is nearly unbounded.
> >
> > The other problem is even though bpf_mem_alloc is safe in NMI generally,
> > we cannot allocate in path that destroys an object. If the allocation
> > fails due to memory pressure, we leak the object.
> >
> > There are a few options, all with drawbacks.
> >
> > 1. Dynamically allocate the job. Non-starter, failing to allocate is
> > unrecoverable, memory pressure means we can't ever schedule the dtor to
> > run.
> >
> > 2. Store job ntrusively in the object : Requires a safe place to place
> > it within the object. Bad because not all objects have a space we can write to.
> > Non-starter.
> >
> > 3. Within the map slot (after actual kptr): Taken with my initial approach in v1.
> > Significant complexity and requires per-map changes. Feasible but very
> > complex and would need DCAS or locking to make updating the map slot and
> > our job information atomic.
> >
> > 4. Wrapping the kptr in a box and storing it in place of the kptr [2] :
> > Proposed by Mykyta. Would break direct load access to kptr objects.
> >
> > 5. Make every dtor nmi safe individually. This would require a lot of
> > duplicated code and require updating every destructor invididually.
> > Feasible technically, but seems brittle.
> >
> > 6. One that would be the least complex, would be forbidding xchg operations
> > that can run the dtor in NMI context. That would preserve the inlining fix,
> > but limit our usage of referenced kptrs in BPF programs that run in NMI context.
> >
> > The approach here:
> >
> > 7. Allocate a new spot for a free job every time we xchg into the map
> > and put it in a global list. When in NMI and we run a dtor, we pop a
> > job from that slot and use it to offload our work via irq_work. If
> > we're not in NMI we run normally. Downside is this breaks inlining for
> > ref kptrs.
> >
> > ...
> >
> > I may be missing something critical, but everything I've looked at
> > points to this problem being much more complex that it initially seemed.
> 
> yes. it is complex. all 7 options are not good.
> 
Agreed. None of them are good but I just don't see any good
alternatives.
> I recall the whole thing started with desire to add bpf_put_file_dtor().
> This was discussed with VFS maintainers and they didn't like the idea,
> since it needs a ton of work to make it safe:
> . umount notifier to make sure stashed file doesn't hold umount
> . potential circular refcnt issue if file to bpf map stashed into the same map
> . scm_rights-like facility with garbage collection
> 
> So generic file stash is really no go.

That may be the case; but this is a seprate issue. This would be a
bug that could apply to future new refcounted kptrs as well.

Thanks,
Justin

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [bpf-next v3 1/2] bpf: Offload kptr destructors that run from NMI
  2026-05-11 16:38             ` Justin Suess
@ 2026-05-11 17:18               ` Alexei Starovoitov
  2026-05-11 20:10                 ` Kumar Kartikeya Dwivedi
  0 siblings, 1 reply; 23+ messages in thread
From: Alexei Starovoitov @ 2026-05-11 17:18 UTC (permalink / raw)
  To: Justin Suess; +Cc: sashiko, bpf

On Mon May 11, 2026 at 9:38 AM PDT, Justin Suess wrote:
>   [   21.604660] Call Trace:
>   [   21.604662]  <TASK>
>   [   21.604663]  dump_stack_lvl+0x5d/0x80
>   [   21.604666]  print_usage_bug.part.0+0x22b/0x2c0
>   [   21.604669]  lock_acquire+0x295/0x2e0
>   [   21.604671]  ? terminate_walk+0x33/0x160
>   [   21.604674]  ? __call_rcu_common.constprop.0+0x309/0x730
>   [   21.604679]  _raw_spin_lock+0x30/0x40
>   [   21.604680]  ? __call_rcu_common.constprop.0+0x309/0x730
>   [   21.604682]  __call_rcu_common.constprop.0+0x309/0x730
>   [   21.604686]  bpf_obj_free_fields+0x118/0x250
>   [   21.604691]  free_htab_elem+0x85/0xd0
>   [   21.604694]  htab_map_delete_elem+0x168/0x230
>   [   21.604698]  bpf_prog_f6a7136050cb5431_clear_task_kptrs_from_nmi+0xeb/0x144
>   [   21.604700]  bpf_trace_run3+0x126/0x430

that's better.
Looks like we moved bpf_obj_free_fields() into htab_mem_dtor(),
but left check_and_free_fields() in free_htab_elem().

I think the fix is to remove check_and_free_fields() from ma path in free_htab_elem()
and fallback to bpf_mem_alloc at map create time when map has kptrs
with dtors. Even when BPF_F_NO_PREALLOC is not specified.

Kumar,

thoughts?


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [bpf-next v3 1/2] bpf: Offload kptr destructors that run from NMI
  2026-05-11 15:51           ` Alexei Starovoitov
  2026-05-11 16:38             ` Justin Suess
@ 2026-05-11 19:22             ` Justin Suess
  1 sibling, 0 replies; 23+ messages in thread
From: Justin Suess @ 2026-05-11 19:22 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: sashiko, bpf

On Mon, May 11, 2026 at 08:51:53AM -0700, Alexei Starovoitov wrote:
> On Sun May 10, 2026 at 6:49 PM PDT, Justin Suess wrote:
> > On Sun, May 10, 2026 at 03:38:08PM -0700, Alexei Starovoitov wrote:
> >> On Sun, May 10, 2026 at 8:14 AM Justin Suess <utilityemal77@gmail.com> wrote:
Here's a reproducer for the cgroup case:

https://gist.githubusercontent.com/RazeLighter777/5f77cdfe035a4e22ee2642ae7db6387d/raw/10898d27040a07098cccc5d0785d9ad6620344e7/cgroup_kptr_nmi_deadlock_repro

Hacked together with an AI prompt but functional.

Exercises a different path, but more consistently splats even without
CONFIG_RCU_NOCB_CPU / CONFIG_RCU_EXPERT since this dtor uses workqueue.

Had to use an fexit hook to get the timing condition right to release
the last cgroup reference.

But this lets you see the deadlock is indeed in the dtor in NMI.

This is on the same bpf-next/master
7e033543a2ab4c72319201298ed458e3bbddd82f:

  [   15.160694] ================================
  [   15.160695] WARNING: inconsistent lock state
  [   15.160695] 7.1.0-rc2-g7e033543a2ab-dirty #130 Not tainted
  [   15.160697] --------------------------------
  [   15.160697] inconsistent {INITIAL USE} -> {IN-NMI} usage.
  [   15.160698] test_progs/434 [HC1[1]:SC0[0]:HE0:SE1] takes:
  [   15.160700] ffff9096fd66ced8 (&pool->lock){-.-.}-{2:2}, at: __queue_work+0xde/0x720
  [   15.160707] {INITIAL USE} state was registered at:
  [   15.160708]   lock_acquire+0xbf/0x2e0
  [   15.160711]   _raw_spin_lock+0x30/0x40
  [   15.160715]   __queue_work+0xde/0x720
  [   15.160716]   queue_work_on+0x54/0xa0
  [   15.160716]   start_poll_synchronize_rcu_expedited+0xaf/0x110
  [   15.160719]   rcu_init+0x958/0x990
  [   15.160722]   start_kernel+0x746/0x980
  [   15.160725]   x86_64_start_reservations+0x24/0x30
  [   15.160727]   __pfx_reserve_bios_regions+0x0/0x10
  [   15.160729]   common_startup_64+0x12c/0x138
  [   15.160731] irq event stamp: 18704
  [   15.160732] hardirqs last  enabled at (18703): [<ffffffffa200148a>] asm_sysvec_apic_timer_interrupt+0x1a/0x20
  [   15.160734] hardirqs last disabled at (18704): [<ffffffffa30e3f8f>] exc_nmi+0x7f/0x110
  [   15.160737] softirqs last  enabled at (18698): [<ffffffffa22e5800>] __irq_exit_rcu+0xc0/0x100
  [   15.160739] softirqs last disabled at (18687): [<ffffffffa22e5800>] __irq_exit_rcu+0xc0/0x100
  [   15.160741] 
  [   15.160741] other info that might help us debug this:
  [   15.160741]  Possible unsafe locking scenario:
  [   15.160741] 
  [   15.160742]        CPU0
  [   15.160742]        ----
  [   15.160742]   lock(&pool->lock);
  [   15.160743]   <Interrupt>
  [   15.160743]     lock(&pool->lock);
  [   15.160744] 
  [   15.160744]  *** DEADLOCK ***
  [   15.160744] 
  [   15.160744] no locks held by test_progs/434.
  [   15.160745] 
  [   15.160745] stack backtrace:
  [   15.160747] CPU: 1 UID: 0 PID: 434 Comm: test_progs Not tainted 7.1.0-rc2-g7e033543a2ab-dirty #130 PREEMPT(full) 
  [   15.160749] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.17.0-0-gb52ca86e094d-prebuilt.qemu.org 04/01/2014
  [   15.160750] Call Trace:
  [   15.160751]  <TASK>
  [   15.160753]  dump_stack_lvl+0x5d/0x80
  [   15.160757]  print_usage_bug.part.0+0x22b/0x2c0
  [   15.160760]  lock_acquire+0x295/0x2e0
  [   15.160762]  ? srso_alias_return_thunk+0x5/0xfbef5
  [   15.160763]  ? __queue_work+0xde/0x720
  [   15.160767]  _raw_spin_lock+0x30/0x40
  [   15.160768]  ? __queue_work+0xde/0x720
  [   15.160769]  __queue_work+0xde/0x720
  [   15.160772]  queue_work_on+0x54/0xa0
  [   15.160774]  bpf_cgroup_release_dtor+0x12e/0x140
  [   15.160778]  bpf_obj_free_fields+0x118/0x250
  [   15.160782]  free_htab_elem+0x85/0xd0
  [   15.160785]  htab_map_delete_elem+0x168/0x230
  [   15.160790]  bpf_prog_23fcbbeb395ac6b4_clear_cgroup_kptrs_from_nmi+0x54/0x74
  [   15.160792]  bpf_trace_run3+0x126/0x430
  [   15.160795]  ? __pfx_perf_event_nmi_handler+0x10/0x10
  [   15.160799]  nmi_handle.part.0+0x15b/0x250
  [   15.160802]  ? __pfx_perf_event_nmi_handler+0x10/0x10
  [   15.160804]  default_do_nmi+0x120/0x180
  [   15.160807]  exc_nmi+0xe3/0x110
  [   15.160809]  asm_exc_nmi+0xb7/0x100
  [   15.160810] RIP: 0033:0x5607a669541b
  [   15.160813] Code: c7 45 f0 00 00 00 00 eb 1a 8b 55 f0 8b 45 f4 01 d0 48 63 d0 48 8b 45 a8 48 01 d0 48 89 45 a8 83 45 f0 01 81 7d f0 3f 42 0f 00 <7e> dd e8 7e f5 ff ff 48 89 45 f8 48 8b 45 f8 48 3b 45 e8 73 16 83
  [   15.160814] RSP: 002b:00007ffdb09c1dc0 EFLAGS: 00000293
  [   15.160816] RAX: 0000003aced4e2f4 RBX: 00007f1d8d574000 RCX: 000000000000000f
  [   15.160816] RDX: 00000000000ad857 RSI: 00007f1d8d577000 RDI: 0000000000000001
  [   15.160817] RBP: 00007ffdb09c1e30 R08: 00007ffdb09c1da0 R09: 00007f1d8d577010
  [   15.160818] R10: 0000000000001614 R11: 0009718b9187183f R12: 0000000000000003
  [   15.160818] R13: 00007f1d8d5b6000 R14: 00007ffdb09c3358 R15: 00005607a9daf890
  [   15.160824]  </TASK>
  [   15.214040] perf: interrupt took too long (2501 > 2500), lowering kernel.perf_event_max_sample_rate to 79000
  [   15.246002] perf: interrupt took too long (3135 > 3126), lowering kernel.perf_event_max_sample_rate to 63000
  [   15.308032] perf: interrupt took too long (3928 > 3918), lowering kernel.perf_event_max_sample_rate to 50000
  [   15.500072] perf: interrupt took too long (4912 > 4910), lowering kernel.perf_event_max_sample_rate to 40000

Justin

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [bpf-next v3 1/2] bpf: Offload kptr destructors that run from NMI
  2026-05-11 17:18               ` Alexei Starovoitov
@ 2026-05-11 20:10                 ` Kumar Kartikeya Dwivedi
  2026-05-12  1:43                   ` Justin Suess
  0 siblings, 1 reply; 23+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2026-05-11 20:10 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: Justin Suess, sashiko, bpf

On Mon, 11 May 2026 at 19:29, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Mon May 11, 2026 at 9:38 AM PDT, Justin Suess wrote:
> >   [   21.604660] Call Trace:
> >   [   21.604662]  <TASK>
> >   [   21.604663]  dump_stack_lvl+0x5d/0x80
> >   [   21.604666]  print_usage_bug.part.0+0x22b/0x2c0
> >   [   21.604669]  lock_acquire+0x295/0x2e0
> >   [   21.604671]  ? terminate_walk+0x33/0x160
> >   [   21.604674]  ? __call_rcu_common.constprop.0+0x309/0x730
> >   [   21.604679]  _raw_spin_lock+0x30/0x40
> >   [   21.604680]  ? __call_rcu_common.constprop.0+0x309/0x730
> >   [   21.604682]  __call_rcu_common.constprop.0+0x309/0x730
> >   [   21.604686]  bpf_obj_free_fields+0x118/0x250
> >   [   21.604691]  free_htab_elem+0x85/0xd0
> >   [   21.604694]  htab_map_delete_elem+0x168/0x230
> >   [   21.604698]  bpf_prog_f6a7136050cb5431_clear_task_kptrs_from_nmi+0xeb/0x144
> >   [   21.604700]  bpf_trace_run3+0x126/0x430
>
> that's better.
> Looks like we moved bpf_obj_free_fields() into htab_mem_dtor(),
> but left check_and_free_fields() in free_htab_elem().
>
> I think the fix is to remove check_and_free_fields() from ma path in free_htab_elem()
> and fallback to bpf_mem_alloc at map create time when map has kptrs
> with dtors. Even when BPF_F_NO_PREALLOC is not specified.
>
> Kumar,
>
> thoughts?
>
>

Yeah, removing it from the path that helpers can invoke seems simpler.
Remember though, this splat is just for hashtab, we have similar
bpf_obj_free_fields() in array map on update. I think fundamentally
the main issue here is that we logically free special fields when a
map value is freed or deleted. When updating array maps we logically
'free' and then 'update' the same map value together. For hashtab, it
happens on update/delete.

We could relax this behavior to avoid eagerly freeing these special
fields on update or deletion. The only worry is how this would impact
programs that have come to rely on the existing behavior. There are
patterns where people expect kptr to be NULL on some new map value,
which causes programs to return errors when that expectation is not
met. Just doing the skip when irqs_disabled() doesn't save us from the
surprise side-effect. We need to decide upon this first before
discussing the shape of the solution.

This is the theoretical concern; In practice, I think most people who
depend on such behavior use kptr in local storage maps (in
schedulers). So it probably won't be a problem in practice, even
though we can't judge this ahead of time. Also, we eagerly reuse map
values when using memalloc, so the guarantees are already pretty weak
I guess.

So, if we are not going to go through a grace period (like local
storage) and free back to kernel allocator before reuse, we should
relax field freeing behavior. At best, we should cancel work for
timer, wq, task_work, and task_work, leaving other items as-is. E.g.
BPF_UPTR is used in task storage which I think is accessible to
tracing programs, I am not sure how safe unpin_user_page() is when
called from random reentrant contexts. We might have more cases in the
future, we cannot guarantee we can handle everything in NMIs
universally.

So the best course of action seems to be relaxing
bpf_obj_free_fields() to bpf_obj_cancel_fields() that just does cancel
on async work (timer, wq, task_work) for delete / update and let other
fields be as-is. We likely need to do bpf_obj_free_fields()
additionally before prealloc_destroy() now, but that should be simple.
Whether or not to use bpf_ma when kptrs are used in prealloc map is a
separate change.

This should hopefully resolve the issue, unless I missed other cases.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [bpf-next v3 1/2] bpf: Offload kptr destructors that run from NMI
  2026-05-11 20:10                 ` Kumar Kartikeya Dwivedi
@ 2026-05-12  1:43                   ` Justin Suess
  2026-05-12  1:46                     ` Kumar Kartikeya Dwivedi
  0 siblings, 1 reply; 23+ messages in thread
From: Justin Suess @ 2026-05-12  1:43 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi; +Cc: Alexei Starovoitov, sashiko, bpf

On Mon, May 11, 2026 at 10:10:07PM +0200, Kumar Kartikeya Dwivedi wrote:
> On Mon, 11 May 2026 at 19:29, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Mon May 11, 2026 at 9:38 AM PDT, Justin Suess wrote:
> > >   [   21.604660] Call Trace:
> > >   [   21.604662]  <TASK>
> > >   [   21.604663]  dump_stack_lvl+0x5d/0x80
> > >   [   21.604666]  print_usage_bug.part.0+0x22b/0x2c0
> > >   [   21.604669]  lock_acquire+0x295/0x2e0
> > >   [   21.604671]  ? terminate_walk+0x33/0x160
> > >   [   21.604674]  ? __call_rcu_common.constprop.0+0x309/0x730
> > >   [   21.604679]  _raw_spin_lock+0x30/0x40
> > >   [   21.604680]  ? __call_rcu_common.constprop.0+0x309/0x730
> > >   [   21.604682]  __call_rcu_common.constprop.0+0x309/0x730
> > >   [   21.604686]  bpf_obj_free_fields+0x118/0x250
> > >   [   21.604691]  free_htab_elem+0x85/0xd0
> > >   [   21.604694]  htab_map_delete_elem+0x168/0x230
> > >   [   21.604698]  bpf_prog_f6a7136050cb5431_clear_task_kptrs_from_nmi+0xeb/0x144
> > >   [   21.604700]  bpf_trace_run3+0x126/0x430
> >
> > that's better.
> > Looks like we moved bpf_obj_free_fields() into htab_mem_dtor(),
> > but left check_and_free_fields() in free_htab_elem().
> >
> > I think the fix is to remove check_and_free_fields() from ma path in free_htab_elem()
> > and fallback to bpf_mem_alloc at map create time when map has kptrs
> > with dtors. Even when BPF_F_NO_PREALLOC is not specified.
> >
> > Kumar,
> >
> > thoughts?
> >
> >
> 
> Yeah, removing it from the path that helpers can invoke seems simpler.
> Remember though, this splat is just for hashtab, we have similar
> bpf_obj_free_fields() in array map on update. I think fundamentally
> the main issue here is that we logically free special fields when a
> map value is freed or deleted. When updating array maps we logically
> 'free' and then 'update' the same map value together. For hashtab, it
> happens on update/delete.
> 
> We could relax this behavior to avoid eagerly freeing these special
> fields on update or deletion. The only worry is how this would impact
> programs that have come to rely on the existing behavior. There are
> patterns where people expect kptr to be NULL on some new map value,
> which causes programs to return errors when that expectation is not
> met. Just doing the skip when irqs_disabled() doesn't save us from the
> surprise side-effect. We need to decide upon this first before
> discussing the shape of the solution.
> 
> This is the theoretical concern; In practice, I think most people who
> depend on such behavior use kptr in local storage maps (in
> schedulers). So it probably won't be a problem in practice, even
> though we can't judge this ahead of time. Also, we eagerly reuse map
> values when using memalloc, so the guarantees are already pretty weak
> I guess.
> 
> So, if we are not going to go through a grace period (like local
> storage) and free back to kernel allocator before reuse, we should
> relax field freeing behavior. At best, we should cancel work for
> timer, wq, task_work, and task_work, leaving other items as-is. E.g.
> BPF_UPTR is used in task storage which I think is accessible to
> tracing programs, I am not sure how safe unpin_user_page() is when
> called from random reentrant contexts. We might have more cases in the
> future, we cannot guarantee we can handle everything in NMIs
> universally.
> 
> So the best course of action seems to be relaxing
> bpf_obj_free_fields() to bpf_obj_cancel_fields() that just does cancel
> on async work (timer, wq, task_work) for delete / update and let other
> fields be as-is. We likely need to do bpf_obj_free_fields()
> additionally before prealloc_destroy() now, but that should be simple.
> Whether or not to use bpf_ma when kptrs are used in prealloc map is a
> separate change.
> 
> This should hopefully resolve the issue, unless I missed other cases.
This does sound good, so you'd set the bpf_obj_free_fields up in the
htab allocator dtor for the final free and rely on the allocators
existing nmi deferral?

The missing piece is whether to handle this differently in NMI or just
always do it with the deferral. Also the prealloc question needs
answering.

This fix would preserve the xchg inlining and allow for failure with
ENOMEM on exhaustion and not need weird locking or atomic counting.

Appreciate the time from both of you learned a lot more than I wanted to
about NMI than I bargained for writing this. There's probably more weird
NMI cases. FWIW there are zero tp_btf/nmi_handler progs anywhere I can
find on the internet so this is a weird edge case.

Since I brought this issue up I'll finish it out and send patches next
week.

Even if generic file kptr isn't possible hopefully this can stop sashiko
from complaining about every new kptr dtor getting introduced being unsafe in NMI :)

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [bpf-next v3 1/2] bpf: Offload kptr destructors that run from NMI
  2026-05-12  1:43                   ` Justin Suess
@ 2026-05-12  1:46                     ` Kumar Kartikeya Dwivedi
  2026-05-12  1:55                       ` Alexei Starovoitov
  0 siblings, 1 reply; 23+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2026-05-12  1:46 UTC (permalink / raw)
  To: Justin Suess; +Cc: Alexei Starovoitov, sashiko, bpf

On Tue, 12 May 2026 at 03:43, Justin Suess <utilityemal77@gmail.com> wrote:
>
> On Mon, May 11, 2026 at 10:10:07PM +0200, Kumar Kartikeya Dwivedi wrote:
> > On Mon, 11 May 2026 at 19:29, Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Mon May 11, 2026 at 9:38 AM PDT, Justin Suess wrote:
> > > >   [   21.604660] Call Trace:
> > > >   [   21.604662]  <TASK>
> > > >   [   21.604663]  dump_stack_lvl+0x5d/0x80
> > > >   [   21.604666]  print_usage_bug.part.0+0x22b/0x2c0
> > > >   [   21.604669]  lock_acquire+0x295/0x2e0
> > > >   [   21.604671]  ? terminate_walk+0x33/0x160
> > > >   [   21.604674]  ? __call_rcu_common.constprop.0+0x309/0x730
> > > >   [   21.604679]  _raw_spin_lock+0x30/0x40
> > > >   [   21.604680]  ? __call_rcu_common.constprop.0+0x309/0x730
> > > >   [   21.604682]  __call_rcu_common.constprop.0+0x309/0x730
> > > >   [   21.604686]  bpf_obj_free_fields+0x118/0x250
> > > >   [   21.604691]  free_htab_elem+0x85/0xd0
> > > >   [   21.604694]  htab_map_delete_elem+0x168/0x230
> > > >   [   21.604698]  bpf_prog_f6a7136050cb5431_clear_task_kptrs_from_nmi+0xeb/0x144
> > > >   [   21.604700]  bpf_trace_run3+0x126/0x430
> > >
> > > that's better.
> > > Looks like we moved bpf_obj_free_fields() into htab_mem_dtor(),
> > > but left check_and_free_fields() in free_htab_elem().
> > >
> > > I think the fix is to remove check_and_free_fields() from ma path in free_htab_elem()
> > > and fallback to bpf_mem_alloc at map create time when map has kptrs
> > > with dtors. Even when BPF_F_NO_PREALLOC is not specified.
> > >
> > > Kumar,
> > >
> > > thoughts?
> > >
> > >
> >
> > Yeah, removing it from the path that helpers can invoke seems simpler.
> > Remember though, this splat is just for hashtab, we have similar
> > bpf_obj_free_fields() in array map on update. I think fundamentally
> > the main issue here is that we logically free special fields when a
> > map value is freed or deleted. When updating array maps we logically
> > 'free' and then 'update' the same map value together. For hashtab, it
> > happens on update/delete.
> >
> > We could relax this behavior to avoid eagerly freeing these special
> > fields on update or deletion. The only worry is how this would impact
> > programs that have come to rely on the existing behavior. There are
> > patterns where people expect kptr to be NULL on some new map value,
> > which causes programs to return errors when that expectation is not
> > met. Just doing the skip when irqs_disabled() doesn't save us from the
> > surprise side-effect. We need to decide upon this first before
> > discussing the shape of the solution.
> >
> > This is the theoretical concern; In practice, I think most people who
> > depend on such behavior use kptr in local storage maps (in
> > schedulers). So it probably won't be a problem in practice, even
> > though we can't judge this ahead of time. Also, we eagerly reuse map
> > values when using memalloc, so the guarantees are already pretty weak
> > I guess.
> >
> > So, if we are not going to go through a grace period (like local
> > storage) and free back to kernel allocator before reuse, we should
> > relax field freeing behavior. At best, we should cancel work for
> > timer, wq, task_work, and task_work, leaving other items as-is. E.g.
> > BPF_UPTR is used in task storage which I think is accessible to
> > tracing programs, I am not sure how safe unpin_user_page() is when
> > called from random reentrant contexts. We might have more cases in the
> > future, we cannot guarantee we can handle everything in NMIs
> > universally.
> >
> > So the best course of action seems to be relaxing
> > bpf_obj_free_fields() to bpf_obj_cancel_fields() that just does cancel
> > on async work (timer, wq, task_work) for delete / update and let other
> > fields be as-is. We likely need to do bpf_obj_free_fields()
> > additionally before prealloc_destroy() now, but that should be simple.
> > Whether or not to use bpf_ma when kptrs are used in prealloc map is a
> > separate change.
> >
> > This should hopefully resolve the issue, unless I missed other cases.
> This does sound good, so you'd set the bpf_obj_free_fields up in the
> htab allocator dtor for the final free and rely on the allocators
> existing nmi deferral?

It is already set, except for prealloc maps. But we can call it before
destroying the pcpu freelist etc.

>
> The missing piece is whether to handle this differently in NMI or just
> always do it with the deferral. Also the prealloc question needs
> answering.

There is no deferral here. I'm saying that we just cancel for timer,
wq, task work, and leave other fields as is. So we don't have active
work pending for async items.

So as long as the item keeps getting recycled in the allocator, we
don't free these fields. Once the memalloc is destroyed, the dtor runs
in a known safe context where we can assume bpf_obj_free_fields won't
deadlock or run into any problems.

>
> [...]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [bpf-next v3 1/2] bpf: Offload kptr destructors that run from NMI
  2026-05-12  1:46                     ` Kumar Kartikeya Dwivedi
@ 2026-05-12  1:55                       ` Alexei Starovoitov
  2026-05-12  2:03                         ` Kumar Kartikeya Dwivedi
  2026-05-12  2:07                         ` Justin Suess
  0 siblings, 2 replies; 23+ messages in thread
From: Alexei Starovoitov @ 2026-05-12  1:55 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi, Justin Suess; +Cc: sashiko, bpf

On Mon May 11, 2026 at 6:46 PM PDT, Kumar Kartikeya Dwivedi wrote:
> On Tue, 12 May 2026 at 03:43, Justin Suess <utilityemal77@gmail.com> wrote:
>>
>> On Mon, May 11, 2026 at 10:10:07PM +0200, Kumar Kartikeya Dwivedi wrote:
>> > On Mon, 11 May 2026 at 19:29, Alexei Starovoitov
>> > <alexei.starovoitov@gmail.com> wrote:
>> > >
>> > > On Mon May 11, 2026 at 9:38 AM PDT, Justin Suess wrote:
>> > > >   [   21.604660] Call Trace:
>> > > >   [   21.604662]  <TASK>
>> > > >   [   21.604663]  dump_stack_lvl+0x5d/0x80
>> > > >   [   21.604666]  print_usage_bug.part.0+0x22b/0x2c0
>> > > >   [   21.604669]  lock_acquire+0x295/0x2e0
>> > > >   [   21.604671]  ? terminate_walk+0x33/0x160
>> > > >   [   21.604674]  ? __call_rcu_common.constprop.0+0x309/0x730
>> > > >   [   21.604679]  _raw_spin_lock+0x30/0x40
>> > > >   [   21.604680]  ? __call_rcu_common.constprop.0+0x309/0x730
>> > > >   [   21.604682]  __call_rcu_common.constprop.0+0x309/0x730
>> > > >   [   21.604686]  bpf_obj_free_fields+0x118/0x250
>> > > >   [   21.604691]  free_htab_elem+0x85/0xd0
>> > > >   [   21.604694]  htab_map_delete_elem+0x168/0x230
>> > > >   [   21.604698]  bpf_prog_f6a7136050cb5431_clear_task_kptrs_from_nmi+0xeb/0x144
>> > > >   [   21.604700]  bpf_trace_run3+0x126/0x430
>> > >
>> > > that's better.
>> > > Looks like we moved bpf_obj_free_fields() into htab_mem_dtor(),
>> > > but left check_and_free_fields() in free_htab_elem().
>> > >
>> > > I think the fix is to remove check_and_free_fields() from ma path in free_htab_elem()
>> > > and fallback to bpf_mem_alloc at map create time when map has kptrs
>> > > with dtors. Even when BPF_F_NO_PREALLOC is not specified.
>> > >
>> > > Kumar,
>> > >
>> > > thoughts?
>> > >
>> > >
>> >
>> > Yeah, removing it from the path that helpers can invoke seems simpler.
>> > Remember though, this splat is just for hashtab, we have similar
>> > bpf_obj_free_fields() in array map on update. I think fundamentally
>> > the main issue here is that we logically free special fields when a
>> > map value is freed or deleted. When updating array maps we logically
>> > 'free' and then 'update' the same map value together. For hashtab, it
>> > happens on update/delete.
>> >
>> > We could relax this behavior to avoid eagerly freeing these special
>> > fields on update or deletion. The only worry is how this would impact
>> > programs that have come to rely on the existing behavior. There are
>> > patterns where people expect kptr to be NULL on some new map value,
>> > which causes programs to return errors when that expectation is not
>> > met. Just doing the skip when irqs_disabled() doesn't save us from the
>> > surprise side-effect. We need to decide upon this first before
>> > discussing the shape of the solution.
>> >
>> > This is the theoretical concern; In practice, I think most people who
>> > depend on such behavior use kptr in local storage maps (in
>> > schedulers). So it probably won't be a problem in practice, even
>> > though we can't judge this ahead of time. Also, we eagerly reuse map
>> > values when using memalloc, so the guarantees are already pretty weak
>> > I guess.
>> >
>> > So, if we are not going to go through a grace period (like local
>> > storage) and free back to kernel allocator before reuse, we should
>> > relax field freeing behavior. At best, we should cancel work for
>> > timer, wq, task_work, and task_work, leaving other items as-is. E.g.
>> > BPF_UPTR is used in task storage which I think is accessible to
>> > tracing programs, I am not sure how safe unpin_user_page() is when
>> > called from random reentrant contexts. We might have more cases in the
>> > future, we cannot guarantee we can handle everything in NMIs
>> > universally.
>> >
>> > So the best course of action seems to be relaxing
>> > bpf_obj_free_fields() to bpf_obj_cancel_fields() that just does cancel
>> > on async work (timer, wq, task_work) for delete / update and let other
>> > fields be as-is. We likely need to do bpf_obj_free_fields()
>> > additionally before prealloc_destroy() now, but that should be simple.
>> > Whether or not to use bpf_ma when kptrs are used in prealloc map is a
>> > separate change.
>> >
>> > This should hopefully resolve the issue, unless I missed other cases.
>> This does sound good, so you'd set the bpf_obj_free_fields up in the
>> htab allocator dtor for the final free and rely on the allocators
>> existing nmi deferral?
>
> It is already set, except for prealloc maps. But we can call it before
> destroying the pcpu freelist etc.

htab_map_free->htab_free_prealloced_fields does bpf_obj_free_fields already.
So scratch my suggestion to force bpf_mem_alloc on preallocated hash maps.

>>
>> The missing piece is whether to handle this differently in NMI or just
>> always do it with the deferral. Also the prealloc question needs
>> answering.
>
> There is no deferral here. I'm saying that we just cancel for timer,
> wq, task work, and leave other fields as is. So we don't have active
> work pending for async items.
>
> So as long as the item keeps getting recycled in the allocator, we
> don't free these fields. Once the memalloc is destroyed, the dtor runs
> in a known safe context where we can assume bpf_obj_free_fields won't
> deadlock or run into any problems.

So the plan is to do
if (in_nmi()) && case BPF_KPTR* | BPF_LIST_HEAD | BPF_RB_ROOT
just ignore it?
And no other changes anywhere at all?

That would be too good to be true :)

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [bpf-next v3 1/2] bpf: Offload kptr destructors that run from NMI
  2026-05-12  1:55                       ` Alexei Starovoitov
@ 2026-05-12  2:03                         ` Kumar Kartikeya Dwivedi
  2026-05-12  2:10                           ` Alexei Starovoitov
  2026-05-12  2:07                         ` Justin Suess
  1 sibling, 1 reply; 23+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2026-05-12  2:03 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: Justin Suess, sashiko, bpf

On Tue, 12 May 2026 at 03:55, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Mon May 11, 2026 at 6:46 PM PDT, Kumar Kartikeya Dwivedi wrote:
> > On Tue, 12 May 2026 at 03:43, Justin Suess <utilityemal77@gmail.com> wrote:
> >>
> >> On Mon, May 11, 2026 at 10:10:07PM +0200, Kumar Kartikeya Dwivedi wrote:
> >> > On Mon, 11 May 2026 at 19:29, Alexei Starovoitov
> >> > <alexei.starovoitov@gmail.com> wrote:
> >> > >
> >> > > On Mon May 11, 2026 at 9:38 AM PDT, Justin Suess wrote:
> >> > > >   [   21.604660] Call Trace:
> >> > > >   [   21.604662]  <TASK>
> >> > > >   [   21.604663]  dump_stack_lvl+0x5d/0x80
> >> > > >   [   21.604666]  print_usage_bug.part.0+0x22b/0x2c0
> >> > > >   [   21.604669]  lock_acquire+0x295/0x2e0
> >> > > >   [   21.604671]  ? terminate_walk+0x33/0x160
> >> > > >   [   21.604674]  ? __call_rcu_common.constprop.0+0x309/0x730
> >> > > >   [   21.604679]  _raw_spin_lock+0x30/0x40
> >> > > >   [   21.604680]  ? __call_rcu_common.constprop.0+0x309/0x730
> >> > > >   [   21.604682]  __call_rcu_common.constprop.0+0x309/0x730
> >> > > >   [   21.604686]  bpf_obj_free_fields+0x118/0x250
> >> > > >   [   21.604691]  free_htab_elem+0x85/0xd0
> >> > > >   [   21.604694]  htab_map_delete_elem+0x168/0x230
> >> > > >   [   21.604698]  bpf_prog_f6a7136050cb5431_clear_task_kptrs_from_nmi+0xeb/0x144
> >> > > >   [   21.604700]  bpf_trace_run3+0x126/0x430
> >> > >
> >> > > that's better.
> >> > > Looks like we moved bpf_obj_free_fields() into htab_mem_dtor(),
> >> > > but left check_and_free_fields() in free_htab_elem().
> >> > >
> >> > > I think the fix is to remove check_and_free_fields() from ma path in free_htab_elem()
> >> > > and fallback to bpf_mem_alloc at map create time when map has kptrs
> >> > > with dtors. Even when BPF_F_NO_PREALLOC is not specified.
> >> > >
> >> > > Kumar,
> >> > >
> >> > > thoughts?
> >> > >
> >> > >
> >> >
> >> > Yeah, removing it from the path that helpers can invoke seems simpler.
> >> > Remember though, this splat is just for hashtab, we have similar
> >> > bpf_obj_free_fields() in array map on update. I think fundamentally
> >> > the main issue here is that we logically free special fields when a
> >> > map value is freed or deleted. When updating array maps we logically
> >> > 'free' and then 'update' the same map value together. For hashtab, it
> >> > happens on update/delete.
> >> >
> >> > We could relax this behavior to avoid eagerly freeing these special
> >> > fields on update or deletion. The only worry is how this would impact
> >> > programs that have come to rely on the existing behavior. There are
> >> > patterns where people expect kptr to be NULL on some new map value,
> >> > which causes programs to return errors when that expectation is not
> >> > met. Just doing the skip when irqs_disabled() doesn't save us from the
> >> > surprise side-effect. We need to decide upon this first before
> >> > discussing the shape of the solution.
> >> >
> >> > This is the theoretical concern; In practice, I think most people who
> >> > depend on such behavior use kptr in local storage maps (in
> >> > schedulers). So it probably won't be a problem in practice, even
> >> > though we can't judge this ahead of time. Also, we eagerly reuse map
> >> > values when using memalloc, so the guarantees are already pretty weak
> >> > I guess.
> >> >
> >> > So, if we are not going to go through a grace period (like local
> >> > storage) and free back to kernel allocator before reuse, we should
> >> > relax field freeing behavior. At best, we should cancel work for
> >> > timer, wq, task_work, and task_work, leaving other items as-is. E.g.
> >> > BPF_UPTR is used in task storage which I think is accessible to
> >> > tracing programs, I am not sure how safe unpin_user_page() is when
> >> > called from random reentrant contexts. We might have more cases in the
> >> > future, we cannot guarantee we can handle everything in NMIs
> >> > universally.
> >> >
> >> > So the best course of action seems to be relaxing
> >> > bpf_obj_free_fields() to bpf_obj_cancel_fields() that just does cancel
> >> > on async work (timer, wq, task_work) for delete / update and let other
> >> > fields be as-is. We likely need to do bpf_obj_free_fields()
> >> > additionally before prealloc_destroy() now, but that should be simple.
> >> > Whether or not to use bpf_ma when kptrs are used in prealloc map is a
> >> > separate change.
> >> >
> >> > This should hopefully resolve the issue, unless I missed other cases.
> >> This does sound good, so you'd set the bpf_obj_free_fields up in the
> >> htab allocator dtor for the final free and rely on the allocators
> >> existing nmi deferral?
> >
> > It is already set, except for prealloc maps. But we can call it before
> > destroying the pcpu freelist etc.
>
> htab_map_free->htab_free_prealloced_fields does bpf_obj_free_fields already.
> So scratch my suggestion to force bpf_mem_alloc on preallocated hash maps.
>
> >>
> >> The missing piece is whether to handle this differently in NMI or just
> >> always do it with the deferral. Also the prealloc question needs
> >> answering.
> >
> > There is no deferral here. I'm saying that we just cancel for timer,
> > wq, task work, and leave other fields as is. So we don't have active
> > work pending for async items.
> >
> > So as long as the item keeps getting recycled in the allocator, we
> > don't free these fields. Once the memalloc is destroyed, the dtor runs
> > in a known safe context where we can assume bpf_obj_free_fields won't
> > deadlock or run into any problems.
>
> So the plan is to do
> if (in_nmi()) && case BPF_KPTR* | BPF_LIST_HEAD | BPF_RB_ROOT
> just ignore it?
> And no other changes anywhere at all?
>
> That would be too good to be true :)

I don't know whether in_nmi() would be sufficient, we likely need
irqs_disabled()? At that point, why not always ignore it, since
freeing the fields is dependent on where you're running. I would still
cancel async fields, since they're already any-context safe.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [bpf-next v3 1/2] bpf: Offload kptr destructors that run from NMI
  2026-05-12  1:55                       ` Alexei Starovoitov
  2026-05-12  2:03                         ` Kumar Kartikeya Dwivedi
@ 2026-05-12  2:07                         ` Justin Suess
  2026-05-12  2:08                           ` Kumar Kartikeya Dwivedi
  1 sibling, 1 reply; 23+ messages in thread
From: Justin Suess @ 2026-05-12  2:07 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: Kumar Kartikeya Dwivedi, sashiko, bpf

On Mon, May 11, 2026 at 06:55:50PM -0700, Alexei Starovoitov wrote:
> On Mon May 11, 2026 at 6:46 PM PDT, Kumar Kartikeya Dwivedi wrote:
> > On Tue, 12 May 2026 at 03:43, Justin Suess <utilityemal77@gmail.com> wrote:
> >>
> >> On Mon, May 11, 2026 at 10:10:07PM +0200, Kumar Kartikeya Dwivedi wrote:
> >> > On Mon, 11 May 2026 at 19:29, Alexei Starovoitov
> >> > <alexei.starovoitov@gmail.com> wrote:
> > There is no deferral here. I'm saying that we just cancel for timer,
> > wq, task work, and leave other fields as is. So we don't have active
> > work pending for async items.
> >
> > So as long as the item keeps getting recycled in the allocator, we
> > don't free these fields. Once the memalloc is destroyed, the dtor runs
> > in a known safe context where we can assume bpf_obj_free_fields won't
> > deadlock or run into any problems.
> 
> So the plan is to do
> if (in_nmi()) && case BPF_KPTR* | BPF_LIST_HEAD | BPF_RB_ROOT
> just ignore it?
> And no other changes anywhere at all?
> 
> That would be too good to be true :)
I guess the last question I have is what would prevent an xchg() out of
a map on a recycled kptr from causing a UAF?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [bpf-next v3 1/2] bpf: Offload kptr destructors that run from NMI
  2026-05-12  2:07                         ` Justin Suess
@ 2026-05-12  2:08                           ` Kumar Kartikeya Dwivedi
  0 siblings, 0 replies; 23+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2026-05-12  2:08 UTC (permalink / raw)
  To: Justin Suess; +Cc: Alexei Starovoitov, sashiko, bpf

On Tue, 12 May 2026 at 04:07, Justin Suess <utilityemal77@gmail.com> wrote:
>
> On Mon, May 11, 2026 at 06:55:50PM -0700, Alexei Starovoitov wrote:
> > On Mon May 11, 2026 at 6:46 PM PDT, Kumar Kartikeya Dwivedi wrote:
> > > On Tue, 12 May 2026 at 03:43, Justin Suess <utilityemal77@gmail.com> wrote:
> > >>
> > >> On Mon, May 11, 2026 at 10:10:07PM +0200, Kumar Kartikeya Dwivedi wrote:
> > >> > On Mon, 11 May 2026 at 19:29, Alexei Starovoitov
> > >> > <alexei.starovoitov@gmail.com> wrote:
> > > There is no deferral here. I'm saying that we just cancel for timer,
> > > wq, task work, and leave other fields as is. So we don't have active
> > > work pending for async items.
> > >
> > > So as long as the item keeps getting recycled in the allocator, we
> > > don't free these fields. Once the memalloc is destroyed, the dtor runs
> > > in a known safe context where we can assume bpf_obj_free_fields won't
> > > deadlock or run into any problems.
> >
> > So the plan is to do
> > if (in_nmi()) && case BPF_KPTR* | BPF_LIST_HEAD | BPF_RB_ROOT
> > just ignore it?
> > And no other changes anywhere at all?
> >
> > That would be too good to be true :)
> I guess the last question I have is what would prevent an xchg() out of
> a map on a recycled kptr from causing a UAF?

xchg() is atomic, so either a program now has the value, or the map
has it. Whoever has it will end up freeing it.
If the program has it then the program frees it. I don't see why there
would be a UAF.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [bpf-next v3 1/2] bpf: Offload kptr destructors that run from NMI
  2026-05-12  2:03                         ` Kumar Kartikeya Dwivedi
@ 2026-05-12  2:10                           ` Alexei Starovoitov
  2026-05-12  2:13                             ` Kumar Kartikeya Dwivedi
  0 siblings, 1 reply; 23+ messages in thread
From: Alexei Starovoitov @ 2026-05-12  2:10 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi; +Cc: Justin Suess, sashiko, bpf

On Mon May 11, 2026 at 7:03 PM PDT, Kumar Kartikeya Dwivedi wrote:
> On Tue, 12 May 2026 at 03:55, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
>>
>> On Mon May 11, 2026 at 6:46 PM PDT, Kumar Kartikeya Dwivedi wrote:
>> > On Tue, 12 May 2026 at 03:43, Justin Suess <utilityemal77@gmail.com> wrote:
>> >>
>> >> On Mon, May 11, 2026 at 10:10:07PM +0200, Kumar Kartikeya Dwivedi wrote:
>> >> > On Mon, 11 May 2026 at 19:29, Alexei Starovoitov
>> >> > <alexei.starovoitov@gmail.com> wrote:
>> >> > >
>> >> > > On Mon May 11, 2026 at 9:38 AM PDT, Justin Suess wrote:
>> >> > > >   [   21.604660] Call Trace:
>> >> > > >   [   21.604662]  <TASK>
>> >> > > >   [   21.604663]  dump_stack_lvl+0x5d/0x80
>> >> > > >   [   21.604666]  print_usage_bug.part.0+0x22b/0x2c0
>> >> > > >   [   21.604669]  lock_acquire+0x295/0x2e0
>> >> > > >   [   21.604671]  ? terminate_walk+0x33/0x160
>> >> > > >   [   21.604674]  ? __call_rcu_common.constprop.0+0x309/0x730
>> >> > > >   [   21.604679]  _raw_spin_lock+0x30/0x40
>> >> > > >   [   21.604680]  ? __call_rcu_common.constprop.0+0x309/0x730
>> >> > > >   [   21.604682]  __call_rcu_common.constprop.0+0x309/0x730
>> >> > > >   [   21.604686]  bpf_obj_free_fields+0x118/0x250
>> >> > > >   [   21.604691]  free_htab_elem+0x85/0xd0
>> >> > > >   [   21.604694]  htab_map_delete_elem+0x168/0x230
>> >> > > >   [   21.604698]  bpf_prog_f6a7136050cb5431_clear_task_kptrs_from_nmi+0xeb/0x144
>> >> > > >   [   21.604700]  bpf_trace_run3+0x126/0x430
>> >> > >
>> >> > > that's better.
>> >> > > Looks like we moved bpf_obj_free_fields() into htab_mem_dtor(),
>> >> > > but left check_and_free_fields() in free_htab_elem().
>> >> > >
>> >> > > I think the fix is to remove check_and_free_fields() from ma path in free_htab_elem()
>> >> > > and fallback to bpf_mem_alloc at map create time when map has kptrs
>> >> > > with dtors. Even when BPF_F_NO_PREALLOC is not specified.
>> >> > >
>> >> > > Kumar,
>> >> > >
>> >> > > thoughts?
>> >> > >
>> >> > >
>> >> >
>> >> > Yeah, removing it from the path that helpers can invoke seems simpler.
>> >> > Remember though, this splat is just for hashtab, we have similar
>> >> > bpf_obj_free_fields() in array map on update. I think fundamentally
>> >> > the main issue here is that we logically free special fields when a
>> >> > map value is freed or deleted. When updating array maps we logically
>> >> > 'free' and then 'update' the same map value together. For hashtab, it
>> >> > happens on update/delete.
>> >> >
>> >> > We could relax this behavior to avoid eagerly freeing these special
>> >> > fields on update or deletion. The only worry is how this would impact
>> >> > programs that have come to rely on the existing behavior. There are
>> >> > patterns where people expect kptr to be NULL on some new map value,
>> >> > which causes programs to return errors when that expectation is not
>> >> > met. Just doing the skip when irqs_disabled() doesn't save us from the
>> >> > surprise side-effect. We need to decide upon this first before
>> >> > discussing the shape of the solution.
>> >> >
>> >> > This is the theoretical concern; In practice, I think most people who
>> >> > depend on such behavior use kptr in local storage maps (in
>> >> > schedulers). So it probably won't be a problem in practice, even
>> >> > though we can't judge this ahead of time. Also, we eagerly reuse map
>> >> > values when using memalloc, so the guarantees are already pretty weak
>> >> > I guess.
>> >> >
>> >> > So, if we are not going to go through a grace period (like local
>> >> > storage) and free back to kernel allocator before reuse, we should
>> >> > relax field freeing behavior. At best, we should cancel work for
>> >> > timer, wq, task_work, and task_work, leaving other items as-is. E.g.
>> >> > BPF_UPTR is used in task storage which I think is accessible to
>> >> > tracing programs, I am not sure how safe unpin_user_page() is when
>> >> > called from random reentrant contexts. We might have more cases in the
>> >> > future, we cannot guarantee we can handle everything in NMIs
>> >> > universally.
>> >> >
>> >> > So the best course of action seems to be relaxing
>> >> > bpf_obj_free_fields() to bpf_obj_cancel_fields() that just does cancel
>> >> > on async work (timer, wq, task_work) for delete / update and let other
>> >> > fields be as-is. We likely need to do bpf_obj_free_fields()
>> >> > additionally before prealloc_destroy() now, but that should be simple.
>> >> > Whether or not to use bpf_ma when kptrs are used in prealloc map is a
>> >> > separate change.
>> >> >
>> >> > This should hopefully resolve the issue, unless I missed other cases.
>> >> This does sound good, so you'd set the bpf_obj_free_fields up in the
>> >> htab allocator dtor for the final free and rely on the allocators
>> >> existing nmi deferral?
>> >
>> > It is already set, except for prealloc maps. But we can call it before
>> > destroying the pcpu freelist etc.
>>
>> htab_map_free->htab_free_prealloced_fields does bpf_obj_free_fields already.
>> So scratch my suggestion to force bpf_mem_alloc on preallocated hash maps.
>>
>> >>
>> >> The missing piece is whether to handle this differently in NMI or just
>> >> always do it with the deferral. Also the prealloc question needs
>> >> answering.
>> >
>> > There is no deferral here. I'm saying that we just cancel for timer,
>> > wq, task work, and leave other fields as is. So we don't have active
>> > work pending for async items.
>> >
>> > So as long as the item keeps getting recycled in the allocator, we
>> > don't free these fields. Once the memalloc is destroyed, the dtor runs
>> > in a known safe context where we can assume bpf_obj_free_fields won't
>> > deadlock or run into any problems.
>>
>> So the plan is to do
>> if (in_nmi()) && case BPF_KPTR* | BPF_LIST_HEAD | BPF_RB_ROOT
>> just ignore it?
>> And no other changes anywhere at all?
>>
>> That would be too good to be true :)
>
> I don't know whether in_nmi() would be sufficient, we likely need
> irqs_disabled()? 

fair irqs_disabled() is safer.

> At that point, why not always ignore it, since
> freeing the fields is dependent on where you're running. I would still
> cancel async fields, since they're already any-context safe.

you mean never touch case BPF_KPTR* | BPF_LIST_HEAD | BPF_RB_ROOT
regardless of running context and rely on final cleanup?
That's an idea to consider, but I suspect some rbtree, link list
tests will fail.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [bpf-next v3 1/2] bpf: Offload kptr destructors that run from NMI
  2026-05-12  2:10                           ` Alexei Starovoitov
@ 2026-05-12  2:13                             ` Kumar Kartikeya Dwivedi
  0 siblings, 0 replies; 23+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2026-05-12  2:13 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: Justin Suess, sashiko, bpf

On Tue, 12 May 2026 at 04:10, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Mon May 11, 2026 at 7:03 PM PDT, Kumar Kartikeya Dwivedi wrote:
> > On Tue, 12 May 2026 at 03:55, Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> >>
> >> On Mon May 11, 2026 at 6:46 PM PDT, Kumar Kartikeya Dwivedi wrote:
> >> > On Tue, 12 May 2026 at 03:43, Justin Suess <utilityemal77@gmail.com> wrote:
> >> >>
> >> >> On Mon, May 11, 2026 at 10:10:07PM +0200, Kumar Kartikeya Dwivedi wrote:
> >> >> > On Mon, 11 May 2026 at 19:29, Alexei Starovoitov
> >> >> > <alexei.starovoitov@gmail.com> wrote:
> >> >> > >
> >> >> > > On Mon May 11, 2026 at 9:38 AM PDT, Justin Suess wrote:
> >> >> > > >   [   21.604660] Call Trace:
> >> >> > > >   [   21.604662]  <TASK>
> >> >> > > >   [   21.604663]  dump_stack_lvl+0x5d/0x80
> >> >> > > >   [   21.604666]  print_usage_bug.part.0+0x22b/0x2c0
> >> >> > > >   [   21.604669]  lock_acquire+0x295/0x2e0
> >> >> > > >   [   21.604671]  ? terminate_walk+0x33/0x160
> >> >> > > >   [   21.604674]  ? __call_rcu_common.constprop.0+0x309/0x730
> >> >> > > >   [   21.604679]  _raw_spin_lock+0x30/0x40
> >> >> > > >   [   21.604680]  ? __call_rcu_common.constprop.0+0x309/0x730
> >> >> > > >   [   21.604682]  __call_rcu_common.constprop.0+0x309/0x730
> >> >> > > >   [   21.604686]  bpf_obj_free_fields+0x118/0x250
> >> >> > > >   [   21.604691]  free_htab_elem+0x85/0xd0
> >> >> > > >   [   21.604694]  htab_map_delete_elem+0x168/0x230
> >> >> > > >   [   21.604698]  bpf_prog_f6a7136050cb5431_clear_task_kptrs_from_nmi+0xeb/0x144
> >> >> > > >   [   21.604700]  bpf_trace_run3+0x126/0x430
> >> >> > >
> >> >> > > that's better.
> >> >> > > Looks like we moved bpf_obj_free_fields() into htab_mem_dtor(),
> >> >> > > but left check_and_free_fields() in free_htab_elem().
> >> >> > >
> >> >> > > I think the fix is to remove check_and_free_fields() from ma path in free_htab_elem()
> >> >> > > and fallback to bpf_mem_alloc at map create time when map has kptrs
> >> >> > > with dtors. Even when BPF_F_NO_PREALLOC is not specified.
> >> >> > >
> >> >> > > Kumar,
> >> >> > >
> >> >> > > thoughts?
> >> >> > >
> >> >> > >
> >> >> >
> >> >> > Yeah, removing it from the path that helpers can invoke seems simpler.
> >> >> > Remember though, this splat is just for hashtab, we have similar
> >> >> > bpf_obj_free_fields() in array map on update. I think fundamentally
> >> >> > the main issue here is that we logically free special fields when a
> >> >> > map value is freed or deleted. When updating array maps we logically
> >> >> > 'free' and then 'update' the same map value together. For hashtab, it
> >> >> > happens on update/delete.
> >> >> >
> >> >> > We could relax this behavior to avoid eagerly freeing these special
> >> >> > fields on update or deletion. The only worry is how this would impact
> >> >> > programs that have come to rely on the existing behavior. There are
> >> >> > patterns where people expect kptr to be NULL on some new map value,
> >> >> > which causes programs to return errors when that expectation is not
> >> >> > met. Just doing the skip when irqs_disabled() doesn't save us from the
> >> >> > surprise side-effect. We need to decide upon this first before
> >> >> > discussing the shape of the solution.
> >> >> >
> >> >> > This is the theoretical concern; In practice, I think most people who
> >> >> > depend on such behavior use kptr in local storage maps (in
> >> >> > schedulers). So it probably won't be a problem in practice, even
> >> >> > though we can't judge this ahead of time. Also, we eagerly reuse map
> >> >> > values when using memalloc, so the guarantees are already pretty weak
> >> >> > I guess.
> >> >> >
> >> >> > So, if we are not going to go through a grace period (like local
> >> >> > storage) and free back to kernel allocator before reuse, we should
> >> >> > relax field freeing behavior. At best, we should cancel work for
> >> >> > timer, wq, task_work, and task_work, leaving other items as-is. E.g.
> >> >> > BPF_UPTR is used in task storage which I think is accessible to
> >> >> > tracing programs, I am not sure how safe unpin_user_page() is when
> >> >> > called from random reentrant contexts. We might have more cases in the
> >> >> > future, we cannot guarantee we can handle everything in NMIs
> >> >> > universally.
> >> >> >
> >> >> > So the best course of action seems to be relaxing
> >> >> > bpf_obj_free_fields() to bpf_obj_cancel_fields() that just does cancel
> >> >> > on async work (timer, wq, task_work) for delete / update and let other
> >> >> > fields be as-is. We likely need to do bpf_obj_free_fields()
> >> >> > additionally before prealloc_destroy() now, but that should be simple.
> >> >> > Whether or not to use bpf_ma when kptrs are used in prealloc map is a
> >> >> > separate change.
> >> >> >
> >> >> > This should hopefully resolve the issue, unless I missed other cases.
> >> >> This does sound good, so you'd set the bpf_obj_free_fields up in the
> >> >> htab allocator dtor for the final free and rely on the allocators
> >> >> existing nmi deferral?
> >> >
> >> > It is already set, except for prealloc maps. But we can call it before
> >> > destroying the pcpu freelist etc.
> >>
> >> htab_map_free->htab_free_prealloced_fields does bpf_obj_free_fields already.
> >> So scratch my suggestion to force bpf_mem_alloc on preallocated hash maps.
> >>
> >> >>
> >> >> The missing piece is whether to handle this differently in NMI or just
> >> >> always do it with the deferral. Also the prealloc question needs
> >> >> answering.
> >> >
> >> > There is no deferral here. I'm saying that we just cancel for timer,
> >> > wq, task work, and leave other fields as is. So we don't have active
> >> > work pending for async items.
> >> >
> >> > So as long as the item keeps getting recycled in the allocator, we
> >> > don't free these fields. Once the memalloc is destroyed, the dtor runs
> >> > in a known safe context where we can assume bpf_obj_free_fields won't
> >> > deadlock or run into any problems.
> >>
> >> So the plan is to do
> >> if (in_nmi()) && case BPF_KPTR* | BPF_LIST_HEAD | BPF_RB_ROOT
> >> just ignore it?
> >> And no other changes anywhere at all?
> >>
> >> That would be too good to be true :)
> >
> > I don't know whether in_nmi() would be sufficient, we likely need
> > irqs_disabled()?
>
> fair irqs_disabled() is safer.
>
> > At that point, why not always ignore it, since
> > freeing the fields is dependent on where you're running. I would still
> > cancel async fields, since they're already any-context safe.
>
> you mean never touch case BPF_KPTR* | BPF_LIST_HEAD | BPF_RB_ROOT
> regardless of running context and rely on final cleanup?
> That's an idea to consider, but I suspect some rbtree, link list
> tests will fail.

Yeah, but we should see which ones do, some of them were written with
expectation to test kernel clean up on map value delete / update etc.,
those don't count as much. Unless this change triggers a real example
in tests, it will likely go unnoticed (famous last words).

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2026-05-12  2:13 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-07 17:54 [bpf-next v3 0/2] bpf: Fix deadlock in kptr dtor in nmi Justin Suess
2026-05-07 17:54 ` [bpf-next v3 1/2] bpf: Offload kptr destructors that run from NMI Justin Suess
2026-05-07 18:43   ` bot+bpf-ci
2026-05-07 18:52     ` Justin Suess
2026-05-07 23:45   ` sashiko-bot
2026-05-10 15:13     ` Justin Suess
2026-05-10 22:38       ` Alexei Starovoitov
2026-05-11  1:49         ` Justin Suess
2026-05-11 15:51           ` Alexei Starovoitov
2026-05-11 16:38             ` Justin Suess
2026-05-11 17:18               ` Alexei Starovoitov
2026-05-11 20:10                 ` Kumar Kartikeya Dwivedi
2026-05-12  1:43                   ` Justin Suess
2026-05-12  1:46                     ` Kumar Kartikeya Dwivedi
2026-05-12  1:55                       ` Alexei Starovoitov
2026-05-12  2:03                         ` Kumar Kartikeya Dwivedi
2026-05-12  2:10                           ` Alexei Starovoitov
2026-05-12  2:13                             ` Kumar Kartikeya Dwivedi
2026-05-12  2:07                         ` Justin Suess
2026-05-12  2:08                           ` Kumar Kartikeya Dwivedi
2026-05-11 19:22             ` Justin Suess
2026-05-07 17:54 ` [bpf-next v3 2/2] selftests/bpf: Add kptr destructor NMI exerciser Justin Suess
2026-05-08  0:03   ` sashiko-bot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox