BPF List
 help / color / mirror / Atom feed
From: Justin Suess <utilityemal77@gmail.com>
To: ast@kernel.org, daniel@iogearbox.net, andrii@kernel.org,
	eddyz87@gmail.com, memxor@gmail.com
Cc: martin.lau@linux.dev, song@kernel.org, yonghong.song@linux.dev,
	jolsa@kernel.org, bpf@vger.kernel.org,
	Justin Suess <utilityemal77@gmail.com>,
	Mykyta Yatsenko <mykyta.yatsenko5@gmail.com>,
	Alexei Starovoitov <alexei.starovoitov@gmail.com>
Subject: [bpf-next v3 1/2] bpf: Offload kptr destructors that run from NMI
Date: Thu,  7 May 2026 13:54:52 -0400	[thread overview]
Message-ID: <20260507175453.1140400-2-utilityemal77@gmail.com> (raw)
In-Reply-To: <20260507175453.1140400-1-utilityemal77@gmail.com>

A BPF program attached to tp_btf/nmi_handler can delete map entries or
swap out referenced kptrs from NMI context. Today that runs the kptr
destructor inline. Destructors such as bpf_cpumask_release() can take
RCU-related locks, so running them from NMI can deadlock the system.

Queue destructor-backed teardown to irq_work and track the preallocated
capacity with an idle-slot surplus counter. Two pcpu_freelists are
maintained: one for the idle slots and one for active jobs.
pcpu_freelist is built for NMI context, using raw_res_spin_lock and per
cpu list heads.

This counter is positive when the idle slots for work exceeds the number
of kptrs exchanged into maps. The counter is negative when the idle slots
for work is less than the number of ref kptrs exchanged into maps.

The counter will always attempt to trend to zero, keeping exactly the
amount of work slots ready for the worst case, freeing excess surplus
while allocating new surplus when needed.

This keeps NMI teardown on the fast path: it only has to pop an idle job
and never allocate. On consumption of a job, the memory is returned to
the idle queue for reuse.

Each successful install can consume at most one future offload slot, so
bpf_kptr_offload_slot_acquire() decrements surplus and provisions one
additional idle job only when the pool falls short. Teardown and
irq_work completion return that slot with
bpf_kptr_offload_slot_release(), which rechecks the surplus before
freeing an extra idle job.

There are two main safety nets against memory exhaustion:

If reserving another offload slot fails while installing a new
destructor-backed kptr through bpf_kptr_xchg(), leave the destination
unchanged and return the incoming pointer so the caller keeps ownership.

This should only happen under severe memory pressure, as the work is
only 24 bytes on 64 bit architectures, and slots are reused whenever
possible.

If NMI teardown still fails to grab an idle offload job despite that
accounting, warn once and run the destructor inline rather than leak the
object permanently. This warn condition is a canary for accounting bugs,
since the xchg safety net shouldn't let this state be reached. The
justification for falling back to running the dtor in nmi is leaking
memory is less debuggable and visible to users than an explicit
warning and a possible deadlock that is recoverable by the watchdog.

This fix does come with small performance tradeoffs for safety. xchg can
no longer be inlined for referenced kptrs, as inlining would break the
slot accounting. The inlining fix is preserved for kptrs with no
destructor defined.

There is a small price paid of a potential alloc and a few atomic ops in
the xchg path for kptrs with dtors. The nmi dtor path has a few atomic ops
and an IPI for the irq_work overhead.

This keeps refcounted kptr teardown out of NMI context without slowing
down raw kptr exchanges that never need destructor handling.

This change ensures that dtors can be written for hardirq context
instead of NMI.

Cc: Mykyta Yatsenko <mykyta.yatsenko5@gmail.com>
Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Reported-by: Justin Suess <utilityemal77@gmail.com>
Closes: https://lore.kernel.org/bpf/20260421201035.1729473-1-utilityemal77@gmail.com/
Signed-off-by: Justin Suess <utilityemal77@gmail.com>
---
 include/linux/bpf.h          |  16 ++++
 include/linux/bpf_verifier.h |   2 +
 kernel/bpf/fixups.c          |  33 +++++---
 kernel/bpf/helpers.c         |  24 +++++-
 kernel/bpf/syscall.c         | 159 +++++++++++++++++++++++++++++++++++
 kernel/bpf/verifier.c        |  13 +++
 6 files changed, 232 insertions(+), 15 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 715b6df9c403..583e8551d162 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -3454,6 +3454,22 @@ static inline struct bpf_prog *bpf_prog_get_type(u32 ufd,
 
 void __bpf_free_used_maps(struct bpf_prog_aux *aux,
 			  struct bpf_map **used_maps, u32 len);
+/* Direct-call target used by fixups for bpf_kptr_xchg() sites without dtors. */
+u64 bpf_kptr_xchg_nodtor(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
+
+#ifdef CONFIG_BPF_SYSCALL
+int bpf_kptr_offload_slot_acquire(void);
+void bpf_kptr_offload_slot_release(void);
+#else
+static inline int bpf_kptr_offload_slot_acquire(void)
+{
+	return 0;
+}
+
+static inline void bpf_kptr_offload_slot_release(void)
+{
+}
+#endif
 
 bool bpf_prog_get_ok(struct bpf_prog *, enum bpf_prog_type *, bool);
 
diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index 976e2b2f40e8..26bc8b5c9030 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -672,6 +672,8 @@ struct bpf_insn_aux_data {
 	bool non_sleepable; /* helper/kfunc may be called from non-sleepable context */
 	bool is_iter_next; /* bpf_iter_<type>_next() kfunc call */
 	bool call_with_percpu_alloc_ptr; /* {this,per}_cpu_ptr() with prog percpu alloc */
+	u8 kptr_has_dtor:1;
+	u8 kptr_has_dtor_seen:1;
 	u8 alu_state; /* used in combination with alu_limit */
 	/* true if STX or LDX instruction is a part of a spill/fill
 	 * pattern for a bpf_fastcall call.
diff --git a/kernel/bpf/fixups.c b/kernel/bpf/fixups.c
index fba9e8c00878..459e855e86a5 100644
--- a/kernel/bpf/fixups.c
+++ b/kernel/bpf/fixups.c
@@ -2284,23 +2284,30 @@ int bpf_do_misc_fixups(struct bpf_verifier_env *env)
 			goto next_insn;
 		}
 
-		/* Implement bpf_kptr_xchg inline */
-		if (prog->jit_requested && BITS_PER_LONG == 64 &&
-		    insn->imm == BPF_FUNC_kptr_xchg &&
-		    bpf_jit_supports_ptr_xchg()) {
-			insn_buf[0] = BPF_MOV64_REG(BPF_REG_0, BPF_REG_2);
-			insn_buf[1] = BPF_ATOMIC_OP(BPF_DW, BPF_XCHG, BPF_REG_1, BPF_REG_0, 0);
-			cnt = 2;
+		/* Implement bpf_kptr_xchg inline. */
+		if (insn->imm == BPF_FUNC_kptr_xchg &&
+		    !env->insn_aux_data[i + delta].kptr_has_dtor) {
+			if (prog->jit_requested && BITS_PER_LONG == 64 &&
+			    bpf_jit_supports_ptr_xchg()) {
+				insn_buf[0] = BPF_MOV64_REG(BPF_REG_0, BPF_REG_2);
+				insn_buf[1] = BPF_ATOMIC_OP(BPF_DW, BPF_XCHG,
+						     BPF_REG_1, BPF_REG_0, 0);
+				cnt = 2;
 
-			new_prog = bpf_patch_insn_data(env, i + delta, insn_buf, cnt);
-			if (!new_prog)
-				return -ENOMEM;
+				new_prog = bpf_patch_insn_data(env, i + delta, insn_buf, cnt);
+				if (!new_prog)
+					return -ENOMEM;
 
-			delta    += cnt - 1;
-			env->prog = prog = new_prog;
-			insn      = new_prog->insnsi + i + delta;
+				delta    += cnt - 1;
+				env->prog = prog = new_prog;
+				insn      = new_prog->insnsi + i + delta;
+				goto next_insn;
+			}
+
+			insn->imm = bpf_kptr_xchg_nodtor - __bpf_call_base;
 			goto next_insn;
 		}
+
 patch_call_imm:
 		fn = env->ops->get_func_proto(insn->imm, env->prog);
 		/* all functions that have prototype and verifier allowed
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index baa12b24bb64..51717a88f627 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -1728,7 +1728,7 @@ void bpf_wq_cancel_and_free(void *val)
 	bpf_async_cancel_and_free(val);
 }
 
-BPF_CALL_2(bpf_kptr_xchg, void *, dst, void *, ptr)
+BPF_CALL_2(bpf_kptr_xchg_nodtor, void *, dst, void *, ptr)
 {
 	unsigned long *kptr = dst;
 
@@ -1736,12 +1736,32 @@ BPF_CALL_2(bpf_kptr_xchg, void *, dst, void *, ptr)
 	return xchg(kptr, (unsigned long)ptr);
 }
 
+BPF_CALL_2(bpf_ref_kptr_xchg, void *, dst, void *, ptr)
+{
+	unsigned long *kptr = dst;
+	void *old;
+
+	/*
+	 * If the incoming pointer cannot be torn down safely from NMI later on,
+	 * leave the destination untouched and return ptr so the caller keeps
+	 * ownership.
+	 */
+	if (ptr && bpf_kptr_offload_slot_acquire())
+		return (unsigned long)ptr;
+
+	old = (void *)xchg(kptr, (unsigned long)ptr);
+	if (old)
+		bpf_kptr_offload_slot_release();
+	return (unsigned long)old;
+}
+
 /* Unlike other PTR_TO_BTF_ID helpers the btf_id in bpf_kptr_xchg()
  * helper is determined dynamically by the verifier. Use BPF_PTR_POISON to
  * denote type that verifier will determine.
+ * No-dtor callsites are redirected to bpf_kptr_xchg_nodtor() from fixups.
  */
 static const struct bpf_func_proto bpf_kptr_xchg_proto = {
-	.func         = bpf_kptr_xchg,
+	.func         = bpf_ref_kptr_xchg,
 	.gpl_only     = false,
 	.ret_type     = RET_PTR_TO_BTF_ID_OR_NULL,
 	.ret_btf_id   = BPF_PTR_POISON,
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 3b1f0ba02f61..d34fdb99eb8a 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -7,6 +7,7 @@
 #include <linux/bpf_trace.h>
 #include <linux/bpf_lirc.h>
 #include <linux/bpf_verifier.h>
+#include <linux/bpf_mem_alloc.h>
 #include <linux/bsearch.h>
 #include <linux/btf.h>
 #include <linux/hex.h>
@@ -19,6 +20,7 @@
 #include <linux/fdtable.h>
 #include <linux/file.h>
 #include <linux/fs.h>
+#include <linux/irq_work.h>
 #include <linux/license.h>
 #include <linux/filter.h>
 #include <linux/kernel.h>
@@ -42,6 +44,8 @@
 #include <linux/cookie.h>
 #include <linux/verification.h>
 
+#include "percpu_freelist.h"
+
 #include <net/netfilter/nf_bpf_link.h>
 #include <net/netkit.h>
 #include <net/tcx.h>
@@ -65,6 +69,111 @@ static DEFINE_SPINLOCK(map_idr_lock);
 static DEFINE_IDR(link_idr);
 static DEFINE_SPINLOCK(link_idr_lock);
 
+struct bpf_dtor_kptr_work {
+	struct pcpu_freelist_node fnode;
+	void *obj;
+	btf_dtor_kfunc_t dtor;
+};
+
+/* Queue pending dtors; the idle pool uses a global pcpu_freelist. */
+static struct pcpu_freelist bpf_dtor_kptr_jobs;
+static struct pcpu_freelist bpf_dtor_kptr_idle;
+/* Keep surplus = total - needed = idle - refs >= 0 so NMI frees never need to allocate. */
+static atomic_long_t bpf_dtor_kptr_surplus = ATOMIC_LONG_INIT(0);
+
+static void bpf_dtor_kptr_worker(struct irq_work *work);
+static DEFINE_PER_CPU(struct irq_work, bpf_dtor_kptr_irq_work) =
+	IRQ_WORK_INIT_HARD(bpf_dtor_kptr_worker);
+
+static struct bpf_dtor_kptr_work *bpf_dtor_kptr_pop_idle(void)
+{
+	struct pcpu_freelist_node *node;
+
+	node = pcpu_freelist_pop(&bpf_dtor_kptr_idle);
+	if (!node)
+		return NULL;
+
+	return container_of(node, struct bpf_dtor_kptr_work, fnode);
+}
+
+static void bpf_dtor_kptr_release_one(void)
+{
+	struct bpf_dtor_kptr_work *job;
+	long surplus;
+
+	for (;;) {
+		surplus = atomic_long_read(&bpf_dtor_kptr_surplus);
+		if (surplus <= 0)
+			return;
+
+		job = bpf_dtor_kptr_pop_idle();
+		if (!job)
+			return;
+
+		if (!atomic_long_try_cmpxchg(&bpf_dtor_kptr_surplus, &surplus,
+						     surplus - 1)) {
+			pcpu_freelist_push(&bpf_dtor_kptr_idle, &job->fnode);
+			continue;
+		}
+
+		bpf_mem_free(&bpf_global_ma, job);
+		return;
+	}
+}
+
+void bpf_kptr_offload_slot_release(void)
+{
+	if (atomic_long_inc_return(&bpf_dtor_kptr_surplus) > 0)
+		bpf_dtor_kptr_release_one();
+}
+
+int bpf_kptr_offload_slot_acquire(void)
+{
+	struct bpf_dtor_kptr_work *job;
+	long surplus;
+
+	if (unlikely(!bpf_global_ma_set ||
+		     !READ_ONCE(bpf_dtor_kptr_idle.freelist) ||
+		     !READ_ONCE(bpf_dtor_kptr_jobs.freelist)))
+		return -ENOMEM;
+
+	/*
+	 * Each successful install can decrease the surplus by at most one, so it only
+	 * ever needs to provision one additional idle job.
+	 */
+	surplus = atomic_long_dec_return(&bpf_dtor_kptr_surplus);
+	if (surplus >= 0)
+		return 0;
+
+	job = bpf_mem_alloc(&bpf_global_ma, sizeof(*job));
+	if (!job) {
+		atomic_long_inc(&bpf_dtor_kptr_surplus);
+		return -ENOMEM;
+	}
+
+	pcpu_freelist_push(&bpf_dtor_kptr_idle, &job->fnode);
+	/* A racing teardown may have already removed the demand that forced this. */
+	bpf_kptr_offload_slot_release();
+
+	return 0;
+}
+
+static int __init bpf_dtor_kptr_init(void)
+{
+	int err;
+
+	err = pcpu_freelist_init(&bpf_dtor_kptr_idle);
+	if (err)
+		return err;
+
+	err = pcpu_freelist_init(&bpf_dtor_kptr_jobs);
+	if (err)
+		return err;
+
+	return 0;
+}
+late_initcall(bpf_dtor_kptr_init);
+
 int sysctl_unprivileged_bpf_disabled __read_mostly =
 	IS_BUILTIN(CONFIG_BPF_UNPRIV_DEFAULT_OFF) ? 2 : 0;
 
@@ -807,6 +916,43 @@ void bpf_obj_free_task_work(const struct btf_record *rec, void *obj)
 	bpf_task_work_cancel_and_free(obj + rec->task_work_off);
 }
 
+static void bpf_dtor_kptr_worker(struct irq_work *work)
+{
+	struct pcpu_freelist_node *fnode;
+	struct bpf_dtor_kptr_work *job;
+
+	while ((fnode = pcpu_freelist_pop(&bpf_dtor_kptr_jobs))) {
+		job = container_of(fnode, struct bpf_dtor_kptr_work, fnode);
+		job->dtor(job->obj);
+		pcpu_freelist_push(&bpf_dtor_kptr_idle, &job->fnode);
+		bpf_kptr_offload_slot_release();
+	}
+}
+
+static void bpf_dtor_kptr_offload(void *obj, btf_dtor_kfunc_t dtor)
+{
+	struct bpf_dtor_kptr_work *job;
+
+	/* Handing storage teardown off to irq_work consumes one idle slot. */
+	atomic_long_dec(&bpf_dtor_kptr_surplus);
+	job = bpf_dtor_kptr_pop_idle();
+	if (WARN_ON_ONCE(!job)) {
+		atomic_long_inc(&bpf_dtor_kptr_surplus);
+		/*
+		 * This should stay unreachable if reserve accounting is correct. If it
+		 * ever breaks, running the destructor unsafely is still better than
+		 * leaking the object permanently.
+		 */
+		dtor(obj);
+		return;
+	}
+
+	job->obj = obj;
+	job->dtor = dtor;
+	pcpu_freelist_push(&bpf_dtor_kptr_jobs, &job->fnode);
+	irq_work_queue(this_cpu_ptr(&bpf_dtor_kptr_irq_work));
+}
+
 void bpf_obj_free_fields(const struct btf_record *rec, void *obj)
 {
 	const struct btf_field *fields;
@@ -842,6 +988,19 @@ void bpf_obj_free_fields(const struct btf_record *rec, void *obj)
 			xchgd_field = (void *)xchg((unsigned long *)field_ptr, 0);
 			if (!xchgd_field)
 				break;
+			if (in_nmi() && field->kptr.dtor) {
+				bpf_dtor_kptr_offload(xchgd_field, field->kptr.dtor);
+				bpf_kptr_offload_slot_release();
+				break;
+			}
+			if (field->kptr.dtor)
+				/*
+				 * Dtor kptrs reach storage through bpf_ref_kptr_xchg(), which
+				 * pairs installation with bpf_kptr_offload_slot_acquire(). Return
+				 * that slot on non-NMI teardown once no active transition is
+				 * needed.
+				 */
+				bpf_kptr_offload_slot_release();
 
 			if (!btf_is_kernel(field->kptr.btf)) {
 				pointee_struct_meta = btf_find_struct_meta(field->kptr.btf,
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 11054ad89c14..d042be8ed789 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -9891,6 +9891,7 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
 	int insn_idx = *insn_idx_p;
 	bool changes_data;
 	int i, err, func_id;
+	bool kptr_has_dtor;
 
 	/* find function prototype */
 	func_id = insn->imm;
@@ -9950,6 +9951,18 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
 		if (err)
 			return err;
 	}
+	if (func_id == BPF_FUNC_kptr_xchg) {
+		kptr_has_dtor = !!meta.kptr_field->kptr.dtor;
+		if (env->insn_aux_data[insn_idx].kptr_has_dtor_seen &&
+		    env->insn_aux_data[insn_idx].kptr_has_dtor != kptr_has_dtor) {
+			verbose(env,
+				"same insn cannot call bpf_kptr_xchg() on both dtor and non-dtor kptrs\n");
+			return -EINVAL;
+		}
+
+		env->insn_aux_data[insn_idx].kptr_has_dtor_seen = true;
+		env->insn_aux_data[insn_idx].kptr_has_dtor = kptr_has_dtor;
+	}
 
 	err = record_func_map(env, &meta, func_id, insn_idx);
 	if (err)
-- 
2.53.0


  reply	other threads:[~2026-05-07 17:55 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-07 17:54 [bpf-next v3 0/2] bpf: Fix deadlock in kptr dtor in nmi Justin Suess
2026-05-07 17:54 ` Justin Suess [this message]
2026-05-07 18:43   ` [bpf-next v3 1/2] bpf: Offload kptr destructors that run from NMI bot+bpf-ci
2026-05-07 18:52     ` Justin Suess
2026-05-07 23:45   ` sashiko-bot
2026-05-10 15:13     ` Justin Suess
2026-05-10 22:38       ` Alexei Starovoitov
2026-05-11  1:49         ` Justin Suess
2026-05-11 15:51           ` Alexei Starovoitov
2026-05-11 16:38             ` Justin Suess
2026-05-11 17:18               ` Alexei Starovoitov
2026-05-11 20:10                 ` Kumar Kartikeya Dwivedi
2026-05-12  1:43                   ` Justin Suess
2026-05-12  1:46                     ` Kumar Kartikeya Dwivedi
2026-05-12  1:55                       ` Alexei Starovoitov
2026-05-12  2:03                         ` Kumar Kartikeya Dwivedi
2026-05-12  2:10                           ` Alexei Starovoitov
2026-05-12  2:13                             ` Kumar Kartikeya Dwivedi
2026-05-12  2:07                         ` Justin Suess
2026-05-12  2:08                           ` Kumar Kartikeya Dwivedi
2026-05-11 19:22             ` Justin Suess
2026-05-07 17:54 ` [bpf-next v3 2/2] selftests/bpf: Add kptr destructor NMI exerciser Justin Suess
2026-05-08  0:03   ` sashiko-bot

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260507175453.1140400-2-utilityemal77@gmail.com \
    --to=utilityemal77@gmail.com \
    --cc=alexei.starovoitov@gmail.com \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=eddyz87@gmail.com \
    --cc=jolsa@kernel.org \
    --cc=martin.lau@linux.dev \
    --cc=memxor@gmail.com \
    --cc=mykyta.yatsenko5@gmail.com \
    --cc=song@kernel.org \
    --cc=yonghong.song@linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox