Re: [bpf-next v2 1/2] bpf: Offload kptr destructors that run from NMI

BPF List
 help / color / mirror / Atom feed

From: Mykyta Yatsenko <mykyta.yatsenko5@gmail.com>
To: Justin Suess <utilityemal77@gmail.com>,
	ast@kernel.org, daniel@iogearbox.net, andrii@kernel.org,
	eddyz87@gmail.com, memxor@gmail.com
Cc: martin.lau@linux.dev, song@kernel.org, yonghong.song@linux.dev,
	jolsa@kernel.org, bpf@vger.kernel.org, mic@digikod.net,
	Alexei Starovoitov <alexei.starovoitov@gmail.com>
Subject: Re: [bpf-next v2 1/2] bpf: Offload kptr destructors that run from NMI
Date: Wed, 6 May 2026 17:43:51 +0100	[thread overview]
Message-ID: <51a054a0-e57f-49dc-9527-36da0535087c@gmail.com> (raw)
In-Reply-To: <20260505150851.3090688-2-utilityemal77@gmail.com>

On 5/5/26 4:08 PM, Justin Suess wrote:
> A BPF program attached to tp_btf/nmi_handler can delete map entries or
> swap out referenced kptrs from NMI context. Today that runs the kptr
> destructor inline. Destructors such as bpf_cpumask_release() can take
> RCU-related locks, so running them from NMI can deadlock the system.
> 
> Preallocate offload jobs from the global BPF memory allocator, track the
> number of live destructor-backed references so the pool stays ahead of
> NMI frees, and let the worker invoke the destructor after NMI exits.
> 
> The algorithm for preallocation is simple: The invariant is total >=
> refs + active, where refs = the ref kptrs installed in maps, active =
> jobs being executed in the irq_work worker, and total is the number of
> job structures allocated. To avoid excessive pre-allocation calls while
> maintaining the invariant, we allocate the needed slots, plus a small
> amount of extra, min(needed, BPF_DTOR_KPTR_RESERVE_HEADROOM), where
> BPF_DTOR_KPTR_RESERVE_HEADROOM is 64 in this patch.
> 
> A small but harmless ordering subtlety: the active atomic is read before
> refs. This can result in a small amount of over allocation, but this
> won't be leaked and will properly be carried into the trim stage.
> 
> The trim stage is simple. It uses a CAS loop to free excessive leftover
> idle job slots. It snapshots total refs and active, pops an idle job if
> the pool is excessively large, and attempts a cmpxhg to decrement it
> atomically. On a failure case, it will just push the job back into the
> idle list and retry.
> 
> There are several best-effort mitigation methods to tackle the memory
> pressure problem, preserving integrity under this unlikely scenario.
> 
> If reserving another offload slot fails while installing a new
> destructor-backed kptr through bpf_kptr_xchg(), leave the destination
> unchanged and return the incoming pointer so the caller keeps ownership.
> 
> This is superior to leaking the pointer, and should only happen if the
> accounting is incorrect. Moreover, this is a condition the caller can
> check for and recover from.
> 
> If NMI teardown still fails to grab an idle offload job despite that
> reserve accounting, warn once and run the destructor inline rather than
> leak the object permanently. Attempt to repair the counter safely with
> another CAS loop in that case, preserving concurrent increments.
> 
> This fix does come with small performance tradeoffs for safety. xchg can
> no longer be inlined for referenced kptrs, as inlining would break the
> reference counting. The inlining fix is preserved for kptrs with no
> destructor defined.
> 
> This keeps refcounted kptr teardown out of NMI context without slowing
> down raw kptr exchanges that never need destructor handling.
> 
> Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
> Reported-by: Justin Suess <utilityemal77@gmail.com>
> Closes: https://lore.kernel.org/bpf/20260421201035.1729473-1-utilityemal77@gmail.com/
> Signed-off-by: Justin Suess <utilityemal77@gmail.com>
> ---
>  include/linux/bpf.h          |  16 ++++
>  include/linux/bpf_verifier.h |   1 +
>  kernel/bpf/fixups.c          |  33 ++++---
>  kernel/bpf/helpers.c         |  24 ++++-
>  kernel/bpf/syscall.c         | 181 +++++++++++++++++++++++++++++++++++
>  kernel/bpf/verifier.c        |   2 +
>  6 files changed, 242 insertions(+), 15 deletions(-)
> 
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 715b6df9c403..307de5caa646 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -3454,6 +3454,22 @@ static inline struct bpf_prog *bpf_prog_get_type(u32 ufd,
>  
>  void __bpf_free_used_maps(struct bpf_prog_aux *aux,
>  			  struct bpf_map **used_maps, u32 len);
> +/* Direct-call target used by fixups for bpf_kptr_xchg() sites without dtors. */
> +u64 bpf_kptr_xchg_nodtor(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
> +
> +#ifdef CONFIG_BPF_SYSCALL
> +int bpf_kptr_offload_inc(void);
> +void bpf_kptr_offload_dec(void);
> +#else
> +static inline int bpf_kptr_offload_inc(void)
> +{
> +	return 0;
> +}
> +
> +static inline void bpf_kptr_offload_dec(void)
> +{
> +}
> +#endif
>  
>  bool bpf_prog_get_ok(struct bpf_prog *, enum bpf_prog_type *, bool);
>  
> diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
> index 976e2b2f40e8..8e39ff92dd2c 100644
> --- a/include/linux/bpf_verifier.h
> +++ b/include/linux/bpf_verifier.h
> @@ -672,6 +672,7 @@ struct bpf_insn_aux_data {
>  	bool non_sleepable; /* helper/kfunc may be called from non-sleepable context */
>  	bool is_iter_next; /* bpf_iter_<type>_next() kfunc call */
>  	bool call_with_percpu_alloc_ptr; /* {this,per}_cpu_ptr() with prog percpu alloc */
> +	bool kptr_has_dtor;
>  	u8 alu_state; /* used in combination with alu_limit */
>  	/* true if STX or LDX instruction is a part of a spill/fill
>  	 * pattern for a bpf_fastcall call.
> diff --git a/kernel/bpf/fixups.c b/kernel/bpf/fixups.c
> index fba9e8c00878..459e855e86a5 100644
> --- a/kernel/bpf/fixups.c
> +++ b/kernel/bpf/fixups.c
> @@ -2284,23 +2284,30 @@ int bpf_do_misc_fixups(struct bpf_verifier_env *env)
>  			goto next_insn;
>  		}
>  
> -		/* Implement bpf_kptr_xchg inline */
> -		if (prog->jit_requested && BITS_PER_LONG == 64 &&
> -		    insn->imm == BPF_FUNC_kptr_xchg &&
> -		    bpf_jit_supports_ptr_xchg()) {
> -			insn_buf[0] = BPF_MOV64_REG(BPF_REG_0, BPF_REG_2);
> -			insn_buf[1] = BPF_ATOMIC_OP(BPF_DW, BPF_XCHG, BPF_REG_1, BPF_REG_0, 0);
> -			cnt = 2;
> +		/* Implement bpf_kptr_xchg inline. */
> +		if (insn->imm == BPF_FUNC_kptr_xchg &&
> +		    !env->insn_aux_data[i + delta].kptr_has_dtor) {
> +			if (prog->jit_requested && BITS_PER_LONG == 64 &&
> +			    bpf_jit_supports_ptr_xchg()) {
> +				insn_buf[0] = BPF_MOV64_REG(BPF_REG_0, BPF_REG_2);
> +				insn_buf[1] = BPF_ATOMIC_OP(BPF_DW, BPF_XCHG,
> +						     BPF_REG_1, BPF_REG_0, 0);
> +				cnt = 2;
>  
> -			new_prog = bpf_patch_insn_data(env, i + delta, insn_buf, cnt);
> -			if (!new_prog)
> -				return -ENOMEM;
> +				new_prog = bpf_patch_insn_data(env, i + delta, insn_buf, cnt);
> +				if (!new_prog)
> +					return -ENOMEM;
>  
> -			delta    += cnt - 1;
> -			env->prog = prog = new_prog;
> -			insn      = new_prog->insnsi + i + delta;
> +				delta    += cnt - 1;
> +				env->prog = prog = new_prog;
> +				insn      = new_prog->insnsi + i + delta;
> +				goto next_insn;
> +			}
> +
> +			insn->imm = bpf_kptr_xchg_nodtor - __bpf_call_base;
>  			goto next_insn;
>  		}
> +
>  patch_call_imm:
>  		fn = env->ops->get_func_proto(insn->imm, env->prog);
>  		/* all functions that have prototype and verifier allowed
> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> index baa12b24bb64..cdc64ab83ef6 100644
> --- a/kernel/bpf/helpers.c
> +++ b/kernel/bpf/helpers.c
> @@ -1728,7 +1728,7 @@ void bpf_wq_cancel_and_free(void *val)
>  	bpf_async_cancel_and_free(val);
>  }
>  
> -BPF_CALL_2(bpf_kptr_xchg, void *, dst, void *, ptr)
> +BPF_CALL_2(bpf_kptr_xchg_nodtor, void *, dst, void *, ptr)
>  {
>  	unsigned long *kptr = dst;
>  
> @@ -1736,12 +1736,32 @@ BPF_CALL_2(bpf_kptr_xchg, void *, dst, void *, ptr)
>  	return xchg(kptr, (unsigned long)ptr);
>  }
>  
> +BPF_CALL_2(bpf_ref_kptr_xchg, void *, dst, void *, ptr)
> +{
> +	unsigned long *kptr = dst;
> +	void *old;
> +
> +	/*
> +	 * If the incoming pointer cannot be torn down safely from NMI later on,
> +	 * leave the destination untouched and return ptr so the caller keeps
> +	 * ownership.
> +	 */
> +	if (ptr && bpf_kptr_offload_inc())
> +		return (unsigned long)ptr;
> +
> +	old = (void *)xchg(kptr, (unsigned long)ptr);
> +	if (old)
> +		bpf_kptr_offload_dec();
> +	return (unsigned long)old;
> +}
> +
>  /* Unlike other PTR_TO_BTF_ID helpers the btf_id in bpf_kptr_xchg()
>   * helper is determined dynamically by the verifier. Use BPF_PTR_POISON to
>   * denote type that verifier will determine.
> + * No-dtor callsites are redirected to bpf_kptr_xchg_nodtor() from fixups.
>   */
>  static const struct bpf_func_proto bpf_kptr_xchg_proto = {
> -	.func         = bpf_kptr_xchg,
> +	.func         = bpf_ref_kptr_xchg,
>  	.gpl_only     = false,
>  	.ret_type     = RET_PTR_TO_BTF_ID_OR_NULL,
>  	.ret_btf_id   = BPF_PTR_POISON,
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 3b1f0ba02f61..162bfd4796ea 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -7,6 +7,7 @@
>  #include <linux/bpf_trace.h>
>  #include <linux/bpf_lirc.h>
>  #include <linux/bpf_verifier.h>
> +#include <linux/bpf_mem_alloc.h>
>  #include <linux/bsearch.h>
>  #include <linux/btf.h>
>  #include <linux/hex.h>
> @@ -19,6 +20,8 @@
>  #include <linux/fdtable.h>
>  #include <linux/file.h>
>  #include <linux/fs.h>
> +#include <linux/irq_work.h>
> +#include <linux/llist.h>
>  #include <linux/license.h>
>  #include <linux/filter.h>
>  #include <linux/kernel.h>
> @@ -65,6 +68,131 @@ static DEFINE_SPINLOCK(map_idr_lock);
>  static DEFINE_IDR(link_idr);
>  static DEFINE_SPINLOCK(link_idr_lock);
>  
> +struct bpf_dtor_kptr_work {
> +	struct llist_node node;
> +	void *obj;
> +	btf_dtor_kfunc_t dtor;
> +};
> +
> +/* Queue pending dtors per CPU; the idle pool stays global. */
> +static DEFINE_PER_CPU(struct llist_head, bpf_dtor_kptr_jobs);
> +static LLIST_HEAD(bpf_dtor_kptr_idle);
> +/* Keep total >= refs + active so NMI frees never need to allocate. */
> +static atomic_long_t bpf_dtor_kptr_refs = ATOMIC_LONG_INIT(0);
> +static atomic_long_t bpf_dtor_kptr_active = ATOMIC_LONG_INIT(0);
> +static atomic_long_t bpf_dtor_kptr_total = ATOMIC_LONG_INIT(0);
> +
> +/* Bound reserve overshoot so the pool tracks demand instead of growing on itself. */
> +#define BPF_DTOR_KPTR_RESERVE_HEADROOM 64L
> +
> +static void bpf_dtor_kptr_worker(struct irq_work *work);
> +static DEFINE_PER_CPU(struct irq_work, bpf_dtor_kptr_irq_work) =
> +	IRQ_WORK_INIT_HARD(bpf_dtor_kptr_worker);
> +

I think this still looks too complex:
 * 2 lists - idle list and armed list
 * 3 atomics, controlling demand/supply
 * headroom/trimming management

The complexity introduced for performance reasons, but
I'm not sure if the tradeoff is worth it.

What about the next design:

Instead of idle list, store bpf_dtor_kptr_work in the kptr map slot itself.
Use kmalloc_nolock() to allocate bpf_dtor_kptr_work on the first 
xchg just once per map value, then reuse it across xchg in/out.

Detach: When map value is deleted, atomically set kptr map field storing 
bpf_dtor_kptr_work to NULL (so the next xchg-in allocates new 
bpf_dtor_kptr_work.)
After detaching insert bpf_dtor_kptr_work to the global list and run irq_work. 
Free bpf_dtor_kptr_work in call_rcu_tasks_trace().

This is based on the bpf_timer and bpf_task_work
implementations.

> +static void bpf_dtor_kptr_push_idle(struct bpf_dtor_kptr_work *job)
> +{
> +	llist_add(&job->node, &bpf_dtor_kptr_idle);
> +}
> +
> +static struct bpf_dtor_kptr_work *bpf_dtor_kptr_pop_idle(void)
> +{
> +	struct llist_node *node;
> +
> +	node = llist_del_first(&bpf_dtor_kptr_idle);
> +	if (!node)
> +		return NULL;
> +
> +	return llist_entry(node, struct bpf_dtor_kptr_work, node);
> +}
> +
> +static void bpf_dtor_kptr_trim(void)
> +{
> +	struct bpf_dtor_kptr_work *job;
> +	long total;
> +	long needed;
> +
> +	for (;;) {
> +		total = atomic_long_read(&bpf_dtor_kptr_total);
> +		needed = atomic_long_read(&bpf_dtor_kptr_refs) +
> +			 atomic_long_read(&bpf_dtor_kptr_active);
> +		if (total <= needed)
> +			return;
> +
> +		job = bpf_dtor_kptr_pop_idle();
> +		if (!job)
> +			return;
> +
> +		if (!atomic_long_try_cmpxchg(&bpf_dtor_kptr_total, &total, total - 1)) {
> +			bpf_dtor_kptr_push_idle(job);
> +			continue;
> +		}
> +
> +		bpf_mem_free(&bpf_global_ma, job);
> +	}
> +}
> +
> +static int bpf_dtor_kptr_reserve(long needed)
> +{
> +	struct bpf_dtor_kptr_work *job;
> +	long headroom;
> +	long target;
> +
> +	headroom = min_t(long, needed, BPF_DTOR_KPTR_RESERVE_HEADROOM);
> +	if (check_add_overflow(needed, headroom, &target))
> +		target = needed;
> +
> +	while (atomic_long_read(&bpf_dtor_kptr_total) < target) {
> +		job = bpf_mem_alloc(&bpf_global_ma, sizeof(*job));
> +		if (!job)
> +			return -ENOMEM;
> +		atomic_long_inc(&bpf_dtor_kptr_total);
> +		bpf_dtor_kptr_push_idle(job);
> +	}
> +
> +	return 0;
> +}
> +
> +int bpf_kptr_offload_inc(void)
> +{
> +	long needed;
> +	int err;
> +
> +	if (unlikely(!bpf_global_ma_set))
> +		return -ENOMEM;
> +
> +	/*
> +	 * Read active before incrementing refs so a free path moving one slot from
> +	 * refs to active cannot shrink the reservation snapshot below the steady
> +	 * state we need to cover. Racing results worst case in a larger reservation.
> +	 */
> +	needed = atomic_long_read(&bpf_dtor_kptr_active);
> +	needed += atomic_long_inc_return(&bpf_dtor_kptr_refs);
> +	err = bpf_dtor_kptr_reserve(needed);
> +	if (err)
> +		atomic_long_dec(&bpf_dtor_kptr_refs);
> +
> +	return err;
> +}
> +
> +void bpf_kptr_offload_dec(void)
> +{
> +	long val;
> +
> +	val = atomic_long_dec_return(&bpf_dtor_kptr_refs);
> +	if (!WARN_ON_ONCE(val < 0))
> +		return;
> +
> +	/*
> +	 * Clamp a mismatched decrement back to zero without overwriting a
> +	 * concurrent increment that already repaired the counter.
> +	 */
> +	do {
> +		val = atomic_long_read(&bpf_dtor_kptr_refs);
> +		if (val >= 0)
> +			break;
> +	} while (!atomic_long_try_cmpxchg(&bpf_dtor_kptr_refs, &val, 0));
> +}
> +
>  int sysctl_unprivileged_bpf_disabled __read_mostly =
>  	IS_BUILTIN(CONFIG_BPF_UNPRIV_DEFAULT_OFF) ? 2 : 0;
>  
> @@ -807,6 +935,46 @@ void bpf_obj_free_task_work(const struct btf_record *rec, void *obj)
>  	bpf_task_work_cancel_and_free(obj + rec->task_work_off);
>  }
>  
> +static void bpf_dtor_kptr_worker(struct irq_work *work)
> +{
> +	struct llist_node *jobs, *node, *next;
> +
> +	jobs = llist_del_all(this_cpu_ptr(&bpf_dtor_kptr_jobs));
> +	llist_for_each_safe(node, next, jobs) {
> +		struct bpf_dtor_kptr_work *job;
> +
> +		job = llist_entry(node, struct bpf_dtor_kptr_work, node);
> +		job->dtor(job->obj);
> +		atomic_long_dec(&bpf_dtor_kptr_active);
> +		bpf_dtor_kptr_push_idle(job);
> +	}
> +
> +	bpf_dtor_kptr_trim();
> +}
> +
> +static void bpf_dtor_kptr_offload(void *obj, btf_dtor_kfunc_t dtor)
> +{
> +	struct bpf_dtor_kptr_work *job;
> +
> +	atomic_long_inc(&bpf_dtor_kptr_active);
> +	job = bpf_dtor_kptr_pop_idle();
> +	if (WARN_ON_ONCE(!job)) {
> +		atomic_long_dec(&bpf_dtor_kptr_active);
> +		/*
> +		 * This should stay unreachable if reserve accounting is correct. If it
> +		 * ever breaks, running the destructor unsafely is still better than
> +		 * leaking the object permanently.
> +		 */
> +		dtor(obj);
> +		return;
> +	}
> +
> +	job->obj = obj;
> +	job->dtor = dtor;
> +	if (llist_add(&job->node, this_cpu_ptr(&bpf_dtor_kptr_jobs)))
> +		irq_work_queue(this_cpu_ptr(&bpf_dtor_kptr_irq_work));
> +}
> +
>  void bpf_obj_free_fields(const struct btf_record *rec, void *obj)
>  {
>  	const struct btf_field *fields;
> @@ -842,6 +1010,19 @@ void bpf_obj_free_fields(const struct btf_record *rec, void *obj)
>  			xchgd_field = (void *)xchg((unsigned long *)field_ptr, 0);
>  			if (!xchgd_field)
>  				break;
> +			if (in_nmi() && field->kptr.dtor) {
> +				bpf_dtor_kptr_offload(xchgd_field, field->kptr.dtor);
> +				bpf_kptr_offload_dec();
> +				break;
> +			}
> +			if (field->kptr.dtor)
> +				/*
> +				 * Dtor kptrs reach storage through bpf_ref_kptr_xchg(), which
> +				 * pairs installation with bpf_kptr_offload_inc(). Drop that
> +				 * reservation on non-NMI teardown once no active transition is
> +				 * needed.
> +				 */
> +				bpf_kptr_offload_dec();
>  
>  			if (!btf_is_kernel(field->kptr.btf)) {
>  				pointee_struct_meta = btf_find_struct_meta(field->kptr.btf,
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 11054ad89c14..2c7b21bda666 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -9950,6 +9950,8 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
>  		if (err)
>  			return err;
>  	}
> +	env->insn_aux_data[insn_idx].kptr_has_dtor =
> +		func_id == BPF_FUNC_kptr_xchg && !!meta.kptr_field->kptr.dtor;
>  
>  	err = record_func_map(env, &meta, func_id, insn_idx);
>  	if (err)

next prev parent reply	other threads:[~2026-05-06 16:43 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-05 15:08 [bpf-next v2 0/2] bpf: Fix deadlock in kptr dtor in nmi Justin Suess
2026-05-05 15:08 ` [bpf-next v2 1/2] bpf: Offload kptr destructors that run from NMI Justin Suess
2026-05-05 16:06   ` bot+bpf-ci
2026-05-05 19:48     ` Justin Suess
2026-05-05 19:49   ` sashiko-bot
2026-05-06 16:43   ` Mykyta Yatsenko [this message]
2026-05-06 19:52     ` Justin Suess
2026-05-07 14:59       ` Mykyta Yatsenko
2026-05-07 16:41         ` Justin Suess
2026-05-07 17:19           ` Mykyta Yatsenko
2026-05-05 15:08 ` [bpf-next v2 2/2] selftests/bpf: Add kptr destructor NMI exerciser Justin Suess
2026-05-05 20:15   ` sashiko-bot

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=51a054a0-e57f-49dc-9527-36da0535087c@gmail.com \
    --to=mykyta.yatsenko5@gmail.com \
    --cc=alexei.starovoitov@gmail.com \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=eddyz87@gmail.com \
    --cc=jolsa@kernel.org \
    --cc=martin.lau@linux.dev \
    --cc=memxor@gmail.com \
    --cc=mic@digikod.net \
    --cc=song@kernel.org \
    --cc=utilityemal77@gmail.com \
    --cc=yonghong.song@linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox