Re: [bpf-next v2 1/2] bpf: Offload kptr destructors that run from NMI

BPF List
 help / color / mirror / Atom feed

From: sashiko-bot@kernel.org
To: "Justin Suess" <utilityemal77@gmail.com>
Cc: bpf@vger.kernel.org
Subject: Re: [bpf-next v2 1/2] bpf: Offload kptr destructors that run from NMI
Date: Tue, 05 May 2026 19:49:05 +0000	[thread overview]
Message-ID: <20260505194906.536EAC2BCF4@smtp.kernel.org> (raw)
In-Reply-To: <20260505150851.3090688-2-utilityemal77@gmail.com>

Thank you for your contribution! Sashiko AI review found 5 potential issue(s) to consider:
- [Critical] Multiple concurrent consumers on the lockless `bpf_dtor_kptr_idle` list cause ABA memory corruption.
- [High] Unrestricted assignment to `env->insn_aux_data[insn_idx].kptr_has_dtor` causes mismatched JIT patching for subprograms.
- [High] Time-of-Check to Time-of-Use (TOCTOU) race in `bpf_kptr_offload_inc` under-allocates the job pool.
- [High] TOCTOU race in `bpf_dtor_kptr_trim` steals jobs from concurrent insertions, leading to NMI deadlocks.
- [Medium] Preallocated offload jobs are never trimmed during non-NMI teardown, causing unbounded memory retention.
--

commit 274934ceea98ba86682bc3758a94297eee420620
Author: Justin Suess <utilityemal77@gmail.com>

bpf: Offload kptr destructors that run from NMI

A BPF program attached to tp_btf/nmi_handler can delete map entries or
swap out referenced kptrs from NMI context. This patch introduces an
offload pool to let a worker invoke destructors after NMI exits.

Closes: https://lore.kernel.org/bpf/20260421201035.1729473-1-utilityemal77@gmail.com/

> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 3b1f0ba02f613..162bfd4796ea0 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c

[ ... ]

> @@ -65,6 +68,131 @@ static DEFINE_SPINLOCK(map_idr_lock);

[ ... ]

> +static struct bpf_dtor_kptr_work *bpf_dtor_kptr_pop_idle(void)
> +{
> +	struct llist_node *node;
> +
> +	node = llist_del_first(&bpf_dtor_kptr_idle);
> +	if (!node)
> +		return NULL;
> +
> +	return llist_entry(node, struct bpf_dtor_kptr_work, node);
> +}

Is this susceptible to an ABA issue resulting in list corruption?

The llist_del_first() function is strictly single-consumer because it lacks
ABA protection. Here, bpf_dtor_kptr_pop_idle() is accessed concurrently by
multiple consumers: bpf_dtor_kptr_offload() running in NMI context across
multiple CPUs, and bpf_dtor_kptr_trim() running in irq_work context.

If CPU 0 begins popping a node, is interrupted, and CPU 1 pops and re-pushes
nodes, could CPU 0's cmpxchg succeed but set the list head to a stale or
freed next pointer?

[ ... ]

> +static void bpf_dtor_kptr_trim(void)
> +{
> +	struct bpf_dtor_kptr_work *job;
> +	long total;
> +	long needed;
> +
> +	for (;;) {
> +		total = atomic_long_read(&bpf_dtor_kptr_total);
> +		needed = atomic_long_read(&bpf_dtor_kptr_refs) +
> +			 atomic_long_read(&bpf_dtor_kptr_active);
> +		if (total <= needed)
> +			return;
> +
> +		job = bpf_dtor_kptr_pop_idle();
> +		if (!job)
> +			return;
> +
> +		if (!atomic_long_try_cmpxchg(&bpf_dtor_kptr_total, &total, total - 1)) {
> +			bpf_dtor_kptr_push_idle(job);
> +			continue;
> +		}

Can a concurrent bpf_kptr_offload_inc() cause this to violate the
total >= refs + active invariant?

If bpf_dtor_kptr_trim() reads needed, and then another CPU increments refs
in bpf_kptr_offload_inc():

CPU1
bpf_dtor_kptr_trim()
   needed = refs + active;

CPU2
bpf_kptr_offload_inc()
   atomic_long_inc_return(&bpf_dtor_kptr_refs);
   bpf_dtor_kptr_reserve() observes the old high total and skips allocation

CPU1 would see the cmpxchg succeed because total hasn't changed, successfully
freeing the job. Does this lead to an empty idle list when the kptr is later
freed in an NMI?

[ ... ]

> +int bpf_kptr_offload_inc(void)
> +{
> +	long needed;
> +	int err;
> +
> +	if (unlikely(!bpf_global_ma_set))
> +		return -ENOMEM;
> +
> +	/*
> +	 * Read active before incrementing refs so a free path moving one slot from
> +	 * refs to active cannot shrink the reservation snapshot below the steady
> +	 * state we need to cover. Racing results worst case in a larger reservation.
> +	 */
> +	needed = atomic_long_read(&bpf_dtor_kptr_active);
> +	needed += atomic_long_inc_return(&bpf_dtor_kptr_refs);
> +	err = bpf_dtor_kptr_reserve(needed);

Could an NMI interrupting this sequence result in under-allocating the job
pool?

If an NMI free occurs precisely after active is read but before refs is
incremented:

CPU1
bpf_kptr_offload_inc()
   needed = atomic_long_read(&bpf_dtor_kptr_active);
   // NMI interrupts CPU1
   bpf_dtor_kptr_offload()
      atomic_long_inc(&bpf_dtor_kptr_active);
      // pops job
   bpf_kptr_offload_dec()
      atomic_long_dec(&bpf_dtor_kptr_refs);
   // NMI exits
   needed += atomic_long_inc_return(&bpf_dtor_kptr_refs);

In this scenario, needed is calculated with the old active value and the new
refs value, making it 1 less than the true requirement. Would a subsequent
NMI free encounter an empty job list?

[ ... ]

> +void bpf_kptr_offload_dec(void)
> +{
> +	long val;
> +
> +	val = atomic_long_dec_return(&bpf_dtor_kptr_refs);
> +	if (!WARN_ON_ONCE(val < 0))
> +		return;
> +
> +	/*
> +	 * Clamp a mismatched decrement back to zero without overwriting a
> +	 * concurrent increment that already repaired the counter.
> +	 */
> +	do {
> +		val = atomic_long_read(&bpf_dtor_kptr_refs);
> +		if (val >= 0)
> +			break;
> +	} while (!atomic_long_try_cmpxchg(&bpf_dtor_kptr_refs, &val, 0));
> +}

When kptrs are deleted from normal process or softirq contexts, does this code
cause unbounded memory retention?

Since bpf_dtor_kptr_trim() is exclusively invoked by the irq_work worker
during NMI teardown, the global idle list might never be trimmed for non-NMI
teardowns. This could cause the pool memory to stay at the high-water mark of
concurrent kptrs indefinitely.

[ ... ]

> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 11054ad89c14d..2c7b21bda666a 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c

[ ... ]

> @@ -9950,6 +9950,8 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
>  		if (err)
>  			return err;
>  	}
> +	env->insn_aux_data[insn_idx].kptr_has_dtor =
> +		func_id == BPF_FUNC_kptr_xchg && !!meta.kptr_field->kptr.dtor;
>  
>  	err = record_func_map(env, &meta, func_id, insn_idx);
>  	if (err)

Does this unrestricted assignment cause mismatched JIT patching for subprograms?

If a BPF static subprogram contains a bpf_kptr_xchg instruction and is called
twice by the main program - once with a map containing a destructor kptr,
and once without - the verifier will simulate the subprogram twice. The
second pass will blindly overwrite kptr_has_dtor for that instruction.

If the last simulated path had no destructor, the JIT patches the instruction
to bpf_kptr_xchg_nodtor. At runtime, would the path with a destructor execute
this un-instrumented variant, skipping bpf_kptr_offload_inc() and causing
an NMI deadlock when freed?

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260505150851.3090688-1-utilityemal77@gmail.com?part=1

next prev parent reply	other threads:[~2026-05-05 19:49 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-05 15:08 [bpf-next v2 0/2] bpf: Fix deadlock in kptr dtor in nmi Justin Suess
2026-05-05 15:08 ` [bpf-next v2 1/2] bpf: Offload kptr destructors that run from NMI Justin Suess
2026-05-05 16:06   ` bot+bpf-ci
2026-05-05 19:48     ` Justin Suess
2026-05-05 19:49   ` sashiko-bot [this message]
2026-05-06 16:43   ` Mykyta Yatsenko
2026-05-06 19:52     ` Justin Suess
2026-05-07 14:59       ` Mykyta Yatsenko
2026-05-07 16:41         ` Justin Suess
2026-05-07 17:19           ` Mykyta Yatsenko
2026-05-05 15:08 ` [bpf-next v2 2/2] selftests/bpf: Add kptr destructor NMI exerciser Justin Suess
2026-05-05 20:15   ` sashiko-bot

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260505194906.536EAC2BCF4@smtp.kernel.org \
    --to=sashiko-bot@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=sashiko@lists.linux.dev \
    --cc=utilityemal77@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox