From: Justin Suess <utilityemal77@gmail.com>
To: bot+bpf-ci@kernel.org
Cc: ast@kernel.org, daniel@iogearbox.net, andrii@kernel.org,
eddyz87@gmail.com, memxor@gmail.com, martin.lau@linux.dev,
song@kernel.org, yonghong.song@linux.dev, jolsa@kernel.org,
bpf@vger.kernel.org, mic@digikod.net,
alexei.starovoitov@gmail.com, martin.lau@kernel.org,
clm@meta.com, ihor.solodrai@linux.dev
Subject: Re: [bpf-next v2 1/2] bpf: Offload kptr destructors that run from NMI
Date: Tue, 5 May 2026 15:48:07 -0400 [thread overview]
Message-ID: <afpJdw5bpO0fxyel@zenbox> (raw)
In-Reply-To: <c2bc74937f3da0174c896daa2d129c9572a5c6434f8d71da26283fe7f011c4af@mail.kernel.org>
On Tue, May 05, 2026 at 04:06:23PM +0000, bot+bpf-ci@kernel.org wrote:
> > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> > index 3b1f0ba02f61..162bfd4796ea 100644
> > --- a/kernel/bpf/syscall.c
> > +++ b/kernel/bpf/syscall.c
>
> [ ... ]
>
> > @@ -65,6 +68,131 @@ static DEFINE_SPINLOCK(map_idr_lock);
> > static DEFINE_IDR(link_idr);
> > static DEFINE_SPINLOCK(link_idr_lock);
> >
> > +struct bpf_dtor_kptr_work {
> > + struct llist_node node;
> > + void *obj;
> > + btf_dtor_kfunc_t dtor;
> > +};
> > +
> > +/* Queue pending dtors per CPU; the idle pool stays global. */
> > +static DEFINE_PER_CPU(struct llist_head, bpf_dtor_kptr_jobs);
> > +static LLIST_HEAD(bpf_dtor_kptr_idle);
> > +/* Keep total >= refs + active so NMI frees never need to allocate. */
> > +static atomic_long_t bpf_dtor_kptr_refs = ATOMIC_LONG_INIT(0);
> > +static atomic_long_t bpf_dtor_kptr_active = ATOMIC_LONG_INIT(0);
> > +static atomic_long_t bpf_dtor_kptr_total = ATOMIC_LONG_INIT(0);
> > +
> > +/* Bound reserve overshoot so the pool tracks demand instead of growing on itself. */
> > +#define BPF_DTOR_KPTR_RESERVE_HEADROOM 64L
> > +
> > +static void bpf_dtor_kptr_worker(struct irq_work *work);
> > +static DEFINE_PER_CPU(struct irq_work, bpf_dtor_kptr_irq_work) =
> > + IRQ_WORK_INIT_HARD(bpf_dtor_kptr_worker);
> > +
> > +static void bpf_dtor_kptr_push_idle(struct bpf_dtor_kptr_work *job)
> > +{
> > + llist_add(&job->node, &bpf_dtor_kptr_idle);
> > +}
> > +
> > +static struct bpf_dtor_kptr_work *bpf_dtor_kptr_pop_idle(void)
> > +{
> > + struct llist_node *node;
> > +
> > + node = llist_del_first(&bpf_dtor_kptr_idle);
> > + if (!node)
> > + return NULL;
> > +
> > + return llist_entry(node, struct bpf_dtor_kptr_work, node);
> > +}
>
> Does bpf_dtor_kptr_pop_idle() handle multiple concurrent consumers
> safely?
>
> The documentation for llist_del_first() in lib/llist.c explicitly
> states:
>
> Only one llist_del_first user can be used simultaneously with multiple
> llist_add users without lock. Because otherwise llist_del_first,
> llist_add, llist_add (or llist_del_all, llist_add, llist_add)
> sequence in another user may change @head->first->next, but keep
> @head->first.
>
> This is the classic ABA race that llist_del_first cannot defend against.
>
> In this patch, bpf_dtor_kptr_pop_idle() has multiple concurrent
> consumers:
>
> 1. bpf_dtor_kptr_offload() called from NMI context on any CPU
> (the function installed below, called from bpf_obj_free_fields
> when in_nmi()).
>
> 2. bpf_dtor_kptr_trim() called from the per-CPU irq_work worker on
> any CPU (called from bpf_dtor_kptr_worker below).
>
> Both paths call bpf_dtor_kptr_pop_idle() which calls
> llist_del_first(&bpf_dtor_kptr_idle) on the global idle list. Two or
> more CPUs can be popping simultaneously with no external lock and no
> single-consumer guarantee.
>
> Can the following race timeline produce list corruption or
> double-ownership?
>
> Initial state: idle list = A -> B -> C -> NULL
>
> CPU0 (NMI): enters bpf_dtor_kptr_pop_idle
> entry = head->first = A
> next = A->next = B
> (stalls before cmpxchg)
>
> CPU1 (NMI/worker): runs bpf_dtor_kptr_pop_idle
> entry = A, next = B, cmpxchg(head->first, A, B) succeeds.
> Now idle = B -> C -> NULL. CPU1 uses A (for offload or mem_free).
>
> CPU1: pops again, takes B. idle = C -> NULL.
>
> CPU1 eventually runs worker -> push_idle(A). Now idle = A -> C -> NULL.
> A->next = C (not B anymore).
>
> CPU0 resumes:
> Its cached next variable is still B, cmpxchg(&head->first, A, B)
> finds head->first is currently A, so the cmpxchg succeeds.
> Result: head->first = B, but B has been popped and is in CPU1's
> per-CPU bpf_dtor_kptr_jobs list already (not free).
>
> Node B is now simultaneously:
> (a) considered removed from idle (CPU1 popped it and is using it)
> (b) referenced as idle->first by the head pointer
>
> Another pop_idle() may grab B again as if it were idle, while it is
> active in a per-CPU jobs list. Then bpf_dtor_kptr_worker will execute
> job->dtor() on already-freed/reused job->obj, or the same job is pushed
> into two lists, corrupting llist_node chains.
>
> Concurrency is not hypothetical here. A BPF tp_btf/nmi_handler program
> calling bpf_map_delete_elem can fire on multiple CPUs simultaneously.
> The irq_work worker also runs on any CPU that queued work. Thus both
> bpf_dtor_kptr_offload and bpf_dtor_kptr_trim can run concurrently on
> different CPUs, giving at least 2 concurrent llist_del_first consumers.
>
> Typical fixes require either:
> - using a locked list (spinlock + list_head), or
> - using llist_del_all() with batch processing, or
> - a cmpxchg-based stack that properly handles ABA (e.g., using a
> tagged pointer).
>
This is indeed ABA.
Switching the idle list to use percpu freelist instead should
fix this issue. That's better for this model anyway where we don't care
about anything other than pop and push.
(The bpf_dtor_kptr_jobs llist is fine since we're not using
llist_del_first)
Something like:
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 162bfd4796ea..f1e7dc9ef4fe 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -45,6 +45,8 @@
#include <linux/cookie.h>
#include <linux/verification.h>
+#include "percpu_freelist.h"
+
#include <net/netfilter/nf_bpf_link.h>
#include <net/netkit.h>
#include <net/tcx.h>
@@ -69,14 +71,17 @@ static DEFINE_IDR(link_idr);
static DEFINE_SPINLOCK(link_idr_lock);
struct bpf_dtor_kptr_work {
- struct llist_node node;
+ union {
+ struct llist_node node;
+ struct pcpu_freelist_node fnode;
+ };
void *obj;
btf_dtor_kfunc_t dtor;
};
-/* Queue pending dtors per CPU; the idle pool stays global. */
+/* Queue pending dtors per CPU; the idle pool uses a global pcpu_freelist. */
static DEFINE_PER_CPU(struct llist_head, bpf_dtor_kptr_jobs);
-static LLIST_HEAD(bpf_dtor_kptr_idle);
+static struct pcpu_freelist bpf_dtor_kptr_idle;
/* Keep total >= refs + active so NMI frees never need to allocate. */
static atomic_long_t bpf_dtor_kptr_refs = ATOMIC_LONG_INIT(0);
static atomic_long_t bpf_dtor_kptr_active = ATOMIC_LONG_INIT(0);
@@ -91,18 +96,18 @@ static DEFINE_PER_CPU(struct irq_work, bpf_dtor_kptr_irq_work) =
static void bpf_dtor_kptr_push_idle(struct bpf_dtor_kptr_work *job)
{
- llist_add(&job->node, &bpf_dtor_kptr_idle);
+ pcpu_freelist_push(&bpf_dtor_kptr_idle, &job->fnode);
}
static struct bpf_dtor_kptr_work *bpf_dtor_kptr_pop_idle(void)
{
- struct llist_node *node;
+ struct pcpu_freelist_node *node;
- node = llist_del_first(&bpf_dtor_kptr_idle);
+ node = pcpu_freelist_pop(&bpf_dtor_kptr_idle);
if (!node)
return NULL;
- return llist_entry(node, struct bpf_dtor_kptr_work, node);
+ return container_of(node, struct bpf_dtor_kptr_work, fnode);
}
static void bpf_dtor_kptr_trim(void)
@@ -157,7 +162,7 @@ int bpf_kptr_offload_inc(void)
long needed;
int err;
- if (unlikely(!bpf_global_ma_set))
+ if (unlikely(!bpf_global_ma_set || !READ_ONCE(bpf_dtor_kptr_idle.freelist)))
return -ENOMEM;
/*
@@ -193,6 +198,18 @@ void bpf_kptr_offload_dec(void)
} while (!atomic_long_try_cmpxchg(&bpf_dtor_kptr_refs, &val, 0));
}
+static int __init bpf_dtor_kptr_init(void)
+{
+ int err;
+
+ err = pcpu_freelist_init(&bpf_dtor_kptr_idle);
+ if (err)
+ return err;
+
+ return 0;
+}
+late_initcall(bpf_dtor_kptr_init);
+
int sysctl_unprivileged_bpf_disabled __read_mostly =
IS_BUILTIN(CONFIG_BPF_UNPRIV_DEFAULT_OFF) ? 2 : 0;
> [...]
next prev parent reply other threads:[~2026-05-05 19:48 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-05 15:08 [bpf-next v2 0/2] bpf: Fix deadlock in kptr dtor in nmi Justin Suess
2026-05-05 15:08 ` [bpf-next v2 1/2] bpf: Offload kptr destructors that run from NMI Justin Suess
2026-05-05 16:06 ` bot+bpf-ci
2026-05-05 19:48 ` Justin Suess [this message]
2026-05-05 19:49 ` sashiko-bot
2026-05-06 16:43 ` Mykyta Yatsenko
2026-05-06 19:52 ` Justin Suess
2026-05-07 14:59 ` Mykyta Yatsenko
2026-05-07 16:41 ` Justin Suess
2026-05-07 17:19 ` Mykyta Yatsenko
2026-05-05 15:08 ` [bpf-next v2 2/2] selftests/bpf: Add kptr destructor NMI exerciser Justin Suess
2026-05-05 20:15 ` sashiko-bot
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=afpJdw5bpO0fxyel@zenbox \
--to=utilityemal77@gmail.com \
--cc=alexei.starovoitov@gmail.com \
--cc=andrii@kernel.org \
--cc=ast@kernel.org \
--cc=bot+bpf-ci@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=clm@meta.com \
--cc=daniel@iogearbox.net \
--cc=eddyz87@gmail.com \
--cc=ihor.solodrai@linux.dev \
--cc=jolsa@kernel.org \
--cc=martin.lau@kernel.org \
--cc=martin.lau@linux.dev \
--cc=memxor@gmail.com \
--cc=mic@digikod.net \
--cc=song@kernel.org \
--cc=yonghong.song@linux.dev \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox