From: Justin Suess <utilityemal77@gmail.com>
To: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: ast@kernel.org, daniel@iogearbox.net, andrii@kernel.org,
eddyz87@gmail.com, memxor@gmail.com, martin.lau@linux.dev,
song@kernel.org, yonghong.song@linux.dev, jolsa@kernel.org,
bpf@vger.kernel.org
Subject: Re: [PATCH bpf-next 3/4] bpf: Fix deadlock in kptr dtor in nmi
Date: Sat, 2 May 2026 10:33:08 -0400 [thread overview]
Message-ID: <afYLJAT9brXkWxz2@zenbox> (raw)
In-Reply-To: <afIlTiyoJUqf2YBN@suesslenovo>
On Wed, Apr 29, 2026 at 12:21:17PM -0400, Justin Suess wrote:
> On Wed, Apr 29, 2026 at 02:37:32AM -0700, Alexei Starovoitov wrote:
> > On Tue Apr 28, 2026 at 1:14 PM PDT, Justin Suess wrote:
> > > Defer freeing of referenced kptrs using irq_work queue.
> > >
> > > This fixes a deadlock in BPF tracing programs running under NMI.
> > >
> > > Each kptr is tagged with an auxiliary data field storing an llist_node
> > > and a pointer to the object to be freed. These are assembled together
> > > to form a queue for deletion outside NMI.
> > >
> > > Add a field to each data structure capable of holding referenced kptrs
> > > to store the llist_head, as well as an irq_work struct to the btf kptr
> > > field to store the task callback.
> > >
> > > The llist_nodes are linked in the queue safely, allowing them to be torn
> > > down once NMI is over.
> > >
> > > This irq_work struct is foribly synchronized on btf teardown, enabled by
> > > the change in btf cleanup code introduced in the previous commit, adding
> > > the rcu_work teardown.
> > >
> > > At dtor time, if the execution is in an nmi context, enqueue the
> > > referenced kptr nodes in the llist_head and enqueue a job to drain the
> > > list, calling the respective dtor callback from a safe context.
> > >
> > > If running outside nmi, use synchronous dtor path.
> > >
> > > This touches arraymap, hashtab, and bpf local storage. It's important to
> > > note however, that the bpf_local_storage code rejects nmi updates
> > > already, the code changes in that case are just to accommodate the changes
> > > to the record extending the kptr.
> > >
> > > Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
> > > Reported-by: Justin Suess <utilityemal77@gmail.com>
> > > Closes: https://lore.kernel.org/bpf/20260421201035.1729473-1-utilityemal77@gmail.com/
> > > Signed-off-by: Justin Suess <utilityemal77@gmail.com>
> > > ---
> > > include/linux/bpf.h | 69 ++++++++++++
> > > kernel/bpf/arraymap.c | 36 ++++++-
> > > kernel/bpf/bpf_local_storage.c | 13 ++-
> > > kernel/bpf/btf.c | 6 +-
> > > kernel/bpf/hashtab.c | 181 +++++++++++++++++++++++++++----
> > > kernel/bpf/syscall.c | 190 +++++++++++++++++++++++++++++++--
> > > 6 files changed, 456 insertions(+), 39 deletions(-)
> > >
> > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > > index 715b6df9c403..037bdadbed96 100644
> > > --- a/include/linux/bpf.h
> > > +++ b/include/linux/bpf.h
> > > @@ -9,6 +9,8 @@
> > >
> > > #include <crypto/sha2.h>
> > > #include <linux/workqueue.h>
> > > +#include <linux/irq_work.h>
> > > +#include <linux/llist.h>
> > > #include <linux/file.h>
> > > #include <linux/percpu.h>
> > > #include <linux/err.h>
> > > @@ -234,6 +236,10 @@ struct btf_field_kptr {
> > > * program-allocated, dtor is NULL, and __bpf_obj_drop_impl is used
> > > */
> > > btf_dtor_kfunc_t dtor;
> > > + struct irq_work irq_work;
> > > + struct llist_head irq_work_items;
> > > + struct llist_head free_list;
> > > + u32 aux_off;
> > > u32 btf_id;
> > > };
> >
> > This is extreme per-field overhead
> > 500 extra lines to fix a corner case?
> > That's not what I suggested. Can the whole thing will be
> > done with _single_ global llist and irq_work? I bet yes.
> > Think it through and send us prompt for review instead of code.
> >
Alexei,
I did some more thinking and came up with a possible solution.
There is some pseudo-C code to make the idea easier to understand.
This approach involves hijacking the slab freepointer, and thus would
only work for slab allocated objects. (I'm not aware of any vmalloc
allocated referenced kptrs or if that's something needing to be
supported).
When bpf_obj_free_fields is called, for each referenced kptr field
that:
a) is called from an nmi context, and;
b) that kptr has a dtor field defined
We take the following steps after the map slot has
been zeroed but before the dtor runs:
1. Hijack the slab freepointer slot for the object. This is only safe
because we still have not run the dtor yet, and thus still hold a
valid reference to the underlying kernel object, even though the
pointer no longer exists in the map. We use the freepointer as an
llist_node for our defered free list, storing the head in a percpu queue
for later usage. The kasan_reset_tag is to avoid warnings from
kasan.
We do this like:
struct kmem_cache *s;
slab = virt_to_slab(obj);
s = slab->slab_cache
struct llist_node *node = (struct llist_node*)kasan_reset_tag(obj) + s->offset;
2. We queue the irq_work to execute our dtor in
hardirq context:
llist_add(node, &our_dtor_work_llist)
irq_work_queue(&our_workqueue)
3. Once our work executes, we move all the items from our llist,
allowing concurrent jobs to start pending again. We know that this
is safe since we hold a valid reference still, because again, the
dtor has not executed.
llnode = llist_del_all(&our_dtor_work_llist);
llist_for_each_safe(pos, tmp, llnode) {
// step 4
void *obj = btf_kptr_llist_to_obj(pos);
// step 5
if (obj)
our_dtor_metadata->dtor(obj);
}
4. We iterate through our llist. For every one of our pointers, we
replace our hijacked entry's freepointer with the pointer to the
current freelist head, un-hijacking it
// returns the object for the hijacked free pointer and
// un-hijacks it by relinking it into the freelist.
static void *btf_kptr_llist_to_obj(struct llist_node *node)
{
struct kmem_cache *s;
struct slab *slab;
void *object;
object = kasan_reset_tag(node);
slab = virt_to_slab(object);
if (WARN_ON_ONCE(!slab))
return NULL;
s = slab->slab_cache;
object -= s->offset;
btf_kptr_relink_freepointer(s, slab, object); // reset our freeepointer
return object;
}
static inline void btf_kptr_relink_freepointer(struct kmem_cache *s,
struct slab *slab, void *object)
{
/* atomics may be needed here instead */
void *fp = READ_ONCE(slab->freelist);
if (WARN_ON_ONCE(fp == object))
fp = NULL;
btf_kptr_set_freepointer(s, object, fp);
}
5. Finally, we call our destructor. The destructor is free to
decide if it wants to free or not, since the freepointer is still
valid because of step 4.
if (obj)
our_dtor_metadata->dtor(obj);
Benefits of this:
1. We don't allocate any new memory. We basically abuse the freepointer
as space for nodes in our dtor queue
2. No changes to maps at all.
3. Doesn't require any new data structures on the BPF side.
4. Whole thing is about 215~ LOC, mostly for slab address helpers and
the freelist pointer / relinking logic.
This approach requires including "../../mm/slab.h", which is already
included by kernel/bpf/kmem_cache_iter.c.
This would definitely need a look at by mm, not sure
if freelist manipulation (abuse?) like this is OK.
I have a working POC, but going to hold off on this until I figure out if
this idea is the right way to do it.
Thanks for your time with this and I appreciate you pushing me to get
the right solution :) I'm learning a lot from this.
Justin Suess
next prev parent reply other threads:[~2026-05-02 14:33 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-28 20:14 [PATCH bpf-next 0/4] bpf: Fix NMI deadlock in referenced kptr destructors Justin Suess
2026-04-28 20:14 ` [PATCH bpf-next 1/4] bpf: Limit fields used in btf_record_equal comparisons Justin Suess
2026-04-28 20:14 ` [PATCH bpf-next 2/4] bpf: Use rcu_work in BTF teardown Justin Suess
2026-04-29 1:49 ` sashiko-bot
2026-04-28 20:14 ` [PATCH bpf-next 3/4] bpf: Fix deadlock in kptr dtor in nmi Justin Suess
2026-04-29 2:29 ` sashiko-bot
2026-04-29 9:37 ` Alexei Starovoitov
2026-04-29 16:21 ` Justin Suess
2026-05-02 14:33 ` Justin Suess [this message]
2026-04-28 20:14 ` [PATCH bpf-next 4/4] selftests/bpf: Add kptr nmi deadlock reproducer Justin Suess
2026-04-29 3:39 ` sashiko-bot
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=afYLJAT9brXkWxz2@zenbox \
--to=utilityemal77@gmail.com \
--cc=alexei.starovoitov@gmail.com \
--cc=andrii@kernel.org \
--cc=ast@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=daniel@iogearbox.net \
--cc=eddyz87@gmail.com \
--cc=jolsa@kernel.org \
--cc=martin.lau@linux.dev \
--cc=memxor@gmail.com \
--cc=song@kernel.org \
--cc=yonghong.song@linux.dev \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox