BPF List
 help / color / mirror / Atom feed
From: Justin Suess <utilityemal77@gmail.com>
To: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: ast@kernel.org, daniel@iogearbox.net, andrii@kernel.org,
	eddyz87@gmail.com, memxor@gmail.com, martin.lau@linux.dev,
	song@kernel.org, yonghong.song@linux.dev, jolsa@kernel.org,
	bpf@vger.kernel.org
Subject: Re: [PATCH bpf-next 3/4] bpf: Fix deadlock in kptr dtor in nmi
Date: Sat, 2 May 2026 10:33:08 -0400	[thread overview]
Message-ID: <afYLJAT9brXkWxz2@zenbox> (raw)
In-Reply-To: <afIlTiyoJUqf2YBN@suesslenovo>

On Wed, Apr 29, 2026 at 12:21:17PM -0400, Justin Suess wrote:
> On Wed, Apr 29, 2026 at 02:37:32AM -0700, Alexei Starovoitov wrote:
> > On Tue Apr 28, 2026 at 1:14 PM PDT, Justin Suess wrote:
> > > Defer freeing of referenced kptrs using irq_work queue.
> > >
> > > This fixes a deadlock in BPF tracing programs running under NMI.
> > >
> > > Each kptr is tagged with an auxiliary data field storing an llist_node
> > > and a pointer to the object to be freed. These are assembled together
> > > to form a queue for deletion outside NMI.
> > >
> > > Add a field to each data structure capable of holding referenced kptrs
> > > to store the llist_head, as well as an irq_work struct to the btf kptr
> > > field to store the task callback.
> > >
> > > The llist_nodes are linked in the queue safely, allowing them to be torn
> > > down once NMI is over.
> > >
> > > This irq_work struct is foribly synchronized on btf teardown, enabled by
> > > the change in btf cleanup code introduced in the previous commit, adding
> > > the rcu_work teardown.
> > >
> > > At dtor time, if the execution is in an nmi context, enqueue the
> > > referenced kptr nodes in the llist_head and enqueue a job to drain the
> > > list, calling the respective dtor callback from a safe context.
> > >
> > > If running outside nmi, use synchronous dtor path.
> > >
> > > This touches arraymap, hashtab, and bpf local storage. It's important to
> > > note however, that the bpf_local_storage code rejects nmi updates
> > > already, the code changes in that case are just to accommodate the changes
> > > to the record extending the kptr.
> > >
> > > Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
> > > Reported-by: Justin Suess <utilityemal77@gmail.com>
> > > Closes: https://lore.kernel.org/bpf/20260421201035.1729473-1-utilityemal77@gmail.com/
> > > Signed-off-by: Justin Suess <utilityemal77@gmail.com>
> > > ---
> > >  include/linux/bpf.h            |  69 ++++++++++++
> > >  kernel/bpf/arraymap.c          |  36 ++++++-
> > >  kernel/bpf/bpf_local_storage.c |  13 ++-
> > >  kernel/bpf/btf.c               |   6 +-
> > >  kernel/bpf/hashtab.c           | 181 +++++++++++++++++++++++++++----
> > >  kernel/bpf/syscall.c           | 190 +++++++++++++++++++++++++++++++--
> > >  6 files changed, 456 insertions(+), 39 deletions(-)
> > >
> > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > > index 715b6df9c403..037bdadbed96 100644
> > > --- a/include/linux/bpf.h
> > > +++ b/include/linux/bpf.h
> > > @@ -9,6 +9,8 @@
> > >  
> > >  #include <crypto/sha2.h>
> > >  #include <linux/workqueue.h>
> > > +#include <linux/irq_work.h>
> > > +#include <linux/llist.h>
> > >  #include <linux/file.h>
> > >  #include <linux/percpu.h>
> > >  #include <linux/err.h>
> > > @@ -234,6 +236,10 @@ struct btf_field_kptr {
> > >  	 * program-allocated, dtor is NULL,  and __bpf_obj_drop_impl is used
> > >  	 */
> > >  	btf_dtor_kfunc_t dtor;
> > > +	struct irq_work irq_work;
> > > +	struct llist_head irq_work_items;
> > > +	struct llist_head free_list;
> > > +	u32 aux_off;
> > >  	u32 btf_id;
> > >  };
> > 
> > This is extreme per-field overhead
> > 500 extra lines to fix a corner case?
> > That's not what I suggested. Can the whole thing will be
> > done with _single_ global llist and irq_work? I bet yes.
> > Think it through and send us prompt for review instead of code.
> >
Alexei,

I did some more thinking and came up with a possible solution.

There is some pseudo-C code to make the idea easier to understand.

This approach involves hijacking the slab freepointer, and thus would
only work for slab allocated objects. (I'm not aware of any vmalloc
allocated referenced kptrs or if that's something needing to be
supported).

    When bpf_obj_free_fields is called, for each referenced kptr field
    that:

    a) is called from an nmi context, and;

    b) that kptr has a dtor field defined

    We take the following steps after the map slot has
    been zeroed but before the dtor runs:
    
    1. Hijack the slab freepointer slot for the object. This is only safe
       because we still have not run the dtor yet, and thus still hold a
       valid reference to the underlying kernel object, even though the
       pointer no longer exists in the map. We use the freepointer as an
       llist_node for our defered free list, storing the head in a percpu queue
       for later usage. The kasan_reset_tag is to avoid warnings from
       kasan.

       We do this like:
        struct kmem_cache *s;
        slab = virt_to_slab(obj);
        s = slab->slab_cache
        struct llist_node *node = (struct llist_node*)kasan_reset_tag(obj) + s->offset;
      
    2. We queue the irq_work to execute our dtor in
       hardirq context:

        llist_add(node, &our_dtor_work_llist)
        irq_work_queue(&our_workqueue)

    3. Once our work executes, we move all the items from our llist,
       allowing concurrent jobs to start pending again. We know that this
       is safe since we hold a valid reference still, because again, the
       dtor has not executed.

        llnode = llist_del_all(&our_dtor_work_llist);
        llist_for_each_safe(pos, tmp, llnode) {
		      // step 4
          void *obj = btf_kptr_llist_to_obj(pos);
          // step 5
		      if (obj)
			      our_dtor_metadata->dtor(obj);
        }
    4. We iterate through our llist. For every one of our pointers, we
       replace our hijacked entry's freepointer with the pointer to the
       current freelist head, un-hijacking it

        // returns the object for the hijacked free pointer and
        // un-hijacks it by relinking it into the freelist.
        static void *btf_kptr_llist_to_obj(struct llist_node *node)
        {
        	struct kmem_cache *s;
        	struct slab *slab;
        	void *object;
        
        	object = kasan_reset_tag(node);
        	slab = virt_to_slab(object);
        	if (WARN_ON_ONCE(!slab))
        		return NULL;
        
        	s = slab->slab_cache;
        	object -= s->offset;
        	btf_kptr_relink_freepointer(s, slab, object); // reset our freeepointer
        	return object;
        }

        static inline void btf_kptr_relink_freepointer(struct kmem_cache *s,
				       struct slab *slab, void *object)
        {
          /* atomics may be needed here instead */
        	void *fp = READ_ONCE(slab->freelist);
        
        	if (WARN_ON_ONCE(fp == object))
        		fp = NULL;
        
        	btf_kptr_set_freepointer(s, object, fp);
        }

    5. Finally, we call our destructor. The destructor is free to
       decide if it wants to free or not, since the freepointer is still
       valid because of step 4.

        if (obj)
			    our_dtor_metadata->dtor(obj);

Benefits of this:

1. We don't allocate any new memory. We basically abuse the freepointer
as space for nodes in our dtor queue

2. No changes to maps at all.

3. Doesn't require any new data structures on the BPF side.

4. Whole thing is about 215~ LOC, mostly for slab address helpers and
   the freelist pointer / relinking logic.

This approach requires including  "../../mm/slab.h", which is already
included by kernel/bpf/kmem_cache_iter.c.

This would definitely need a look at by mm, not sure
if freelist manipulation (abuse?) like this is OK.

I have a working POC, but going to hold off on this until I figure out if
this idea is the right way to do it.

Thanks for your time with this and I appreciate you pushing me to get
the right solution :) I'm learning a lot from this.

Justin Suess

  reply	other threads:[~2026-05-02 14:33 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-28 20:14 [PATCH bpf-next 0/4] bpf: Fix NMI deadlock in referenced kptr destructors Justin Suess
2026-04-28 20:14 ` [PATCH bpf-next 1/4] bpf: Limit fields used in btf_record_equal comparisons Justin Suess
2026-04-28 20:14 ` [PATCH bpf-next 2/4] bpf: Use rcu_work in BTF teardown Justin Suess
2026-04-29  1:49   ` sashiko-bot
2026-04-28 20:14 ` [PATCH bpf-next 3/4] bpf: Fix deadlock in kptr dtor in nmi Justin Suess
2026-04-29  2:29   ` sashiko-bot
2026-04-29  9:37   ` Alexei Starovoitov
2026-04-29 16:21     ` Justin Suess
2026-05-02 14:33       ` Justin Suess [this message]
2026-04-28 20:14 ` [PATCH bpf-next 4/4] selftests/bpf: Add kptr nmi deadlock reproducer Justin Suess
2026-04-29  3:39   ` sashiko-bot

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=afYLJAT9brXkWxz2@zenbox \
    --to=utilityemal77@gmail.com \
    --cc=alexei.starovoitov@gmail.com \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=eddyz87@gmail.com \
    --cc=jolsa@kernel.org \
    --cc=martin.lau@linux.dev \
    --cc=memxor@gmail.com \
    --cc=song@kernel.org \
    --cc=yonghong.song@linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox