BPF List
 help / color / mirror / Atom feed
From: Justin Suess <utilityemal77@gmail.com>
To: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: ast@kernel.org, daniel@iogearbox.net, andrii@kernel.org,
	 eddyz87@gmail.com, memxor@gmail.com, martin.lau@linux.dev,
	song@kernel.org,  yonghong.song@linux.dev, jolsa@kernel.org,
	bpf@vger.kernel.org
Subject: Re: [PATCH bpf-next 3/4] bpf: Fix deadlock in kptr dtor in nmi
Date: Wed, 29 Apr 2026 12:21:17 -0400	[thread overview]
Message-ID: <afIlTiyoJUqf2YBN@suesslenovo> (raw)
In-Reply-To: <DI5JDE24ICIE.1PAT8HWYK9ENS@gmail.com>

On Wed, Apr 29, 2026 at 02:37:32AM -0700, Alexei Starovoitov wrote:
> On Tue Apr 28, 2026 at 1:14 PM PDT, Justin Suess wrote:
> > Defer freeing of referenced kptrs using irq_work queue.
> >
> > This fixes a deadlock in BPF tracing programs running under NMI.
> >
> > Each kptr is tagged with an auxiliary data field storing an llist_node
> > and a pointer to the object to be freed. These are assembled together
> > to form a queue for deletion outside NMI.
> >
> > Add a field to each data structure capable of holding referenced kptrs
> > to store the llist_head, as well as an irq_work struct to the btf kptr
> > field to store the task callback.
> >
> > The llist_nodes are linked in the queue safely, allowing them to be torn
> > down once NMI is over.
> >
> > This irq_work struct is foribly synchronized on btf teardown, enabled by
> > the change in btf cleanup code introduced in the previous commit, adding
> > the rcu_work teardown.
> >
> > At dtor time, if the execution is in an nmi context, enqueue the
> > referenced kptr nodes in the llist_head and enqueue a job to drain the
> > list, calling the respective dtor callback from a safe context.
> >
> > If running outside nmi, use synchronous dtor path.
> >
> > This touches arraymap, hashtab, and bpf local storage. It's important to
> > note however, that the bpf_local_storage code rejects nmi updates
> > already, the code changes in that case are just to accommodate the changes
> > to the record extending the kptr.
> >
> > Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
> > Reported-by: Justin Suess <utilityemal77@gmail.com>
> > Closes: https://lore.kernel.org/bpf/20260421201035.1729473-1-utilityemal77@gmail.com/
> > Signed-off-by: Justin Suess <utilityemal77@gmail.com>
> > ---
> >  include/linux/bpf.h            |  69 ++++++++++++
> >  kernel/bpf/arraymap.c          |  36 ++++++-
> >  kernel/bpf/bpf_local_storage.c |  13 ++-
> >  kernel/bpf/btf.c               |   6 +-
> >  kernel/bpf/hashtab.c           | 181 +++++++++++++++++++++++++++----
> >  kernel/bpf/syscall.c           | 190 +++++++++++++++++++++++++++++++--
> >  6 files changed, 456 insertions(+), 39 deletions(-)
> >
> > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > index 715b6df9c403..037bdadbed96 100644
> > --- a/include/linux/bpf.h
> > +++ b/include/linux/bpf.h
> > @@ -9,6 +9,8 @@
> >  
> >  #include <crypto/sha2.h>
> >  #include <linux/workqueue.h>
> > +#include <linux/irq_work.h>
> > +#include <linux/llist.h>
> >  #include <linux/file.h>
> >  #include <linux/percpu.h>
> >  #include <linux/err.h>
> > @@ -234,6 +236,10 @@ struct btf_field_kptr {
> >  	 * program-allocated, dtor is NULL,  and __bpf_obj_drop_impl is used
> >  	 */
> >  	btf_dtor_kfunc_t dtor;
> > +	struct irq_work irq_work;
> > +	struct llist_head irq_work_items;
> > +	struct llist_head free_list;
> > +	u32 aux_off;
> >  	u32 btf_id;
> >  };
> 
> This is extreme per-field overhead
> 500 extra lines to fix a corner case?
> That's not what I suggested. Can the whole thing will be
> done with _single_ global llist and irq_work? I bet yes.
> Think it through and send us prompt for review instead of code.
>
This is definitely over-engineered, I agree. I'm not the most
experienced at this so I appreciate the patience.

The main issue I'm running into is (avoiding) memory allocation in the
dtor path. My solution does *technically* avoid that, but causes a bunch
of other problems.

So we work in these constraints:

1. Even though BPF provides nmi safe allocation methods, it's not
acceptable in my understanding to allocate because memory exhaustion
will cause a failure to free.

2. We obviously can't do anything unsafe like spinlock, lock, block,
rcu, etc under nmi.

3. We need to defer dtor work from nmi for refcounted kptrs in nmi with
irq_work.

It's just the placement of the llist_node that I'm struggling with.
kptrs are just stored as memory addresses, there's no clean location
I can just stick a llist_node, (hence the auxiliary data hack I did),
and I can't just allocate it dynamically either.

---

What do you think of this general approach / prompt?

We make a global llist_head and irq_work queue for dtor work.

Whenever a slot in a map with a record containing refcounted kptr fields
is allocated, we allocate a struct per kptr field like

  struct bpf_kptr_free_slot {
    struct llist_node free_node;
    void *obj;
    void *elem;
    btf_dtor_kfunc_t dtor;
  };

Allocating it using the nmi safe bpf allocator in kernel/bpf/memalloc.c

(Allocating it at slot allocation time seems like the only reasonable way to
do it, since we can safely fail allocating new map items with -ENOMEM)

These structs would ideally be stored in some way that gives efficient
elem* to bpf_kptr_free_slot lookup. (and be nmi safe).

They would need to live outside the map itself.

Then at dtor time we look up the bpf_kptr_free_list for the elem, set
the obj to our pointer to be freed, set the dtor, and
link it into the global free llist_head.

Then we trigger the global irq_work.

Our irq work jobs will execute all the queued dtors and then free the
bpf_kptr_free_slot struct.

---

This still would add overhead, but it wouldn't require any changes to
the map structure itself.

Let me know what you think or if there is some obvious thing I'm
missing.

Justin
> pw-bot: cr
> 

  reply	other threads:[~2026-04-29 16:21 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-28 20:14 [PATCH bpf-next 0/4] bpf: Fix NMI deadlock in referenced kptr destructors Justin Suess
2026-04-28 20:14 ` [PATCH bpf-next 1/4] bpf: Limit fields used in btf_record_equal comparisons Justin Suess
2026-04-28 20:14 ` [PATCH bpf-next 2/4] bpf: Use rcu_work in BTF teardown Justin Suess
2026-04-29  1:49   ` sashiko-bot
2026-04-28 20:14 ` [PATCH bpf-next 3/4] bpf: Fix deadlock in kptr dtor in nmi Justin Suess
2026-04-29  2:29   ` sashiko-bot
2026-04-29  9:37   ` Alexei Starovoitov
2026-04-29 16:21     ` Justin Suess [this message]
2026-05-02 14:33       ` Justin Suess
2026-04-28 20:14 ` [PATCH bpf-next 4/4] selftests/bpf: Add kptr nmi deadlock reproducer Justin Suess
2026-04-29  3:39   ` sashiko-bot

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=afIlTiyoJUqf2YBN@suesslenovo \
    --to=utilityemal77@gmail.com \
    --cc=alexei.starovoitov@gmail.com \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=eddyz87@gmail.com \
    --cc=jolsa@kernel.org \
    --cc=martin.lau@linux.dev \
    --cc=memxor@gmail.com \
    --cc=song@kernel.org \
    --cc=yonghong.song@linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox