Re: [PATCH bpf-next 1/1] bpf: fix deadlock in special field destruction in NMI

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Justin Suess <utilityemal77@gmail.com>
To: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Cc: sashiko-reviews@lists.linux.dev, bpf@vger.kernel.org
Subject: Re: [PATCH bpf-next 1/1] bpf: fix deadlock in special field destruction in NMI
Date: Sat, 6 Jun 2026 12:27:49 -0400	[thread overview]
Message-ID: <aiRG9nsbRi2cO3XT@zenbox> (raw)
In-Reply-To: <DIMWGCWL1AG2.14F7CN23GMS8N@gmail.com>

On Tue, May 19, 2026 at 09:27:58PM +0200, Kumar Kartikeya Dwivedi wrote:
> On Tue May 19, 2026 at 5:59 PM CEST, Justin Suess wrote:
> > On Tue, May 19, 2026 at 10:42:44AM -0400, Justin Suess wrote:
> >> Looks like this kfunc also tries to eagerly free fields and is callable
> >> from tp_btf/nmi_handler with no recovery mechanism.
> >>
> >> I don't see a good way to handle this at all... it also is effectively
> >> already bugged in the existing code unless I'm missing something
> >>
> >> Since you can use it to call abritrary dtors in NMI.
> >>
> >> 1.
> >>
> >> In some context where task_acquire is valid:
> >>
> >> struct blah {
> >>   struct task_struct __kptr* task;
> >> }
> >>
> >> b = bpf_task_acquire(...)
> >> a = bpf_obj_new(typeof(*blah_ptr));
> >> bpf_kptr_xchg(a->task, b);
> >>
> >> 2.
> >>
> >> Then xchg a into map
> >> bpf_kptr_xchg([map slot], a);
> >> ...
> >>
> >> 3.
> >>
> >> In tp_btf/nmi_handler we exchange the obj out of the map
> >> and then drop it.
> >>
> >> c = bpf_kptr_xchg([slot where a is stored], NULL);
> >> bpf_obj_drop(c);
> >>
> >> And because this can be called on an object not in a map, we don't have
> >> a good recovery mechanism.
> >>
> >> Because tp_btf/nmi_handler is always in_nmi and hence irqs_disabled, in
> >> the code before this patch the task_struct dtor will be invoked and
> >> cause a deadlock.
> >>
> >> This patch will not fix the issue and just free a struct blah holding the
> >> reference without actually dropping it.
> >>
> >> Unless I'm missing something then this whole approach taken by this
> >> patch can't handle this case since the bpf_obj is detached from the map
> >> entirely and has no allocator recovery mechanism at all.
> >>
> >> Then we are back to square 1 needing to defer the freeing with irq work.
> >>
> >> Kumar / Alexei do you have anything on this?
> >
> > I've confirmed this is a viable mechanism to deadlock as well
> >
> > Using the above approach, just exchanging out of a map and dropping the
> > bpf_obj containing a task_struct kptr can be used to cause a deadlock
> > outside any map path:
> >
> > On 576482b55c19e7ec00e162a0fde4c4f1a95128c7 (bpf-next/master) w/ RCU_NOCB_CPU/RCU_EXPERT:
> >
> >   [   10.773819] ================================
> >   [   10.773820] WARNING: inconsistent lock state
> >   [   10.773821] 7.1.0-rc2-g576482b55c19 #150 Not tainted
> >   [   10.773822] --------------------------------
> >   [   10.773822] inconsistent {INITIAL USE} -> {IN-NMI} usage.
> >   [   10.773823] test_progs/471 [HC1[1]:SC0[0]:HE0:SE1] takes:
> >   [   10.773825] ffff98623d66f0e8 (&rdp->nocb_lock){-.-.}-{2:2}, at: __call_rcu_common.constprop.0+0x54a/0x730
> >   [   10.773832] {INITIAL USE} state was registered at:
> >   [   10.773832]   lock_acquire+0xbf/0x2e0
> >   [   10.773835]   _raw_spin_lock+0x30/0x40
> >   [   10.773838]   rcu_nocb_gp_kthread+0x13f/0xb90
> >   [   10.773840]   kthread+0xf4/0x130
> >   [   10.773845]   ret_from_fork+0x264/0x330
> >   [   10.773848]   ret_from_fork_asm+0x1a/0x30
> >   [   10.773851] irq event stamp: 1127908
> >   [   10.773851] hardirqs last  enabled at (1127907): [<ffffffff87104484>] _raw_spin_unlock_irqrestore+0x44/0x60
> >   [   10.773853] hardirqs last disabled at (1127908): [<ffffffff871041a1>] _raw_spin_lock_irqsave+0x51/0x60
> >   [   10.773854] softirqs last  enabled at (1127782): [<ffffffff862e6b80>] __irq_exit_rcu+0xc0/0x100
> >   [   10.773857] softirqs last disabled at (1127777): [<ffffffff862e6b80>] __irq_exit_rcu+0xc0/0x100
> >   [   10.773858]
> >                  other info that might help us debug this:
> >   [   10.773859]  Possible unsafe locking scenario:
> >
> >   [   10.773859]        CPU0
> >   [   10.773859]        ----
> >   [   10.773860]   lock(&rdp->nocb_lock);
> >   [   10.773861]   <Interrupt>
> >   [   10.773861]     lock(&rdp->nocb_lock);
> >   [   10.773862]
> >                   *** DEADLOCK ***
> >
> >   [   10.773862] 4 locks held by test_progs/471:
> >   [   10.773863]  #0: ffffffff885ccaf0 (dup_mmap_sem){.+.+}-{0:0}, at: copy_process+0x1e39/0x2b80
> >   [   10.773866]  #1: ffff9861c33a2ff8 (&mm->mmap_lock){++++}-{4:4}, at: dup_mmap+0x83/0x990
> >   [   10.773871]  #2: ffff9861c1152ff8 (&mm->mmap_lock/1){+.+.}-{4:4}, at: dup_mmap+0xc8/0x990
> >   [   10.773874]  #3: ffffffff885f43b8 (kmemleak_lock){-.-.}-{2:2}, at: __create_object+0x3a/0xa0
> >   [   10.773877]
> >                  stack backtrace:
> >   [   10.773879] CPU: 1 UID: 0 PID: 471 Comm: test_progs Not tainted 7.1.0-rc2-g576482b55c19 #150 PREEMPT(full)
> >   [   10.773881] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.17.0-0-gb52ca86e094d-prebuilt.qemu.org 04/01/2014
> >   [   10.773882] Call Trace:
> >   [   10.773883]  <NMI>
> >   [   10.773884]  dump_stack_lvl+0x5d/0x80
> >   [   10.773887]  print_usage_bug.part.0+0x22b/0x2c0
> >   [   10.773891]  lock_acquire+0x295/0x2e0
> >   [   10.773893]  ? __call_rcu_common.constprop.0+0x54a/0x730
> >   [   10.773896]  _raw_spin_lock+0x30/0x40
> >   [   10.773898]  ? __call_rcu_common.constprop.0+0x54a/0x730
> >   [   10.773899]  __call_rcu_common.constprop.0+0x54a/0x730
> >   [   10.773903]  bpf_obj_free_fields+0x118/0x250
> >   [   10.773907]  bpf_obj_drop+0x4f/0x80
> >   [   10.773910]  bpf_prog_b7ff9c2e6c583824_clear_task_kptrs_from_nmi+0x174/0x1f4
> >   [   10.773912]  bpf_trace_run3+0x126/0x430
> >   [   10.773916]  ? __pfx_perf_event_nmi_handler+0x10/0x10
> >   [   10.773919]  nmi_handle.part.0+0x15b/0x250
> >   [   10.773922]  ? __pfx_perf_event_nmi_handler+0x10/0x10
> >   [   10.773924]  default_do_nmi+0x120/0x180
> >   [   10.773927]  exc_nmi+0xe3/0x110
> >   [   10.773928]  end_repeat_nmi+0xf/0x53
> >   [   10.773931] RIP: 0010:__link_object+0xb4/0x270
> >   [   10.773932] Code: 00 00 00 48 c7 c6 80 36 bb 89 48 c7 43 70 00 00 00 00 48 c7 43 78 00 00 00 00 48 89 3d 35 62 50 03 eb 66 48 89 c8 48 8b 48 30 <48> 8d 70 10 48 39 d1 73 11 48 03 48 38 48 39 cf 0f 82 2f 01 00 00
> >   [   10.773933] RSP: 0018:ffffb516013b7a60 EFLAGS: 00000082
> >   [   10.773935] RAX: ffff9861c37b5c80 RBX: ffff9861df869878 RCX: ffff9861c3776420
> >   [   10.773936] RDX: ffff9861c3769e00 RSI: ffff9861c43564f8 RDI: ffff9861c3769d00
> >   [   10.773936] RBP: 0000000000000100 R08: 0000000000000000 R09: 0000000000000000
> >   [   10.773937] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9861c3769d00
> >   [   10.773937] R13: 0000000000000286 R14: 0000000000000000 R15: 0000000000000001
> >   [   10.773943]  ? __link_object+0xb4/0x270
> >   [   10.773945]  ? __link_object+0xb4/0x270
> >   [   10.773947]  </NMI>
> >   [   10.773947]  <TASK>
> >   [   10.773948]  __create_object+0x51/0xa0
> >   [   10.773951]  kmem_cache_alloc_from_sheaf_noprof+0x1e2/0x290
> >   [   10.773956]  mas_dup_build+0x41d/0x7f0
> >   [   10.773960]  __mt_dup+0x78/0xf0
> >   [   10.773967]  dup_mmap+0x12e/0x990
> >   [   10.773972]  ? srso_alias_return_thunk+0x5/0xfbef5
> >   [   10.773973]  ? lock_acquire+0xbf/0x2e0
> >   [   10.773979]  copy_process+0x1e45/0x2b80
> >   [   10.773985]  kernel_clone+0xbb/0x3d0
> >   [   10.773986]  ? srso_alias_return_thunk+0x5/0xfbef5
> >   [   10.773987]  ? __lock_acquire+0x481/0x2590
> >   [   10.773993]  __do_sys_clone+0x65/0x90
> >   [   10.773998]  do_syscall_64+0xa1/0x5f0
> >   [   10.774002]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> >   [   10.774003] RIP: 0033:0x7f9da26e8f2f
> >   [   10.774005] Code: 89 e7 e8 c4 0e f6 ff 45 31 c0 31 d2 31 f6 64 48 8b 04 25 10 00 00 00 bf 11 00 20 01 4c 8d 90 d0 02 00 00 b8 38 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 59 89 c5 85 c0 75 31 64 48 8b 04 25 10 00 00
> >   [   10.774005] RSP: 002b:00007ffc278b4540 EFLAGS: 00000246 ORIG_RAX: 0000000000000038
> >   [   10.774007] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f9da26e8f2f
> >   [   10.774007] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000001200011
> >   [   10.774008] RBP: 00007ffc278b4740 R08: 0000000000000000 R09: 0000000000000001
> >   [   10.774008] R10: 00007f9da2acaf10 R11: 0000000000000246 R12: 0000000000000003
> >   [   10.774009] R13: 0000000000000001 R14: 0000000000000000 R15: 00005625de8b38d0
> >   [   10.774014]  </TASK>
> >   [   10.866584] perf: interrupt took too long (6193 > 6190), lowering kernel.perf_event_max_sample_rate to 32000
> >   [   16.630663] kauditd_printk_skb: 30 callbacks suppressed
> >
> > I can give the reproducer if wished.
> >
> > You can see the path involves bpf_obj_drop and not any map methods at
> > all.
> >
> > But basically this makes the whole mechanism of relying on map allocators to
> > recycle and free objects in irqs_disabled contexts unsound because kptrs
> > can live and be released outside map contexts.
> 
> The approach in this version is still different than what I had proposed. For
> the particular case you highlighted, we should just reject bpf_obj_drop() in
> is_tracing_prog_type() once the object has kptr fields etc. If someone still
> wants it they should just use bpf_wq to defer manually.
>
Would you be opposed to me submitting a fix for *just* the bpf_obj_drop case?

Since your proposed solution for that is just a tiny verifier fix w/ btf_record_has_field and a prog_type check and is mostly unrelated to this fix for the map cases.

It's a small fix, but just don't want to be rude and submit anything if
you're working on that too in your patch. :)

Thanks for your guidance with this fix in general and apologize for my
misinterpretation of the intended route.

Kind Regards,
Justin
> The whole prealloc related change to the map is unnecessary, as I said before.
> It's an orthogonal change. We should also not use irqs_disabled() and just do
> the skipping unconditionally.
> 
> Let me work on a patch, since the nuance is obviously getting lost in email.

next prev parent reply	other threads:[~2026-06-06 16:27 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-19  1:14 [PATCH bpf-next 0/1] bpf: Fix deadlock in freeing of special fields in NMI Justin Suess
2026-05-19  1:14 ` [PATCH bpf-next 1/1] bpf: fix deadlock in special field destruction " Justin Suess
2026-05-19  1:51   ` sashiko-bot
2026-05-19 14:42     ` Justin Suess
2026-05-19 15:59       ` Justin Suess
2026-05-19 19:27         ` Kumar Kartikeya Dwivedi
2026-05-19 23:19           ` Justin Suess
2026-06-06 16:27           ` Justin Suess [this message]
2026-06-07  2:40             ` Kumar Kartikeya Dwivedi
2026-05-19 10:25   ` kernel test robot
2026-05-19 12:31   ` Mykyta Yatsenko
2026-05-19 13:22     ` Justin Suess
2026-05-19 13:55       ` Mykyta Yatsenko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aiRG9nsbRi2cO3XT@zenbox \
    --to=utilityemal77@gmail.com \
    --cc=bpf@vger.kernel.org \
    --cc=memxor@gmail.com \
    --cc=sashiko-reviews@lists.linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.