BPF List
 help / color / mirror / Atom feed
From: Justin Suess <utilityemal77@gmail.com>
To: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Cc: sashiko-reviews@lists.linux.dev, bpf@vger.kernel.org
Subject: Re: [PATCH bpf-next 1/1] bpf: fix deadlock in special field destruction in NMI
Date: Sat, 6 Jun 2026 12:27:49 -0400	[thread overview]
Message-ID: <aiRG9nsbRi2cO3XT@zenbox> (raw)
In-Reply-To: <DIMWGCWL1AG2.14F7CN23GMS8N@gmail.com>

On Tue, May 19, 2026 at 09:27:58PM +0200, Kumar Kartikeya Dwivedi wrote:
> On Tue May 19, 2026 at 5:59 PM CEST, Justin Suess wrote:
> > On Tue, May 19, 2026 at 10:42:44AM -0400, Justin Suess wrote:
> >> Looks like this kfunc also tries to eagerly free fields and is callable
> >> from tp_btf/nmi_handler with no recovery mechanism.
> >>
> >> I don't see a good way to handle this at all... it also is effectively
> >> already bugged in the existing code unless I'm missing something
> >>
> >> Since you can use it to call abritrary dtors in NMI.
> >>
> >> 1.
> >>
> >> In some context where task_acquire is valid:
> >>
> >> struct blah {
> >>   struct task_struct __kptr* task;
> >> }
> >>
> >> b = bpf_task_acquire(...)
> >> a = bpf_obj_new(typeof(*blah_ptr));
> >> bpf_kptr_xchg(a->task, b);
> >>
> >> 2.
> >>
> >> Then xchg a into map
> >> bpf_kptr_xchg([map slot], a);
> >> ...
> >>
> >> 3.
> >>
> >> In tp_btf/nmi_handler we exchange the obj out of the map
> >> and then drop it.
> >>
> >> c = bpf_kptr_xchg([slot where a is stored], NULL);
> >> bpf_obj_drop(c);
> >>
> >> And because this can be called on an object not in a map, we don't have
> >> a good recovery mechanism.
> >>
> >> Because tp_btf/nmi_handler is always in_nmi and hence irqs_disabled, in
> >> the code before this patch the task_struct dtor will be invoked and
> >> cause a deadlock.
> >>
> >> This patch will not fix the issue and just free a struct blah holding the
> >> reference without actually dropping it.
> >>
> >> Unless I'm missing something then this whole approach taken by this
> >> patch can't handle this case since the bpf_obj is detached from the map
> >> entirely and has no allocator recovery mechanism at all.
> >>
> >> Then we are back to square 1 needing to defer the freeing with irq work.
> >>
> >> Kumar / Alexei do you have anything on this?
> >
> > I've confirmed this is a viable mechanism to deadlock as well
> >
> > Using the above approach, just exchanging out of a map and dropping the
> > bpf_obj containing a task_struct kptr can be used to cause a deadlock
> > outside any map path:
> >
> > On 576482b55c19e7ec00e162a0fde4c4f1a95128c7 (bpf-next/master) w/ RCU_NOCB_CPU/RCU_EXPERT:
> >
> >   [   10.773819] ================================
> >   [   10.773820] WARNING: inconsistent lock state
> >   [   10.773821] 7.1.0-rc2-g576482b55c19 #150 Not tainted
> >   [   10.773822] --------------------------------
> >   [   10.773822] inconsistent {INITIAL USE} -> {IN-NMI} usage.
> >   [   10.773823] test_progs/471 [HC1[1]:SC0[0]:HE0:SE1] takes:
> >   [   10.773825] ffff98623d66f0e8 (&rdp->nocb_lock){-.-.}-{2:2}, at: __call_rcu_common.constprop.0+0x54a/0x730
> >   [   10.773832] {INITIAL USE} state was registered at:
> >   [   10.773832]   lock_acquire+0xbf/0x2e0
> >   [   10.773835]   _raw_spin_lock+0x30/0x40
> >   [   10.773838]   rcu_nocb_gp_kthread+0x13f/0xb90
> >   [   10.773840]   kthread+0xf4/0x130
> >   [   10.773845]   ret_from_fork+0x264/0x330
> >   [   10.773848]   ret_from_fork_asm+0x1a/0x30
> >   [   10.773851] irq event stamp: 1127908
> >   [   10.773851] hardirqs last  enabled at (1127907): [<ffffffff87104484>] _raw_spin_unlock_irqrestore+0x44/0x60
> >   [   10.773853] hardirqs last disabled at (1127908): [<ffffffff871041a1>] _raw_spin_lock_irqsave+0x51/0x60
> >   [   10.773854] softirqs last  enabled at (1127782): [<ffffffff862e6b80>] __irq_exit_rcu+0xc0/0x100
> >   [   10.773857] softirqs last disabled at (1127777): [<ffffffff862e6b80>] __irq_exit_rcu+0xc0/0x100
> >   [   10.773858]
> >                  other info that might help us debug this:
> >   [   10.773859]  Possible unsafe locking scenario:
> >
> >   [   10.773859]        CPU0
> >   [   10.773859]        ----
> >   [   10.773860]   lock(&rdp->nocb_lock);
> >   [   10.773861]   <Interrupt>
> >   [   10.773861]     lock(&rdp->nocb_lock);
> >   [   10.773862]
> >                   *** DEADLOCK ***
> >
> >   [   10.773862] 4 locks held by test_progs/471:
> >   [   10.773863]  #0: ffffffff885ccaf0 (dup_mmap_sem){.+.+}-{0:0}, at: copy_process+0x1e39/0x2b80
> >   [   10.773866]  #1: ffff9861c33a2ff8 (&mm->mmap_lock){++++}-{4:4}, at: dup_mmap+0x83/0x990
> >   [   10.773871]  #2: ffff9861c1152ff8 (&mm->mmap_lock/1){+.+.}-{4:4}, at: dup_mmap+0xc8/0x990
> >   [   10.773874]  #3: ffffffff885f43b8 (kmemleak_lock){-.-.}-{2:2}, at: __create_object+0x3a/0xa0
> >   [   10.773877]
> >                  stack backtrace:
> >   [   10.773879] CPU: 1 UID: 0 PID: 471 Comm: test_progs Not tainted 7.1.0-rc2-g576482b55c19 #150 PREEMPT(full)
> >   [   10.773881] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.17.0-0-gb52ca86e094d-prebuilt.qemu.org 04/01/2014
> >   [   10.773882] Call Trace:
> >   [   10.773883]  <NMI>
> >   [   10.773884]  dump_stack_lvl+0x5d/0x80
> >   [   10.773887]  print_usage_bug.part.0+0x22b/0x2c0
> >   [   10.773891]  lock_acquire+0x295/0x2e0
> >   [   10.773893]  ? __call_rcu_common.constprop.0+0x54a/0x730
> >   [   10.773896]  _raw_spin_lock+0x30/0x40
> >   [   10.773898]  ? __call_rcu_common.constprop.0+0x54a/0x730
> >   [   10.773899]  __call_rcu_common.constprop.0+0x54a/0x730
> >   [   10.773903]  bpf_obj_free_fields+0x118/0x250
> >   [   10.773907]  bpf_obj_drop+0x4f/0x80
> >   [   10.773910]  bpf_prog_b7ff9c2e6c583824_clear_task_kptrs_from_nmi+0x174/0x1f4
> >   [   10.773912]  bpf_trace_run3+0x126/0x430
> >   [   10.773916]  ? __pfx_perf_event_nmi_handler+0x10/0x10
> >   [   10.773919]  nmi_handle.part.0+0x15b/0x250
> >   [   10.773922]  ? __pfx_perf_event_nmi_handler+0x10/0x10
> >   [   10.773924]  default_do_nmi+0x120/0x180
> >   [   10.773927]  exc_nmi+0xe3/0x110
> >   [   10.773928]  end_repeat_nmi+0xf/0x53
> >   [   10.773931] RIP: 0010:__link_object+0xb4/0x270
> >   [   10.773932] Code: 00 00 00 48 c7 c6 80 36 bb 89 48 c7 43 70 00 00 00 00 48 c7 43 78 00 00 00 00 48 89 3d 35 62 50 03 eb 66 48 89 c8 48 8b 48 30 <48> 8d 70 10 48 39 d1 73 11 48 03 48 38 48 39 cf 0f 82 2f 01 00 00
> >   [   10.773933] RSP: 0018:ffffb516013b7a60 EFLAGS: 00000082
> >   [   10.773935] RAX: ffff9861c37b5c80 RBX: ffff9861df869878 RCX: ffff9861c3776420
> >   [   10.773936] RDX: ffff9861c3769e00 RSI: ffff9861c43564f8 RDI: ffff9861c3769d00
> >   [   10.773936] RBP: 0000000000000100 R08: 0000000000000000 R09: 0000000000000000
> >   [   10.773937] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9861c3769d00
> >   [   10.773937] R13: 0000000000000286 R14: 0000000000000000 R15: 0000000000000001
> >   [   10.773943]  ? __link_object+0xb4/0x270
> >   [   10.773945]  ? __link_object+0xb4/0x270
> >   [   10.773947]  </NMI>
> >   [   10.773947]  <TASK>
> >   [   10.773948]  __create_object+0x51/0xa0
> >   [   10.773951]  kmem_cache_alloc_from_sheaf_noprof+0x1e2/0x290
> >   [   10.773956]  mas_dup_build+0x41d/0x7f0
> >   [   10.773960]  __mt_dup+0x78/0xf0
> >   [   10.773967]  dup_mmap+0x12e/0x990
> >   [   10.773972]  ? srso_alias_return_thunk+0x5/0xfbef5
> >   [   10.773973]  ? lock_acquire+0xbf/0x2e0
> >   [   10.773979]  copy_process+0x1e45/0x2b80
> >   [   10.773985]  kernel_clone+0xbb/0x3d0
> >   [   10.773986]  ? srso_alias_return_thunk+0x5/0xfbef5
> >   [   10.773987]  ? __lock_acquire+0x481/0x2590
> >   [   10.773993]  __do_sys_clone+0x65/0x90
> >   [   10.773998]  do_syscall_64+0xa1/0x5f0
> >   [   10.774002]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> >   [   10.774003] RIP: 0033:0x7f9da26e8f2f
> >   [   10.774005] Code: 89 e7 e8 c4 0e f6 ff 45 31 c0 31 d2 31 f6 64 48 8b 04 25 10 00 00 00 bf 11 00 20 01 4c 8d 90 d0 02 00 00 b8 38 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 59 89 c5 85 c0 75 31 64 48 8b 04 25 10 00 00
> >   [   10.774005] RSP: 002b:00007ffc278b4540 EFLAGS: 00000246 ORIG_RAX: 0000000000000038
> >   [   10.774007] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f9da26e8f2f
> >   [   10.774007] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000001200011
> >   [   10.774008] RBP: 00007ffc278b4740 R08: 0000000000000000 R09: 0000000000000001
> >   [   10.774008] R10: 00007f9da2acaf10 R11: 0000000000000246 R12: 0000000000000003
> >   [   10.774009] R13: 0000000000000001 R14: 0000000000000000 R15: 00005625de8b38d0
> >   [   10.774014]  </TASK>
> >   [   10.866584] perf: interrupt took too long (6193 > 6190), lowering kernel.perf_event_max_sample_rate to 32000
> >   [   16.630663] kauditd_printk_skb: 30 callbacks suppressed
> >
> > I can give the reproducer if wished.
> >
> > You can see the path involves bpf_obj_drop and not any map methods at
> > all.
> >
> > But basically this makes the whole mechanism of relying on map allocators to
> > recycle and free objects in irqs_disabled contexts unsound because kptrs
> > can live and be released outside map contexts.
> 
> The approach in this version is still different than what I had proposed. For
> the particular case you highlighted, we should just reject bpf_obj_drop() in
> is_tracing_prog_type() once the object has kptr fields etc. If someone still
> wants it they should just use bpf_wq to defer manually.
>
Would you be opposed to me submitting a fix for *just* the bpf_obj_drop case?

Since your proposed solution for that is just a tiny verifier fix w/ btf_record_has_field and a prog_type check and is mostly unrelated to this fix for the map cases.

It's a small fix, but just don't want to be rude and submit anything if
you're working on that too in your patch. :)

Thanks for your guidance with this fix in general and apologize for my
misinterpretation of the intended route.

Kind Regards,
Justin
> The whole prealloc related change to the map is unnecessary, as I said before.
> It's an orthogonal change. We should also not use irqs_disabled() and just do
> the skipping unconditionally.
> 
> Let me work on a patch, since the nuance is obviously getting lost in email.


  parent reply	other threads:[~2026-06-06 16:27 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-19  1:14 [PATCH bpf-next 0/1] bpf: Fix deadlock in freeing of special fields in NMI Justin Suess
2026-05-19  1:14 ` [PATCH bpf-next 1/1] bpf: fix deadlock in special field destruction " Justin Suess
2026-05-19  1:51   ` sashiko-bot
2026-05-19 14:42     ` Justin Suess
2026-05-19 15:59       ` Justin Suess
2026-05-19 19:27         ` Kumar Kartikeya Dwivedi
2026-05-19 23:19           ` Justin Suess
2026-06-06 16:27           ` Justin Suess [this message]
2026-06-07  2:40             ` Kumar Kartikeya Dwivedi
2026-06-07 15:05               ` Justin Suess
2026-05-19 10:25   ` kernel test robot
2026-05-19 12:31   ` Mykyta Yatsenko
2026-05-19 13:22     ` Justin Suess
2026-05-19 13:55       ` Mykyta Yatsenko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aiRG9nsbRi2cO3XT@zenbox \
    --to=utilityemal77@gmail.com \
    --cc=bpf@vger.kernel.org \
    --cc=memxor@gmail.com \
    --cc=sashiko-reviews@lists.linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox