From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-yw1-f174.google.com (mail-yw1-f174.google.com [209.85.128.174]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 498CA2D3A7B for ; Sat, 6 Jun 2026 16:27:52 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.174 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780763273; cv=none; b=TwOEP9RizUWZFZn5FBIS9ZePr1LtUv/9pBwV/aeUZNorVydZt6tvpt4ZxKBB0TurfQ8yPpjBt1mLNrawHoNkFGcFwRTNFUa/Cw5Apr/29joyg16Su261tLxeGXYzI+3mnY3pD69rsPj8KWciN20jFd32puEwyZ/K4lgsk7NsbOg= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780763273; c=relaxed/simple; bh=dF7QO6ZjPt2ZR9/UvbN/N52OV08VOWQaY8itXBTTf2c=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=YL6YifPoiU1ACy4n7RteFl475sC9M3n4rD1kJpk/E7fxS83RuCJNUmc+j0qLZLyZqJZdVek9suLzYg3up6qnbdVlES/aBxvbsDRkIEhK3TAcq6Abm1NEwmNPdeTLwFmeXYuZtchf2FJ1ZeoAI6yiixDZs7XbzKQCfPfU28SoiTg= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=Pybbn/JM; arc=none smtp.client-ip=209.85.128.174 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="Pybbn/JM" Received: by mail-yw1-f174.google.com with SMTP id 00721157ae682-7dd3f176f84so34382447b3.0 for ; Sat, 06 Jun 2026 09:27:52 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20251104; t=1780763271; x=1781368071; darn=vger.kernel.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=7BttLOF29gVUPciUNpnBFamecNHnBLVu+zJ6G0qMKk0=; b=Pybbn/JMz/sJHzD0EzKvAF+MiZTvbspGyQOhVccDTR7WcTL9B651iL/QgVC4RXptax j3lY6n9O12p0QCIe1OklATLhTuo1X64OtRFD2ROXku/43Bir3pOrgORNVx/tX7iPS3ey vEZaKhzB+BjD0/dj6HlIW17KEGvrpq4PdK2azeKzxV7MAtuAKicS3B93LCIlK/zNecZ8 i2byXHy1F6w+hNnw8dRen4zY/snjmeytugoAFnbMNtYxV4oxq1Avke/PI8EN479kLaEJ 023bhRReXK99WihFPTmo7wIRkORD4749LK8n6Lo8sMEIFyPwWn28FQYqzIhAq4YCx0Uh n+qQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1780763271; x=1781368071; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-gg:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=7BttLOF29gVUPciUNpnBFamecNHnBLVu+zJ6G0qMKk0=; b=W/RtD65t2Ab9V/j+S3s7cmv/InzX1dJr7YjseI92QswlEDeEYUcuCsxpKFwcHYkf7K dbhUH9XqDqLGtzrwzJwxXxPs8GVnrTYWyYiqWET+s+IrDHefB+Sk9dL6siyO737m8uAN aisV3mvlGES9v7HNGnZMp2bvDI+w/zHqp4p8Cb4TptT8ETAGa48eFWVRnFIuYc3CrvX6 9bewt+PH2//vuZGdRBHC/koLpcLcC2ZcDkHLWlk9uWmOsGUHaYw8R/P2o3KHpEHnKXTE jyawUn7g1ji3urQBS8dWTT5gPxiArcLXr7xzqcSSyHCZS5Xft1FMJG6oyy7C2gnvD0vN HI7g== X-Forwarded-Encrypted: i=1; AFNElJ/GOnlB68QyOBT+Nujvh4TmBNazEgIKUmOPHt9GiNzVJKN90Ijx/83f98J/XQU4TqFps94=@vger.kernel.org X-Gm-Message-State: AOJu0Ywkoy+30aTAUDUc8/axWmBbFNt0x/2B1AiUp4vQJPHZty12RQJs /73cwkSroYOsls4DhfIzJ/ZJ1ExKQMTL6y4LcQ5Ap70w5qkUVqq57c5aik3AUg== X-Gm-Gg: Acq92OFYXg6W/1VMP3szVfy9hGMfUzVSeNFv3gACxJ5wA4g34kARrA0TTHe1u2Qy3qq kHwCyaOyNMXKhIdDRO/PUITk5O/TyhaMiWtoxfo/YQl+QG0XJHkWnMmy8gn4mGndZ8F4aPNE6t3 7E55E7sep8yG8U69uLI+k1G3pCmeJWpis2NVmpww8B4ja/thq64xMW1cwG8hoNkWeu0P49lycmK PuRvpa6XKeq2OOK+ug7TeR1anr1RCAC44MNgZYsSWKzpKOKuxUgZBOlMn5IF1XenKphldlCgAnR Yls3eiv/WIFNktqTMMx5gj9xI9JnnCkzDsyDVdYIcS1+Jjlnhc4v9DSvCRq2018U/TFhjZj5qGW 59tOcqpLApT6saOHv0+g0+GwO0VLTcECO6IbWxVCAmtOKzmRiwXDl4gViGA/+l1SupuNLQaUF3I gLOSifdQoSmChhjtbnrANxB28oeR6a0+U3pq7a43G9V+YiL6g7XEgxPnjU0WSXLEs0ww== X-Received: by 2002:a05:690c:d87:b0:7b9:d6b8:b9ac with SMTP id 00721157ae682-7ed0ed4eef8mr81055147b3.26.1780763270872; Sat, 06 Jun 2026 09:27:50 -0700 (PDT) Received: from zenbox ([2600:1700:18fb:6011:694:b67f:fc7:4c0a]) by smtp.gmail.com with ESMTPSA id 00721157ae682-7ea2167b40bsm62834257b3.17.2026.06.06.09.27.50 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sat, 06 Jun 2026 09:27:50 -0700 (PDT) Date: Sat, 6 Jun 2026 12:27:49 -0400 From: Justin Suess To: Kumar Kartikeya Dwivedi Cc: sashiko-reviews@lists.linux.dev, bpf@vger.kernel.org Subject: Re: [PATCH bpf-next 1/1] bpf: fix deadlock in special field destruction in NMI Message-ID: References: <20260519011450.1144935-2-utilityemal77@gmail.com> <20260519015123.36096C2BCB7@smtp.kernel.org> Precedence: bulk X-Mailing-List: bpf@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: On Tue, May 19, 2026 at 09:27:58PM +0200, Kumar Kartikeya Dwivedi wrote: > On Tue May 19, 2026 at 5:59 PM CEST, Justin Suess wrote: > > On Tue, May 19, 2026 at 10:42:44AM -0400, Justin Suess wrote: > >> Looks like this kfunc also tries to eagerly free fields and is callable > >> from tp_btf/nmi_handler with no recovery mechanism. > >> > >> I don't see a good way to handle this at all... it also is effectively > >> already bugged in the existing code unless I'm missing something > >> > >> Since you can use it to call abritrary dtors in NMI. > >> > >> 1. > >> > >> In some context where task_acquire is valid: > >> > >> struct blah { > >> struct task_struct __kptr* task; > >> } > >> > >> b = bpf_task_acquire(...) > >> a = bpf_obj_new(typeof(*blah_ptr)); > >> bpf_kptr_xchg(a->task, b); > >> > >> 2. > >> > >> Then xchg a into map > >> bpf_kptr_xchg([map slot], a); > >> ... > >> > >> 3. > >> > >> In tp_btf/nmi_handler we exchange the obj out of the map > >> and then drop it. > >> > >> c = bpf_kptr_xchg([slot where a is stored], NULL); > >> bpf_obj_drop(c); > >> > >> And because this can be called on an object not in a map, we don't have > >> a good recovery mechanism. > >> > >> Because tp_btf/nmi_handler is always in_nmi and hence irqs_disabled, in > >> the code before this patch the task_struct dtor will be invoked and > >> cause a deadlock. > >> > >> This patch will not fix the issue and just free a struct blah holding the > >> reference without actually dropping it. > >> > >> Unless I'm missing something then this whole approach taken by this > >> patch can't handle this case since the bpf_obj is detached from the map > >> entirely and has no allocator recovery mechanism at all. > >> > >> Then we are back to square 1 needing to defer the freeing with irq work. > >> > >> Kumar / Alexei do you have anything on this? > > > > I've confirmed this is a viable mechanism to deadlock as well > > > > Using the above approach, just exchanging out of a map and dropping the > > bpf_obj containing a task_struct kptr can be used to cause a deadlock > > outside any map path: > > > > On 576482b55c19e7ec00e162a0fde4c4f1a95128c7 (bpf-next/master) w/ RCU_NOCB_CPU/RCU_EXPERT: > > > > [ 10.773819] ================================ > > [ 10.773820] WARNING: inconsistent lock state > > [ 10.773821] 7.1.0-rc2-g576482b55c19 #150 Not tainted > > [ 10.773822] -------------------------------- > > [ 10.773822] inconsistent {INITIAL USE} -> {IN-NMI} usage. > > [ 10.773823] test_progs/471 [HC1[1]:SC0[0]:HE0:SE1] takes: > > [ 10.773825] ffff98623d66f0e8 (&rdp->nocb_lock){-.-.}-{2:2}, at: __call_rcu_common.constprop.0+0x54a/0x730 > > [ 10.773832] {INITIAL USE} state was registered at: > > [ 10.773832] lock_acquire+0xbf/0x2e0 > > [ 10.773835] _raw_spin_lock+0x30/0x40 > > [ 10.773838] rcu_nocb_gp_kthread+0x13f/0xb90 > > [ 10.773840] kthread+0xf4/0x130 > > [ 10.773845] ret_from_fork+0x264/0x330 > > [ 10.773848] ret_from_fork_asm+0x1a/0x30 > > [ 10.773851] irq event stamp: 1127908 > > [ 10.773851] hardirqs last enabled at (1127907): [] _raw_spin_unlock_irqrestore+0x44/0x60 > > [ 10.773853] hardirqs last disabled at (1127908): [] _raw_spin_lock_irqsave+0x51/0x60 > > [ 10.773854] softirqs last enabled at (1127782): [] __irq_exit_rcu+0xc0/0x100 > > [ 10.773857] softirqs last disabled at (1127777): [] __irq_exit_rcu+0xc0/0x100 > > [ 10.773858] > > other info that might help us debug this: > > [ 10.773859] Possible unsafe locking scenario: > > > > [ 10.773859] CPU0 > > [ 10.773859] ---- > > [ 10.773860] lock(&rdp->nocb_lock); > > [ 10.773861] > > [ 10.773861] lock(&rdp->nocb_lock); > > [ 10.773862] > > *** DEADLOCK *** > > > > [ 10.773862] 4 locks held by test_progs/471: > > [ 10.773863] #0: ffffffff885ccaf0 (dup_mmap_sem){.+.+}-{0:0}, at: copy_process+0x1e39/0x2b80 > > [ 10.773866] #1: ffff9861c33a2ff8 (&mm->mmap_lock){++++}-{4:4}, at: dup_mmap+0x83/0x990 > > [ 10.773871] #2: ffff9861c1152ff8 (&mm->mmap_lock/1){+.+.}-{4:4}, at: dup_mmap+0xc8/0x990 > > [ 10.773874] #3: ffffffff885f43b8 (kmemleak_lock){-.-.}-{2:2}, at: __create_object+0x3a/0xa0 > > [ 10.773877] > > stack backtrace: > > [ 10.773879] CPU: 1 UID: 0 PID: 471 Comm: test_progs Not tainted 7.1.0-rc2-g576482b55c19 #150 PREEMPT(full) > > [ 10.773881] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.17.0-0-gb52ca86e094d-prebuilt.qemu.org 04/01/2014 > > [ 10.773882] Call Trace: > > [ 10.773883] > > [ 10.773884] dump_stack_lvl+0x5d/0x80 > > [ 10.773887] print_usage_bug.part.0+0x22b/0x2c0 > > [ 10.773891] lock_acquire+0x295/0x2e0 > > [ 10.773893] ? __call_rcu_common.constprop.0+0x54a/0x730 > > [ 10.773896] _raw_spin_lock+0x30/0x40 > > [ 10.773898] ? __call_rcu_common.constprop.0+0x54a/0x730 > > [ 10.773899] __call_rcu_common.constprop.0+0x54a/0x730 > > [ 10.773903] bpf_obj_free_fields+0x118/0x250 > > [ 10.773907] bpf_obj_drop+0x4f/0x80 > > [ 10.773910] bpf_prog_b7ff9c2e6c583824_clear_task_kptrs_from_nmi+0x174/0x1f4 > > [ 10.773912] bpf_trace_run3+0x126/0x430 > > [ 10.773916] ? __pfx_perf_event_nmi_handler+0x10/0x10 > > [ 10.773919] nmi_handle.part.0+0x15b/0x250 > > [ 10.773922] ? __pfx_perf_event_nmi_handler+0x10/0x10 > > [ 10.773924] default_do_nmi+0x120/0x180 > > [ 10.773927] exc_nmi+0xe3/0x110 > > [ 10.773928] end_repeat_nmi+0xf/0x53 > > [ 10.773931] RIP: 0010:__link_object+0xb4/0x270 > > [ 10.773932] Code: 00 00 00 48 c7 c6 80 36 bb 89 48 c7 43 70 00 00 00 00 48 c7 43 78 00 00 00 00 48 89 3d 35 62 50 03 eb 66 48 89 c8 48 8b 48 30 <48> 8d 70 10 48 39 d1 73 11 48 03 48 38 48 39 cf 0f 82 2f 01 00 00 > > [ 10.773933] RSP: 0018:ffffb516013b7a60 EFLAGS: 00000082 > > [ 10.773935] RAX: ffff9861c37b5c80 RBX: ffff9861df869878 RCX: ffff9861c3776420 > > [ 10.773936] RDX: ffff9861c3769e00 RSI: ffff9861c43564f8 RDI: ffff9861c3769d00 > > [ 10.773936] RBP: 0000000000000100 R08: 0000000000000000 R09: 0000000000000000 > > [ 10.773937] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9861c3769d00 > > [ 10.773937] R13: 0000000000000286 R14: 0000000000000000 R15: 0000000000000001 > > [ 10.773943] ? __link_object+0xb4/0x270 > > [ 10.773945] ? __link_object+0xb4/0x270 > > [ 10.773947] > > [ 10.773947] > > [ 10.773948] __create_object+0x51/0xa0 > > [ 10.773951] kmem_cache_alloc_from_sheaf_noprof+0x1e2/0x290 > > [ 10.773956] mas_dup_build+0x41d/0x7f0 > > [ 10.773960] __mt_dup+0x78/0xf0 > > [ 10.773967] dup_mmap+0x12e/0x990 > > [ 10.773972] ? srso_alias_return_thunk+0x5/0xfbef5 > > [ 10.773973] ? lock_acquire+0xbf/0x2e0 > > [ 10.773979] copy_process+0x1e45/0x2b80 > > [ 10.773985] kernel_clone+0xbb/0x3d0 > > [ 10.773986] ? srso_alias_return_thunk+0x5/0xfbef5 > > [ 10.773987] ? __lock_acquire+0x481/0x2590 > > [ 10.773993] __do_sys_clone+0x65/0x90 > > [ 10.773998] do_syscall_64+0xa1/0x5f0 > > [ 10.774002] entry_SYSCALL_64_after_hwframe+0x76/0x7e > > [ 10.774003] RIP: 0033:0x7f9da26e8f2f > > [ 10.774005] Code: 89 e7 e8 c4 0e f6 ff 45 31 c0 31 d2 31 f6 64 48 8b 04 25 10 00 00 00 bf 11 00 20 01 4c 8d 90 d0 02 00 00 b8 38 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 59 89 c5 85 c0 75 31 64 48 8b 04 25 10 00 00 > > [ 10.774005] RSP: 002b:00007ffc278b4540 EFLAGS: 00000246 ORIG_RAX: 0000000000000038 > > [ 10.774007] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f9da26e8f2f > > [ 10.774007] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000001200011 > > [ 10.774008] RBP: 00007ffc278b4740 R08: 0000000000000000 R09: 0000000000000001 > > [ 10.774008] R10: 00007f9da2acaf10 R11: 0000000000000246 R12: 0000000000000003 > > [ 10.774009] R13: 0000000000000001 R14: 0000000000000000 R15: 00005625de8b38d0 > > [ 10.774014] > > [ 10.866584] perf: interrupt took too long (6193 > 6190), lowering kernel.perf_event_max_sample_rate to 32000 > > [ 16.630663] kauditd_printk_skb: 30 callbacks suppressed > > > > I can give the reproducer if wished. > > > > You can see the path involves bpf_obj_drop and not any map methods at > > all. > > > > But basically this makes the whole mechanism of relying on map allocators to > > recycle and free objects in irqs_disabled contexts unsound because kptrs > > can live and be released outside map contexts. > > The approach in this version is still different than what I had proposed. For > the particular case you highlighted, we should just reject bpf_obj_drop() in > is_tracing_prog_type() once the object has kptr fields etc. If someone still > wants it they should just use bpf_wq to defer manually. > Would you be opposed to me submitting a fix for *just* the bpf_obj_drop case? Since your proposed solution for that is just a tiny verifier fix w/ btf_record_has_field and a prog_type check and is mostly unrelated to this fix for the map cases. It's a small fix, but just don't want to be rude and submit anything if you're working on that too in your patch. :) Thanks for your guidance with this fix in general and apologize for my misinterpretation of the intended route. Kind Regards, Justin > The whole prealloc related change to the map is unnecessary, as I said before. > It's an orthogonal change. We should also not use irqs_disabled() and just do > the skipping unconditionally. > > Let me work on a patch, since the nuance is obviously getting lost in email.