From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-ot1-f45.google.com (mail-ot1-f45.google.com [209.85.210.45]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6AA9C402435 for ; Mon, 11 May 2026 15:51:56 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.210.45 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778514718; cv=none; b=GMIAOfQ+zqk/gs9OdcjMZNpm/R3LWNcAKHJy14Xln7OzrYYHuKtudRbO7Za05eiGFekRYi3c0AauHmJK2YeVMIJDlND7+0vQSywPQ4fCEUc2Jdw2QaX3RKcuRBSq2O38qVS/LQxVs4rAd46q2JtLbzMCPEw1U3dsUQtR7/ceQ2g= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778514718; c=relaxed/simple; bh=bPCZegAxHoeoycJru1ZuXUkag0ytA9Cy5ZI2BD2RvCY=; h=Mime-Version:Content-Type:Date:Message-Id:Cc:Subject:From:To: References:In-Reply-To; b=uOKHBBn/cFK34Frt+hU1KWKs9u0BC2L07WWUQQldB2vglfJlFV549muCefHAHc/5S37YJnIXU2knuVQUuUvW1E5wwflvpPjN6pSK4V/+5ZV+wiBi/F/TJt6bGDUNyp/PdmsPnKW6EQO4CRj5dKfrnNCDuwl2qQ7OHbNI1coZtyA= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=dHCDTHY8; arc=none smtp.client-ip=209.85.210.45 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="dHCDTHY8" Received: by mail-ot1-f45.google.com with SMTP id 46e09a7af769-7dca4debedaso4456535a34.2 for ; Mon, 11 May 2026 08:51:56 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20251104; t=1778514715; x=1779119515; darn=vger.kernel.org; h=in-reply-to:references:to:from:subject:cc:message-id:date :content-transfer-encoding:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=h8ymRLx5DRViBt7vQ33aUn3ouAwAfSpy9qoKhGae7+s=; b=dHCDTHY8J9G19VjDe2NN5XjTdUcnJp8qXE/U1ZDtkWTP7Uy1Ky9rs65ITpX4POGU/H eWL3CxsuLqYnklpaXDpf9qBPQ8arSI5IcrZ3fvxOQBsFl2XNkKGpL/Rlq7usFlRq7ATh Fm+R5EiWE0LV0VXhBz74PpDlAWRwmQOrcq0wOygBxm0Hbu8ggOQbvs9i+7o35v7jcBSe p5fFCm6DQlkj2OasbTH/XpCYWAkJ1YPvvRLZegWeEQ3Xr0ILKX8VGcHO0wLuAaoK1BK3 jP5xoZ/78B7qtGJ/MtoE1ySO0N8GmI4SPg/fh/ANn58OJpQbbefK2vjGVGRWoe6xqCw9 M8xA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1778514715; x=1779119515; h=in-reply-to:references:to:from:subject:cc:message-id:date :content-transfer-encoding:mime-version:x-gm-gg:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=h8ymRLx5DRViBt7vQ33aUn3ouAwAfSpy9qoKhGae7+s=; b=jk/azmE0d8DhsQgJ5i7VaoEB2hvSa76kd44tJZ2dnRKzTjBGwPpfSnZRBeYNMqXcaI nq7kT5H6BVUJx2NEF6KWL0MsjSBN6zVhVtsXZaRcS7WOyocAjcyHiYWCJ6rReIQvhoYh cxMaoTCrmTINsicKQAxmnGmoXdV4KXc/w5eb/jrLGt9ocgGRQ7B1nx4Yu65vpEsGjuvA eBIBIKKF9DlYGU+cOMp1Aw575v12ggw0m2MPXb/TnsuKt8uAspTO2nDVT2A+CWfTcxX/ PGEZhLMl6bF1kPtMgBAlxV7W/azUMETYtYz7MWu1cRfHAff92gN24rdREIPM8FwmloNm CTMQ== X-Forwarded-Encrypted: i=1; AFNElJ/dSJGrlVK48Niiqza9ZvOhEJSb0m8FGU3I0TJhEOhDnwmjJXZbWRqBhEqSxmvyx0HSomk=@vger.kernel.org X-Gm-Message-State: AOJu0YwNOXbAdh/s890+33encGE0y7RuhAbaoUC5CGvtAgbOxxEGBrsA k+rpFEDzaSE0/hyewBtLS+TxPQZGJoSYA3cHVWe/quXmjChieFVFrg+j X-Gm-Gg: Acq92OE5nNN7wb9hFbhiIwfK1atIkBtylgVqoh/E6Eb2KOKhK45UZFVTbV4+L6Gitko BOsDDfTPEwZYaAxY03UnFiNgWQUfThJV08n3DPOuTyI+VcrgIxoxVpvDfFPusS1dDKkq39GEAgw zZ7BgugueLhzWSLvih65cdl6HEMZ6Gmr9SBbPyhq1hp797iB9qZ99ElyUtXEBflQK5bnfsvzBSZ IbytCam/XlIkIsb9Mc+S9240QFt4rU0qP1uJkB/6cMiyEUmWoPQy+Nk17PcDAVXF/GrTxkAvJzP zS4XvsN1an2+OMmZYUtvHvmJ8JwZn7rhEX+ZFSDbGOk963hiWxZt81zdG/Cne161PNXLSMlwwSM ugqPoC5gaTesj91TK1p/8pAEmn6J3P8H1fuSQ0DV3N19q6hBTDLSknBgr0cYDdjnK+XanFvztFO ER/xSAoD7jeeFByxWGCmxQLg93eohBxod6IyjvXNeJzRy14axFD+ihLCmIAmKu1dCVPbOy0d68A v4c45JQid9qr0ZwmA== X-Received: by 2002:a05:6830:25d0:b0:7dc:5899:c711 with SMTP id 46e09a7af769-7e381f0fac8mr6693162a34.3.1778514715076; Mon, 11 May 2026 08:51:55 -0700 (PDT) Received: from localhost ([2a03:2880:10ff:47::]) by smtp.gmail.com with ESMTPSA id 46e09a7af769-7e367d8fe3esm6861406a34.21.2026.05.11.08.51.54 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 11 May 2026 08:51:54 -0700 (PDT) Precedence: bulk X-Mailing-List: bpf@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=UTF-8 Date: Mon, 11 May 2026 08:51:53 -0700 Message-Id: Cc: , "bpf" Subject: Re: [bpf-next v3 1/2] bpf: Offload kptr destructors that run from NMI From: "Alexei Starovoitov" To: "Justin Suess" X-Mailer: aerc References: <20260507175453.1140400-2-utilityemal77@gmail.com> <20260507234520.646C4C2BCB2@smtp.kernel.org> In-Reply-To: On Sun May 10, 2026 at 6:49 PM PDT, Justin Suess wrote: > On Sun, May 10, 2026 at 03:38:08PM -0700, Alexei Starovoitov wrote: >> On Sun, May 10, 2026 at 8:14=E2=80=AFAM Justin Suess wrote: >> > >> > >> > Any help or guidance on this would be appreciated! >>=20 >> sorry for the delay. Everyone was at lsfmmbpf for a week+. >>=20 > > No worries! I hope it was an enjoyable trip and I look forward to > hearing about the conference. > >> All of the solutions so far are way too complicated. >> bpf_kptr_xchg() has to remain inlined as single atomic xchg >> without slowpath otherwise it ruins the concept >> and makes usage unpredictable. >>=20 >> Let's step back. >> What is the issue you're trying to solve? >>=20 >> the commit log say: >>=20 >> > A BPF program attached to tp_btf/nmi_handler can delete map entries or >> > swap out referenced kptrs from NMI context. Today that runs the kptr >> > destructor inline. Destructors such as bpf_cpumask_release() can take >> > RCU-related locks, so running them from NMI can deadlock the system. >>=20 >> and looking at selftest from patch 2 you do: >>=20 >> old =3D bpf_kptr_xchg(&value->mask, old); >> if (old) >> bpf_cpumask_release(old); >>=20 >> so? >> bpf_cpumask_release() is fine to call from any context, >> because bpf_mem_cache_free_rcu() is safe everywhere including NMI. >> > > My mistake on that. I picked a bad example for the test, but the test is > just exercising the nmi dtor path, and doesn't rely on the particular > type of kptr. I just picked one that was easy to acquire a reference to. > > This dtor is safe. task_struct dtor, cgroup dtor, crypto_ctx dtor are > not. I've annotated why here: > > crypto_ctx: > > __bpf_kfunc void bpf_crypto_ctx_release(struct bpf_crypto_ctx *ctx) > { > if (refcount_dec_and_test(&ctx->usage)) > call_rcu(&ctx->rcu, crypto_free_cb); /* UNSAFE: call_rcu */ > } > =20 > __bpf_kfunc void bpf_crypto_ctx_release_dtor(void *ctx) > { > bpf_crypto_ctx_release(ctx); > } bpf_crypto_ctx_release() is only allowed in syscall prog types and dtor via hashmap free will execute in safe context as well. So not an issue. > task_struct: > > __bpf_kfunc void bpf_task_release(struct task_struct *p) > { > put_task_struct_rcu_user(p); > } > =20 > __bpf_kfunc void bpf_task_release_dtor(void *p) > { > put_task_struct_rcu_user(p); > } > > void put_task_struct_rcu_user(struct task_struct *task) > { > if (refcount_dec_and_test(&task->rcu_users)) > call_rcu(&task->rcu, delayed_put_task_struct); /* UNSAFE: call_rcu > */ > } In theory. I don't think there is a reproducer. > cgroup_release_dtor is more complex, goes through ultimately through > callbacks to: > > static void css_release(struct percpu_ref *ref) > { > struct cgroup_subsys_state *css =3D > container_of(ref, struct cgroup_subsys_state, refcnt); > =20 > INIT_WORK(&css->destroy_work, css_release_work_fn); > queue_work(cgroup_release_wq, &css->destroy_work); /* UNSAFE: > workqueue */ > } similar to task_struct. I don't think it's exploitable. > More generally, unless it's a BPF allocated object or doesn't rely on > locking/call_rcu or workqueues, the dtor is unsafe. > >> hashtab introduced dtor in bpf_mem_alloc, >> so bpf_obj_free_fields() and corresponding dtor's of kptr-s >> are called from valid context. >>=20 >> What is the problematic sequence? > > So from the beginning stepping back. > > The problematic sequence: > > 1. ref kptr (i.e task_struct, cgroup, crypto_ctx) xchg'd into map.=20 > > 2. in a tp_btf/nmi_handler (NMI CTX) program we drop the item from the ma= p > with that referenced kptr field. > > 3. dtor runs in the nmi context > > 4. dtor runs call_rcu/queue_work/some bad thing in nmi, causing deadlock. > > You can see this demonstrated in my task_struct reproducer [1]. did you? That link points to your v2 with cpumask. I don't recall seeing task_struct repro. > It causes a deadlock by deliberately releasing the last reference to a > task_struct via a ref kptr in nmi, getting it to call_rcu and deadlock. > > The typical solution to this is to run the nmi unsafe code in non-nmi > context by offloading to NMI work, as you proposed. > > The problem is we need space to for the jobs we enqueue. The required > information to run the dtor is the dtor function and the original object > pointer. The number of dtors that can run in a single tp_btf/nmi_handler > prog is nearly unbounded. > > The other problem is even though bpf_mem_alloc is safe in NMI generally, > we cannot allocate in path that destroys an object. If the allocation > fails due to memory pressure, we leak the object. > > There are a few options, all with drawbacks. > > 1. Dynamically allocate the job. Non-starter, failing to allocate is > unrecoverable, memory pressure means we can't ever schedule the dtor to > run. > > 2. Store job ntrusively in the object : Requires a safe place to place > it within the object. Bad because not all objects have a space we can wri= te to. > Non-starter. > > 3. Within the map slot (after actual kptr): Taken with my initial approac= h in v1. > Significant complexity and requires per-map changes. Feasible but very > complex and would need DCAS or locking to make updating the map slot and > our job information atomic. > > 4. Wrapping the kptr in a box and storing it in place of the kptr [2] : > Proposed by Mykyta. Would break direct load access to kptr objects. > > 5. Make every dtor nmi safe individually. This would require a lot of > duplicated code and require updating every destructor invididually. > Feasible technically, but seems brittle. > > 6. One that would be the least complex, would be forbidding xchg operatio= ns > that can run the dtor in NMI context. That would preserve the inlining fi= x, > but limit our usage of referenced kptrs in BPF programs that run in NMI c= ontext. > > The approach here: > > 7. Allocate a new spot for a free job every time we xchg into the map > and put it in a global list. When in NMI and we run a dtor, we pop a > job from that slot and use it to offload our work via irq_work. If > we're not in NMI we run normally. Downside is this breaks inlining for > ref kptrs. > > ... > > I may be missing something critical, but everything I've looked at > points to this problem being much more complex that it initially seemed. yes. it is complex. all 7 options are not good. I recall the whole thing started with desire to add bpf_put_file_dtor(). This was discussed with VFS maintainers and they didn't like the idea, since it needs a ton of work to make it safe: . umount notifier to make sure stashed file doesn't hold umount . potential circular refcnt issue if file to bpf map stashed into the same = map . scm_rights-like facility with garbage collection So generic file stash is really no go.