* [bpf-next v3 0/2] bpf: Fix deadlock in kptr dtor in nmi @ 2026-05-07 17:54 Justin Suess 2026-05-07 17:54 ` [bpf-next v3 1/2] bpf: Offload kptr destructors that run from NMI Justin Suess 2026-05-07 17:54 ` [bpf-next v3 2/2] selftests/bpf: Add kptr destructor NMI exerciser Justin Suess 0 siblings, 2 replies; 23+ messages in thread From: Justin Suess @ 2026-05-07 17:54 UTC (permalink / raw) To: ast, daniel, andrii, eddyz87, memxor Cc: martin.lau, song, yonghong.song, jolsa, bpf, Justin Suess Hello, While following up on a Sashiko report [1], I found that referenced kptr destructors can run from NMI context. One way to trigger this is from a tracing program attached to tp_btf/nmi_handler while a map element is being torn down. That is problematic because referenced kptr destructor paths are not universally NMI-safe. In particular, they may rely on operations such as call_rcu(), which can deadlock when reached from NMI context. This is v3 of the series. Changes since v2: The previous version of the series used three atomics and had an ABA race condition. This version of the series moves both the idle and active job queues to pcpu_freelists, which are designed for the push/pop operations and better handle NMI. The number of atomics was reduced from 3 -> 1, using a counter that tracks only demand, reducing complexity significantly. See the patch one commit message for the full details on the new surplus accounting mechanism. The verifier changes were fixed as well to account for a case found by Sashiko; because we are now only inlining in the non-dtor case, there was a bug (introduced by the patch) where an xchg call insn could be polymorphic with respect to referenced and unreferenced kptrs. This is fixed with a new verifier check. Finally, the selftests had some small adjustments. The counters were moved to u64 atomics from u32 non-atomics to decrease potential test flakiness. There was a small change on when kern_sync_rcu is called. 1. bpf: Offload kptr destructors that run from NMI 2. selftests/bpf: Add kptr destructor NMI exerciser Kind regards, Justin Suess [1] https://lore.kernel.org/bpf/20260421010536.17FB1C19425@smtp.kernel.org/ [2] https://lore.kernel.org/bpf/afYLJAT9brXkWxz2@zenbox/ [3] https://lore.kernel.org/bpf/20260421201035.1729473-1-utilityemal77@gmail.com/ v2: https://lore.kernel.org/bpf/20260505150851.3090688-1-utilityemal77@gmail.com/ v1: https://lore.kernel.org/bpf/20260428201422.1518903-1-utilityemal77@gmail.com/ Justin Suess (2): bpf: Offload kptr destructors that run from NMI selftests/bpf: Add kptr destructor NMI exerciser include/linux/bpf.h | 16 + include/linux/bpf_verifier.h | 2 + kernel/bpf/fixups.c | 33 +- kernel/bpf/helpers.c | 24 +- kernel/bpf/syscall.c | 159 +++++++ kernel/bpf/verifier.c | 13 + .../selftests/bpf/prog_tests/kptr_dtor_nmi.c | 258 +++++++++++ .../selftests/bpf/progs/kptr_dtor_nmi.c | 412 ++++++++++++++++++ 8 files changed, 902 insertions(+), 15 deletions(-) create mode 100644 tools/testing/selftests/bpf/prog_tests/kptr_dtor_nmi.c create mode 100644 tools/testing/selftests/bpf/progs/kptr_dtor_nmi.c base-commit: 2ca6723a5f7b68c739dba47b2639e3eaa7884b09 -- 2.53.0 ^ permalink raw reply [flat|nested] 23+ messages in thread
* [bpf-next v3 1/2] bpf: Offload kptr destructors that run from NMI 2026-05-07 17:54 [bpf-next v3 0/2] bpf: Fix deadlock in kptr dtor in nmi Justin Suess @ 2026-05-07 17:54 ` Justin Suess 2026-05-07 18:43 ` bot+bpf-ci 2026-05-07 23:45 ` sashiko-bot 2026-05-07 17:54 ` [bpf-next v3 2/2] selftests/bpf: Add kptr destructor NMI exerciser Justin Suess 1 sibling, 2 replies; 23+ messages in thread From: Justin Suess @ 2026-05-07 17:54 UTC (permalink / raw) To: ast, daniel, andrii, eddyz87, memxor Cc: martin.lau, song, yonghong.song, jolsa, bpf, Justin Suess, Mykyta Yatsenko, Alexei Starovoitov A BPF program attached to tp_btf/nmi_handler can delete map entries or swap out referenced kptrs from NMI context. Today that runs the kptr destructor inline. Destructors such as bpf_cpumask_release() can take RCU-related locks, so running them from NMI can deadlock the system. Queue destructor-backed teardown to irq_work and track the preallocated capacity with an idle-slot surplus counter. Two pcpu_freelists are maintained: one for the idle slots and one for active jobs. pcpu_freelist is built for NMI context, using raw_res_spin_lock and per cpu list heads. This counter is positive when the idle slots for work exceeds the number of kptrs exchanged into maps. The counter is negative when the idle slots for work is less than the number of ref kptrs exchanged into maps. The counter will always attempt to trend to zero, keeping exactly the amount of work slots ready for the worst case, freeing excess surplus while allocating new surplus when needed. This keeps NMI teardown on the fast path: it only has to pop an idle job and never allocate. On consumption of a job, the memory is returned to the idle queue for reuse. Each successful install can consume at most one future offload slot, so bpf_kptr_offload_slot_acquire() decrements surplus and provisions one additional idle job only when the pool falls short. Teardown and irq_work completion return that slot with bpf_kptr_offload_slot_release(), which rechecks the surplus before freeing an extra idle job. There are two main safety nets against memory exhaustion: If reserving another offload slot fails while installing a new destructor-backed kptr through bpf_kptr_xchg(), leave the destination unchanged and return the incoming pointer so the caller keeps ownership. This should only happen under severe memory pressure, as the work is only 24 bytes on 64 bit architectures, and slots are reused whenever possible. If NMI teardown still fails to grab an idle offload job despite that accounting, warn once and run the destructor inline rather than leak the object permanently. This warn condition is a canary for accounting bugs, since the xchg safety net shouldn't let this state be reached. The justification for falling back to running the dtor in nmi is leaking memory is less debuggable and visible to users than an explicit warning and a possible deadlock that is recoverable by the watchdog. This fix does come with small performance tradeoffs for safety. xchg can no longer be inlined for referenced kptrs, as inlining would break the slot accounting. The inlining fix is preserved for kptrs with no destructor defined. There is a small price paid of a potential alloc and a few atomic ops in the xchg path for kptrs with dtors. The nmi dtor path has a few atomic ops and an IPI for the irq_work overhead. This keeps refcounted kptr teardown out of NMI context without slowing down raw kptr exchanges that never need destructor handling. This change ensures that dtors can be written for hardirq context instead of NMI. Cc: Mykyta Yatsenko <mykyta.yatsenko5@gmail.com> Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com> Reported-by: Justin Suess <utilityemal77@gmail.com> Closes: https://lore.kernel.org/bpf/20260421201035.1729473-1-utilityemal77@gmail.com/ Signed-off-by: Justin Suess <utilityemal77@gmail.com> --- include/linux/bpf.h | 16 ++++ include/linux/bpf_verifier.h | 2 + kernel/bpf/fixups.c | 33 +++++--- kernel/bpf/helpers.c | 24 +++++- kernel/bpf/syscall.c | 159 +++++++++++++++++++++++++++++++++++ kernel/bpf/verifier.c | 13 +++ 6 files changed, 232 insertions(+), 15 deletions(-) diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 715b6df9c403..583e8551d162 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -3454,6 +3454,22 @@ static inline struct bpf_prog *bpf_prog_get_type(u32 ufd, void __bpf_free_used_maps(struct bpf_prog_aux *aux, struct bpf_map **used_maps, u32 len); +/* Direct-call target used by fixups for bpf_kptr_xchg() sites without dtors. */ +u64 bpf_kptr_xchg_nodtor(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5); + +#ifdef CONFIG_BPF_SYSCALL +int bpf_kptr_offload_slot_acquire(void); +void bpf_kptr_offload_slot_release(void); +#else +static inline int bpf_kptr_offload_slot_acquire(void) +{ + return 0; +} + +static inline void bpf_kptr_offload_slot_release(void) +{ +} +#endif bool bpf_prog_get_ok(struct bpf_prog *, enum bpf_prog_type *, bool); diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h index 976e2b2f40e8..26bc8b5c9030 100644 --- a/include/linux/bpf_verifier.h +++ b/include/linux/bpf_verifier.h @@ -672,6 +672,8 @@ struct bpf_insn_aux_data { bool non_sleepable; /* helper/kfunc may be called from non-sleepable context */ bool is_iter_next; /* bpf_iter_<type>_next() kfunc call */ bool call_with_percpu_alloc_ptr; /* {this,per}_cpu_ptr() with prog percpu alloc */ + u8 kptr_has_dtor:1; + u8 kptr_has_dtor_seen:1; u8 alu_state; /* used in combination with alu_limit */ /* true if STX or LDX instruction is a part of a spill/fill * pattern for a bpf_fastcall call. diff --git a/kernel/bpf/fixups.c b/kernel/bpf/fixups.c index fba9e8c00878..459e855e86a5 100644 --- a/kernel/bpf/fixups.c +++ b/kernel/bpf/fixups.c @@ -2284,23 +2284,30 @@ int bpf_do_misc_fixups(struct bpf_verifier_env *env) goto next_insn; } - /* Implement bpf_kptr_xchg inline */ - if (prog->jit_requested && BITS_PER_LONG == 64 && - insn->imm == BPF_FUNC_kptr_xchg && - bpf_jit_supports_ptr_xchg()) { - insn_buf[0] = BPF_MOV64_REG(BPF_REG_0, BPF_REG_2); - insn_buf[1] = BPF_ATOMIC_OP(BPF_DW, BPF_XCHG, BPF_REG_1, BPF_REG_0, 0); - cnt = 2; + /* Implement bpf_kptr_xchg inline. */ + if (insn->imm == BPF_FUNC_kptr_xchg && + !env->insn_aux_data[i + delta].kptr_has_dtor) { + if (prog->jit_requested && BITS_PER_LONG == 64 && + bpf_jit_supports_ptr_xchg()) { + insn_buf[0] = BPF_MOV64_REG(BPF_REG_0, BPF_REG_2); + insn_buf[1] = BPF_ATOMIC_OP(BPF_DW, BPF_XCHG, + BPF_REG_1, BPF_REG_0, 0); + cnt = 2; - new_prog = bpf_patch_insn_data(env, i + delta, insn_buf, cnt); - if (!new_prog) - return -ENOMEM; + new_prog = bpf_patch_insn_data(env, i + delta, insn_buf, cnt); + if (!new_prog) + return -ENOMEM; - delta += cnt - 1; - env->prog = prog = new_prog; - insn = new_prog->insnsi + i + delta; + delta += cnt - 1; + env->prog = prog = new_prog; + insn = new_prog->insnsi + i + delta; + goto next_insn; + } + + insn->imm = bpf_kptr_xchg_nodtor - __bpf_call_base; goto next_insn; } + patch_call_imm: fn = env->ops->get_func_proto(insn->imm, env->prog); /* all functions that have prototype and verifier allowed diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c index baa12b24bb64..51717a88f627 100644 --- a/kernel/bpf/helpers.c +++ b/kernel/bpf/helpers.c @@ -1728,7 +1728,7 @@ void bpf_wq_cancel_and_free(void *val) bpf_async_cancel_and_free(val); } -BPF_CALL_2(bpf_kptr_xchg, void *, dst, void *, ptr) +BPF_CALL_2(bpf_kptr_xchg_nodtor, void *, dst, void *, ptr) { unsigned long *kptr = dst; @@ -1736,12 +1736,32 @@ BPF_CALL_2(bpf_kptr_xchg, void *, dst, void *, ptr) return xchg(kptr, (unsigned long)ptr); } +BPF_CALL_2(bpf_ref_kptr_xchg, void *, dst, void *, ptr) +{ + unsigned long *kptr = dst; + void *old; + + /* + * If the incoming pointer cannot be torn down safely from NMI later on, + * leave the destination untouched and return ptr so the caller keeps + * ownership. + */ + if (ptr && bpf_kptr_offload_slot_acquire()) + return (unsigned long)ptr; + + old = (void *)xchg(kptr, (unsigned long)ptr); + if (old) + bpf_kptr_offload_slot_release(); + return (unsigned long)old; +} + /* Unlike other PTR_TO_BTF_ID helpers the btf_id in bpf_kptr_xchg() * helper is determined dynamically by the verifier. Use BPF_PTR_POISON to * denote type that verifier will determine. + * No-dtor callsites are redirected to bpf_kptr_xchg_nodtor() from fixups. */ static const struct bpf_func_proto bpf_kptr_xchg_proto = { - .func = bpf_kptr_xchg, + .func = bpf_ref_kptr_xchg, .gpl_only = false, .ret_type = RET_PTR_TO_BTF_ID_OR_NULL, .ret_btf_id = BPF_PTR_POISON, diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index 3b1f0ba02f61..d34fdb99eb8a 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -7,6 +7,7 @@ #include <linux/bpf_trace.h> #include <linux/bpf_lirc.h> #include <linux/bpf_verifier.h> +#include <linux/bpf_mem_alloc.h> #include <linux/bsearch.h> #include <linux/btf.h> #include <linux/hex.h> @@ -19,6 +20,7 @@ #include <linux/fdtable.h> #include <linux/file.h> #include <linux/fs.h> +#include <linux/irq_work.h> #include <linux/license.h> #include <linux/filter.h> #include <linux/kernel.h> @@ -42,6 +44,8 @@ #include <linux/cookie.h> #include <linux/verification.h> +#include "percpu_freelist.h" + #include <net/netfilter/nf_bpf_link.h> #include <net/netkit.h> #include <net/tcx.h> @@ -65,6 +69,111 @@ static DEFINE_SPINLOCK(map_idr_lock); static DEFINE_IDR(link_idr); static DEFINE_SPINLOCK(link_idr_lock); +struct bpf_dtor_kptr_work { + struct pcpu_freelist_node fnode; + void *obj; + btf_dtor_kfunc_t dtor; +}; + +/* Queue pending dtors; the idle pool uses a global pcpu_freelist. */ +static struct pcpu_freelist bpf_dtor_kptr_jobs; +static struct pcpu_freelist bpf_dtor_kptr_idle; +/* Keep surplus = total - needed = idle - refs >= 0 so NMI frees never need to allocate. */ +static atomic_long_t bpf_dtor_kptr_surplus = ATOMIC_LONG_INIT(0); + +static void bpf_dtor_kptr_worker(struct irq_work *work); +static DEFINE_PER_CPU(struct irq_work, bpf_dtor_kptr_irq_work) = + IRQ_WORK_INIT_HARD(bpf_dtor_kptr_worker); + +static struct bpf_dtor_kptr_work *bpf_dtor_kptr_pop_idle(void) +{ + struct pcpu_freelist_node *node; + + node = pcpu_freelist_pop(&bpf_dtor_kptr_idle); + if (!node) + return NULL; + + return container_of(node, struct bpf_dtor_kptr_work, fnode); +} + +static void bpf_dtor_kptr_release_one(void) +{ + struct bpf_dtor_kptr_work *job; + long surplus; + + for (;;) { + surplus = atomic_long_read(&bpf_dtor_kptr_surplus); + if (surplus <= 0) + return; + + job = bpf_dtor_kptr_pop_idle(); + if (!job) + return; + + if (!atomic_long_try_cmpxchg(&bpf_dtor_kptr_surplus, &surplus, + surplus - 1)) { + pcpu_freelist_push(&bpf_dtor_kptr_idle, &job->fnode); + continue; + } + + bpf_mem_free(&bpf_global_ma, job); + return; + } +} + +void bpf_kptr_offload_slot_release(void) +{ + if (atomic_long_inc_return(&bpf_dtor_kptr_surplus) > 0) + bpf_dtor_kptr_release_one(); +} + +int bpf_kptr_offload_slot_acquire(void) +{ + struct bpf_dtor_kptr_work *job; + long surplus; + + if (unlikely(!bpf_global_ma_set || + !READ_ONCE(bpf_dtor_kptr_idle.freelist) || + !READ_ONCE(bpf_dtor_kptr_jobs.freelist))) + return -ENOMEM; + + /* + * Each successful install can decrease the surplus by at most one, so it only + * ever needs to provision one additional idle job. + */ + surplus = atomic_long_dec_return(&bpf_dtor_kptr_surplus); + if (surplus >= 0) + return 0; + + job = bpf_mem_alloc(&bpf_global_ma, sizeof(*job)); + if (!job) { + atomic_long_inc(&bpf_dtor_kptr_surplus); + return -ENOMEM; + } + + pcpu_freelist_push(&bpf_dtor_kptr_idle, &job->fnode); + /* A racing teardown may have already removed the demand that forced this. */ + bpf_kptr_offload_slot_release(); + + return 0; +} + +static int __init bpf_dtor_kptr_init(void) +{ + int err; + + err = pcpu_freelist_init(&bpf_dtor_kptr_idle); + if (err) + return err; + + err = pcpu_freelist_init(&bpf_dtor_kptr_jobs); + if (err) + return err; + + return 0; +} +late_initcall(bpf_dtor_kptr_init); + int sysctl_unprivileged_bpf_disabled __read_mostly = IS_BUILTIN(CONFIG_BPF_UNPRIV_DEFAULT_OFF) ? 2 : 0; @@ -807,6 +916,43 @@ void bpf_obj_free_task_work(const struct btf_record *rec, void *obj) bpf_task_work_cancel_and_free(obj + rec->task_work_off); } +static void bpf_dtor_kptr_worker(struct irq_work *work) +{ + struct pcpu_freelist_node *fnode; + struct bpf_dtor_kptr_work *job; + + while ((fnode = pcpu_freelist_pop(&bpf_dtor_kptr_jobs))) { + job = container_of(fnode, struct bpf_dtor_kptr_work, fnode); + job->dtor(job->obj); + pcpu_freelist_push(&bpf_dtor_kptr_idle, &job->fnode); + bpf_kptr_offload_slot_release(); + } +} + +static void bpf_dtor_kptr_offload(void *obj, btf_dtor_kfunc_t dtor) +{ + struct bpf_dtor_kptr_work *job; + + /* Handing storage teardown off to irq_work consumes one idle slot. */ + atomic_long_dec(&bpf_dtor_kptr_surplus); + job = bpf_dtor_kptr_pop_idle(); + if (WARN_ON_ONCE(!job)) { + atomic_long_inc(&bpf_dtor_kptr_surplus); + /* + * This should stay unreachable if reserve accounting is correct. If it + * ever breaks, running the destructor unsafely is still better than + * leaking the object permanently. + */ + dtor(obj); + return; + } + + job->obj = obj; + job->dtor = dtor; + pcpu_freelist_push(&bpf_dtor_kptr_jobs, &job->fnode); + irq_work_queue(this_cpu_ptr(&bpf_dtor_kptr_irq_work)); +} + void bpf_obj_free_fields(const struct btf_record *rec, void *obj) { const struct btf_field *fields; @@ -842,6 +988,19 @@ void bpf_obj_free_fields(const struct btf_record *rec, void *obj) xchgd_field = (void *)xchg((unsigned long *)field_ptr, 0); if (!xchgd_field) break; + if (in_nmi() && field->kptr.dtor) { + bpf_dtor_kptr_offload(xchgd_field, field->kptr.dtor); + bpf_kptr_offload_slot_release(); + break; + } + if (field->kptr.dtor) + /* + * Dtor kptrs reach storage through bpf_ref_kptr_xchg(), which + * pairs installation with bpf_kptr_offload_slot_acquire(). Return + * that slot on non-NMI teardown once no active transition is + * needed. + */ + bpf_kptr_offload_slot_release(); if (!btf_is_kernel(field->kptr.btf)) { pointee_struct_meta = btf_find_struct_meta(field->kptr.btf, diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 11054ad89c14..d042be8ed789 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -9891,6 +9891,7 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn int insn_idx = *insn_idx_p; bool changes_data; int i, err, func_id; + bool kptr_has_dtor; /* find function prototype */ func_id = insn->imm; @@ -9950,6 +9951,18 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn if (err) return err; } + if (func_id == BPF_FUNC_kptr_xchg) { + kptr_has_dtor = !!meta.kptr_field->kptr.dtor; + if (env->insn_aux_data[insn_idx].kptr_has_dtor_seen && + env->insn_aux_data[insn_idx].kptr_has_dtor != kptr_has_dtor) { + verbose(env, + "same insn cannot call bpf_kptr_xchg() on both dtor and non-dtor kptrs\n"); + return -EINVAL; + } + + env->insn_aux_data[insn_idx].kptr_has_dtor_seen = true; + env->insn_aux_data[insn_idx].kptr_has_dtor = kptr_has_dtor; + } err = record_func_map(env, &meta, func_id, insn_idx); if (err) -- 2.53.0 ^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [bpf-next v3 1/2] bpf: Offload kptr destructors that run from NMI 2026-05-07 17:54 ` [bpf-next v3 1/2] bpf: Offload kptr destructors that run from NMI Justin Suess @ 2026-05-07 18:43 ` bot+bpf-ci 2026-05-07 18:52 ` Justin Suess 2026-05-07 23:45 ` sashiko-bot 1 sibling, 1 reply; 23+ messages in thread From: bot+bpf-ci @ 2026-05-07 18:43 UTC (permalink / raw) To: utilityemal77, ast, daniel, andrii, eddyz87, memxor Cc: martin.lau, song, yonghong.song, jolsa, bpf, utilityemal77, mykyta.yatsenko5, alexei.starovoitov, ast, andrii, daniel, martin.lau, eddyz87, yonghong.song, clm, ihor.solodrai [-- Attachment #1: Type: text/plain, Size: 3972 bytes --] > diff --git a/include/linux/bpf.h b/include/linux/bpf.h [ ... ] > diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h [ ... ] > diff --git a/kernel/bpf/fixups.c b/kernel/bpf/fixups.c [ ... ] > diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c [ ... ] > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c > index 3b1f0ba02f61..d34fdb99eb8a 100644 > --- a/kernel/bpf/syscall.c > +++ b/kernel/bpf/syscall.c > @@ -7,6 +7,7 @@ > #include <linux/bpf_trace.h> > #include <linux/bpf_lirc.h> > #include <linux/bpf_verifier.h> > +#include <linux/bpf_mem_alloc.h> > #include <linux/bsearch.h> > #include <linux/btf.h> > #include <linux/hex.h> > @@ -19,6 +20,7 @@ > #include <linux/fdtable.h> > #include <linux/file.h> > #include <linux/fs.h> > +#include <linux/irq_work.h> > #include <linux/license.h> > #include <linux/filter.h> > #include <linux/kernel.h> > @@ -42,6 +44,8 @@ > #include <linux/cookie.h> > #include <linux/verification.h> > > +#include "percpu_freelist.h" > + > #include <net/netfilter/nf_bpf_link.h> > #include <net/netkit.h> > #include <net/tcx.h> > @@ -65,6 +69,111 @@ static DEFINE_SPINLOCK(map_idr_lock); > static DEFINE_IDR(link_idr); > static DEFINE_SPINLOCK(link_idr_lock); > > +struct bpf_dtor_kptr_work { > + struct pcpu_freelist_node fnode; > + void *obj; > + btf_dtor_kfunc_t dtor; > +}; > + > +/* Queue pending dtors; the idle pool uses a global pcpu_freelist. */ > +static struct pcpu_freelist bpf_dtor_kptr_jobs; > +static struct pcpu_freelist bpf_dtor_kptr_idle; > +/* Keep surplus = total - needed = idle - refs >= 0 so NMI frees never need to allocate. */ > +static atomic_long_t bpf_dtor_kptr_surplus = ATOMIC_LONG_INIT(0); An unaddressed review comment from the v2 discussion raises a design concern about the complexity of this implementation. Mykyta Yatsenko suggested an alternative design where bpf_dtor_kptr_work is stored in the kptr map slot itself (a bpf_kptr_box approach) instead of using two lists (idle and armed) with 3 atomics and headroom/trimming management. From the lore discussion at: https://lore.kernel.org/bpf/51a054a0-e57f-49dc-9527-36da0535087c@gmail.com/ Mykyta's suggestion was: "I think this still looks too complex: * 2 lists - idle list and armed list * 3 atomics, controlling demand/supply * headroom/trimming management The complexity introduced for performance reasons, but I'm not sure if the tradeoff is worth it. What about the next design: Instead of idle list, store bpf_dtor_kptr_work in the kptr map slot itself. Use kmalloc_nolock() to allocate bpf_dtor_kptr_work on the first xchg just once per map value, then reuse it across xchg in/out." While Justin responded with an explanation of ABBA ordering issues, and Mykyta provided a more detailed counter-proposal with RCU protection and atomic detach/attach, the v3 implementation does not adopt this suggestion and still uses the pcpu_freelist approach with a surplus counter. Was there a specific technical reason to continue with the two-list approach rather than exploring the bpf_kptr_box alternative, or would it be worth revisiting this design choice? [ ... ] > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c [ ... ] This commit fixes a deadlock bug but lacks a Fixes: tag. The bug was introduced in commit 14a324f6a67e ("bpf: Wire up freeing of referenced kptr") which added bpf_map_free_kptrs() that calls kptr destructors inline without checking for NMI context. When BPF programs attached to tp_btf/nmi_handler delete map entries, destructors like bpf_cpumask_release() that take RCU locks are called from NMI, causing system deadlock. Suggested: Fixes: 14a324f6a67e ("bpf: Wire up freeing of referenced kptr") --- AI reviewed your patch. Please fix the bug or email reply why it's not a bug. See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md CI run summary: https://github.com/kernel-patches/bpf/actions/runs/25513277524 ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [bpf-next v3 1/2] bpf: Offload kptr destructors that run from NMI 2026-05-07 18:43 ` bot+bpf-ci @ 2026-05-07 18:52 ` Justin Suess 0 siblings, 0 replies; 23+ messages in thread From: Justin Suess @ 2026-05-07 18:52 UTC (permalink / raw) To: bot+bpf-ci Cc: ast, daniel, andrii, eddyz87, memxor, martin.lau, song, yonghong.song, jolsa, bpf, mykyta.yatsenko5, alexei.starovoitov, martin.lau, clm, ihor.solodrai On Thu, May 07, 2026 at 06:43:29PM +0000, bot+bpf-ci@kernel.org wrote: > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h > > [ ... ] > > > diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h > > [ ... ] > > > diff --git a/kernel/bpf/fixups.c b/kernel/bpf/fixups.c > > [ ... ] > > > diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c > > [ ... ] > > > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c > > index 3b1f0ba02f61..d34fdb99eb8a 100644 > > --- a/kernel/bpf/syscall.c > > +++ b/kernel/bpf/syscall.c > > @@ -7,6 +7,7 @@ > > #include <linux/bpf_trace.h> > > #include <linux/bpf_lirc.h> > > #include <linux/bpf_verifier.h> > > +#include <linux/bpf_mem_alloc.h> > > #include <linux/bsearch.h> > > #include <linux/btf.h> > > #include <linux/hex.h> > > @@ -19,6 +20,7 @@ > > #include <linux/fdtable.h> > > #include <linux/file.h> > > #include <linux/fs.h> > > +#include <linux/irq_work.h> > > #include <linux/license.h> > > #include <linux/filter.h> > > #include <linux/kernel.h> > > @@ -42,6 +44,8 @@ > > #include <linux/cookie.h> > > #include <linux/verification.h> > > > > +#include "percpu_freelist.h" > > + > > #include <net/netfilter/nf_bpf_link.h> > > #include <net/netkit.h> > > #include <net/tcx.h> > > @@ -65,6 +69,111 @@ static DEFINE_SPINLOCK(map_idr_lock); > > static DEFINE_IDR(link_idr); > > static DEFINE_SPINLOCK(link_idr_lock); > > > > +struct bpf_dtor_kptr_work { > > + struct pcpu_freelist_node fnode; > > + void *obj; > > + btf_dtor_kfunc_t dtor; > > +}; > > + > > +/* Queue pending dtors; the idle pool uses a global pcpu_freelist. */ > > +static struct pcpu_freelist bpf_dtor_kptr_jobs; > > +static struct pcpu_freelist bpf_dtor_kptr_idle; > > +/* Keep surplus = total - needed = idle - refs >= 0 so NMI frees never need to allocate. */ > > +static atomic_long_t bpf_dtor_kptr_surplus = ATOMIC_LONG_INIT(0); > > An unaddressed review comment from the v2 discussion raises a design > concern about the complexity of this implementation. Mykyta Yatsenko > suggested an alternative design where bpf_dtor_kptr_work is stored in the > kptr map slot itself (a bpf_kptr_box approach) instead of using two lists > (idle and armed) with 3 atomics and headroom/trimming management. > > From the lore discussion at: > https://lore.kernel.org/bpf/51a054a0-e57f-49dc-9527-36da0535087c@gmail.com/ > > Mykyta's suggestion was: > > "I think this still looks too complex: > * 2 lists - idle list and armed list > * 3 atomics, controlling demand/supply > * headroom/trimming management > > The complexity introduced for performance reasons, but I'm not sure if > the tradeoff is worth it. > > What about the next design: > > Instead of idle list, store bpf_dtor_kptr_work in the kptr map slot > itself. Use kmalloc_nolock() to allocate bpf_dtor_kptr_work on the first > xchg just once per map value, then reuse it across xchg in/out." > > While Justin responded with an explanation of ABBA ordering issues, and > Mykyta provided a more detailed counter-proposal with RCU protection and > atomic detach/attach, the v3 implementation does not adopt this > suggestion and still uses the pcpu_freelist approach with a surplus > counter. > > Was there a specific technical reason to continue with the two-list > approach rather than exploring the bpf_kptr_box alternative, or would it > be worth revisiting this design choice? > This was addressed. See https://lore.kernel.org/bpf/afy8yl1bleXmO78H@suesslenovo/ storing a pointer to the bpf_kptr_box, or any other pointer that isn't the original kptr in the map slot would break direct load operations on ref kptrs in maps requiring patching the program and breaking bpf assumptions. > [ ... ] > > > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c > > [ ... ] > > This commit fixes a deadlock bug but lacks a Fixes: tag. The bug was > introduced in commit 14a324f6a67e ("bpf: Wire up freeing of referenced > kptr") which added bpf_map_free_kptrs() that calls kptr destructors > inline without checking for NMI context. > > When BPF programs attached to tp_btf/nmi_handler delete map entries, > destructors like bpf_cpumask_release() that take RCU locks are called > from NMI, causing system deadlock. > > Suggested: Fixes: 14a324f6a67e ("bpf: Wire up freeing of referenced kptr") > > I'll add the Fixes: tag in next reroll. > --- > AI reviewed your patch. Please fix the bug or email reply why it's not a bug. > See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md > > CI run summary: https://github.com/kernel-patches/bpf/actions/runs/25513277524 ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [bpf-next v3 1/2] bpf: Offload kptr destructors that run from NMI 2026-05-07 17:54 ` [bpf-next v3 1/2] bpf: Offload kptr destructors that run from NMI Justin Suess 2026-05-07 18:43 ` bot+bpf-ci @ 2026-05-07 23:45 ` sashiko-bot 2026-05-10 15:13 ` Justin Suess 1 sibling, 1 reply; 23+ messages in thread From: sashiko-bot @ 2026-05-07 23:45 UTC (permalink / raw) To: Justin Suess; +Cc: bpf Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider: - [High] TOCTOU race in `bpf_dtor_kptr_release_one()` can lead to NMI deadlocks. - [Low] Memory leak on initialization failure in `bpf_dtor_kptr_init()`. -- commit 488cdc390b90427a591d47211600c09addc95fcd Author: Justin Suess <utilityemal77@gmail.com> Date: Thu May 7 17:54:52 2026 +0000 bpf: Offload kptr destructors that run from NMI This commit queues destructor-backed kptr teardowns to irq_work when running from NMI context. It maintains an idle-slot surplus counter to ensure offload slots are preallocated to prevent deadlocks. Closes: https://lore.kernel.org/bpf/20260421201035.1729473-1-utilityemal77@gmail.com/ > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c > index 3b1f0ba02f613..d34fdb99eb8a9 100644 > --- a/kernel/bpf/syscall.c > +++ b/kernel/bpf/syscall.c [ ... ] > @@ -65,6 +69,111 @@ static DEFINE_SPINLOCK(map_idr_lock); > +static void bpf_dtor_kptr_release_one(void) > +{ > + struct bpf_dtor_kptr_work *job; > + long surplus; > + > + for (;;) { > + surplus = atomic_long_read(&bpf_dtor_kptr_surplus); > + if (surplus <= 0) > + return; > + > + job = bpf_dtor_kptr_pop_idle(); Could this sequence lead to a race condition that drains the idle pool too early? If multiple threads read a positive surplus value here, they might all call bpf_dtor_kptr_pop_idle() before the cmpxchg executes. This could temporarily drain the idle pool below the required number of reserved slots. If an NMI fires during this window and attempts to tear down a kptr, would bpf_dtor_kptr_offload() find the idle pool empty? If so, it looks like it would trigger the fallback warning and run the destructor inline from the NMI context, which might reintroduce the deadlocks this patch aims to prevent. > + if (!job) > + return; > + > + if (!atomic_long_try_cmpxchg(&bpf_dtor_kptr_surplus, &surplus, > + surplus - 1)) { > + pcpu_freelist_push(&bpf_dtor_kptr_idle, &job->fnode); > + continue; > + } > + > + bpf_mem_free(&bpf_global_ma, job); > + return; > + } > +} [ ... ] > +static int __init bpf_dtor_kptr_init(void) > +{ > + int err; > + > + err = pcpu_freelist_init(&bpf_dtor_kptr_idle); > + if (err) > + return err; > + > + err = pcpu_freelist_init(&bpf_dtor_kptr_jobs); > + if (err) > + return err; Does this error path leak the per-cpu memory allocated for the idle list? If the second pcpu_freelist_init() fails, the function returns immediately. Should there be a cleanup path here to free the already initialized bpf_dtor_kptr_idle list? > + > + return 0; > +} > +late_initcall(bpf_dtor_kptr_init); -- Sashiko AI review · https://sashiko.dev/#/patchset/20260507175453.1140400-1-utilityemal77@gmail.com?part=1 ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [bpf-next v3 1/2] bpf: Offload kptr destructors that run from NMI 2026-05-07 23:45 ` sashiko-bot @ 2026-05-10 15:13 ` Justin Suess 2026-05-10 22:38 ` Alexei Starovoitov 0 siblings, 1 reply; 23+ messages in thread From: Justin Suess @ 2026-05-10 15:13 UTC (permalink / raw) To: sashiko; +Cc: bpf On Thu, May 07, 2026 at 11:45:19PM +0000, sashiko-bot@kernel.org wrote: > Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider: > - [High] TOCTOU race in `bpf_dtor_kptr_release_one()` can lead to NMI deadlocks. > - [Low] Memory leak on initialization failure in `bpf_dtor_kptr_init()`. > -- > > commit 488cdc390b90427a591d47211600c09addc95fcd > Author: Justin Suess <utilityemal77@gmail.com> > Date: Thu May 7 17:54:52 2026 +0000 > > bpf: Offload kptr destructors that run from NMI > > This commit queues destructor-backed kptr teardowns to irq_work when > running from NMI context. It maintains an idle-slot surplus counter to > ensure offload slots are preallocated to prevent deadlocks. > > Closes: https://lore.kernel.org/bpf/20260421201035.1729473-1-utilityemal77@gmail.com/ > > > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c > > index 3b1f0ba02f613..d34fdb99eb8a9 100644 > > --- a/kernel/bpf/syscall.c > > +++ b/kernel/bpf/syscall.c > [ ... ] > > @@ -65,6 +69,111 @@ static DEFINE_SPINLOCK(map_idr_lock); > > +static void bpf_dtor_kptr_release_one(void) > > +{ > > + struct bpf_dtor_kptr_work *job; > > + long surplus; > > + > > + for (;;) { > > + surplus = atomic_long_read(&bpf_dtor_kptr_surplus); > > + if (surplus <= 0) > > + return; > > + > > + job = bpf_dtor_kptr_pop_idle(); > > Could this sequence lead to a race condition that drains the idle pool > too early? > > If multiple threads read a positive surplus value here, they might all call > bpf_dtor_kptr_pop_idle() before the cmpxchg executes. This could temporarily > drain the idle pool below the required number of reserved slots. > > If an NMI fires during this window and attempts to tear down a kptr, would > bpf_dtor_kptr_offload() find the idle pool empty? If so, it looks like it > would trigger the fallback warning and run the destructor inline from the > NMI context, which might reintroduce the deadlocks this patch aims to > prevent. > Yes this is real. But there's a bigger problem: the push side of pcpu_freelist is not interruptible. If we jump to NMI during this stage then the pcpu_freelist will be locked up because we were pushing to it, thus preventing us from popping. Thus we just deadlock again. I think we technically have exclusivity here on our percpu head (we can't be in NMI and process context stimultaniously on the cpu). So maybe it is safe to forcibly acquire that specific lock? I'm not 100% sure. llist won't work since we can't have multiple llist_del_first consumers, and llist_del_all would render the list null and unusable if we jump to it from NMI while we're borrowing it. Making percpu llists won't work either, since the cpu we allocate the job one might not be the one the dtor runs on, so we need percpu stealing, which then runs into the same multiple consumers problem if threads steal from same cpu head. So this code is not ready either way. I think 128 bit cmpxchg could technically solve this problem, but don't think we can rely on it for all architectures BPF supports. If anyone has any guidance on this it would be much appreciated. Been stuck on this for several weeks... Construct I need is a data structure with atomic push, pop, and move between sets. No specific ordering needed, but updates from producers must be immediately visible to all consumers. I've looked into Treiber stacks. Any help or guidance on this would be appreciated! > > + if (!job) > > + return; > > + > > + if (!atomic_long_try_cmpxchg(&bpf_dtor_kptr_surplus, &surplus, > > + surplus - 1)) { > > + pcpu_freelist_push(&bpf_dtor_kptr_idle, &job->fnode); > > + continue; > > + } > > + > > + bpf_mem_free(&bpf_global_ma, job); > > + return; > > + } > > +} > [ ... ] > > +static int __init bpf_dtor_kptr_init(void) > > +{ > > + int err; > > + > > + err = pcpu_freelist_init(&bpf_dtor_kptr_idle); > > + if (err) > > + return err; > > + > > + err = pcpu_freelist_init(&bpf_dtor_kptr_jobs); > > + if (err) > > + return err; > > Does this error path leak the per-cpu memory allocated for the idle list? > > If the second pcpu_freelist_init() fails, the function returns immediately. > Should there be a cleanup path here to free the already initialized > bpf_dtor_kptr_idle list? > > > + > > + return 0; > > +} > > +late_initcall(bpf_dtor_kptr_init); > > -- > Sashiko AI review · https://sashiko.dev/#/patchset/20260507175453.1140400-1-utilityemal77@gmail.com?part=1 ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [bpf-next v3 1/2] bpf: Offload kptr destructors that run from NMI 2026-05-10 15:13 ` Justin Suess @ 2026-05-10 22:38 ` Alexei Starovoitov 2026-05-11 1:49 ` Justin Suess 0 siblings, 1 reply; 23+ messages in thread From: Alexei Starovoitov @ 2026-05-10 22:38 UTC (permalink / raw) To: Justin Suess; +Cc: sashiko, bpf On Sun, May 10, 2026 at 8:14 AM Justin Suess <utilityemal77@gmail.com> wrote: > > > Any help or guidance on this would be appreciated! sorry for the delay. Everyone was at lsfmmbpf for a week+. All of the solutions so far are way too complicated. bpf_kptr_xchg() has to remain inlined as single atomic xchg without slowpath otherwise it ruins the concept and makes usage unpredictable. Let's step back. What is the issue you're trying to solve? the commit log say: > A BPF program attached to tp_btf/nmi_handler can delete map entries or > swap out referenced kptrs from NMI context. Today that runs the kptr > destructor inline. Destructors such as bpf_cpumask_release() can take > RCU-related locks, so running them from NMI can deadlock the system. and looking at selftest from patch 2 you do: old = bpf_kptr_xchg(&value->mask, old); if (old) bpf_cpumask_release(old); so? bpf_cpumask_release() is fine to call from any context, because bpf_mem_cache_free_rcu() is safe everywhere including NMI. hashtab introduced dtor in bpf_mem_alloc, so bpf_obj_free_fields() and corresponding dtor's of kptr-s are called from valid context. What is the problematic sequence? ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [bpf-next v3 1/2] bpf: Offload kptr destructors that run from NMI 2026-05-10 22:38 ` Alexei Starovoitov @ 2026-05-11 1:49 ` Justin Suess 2026-05-11 15:51 ` Alexei Starovoitov 0 siblings, 1 reply; 23+ messages in thread From: Justin Suess @ 2026-05-11 1:49 UTC (permalink / raw) To: Alexei Starovoitov; +Cc: sashiko, bpf On Sun, May 10, 2026 at 03:38:08PM -0700, Alexei Starovoitov wrote: > On Sun, May 10, 2026 at 8:14 AM Justin Suess <utilityemal77@gmail.com> wrote: > > > > > > Any help or guidance on this would be appreciated! > > sorry for the delay. Everyone was at lsfmmbpf for a week+. > No worries! I hope it was an enjoyable trip and I look forward to hearing about the conference. > All of the solutions so far are way too complicated. > bpf_kptr_xchg() has to remain inlined as single atomic xchg > without slowpath otherwise it ruins the concept > and makes usage unpredictable. > > Let's step back. > What is the issue you're trying to solve? > > the commit log say: > > > A BPF program attached to tp_btf/nmi_handler can delete map entries or > > swap out referenced kptrs from NMI context. Today that runs the kptr > > destructor inline. Destructors such as bpf_cpumask_release() can take > > RCU-related locks, so running them from NMI can deadlock the system. > > and looking at selftest from patch 2 you do: > > old = bpf_kptr_xchg(&value->mask, old); > if (old) > bpf_cpumask_release(old); > > so? > bpf_cpumask_release() is fine to call from any context, > because bpf_mem_cache_free_rcu() is safe everywhere including NMI. > My mistake on that. I picked a bad example for the test, but the test is just exercising the nmi dtor path, and doesn't rely on the particular type of kptr. I just picked one that was easy to acquire a reference to. This dtor is safe. task_struct dtor, cgroup dtor, crypto_ctx dtor are not. I've annotated why here: crypto_ctx: __bpf_kfunc void bpf_crypto_ctx_release(struct bpf_crypto_ctx *ctx) { if (refcount_dec_and_test(&ctx->usage)) call_rcu(&ctx->rcu, crypto_free_cb); /* UNSAFE: call_rcu */ } __bpf_kfunc void bpf_crypto_ctx_release_dtor(void *ctx) { bpf_crypto_ctx_release(ctx); } task_struct: __bpf_kfunc void bpf_task_release(struct task_struct *p) { put_task_struct_rcu_user(p); } __bpf_kfunc void bpf_task_release_dtor(void *p) { put_task_struct_rcu_user(p); } void put_task_struct_rcu_user(struct task_struct *task) { if (refcount_dec_and_test(&task->rcu_users)) call_rcu(&task->rcu, delayed_put_task_struct); /* UNSAFE: call_rcu */ } cgroup_release_dtor is more complex, goes through ultimately through callbacks to: static void css_release(struct percpu_ref *ref) { struct cgroup_subsys_state *css = container_of(ref, struct cgroup_subsys_state, refcnt); INIT_WORK(&css->destroy_work, css_release_work_fn); queue_work(cgroup_release_wq, &css->destroy_work); /* UNSAFE: workqueue */ } More generally, unless it's a BPF allocated object or doesn't rely on locking/call_rcu or workqueues, the dtor is unsafe. > hashtab introduced dtor in bpf_mem_alloc, > so bpf_obj_free_fields() and corresponding dtor's of kptr-s > are called from valid context. > > What is the problematic sequence? So from the beginning stepping back. The problematic sequence: 1. ref kptr (i.e task_struct, cgroup, crypto_ctx) xchg'd into map. 2. in a tp_btf/nmi_handler (NMI CTX) program we drop the item from the map with that referenced kptr field. 3. dtor runs in the nmi context 4. dtor runs call_rcu/queue_work/some bad thing in nmi, causing deadlock. You can see this demonstrated in my task_struct reproducer [1]. It causes a deadlock by deliberately releasing the last reference to a task_struct via a ref kptr in nmi, getting it to call_rcu and deadlock. The typical solution to this is to run the nmi unsafe code in non-nmi context by offloading to NMI work, as you proposed. The problem is we need space to for the jobs we enqueue. The required information to run the dtor is the dtor function and the original object pointer. The number of dtors that can run in a single tp_btf/nmi_handler prog is nearly unbounded. The other problem is even though bpf_mem_alloc is safe in NMI generally, we cannot allocate in path that destroys an object. If the allocation fails due to memory pressure, we leak the object. There are a few options, all with drawbacks. 1. Dynamically allocate the job. Non-starter, failing to allocate is unrecoverable, memory pressure means we can't ever schedule the dtor to run. 2. Store job ntrusively in the object : Requires a safe place to place it within the object. Bad because not all objects have a space we can write to. Non-starter. 3. Within the map slot (after actual kptr): Taken with my initial approach in v1. Significant complexity and requires per-map changes. Feasible but very complex and would need DCAS or locking to make updating the map slot and our job information atomic. 4. Wrapping the kptr in a box and storing it in place of the kptr [2] : Proposed by Mykyta. Would break direct load access to kptr objects. 5. Make every dtor nmi safe individually. This would require a lot of duplicated code and require updating every destructor invididually. Feasible technically, but seems brittle. 6. One that would be the least complex, would be forbidding xchg operations that can run the dtor in NMI context. That would preserve the inlining fix, but limit our usage of referenced kptrs in BPF programs that run in NMI context. The approach here: 7. Allocate a new spot for a free job every time we xchg into the map and put it in a global list. When in NMI and we run a dtor, we pop a job from that slot and use it to offload our work via irq_work. If we're not in NMI we run normally. Downside is this breaks inlining for ref kptrs. ... I may be missing something critical, but everything I've looked at points to this problem being much more complex that it initially seemed. I'm happy to discuss further and do a respin. Justin [1] https://lore.kernel.org/bpf/383afba6-f732-49bc-a197-147b9d8b1a29@gmail.com/ [2] https://lore.kernel.org/bpf/51a054a0-e57f-49dc-9527-36da0535087c@gmail.com/ ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [bpf-next v3 1/2] bpf: Offload kptr destructors that run from NMI 2026-05-11 1:49 ` Justin Suess @ 2026-05-11 15:51 ` Alexei Starovoitov 2026-05-11 16:38 ` Justin Suess 2026-05-11 19:22 ` Justin Suess 0 siblings, 2 replies; 23+ messages in thread From: Alexei Starovoitov @ 2026-05-11 15:51 UTC (permalink / raw) To: Justin Suess; +Cc: sashiko, bpf On Sun May 10, 2026 at 6:49 PM PDT, Justin Suess wrote: > On Sun, May 10, 2026 at 03:38:08PM -0700, Alexei Starovoitov wrote: >> On Sun, May 10, 2026 at 8:14 AM Justin Suess <utilityemal77@gmail.com> wrote: >> > >> > >> > Any help or guidance on this would be appreciated! >> >> sorry for the delay. Everyone was at lsfmmbpf for a week+. >> > > No worries! I hope it was an enjoyable trip and I look forward to > hearing about the conference. > >> All of the solutions so far are way too complicated. >> bpf_kptr_xchg() has to remain inlined as single atomic xchg >> without slowpath otherwise it ruins the concept >> and makes usage unpredictable. >> >> Let's step back. >> What is the issue you're trying to solve? >> >> the commit log say: >> >> > A BPF program attached to tp_btf/nmi_handler can delete map entries or >> > swap out referenced kptrs from NMI context. Today that runs the kptr >> > destructor inline. Destructors such as bpf_cpumask_release() can take >> > RCU-related locks, so running them from NMI can deadlock the system. >> >> and looking at selftest from patch 2 you do: >> >> old = bpf_kptr_xchg(&value->mask, old); >> if (old) >> bpf_cpumask_release(old); >> >> so? >> bpf_cpumask_release() is fine to call from any context, >> because bpf_mem_cache_free_rcu() is safe everywhere including NMI. >> > > My mistake on that. I picked a bad example for the test, but the test is > just exercising the nmi dtor path, and doesn't rely on the particular > type of kptr. I just picked one that was easy to acquire a reference to. > > This dtor is safe. task_struct dtor, cgroup dtor, crypto_ctx dtor are > not. I've annotated why here: > > crypto_ctx: > > __bpf_kfunc void bpf_crypto_ctx_release(struct bpf_crypto_ctx *ctx) > { > if (refcount_dec_and_test(&ctx->usage)) > call_rcu(&ctx->rcu, crypto_free_cb); /* UNSAFE: call_rcu */ > } > > __bpf_kfunc void bpf_crypto_ctx_release_dtor(void *ctx) > { > bpf_crypto_ctx_release(ctx); > } bpf_crypto_ctx_release() is only allowed in syscall prog types and dtor via hashmap free will execute in safe context as well. So not an issue. > task_struct: > > __bpf_kfunc void bpf_task_release(struct task_struct *p) > { > put_task_struct_rcu_user(p); > } > > __bpf_kfunc void bpf_task_release_dtor(void *p) > { > put_task_struct_rcu_user(p); > } > > void put_task_struct_rcu_user(struct task_struct *task) > { > if (refcount_dec_and_test(&task->rcu_users)) > call_rcu(&task->rcu, delayed_put_task_struct); /* UNSAFE: call_rcu > */ > } In theory. I don't think there is a reproducer. > cgroup_release_dtor is more complex, goes through ultimately through > callbacks to: > > static void css_release(struct percpu_ref *ref) > { > struct cgroup_subsys_state *css = > container_of(ref, struct cgroup_subsys_state, refcnt); > > INIT_WORK(&css->destroy_work, css_release_work_fn); > queue_work(cgroup_release_wq, &css->destroy_work); /* UNSAFE: > workqueue */ > } similar to task_struct. I don't think it's exploitable. > More generally, unless it's a BPF allocated object or doesn't rely on > locking/call_rcu or workqueues, the dtor is unsafe. > >> hashtab introduced dtor in bpf_mem_alloc, >> so bpf_obj_free_fields() and corresponding dtor's of kptr-s >> are called from valid context. >> >> What is the problematic sequence? > > So from the beginning stepping back. > > The problematic sequence: > > 1. ref kptr (i.e task_struct, cgroup, crypto_ctx) xchg'd into map. > > 2. in a tp_btf/nmi_handler (NMI CTX) program we drop the item from the map > with that referenced kptr field. > > 3. dtor runs in the nmi context > > 4. dtor runs call_rcu/queue_work/some bad thing in nmi, causing deadlock. > > You can see this demonstrated in my task_struct reproducer [1]. did you? That link points to your v2 with cpumask. I don't recall seeing task_struct repro. > It causes a deadlock by deliberately releasing the last reference to a > task_struct via a ref kptr in nmi, getting it to call_rcu and deadlock. > > The typical solution to this is to run the nmi unsafe code in non-nmi > context by offloading to NMI work, as you proposed. > > The problem is we need space to for the jobs we enqueue. The required > information to run the dtor is the dtor function and the original object > pointer. The number of dtors that can run in a single tp_btf/nmi_handler > prog is nearly unbounded. > > The other problem is even though bpf_mem_alloc is safe in NMI generally, > we cannot allocate in path that destroys an object. If the allocation > fails due to memory pressure, we leak the object. > > There are a few options, all with drawbacks. > > 1. Dynamically allocate the job. Non-starter, failing to allocate is > unrecoverable, memory pressure means we can't ever schedule the dtor to > run. > > 2. Store job ntrusively in the object : Requires a safe place to place > it within the object. Bad because not all objects have a space we can write to. > Non-starter. > > 3. Within the map slot (after actual kptr): Taken with my initial approach in v1. > Significant complexity and requires per-map changes. Feasible but very > complex and would need DCAS or locking to make updating the map slot and > our job information atomic. > > 4. Wrapping the kptr in a box and storing it in place of the kptr [2] : > Proposed by Mykyta. Would break direct load access to kptr objects. > > 5. Make every dtor nmi safe individually. This would require a lot of > duplicated code and require updating every destructor invididually. > Feasible technically, but seems brittle. > > 6. One that would be the least complex, would be forbidding xchg operations > that can run the dtor in NMI context. That would preserve the inlining fix, > but limit our usage of referenced kptrs in BPF programs that run in NMI context. > > The approach here: > > 7. Allocate a new spot for a free job every time we xchg into the map > and put it in a global list. When in NMI and we run a dtor, we pop a > job from that slot and use it to offload our work via irq_work. If > we're not in NMI we run normally. Downside is this breaks inlining for > ref kptrs. > > ... > > I may be missing something critical, but everything I've looked at > points to this problem being much more complex that it initially seemed. yes. it is complex. all 7 options are not good. I recall the whole thing started with desire to add bpf_put_file_dtor(). This was discussed with VFS maintainers and they didn't like the idea, since it needs a ton of work to make it safe: . umount notifier to make sure stashed file doesn't hold umount . potential circular refcnt issue if file to bpf map stashed into the same map . scm_rights-like facility with garbage collection So generic file stash is really no go. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [bpf-next v3 1/2] bpf: Offload kptr destructors that run from NMI 2026-05-11 15:51 ` Alexei Starovoitov @ 2026-05-11 16:38 ` Justin Suess 2026-05-11 17:18 ` Alexei Starovoitov 2026-05-11 19:22 ` Justin Suess 1 sibling, 1 reply; 23+ messages in thread From: Justin Suess @ 2026-05-11 16:38 UTC (permalink / raw) To: Alexei Starovoitov; +Cc: sashiko, bpf On Mon, May 11, 2026 at 08:51:53AM -0700, Alexei Starovoitov wrote: > On Sun May 10, 2026 at 6:49 PM PDT, Justin Suess wrote: > > On Sun, May 10, 2026 at 03:38:08PM -0700, Alexei Starovoitov wrote: > >> On Sun, May 10, 2026 at 8:14 AM Justin Suess <utilityemal77@gmail.com> wrote: > >> > > >> > > >> > Any help or guidance on this would be appreciated! > >> > >> sorry for the delay. Everyone was at lsfmmbpf for a week+. > >> > > > > No worries! I hope it was an enjoyable trip and I look forward to > > hearing about the conference. > > > >> All of the solutions so far are way too complicated. > >> bpf_kptr_xchg() has to remain inlined as single atomic xchg > >> without slowpath otherwise it ruins the concept > >> and makes usage unpredictable. > >> > >> Let's step back. > >> What is the issue you're trying to solve? > >> > >> the commit log say: > >> > >> > A BPF program attached to tp_btf/nmi_handler can delete map entries or > >> > swap out referenced kptrs from NMI context. Today that runs the kptr > >> > destructor inline. Destructors such as bpf_cpumask_release() can take > >> > RCU-related locks, so running them from NMI can deadlock the system. > >> > >> and looking at selftest from patch 2 you do: > >> > >> old = bpf_kptr_xchg(&value->mask, old); > >> if (old) > >> bpf_cpumask_release(old); > >> > >> so? > >> bpf_cpumask_release() is fine to call from any context, > >> because bpf_mem_cache_free_rcu() is safe everywhere including NMI. > >> > > > > My mistake on that. I picked a bad example for the test, but the test is > > just exercising the nmi dtor path, and doesn't rely on the particular > > type of kptr. I just picked one that was easy to acquire a reference to. > > > > This dtor is safe. task_struct dtor, cgroup dtor, crypto_ctx dtor are > > not. I've annotated why here: > > > > crypto_ctx: > > > > __bpf_kfunc void bpf_crypto_ctx_release(struct bpf_crypto_ctx *ctx) > > { > > if (refcount_dec_and_test(&ctx->usage)) > > call_rcu(&ctx->rcu, crypto_free_cb); /* UNSAFE: call_rcu */ > > } > > > > __bpf_kfunc void bpf_crypto_ctx_release_dtor(void *ctx) > > { > > bpf_crypto_ctx_release(ctx); > > } > > bpf_crypto_ctx_release() is only allowed in syscall prog types > and dtor via hashmap free will execute in safe context as well. > So not an issue. > It doesn't matter if bpf_crypto_ctx_release is only allowed in syscall program types; we don't need that to invoke it from a tracing program. All we have to do is delete a map element with that type of kptr already in it. The release function is not the issue, it's bpf_map_delete_elem, which ultimately invokes bpf_obj_free_fields and then the dtor. Hashmap free doesn't execute in safe context. See the kernel splat below. > > task_struct: > > > > __bpf_kfunc void bpf_task_release(struct task_struct *p) > > { > > put_task_struct_rcu_user(p); > > } > > > > __bpf_kfunc void bpf_task_release_dtor(void *p) > > { > > put_task_struct_rcu_user(p); > > } > > > > void put_task_struct_rcu_user(struct task_struct *task) > > { > > if (refcount_dec_and_test(&task->rcu_users)) > > call_rcu(&task->rcu, delayed_put_task_struct); /* UNSAFE: call_rcu > > */ > > } > > In theory. I don't think there is a reproducer. > I sent the wrong link in my reply. apologies. This does indeed reproduce, even on bpf-next/master. Using this reproducer: https://gist.githubusercontent.com/RazeLighter777/5539336d79ab1854f9e9550c6dcab118/raw/082f1eeb2dd445936e64dd3a33861764690bde82/task_struct_dtor_deadlock.patch On the latest bpf-next/master : "7e033543a2ab: Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf 7.1-rc3": With these kconfig options set (not necessary, but increase reproducibility): [justin@zenbox linux-next (repro $%)]$ rg 'RCU_NOCB|RCU_EXPERT' .config 171:CONFIG_RCU_EXPERT=y 188:CONFIG_RCU_NOCB_CPU=y Run ./test_progs -t task_kptr_nmi_deadlock_repro [ 21.604612] ================================ [ 21.604612] WARNING: inconsistent lock state [ 21.604613] 7.1.0-rc2-g7e033543a2ab #126 Tainted: G OE [ 21.604614] -------------------------------- [ 21.604615] inconsistent {INITIAL USE} -> {IN-NMI} usage. [ 21.604616] test_progs/494 [HC1[1]:SC0[0]:HE0:SE1] takes: [ 21.604617] ffffa2077d66f0e8 (&rdp->nocb_lock){....}-{2:2}, at: __call_rcu_common.constprop.0+0x309/0x730 [ 21.604625] {INITIAL USE} state was registered at: [ 21.604625] lock_acquire+0xbf/0x2e0 [ 21.604628] _raw_spin_lock+0x30/0x40 [ 21.604632] rcu_nocb_gp_kthread+0x13f/0xb90 [ 21.604633] kthread+0xf4/0x130 [ 21.604636] ret_from_fork+0x264/0x330 [ 21.604639] ret_from_fork_asm+0x1a/0x30 [ 21.604642] irq event stamp: 194 [ 21.604642] hardirqs last enabled at (193): [<ffffffff9700012f>] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 21.604644] hardirqs last disabled at (194): [<ffffffff980e7f4f>] exc_nmi+0x7f/0x110 [ 21.604646] softirqs last enabled at (0): [<ffffffff972d9838>] copy_process+0xfd8/0x2b80 [ 21.604648] softirqs last disabled at (0): [<0000000000000000>] 0x0 [ 21.604650] other info that might help us debug this: [ 21.604651] Possible unsafe locking scenario: [ 21.604651] CPU0 [ 21.604651] ---- [ 21.604652] lock(&rdp->nocb_lock); [ 21.604653] <Interrupt> [ 21.604653] lock(&rdp->nocb_lock); [ 21.604654] *** DEADLOCK *** [ 21.604654] no locks held by test_progs/494. [ 21.604655] stack backtrace: [ 21.604657] CPU: 1 UID: 0 PID: 494 Comm: test_progs Tainted: G OE 7.1.0-rc2-g7e033543a2ab #126 PREEMPT(full) [ 21.604659] Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE [ 21.604660] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.17.0-0-gb52ca86e094d-prebuilt.qemu.org 04/01/2014 [ 21.604660] Call Trace: [ 21.604662] <TASK> [ 21.604663] dump_stack_lvl+0x5d/0x80 [ 21.604666] print_usage_bug.part.0+0x22b/0x2c0 [ 21.604669] lock_acquire+0x295/0x2e0 [ 21.604671] ? terminate_walk+0x33/0x160 [ 21.604674] ? __call_rcu_common.constprop.0+0x309/0x730 [ 21.604679] _raw_spin_lock+0x30/0x40 [ 21.604680] ? __call_rcu_common.constprop.0+0x309/0x730 [ 21.604682] __call_rcu_common.constprop.0+0x309/0x730 [ 21.604686] bpf_obj_free_fields+0x118/0x250 [ 21.604691] free_htab_elem+0x85/0xd0 [ 21.604694] htab_map_delete_elem+0x168/0x230 [ 21.604698] bpf_prog_f6a7136050cb5431_clear_task_kptrs_from_nmi+0xeb/0x144 [ 21.604700] bpf_trace_run3+0x126/0x430 [ 21.604703] ? __pfx_perf_event_nmi_handler+0x10/0x10 [ 21.604707] nmi_handle.part.0+0x15b/0x250 [ 21.604710] ? __pfx_perf_event_nmi_handler+0x10/0x10 [ 21.604712] default_do_nmi+0x120/0x180 [ 21.604715] exc_nmi+0xe3/0x110 [ 21.604717] asm_exc_nmi+0xb7/0x100 [ 21.604722] RIP: 0033:0x7fd27529a7ee [ 21.604724] Code: ff 0f 1f 00 49 89 ca 48 8b 44 24 20 0f 05 48 83 c4 18 c3 66 0f 1f 84 00 00 00 00 00 48 83 ec 10 ff 74 24 18 e8 63 ff ff ff 5a <59> 48 3d 00 f0 ff ff 77 09 48 83 c4 08 c3 0f 1f 40 00 48 8b 15 e9 [ 21.604725] RSP: 002b:00007ffe5e3eca08 EFLAGS: 00000202 [ 21.604726] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 00007fd2756cef48 [ 21.604727] RDX: 0000000000000003 RSI: 0000000000000000 RDI: 000000000000000c [ 21.604728] RBP: 00007ffe5e3ecae0 R08: 0000000000000000 R09: 0000000000000000 [ 21.604728] R10: 0000000000000000 R11: 00007fd2756cec40 R12: 0000000000000003 [ 21.604729] R13: 00007fd275887000 R14: 00007ffe5e3eceb8 R15: 000055cd1271d8d0 [ 21.604734] </TASK> [ 21.666237] perf: interrupt took too long (2501 > 2500), lowering kernel.perf_event_max_sample_rate to 79000 [ 21.691290] perf: interrupt took too long (3144 > 3126), lowering kernel.perf_event_max_sample_rate to 63000 [ 21.730227] perf: interrupt took too long (3932 > 3930), lowering kernel.perf_event_max_sample_rate to 50000 [ 21.805475] perf: interrupt took too long (4916 > 4915), lowering kernel.perf_event_max_sample_rate to 40000 > > cgroup_release_dtor is more complex, goes through ultimately through > > callbacks to: > > > > static void css_release(struct percpu_ref *ref) > > { > > struct cgroup_subsys_state *css = > > container_of(ref, struct cgroup_subsys_state, refcnt); > > > > INIT_WORK(&css->destroy_work, css_release_work_fn); > > queue_work(cgroup_release_wq, &css->destroy_work); /* UNSAFE: > > workqueue */ > > } > > similar to task_struct. I don't think it's exploitable. > > > More generally, unless it's a BPF allocated object or doesn't rely on > > locking/call_rcu or workqueues, the dtor is unsafe. > > > >> hashtab introduced dtor in bpf_mem_alloc, > >> so bpf_obj_free_fields() and corresponding dtor's of kptr-s > >> are called from valid context. > >> > >> What is the problematic sequence? > > > > So from the beginning stepping back. > > > > The problematic sequence: > > > > 1. ref kptr (i.e task_struct, cgroup, crypto_ctx) xchg'd into map. > > > > 2. in a tp_btf/nmi_handler (NMI CTX) program we drop the item from the map > > with that referenced kptr field. > > > > 3. dtor runs in the nmi context > > > > 4. dtor runs call_rcu/queue_work/some bad thing in nmi, causing deadlock. > > > > You can see this demonstrated in my task_struct reproducer [1]. > > did you? > That link points to your v2 with cpumask. > I don't recall seeing task_struct repro. > I misplaced the link. See above > > It causes a deadlock by deliberately releasing the last reference to a > > task_struct via a ref kptr in nmi, getting it to call_rcu and deadlock. > > > > The typical solution to this is to run the nmi unsafe code in non-nmi > > context by offloading to NMI work, as you proposed. > > > > The problem is we need space to for the jobs we enqueue. The required > > information to run the dtor is the dtor function and the original object > > pointer. The number of dtors that can run in a single tp_btf/nmi_handler > > prog is nearly unbounded. > > > > The other problem is even though bpf_mem_alloc is safe in NMI generally, > > we cannot allocate in path that destroys an object. If the allocation > > fails due to memory pressure, we leak the object. > > > > There are a few options, all with drawbacks. > > > > 1. Dynamically allocate the job. Non-starter, failing to allocate is > > unrecoverable, memory pressure means we can't ever schedule the dtor to > > run. > > > > 2. Store job ntrusively in the object : Requires a safe place to place > > it within the object. Bad because not all objects have a space we can write to. > > Non-starter. > > > > 3. Within the map slot (after actual kptr): Taken with my initial approach in v1. > > Significant complexity and requires per-map changes. Feasible but very > > complex and would need DCAS or locking to make updating the map slot and > > our job information atomic. > > > > 4. Wrapping the kptr in a box and storing it in place of the kptr [2] : > > Proposed by Mykyta. Would break direct load access to kptr objects. > > > > 5. Make every dtor nmi safe individually. This would require a lot of > > duplicated code and require updating every destructor invididually. > > Feasible technically, but seems brittle. > > > > 6. One that would be the least complex, would be forbidding xchg operations > > that can run the dtor in NMI context. That would preserve the inlining fix, > > but limit our usage of referenced kptrs in BPF programs that run in NMI context. > > > > The approach here: > > > > 7. Allocate a new spot for a free job every time we xchg into the map > > and put it in a global list. When in NMI and we run a dtor, we pop a > > job from that slot and use it to offload our work via irq_work. If > > we're not in NMI we run normally. Downside is this breaks inlining for > > ref kptrs. > > > > ... > > > > I may be missing something critical, but everything I've looked at > > points to this problem being much more complex that it initially seemed. > > yes. it is complex. all 7 options are not good. > Agreed. None of them are good but I just don't see any good alternatives. > I recall the whole thing started with desire to add bpf_put_file_dtor(). > This was discussed with VFS maintainers and they didn't like the idea, > since it needs a ton of work to make it safe: > . umount notifier to make sure stashed file doesn't hold umount > . potential circular refcnt issue if file to bpf map stashed into the same map > . scm_rights-like facility with garbage collection > > So generic file stash is really no go. That may be the case; but this is a seprate issue. This would be a bug that could apply to future new refcounted kptrs as well. Thanks, Justin ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [bpf-next v3 1/2] bpf: Offload kptr destructors that run from NMI 2026-05-11 16:38 ` Justin Suess @ 2026-05-11 17:18 ` Alexei Starovoitov 2026-05-11 20:10 ` Kumar Kartikeya Dwivedi 0 siblings, 1 reply; 23+ messages in thread From: Alexei Starovoitov @ 2026-05-11 17:18 UTC (permalink / raw) To: Justin Suess; +Cc: sashiko, bpf On Mon May 11, 2026 at 9:38 AM PDT, Justin Suess wrote: > [ 21.604660] Call Trace: > [ 21.604662] <TASK> > [ 21.604663] dump_stack_lvl+0x5d/0x80 > [ 21.604666] print_usage_bug.part.0+0x22b/0x2c0 > [ 21.604669] lock_acquire+0x295/0x2e0 > [ 21.604671] ? terminate_walk+0x33/0x160 > [ 21.604674] ? __call_rcu_common.constprop.0+0x309/0x730 > [ 21.604679] _raw_spin_lock+0x30/0x40 > [ 21.604680] ? __call_rcu_common.constprop.0+0x309/0x730 > [ 21.604682] __call_rcu_common.constprop.0+0x309/0x730 > [ 21.604686] bpf_obj_free_fields+0x118/0x250 > [ 21.604691] free_htab_elem+0x85/0xd0 > [ 21.604694] htab_map_delete_elem+0x168/0x230 > [ 21.604698] bpf_prog_f6a7136050cb5431_clear_task_kptrs_from_nmi+0xeb/0x144 > [ 21.604700] bpf_trace_run3+0x126/0x430 that's better. Looks like we moved bpf_obj_free_fields() into htab_mem_dtor(), but left check_and_free_fields() in free_htab_elem(). I think the fix is to remove check_and_free_fields() from ma path in free_htab_elem() and fallback to bpf_mem_alloc at map create time when map has kptrs with dtors. Even when BPF_F_NO_PREALLOC is not specified. Kumar, thoughts? ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [bpf-next v3 1/2] bpf: Offload kptr destructors that run from NMI 2026-05-11 17:18 ` Alexei Starovoitov @ 2026-05-11 20:10 ` Kumar Kartikeya Dwivedi 2026-05-12 1:43 ` Justin Suess 0 siblings, 1 reply; 23+ messages in thread From: Kumar Kartikeya Dwivedi @ 2026-05-11 20:10 UTC (permalink / raw) To: Alexei Starovoitov; +Cc: Justin Suess, sashiko, bpf On Mon, 11 May 2026 at 19:29, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > > On Mon May 11, 2026 at 9:38 AM PDT, Justin Suess wrote: > > [ 21.604660] Call Trace: > > [ 21.604662] <TASK> > > [ 21.604663] dump_stack_lvl+0x5d/0x80 > > [ 21.604666] print_usage_bug.part.0+0x22b/0x2c0 > > [ 21.604669] lock_acquire+0x295/0x2e0 > > [ 21.604671] ? terminate_walk+0x33/0x160 > > [ 21.604674] ? __call_rcu_common.constprop.0+0x309/0x730 > > [ 21.604679] _raw_spin_lock+0x30/0x40 > > [ 21.604680] ? __call_rcu_common.constprop.0+0x309/0x730 > > [ 21.604682] __call_rcu_common.constprop.0+0x309/0x730 > > [ 21.604686] bpf_obj_free_fields+0x118/0x250 > > [ 21.604691] free_htab_elem+0x85/0xd0 > > [ 21.604694] htab_map_delete_elem+0x168/0x230 > > [ 21.604698] bpf_prog_f6a7136050cb5431_clear_task_kptrs_from_nmi+0xeb/0x144 > > [ 21.604700] bpf_trace_run3+0x126/0x430 > > that's better. > Looks like we moved bpf_obj_free_fields() into htab_mem_dtor(), > but left check_and_free_fields() in free_htab_elem(). > > I think the fix is to remove check_and_free_fields() from ma path in free_htab_elem() > and fallback to bpf_mem_alloc at map create time when map has kptrs > with dtors. Even when BPF_F_NO_PREALLOC is not specified. > > Kumar, > > thoughts? > > Yeah, removing it from the path that helpers can invoke seems simpler. Remember though, this splat is just for hashtab, we have similar bpf_obj_free_fields() in array map on update. I think fundamentally the main issue here is that we logically free special fields when a map value is freed or deleted. When updating array maps we logically 'free' and then 'update' the same map value together. For hashtab, it happens on update/delete. We could relax this behavior to avoid eagerly freeing these special fields on update or deletion. The only worry is how this would impact programs that have come to rely on the existing behavior. There are patterns where people expect kptr to be NULL on some new map value, which causes programs to return errors when that expectation is not met. Just doing the skip when irqs_disabled() doesn't save us from the surprise side-effect. We need to decide upon this first before discussing the shape of the solution. This is the theoretical concern; In practice, I think most people who depend on such behavior use kptr in local storage maps (in schedulers). So it probably won't be a problem in practice, even though we can't judge this ahead of time. Also, we eagerly reuse map values when using memalloc, so the guarantees are already pretty weak I guess. So, if we are not going to go through a grace period (like local storage) and free back to kernel allocator before reuse, we should relax field freeing behavior. At best, we should cancel work for timer, wq, task_work, and task_work, leaving other items as-is. E.g. BPF_UPTR is used in task storage which I think is accessible to tracing programs, I am not sure how safe unpin_user_page() is when called from random reentrant contexts. We might have more cases in the future, we cannot guarantee we can handle everything in NMIs universally. So the best course of action seems to be relaxing bpf_obj_free_fields() to bpf_obj_cancel_fields() that just does cancel on async work (timer, wq, task_work) for delete / update and let other fields be as-is. We likely need to do bpf_obj_free_fields() additionally before prealloc_destroy() now, but that should be simple. Whether or not to use bpf_ma when kptrs are used in prealloc map is a separate change. This should hopefully resolve the issue, unless I missed other cases. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [bpf-next v3 1/2] bpf: Offload kptr destructors that run from NMI 2026-05-11 20:10 ` Kumar Kartikeya Dwivedi @ 2026-05-12 1:43 ` Justin Suess 2026-05-12 1:46 ` Kumar Kartikeya Dwivedi 0 siblings, 1 reply; 23+ messages in thread From: Justin Suess @ 2026-05-12 1:43 UTC (permalink / raw) To: Kumar Kartikeya Dwivedi; +Cc: Alexei Starovoitov, sashiko, bpf On Mon, May 11, 2026 at 10:10:07PM +0200, Kumar Kartikeya Dwivedi wrote: > On Mon, 11 May 2026 at 19:29, Alexei Starovoitov > <alexei.starovoitov@gmail.com> wrote: > > > > On Mon May 11, 2026 at 9:38 AM PDT, Justin Suess wrote: > > > [ 21.604660] Call Trace: > > > [ 21.604662] <TASK> > > > [ 21.604663] dump_stack_lvl+0x5d/0x80 > > > [ 21.604666] print_usage_bug.part.0+0x22b/0x2c0 > > > [ 21.604669] lock_acquire+0x295/0x2e0 > > > [ 21.604671] ? terminate_walk+0x33/0x160 > > > [ 21.604674] ? __call_rcu_common.constprop.0+0x309/0x730 > > > [ 21.604679] _raw_spin_lock+0x30/0x40 > > > [ 21.604680] ? __call_rcu_common.constprop.0+0x309/0x730 > > > [ 21.604682] __call_rcu_common.constprop.0+0x309/0x730 > > > [ 21.604686] bpf_obj_free_fields+0x118/0x250 > > > [ 21.604691] free_htab_elem+0x85/0xd0 > > > [ 21.604694] htab_map_delete_elem+0x168/0x230 > > > [ 21.604698] bpf_prog_f6a7136050cb5431_clear_task_kptrs_from_nmi+0xeb/0x144 > > > [ 21.604700] bpf_trace_run3+0x126/0x430 > > > > that's better. > > Looks like we moved bpf_obj_free_fields() into htab_mem_dtor(), > > but left check_and_free_fields() in free_htab_elem(). > > > > I think the fix is to remove check_and_free_fields() from ma path in free_htab_elem() > > and fallback to bpf_mem_alloc at map create time when map has kptrs > > with dtors. Even when BPF_F_NO_PREALLOC is not specified. > > > > Kumar, > > > > thoughts? > > > > > > Yeah, removing it from the path that helpers can invoke seems simpler. > Remember though, this splat is just for hashtab, we have similar > bpf_obj_free_fields() in array map on update. I think fundamentally > the main issue here is that we logically free special fields when a > map value is freed or deleted. When updating array maps we logically > 'free' and then 'update' the same map value together. For hashtab, it > happens on update/delete. > > We could relax this behavior to avoid eagerly freeing these special > fields on update or deletion. The only worry is how this would impact > programs that have come to rely on the existing behavior. There are > patterns where people expect kptr to be NULL on some new map value, > which causes programs to return errors when that expectation is not > met. Just doing the skip when irqs_disabled() doesn't save us from the > surprise side-effect. We need to decide upon this first before > discussing the shape of the solution. > > This is the theoretical concern; In practice, I think most people who > depend on such behavior use kptr in local storage maps (in > schedulers). So it probably won't be a problem in practice, even > though we can't judge this ahead of time. Also, we eagerly reuse map > values when using memalloc, so the guarantees are already pretty weak > I guess. > > So, if we are not going to go through a grace period (like local > storage) and free back to kernel allocator before reuse, we should > relax field freeing behavior. At best, we should cancel work for > timer, wq, task_work, and task_work, leaving other items as-is. E.g. > BPF_UPTR is used in task storage which I think is accessible to > tracing programs, I am not sure how safe unpin_user_page() is when > called from random reentrant contexts. We might have more cases in the > future, we cannot guarantee we can handle everything in NMIs > universally. > > So the best course of action seems to be relaxing > bpf_obj_free_fields() to bpf_obj_cancel_fields() that just does cancel > on async work (timer, wq, task_work) for delete / update and let other > fields be as-is. We likely need to do bpf_obj_free_fields() > additionally before prealloc_destroy() now, but that should be simple. > Whether or not to use bpf_ma when kptrs are used in prealloc map is a > separate change. > > This should hopefully resolve the issue, unless I missed other cases. This does sound good, so you'd set the bpf_obj_free_fields up in the htab allocator dtor for the final free and rely on the allocators existing nmi deferral? The missing piece is whether to handle this differently in NMI or just always do it with the deferral. Also the prealloc question needs answering. This fix would preserve the xchg inlining and allow for failure with ENOMEM on exhaustion and not need weird locking or atomic counting. Appreciate the time from both of you learned a lot more than I wanted to about NMI than I bargained for writing this. There's probably more weird NMI cases. FWIW there are zero tp_btf/nmi_handler progs anywhere I can find on the internet so this is a weird edge case. Since I brought this issue up I'll finish it out and send patches next week. Even if generic file kptr isn't possible hopefully this can stop sashiko from complaining about every new kptr dtor getting introduced being unsafe in NMI :) ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [bpf-next v3 1/2] bpf: Offload kptr destructors that run from NMI 2026-05-12 1:43 ` Justin Suess @ 2026-05-12 1:46 ` Kumar Kartikeya Dwivedi 2026-05-12 1:55 ` Alexei Starovoitov 0 siblings, 1 reply; 23+ messages in thread From: Kumar Kartikeya Dwivedi @ 2026-05-12 1:46 UTC (permalink / raw) To: Justin Suess; +Cc: Alexei Starovoitov, sashiko, bpf On Tue, 12 May 2026 at 03:43, Justin Suess <utilityemal77@gmail.com> wrote: > > On Mon, May 11, 2026 at 10:10:07PM +0200, Kumar Kartikeya Dwivedi wrote: > > On Mon, 11 May 2026 at 19:29, Alexei Starovoitov > > <alexei.starovoitov@gmail.com> wrote: > > > > > > On Mon May 11, 2026 at 9:38 AM PDT, Justin Suess wrote: > > > > [ 21.604660] Call Trace: > > > > [ 21.604662] <TASK> > > > > [ 21.604663] dump_stack_lvl+0x5d/0x80 > > > > [ 21.604666] print_usage_bug.part.0+0x22b/0x2c0 > > > > [ 21.604669] lock_acquire+0x295/0x2e0 > > > > [ 21.604671] ? terminate_walk+0x33/0x160 > > > > [ 21.604674] ? __call_rcu_common.constprop.0+0x309/0x730 > > > > [ 21.604679] _raw_spin_lock+0x30/0x40 > > > > [ 21.604680] ? __call_rcu_common.constprop.0+0x309/0x730 > > > > [ 21.604682] __call_rcu_common.constprop.0+0x309/0x730 > > > > [ 21.604686] bpf_obj_free_fields+0x118/0x250 > > > > [ 21.604691] free_htab_elem+0x85/0xd0 > > > > [ 21.604694] htab_map_delete_elem+0x168/0x230 > > > > [ 21.604698] bpf_prog_f6a7136050cb5431_clear_task_kptrs_from_nmi+0xeb/0x144 > > > > [ 21.604700] bpf_trace_run3+0x126/0x430 > > > > > > that's better. > > > Looks like we moved bpf_obj_free_fields() into htab_mem_dtor(), > > > but left check_and_free_fields() in free_htab_elem(). > > > > > > I think the fix is to remove check_and_free_fields() from ma path in free_htab_elem() > > > and fallback to bpf_mem_alloc at map create time when map has kptrs > > > with dtors. Even when BPF_F_NO_PREALLOC is not specified. > > > > > > Kumar, > > > > > > thoughts? > > > > > > > > > > Yeah, removing it from the path that helpers can invoke seems simpler. > > Remember though, this splat is just for hashtab, we have similar > > bpf_obj_free_fields() in array map on update. I think fundamentally > > the main issue here is that we logically free special fields when a > > map value is freed or deleted. When updating array maps we logically > > 'free' and then 'update' the same map value together. For hashtab, it > > happens on update/delete. > > > > We could relax this behavior to avoid eagerly freeing these special > > fields on update or deletion. The only worry is how this would impact > > programs that have come to rely on the existing behavior. There are > > patterns where people expect kptr to be NULL on some new map value, > > which causes programs to return errors when that expectation is not > > met. Just doing the skip when irqs_disabled() doesn't save us from the > > surprise side-effect. We need to decide upon this first before > > discussing the shape of the solution. > > > > This is the theoretical concern; In practice, I think most people who > > depend on such behavior use kptr in local storage maps (in > > schedulers). So it probably won't be a problem in practice, even > > though we can't judge this ahead of time. Also, we eagerly reuse map > > values when using memalloc, so the guarantees are already pretty weak > > I guess. > > > > So, if we are not going to go through a grace period (like local > > storage) and free back to kernel allocator before reuse, we should > > relax field freeing behavior. At best, we should cancel work for > > timer, wq, task_work, and task_work, leaving other items as-is. E.g. > > BPF_UPTR is used in task storage which I think is accessible to > > tracing programs, I am not sure how safe unpin_user_page() is when > > called from random reentrant contexts. We might have more cases in the > > future, we cannot guarantee we can handle everything in NMIs > > universally. > > > > So the best course of action seems to be relaxing > > bpf_obj_free_fields() to bpf_obj_cancel_fields() that just does cancel > > on async work (timer, wq, task_work) for delete / update and let other > > fields be as-is. We likely need to do bpf_obj_free_fields() > > additionally before prealloc_destroy() now, but that should be simple. > > Whether or not to use bpf_ma when kptrs are used in prealloc map is a > > separate change. > > > > This should hopefully resolve the issue, unless I missed other cases. > This does sound good, so you'd set the bpf_obj_free_fields up in the > htab allocator dtor for the final free and rely on the allocators > existing nmi deferral? It is already set, except for prealloc maps. But we can call it before destroying the pcpu freelist etc. > > The missing piece is whether to handle this differently in NMI or just > always do it with the deferral. Also the prealloc question needs > answering. There is no deferral here. I'm saying that we just cancel for timer, wq, task work, and leave other fields as is. So we don't have active work pending for async items. So as long as the item keeps getting recycled in the allocator, we don't free these fields. Once the memalloc is destroyed, the dtor runs in a known safe context where we can assume bpf_obj_free_fields won't deadlock or run into any problems. > > [...] ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [bpf-next v3 1/2] bpf: Offload kptr destructors that run from NMI 2026-05-12 1:46 ` Kumar Kartikeya Dwivedi @ 2026-05-12 1:55 ` Alexei Starovoitov 2026-05-12 2:03 ` Kumar Kartikeya Dwivedi 2026-05-12 2:07 ` Justin Suess 0 siblings, 2 replies; 23+ messages in thread From: Alexei Starovoitov @ 2026-05-12 1:55 UTC (permalink / raw) To: Kumar Kartikeya Dwivedi, Justin Suess; +Cc: sashiko, bpf On Mon May 11, 2026 at 6:46 PM PDT, Kumar Kartikeya Dwivedi wrote: > On Tue, 12 May 2026 at 03:43, Justin Suess <utilityemal77@gmail.com> wrote: >> >> On Mon, May 11, 2026 at 10:10:07PM +0200, Kumar Kartikeya Dwivedi wrote: >> > On Mon, 11 May 2026 at 19:29, Alexei Starovoitov >> > <alexei.starovoitov@gmail.com> wrote: >> > > >> > > On Mon May 11, 2026 at 9:38 AM PDT, Justin Suess wrote: >> > > > [ 21.604660] Call Trace: >> > > > [ 21.604662] <TASK> >> > > > [ 21.604663] dump_stack_lvl+0x5d/0x80 >> > > > [ 21.604666] print_usage_bug.part.0+0x22b/0x2c0 >> > > > [ 21.604669] lock_acquire+0x295/0x2e0 >> > > > [ 21.604671] ? terminate_walk+0x33/0x160 >> > > > [ 21.604674] ? __call_rcu_common.constprop.0+0x309/0x730 >> > > > [ 21.604679] _raw_spin_lock+0x30/0x40 >> > > > [ 21.604680] ? __call_rcu_common.constprop.0+0x309/0x730 >> > > > [ 21.604682] __call_rcu_common.constprop.0+0x309/0x730 >> > > > [ 21.604686] bpf_obj_free_fields+0x118/0x250 >> > > > [ 21.604691] free_htab_elem+0x85/0xd0 >> > > > [ 21.604694] htab_map_delete_elem+0x168/0x230 >> > > > [ 21.604698] bpf_prog_f6a7136050cb5431_clear_task_kptrs_from_nmi+0xeb/0x144 >> > > > [ 21.604700] bpf_trace_run3+0x126/0x430 >> > > >> > > that's better. >> > > Looks like we moved bpf_obj_free_fields() into htab_mem_dtor(), >> > > but left check_and_free_fields() in free_htab_elem(). >> > > >> > > I think the fix is to remove check_and_free_fields() from ma path in free_htab_elem() >> > > and fallback to bpf_mem_alloc at map create time when map has kptrs >> > > with dtors. Even when BPF_F_NO_PREALLOC is not specified. >> > > >> > > Kumar, >> > > >> > > thoughts? >> > > >> > > >> > >> > Yeah, removing it from the path that helpers can invoke seems simpler. >> > Remember though, this splat is just for hashtab, we have similar >> > bpf_obj_free_fields() in array map on update. I think fundamentally >> > the main issue here is that we logically free special fields when a >> > map value is freed or deleted. When updating array maps we logically >> > 'free' and then 'update' the same map value together. For hashtab, it >> > happens on update/delete. >> > >> > We could relax this behavior to avoid eagerly freeing these special >> > fields on update or deletion. The only worry is how this would impact >> > programs that have come to rely on the existing behavior. There are >> > patterns where people expect kptr to be NULL on some new map value, >> > which causes programs to return errors when that expectation is not >> > met. Just doing the skip when irqs_disabled() doesn't save us from the >> > surprise side-effect. We need to decide upon this first before >> > discussing the shape of the solution. >> > >> > This is the theoretical concern; In practice, I think most people who >> > depend on such behavior use kptr in local storage maps (in >> > schedulers). So it probably won't be a problem in practice, even >> > though we can't judge this ahead of time. Also, we eagerly reuse map >> > values when using memalloc, so the guarantees are already pretty weak >> > I guess. >> > >> > So, if we are not going to go through a grace period (like local >> > storage) and free back to kernel allocator before reuse, we should >> > relax field freeing behavior. At best, we should cancel work for >> > timer, wq, task_work, and task_work, leaving other items as-is. E.g. >> > BPF_UPTR is used in task storage which I think is accessible to >> > tracing programs, I am not sure how safe unpin_user_page() is when >> > called from random reentrant contexts. We might have more cases in the >> > future, we cannot guarantee we can handle everything in NMIs >> > universally. >> > >> > So the best course of action seems to be relaxing >> > bpf_obj_free_fields() to bpf_obj_cancel_fields() that just does cancel >> > on async work (timer, wq, task_work) for delete / update and let other >> > fields be as-is. We likely need to do bpf_obj_free_fields() >> > additionally before prealloc_destroy() now, but that should be simple. >> > Whether or not to use bpf_ma when kptrs are used in prealloc map is a >> > separate change. >> > >> > This should hopefully resolve the issue, unless I missed other cases. >> This does sound good, so you'd set the bpf_obj_free_fields up in the >> htab allocator dtor for the final free and rely on the allocators >> existing nmi deferral? > > It is already set, except for prealloc maps. But we can call it before > destroying the pcpu freelist etc. htab_map_free->htab_free_prealloced_fields does bpf_obj_free_fields already. So scratch my suggestion to force bpf_mem_alloc on preallocated hash maps. >> >> The missing piece is whether to handle this differently in NMI or just >> always do it with the deferral. Also the prealloc question needs >> answering. > > There is no deferral here. I'm saying that we just cancel for timer, > wq, task work, and leave other fields as is. So we don't have active > work pending for async items. > > So as long as the item keeps getting recycled in the allocator, we > don't free these fields. Once the memalloc is destroyed, the dtor runs > in a known safe context where we can assume bpf_obj_free_fields won't > deadlock or run into any problems. So the plan is to do if (in_nmi()) && case BPF_KPTR* | BPF_LIST_HEAD | BPF_RB_ROOT just ignore it? And no other changes anywhere at all? That would be too good to be true :) ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [bpf-next v3 1/2] bpf: Offload kptr destructors that run from NMI 2026-05-12 1:55 ` Alexei Starovoitov @ 2026-05-12 2:03 ` Kumar Kartikeya Dwivedi 2026-05-12 2:10 ` Alexei Starovoitov 2026-05-12 2:07 ` Justin Suess 1 sibling, 1 reply; 23+ messages in thread From: Kumar Kartikeya Dwivedi @ 2026-05-12 2:03 UTC (permalink / raw) To: Alexei Starovoitov; +Cc: Justin Suess, sashiko, bpf On Tue, 12 May 2026 at 03:55, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > > On Mon May 11, 2026 at 6:46 PM PDT, Kumar Kartikeya Dwivedi wrote: > > On Tue, 12 May 2026 at 03:43, Justin Suess <utilityemal77@gmail.com> wrote: > >> > >> On Mon, May 11, 2026 at 10:10:07PM +0200, Kumar Kartikeya Dwivedi wrote: > >> > On Mon, 11 May 2026 at 19:29, Alexei Starovoitov > >> > <alexei.starovoitov@gmail.com> wrote: > >> > > > >> > > On Mon May 11, 2026 at 9:38 AM PDT, Justin Suess wrote: > >> > > > [ 21.604660] Call Trace: > >> > > > [ 21.604662] <TASK> > >> > > > [ 21.604663] dump_stack_lvl+0x5d/0x80 > >> > > > [ 21.604666] print_usage_bug.part.0+0x22b/0x2c0 > >> > > > [ 21.604669] lock_acquire+0x295/0x2e0 > >> > > > [ 21.604671] ? terminate_walk+0x33/0x160 > >> > > > [ 21.604674] ? __call_rcu_common.constprop.0+0x309/0x730 > >> > > > [ 21.604679] _raw_spin_lock+0x30/0x40 > >> > > > [ 21.604680] ? __call_rcu_common.constprop.0+0x309/0x730 > >> > > > [ 21.604682] __call_rcu_common.constprop.0+0x309/0x730 > >> > > > [ 21.604686] bpf_obj_free_fields+0x118/0x250 > >> > > > [ 21.604691] free_htab_elem+0x85/0xd0 > >> > > > [ 21.604694] htab_map_delete_elem+0x168/0x230 > >> > > > [ 21.604698] bpf_prog_f6a7136050cb5431_clear_task_kptrs_from_nmi+0xeb/0x144 > >> > > > [ 21.604700] bpf_trace_run3+0x126/0x430 > >> > > > >> > > that's better. > >> > > Looks like we moved bpf_obj_free_fields() into htab_mem_dtor(), > >> > > but left check_and_free_fields() in free_htab_elem(). > >> > > > >> > > I think the fix is to remove check_and_free_fields() from ma path in free_htab_elem() > >> > > and fallback to bpf_mem_alloc at map create time when map has kptrs > >> > > with dtors. Even when BPF_F_NO_PREALLOC is not specified. > >> > > > >> > > Kumar, > >> > > > >> > > thoughts? > >> > > > >> > > > >> > > >> > Yeah, removing it from the path that helpers can invoke seems simpler. > >> > Remember though, this splat is just for hashtab, we have similar > >> > bpf_obj_free_fields() in array map on update. I think fundamentally > >> > the main issue here is that we logically free special fields when a > >> > map value is freed or deleted. When updating array maps we logically > >> > 'free' and then 'update' the same map value together. For hashtab, it > >> > happens on update/delete. > >> > > >> > We could relax this behavior to avoid eagerly freeing these special > >> > fields on update or deletion. The only worry is how this would impact > >> > programs that have come to rely on the existing behavior. There are > >> > patterns where people expect kptr to be NULL on some new map value, > >> > which causes programs to return errors when that expectation is not > >> > met. Just doing the skip when irqs_disabled() doesn't save us from the > >> > surprise side-effect. We need to decide upon this first before > >> > discussing the shape of the solution. > >> > > >> > This is the theoretical concern; In practice, I think most people who > >> > depend on such behavior use kptr in local storage maps (in > >> > schedulers). So it probably won't be a problem in practice, even > >> > though we can't judge this ahead of time. Also, we eagerly reuse map > >> > values when using memalloc, so the guarantees are already pretty weak > >> > I guess. > >> > > >> > So, if we are not going to go through a grace period (like local > >> > storage) and free back to kernel allocator before reuse, we should > >> > relax field freeing behavior. At best, we should cancel work for > >> > timer, wq, task_work, and task_work, leaving other items as-is. E.g. > >> > BPF_UPTR is used in task storage which I think is accessible to > >> > tracing programs, I am not sure how safe unpin_user_page() is when > >> > called from random reentrant contexts. We might have more cases in the > >> > future, we cannot guarantee we can handle everything in NMIs > >> > universally. > >> > > >> > So the best course of action seems to be relaxing > >> > bpf_obj_free_fields() to bpf_obj_cancel_fields() that just does cancel > >> > on async work (timer, wq, task_work) for delete / update and let other > >> > fields be as-is. We likely need to do bpf_obj_free_fields() > >> > additionally before prealloc_destroy() now, but that should be simple. > >> > Whether or not to use bpf_ma when kptrs are used in prealloc map is a > >> > separate change. > >> > > >> > This should hopefully resolve the issue, unless I missed other cases. > >> This does sound good, so you'd set the bpf_obj_free_fields up in the > >> htab allocator dtor for the final free and rely on the allocators > >> existing nmi deferral? > > > > It is already set, except for prealloc maps. But we can call it before > > destroying the pcpu freelist etc. > > htab_map_free->htab_free_prealloced_fields does bpf_obj_free_fields already. > So scratch my suggestion to force bpf_mem_alloc on preallocated hash maps. > > >> > >> The missing piece is whether to handle this differently in NMI or just > >> always do it with the deferral. Also the prealloc question needs > >> answering. > > > > There is no deferral here. I'm saying that we just cancel for timer, > > wq, task work, and leave other fields as is. So we don't have active > > work pending for async items. > > > > So as long as the item keeps getting recycled in the allocator, we > > don't free these fields. Once the memalloc is destroyed, the dtor runs > > in a known safe context where we can assume bpf_obj_free_fields won't > > deadlock or run into any problems. > > So the plan is to do > if (in_nmi()) && case BPF_KPTR* | BPF_LIST_HEAD | BPF_RB_ROOT > just ignore it? > And no other changes anywhere at all? > > That would be too good to be true :) I don't know whether in_nmi() would be sufficient, we likely need irqs_disabled()? At that point, why not always ignore it, since freeing the fields is dependent on where you're running. I would still cancel async fields, since they're already any-context safe. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [bpf-next v3 1/2] bpf: Offload kptr destructors that run from NMI 2026-05-12 2:03 ` Kumar Kartikeya Dwivedi @ 2026-05-12 2:10 ` Alexei Starovoitov 2026-05-12 2:13 ` Kumar Kartikeya Dwivedi 0 siblings, 1 reply; 23+ messages in thread From: Alexei Starovoitov @ 2026-05-12 2:10 UTC (permalink / raw) To: Kumar Kartikeya Dwivedi; +Cc: Justin Suess, sashiko, bpf On Mon May 11, 2026 at 7:03 PM PDT, Kumar Kartikeya Dwivedi wrote: > On Tue, 12 May 2026 at 03:55, Alexei Starovoitov > <alexei.starovoitov@gmail.com> wrote: >> >> On Mon May 11, 2026 at 6:46 PM PDT, Kumar Kartikeya Dwivedi wrote: >> > On Tue, 12 May 2026 at 03:43, Justin Suess <utilityemal77@gmail.com> wrote: >> >> >> >> On Mon, May 11, 2026 at 10:10:07PM +0200, Kumar Kartikeya Dwivedi wrote: >> >> > On Mon, 11 May 2026 at 19:29, Alexei Starovoitov >> >> > <alexei.starovoitov@gmail.com> wrote: >> >> > > >> >> > > On Mon May 11, 2026 at 9:38 AM PDT, Justin Suess wrote: >> >> > > > [ 21.604660] Call Trace: >> >> > > > [ 21.604662] <TASK> >> >> > > > [ 21.604663] dump_stack_lvl+0x5d/0x80 >> >> > > > [ 21.604666] print_usage_bug.part.0+0x22b/0x2c0 >> >> > > > [ 21.604669] lock_acquire+0x295/0x2e0 >> >> > > > [ 21.604671] ? terminate_walk+0x33/0x160 >> >> > > > [ 21.604674] ? __call_rcu_common.constprop.0+0x309/0x730 >> >> > > > [ 21.604679] _raw_spin_lock+0x30/0x40 >> >> > > > [ 21.604680] ? __call_rcu_common.constprop.0+0x309/0x730 >> >> > > > [ 21.604682] __call_rcu_common.constprop.0+0x309/0x730 >> >> > > > [ 21.604686] bpf_obj_free_fields+0x118/0x250 >> >> > > > [ 21.604691] free_htab_elem+0x85/0xd0 >> >> > > > [ 21.604694] htab_map_delete_elem+0x168/0x230 >> >> > > > [ 21.604698] bpf_prog_f6a7136050cb5431_clear_task_kptrs_from_nmi+0xeb/0x144 >> >> > > > [ 21.604700] bpf_trace_run3+0x126/0x430 >> >> > > >> >> > > that's better. >> >> > > Looks like we moved bpf_obj_free_fields() into htab_mem_dtor(), >> >> > > but left check_and_free_fields() in free_htab_elem(). >> >> > > >> >> > > I think the fix is to remove check_and_free_fields() from ma path in free_htab_elem() >> >> > > and fallback to bpf_mem_alloc at map create time when map has kptrs >> >> > > with dtors. Even when BPF_F_NO_PREALLOC is not specified. >> >> > > >> >> > > Kumar, >> >> > > >> >> > > thoughts? >> >> > > >> >> > > >> >> > >> >> > Yeah, removing it from the path that helpers can invoke seems simpler. >> >> > Remember though, this splat is just for hashtab, we have similar >> >> > bpf_obj_free_fields() in array map on update. I think fundamentally >> >> > the main issue here is that we logically free special fields when a >> >> > map value is freed or deleted. When updating array maps we logically >> >> > 'free' and then 'update' the same map value together. For hashtab, it >> >> > happens on update/delete. >> >> > >> >> > We could relax this behavior to avoid eagerly freeing these special >> >> > fields on update or deletion. The only worry is how this would impact >> >> > programs that have come to rely on the existing behavior. There are >> >> > patterns where people expect kptr to be NULL on some new map value, >> >> > which causes programs to return errors when that expectation is not >> >> > met. Just doing the skip when irqs_disabled() doesn't save us from the >> >> > surprise side-effect. We need to decide upon this first before >> >> > discussing the shape of the solution. >> >> > >> >> > This is the theoretical concern; In practice, I think most people who >> >> > depend on such behavior use kptr in local storage maps (in >> >> > schedulers). So it probably won't be a problem in practice, even >> >> > though we can't judge this ahead of time. Also, we eagerly reuse map >> >> > values when using memalloc, so the guarantees are already pretty weak >> >> > I guess. >> >> > >> >> > So, if we are not going to go through a grace period (like local >> >> > storage) and free back to kernel allocator before reuse, we should >> >> > relax field freeing behavior. At best, we should cancel work for >> >> > timer, wq, task_work, and task_work, leaving other items as-is. E.g. >> >> > BPF_UPTR is used in task storage which I think is accessible to >> >> > tracing programs, I am not sure how safe unpin_user_page() is when >> >> > called from random reentrant contexts. We might have more cases in the >> >> > future, we cannot guarantee we can handle everything in NMIs >> >> > universally. >> >> > >> >> > So the best course of action seems to be relaxing >> >> > bpf_obj_free_fields() to bpf_obj_cancel_fields() that just does cancel >> >> > on async work (timer, wq, task_work) for delete / update and let other >> >> > fields be as-is. We likely need to do bpf_obj_free_fields() >> >> > additionally before prealloc_destroy() now, but that should be simple. >> >> > Whether or not to use bpf_ma when kptrs are used in prealloc map is a >> >> > separate change. >> >> > >> >> > This should hopefully resolve the issue, unless I missed other cases. >> >> This does sound good, so you'd set the bpf_obj_free_fields up in the >> >> htab allocator dtor for the final free and rely on the allocators >> >> existing nmi deferral? >> > >> > It is already set, except for prealloc maps. But we can call it before >> > destroying the pcpu freelist etc. >> >> htab_map_free->htab_free_prealloced_fields does bpf_obj_free_fields already. >> So scratch my suggestion to force bpf_mem_alloc on preallocated hash maps. >> >> >> >> >> The missing piece is whether to handle this differently in NMI or just >> >> always do it with the deferral. Also the prealloc question needs >> >> answering. >> > >> > There is no deferral here. I'm saying that we just cancel for timer, >> > wq, task work, and leave other fields as is. So we don't have active >> > work pending for async items. >> > >> > So as long as the item keeps getting recycled in the allocator, we >> > don't free these fields. Once the memalloc is destroyed, the dtor runs >> > in a known safe context where we can assume bpf_obj_free_fields won't >> > deadlock or run into any problems. >> >> So the plan is to do >> if (in_nmi()) && case BPF_KPTR* | BPF_LIST_HEAD | BPF_RB_ROOT >> just ignore it? >> And no other changes anywhere at all? >> >> That would be too good to be true :) > > I don't know whether in_nmi() would be sufficient, we likely need > irqs_disabled()? fair irqs_disabled() is safer. > At that point, why not always ignore it, since > freeing the fields is dependent on where you're running. I would still > cancel async fields, since they're already any-context safe. you mean never touch case BPF_KPTR* | BPF_LIST_HEAD | BPF_RB_ROOT regardless of running context and rely on final cleanup? That's an idea to consider, but I suspect some rbtree, link list tests will fail. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [bpf-next v3 1/2] bpf: Offload kptr destructors that run from NMI 2026-05-12 2:10 ` Alexei Starovoitov @ 2026-05-12 2:13 ` Kumar Kartikeya Dwivedi 0 siblings, 0 replies; 23+ messages in thread From: Kumar Kartikeya Dwivedi @ 2026-05-12 2:13 UTC (permalink / raw) To: Alexei Starovoitov; +Cc: Justin Suess, sashiko, bpf On Tue, 12 May 2026 at 04:10, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > > On Mon May 11, 2026 at 7:03 PM PDT, Kumar Kartikeya Dwivedi wrote: > > On Tue, 12 May 2026 at 03:55, Alexei Starovoitov > > <alexei.starovoitov@gmail.com> wrote: > >> > >> On Mon May 11, 2026 at 6:46 PM PDT, Kumar Kartikeya Dwivedi wrote: > >> > On Tue, 12 May 2026 at 03:43, Justin Suess <utilityemal77@gmail.com> wrote: > >> >> > >> >> On Mon, May 11, 2026 at 10:10:07PM +0200, Kumar Kartikeya Dwivedi wrote: > >> >> > On Mon, 11 May 2026 at 19:29, Alexei Starovoitov > >> >> > <alexei.starovoitov@gmail.com> wrote: > >> >> > > > >> >> > > On Mon May 11, 2026 at 9:38 AM PDT, Justin Suess wrote: > >> >> > > > [ 21.604660] Call Trace: > >> >> > > > [ 21.604662] <TASK> > >> >> > > > [ 21.604663] dump_stack_lvl+0x5d/0x80 > >> >> > > > [ 21.604666] print_usage_bug.part.0+0x22b/0x2c0 > >> >> > > > [ 21.604669] lock_acquire+0x295/0x2e0 > >> >> > > > [ 21.604671] ? terminate_walk+0x33/0x160 > >> >> > > > [ 21.604674] ? __call_rcu_common.constprop.0+0x309/0x730 > >> >> > > > [ 21.604679] _raw_spin_lock+0x30/0x40 > >> >> > > > [ 21.604680] ? __call_rcu_common.constprop.0+0x309/0x730 > >> >> > > > [ 21.604682] __call_rcu_common.constprop.0+0x309/0x730 > >> >> > > > [ 21.604686] bpf_obj_free_fields+0x118/0x250 > >> >> > > > [ 21.604691] free_htab_elem+0x85/0xd0 > >> >> > > > [ 21.604694] htab_map_delete_elem+0x168/0x230 > >> >> > > > [ 21.604698] bpf_prog_f6a7136050cb5431_clear_task_kptrs_from_nmi+0xeb/0x144 > >> >> > > > [ 21.604700] bpf_trace_run3+0x126/0x430 > >> >> > > > >> >> > > that's better. > >> >> > > Looks like we moved bpf_obj_free_fields() into htab_mem_dtor(), > >> >> > > but left check_and_free_fields() in free_htab_elem(). > >> >> > > > >> >> > > I think the fix is to remove check_and_free_fields() from ma path in free_htab_elem() > >> >> > > and fallback to bpf_mem_alloc at map create time when map has kptrs > >> >> > > with dtors. Even when BPF_F_NO_PREALLOC is not specified. > >> >> > > > >> >> > > Kumar, > >> >> > > > >> >> > > thoughts? > >> >> > > > >> >> > > > >> >> > > >> >> > Yeah, removing it from the path that helpers can invoke seems simpler. > >> >> > Remember though, this splat is just for hashtab, we have similar > >> >> > bpf_obj_free_fields() in array map on update. I think fundamentally > >> >> > the main issue here is that we logically free special fields when a > >> >> > map value is freed or deleted. When updating array maps we logically > >> >> > 'free' and then 'update' the same map value together. For hashtab, it > >> >> > happens on update/delete. > >> >> > > >> >> > We could relax this behavior to avoid eagerly freeing these special > >> >> > fields on update or deletion. The only worry is how this would impact > >> >> > programs that have come to rely on the existing behavior. There are > >> >> > patterns where people expect kptr to be NULL on some new map value, > >> >> > which causes programs to return errors when that expectation is not > >> >> > met. Just doing the skip when irqs_disabled() doesn't save us from the > >> >> > surprise side-effect. We need to decide upon this first before > >> >> > discussing the shape of the solution. > >> >> > > >> >> > This is the theoretical concern; In practice, I think most people who > >> >> > depend on such behavior use kptr in local storage maps (in > >> >> > schedulers). So it probably won't be a problem in practice, even > >> >> > though we can't judge this ahead of time. Also, we eagerly reuse map > >> >> > values when using memalloc, so the guarantees are already pretty weak > >> >> > I guess. > >> >> > > >> >> > So, if we are not going to go through a grace period (like local > >> >> > storage) and free back to kernel allocator before reuse, we should > >> >> > relax field freeing behavior. At best, we should cancel work for > >> >> > timer, wq, task_work, and task_work, leaving other items as-is. E.g. > >> >> > BPF_UPTR is used in task storage which I think is accessible to > >> >> > tracing programs, I am not sure how safe unpin_user_page() is when > >> >> > called from random reentrant contexts. We might have more cases in the > >> >> > future, we cannot guarantee we can handle everything in NMIs > >> >> > universally. > >> >> > > >> >> > So the best course of action seems to be relaxing > >> >> > bpf_obj_free_fields() to bpf_obj_cancel_fields() that just does cancel > >> >> > on async work (timer, wq, task_work) for delete / update and let other > >> >> > fields be as-is. We likely need to do bpf_obj_free_fields() > >> >> > additionally before prealloc_destroy() now, but that should be simple. > >> >> > Whether or not to use bpf_ma when kptrs are used in prealloc map is a > >> >> > separate change. > >> >> > > >> >> > This should hopefully resolve the issue, unless I missed other cases. > >> >> This does sound good, so you'd set the bpf_obj_free_fields up in the > >> >> htab allocator dtor for the final free and rely on the allocators > >> >> existing nmi deferral? > >> > > >> > It is already set, except for prealloc maps. But we can call it before > >> > destroying the pcpu freelist etc. > >> > >> htab_map_free->htab_free_prealloced_fields does bpf_obj_free_fields already. > >> So scratch my suggestion to force bpf_mem_alloc on preallocated hash maps. > >> > >> >> > >> >> The missing piece is whether to handle this differently in NMI or just > >> >> always do it with the deferral. Also the prealloc question needs > >> >> answering. > >> > > >> > There is no deferral here. I'm saying that we just cancel for timer, > >> > wq, task work, and leave other fields as is. So we don't have active > >> > work pending for async items. > >> > > >> > So as long as the item keeps getting recycled in the allocator, we > >> > don't free these fields. Once the memalloc is destroyed, the dtor runs > >> > in a known safe context where we can assume bpf_obj_free_fields won't > >> > deadlock or run into any problems. > >> > >> So the plan is to do > >> if (in_nmi()) && case BPF_KPTR* | BPF_LIST_HEAD | BPF_RB_ROOT > >> just ignore it? > >> And no other changes anywhere at all? > >> > >> That would be too good to be true :) > > > > I don't know whether in_nmi() would be sufficient, we likely need > > irqs_disabled()? > > fair irqs_disabled() is safer. > > > At that point, why not always ignore it, since > > freeing the fields is dependent on where you're running. I would still > > cancel async fields, since they're already any-context safe. > > you mean never touch case BPF_KPTR* | BPF_LIST_HEAD | BPF_RB_ROOT > regardless of running context and rely on final cleanup? > That's an idea to consider, but I suspect some rbtree, link list > tests will fail. Yeah, but we should see which ones do, some of them were written with expectation to test kernel clean up on map value delete / update etc., those don't count as much. Unless this change triggers a real example in tests, it will likely go unnoticed (famous last words). ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [bpf-next v3 1/2] bpf: Offload kptr destructors that run from NMI 2026-05-12 1:55 ` Alexei Starovoitov 2026-05-12 2:03 ` Kumar Kartikeya Dwivedi @ 2026-05-12 2:07 ` Justin Suess 2026-05-12 2:08 ` Kumar Kartikeya Dwivedi 1 sibling, 1 reply; 23+ messages in thread From: Justin Suess @ 2026-05-12 2:07 UTC (permalink / raw) To: Alexei Starovoitov; +Cc: Kumar Kartikeya Dwivedi, sashiko, bpf On Mon, May 11, 2026 at 06:55:50PM -0700, Alexei Starovoitov wrote: > On Mon May 11, 2026 at 6:46 PM PDT, Kumar Kartikeya Dwivedi wrote: > > On Tue, 12 May 2026 at 03:43, Justin Suess <utilityemal77@gmail.com> wrote: > >> > >> On Mon, May 11, 2026 at 10:10:07PM +0200, Kumar Kartikeya Dwivedi wrote: > >> > On Mon, 11 May 2026 at 19:29, Alexei Starovoitov > >> > <alexei.starovoitov@gmail.com> wrote: > > There is no deferral here. I'm saying that we just cancel for timer, > > wq, task work, and leave other fields as is. So we don't have active > > work pending for async items. > > > > So as long as the item keeps getting recycled in the allocator, we > > don't free these fields. Once the memalloc is destroyed, the dtor runs > > in a known safe context where we can assume bpf_obj_free_fields won't > > deadlock or run into any problems. > > So the plan is to do > if (in_nmi()) && case BPF_KPTR* | BPF_LIST_HEAD | BPF_RB_ROOT > just ignore it? > And no other changes anywhere at all? > > That would be too good to be true :) I guess the last question I have is what would prevent an xchg() out of a map on a recycled kptr from causing a UAF? ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [bpf-next v3 1/2] bpf: Offload kptr destructors that run from NMI 2026-05-12 2:07 ` Justin Suess @ 2026-05-12 2:08 ` Kumar Kartikeya Dwivedi 0 siblings, 0 replies; 23+ messages in thread From: Kumar Kartikeya Dwivedi @ 2026-05-12 2:08 UTC (permalink / raw) To: Justin Suess; +Cc: Alexei Starovoitov, sashiko, bpf On Tue, 12 May 2026 at 04:07, Justin Suess <utilityemal77@gmail.com> wrote: > > On Mon, May 11, 2026 at 06:55:50PM -0700, Alexei Starovoitov wrote: > > On Mon May 11, 2026 at 6:46 PM PDT, Kumar Kartikeya Dwivedi wrote: > > > On Tue, 12 May 2026 at 03:43, Justin Suess <utilityemal77@gmail.com> wrote: > > >> > > >> On Mon, May 11, 2026 at 10:10:07PM +0200, Kumar Kartikeya Dwivedi wrote: > > >> > On Mon, 11 May 2026 at 19:29, Alexei Starovoitov > > >> > <alexei.starovoitov@gmail.com> wrote: > > > There is no deferral here. I'm saying that we just cancel for timer, > > > wq, task work, and leave other fields as is. So we don't have active > > > work pending for async items. > > > > > > So as long as the item keeps getting recycled in the allocator, we > > > don't free these fields. Once the memalloc is destroyed, the dtor runs > > > in a known safe context where we can assume bpf_obj_free_fields won't > > > deadlock or run into any problems. > > > > So the plan is to do > > if (in_nmi()) && case BPF_KPTR* | BPF_LIST_HEAD | BPF_RB_ROOT > > just ignore it? > > And no other changes anywhere at all? > > > > That would be too good to be true :) > I guess the last question I have is what would prevent an xchg() out of > a map on a recycled kptr from causing a UAF? xchg() is atomic, so either a program now has the value, or the map has it. Whoever has it will end up freeing it. If the program has it then the program frees it. I don't see why there would be a UAF. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [bpf-next v3 1/2] bpf: Offload kptr destructors that run from NMI 2026-05-11 15:51 ` Alexei Starovoitov 2026-05-11 16:38 ` Justin Suess @ 2026-05-11 19:22 ` Justin Suess 1 sibling, 0 replies; 23+ messages in thread From: Justin Suess @ 2026-05-11 19:22 UTC (permalink / raw) To: Alexei Starovoitov; +Cc: sashiko, bpf On Mon, May 11, 2026 at 08:51:53AM -0700, Alexei Starovoitov wrote: > On Sun May 10, 2026 at 6:49 PM PDT, Justin Suess wrote: > > On Sun, May 10, 2026 at 03:38:08PM -0700, Alexei Starovoitov wrote: > >> On Sun, May 10, 2026 at 8:14 AM Justin Suess <utilityemal77@gmail.com> wrote: Here's a reproducer for the cgroup case: https://gist.githubusercontent.com/RazeLighter777/5f77cdfe035a4e22ee2642ae7db6387d/raw/10898d27040a07098cccc5d0785d9ad6620344e7/cgroup_kptr_nmi_deadlock_repro Hacked together with an AI prompt but functional. Exercises a different path, but more consistently splats even without CONFIG_RCU_NOCB_CPU / CONFIG_RCU_EXPERT since this dtor uses workqueue. Had to use an fexit hook to get the timing condition right to release the last cgroup reference. But this lets you see the deadlock is indeed in the dtor in NMI. This is on the same bpf-next/master 7e033543a2ab4c72319201298ed458e3bbddd82f: [ 15.160694] ================================ [ 15.160695] WARNING: inconsistent lock state [ 15.160695] 7.1.0-rc2-g7e033543a2ab-dirty #130 Not tainted [ 15.160697] -------------------------------- [ 15.160697] inconsistent {INITIAL USE} -> {IN-NMI} usage. [ 15.160698] test_progs/434 [HC1[1]:SC0[0]:HE0:SE1] takes: [ 15.160700] ffff9096fd66ced8 (&pool->lock){-.-.}-{2:2}, at: __queue_work+0xde/0x720 [ 15.160707] {INITIAL USE} state was registered at: [ 15.160708] lock_acquire+0xbf/0x2e0 [ 15.160711] _raw_spin_lock+0x30/0x40 [ 15.160715] __queue_work+0xde/0x720 [ 15.160716] queue_work_on+0x54/0xa0 [ 15.160716] start_poll_synchronize_rcu_expedited+0xaf/0x110 [ 15.160719] rcu_init+0x958/0x990 [ 15.160722] start_kernel+0x746/0x980 [ 15.160725] x86_64_start_reservations+0x24/0x30 [ 15.160727] __pfx_reserve_bios_regions+0x0/0x10 [ 15.160729] common_startup_64+0x12c/0x138 [ 15.160731] irq event stamp: 18704 [ 15.160732] hardirqs last enabled at (18703): [<ffffffffa200148a>] asm_sysvec_apic_timer_interrupt+0x1a/0x20 [ 15.160734] hardirqs last disabled at (18704): [<ffffffffa30e3f8f>] exc_nmi+0x7f/0x110 [ 15.160737] softirqs last enabled at (18698): [<ffffffffa22e5800>] __irq_exit_rcu+0xc0/0x100 [ 15.160739] softirqs last disabled at (18687): [<ffffffffa22e5800>] __irq_exit_rcu+0xc0/0x100 [ 15.160741] [ 15.160741] other info that might help us debug this: [ 15.160741] Possible unsafe locking scenario: [ 15.160741] [ 15.160742] CPU0 [ 15.160742] ---- [ 15.160742] lock(&pool->lock); [ 15.160743] <Interrupt> [ 15.160743] lock(&pool->lock); [ 15.160744] [ 15.160744] *** DEADLOCK *** [ 15.160744] [ 15.160744] no locks held by test_progs/434. [ 15.160745] [ 15.160745] stack backtrace: [ 15.160747] CPU: 1 UID: 0 PID: 434 Comm: test_progs Not tainted 7.1.0-rc2-g7e033543a2ab-dirty #130 PREEMPT(full) [ 15.160749] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.17.0-0-gb52ca86e094d-prebuilt.qemu.org 04/01/2014 [ 15.160750] Call Trace: [ 15.160751] <TASK> [ 15.160753] dump_stack_lvl+0x5d/0x80 [ 15.160757] print_usage_bug.part.0+0x22b/0x2c0 [ 15.160760] lock_acquire+0x295/0x2e0 [ 15.160762] ? srso_alias_return_thunk+0x5/0xfbef5 [ 15.160763] ? __queue_work+0xde/0x720 [ 15.160767] _raw_spin_lock+0x30/0x40 [ 15.160768] ? __queue_work+0xde/0x720 [ 15.160769] __queue_work+0xde/0x720 [ 15.160772] queue_work_on+0x54/0xa0 [ 15.160774] bpf_cgroup_release_dtor+0x12e/0x140 [ 15.160778] bpf_obj_free_fields+0x118/0x250 [ 15.160782] free_htab_elem+0x85/0xd0 [ 15.160785] htab_map_delete_elem+0x168/0x230 [ 15.160790] bpf_prog_23fcbbeb395ac6b4_clear_cgroup_kptrs_from_nmi+0x54/0x74 [ 15.160792] bpf_trace_run3+0x126/0x430 [ 15.160795] ? __pfx_perf_event_nmi_handler+0x10/0x10 [ 15.160799] nmi_handle.part.0+0x15b/0x250 [ 15.160802] ? __pfx_perf_event_nmi_handler+0x10/0x10 [ 15.160804] default_do_nmi+0x120/0x180 [ 15.160807] exc_nmi+0xe3/0x110 [ 15.160809] asm_exc_nmi+0xb7/0x100 [ 15.160810] RIP: 0033:0x5607a669541b [ 15.160813] Code: c7 45 f0 00 00 00 00 eb 1a 8b 55 f0 8b 45 f4 01 d0 48 63 d0 48 8b 45 a8 48 01 d0 48 89 45 a8 83 45 f0 01 81 7d f0 3f 42 0f 00 <7e> dd e8 7e f5 ff ff 48 89 45 f8 48 8b 45 f8 48 3b 45 e8 73 16 83 [ 15.160814] RSP: 002b:00007ffdb09c1dc0 EFLAGS: 00000293 [ 15.160816] RAX: 0000003aced4e2f4 RBX: 00007f1d8d574000 RCX: 000000000000000f [ 15.160816] RDX: 00000000000ad857 RSI: 00007f1d8d577000 RDI: 0000000000000001 [ 15.160817] RBP: 00007ffdb09c1e30 R08: 00007ffdb09c1da0 R09: 00007f1d8d577010 [ 15.160818] R10: 0000000000001614 R11: 0009718b9187183f R12: 0000000000000003 [ 15.160818] R13: 00007f1d8d5b6000 R14: 00007ffdb09c3358 R15: 00005607a9daf890 [ 15.160824] </TASK> [ 15.214040] perf: interrupt took too long (2501 > 2500), lowering kernel.perf_event_max_sample_rate to 79000 [ 15.246002] perf: interrupt took too long (3135 > 3126), lowering kernel.perf_event_max_sample_rate to 63000 [ 15.308032] perf: interrupt took too long (3928 > 3918), lowering kernel.perf_event_max_sample_rate to 50000 [ 15.500072] perf: interrupt took too long (4912 > 4910), lowering kernel.perf_event_max_sample_rate to 40000 Justin ^ permalink raw reply [flat|nested] 23+ messages in thread
* [bpf-next v3 2/2] selftests/bpf: Add kptr destructor NMI exerciser 2026-05-07 17:54 [bpf-next v3 0/2] bpf: Fix deadlock in kptr dtor in nmi Justin Suess 2026-05-07 17:54 ` [bpf-next v3 1/2] bpf: Offload kptr destructors that run from NMI Justin Suess @ 2026-05-07 17:54 ` Justin Suess 2026-05-08 0:03 ` sashiko-bot 1 sibling, 1 reply; 23+ messages in thread From: Justin Suess @ 2026-05-07 17:54 UTC (permalink / raw) To: ast, daniel, andrii, eddyz87, memxor Cc: martin.lau, song, yonghong.song, jolsa, bpf, Justin Suess, Alexei Starovoitov Programs attached to tp_btf/nmi_handler can drop refcounted kptrs from NMI context by deleting map entries or clearing map values. Add a dedicated BPF-side selftest program that populates hash and array maps with cpumask kptrs and clears them again from the NMI handler. This test fails on the upstream and results in a lockdep warning, but passes when NMI dtors are properly offloaded by the previous commit. The test asserts that every object queued for destruction in hardirq from NMI had the dtor called on it. The irq_work which has the IRQ_WORK_HARD_IRQ flag is drained with kern_sync_rcu to ensure consistency. Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com> Signed-off-by: Justin Suess <utilityemal77@gmail.com> --- .../selftests/bpf/prog_tests/kptr_dtor_nmi.c | 258 +++++++++++ .../selftests/bpf/progs/kptr_dtor_nmi.c | 412 ++++++++++++++++++ 2 files changed, 670 insertions(+) create mode 100644 tools/testing/selftests/bpf/prog_tests/kptr_dtor_nmi.c create mode 100644 tools/testing/selftests/bpf/progs/kptr_dtor_nmi.c diff --git a/tools/testing/selftests/bpf/prog_tests/kptr_dtor_nmi.c b/tools/testing/selftests/bpf/prog_tests/kptr_dtor_nmi.c new file mode 100644 index 000000000000..21452b3cf9eb --- /dev/null +++ b/tools/testing/selftests/bpf/prog_tests/kptr_dtor_nmi.c @@ -0,0 +1,258 @@ +// SPDX-License-Identifier: GPL-2.0 + +#define _GNU_SOURCE + +#include <errno.h> +#include <linux/perf_event.h> +#include <sched.h> +#include <sys/syscall.h> +#include <unistd.h> + +#include <test_progs.h> + +#include "kptr_dtor_nmi.skel.h" + +#define KPTR_DTOR_NMI_MAX_SLOTS 8 +#define KPTR_DTOR_NMI_ROUNDS 256 +#define DELETE_TIMEOUT_NS (5ULL * 1000 * 1000 * 1000) + +enum kptr_dtor_nmi_map_type { + KPTR_DTOR_NMI_MAP_HASH = 1, + KPTR_DTOR_NMI_MAP_ARRAY, +}; + +struct kptr_dtor_nmi_case { + const char *name; + __u32 map_type; +}; + +__maybe_unused +static int find_test_cpu(void) +{ + cpu_set_t cpuset; + int cpu, err; + + err = sched_getaffinity(0, sizeof(cpuset), &cpuset); + if (!ASSERT_OK(err, "sched_getaffinity")) + return -1; + + for (cpu = 0; cpu < CPU_SETSIZE; cpu++) { + if (CPU_ISSET(cpu, &cpuset)) + return cpu; + } + + ASSERT_TRUE(false, "cpu_available"); + return -1; +} + +__maybe_unused +static int open_nmi_pmu_event_on_cpu(int cpu) +{ + struct perf_event_attr attr = { + .size = sizeof(attr), + .type = PERF_TYPE_HARDWARE, + .config = PERF_COUNT_HW_CPU_CYCLES, + .freq = 1, + .sample_freq = 1000, + }; + int pmu_fd; + + pmu_fd = syscall(__NR_perf_event_open, &attr, -1, cpu, -1, + PERF_FLAG_FD_CLOEXEC); + if (pmu_fd == -1) { + if (errno == ENOENT || errno == EOPNOTSUPP) { + printf("SKIP:no PERF_COUNT_HW_CPU_CYCLES\n"); + test__skip(); + } + return -1; + } + + return pmu_fd; +} + +__maybe_unused +static bool pin_to_cpu(int cpu, cpu_set_t *old_cpuset) +{ + cpu_set_t cpuset; + int err; + + err = sched_getaffinity(0, sizeof(*old_cpuset), old_cpuset); + if (!ASSERT_OK(err, "sched_getaffinity")) + return false; + + CPU_ZERO(&cpuset); + CPU_SET(cpu, &cpuset); + err = sched_setaffinity(0, sizeof(cpuset), &cpuset); + if (!ASSERT_OK(err, "sched_setaffinity")) + return false; + + return true; +} + +__maybe_unused +static void restore_affinity(const cpu_set_t *old_cpuset) +{ + ASSERT_OK(sched_setaffinity(0, sizeof(*old_cpuset), old_cpuset), + "restore_affinity"); +} + +__maybe_unused +static bool run_syscall_prog(struct bpf_program *prog, const char *name) +{ + LIBBPF_OPTS(bpf_test_run_opts, opts); + int err; + + err = bpf_prog_test_run_opts(bpf_program__fd(prog), &opts); + if (!ASSERT_OK(err, name)) + return false; + if (!ASSERT_EQ(opts.retval, 0, name)) + return false; + + return true; +} + +__maybe_unused +static bool wait_for_nmi_drain(struct kptr_dtor_nmi *skel, + __u64 expected_deleted, + __u64 expected_release_calls) +{ + u64 now_ns, timeout_time_ns; + + now_ns = get_time_ns(); + timeout_time_ns = now_ns + DELETE_TIMEOUT_NS; + while (skel->bss->kptr_dtor_nmi_deleted < expected_deleted) { + if (skel->bss->kptr_dtor_nmi_setup_err || + skel->bss->kptr_dtor_nmi_nmi_err || + skel->bss->kptr_dtor_nmi_cleanup_err) + break; + now_ns = get_time_ns(); + if (now_ns >= timeout_time_ns) + break; + sched_yield(); + } + + if (!ASSERT_EQ(skel->bss->kptr_dtor_nmi_setup_err, 0, + "kptr_dtor_nmi_setup_err")) + return false; + if (!ASSERT_EQ(skel->bss->kptr_dtor_nmi_nmi_err, 0, + "kptr_dtor_nmi_nmi_err")) + return false; + if (!ASSERT_EQ(skel->bss->kptr_dtor_nmi_cleanup_err, 0, + "kptr_dtor_nmi_cleanup_err")) + return false; + if (!ASSERT_GE(skel->bss->kptr_dtor_nmi_deleted, expected_deleted, + "kptr_dtor_nmi_deleted")) + return false; + if (!ASSERT_OK(kern_sync_rcu(), "kern_sync_rcu")) + return false; + if (!ASSERT_GE(skel->bss->kptr_dtor_nmi_release_calls, + expected_release_calls, + "kptr_dtor_nmi_release_calls")) + return false; + if (!ASSERT_LT(now_ns, timeout_time_ns, "kptr_dtor_nmi_timeout")) + return false; + + return true; +} + +__maybe_unused +static void run_kptr_dtor_nmi_case(const struct kptr_dtor_nmi_case *test) +{ + struct kptr_dtor_nmi *skel; + cpu_set_t old_cpuset; + bool pinned = false; + int cpu = -1; + int pmu_fd = -1; + int err, round; + + cpu = find_test_cpu(); + if (cpu < 0) + return; + + skel = kptr_dtor_nmi__open(); + if (!ASSERT_OK_PTR(skel, "kptr_dtor_nmi__open")) + return; + + skel->bss->kptr_dtor_nmi_map_type = test->map_type; + bpf_program__set_autoattach(skel->progs.clear_kptrs_from_nmi, false); + + err = kptr_dtor_nmi__load(skel); + if (!ASSERT_OK(err, "kptr_dtor_nmi__load")) + goto cleanup; + + err = kptr_dtor_nmi__attach(skel); + if (!ASSERT_OK(err, "kptr_dtor_nmi__attach")) + goto cleanup; + + skel->links.clear_kptrs_from_nmi = + bpf_program__attach_trace(skel->progs.clear_kptrs_from_nmi); + if (!ASSERT_OK_PTR(skel->links.clear_kptrs_from_nmi, + "attach_tp_btf_nmi_handler")) + goto cleanup; + + pinned = pin_to_cpu(cpu, &old_cpuset); + if (!pinned) + goto cleanup; + + pmu_fd = open_nmi_pmu_event_on_cpu(cpu); + if (pmu_fd < 0) + goto cleanup; + + for (round = 0; round < KPTR_DTOR_NMI_ROUNDS; round++) { + __u64 expected_total; + + if (!run_syscall_prog(skel->progs.populate_kptrs, "populate_kptrs")) + goto cleanup; + + expected_total = (round + 1) * KPTR_DTOR_NMI_MAX_SLOTS; + if (!ASSERT_EQ(skel->bss->kptr_dtor_nmi_setup_created, + expected_total, + "kptr_dtor_nmi_setup_created")) + goto cleanup; + + if (!wait_for_nmi_drain(skel, expected_total, expected_total)) + goto cleanup; + } + + if (!run_syscall_prog(skel->progs.cleanup_kptrs, "cleanup_kptrs")) + goto cleanup; + /* + * The grace period for rcu cannot complete until the CPU that ran the + * hard irq_work has passed through a quiescent state after running + * our dtor work. This effectively flushes our pending work and allows + * the test to verify the dtor was called the expected number of times. + */ + kern_sync_rcu(); + ASSERT_EQ(skel->bss->kptr_dtor_nmi_cleanup_deleted, 0, + "kptr_dtor_nmi_cleanup_deleted"); + +cleanup: + if (pmu_fd >= 0) + close(pmu_fd); + if (pinned) + restore_affinity(&old_cpuset); + kptr_dtor_nmi__destroy(skel); +} + +void serial_test_kptr_dtor_nmi(void) +{ +/* + * nmi_handler isn't supported for these architectures. + */ +#if defined(__aarch64__) || defined(__s390x__) + test__skip(); + return; +#else + static const struct kptr_dtor_nmi_case tests[] = { + { "hash", KPTR_DTOR_NMI_MAP_HASH }, + { "array", KPTR_DTOR_NMI_MAP_ARRAY }, + }; + int i; + + for (i = 0; i < ARRAY_SIZE(tests); i++) { + if (!test__start_subtest(tests[i].name)) + continue; + run_kptr_dtor_nmi_case(&tests[i]); + } +#endif +} diff --git a/tools/testing/selftests/bpf/progs/kptr_dtor_nmi.c b/tools/testing/selftests/bpf/progs/kptr_dtor_nmi.c new file mode 100644 index 000000000000..693b07215006 --- /dev/null +++ b/tools/testing/selftests/bpf/progs/kptr_dtor_nmi.c @@ -0,0 +1,412 @@ +// SPDX-License-Identifier: GPL-2.0 + +#include <vmlinux.h> +#include <linux/errno.h> +#include <bpf/bpf_helpers.h> +#include <bpf/bpf_tracing.h> + +#define KPTR_DTOR_NMI_MAX_SLOTS 8 + +enum kptr_dtor_nmi_map_type { + KPTR_DTOR_NMI_MAP_HASH = 1, + KPTR_DTOR_NMI_MAP_ARRAY, +}; + +enum kptr_dtor_nmi_err { + KPTR_DTOR_NMI_SETUP_CREATE_ERR = 1, + KPTR_DTOR_NMI_SETUP_LOOKUP_ERR, + KPTR_DTOR_NMI_SETUP_STALE_ERR, + KPTR_DTOR_NMI_SETUP_MAP_ERR, + KPTR_DTOR_NMI_DELETE_ERR, + KPTR_DTOR_NMI_CLEANUP_ERR, +}; + +struct kptr_dtor_nmi_value { + struct bpf_cpumask __kptr * mask; +}; + +struct { + __uint(type, BPF_MAP_TYPE_HASH); + __uint(map_flags, BPF_F_NO_PREALLOC); + __type(key, __u32); + __type(value, struct kptr_dtor_nmi_value); + __uint(max_entries, KPTR_DTOR_NMI_MAX_SLOTS); +} kptr_dtor_nmi_hash SEC(".maps"); + +struct { + __uint(type, BPF_MAP_TYPE_ARRAY); + __type(key, __u32); + __type(value, struct kptr_dtor_nmi_value); + __uint(max_entries, KPTR_DTOR_NMI_MAX_SLOTS); +} kptr_dtor_nmi_array SEC(".maps"); + +struct bpf_cpumask *bpf_cpumask_create(void) __ksym __weak; +void bpf_cpumask_release(struct bpf_cpumask *cpumask) __ksym __weak; + +__u64 kptr_dtor_nmi_live_mask; +__u32 kptr_dtor_nmi_map_type; +__u64 kptr_dtor_nmi_setup_created; +__u64 kptr_dtor_nmi_deleted; +__u64 kptr_dtor_nmi_cleanup_deleted; +__u64 kptr_dtor_nmi_release_calls; +__u32 kptr_dtor_nmi_setup_err; +__u32 kptr_dtor_nmi_nmi_err; +__u32 kptr_dtor_nmi_cleanup_err; +int kptr_dtor_nmi_setup_errno; +int kptr_dtor_nmi_nmi_errno; +int kptr_dtor_nmi_cleanup_errno; + +static void set_err(__u32 *err_dst, int *errno_dst, __u32 err, int err_no) +{ + if (!*err_dst) { + *err_dst = err; + *errno_dst = err_no; + } +} + +static bool slot_is_live(__u32 slot) +{ + return kptr_dtor_nmi_live_mask & (1ULL << slot); +} + +static void mark_slot_live(__u32 slot) +{ + __sync_fetch_and_or(&kptr_dtor_nmi_live_mask, 1ULL << slot); +} + +static void clear_slot_live(__u32 slot) +{ + __sync_fetch_and_and(&kptr_dtor_nmi_live_mask, ~(1ULL << slot)); +} + +static struct kptr_dtor_nmi_value *lookup_hash_value(__u32 slot) +{ + return bpf_map_lookup_elem(&kptr_dtor_nmi_hash, &slot); +} + +static struct kptr_dtor_nmi_value *lookup_array_value(__u32 slot) +{ + return bpf_map_lookup_elem(&kptr_dtor_nmi_array, &slot); +} + +static int stash_mask(struct kptr_dtor_nmi_value *value, __u32 slot) +{ + struct bpf_cpumask *mask, *old; + + mask = bpf_cpumask_create(); + if (!mask) + return -ENOMEM; + + old = bpf_kptr_xchg(&value->mask, mask); + if (old) { + bpf_cpumask_release(old); + return -EEXIST; + } + + mark_slot_live(slot); + __sync_fetch_and_add(&kptr_dtor_nmi_setup_created, 1); + return 0; +} + +static bool populate_hash_slot(__u32 slot) +{ + struct kptr_dtor_nmi_value init = {}; + struct kptr_dtor_nmi_value *value; + int err; + + err = bpf_map_update_elem(&kptr_dtor_nmi_hash, &slot, &init, BPF_NOEXIST); + if (err) { + set_err(&kptr_dtor_nmi_setup_err, + &kptr_dtor_nmi_setup_errno, + KPTR_DTOR_NMI_SETUP_CREATE_ERR, err); + return false; + } + + value = lookup_hash_value(slot); + if (!value) { + set_err(&kptr_dtor_nmi_setup_err, + &kptr_dtor_nmi_setup_errno, + KPTR_DTOR_NMI_SETUP_LOOKUP_ERR, -ENOENT); + return false; + } + + err = stash_mask(value, slot); + if (err) { + set_err(&kptr_dtor_nmi_setup_err, + &kptr_dtor_nmi_setup_errno, + KPTR_DTOR_NMI_SETUP_STALE_ERR, err); + return false; + } + + return true; +} + +static bool populate_array_slot(__u32 slot) +{ + struct kptr_dtor_nmi_value *value; + int err; + + value = lookup_array_value(slot); + if (!value) { + set_err(&kptr_dtor_nmi_setup_err, + &kptr_dtor_nmi_setup_errno, + KPTR_DTOR_NMI_SETUP_LOOKUP_ERR, -ENOENT); + return false; + } + + err = stash_mask(value, slot); + if (err) { + set_err(&kptr_dtor_nmi_setup_err, + &kptr_dtor_nmi_setup_errno, + KPTR_DTOR_NMI_SETUP_STALE_ERR, err); + return false; + } + + return true; +} + +static bool clear_hash_slot_from_nmi(__u32 slot) +{ + struct kptr_dtor_nmi_value *value; + int err; + + if (!slot_is_live(slot)) + return true; + + err = bpf_map_delete_elem(&kptr_dtor_nmi_hash, &slot); + if (!err) { + clear_slot_live(slot); + __sync_fetch_and_add(&kptr_dtor_nmi_deleted, 1); + return true; + } + + /* + * Hash deletes take rqspinlock-backed bucket locks. NMI reentry can lose + * those acquisitions with -EDEADLK or -ETIMEDOUT even though the slot is + * still valid, so leave it live and retry on a later NMI. + */ + if (err == -EDEADLK || err == -ETIMEDOUT) + return true; + + value = lookup_hash_value(slot); + if (value) + set_err(&kptr_dtor_nmi_nmi_err, + &kptr_dtor_nmi_nmi_errno, + KPTR_DTOR_NMI_DELETE_ERR, err); + + return false; +} + +static bool clear_array_slot_from_nmi(__u32 slot) +{ + struct kptr_dtor_nmi_value init = {}; + int err; + + if (!slot_is_live(slot)) + return true; + + err = bpf_map_update_elem(&kptr_dtor_nmi_array, &slot, &init, BPF_EXIST); + if (err) { + set_err(&kptr_dtor_nmi_nmi_err, + &kptr_dtor_nmi_nmi_errno, + KPTR_DTOR_NMI_DELETE_ERR, err); + return false; + } + + clear_slot_live(slot); + __sync_fetch_and_add(&kptr_dtor_nmi_deleted, 1); + return true; +} + +static bool cleanup_hash_slot(__u32 slot) +{ + struct kptr_dtor_nmi_value *value; + struct bpf_cpumask *old = NULL; + + value = lookup_hash_value(slot); + if (!value) { + clear_slot_live(slot); + return true; + } + + old = bpf_kptr_xchg(&value->mask, old); + if (old) { + bpf_cpumask_release(old); + __sync_fetch_and_add(&kptr_dtor_nmi_cleanup_deleted, 1); + } + + if (bpf_map_delete_elem(&kptr_dtor_nmi_hash, &slot) && + lookup_hash_value(slot)) { + set_err(&kptr_dtor_nmi_cleanup_err, + &kptr_dtor_nmi_cleanup_errno, + KPTR_DTOR_NMI_CLEANUP_ERR, -EIO); + return false; + } + + clear_slot_live(slot); + return true; +} + +static bool cleanup_array_slot(__u32 slot) +{ + struct kptr_dtor_nmi_value *value; + struct bpf_cpumask *old = NULL; + + value = lookup_array_value(slot); + if (!value) { + set_err(&kptr_dtor_nmi_cleanup_err, + &kptr_dtor_nmi_cleanup_errno, + KPTR_DTOR_NMI_CLEANUP_ERR, -ENOENT); + return false; + } + + old = bpf_kptr_xchg(&value->mask, old); + if (old) { + bpf_cpumask_release(old); + __sync_fetch_and_add(&kptr_dtor_nmi_cleanup_deleted, 1); + } + + clear_slot_live(slot); + return true; +} + +static void populate_hash_masks(void) +{ + __u32 slot; + + for (slot = 0; slot < KPTR_DTOR_NMI_MAX_SLOTS; slot++) { + if (!populate_hash_slot(slot)) + return; + } +} + +static void populate_array_masks(void) +{ + __u32 slot; + + for (slot = 0; slot < KPTR_DTOR_NMI_MAX_SLOTS; slot++) { + if (!populate_array_slot(slot)) + return; + } +} + +static void clear_hash_masks_from_nmi(void) +{ + __u32 slot; + + for (slot = 0; slot < KPTR_DTOR_NMI_MAX_SLOTS; slot++) { + if (!clear_hash_slot_from_nmi(slot)) + return; + } +} + +static void clear_array_masks_from_nmi(void) +{ + __u32 slot; + + for (slot = 0; slot < KPTR_DTOR_NMI_MAX_SLOTS; slot++) { + if (!clear_array_slot_from_nmi(slot)) + return; + } +} + +static void cleanup_hash_masks(void) +{ + __u32 slot; + + for (slot = 0; slot < KPTR_DTOR_NMI_MAX_SLOTS; slot++) { + if (!cleanup_hash_slot(slot)) + return; + } +} + +static void cleanup_array_masks(void) +{ + __u32 slot; + + for (slot = 0; slot < KPTR_DTOR_NMI_MAX_SLOTS; slot++) { + if (!cleanup_array_slot(slot)) + return; + } +} + +SEC("syscall") +int populate_kptrs(void *ctx) +{ + (void)ctx; + + switch (kptr_dtor_nmi_map_type) { + case KPTR_DTOR_NMI_MAP_HASH: + populate_hash_masks(); + break; + case KPTR_DTOR_NMI_MAP_ARRAY: + populate_array_masks(); + break; + default: + set_err(&kptr_dtor_nmi_setup_err, + &kptr_dtor_nmi_setup_errno, + KPTR_DTOR_NMI_SETUP_MAP_ERR, -EINVAL); + break; + } + + return 0; +} + +SEC("syscall") +int cleanup_kptrs(void *ctx) +{ + (void)ctx; + + switch (kptr_dtor_nmi_map_type) { + case KPTR_DTOR_NMI_MAP_HASH: + cleanup_hash_masks(); + break; + case KPTR_DTOR_NMI_MAP_ARRAY: + cleanup_array_masks(); + break; + default: + set_err(&kptr_dtor_nmi_cleanup_err, + &kptr_dtor_nmi_cleanup_errno, + KPTR_DTOR_NMI_CLEANUP_ERR, -EINVAL); + break; + } + + return 0; +} + +SEC("tp_btf/nmi_handler") +int BPF_PROG(clear_kptrs_from_nmi, void *handler, s64 delta_ns, int handled) +{ + (void)handler; + (void)delta_ns; + (void)handled; + + if (kptr_dtor_nmi_deleted >= kptr_dtor_nmi_setup_created) + return 0; + + switch (kptr_dtor_nmi_map_type) { + case KPTR_DTOR_NMI_MAP_HASH: + clear_hash_masks_from_nmi(); + break; + case KPTR_DTOR_NMI_MAP_ARRAY: + clear_array_masks_from_nmi(); + break; + default: + set_err(&kptr_dtor_nmi_nmi_err, + &kptr_dtor_nmi_nmi_errno, + KPTR_DTOR_NMI_DELETE_ERR, -EINVAL); + break; + } + + return 0; +} + +SEC("fentry/bpf_cpumask_release") +int BPF_PROG(count_cpumask_release, struct bpf_cpumask *mask) +{ + (void)mask; + __sync_fetch_and_add(&kptr_dtor_nmi_release_calls, 1); + return 0; +} + +char _license[] SEC("license") = "GPL"; -- 2.53.0 ^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [bpf-next v3 2/2] selftests/bpf: Add kptr destructor NMI exerciser 2026-05-07 17:54 ` [bpf-next v3 2/2] selftests/bpf: Add kptr destructor NMI exerciser Justin Suess @ 2026-05-08 0:03 ` sashiko-bot 0 siblings, 0 replies; 23+ messages in thread From: sashiko-bot @ 2026-05-08 0:03 UTC (permalink / raw) To: Justin Suess; +Cc: bpf Thank you for your contribution! Sashiko AI review found 3 potential issue(s) to consider: - [Medium] Failure to acquire the PMU file descriptor (e.g., due to EACCES permissions) causes the test to silently pass rather than fail or skip. - [Medium] The test executes the `cleanup_kptrs` BPF program but fails to verify if it encountered any internal errors. - [Medium] Concurrent NMI handlers can cause double-increments of `kptr_dtor_nmi_deleted` and loop aborts due to a TOCTOU race on `slot_is_live`. -- commit f53640a66945f8f1050d74f78ad482b51aeaf343 Author: Justin Suess <utilityemal77@gmail.com> selftests/bpf: Add kptr destructor NMI exerciser This patch adds a BPF selftest program that populates hash and array maps with cpumask kptrs and clears them from the NMI handler. The test verifies that the destructor is called on all objects queued for destruction in hardirq from NMI. > diff --git a/tools/testing/selftests/bpf/prog_tests/kptr_dtor_nmi.c b/tools/testing/selftests/bpf/prog_tests/kptr_dtor_nmi.c > new file mode 100644 > index 0000000000000..21452b3cf9eb8 > --- /dev/null > +++ b/tools/testing/selftests/bpf/prog_tests/kptr_dtor_nmi.c > @@ -0,0 +1,258 @@ [ ... ] > +__maybe_unused > +static int open_nmi_pmu_event_on_cpu(int cpu) > +{ > + struct perf_event_attr attr = { > + .size = sizeof(attr), > + .type = PERF_TYPE_HARDWARE, > + .config = PERF_COUNT_HW_CPU_CYCLES, > + .freq = 1, > + .sample_freq = 1000, > + }; > + int pmu_fd; > + > + pmu_fd = syscall(__NR_perf_event_open, &attr, -1, cpu, -1, > + PERF_FLAG_FD_CLOEXEC); > + if (pmu_fd == -1) { > + if (errno == ENOENT || errno == EOPNOTSUPP) { > + printf("SKIP:no PERF_COUNT_HW_CPU_CYCLES\n"); > + test__skip(); > + } > + return -1; > + } If the syscall fails with an error other than ENOENT or EOPNOTSUPP (like EACCES), this returns -1 without calling test__skip(). In run_kptr_dtor_nmi_case(): pmu_fd = open_nmi_pmu_event_on_cpu(cpu); if (pmu_fd < 0) goto cleanup; Since no assertions are triggered and the test isn't explicitly skipped, will the selftest framework register this as a PASS, masking the resource acquisition failure? [ ... ] > +static void run_kptr_dtor_nmi_case(const struct kptr_dtor_nmi_case *test) > +{ [ ... ] > + if (!run_syscall_prog(skel->progs.cleanup_kptrs, "cleanup_kptrs")) > + goto cleanup; > + /* > + * The grace period for rcu cannot complete until the CPU that ran the > + * hard irq_work has passed through a quiescent state after running > + * our dtor work. This effectively flushes our pending work and allows > + * the test to verify the dtor was called the expected number of times. > + */ > + kern_sync_rcu(); > + ASSERT_EQ(skel->bss->kptr_dtor_nmi_cleanup_deleted, 0, > + "kptr_dtor_nmi_cleanup_deleted"); Does this test need to assert that skel->bss->kptr_dtor_nmi_cleanup_err is 0? The cleanup_kptrs BPF program sets kptr_dtor_nmi_cleanup_err if an error occurs, but it looks like only kptr_dtor_nmi_cleanup_deleted is verified here. If an internal error happens but no masks were exchanged out, cleanup_deleted will be 0 and the test might silently pass. > +cleanup: > + if (pmu_fd >= 0) > + close(pmu_fd); [ ... ] > diff --git a/tools/testing/selftests/bpf/progs/kptr_dtor_nmi.c b/tools/testing/selftests/bpf/progs/kptr_dtor_nmi.c > new file mode 100644 > index 0000000000000..693b07215006b > --- /dev/null > +++ b/tools/testing/selftests/bpf/progs/kptr_dtor_nmi.c [ ... ] > +static bool clear_array_slot_from_nmi(__u32 slot) > +{ > + struct kptr_dtor_nmi_value init = {}; > + int err; > + > + if (!slot_is_live(slot)) > + return true; > + > + err = bpf_map_update_elem(&kptr_dtor_nmi_array, &slot, &init, BPF_EXIST); > + if (err) { > + set_err(&kptr_dtor_nmi_nmi_err, > + &kptr_dtor_nmi_nmi_errno, > + KPTR_DTOR_NMI_DELETE_ERR, err); > + return false; > + } > + > + clear_slot_live(slot); > + __sync_fetch_and_add(&kptr_dtor_nmi_deleted, 1); Could concurrent NMI handlers cause a double-increment of kptr_dtor_nmi_deleted due to a TOCTOU race on slot_is_live()? Because clear_kptrs_from_nmi is attached to the global tp_btf/nmi_handler tracepoint, it executes on NMIs across all CPUs. If two NMIs fire concurrently on different CPUs, both could read slot_is_live(slot) as true and successfully execute bpf_map_update_elem(..., BPF_EXIST) on the array map. Both would then call clear_slot_live() and increment kptr_dtor_nmi_deleted. This double-increment could cause wait_for_nmi_drain() to return prematurely before all slots are truly drained. The next round's populate_kptrs syscall would then encounter uncleared kptrs, causing stash_mask() to fail with -EEXIST. Would it be better to use an atomic test-and-clear operation here? > + return true; > +} -- Sashiko AI review · https://sashiko.dev/#/patchset/20260507175453.1140400-1-utilityemal77@gmail.com?part=2 ^ permalink raw reply [flat|nested] 23+ messages in thread
end of thread, other threads:[~2026-05-12 2:13 UTC | newest] Thread overview: 23+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-05-07 17:54 [bpf-next v3 0/2] bpf: Fix deadlock in kptr dtor in nmi Justin Suess 2026-05-07 17:54 ` [bpf-next v3 1/2] bpf: Offload kptr destructors that run from NMI Justin Suess 2026-05-07 18:43 ` bot+bpf-ci 2026-05-07 18:52 ` Justin Suess 2026-05-07 23:45 ` sashiko-bot 2026-05-10 15:13 ` Justin Suess 2026-05-10 22:38 ` Alexei Starovoitov 2026-05-11 1:49 ` Justin Suess 2026-05-11 15:51 ` Alexei Starovoitov 2026-05-11 16:38 ` Justin Suess 2026-05-11 17:18 ` Alexei Starovoitov 2026-05-11 20:10 ` Kumar Kartikeya Dwivedi 2026-05-12 1:43 ` Justin Suess 2026-05-12 1:46 ` Kumar Kartikeya Dwivedi 2026-05-12 1:55 ` Alexei Starovoitov 2026-05-12 2:03 ` Kumar Kartikeya Dwivedi 2026-05-12 2:10 ` Alexei Starovoitov 2026-05-12 2:13 ` Kumar Kartikeya Dwivedi 2026-05-12 2:07 ` Justin Suess 2026-05-12 2:08 ` Kumar Kartikeya Dwivedi 2026-05-11 19:22 ` Justin Suess 2026-05-07 17:54 ` [bpf-next v3 2/2] selftests/bpf: Add kptr destructor NMI exerciser Justin Suess 2026-05-08 0:03 ` sashiko-bot
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox