* [PATCH bpf-next v2 0/4] Fix kptr dtor deadlock
@ 2026-06-09 9:37 Kumar Kartikeya Dwivedi
2026-06-09 9:37 ` [PATCH bpf-next v2 1/4] bpf: Reject bpf_obj_drop() from tracing progs Kumar Kartikeya Dwivedi
` (3 more replies)
0 siblings, 4 replies; 13+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2026-06-09 9:37 UTC (permalink / raw)
To: bpf
Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Eduard Zingerman, Emil Tsalapatis, Justin Suess, kkd, kernel-team
Referenced kptr destruction can run from tracing/NMI contexts through
bpf_obj_drop() and map value update/delete paths, reaching NMI-unsafe
special field teardown and deadlocks. Justin reported the issue and
iterated on fixes in [0]-[2], and also confirmed the bpf_obj_drop()
reproducer in [3].
This series rejects unsafe obj drops from non-iterator tracing programs,
limits map value recycle to NMI-safe field cancellation, and adds
focused selftests for the obj_drop(), NMI delete, and recycle teardown
cases.
See patches for details.
[0]: https://lore.kernel.org/bpf/20260505150851.3090688-1-utilityemal77@gmail.com
[1]: https://lore.kernel.org/bpf/20260507175453.1140400-1-utilityemal77@gmail.com
[2]: https://lore.kernel.org/bpf/20260519011450.1144935-1-utilityemal77@gmail.com
[3]: https://lore.kernel.org/bpf/agyG3eQwgmoJwmj2@suesslenovo
Changelog:
----------
v1 -> v2
v1: https://lore.kernel.org/bpf/20260608144841.1732406-1-memxor@gmail.com
* Drop is_tracing_prog_type() fix due to compat breakage, revisit separately.
* Rework bpf_obj_drop() fix to additionally reject non-iter tracing progs.
Justin Suess (2):
bpf: Reject bpf_obj_drop() from tracing progs
bpf: Cancel special fields on map value recycle
Kumar Kartikeya Dwivedi (2):
selftests/bpf: Exercise unsafe obj drops from tracing progs
selftests/bpf: Exercise kptr map update lifetime
include/linux/bpf.h | 30 +++++++
kernel/bpf/arraymap.c | 8 +-
kernel/bpf/hashtab.c | 32 ++++---
kernel/bpf/syscall.c | 27 ++++++
kernel/bpf/verifier.c | 16 ++++
.../selftests/bpf/prog_tests/htab_update.c | 4 +-
.../selftests/bpf/prog_tests/map_kptr.c | 66 ++++++++++++--
.../selftests/bpf/prog_tests/task_kfunc.c | 42 ++++++++-
.../testing/selftests/bpf/progs/htab_update.c | 4 +-
tools/testing/selftests/bpf/progs/map_kptr.c | 89 ++++++++++++++++++-
.../selftests/bpf/progs/task_kfunc_failure.c | 43 +++++++++
.../selftests/bpf/progs/task_kfunc_success.c | 13 ++-
12 files changed, 332 insertions(+), 42 deletions(-)
base-commit: b9452b594fd3aecbfd4aa0a6a1f741330a37dab7
--
2.53.0
^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH bpf-next v2 1/4] bpf: Reject bpf_obj_drop() from tracing progs
2026-06-09 9:37 [PATCH bpf-next v2 0/4] Fix kptr dtor deadlock Kumar Kartikeya Dwivedi
@ 2026-06-09 9:37 ` Kumar Kartikeya Dwivedi
2026-06-09 10:05 ` bot+bpf-ci
2026-06-09 13:31 ` Mykyta Yatsenko
2026-06-09 9:37 ` [PATCH bpf-next v2 2/4] bpf: Cancel special fields on map value recycle Kumar Kartikeya Dwivedi
` (2 subsequent siblings)
3 siblings, 2 replies; 13+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2026-06-09 9:37 UTC (permalink / raw)
To: bpf
Cc: Justin Suess, Alexei Starovoitov, Andrii Nakryiko,
Daniel Borkmann, Eduard Zingerman, Emil Tsalapatis, kkd,
kernel-team
From: Justin Suess <utilityemal77@gmail.com>
bpf_obj_drop() runs bpf_obj_free_fields() synchronously for
program-allocated objects. When such an object contains NMI unsafe
fields, a non-iterator tracing program can reach that destruction from
arbitrary instrumented context, including NMI.
NMI is likely one instance of this problem, and other instances would
include possible unsafe reentrancy. Deferring bpf_obj_drop() is not
appealing either: it would add delayed-free machinery to a release
operation that otherwise has straightforward synchronous ownership
semantics.
Reject bpf_obj_drop() and bpf_percpu_obj_drop() from tracing programs
unless every field in the object's BTF record is explicitly NMI safe.
Note that while bpf_rb_root and bpf_list_head would be NMI safe on their
own to free, the objects recursively held by them may not be; be
conservative and just mark them as not NMI safe for now.
Use a whitelist for the NMI-safe field set instead of listing only known
NMI unsafe fields. Locks, async fields, unreferenced kptrs, and
refcounts are known to be NMI safe because their destruction is either a
no-op, simple state reset, or async cancellation. Referenced kptrs,
percpu referenced kptrs, uptrs, graph roots, graph nodes, and any future
field type are rejected until audited for arbitrary tracing and NMI
contexts. This is less susceptible to future changes in fields that were
previously safe by exclusion, and to new fields being added without
updating this check.
Convert the existing recursive local-object drop success case to a
syscall program in the same commit, since this verifier change makes the
old tracing program form invalid. The test still exercises
bpf_obj_drop() releasing a referenced task kptr from a safe program
type.
Fixes: ac9f06050a35 ("bpf: Introduce bpf_obj_drop")
Signed-off-by: Justin Suess <utilityemal77@gmail.com>
Co-developed-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
include/linux/bpf.h | 29 +++++++++++++
kernel/bpf/verifier.c | 16 +++++++
.../selftests/bpf/prog_tests/task_kfunc.c | 42 ++++++++++++++++++-
.../selftests/bpf/progs/task_kfunc_success.c | 13 +++---
4 files changed, 92 insertions(+), 8 deletions(-)
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 62bba7a4876f..0654d2ffadc1 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -492,6 +492,35 @@ static inline bool btf_record_has_field(const struct btf_record *rec, enum btf_f
return rec->field_mask & type;
}
+static inline bool btf_field_is_nmi_safe(enum btf_field_type type)
+{
+ switch (type) {
+ case BPF_SPIN_LOCK:
+ case BPF_RES_SPIN_LOCK:
+ case BPF_TIMER:
+ case BPF_WORKQUEUE:
+ case BPF_TASK_WORK:
+ case BPF_KPTR_UNREF:
+ case BPF_REFCOUNT:
+ return true;
+ default:
+ return false;
+ }
+}
+
+static inline bool btf_record_has_nmi_unsafe_fields(const struct btf_record *rec)
+{
+ int i;
+
+ if (IS_ERR_OR_NULL(rec))
+ return false;
+ for (i = 0; i < rec->cnt; i++) {
+ if (!btf_field_is_nmi_safe(rec->fields[i].type))
+ return true;
+ }
+ return false;
+}
+
static inline void bpf_obj_init(const struct btf_record *rec, void *obj)
{
int i;
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index ed7ba0e6a9ce..3b43e2bd0f2a 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -205,6 +205,7 @@ static int release_reference_nomark(struct bpf_verifier_state *state, int id);
static int release_reference(struct bpf_verifier_env *env, int id);
static void invalidate_non_owning_refs(struct bpf_verifier_env *env);
static bool in_rbtree_lock_required_cb(struct bpf_verifier_env *env);
+static bool is_tracing_prog_type(enum bpf_prog_type type);
static int ref_set_non_owning(struct bpf_verifier_env *env,
struct bpf_reg_state *reg);
static bool is_trusted_reg(struct bpf_verifier_env *env, const struct bpf_reg_state *reg);
@@ -12881,6 +12882,7 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
int *insn_idx_p)
{
bool sleepable, rcu_lock, rcu_unlock, preempt_disable, preempt_enable;
+ enum bpf_prog_type prog_type = resolve_prog_type(env->prog);
struct bpf_reg_state *regs = cur_regs(env);
const char *func_name, *ptr_type_name;
const struct btf_type *t, *ptr_type;
@@ -12957,6 +12959,20 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
if (err < 0)
return err;
+ if ((is_bpf_obj_drop_kfunc(meta.func_id) ||
+ is_bpf_percpu_obj_drop_kfunc(meta.func_id)) && (is_tracing_prog_type(prog_type) ||
+ /* is_tracing_prog_type() for now doesn't cover non-iterator tracing progs. */
+ (prog_type == BPF_PROG_TYPE_TRACING && env->prog->expected_attach_type != BPF_TRACE_ITER))) {
+ struct btf_struct_meta *struct_meta;
+
+ struct_meta = btf_find_struct_meta(meta.arg_btf, meta.arg_btf_id);
+ if (struct_meta && btf_record_has_nmi_unsafe_fields(struct_meta->record)) {
+ verbose(env, "%s cannot be used in tracing programs on types with NMI unsafe fields\n",
+ func_name);
+ return -EINVAL;
+ }
+ }
+
if (is_bpf_rbtree_add_kfunc(meta.func_id)) {
err = push_callback_call(env, insn, insn_idx, meta.subprogno,
set_rbtree_add_callback_state);
diff --git a/tools/testing/selftests/bpf/prog_tests/task_kfunc.c b/tools/testing/selftests/bpf/prog_tests/task_kfunc.c
index 83b90335967a..e6e95c1416e6 100644
--- a/tools/testing/selftests/bpf/prog_tests/task_kfunc.c
+++ b/tools/testing/selftests/bpf/prog_tests/task_kfunc.c
@@ -68,6 +68,36 @@ static void run_success_test(const char *prog_name)
task_kfunc_success__destroy(skel);
}
+static void run_syscall_success_test(const char *prog_name)
+{
+ LIBBPF_OPTS(bpf_test_run_opts, opts);
+ struct task_kfunc_success *skel;
+ struct bpf_program *prog;
+ int err;
+
+ skel = open_load_task_kfunc_skel();
+ if (!ASSERT_OK_PTR(skel, "open_load_skel"))
+ return;
+
+ if (!ASSERT_OK(skel->bss->err, "pre_run_err"))
+ goto cleanup;
+
+ prog = bpf_object__find_program_by_name(skel->obj, prog_name);
+ if (!ASSERT_OK_PTR(prog, "bpf_object__find_program_by_name"))
+ goto cleanup;
+
+ err = bpf_prog_test_run_opts(bpf_program__fd(prog), &opts);
+ if (!ASSERT_OK(err, "bpf_prog_test_run_opts"))
+ goto cleanup;
+ if (!ASSERT_EQ(opts.retval, 0, "retval"))
+ goto cleanup;
+
+ ASSERT_OK(skel->bss->err, "post_run_err");
+
+cleanup:
+ task_kfunc_success__destroy(skel);
+}
+
static int run_vpid_test(void *prog_name)
{
struct task_kfunc_success *skel;
@@ -140,7 +170,6 @@ static const char * const success_tests[] = {
"test_task_acquire_release_argument",
"test_task_acquire_release_current",
"test_task_acquire_leave_in_map",
- "test_task_xchg_release",
"test_task_map_acquire_release",
"test_task_current_acquire_release",
"test_task_from_pid_arg",
@@ -151,6 +180,10 @@ static const char * const success_tests[] = {
"test_task_kfunc_flavor_relo_not_found",
};
+static const char * const syscall_success_tests[] = {
+ "test_task_xchg_release",
+};
+
static const char * const vpid_success_tests[] = {
"test_task_from_vpid_current",
"test_task_from_vpid_invalid",
@@ -167,6 +200,13 @@ void test_task_kfunc(void)
run_success_test(success_tests[i]);
}
+ for (i = 0; i < ARRAY_SIZE(syscall_success_tests); i++) {
+ if (!test__start_subtest(syscall_success_tests[i]))
+ continue;
+
+ run_syscall_success_test(syscall_success_tests[i]);
+ }
+
for (i = 0; i < ARRAY_SIZE(vpid_success_tests); i++) {
if (!test__start_subtest(vpid_success_tests[i]))
continue;
diff --git a/tools/testing/selftests/bpf/progs/task_kfunc_success.c b/tools/testing/selftests/bpf/progs/task_kfunc_success.c
index 5fb4fc19d26a..d63a79ee33dc 100644
--- a/tools/testing/selftests/bpf/progs/task_kfunc_success.c
+++ b/tools/testing/selftests/bpf/progs/task_kfunc_success.c
@@ -140,17 +140,17 @@ int BPF_PROG(test_task_acquire_leave_in_map, struct task_struct *task, u64 clone
return 0;
}
-SEC("tp_btf/task_newtask")
-int BPF_PROG(test_task_xchg_release, struct task_struct *task, u64 clone_flags)
+SEC("syscall")
+int test_task_xchg_release(const void *ctx)
{
- struct task_struct *kptr, *acquired;
+ struct task_struct *task, *kptr, *acquired;
struct __tasks_kfunc_map_value *v, *local;
int refcnt, refcnt_after_drop;
long status;
- if (!is_test_kfunc_task())
- return 0;
+ (void)ctx;
+ task = bpf_get_current_task_btf();
status = tasks_kfunc_map_insert(task);
if (status) {
err = 1;
@@ -191,7 +191,7 @@ int BPF_PROG(test_task_xchg_release, struct task_struct *task, u64 clone_flags)
return 0;
}
- /* Stash a copy into local kptr and check if it is released recursively */
+ /* Stash a copy into local kptr and check if it is released recursively. */
acquired = bpf_task_acquire(kptr);
if (!acquired) {
err = 7;
@@ -220,7 +220,6 @@ int BPF_PROG(test_task_xchg_release, struct task_struct *task, u64 clone_flags)
}
bpf_task_release(kptr);
-
return 0;
}
--
2.53.0
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH bpf-next v2 2/4] bpf: Cancel special fields on map value recycle
2026-06-09 9:37 [PATCH bpf-next v2 0/4] Fix kptr dtor deadlock Kumar Kartikeya Dwivedi
2026-06-09 9:37 ` [PATCH bpf-next v2 1/4] bpf: Reject bpf_obj_drop() from tracing progs Kumar Kartikeya Dwivedi
@ 2026-06-09 9:37 ` Kumar Kartikeya Dwivedi
2026-06-09 9:55 ` sashiko-bot
2026-06-09 10:51 ` Mykyta Yatsenko
2026-06-09 9:37 ` [PATCH bpf-next v2 3/4] selftests/bpf: Exercise unsafe obj drops from tracing progs Kumar Kartikeya Dwivedi
2026-06-09 9:37 ` [PATCH bpf-next v2 4/4] selftests/bpf: Exercise kptr map update lifetime Kumar Kartikeya Dwivedi
3 siblings, 2 replies; 13+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2026-06-09 9:37 UTC (permalink / raw)
To: bpf
Cc: Justin Suess, Alexei Starovoitov, Andrii Nakryiko,
Daniel Borkmann, Eduard Zingerman, Emil Tsalapatis, kkd,
kernel-team
From: Justin Suess <utilityemal77@gmail.com>
Map update and delete paths currently call bpf_obj_free_fields() when a
value is being replaced or recycled. That makes field destruction depend
on the context of the update/delete operation. For tracing programs this
can include NMI context, where referenced kptr destructors, uptr
unpinning, and graph root destruction are not generally safe.
Introduce bpf_obj_cancel_fields() for the reusable-value path. It only
performs NMI-safe cleanup for timer, workqueue, and task_work fields.
Fields that need full destruction are left attached to the recycled value
and are destroyed by the final cleanup path instead.
Switch array and hashtab update/delete/recycle paths to this cancel
helper. Keep bpf_obj_free_fields() for final map destruction and for
bpf_mem_alloc destructors. Preallocated hashtabs do not have allocator
destructors, so teardown continues to walk the normal and extra elements
and fully destroy their fields.
This deliberately relaxes the eager-free semantics of map update/delete
for special fields. Programs that relied on a recycled map slot becoming
empty immediately after update/delete were relying on behavior that
cannot be implemented safely from every BPF execution context without
offloading arbitrary destructors.
There is a chance this change breaks programs making assumptions
regarding the eager freeing of fields. If so, we can relax semantics to
cancellation only when irqs_disabled() is true in the future. However,
theoretically, map values that get reused eagerly already have weaker
guarantees as parallel users can recreate freed fields before the new
element becomes visible again.
Fixes: 14a324f6a67e ("bpf: Wire up freeing of referenced kptr")
Signed-off-by: Justin Suess <utilityemal77@gmail.com>
Co-developed-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
include/linux/bpf.h | 1 +
kernel/bpf/arraymap.c | 8 ++---
kernel/bpf/hashtab.c | 32 +++++++++++--------
kernel/bpf/syscall.c | 27 ++++++++++++++++
.../selftests/bpf/prog_tests/htab_update.c | 4 +--
.../selftests/bpf/prog_tests/map_kptr.c | 10 +-----
.../testing/selftests/bpf/progs/htab_update.c | 4 +--
7 files changed, 55 insertions(+), 31 deletions(-)
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 0654d2ffadc1..fc24b81cc206 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -2717,6 +2717,7 @@ bool btf_record_equal(const struct btf_record *rec_a, const struct btf_record *r
void bpf_obj_free_timer(const struct btf_record *rec, void *obj);
void bpf_obj_free_workqueue(const struct btf_record *rec, void *obj);
void bpf_obj_free_task_work(const struct btf_record *rec, void *obj);
+void bpf_obj_cancel_fields(const struct btf_record *rec, void *obj);
void bpf_obj_free_fields(const struct btf_record *rec, void *obj);
void __bpf_obj_drop_impl(void *p, const struct btf_record *rec, bool percpu);
diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index e6271a2bf6d6..4e8d6fe6999b 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -384,7 +384,7 @@ static long array_map_update_elem(struct bpf_map *map, void *key, void *value,
if (array->map.map_type == BPF_MAP_TYPE_PERCPU_ARRAY) {
val = this_cpu_ptr(array->pptrs[index & array->index_mask]);
copy_map_value(map, val, value);
- bpf_obj_free_fields(array->map.record, val);
+ bpf_obj_cancel_fields(array->map.record, val);
} else {
val = array->value +
(u64)array->elem_size * (index & array->index_mask);
@@ -392,7 +392,7 @@ static long array_map_update_elem(struct bpf_map *map, void *key, void *value,
copy_map_value_locked(map, val, value, false);
else
copy_map_value(map, val, value);
- bpf_obj_free_fields(array->map.record, val);
+ bpf_obj_cancel_fields(array->map.record, val);
}
return 0;
}
@@ -432,14 +432,14 @@ int bpf_percpu_array_update(struct bpf_map *map, void *key, void *value,
cpu = map_flags >> 32;
ptr = per_cpu_ptr(pptr, cpu);
copy_map_value(map, ptr, value);
- bpf_obj_free_fields(array->map.record, ptr);
+ bpf_obj_cancel_fields(array->map.record, ptr);
goto unlock;
}
for_each_possible_cpu(cpu) {
ptr = per_cpu_ptr(pptr, cpu);
val = (map_flags & BPF_F_ALL_CPUS) ? value : value + size * cpu;
copy_map_value(map, ptr, val);
- bpf_obj_free_fields(array->map.record, ptr);
+ bpf_obj_cancel_fields(array->map.record, ptr);
}
unlock:
rcu_read_unlock();
diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index b4366cad3cfa..a9b2b882b90f 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -243,6 +243,10 @@ static void htab_free_prealloced_fields(struct bpf_htab *htab)
if (IS_ERR_OR_NULL(htab->map.record))
return;
+ /*
+ * Preallocated maps do not have a bpf_mem_alloc destructor, so fully
+ * destroy every element, including the extra elements.
+ */
if (htab_has_extra_elems(htab))
num_entries += num_possible_cpus();
for (i = 0; i < num_entries; i++) {
@@ -833,8 +837,8 @@ static int htab_lru_map_gen_lookup(struct bpf_map *map,
return insn - insn_buf;
}
-static void check_and_free_fields(struct bpf_htab *htab,
- struct htab_elem *elem)
+static void check_and_cancel_fields(struct bpf_htab *htab,
+ struct htab_elem *elem)
{
if (IS_ERR_OR_NULL(htab->map.record))
return;
@@ -844,11 +848,11 @@ static void check_and_free_fields(struct bpf_htab *htab,
int cpu;
for_each_possible_cpu(cpu)
- bpf_obj_free_fields(htab->map.record, per_cpu_ptr(pptr, cpu));
+ bpf_obj_cancel_fields(htab->map.record, per_cpu_ptr(pptr, cpu));
} else {
void *map_value = htab_elem_value(elem, htab->map.key_size);
- bpf_obj_free_fields(htab->map.record, map_value);
+ bpf_obj_cancel_fields(htab->map.record, map_value);
}
}
@@ -883,7 +887,7 @@ static bool htab_lru_map_delete_node(void *arg, struct bpf_lru_node *node)
htab_unlock_bucket(b, flags);
if (l == tgt_l)
- check_and_free_fields(htab, l);
+ check_and_cancel_fields(htab, l);
return l == tgt_l;
}
@@ -948,7 +952,7 @@ static int htab_map_get_next_key(struct bpf_map *map, void *key, void *next_key)
static void htab_elem_free(struct bpf_htab *htab, struct htab_elem *l)
{
- check_and_free_fields(htab, l);
+ check_and_cancel_fields(htab, l);
if (htab->map.map_type == BPF_MAP_TYPE_PERCPU_HASH)
bpf_mem_cache_free(&htab->pcpu_ma, l->ptr_to_pptr);
@@ -1001,7 +1005,7 @@ static void free_htab_elem(struct bpf_htab *htab, struct htab_elem *l)
if (htab_is_prealloc(htab)) {
bpf_map_dec_elem_count(&htab->map);
- check_and_free_fields(htab, l);
+ check_and_cancel_fields(htab, l);
pcpu_freelist_push(&htab->freelist, &l->fnode);
} else {
dec_elem_count(htab);
@@ -1018,7 +1022,7 @@ static void pcpu_copy_value(struct bpf_htab *htab, void __percpu *pptr,
/* copy true value_size bytes */
ptr = this_cpu_ptr(pptr);
copy_map_value(&htab->map, ptr, value);
- bpf_obj_free_fields(htab->map.record, ptr);
+ bpf_obj_cancel_fields(htab->map.record, ptr);
} else {
u32 size = round_up(htab->map.value_size, 8);
void *val;
@@ -1028,7 +1032,7 @@ static void pcpu_copy_value(struct bpf_htab *htab, void __percpu *pptr,
cpu = map_flags >> 32;
ptr = per_cpu_ptr(pptr, cpu);
copy_map_value(&htab->map, ptr, value);
- bpf_obj_free_fields(htab->map.record, ptr);
+ bpf_obj_cancel_fields(htab->map.record, ptr);
return;
}
@@ -1036,7 +1040,7 @@ static void pcpu_copy_value(struct bpf_htab *htab, void __percpu *pptr,
ptr = per_cpu_ptr(pptr, cpu);
val = (map_flags & BPF_F_ALL_CPUS) ? value : value + size * cpu;
copy_map_value(&htab->map, ptr, val);
- bpf_obj_free_fields(htab->map.record, ptr);
+ bpf_obj_cancel_fields(htab->map.record, ptr);
}
}
}
@@ -1252,11 +1256,11 @@ static long htab_map_update_elem(struct bpf_map *map, void *key, void *value,
if (l_old) {
hlist_nulls_del_rcu(&l_old->hash_node);
- /* l_old has already been stashed in htab->extra_elems, free
- * its special fields before it is available for reuse.
+ /* l_old has already been stashed in htab->extra_elems, cancel
+ * its reusable special fields before it is available for reuse.
*/
if (htab_is_prealloc(htab))
- check_and_free_fields(htab, l_old);
+ check_and_cancel_fields(htab, l_old);
}
htab_unlock_bucket(b, flags);
if (l_old && !htab_is_prealloc(htab))
@@ -1269,7 +1273,7 @@ static long htab_map_update_elem(struct bpf_map *map, void *key, void *value,
static void htab_lru_push_free(struct bpf_htab *htab, struct htab_elem *elem)
{
- check_and_free_fields(htab, elem);
+ check_and_cancel_fields(htab, elem);
bpf_map_dec_elem_count(&htab->map);
bpf_lru_push_free(&htab->lru, &elem->lru_node);
}
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index d4188a992bd8..8d12aa3728ea 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -808,6 +808,33 @@ void bpf_obj_free_task_work(const struct btf_record *rec, void *obj)
bpf_task_work_cancel_and_free(obj + rec->task_work_off);
}
+void bpf_obj_cancel_fields(const struct btf_record *rec, void *obj)
+{
+ const struct btf_field *fields;
+ int i;
+
+ if (IS_ERR_OR_NULL(rec))
+ return;
+ fields = rec->fields;
+ for (i = 0; i < rec->cnt; i++) {
+ void *field_ptr = obj + fields[i].offset;
+
+ switch (fields[i].type) {
+ case BPF_TIMER:
+ bpf_timer_cancel_and_free(field_ptr);
+ break;
+ case BPF_WORKQUEUE:
+ bpf_wq_cancel_and_free(field_ptr);
+ break;
+ case BPF_TASK_WORK:
+ bpf_task_work_cancel_and_free(field_ptr);
+ break;
+ default:
+ break;
+ }
+ }
+}
+
void bpf_obj_free_fields(const struct btf_record *rec, void *obj)
{
const struct btf_field *fields;
diff --git a/tools/testing/selftests/bpf/prog_tests/htab_update.c b/tools/testing/selftests/bpf/prog_tests/htab_update.c
index ea1a6766fbe9..0a28d4346924 100644
--- a/tools/testing/selftests/bpf/prog_tests/htab_update.c
+++ b/tools/testing/selftests/bpf/prog_tests/htab_update.c
@@ -23,7 +23,7 @@ static void test_reenter_update(void)
if (!ASSERT_OK_PTR(skel, "htab_update__open"))
return;
- bpf_program__set_autoload(skel->progs.bpf_obj_free_fields, true);
+ bpf_program__set_autoload(skel->progs.bpf_obj_cancel_fields, true);
err = htab_update__load(skel);
if (!ASSERT_TRUE(!err, "htab_update__load") || err)
goto out;
@@ -50,7 +50,7 @@ static void test_reenter_update(void)
/*
* Second update: replace existing element with same key and trigger
* the reentrancy of bpf_map_update_elem().
- * check_and_free_fields() calls bpf_obj_free_fields() on the old
+ * check_and_cancel_fields() calls bpf_obj_cancel_fields() on the old
* value, which is where fentry program runs and performs a nested
* bpf_map_update_elem(), triggering -EDEADLK.
*/
diff --git a/tools/testing/selftests/bpf/prog_tests/map_kptr.c b/tools/testing/selftests/bpf/prog_tests/map_kptr.c
index 03b46f17cf53..ec6f2f2e8308 100644
--- a/tools/testing/selftests/bpf/prog_tests/map_kptr.c
+++ b/tools/testing/selftests/bpf/prog_tests/map_kptr.c
@@ -51,7 +51,6 @@ static void test_map_kptr_success(bool test_run)
ret = bpf_map__update_elem(skel->maps.array_map,
&key, sizeof(key), buf, sizeof(buf), 0);
ASSERT_OK(ret, "array_map update");
- skel->data->ref--;
ret = bpf_prog_test_run_opts(bpf_program__fd(skel->progs.test_map_kptr_ref3), &opts);
ASSERT_OK(ret, "test_map_kptr_ref3 refcount");
ASSERT_OK(opts.retval, "test_map_kptr_ref3 retval");
@@ -59,49 +58,42 @@ static void test_map_kptr_success(bool test_run)
ret = bpf_map__update_elem(skel->maps.pcpu_array_map,
&key, sizeof(key), pbuf, cpu * sizeof(buf), 0);
ASSERT_OK(ret, "pcpu_array_map update");
- skel->data->ref--;
ret = bpf_prog_test_run_opts(bpf_program__fd(skel->progs.test_map_kptr_ref3), &opts);
ASSERT_OK(ret, "test_map_kptr_ref3 refcount");
ASSERT_OK(opts.retval, "test_map_kptr_ref3 retval");
ret = bpf_map__delete_elem(skel->maps.hash_map, &key, sizeof(key), 0);
ASSERT_OK(ret, "hash_map delete");
- skel->data->ref--;
ret = bpf_prog_test_run_opts(bpf_program__fd(skel->progs.test_map_kptr_ref3), &opts);
ASSERT_OK(ret, "test_map_kptr_ref3 refcount");
ASSERT_OK(opts.retval, "test_map_kptr_ref3 retval");
ret = bpf_map__delete_elem(skel->maps.pcpu_hash_map, &key, sizeof(key), 0);
ASSERT_OK(ret, "pcpu_hash_map delete");
- skel->data->ref--;
ret = bpf_prog_test_run_opts(bpf_program__fd(skel->progs.test_map_kptr_ref3), &opts);
ASSERT_OK(ret, "test_map_kptr_ref3 refcount");
ASSERT_OK(opts.retval, "test_map_kptr_ref3 retval");
ret = bpf_map__delete_elem(skel->maps.hash_malloc_map, &key, sizeof(key), 0);
ASSERT_OK(ret, "hash_malloc_map delete");
- skel->data->ref--;
ret = bpf_prog_test_run_opts(bpf_program__fd(skel->progs.test_map_kptr_ref3), &opts);
ASSERT_OK(ret, "test_map_kptr_ref3 refcount");
ASSERT_OK(opts.retval, "test_map_kptr_ref3 retval");
ret = bpf_map__delete_elem(skel->maps.pcpu_hash_malloc_map, &key, sizeof(key), 0);
ASSERT_OK(ret, "pcpu_hash_malloc_map delete");
- skel->data->ref--;
ret = bpf_prog_test_run_opts(bpf_program__fd(skel->progs.test_map_kptr_ref3), &opts);
ASSERT_OK(ret, "test_map_kptr_ref3 refcount");
ASSERT_OK(opts.retval, "test_map_kptr_ref3 retval");
ret = bpf_map__delete_elem(skel->maps.lru_hash_map, &key, sizeof(key), 0);
ASSERT_OK(ret, "lru_hash_map delete");
- skel->data->ref--;
ret = bpf_prog_test_run_opts(bpf_program__fd(skel->progs.test_map_kptr_ref3), &opts);
ASSERT_OK(ret, "test_map_kptr_ref3 refcount");
ASSERT_OK(opts.retval, "test_map_kptr_ref3 retval");
ret = bpf_map__delete_elem(skel->maps.lru_pcpu_hash_map, &key, sizeof(key), 0);
ASSERT_OK(ret, "lru_pcpu_hash_map delete");
- skel->data->ref--;
ret = bpf_prog_test_run_opts(bpf_program__fd(skel->progs.test_map_kptr_ref3), &opts);
ASSERT_OK(ret, "test_map_kptr_ref3 refcount");
ASSERT_OK(opts.retval, "test_map_kptr_ref3 retval");
@@ -175,7 +167,7 @@ void serial_test_map_kptr(void)
ASSERT_OK(kern_sync_rcu(), "sync rcu");
wait_for_map_release();
- /* Observe refcount dropping to 1 on synchronous delete elem */
+ /* Observe refcount dropping to 1 on map release. */
test_map_kptr_success(true);
}
diff --git a/tools/testing/selftests/bpf/progs/htab_update.c b/tools/testing/selftests/bpf/progs/htab_update.c
index 195d3b2fba00..62c1b1325ec2 100644
--- a/tools/testing/selftests/bpf/progs/htab_update.c
+++ b/tools/testing/selftests/bpf/progs/htab_update.c
@@ -22,8 +22,8 @@ struct {
int pid = 0;
int update_err = 0;
-SEC("?fentry/bpf_obj_free_fields")
-int bpf_obj_free_fields(void *ctx)
+SEC("?fentry/bpf_obj_cancel_fields")
+int bpf_obj_cancel_fields(void *ctx)
{
__u32 key = 0;
struct val value = { .payload = 1 };
--
2.53.0
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH bpf-next v2 3/4] selftests/bpf: Exercise unsafe obj drops from tracing progs
2026-06-09 9:37 [PATCH bpf-next v2 0/4] Fix kptr dtor deadlock Kumar Kartikeya Dwivedi
2026-06-09 9:37 ` [PATCH bpf-next v2 1/4] bpf: Reject bpf_obj_drop() from tracing progs Kumar Kartikeya Dwivedi
2026-06-09 9:37 ` [PATCH bpf-next v2 2/4] bpf: Cancel special fields on map value recycle Kumar Kartikeya Dwivedi
@ 2026-06-09 9:37 ` Kumar Kartikeya Dwivedi
2026-06-09 10:02 ` sashiko-bot
2026-06-09 9:37 ` [PATCH bpf-next v2 4/4] selftests/bpf: Exercise kptr map update lifetime Kumar Kartikeya Dwivedi
3 siblings, 1 reply; 13+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2026-06-09 9:37 UTC (permalink / raw)
To: bpf
Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Eduard Zingerman, Emil Tsalapatis, Justin Suess, kkd, kernel-team
Add task_kfunc failure cases for bpf_obj_drop() on local objects with
referenced kptr fields from tracing and NMI tracing programs. These programs
must be rejected because dropping the object would run full special-field
destruction synchronously in an unsafe context.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
.../selftests/bpf/progs/task_kfunc_failure.c | 43 +++++++++++++++++++
1 file changed, 43 insertions(+)
diff --git a/tools/testing/selftests/bpf/progs/task_kfunc_failure.c b/tools/testing/selftests/bpf/progs/task_kfunc_failure.c
index 8e947d445f8e..98e00287c438 100644
--- a/tools/testing/selftests/bpf/progs/task_kfunc_failure.c
+++ b/tools/testing/selftests/bpf/progs/task_kfunc_failure.c
@@ -5,6 +5,7 @@
#include <bpf/bpf_tracing.h>
#include <bpf/bpf_helpers.h>
+#include "../bpf_experimental.h"
#include "bpf_misc.h"
#include "task_kfunc_common.h"
@@ -233,6 +234,48 @@ int BPF_PROG(task_kfunc_release_unacquired, struct task_struct *task, u64 clone_
return 0;
}
+SEC("tp_btf/task_newtask")
+__failure __msg("bpf_obj_drop cannot be used in tracing programs on types with NMI unsafe fields")
+int BPF_PROG(task_kfunc_obj_drop_with_kptr, struct task_struct *task, u64 clone_flags)
+{
+ struct __tasks_kfunc_map_value *local;
+
+ local = bpf_obj_new(typeof(*local));
+ if (!local)
+ return 0;
+
+ bpf_obj_drop(local);
+ return 0;
+}
+
+SEC("tp_btf/nmi_handler")
+__failure __msg("bpf_obj_drop cannot be used in tracing programs on types with NMI unsafe fields")
+int BPF_PROG(task_kfunc_obj_drop_nmi_with_kptr, void *handler, s64 delta_ns,
+ int handled)
+{
+ struct __tasks_kfunc_map_value *local;
+ struct task_struct *task, *acquired, *old;
+
+ (void)handler;
+ (void)delta_ns;
+ (void)handled;
+
+ local = bpf_obj_new(typeof(*local));
+ if (!local)
+ return 0;
+
+ task = bpf_get_current_task_btf();
+ acquired = bpf_task_acquire(task);
+ if (acquired) {
+ old = bpf_kptr_xchg(&local->task, acquired);
+ if (old)
+ bpf_task_release(old);
+ }
+
+ bpf_obj_drop(local);
+ return 0;
+}
+
SEC("tp_btf/task_newtask")
__failure __msg("Possibly NULL pointer passed to trusted R1")
int BPF_PROG(task_kfunc_from_pid_no_null_check, struct task_struct *task, u64 clone_flags)
--
2.53.0
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH bpf-next v2 4/4] selftests/bpf: Exercise kptr map update lifetime
2026-06-09 9:37 [PATCH bpf-next v2 0/4] Fix kptr dtor deadlock Kumar Kartikeya Dwivedi
` (2 preceding siblings ...)
2026-06-09 9:37 ` [PATCH bpf-next v2 3/4] selftests/bpf: Exercise unsafe obj drops from tracing progs Kumar Kartikeya Dwivedi
@ 2026-06-09 9:37 ` Kumar Kartikeya Dwivedi
2026-06-09 10:20 ` sashiko-bot
3 siblings, 1 reply; 13+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2026-06-09 9:37 UTC (permalink / raw)
To: bpf
Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Eduard Zingerman, Emil Tsalapatis, Justin Suess, kkd, kernel-team
Add focused map_kptr coverage for BPF-side map updates that touch values
containing referenced kptrs.
The new syscall programs stash the testmod refcounted object in an array
map, a preallocated hash map, and a no-prealloc hash map, then update the
same map from BPF. The refcount must remain elevated after the update,
while the userspace runner destroys the skeleton and reuses the existing
refcount wait to confirm map teardown releases the kptr.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
.../selftests/bpf/prog_tests/map_kptr.c | 56 ++++++++++++
tools/testing/selftests/bpf/progs/map_kptr.c | 89 ++++++++++++++++++-
2 files changed, 142 insertions(+), 3 deletions(-)
diff --git a/tools/testing/selftests/bpf/prog_tests/map_kptr.c b/tools/testing/selftests/bpf/prog_tests/map_kptr.c
index ec6f2f2e8308..17e707dddda8 100644
--- a/tools/testing/selftests/bpf/prog_tests/map_kptr.c
+++ b/tools/testing/selftests/bpf/prog_tests/map_kptr.c
@@ -143,12 +143,68 @@ static void wait_for_map_release(void)
map_kptr__destroy(skel);
}
+enum map_update_kptr_case {
+ MAP_UPDATE_KPTR_ARRAY,
+ MAP_UPDATE_KPTR_HASH,
+ MAP_UPDATE_KPTR_HASH_MALLOC,
+};
+
+static struct bpf_program *map_update_kptr_prog(struct map_kptr *skel,
+ enum map_update_kptr_case test)
+{
+ switch (test) {
+ case MAP_UPDATE_KPTR_ARRAY:
+ return skel->progs.test_array_map_update_kptr;
+ case MAP_UPDATE_KPTR_HASH:
+ return skel->progs.test_hash_map_update_kptr;
+ case MAP_UPDATE_KPTR_HASH_MALLOC:
+ return skel->progs.test_hash_malloc_map_update_kptr;
+ }
+
+ return NULL;
+}
+
+static void test_map_update_kptr(enum map_update_kptr_case test)
+{
+ LIBBPF_OPTS(bpf_test_run_opts, opts);
+ struct map_kptr *skel;
+ struct bpf_program *prog;
+ int ret;
+
+ skel = map_kptr__open_and_load();
+ if (!ASSERT_OK_PTR(skel, "map_kptr__open_and_load"))
+ return;
+
+ prog = map_update_kptr_prog(skel, test);
+ if (!ASSERT_OK_PTR(prog, "map_update_kptr_prog"))
+ goto out;
+
+ ret = bpf_prog_test_run_opts(bpf_program__fd(prog), &opts);
+ if (!ASSERT_OK(ret, "map_update_kptr"))
+ goto out;
+ if (!ASSERT_OK(opts.retval, "map_update_kptr retval"))
+ goto out;
+
+ ASSERT_EQ(skel->bss->num_of_refs, 3, "refs_after_update");
+
+out:
+ map_kptr__destroy(skel);
+ wait_for_map_release();
+}
+
void serial_test_map_kptr(void)
{
struct rcu_tasks_trace_gp *skel;
RUN_TESTS(map_kptr_fail);
+ if (test__start_subtest("update_array_map_kptr"))
+ test_map_update_kptr(MAP_UPDATE_KPTR_ARRAY);
+ if (test__start_subtest("update_hash_map_kptr"))
+ test_map_update_kptr(MAP_UPDATE_KPTR_HASH);
+ if (test__start_subtest("update_hash_malloc_map_kptr"))
+ test_map_update_kptr(MAP_UPDATE_KPTR_HASH_MALLOC);
+
skel = rcu_tasks_trace_gp__open_and_load();
if (!ASSERT_OK_PTR(skel, "rcu_tasks_trace_gp__open_and_load"))
return;
diff --git a/tools/testing/selftests/bpf/progs/map_kptr.c b/tools/testing/selftests/bpf/progs/map_kptr.c
index e708ffbe1f61..3fbefc568e0a 100644
--- a/tools/testing/selftests/bpf/progs/map_kptr.c
+++ b/tools/testing/selftests/bpf/progs/map_kptr.c
@@ -489,8 +489,7 @@ int test_map_kptr_ref3(struct __sk_buff *ctx)
int num_of_refs;
-SEC("syscall")
-int count_ref(void *ctx)
+static __always_inline int read_ref_count(void)
{
struct prog_test_ref_kfunc *p;
unsigned long arg = 0;
@@ -500,11 +499,95 @@ int count_ref(void *ctx)
return 1;
num_of_refs = p->cnt.refs.counter;
-
bpf_kfunc_call_test_release(p);
return 0;
}
+SEC("syscall")
+int count_ref(void *ctx)
+{
+ return read_ref_count();
+}
+
+static __always_inline int stash_ref_ptr(struct map_value *v)
+{
+ struct prog_test_ref_kfunc *p, *old;
+ unsigned long arg = 0;
+
+ p = bpf_kfunc_call_test_acquire(&arg);
+ if (!p)
+ return 1;
+
+ old = bpf_kptr_xchg(&v->ref_ptr, p);
+ if (old) {
+ bpf_kfunc_call_test_release(old);
+ old = bpf_kptr_xchg(&v->ref_ptr, NULL);
+ if (old)
+ bpf_kfunc_call_test_release(old);
+ return 2;
+ }
+ return 0;
+}
+
+static __always_inline int check_refs(int expected)
+{
+ int ret;
+
+ ret = read_ref_count();
+ if (ret)
+ return ret;
+ return num_of_refs == expected ? 0 : 3;
+}
+
+SEC("syscall")
+int test_array_map_update_kptr(void *ctx)
+{
+ struct map_value init = {}, *v;
+ int key = 0, ret;
+
+ v = bpf_map_lookup_elem(&array_map, &key);
+ if (!v)
+ return 1;
+ ret = stash_ref_ptr(v);
+ if (ret)
+ return ret;
+ ret = check_refs(3);
+ if (ret)
+ return ret;
+ ret = bpf_map_update_elem(&array_map, &key, &init, BPF_EXIST);
+ if (ret)
+ return 4;
+ return check_refs(3);
+}
+
+#define DEFINE_HASH_UPDATE_KPTR_TEST(name, map) \
+SEC("syscall") \
+int name(void *ctx) \
+{ \
+ struct map_value init = {}, *v; \
+ int key = 0, ret; \
+ \
+ ret = bpf_map_update_elem(&map, &key, &init, BPF_NOEXIST); \
+ if (ret) \
+ return 1; \
+ v = bpf_map_lookup_elem(&map, &key); \
+ if (!v) \
+ return 2; \
+ ret = stash_ref_ptr(v); \
+ if (ret) \
+ return ret; \
+ ret = check_refs(3); \
+ if (ret) \
+ return ret; \
+ ret = bpf_map_update_elem(&map, &key, &init, BPF_EXIST); \
+ if (ret) \
+ return 4; \
+ return check_refs(3); \
+}
+
+DEFINE_HASH_UPDATE_KPTR_TEST(test_hash_map_update_kptr, hash_map)
+DEFINE_HASH_UPDATE_KPTR_TEST(test_hash_malloc_map_update_kptr, hash_malloc_map)
+
SEC("syscall")
int test_ls_map_kptr_ref1(void *ctx)
{
--
2.53.0
^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH bpf-next v2 2/4] bpf: Cancel special fields on map value recycle
2026-06-09 9:37 ` [PATCH bpf-next v2 2/4] bpf: Cancel special fields on map value recycle Kumar Kartikeya Dwivedi
@ 2026-06-09 9:55 ` sashiko-bot
2026-06-09 10:51 ` Mykyta Yatsenko
1 sibling, 0 replies; 13+ messages in thread
From: sashiko-bot @ 2026-06-09 9:55 UTC (permalink / raw)
To: Kumar Kartikeya Dwivedi; +Cc: bpf
Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [Critical] Replacing `bpf_obj_free_fields()` with `bpf_obj_cancel_fields()` fails to zero or free kernel pointers on map element deletion and update, causing new hash map keys to inherit valid kptrs from deleted elements (cross-key contamination).
--
commit a156fa02499cc2be66e3982937fa823e00008e97
Author: Justin Suess <utilityemal77@gmail.com>
bpf: Cancel special fields on map value recycle
This commit introduces bpf_obj_cancel_fields() for NMI-safe cleanup of timer,
workqueue, and task_work fields on map value recycle. It replaces
bpf_obj_free_fields() in the array and hashtab update/delete/recycle paths.
> diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
> index e6271a2bf6d64..4e8d6fe6999bf 100644
> --- a/kernel/bpf/arraymap.c
> +++ b/kernel/bpf/arraymap.c
[ ... ]
> @@ -392,7 +392,7 @@ static long array_map_update_elem(struct bpf_map *map, void *key, void *value,
> copy_map_value_locked(map, val, value, false);
> else
> copy_map_value(map, val, value);
> - bpf_obj_free_fields(array->map.record, val);
> + bpf_obj_cancel_fields(array->map.record, val);
> }
> return 0;
[Severity: Critical]
Does replacing bpf_obj_free_fields() with bpf_obj_cancel_fields() fail to clear
old kptrs in the array slot?
Since the new bpf_obj_cancel_fields() helper only processes timers and
workqueues, the old kernel pointers (like BPF_KPTR_REF) seem to remain valid
and unzeroed in the slot indefinitely.
[ ... ]
> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
> index b4366cad3cfa5..a9b2b882b90f2 100644
> --- a/kernel/bpf/hashtab.c
> +++ b/kernel/bpf/hashtab.c
[ ... ]
> @@ -844,11 +848,11 @@ static void check_and_free_fields(struct bpf_htab *htab,
> int cpu;
>
> for_each_possible_cpu(cpu)
> - bpf_obj_free_fields(htab->map.record, per_cpu_ptr(pptr, cpu));
> + bpf_obj_cancel_fields(htab->map.record, per_cpu_ptr(pptr, cpu));
> } else {
> void *map_value = htab_elem_value(elem, htab->map.key_size);
>
> - bpf_obj_free_fields(htab->map.record, map_value);
> + bpf_obj_cancel_fields(htab->map.record, map_value);
> }
> }
[Severity: Critical]
Can this leak kernel pointers between different hash keys?
When a hash map element is deleted, check_and_cancel_fields() is called.
Because it doesn't clear kptrs, the element is returned to the memory cache or
preallocated freelist with its kernel pointers still intact.
When a new hash map element is inserted for a different key, alloc_htab_elem()
reuses this memory. Because copy_map_value() safely skips writing to special
BTF field offsets:
alloc_htab_elem() {
...
copy_map_value(&htab->map, htab_elem_value(l_new, key_size), value);
...
}
won't the new element silently inherit the valid kernel pointers from the
previously deleted key?
[ ... ]
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index d4188a992bd8b..8d12aa3728ea9 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
[ ... ]
> +void bpf_obj_cancel_fields(const struct btf_record *rec, void *obj)
> +{
> + const struct btf_field *fields;
> + int i;
> +
> + if (IS_ERR_OR_NULL(rec))
> + return;
> + fields = rec->fields;
> + for (i = 0; i < rec->cnt; i++) {
> + void *field_ptr = obj + fields[i].offset;
> +
> + switch (fields[i].type) {
> + case BPF_TIMER:
> + bpf_timer_cancel_and_free(field_ptr);
> + break;
> + case BPF_WORKQUEUE:
> + bpf_wq_cancel_and_free(field_ptr);
> + break;
> + case BPF_TASK_WORK:
> + bpf_task_work_cancel_and_free(field_ptr);
> + break;
> + default:
> + break;
> + }
> + }
> +}
[Severity: Critical]
Since this new helper handles BPF_TIMER, BPF_WORKQUEUE, and BPF_TASK_WORK,
doesn't it leave fields like BPF_KPTR_REF, BPF_LIST_HEAD, BPF_UPTR and
BPF_RB_ROOT completely intact without zeroing them?
--
Sashiko AI review · https://sashiko.dev/#/patchset/20260609093719.2858096-1-memxor@gmail.com?part=2
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH bpf-next v2 3/4] selftests/bpf: Exercise unsafe obj drops from tracing progs
2026-06-09 9:37 ` [PATCH bpf-next v2 3/4] selftests/bpf: Exercise unsafe obj drops from tracing progs Kumar Kartikeya Dwivedi
@ 2026-06-09 10:02 ` sashiko-bot
0 siblings, 0 replies; 13+ messages in thread
From: sashiko-bot @ 2026-06-09 10:02 UTC (permalink / raw)
To: Kumar Kartikeya Dwivedi; +Cc: bpf
Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [Medium] The selftest `task_kfunc_obj_drop_nmi_with_kptr` uses the x86-only `nmi_handler` tracepoint, breaking the test suite on other architectures.
--
commit 1282fb89cc1ac86a478ae0747dc71f01a3bfc729
Author: Kumar Kartikeya Dwivedi <memxor@gmail.com>
selftests/bpf: Exercise unsafe obj drops from tracing progs
This commit adds task_kfunc failure cases for bpf_obj_drop() on local
objects with referenced kptr fields. It introduces tests for tracing and
NMI tracing programs to ensure they are rejected by the verifier.
> diff --git a/tools/testing/selftests/bpf/progs/task_kfunc_failure.c b/tools/testing/selftests/bpf/progs/task_kfunc_failure.c
> --- a/tools/testing/selftests/bpf/progs/task_kfunc_failure.c
> +++ b/tools/testing/selftests/bpf/progs/task_kfunc_failure.c
[ ... ]
> +SEC("tp_btf/nmi_handler")
> +__failure __msg("bpf_obj_drop cannot be used in tracing programs on types with NMI unsafe fields")
> +int BPF_PROG(task_kfunc_obj_drop_nmi_with_kptr, void *handler, s64 delta_ns,
> + int handled)
[Severity: Medium]
Will this break the selftest suite on non-x86 architectures?
The nmi_handler tracepoint is only defined and instantiated on x86
architectures. On non-x86 architectures like ARM64 or RISC-V, libbpf will
fail to resolve the BTF ID for this tracepoint, and the BPF program load
will abort before the verifier is invoked.
Consequently, the test_loader framework will not find the expected verifier
failure message in the log, causing the test to incorrectly report as failed.
Is there an architecture-independent tracepoint that could be used instead,
or should this test be skipped on non-x86 platforms?
> +{
> + struct __tasks_kfunc_map_value *local;
> + struct task_struct *task, *acquired, *old;
[ ... ]
--
Sashiko AI review · https://sashiko.dev/#/patchset/20260609093719.2858096-1-memxor@gmail.com?part=3
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH bpf-next v2 1/4] bpf: Reject bpf_obj_drop() from tracing progs
2026-06-09 9:37 ` [PATCH bpf-next v2 1/4] bpf: Reject bpf_obj_drop() from tracing progs Kumar Kartikeya Dwivedi
@ 2026-06-09 10:05 ` bot+bpf-ci
2026-06-09 13:31 ` Mykyta Yatsenko
1 sibling, 0 replies; 13+ messages in thread
From: bot+bpf-ci @ 2026-06-09 10:05 UTC (permalink / raw)
To: memxor, bpf
Cc: utilityemal77, ast, andrii, daniel, eddyz87, emil, kkd,
kernel-team, ast, andrii, daniel, martin.lau, eddyz87,
yonghong.song, clm, ihor.solodrai
[-- Attachment #1: Type: text/plain, Size: 1969 bytes --]
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 62bba7a4876f..0654d2ffadc1 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -492,6 +492,35 @@ static inline bool btf_record_has_field(const struct btf_record *rec, enum btf_f
> return rec->field_mask & type;
> }
>
> +static inline bool btf_field_is_nmi_safe(enum btf_field_type type)
> +{
> + switch (type) {
> + case BPF_SPIN_LOCK:
> + case BPF_RES_SPIN_LOCK:
> + case BPF_TIMER:
> + case BPF_WORKQUEUE:
Can the BPF_TIMER and BPF_WORKQUEUE cases really be treated as NMI safe
here?
Dropping an object with one of these fields runs through
bpf_timer_cancel_and_free() or bpf_wq_cancel_and_free(), which calls
bpf_prog_put() synchronously via bpf_async_update_prog_callback().
If that drops the last reference to the program in NMI context,
__bpf_prog_put() falls back to schedule_work() because irqs are disabled:
schedule_work()
-> queue_work_on()
-> raw spin lock on the workqueue pool
Can this hard deadlock if the NMI interrupted a thread that was already
holding that workqueue pool lock?
BPF_TASK_WORK defers its destruction via irq_work_queue() when
irqs_disabled(), but BPF_TIMER and BPF_WORKQUEUE do not appear to take
that path, so they do not look NMI safe in the same way.
This was raised as a high-severity concern on v1 of this series by the
Sashiko AI review:
https://lore.kernel.org/bpf/20260608154008.27E011F00893@smtp.kernel.org
The v2 respin left btf_field_is_nmi_safe() unchanged, still listing
BPF_TIMER and BPF_WORKQUEUE, so this may still be unaddressed.
> + case BPF_TASK_WORK:
> + case BPF_KPTR_UNREF:
> + case BPF_REFCOUNT:
> + return true;
> + default:
> + return false;
> + }
> +}
---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md
CI run summary: https://github.com/kernel-patches/bpf/actions/runs/27197841616
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH bpf-next v2 4/4] selftests/bpf: Exercise kptr map update lifetime
2026-06-09 9:37 ` [PATCH bpf-next v2 4/4] selftests/bpf: Exercise kptr map update lifetime Kumar Kartikeya Dwivedi
@ 2026-06-09 10:20 ` sashiko-bot
0 siblings, 0 replies; 13+ messages in thread
From: sashiko-bot @ 2026-06-09 10:20 UTC (permalink / raw)
To: Kumar Kartikeya Dwivedi; +Cc: bpf
Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
Pre-existing issues:
- [Critical] The selftests incorrectly codify a resource leak and cross-key reference vulnerability for BPF kptrs in map updates.
--
commit b8d5d36d1cdd1bc97594f89c513eb183613ab05f
Author: Kumar Kartikeya Dwivedi <memxor@gmail.com>
selftests/bpf: Exercise kptr map update lifetime
This commit adds selftests for BPF-side map updates that touch values
containing referenced kptrs in array and hash maps. It verifies that the
refcount behaves as expected when updating existing map elements from BPF.
> diff --git a/tools/testing/selftests/bpf/progs/map_kptr.c b/tools/testing/selftests/bpf/progs/map_kptr.c
> index e708ffbe1f610..3fbefc568e0a6 100644
> --- a/tools/testing/selftests/bpf/progs/map_kptr.c
> +++ b/tools/testing/selftests/bpf/progs/map_kptr.c
[ ... ]
> @@ -500,11 +499,95 @@ int count_ref(void *ctx)
[ ... ]
> +SEC("syscall")
> +int test_array_map_update_kptr(void *ctx)
> +{
> + struct map_value init = {}, *v;
> + int key = 0, ret;
> +
> + v = bpf_map_lookup_elem(&array_map, &key);
> + if (!v)
> + return 1;
> + ret = stash_ref_ptr(v);
> + if (ret)
> + return ret;
> + ret = check_refs(3);
> + if (ret)
> + return ret;
> + ret = bpf_map_update_elem(&array_map, &key, &init, BPF_EXIST);
> + if (ret)
> + return 4;
> + return check_refs(3);
[Severity: Critical]
This isn't a bug introduced by this patch, but are these tests codifying a
kernel resource leak and cross-key reference vulnerability?
In a prior commit, bpf_obj_free_fields() was replaced with
bpf_obj_cancel_fields() in kernel/bpf/arraymap.c:array_map_update_elem()
and check_and_cancel_fields() in kernel/bpf/hashtab.c:free_htab_elem().
Because those cancel functions do not release or zero kptrs, overwriting an
array element or freeing a preallocated hash map element to the freelist
leaves the old kptr intact.
For preallocated hash maps, when a new element is popped from the freelist
and updated via copy_map_value() (which skips special BTF fields), the new
element silently inherits the stale kptr from the deleted key.
By asserting check_refs(3) here and in DEFINE_HASH_UPDATE_KPTR_TEST below
after updating the map with an empty init value, do these tests accidentally
codify this broken kernel behavior where the refcount incorrectly remains
elevated?
> +}
> +
> +#define DEFINE_HASH_UPDATE_KPTR_TEST(name, map) \
> +SEC("syscall") \
> +int name(void *ctx) \
> +{ \
> + struct map_value init = {}, *v; \
> + int key = 0, ret; \
> + \
> + ret = bpf_map_update_elem(&map, &key, &init, BPF_NOEXIST); \
> + if (ret) \
> + return 1; \
> + v = bpf_map_lookup_elem(&map, &key); \
> + if (!v) \
> + return 2; \
> + ret = stash_ref_ptr(v); \
> + if (ret) \
> + return ret; \
> + ret = check_refs(3); \
> + if (ret) \
> + return ret; \
> + ret = bpf_map_update_elem(&map, &key, &init, BPF_EXIST); \
> + if (ret) \
> + return 4; \
> + return check_refs(3); \
> +}
--
Sashiko AI review · https://sashiko.dev/#/patchset/20260609093719.2858096-1-memxor@gmail.com?part=4
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH bpf-next v2 2/4] bpf: Cancel special fields on map value recycle
2026-06-09 9:37 ` [PATCH bpf-next v2 2/4] bpf: Cancel special fields on map value recycle Kumar Kartikeya Dwivedi
2026-06-09 9:55 ` sashiko-bot
@ 2026-06-09 10:51 ` Mykyta Yatsenko
1 sibling, 0 replies; 13+ messages in thread
From: Mykyta Yatsenko @ 2026-06-09 10:51 UTC (permalink / raw)
To: Kumar Kartikeya Dwivedi, bpf
Cc: Justin Suess, Alexei Starovoitov, Andrii Nakryiko,
Daniel Borkmann, Eduard Zingerman, Emil Tsalapatis, kkd,
kernel-team
On 6/9/26 10:37 AM, Kumar Kartikeya Dwivedi wrote:
> From: Justin Suess <utilityemal77@gmail.com>
>
> Map update and delete paths currently call bpf_obj_free_fields() when a
> value is being replaced or recycled. That makes field destruction depend
> on the context of the update/delete operation. For tracing programs this
> can include NMI context, where referenced kptr destructors, uptr
> unpinning, and graph root destruction are not generally safe.
>
> Introduce bpf_obj_cancel_fields() for the reusable-value path. It only
> performs NMI-safe cleanup for timer, workqueue, and task_work fields.
> Fields that need full destruction are left attached to the recycled value
> and are destroyed by the final cleanup path instead.
>
> Switch array and hashtab update/delete/recycle paths to this cancel
> helper. Keep bpf_obj_free_fields() for final map destruction and for
> bpf_mem_alloc destructors. Preallocated hashtabs do not have allocator
> destructors, so teardown continues to walk the normal and extra elements
> and fully destroy their fields.
>
> This deliberately relaxes the eager-free semantics of map update/delete
> for special fields. Programs that relied on a recycled map slot becoming
> empty immediately after update/delete were relying on behavior that
> cannot be implemented safely from every BPF execution context without
> offloading arbitrary destructors.
>
> There is a chance this change breaks programs making assumptions
> regarding the eager freeing of fields. If so, we can relax semantics to
> cancellation only when irqs_disabled() is true in the future. However,
> theoretically, map values that get reused eagerly already have weaker
> guarantees as parallel users can recreate freed fields before the new
> element becomes visible again.
>
> Fixes: 14a324f6a67e ("bpf: Wire up freeing of referenced kptr")
> Signed-off-by: Justin Suess <utilityemal77@gmail.com>
> Co-developed-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> ---
> include/linux/bpf.h | 1 +
> kernel/bpf/arraymap.c | 8 ++---
> kernel/bpf/hashtab.c | 32 +++++++++++--------
> kernel/bpf/syscall.c | 27 ++++++++++++++++
> .../selftests/bpf/prog_tests/htab_update.c | 4 +--
> .../selftests/bpf/prog_tests/map_kptr.c | 10 +-----
> .../testing/selftests/bpf/progs/htab_update.c | 4 +--
> 7 files changed, 55 insertions(+), 31 deletions(-)
>
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index d4188a992bd8..8d12aa3728ea 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -808,6 +808,33 @@ void bpf_obj_free_task_work(const struct btf_record *rec, void *obj)
> bpf_task_work_cancel_and_free(obj + rec->task_work_off);
> }
>
> +void bpf_obj_cancel_fields(const struct btf_record *rec, void *obj)
This function looks very similar to bpf_map_free_internal_structs() from helpers.c.
If I understand it correctly, it does exactly the same.
The rest of the changes in this patch make sense - we only free timers, workqueues and task works,
other special fields are kept in the element as is. So existing code will be holding
old fields for a longer time, potentially up until map is unloaded.
There is still a way for user to destroy those sticky kptrs by, for example,
scheduling a task_work and explicitly releasing the kptr.
> +{
> + const struct btf_field *fields;
> + int i;
> +
> + if (IS_ERR_OR_NULL(rec))
> + return;
> + fields = rec->fields;
> + for (i = 0; i < rec->cnt; i++) {
> + void *field_ptr = obj + fields[i].offset;
> +
> + switch (fields[i].type) {
> + case BPF_TIMER:
> + bpf_timer_cancel_and_free(field_ptr);
> + break;
> + case BPF_WORKQUEUE:
> + bpf_wq_cancel_and_free(field_ptr);
> + break;
> + case BPF_TASK_WORK:
> + bpf_task_work_cancel_and_free(field_ptr);
> + break;
> + default:
> + break;
> + }
> + }
> +}
> +
> void bpf_obj_free_fields(const struct btf_record *rec, void *obj)
> {
> const struct btf_field *fields;
> diff --git a/tools/testing/selftests/bpf/prog_tests/htab_update.c b/tools/testing/selftests/bpf/prog_tests/htab_update.c
> index ea1a6766fbe9..0a28d4346924 100644
> --- a/tools/testing/selftests/bpf/prog_tests/htab_update.c
> +++ b/tools/testing/selftests/bpf/prog_tests/htab_update.c
> @@ -23,7 +23,7 @@ static void test_reenter_update(void)
> if (!ASSERT_OK_PTR(skel, "htab_update__open"))
> return;
>
> - bpf_program__set_autoload(skel->progs.bpf_obj_free_fields, true);
> + bpf_program__set_autoload(skel->progs.bpf_obj_cancel_fields, true);
> err = htab_update__load(skel);
> if (!ASSERT_TRUE(!err, "htab_update__load") || err)
> goto out;
> @@ -50,7 +50,7 @@ static void test_reenter_update(void)
> /*
> * Second update: replace existing element with same key and trigger
> * the reentrancy of bpf_map_update_elem().
> - * check_and_free_fields() calls bpf_obj_free_fields() on the old
> + * check_and_cancel_fields() calls bpf_obj_cancel_fields() on the old
> * value, which is where fentry program runs and performs a nested
> * bpf_map_update_elem(), triggering -EDEADLK.
> */
> diff --git a/tools/testing/selftests/bpf/prog_tests/map_kptr.c b/tools/testing/selftests/bpf/prog_tests/map_kptr.c
> index 03b46f17cf53..ec6f2f2e8308 100644
> --- a/tools/testing/selftests/bpf/prog_tests/map_kptr.c
> +++ b/tools/testing/selftests/bpf/prog_tests/map_kptr.c
> @@ -51,7 +51,6 @@ static void test_map_kptr_success(bool test_run)
> ret = bpf_map__update_elem(skel->maps.array_map,
> &key, sizeof(key), buf, sizeof(buf), 0);
> ASSERT_OK(ret, "array_map update");
> - skel->data->ref--;
> ret = bpf_prog_test_run_opts(bpf_program__fd(skel->progs.test_map_kptr_ref3), &opts);
> ASSERT_OK(ret, "test_map_kptr_ref3 refcount");
> ASSERT_OK(opts.retval, "test_map_kptr_ref3 retval");
> @@ -59,49 +58,42 @@ static void test_map_kptr_success(bool test_run)
> ret = bpf_map__update_elem(skel->maps.pcpu_array_map,
> &key, sizeof(key), pbuf, cpu * sizeof(buf), 0);
> ASSERT_OK(ret, "pcpu_array_map update");
> - skel->data->ref--;
> ret = bpf_prog_test_run_opts(bpf_program__fd(skel->progs.test_map_kptr_ref3), &opts);
> ASSERT_OK(ret, "test_map_kptr_ref3 refcount");
> ASSERT_OK(opts.retval, "test_map_kptr_ref3 retval");
>
> ret = bpf_map__delete_elem(skel->maps.hash_map, &key, sizeof(key), 0);
> ASSERT_OK(ret, "hash_map delete");
> - skel->data->ref--;
> ret = bpf_prog_test_run_opts(bpf_program__fd(skel->progs.test_map_kptr_ref3), &opts);
> ASSERT_OK(ret, "test_map_kptr_ref3 refcount");
> ASSERT_OK(opts.retval, "test_map_kptr_ref3 retval");
>
> ret = bpf_map__delete_elem(skel->maps.pcpu_hash_map, &key, sizeof(key), 0);
> ASSERT_OK(ret, "pcpu_hash_map delete");
> - skel->data->ref--;
> ret = bpf_prog_test_run_opts(bpf_program__fd(skel->progs.test_map_kptr_ref3), &opts);
> ASSERT_OK(ret, "test_map_kptr_ref3 refcount");
> ASSERT_OK(opts.retval, "test_map_kptr_ref3 retval");
>
> ret = bpf_map__delete_elem(skel->maps.hash_malloc_map, &key, sizeof(key), 0);
> ASSERT_OK(ret, "hash_malloc_map delete");
> - skel->data->ref--;
> ret = bpf_prog_test_run_opts(bpf_program__fd(skel->progs.test_map_kptr_ref3), &opts);
> ASSERT_OK(ret, "test_map_kptr_ref3 refcount");
> ASSERT_OK(opts.retval, "test_map_kptr_ref3 retval");
>
> ret = bpf_map__delete_elem(skel->maps.pcpu_hash_malloc_map, &key, sizeof(key), 0);
> ASSERT_OK(ret, "pcpu_hash_malloc_map delete");
> - skel->data->ref--;
> ret = bpf_prog_test_run_opts(bpf_program__fd(skel->progs.test_map_kptr_ref3), &opts);
> ASSERT_OK(ret, "test_map_kptr_ref3 refcount");
> ASSERT_OK(opts.retval, "test_map_kptr_ref3 retval");
>
> ret = bpf_map__delete_elem(skel->maps.lru_hash_map, &key, sizeof(key), 0);
> ASSERT_OK(ret, "lru_hash_map delete");
> - skel->data->ref--;
> ret = bpf_prog_test_run_opts(bpf_program__fd(skel->progs.test_map_kptr_ref3), &opts);
> ASSERT_OK(ret, "test_map_kptr_ref3 refcount");
> ASSERT_OK(opts.retval, "test_map_kptr_ref3 retval");
>
> ret = bpf_map__delete_elem(skel->maps.lru_pcpu_hash_map, &key, sizeof(key), 0);
> ASSERT_OK(ret, "lru_pcpu_hash_map delete");
> - skel->data->ref--;
> ret = bpf_prog_test_run_opts(bpf_program__fd(skel->progs.test_map_kptr_ref3), &opts);
> ASSERT_OK(ret, "test_map_kptr_ref3 refcount");
> ASSERT_OK(opts.retval, "test_map_kptr_ref3 retval");
> @@ -175,7 +167,7 @@ void serial_test_map_kptr(void)
> ASSERT_OK(kern_sync_rcu(), "sync rcu");
> wait_for_map_release();
>
> - /* Observe refcount dropping to 1 on synchronous delete elem */
> + /* Observe refcount dropping to 1 on map release. */
> test_map_kptr_success(true);
> }
>
> diff --git a/tools/testing/selftests/bpf/progs/htab_update.c b/tools/testing/selftests/bpf/progs/htab_update.c
> index 195d3b2fba00..62c1b1325ec2 100644
> --- a/tools/testing/selftests/bpf/progs/htab_update.c
> +++ b/tools/testing/selftests/bpf/progs/htab_update.c
> @@ -22,8 +22,8 @@ struct {
> int pid = 0;
> int update_err = 0;
>
> -SEC("?fentry/bpf_obj_free_fields")
> -int bpf_obj_free_fields(void *ctx)
> +SEC("?fentry/bpf_obj_cancel_fields")
> +int bpf_obj_cancel_fields(void *ctx)
> {
> __u32 key = 0;
> struct val value = { .payload = 1 };
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH bpf-next v2 1/4] bpf: Reject bpf_obj_drop() from tracing progs
2026-06-09 9:37 ` [PATCH bpf-next v2 1/4] bpf: Reject bpf_obj_drop() from tracing progs Kumar Kartikeya Dwivedi
2026-06-09 10:05 ` bot+bpf-ci
@ 2026-06-09 13:31 ` Mykyta Yatsenko
2026-06-09 14:08 ` Kumar Kartikeya Dwivedi
1 sibling, 1 reply; 13+ messages in thread
From: Mykyta Yatsenko @ 2026-06-09 13:31 UTC (permalink / raw)
To: Kumar Kartikeya Dwivedi, bpf
Cc: Justin Suess, Alexei Starovoitov, Andrii Nakryiko,
Daniel Borkmann, Eduard Zingerman, Emil Tsalapatis, kkd,
kernel-team
On 6/9/26 10:37 AM, Kumar Kartikeya Dwivedi wrote:
> From: Justin Suess <utilityemal77@gmail.com>
>
> bpf_obj_drop() runs bpf_obj_free_fields() synchronously for
> program-allocated objects. When such an object contains NMI unsafe
> fields, a non-iterator tracing program can reach that destruction from
> arbitrary instrumented context, including NMI.
>
> NMI is likely one instance of this problem, and other instances would
> include possible unsafe reentrancy. Deferring bpf_obj_drop() is not
> appealing either: it would add delayed-free machinery to a release
> operation that otherwise has straightforward synchronous ownership
> semantics.
>
> Reject bpf_obj_drop() and bpf_percpu_obj_drop() from tracing programs
> unless every field in the object's BTF record is explicitly NMI safe.
> Note that while bpf_rb_root and bpf_list_head would be NMI safe on their
> own to free, the objects recursively held by them may not be; be
> conservative and just mark them as not NMI safe for now.
>
> Use a whitelist for the NMI-safe field set instead of listing only known
> NMI unsafe fields. Locks, async fields, unreferenced kptrs, and
> refcounts are known to be NMI safe because their destruction is either a
> no-op, simple state reset, or async cancellation. Referenced kptrs,
> percpu referenced kptrs, uptrs, graph roots, graph nodes, and any future
> field type are rejected until audited for arbitrary tracing and NMI
> contexts. This is less susceptible to future changes in fields that were
> previously safe by exclusion, and to new fields being added without
> updating this check.
>
> Convert the existing recursive local-object drop success case to a
> syscall program in the same commit, since this verifier change makes the
> old tracing program form invalid. The test still exercises
> bpf_obj_drop() releasing a referenced task kptr from a safe program
> type.
>
> Fixes: ac9f06050a35 ("bpf: Introduce bpf_obj_drop")
> Signed-off-by: Justin Suess <utilityemal77@gmail.com>
> Co-developed-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> ---
> include/linux/bpf.h | 29 +++++++++++++
> kernel/bpf/verifier.c | 16 +++++++
> .../selftests/bpf/prog_tests/task_kfunc.c | 42 ++++++++++++++++++-
> .../selftests/bpf/progs/task_kfunc_success.c | 13 +++---
> 4 files changed, 92 insertions(+), 8 deletions(-)
>
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 62bba7a4876f..0654d2ffadc1 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -492,6 +492,35 @@ static inline bool btf_record_has_field(const struct btf_record *rec, enum btf_f
> return rec->field_mask & type;
> }
>
> +static inline bool btf_field_is_nmi_safe(enum btf_field_type type)
> +{
> + switch (type) {
> + case BPF_SPIN_LOCK:
> + case BPF_RES_SPIN_LOCK:
> + case BPF_TIMER:
> + case BPF_WORKQUEUE:
> + case BPF_TASK_WORK:
> + case BPF_KPTR_UNREF:
> + case BPF_REFCOUNT:
> + return true;
> + default:
> + return false;
> + }
> +}
> +
> +static inline bool btf_record_has_nmi_unsafe_fields(const struct btf_record *rec)
> +{
> + int i;
> +
> + if (IS_ERR_OR_NULL(rec))
> + return false;
> + for (i = 0; i < rec->cnt; i++) {
> + if (!btf_field_is_nmi_safe(rec->fields[i].type))
> + return true;
> + }
> + return false;
> +}
> +
Do we need this stuff in the header file? Why not put it into the verifier.c?
> static inline void bpf_obj_init(const struct btf_record *rec, void *obj)
> {
> int i;
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index ed7ba0e6a9ce..3b43e2bd0f2a 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -205,6 +205,7 @@ static int release_reference_nomark(struct bpf_verifier_state *state, int id);
> static int release_reference(struct bpf_verifier_env *env, int id);
> static void invalidate_non_owning_refs(struct bpf_verifier_env *env);
> static bool in_rbtree_lock_required_cb(struct bpf_verifier_env *env);
> +static bool is_tracing_prog_type(enum bpf_prog_type type);
> static int ref_set_non_owning(struct bpf_verifier_env *env,
> struct bpf_reg_state *reg);
> static bool is_trusted_reg(struct bpf_verifier_env *env, const struct bpf_reg_state *reg);
> @@ -12881,6 +12882,7 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
> int *insn_idx_p)
> {
> bool sleepable, rcu_lock, rcu_unlock, preempt_disable, preempt_enable;
> + enum bpf_prog_type prog_type = resolve_prog_type(env->prog);
> struct bpf_reg_state *regs = cur_regs(env);
> const char *func_name, *ptr_type_name;
> const struct btf_type *t, *ptr_type;
> @@ -12957,6 +12959,20 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
> if (err < 0)
> return err;
>
> + if ((is_bpf_obj_drop_kfunc(meta.func_id) ||
> + is_bpf_percpu_obj_drop_kfunc(meta.func_id)) && (is_tracing_prog_type(prog_type) ||
We have sleepable tracepoints, are we going to reject them as well? I wonder if there is a better way
to detect NMI context in verifier.
> + /* is_tracing_prog_type() for now doesn't cover non-iterator tracing progs. */
> + (prog_type == BPF_PROG_TYPE_TRACING && env->prog->expected_attach_type != BPF_TRACE_ITER))) {
> + struct btf_struct_meta *struct_meta;
> +
> + struct_meta = btf_find_struct_meta(meta.arg_btf, meta.arg_btf_id);
> + if (struct_meta && btf_record_has_nmi_unsafe_fields(struct_meta->record)) {
> + verbose(env, "%s cannot be used in tracing programs on types with NMI unsafe fields\n",
> + func_name);
> + return -EINVAL;
> + }
> + }
> +
> if (is_bpf_rbtree_add_kfunc(meta.func_id)) {
> err = push_callback_call(env, insn, insn_idx, meta.subprogno,
> set_rbtree_add_callback_state);
> diff --git a/tools/testing/selftests/bpf/prog_tests/task_kfunc.c b/tools/testing/selftests/bpf/prog_tests/task_kfunc.c
> index 83b90335967a..e6e95c1416e6 100644
> --- a/tools/testing/selftests/bpf/prog_tests/task_kfunc.c
> +++ b/tools/testing/selftests/bpf/prog_tests/task_kfunc.c
> @@ -68,6 +68,36 @@ static void run_success_test(const char *prog_name)
> task_kfunc_success__destroy(skel);
> }
>
> +static void run_syscall_success_test(const char *prog_name)
> +{
> + LIBBPF_OPTS(bpf_test_run_opts, opts);
> + struct task_kfunc_success *skel;
> + struct bpf_program *prog;
> + int err;
> +
> + skel = open_load_task_kfunc_skel();
> + if (!ASSERT_OK_PTR(skel, "open_load_skel"))
> + return;
> +
> + if (!ASSERT_OK(skel->bss->err, "pre_run_err"))
> + goto cleanup;
> +
> + prog = bpf_object__find_program_by_name(skel->obj, prog_name);
> + if (!ASSERT_OK_PTR(prog, "bpf_object__find_program_by_name"))
> + goto cleanup;
> +
> + err = bpf_prog_test_run_opts(bpf_program__fd(prog), &opts);
> + if (!ASSERT_OK(err, "bpf_prog_test_run_opts"))
> + goto cleanup;
> + if (!ASSERT_EQ(opts.retval, 0, "retval"))
> + goto cleanup;
> +
> + ASSERT_OK(skel->bss->err, "post_run_err");
> +
> +cleanup:
> + task_kfunc_success__destroy(skel);
> +}
> +
> static int run_vpid_test(void *prog_name)
> {
> struct task_kfunc_success *skel;
> @@ -140,7 +170,6 @@ static const char * const success_tests[] = {
> "test_task_acquire_release_argument",
> "test_task_acquire_release_current",
> "test_task_acquire_leave_in_map",
> - "test_task_xchg_release",
> "test_task_map_acquire_release",
> "test_task_current_acquire_release",
> "test_task_from_pid_arg",
> @@ -151,6 +180,10 @@ static const char * const success_tests[] = {
> "test_task_kfunc_flavor_relo_not_found",
> };
>
> +static const char * const syscall_success_tests[] = {
> + "test_task_xchg_release",
> +};
> +
> static const char * const vpid_success_tests[] = {
> "test_task_from_vpid_current",
> "test_task_from_vpid_invalid",
> @@ -167,6 +200,13 @@ void test_task_kfunc(void)
> run_success_test(success_tests[i]);
> }
>
> + for (i = 0; i < ARRAY_SIZE(syscall_success_tests); i++) {
> + if (!test__start_subtest(syscall_success_tests[i]))
> + continue;
> +
> + run_syscall_success_test(syscall_success_tests[i]);
> + }
> +
> for (i = 0; i < ARRAY_SIZE(vpid_success_tests); i++) {
> if (!test__start_subtest(vpid_success_tests[i]))
> continue;
> diff --git a/tools/testing/selftests/bpf/progs/task_kfunc_success.c b/tools/testing/selftests/bpf/progs/task_kfunc_success.c
> index 5fb4fc19d26a..d63a79ee33dc 100644
> --- a/tools/testing/selftests/bpf/progs/task_kfunc_success.c
> +++ b/tools/testing/selftests/bpf/progs/task_kfunc_success.c
> @@ -140,17 +140,17 @@ int BPF_PROG(test_task_acquire_leave_in_map, struct task_struct *task, u64 clone
> return 0;
> }
>
> -SEC("tp_btf/task_newtask")
> -int BPF_PROG(test_task_xchg_release, struct task_struct *task, u64 clone_flags)
> +SEC("syscall")
> +int test_task_xchg_release(const void *ctx)
> {
> - struct task_struct *kptr, *acquired;
> + struct task_struct *task, *kptr, *acquired;
> struct __tasks_kfunc_map_value *v, *local;
> int refcnt, refcnt_after_drop;
> long status;
>
> - if (!is_test_kfunc_task())
> - return 0;
> + (void)ctx;
>
> + task = bpf_get_current_task_btf();
> status = tasks_kfunc_map_insert(task);
> if (status) {
> err = 1;
> @@ -191,7 +191,7 @@ int BPF_PROG(test_task_xchg_release, struct task_struct *task, u64 clone_flags)
> return 0;
> }
>
> - /* Stash a copy into local kptr and check if it is released recursively */
> + /* Stash a copy into local kptr and check if it is released recursively. */
> acquired = bpf_task_acquire(kptr);
> if (!acquired) {
> err = 7;
> @@ -220,7 +220,6 @@ int BPF_PROG(test_task_xchg_release, struct task_struct *task, u64 clone_flags)
> }
>
> bpf_task_release(kptr);
> -
> return 0;
> }
>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH bpf-next v2 1/4] bpf: Reject bpf_obj_drop() from tracing progs
2026-06-09 13:31 ` Mykyta Yatsenko
@ 2026-06-09 14:08 ` Kumar Kartikeya Dwivedi
2026-06-09 16:14 ` Justin Suess
0 siblings, 1 reply; 13+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2026-06-09 14:08 UTC (permalink / raw)
To: Mykyta Yatsenko, Kumar Kartikeya Dwivedi, bpf
Cc: Justin Suess, Alexei Starovoitov, Andrii Nakryiko,
Daniel Borkmann, Eduard Zingerman, Emil Tsalapatis, kkd,
kernel-team
On Tue Jun 9, 2026 at 3:31 PM CEST, Mykyta Yatsenko wrote:
>
>
> On 6/9/26 10:37 AM, Kumar Kartikeya Dwivedi wrote:
>> From: Justin Suess <utilityemal77@gmail.com>
>>
>> bpf_obj_drop() runs bpf_obj_free_fields() synchronously for
>> program-allocated objects. When such an object contains NMI unsafe
>> fields, a non-iterator tracing program can reach that destruction from
>> arbitrary instrumented context, including NMI.
>>
>> NMI is likely one instance of this problem, and other instances would
>> include possible unsafe reentrancy. Deferring bpf_obj_drop() is not
>> appealing either: it would add delayed-free machinery to a release
>> operation that otherwise has straightforward synchronous ownership
>> semantics.
>>
>> Reject bpf_obj_drop() and bpf_percpu_obj_drop() from tracing programs
>> unless every field in the object's BTF record is explicitly NMI safe.
>> Note that while bpf_rb_root and bpf_list_head would be NMI safe on their
>> own to free, the objects recursively held by them may not be; be
>> conservative and just mark them as not NMI safe for now.
>>
>> Use a whitelist for the NMI-safe field set instead of listing only known
>> NMI unsafe fields. Locks, async fields, unreferenced kptrs, and
>> refcounts are known to be NMI safe because their destruction is either a
>> no-op, simple state reset, or async cancellation. Referenced kptrs,
>> percpu referenced kptrs, uptrs, graph roots, graph nodes, and any future
>> field type are rejected until audited for arbitrary tracing and NMI
>> contexts. This is less susceptible to future changes in fields that were
>> previously safe by exclusion, and to new fields being added without
>> updating this check.
>>
>> Convert the existing recursive local-object drop success case to a
>> syscall program in the same commit, since this verifier change makes the
>> old tracing program form invalid. The test still exercises
>> bpf_obj_drop() releasing a referenced task kptr from a safe program
>> type.
>>
>> Fixes: ac9f06050a35 ("bpf: Introduce bpf_obj_drop")
>> Signed-off-by: Justin Suess <utilityemal77@gmail.com>
>> Co-developed-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
>> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
>> ---
>> include/linux/bpf.h | 29 +++++++++++++
>> kernel/bpf/verifier.c | 16 +++++++
>> .../selftests/bpf/prog_tests/task_kfunc.c | 42 ++++++++++++++++++-
>> .../selftests/bpf/progs/task_kfunc_success.c | 13 +++---
>> 4 files changed, 92 insertions(+), 8 deletions(-)
>>
>> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
>> index 62bba7a4876f..0654d2ffadc1 100644
>> --- a/include/linux/bpf.h
>> +++ b/include/linux/bpf.h
>> @@ -492,6 +492,35 @@ static inline bool btf_record_has_field(const struct btf_record *rec, enum btf_f
>> return rec->field_mask & type;
>> }
>>
>> +static inline bool btf_field_is_nmi_safe(enum btf_field_type type)
>> +{
>> + switch (type) {
>> + case BPF_SPIN_LOCK:
>> + case BPF_RES_SPIN_LOCK:
>> + case BPF_TIMER:
>> + case BPF_WORKQUEUE:
>> + case BPF_TASK_WORK:
>> + case BPF_KPTR_UNREF:
>> + case BPF_REFCOUNT:
>> + return true;
>> + default:
>> + return false;
>> + }
>> +}
>> +
>> +static inline bool btf_record_has_nmi_unsafe_fields(const struct btf_record *rec)
>> +{
>> + int i;
>> +
>> + if (IS_ERR_OR_NULL(rec))
>> + return false;
>> + for (i = 0; i < rec->cnt; i++) {
>> + if (!btf_field_is_nmi_safe(rec->fields[i].type))
>> + return true;
>> + }
>> + return false;
>> +}
>> +
>
> Do we need this stuff in the header file? Why not put it into the verifier.c?
No strong reason, I can move it to verifier.c too.
>
>> static inline void bpf_obj_init(const struct btf_record *rec, void *obj)
>> {
>> int i;
>> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
>> index ed7ba0e6a9ce..3b43e2bd0f2a 100644
>> --- a/kernel/bpf/verifier.c
>> +++ b/kernel/bpf/verifier.c
>> @@ -205,6 +205,7 @@ static int release_reference_nomark(struct bpf_verifier_state *state, int id);
>> static int release_reference(struct bpf_verifier_env *env, int id);
>> static void invalidate_non_owning_refs(struct bpf_verifier_env *env);
>> static bool in_rbtree_lock_required_cb(struct bpf_verifier_env *env);
>> +static bool is_tracing_prog_type(enum bpf_prog_type type);
>> static int ref_set_non_owning(struct bpf_verifier_env *env,
>> struct bpf_reg_state *reg);
>> static bool is_trusted_reg(struct bpf_verifier_env *env, const struct bpf_reg_state *reg);
>> @@ -12881,6 +12882,7 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
>> int *insn_idx_p)
>> {
>> bool sleepable, rcu_lock, rcu_unlock, preempt_disable, preempt_enable;
>> + enum bpf_prog_type prog_type = resolve_prog_type(env->prog);
>> struct bpf_reg_state *regs = cur_regs(env);
>> const char *func_name, *ptr_type_name;
>> const struct btf_type *t, *ptr_type;
>> @@ -12957,6 +12959,20 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
>> if (err < 0)
>> return err;
>>
>> + if ((is_bpf_obj_drop_kfunc(meta.func_id) ||
>> + is_bpf_percpu_obj_drop_kfunc(meta.func_id)) && (is_tracing_prog_type(prog_type) ||
>
> We have sleepable tracepoints, are we going to reject them as well? I wonder if there is a better way
> to detect NMI context in verifier.
>
NMI is probably one concern, reentrancy is another, though admiteddly sleepable
tracepoints are probably fine for this case (I didn't think too hard).
That said, on the other bit about having better ways of detecting such things, I
was wondering whether we should bite the bullet and decouple context information
from prog type. We sort of use program types as a proxy for it, but we may have
more clarity around expectations and safety checks if the context a program may
run in is extracted from the type (+ more contextual information like attach BTF
ID if available, otherwise falling back to the conservative set), and operations
have their checks be done against the context. For helpers, kfuncs, except for
specific special cases that require specific program hooks to function
correctly, registration could move from program types to contexts it can be
called in safely in general.
This is something I've been wondering for a while, and it's a substantial
change, but it appears to better fit the way we reason about safety now and use
various program types in checks and kfunc registration in practice anyway.
It would be nice to hear opinions on whether this makes sense to others.
> [...]
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH bpf-next v2 1/4] bpf: Reject bpf_obj_drop() from tracing progs
2026-06-09 14:08 ` Kumar Kartikeya Dwivedi
@ 2026-06-09 16:14 ` Justin Suess
0 siblings, 0 replies; 13+ messages in thread
From: Justin Suess @ 2026-06-09 16:14 UTC (permalink / raw)
To: Kumar Kartikeya Dwivedi
Cc: Mykyta Yatsenko, bpf, Alexei Starovoitov, Andrii Nakryiko,
Daniel Borkmann, Eduard Zingerman, Emil Tsalapatis, kkd,
kernel-team
On Tue, Jun 09, 2026 at 04:08:59PM +0200, Kumar Kartikeya Dwivedi wrote:
> On Tue Jun 9, 2026 at 3:31 PM CEST, Mykyta Yatsenko wrote:
> >
> >
> > On 6/9/26 10:37 AM, Kumar Kartikeya Dwivedi wrote:
> >> From: Justin Suess <utilityemal77@gmail.com>
> >>
> >> bpf_obj_drop() runs bpf_obj_free_fields() synchronously for
> >> program-allocated objects. When such an object contains NMI unsafe
> >> fields, a non-iterator tracing program can reach that destruction from
> >> arbitrary instrumented context, including NMI.
> >>
> >> NMI is likely one instance of this problem, and other instances would
> >> include possible unsafe reentrancy. Deferring bpf_obj_drop() is not
> >> appealing either: it would add delayed-free machinery to a release
> >> operation that otherwise has straightforward synchronous ownership
> >> semantics.
> >>
> >> Reject bpf_obj_drop() and bpf_percpu_obj_drop() from tracing programs
> >> unless every field in the object's BTF record is explicitly NMI safe.
> >> Note that while bpf_rb_root and bpf_list_head would be NMI safe on their
> >> own to free, the objects recursively held by them may not be; be
> >> conservative and just mark them as not NMI safe for now.
> >>
> >> Use a whitelist for the NMI-safe field set instead of listing only known
> >> NMI unsafe fields. Locks, async fields, unreferenced kptrs, and
> >> refcounts are known to be NMI safe because their destruction is either a
> >> no-op, simple state reset, or async cancellation. Referenced kptrs,
> >> percpu referenced kptrs, uptrs, graph roots, graph nodes, and any future
> >> field type are rejected until audited for arbitrary tracing and NMI
> >> contexts. This is less susceptible to future changes in fields that were
> >> previously safe by exclusion, and to new fields being added without
> >> updating this check.
> >>
> >> Convert the existing recursive local-object drop success case to a
> >> syscall program in the same commit, since this verifier change makes the
> >> old tracing program form invalid. The test still exercises
> >> bpf_obj_drop() releasing a referenced task kptr from a safe program
> >> type.
> >>
> >> Fixes: ac9f06050a35 ("bpf: Introduce bpf_obj_drop")
> >> Signed-off-by: Justin Suess <utilityemal77@gmail.com>
> >> Co-developed-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> >> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> >> ---
> >> include/linux/bpf.h | 29 +++++++++++++
> >> kernel/bpf/verifier.c | 16 +++++++
> >> .../selftests/bpf/prog_tests/task_kfunc.c | 42 ++++++++++++++++++-
> >> .../selftests/bpf/progs/task_kfunc_success.c | 13 +++---
> >> 4 files changed, 92 insertions(+), 8 deletions(-)
> >>
> >> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> >> index 62bba7a4876f..0654d2ffadc1 100644
> >> --- a/include/linux/bpf.h
> >> +++ b/include/linux/bpf.h
> >> @@ -492,6 +492,35 @@ static inline bool btf_record_has_field(const struct btf_record *rec, enum btf_f
> >> return rec->field_mask & type;
> >> }
> >>
> >> +static inline bool btf_field_is_nmi_safe(enum btf_field_type type)
> >> +{
> >> + switch (type) {
> >> + case BPF_SPIN_LOCK:
> >> + case BPF_RES_SPIN_LOCK:
> >> + case BPF_TIMER:
> >> + case BPF_WORKQUEUE:
> >> + case BPF_TASK_WORK:
> >> + case BPF_KPTR_UNREF:
> >> + case BPF_REFCOUNT:
> >> + return true;
> >> + default:
> >> + return false;
> >> + }
> >> +}
> >> +
> >> +static inline bool btf_record_has_nmi_unsafe_fields(const struct btf_record *rec)
> >> +{
> >> + int i;
> >> +
> >> + if (IS_ERR_OR_NULL(rec))
> >> + return false;
> >> + for (i = 0; i < rec->cnt; i++) {
> >> + if (!btf_field_is_nmi_safe(rec->fields[i].type))
> >> + return true;
> >> + }
> >> + return false;
> >> +}
> >> +
> >
> > Do we need this stuff in the header file? Why not put it into the verifier.c?
>
> No strong reason, I can move it to verifier.c too.
>
> >
> >> static inline void bpf_obj_init(const struct btf_record *rec, void *obj)
> >> {
> >> int i;
> >> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> >> index ed7ba0e6a9ce..3b43e2bd0f2a 100644
> >> --- a/kernel/bpf/verifier.c
> >> +++ b/kernel/bpf/verifier.c
> >> @@ -205,6 +205,7 @@ static int release_reference_nomark(struct bpf_verifier_state *state, int id);
> >> static int release_reference(struct bpf_verifier_env *env, int id);
> >> static void invalidate_non_owning_refs(struct bpf_verifier_env *env);
> >> static bool in_rbtree_lock_required_cb(struct bpf_verifier_env *env);
> >> +static bool is_tracing_prog_type(enum bpf_prog_type type);
> >> static int ref_set_non_owning(struct bpf_verifier_env *env,
> >> struct bpf_reg_state *reg);
> >> static bool is_trusted_reg(struct bpf_verifier_env *env, const struct bpf_reg_state *reg);
> >> @@ -12881,6 +12882,7 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
> >> int *insn_idx_p)
> >> {
> >> bool sleepable, rcu_lock, rcu_unlock, preempt_disable, preempt_enable;
> >> + enum bpf_prog_type prog_type = resolve_prog_type(env->prog);
> >> struct bpf_reg_state *regs = cur_regs(env);
> >> const char *func_name, *ptr_type_name;
> >> const struct btf_type *t, *ptr_type;
> >> @@ -12957,6 +12959,20 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
> >> if (err < 0)
> >> return err;
> >>
> >> + if ((is_bpf_obj_drop_kfunc(meta.func_id) ||
> >> + is_bpf_percpu_obj_drop_kfunc(meta.func_id)) && (is_tracing_prog_type(prog_type) ||
> >
> > We have sleepable tracepoints, are we going to reject them as well? I wonder if there is a better way
> > to detect NMI context in verifier.
> >
>
> NMI is probably one concern, reentrancy is another, though admiteddly sleepable
> tracepoints are probably fine for this case (I didn't think too hard).
>
> That said, on the other bit about having better ways of detecting such things, I
> was wondering whether we should bite the bullet and decouple context information
> from prog type. We sort of use program types as a proxy for it, but we may have
> more clarity around expectations and safety checks if the context a program may
> run in is extracted from the type (+ more contextual information like attach BTF
> ID if available, otherwise falling back to the conservative set), and operations
> have their checks be done against the context. For helpers, kfuncs, except for
> specific special cases that require specific program hooks to function
> correctly, registration could move from program types to contexts it can be
> called in safely in general.
>
> This is something I've been wondering for a while, and it's a substantial
> change, but it appears to better fit the way we reason about safety now and use
> various program types in checks and kfunc registration in practice anyway.
>
> It would be nice to hear opinions on whether this makes sense to others.
>
For the NMI case, this seems like a function coloring problem.
Basically there's a subset of functions and tracepoints in the kernel
that are reachable in NMI.
We could naively through some instrumentation, tag any
function explicitly called by an NMI handler as NMI unsafe.
This doesn't work perfectly. Dynamic dispatch via function pointers
defeats this approach. This would also be prone to false positives,
where a function reachable from an NMI handler reachable function
could have a conditonal call that never triggers if called from NMI.
The problem here mirrors that which motivated the introduction of the
resilient spinlock. Statically analyzing the locking correctness
to prevent deadlock is impossible for all cases, therefore, the failure
state was moved to runtime. This is similar; we can't statically analyze
the interrupt state at the many programs/attach sites available to BPF
programs without overshooting or undershooting.
The approach here is the overshooting case.
*Bottom line is that statically analyzing interrupt state is impossible.*
Runtime detection is technically possible. But it opens up a different
can of worms: safely handling the error condition without causing
breakage.
I considered the following approaches, but decided against it.
For each of the below:
Introduce a helper meta attribute and a kfunc annotation (KF_NMI)
to whitelist the kfuncs and helpers that safely handle NMI. Think much
like KF_SLEEPABLE. There could be others, KF_SOFTIRQ, KF_HARDIRQ, etc.
KF_NMI could imply the lower ranked interrupt KF_HARDIRQ and KF_SOFTIRQ
Annotating the functions with the interrupt safety allows
us to determine a verdict on the highest interrupt level a given
program can be safely run in. For example, a program making a call to a
KF_NMI and a KF_HARDIRQ function could be known to be safe to run in
hardirq, but not safe in NMI. because KF_HARDIRQ < KF_NMI.
When we can determine at verification time that the program context is
safe for those contexts, we allow them to be called.
a) Require explicit guard against interrupt contexts
For cases where static analysis is impossible, we allow their emission
only in a branch that explicitly guards against that interrupt.
For a practical example:
We could allow bpf_obj_drop in fentry tracing progs if guarded
if (bpf_in_nmi()) /* helper doesn't exist yet, hypothetical */
bpf_obj_drop(some kptr);
Other kfuncs / helpers could be adjusted to move that check inside
the function call, exposing the failure through an error return code,
if the function's return type permits it.
int err = some_nmi_unsafe_function();
Obviously this isn't possible for void returning functions.
b) Implicit guard of interrupt context
We implictly emit this conditional branch through a fix up.
Basically we allow the emission of these helpers/kfuncs, but generate
bytecode that skips the call if the context for calling it is unsafe
So in pseudocode when the user calls some non-KF_NMI function, we rewrite
it to
if (bpf_in_nmi()) /* helper doesn't exist yet, hypothetical */
bpf_obj_drop(some kptr);
The issue is that this conditional would be invisible to the user,
invokation of the function would be silently skipped.
c) Literally just drop the call
Insert a guard for the highest allowed interrupt level at the beginning
of the program's invokation. (our verdict)
If the guard fails abort the program / return 0 for fmodify.
This is the simplest approach, but the error condition is not
transparent to the user.
We could add a tracepoint that catches the condition, but this in itself
would run at the same interrupt context, so it would probably need to
*require* at verification time that all functions have KF_(Whatever
interrupt safety level)
...
All of these approaches are problematic at some level.
This is just some ideas to add.
Justin
> > [...]
^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2026-06-09 16:14 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-09 9:37 [PATCH bpf-next v2 0/4] Fix kptr dtor deadlock Kumar Kartikeya Dwivedi
2026-06-09 9:37 ` [PATCH bpf-next v2 1/4] bpf: Reject bpf_obj_drop() from tracing progs Kumar Kartikeya Dwivedi
2026-06-09 10:05 ` bot+bpf-ci
2026-06-09 13:31 ` Mykyta Yatsenko
2026-06-09 14:08 ` Kumar Kartikeya Dwivedi
2026-06-09 16:14 ` Justin Suess
2026-06-09 9:37 ` [PATCH bpf-next v2 2/4] bpf: Cancel special fields on map value recycle Kumar Kartikeya Dwivedi
2026-06-09 9:55 ` sashiko-bot
2026-06-09 10:51 ` Mykyta Yatsenko
2026-06-09 9:37 ` [PATCH bpf-next v2 3/4] selftests/bpf: Exercise unsafe obj drops from tracing progs Kumar Kartikeya Dwivedi
2026-06-09 10:02 ` sashiko-bot
2026-06-09 9:37 ` [PATCH bpf-next v2 4/4] selftests/bpf: Exercise kptr map update lifetime Kumar Kartikeya Dwivedi
2026-06-09 10:20 ` sashiko-bot
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.