[PATCH bpf-next v1 0/5] Fix kptr dtor deadlock

BPF List
 help / color / mirror / Atom feed

* [PATCH bpf-next v1 0/5] Fix kptr dtor deadlock
@ 2026-06-08 14:48 Kumar Kartikeya Dwivedi
  2026-06-08 14:48 ` [PATCH bpf-next v1 1/5] bpf: Treat non-iterator tracing progs as tracing Kumar Kartikeya Dwivedi
                   ` (5 more replies)
  0 siblings, 6 replies; 19+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2026-06-08 14:48 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Eduard Zingerman, Emil Tsalapatis, Justin Suess, kkd, kernel-team

Referenced kptr destruction can run from tracing/NMI contexts through
bpf_obj_drop() and map value update/delete paths, reaching NMI-unsafe
special field teardown and deadlocks. Justin reported the issue and
iterated on fixes in [0]-[2], and also confirmed the bpf_obj_drop()
reproducer in [3].

This series rejects unsafe obj drops from non-iterator tracing programs,
limits map value recycle to NMI-safe field cancellation, and adds
focused selftests for the obj_drop(), NMI delete, and recycle teardown
cases.

See patches for details.

  [0]: https://lore.kernel.org/bpf/20260505150851.3090688-1-utilityemal77@gmail.com
  [1]: https://lore.kernel.org/bpf/20260507175453.1140400-1-utilityemal77@gmail.com
  [2]: https://lore.kernel.org/bpf/20260519011450.1144935-1-utilityemal77@gmail.com
  [3]: https://lore.kernel.org/bpf/agyG3eQwgmoJwmj2@suesslenovo

Justin Suess (2):
  bpf: Reject bpf_obj_drop() from tracing progs
  bpf: Cancel special fields on map value recycle

Kumar Kartikeya Dwivedi (3):
  bpf: Treat non-iterator tracing progs as tracing
  selftests/bpf: Exercise unsafe obj drops from tracing progs
  selftests/bpf: Exercise kptr map update lifetime

 include/linux/bpf.h                           | 30 +++++++
 kernel/bpf/arraymap.c                         |  8 +-
 kernel/bpf/hashtab.c                          | 32 ++++---
 kernel/bpf/syscall.c                          | 27 ++++++
 kernel/bpf/verifier.c                         | 51 +++++++----
 .../selftests/bpf/prog_tests/htab_update.c    |  4 +-
 .../selftests/bpf/prog_tests/map_kptr.c       | 66 ++++++++++++--
 .../selftests/bpf/prog_tests/task_kfunc.c     | 42 ++++++++-
 .../testing/selftests/bpf/progs/htab_update.c |  4 +-
 tools/testing/selftests/bpf/progs/map_kptr.c  | 89 ++++++++++++++++++-
 .../selftests/bpf/progs/task_kfunc_failure.c  | 43 +++++++++
 .../selftests/bpf/progs/task_kfunc_success.c  | 13 ++-
 12 files changed, 352 insertions(+), 57 deletions(-)


base-commit: 50dff00615522f3ec03449680ca23beb4cfc549c
-- 
2.53.0


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH bpf-next v1 1/5] bpf: Treat non-iterator tracing progs as tracing
  2026-06-08 14:48 [PATCH bpf-next v1 0/5] Fix kptr dtor deadlock Kumar Kartikeya Dwivedi
@ 2026-06-08 14:48 ` Kumar Kartikeya Dwivedi
  2026-06-08 14:51   ` Kumar Kartikeya Dwivedi
                     ` (3 more replies)
  2026-06-08 14:48 ` [PATCH bpf-next v1 2/5] bpf: Reject bpf_obj_drop() from tracing progs Kumar Kartikeya Dwivedi
                   ` (4 subsequent siblings)
  5 siblings, 4 replies; 19+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2026-06-08 14:48 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Eduard Zingerman, Emil Tsalapatis, Justin Suess, kkd, kernel-team

The is_tracing_prog_type() predicate omitted BPF_PROG_TYPE_TRACING even
though fentry, fexit, fmod_ret, raw_tp BTF and similar programs have the
same execution-context concerns as the tracing program types already
covered by the helper.

This matters for map compatibility checks that reject bpf_spin_lock,
bpf_list_head and bpf_rb_root in tracing contexts. BPF_PROG_TYPE_TRACING
programs can run from arbitrary instrumented contexts, including places
where taking these locks or manipulating graph roots is not safe.
BPF_TRACE_ITER is different: iterator programs run from task context, so
we continue to exclude them.

This can reject existing fentry/fexit-style programs that use map values
with these fields. Such programs were accepted only because the
predicate missed this program type; their use depends on semantics the
verifier already rejects for equivalent tracing hooks.

Move is_tracing_prog_type() checks from check_map_prog_compatibility()
to points where the fields are actually used to avoid preemptively
rejecting tracing programs that use maps with such fields but do not
touch these fields.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 kernel/bpf/verifier.c | 37 ++++++++++++++++++++++---------------
 1 file changed, 22 insertions(+), 15 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index ed7ba0e6a9ce..26bfb4465725 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -6967,6 +6967,11 @@ static int process_spin_lock(struct bpf_verifier_env *env, struct bpf_reg_state
 	u32 spin_lock_off;
 	int err;
 
+	if (is_tracing_prog_type(env->prog)) {
+		verbose(env, "tracing progs cannot use bpf_spin_lock yet\n");
+		return -EINVAL;
+	}
+
 	if (!is_const) {
 		verbose(env,
 			"%s doesn't have constant offset. %s_lock has to be at the constant offset\n",
@@ -12222,6 +12227,10 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
 				return ret;
 			break;
 		case KF_ARG_PTR_TO_LIST_HEAD:
+			if (is_tracing_prog_type(env->prog)) {
+				verbose(env, "tracing progs cannot use bpf_{list_head,rb_root} yet\n");
+				return -EINVAL;
+			}
 			if (reg->type != PTR_TO_MAP_VALUE &&
 			    reg->type != (PTR_TO_BTF_ID | MEM_ALLOC)) {
 				verbose(env, "%s expected pointer to map value or allocated object\n",
@@ -12238,6 +12247,10 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
 				return ret;
 			break;
 		case KF_ARG_PTR_TO_RB_ROOT:
+			if (is_tracing_prog_type(env->prog)) {
+				verbose(env, "tracing progs cannot use bpf_{list_head,rb_root} yet\n");
+				return -EINVAL;
+			}
 			if (reg->type != PTR_TO_MAP_VALUE &&
 			    reg->type != (PTR_TO_BTF_ID | MEM_ALLOC)) {
 				verbose(env, "%s expected pointer to map value or allocated object\n",
@@ -17664,9 +17677,11 @@ static int check_pseudo_btf_id(struct bpf_verifier_env *env,
 	return __add_used_btf(env, btf);
 }
 
-static bool is_tracing_prog_type(enum bpf_prog_type type)
+static bool is_tracing_prog_type(const struct bpf_prog *prog)
 {
-	switch (type) {
+	switch (resolve_prog_type(prog)) {
+	case BPF_PROG_TYPE_TRACING:
+		return prog->expected_attach_type != BPF_TRACE_ITER;
 	case BPF_PROG_TYPE_KPROBE:
 	case BPF_PROG_TYPE_TRACEPOINT:
 	case BPF_PROG_TYPE_PERF_EVENT:
@@ -17697,24 +17712,16 @@ static int check_map_prog_compatibility(struct bpf_verifier_env *env,
 		return -EACCES;
 	}
 
-	if (btf_record_has_field(map->record, BPF_LIST_HEAD) ||
-	    btf_record_has_field(map->record, BPF_RB_ROOT)) {
-		if (is_tracing_prog_type(prog_type)) {
-			verbose(env, "tracing progs cannot use bpf_{list_head,rb_root} yet\n");
-			return -EINVAL;
-		}
-	}
-
 	if (btf_record_has_field(map->record, BPF_SPIN_LOCK | BPF_RES_SPIN_LOCK)) {
 		if (prog_type == BPF_PROG_TYPE_SOCKET_FILTER) {
 			verbose(env, "socket filter progs cannot use bpf_spin_lock yet\n");
 			return -EINVAL;
 		}
-
-		if (is_tracing_prog_type(prog_type)) {
-			verbose(env, "tracing progs cannot use bpf_spin_lock yet\n");
-			return -EINVAL;
-		}
+		/*
+		 * Rejecting tracing progs accessing maps with bpf_spin_lock in
+		 * them here would be too conservative; let's defer rejection
+		 * until seeing first use.
+		 */
 	}
 
 	if ((bpf_prog_is_offloaded(prog->aux) || bpf_map_is_offloaded(map)) &&
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH bpf-next v1 2/5] bpf: Reject bpf_obj_drop() from tracing progs
  2026-06-08 14:48 [PATCH bpf-next v1 0/5] Fix kptr dtor deadlock Kumar Kartikeya Dwivedi
  2026-06-08 14:48 ` [PATCH bpf-next v1 1/5] bpf: Treat non-iterator tracing progs as tracing Kumar Kartikeya Dwivedi
@ 2026-06-08 14:48 ` Kumar Kartikeya Dwivedi
  2026-06-08 15:40   ` sashiko-bot
  2026-06-08 14:48 ` [PATCH bpf-next v1 3/5] bpf: Cancel special fields on map value recycle Kumar Kartikeya Dwivedi
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 19+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2026-06-08 14:48 UTC (permalink / raw)
  To: bpf
  Cc: Justin Suess, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, Eduard Zingerman, Emil Tsalapatis, kkd,
	kernel-team

From: Justin Suess <utilityemal77@gmail.com>

bpf_obj_drop() runs bpf_obj_free_fields() synchronously for
program-allocated objects. When such an object contains NMI unsafe
fields, a non-iterator tracing program can reach that destruction from
arbitrary instrumented context, including NMI.

NMI is likely one instance of this problem, and other instances would
include possible unsafe reentrancy. Deferring bpf_obj_drop() is not
appealing either: it would add delayed-free machinery to a release
operation that otherwise has straightforward synchronous ownership
semantics.

Reject bpf_obj_drop() and bpf_percpu_obj_drop() from tracing programs
unless every field in the object's BTF record is explicitly NMI safe.
Note that while bpf_rb_root and bpf_list_head would be NMI safe on their
own to free, the objects recursively held by them may not be; be
conservative and just mark them as not NMI safe for now.

Use a whitelist for the NMI-safe field set instead of listing only known
NMI unsafe fields. Locks, async fields, unreferenced kptrs, and
refcounts are known to be NMI safe because their destruction is either a
no-op, simple state reset, or async cancellation. Referenced kptrs,
percpu referenced kptrs, uptrs, graph roots, graph nodes, and any future
field type are rejected until audited for arbitrary tracing and NMI
contexts. This is less susceptible to future changes in fields that were
previously safe by exclusion, and to new fields being added without
updating this check.

Convert the existing recursive local-object drop success case to a
syscall program in the same commit, since this verifier change makes the
old tracing program form invalid. The test still exercises
bpf_obj_drop() releasing a referenced task kptr from a safe program
type.

Fixes: ac9f06050a35 ("bpf: Introduce bpf_obj_drop")
Signed-off-by: Justin Suess <utilityemal77@gmail.com>
Co-developed-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 include/linux/bpf.h                           | 29 +++++++++++++
 kernel/bpf/verifier.c                         | 14 +++++++
 .../selftests/bpf/prog_tests/task_kfunc.c     | 42 ++++++++++++++++++-
 .../selftests/bpf/progs/task_kfunc_success.c  | 13 +++---
 4 files changed, 90 insertions(+), 8 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 62bba7a4876f..0654d2ffadc1 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -492,6 +492,35 @@ static inline bool btf_record_has_field(const struct btf_record *rec, enum btf_f
 	return rec->field_mask & type;
 }
 
+static inline bool btf_field_is_nmi_safe(enum btf_field_type type)
+{
+	switch (type) {
+	case BPF_SPIN_LOCK:
+	case BPF_RES_SPIN_LOCK:
+	case BPF_TIMER:
+	case BPF_WORKQUEUE:
+	case BPF_TASK_WORK:
+	case BPF_KPTR_UNREF:
+	case BPF_REFCOUNT:
+		return true;
+	default:
+		return false;
+	}
+}
+
+static inline bool btf_record_has_nmi_unsafe_fields(const struct btf_record *rec)
+{
+	int i;
+
+	if (IS_ERR_OR_NULL(rec))
+		return false;
+	for (i = 0; i < rec->cnt; i++) {
+		if (!btf_field_is_nmi_safe(rec->fields[i].type))
+			return true;
+	}
+	return false;
+}
+
 static inline void bpf_obj_init(const struct btf_record *rec, void *obj)
 {
 	int i;
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 26bfb4465725..d636484f1d6c 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -205,6 +205,7 @@ static int release_reference_nomark(struct bpf_verifier_state *state, int id);
 static int release_reference(struct bpf_verifier_env *env, int id);
 static void invalidate_non_owning_refs(struct bpf_verifier_env *env);
 static bool in_rbtree_lock_required_cb(struct bpf_verifier_env *env);
+static bool is_tracing_prog_type(const struct bpf_prog *prog);
 static int ref_set_non_owning(struct bpf_verifier_env *env,
 			      struct bpf_reg_state *reg);
 static bool is_trusted_reg(struct bpf_verifier_env *env, const struct bpf_reg_state *reg);
@@ -12970,6 +12971,19 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
 	if (err < 0)
 		return err;
 
+	if ((is_bpf_obj_drop_kfunc(meta.func_id) ||
+	     is_bpf_percpu_obj_drop_kfunc(meta.func_id)) && is_tracing_prog_type(env->prog)) {
+		struct btf_struct_meta *struct_meta;
+
+		struct_meta = btf_find_struct_meta(meta.arg_btf, meta.arg_btf_id);
+		if (struct_meta &&
+		    btf_record_has_nmi_unsafe_fields(struct_meta->record)) {
+			verbose(env, "%s cannot be used in tracing programs on types with NMI unsafe fields\n",
+				func_name);
+			return -EINVAL;
+		}
+	}
+
 	if (is_bpf_rbtree_add_kfunc(meta.func_id)) {
 		err = push_callback_call(env, insn, insn_idx, meta.subprogno,
 					 set_rbtree_add_callback_state);
diff --git a/tools/testing/selftests/bpf/prog_tests/task_kfunc.c b/tools/testing/selftests/bpf/prog_tests/task_kfunc.c
index 83b90335967a..e6e95c1416e6 100644
--- a/tools/testing/selftests/bpf/prog_tests/task_kfunc.c
+++ b/tools/testing/selftests/bpf/prog_tests/task_kfunc.c
@@ -68,6 +68,36 @@ static void run_success_test(const char *prog_name)
 	task_kfunc_success__destroy(skel);
 }
 
+static void run_syscall_success_test(const char *prog_name)
+{
+	LIBBPF_OPTS(bpf_test_run_opts, opts);
+	struct task_kfunc_success *skel;
+	struct bpf_program *prog;
+	int err;
+
+	skel = open_load_task_kfunc_skel();
+	if (!ASSERT_OK_PTR(skel, "open_load_skel"))
+		return;
+
+	if (!ASSERT_OK(skel->bss->err, "pre_run_err"))
+		goto cleanup;
+
+	prog = bpf_object__find_program_by_name(skel->obj, prog_name);
+	if (!ASSERT_OK_PTR(prog, "bpf_object__find_program_by_name"))
+		goto cleanup;
+
+	err = bpf_prog_test_run_opts(bpf_program__fd(prog), &opts);
+	if (!ASSERT_OK(err, "bpf_prog_test_run_opts"))
+		goto cleanup;
+	if (!ASSERT_EQ(opts.retval, 0, "retval"))
+		goto cleanup;
+
+	ASSERT_OK(skel->bss->err, "post_run_err");
+
+cleanup:
+	task_kfunc_success__destroy(skel);
+}
+
 static int run_vpid_test(void *prog_name)
 {
 	struct task_kfunc_success *skel;
@@ -140,7 +170,6 @@ static const char * const success_tests[] = {
 	"test_task_acquire_release_argument",
 	"test_task_acquire_release_current",
 	"test_task_acquire_leave_in_map",
-	"test_task_xchg_release",
 	"test_task_map_acquire_release",
 	"test_task_current_acquire_release",
 	"test_task_from_pid_arg",
@@ -151,6 +180,10 @@ static const char * const success_tests[] = {
 	"test_task_kfunc_flavor_relo_not_found",
 };
 
+static const char * const syscall_success_tests[] = {
+	"test_task_xchg_release",
+};
+
 static const char * const vpid_success_tests[] = {
 	"test_task_from_vpid_current",
 	"test_task_from_vpid_invalid",
@@ -167,6 +200,13 @@ void test_task_kfunc(void)
 		run_success_test(success_tests[i]);
 	}
 
+	for (i = 0; i < ARRAY_SIZE(syscall_success_tests); i++) {
+		if (!test__start_subtest(syscall_success_tests[i]))
+			continue;
+
+		run_syscall_success_test(syscall_success_tests[i]);
+	}
+
 	for (i = 0; i < ARRAY_SIZE(vpid_success_tests); i++) {
 		if (!test__start_subtest(vpid_success_tests[i]))
 			continue;
diff --git a/tools/testing/selftests/bpf/progs/task_kfunc_success.c b/tools/testing/selftests/bpf/progs/task_kfunc_success.c
index 5fb4fc19d26a..d63a79ee33dc 100644
--- a/tools/testing/selftests/bpf/progs/task_kfunc_success.c
+++ b/tools/testing/selftests/bpf/progs/task_kfunc_success.c
@@ -140,17 +140,17 @@ int BPF_PROG(test_task_acquire_leave_in_map, struct task_struct *task, u64 clone
 	return 0;
 }
 
-SEC("tp_btf/task_newtask")
-int BPF_PROG(test_task_xchg_release, struct task_struct *task, u64 clone_flags)
+SEC("syscall")
+int test_task_xchg_release(const void *ctx)
 {
-	struct task_struct *kptr, *acquired;
+	struct task_struct *task, *kptr, *acquired;
 	struct __tasks_kfunc_map_value *v, *local;
 	int refcnt, refcnt_after_drop;
 	long status;
 
-	if (!is_test_kfunc_task())
-		return 0;
+	(void)ctx;
 
+	task = bpf_get_current_task_btf();
 	status = tasks_kfunc_map_insert(task);
 	if (status) {
 		err = 1;
@@ -191,7 +191,7 @@ int BPF_PROG(test_task_xchg_release, struct task_struct *task, u64 clone_flags)
 		return 0;
 	}
 
-	/* Stash a copy into local kptr and check if it is released recursively */
+	/* Stash a copy into local kptr and check if it is released recursively. */
 	acquired = bpf_task_acquire(kptr);
 	if (!acquired) {
 		err = 7;
@@ -220,7 +220,6 @@ int BPF_PROG(test_task_xchg_release, struct task_struct *task, u64 clone_flags)
 	}
 
 	bpf_task_release(kptr);
-
 	return 0;
 }
 
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH bpf-next v1 3/5] bpf: Cancel special fields on map value recycle
  2026-06-08 14:48 [PATCH bpf-next v1 0/5] Fix kptr dtor deadlock Kumar Kartikeya Dwivedi
  2026-06-08 14:48 ` [PATCH bpf-next v1 1/5] bpf: Treat non-iterator tracing progs as tracing Kumar Kartikeya Dwivedi
  2026-06-08 14:48 ` [PATCH bpf-next v1 2/5] bpf: Reject bpf_obj_drop() from tracing progs Kumar Kartikeya Dwivedi
@ 2026-06-08 14:48 ` Kumar Kartikeya Dwivedi
  2026-06-08 15:44   ` bot+bpf-ci
                     ` (2 more replies)
  2026-06-08 14:48 ` [PATCH bpf-next v1 4/5] selftests/bpf: Exercise unsafe obj drops from tracing progs Kumar Kartikeya Dwivedi
                   ` (2 subsequent siblings)
  5 siblings, 3 replies; 19+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2026-06-08 14:48 UTC (permalink / raw)
  To: bpf
  Cc: Justin Suess, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, Eduard Zingerman, Emil Tsalapatis, kkd,
	kernel-team

From: Justin Suess <utilityemal77@gmail.com>

Map update and delete paths currently call bpf_obj_free_fields() when a
value is being replaced or recycled. That makes field destruction depend
on the context of the update/delete operation. For tracing programs this
can include NMI context, where referenced kptr destructors, uptr
unpinning, and graph root destruction are not generally safe.

Introduce bpf_obj_cancel_fields() for the reusable-value path. It only
performs NMI-safe cleanup for timer, workqueue, and task_work fields.
Fields that need full destruction are left attached to the recycled value
and are destroyed by the final cleanup path instead.

Switch array and hashtab update/delete/recycle paths to this cancel
helper. Keep bpf_obj_free_fields() for final map destruction and for
bpf_mem_alloc destructors. Preallocated hashtabs do not have allocator
destructors, so teardown continues to walk the normal and extra elements
and fully destroy their fields.

This deliberately relaxes the eager-free semantics of map update/delete
for special fields. Programs that relied on a recycled map slot becoming
empty immediately after update/delete were relying on behavior that
cannot be implemented safely from every BPF execution context without
offloading arbitrary destructors.

There is a chance this change breaks programs making assumptions
regarding the eager freeing of fields. If so, we can relax semantics to
cancellation only when irqs_disabled() is true in the future. However,
theoretically, map values that get reused eagerly already have weaker
guarantees as parallel users can recreate freed fields before the new
element becomes visible again.

Fixes: 14a324f6a67e ("bpf: Wire up freeing of referenced kptr")
Signed-off-by: Justin Suess <utilityemal77@gmail.com>
Co-developed-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 include/linux/bpf.h                           |  1 +
 kernel/bpf/arraymap.c                         |  8 ++---
 kernel/bpf/hashtab.c                          | 32 +++++++++++--------
 kernel/bpf/syscall.c                          | 27 ++++++++++++++++
 .../selftests/bpf/prog_tests/htab_update.c    |  4 +--
 .../selftests/bpf/prog_tests/map_kptr.c       | 10 +-----
 .../testing/selftests/bpf/progs/htab_update.c |  4 +--
 7 files changed, 55 insertions(+), 31 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 0654d2ffadc1..fc24b81cc206 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -2717,6 +2717,7 @@ bool btf_record_equal(const struct btf_record *rec_a, const struct btf_record *r
 void bpf_obj_free_timer(const struct btf_record *rec, void *obj);
 void bpf_obj_free_workqueue(const struct btf_record *rec, void *obj);
 void bpf_obj_free_task_work(const struct btf_record *rec, void *obj);
+void bpf_obj_cancel_fields(const struct btf_record *rec, void *obj);
 void bpf_obj_free_fields(const struct btf_record *rec, void *obj);
 void __bpf_obj_drop_impl(void *p, const struct btf_record *rec, bool percpu);
 
diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index e6271a2bf6d6..4e8d6fe6999b 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -384,7 +384,7 @@ static long array_map_update_elem(struct bpf_map *map, void *key, void *value,
 	if (array->map.map_type == BPF_MAP_TYPE_PERCPU_ARRAY) {
 		val = this_cpu_ptr(array->pptrs[index & array->index_mask]);
 		copy_map_value(map, val, value);
-		bpf_obj_free_fields(array->map.record, val);
+		bpf_obj_cancel_fields(array->map.record, val);
 	} else {
 		val = array->value +
 			(u64)array->elem_size * (index & array->index_mask);
@@ -392,7 +392,7 @@ static long array_map_update_elem(struct bpf_map *map, void *key, void *value,
 			copy_map_value_locked(map, val, value, false);
 		else
 			copy_map_value(map, val, value);
-		bpf_obj_free_fields(array->map.record, val);
+		bpf_obj_cancel_fields(array->map.record, val);
 	}
 	return 0;
 }
@@ -432,14 +432,14 @@ int bpf_percpu_array_update(struct bpf_map *map, void *key, void *value,
 		cpu = map_flags >> 32;
 		ptr = per_cpu_ptr(pptr, cpu);
 		copy_map_value(map, ptr, value);
-		bpf_obj_free_fields(array->map.record, ptr);
+		bpf_obj_cancel_fields(array->map.record, ptr);
 		goto unlock;
 	}
 	for_each_possible_cpu(cpu) {
 		ptr = per_cpu_ptr(pptr, cpu);
 		val = (map_flags & BPF_F_ALL_CPUS) ? value : value + size * cpu;
 		copy_map_value(map, ptr, val);
-		bpf_obj_free_fields(array->map.record, ptr);
+		bpf_obj_cancel_fields(array->map.record, ptr);
 	}
 unlock:
 	rcu_read_unlock();
diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index b4366cad3cfa..a9b2b882b90f 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -243,6 +243,10 @@ static void htab_free_prealloced_fields(struct bpf_htab *htab)
 
 	if (IS_ERR_OR_NULL(htab->map.record))
 		return;
+	/*
+	 * Preallocated maps do not have a bpf_mem_alloc destructor, so fully
+	 * destroy every element, including the extra elements.
+	 */
 	if (htab_has_extra_elems(htab))
 		num_entries += num_possible_cpus();
 	for (i = 0; i < num_entries; i++) {
@@ -833,8 +837,8 @@ static int htab_lru_map_gen_lookup(struct bpf_map *map,
 	return insn - insn_buf;
 }
 
-static void check_and_free_fields(struct bpf_htab *htab,
-				  struct htab_elem *elem)
+static void check_and_cancel_fields(struct bpf_htab *htab,
+				    struct htab_elem *elem)
 {
 	if (IS_ERR_OR_NULL(htab->map.record))
 		return;
@@ -844,11 +848,11 @@ static void check_and_free_fields(struct bpf_htab *htab,
 		int cpu;
 
 		for_each_possible_cpu(cpu)
-			bpf_obj_free_fields(htab->map.record, per_cpu_ptr(pptr, cpu));
+			bpf_obj_cancel_fields(htab->map.record, per_cpu_ptr(pptr, cpu));
 	} else {
 		void *map_value = htab_elem_value(elem, htab->map.key_size);
 
-		bpf_obj_free_fields(htab->map.record, map_value);
+		bpf_obj_cancel_fields(htab->map.record, map_value);
 	}
 }
 
@@ -883,7 +887,7 @@ static bool htab_lru_map_delete_node(void *arg, struct bpf_lru_node *node)
 	htab_unlock_bucket(b, flags);
 
 	if (l == tgt_l)
-		check_and_free_fields(htab, l);
+		check_and_cancel_fields(htab, l);
 	return l == tgt_l;
 }
 
@@ -948,7 +952,7 @@ static int htab_map_get_next_key(struct bpf_map *map, void *key, void *next_key)
 
 static void htab_elem_free(struct bpf_htab *htab, struct htab_elem *l)
 {
-	check_and_free_fields(htab, l);
+	check_and_cancel_fields(htab, l);
 
 	if (htab->map.map_type == BPF_MAP_TYPE_PERCPU_HASH)
 		bpf_mem_cache_free(&htab->pcpu_ma, l->ptr_to_pptr);
@@ -1001,7 +1005,7 @@ static void free_htab_elem(struct bpf_htab *htab, struct htab_elem *l)
 
 	if (htab_is_prealloc(htab)) {
 		bpf_map_dec_elem_count(&htab->map);
-		check_and_free_fields(htab, l);
+		check_and_cancel_fields(htab, l);
 		pcpu_freelist_push(&htab->freelist, &l->fnode);
 	} else {
 		dec_elem_count(htab);
@@ -1018,7 +1022,7 @@ static void pcpu_copy_value(struct bpf_htab *htab, void __percpu *pptr,
 		/* copy true value_size bytes */
 		ptr = this_cpu_ptr(pptr);
 		copy_map_value(&htab->map, ptr, value);
-		bpf_obj_free_fields(htab->map.record, ptr);
+		bpf_obj_cancel_fields(htab->map.record, ptr);
 	} else {
 		u32 size = round_up(htab->map.value_size, 8);
 		void *val;
@@ -1028,7 +1032,7 @@ static void pcpu_copy_value(struct bpf_htab *htab, void __percpu *pptr,
 			cpu = map_flags >> 32;
 			ptr = per_cpu_ptr(pptr, cpu);
 			copy_map_value(&htab->map, ptr, value);
-			bpf_obj_free_fields(htab->map.record, ptr);
+			bpf_obj_cancel_fields(htab->map.record, ptr);
 			return;
 		}
 
@@ -1036,7 +1040,7 @@ static void pcpu_copy_value(struct bpf_htab *htab, void __percpu *pptr,
 			ptr = per_cpu_ptr(pptr, cpu);
 			val = (map_flags & BPF_F_ALL_CPUS) ? value : value + size * cpu;
 			copy_map_value(&htab->map, ptr, val);
-			bpf_obj_free_fields(htab->map.record, ptr);
+			bpf_obj_cancel_fields(htab->map.record, ptr);
 		}
 	}
 }
@@ -1252,11 +1256,11 @@ static long htab_map_update_elem(struct bpf_map *map, void *key, void *value,
 	if (l_old) {
 		hlist_nulls_del_rcu(&l_old->hash_node);
 
-		/* l_old has already been stashed in htab->extra_elems, free
-		 * its special fields before it is available for reuse.
+		/* l_old has already been stashed in htab->extra_elems, cancel
+		 * its reusable special fields before it is available for reuse.
 		 */
 		if (htab_is_prealloc(htab))
-			check_and_free_fields(htab, l_old);
+			check_and_cancel_fields(htab, l_old);
 	}
 	htab_unlock_bucket(b, flags);
 	if (l_old && !htab_is_prealloc(htab))
@@ -1269,7 +1273,7 @@ static long htab_map_update_elem(struct bpf_map *map, void *key, void *value,
 
 static void htab_lru_push_free(struct bpf_htab *htab, struct htab_elem *elem)
 {
-	check_and_free_fields(htab, elem);
+	check_and_cancel_fields(htab, elem);
 	bpf_map_dec_elem_count(&htab->map);
 	bpf_lru_push_free(&htab->lru, &elem->lru_node);
 }
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index d4188a992bd8..8d12aa3728ea 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -808,6 +808,33 @@ void bpf_obj_free_task_work(const struct btf_record *rec, void *obj)
 	bpf_task_work_cancel_and_free(obj + rec->task_work_off);
 }
 
+void bpf_obj_cancel_fields(const struct btf_record *rec, void *obj)
+{
+	const struct btf_field *fields;
+	int i;
+
+	if (IS_ERR_OR_NULL(rec))
+		return;
+	fields = rec->fields;
+	for (i = 0; i < rec->cnt; i++) {
+		void *field_ptr = obj + fields[i].offset;
+
+		switch (fields[i].type) {
+		case BPF_TIMER:
+			bpf_timer_cancel_and_free(field_ptr);
+			break;
+		case BPF_WORKQUEUE:
+			bpf_wq_cancel_and_free(field_ptr);
+			break;
+		case BPF_TASK_WORK:
+			bpf_task_work_cancel_and_free(field_ptr);
+			break;
+		default:
+			break;
+		}
+	}
+}
+
 void bpf_obj_free_fields(const struct btf_record *rec, void *obj)
 {
 	const struct btf_field *fields;
diff --git a/tools/testing/selftests/bpf/prog_tests/htab_update.c b/tools/testing/selftests/bpf/prog_tests/htab_update.c
index ea1a6766fbe9..0a28d4346924 100644
--- a/tools/testing/selftests/bpf/prog_tests/htab_update.c
+++ b/tools/testing/selftests/bpf/prog_tests/htab_update.c
@@ -23,7 +23,7 @@ static void test_reenter_update(void)
 	if (!ASSERT_OK_PTR(skel, "htab_update__open"))
 		return;
 
-	bpf_program__set_autoload(skel->progs.bpf_obj_free_fields, true);
+	bpf_program__set_autoload(skel->progs.bpf_obj_cancel_fields, true);
 	err = htab_update__load(skel);
 	if (!ASSERT_TRUE(!err, "htab_update__load") || err)
 		goto out;
@@ -50,7 +50,7 @@ static void test_reenter_update(void)
 	/*
 	 * Second update: replace existing element with same key and trigger
 	 * the reentrancy of bpf_map_update_elem().
-	 * check_and_free_fields() calls bpf_obj_free_fields() on the old
+	 * check_and_cancel_fields() calls bpf_obj_cancel_fields() on the old
 	 * value, which is where fentry program runs and performs a nested
 	 * bpf_map_update_elem(), triggering -EDEADLK.
 	 */
diff --git a/tools/testing/selftests/bpf/prog_tests/map_kptr.c b/tools/testing/selftests/bpf/prog_tests/map_kptr.c
index 03b46f17cf53..ec6f2f2e8308 100644
--- a/tools/testing/selftests/bpf/prog_tests/map_kptr.c
+++ b/tools/testing/selftests/bpf/prog_tests/map_kptr.c
@@ -51,7 +51,6 @@ static void test_map_kptr_success(bool test_run)
 	ret = bpf_map__update_elem(skel->maps.array_map,
 				   &key, sizeof(key), buf, sizeof(buf), 0);
 	ASSERT_OK(ret, "array_map update");
-	skel->data->ref--;
 	ret = bpf_prog_test_run_opts(bpf_program__fd(skel->progs.test_map_kptr_ref3), &opts);
 	ASSERT_OK(ret, "test_map_kptr_ref3 refcount");
 	ASSERT_OK(opts.retval, "test_map_kptr_ref3 retval");
@@ -59,49 +58,42 @@ static void test_map_kptr_success(bool test_run)
 	ret = bpf_map__update_elem(skel->maps.pcpu_array_map,
 				   &key, sizeof(key), pbuf, cpu * sizeof(buf), 0);
 	ASSERT_OK(ret, "pcpu_array_map update");
-	skel->data->ref--;
 	ret = bpf_prog_test_run_opts(bpf_program__fd(skel->progs.test_map_kptr_ref3), &opts);
 	ASSERT_OK(ret, "test_map_kptr_ref3 refcount");
 	ASSERT_OK(opts.retval, "test_map_kptr_ref3 retval");
 
 	ret = bpf_map__delete_elem(skel->maps.hash_map, &key, sizeof(key), 0);
 	ASSERT_OK(ret, "hash_map delete");
-	skel->data->ref--;
 	ret = bpf_prog_test_run_opts(bpf_program__fd(skel->progs.test_map_kptr_ref3), &opts);
 	ASSERT_OK(ret, "test_map_kptr_ref3 refcount");
 	ASSERT_OK(opts.retval, "test_map_kptr_ref3 retval");
 
 	ret = bpf_map__delete_elem(skel->maps.pcpu_hash_map, &key, sizeof(key), 0);
 	ASSERT_OK(ret, "pcpu_hash_map delete");
-	skel->data->ref--;
 	ret = bpf_prog_test_run_opts(bpf_program__fd(skel->progs.test_map_kptr_ref3), &opts);
 	ASSERT_OK(ret, "test_map_kptr_ref3 refcount");
 	ASSERT_OK(opts.retval, "test_map_kptr_ref3 retval");
 
 	ret = bpf_map__delete_elem(skel->maps.hash_malloc_map, &key, sizeof(key), 0);
 	ASSERT_OK(ret, "hash_malloc_map delete");
-	skel->data->ref--;
 	ret = bpf_prog_test_run_opts(bpf_program__fd(skel->progs.test_map_kptr_ref3), &opts);
 	ASSERT_OK(ret, "test_map_kptr_ref3 refcount");
 	ASSERT_OK(opts.retval, "test_map_kptr_ref3 retval");
 
 	ret = bpf_map__delete_elem(skel->maps.pcpu_hash_malloc_map, &key, sizeof(key), 0);
 	ASSERT_OK(ret, "pcpu_hash_malloc_map delete");
-	skel->data->ref--;
 	ret = bpf_prog_test_run_opts(bpf_program__fd(skel->progs.test_map_kptr_ref3), &opts);
 	ASSERT_OK(ret, "test_map_kptr_ref3 refcount");
 	ASSERT_OK(opts.retval, "test_map_kptr_ref3 retval");
 
 	ret = bpf_map__delete_elem(skel->maps.lru_hash_map, &key, sizeof(key), 0);
 	ASSERT_OK(ret, "lru_hash_map delete");
-	skel->data->ref--;
 	ret = bpf_prog_test_run_opts(bpf_program__fd(skel->progs.test_map_kptr_ref3), &opts);
 	ASSERT_OK(ret, "test_map_kptr_ref3 refcount");
 	ASSERT_OK(opts.retval, "test_map_kptr_ref3 retval");
 
 	ret = bpf_map__delete_elem(skel->maps.lru_pcpu_hash_map, &key, sizeof(key), 0);
 	ASSERT_OK(ret, "lru_pcpu_hash_map delete");
-	skel->data->ref--;
 	ret = bpf_prog_test_run_opts(bpf_program__fd(skel->progs.test_map_kptr_ref3), &opts);
 	ASSERT_OK(ret, "test_map_kptr_ref3 refcount");
 	ASSERT_OK(opts.retval, "test_map_kptr_ref3 retval");
@@ -175,7 +167,7 @@ void serial_test_map_kptr(void)
 		ASSERT_OK(kern_sync_rcu(), "sync rcu");
 		wait_for_map_release();
 
-		/* Observe refcount dropping to 1 on synchronous delete elem */
+		/* Observe refcount dropping to 1 on map release. */
 		test_map_kptr_success(true);
 	}
 
diff --git a/tools/testing/selftests/bpf/progs/htab_update.c b/tools/testing/selftests/bpf/progs/htab_update.c
index 195d3b2fba00..62c1b1325ec2 100644
--- a/tools/testing/selftests/bpf/progs/htab_update.c
+++ b/tools/testing/selftests/bpf/progs/htab_update.c
@@ -22,8 +22,8 @@ struct {
 int pid = 0;
 int update_err = 0;
 
-SEC("?fentry/bpf_obj_free_fields")
-int bpf_obj_free_fields(void *ctx)
+SEC("?fentry/bpf_obj_cancel_fields")
+int bpf_obj_cancel_fields(void *ctx)
 {
 	__u32 key = 0;
 	struct val value = { .payload = 1 };
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH bpf-next v1 4/5] selftests/bpf: Exercise unsafe obj drops from tracing progs
  2026-06-08 14:48 [PATCH bpf-next v1 0/5] Fix kptr dtor deadlock Kumar Kartikeya Dwivedi
                   ` (2 preceding siblings ...)
  2026-06-08 14:48 ` [PATCH bpf-next v1 3/5] bpf: Cancel special fields on map value recycle Kumar Kartikeya Dwivedi
@ 2026-06-08 14:48 ` Kumar Kartikeya Dwivedi
  2026-06-08 16:16   ` sashiko-bot
  2026-06-08 14:48 ` [PATCH bpf-next v1 5/5] selftests/bpf: Exercise kptr map update lifetime Kumar Kartikeya Dwivedi
  2026-06-08 14:58 ` [PATCH bpf-next v1 0/5] Fix kptr dtor deadlock Kumar Kartikeya Dwivedi
  5 siblings, 1 reply; 19+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2026-06-08 14:48 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Eduard Zingerman, Emil Tsalapatis, Justin Suess, kkd, kernel-team

Add task_kfunc failure cases for bpf_obj_drop() on local objects with
referenced kptr fields from tracing and NMI tracing programs. These programs
must be rejected because dropping the object would run full special-field
destruction synchronously in an unsafe context.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 .../selftests/bpf/progs/task_kfunc_failure.c  | 43 +++++++++++++++++++
 1 file changed, 43 insertions(+)

diff --git a/tools/testing/selftests/bpf/progs/task_kfunc_failure.c b/tools/testing/selftests/bpf/progs/task_kfunc_failure.c
index 8e947d445f8e..98e00287c438 100644
--- a/tools/testing/selftests/bpf/progs/task_kfunc_failure.c
+++ b/tools/testing/selftests/bpf/progs/task_kfunc_failure.c
@@ -5,6 +5,7 @@
 #include <bpf/bpf_tracing.h>
 #include <bpf/bpf_helpers.h>
 
+#include "../bpf_experimental.h"
 #include "bpf_misc.h"
 #include "task_kfunc_common.h"
 
@@ -233,6 +234,48 @@ int BPF_PROG(task_kfunc_release_unacquired, struct task_struct *task, u64 clone_
 	return 0;
 }
 
+SEC("tp_btf/task_newtask")
+__failure __msg("bpf_obj_drop cannot be used in tracing programs on types with NMI unsafe fields")
+int BPF_PROG(task_kfunc_obj_drop_with_kptr, struct task_struct *task, u64 clone_flags)
+{
+	struct __tasks_kfunc_map_value *local;
+
+	local = bpf_obj_new(typeof(*local));
+	if (!local)
+		return 0;
+
+	bpf_obj_drop(local);
+	return 0;
+}
+
+SEC("tp_btf/nmi_handler")
+__failure __msg("bpf_obj_drop cannot be used in tracing programs on types with NMI unsafe fields")
+int BPF_PROG(task_kfunc_obj_drop_nmi_with_kptr, void *handler, s64 delta_ns,
+	     int handled)
+{
+	struct __tasks_kfunc_map_value *local;
+	struct task_struct *task, *acquired, *old;
+
+	(void)handler;
+	(void)delta_ns;
+	(void)handled;
+
+	local = bpf_obj_new(typeof(*local));
+	if (!local)
+		return 0;
+
+	task = bpf_get_current_task_btf();
+	acquired = bpf_task_acquire(task);
+	if (acquired) {
+		old = bpf_kptr_xchg(&local->task, acquired);
+		if (old)
+			bpf_task_release(old);
+	}
+
+	bpf_obj_drop(local);
+	return 0;
+}
+
 SEC("tp_btf/task_newtask")
 __failure __msg("Possibly NULL pointer passed to trusted R1")
 int BPF_PROG(task_kfunc_from_pid_no_null_check, struct task_struct *task, u64 clone_flags)
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH bpf-next v1 5/5] selftests/bpf: Exercise kptr map update lifetime
  2026-06-08 14:48 [PATCH bpf-next v1 0/5] Fix kptr dtor deadlock Kumar Kartikeya Dwivedi
                   ` (3 preceding siblings ...)
  2026-06-08 14:48 ` [PATCH bpf-next v1 4/5] selftests/bpf: Exercise unsafe obj drops from tracing progs Kumar Kartikeya Dwivedi
@ 2026-06-08 14:48 ` Kumar Kartikeya Dwivedi
  2026-06-08 16:40   ` sashiko-bot
  2026-06-08 14:58 ` [PATCH bpf-next v1 0/5] Fix kptr dtor deadlock Kumar Kartikeya Dwivedi
  5 siblings, 1 reply; 19+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2026-06-08 14:48 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Eduard Zingerman, Emil Tsalapatis, Justin Suess, kkd, kernel-team

Add focused map_kptr coverage for BPF-side map updates that touch values
containing referenced kptrs.

The new syscall programs stash the testmod refcounted object in an array
map, a preallocated hash map, and a no-prealloc hash map, then update the
same map from BPF. The refcount must remain elevated after the update,
while the userspace runner destroys the skeleton and reuses the existing
refcount wait to confirm map teardown releases the kptr.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 .../selftests/bpf/prog_tests/map_kptr.c       | 56 ++++++++++++
 tools/testing/selftests/bpf/progs/map_kptr.c  | 89 ++++++++++++++++++-
 2 files changed, 142 insertions(+), 3 deletions(-)

diff --git a/tools/testing/selftests/bpf/prog_tests/map_kptr.c b/tools/testing/selftests/bpf/prog_tests/map_kptr.c
index ec6f2f2e8308..17e707dddda8 100644
--- a/tools/testing/selftests/bpf/prog_tests/map_kptr.c
+++ b/tools/testing/selftests/bpf/prog_tests/map_kptr.c
@@ -143,12 +143,68 @@ static void wait_for_map_release(void)
 	map_kptr__destroy(skel);
 }
 
+enum map_update_kptr_case {
+	MAP_UPDATE_KPTR_ARRAY,
+	MAP_UPDATE_KPTR_HASH,
+	MAP_UPDATE_KPTR_HASH_MALLOC,
+};
+
+static struct bpf_program *map_update_kptr_prog(struct map_kptr *skel,
+						enum map_update_kptr_case test)
+{
+	switch (test) {
+	case MAP_UPDATE_KPTR_ARRAY:
+		return skel->progs.test_array_map_update_kptr;
+	case MAP_UPDATE_KPTR_HASH:
+		return skel->progs.test_hash_map_update_kptr;
+	case MAP_UPDATE_KPTR_HASH_MALLOC:
+		return skel->progs.test_hash_malloc_map_update_kptr;
+	}
+
+	return NULL;
+}
+
+static void test_map_update_kptr(enum map_update_kptr_case test)
+{
+	LIBBPF_OPTS(bpf_test_run_opts, opts);
+	struct map_kptr *skel;
+	struct bpf_program *prog;
+	int ret;
+
+	skel = map_kptr__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "map_kptr__open_and_load"))
+		return;
+
+	prog = map_update_kptr_prog(skel, test);
+	if (!ASSERT_OK_PTR(prog, "map_update_kptr_prog"))
+		goto out;
+
+	ret = bpf_prog_test_run_opts(bpf_program__fd(prog), &opts);
+	if (!ASSERT_OK(ret, "map_update_kptr"))
+		goto out;
+	if (!ASSERT_OK(opts.retval, "map_update_kptr retval"))
+		goto out;
+
+	ASSERT_EQ(skel->bss->num_of_refs, 3, "refs_after_update");
+
+out:
+	map_kptr__destroy(skel);
+	wait_for_map_release();
+}
+
 void serial_test_map_kptr(void)
 {
 	struct rcu_tasks_trace_gp *skel;
 
 	RUN_TESTS(map_kptr_fail);
 
+	if (test__start_subtest("update_array_map_kptr"))
+		test_map_update_kptr(MAP_UPDATE_KPTR_ARRAY);
+	if (test__start_subtest("update_hash_map_kptr"))
+		test_map_update_kptr(MAP_UPDATE_KPTR_HASH);
+	if (test__start_subtest("update_hash_malloc_map_kptr"))
+		test_map_update_kptr(MAP_UPDATE_KPTR_HASH_MALLOC);
+
 	skel = rcu_tasks_trace_gp__open_and_load();
 	if (!ASSERT_OK_PTR(skel, "rcu_tasks_trace_gp__open_and_load"))
 		return;
diff --git a/tools/testing/selftests/bpf/progs/map_kptr.c b/tools/testing/selftests/bpf/progs/map_kptr.c
index e708ffbe1f61..3fbefc568e0a 100644
--- a/tools/testing/selftests/bpf/progs/map_kptr.c
+++ b/tools/testing/selftests/bpf/progs/map_kptr.c
@@ -489,8 +489,7 @@ int test_map_kptr_ref3(struct __sk_buff *ctx)
 
 int num_of_refs;
 
-SEC("syscall")
-int count_ref(void *ctx)
+static __always_inline int read_ref_count(void)
 {
 	struct prog_test_ref_kfunc *p;
 	unsigned long arg = 0;
@@ -500,11 +499,95 @@ int count_ref(void *ctx)
 		return 1;
 
 	num_of_refs = p->cnt.refs.counter;
-
 	bpf_kfunc_call_test_release(p);
 	return 0;
 }
 
+SEC("syscall")
+int count_ref(void *ctx)
+{
+	return read_ref_count();
+}
+
+static __always_inline int stash_ref_ptr(struct map_value *v)
+{
+	struct prog_test_ref_kfunc *p, *old;
+	unsigned long arg = 0;
+
+	p = bpf_kfunc_call_test_acquire(&arg);
+	if (!p)
+		return 1;
+
+	old = bpf_kptr_xchg(&v->ref_ptr, p);
+	if (old) {
+		bpf_kfunc_call_test_release(old);
+		old = bpf_kptr_xchg(&v->ref_ptr, NULL);
+		if (old)
+			bpf_kfunc_call_test_release(old);
+		return 2;
+	}
+	return 0;
+}
+
+static __always_inline int check_refs(int expected)
+{
+	int ret;
+
+	ret = read_ref_count();
+	if (ret)
+		return ret;
+	return num_of_refs == expected ? 0 : 3;
+}
+
+SEC("syscall")
+int test_array_map_update_kptr(void *ctx)
+{
+	struct map_value init = {}, *v;
+	int key = 0, ret;
+
+	v = bpf_map_lookup_elem(&array_map, &key);
+	if (!v)
+		return 1;
+	ret = stash_ref_ptr(v);
+	if (ret)
+		return ret;
+	ret = check_refs(3);
+	if (ret)
+		return ret;
+	ret = bpf_map_update_elem(&array_map, &key, &init, BPF_EXIST);
+	if (ret)
+		return 4;
+	return check_refs(3);
+}
+
+#define DEFINE_HASH_UPDATE_KPTR_TEST(name, map)			\
+SEC("syscall")							\
+int name(void *ctx)						\
+{								\
+	struct map_value init = {}, *v;				\
+	int key = 0, ret;					\
+								\
+	ret = bpf_map_update_elem(&map, &key, &init, BPF_NOEXIST); \
+	if (ret)						\
+		return 1;					\
+	v = bpf_map_lookup_elem(&map, &key);			\
+	if (!v)							\
+		return 2;					\
+	ret = stash_ref_ptr(v);					\
+	if (ret)						\
+		return ret;					\
+	ret = check_refs(3);					\
+	if (ret)						\
+		return ret;					\
+	ret = bpf_map_update_elem(&map, &key, &init, BPF_EXIST); \
+	if (ret)						\
+		return 4;					\
+	return check_refs(3);					\
+}
+
+DEFINE_HASH_UPDATE_KPTR_TEST(test_hash_map_update_kptr, hash_map)
+DEFINE_HASH_UPDATE_KPTR_TEST(test_hash_malloc_map_update_kptr, hash_malloc_map)
+
 SEC("syscall")
 int test_ls_map_kptr_ref1(void *ctx)
 {
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH bpf-next v1 1/5] bpf: Treat non-iterator tracing progs as tracing
  2026-06-08 14:48 ` [PATCH bpf-next v1 1/5] bpf: Treat non-iterator tracing progs as tracing Kumar Kartikeya Dwivedi
@ 2026-06-08 14:51   ` Kumar Kartikeya Dwivedi
  2026-06-08 15:13   ` sashiko-bot
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 19+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2026-06-08 14:51 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi, bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Eduard Zingerman, Emil Tsalapatis, Justin Suess, kkd, kernel-team

On Mon Jun 8, 2026 at 4:48 PM CEST, Kumar Kartikeya Dwivedi wrote:
> The is_tracing_prog_type() predicate omitted BPF_PROG_TYPE_TRACING even
> though fentry, fexit, fmod_ret, raw_tp BTF and similar programs have the
> same execution-context concerns as the tracing program types already
> covered by the helper.
>
> This matters for map compatibility checks that reject bpf_spin_lock,
> bpf_list_head and bpf_rb_root in tracing contexts. BPF_PROG_TYPE_TRACING
> programs can run from arbitrary instrumented contexts, including places
> where taking these locks or manipulating graph roots is not safe.
> BPF_TRACE_ITER is different: iterator programs run from task context, so
> we continue to exclude them.
>
> This can reject existing fentry/fexit-style programs that use map values
> with these fields. Such programs were accepted only because the
> predicate missed this program type; their use depends on semantics the
> verifier already rejects for equivalent tracing hooks.
>
> Move is_tracing_prog_type() checks from check_map_prog_compatibility()
> to points where the fields are actually used to avoid preemptively
> rejecting tracing programs that use maps with such fields but do not
> touch these fields.
>
> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> ---

This particular patch causes regressions for at least one production program
using bpf_spin_lock() from fentry, and a bunch of selftests (which are mostly
due to the movement of the checks, and usage of spin locks in tracing progs that
can be adjusted). As such, there is potential of breakage here, but it also
implies the current behavior is broken.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH bpf-next v1 0/5] Fix kptr dtor deadlock
  2026-06-08 14:48 [PATCH bpf-next v1 0/5] Fix kptr dtor deadlock Kumar Kartikeya Dwivedi
                   ` (4 preceding siblings ...)
  2026-06-08 14:48 ` [PATCH bpf-next v1 5/5] selftests/bpf: Exercise kptr map update lifetime Kumar Kartikeya Dwivedi
@ 2026-06-08 14:58 ` Kumar Kartikeya Dwivedi
  5 siblings, 0 replies; 19+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2026-06-08 14:58 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi, bpf, Justin Suess
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Eduard Zingerman, Emil Tsalapatis, kkd, kernel-team

On Mon Jun 8, 2026 at 4:48 PM CEST, Kumar Kartikeya Dwivedi wrote:
> Referenced kptr destruction can run from tracing/NMI contexts through
> bpf_obj_drop() and map value update/delete paths, reaching NMI-unsafe
> special field teardown and deadlocks. Justin reported the issue and
> iterated on fixes in [0]-[2], and also confirmed the bpf_obj_drop()
> reproducer in [3].
>
> This series rejects unsafe obj drops from non-iterator tracing programs,
> limits map value recycle to NMI-safe field cancellation, and adds
> focused selftests for the obj_drop(), NMI delete, and recycle teardown
> cases.
>
> See patches for details.
>
>   [0]: https://lore.kernel.org/bpf/20260505150851.3090688-1-utilityemal77@gmail.com
>   [1]: https://lore.kernel.org/bpf/20260507175453.1140400-1-utilityemal77@gmail.com
>   [2]: https://lore.kernel.org/bpf/20260519011450.1144935-1-utilityemal77@gmail.com
>   [3]: https://lore.kernel.org/bpf/agyG3eQwgmoJwmj2@suesslenovo
>

Justin,
If you have other reproducers that can be crafted as selftests we can exercise
along with the ones I added (esp. for the task NMI deadlock), it would be great
to include them.

Thanks

> Justin Suess (2):
>   bpf: Reject bpf_obj_drop() from tracing progs
>   bpf: Cancel special fields on map value recycle
>
> Kumar Kartikeya Dwivedi (3):
>   bpf: Treat non-iterator tracing progs as tracing
>   selftests/bpf: Exercise unsafe obj drops from tracing progs
>   selftests/bpf: Exercise kptr map update lifetime
>
>  include/linux/bpf.h                           | 30 +++++++
>  kernel/bpf/arraymap.c                         |  8 +-
>  kernel/bpf/hashtab.c                          | 32 ++++---
>  kernel/bpf/syscall.c                          | 27 ++++++
>  kernel/bpf/verifier.c                         | 51 +++++++----
>  .../selftests/bpf/prog_tests/htab_update.c    |  4 +-
>  .../selftests/bpf/prog_tests/map_kptr.c       | 66 ++++++++++++--
>  .../selftests/bpf/prog_tests/task_kfunc.c     | 42 ++++++++-
>  .../testing/selftests/bpf/progs/htab_update.c |  4 +-
>  tools/testing/selftests/bpf/progs/map_kptr.c  | 89 ++++++++++++++++++-
>  .../selftests/bpf/progs/task_kfunc_failure.c  | 43 +++++++++
>  .../selftests/bpf/progs/task_kfunc_success.c  | 13 ++-
>  12 files changed, 352 insertions(+), 57 deletions(-)
>
>
> base-commit: 50dff00615522f3ec03449680ca23beb4cfc549c


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH bpf-next v1 1/5] bpf: Treat non-iterator tracing progs as tracing
  2026-06-08 14:48 ` [PATCH bpf-next v1 1/5] bpf: Treat non-iterator tracing progs as tracing Kumar Kartikeya Dwivedi
  2026-06-08 14:51   ` Kumar Kartikeya Dwivedi
@ 2026-06-08 15:13   ` sashiko-bot
  2026-06-08 15:44   ` bot+bpf-ci
  2026-06-08 17:47   ` Justin Suess
  3 siblings, 0 replies; 19+ messages in thread
From: sashiko-bot @ 2026-06-08 15:13 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi; +Cc: bpf

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [High] Tracing programs can deadlock the kernel by acquiring `bpf_spin_lock` via `bpf_map_update_elem` with the `BPF_F_LOCK` flag, bypassing the verifier's safety checks.
--

commit 0ab13739c010beba3fc294b839936a5123924643
Author: Kumar Kartikeya Dwivedi <memxor@gmail.com>

bpf: Treat non-iterator tracing progs as tracing

This patch updates the is_tracing_prog_type() helper to include non-iterator
tracing programs. It modifies check_map_prog_compatibility() to defer rejecting
tracing progs that access maps with bpf_spin_lock until their first use,
avoiding overly conservative rejections.

> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index ed7ba0e6a9ce0..26bfb44657253 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
[ ... ]
> @@ -17697,24 +17712,16 @@ static int check_map_prog_compatibility(struct bpf_verifier_env *env,
>  		return -EACCES;
>  	}
>  
>  	if (btf_record_has_field(map->record, BPF_SPIN_LOCK | BPF_RES_SPIN_LOCK)) {
>  		if (prog_type == BPF_PROG_TYPE_SOCKET_FILTER) {
>  			verbose(env, "socket filter progs cannot use bpf_spin_lock yet\n");
>  			return -EINVAL;
>  		}
> -
> -		if (is_tracing_prog_type(prog_type)) {
> -			verbose(env, "tracing progs cannot use bpf_spin_lock yet\n");
> -			return -EINVAL;
> -		}
> +		/*
> +		 * Rejecting tracing progs accessing maps with bpf_spin_lock in
> +		 * them here would be too conservative; let's defer rejection
> +		 * until seeing first use.
> +		 */

[Severity: High]
Is it possible for tracing programs to deadlock the kernel by acquiring
the spinlock via bpf_map_update_elem()?

By deferring the rejection to process_spin_lock(), we prevent explicit
bpf_spin_lock() calls in the tracing program. However, tracing programs might
still be able to call bpf_map_update_elem() with the BPF_F_LOCK flag.

If a tracing program passes BPF_F_LOCK:

kernel/bpf/hashtab.c:htab_map_update_elem() {
	if (unlikely(map_flags & BPF_F_LOCK)) {
		if (unlikely(!btf_record_has_field(map->record, BPF_SPIN_LOCK)))
			return -EINVAL;
    ...
}

This leads to copy_map_value_locked() which takes the lock:

kernel/bpf/helpers.c:copy_map_value_locked() {
    ...
	preempt_disable();
	__bpf_spin_lock_irqsave(lock);
    ...
}

Since tracing programs can run in NMI context, if an NMI interrupts another
context that is currently holding this exact spinlock, wouldn't it result in
an AA deadlock?

Does the verifier block BPF_F_LOCK dynamically at load time, or does removing
this map compatibility check expose us to this deadlock vector?

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260608144841.1732406-1-memxor@gmail.com?part=1

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH bpf-next v1 2/5] bpf: Reject bpf_obj_drop() from tracing progs
  2026-06-08 14:48 ` [PATCH bpf-next v1 2/5] bpf: Reject bpf_obj_drop() from tracing progs Kumar Kartikeya Dwivedi
@ 2026-06-08 15:40   ` sashiko-bot
  0 siblings, 0 replies; 19+ messages in thread
From: sashiko-bot @ 2026-06-08 15:40 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi; +Cc: bpf

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:

Pre-existing issues:
- [High] BPF_TIMER and BPF_WORKQUEUE fields are incorrectly whitelisted as NMI-safe, causing a deadlock when bpf_obj_drop() is called from NMI context.
--

commit 3287c13e44b513ce0d9b41e4aa24cd264b519b04
Author: Justin Suess <utilityemal77@gmail.com>

bpf: Reject bpf_obj_drop() from tracing progs

This commit restricts bpf_obj_drop() and bpf_percpu_obj_drop() in tracing
programs to only allow objects whose fields are explicitly whitelisted as
NMI-safe. It replaces the previous blacklist approach with a whitelist to
better handle future field types.

> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -492,6 +492,35 @@ static inline bool btf_record_has_field(const struct btf_record *rec, enum btf_f
>  	return rec->field_mask & type;
>  }
>  
> +static inline bool btf_field_is_nmi_safe(enum btf_field_type type)
> +{
> +	switch (type) {
> +	case BPF_SPIN_LOCK:
> +	case BPF_RES_SPIN_LOCK:
> +	case BPF_TIMER:
> +	case BPF_WORKQUEUE:

[Severity: High]
This is a pre-existing issue, but is it safe to include BPF_TIMER and
BPF_WORKQUEUE in this NMI-safe whitelist?

Dropping an object with these fields calls bpf_timer_cancel_and_free() or
bpf_wq_cancel_and_free(), which synchronously invokes bpf_prog_put() via
bpf_async_update_prog_callback():

kernel/bpf/helpers.c:bpf_async_cancel_and_free() {
    ...
    bpf_async_update_prog_callback(cb, NULL, NULL);
    ...
}

If this drops the last reference to the BPF program while in NMI context,
__bpf_prog_put() falls back to schedule_work() since irqs are disabled:

kernel/bpf/syscall.c:__bpf_prog_put() {
    if (in_hardirq() || irqs_disabled()) {
        INIT_WORK(&aux->work, bpf_prog_put_deferred);
        schedule_work(&aux->work);
    } else {
        ...
}

Since schedule_work() takes a raw spin lock on the workqueue pool, couldn't
this hard deadlock if the NMI interrupted another thread that was already
holding the workqueue pool lock?

Unlike BPF_TASK_WORK, which safely defers destruction via irq_work_queue
when irqs_disabled(), these fields do not appear to be NMI-safe.

> +	case BPF_TASK_WORK:
> +	case BPF_KPTR_UNREF:
> +	case BPF_REFCOUNT:
> +		return true;
> +	default:
> +		return false;
> +	}
> +}

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260608144841.1732406-1-memxor@gmail.com?part=2

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH bpf-next v1 1/5] bpf: Treat non-iterator tracing progs as tracing
  2026-06-08 14:48 ` [PATCH bpf-next v1 1/5] bpf: Treat non-iterator tracing progs as tracing Kumar Kartikeya Dwivedi
  2026-06-08 14:51   ` Kumar Kartikeya Dwivedi
  2026-06-08 15:13   ` sashiko-bot
@ 2026-06-08 15:44   ` bot+bpf-ci
  2026-06-08 17:47   ` Justin Suess
  3 siblings, 0 replies; 19+ messages in thread
From: bot+bpf-ci @ 2026-06-08 15:44 UTC (permalink / raw)
  To: memxor, bpf
  Cc: ast, andrii, daniel, eddyz87, emil, utilityemal77, kkd,
	kernel-team, ast, andrii, daniel, martin.lau, eddyz87,
	yonghong.song, clm, ihor.solodrai

[-- Attachment #1: Type: text/plain, Size: 3953 bytes --]

> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index ed7ba0e6a9ce..26bfb4465725 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c

[ ... ]

> @@ -17697,24 +17712,16 @@ static int check_map_prog_compatibility(struct bpf_verifier_env *env,
>  		return -EACCES;
>  	}
>
> -	if (btf_record_has_field(map->record, BPF_LIST_HEAD) ||
> -	    btf_record_has_field(map->record, BPF_RB_ROOT)) {
> -		if (is_tracing_prog_type(prog_type)) {
> -			verbose(env, "tracing progs cannot use bpf_{list_head,rb_root} yet\n");
> -			return -EINVAL;
> -		}
> -	}
> -
>  	if (btf_record_has_field(map->record, BPF_SPIN_LOCK | BPF_RES_SPIN_LOCK)) {
>  		if (prog_type == BPF_PROG_TYPE_SOCKET_FILTER) {
>  			verbose(env, "socket filter progs cannot use bpf_spin_lock yet\n");
>  			return -EINVAL;
>  		}
> -
> -		if (is_tracing_prog_type(prog_type)) {
> -			verbose(env, "tracing progs cannot use bpf_spin_lock yet\n");
> -			return -EINVAL;
> -		}
> +		/*
> +		 * Rejecting tracing progs accessing maps with bpf_spin_lock in
> +		 * them here would be too conservative; let's defer rejection
> +		 * until seeing first use.
> +		 */
>  	}

By deferring the is_tracing_prog_type() check to process_spin_lock(), the
patch prevents explicit bpf_spin_lock() calls but may not prevent tracing
programs from calling bpf_map_update_elem() with the BPF_F_LOCK flag.

This flag triggers copy_map_value_locked() which acquires the spinlock via
__bpf_spin_lock_irqsave(). Since tracing programs can run in NMI context,
if an NMI interrupts another context holding the same spinlock, it could
result in an AA deadlock:

CPU1                                    CPU2 (NMI)
bpf_map_update_elem(BPF_F_LOCK)         bpf_map_update_elem(BPF_F_LOCK)
  copy_map_value_locked()                 copy_map_value_locked()
    __bpf_spin_lock_irqsave()               __bpf_spin_lock_irqsave()
      spin_lock()                             spin_lock() <- deadlock
      [NMI arrives] ------>

Does the verifier block BPF_F_LOCK at load time, or does removing this map
compatibility check expose us to this deadlock vector?

This concern was raised by sashiko-bot@kernel.org on lore:
https://lore.kernel.org/bpf/20260608151331.730D81F00893@smtp.kernel.org/

The commit message also states:

> bpf: Treat non-iterator tracing progs as tracing
>
> The is_tracing_prog_type() predicate omitted BPF_PROG_TYPE_TRACING even
> though fentry, fexit, fmod_ret, raw_tp BTF and similar programs have the
> same execution-context concerns as the tracing program types already
> covered by the helper.
>
> This matters for map compatibility checks that reject bpf_spin_lock,
> bpf_list_head and bpf_rb_root in tracing contexts. BPF_PROG_TYPE_TRACING
> programs can run from arbitrary instrumented contexts, including places
> where taking these locks or manipulating graph roots is not safe.
> BPF_TRACE_ITER is different: iterator programs run from task context, so
> we continue to exclude them.
>
> This can reject existing fentry/fexit-style programs that use map values
> with these fields. Such programs were accepted only because the
> predicate missed this program type; their use depends on semantics the
> verifier already rejects for equivalent tracing hooks.
>
> Move is_tracing_prog_type() checks from check_map_prog_compatibility()
> to points where the fields are actually used to avoid preemptively
> rejecting tracing programs that use maps with such fields but do not
> touch these fields.
>
> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>

This looks like a bug fix for handling BPF_PROG_TYPE_TRACING programs.

Should this include:

  Fixes: f1b9509c2fb0 ("bpf: Replace prog_raw_tp+btf_id with prog_tracing")


---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/27147105932

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH bpf-next v1 3/5] bpf: Cancel special fields on map value recycle
  2026-06-08 14:48 ` [PATCH bpf-next v1 3/5] bpf: Cancel special fields on map value recycle Kumar Kartikeya Dwivedi
@ 2026-06-08 15:44   ` bot+bpf-ci
  2026-06-08 15:56   ` sashiko-bot
  2026-06-08 18:01   ` Justin Suess
  2 siblings, 0 replies; 19+ messages in thread
From: bot+bpf-ci @ 2026-06-08 15:44 UTC (permalink / raw)
  To: memxor, bpf
  Cc: utilityemal77, ast, andrii, daniel, eddyz87, emil, kkd,
	kernel-team, ast, andrii, daniel, martin.lau, eddyz87,
	yonghong.song, clm, ihor.solodrai

[-- Attachment #1: Type: text/plain, Size: 5931 bytes --]

> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 0654d2ffadc1..fc24b81cc206 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -2717,6 +2717,7 @@ bool btf_record_equal(const struct btf_record *rec_a, const struct btf_record *r
>  void bpf_obj_free_timer(const struct btf_record *rec, void *obj);
>  void bpf_obj_free_workqueue(const struct btf_record *rec, void *obj);
>  void bpf_obj_free_task_work(const struct btf_record *rec, void *obj);
> +void bpf_obj_cancel_fields(const struct btf_record *rec, void *obj);
>  void bpf_obj_free_fields(const struct btf_record *rec, void *obj);
>  void __bpf_obj_drop_impl(void *p, const struct btf_record *rec, bool percpu);
>
> diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
> index e6271a2bf6d6..4e8d6fe6999b 100644
> --- a/kernel/bpf/arraymap.c
> +++ b/kernel/bpf/arraymap.c
> @@ -384,7 +384,7 @@ static long array_map_update_elem(struct bpf_map *map, void *key, void *value,
>  	if (array->map.map_type == BPF_MAP_TYPE_PERCPU_ARRAY) {
>  		val = this_cpu_ptr(array->pptrs[index & array->index_mask]);
>  		copy_map_value(map, val, value);
> -		bpf_obj_free_fields(array->map.record, val);
> +		bpf_obj_cancel_fields(array->map.record, val);
>  	} else {
>  		val = array->value +
>  			(u64)array->elem_size * (index & array->index_mask);
> @@ -392,7 +392,7 @@ static long array_map_update_elem(struct bpf_map *map, void *key, void *value,
>  			copy_map_value_locked(map, val, value, false);
>  		else
>  			copy_map_value(map, val, value);
> -		bpf_obj_free_fields(array->map.record, val);
> +		bpf_obj_cancel_fields(array->map.record, val);
>  	}
>  	return 0;
>  }
> @@ -432,14 +432,14 @@ int bpf_percpu_array_update(struct bpf_map *map, void *key, void *value,
>  		cpu = map_flags >> 32;
>  		ptr = per_cpu_ptr(pptr, cpu);
>  		copy_map_value(map, ptr, value);
> -		bpf_obj_free_fields(array->map.record, ptr);
> +		bpf_obj_cancel_fields(array->map.record, ptr);
>  		goto unlock;
>  	}
>  	for_each_possible_cpu(cpu) {
>  		ptr = per_cpu_ptr(pptr, cpu);
>  		val = (map_flags & BPF_F_ALL_CPUS) ? value : value + size * cpu;
>  		copy_map_value(map, ptr, val);
> -		bpf_obj_free_fields(array->map.record, ptr);
> +		bpf_obj_cancel_fields(array->map.record, ptr);
>  	}
>  unlock:
>  	rcu_read_unlock();
> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
> index b4366cad3cfa..a9b2b882b90f 100644
> --- a/kernel/bpf/hashtab.c
> +++ b/kernel/bpf/hashtab.c
> @@ -243,6 +243,10 @@ static void htab_free_prealloced_fields(struct bpf_htab *htab)
>
>  	if (IS_ERR_OR_NULL(htab->map.record))
>  		return;
> +	/*
> +	 * Preallocated maps do not have a bpf_mem_alloc destructor, so fully
> +	 * destroy every element, including the extra elements.
> +	 */
>  	if (htab_has_extra_elems(htab))
>  		num_entries += num_possible_cpus();
>  	for (i = 0; i < num_entries; i++) {

In the prior version review thread, Kumar noted that the prealloc-related
changes are unnecessary and orthogonal to the actual fix. Could this patch
focus only on the cancel helper conversion without the htab_has_extra_elems
changes? Kumar suggested the approach should skip unconditionally rather
than use irqs_disabled().

Reference: https://lore.kernel.org/bpf/DIMWGCWL1AG2.14F7CN23GMS8N@gmail.com/

> @@ -833,8 +837,8 @@ static int htab_lru_map_gen_lookup(struct bpf_map *map,
>  	return insn - insn_buf;
>  }
>
> -static void check_and_free_fields(struct bpf_htab *htab,
> -				  struct htab_elem *elem)
> +static void check_and_cancel_fields(struct bpf_htab *htab,
> +				    struct htab_elem *elem)
>  {
>  	if (IS_ERR_OR_NULL(htab->map.record))
>  		return;
> @@ -844,11 +848,11 @@ static void check_and_free_fields(struct bpf_htab *htab,
>  		int cpu;
>
>  		for_each_possible_cpu(cpu)
> -			bpf_obj_free_fields(htab->map.record, per_cpu_ptr(pptr, cpu));
> +			bpf_obj_cancel_fields(htab->map.record, per_cpu_ptr(pptr, cpu));
>  	} else {
>  		void *map_value = htab_elem_value(elem, htab->map.key_size);
>
> -		bpf_obj_free_fields(htab->map.record, map_value);
> +		bpf_obj_cancel_fields(htab->map.record, map_value);
>  	}
>  }

[ ... ]

> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index d4188a992bd8..8d12aa3728ea 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -808,6 +808,33 @@ void bpf_obj_free_task_work(const struct btf_record *rec, void *obj)
>  	bpf_task_work_cancel_and_free(obj + rec->task_work_off);
>  }
>
> +void bpf_obj_cancel_fields(const struct btf_record *rec, void *obj)
> +{
> +	const struct btf_field *fields;
> +	int i;
> +
> +	if (IS_ERR_OR_NULL(rec))
> +		return;
> +	fields = rec->fields;
> +	for (i = 0; i < rec->cnt; i++) {
> +		void *field_ptr = obj + fields[i].offset;
> +
> +		switch (fields[i].type) {
> +		case BPF_TIMER:
> +			bpf_timer_cancel_and_free(field_ptr);
> +			break;
> +		case BPF_WORKQUEUE:
> +			bpf_wq_cancel_and_free(field_ptr);
> +			break;
> +		case BPF_TASK_WORK:
> +			bpf_task_work_cancel_and_free(field_ptr);
> +			break;
> +		default:
> +			break;
> +		}
> +	}
> +}

This handles the map update/delete paths, but what about the bpf_obj_drop()
kfunc? During the prior version discussion, you and Kumar identified that
bpf_obj_drop() can also trigger NMI deadlocks when called from tracing
programs (for example, tp_btf/nmi_handler).

Kumar suggested rejecting bpf_obj_drop() in is_tracing_prog_type() when
the object has kptr fields. Should this patch also address that path, or
will that be a separate follow-up fix?

Reference: https://lore.kernel.org/bpf/DIMWGCWL1AG2.14F7CN23GMS8N@gmail.com/

> +
>  void bpf_obj_free_fields(const struct btf_record *rec, void *obj)
>  {
>  	const struct btf_field *fields;

[ ... ]


---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/27147105932

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH bpf-next v1 3/5] bpf: Cancel special fields on map value recycle
  2026-06-08 14:48 ` [PATCH bpf-next v1 3/5] bpf: Cancel special fields on map value recycle Kumar Kartikeya Dwivedi
  2026-06-08 15:44   ` bot+bpf-ci
@ 2026-06-08 15:56   ` sashiko-bot
  2026-06-08 18:01   ` Justin Suess
  2 siblings, 0 replies; 19+ messages in thread
From: sashiko-bot @ 2026-06-08 15:56 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi; +Cc: bpf

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [High] Replacing bpf_obj_free_fields() with bpf_obj_cancel_fields() in map update and delete paths fails to clear BTF special fields (kptrs, list heads, etc.), causing recycled map elements to silently inherit these fields from previously deleted or overwritten keys.
--

commit 0b4e0636a1d77531077ad0491c76e6b6595a28c6
Author: Justin Suess <utilityemal77@gmail.com>

bpf: Cancel special fields on map value recycle

This patch introduces bpf_obj_cancel_fields() to replace
bpf_obj_free_fields() in map update and delete paths. It restricts
eager cleanup to timers, workqueues, and task works, leaving other
fields attached for final map destruction.

> diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
> index e6271a2bf6d64..4e8d6fe6999bf 100644
> --- a/kernel/bpf/arraymap.c
> +++ b/kernel/bpf/arraymap.c
> @@ -384,7 +384,7 @@ static long array_map_update_elem(struct bpf_map *map, void *key, void *value,
>  	if (array->map.map_type == BPF_MAP_TYPE_PERCPU_ARRAY) {
>  		val = this_cpu_ptr(array->pptrs[index & array->index_mask]);
>  		copy_map_value(map, val, value);
> -		bpf_obj_free_fields(array->map.record, val);
> +		bpf_obj_cancel_fields(array->map.record, val);
>  	} else {
>  		val = array->value +
>  			(u64)array->elem_size * (index & array->index_mask);
> @@ -392,7 +392,7 @@ static long array_map_update_elem(struct bpf_map *map, void *key, void *value,
>  			copy_map_value_locked(map, val, value, false);
>  		else
>  			copy_map_value(map, val, value);
> -		bpf_obj_free_fields(array->map.record, val);
> +		bpf_obj_cancel_fields(array->map.record, val);

[Severity: High]
Does this prevent userspace from clearing kptrs in array maps?

Because bpf_obj_cancel_fields() leaves kptrs untouched and copy_map_value()
skips copying BTF fields from the user, array elements are updated without
freeing or zeroing old kptrs.

This permanently attaches the kptr to the array slot until map destruction,
making it impossible to clear them by updating the element with zeroes.

[ ... ]

> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
> index b4366cad3cfa5..a9b2b882b90f2 100644
> --- a/kernel/bpf/hashtab.c
> +++ b/kernel/bpf/hashtab.c
[ ... ]
> @@ -833,8 +837,8 @@ static int htab_lru_map_gen_lookup(struct bpf_map *map,
>  	return insn - insn_buf;
>  }
>  
> -static void check_and_free_fields(struct bpf_htab *htab,
> -				  struct htab_elem *elem)
> +static void check_and_cancel_fields(struct bpf_htab *htab,
> +				    struct htab_elem *elem)
>  {
>  	if (IS_ERR_OR_NULL(htab->map.record))
>  		return;
> @@ -844,11 +848,11 @@ static void check_and_free_fields(struct bpf_htab *htab,
>  		int cpu;
>  
>  		for_each_possible_cpu(cpu)
> -			bpf_obj_free_fields(htab->map.record, per_cpu_ptr(pptr, cpu));
> +			bpf_obj_cancel_fields(htab->map.record, per_cpu_ptr(pptr, cpu));
>  	} else {
>  		void *map_value = htab_elem_value(elem, htab->map.key_size);
>  
> -		bpf_obj_free_fields(htab->map.record, map_value);
> +		bpf_obj_cancel_fields(htab->map.record, map_value);
>  	}
>  }

[Severity: High]
Does this allow recycled hash map elements to silently inherit kptrs from
previously deleted keys?

When an element is deleted, it goes back to the freelist or memory allocator
cache. Since check_and_cancel_fields() no longer clears BTF special fields
like kptrs or list heads, these fields remain intact in the recycled memory.

When alloc_htab_elem() reuses this memory for a new key, copy_map_value()
intentionally skips over BTF fields.

The new element will thus silently inherit the kptr from the old element,
breaking cross-key isolation.

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260608144841.1732406-1-memxor@gmail.com?part=3

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH bpf-next v1 4/5] selftests/bpf: Exercise unsafe obj drops from tracing progs
  2026-06-08 14:48 ` [PATCH bpf-next v1 4/5] selftests/bpf: Exercise unsafe obj drops from tracing progs Kumar Kartikeya Dwivedi
@ 2026-06-08 16:16   ` sashiko-bot
  0 siblings, 0 replies; 19+ messages in thread
From: sashiko-bot @ 2026-06-08 16:16 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi; +Cc: bpf

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [Critical] Replacing bpf_obj_free_fields() with bpf_obj_cancel_fields() in map update and delete paths breaks BTF special field initialization and isolation, leading to cross-key reference hijacking in hash maps and un-clearable fields in array maps.
--

commit 0b4e0636a1d77531077ad0491c76e6b6595a28c6
Author: Justin Suess <utilityemal77@gmail.com>

bpf: Cancel special fields on map value recycle

This commit introduces bpf_obj_cancel_fields() to defer the destruction
of special fields (like referenced kptrs and graph roots) until map
teardown, rather than destroying them during map update or delete operations.
It updates array and hash map paths to use this cancellation helper,
explicitly relaxing the eager-free semantics for these fields. Although tests
were added in related commits (like selftests/bpf: Exercise kptr map update
lifetime) asserting that the refcount remains elevated after an update, they
failed to account for memory reuse.

> diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
> index e6271a2bf6d64..4e8d6fe6999bf 100644
> --- a/kernel/bpf/arraymap.c
> +++ b/kernel/bpf/arraymap.c
> @@ -384,7 +384,7 @@ static long array_map_update_elem(struct bpf_map *map, void *key, void *value,
>  	if (array->map.map_type == BPF_MAP_TYPE_PERCPU_ARRAY) {
>  		val = this_cpu_ptr(array->pptrs[index & array->index_mask]);
>  		copy_map_value(map, val, value);
> -		bpf_obj_free_fields(array->map.record, val);
> +		bpf_obj_cancel_fields(array->map.record, val);
>  	} else {
>  		val = array->value +
>  			(u64)array->elem_size * (index & array->index_mask);
> @@ -392,7 +392,7 @@ static long array_map_update_elem(struct bpf_map *map, void *key, void *value,
>  			copy_map_value_locked(map, val, value, false);
>  		else
>  			copy_map_value(map, val, value);
> -		bpf_obj_free_fields(array->map.record, val);
> +		bpf_obj_cancel_fields(array->map.record, val);

[Severity: Critical]
Since array_map_update_elem() updates the element in place, does this mean
array elements permanently retain old references?

If copy_map_value() skips copying over special fields, and the new
bpf_obj_cancel_fields() no longer clears them, it seems the old kptrs
are left intact on the array element. Wouldn't this make it impossible
for userspace to clear a kptr from an array map once it is set?

>  	}
>  	return 0;
>  }

[ ... ]

> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
> index b4366cad3cfa5..a9b2b882b90f2 100644
> --- a/kernel/bpf/hashtab.c
> +++ b/kernel/bpf/hashtab.c

[ ... ]

> @@ -844,11 +848,11 @@ static void check_and_cancel_fields(struct bpf_htab *htab,
>  		int cpu;
>  
>  		for_each_possible_cpu(cpu)
> -			bpf_obj_free_fields(htab->map.record, per_cpu_ptr(pptr, cpu));
> +			bpf_obj_cancel_fields(htab->map.record, per_cpu_ptr(pptr, cpu));
>  	} else {
>  		void *map_value = htab_elem_value(elem, htab->map.key_size);
>  
> -		bpf_obj_free_fields(htab->map.record, map_value);
> +		bpf_obj_cancel_fields(htab->map.record, map_value);
>  	}
>  }

[Severity: Critical]
Can this lead to cross-key reference hijacking when hash map elements
are reused?

When an element is deleted or updated in a hash map, its memory is pushed
back to the bpf_mem_alloc freelist (or extra_elems for preallocated maps).
Since the special fields are no longer cleared, the pushed element retains
its old kptr and lock state.

If alloc_htab_elem() later reuses this memory for a newly inserted key,
and copy_map_value() intentionally skips copying over special fields,
will the new key inherit the old key's reference? A BPF program could then
look up the new key and extract the reference via bpf_kptr_xchg(), breaking
isolation guarantees.

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260608144841.1732406-1-memxor@gmail.com?part=4

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH bpf-next v1 5/5] selftests/bpf: Exercise kptr map update lifetime
  2026-06-08 14:48 ` [PATCH bpf-next v1 5/5] selftests/bpf: Exercise kptr map update lifetime Kumar Kartikeya Dwivedi
@ 2026-06-08 16:40   ` sashiko-bot
  0 siblings, 0 replies; 19+ messages in thread
From: sashiko-bot @ 2026-06-08 16:40 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi; +Cc: bpf

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:

Pre-existing issues:
- [Critical] The newly added selftests incorrectly assert that map updates preserve old kptr references, codifying severe preexisting kernel bugs where bpf_obj_free_fields() is missing during map updates. This leads to leaked kptrs and cross-key uninitialized memory corruption in hash maps.
--

commit 4d2aa8d889c01038cf4926ada20dda3564090fb2
Author: Kumar Kartikeya Dwivedi <memxor@gmail.com>

selftests/bpf: Exercise kptr map update lifetime

Add focused map_kptr coverage for BPF-side map updates that touch values
containing referenced kptrs. The new syscall programs stash the testmod
refcounted object in various maps, then update the same map from BPF.

> diff --git a/tools/testing/selftests/bpf/progs/map_kptr.c b/tools/testing/selftests/bpf/progs/map_kptr.c
> index e708ffbe1f610..3fbefc568e0a6 100644
> --- a/tools/testing/selftests/bpf/progs/map_kptr.c
> +++ b/tools/testing/selftests/bpf/progs/map_kptr.c

[ ... ]

> +SEC("syscall")
> +int test_array_map_update_kptr(void *ctx)
> +{
> +	struct map_value init = {}, *v;
> +	int key = 0, ret;
> +
> +	v = bpf_map_lookup_elem(&array_map, &key);
> +	if (!v)
> +		return 1;
> +	ret = stash_ref_ptr(v);
> +	if (ret)
> +		return ret;
> +	ret = check_refs(3);
> +	if (ret)
> +		return ret;
> +	ret = bpf_map_update_elem(&array_map, &key, &init, BPF_EXIST);
> +	if (ret)
> +		return 4;
> +	return check_refs(3);
> +}
> +
> +#define DEFINE_HASH_UPDATE_KPTR_TEST(name, map)			\
> +SEC("syscall")							\
> +int name(void *ctx)						\
> +{								\
> +	struct map_value init = {}, *v;				\
> +	int key = 0, ret;					\
> +								\
> +	ret = bpf_map_update_elem(&map, &key, &init, BPF_NOEXIST); \
> +	if (ret)						\
> +		return 1;					\
> +	v = bpf_map_lookup_elem(&map, &key);			\
> +	if (!v)							\
> +		return 2;					\
> +	ret = stash_ref_ptr(v);					\
> +	if (ret)						\
> +		return ret;					\
> +	ret = check_refs(3);					\
> +	if (ret)						\
> +		return ret;					\
> +	ret = bpf_map_update_elem(&map, &key, &init, BPF_EXIST); \
> +	if (ret)						\
> +		return 4;					\
> +	return check_refs(3);					\
> +}

[Severity: Critical]
This is a pre-existing issue, but does this selftest actually codify an
existing memory leak and uninitialized memory corruption in the kernel?

Looking at array_map_update_elem() and htab_map_update_elem() in the kernel,
they incorrectly use bpf_obj_cancel_fields() instead of
bpf_obj_free_fields() when overwriting elements. Because
bpf_obj_cancel_fields() explicitly skips BPF_KPTR_REF, the old kptr
references are never freed.

For hash maps specifically, htab_map_update_elem() allocates a new element
(l_new), and copy_map_value() explicitly skips copying special BTF fields
like kptrs. This means l_new receives uninitialized garbage memory for its
kptr field before being inserted into the map. Meanwhile, the old element is
placed in the freelist without its kptr being freed, leaking the reference.

By asserting check_refs(3) after the map update, is the test just measuring
this leak and ignoring the uninitialized garbage now present in the map,
rather than confirming that the kptr was gracefully preserved?

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260608144841.1732406-1-memxor@gmail.com?part=5

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH bpf-next v1 1/5] bpf: Treat non-iterator tracing progs as tracing
  2026-06-08 14:48 ` [PATCH bpf-next v1 1/5] bpf: Treat non-iterator tracing progs as tracing Kumar Kartikeya Dwivedi
                     ` (2 preceding siblings ...)
  2026-06-08 15:44   ` bot+bpf-ci
@ 2026-06-08 17:47   ` Justin Suess
  2026-06-08 18:53     ` Kumar Kartikeya Dwivedi
  3 siblings, 1 reply; 19+ messages in thread
From: Justin Suess @ 2026-06-08 17:47 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Eduard Zingerman, Emil Tsalapatis, kkd, kernel-team

On Mon, Jun 08, 2026 at 04:48:34PM +0200, Kumar Kartikeya Dwivedi wrote:
> The is_tracing_prog_type() predicate omitted BPF_PROG_TYPE_TRACING even
> though fentry, fexit, fmod_ret, raw_tp BTF and similar programs have the
> same execution-context concerns as the tracing program types already
> covered by the helper.
> 
> This matters for map compatibility checks that reject bpf_spin_lock,
> bpf_list_head and bpf_rb_root in tracing contexts. BPF_PROG_TYPE_TRACING
> programs can run from arbitrary instrumented contexts, including places
> where taking these locks or manipulating graph roots is not safe.
> BPF_TRACE_ITER is different: iterator programs run from task context, so
> we continue to exclude them.
> 
> This can reject existing fentry/fexit-style programs that use map values
> with these fields. Such programs were accepted only because the
> predicate missed this program type; their use depends on semantics the
> verifier already rejects for equivalent tracing hooks.
> 
> Move is_tracing_prog_type() checks from check_map_prog_compatibility()
> to points where the fields are actually used to avoid preemptively
> rejecting tracing programs that use maps with such fields but do not
> touch these fields.
> 
> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> ---
Thanks for sending this, looks pretty good to me.
>  kernel/bpf/verifier.c | 37 ++++++++++++++++++++++---------------
>  1 file changed, 22 insertions(+), 15 deletions(-)
> 
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index ed7ba0e6a9ce..26bfb4465725 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -6967,6 +6967,11 @@ static int process_spin_lock(struct bpf_verifier_env *env, struct bpf_reg_state
>  	u32 spin_lock_off;
>  	int err;
>  
> +	if (is_tracing_prog_type(env->prog)) {
> +		verbose(env, "tracing progs cannot use bpf_spin_lock yet\n");
> +		return -EINVAL;
> +	}
> +
>  	if (!is_const) {
>  		verbose(env,
>  			"%s doesn't have constant offset. %s_lock has to be at the constant offset\n",
> @@ -12222,6 +12227,10 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
>  				return ret;
>  			break;
>  		case KF_ARG_PTR_TO_LIST_HEAD:
> +			if (is_tracing_prog_type(env->prog)) {
> +				verbose(env, "tracing progs cannot use bpf_{list_head,rb_root} yet\n");
> +				return -EINVAL;
> +			}
>  			if (reg->type != PTR_TO_MAP_VALUE &&
>  			    reg->type != (PTR_TO_BTF_ID | MEM_ALLOC)) {
>  				verbose(env, "%s expected pointer to map value or allocated object\n",
> @@ -12238,6 +12247,10 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
>  				return ret;
>  			break;
>  		case KF_ARG_PTR_TO_RB_ROOT:
> +			if (is_tracing_prog_type(env->prog)) {
> +				verbose(env, "tracing progs cannot use bpf_{list_head,rb_root} yet\n");
> +				return -EINVAL;
> +			}
>  			if (reg->type != PTR_TO_MAP_VALUE &&
>  			    reg->type != (PTR_TO_BTF_ID | MEM_ALLOC)) {
>  				verbose(env, "%s expected pointer to map value or allocated object\n",
> @@ -17664,9 +17677,11 @@ static int check_pseudo_btf_id(struct bpf_verifier_env *env,
>  	return __add_used_btf(env, btf);
>  }
>  
> -static bool is_tracing_prog_type(enum bpf_prog_type type)
> +static bool is_tracing_prog_type(const struct bpf_prog *prog)
>  {
> -	switch (type) {
> +	switch (resolve_prog_type(prog)) {
> +	case BPF_PROG_TYPE_TRACING:
> +		return prog->expected_attach_type != BPF_TRACE_ITER;
Seems reasonable to exclude this attach type as they run always in task
context as stated. This seems like the safest approach.

In practice, *most* tracing type programs will be unreachable from
NMI. But the edge cases are difficult to enumerate...

fentry / fexit are nearly impossible to exclude, since there's no
good way to answer "can this function be called from an nmi handler"
at verification time. Such a list would be unmaintainable.

Raw tracepoints are similar. Indeed, it would be *possible* to
specifically exclude the tracepoints that cannot be called from NMI. But
this would require maintaining a large list and would be brittle.

fmodify_return attach types are whitelisted to LSM hooks and
ALLOW_ERROR_INJECTION functions. A quick grep for ALLOW_ERROR_INJECTION
doesn't bring anything that looks obviously NMI-relevant, but again, I
think allowing it would be asking for trouble later.

So I think this is a good balance...
>  	case BPF_PROG_TYPE_KPROBE:
>  	case BPF_PROG_TYPE_TRACEPOINT:
>  	case BPF_PROG_TYPE_PERF_EVENT:
> @@ -17697,24 +17712,16 @@ static int check_map_prog_compatibility(struct bpf_verifier_env *env,
>  		return -EACCES;
>  	}
>  
> -	if (btf_record_has_field(map->record, BPF_LIST_HEAD) ||
> -	    btf_record_has_field(map->record, BPF_RB_ROOT)) {
> -		if (is_tracing_prog_type(prog_type)) {
> -			verbose(env, "tracing progs cannot use bpf_{list_head,rb_root} yet\n");
> -			return -EINVAL;
> -		}
> -	}
> -
>  	if (btf_record_has_field(map->record, BPF_SPIN_LOCK | BPF_RES_SPIN_LOCK)) {
>  		if (prog_type == BPF_PROG_TYPE_SOCKET_FILTER) {
>  			verbose(env, "socket filter progs cannot use bpf_spin_lock yet\n");
>  			return -EINVAL;
>  		}
> -
> -		if (is_tracing_prog_type(prog_type)) {
> -			verbose(env, "tracing progs cannot use bpf_spin_lock yet\n");
> -			return -EINVAL;
> -		}
> +		/*
> +		 * Rejecting tracing progs accessing maps with bpf_spin_lock in
> +		 * them here would be too conservative; let's defer rejection
> +		 * until seeing first use.
> +		 */

In practice, I think most programs using those spinlocks w/ fentry
weren't broken.

They were saved only by nature of the fact that the particular
functions that attached to were uncallable / never were
called from NMI. Which is the case for 99% of those programs.

So this case is where fixing the other 1% of cases are going
to have consequences for the remainder.

This is a tough decision to have to make. There's options, but all of
them come with downsides.

1. (this patch) break tracing programs using bpf_spin_lock.
2. Somehow prevent NMI reentrancy of tracing progs on the same cpu. probably
by just ignoring the reentrant calls, much how we handle the recursive
tracepoint prevention for bpf. (complicated, may have side effects)
3. Inject a fixup runtime check for in_nmi for every tracing program
and tail callable program, exiting before executing anything
(still breakage, just at runtime now)

If breakage is acceptible here, the approach this patch takes is the
most *correct* way to do it, and the way it should have been done from
day one, but having users relying on the "broken" behavior makes it
harder to fix now.

This seems like a maintainer call.

Justin
>  	}
>  
>  	if ((bpf_prog_is_offloaded(prog->aux) || bpf_map_is_offloaded(map)) &&
> -- 
> 2.53.0
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH bpf-next v1 3/5] bpf: Cancel special fields on map value recycle
  2026-06-08 14:48 ` [PATCH bpf-next v1 3/5] bpf: Cancel special fields on map value recycle Kumar Kartikeya Dwivedi
  2026-06-08 15:44   ` bot+bpf-ci
  2026-06-08 15:56   ` sashiko-bot
@ 2026-06-08 18:01   ` Justin Suess
  2026-06-08 18:50     ` Kumar Kartikeya Dwivedi
  2 siblings, 1 reply; 19+ messages in thread
From: Justin Suess @ 2026-06-08 18:01 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Eduard Zingerman, Emil Tsalapatis, kkd, kernel-team

On Mon, Jun 08, 2026 at 04:48:36PM +0200, Kumar Kartikeya Dwivedi wrote:
> From: Justin Suess <utilityemal77@gmail.com>
> 
> Map update and delete paths currently call bpf_obj_free_fields() when a
> value is being replaced or recycled. That makes field destruction depend
> on the context of the update/delete operation. For tracing programs this
> can include NMI context, where referenced kptr destructors, uptr
> unpinning, and graph root destruction are not generally safe.
> 
> Introduce bpf_obj_cancel_fields() for the reusable-value path. It only
> performs NMI-safe cleanup for timer, workqueue, and task_work fields.
> Fields that need full destruction are left attached to the recycled value
> and are destroyed by the final cleanup path instead.
> 
> Switch array and hashtab update/delete/recycle paths to this cancel
> helper. Keep bpf_obj_free_fields() for final map destruction and for
> bpf_mem_alloc destructors. Preallocated hashtabs do not have allocator
> destructors, so teardown continues to walk the normal and extra elements
> and fully destroy their fields.
> 
> This deliberately relaxes the eager-free semantics of map update/delete
> for special fields. Programs that relied on a recycled map slot becoming
> empty immediately after update/delete were relying on behavior that
> cannot be implemented safely from every BPF execution context without
> offloading arbitrary destructors.
> 
> There is a chance this change breaks programs making assumptions
> regarding the eager freeing of fields. If so, we can relax semantics to
> cancellation only when irqs_disabled() is true in the future. However,
> theoretically, map values that get reused eagerly already have weaker
> guarantees as parallel users can recreate freed fields before the new
> element becomes visible again.
> 
> Fixes: 14a324f6a67e ("bpf: Wire up freeing of referenced kptr")
> Signed-off-by: Justin Suess <utilityemal77@gmail.com>
> Co-developed-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> ---
>  include/linux/bpf.h                           |  1 +
>  kernel/bpf/arraymap.c                         |  8 ++---
>  kernel/bpf/hashtab.c                          | 32 +++++++++++--------
>  kernel/bpf/syscall.c                          | 27 ++++++++++++++++
>  .../selftests/bpf/prog_tests/htab_update.c    |  4 +--
>  .../selftests/bpf/prog_tests/map_kptr.c       | 10 +-----
>  .../testing/selftests/bpf/progs/htab_update.c |  4 +--
>  7 files changed, 55 insertions(+), 31 deletions(-)
> 
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 0654d2ffadc1..fc24b81cc206 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -2717,6 +2717,7 @@ bool btf_record_equal(const struct btf_record *rec_a, const struct btf_record *r
>  void bpf_obj_free_timer(const struct btf_record *rec, void *obj);
>  void bpf_obj_free_workqueue(const struct btf_record *rec, void *obj);
>  void bpf_obj_free_task_work(const struct btf_record *rec, void *obj);
> +void bpf_obj_cancel_fields(const struct btf_record *rec, void *obj);
>  void bpf_obj_free_fields(const struct btf_record *rec, void *obj);
>  void __bpf_obj_drop_impl(void *p, const struct btf_record *rec, bool percpu);
>  
> diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
> index e6271a2bf6d6..4e8d6fe6999b 100644
> --- a/kernel/bpf/arraymap.c
> +++ b/kernel/bpf/arraymap.c
> @@ -384,7 +384,7 @@ static long array_map_update_elem(struct bpf_map *map, void *key, void *value,
>  	if (array->map.map_type == BPF_MAP_TYPE_PERCPU_ARRAY) {
>  		val = this_cpu_ptr(array->pptrs[index & array->index_mask]);
>  		copy_map_value(map, val, value);
> -		bpf_obj_free_fields(array->map.record, val);
> +		bpf_obj_cancel_fields(array->map.record, val);
>  	} else {
>  		val = array->value +
>  			(u64)array->elem_size * (index & array->index_mask);
> @@ -392,7 +392,7 @@ static long array_map_update_elem(struct bpf_map *map, void *key, void *value,
>  			copy_map_value_locked(map, val, value, false);
>  		else
>  			copy_map_value(map, val, value);
> -		bpf_obj_free_fields(array->map.record, val);
> +		bpf_obj_cancel_fields(array->map.record, val);
>  	}
>  	return 0;
>  }
> @@ -432,14 +432,14 @@ int bpf_percpu_array_update(struct bpf_map *map, void *key, void *value,
>  		cpu = map_flags >> 32;
>  		ptr = per_cpu_ptr(pptr, cpu);
>  		copy_map_value(map, ptr, value);
> -		bpf_obj_free_fields(array->map.record, ptr);
> +		bpf_obj_cancel_fields(array->map.record, ptr);
>  		goto unlock;
>  	}
>  	for_each_possible_cpu(cpu) {
>  		ptr = per_cpu_ptr(pptr, cpu);
>  		val = (map_flags & BPF_F_ALL_CPUS) ? value : value + size * cpu;
>  		copy_map_value(map, ptr, val);
> -		bpf_obj_free_fields(array->map.record, ptr);
> +		bpf_obj_cancel_fields(array->map.record, ptr);
>  	}
>  unlock:
>  	rcu_read_unlock();
> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
> index b4366cad3cfa..a9b2b882b90f 100644
> --- a/kernel/bpf/hashtab.c
> +++ b/kernel/bpf/hashtab.c
> @@ -243,6 +243,10 @@ static void htab_free_prealloced_fields(struct bpf_htab *htab)
>  
>  	if (IS_ERR_OR_NULL(htab->map.record))
>  		return;
> +	/*
> +	 * Preallocated maps do not have a bpf_mem_alloc destructor, so fully
> +	 * destroy every element, including the extra elements.
> +	 */
>  	if (htab_has_extra_elems(htab))
>  		num_entries += num_possible_cpus();
>  	for (i = 0; i < num_entries; i++) {
> @@ -833,8 +837,8 @@ static int htab_lru_map_gen_lookup(struct bpf_map *map,
>  	return insn - insn_buf;
>  }
>  
> -static void check_and_free_fields(struct bpf_htab *htab,
> -				  struct htab_elem *elem)
> +static void check_and_cancel_fields(struct bpf_htab *htab,
> +				    struct htab_elem *elem)
>  {
>  	if (IS_ERR_OR_NULL(htab->map.record))
>  		return;
> @@ -844,11 +848,11 @@ static void check_and_free_fields(struct bpf_htab *htab,
>  		int cpu;
>  
>  		for_each_possible_cpu(cpu)
> -			bpf_obj_free_fields(htab->map.record, per_cpu_ptr(pptr, cpu));
> +			bpf_obj_cancel_fields(htab->map.record, per_cpu_ptr(pptr, cpu));
>  	} else {
>  		void *map_value = htab_elem_value(elem, htab->map.key_size);
>  
> -		bpf_obj_free_fields(htab->map.record, map_value);
> +		bpf_obj_cancel_fields(htab->map.record, map_value);
>  	}
>  }
>  
> @@ -883,7 +887,7 @@ static bool htab_lru_map_delete_node(void *arg, struct bpf_lru_node *node)
>  	htab_unlock_bucket(b, flags);
>  
>  	if (l == tgt_l)
> -		check_and_free_fields(htab, l);
> +		check_and_cancel_fields(htab, l);
>  	return l == tgt_l;
>  }
>  
> @@ -948,7 +952,7 @@ static int htab_map_get_next_key(struct bpf_map *map, void *key, void *next_key)
>  
>  static void htab_elem_free(struct bpf_htab *htab, struct htab_elem *l)
>  {
> -	check_and_free_fields(htab, l);
> +	check_and_cancel_fields(htab, l);
>  
>  	if (htab->map.map_type == BPF_MAP_TYPE_PERCPU_HASH)
>  		bpf_mem_cache_free(&htab->pcpu_ma, l->ptr_to_pptr);
> @@ -1001,7 +1005,7 @@ static void free_htab_elem(struct bpf_htab *htab, struct htab_elem *l)
>  
>  	if (htab_is_prealloc(htab)) {
>  		bpf_map_dec_elem_count(&htab->map);
> -		check_and_free_fields(htab, l);
> +		check_and_cancel_fields(htab, l);
>  		pcpu_freelist_push(&htab->freelist, &l->fnode);
>  	} else {
>  		dec_elem_count(htab);
> @@ -1018,7 +1022,7 @@ static void pcpu_copy_value(struct bpf_htab *htab, void __percpu *pptr,
>  		/* copy true value_size bytes */
>  		ptr = this_cpu_ptr(pptr);
>  		copy_map_value(&htab->map, ptr, value);
> -		bpf_obj_free_fields(htab->map.record, ptr);
> +		bpf_obj_cancel_fields(htab->map.record, ptr);
>  	} else {
>  		u32 size = round_up(htab->map.value_size, 8);
>  		void *val;
> @@ -1028,7 +1032,7 @@ static void pcpu_copy_value(struct bpf_htab *htab, void __percpu *pptr,
>  			cpu = map_flags >> 32;
>  			ptr = per_cpu_ptr(pptr, cpu);
>  			copy_map_value(&htab->map, ptr, value);
> -			bpf_obj_free_fields(htab->map.record, ptr);
> +			bpf_obj_cancel_fields(htab->map.record, ptr);
>  			return;
>  		}
>  
> @@ -1036,7 +1040,7 @@ static void pcpu_copy_value(struct bpf_htab *htab, void __percpu *pptr,
>  			ptr = per_cpu_ptr(pptr, cpu);
>  			val = (map_flags & BPF_F_ALL_CPUS) ? value : value + size * cpu;
>  			copy_map_value(&htab->map, ptr, val);
> -			bpf_obj_free_fields(htab->map.record, ptr);
> +			bpf_obj_cancel_fields(htab->map.record, ptr);
>  		}
>  	}
>  }
> @@ -1252,11 +1256,11 @@ static long htab_map_update_elem(struct bpf_map *map, void *key, void *value,
>  	if (l_old) {
>  		hlist_nulls_del_rcu(&l_old->hash_node);
>  
> -		/* l_old has already been stashed in htab->extra_elems, free
> -		 * its special fields before it is available for reuse.
> +		/* l_old has already been stashed in htab->extra_elems, cancel
> +		 * its reusable special fields before it is available for reuse.
>  		 */
>  		if (htab_is_prealloc(htab))
> -			check_and_free_fields(htab, l_old);
> +			check_and_cancel_fields(htab, l_old);
>  	}
>  	htab_unlock_bucket(b, flags);
>  	if (l_old && !htab_is_prealloc(htab))
> @@ -1269,7 +1273,7 @@ static long htab_map_update_elem(struct bpf_map *map, void *key, void *value,
>  
>  static void htab_lru_push_free(struct bpf_htab *htab, struct htab_elem *elem)
>  {
> -	check_and_free_fields(htab, elem);
> +	check_and_cancel_fields(htab, elem);
>  	bpf_map_dec_elem_count(&htab->map);
>  	bpf_lru_push_free(&htab->lru, &elem->lru_node);
>  }
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index d4188a992bd8..8d12aa3728ea 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -808,6 +808,33 @@ void bpf_obj_free_task_work(const struct btf_record *rec, void *obj)
>  	bpf_task_work_cancel_and_free(obj + rec->task_work_off);
>  }
>  
> +void bpf_obj_cancel_fields(const struct btf_record *rec, void *obj)
> +{
> +	const struct btf_field *fields;
> +	int i;
> +
> +	if (IS_ERR_OR_NULL(rec))
> +		return;
> +	fields = rec->fields;
> +	for (i = 0; i < rec->cnt; i++) {
> +		void *field_ptr = obj + fields[i].offset;
> +
> +		switch (fields[i].type) {
> +		case BPF_TIMER:
> +			bpf_timer_cancel_and_free(field_ptr);
> +			break;
> +		case BPF_WORKQUEUE:
> +			bpf_wq_cancel_and_free(field_ptr);
> +			break;
> +		case BPF_TASK_WORK:
> +			bpf_task_work_cancel_and_free(field_ptr);
> +			break;
> +		default:
> +			break;
> +		}
> +	}
> +}
> +
>  void bpf_obj_free_fields(const struct btf_record *rec, void *obj)
>  {
Would a WARN_ON_ONCE for in_nmi in bpf_obj_free_fields help spot these
kind of bugs in the future?

Usually calling bpf_obj_free_fields in_nmi won't result in a deadlock
unless specific conditions are met (ie last reference to a refcounted
object, using a workqueue, etc).

Adding a check here would quickly surface bugs of this class.

> [...]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH bpf-next v1 3/5] bpf: Cancel special fields on map value recycle
  2026-06-08 18:01   ` Justin Suess
@ 2026-06-08 18:50     ` Kumar Kartikeya Dwivedi
  0 siblings, 0 replies; 19+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2026-06-08 18:50 UTC (permalink / raw)
  To: Justin Suess, Kumar Kartikeya Dwivedi
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Eduard Zingerman, Emil Tsalapatis, kkd, kernel-team

On Mon Jun 8, 2026 at 8:01 PM CEST, Justin Suess wrote:
> On Mon, Jun 08, 2026 at 04:48:36PM +0200, Kumar Kartikeya Dwivedi wrote:
>> From: Justin Suess <utilityemal77@gmail.com>
>>
>> Map update and delete paths currently call bpf_obj_free_fields() when a
>> value is being replaced or recycled. That makes field destruction depend
>> on the context of the update/delete operation. For tracing programs this
>> can include NMI context, where referenced kptr destructors, uptr
>> unpinning, and graph root destruction are not generally safe.
>>
>> Introduce bpf_obj_cancel_fields() for the reusable-value path. It only
>> performs NMI-safe cleanup for timer, workqueue, and task_work fields.
>> Fields that need full destruction are left attached to the recycled value
>> and are destroyed by the final cleanup path instead.
>>
>> Switch array and hashtab update/delete/recycle paths to this cancel
>> helper. Keep bpf_obj_free_fields() for final map destruction and for
>> bpf_mem_alloc destructors. Preallocated hashtabs do not have allocator
>> destructors, so teardown continues to walk the normal and extra elements
>> and fully destroy their fields.
>>
>> This deliberately relaxes the eager-free semantics of map update/delete
>> for special fields. Programs that relied on a recycled map slot becoming
>> empty immediately after update/delete were relying on behavior that
>> cannot be implemented safely from every BPF execution context without
>> offloading arbitrary destructors.
>>
>> There is a chance this change breaks programs making assumptions
>> regarding the eager freeing of fields. If so, we can relax semantics to
>> cancellation only when irqs_disabled() is true in the future. However,
>> theoretically, map values that get reused eagerly already have weaker
>> guarantees as parallel users can recreate freed fields before the new
>> element becomes visible again.
>>
>> Fixes: 14a324f6a67e ("bpf: Wire up freeing of referenced kptr")
>> Signed-off-by: Justin Suess <utilityemal77@gmail.com>
>> Co-developed-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
>> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
>> ---
>> [...]
>>  void bpf_obj_free_fields(const struct btf_record *rec, void *obj)
>>  {
> Would a WARN_ON_ONCE for in_nmi in bpf_obj_free_fields help spot these
> kind of bugs in the future?
>
> Usually calling bpf_obj_free_fields in_nmi won't result in a deadlock
> unless specific conditions are met (ie last reference to a refcounted
> object, using a workqueue, etc).
>
> Adding a check here would quickly surface bugs of this class.
>

In general, the idea makes sense when errors are impossible, but people
increasingly run with panic-on-warn enabled, which sort of beats the purpose,
hence I'd rather avoid it.

If you think about it, there are several guardrails that protect us here. First,
we maintain a whitelist for stuff that is invoked in the bpf_obj_cancel_fields()
path, so we don't inadvertently start operating on any new, possibly unsafe,
fields. Second, the free path mostly happens in the map destruction path now,
hence we need to worry less about random contexts it could be invoked in anymore.

>> [...]


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH bpf-next v1 1/5] bpf: Treat non-iterator tracing progs as tracing
  2026-06-08 17:47   ` Justin Suess
@ 2026-06-08 18:53     ` Kumar Kartikeya Dwivedi
  0 siblings, 0 replies; 19+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2026-06-08 18:53 UTC (permalink / raw)
  To: Justin Suess, Kumar Kartikeya Dwivedi
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Eduard Zingerman, Emil Tsalapatis, kkd, kernel-team

On Mon Jun 8, 2026 at 7:47 PM CEST, Justin Suess wrote:
> On Mon, Jun 08, 2026 at 04:48:34PM +0200, Kumar Kartikeya Dwivedi wrote:
>> The is_tracing_prog_type() predicate omitted BPF_PROG_TYPE_TRACING even
>> though fentry, fexit, fmod_ret, raw_tp BTF and similar programs have the
>> same execution-context concerns as the tracing program types already
>> covered by the helper.
>>
>> This matters for map compatibility checks that reject bpf_spin_lock,
>> bpf_list_head and bpf_rb_root in tracing contexts. BPF_PROG_TYPE_TRACING
>> programs can run from arbitrary instrumented contexts, including places
>> where taking these locks or manipulating graph roots is not safe.
>> BPF_TRACE_ITER is different: iterator programs run from task context, so
>> we continue to exclude them.
>>
>> This can reject existing fentry/fexit-style programs that use map values
>> with these fields. Such programs were accepted only because the
>> predicate missed this program type; their use depends on semantics the
>> verifier already rejects for equivalent tracing hooks.
>>
>> Move is_tracing_prog_type() checks from check_map_prog_compatibility()
>> to points where the fields are actually used to avoid preemptively
>> rejecting tracing programs that use maps with such fields but do not
>> touch these fields.
>>
>> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
>> ---
> Thanks for sending this, looks pretty good to me.
>>  kernel/bpf/verifier.c | 37 ++++++++++++++++++++++---------------
>>  1 file changed, 22 insertions(+), 15 deletions(-)
>>
>> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
>> index ed7ba0e6a9ce..26bfb4465725 100644
>> --- a/kernel/bpf/verifier.c
>> +++ b/kernel/bpf/verifier.c
>> @@ -6967,6 +6967,11 @@ static int process_spin_lock(struct bpf_verifier_env *env, struct bpf_reg_state
>>  	u32 spin_lock_off;
>>  	int err;
>>
>> +	if (is_tracing_prog_type(env->prog)) {
>> +		verbose(env, "tracing progs cannot use bpf_spin_lock yet\n");
>> +		return -EINVAL;
>> +	}
>> +
>>  	if (!is_const) {
>>  		verbose(env,
>>  			"%s doesn't have constant offset. %s_lock has to be at the constant offset\n",
>> @@ -12222,6 +12227,10 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
>>  				return ret;
>>  			break;
>>  		case KF_ARG_PTR_TO_LIST_HEAD:
>> +			if (is_tracing_prog_type(env->prog)) {
>> +				verbose(env, "tracing progs cannot use bpf_{list_head,rb_root} yet\n");
>> +				return -EINVAL;
>> +			}
>>  			if (reg->type != PTR_TO_MAP_VALUE &&
>>  			    reg->type != (PTR_TO_BTF_ID | MEM_ALLOC)) {
>>  				verbose(env, "%s expected pointer to map value or allocated object\n",
>> @@ -12238,6 +12247,10 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
>>  				return ret;
>>  			break;
>>  		case KF_ARG_PTR_TO_RB_ROOT:
>> +			if (is_tracing_prog_type(env->prog)) {
>> +				verbose(env, "tracing progs cannot use bpf_{list_head,rb_root} yet\n");
>> +				return -EINVAL;
>> +			}
>>  			if (reg->type != PTR_TO_MAP_VALUE &&
>>  			    reg->type != (PTR_TO_BTF_ID | MEM_ALLOC)) {
>>  				verbose(env, "%s expected pointer to map value or allocated object\n",
>> @@ -17664,9 +17677,11 @@ static int check_pseudo_btf_id(struct bpf_verifier_env *env,
>>  	return __add_used_btf(env, btf);
>>  }
>>
>> -static bool is_tracing_prog_type(enum bpf_prog_type type)
>> +static bool is_tracing_prog_type(const struct bpf_prog *prog)
>>  {
>> -	switch (type) {
>> +	switch (resolve_prog_type(prog)) {
>> +	case BPF_PROG_TYPE_TRACING:
>> +		return prog->expected_attach_type != BPF_TRACE_ITER;
> Seems reasonable to exclude this attach type as they run always in task
> context as stated. This seems like the safest approach.
>
> In practice, *most* tracing type programs will be unreachable from
> NMI. But the edge cases are difficult to enumerate...
>
> fentry / fexit are nearly impossible to exclude, since there's no
> good way to answer "can this function be called from an nmi handler"
> at verification time. Such a list would be unmaintainable.
>
> Raw tracepoints are similar. Indeed, it would be *possible* to
> specifically exclude the tracepoints that cannot be called from NMI. But
> this would require maintaining a large list and would be brittle.
>
> fmodify_return attach types are whitelisted to LSM hooks and
> ALLOW_ERROR_INJECTION functions. A quick grep for ALLOW_ERROR_INJECTION
> doesn't bring anything that looks obviously NMI-relevant, but again, I
> think allowing it would be asking for trouble later.
>
> So I think this is a good balance...
>>  	case BPF_PROG_TYPE_KPROBE:
>>  	case BPF_PROG_TYPE_TRACEPOINT:
>>  	case BPF_PROG_TYPE_PERF_EVENT:
>> @@ -17697,24 +17712,16 @@ static int check_map_prog_compatibility(struct bpf_verifier_env *env,
>>  		return -EACCES;
>>  	}
>>
>> -	if (btf_record_has_field(map->record, BPF_LIST_HEAD) ||
>> -	    btf_record_has_field(map->record, BPF_RB_ROOT)) {
>> -		if (is_tracing_prog_type(prog_type)) {
>> -			verbose(env, "tracing progs cannot use bpf_{list_head,rb_root} yet\n");
>> -			return -EINVAL;
>> -		}
>> -	}
>> -
>>  	if (btf_record_has_field(map->record, BPF_SPIN_LOCK | BPF_RES_SPIN_LOCK)) {
>>  		if (prog_type == BPF_PROG_TYPE_SOCKET_FILTER) {
>>  			verbose(env, "socket filter progs cannot use bpf_spin_lock yet\n");
>>  			return -EINVAL;
>>  		}
>> -
>> -		if (is_tracing_prog_type(prog_type)) {
>> -			verbose(env, "tracing progs cannot use bpf_spin_lock yet\n");
>> -			return -EINVAL;
>> -		}
>> +		/*
>> +		 * Rejecting tracing progs accessing maps with bpf_spin_lock in
>> +		 * them here would be too conservative; let's defer rejection
>> +		 * until seeing first use.
>> +		 */
>
> In practice, I think most programs using those spinlocks w/ fentry
> weren't broken.
>
> They were saved only by nature of the fact that the particular
> functions that attached to were uncallable / never were
> called from NMI. Which is the case for 99% of those programs.
>
> So this case is where fixing the other 1% of cases are going
> to have consequences for the remainder.
>
> This is a tough decision to have to make. There's options, but all of
> them come with downsides.
>
> 1. (this patch) break tracing programs using bpf_spin_lock.
> 2. Somehow prevent NMI reentrancy of tracing progs on the same cpu. probably
> by just ignoring the reentrant calls, much how we handle the recursive
> tracepoint prevention for bpf. (complicated, may have side effects)
> 3. Inject a fixup runtime check for in_nmi for every tracing program
> and tail callable program, exiting before executing anything
> (still breakage, just at runtime now)
>
> If breakage is acceptible here, the approach this patch takes is the
> most *correct* way to do it, and the way it should have been done from
> day one, but having users relying on the "broken" behavior makes it
> harder to fix now.
>
> This seems like a maintainer call.

Yeah, you outlined the tradeoffs perfectly. What I'm leaning toward is splitting
this one out of the series, focus on the other two for now in v2, and send a new
set with this fix + opening up bpf_res_spin_lock() for use as a replacement. We
can then emit some guidance for the user to switch when their program breaks to
a working replacement which hopefully allows a reasonable migration path while
reducing this technical debt.

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2026-06-08 18:53 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-08 14:48 [PATCH bpf-next v1 0/5] Fix kptr dtor deadlock Kumar Kartikeya Dwivedi
2026-06-08 14:48 ` [PATCH bpf-next v1 1/5] bpf: Treat non-iterator tracing progs as tracing Kumar Kartikeya Dwivedi
2026-06-08 14:51   ` Kumar Kartikeya Dwivedi
2026-06-08 15:13   ` sashiko-bot
2026-06-08 15:44   ` bot+bpf-ci
2026-06-08 17:47   ` Justin Suess
2026-06-08 18:53     ` Kumar Kartikeya Dwivedi
2026-06-08 14:48 ` [PATCH bpf-next v1 2/5] bpf: Reject bpf_obj_drop() from tracing progs Kumar Kartikeya Dwivedi
2026-06-08 15:40   ` sashiko-bot
2026-06-08 14:48 ` [PATCH bpf-next v1 3/5] bpf: Cancel special fields on map value recycle Kumar Kartikeya Dwivedi
2026-06-08 15:44   ` bot+bpf-ci
2026-06-08 15:56   ` sashiko-bot
2026-06-08 18:01   ` Justin Suess
2026-06-08 18:50     ` Kumar Kartikeya Dwivedi
2026-06-08 14:48 ` [PATCH bpf-next v1 4/5] selftests/bpf: Exercise unsafe obj drops from tracing progs Kumar Kartikeya Dwivedi
2026-06-08 16:16   ` sashiko-bot
2026-06-08 14:48 ` [PATCH bpf-next v1 5/5] selftests/bpf: Exercise kptr map update lifetime Kumar Kartikeya Dwivedi
2026-06-08 16:40   ` sashiko-bot
2026-06-08 14:58 ` [PATCH bpf-next v1 0/5] Fix kptr dtor deadlock Kumar Kartikeya Dwivedi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox