[PATCH bpf-next v3 0/7] bpf: Introduce deferred task context execution

BPF List
 help / color / mirror / Atom feed

* [PATCH bpf-next v3 0/7] bpf: Introduce deferred task context execution
@ 2025-09-05 16:44 Mykyta Yatsenko
  2025-09-05 16:44 ` [PATCH bpf-next v3 1/7] bpf: refactor special field-type detection Mykyta Yatsenko
                   ` (6 more replies)
  0 siblings, 7 replies; 44+ messages in thread
From: Mykyta Yatsenko @ 2025-09-05 16:44 UTC (permalink / raw)
  To: bpf, ast, andrii, daniel, kafai, kernel-team, eddyz87, memxor
  Cc: Mykyta Yatsenko

From: Mykyta Yatsenko <yatsenko@meta.com>

This patch introduces a new mechanism for BPF programs to schedule
deferred execution in the context of a specific task using the kernel’s
task_work infrastructure.

The new bpf_task_work interface enables BPF use cases that
require sleepable subprogram execution within task context, for example,
scheduling sleepable function from the context that does not
allow sleepable, such as NMI.

Introduced kfuncs bpf_task_work_schedule_signal() and
bpf_task_work_schedule_resume() for scheduling BPF callbacks correspond
to different modes used by task_work (TWA_SIGNAL or TWA_RESUME).

The implementation manages scheduling state via metadata objects (struct
bpf_task_work_context). Pointers to bpf_task_work_context are stored
in BPF map values. State transitions are handled via an atomic
state machine (bpf_task_work_state) to ensure correctness under
concurrent usage and deletion, lifetime is guarded by refcounting and
RCU Tasks Trace.
Kfuncs call task_work_add() indirectly via irq_work to avoid locking in
potentially NMI context.

v2 -> v3
 * Introduce ref counting
 * Add patches with minor verifier and btf.c refactorings to avoid code
duplication
 * Rework initiation of the task work scheduling to handle race with map
usercnt dropping to zero

Mykyta Yatsenko (7):
  bpf: refactor special field-type detection
  bpf: extract generic helper from process_timer_func()
  bpf: htab: extract helper for freeing special structs
  bpf: bpf task work plumbing
  bpf: extract map key pointer calculation
  bpf: task work scheduling kfuncs
  selftests/bpf: BPF task work scheduling tests

 include/linux/bpf.h                           |  11 +
 include/uapi/linux/bpf.h                      |   4 +
 kernel/bpf/arraymap.c                         |   8 +-
 kernel/bpf/btf.c                              |  63 ++-
 kernel/bpf/hashtab.c                          |  43 +-
 kernel/bpf/helpers.c                          | 386 +++++++++++++++++-
 kernel/bpf/syscall.c                          |  16 +-
 kernel/bpf/verifier.c                         | 136 +++++-
 tools/include/uapi/linux/bpf.h                |   4 +
 .../selftests/bpf/prog_tests/test_task_work.c | 149 +++++++
 tools/testing/selftests/bpf/progs/task_work.c | 108 +++++
 .../selftests/bpf/progs/task_work_fail.c      |  98 +++++
 12 files changed, 932 insertions(+), 94 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/test_task_work.c
 create mode 100644 tools/testing/selftests/bpf/progs/task_work.c
 create mode 100644 tools/testing/selftests/bpf/progs/task_work_fail.c

-- 
2.51.0

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH bpf-next v3 1/7] bpf: refactor special field-type detection
  2025-09-05 16:44 [PATCH bpf-next v3 0/7] bpf: Introduce deferred task context execution Mykyta Yatsenko
@ 2025-09-05 16:44 ` Mykyta Yatsenko
  2025-09-05 19:36   ` Eduard Zingerman
  2025-09-05 21:29   ` Andrii Nakryiko
  2025-09-05 16:45 ` [PATCH bpf-next v3 2/7] bpf: extract generic helper from process_timer_func() Mykyta Yatsenko
                   ` (5 subsequent siblings)
  6 siblings, 2 replies; 44+ messages in thread
From: Mykyta Yatsenko @ 2025-09-05 16:44 UTC (permalink / raw)
  To: bpf, ast, andrii, daniel, kafai, kernel-team, eddyz87, memxor
  Cc: Mykyta Yatsenko

From: Mykyta Yatsenko <yatsenko@meta.com>

Reduce code duplication in detection of the known special field types in
map values. This refactoring helps to avoid copying a chunk of code in
the next patch of the series.

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
---
 kernel/bpf/btf.c | 56 +++++++++++++++++-------------------------------
 1 file changed, 20 insertions(+), 36 deletions(-)

diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index 64739308902f..a1a9bc589518 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -3488,44 +3488,28 @@ static int btf_get_field_type(const struct btf *btf, const struct btf_type *var_
 			      u32 field_mask, u32 *seen_mask,
 			      int *align, int *sz)
 {
-	int type = 0;
+	const struct {
+		enum btf_field_type type;
+		const char *const name;
+	} field_types[] = { { BPF_SPIN_LOCK, "bpf_spin_lock" },
+			    { BPF_RES_SPIN_LOCK, "bpf_res_spin_lock" },
+			    { BPF_TIMER, "bpf_timer" },
+			    { BPF_WORKQUEUE, "bpf_wq" }};
+	int type = 0, i;
 	const char *name = __btf_name_by_offset(btf, var_type->name_off);
+	const char *field_type_name;
+	enum btf_field_type field_type;
 
-	if (field_mask & BPF_SPIN_LOCK) {
-		if (!strcmp(name, "bpf_spin_lock")) {
-			if (*seen_mask & BPF_SPIN_LOCK)
-				return -E2BIG;
-			*seen_mask |= BPF_SPIN_LOCK;
-			type = BPF_SPIN_LOCK;
-			goto end;
-		}
-	}
-	if (field_mask & BPF_RES_SPIN_LOCK) {
-		if (!strcmp(name, "bpf_res_spin_lock")) {
-			if (*seen_mask & BPF_RES_SPIN_LOCK)
-				return -E2BIG;
-			*seen_mask |= BPF_RES_SPIN_LOCK;
-			type = BPF_RES_SPIN_LOCK;
-			goto end;
-		}
-	}
-	if (field_mask & BPF_TIMER) {
-		if (!strcmp(name, "bpf_timer")) {
-			if (*seen_mask & BPF_TIMER)
-				return -E2BIG;
-			*seen_mask |= BPF_TIMER;
-			type = BPF_TIMER;
-			goto end;
-		}
-	}
-	if (field_mask & BPF_WORKQUEUE) {
-		if (!strcmp(name, "bpf_wq")) {
-			if (*seen_mask & BPF_WORKQUEUE)
-				return -E2BIG;
-			*seen_mask |= BPF_WORKQUEUE;
-			type = BPF_WORKQUEUE;
-			goto end;
-		}
+	for (i = 0; i < ARRAY_SIZE(field_types); ++i) {
+		field_type = field_types[i].type;
+		field_type_name = field_types[i].name;
+		if (!(field_mask & field_type) || strcmp(name, field_type_name))
+			continue;
+		if (*seen_mask & field_type)
+			return -E2BIG;
+		*seen_mask |= field_type;
+		type = field_type;
+		goto end;
 	}
 	field_mask_test_name(BPF_LIST_HEAD, "bpf_list_head");
 	field_mask_test_name(BPF_LIST_NODE, "bpf_list_node");
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH bpf-next v3 2/7] bpf: extract generic helper from process_timer_func()
  2025-09-05 16:44 [PATCH bpf-next v3 0/7] bpf: Introduce deferred task context execution Mykyta Yatsenko
  2025-09-05 16:44 ` [PATCH bpf-next v3 1/7] bpf: refactor special field-type detection Mykyta Yatsenko
@ 2025-09-05 16:45 ` Mykyta Yatsenko
  2025-09-05 21:15   ` Eduard Zingerman
                     ` (2 more replies)
  2025-09-05 16:45 ` [PATCH bpf-next v3 3/7] bpf: htab: extract helper for freeing special structs Mykyta Yatsenko
                   ` (4 subsequent siblings)
  6 siblings, 3 replies; 44+ messages in thread
From: Mykyta Yatsenko @ 2025-09-05 16:45 UTC (permalink / raw)
  To: bpf, ast, andrii, daniel, kafai, kernel-team, eddyz87, memxor
  Cc: Mykyta Yatsenko

From: Mykyta Yatsenko <yatsenko@meta.com>

Refactor the verifier by pulling the common logic from
process_timer_func() into a dedicated helper. This allows reusing
process_async_func() helper for verifying bpf_task_work struct in the
next patch.

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
---
 kernel/bpf/verifier.c | 39 ++++++++++++++++++++++++---------------
 1 file changed, 24 insertions(+), 15 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index b9394f8fac0e..a5d19a01d488 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -8520,43 +8520,52 @@ static int process_spin_lock(struct bpf_verifier_env *env, int regno, int flags)
 	return 0;
 }
 
-static int process_timer_func(struct bpf_verifier_env *env, int regno,
-			      struct bpf_call_arg_meta *meta)
+static int process_async_func(struct bpf_verifier_env *env, int regno, struct bpf_map **map_ptr,
+			      int *map_uid, u32 rec_off, enum btf_field_type field_type,
+			      const char *struct_name)
 {
 	struct bpf_reg_state *regs = cur_regs(env), *reg = &regs[regno];
 	bool is_const = tnum_is_const(reg->var_off);
 	struct bpf_map *map = reg->map_ptr;
 	u64 val = reg->var_off.value;
+	int *struct_off = (void *)map->record + rec_off;
 
 	if (!is_const) {
 		verbose(env,
-			"R%d doesn't have constant offset. bpf_timer has to be at the constant offset\n",
-			regno);
+			"R%d doesn't have constant offset. %s has to be at the constant offset\n",
+			regno, struct_name);
 		return -EINVAL;
 	}
 	if (!map->btf) {
-		verbose(env, "map '%s' has to have BTF in order to use bpf_timer\n",
-			map->name);
+		verbose(env, "map '%s' has to have BTF in order to use %s\n", map->name,
+			struct_name);
 		return -EINVAL;
 	}
-	if (!btf_record_has_field(map->record, BPF_TIMER)) {
-		verbose(env, "map '%s' has no valid bpf_timer\n", map->name);
+	if (!btf_record_has_field(map->record, field_type)) {
+		verbose(env, "map '%s' has no valid %s\n", map->name, struct_name);
 		return -EINVAL;
 	}
-	if (map->record->timer_off != val + reg->off) {
-		verbose(env, "off %lld doesn't point to 'struct bpf_timer' that is at %d\n",
-			val + reg->off, map->record->timer_off);
+	if (*struct_off != val + reg->off) {
+		verbose(env, "off %lld doesn't point to 'struct %s' that is at %d\n",
+			val + reg->off, struct_name, *struct_off);
 		return -EINVAL;
 	}
-	if (meta->map_ptr) {
-		verifier_bug(env, "Two map pointers in a timer helper");
+	if (*map_ptr) {
+		verifier_bug(env, "Two map pointers in a %s helper", struct_name);
 		return -EFAULT;
 	}
-	meta->map_uid = reg->map_uid;
-	meta->map_ptr = map;
+	*map_uid = reg->map_uid;
+	*map_ptr = map;
 	return 0;
 }
 
+static int process_timer_func(struct bpf_verifier_env *env, int regno,
+			      struct bpf_call_arg_meta *meta)
+{
+	return process_async_func(env, regno, &meta->map_ptr, &meta->map_uid,
+				  offsetof(struct btf_record, timer_off), BPF_TIMER, "bpf_timer");
+}
+
 static int process_wq_func(struct bpf_verifier_env *env, int regno,
 			   struct bpf_kfunc_call_arg_meta *meta)
 {
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH bpf-next v3 3/7] bpf: htab: extract helper for freeing special structs
  2025-09-05 16:44 [PATCH bpf-next v3 0/7] bpf: Introduce deferred task context execution Mykyta Yatsenko
  2025-09-05 16:44 ` [PATCH bpf-next v3 1/7] bpf: refactor special field-type detection Mykyta Yatsenko
  2025-09-05 16:45 ` [PATCH bpf-next v3 2/7] bpf: extract generic helper from process_timer_func() Mykyta Yatsenko
@ 2025-09-05 16:45 ` Mykyta Yatsenko
  2025-09-05 21:31   ` Andrii Nakryiko
  2025-09-05 21:31   ` Eduard Zingerman
  2025-09-05 16:45 ` [PATCH bpf-next v3 4/7] bpf: bpf task work plumbing Mykyta Yatsenko
                   ` (3 subsequent siblings)
  6 siblings, 2 replies; 44+ messages in thread
From: Mykyta Yatsenko @ 2025-09-05 16:45 UTC (permalink / raw)
  To: bpf, ast, andrii, daniel, kafai, kernel-team, eddyz87, memxor
  Cc: Mykyta Yatsenko

From: Mykyta Yatsenko <yatsenko@meta.com>

Extract the cleanup of known embedded structs into the dedicated helper.
Remove duplication and introduce a single source of truth for freeing
special embedded structs in hashtab.

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
---
 kernel/bpf/hashtab.c | 24 ++++++++++++------------
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 71f9931ac64c..2319f8f8fa3e 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -215,6 +215,16 @@ static bool htab_has_extra_elems(struct bpf_htab *htab)
 	return !htab_is_percpu(htab) && !htab_is_lru(htab) && !is_fd_htab(htab);
 }
 
+static void htab_free_internal_structs(struct bpf_htab *htab, struct htab_elem *elem)
+{
+	if (btf_record_has_field(htab->map.record, BPF_TIMER))
+		bpf_obj_free_timer(htab->map.record,
+				   htab_elem_value(elem, htab->map.key_size));
+	if (btf_record_has_field(htab->map.record, BPF_WORKQUEUE))
+		bpf_obj_free_workqueue(htab->map.record,
+				       htab_elem_value(elem, htab->map.key_size));
+}
+
 static void htab_free_prealloced_timers_and_wq(struct bpf_htab *htab)
 {
 	u32 num_entries = htab->map.max_entries;
@@ -227,12 +237,7 @@ static void htab_free_prealloced_timers_and_wq(struct bpf_htab *htab)
 		struct htab_elem *elem;
 
 		elem = get_htab_elem(htab, i);
-		if (btf_record_has_field(htab->map.record, BPF_TIMER))
-			bpf_obj_free_timer(htab->map.record,
-					   htab_elem_value(elem, htab->map.key_size));
-		if (btf_record_has_field(htab->map.record, BPF_WORKQUEUE))
-			bpf_obj_free_workqueue(htab->map.record,
-					       htab_elem_value(elem, htab->map.key_size));
+		htab_free_internal_structs(htab, elem);
 		cond_resched();
 	}
 }
@@ -1502,12 +1507,7 @@ static void htab_free_malloced_timers_and_wq(struct bpf_htab *htab)
 
 		hlist_nulls_for_each_entry(l, n, head, hash_node) {
 			/* We only free timer on uref dropping to zero */
-			if (btf_record_has_field(htab->map.record, BPF_TIMER))
-				bpf_obj_free_timer(htab->map.record,
-						   htab_elem_value(l, htab->map.key_size));
-			if (btf_record_has_field(htab->map.record, BPF_WORKQUEUE))
-				bpf_obj_free_workqueue(htab->map.record,
-						       htab_elem_value(l, htab->map.key_size));
+			htab_free_internal_structs(htab, l);
 		}
 		cond_resched_rcu();
 	}
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH bpf-next v3 4/7] bpf: bpf task work plumbing
  2025-09-05 16:44 [PATCH bpf-next v3 0/7] bpf: Introduce deferred task context execution Mykyta Yatsenko
                   ` (2 preceding siblings ...)
  2025-09-05 16:45 ` [PATCH bpf-next v3 3/7] bpf: htab: extract helper for freeing special structs Mykyta Yatsenko
@ 2025-09-05 16:45 ` Mykyta Yatsenko
  2025-09-05 21:31   ` Andrii Nakryiko
  2025-09-05 23:09   ` Eduard Zingerman
  2025-09-05 16:45 ` [PATCH bpf-next v3 5/7] bpf: extract map key pointer calculation Mykyta Yatsenko
                   ` (2 subsequent siblings)
  6 siblings, 2 replies; 44+ messages in thread
From: Mykyta Yatsenko @ 2025-09-05 16:45 UTC (permalink / raw)
  To: bpf, ast, andrii, daniel, kafai, kernel-team, eddyz87, memxor
  Cc: Mykyta Yatsenko

From: Mykyta Yatsenko <yatsenko@meta.com>

This patch adds necessary plumbing in verifier, syscall and maps to
support handling new kfunc bpf_task_work_schedule and kernel structure
bpf_task_work. The idea is similar to how we already handle bpf_wq and
bpf_timer.
verifier changes validate calls to bpf_task_work_schedule to make sure
it is safe and expected invariants hold.
btf part is required to detect bpf_task_work structure inside map value
and store its offset, which will be used in the next patch to calculate
key and value addresses.
arraymap and hashtab changes are needed to handle freeing of the
bpf_task_work: run code needed to deinitialize it, for example cancel
task_work callback if possible.
The use of bpf_task_work and proper implementation for kfuncs are
introduced in the next patch.

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
---
 include/linux/bpf.h            | 11 ++++
 include/uapi/linux/bpf.h       |  4 ++
 kernel/bpf/arraymap.c          |  8 +--
 kernel/bpf/btf.c               |  9 +++-
 kernel/bpf/hashtab.c           | 19 ++++---
 kernel/bpf/helpers.c           | 40 ++++++++++++++
 kernel/bpf/syscall.c           | 16 +++++-
 kernel/bpf/verifier.c          | 97 +++++++++++++++++++++++++++++++++-
 tools/include/uapi/linux/bpf.h |  4 ++
 9 files changed, 193 insertions(+), 15 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 8f6e87f0f3a8..febb4ca68401 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -206,6 +206,7 @@ enum btf_field_type {
 	BPF_WORKQUEUE  = (1 << 10),
 	BPF_UPTR       = (1 << 11),
 	BPF_RES_SPIN_LOCK = (1 << 12),
+	BPF_TASK_WORK  = (1 << 13),
 };
 
 enum bpf_cgroup_storage_type {
@@ -259,6 +260,7 @@ struct btf_record {
 	int timer_off;
 	int wq_off;
 	int refcount_off;
+	int task_work_off;
 	struct btf_field fields[];
 };
 
@@ -358,6 +360,8 @@ static inline const char *btf_field_type_name(enum btf_field_type type)
 		return "bpf_rb_node";
 	case BPF_REFCOUNT:
 		return "bpf_refcount";
+	case BPF_TASK_WORK:
+		return "bpf_task_work";
 	default:
 		WARN_ON_ONCE(1);
 		return "unknown";
@@ -396,6 +400,8 @@ static inline u32 btf_field_type_size(enum btf_field_type type)
 		return sizeof(struct bpf_rb_node);
 	case BPF_REFCOUNT:
 		return sizeof(struct bpf_refcount);
+	case BPF_TASK_WORK:
+		return sizeof(struct bpf_task_work);
 	default:
 		WARN_ON_ONCE(1);
 		return 0;
@@ -428,6 +434,8 @@ static inline u32 btf_field_type_align(enum btf_field_type type)
 		return __alignof__(struct bpf_rb_node);
 	case BPF_REFCOUNT:
 		return __alignof__(struct bpf_refcount);
+	case BPF_TASK_WORK:
+		return __alignof__(struct bpf_task_work);
 	default:
 		WARN_ON_ONCE(1);
 		return 0;
@@ -459,6 +467,7 @@ static inline void bpf_obj_init_field(const struct btf_field *field, void *addr)
 	case BPF_KPTR_REF:
 	case BPF_KPTR_PERCPU:
 	case BPF_UPTR:
+	case BPF_TASK_WORK:
 		break;
 	default:
 		WARN_ON_ONCE(1);
@@ -595,6 +604,7 @@ void copy_map_value_locked(struct bpf_map *map, void *dst, void *src,
 			   bool lock_src);
 void bpf_timer_cancel_and_free(void *timer);
 void bpf_wq_cancel_and_free(void *timer);
+void bpf_task_work_cancel_and_free(void *timer);
 void bpf_list_head_free(const struct btf_field *field, void *list_head,
 			struct bpf_spin_lock *spin_lock);
 void bpf_rb_root_free(const struct btf_field *field, void *rb_root,
@@ -2417,6 +2427,7 @@ struct btf_record *btf_record_dup(const struct btf_record *rec);
 bool btf_record_equal(const struct btf_record *rec_a, const struct btf_record *rec_b);
 void bpf_obj_free_timer(const struct btf_record *rec, void *obj);
 void bpf_obj_free_workqueue(const struct btf_record *rec, void *obj);
+void bpf_obj_free_task_work(const struct btf_record *rec, void *obj);
 void bpf_obj_free_fields(const struct btf_record *rec, void *obj);
 void __bpf_obj_drop_impl(void *p, const struct btf_record *rec, bool percpu);
 
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 233de8677382..5b42faff9aeb 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -7418,6 +7418,10 @@ struct bpf_timer {
 	__u64 __opaque[2];
 } __attribute__((aligned(8)));
 
+struct bpf_task_work {
+	__u64 __opaque;
+} __attribute__((aligned(8)));
+
 struct bpf_wq {
 	__u64 __opaque[2];
 } __attribute__((aligned(8)));
diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index 3d080916faf9..4130d8e76dff 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -431,7 +431,7 @@ static void *array_map_vmalloc_addr(struct bpf_array *array)
 	return (void *)round_down((unsigned long)array, PAGE_SIZE);
 }
 
-static void array_map_free_timers_wq(struct bpf_map *map)
+static void array_map_free_internal_structs(struct bpf_map *map)
 {
 	struct bpf_array *array = container_of(map, struct bpf_array, map);
 	int i;
@@ -439,12 +439,14 @@ static void array_map_free_timers_wq(struct bpf_map *map)
 	/* We don't reset or free fields other than timer and workqueue
 	 * on uref dropping to zero.
 	 */
-	if (btf_record_has_field(map->record, BPF_TIMER | BPF_WORKQUEUE)) {
+	if (btf_record_has_field(map->record, BPF_TIMER | BPF_WORKQUEUE | BPF_TASK_WORK)) {
 		for (i = 0; i < array->map.max_entries; i++) {
 			if (btf_record_has_field(map->record, BPF_TIMER))
 				bpf_obj_free_timer(map->record, array_map_elem_ptr(array, i));
 			if (btf_record_has_field(map->record, BPF_WORKQUEUE))
 				bpf_obj_free_workqueue(map->record, array_map_elem_ptr(array, i));
+			if (btf_record_has_field(map->record, BPF_TASK_WORK))
+				bpf_obj_free_task_work(map->record, array_map_elem_ptr(array, i));
 		}
 	}
 }
@@ -783,7 +785,7 @@ const struct bpf_map_ops array_map_ops = {
 	.map_alloc = array_map_alloc,
 	.map_free = array_map_free,
 	.map_get_next_key = array_map_get_next_key,
-	.map_release_uref = array_map_free_timers_wq,
+	.map_release_uref = array_map_free_internal_structs,
 	.map_lookup_elem = array_map_lookup_elem,
 	.map_update_elem = array_map_update_elem,
 	.map_delete_elem = array_map_delete_elem,
diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index a1a9bc589518..73ca21911b30 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -3494,7 +3494,8 @@ static int btf_get_field_type(const struct btf *btf, const struct btf_type *var_
 	} field_types[] = { { BPF_SPIN_LOCK, "bpf_spin_lock" },
 			    { BPF_RES_SPIN_LOCK, "bpf_res_spin_lock" },
 			    { BPF_TIMER, "bpf_timer" },
-			    { BPF_WORKQUEUE, "bpf_wq" }};
+			    { BPF_WORKQUEUE, "bpf_wq" },
+			    { BPF_TASK_WORK, "bpf_task_work" } };
 	int type = 0, i;
 	const char *name = __btf_name_by_offset(btf, var_type->name_off);
 	const char *field_type_name;
@@ -3677,6 +3678,7 @@ static int btf_find_field_one(const struct btf *btf,
 	case BPF_LIST_NODE:
 	case BPF_RB_NODE:
 	case BPF_REFCOUNT:
+	case BPF_TASK_WORK:
 		ret = btf_find_struct(btf, var_type, off, sz, field_type,
 				      info_cnt ? &info[0] : &tmp);
 		if (ret < 0)
@@ -3969,6 +3971,7 @@ struct btf_record *btf_parse_fields(const struct btf *btf, const struct btf_type
 	rec->timer_off = -EINVAL;
 	rec->wq_off = -EINVAL;
 	rec->refcount_off = -EINVAL;
+	rec->task_work_off = -EINVAL;
 	for (i = 0; i < cnt; i++) {
 		field_type_size = btf_field_type_size(info_arr[i].type);
 		if (info_arr[i].off + field_type_size > value_size) {
@@ -4034,6 +4037,10 @@ struct btf_record *btf_parse_fields(const struct btf *btf, const struct btf_type
 		case BPF_LIST_NODE:
 		case BPF_RB_NODE:
 			break;
+		case BPF_TASK_WORK:
+			WARN_ON_ONCE(rec->task_work_off >= 0);
+			rec->task_work_off = rec->fields[i].offset;
+			break;
 		default:
 			ret = -EFAULT;
 			goto end;
diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 2319f8f8fa3e..c2fcd0cd51e5 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -223,9 +223,12 @@ static void htab_free_internal_structs(struct bpf_htab *htab, struct htab_elem *
 	if (btf_record_has_field(htab->map.record, BPF_WORKQUEUE))
 		bpf_obj_free_workqueue(htab->map.record,
 				       htab_elem_value(elem, htab->map.key_size));
+	if (btf_record_has_field(htab->map.record, BPF_TASK_WORK))
+		bpf_obj_free_task_work(htab->map.record,
+				       htab_elem_value(elem, htab->map.key_size));
 }
 
-static void htab_free_prealloced_timers_and_wq(struct bpf_htab *htab)
+static void htab_free_prealloced_internal_structs(struct bpf_htab *htab)
 {
 	u32 num_entries = htab->map.max_entries;
 	int i;
@@ -1495,7 +1498,7 @@ static void delete_all_elements(struct bpf_htab *htab)
 	}
 }
 
-static void htab_free_malloced_timers_and_wq(struct bpf_htab *htab)
+static void htab_free_malloced_internal_structs(struct bpf_htab *htab)
 {
 	int i;
 
@@ -1514,16 +1517,16 @@ static void htab_free_malloced_timers_and_wq(struct bpf_htab *htab)
 	rcu_read_unlock();
 }
 
-static void htab_map_free_timers_and_wq(struct bpf_map *map)
+static void htab_map_free_internal_structs(struct bpf_map *map)
 {
 	struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
 
 	/* We only free timer and workqueue on uref dropping to zero */
-	if (btf_record_has_field(htab->map.record, BPF_TIMER | BPF_WORKQUEUE)) {
+	if (btf_record_has_field(htab->map.record, BPF_TIMER | BPF_WORKQUEUE | BPF_TASK_WORK)) {
 		if (!htab_is_prealloc(htab))
-			htab_free_malloced_timers_and_wq(htab);
+			htab_free_malloced_internal_structs(htab);
 		else
-			htab_free_prealloced_timers_and_wq(htab);
+			htab_free_prealloced_internal_structs(htab);
 	}
 }
 
@@ -2255,7 +2258,7 @@ const struct bpf_map_ops htab_map_ops = {
 	.map_alloc = htab_map_alloc,
 	.map_free = htab_map_free,
 	.map_get_next_key = htab_map_get_next_key,
-	.map_release_uref = htab_map_free_timers_and_wq,
+	.map_release_uref = htab_map_free_internal_structs,
 	.map_lookup_elem = htab_map_lookup_elem,
 	.map_lookup_and_delete_elem = htab_map_lookup_and_delete_elem,
 	.map_update_elem = htab_map_update_elem,
@@ -2276,7 +2279,7 @@ const struct bpf_map_ops htab_lru_map_ops = {
 	.map_alloc = htab_map_alloc,
 	.map_free = htab_map_free,
 	.map_get_next_key = htab_map_get_next_key,
-	.map_release_uref = htab_map_free_timers_and_wq,
+	.map_release_uref = htab_map_free_internal_structs,
 	.map_lookup_elem = htab_lru_map_lookup_elem,
 	.map_lookup_and_delete_elem = htab_lru_map_lookup_and_delete_elem,
 	.map_lookup_elem_sys_only = htab_lru_map_lookup_elem_sys,
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 588bc7e36436..89a5d8808ce8 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -3738,8 +3738,48 @@ __bpf_kfunc int bpf_strstr(const char *s1__ign, const char *s2__ign)
 	return bpf_strnstr(s1__ign, s2__ign, XATTR_SIZE_MAX);
 }
 
+typedef void (*bpf_task_work_callback_t)(struct bpf_map *map, void *key, void *value);
+
+/**
+ * bpf_task_work_schedule_signal - Schedule BPF callback using task_work_add with TWA_SIGNAL mode
+ * @task: Task struct for which callback should be scheduled
+ * @tw: Pointer to struct bpf_task_work in BPF map value for internal bookkeeping
+ * @map__map: bpf_map that embeds struct bpf_task_work in the values
+ * @callback: pointer to BPF subprogram to call
+ * @aux__prog: user should pass NULL
+ *
+ * Return: 0 if task work has been scheduled successfully, negative error code otherwise
+ */
+__bpf_kfunc int bpf_task_work_schedule_signal(struct task_struct *task, struct bpf_task_work *tw,
+					      struct bpf_map *map__map,
+					      bpf_task_work_callback_t callback, void *aux__prog)
+{
+	return 0;
+}
+
+/**
+ * bpf_task_work_schedule_resume - Schedule BPF callback using task_work_add with TWA_RESUME mode
+ * @task: Task struct for which callback should be scheduled
+ * @tw: Pointer to struct bpf_task_work in BPF map value for internal bookkeeping
+ * @map__map: bpf_map that embeds struct bpf_task_work in the values
+ * @callback: pointer to BPF subprogram to call
+ * @aux__prog: user should pass NULL
+ *
+ * Return: 0 if task work has been scheduled successfully, negative error code otherwise
+ */
+__bpf_kfunc int bpf_task_work_schedule_resume(struct task_struct *task, struct bpf_task_work *tw,
+					      struct bpf_map *map__map,
+					      bpf_task_work_callback_t callback, void *aux__prog)
+{
+	return 0;
+}
+
 __bpf_kfunc_end_defs();
 
+void bpf_task_work_cancel_and_free(void *val)
+{
+}
+
 BTF_KFUNCS_START(generic_btf_ids)
 #ifdef CONFIG_CRASH_DUMP
 BTF_ID_FLAGS(func, crash_kexec, KF_DESTRUCTIVE)
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 0fbfa8532c39..7da1ca893dfe 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -672,6 +672,7 @@ void btf_record_free(struct btf_record *rec)
 		case BPF_TIMER:
 		case BPF_REFCOUNT:
 		case BPF_WORKQUEUE:
+		case BPF_TASK_WORK:
 			/* Nothing to release */
 			break;
 		default:
@@ -725,6 +726,7 @@ struct btf_record *btf_record_dup(const struct btf_record *rec)
 		case BPF_TIMER:
 		case BPF_REFCOUNT:
 		case BPF_WORKQUEUE:
+		case BPF_TASK_WORK:
 			/* Nothing to acquire */
 			break;
 		default:
@@ -783,6 +785,13 @@ void bpf_obj_free_workqueue(const struct btf_record *rec, void *obj)
 	bpf_wq_cancel_and_free(obj + rec->wq_off);
 }
 
+void bpf_obj_free_task_work(const struct btf_record *rec, void *obj)
+{
+	if (WARN_ON_ONCE(!btf_record_has_field(rec, BPF_TASK_WORK)))
+		return;
+	bpf_task_work_cancel_and_free(obj + rec->task_work_off);
+}
+
 void bpf_obj_free_fields(const struct btf_record *rec, void *obj)
 {
 	const struct btf_field *fields;
@@ -840,6 +849,9 @@ void bpf_obj_free_fields(const struct btf_record *rec, void *obj)
 				continue;
 			bpf_rb_root_free(field, field_ptr, obj + rec->spin_lock_off);
 			break;
+		case BPF_TASK_WORK:
+			bpf_task_work_cancel_and_free(field_ptr);
+			break;
 		case BPF_LIST_NODE:
 		case BPF_RB_NODE:
 		case BPF_REFCOUNT:
@@ -1237,7 +1249,8 @@ static int map_check_btf(struct bpf_map *map, struct bpf_token *token,
 
 	map->record = btf_parse_fields(btf, value_type,
 				       BPF_SPIN_LOCK | BPF_RES_SPIN_LOCK | BPF_TIMER | BPF_KPTR | BPF_LIST_HEAD |
-				       BPF_RB_ROOT | BPF_REFCOUNT | BPF_WORKQUEUE | BPF_UPTR,
+				       BPF_RB_ROOT | BPF_REFCOUNT | BPF_WORKQUEUE | BPF_UPTR |
+				       BPF_TASK_WORK,
 				       map->value_size);
 	if (!IS_ERR_OR_NULL(map->record)) {
 		int i;
@@ -1269,6 +1282,7 @@ static int map_check_btf(struct bpf_map *map, struct bpf_token *token,
 				break;
 			case BPF_TIMER:
 			case BPF_WORKQUEUE:
+			case BPF_TASK_WORK:
 				if (map->map_type != BPF_MAP_TYPE_HASH &&
 				    map->map_type != BPF_MAP_TYPE_LRU_HASH &&
 				    map->map_type != BPF_MAP_TYPE_ARRAY) {
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index a5d19a01d488..6152536a834f 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -2240,6 +2240,8 @@ static void mark_ptr_not_null_reg(struct bpf_reg_state *reg)
 				reg->map_uid = reg->id;
 			if (btf_record_has_field(map->inner_map_meta->record, BPF_WORKQUEUE))
 				reg->map_uid = reg->id;
+			if (btf_record_has_field(map->inner_map_meta->record, BPF_TASK_WORK))
+				reg->map_uid = reg->id;
 		} else if (map->map_type == BPF_MAP_TYPE_XSKMAP) {
 			reg->type = PTR_TO_XDP_SOCK;
 		} else if (map->map_type == BPF_MAP_TYPE_SOCKMAP ||
@@ -8583,6 +8585,14 @@ static int process_wq_func(struct bpf_verifier_env *env, int regno,
 	return 0;
 }
 
+static int process_task_work_func(struct bpf_verifier_env *env, int regno,
+				  struct bpf_kfunc_call_arg_meta *meta)
+{
+	return process_async_func(env, regno, &meta->map.ptr, &meta->map.uid,
+				  offsetof(struct btf_record, task_work_off), BPF_TASK_WORK,
+				  "bpf_task_work");
+}
+
 static int process_kptr_func(struct bpf_verifier_env *env, int regno,
 			     struct bpf_call_arg_meta *meta)
 {
@@ -10412,6 +10422,8 @@ typedef int (*set_callee_state_fn)(struct bpf_verifier_env *env,
 				   struct bpf_func_state *callee,
 				   int insn_idx);
 
+static bool is_task_work_add_kfunc(u32 func_id);
+
 static int set_callee_state(struct bpf_verifier_env *env,
 			    struct bpf_func_state *caller,
 			    struct bpf_func_state *callee, int insn_idx);
@@ -10630,7 +10642,8 @@ static int push_callback_call(struct bpf_verifier_env *env, struct bpf_insn *ins
 		env->subprog_info[subprog].is_async_cb = true;
 		async_cb = push_async_cb(env, env->subprog_info[subprog].start,
 					 insn_idx, subprog,
-					 is_bpf_wq_set_callback_impl_kfunc(insn->imm));
+					 is_bpf_wq_set_callback_impl_kfunc(insn->imm) ||
+					 is_task_work_add_kfunc(insn->imm));
 		if (!async_cb)
 			return -EFAULT;
 		callee = async_cb->frame[0];
@@ -10943,6 +10956,35 @@ static int set_rbtree_add_callback_state(struct bpf_verifier_env *env,
 	return 0;
 }
 
+static int set_task_work_schedule_callback_state(struct bpf_verifier_env *env,
+						 struct bpf_func_state *caller,
+						 struct bpf_func_state *callee,
+						 int insn_idx)
+{
+	struct bpf_map *map_ptr = caller->regs[BPF_REG_3].map_ptr;
+
+	/*
+	 * callback_fn(struct bpf_map *map, void *key, void *value);
+	 */
+	callee->regs[BPF_REG_1].type = CONST_PTR_TO_MAP;
+	__mark_reg_known_zero(&callee->regs[BPF_REG_1]);
+	callee->regs[BPF_REG_1].map_ptr = map_ptr;
+
+	callee->regs[BPF_REG_2].type = PTR_TO_MAP_KEY;
+	__mark_reg_known_zero(&callee->regs[BPF_REG_2]);
+	callee->regs[BPF_REG_2].map_ptr = map_ptr;
+
+	callee->regs[BPF_REG_3].type = PTR_TO_MAP_VALUE;
+	__mark_reg_known_zero(&callee->regs[BPF_REG_3]);
+	callee->regs[BPF_REG_3].map_ptr = map_ptr;
+
+	/* unused */
+	__mark_reg_not_init(env, &callee->regs[BPF_REG_4]);
+	__mark_reg_not_init(env, &callee->regs[BPF_REG_5]);
+	callee->in_callback_fn = true;
+	return 0;
+}
+
 static bool is_rbtree_lock_required_kfunc(u32 btf_id);
 
 /* Are we currently verifying the callback for a rbtree helper that must
@@ -12074,6 +12116,7 @@ enum {
 	KF_ARG_RB_NODE_ID,
 	KF_ARG_WORKQUEUE_ID,
 	KF_ARG_RES_SPIN_LOCK_ID,
+	KF_ARG_TASK_WORK_ID,
 };
 
 BTF_ID_LIST(kf_arg_btf_ids)
@@ -12084,6 +12127,7 @@ BTF_ID(struct, bpf_rb_root)
 BTF_ID(struct, bpf_rb_node)
 BTF_ID(struct, bpf_wq)
 BTF_ID(struct, bpf_res_spin_lock)
+BTF_ID(struct, bpf_task_work)
 
 static bool __is_kfunc_ptr_arg_type(const struct btf *btf,
 				    const struct btf_param *arg, int type)
@@ -12132,6 +12176,11 @@ static bool is_kfunc_arg_wq(const struct btf *btf, const struct btf_param *arg)
 	return __is_kfunc_ptr_arg_type(btf, arg, KF_ARG_WORKQUEUE_ID);
 }
 
+static bool is_kfunc_arg_task_work(const struct btf *btf, const struct btf_param *arg)
+{
+	return __is_kfunc_ptr_arg_type(btf, arg, KF_ARG_TASK_WORK_ID);
+}
+
 static bool is_kfunc_arg_res_spin_lock(const struct btf *btf, const struct btf_param *arg)
 {
 	return __is_kfunc_ptr_arg_type(btf, arg, KF_ARG_RES_SPIN_LOCK_ID);
@@ -12219,6 +12268,7 @@ enum kfunc_ptr_arg_type {
 	KF_ARG_PTR_TO_WORKQUEUE,
 	KF_ARG_PTR_TO_IRQ_FLAG,
 	KF_ARG_PTR_TO_RES_SPIN_LOCK,
+	KF_ARG_PTR_TO_TASK_WORK,
 };
 
 enum special_kfunc_type {
@@ -12268,6 +12318,8 @@ enum special_kfunc_type {
 	KF_bpf_res_spin_lock_irqsave,
 	KF_bpf_res_spin_unlock_irqrestore,
 	KF___bpf_trap,
+	KF_bpf_task_work_schedule_signal,
+	KF_bpf_task_work_schedule_resume,
 };
 
 BTF_ID_LIST(special_kfunc_list)
@@ -12336,6 +12388,14 @@ BTF_ID(func, bpf_res_spin_unlock)
 BTF_ID(func, bpf_res_spin_lock_irqsave)
 BTF_ID(func, bpf_res_spin_unlock_irqrestore)
 BTF_ID(func, __bpf_trap)
+BTF_ID(func, bpf_task_work_schedule_signal)
+BTF_ID(func, bpf_task_work_schedule_resume)
+
+static bool is_task_work_add_kfunc(u32 func_id)
+{
+	return func_id == special_kfunc_list[KF_bpf_task_work_schedule_signal] ||
+	       func_id == special_kfunc_list[KF_bpf_task_work_schedule_resume];
+}
 
 static bool is_kfunc_ret_null(struct bpf_kfunc_call_arg_meta *meta)
 {
@@ -12426,6 +12486,9 @@ get_kfunc_ptr_arg_type(struct bpf_verifier_env *env,
 	if (is_kfunc_arg_wq(meta->btf, &args[argno]))
 		return KF_ARG_PTR_TO_WORKQUEUE;
 
+	if (is_kfunc_arg_task_work(meta->btf, &args[argno]))
+		return KF_ARG_PTR_TO_TASK_WORK;
+
 	if (is_kfunc_arg_irq_flag(meta->btf, &args[argno]))
 		return KF_ARG_PTR_TO_IRQ_FLAG;
 
@@ -12769,7 +12832,8 @@ static bool is_sync_callback_calling_kfunc(u32 btf_id)
 
 static bool is_async_callback_calling_kfunc(u32 btf_id)
 {
-	return btf_id == special_kfunc_list[KF_bpf_wq_set_callback_impl];
+	return btf_id == special_kfunc_list[KF_bpf_wq_set_callback_impl] ||
+	       is_task_work_add_kfunc(btf_id);
 }
 
 static bool is_bpf_throw_kfunc(struct bpf_insn *insn)
@@ -13171,6 +13235,15 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
 					return -EINVAL;
 				}
 			}
+			if (meta->map.ptr && reg->map_ptr->record->task_work_off >= 0) {
+				if (meta->map.ptr != reg->map_ptr ||
+				    meta->map.uid != reg->map_uid) {
+					verbose(env,
+						"bpf_task_work pointer in R2 map_uid=%d doesn't match map pointer in R3 map_uid=%d\n",
+						meta->map.uid, reg->map_uid);
+					return -EINVAL;
+				}
+			}
 			meta->map.ptr = reg->map_ptr;
 			meta->map.uid = reg->map_uid;
 			fallthrough;
@@ -13203,6 +13276,7 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
 		case KF_ARG_PTR_TO_REFCOUNTED_KPTR:
 		case KF_ARG_PTR_TO_CONST_STR:
 		case KF_ARG_PTR_TO_WORKQUEUE:
+		case KF_ARG_PTR_TO_TASK_WORK:
 		case KF_ARG_PTR_TO_IRQ_FLAG:
 		case KF_ARG_PTR_TO_RES_SPIN_LOCK:
 			break;
@@ -13496,6 +13570,15 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
 			if (ret < 0)
 				return ret;
 			break;
+		case KF_ARG_PTR_TO_TASK_WORK:
+			if (reg->type != PTR_TO_MAP_VALUE) {
+				verbose(env, "arg#%d doesn't point to a map value\n", i);
+				return -EINVAL;
+			}
+			ret = process_task_work_func(env, regno, meta);
+			if (ret < 0)
+				return ret;
+			break;
 		case KF_ARG_PTR_TO_IRQ_FLAG:
 			if (reg->type != PTR_TO_STACK) {
 				verbose(env, "arg#%d doesn't point to an irq flag on stack\n", i);
@@ -13862,6 +13945,16 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
 		}
 	}
 
+	if (is_task_work_add_kfunc(meta.func_id)) {
+		err = push_callback_call(env, insn, insn_idx, meta.subprogno,
+					 set_task_work_schedule_callback_state);
+		if (err) {
+			verbose(env, "kfunc %s#%d failed callback verification\n",
+				func_name, meta.func_id);
+			return err;
+		}
+	}
+
 	rcu_lock = is_kfunc_bpf_rcu_read_lock(&meta);
 	rcu_unlock = is_kfunc_bpf_rcu_read_unlock(&meta);
 
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 233de8677382..5b42faff9aeb 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -7418,6 +7418,10 @@ struct bpf_timer {
 	__u64 __opaque[2];
 } __attribute__((aligned(8)));
 
+struct bpf_task_work {
+	__u64 __opaque;
+} __attribute__((aligned(8)));
+
 struct bpf_wq {
 	__u64 __opaque[2];
 } __attribute__((aligned(8)));
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH bpf-next v3 5/7] bpf: extract map key pointer calculation
  2025-09-05 16:44 [PATCH bpf-next v3 0/7] bpf: Introduce deferred task context execution Mykyta Yatsenko
                   ` (3 preceding siblings ...)
  2025-09-05 16:45 ` [PATCH bpf-next v3 4/7] bpf: bpf task work plumbing Mykyta Yatsenko
@ 2025-09-05 16:45 ` Mykyta Yatsenko
  2025-09-05 21:31   ` Andrii Nakryiko
  2025-09-05 23:19   ` Eduard Zingerman
  2025-09-05 16:45 ` [PATCH bpf-next v3 6/7] bpf: task work scheduling kfuncs Mykyta Yatsenko
  2025-09-05 16:45 ` [PATCH bpf-next v3 7/7] selftests/bpf: BPF task work scheduling tests Mykyta Yatsenko
  6 siblings, 2 replies; 44+ messages in thread
From: Mykyta Yatsenko @ 2025-09-05 16:45 UTC (permalink / raw)
  To: bpf, ast, andrii, daniel, kafai, kernel-team, eddyz87, memxor
  Cc: Mykyta Yatsenko

From: Mykyta Yatsenko <yatsenko@meta.com>

Calculation of the BPF map key, given the pointer to a value is
duplicated in a couple of places in helpers already, in the next patch
another use case is introduced as well.
This patch extracts that functionality into a separate function.

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 kernel/bpf/helpers.c | 31 ++++++++++++++-----------------
 1 file changed, 14 insertions(+), 17 deletions(-)

diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 89a5d8808ce8..109cb249e88c 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -1081,6 +1081,18 @@ const struct bpf_func_proto bpf_snprintf_proto = {
 	.arg5_type	= ARG_CONST_SIZE_OR_ZERO,
 };
 
+static void *map_key_from_value(struct bpf_map *map, void *value, u32 *arr_idx)
+{
+	if (map->map_type == BPF_MAP_TYPE_ARRAY) {
+		struct bpf_array *array = container_of(map, struct bpf_array, map);
+
+		*arr_idx = ((char *)value - array->value) / array->elem_size;
+		return arr_idx;
+	}
+	BUG_ON(map->map_type != BPF_MAP_TYPE_HASH && map->map_type != BPF_MAP_TYPE_LRU_HASH);
+	return (void *)value - round_up(map->key_size, 8);
+}
+
 struct bpf_async_cb {
 	struct bpf_map *map;
 	struct bpf_prog *prog;
@@ -1163,15 +1175,8 @@ static enum hrtimer_restart bpf_timer_cb(struct hrtimer *hrtimer)
 	 * bpf_map_delete_elem() on the same timer.
 	 */
 	this_cpu_write(hrtimer_running, t);
-	if (map->map_type == BPF_MAP_TYPE_ARRAY) {
-		struct bpf_array *array = container_of(map, struct bpf_array, map);
 
-		/* compute the key */
-		idx = ((char *)value - array->value) / array->elem_size;
-		key = &idx;
-	} else { /* hash or lru */
-		key = value - round_up(map->key_size, 8);
-	}
+	key = map_key_from_value(map, value, &idx);
 
 	callback_fn((u64)(long)map, (u64)(long)key, (u64)(long)value, 0, 0);
 	/* The verifier checked that return value is zero. */
@@ -1197,15 +1202,7 @@ static void bpf_wq_work(struct work_struct *work)
 	if (!callback_fn)
 		return;
 
-	if (map->map_type == BPF_MAP_TYPE_ARRAY) {
-		struct bpf_array *array = container_of(map, struct bpf_array, map);
-
-		/* compute the key */
-		idx = ((char *)value - array->value) / array->elem_size;
-		key = &idx;
-	} else { /* hash or lru */
-		key = value - round_up(map->key_size, 8);
-	}
+	key = map_key_from_value(map, value, &idx);
 
         rcu_read_lock_trace();
         migrate_disable();
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH bpf-next v3 6/7] bpf: task work scheduling kfuncs
  2025-09-05 16:44 [PATCH bpf-next v3 0/7] bpf: Introduce deferred task context execution Mykyta Yatsenko
                   ` (4 preceding siblings ...)
  2025-09-05 16:45 ` [PATCH bpf-next v3 5/7] bpf: extract map key pointer calculation Mykyta Yatsenko
@ 2025-09-05 16:45 ` Mykyta Yatsenko
  2025-09-05 21:31   ` Andrii Nakryiko
                     ` (2 more replies)
  2025-09-05 16:45 ` [PATCH bpf-next v3 7/7] selftests/bpf: BPF task work scheduling tests Mykyta Yatsenko
  6 siblings, 3 replies; 44+ messages in thread
From: Mykyta Yatsenko @ 2025-09-05 16:45 UTC (permalink / raw)
  To: bpf, ast, andrii, daniel, kafai, kernel-team, eddyz87, memxor
  Cc: Mykyta Yatsenko

From: Mykyta Yatsenko <yatsenko@meta.com>

Implementation of the new bpf_task_work_schedule kfuncs, that let a BPF
program schedule task_work callbacks for a target task:
 * bpf_task_work_schedule_signal() → schedules with TWA_SIGNAL
 * bpf_task_work_schedule_resume() → schedules with TWA_RESUME

Each map value should embed a struct bpf_task_work, which the kernel
side pairs with struct bpf_task_work_kern, containing a pointer to
struct bpf_task_work_ctx, that maintains metadata relevant for the
concrete callback scheduling.

A small state machine and refcounting scheme ensures safe reuse and
teardown:
 STANDBY -> PENDING -> SCHEDULING -> SCHEDULED -> RUNNING -> STANDBY

A FREED terminal state coordinates with map-value
deletion (bpf_task_work_cancel_and_free()).

Scheduling itself is deferred via irq_work to keep the kfunc callable
from NMI context.

Lifetime is guarded with refcount_t + RCU Tasks Trace.

Main components:
 * struct bpf_task_work_context – Metadata and state management per task
work.
 * enum bpf_task_work_state – A state machine to serialize work
 scheduling and execution.
 * bpf_task_work_schedule() – The central helper that initiates
scheduling.
 * bpf_task_work_acquire_ctx() - Attempts to take ownership of the context,
 pointed by passed struct bpf_task_work, allocates new context if none
 exists yet.
 * bpf_task_work_callback() – Invoked when the actual task_work runs.
 * bpf_task_work_irq() – An intermediate step (runs in softirq context)
to enqueue task work.
 * bpf_task_work_cancel_and_free() – Cleanup for deleted BPF map entries.

Flow of successful task work scheduling
 1) bpf_task_work_schedule_* is called from BPF code.
 2) Transition state from STANDBY to PENDING, marks context is owned by
 this task work scheduler
 3) irq_work_queue() schedules bpf_task_work_irq().
 4) Transition state from PENDING to SCHEDULING.
 4) bpf_task_work_irq() attempts task_work_add(). If successful, state
 transitions to SCHEDULED.
 5) Task work calls bpf_task_work_callback(), which transition state to
 RUNNING.
 6) BPF callback is executed
 7) Context is cleaned up, refcounts released, context state set back to
 STANDBY.

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
---
 kernel/bpf/helpers.c | 319 ++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 317 insertions(+), 2 deletions(-)

diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 109cb249e88c..418a0a211699 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -25,6 +25,8 @@
 #include <linux/kasan.h>
 #include <linux/bpf_verifier.h>
 #include <linux/uaccess.h>
+#include <linux/task_work.h>
+#include <linux/irq_work.h>
 
 #include "../../lib/kstrtox.h"
 
@@ -3737,6 +3739,292 @@ __bpf_kfunc int bpf_strstr(const char *s1__ign, const char *s2__ign)
 
 typedef void (*bpf_task_work_callback_t)(struct bpf_map *map, void *key, void *value);
 
+enum bpf_task_work_state {
+	/* bpf_task_work is ready to be used */
+	BPF_TW_STANDBY = 0,
+	/* irq work scheduling in progress */
+	BPF_TW_PENDING,
+	/* task work scheduling in progress */
+	BPF_TW_SCHEDULING,
+	/* task work is scheduled successfully */
+	BPF_TW_SCHEDULED,
+	/* callback is running */
+	BPF_TW_RUNNING,
+	/* associated BPF map value is deleted */
+	BPF_TW_FREED,
+};
+
+struct bpf_task_work_ctx {
+	enum bpf_task_work_state state;
+	refcount_t refcnt;
+	struct callback_head work;
+	struct irq_work irq_work;
+	/* bpf_prog that schedules task work */
+	struct bpf_prog *prog;
+	/* task for which callback is scheduled */
+	struct task_struct *task;
+	/* the map and map value associated with this context */
+	struct bpf_map *map;
+	void *map_val;
+	enum task_work_notify_mode mode;
+	bpf_task_work_callback_t callback_fn;
+	struct rcu_head rcu;
+} __aligned(8);
+
+/* Actual type for struct bpf_task_work */
+struct bpf_task_work_kern {
+	struct bpf_task_work_ctx *ctx;
+};
+
+static void bpf_task_work_ctx_free_rcu_gp(struct rcu_head *rcu)
+{
+	struct bpf_task_work_ctx *ctx = container_of(rcu, struct bpf_task_work_ctx, rcu);
+
+	/* bpf_mem_free expects migration to be disabled */
+	migrate_disable();
+	bpf_mem_free(&bpf_global_ma, ctx);
+	migrate_enable();
+}
+
+static void bpf_task_work_ctx_free_mult_rcu_gp(struct rcu_head *rcu)
+{
+	if (rcu_trace_implies_rcu_gp())
+		bpf_task_work_ctx_free_rcu_gp(rcu);
+	else
+		call_rcu(rcu, bpf_task_work_ctx_free_rcu_gp);
+}
+
+static void bpf_task_work_ctx_reset(struct bpf_task_work_ctx *ctx)
+{
+	if (ctx->prog) {
+		bpf_prog_put(ctx->prog);
+		ctx->prog = NULL;
+	}
+	if (ctx->task) {
+		bpf_task_release(ctx->task);
+		ctx->task = NULL;
+	}
+}
+
+static bool bpf_task_work_ctx_tryget(struct bpf_task_work_ctx *ctx)
+{
+	return refcount_inc_not_zero(&ctx->refcnt);
+}
+
+static void bpf_task_work_ctx_put(struct bpf_task_work_ctx *ctx)
+{
+	if (!refcount_dec_and_test(&ctx->refcnt))
+		return;
+
+	bpf_task_work_ctx_reset(ctx);
+	call_rcu_tasks_trace(&ctx->rcu, bpf_task_work_ctx_free_mult_rcu_gp);
+}
+
+static bool task_work_match(struct callback_head *head, void *data)
+{
+	struct bpf_task_work_ctx *ctx = container_of(head, struct bpf_task_work_ctx, work);
+
+	return ctx == data;
+}
+
+static void bpf_task_work_cancel(struct bpf_task_work_ctx *ctx)
+{
+	/*
+	 * Scheduled task_work callback holds ctx ref, so if we successfully
+	 * cancelled, we put that ref on callback's behalf. If we couldn't
+	 * cancel, callback is inevitably run or has already completed
+	 * running, and it would have taken care of its ctx ref itself.
+	 */
+	if (task_work_cancel_match(ctx->task, task_work_match, ctx))
+		bpf_task_work_ctx_put(ctx);
+}
+
+static void bpf_task_work_callback(struct callback_head *cb)
+{
+	struct bpf_task_work_ctx *ctx = container_of(cb, struct bpf_task_work_ctx, work);
+	enum bpf_task_work_state state;
+	u32 idx;
+	void *key;
+
+	/* Read lock is needed to protect ctx and map key/value access */
+	guard(rcu_tasks_trace)();
+	/*
+	 * This callback may start running before bpf_task_work_irq() switched to
+	 * SCHEDULED state, so handle both transition variants SCHEDULING|SCHEDULED -> RUNNING.
+	 */
+	state = cmpxchg(&ctx->state, BPF_TW_SCHEDULING, BPF_TW_RUNNING);
+	if (state == BPF_TW_SCHEDULED)
+		state = cmpxchg(&ctx->state, BPF_TW_SCHEDULED, BPF_TW_RUNNING);
+	if (state == BPF_TW_FREED) {
+		bpf_task_work_ctx_put(ctx);
+		return;
+	}
+
+	key = (void *)map_key_from_value(ctx->map, ctx->map_val, &idx);
+
+	migrate_disable();
+	ctx->callback_fn(ctx->map, key, ctx->map_val);
+	migrate_enable();
+
+	bpf_task_work_ctx_reset(ctx);
+	(void)cmpxchg(&ctx->state, BPF_TW_RUNNING, BPF_TW_STANDBY);
+
+	bpf_task_work_ctx_put(ctx);
+}
+
+static void bpf_task_work_irq(struct irq_work *irq_work)
+{
+	struct bpf_task_work_ctx *ctx = container_of(irq_work, struct bpf_task_work_ctx, irq_work);
+	enum bpf_task_work_state state;
+	int err;
+
+	guard(rcu_tasks_trace)();
+
+	if (cmpxchg(&ctx->state, BPF_TW_PENDING, BPF_TW_SCHEDULING) != BPF_TW_PENDING) {
+		bpf_task_work_ctx_put(ctx);
+		return;
+	}
+
+	err = task_work_add(ctx->task, &ctx->work, ctx->mode);
+	if (err) {
+		bpf_task_work_ctx_reset(ctx);
+		/*
+		 * try to switch back to STANDBY for another task_work reuse, but we might have
+		 * gone to FREED already, which is fine as we already cleaned up after ourselves
+		 */
+		(void)cmpxchg(&ctx->state, BPF_TW_SCHEDULING, BPF_TW_STANDBY);
+
+		/* we don't have RCU protection, so put after switching state */
+		bpf_task_work_ctx_put(ctx);
+	}
+
+	/*
+	 * It's technically possible for just scheduled task_work callback to
+	 * complete running by now, going SCHEDULING -> RUNNING and then
+	 * dropping its ctx refcount. Instead of capturing extra ref just to
+	 * protected below ctx->state access, we rely on RCU protection to
+	 * perform below SCHEDULING -> SCHEDULED attempt.
+	 */
+	state = cmpxchg(&ctx->state, BPF_TW_SCHEDULING, BPF_TW_SCHEDULED);
+	if (state == BPF_TW_FREED)
+		bpf_task_work_cancel(ctx); /* clean up if we switched into FREED state */
+}
+
+static struct bpf_task_work_ctx *bpf_task_work_fetch_ctx(struct bpf_task_work *tw,
+							 struct bpf_map *map)
+{
+	struct bpf_task_work_kern *twk = (void *)tw;
+	struct bpf_task_work_ctx *ctx, *old_ctx;
+
+	ctx = READ_ONCE(twk->ctx);
+	if (ctx)
+		return ctx;
+
+	ctx = bpf_mem_alloc(&bpf_global_ma, sizeof(struct bpf_task_work_ctx));
+	if (!ctx)
+		return ERR_PTR(-ENOMEM);
+
+	memset(ctx, 0, sizeof(*ctx));
+	refcount_set(&ctx->refcnt, 1); /* map's own ref */
+	ctx->state = BPF_TW_STANDBY;
+
+	old_ctx = cmpxchg(&twk->ctx, NULL, ctx);
+	if (old_ctx) {
+		/*
+		 * tw->ctx is set by concurrent BPF program, release allocated
+		 * memory and try to reuse already set context.
+		 */
+		bpf_mem_free(&bpf_global_ma, ctx);
+		return old_ctx;
+	}
+
+	return ctx; /* Success */
+}
+
+static struct bpf_task_work_ctx *bpf_task_work_acquire_ctx(struct bpf_task_work *tw,
+							   struct bpf_map *map)
+{
+	struct bpf_task_work_ctx *ctx;
+
+	/* early check to avoid any work, we'll double check at the end again */
+	if (!atomic64_read(&map->usercnt))
+		return ERR_PTR(-EBUSY);
+
+	ctx = bpf_task_work_fetch_ctx(tw, map);
+	if (IS_ERR(ctx))
+		return ctx;
+
+	/* try to get ref for task_work callback to hold */
+	if (!bpf_task_work_ctx_tryget(ctx))
+		return ERR_PTR(-EBUSY);
+
+	if (cmpxchg(&ctx->state, BPF_TW_STANDBY, BPF_TW_PENDING) != BPF_TW_STANDBY) {
+		/* lost acquiring race or map_release_uref() stole it from us, put ref and bail */
+		bpf_task_work_ctx_put(ctx);
+		return ERR_PTR(-EBUSY);
+	}
+
+	/*
+	 * Double check that map->usercnt wasn't dropped while we were
+	 * preparing context, and if it was, we need to clean up as if
+	 * map_release_uref() was called; bpf_task_work_cancel_and_free()
+	 * is safe to be called twice on the same task work
+	 */
+	if (!atomic64_read(&map->usercnt)) {
+		/* drop ref we just got for task_work callback itself */
+		bpf_task_work_ctx_put(ctx);
+		/* transfer map's ref into cancel_and_free() */
+		bpf_task_work_cancel_and_free(tw);
+		return ERR_PTR(-EBUSY);
+	}
+
+	return ctx;
+}
+
+static int bpf_task_work_schedule(struct task_struct *task, struct bpf_task_work *tw,
+				  struct bpf_map *map, bpf_task_work_callback_t callback_fn,
+				  struct bpf_prog_aux *aux, enum task_work_notify_mode mode)
+{
+	struct bpf_prog *prog;
+	struct bpf_task_work_ctx *ctx;
+	int err;
+
+	BTF_TYPE_EMIT(struct bpf_task_work);
+
+	prog = bpf_prog_inc_not_zero(aux->prog);
+	if (IS_ERR(prog))
+		return -EBADF;
+	task = bpf_task_acquire(task);
+	if (!task) {
+		err = -EPERM;
+		goto release_prog;
+	}
+
+	ctx = bpf_task_work_acquire_ctx(tw, map);
+	if (IS_ERR(ctx)) {
+		err = PTR_ERR(ctx);
+		goto release_all;
+	}
+
+	ctx->task = task;
+	ctx->callback_fn = callback_fn;
+	ctx->prog = prog;
+	ctx->mode = mode;
+	ctx->map = map;
+	ctx->map_val = (void *)tw - map->record->task_work_off;
+	init_task_work(&ctx->work, bpf_task_work_callback);
+	init_irq_work(&ctx->irq_work, bpf_task_work_irq);
+
+	irq_work_queue(&ctx->irq_work);
+	return 0;
+
+release_all:
+	bpf_task_release(task);
+release_prog:
+	bpf_prog_put(prog);
+	return err;
+}
+
 /**
  * bpf_task_work_schedule_signal - Schedule BPF callback using task_work_add with TWA_SIGNAL mode
  * @task: Task struct for which callback should be scheduled
@@ -3751,7 +4039,7 @@ __bpf_kfunc int bpf_task_work_schedule_signal(struct task_struct *task, struct b
 					      struct bpf_map *map__map,
 					      bpf_task_work_callback_t callback, void *aux__prog)
 {
-	return 0;
+	return bpf_task_work_schedule(task, tw, map__map, callback, aux__prog, TWA_SIGNAL);
 }
 
 /**
@@ -3768,13 +4056,38 @@ __bpf_kfunc int bpf_task_work_schedule_resume(struct task_struct *task, struct b
 					      struct bpf_map *map__map,
 					      bpf_task_work_callback_t callback, void *aux__prog)
 {
-	return 0;
+	return bpf_task_work_schedule(task, tw, map__map, callback, aux__prog, TWA_RESUME);
 }
 
 __bpf_kfunc_end_defs();
 
+static void bpf_task_work_cancel_scheduled(struct irq_work *irq_work)
+{
+	struct bpf_task_work_ctx *ctx = container_of(irq_work, struct bpf_task_work_ctx, irq_work);
+
+	bpf_task_work_cancel(ctx); /* this might put task_work callback's ref */
+	bpf_task_work_ctx_put(ctx); /* and here we put map's own ref that was transferred to us */
+}
+
 void bpf_task_work_cancel_and_free(void *val)
 {
+	struct bpf_task_work_kern *twk = val;
+	struct bpf_task_work_ctx *ctx;
+	enum bpf_task_work_state state;
+
+	ctx = xchg(&twk->ctx, NULL);
+	if (!ctx)
+		return;
+
+	state = xchg(&ctx->state, BPF_TW_FREED);
+	if (state == BPF_TW_SCHEDULED) {
+		/* run in irq_work to avoid locks in NMI */
+		init_irq_work(&ctx->irq_work, bpf_task_work_cancel_scheduled);
+		irq_work_queue(&ctx->irq_work);
+		return;
+	}
+
+	bpf_task_work_ctx_put(ctx); /* put bpf map's ref */
 }
 
 BTF_KFUNCS_START(generic_btf_ids)
@@ -3911,6 +4224,8 @@ BTF_ID_FLAGS(func, bpf_strnstr);
 BTF_ID_FLAGS(func, bpf_cgroup_read_xattr, KF_RCU)
 #endif
 BTF_ID_FLAGS(func, bpf_stream_vprintk, KF_TRUSTED_ARGS)
+BTF_ID_FLAGS(func, bpf_task_work_schedule_signal, KF_TRUSTED_ARGS)
+BTF_ID_FLAGS(func, bpf_task_work_schedule_resume, KF_TRUSTED_ARGS)
 BTF_KFUNCS_END(common_btf_ids)
 
 static const struct btf_kfunc_id_set common_kfunc_set = {
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH bpf-next v3 7/7] selftests/bpf: BPF task work scheduling tests
  2025-09-05 16:44 [PATCH bpf-next v3 0/7] bpf: Introduce deferred task context execution Mykyta Yatsenko
                   ` (5 preceding siblings ...)
  2025-09-05 16:45 ` [PATCH bpf-next v3 6/7] bpf: task work scheduling kfuncs Mykyta Yatsenko
@ 2025-09-05 16:45 ` Mykyta Yatsenko
  2025-09-05 21:31   ` Andrii Nakryiko
                     ` (2 more replies)
  6 siblings, 3 replies; 44+ messages in thread
From: Mykyta Yatsenko @ 2025-09-05 16:45 UTC (permalink / raw)
  To: bpf, ast, andrii, daniel, kafai, kernel-team, eddyz87, memxor
  Cc: Mykyta Yatsenko

From: Mykyta Yatsenko <yatsenko@meta.com>

Introducing selftests that check BPF task work scheduling mechanism.
Validate that verifier does not accepts incorrect calls to
bpf_task_work_schedule kfunc.

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
---
 .../selftests/bpf/prog_tests/test_task_work.c | 149 ++++++++++++++++++
 tools/testing/selftests/bpf/progs/task_work.c | 108 +++++++++++++
 .../selftests/bpf/progs/task_work_fail.c      |  98 ++++++++++++
 3 files changed, 355 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/test_task_work.c
 create mode 100644 tools/testing/selftests/bpf/progs/task_work.c
 create mode 100644 tools/testing/selftests/bpf/progs/task_work_fail.c

diff --git a/tools/testing/selftests/bpf/prog_tests/test_task_work.c b/tools/testing/selftests/bpf/prog_tests/test_task_work.c
new file mode 100644
index 000000000000..9c3c7a46a827
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/test_task_work.c
@@ -0,0 +1,149 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2025 Meta Platforms, Inc. and affiliates. */
+#include <test_progs.h>
+#include <string.h>
+#include <stdio.h>
+#include "task_work.skel.h"
+#include "task_work_fail.skel.h"
+#include <linux/bpf.h>
+#include <linux/perf_event.h>
+#include <sys/syscall.h>
+#include <time.h>
+
+static int perf_event_open(__u32 type, __u64 config, int pid)
+{
+	struct perf_event_attr attr = {
+		.type = type,
+		.config = config,
+		.size = sizeof(struct perf_event_attr),
+		.sample_period = 100000,
+	};
+
+	return syscall(__NR_perf_event_open, &attr, pid, -1, -1, 0);
+}
+
+struct elem {
+	char data[128];
+	struct bpf_task_work tw;
+};
+
+static int verify_map(struct bpf_map *map, const char *expected_data)
+{
+	int err;
+	struct elem value;
+	int processed_values = 0;
+	int k, sz;
+
+	sz = bpf_map__max_entries(map);
+	for (k = 0; k < sz; ++k) {
+		err = bpf_map__lookup_elem(map, &k, sizeof(int), &value, sizeof(struct elem), 0);
+		if (err)
+			continue;
+		if (!ASSERT_EQ(strcmp(expected_data, value.data), 0, "map data")) {
+			fprintf(stderr, "expected '%s', found '%s' in %s map", expected_data,
+				value.data, bpf_map__name(map));
+			return 2;
+		}
+		processed_values++;
+	}
+
+	return processed_values == 0;
+}
+
+static void task_work_run(const char *prog_name, const char *map_name)
+{
+	struct task_work *skel;
+	struct bpf_program *prog;
+	struct bpf_map *map;
+	struct bpf_link *link;
+	int err, pe_fd = 0, pid, status, pipefd[2];
+	char user_string[] = "hello world";
+
+	if (!ASSERT_NEQ(pipe(pipefd), -1, "pipe"))
+		return;
+
+	pid = fork();
+	if (pid == 0) {
+		__u64 num = 1;
+		int i;
+		char buf;
+
+		close(pipefd[1]);
+		read(pipefd[0], &buf, sizeof(buf));
+		close(pipefd[0]);
+
+		for (i = 0; i < 10000; ++i)
+			num *= time(0) % 7;
+		(void)num;
+		exit(0);
+	}
+	skel = task_work__open();
+	if (!ASSERT_OK_PTR(skel, "task_work__open"))
+		return;
+
+	bpf_object__for_each_program(prog, skel->obj) {
+		bpf_program__set_autoload(prog, false);
+	}
+
+	prog = bpf_object__find_program_by_name(skel->obj, prog_name);
+	if (!ASSERT_OK_PTR(prog, "prog_name"))
+		goto cleanup;
+	bpf_program__set_autoload(prog, true);
+	bpf_program__set_type(prog, BPF_PROG_TYPE_PERF_EVENT);
+	skel->bss->user_ptr = (char *)user_string;
+
+	err = task_work__load(skel);
+	if (!ASSERT_OK(err, "skel_load"))
+		goto cleanup;
+
+	pe_fd = perf_event_open(PERF_TYPE_HARDWARE, PERF_COUNT_HW_CPU_CYCLES, pid);
+	if (pe_fd == -1 && (errno == ENOENT || errno == EOPNOTSUPP)) {
+		printf("%s:SKIP:no PERF_COUNT_HW_CPU_CYCLES\n", __func__);
+		test__skip();
+		goto cleanup;
+	}
+	if (!ASSERT_NEQ(pe_fd, -1, "pe_fd")) {
+		fprintf(stderr, "perf_event_open errno: %d, pid: %d\n", errno, pid);
+		goto cleanup;
+	}
+
+	link = bpf_program__attach_perf_event(prog, pe_fd);
+	if (!ASSERT_OK_PTR(link, "attach_perf_event"))
+		goto cleanup;
+
+	close(pipefd[0]);
+	write(pipefd[1], user_string, 1);
+	close(pipefd[1]);
+	/* Wait to collect some samples */
+	waitpid(pid, &status, 0);
+	pid = 0;
+	map = bpf_object__find_map_by_name(skel->obj, map_name);
+	if (!ASSERT_OK_PTR(map, "find map_name"))
+		goto cleanup;
+	if (!ASSERT_OK(verify_map(map, user_string), "verify map"))
+		goto cleanup;
+cleanup:
+	if (pe_fd >= 0)
+		close(pe_fd);
+	task_work__destroy(skel);
+	if (pid) {
+		close(pipefd[0]);
+		write(pipefd[1], user_string, 1);
+		close(pipefd[1]);
+		waitpid(pid, &status, 0);
+	}
+}
+
+void test_task_work(void)
+{
+	if (test__start_subtest("test_task_work_hash_map"))
+		task_work_run("oncpu_hash_map", "hmap");
+
+	if (test__start_subtest("test_task_work_array_map"))
+		task_work_run("oncpu_array_map", "arrmap");
+
+	if (test__start_subtest("test_task_work_lru_map"))
+		task_work_run("oncpu_lru_map", "lrumap");
+
+	RUN_TESTS(task_work_fail);
+}
diff --git a/tools/testing/selftests/bpf/progs/task_work.c b/tools/testing/selftests/bpf/progs/task_work.c
new file mode 100644
index 000000000000..5e761b4a5fd1
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/task_work.c
@@ -0,0 +1,108 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2025 Meta Platforms, Inc. and affiliates. */
+
+#include <vmlinux.h>
+#include <string.h>
+#include <stdbool.h>
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+#include "bpf_misc.h"
+#include "errno.h"
+
+char _license[] SEC("license") = "GPL";
+
+const void *user_ptr = NULL;
+
+struct elem {
+	char data[128];
+	struct bpf_task_work tw;
+};
+
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__uint(map_flags, BPF_F_NO_PREALLOC);
+	__uint(max_entries, 1);
+	__type(key, int);
+	__type(value, struct elem);
+} hmap SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_ARRAY);
+	__uint(max_entries, 1);
+	__type(key, int);
+	__type(value, struct elem);
+} arrmap SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_LRU_HASH);
+	__uint(max_entries, 1);
+	__type(key, int);
+	__type(value, struct elem);
+} lrumap SEC(".maps");
+
+static void process_work(struct bpf_map *map, void *key, void *value)
+{
+	struct elem *work = value;
+
+	bpf_copy_from_user_str(work->data, sizeof(work->data), (const void *)user_ptr, 0);
+}
+
+int key = 0;
+
+SEC("perf_event")
+int oncpu_hash_map(struct pt_regs *args)
+{
+	struct elem empty_work = { .data = { 0 } };
+	struct elem *work;
+	struct task_struct *task;
+	int err;
+
+	task = bpf_get_current_task_btf();
+	err = bpf_map_update_elem(&hmap, &key, &empty_work, BPF_NOEXIST);
+	if (err)
+		return 0;
+	work = bpf_map_lookup_elem(&hmap, &key);
+	if (!work)
+		return 0;
+
+	bpf_task_work_schedule_resume(task, &work->tw, (struct bpf_map *)&hmap, process_work, NULL);
+	return 0;
+}
+
+SEC("perf_event")
+int oncpu_array_map(struct pt_regs *args)
+{
+	struct elem *work;
+	struct task_struct *task;
+
+	task = bpf_get_current_task_btf();
+	work = bpf_map_lookup_elem(&arrmap, &key);
+	if (!work)
+		return 0;
+	bpf_task_work_schedule_signal(task, &work->tw, (struct bpf_map *)&arrmap, process_work,
+				      NULL);
+	return 0;
+}
+
+SEC("perf_event")
+int oncpu_lru_map(struct pt_regs *args)
+{
+	struct elem empty_work = { .data = { 0 } };
+	struct elem *work;
+	struct task_struct *task;
+	int err;
+
+	task = bpf_get_current_task_btf();
+	work = bpf_map_lookup_elem(&lrumap, &key);
+	if (work)
+		return 0;
+	err = bpf_map_update_elem(&lrumap, &key, &empty_work, BPF_NOEXIST);
+	if (err)
+		return 0;
+	work = bpf_map_lookup_elem(&lrumap, &key);
+	if (!work || work->data[0])
+		return 0;
+	bpf_task_work_schedule_resume(task, &work->tw, (struct bpf_map *)&lrumap, process_work,
+				      NULL);
+	return 0;
+}
diff --git a/tools/testing/selftests/bpf/progs/task_work_fail.c b/tools/testing/selftests/bpf/progs/task_work_fail.c
new file mode 100644
index 000000000000..fca7052b805e
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/task_work_fail.c
@@ -0,0 +1,98 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2025 Meta Platforms, Inc. and affiliates. */
+
+#include <vmlinux.h>
+#include <string.h>
+#include <stdbool.h>
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+#include "bpf_misc.h"
+
+char _license[] SEC("license") = "GPL";
+
+const void *user_ptr = NULL;
+
+struct elem {
+	char data[128];
+	struct bpf_task_work tw;
+};
+
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__uint(map_flags, BPF_F_NO_PREALLOC);
+	__uint(max_entries, 1);
+	__type(key, int);
+	__type(value, struct elem);
+} hmap SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_ARRAY);
+	__uint(max_entries, 1);
+	__type(key, int);
+	__type(value, struct elem);
+} arrmap SEC(".maps");
+
+static void process_work(struct bpf_map *map, void *key, void *value)
+{
+	struct elem *work = value;
+
+	bpf_copy_from_user_str(work->data, sizeof(work->data), (const void *)user_ptr, 0);
+}
+
+int key = 0;
+
+SEC("perf_event")
+__failure __msg("doesn't match map pointer in R3")
+int mismatch_map(struct pt_regs *args)
+{
+	struct elem *work;
+	struct task_struct *task;
+
+	task = bpf_get_current_task_btf();
+	work = bpf_map_lookup_elem(&arrmap, &key);
+	if (!work)
+		return 0;
+	bpf_task_work_schedule_resume(task, &work->tw, (struct bpf_map *)&hmap,
+				      process_work, NULL);
+	return 0;
+}
+
+SEC("perf_event")
+__failure __msg("arg#1 doesn't point to a map value")
+int no_map_task_work(struct pt_regs *args)
+{
+	struct task_struct *task;
+	struct bpf_task_work tw;
+
+	task = bpf_get_current_task_btf();
+	bpf_task_work_schedule_resume(task, &tw, (struct bpf_map *)&hmap,
+				      process_work, NULL);
+	return 0;
+}
+
+SEC("perf_event")
+__failure __msg("Possibly NULL pointer passed to trusted arg1")
+int task_work_null(struct pt_regs *args)
+{
+	struct task_struct *task;
+
+	task = bpf_get_current_task_btf();
+	bpf_task_work_schedule_resume(task, NULL, (struct bpf_map *)&hmap,
+				      process_work, NULL);
+	return 0;
+}
+
+SEC("perf_event")
+__failure __msg("Possibly NULL pointer passed to trusted arg2")
+int map_null(struct pt_regs *args)
+{
+	struct elem *work;
+	struct task_struct *task;
+
+	task = bpf_get_current_task_btf();
+	work = bpf_map_lookup_elem(&arrmap, &key);
+	if (!work)
+		return 0;
+	bpf_task_work_schedule_resume(task, &work->tw, NULL, process_work, NULL);
+	return 0;
+}
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH bpf-next v3 1/7] bpf: refactor special field-type detection
  2025-09-05 16:44 ` [PATCH bpf-next v3 1/7] bpf: refactor special field-type detection Mykyta Yatsenko
@ 2025-09-05 19:36   ` Eduard Zingerman
  2025-09-05 21:29   ` Andrii Nakryiko
  1 sibling, 0 replies; 44+ messages in thread
From: Eduard Zingerman @ 2025-09-05 19:36 UTC (permalink / raw)
  To: Mykyta Yatsenko, bpf, ast, andrii, daniel, kafai, kernel-team,
	memxor
  Cc: Mykyta Yatsenko

On Fri, 2025-09-05 at 17:44 +0100, Mykyta Yatsenko wrote:
> From: Mykyta Yatsenko <yatsenko@meta.com>
> 
> Reduce code duplication in detection of the known special field types in
> map values. This refactoring helps to avoid copying a chunk of code in
> the next patch of the series.
> 
> Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
> ---

Acked-by: Eduard Zingerman <eddyz87@gmail.com>

[...]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH bpf-next v3 2/7] bpf: extract generic helper from process_timer_func()
  2025-09-05 16:45 ` [PATCH bpf-next v3 2/7] bpf: extract generic helper from process_timer_func() Mykyta Yatsenko
@ 2025-09-05 21:15   ` Eduard Zingerman
  2025-09-05 21:28   ` Eduard Zingerman
  2025-09-05 21:29   ` Andrii Nakryiko
  2 siblings, 0 replies; 44+ messages in thread
From: Eduard Zingerman @ 2025-09-05 21:15 UTC (permalink / raw)
  To: Mykyta Yatsenko, bpf, ast, andrii, daniel, kafai, kernel-team,
	memxor
  Cc: Mykyta Yatsenko

[-- Attachment #1: Type: text/plain, Size: 1005 bytes --]

On Fri, 2025-09-05 at 17:45 +0100, Mykyta Yatsenko wrote:
> From: Mykyta Yatsenko <yatsenko@meta.com>
> 
> Refactor the verifier by pulling the common logic from
> process_timer_func() into a dedicated helper. This allows reusing
> process_async_func() helper for verifying bpf_task_work struct in the
> next patch.
> 
> Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
> ---

The discrepancy between bpf_call_arg_meta and bpf_kfunc_call_arg_meta
is unfortunate. Maybe the bigger refactoring is possible.
Wdyt about avoiding pointers to pointers and accepting some code
duplication for now? E.g. rename process_async_func:

  /* Check if @regno is a pointer to a specific field in a map value */
  static int check_map_field_pointer(struct bpf_verifier_env *env,
                                     u32 regno,
                                     enum btf_field_type field_type,
                                     u32 field_off)

And proceed as in the attached patch?

[...]

[-- Attachment #2: check_map_field_pointer.diff --]
[-- Type: text/x-patch, Size: 3254 bytes --]

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 6152536a834f..76dad7e0db5f 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -8522,15 +8522,17 @@ static int process_spin_lock(struct bpf_verifier_env *env, int regno, int flags)
 	return 0;
 }
 
-static int process_async_func(struct bpf_verifier_env *env, int regno, struct bpf_map **map_ptr,
-			      int *map_uid, u32 rec_off, enum btf_field_type field_type,
-			      const char *struct_name)
+/* Check if @regno is a pointer to a specific field in a map value */
+static int check_map_field_pointer(struct bpf_verifier_env *env,
+				   u32 regno,
+				   enum btf_field_type field_type,
+				   u32 field_off)
 {
 	struct bpf_reg_state *regs = cur_regs(env), *reg = &regs[regno];
+	const char *struct_name = btf_field_type_name(field_type);
 	bool is_const = tnum_is_const(reg->var_off);
 	struct bpf_map *map = reg->map_ptr;
 	u64 val = reg->var_off.value;
-	int *struct_off = (void *)map->record + rec_off;
 
 	if (!is_const) {
 		verbose(env,
@@ -8547,25 +8549,30 @@ static int process_async_func(struct bpf_verifier_env *env, int regno, struct bp
 		verbose(env, "map '%s' has no valid %s\n", map->name, struct_name);
 		return -EINVAL;
 	}
-	if (*struct_off != val + reg->off) {
+	if (field_off != val + reg->off) {
 		verbose(env, "off %lld doesn't point to 'struct %s' that is at %d\n",
-			val + reg->off, struct_name, *struct_off);
+			val + reg->off, struct_name, field_off);
 		return -EINVAL;
 	}
-	if (*map_ptr) {
-		verifier_bug(env, "Two map pointers in a %s helper", struct_name);
-		return -EFAULT;
-	}
-	*map_uid = reg->map_uid;
-	*map_ptr = map;
 	return 0;
 }
 
 static int process_timer_func(struct bpf_verifier_env *env, int regno,
 			      struct bpf_call_arg_meta *meta)
 {
-	return process_async_func(env, regno, &meta->map_ptr, &meta->map_uid,
-				  offsetof(struct btf_record, timer_off), BPF_TIMER, "bpf_timer");
+	struct bpf_reg_state *regs = cur_regs(env), *reg = &regs[regno];
+	int err;
+
+	err = check_map_field_pointer(env, regno, BPF_TIMER, offsetof(struct btf_record, timer_off));
+	if (err)
+		return err;
+
+	if (verifier_bug_if(meta->map_ptr, env, "Two map pointers in a bpf_timer helper"))
+		return -EFAULT;
+
+	meta->map_ptr = reg->map_ptr;
+	meta->map_uid = reg->map_uid;
+	return 0;
 }
 
 static int process_wq_func(struct bpf_verifier_env *env, int regno,
@@ -8588,9 +8595,20 @@ static int process_wq_func(struct bpf_verifier_env *env, int regno,
 static int process_task_work_func(struct bpf_verifier_env *env, int regno,
 				  struct bpf_kfunc_call_arg_meta *meta)
 {
-	return process_async_func(env, regno, &meta->map.ptr, &meta->map.uid,
-				  offsetof(struct btf_record, task_work_off), BPF_TASK_WORK,
-				  "bpf_task_work");
+	struct bpf_reg_state *regs = cur_regs(env), *reg = &regs[regno];
+	int err;
+
+	err = check_map_field_pointer(env, regno, BPF_TASK_WORK,
+				      offsetof(struct btf_record, task_work_off));
+	if (err)
+		return err;
+
+	if (verifier_bug_if(meta->map.ptr, env, "Two map pointers in a bpf_task_work helper"))
+		return -EFAULT;
+
+	meta->map.ptr = reg->map_ptr;
+	meta->map.uid = reg->map_uid;
+	return 0;
 }
 
 static int process_kptr_func(struct bpf_verifier_env *env, int regno,

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH bpf-next v3 2/7] bpf: extract generic helper from process_timer_func()
  2025-09-05 16:45 ` [PATCH bpf-next v3 2/7] bpf: extract generic helper from process_timer_func() Mykyta Yatsenko
  2025-09-05 21:15   ` Eduard Zingerman
@ 2025-09-05 21:28   ` Eduard Zingerman
  2025-09-05 21:31     ` Andrii Nakryiko
  2025-09-05 21:29   ` Andrii Nakryiko
  2 siblings, 1 reply; 44+ messages in thread
From: Eduard Zingerman @ 2025-09-05 21:28 UTC (permalink / raw)
  To: Mykyta Yatsenko, bpf, ast, andrii, daniel, kafai, kernel-team,
	memxor
  Cc: Mykyta Yatsenko

On Fri, 2025-09-05 at 17:45 +0100, Mykyta Yatsenko wrote:
> From: Mykyta Yatsenko <yatsenko@meta.com>
> 
> Refactor the verifier by pulling the common logic from
> process_timer_func() into a dedicated helper. This allows reusing
> process_async_func() helper for verifying bpf_task_work struct in the
> next patch.
> 
> Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
> ---
>  kernel/bpf/verifier.c | 39 ++++++++++++++++++++++++---------------
>  1 file changed, 24 insertions(+), 15 deletions(-)
> 
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index b9394f8fac0e..a5d19a01d488 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -8520,43 +8520,52 @@ static int process_spin_lock(struct bpf_verifier_env *env, int regno, int flags)
>  	return 0;
>  }
>  
> -static int process_timer_func(struct bpf_verifier_env *env, int regno,
> -			      struct bpf_call_arg_meta *meta)
> +static int process_async_func(struct bpf_verifier_env *env, int regno, struct bpf_map **map_ptr,
> +			      int *map_uid, u32 rec_off, enum btf_field_type field_type,
> +			      const char *struct_name)

Also, it appears that process_wq_func() needs to have the same checks
as in process_async_func(). Maybe add it as a separate commit?

[...]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH bpf-next v3 1/7] bpf: refactor special field-type detection
  2025-09-05 16:44 ` [PATCH bpf-next v3 1/7] bpf: refactor special field-type detection Mykyta Yatsenko
  2025-09-05 19:36   ` Eduard Zingerman
@ 2025-09-05 21:29   ` Andrii Nakryiko
  1 sibling, 0 replies; 44+ messages in thread
From: Andrii Nakryiko @ 2025-09-05 21:29 UTC (permalink / raw)
  To: Mykyta Yatsenko
  Cc: bpf, ast, andrii, daniel, kafai, kernel-team, eddyz87, memxor,
	Mykyta Yatsenko

On Fri, Sep 5, 2025 at 9:45 AM Mykyta Yatsenko
<mykyta.yatsenko5@gmail.com> wrote:
>
> From: Mykyta Yatsenko <yatsenko@meta.com>
>
> Reduce code duplication in detection of the known special field types in
> map values. This refactoring helps to avoid copying a chunk of code in
> the next patch of the series.
>
> Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
> ---
>  kernel/bpf/btf.c | 56 +++++++++++++++++-------------------------------
>  1 file changed, 20 insertions(+), 36 deletions(-)
>

lgtm

Acked-by: Andrii Nakryiko <andrii@kernel.org>

[...]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH bpf-next v3 2/7] bpf: extract generic helper from process_timer_func()
  2025-09-05 16:45 ` [PATCH bpf-next v3 2/7] bpf: extract generic helper from process_timer_func() Mykyta Yatsenko
  2025-09-05 21:15   ` Eduard Zingerman
  2025-09-05 21:28   ` Eduard Zingerman
@ 2025-09-05 21:29   ` Andrii Nakryiko
  2 siblings, 0 replies; 44+ messages in thread
From: Andrii Nakryiko @ 2025-09-05 21:29 UTC (permalink / raw)
  To: Mykyta Yatsenko
  Cc: bpf, ast, andrii, daniel, kafai, kernel-team, eddyz87, memxor,
	Mykyta Yatsenko

On Fri, Sep 5, 2025 at 9:45 AM Mykyta Yatsenko
<mykyta.yatsenko5@gmail.com> wrote:
>
> From: Mykyta Yatsenko <yatsenko@meta.com>
>
> Refactor the verifier by pulling the common logic from
> process_timer_func() into a dedicated helper. This allows reusing
> process_async_func() helper for verifying bpf_task_work struct in the
> next patch.
>
> Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
> ---
>  kernel/bpf/verifier.c | 39 ++++++++++++++++++++++++---------------
>  1 file changed, 24 insertions(+), 15 deletions(-)
>

lgtm, that process_eq_func's lack of use of process_async_func() is
suspicious, but not really something we have to solve right in this
patch set

Acked-by: Andrii Nakryiko <andrii@kernel.org>

> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index b9394f8fac0e..a5d19a01d488 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -8520,43 +8520,52 @@ static int process_spin_lock(struct bpf_verifier_env *env, int regno, int flags)
>         return 0;
>  }
>
> -static int process_timer_func(struct bpf_verifier_env *env, int regno,
> -                             struct bpf_call_arg_meta *meta)
> +static int process_async_func(struct bpf_verifier_env *env, int regno, struct bpf_map **map_ptr,
> +                             int *map_uid, u32 rec_off, enum btf_field_type field_type,
> +                             const char *struct_name)
>  {
>         struct bpf_reg_state *regs = cur_regs(env), *reg = &regs[regno];
>         bool is_const = tnum_is_const(reg->var_off);
>         struct bpf_map *map = reg->map_ptr;
>         u64 val = reg->var_off.value;
> +       int *struct_off = (void *)map->record + rec_off;
>

no need for keeping this as a pointer, just fetch the value here and
keep the rest of the logic a bit cleaner?


>         if (!is_const) {
>                 verbose(env,
> -                       "R%d doesn't have constant offset. bpf_timer has to be at the constant offset\n",
> -                       regno);
> +                       "R%d doesn't have constant offset. %s has to be at the constant offset\n",
> +                       regno, struct_name);
>                 return -EINVAL;
>         }
>         if (!map->btf) {
> -               verbose(env, "map '%s' has to have BTF in order to use bpf_timer\n",
> -                       map->name);
> +               verbose(env, "map '%s' has to have BTF in order to use %s\n", map->name,
> +                       struct_name);
>                 return -EINVAL;
>         }
> -       if (!btf_record_has_field(map->record, BPF_TIMER)) {
> -               verbose(env, "map '%s' has no valid bpf_timer\n", map->name);
> +       if (!btf_record_has_field(map->record, field_type)) {
> +               verbose(env, "map '%s' has no valid %s\n", map->name, struct_name);
>                 return -EINVAL;
>         }
> -       if (map->record->timer_off != val + reg->off) {
> -               verbose(env, "off %lld doesn't point to 'struct bpf_timer' that is at %d\n",
> -                       val + reg->off, map->record->timer_off);
> +       if (*struct_off != val + reg->off) {
> +               verbose(env, "off %lld doesn't point to 'struct %s' that is at %d\n",
> +                       val + reg->off, struct_name, *struct_off);
>                 return -EINVAL;
>         }
> -       if (meta->map_ptr) {
> -               verifier_bug(env, "Two map pointers in a timer helper");
> +       if (*map_ptr) {
> +               verifier_bug(env, "Two map pointers in a %s helper", struct_name);
>                 return -EFAULT;
>         }
> -       meta->map_uid = reg->map_uid;
> -       meta->map_ptr = map;
> +       *map_uid = reg->map_uid;
> +       *map_ptr = map;
>         return 0;
>  }
>
> +static int process_timer_func(struct bpf_verifier_env *env, int regno,
> +                             struct bpf_call_arg_meta *meta)
> +{
> +       return process_async_func(env, regno, &meta->map_ptr, &meta->map_uid,
> +                                 offsetof(struct btf_record, timer_off), BPF_TIMER, "bpf_timer");
> +}
> +
>  static int process_wq_func(struct bpf_verifier_env *env, int regno,
>                            struct bpf_kfunc_call_arg_meta *meta)

question more to Eduard and/or Alexei: why process_wq_func() checks
are so much more lax compared to timer's?... I'd expect them to be the
same. Is this just an omission or intentional?

>  {

> --
> 2.51.0
>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH bpf-next v3 2/7] bpf: extract generic helper from process_timer_func()
  2025-09-05 21:28   ` Eduard Zingerman
@ 2025-09-05 21:31     ` Andrii Nakryiko
  2025-09-05 21:32       ` Eduard Zingerman
  0 siblings, 1 reply; 44+ messages in thread
From: Andrii Nakryiko @ 2025-09-05 21:31 UTC (permalink / raw)
  To: Eduard Zingerman
  Cc: Mykyta Yatsenko, bpf, ast, andrii, daniel, kafai, kernel-team,
	memxor, Mykyta Yatsenko

On Fri, Sep 5, 2025 at 2:28 PM Eduard Zingerman <eddyz87@gmail.com> wrote:
>
> On Fri, 2025-09-05 at 17:45 +0100, Mykyta Yatsenko wrote:
> > From: Mykyta Yatsenko <yatsenko@meta.com>
> >
> > Refactor the verifier by pulling the common logic from
> > process_timer_func() into a dedicated helper. This allows reusing
> > process_async_func() helper for verifying bpf_task_work struct in the
> > next patch.
> >
> > Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
> > ---
> >  kernel/bpf/verifier.c | 39 ++++++++++++++++++++++++---------------
> >  1 file changed, 24 insertions(+), 15 deletions(-)
> >
> > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> > index b9394f8fac0e..a5d19a01d488 100644
> > --- a/kernel/bpf/verifier.c
> > +++ b/kernel/bpf/verifier.c
> > @@ -8520,43 +8520,52 @@ static int process_spin_lock(struct bpf_verifier_env *env, int regno, int flags)
> >       return 0;
> >  }
> >
> > -static int process_timer_func(struct bpf_verifier_env *env, int regno,
> > -                           struct bpf_call_arg_meta *meta)
> > +static int process_async_func(struct bpf_verifier_env *env, int regno, struct bpf_map **map_ptr,
> > +                           int *map_uid, u32 rec_off, enum btf_field_type field_type,
> > +                           const char *struct_name)
>
> Also, it appears that process_wq_func() needs to have the same checks
> as in process_async_func(). Maybe add it as a separate commit?

heh, we raced, I was asking the same question.

But let's do any extra refactorings and fixes to pre-existing code as
a follow up, ok?

>
> [...]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH bpf-next v3 3/7] bpf: htab: extract helper for freeing special structs
  2025-09-05 16:45 ` [PATCH bpf-next v3 3/7] bpf: htab: extract helper for freeing special structs Mykyta Yatsenko
@ 2025-09-05 21:31   ` Andrii Nakryiko
  2025-09-05 21:31   ` Eduard Zingerman
  1 sibling, 0 replies; 44+ messages in thread
From: Andrii Nakryiko @ 2025-09-05 21:31 UTC (permalink / raw)
  To: Mykyta Yatsenko
  Cc: bpf, ast, andrii, daniel, kafai, kernel-team, eddyz87, memxor,
	Mykyta Yatsenko

On Fri, Sep 5, 2025 at 9:45 AM Mykyta Yatsenko
<mykyta.yatsenko5@gmail.com> wrote:
>
> From: Mykyta Yatsenko <yatsenko@meta.com>
>
> Extract the cleanup of known embedded structs into the dedicated helper.
> Remove duplication and introduce a single source of truth for freeing
> special embedded structs in hashtab.
>
> Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
> ---
>  kernel/bpf/hashtab.c | 24 ++++++++++++------------
>  1 file changed, 12 insertions(+), 12 deletions(-)
>

lgtm

Acked-by: Andrii Nakryiko <andrii@kernel.org>


> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
> index 71f9931ac64c..2319f8f8fa3e 100644
> --- a/kernel/bpf/hashtab.c
> +++ b/kernel/bpf/hashtab.c
> @@ -215,6 +215,16 @@ static bool htab_has_extra_elems(struct bpf_htab *htab)
>         return !htab_is_percpu(htab) && !htab_is_lru(htab) && !is_fd_htab(htab);
>  }
>
> +static void htab_free_internal_structs(struct bpf_htab *htab, struct htab_elem *elem)
> +{
> +       if (btf_record_has_field(htab->map.record, BPF_TIMER))
> +               bpf_obj_free_timer(htab->map.record,
> +                                  htab_elem_value(elem, htab->map.key_size));
> +       if (btf_record_has_field(htab->map.record, BPF_WORKQUEUE))
> +               bpf_obj_free_workqueue(htab->map.record,
> +                                      htab_elem_value(elem, htab->map.key_size));
> +}
> +
>  static void htab_free_prealloced_timers_and_wq(struct bpf_htab *htab)
>  {
>         u32 num_entries = htab->map.max_entries;
> @@ -227,12 +237,7 @@ static void htab_free_prealloced_timers_and_wq(struct bpf_htab *htab)
>                 struct htab_elem *elem;
>
>                 elem = get_htab_elem(htab, i);
> -               if (btf_record_has_field(htab->map.record, BPF_TIMER))
> -                       bpf_obj_free_timer(htab->map.record,
> -                                          htab_elem_value(elem, htab->map.key_size));
> -               if (btf_record_has_field(htab->map.record, BPF_WORKQUEUE))
> -                       bpf_obj_free_workqueue(htab->map.record,
> -                                              htab_elem_value(elem, htab->map.key_size));
> +               htab_free_internal_structs(htab, elem);
>                 cond_resched();
>         }
>  }
> @@ -1502,12 +1507,7 @@ static void htab_free_malloced_timers_and_wq(struct bpf_htab *htab)
>
>                 hlist_nulls_for_each_entry(l, n, head, hash_node) {
>                         /* We only free timer on uref dropping to zero */
> -                       if (btf_record_has_field(htab->map.record, BPF_TIMER))
> -                               bpf_obj_free_timer(htab->map.record,
> -                                                  htab_elem_value(l, htab->map.key_size));
> -                       if (btf_record_has_field(htab->map.record, BPF_WORKQUEUE))
> -                               bpf_obj_free_workqueue(htab->map.record,
> -                                                      htab_elem_value(l, htab->map.key_size));
> +                       htab_free_internal_structs(htab, l);
>                 }
>                 cond_resched_rcu();
>         }
> --
> 2.51.0
>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH bpf-next v3 4/7] bpf: bpf task work plumbing
  2025-09-05 16:45 ` [PATCH bpf-next v3 4/7] bpf: bpf task work plumbing Mykyta Yatsenko
@ 2025-09-05 21:31   ` Andrii Nakryiko
  2025-09-05 23:09   ` Eduard Zingerman
  1 sibling, 0 replies; 44+ messages in thread
From: Andrii Nakryiko @ 2025-09-05 21:31 UTC (permalink / raw)
  To: Mykyta Yatsenko
  Cc: bpf, ast, andrii, daniel, kafai, kernel-team, eddyz87, memxor,
	Mykyta Yatsenko

On Fri, Sep 5, 2025 at 9:45 AM Mykyta Yatsenko
<mykyta.yatsenko5@gmail.com> wrote:
>
> From: Mykyta Yatsenko <yatsenko@meta.com>
>
> This patch adds necessary plumbing in verifier, syscall and maps to
> support handling new kfunc bpf_task_work_schedule and kernel structure
> bpf_task_work. The idea is similar to how we already handle bpf_wq and
> bpf_timer.
> verifier changes validate calls to bpf_task_work_schedule to make sure
> it is safe and expected invariants hold.
> btf part is required to detect bpf_task_work structure inside map value
> and store its offset, which will be used in the next patch to calculate
> key and value addresses.
> arraymap and hashtab changes are needed to handle freeing of the
> bpf_task_work: run code needed to deinitialize it, for example cancel
> task_work callback if possible.
> The use of bpf_task_work and proper implementation for kfuncs are
> introduced in the next patch.
>
> Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
> ---
>  include/linux/bpf.h            | 11 ++++
>  include/uapi/linux/bpf.h       |  4 ++
>  kernel/bpf/arraymap.c          |  8 +--
>  kernel/bpf/btf.c               |  9 +++-
>  kernel/bpf/hashtab.c           | 19 ++++---
>  kernel/bpf/helpers.c           | 40 ++++++++++++++
>  kernel/bpf/syscall.c           | 16 +++++-
>  kernel/bpf/verifier.c          | 97 +++++++++++++++++++++++++++++++++-
>  tools/include/uapi/linux/bpf.h |  4 ++
>  9 files changed, 193 insertions(+), 15 deletions(-)
>

LGMT

Acked-by: Andrii Nakryiko <andrii@kernel.org>

> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 8f6e87f0f3a8..febb4ca68401 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -206,6 +206,7 @@ enum btf_field_type {
>         BPF_WORKQUEUE  = (1 << 10),
>         BPF_UPTR       = (1 << 11),
>         BPF_RES_SPIN_LOCK = (1 << 12),
> +       BPF_TASK_WORK  = (1 << 13),
>  };
>

[...]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH bpf-next v3 5/7] bpf: extract map key pointer calculation
  2025-09-05 16:45 ` [PATCH bpf-next v3 5/7] bpf: extract map key pointer calculation Mykyta Yatsenko
@ 2025-09-05 21:31   ` Andrii Nakryiko
  2025-09-05 23:19   ` Eduard Zingerman
  1 sibling, 0 replies; 44+ messages in thread
From: Andrii Nakryiko @ 2025-09-05 21:31 UTC (permalink / raw)
  To: Mykyta Yatsenko
  Cc: bpf, ast, andrii, daniel, kafai, kernel-team, eddyz87, memxor,
	Mykyta Yatsenko

On Fri, Sep 5, 2025 at 9:45 AM Mykyta Yatsenko
<mykyta.yatsenko5@gmail.com> wrote:
>
> From: Mykyta Yatsenko <yatsenko@meta.com>
>
> Calculation of the BPF map key, given the pointer to a value is
> duplicated in a couple of places in helpers already, in the next patch
> another use case is introduced as well.
> This patch extracts that functionality into a separate function.
>
> Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> ---
>  kernel/bpf/helpers.c | 31 ++++++++++++++-----------------
>  1 file changed, 14 insertions(+), 17 deletions(-)
>

lgtm

Acked-by: Andrii Nakryiko <andrii@kernel.org>

> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> index 89a5d8808ce8..109cb249e88c 100644
> --- a/kernel/bpf/helpers.c
> +++ b/kernel/bpf/helpers.c
> @@ -1081,6 +1081,18 @@ const struct bpf_func_proto bpf_snprintf_proto = {
>         .arg5_type      = ARG_CONST_SIZE_OR_ZERO,
>  };
>

[...]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH bpf-next v3 3/7] bpf: htab: extract helper for freeing special structs
  2025-09-05 16:45 ` [PATCH bpf-next v3 3/7] bpf: htab: extract helper for freeing special structs Mykyta Yatsenko
  2025-09-05 21:31   ` Andrii Nakryiko
@ 2025-09-05 21:31   ` Eduard Zingerman
  1 sibling, 0 replies; 44+ messages in thread
From: Eduard Zingerman @ 2025-09-05 21:31 UTC (permalink / raw)
  To: Mykyta Yatsenko, bpf, ast, andrii, daniel, kafai, kernel-team,
	memxor
  Cc: Mykyta Yatsenko

On Fri, 2025-09-05 at 17:45 +0100, Mykyta Yatsenko wrote:
> From: Mykyta Yatsenko <yatsenko@meta.com>
> 
> Extract the cleanup of known embedded structs into the dedicated helper.
> Remove duplication and introduce a single source of truth for freeing
> special embedded structs in hashtab.
> 
> Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
> ---

Acked-by: Eduard Zingerman <eddyz87@gmail.com>

[...]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH bpf-next v3 6/7] bpf: task work scheduling kfuncs
  2025-09-05 16:45 ` [PATCH bpf-next v3 6/7] bpf: task work scheduling kfuncs Mykyta Yatsenko
@ 2025-09-05 21:31   ` Andrii Nakryiko
  2025-09-06 20:22   ` Eduard Zingerman
  2025-09-09 17:49   ` Chris Mason
  2 siblings, 0 replies; 44+ messages in thread
From: Andrii Nakryiko @ 2025-09-05 21:31 UTC (permalink / raw)
  To: Mykyta Yatsenko
  Cc: bpf, ast, andrii, daniel, kafai, kernel-team, eddyz87, memxor,
	Mykyta Yatsenko

On Fri, Sep 5, 2025 at 9:45 AM Mykyta Yatsenko
<mykyta.yatsenko5@gmail.com> wrote:
>
> From: Mykyta Yatsenko <yatsenko@meta.com>
>
> Implementation of the new bpf_task_work_schedule kfuncs, that let a BPF
> program schedule task_work callbacks for a target task:
>  * bpf_task_work_schedule_signal() → schedules with TWA_SIGNAL
>  * bpf_task_work_schedule_resume() → schedules with TWA_RESUME
>
> Each map value should embed a struct bpf_task_work, which the kernel
> side pairs with struct bpf_task_work_kern, containing a pointer to
> struct bpf_task_work_ctx, that maintains metadata relevant for the
> concrete callback scheduling.
>
> A small state machine and refcounting scheme ensures safe reuse and
> teardown:
>  STANDBY -> PENDING -> SCHEDULING -> SCHEDULED -> RUNNING -> STANDBY
>
> A FREED terminal state coordinates with map-value
> deletion (bpf_task_work_cancel_and_free()).

I'd mention that FREED state can be switched into from any point in
the above linear sequence of states and it's always handled in a
wait-free fashion. In all cases except SCHEDULED (when there might not
be any actively participating side besides the one that does FREED
handling, as we are just pending, potentially for a while, for task
work callback execution), if there is any cleanup necessary
(cancellation, putting references, etc.), actively participating side
will notice transition to FREED and ensure clean up.

>
> Scheduling itself is deferred via irq_work to keep the kfunc callable
> from NMI context.
>
> Lifetime is guarded with refcount_t + RCU Tasks Trace.
>
> Main components:
>  * struct bpf_task_work_context – Metadata and state management per task
> work.
>  * enum bpf_task_work_state – A state machine to serialize work
>  scheduling and execution.
>  * bpf_task_work_schedule() – The central helper that initiates
> scheduling.
>  * bpf_task_work_acquire_ctx() - Attempts to take ownership of the context,
>  pointed by passed struct bpf_task_work, allocates new context if none
>  exists yet.
>  * bpf_task_work_callback() – Invoked when the actual task_work runs.
>  * bpf_task_work_irq() – An intermediate step (runs in softirq context)
> to enqueue task work.
>  * bpf_task_work_cancel_and_free() – Cleanup for deleted BPF map entries.
>
> Flow of successful task work scheduling
>  1) bpf_task_work_schedule_* is called from BPF code.
>  2) Transition state from STANDBY to PENDING, marks context is owned by

typo: mark context as owned?

>  this task work scheduler
>  3) irq_work_queue() schedules bpf_task_work_irq().
>  4) Transition state from PENDING to SCHEDULING.
>  4) bpf_task_work_irq() attempts task_work_add(). If successful, state
>  transitions to SCHEDULED.
>  5) Task work calls bpf_task_work_callback(), which transition state to
>  RUNNING.
>  6) BPF callback is executed
>  7) Context is cleaned up, refcounts released, context state set back to
>  STANDBY.
>
> Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
> ---
>  kernel/bpf/helpers.c | 319 ++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 317 insertions(+), 2 deletions(-)
>

I don't see any more problems, it looks good to me!

Reviewed-by: Andrii Nakryiko <andrii@kernel.org>

> +
> +static void bpf_task_work_cancel(struct bpf_task_work_ctx *ctx)
> +{
> +       /*
> +        * Scheduled task_work callback holds ctx ref, so if we successfully
> +        * cancelled, we put that ref on callback's behalf. If we couldn't
> +        * cancel, callback is inevitably run or has already completed

typo: will inevitably run

> +        * running, and it would have taken care of its ctx ref itself.
> +        */
> +       if (task_work_cancel_match(ctx->task, task_work_match, ctx))
> +               bpf_task_work_ctx_put(ctx);
> +}
> +

[...]

> +       err = task_work_add(ctx->task, &ctx->work, ctx->mode);
> +       if (err) {
> +               bpf_task_work_ctx_reset(ctx);
> +               /*
> +                * try to switch back to STANDBY for another task_work reuse, but we might have
> +                * gone to FREED already, which is fine as we already cleaned up after ourselves
> +                */
> +               (void)cmpxchg(&ctx->state, BPF_TW_SCHEDULING, BPF_TW_STANDBY);
> +
> +               /* we don't have RCU protection, so put after switching state */

heh, I guess we do have RCU protection (that guard above), but yeah,
putting after all the manipulations with ctx makes most sense, let's
fix up the comment here

> +               bpf_task_work_ctx_put(ctx);
> +       }
> +
> +       /*
> +        * It's technically possible for just scheduled task_work callback to
> +        * complete running by now, going SCHEDULING -> RUNNING and then
> +        * dropping its ctx refcount. Instead of capturing extra ref just to
> +        * protected below ctx->state access, we rely on RCU protection to
> +        * perform below SCHEDULING -> SCHEDULED attempt.
> +        */
> +       state = cmpxchg(&ctx->state, BPF_TW_SCHEDULING, BPF_TW_SCHEDULED);
> +       if (state == BPF_TW_FREED)
> +               bpf_task_work_cancel(ctx); /* clean up if we switched into FREED state */
> +}
> +

[...]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH bpf-next v3 7/7] selftests/bpf: BPF task work scheduling tests
  2025-09-05 16:45 ` [PATCH bpf-next v3 7/7] selftests/bpf: BPF task work scheduling tests Mykyta Yatsenko
@ 2025-09-05 21:31   ` Andrii Nakryiko
  2025-09-08  7:43   ` Eduard Zingerman
  2025-09-08 18:23   ` Eduard Zingerman
  2 siblings, 0 replies; 44+ messages in thread
From: Andrii Nakryiko @ 2025-09-05 21:31 UTC (permalink / raw)
  To: Mykyta Yatsenko
  Cc: bpf, ast, andrii, daniel, kafai, kernel-team, eddyz87, memxor,
	Mykyta Yatsenko

On Fri, Sep 5, 2025 at 9:45 AM Mykyta Yatsenko
<mykyta.yatsenko5@gmail.com> wrote:
>
> From: Mykyta Yatsenko <yatsenko@meta.com>
>
> Introducing selftests that check BPF task work scheduling mechanism.
> Validate that verifier does not accepts incorrect calls to
> bpf_task_work_schedule kfunc.
>
> Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
> ---
>  .../selftests/bpf/prog_tests/test_task_work.c | 149 ++++++++++++++++++
>  tools/testing/selftests/bpf/progs/task_work.c | 108 +++++++++++++
>  .../selftests/bpf/progs/task_work_fail.c      |  98 ++++++++++++
>  3 files changed, 355 insertions(+)
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/test_task_work.c
>  create mode 100644 tools/testing/selftests/bpf/progs/task_work.c
>  create mode 100644 tools/testing/selftests/bpf/progs/task_work_fail.c
>

[...]

> +SEC("perf_event")
> +int oncpu_hash_map(struct pt_regs *args)
> +{
> +       struct elem empty_work = { .data = { 0 } };
> +       struct elem *work;
> +       struct task_struct *task;
> +       int err;
> +
> +       task = bpf_get_current_task_btf();
> +       err = bpf_map_update_elem(&hmap, &key, &empty_work, BPF_NOEXIST);
> +       if (err)
> +               return 0;
> +       work = bpf_map_lookup_elem(&hmap, &key);
> +       if (!work)
> +               return 0;
> +
> +       bpf_task_work_schedule_resume(task, &work->tw, (struct bpf_map *)&hmap, process_work, NULL);

oh, that struct bpf_map * cast is horrible UX, please just mark that
argument as void * in definitions of
bpf_task_work_schedule_{resume,signal}() (as a follow up, IMO, unless
something needs fixing and a new revision)

> +       return 0;
> +}
> +
> +SEC("perf_event")
> +int oncpu_array_map(struct pt_regs *args)
> +{
> +       struct elem *work;
> +       struct task_struct *task;
> +
> +       task = bpf_get_current_task_btf();
> +       work = bpf_map_lookup_elem(&arrmap, &key);
> +       if (!work)
> +               return 0;
> +       bpf_task_work_schedule_signal(task, &work->tw, (struct bpf_map *)&arrmap, process_work,
> +                                     NULL);
> +       return 0;
> +}
> +

[...]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH bpf-next v3 2/7] bpf: extract generic helper from process_timer_func()
  2025-09-05 21:31     ` Andrii Nakryiko
@ 2025-09-05 21:32       ` Eduard Zingerman
  0 siblings, 0 replies; 44+ messages in thread
From: Eduard Zingerman @ 2025-09-05 21:32 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Mykyta Yatsenko, bpf, ast, andrii, daniel, kafai, kernel-team,
	memxor, Mykyta Yatsenko

On Fri, 2025-09-05 at 14:31 -0700, Andrii Nakryiko wrote:

[...]

> > > -static int process_timer_func(struct bpf_verifier_env *env, int regno,
> > > -                           struct bpf_call_arg_meta *meta)
> > > +static int process_async_func(struct bpf_verifier_env *env, int regno, struct bpf_map **map_ptr,
> > > +                           int *map_uid, u32 rec_off, enum btf_field_type field_type,
> > > +                           const char *struct_name)
> > 
> > Also, it appears that process_wq_func() needs to have the same checks
> > as in process_async_func(). Maybe add it as a separate commit?
> 
> heh, we raced, I was asking the same question.
> 
> But let's do any extra refactorings and fixes to pre-existing code as
> a follow up, ok?

Sure.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH bpf-next v3 4/7] bpf: bpf task work plumbing
  2025-09-05 16:45 ` [PATCH bpf-next v3 4/7] bpf: bpf task work plumbing Mykyta Yatsenko
  2025-09-05 21:31   ` Andrii Nakryiko
@ 2025-09-05 23:09   ` Eduard Zingerman
  2025-09-15 15:59     ` Mykyta Yatsenko
  1 sibling, 1 reply; 44+ messages in thread
From: Eduard Zingerman @ 2025-09-05 23:09 UTC (permalink / raw)
  To: Mykyta Yatsenko, bpf, ast, andrii, daniel, kafai, kernel-team,
	memxor
  Cc: Mykyta Yatsenko

On Fri, 2025-09-05 at 17:45 +0100, Mykyta Yatsenko wrote:

[...]

> diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
> index 3d080916faf9..4130d8e76dff 100644
> --- a/kernel/bpf/arraymap.c
> +++ b/kernel/bpf/arraymap.c

[...]

> @@ -439,12 +439,14 @@ static void array_map_free_timers_wq(struct bpf_map *map)
>  	/* We don't reset or free fields other than timer and workqueue
>  	 * on uref dropping to zero.
>  	 */
> -	if (btf_record_has_field(map->record, BPF_TIMER | BPF_WORKQUEUE)) {
> +	if (btf_record_has_field(map->record, BPF_TIMER | BPF_WORKQUEUE | BPF_TASK_WORK)) {

I think that hashtab.c:htab_free_internal_structs needs to be renamed
and called here, thus avoiding code duplication.

>  		for (i = 0; i < array->map.max_entries; i++) {
>  			if (btf_record_has_field(map->record, BPF_TIMER))
>  				bpf_obj_free_timer(map->record, array_map_elem_ptr(array, i));
>  			if (btf_record_has_field(map->record, BPF_WORKQUEUE))
>  				bpf_obj_free_workqueue(map->record, array_map_elem_ptr(array, i));
> +			if (btf_record_has_field(map->record, BPF_TASK_WORK))
> +				bpf_obj_free_task_work(map->record, array_map_elem_ptr(array, i));
>  		}
>  	}
>  }

[...]

> diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
> index a1a9bc589518..73ca21911b30 100644
> --- a/kernel/bpf/btf.c
> +++ b/kernel/bpf/btf.c

[...]

> @@ -4034,6 +4037,10 @@ struct btf_record *btf_parse_fields(const struct btf *btf, const struct btf_type
>  		case BPF_LIST_NODE:
>  		case BPF_RB_NODE:
>  			break;
> +		case BPF_TASK_WORK:
> +			WARN_ON_ONCE(rec->task_work_off >= 0);
> +			rec->task_work_off = rec->fields[i].offset;
> +			break;

Nit: let's move this case up to BPF_WORKQUEUE or BPF_REFCOUNT,
     so that similar cases are grouped together.

>  		default:
>  			ret = -EFAULT;
>  			goto end;

[...]

> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 0fbfa8532c39..7da1ca893dfe 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c

[...]

> @@ -840,6 +849,9 @@ void bpf_obj_free_fields(const struct btf_record *rec, void *obj)
>  				continue;
>  			bpf_rb_root_free(field, field_ptr, obj + rec->spin_lock_off);
>  			break;
> +		case BPF_TASK_WORK:
> +			bpf_task_work_cancel_and_free(field_ptr);
> +			break;

Nit: same here, let's keep similar cases together.

>  		case BPF_LIST_NODE:
>  		case BPF_RB_NODE:
>  		case BPF_REFCOUNT:

[...]

> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index a5d19a01d488..6152536a834f 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -2240,6 +2240,8 @@ static void mark_ptr_not_null_reg(struct bpf_reg_state *reg)
>  				reg->map_uid = reg->id;
>  			if (btf_record_has_field(map->inner_map_meta->record, BPF_WORKQUEUE))
>  				reg->map_uid = reg->id;
> +			if (btf_record_has_field(map->inner_map_meta->record, BPF_TASK_WORK))
> +				reg->map_uid = reg->id;

Nit: this can be shorter:

			if (btf_record_has_field(map->inner_map_meta->record,
						 BPF_TIMER | BPF_WORKQUEUE | BPF_TASK_WORK))
				reg->map_uid = reg->id;


>  		} else if (map->map_type == BPF_MAP_TYPE_XSKMAP) {
>  			reg->type = PTR_TO_XDP_SOCK;
>  		} else if (map->map_type == BPF_MAP_TYPE_SOCKMAP ||

[...]

> @@ -10943,6 +10956,35 @@ static int set_rbtree_add_callback_state(struct bpf_verifier_env *env,
>  	return 0;
>  }
>  
> +static int set_task_work_schedule_callback_state(struct bpf_verifier_env *env,
> +						 struct bpf_func_state *caller,
> +						 struct bpf_func_state *callee,
> +						 int insn_idx)
> +{
> +	struct bpf_map *map_ptr = caller->regs[BPF_REG_3].map_ptr;
> +
> +	/*
> +	 * callback_fn(struct bpf_map *map, void *key, void *value);
> +	 */
> +	callee->regs[BPF_REG_1].type = CONST_PTR_TO_MAP;
> +	__mark_reg_known_zero(&callee->regs[BPF_REG_1]);
> +	callee->regs[BPF_REG_1].map_ptr = map_ptr;
> +
> +	callee->regs[BPF_REG_2].type = PTR_TO_MAP_KEY;
> +	__mark_reg_known_zero(&callee->regs[BPF_REG_2]);
> +	callee->regs[BPF_REG_2].map_ptr = map_ptr;
> +
> +	callee->regs[BPF_REG_3].type = PTR_TO_MAP_VALUE;
> +	__mark_reg_known_zero(&callee->regs[BPF_REG_3]);
> +	callee->regs[BPF_REG_3].map_ptr = map_ptr;
> +
> +	/* unused */
> +	__mark_reg_not_init(env, &callee->regs[BPF_REG_4]);
> +	__mark_reg_not_init(env, &callee->regs[BPF_REG_5]);
> +	callee->in_callback_fn = true;

This should be `callee->in_async_callback_fn = true;` to avoid an
infinite loop check in the is_state_visisted() in some cases.

> +	return 0;
> +}
> +
>  static bool is_rbtree_lock_required_kfunc(u32 btf_id);
>  
>  /* Are we currently verifying the callback for a rbtree helper that must

[...]

> @@ -13171,6 +13235,15 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
>  					return -EINVAL;
>  				}
>  			}
> +			if (meta->map.ptr && reg->map_ptr->record->task_work_off >= 0) {
> +				if (meta->map.ptr != reg->map_ptr ||
> +				    meta->map.uid != reg->map_uid) {
> +					verbose(env,
> +						"bpf_task_work pointer in R2 map_uid=%d doesn't match map pointer in R3 map_uid=%d\n",
> +						meta->map.uid, reg->map_uid);
> +					return -EINVAL;
> +				}
> +			}

Please merge this with the case for wq_off above.

>  			meta->map.ptr = reg->map_ptr;
>  			meta->map.uid = reg->map_uid;
>  			fallthrough;

[...]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH bpf-next v3 5/7] bpf: extract map key pointer calculation
  2025-09-05 16:45 ` [PATCH bpf-next v3 5/7] bpf: extract map key pointer calculation Mykyta Yatsenko
  2025-09-05 21:31   ` Andrii Nakryiko
@ 2025-09-05 23:19   ` Eduard Zingerman
  2025-09-08 13:39     ` Mykyta Yatsenko
  1 sibling, 1 reply; 44+ messages in thread
From: Eduard Zingerman @ 2025-09-05 23:19 UTC (permalink / raw)
  To: Mykyta Yatsenko, bpf, ast, andrii, daniel, kafai, kernel-team,
	memxor
  Cc: Mykyta Yatsenko

On Fri, 2025-09-05 at 17:45 +0100, Mykyta Yatsenko wrote:

[...]

> +static void *map_key_from_value(struct bpf_map *map, void *value, u32 *arr_idx)

`arr_idx` is unused at every call site of this function.
And even if it was used, why both set through pointer and return same value?

> +{
> +	if (map->map_type == BPF_MAP_TYPE_ARRAY) {
> +		struct bpf_array *array = container_of(map, struct bpf_array, map);
> +
> +		*arr_idx = ((char *)value - array->value) / array->elem_size;
> +		return arr_idx;
> +	}
> +	BUG_ON(map->map_type != BPF_MAP_TYPE_HASH && map->map_type != BPF_MAP_TYPE_LRU_HASH);
> +	return (void *)value - round_up(map->key_size, 8);
> +}
> +

[...]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH bpf-next v3 6/7] bpf: task work scheduling kfuncs
  2025-09-05 16:45 ` [PATCH bpf-next v3 6/7] bpf: task work scheduling kfuncs Mykyta Yatsenko
  2025-09-05 21:31   ` Andrii Nakryiko
@ 2025-09-06 20:22   ` Eduard Zingerman
  2025-09-08 13:13     ` Mykyta Yatsenko
  2025-09-09 17:49   ` Chris Mason
  2 siblings, 1 reply; 44+ messages in thread
From: Eduard Zingerman @ 2025-09-06 20:22 UTC (permalink / raw)
  To: Mykyta Yatsenko, bpf, ast, andrii, daniel, kafai, kernel-team,
	memxor
  Cc: Mykyta Yatsenko

On Fri, 2025-09-05 at 17:45 +0100, Mykyta Yatsenko wrote:

[...]

> A small state machine and refcounting scheme ensures safe reuse and
> teardown:
>  STANDBY -> PENDING -> SCHEDULING -> SCHEDULED -> RUNNING -> STANDBY

Nit: state machine is actually a bit more complex:

  digraph G {
    scheduling  -> running    [label="callback 1"];
    scheduled   -> running    [label="callback 2"];
    running     -> standby    [label="callback 3"];
    pending     -> scheduling [label="irq 1"];
    scheduling  -> standby    [label="irq 2"];
    scheduling  -> scheduled  [label="irq 3"];
    standby     -> pending    [label="acquire_ctx"];
  
    freed      -> freed [label="cancel_and_free"];
    pending    -> freed [label="cancel_and_free"];
    running    -> freed [label="cancel_and_free"];
    scheduled  -> freed [label="cancel_and_free"];
    scheduling -> freed [label="cancel_and_free"];
    standby    -> freed [label="cancel_and_free"];
  }

[...]

> Flow of successful task work scheduling
>  1) bpf_task_work_schedule_* is called from BPF code.
>  2) Transition state from STANDBY to PENDING, marks context is owned by
>  this task work scheduler
>  3) irq_work_queue() schedules bpf_task_work_irq().
>  4) Transition state from PENDING to SCHEDULING.
>  4) bpf_task_work_irq() attempts task_work_add(). If successful, state
>  transitions to SCHEDULED.

Nit: "4" repeated two times.

>  5) Task work calls bpf_task_work_callback(), which transition state to
>  RUNNING.
>  6) BPF callback is executed
>  7) Context is cleaned up, refcounts released, context state set back to
>  STANDBY.
> 
> Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
> ---
>  kernel/bpf/helpers.c | 319 ++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 317 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> index 109cb249e88c..418a0a211699 100644
> --- a/kernel/bpf/helpers.c
> +++ b/kernel/bpf/helpers.c

[...]

> +static void bpf_task_work_cancel(struct bpf_task_work_ctx *ctx)
> +{
> +	/*
> +	 * Scheduled task_work callback holds ctx ref, so if we successfully
> +	 * cancelled, we put that ref on callback's behalf. If we couldn't
> +	 * cancel, callback is inevitably run or has already completed
> +	 * running, and it would have taken care of its ctx ref itself.
> +	 */
> +	if (task_work_cancel_match(ctx->task, task_work_match, ctx))

Will `task_work_cancel(ctx->task, ctx->work)` do the same thing here?

> +		bpf_task_work_ctx_put(ctx);
> +}

[...]

> +static void bpf_task_work_irq(struct irq_work *irq_work)
> +{
> +	struct bpf_task_work_ctx *ctx = container_of(irq_work, struct bpf_task_work_ctx, irq_work);
> +	enum bpf_task_work_state state;
> +	int err;
> +
> +	guard(rcu_tasks_trace)();
> +
> +	if (cmpxchg(&ctx->state, BPF_TW_PENDING, BPF_TW_SCHEDULING) != BPF_TW_PENDING) {
> +		bpf_task_work_ctx_put(ctx);
> +		return;
> +	}

Why are separate PENDING and SCHEDULING states needed?
Both indicate that the task had not been yet but is ready to be
submitted to task_work_add(). So, on a first glance it seems that
merging the two won't change the behaviour, what do I miss?

> +	err = task_work_add(ctx->task, &ctx->work, ctx->mode);
> +	if (err) {
> +		bpf_task_work_ctx_reset(ctx);
> +		/*
> +		 * try to switch back to STANDBY for another task_work reuse, but we might have
> +		 * gone to FREED already, which is fine as we already cleaned up after ourselves
> +		 */
> +		(void)cmpxchg(&ctx->state, BPF_TW_SCHEDULING, BPF_TW_STANDBY);
> +
> +		/* we don't have RCU protection, so put after switching state */
> +		bpf_task_work_ctx_put(ctx);
> +	}
> +
> +	/*
> +	 * It's technically possible for just scheduled task_work callback to
> +	 * complete running by now, going SCHEDULING -> RUNNING and then
> +	 * dropping its ctx refcount. Instead of capturing extra ref just to
> +	 * protected below ctx->state access, we rely on RCU protection to
> +	 * perform below SCHEDULING -> SCHEDULED attempt.
> +	 */
> +	state = cmpxchg(&ctx->state, BPF_TW_SCHEDULING, BPF_TW_SCHEDULED);
> +	if (state == BPF_TW_FREED)
> +		bpf_task_work_cancel(ctx); /* clean up if we switched into FREED state */
> +}

[...]

> +static struct bpf_task_work_ctx *bpf_task_work_acquire_ctx(struct bpf_task_work *tw,
> +							   struct bpf_map *map)
> +{
> +	struct bpf_task_work_ctx *ctx;
> +
> +	/* early check to avoid any work, we'll double check at the end again */
> +	if (!atomic64_read(&map->usercnt))
> +		return ERR_PTR(-EBUSY);
> +
> +	ctx = bpf_task_work_fetch_ctx(tw, map);
> +	if (IS_ERR(ctx))
> +		return ctx;
> +
> +	/* try to get ref for task_work callback to hold */
> +	if (!bpf_task_work_ctx_tryget(ctx))
> +		return ERR_PTR(-EBUSY);
> +
> +	if (cmpxchg(&ctx->state, BPF_TW_STANDBY, BPF_TW_PENDING) != BPF_TW_STANDBY) {
> +		/* lost acquiring race or map_release_uref() stole it from us, put ref and bail */
> +		bpf_task_work_ctx_put(ctx);
> +		return ERR_PTR(-EBUSY);
> +	}
> +
> +	/*
> +	 * Double check that map->usercnt wasn't dropped while we were
> +	 * preparing context, and if it was, we need to clean up as if
> +	 * map_release_uref() was called; bpf_task_work_cancel_and_free()
> +	 * is safe to be called twice on the same task work
> +	 */
> +	if (!atomic64_read(&map->usercnt)) {
> +		/* drop ref we just got for task_work callback itself */
> +		bpf_task_work_ctx_put(ctx);
> +		/* transfer map's ref into cancel_and_free() */
> +		bpf_task_work_cancel_and_free(tw);
> +		return ERR_PTR(-EBUSY);
> +	}

I don't understand how the above check is useful.
Is map->usercnt protected from being changed during execution of
bpf_task_work_schedule()?
There are two such checks in this function, so apparently it is not.
Then what's the point of checking usercnt value if it can be
immediately changed after the check?

> +
> +	return ctx;
> +}
> +
> +static int bpf_task_work_schedule(struct task_struct *task, struct bpf_task_work *tw,
> +				  struct bpf_map *map, bpf_task_work_callback_t callback_fn,
> +				  struct bpf_prog_aux *aux, enum task_work_notify_mode mode)
> +{
> +	struct bpf_prog *prog;
> +	struct bpf_task_work_ctx *ctx;
> +	int err;
> +
> +	BTF_TYPE_EMIT(struct bpf_task_work);
> +
> +	prog = bpf_prog_inc_not_zero(aux->prog);
> +	if (IS_ERR(prog))
> +		return -EBADF;
> +	task = bpf_task_acquire(task);
> +	if (!task) {
> +		err = -EPERM;

Nit: Why -EPERM? bpf_task_acquire() returns NULL if task->rcu_users
     is zero, does not seem to be permission related.

> +		goto release_prog;
> +	}
> +
> +	ctx = bpf_task_work_acquire_ctx(tw, map);
> +	if (IS_ERR(ctx)) {
> +		err = PTR_ERR(ctx);
> +		goto release_all;
> +	}
> +
> +	ctx->task = task;
> +	ctx->callback_fn = callback_fn;
> +	ctx->prog = prog;
> +	ctx->mode = mode;
> +	ctx->map = map;
> +	ctx->map_val = (void *)tw - map->record->task_work_off;
> +	init_task_work(&ctx->work, bpf_task_work_callback);
> +	init_irq_work(&ctx->irq_work, bpf_task_work_irq);
> +
> +	irq_work_queue(&ctx->irq_work);
> +	return 0;
> +
> +release_all:
> +	bpf_task_release(task);
> +release_prog:
> +	bpf_prog_put(prog);
> +	return err;
> +}
> +

[...]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH bpf-next v3 7/7] selftests/bpf: BPF task work scheduling tests
  2025-09-05 16:45 ` [PATCH bpf-next v3 7/7] selftests/bpf: BPF task work scheduling tests Mykyta Yatsenko
  2025-09-05 21:31   ` Andrii Nakryiko
@ 2025-09-08  7:43   ` Eduard Zingerman
  2025-09-08 13:21     ` Mykyta Yatsenko
  2025-09-08 18:23   ` Eduard Zingerman
  2 siblings, 1 reply; 44+ messages in thread
From: Eduard Zingerman @ 2025-09-08  7:43 UTC (permalink / raw)
  To: Mykyta Yatsenko, bpf, ast, andrii, daniel, kafai, kernel-team,
	memxor
  Cc: Mykyta Yatsenko

On Fri, 2025-09-05 at 17:45 +0100, Mykyta Yatsenko wrote:
> From: Mykyta Yatsenko <yatsenko@meta.com>
> 
> Introducing selftests that check BPF task work scheduling mechanism.
> Validate that verifier does not accepts incorrect calls to
> bpf_task_work_schedule kfunc.
> 
> Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
> ---

The test cases in this patch check functional correctness, but there
is no attempt to do some stress testing of the state machine.
E.g. how hard/feasible would it be to construct a test that attempts
to exercise both branches of the (state == BPF_TW_SCHEDULED) in the
bpf_task_work_cancel_and_free()?

>  .../selftests/bpf/prog_tests/test_task_work.c | 149 ++++++++++++++++++
>  tools/testing/selftests/bpf/progs/task_work.c | 108 +++++++++++++
>  .../selftests/bpf/progs/task_work_fail.c      |  98 ++++++++++++
>  3 files changed, 355 insertions(+)
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/test_task_work.c
>  create mode 100644 tools/testing/selftests/bpf/progs/task_work.c
>  create mode 100644 tools/testing/selftests/bpf/progs/task_work_fail.c
> 
> diff --git a/tools/testing/selftests/bpf/prog_tests/test_task_work.c b/tools/testing/selftests/bpf/prog_tests/test_task_work.c
> new file mode 100644
> index 000000000000..9c3c7a46a827
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/prog_tests/test_task_work.c
> @@ -0,0 +1,149 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright (c) 2025 Meta Platforms, Inc. and affiliates. */
> +#include <test_progs.h>
> +#include <string.h>
> +#include <stdio.h>
> +#include "task_work.skel.h"
> +#include "task_work_fail.skel.h"
> +#include <linux/bpf.h>
> +#include <linux/perf_event.h>
> +#include <sys/syscall.h>
> +#include <time.h>
> +
> +static int perf_event_open(__u32 type, __u64 config, int pid)
> +{
> +	struct perf_event_attr attr = {
> +		.type = type,
> +		.config = config,
> +		.size = sizeof(struct perf_event_attr),
> +		.sample_period = 100000,
> +	};
> +
> +	return syscall(__NR_perf_event_open, &attr, pid, -1, -1, 0);
> +}
> +
> +struct elem {
> +	char data[128];
> +	struct bpf_task_work tw;
> +};
> +
> +static int verify_map(struct bpf_map *map, const char *expected_data)
> +{
> +	int err;
> +	struct elem value;
> +	int processed_values = 0;
> +	int k, sz;
> +
> +	sz = bpf_map__max_entries(map);
> +	for (k = 0; k < sz; ++k) {
> +		err = bpf_map__lookup_elem(map, &k, sizeof(int), &value, sizeof(struct elem), 0);
> +		if (err)
> +			continue;
> +		if (!ASSERT_EQ(strcmp(expected_data, value.data), 0, "map data")) {
> +			fprintf(stderr, "expected '%s', found '%s' in %s map", expected_data,
> +				value.data, bpf_map__name(map));
> +			return 2;
> +		}
> +		processed_values++;
> +	}
> +
> +	return processed_values == 0;

Nit: check for exact number of expected values here?

> +}
> +
> +static void task_work_run(const char *prog_name, const char *map_name)
> +{
> +	struct task_work *skel;
> +	struct bpf_program *prog;
> +	struct bpf_map *map;
> +	struct bpf_link *link;
> +	int err, pe_fd = 0, pid, status, pipefd[2];
> +	char user_string[] = "hello world";
> +
> +	if (!ASSERT_NEQ(pipe(pipefd), -1, "pipe"))
> +		return;
> +
> +	pid = fork();

Nit: check for negative return value?

> +	if (pid == 0) {
> +		__u64 num = 1;
> +		int i;
> +		char buf;
> +
> +		close(pipefd[1]);
> +		read(pipefd[0], &buf, sizeof(buf));
> +		close(pipefd[0]);
> +
> +		for (i = 0; i < 10000; ++i)
> +			num *= time(0) % 7;
> +		(void)num;
> +		exit(0);
> +	}
> +	skel = task_work__open();
> +	if (!ASSERT_OK_PTR(skel, "task_work__open"))
> +		return;
> +
> +	bpf_object__for_each_program(prog, skel->obj) {
> +		bpf_program__set_autoload(prog, false);
> +	}
> +
> +	prog = bpf_object__find_program_by_name(skel->obj, prog_name);
> +	if (!ASSERT_OK_PTR(prog, "prog_name"))
> +		goto cleanup;
> +	bpf_program__set_autoload(prog, true);
> +	bpf_program__set_type(prog, BPF_PROG_TYPE_PERF_EVENT);

Nit: this is not really necessary, the programs are already defined as
     SEC("perf_event").

> +	skel->bss->user_ptr = (char *)user_string;
> +
> +	err = task_work__load(skel);
> +	if (!ASSERT_OK(err, "skel_load"))
> +		goto cleanup;

[...]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH bpf-next v3 6/7] bpf: task work scheduling kfuncs
  2025-09-06 20:22   ` Eduard Zingerman
@ 2025-09-08 13:13     ` Mykyta Yatsenko
  2025-09-08 17:38       ` Eduard Zingerman
  2025-09-09  3:33       ` Andrii Nakryiko
  0 siblings, 2 replies; 44+ messages in thread
From: Mykyta Yatsenko @ 2025-09-08 13:13 UTC (permalink / raw)
  To: Eduard Zingerman, bpf, ast, andrii, daniel, kafai, kernel-team,
	memxor
  Cc: Mykyta Yatsenko

On 9/6/25 21:22, Eduard Zingerman wrote:
> On Fri, 2025-09-05 at 17:45 +0100, Mykyta Yatsenko wrote:
>
> [...]
>
>> A small state machine and refcounting scheme ensures safe reuse and
>> teardown:
>>   STANDBY -> PENDING -> SCHEDULING -> SCHEDULED -> RUNNING -> STANDBY
> Nit: state machine is actually a bit more complex:
>
>    digraph G {
>      scheduling  -> running    [label="callback 1"];
>      scheduled   -> running    [label="callback 2"];
>      running     -> standby    [label="callback 3"];
>      pending     -> scheduling [label="irq 1"];
>      scheduling  -> standby    [label="irq 2"];
>      scheduling  -> scheduled  [label="irq 3"];
>      standby     -> pending    [label="acquire_ctx"];
>    
>      freed      -> freed [label="cancel_and_free"];
>      pending    -> freed [label="cancel_and_free"];
>      running    -> freed [label="cancel_and_free"];
>      scheduled  -> freed [label="cancel_and_free"];
>      scheduling -> freed [label="cancel_and_free"];
>      standby    -> freed [label="cancel_and_free"];
>    }
>
> [...]
>
I'll update the description to contain proper graph.
>> Flow of successful task work scheduling
>>   1) bpf_task_work_schedule_* is called from BPF code.
>>   2) Transition state from STANDBY to PENDING, marks context is owned by
>>   this task work scheduler
>>   3) irq_work_queue() schedules bpf_task_work_irq().
>>   4) Transition state from PENDING to SCHEDULING.
>>   4) bpf_task_work_irq() attempts task_work_add(). If successful, state
>>   transitions to SCHEDULED.
> Nit: "4" repeated two times.
>
>>   5) Task work calls bpf_task_work_callback(), which transition state to
>>   RUNNING.
>>   6) BPF callback is executed
>>   7) Context is cleaned up, refcounts released, context state set back to
>>   STANDBY.
>>
>> Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
>> ---
>>   kernel/bpf/helpers.c | 319 ++++++++++++++++++++++++++++++++++++++++++-
>>   1 file changed, 317 insertions(+), 2 deletions(-)
>>
>> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
>> index 109cb249e88c..418a0a211699 100644
>> --- a/kernel/bpf/helpers.c
>> +++ b/kernel/bpf/helpers.c
> [...]
>
>> +static void bpf_task_work_cancel(struct bpf_task_work_ctx *ctx)
>> +{
>> +	/*
>> +	 * Scheduled task_work callback holds ctx ref, so if we successfully
>> +	 * cancelled, we put that ref on callback's behalf. If we couldn't
>> +	 * cancel, callback is inevitably run or has already completed
>> +	 * running, and it would have taken care of its ctx ref itself.
>> +	 */
>> +	if (task_work_cancel_match(ctx->task, task_work_match, ctx))
> Will `task_work_cancel(ctx->task, ctx->work)` do the same thing here?
I think so, yes, thanks for checking.
>
>> +		bpf_task_work_ctx_put(ctx);
>> +}
> [...]
>
>> +static void bpf_task_work_irq(struct irq_work *irq_work)
>> +{
>> +	struct bpf_task_work_ctx *ctx = container_of(irq_work, struct bpf_task_work_ctx, irq_work);
>> +	enum bpf_task_work_state state;
>> +	int err;
>> +
>> +	guard(rcu_tasks_trace)();
>> +
>> +	if (cmpxchg(&ctx->state, BPF_TW_PENDING, BPF_TW_SCHEDULING) != BPF_TW_PENDING) {
>> +		bpf_task_work_ctx_put(ctx);
>> +		return;
>> +	}
> Why are separate PENDING and SCHEDULING states needed?
> Both indicate that the task had not been yet but is ready to be
> submitted to task_work_add(). So, on a first glance it seems that
> merging the two won't change the behaviour, what do I miss?
Yes, this is right, we may drop SCHEDULING state, it does not change any 
behavior compared to PENDING.
The state check before task_work_add is needed anyway, so we won't 
remove much code here.
I kept it just to be more consistent: with every state check we also 
transition state machine forward.
>
>> +	err = task_work_add(ctx->task, &ctx->work, ctx->mode);
>> +	if (err) {
>> +		bpf_task_work_ctx_reset(ctx);
>> +		/*
>> +		 * try to switch back to STANDBY for another task_work reuse, but we might have
>> +		 * gone to FREED already, which is fine as we already cleaned up after ourselves
>> +		 */
>> +		(void)cmpxchg(&ctx->state, BPF_TW_SCHEDULING, BPF_TW_STANDBY);
>> +
>> +		/* we don't have RCU protection, so put after switching state */
>> +		bpf_task_work_ctx_put(ctx);
>> +	}
>> +
>> +	/*
>> +	 * It's technically possible for just scheduled task_work callback to
>> +	 * complete running by now, going SCHEDULING -> RUNNING and then
>> +	 * dropping its ctx refcount. Instead of capturing extra ref just to
>> +	 * protected below ctx->state access, we rely on RCU protection to
>> +	 * perform below SCHEDULING -> SCHEDULED attempt.
>> +	 */
>> +	state = cmpxchg(&ctx->state, BPF_TW_SCHEDULING, BPF_TW_SCHEDULED);
>> +	if (state == BPF_TW_FREED)
>> +		bpf_task_work_cancel(ctx); /* clean up if we switched into FREED state */
>> +}
> [...]
>
>> +static struct bpf_task_work_ctx *bpf_task_work_acquire_ctx(struct bpf_task_work *tw,
>> +							   struct bpf_map *map)
>> +{
>> +	struct bpf_task_work_ctx *ctx;
>> +
>> +	/* early check to avoid any work, we'll double check at the end again */
>> +	if (!atomic64_read(&map->usercnt))
>> +		return ERR_PTR(-EBUSY);
>> +
>> +	ctx = bpf_task_work_fetch_ctx(tw, map);
>> +	if (IS_ERR(ctx))
>> +		return ctx;
>> +
>> +	/* try to get ref for task_work callback to hold */
>> +	if (!bpf_task_work_ctx_tryget(ctx))
>> +		return ERR_PTR(-EBUSY);
>> +
>> +	if (cmpxchg(&ctx->state, BPF_TW_STANDBY, BPF_TW_PENDING) != BPF_TW_STANDBY) {
>> +		/* lost acquiring race or map_release_uref() stole it from us, put ref and bail */
>> +		bpf_task_work_ctx_put(ctx);
>> +		return ERR_PTR(-EBUSY);
>> +	}
>> +
>> +	/*
>> +	 * Double check that map->usercnt wasn't dropped while we were
>> +	 * preparing context, and if it was, we need to clean up as if
>> +	 * map_release_uref() was called; bpf_task_work_cancel_and_free()
>> +	 * is safe to be called twice on the same task work
>> +	 */
>> +	if (!atomic64_read(&map->usercnt)) {
>> +		/* drop ref we just got for task_work callback itself */
>> +		bpf_task_work_ctx_put(ctx);
>> +		/* transfer map's ref into cancel_and_free() */
>> +		bpf_task_work_cancel_and_free(tw);
>> +		return ERR_PTR(-EBUSY);
>> +	}
> I don't understand how the above check is useful.
> Is map->usercnt protected from being changed during execution of
> bpf_task_work_schedule()?
> There are two such checks in this function, so apparently it is not.
> Then what's the point of checking usercnt value if it can be
> immediately changed after the check?
BPF map implementation calls bpf_task_work_cancel_and_free() for each 
value when map->usercnt goes to 0.
We need to make sure that after mutating map value (attaching a ctx, 
setting state and refcnt), we do not
leak memory to a newly allocated ctx.
If bpf_task_work_cancel_and_free() runs concurrently with 
bpf_task_work_acquire_ctx(), there is a chance that map cleans up the 
value first and then we attach a ctx with refcnt=2, memory will leak. 
Alternatively, if map->usercnt is set to 0 right after this check, we 
are guaranteed to have the initialized context attached already, so the 
refcnts will be properly decremented (once by 
bpf_task_work_cancel_and_free()
and once by bpf_task_work_irq() and clean up is safe).

In other words, initialization of the ctx in struct bpf_task_work is 
multi-step operation, those steps could be
interleaved with cancel_and_free(), in such case the value may leak the 
ctx. Check map->usercnt==0 after initialization,
to force correct cleanup preventing the leak. Calling cancel_and_free() 
for the same value twice is safe.
>
>> +
>> +	return ctx;
>> +}
>> +
>> +static int bpf_task_work_schedule(struct task_struct *task, struct bpf_task_work *tw,
>> +				  struct bpf_map *map, bpf_task_work_callback_t callback_fn,
>> +				  struct bpf_prog_aux *aux, enum task_work_notify_mode mode)
>> +{
>> +	struct bpf_prog *prog;
>> +	struct bpf_task_work_ctx *ctx;
>> +	int err;
>> +
>> +	BTF_TYPE_EMIT(struct bpf_task_work);
>> +
>> +	prog = bpf_prog_inc_not_zero(aux->prog);
>> +	if (IS_ERR(prog))
>> +		return -EBADF;
>> +	task = bpf_task_acquire(task);
>> +	if (!task) {
>> +		err = -EPERM;
> Nit: Why -EPERM? bpf_task_acquire() returns NULL if task->rcu_users
>       is zero, does not seem to be permission related.
Right, this probably should be -EBADF.
>> +		goto release_prog;
>> +	}
>> +
>> +	ctx = bpf_task_work_acquire_ctx(tw, map);
>> +	if (IS_ERR(ctx)) {
>> +		err = PTR_ERR(ctx);
>> +		goto release_all;
>> +	}
>> +
>> +	ctx->task = task;
>> +	ctx->callback_fn = callback_fn;
>> +	ctx->prog = prog;
>> +	ctx->mode = mode;
>> +	ctx->map = map;
>> +	ctx->map_val = (void *)tw - map->record->task_work_off;
>> +	init_task_work(&ctx->work, bpf_task_work_callback);
>> +	init_irq_work(&ctx->irq_work, bpf_task_work_irq);
>> +
>> +	irq_work_queue(&ctx->irq_work);
>> +	return 0;
>> +
>> +release_all:
>> +	bpf_task_release(task);
>> +release_prog:
>> +	bpf_prog_put(prog);
>> +	return err;
>> +}
>> +
> [...]


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH bpf-next v3 7/7] selftests/bpf: BPF task work scheduling tests
  2025-09-08  7:43   ` Eduard Zingerman
@ 2025-09-08 13:21     ` Mykyta Yatsenko
  2025-09-08 18:23       ` Eduard Zingerman
  0 siblings, 1 reply; 44+ messages in thread
From: Mykyta Yatsenko @ 2025-09-08 13:21 UTC (permalink / raw)
  To: Eduard Zingerman, bpf, ast, andrii, daniel, kafai, kernel-team,
	memxor
  Cc: Mykyta Yatsenko

On 9/8/25 08:43, Eduard Zingerman wrote:
> On Fri, 2025-09-05 at 17:45 +0100, Mykyta Yatsenko wrote:
>> From: Mykyta Yatsenko <yatsenko@meta.com>
>>
>> Introducing selftests that check BPF task work scheduling mechanism.
>> Validate that verifier does not accepts incorrect calls to
>> bpf_task_work_schedule kfunc.
>>
>> Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
>> ---
> The test cases in this patch check functional correctness, but there
> is no attempt to do some stress testing of the state machine.
> E.g. how hard/feasible would it be to construct a test that attempts
> to exercise both branches of the (state == BPF_TW_SCHEDULED) in the
> bpf_task_work_cancel_and_free()?
Good point, I have stress test, but did not include it in the patches, 
as it takes longer to run.
I had to add logs in the kernel code to make sure cancellation/freeing 
branches are hit.
https://github.com/kernel-patches/bpf/commit/86408b074ab0a2d290977846c3e99a07443ac604
>
>>   .../selftests/bpf/prog_tests/test_task_work.c | 149 ++++++++++++++++++
>>   tools/testing/selftests/bpf/progs/task_work.c | 108 +++++++++++++
>>   .../selftests/bpf/progs/task_work_fail.c      |  98 ++++++++++++
>>   3 files changed, 355 insertions(+)
>>   create mode 100644 tools/testing/selftests/bpf/prog_tests/test_task_work.c
>>   create mode 100644 tools/testing/selftests/bpf/progs/task_work.c
>>   create mode 100644 tools/testing/selftests/bpf/progs/task_work_fail.c
>>
>> diff --git a/tools/testing/selftests/bpf/prog_tests/test_task_work.c b/tools/testing/selftests/bpf/prog_tests/test_task_work.c
>> new file mode 100644
>> index 000000000000..9c3c7a46a827
>> --- /dev/null
>> +++ b/tools/testing/selftests/bpf/prog_tests/test_task_work.c
>> @@ -0,0 +1,149 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/* Copyright (c) 2025 Meta Platforms, Inc. and affiliates. */
>> +#include <test_progs.h>
>> +#include <string.h>
>> +#include <stdio.h>
>> +#include "task_work.skel.h"
>> +#include "task_work_fail.skel.h"
>> +#include <linux/bpf.h>
>> +#include <linux/perf_event.h>
>> +#include <sys/syscall.h>
>> +#include <time.h>
>> +
>> +static int perf_event_open(__u32 type, __u64 config, int pid)
>> +{
>> +	struct perf_event_attr attr = {
>> +		.type = type,
>> +		.config = config,
>> +		.size = sizeof(struct perf_event_attr),
>> +		.sample_period = 100000,
>> +	};
>> +
>> +	return syscall(__NR_perf_event_open, &attr, pid, -1, -1, 0);
>> +}
>> +
>> +struct elem {
>> +	char data[128];
>> +	struct bpf_task_work tw;
>> +};
>> +
>> +static int verify_map(struct bpf_map *map, const char *expected_data)
>> +{
>> +	int err;
>> +	struct elem value;
>> +	int processed_values = 0;
>> +	int k, sz;
>> +
>> +	sz = bpf_map__max_entries(map);
>> +	for (k = 0; k < sz; ++k) {
>> +		err = bpf_map__lookup_elem(map, &k, sizeof(int), &value, sizeof(struct elem), 0);
>> +		if (err)
>> +			continue;
>> +		if (!ASSERT_EQ(strcmp(expected_data, value.data), 0, "map data")) {
>> +			fprintf(stderr, "expected '%s', found '%s' in %s map", expected_data,
>> +				value.data, bpf_map__name(map));
>> +			return 2;
>> +		}
>> +		processed_values++;
>> +	}
>> +
>> +	return processed_values == 0;
> Nit: check for exact number of expected values here?
>
>> +}
>> +
>> +static void task_work_run(const char *prog_name, const char *map_name)
>> +{
>> +	struct task_work *skel;
>> +	struct bpf_program *prog;
>> +	struct bpf_map *map;
>> +	struct bpf_link *link;
>> +	int err, pe_fd = 0, pid, status, pipefd[2];
>> +	char user_string[] = "hello world";
>> +
>> +	if (!ASSERT_NEQ(pipe(pipefd), -1, "pipe"))
>> +		return;
>> +
>> +	pid = fork();
> Nit: check for negative return value?
>
>> +	if (pid == 0) {
>> +		__u64 num = 1;
>> +		int i;
>> +		char buf;
>> +
>> +		close(pipefd[1]);
>> +		read(pipefd[0], &buf, sizeof(buf));
>> +		close(pipefd[0]);
>> +
>> +		for (i = 0; i < 10000; ++i)
>> +			num *= time(0) % 7;
>> +		(void)num;
>> +		exit(0);
>> +	}
>> +	skel = task_work__open();
>> +	if (!ASSERT_OK_PTR(skel, "task_work__open"))
>> +		return;
>> +
>> +	bpf_object__for_each_program(prog, skel->obj) {
>> +		bpf_program__set_autoload(prog, false);
>> +	}
>> +
>> +	prog = bpf_object__find_program_by_name(skel->obj, prog_name);
>> +	if (!ASSERT_OK_PTR(prog, "prog_name"))
>> +		goto cleanup;
>> +	bpf_program__set_autoload(prog, true);
>> +	bpf_program__set_type(prog, BPF_PROG_TYPE_PERF_EVENT);
> Nit: this is not really necessary, the programs are already defined as
>       SEC("perf_event").
>
>> +	skel->bss->user_ptr = (char *)user_string;
>> +
>> +	err = task_work__load(skel);
>> +	if (!ASSERT_OK(err, "skel_load"))
>> +		goto cleanup;
> [...]


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH bpf-next v3 5/7] bpf: extract map key pointer calculation
  2025-09-05 23:19   ` Eduard Zingerman
@ 2025-09-08 13:39     ` Mykyta Yatsenko
  2025-09-08 17:18       ` Eduard Zingerman
  0 siblings, 1 reply; 44+ messages in thread
From: Mykyta Yatsenko @ 2025-09-08 13:39 UTC (permalink / raw)
  To: Eduard Zingerman, bpf, ast, andrii, daniel, kafai, kernel-team,
	memxor
  Cc: Mykyta Yatsenko

On 9/6/25 00:19, Eduard Zingerman wrote:
> On Fri, 2025-09-05 at 17:45 +0100, Mykyta Yatsenko wrote:
>
> [...]
>
>> +static void *map_key_from_value(struct bpf_map *map, void *value, u32 *arr_idx)
> `arr_idx` is unused at every call site of this function.
> And even if it was used, why both set through pointer and return same value?
this function returns the pointer to map key; in case of array map, map 
key is an array index,
which is not stored anywhere in the map explicitly.
arr_idx is a container for the array map key, we need to pass it from 
the outside so that the
lifetime is long enough.
In case of hash map, we return the pointer to the actual key, stored in 
the map,
arr_idx is not needed then.
>
>> +{
>> +	if (map->map_type == BPF_MAP_TYPE_ARRAY) {
>> +		struct bpf_array *array = container_of(map, struct bpf_array, map);
>> +
>> +		*arr_idx = ((char *)value - array->value) / array->elem_size;
>> +		return arr_idx;
>> +	}
>> +	BUG_ON(map->map_type != BPF_MAP_TYPE_HASH && map->map_type != BPF_MAP_TYPE_LRU_HASH);
>> +	return (void *)value - round_up(map->key_size, 8);
>> +}
>> +
> [...]


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH bpf-next v3 5/7] bpf: extract map key pointer calculation
  2025-09-08 13:39     ` Mykyta Yatsenko
@ 2025-09-08 17:18       ` Eduard Zingerman
  0 siblings, 0 replies; 44+ messages in thread
From: Eduard Zingerman @ 2025-09-08 17:18 UTC (permalink / raw)
  To: Mykyta Yatsenko, bpf, ast, andrii, daniel, kafai, kernel-team,
	memxor
  Cc: Mykyta Yatsenko

On Mon, 2025-09-08 at 14:39 +0100, Mykyta Yatsenko wrote:

[...]

> this function returns the pointer to map key; in case of array map, map 
> key is an array index,
> which is not stored anywhere in the map explicitly.
> arr_idx is a container for the array map key, we need to pass it from 
> the outside so that the
> lifetime is long enough.
> In case of hash map, we return the pointer to the actual key, stored in 
> the map,
> arr_idx is not needed then.

Oh, I get it, `arr_idx` is a temporary storage. Sorry for the noise.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>

[...]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH bpf-next v3 6/7] bpf: task work scheduling kfuncs
  2025-09-08 13:13     ` Mykyta Yatsenko
@ 2025-09-08 17:38       ` Eduard Zingerman
  2025-09-09  3:42         ` Andrii Nakryiko
  2025-09-09  3:33       ` Andrii Nakryiko
  1 sibling, 1 reply; 44+ messages in thread
From: Eduard Zingerman @ 2025-09-08 17:38 UTC (permalink / raw)
  To: Mykyta Yatsenko, bpf, ast, andrii, daniel, kafai, kernel-team,
	memxor
  Cc: Mykyta Yatsenko

On Mon, 2025-09-08 at 14:13 +0100, Mykyta Yatsenko wrote:

[...]

> > > +static void bpf_task_work_irq(struct irq_work *irq_work)
> > > +{
> > > +	struct bpf_task_work_ctx *ctx = container_of(irq_work, struct bpf_task_work_ctx, irq_work);
> > > +	enum bpf_task_work_state state;
> > > +	int err;
> > > +
> > > +	guard(rcu_tasks_trace)();
> > > +
> > > +	if (cmpxchg(&ctx->state, BPF_TW_PENDING, BPF_TW_SCHEDULING) != BPF_TW_PENDING) {
> > > +		bpf_task_work_ctx_put(ctx);
> > > +		return;
> > > +	}
> > Why are separate PENDING and SCHEDULING states needed?
> > Both indicate that the task had not been yet but is ready to be
> > submitted to task_work_add(). So, on a first glance it seems that
> > merging the two won't change the behaviour, what do I miss?

> Yes, this is right, we may drop SCHEDULING state, it does not change any 
> behavior compared to PENDING.
> The state check before task_work_add is needed anyway, so we won't 
> remove much code here.
> I kept it just to be more consistent: with every state check we also 
> transition state machine forward.

Why is state check before task_work_add() mandatory?
You check for FREED in both branches of task_work_add(),
so there seem to be no issues with leaking ctx.

> > > +	err = task_work_add(ctx->task, &ctx->work, ctx->mode);
> > > +	if (err) {
> > > +		bpf_task_work_ctx_reset(ctx);
> > > +		/*
> > > +		 * try to switch back to STANDBY for another task_work reuse, but we might have
> > > +		 * gone to FREED already, which is fine as we already cleaned up after ourselves
> > > +		 */
> > > +		(void)cmpxchg(&ctx->state, BPF_TW_SCHEDULING, BPF_TW_STANDBY);
> > > +
> > > +		/* we don't have RCU protection, so put after switching state */
> > > +		bpf_task_work_ctx_put(ctx);
> > > +	}
> > > +
> > > +	/*
> > > +	 * It's technically possible for just scheduled task_work callback to
> > > +	 * complete running by now, going SCHEDULING -> RUNNING and then
> > > +	 * dropping its ctx refcount. Instead of capturing extra ref just to
> > > +	 * protected below ctx->state access, we rely on RCU protection to
> > > +	 * perform below SCHEDULING -> SCHEDULED attempt.
> > > +	 */
> > > +	state = cmpxchg(&ctx->state, BPF_TW_SCHEDULING, BPF_TW_SCHEDULED);
> > > +	if (state == BPF_TW_FREED)
> > > +		bpf_task_work_cancel(ctx); /* clean up if we switched into FREED state */
> > > +}
> > [...]
> > 
> > > +static struct bpf_task_work_ctx *bpf_task_work_acquire_ctx(struct bpf_task_work *tw,
> > > +							   struct bpf_map *map)
> > > +{
> > > +	struct bpf_task_work_ctx *ctx;
> > > +
> > > +	/* early check to avoid any work, we'll double check at the end again */
> > > +	if (!atomic64_read(&map->usercnt))
> > > +		return ERR_PTR(-EBUSY);
> > > +
> > > +	ctx = bpf_task_work_fetch_ctx(tw, map);
> > > +	if (IS_ERR(ctx))
> > > +		return ctx;
> > > +
> > > +	/* try to get ref for task_work callback to hold */
> > > +	if (!bpf_task_work_ctx_tryget(ctx))
> > > +		return ERR_PTR(-EBUSY);
> > > +
> > > +	if (cmpxchg(&ctx->state, BPF_TW_STANDBY, BPF_TW_PENDING) != BPF_TW_STANDBY) {
> > > +		/* lost acquiring race or map_release_uref() stole it from us, put ref and bail */
> > > +		bpf_task_work_ctx_put(ctx);
> > > +		return ERR_PTR(-EBUSY);
> > > +	}
> > > +
> > > +	/*
> > > +	 * Double check that map->usercnt wasn't dropped while we were
> > > +	 * preparing context, and if it was, we need to clean up as if
> > > +	 * map_release_uref() was called; bpf_task_work_cancel_and_free()
> > > +	 * is safe to be called twice on the same task work
> > > +	 */
> > > +	if (!atomic64_read(&map->usercnt)) {
> > > +		/* drop ref we just got for task_work callback itself */
> > > +		bpf_task_work_ctx_put(ctx);
> > > +		/* transfer map's ref into cancel_and_free() */
> > > +		bpf_task_work_cancel_and_free(tw);
> > > +		return ERR_PTR(-EBUSY);
> > > +	}
> > I don't understand how the above check is useful.
> > Is map->usercnt protected from being changed during execution of
> > bpf_task_work_schedule()?
> > There are two such checks in this function, so apparently it is not.
> > Then what's the point of checking usercnt value if it can be
> > immediately changed after the check?

> BPF map implementation calls bpf_task_work_cancel_and_free() for each 
> value when map->usercnt goes to 0.
> We need to make sure that after mutating map value (attaching a ctx, 
> setting state and refcnt), we do not
> leak memory to a newly allocated ctx.
> If bpf_task_work_cancel_and_free() runs concurrently with 
> bpf_task_work_acquire_ctx(), there is a chance that map cleans up the 
> value first and then we attach a ctx with refcnt=2, memory will leak. 
> Alternatively, if map->usercnt is set to 0 right after this check, we 
> are guaranteed to have the initialized context attached already, so the 
> refcnts will be properly decremented (once by 
> bpf_task_work_cancel_and_free()
> and once by bpf_task_work_irq() and clean up is safe).
> 
> In other words, initialization of the ctx in struct bpf_task_work is 
> multi-step operation, those steps could be
> interleaved with cancel_and_free(), in such case the value may leak the 
> ctx. Check map->usercnt==0 after initialization,
> to force correct cleanup preventing the leak. Calling cancel_and_free() 
> for the same value twice is safe.

Ack, thank you for explaining.

> > 
> > > +
> > > +	return ctx;
> > > +}

[...]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH bpf-next v3 7/7] selftests/bpf: BPF task work scheduling tests
  2025-09-08 13:21     ` Mykyta Yatsenko
@ 2025-09-08 18:23       ` Eduard Zingerman
  2025-09-09  3:44         ` Andrii Nakryiko
  0 siblings, 1 reply; 44+ messages in thread
From: Eduard Zingerman @ 2025-09-08 18:23 UTC (permalink / raw)
  To: Mykyta Yatsenko, bpf, ast, andrii, daniel, kafai, kernel-team,
	memxor
  Cc: Mykyta Yatsenko

On Mon, 2025-09-08 at 14:21 +0100, Mykyta Yatsenko wrote:
> On 9/8/25 08:43, Eduard Zingerman wrote:
> > On Fri, 2025-09-05 at 17:45 +0100, Mykyta Yatsenko wrote:
> > > From: Mykyta Yatsenko <yatsenko@meta.com>
> > > 
> > > Introducing selftests that check BPF task work scheduling mechanism.
> > > Validate that verifier does not accepts incorrect calls to
> > > bpf_task_work_schedule kfunc.
> > > 
> > > Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
> > > ---
> > The test cases in this patch check functional correctness, but there
> > is no attempt to do some stress testing of the state machine.
> > E.g. how hard/feasible would it be to construct a test that attempts
> > to exercise both branches of the (state == BPF_TW_SCHEDULED) in the
> > bpf_task_work_cancel_and_free()?

> Good point, I have stress test, but did not include it in the patches, 
> as it takes longer to run.
> I had to add logs in the kernel code to make sure cancellation/freeing 
> branches are hit.
> https://github.com/kernel-patches/bpf/commit/86408b074ab0a2d290977846c3e99a07443ac604

Ack. I see no harm in having such test run for a couple of seconds on
CI, but up to you.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH bpf-next v3 7/7] selftests/bpf: BPF task work scheduling tests
  2025-09-05 16:45 ` [PATCH bpf-next v3 7/7] selftests/bpf: BPF task work scheduling tests Mykyta Yatsenko
  2025-09-05 21:31   ` Andrii Nakryiko
  2025-09-08  7:43   ` Eduard Zingerman
@ 2025-09-08 18:23   ` Eduard Zingerman
  2 siblings, 0 replies; 44+ messages in thread
From: Eduard Zingerman @ 2025-09-08 18:23 UTC (permalink / raw)
  To: Mykyta Yatsenko, bpf, ast, andrii, daniel, kafai, kernel-team,
	memxor
  Cc: Mykyta Yatsenko

On Fri, 2025-09-05 at 17:45 +0100, Mykyta Yatsenko wrote:
> From: Mykyta Yatsenko <yatsenko@meta.com>
> 
> Introducing selftests that check BPF task work scheduling mechanism.
> Validate that verifier does not accepts incorrect calls to
> bpf_task_work_schedule kfunc.
> 
> Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
> ---

Acked-by: Eduard Zingerman <eddyz87@gmail.com>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH bpf-next v3 6/7] bpf: task work scheduling kfuncs
  2025-09-08 13:13     ` Mykyta Yatsenko
  2025-09-08 17:38       ` Eduard Zingerman
@ 2025-09-09  3:33       ` Andrii Nakryiko
  2025-09-09  4:05         ` Eduard Zingerman
  1 sibling, 1 reply; 44+ messages in thread
From: Andrii Nakryiko @ 2025-09-09  3:33 UTC (permalink / raw)
  To: Mykyta Yatsenko
  Cc: Eduard Zingerman, bpf, ast, andrii, daniel, kafai, kernel-team,
	memxor, Mykyta Yatsenko

On Mon, Sep 8, 2025 at 9:13 AM Mykyta Yatsenko
<mykyta.yatsenko5@gmail.com> wrote:
>
> On 9/6/25 21:22, Eduard Zingerman wrote:
> > On Fri, 2025-09-05 at 17:45 +0100, Mykyta Yatsenko wrote:
> >
> > [...]
> >
> >> A small state machine and refcounting scheme ensures safe reuse and
> >> teardown:
> >>   STANDBY -> PENDING -> SCHEDULING -> SCHEDULED -> RUNNING -> STANDBY
> > Nit: state machine is actually a bit more complex:
> >
> >    digraph G {
> >      scheduling  -> running    [label="callback 1"];
> >      scheduled   -> running    [label="callback 2"];
> >      running     -> standby    [label="callback 3"];
> >      pending     -> scheduling [label="irq 1"];
> >      scheduling  -> standby    [label="irq 2"];
> >      scheduling  -> scheduled  [label="irq 3"];
> >      standby     -> pending    [label="acquire_ctx"];
> >
> >      freed      -> freed [label="cancel_and_free"];
> >      pending    -> freed [label="cancel_and_free"];
> >      running    -> freed [label="cancel_and_free"];
> >      scheduled  -> freed [label="cancel_and_free"];
> >      scheduling -> freed [label="cancel_and_free"];
> >      standby    -> freed [label="cancel_and_free"];
> >    }
> >
> > [...]
> >
> I'll update the description to contain proper graph.

Hm... I like the main linear chain of state transitions as is, tbh.
It's fundamentally simple and helps get the general picture. Sure,
there are important details, but I don't think we should overwhelm
anyone reading with all of this upfront.

In the above, "callback 1" and so on is not really helpful for
understanding, IMO.

I'd just add the note that a) each state can transition to FREED and
b) with tiny probability we might skip SCHEDULED and go SCHEDULING ->
RUNNING (extremely unlikely if at all possible, tbh).

In short, let's not go too detailed here.

> >> Flow of successful task work scheduling
> >>   1) bpf_task_work_schedule_* is called from BPF code.
> >>   2) Transition state from STANDBY to PENDING, marks context is owned by
> >>   this task work scheduler
> >>   3) irq_work_queue() schedules bpf_task_work_irq().
> >>   4) Transition state from PENDING to SCHEDULING.
> >>   4) bpf_task_work_irq() attempts task_work_add(). If successful, state
> >>   transitions to SCHEDULED.
> > Nit: "4" repeated two times.
> >
> >>   5) Task work calls bpf_task_work_callback(), which transition state to
> >>   RUNNING.
> >>   6) BPF callback is executed
> >>   7) Context is cleaned up, refcounts released, context state set back to
> >>   STANDBY.
> >>
> >> Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
> >> ---
> >>   kernel/bpf/helpers.c | 319 ++++++++++++++++++++++++++++++++++++++++++-
> >>   1 file changed, 317 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> >> index 109cb249e88c..418a0a211699 100644
> >> --- a/kernel/bpf/helpers.c
> >> +++ b/kernel/bpf/helpers.c
> > [...]
> >
> >> +static void bpf_task_work_cancel(struct bpf_task_work_ctx *ctx)
> >> +{
> >> +    /*
> >> +     * Scheduled task_work callback holds ctx ref, so if we successfully
> >> +     * cancelled, we put that ref on callback's behalf. If we couldn't
> >> +     * cancel, callback is inevitably run or has already completed
> >> +     * running, and it would have taken care of its ctx ref itself.
> >> +     */
> >> +    if (task_work_cancel_match(ctx->task, task_work_match, ctx))
> > Will `task_work_cancel(ctx->task, ctx->work)` do the same thing here?
> I think so, yes, thanks for checking.
> >
> >> +            bpf_task_work_ctx_put(ctx);
> >> +}
> > [...]
> >
> >> +static void bpf_task_work_irq(struct irq_work *irq_work)
> >> +{
> >> +    struct bpf_task_work_ctx *ctx = container_of(irq_work, struct bpf_task_work_ctx, irq_work);
> >> +    enum bpf_task_work_state state;
> >> +    int err;
> >> +
> >> +    guard(rcu_tasks_trace)();
> >> +
> >> +    if (cmpxchg(&ctx->state, BPF_TW_PENDING, BPF_TW_SCHEDULING) != BPF_TW_PENDING) {
> >> +            bpf_task_work_ctx_put(ctx);
> >> +            return;
> >> +    }
> > Why are separate PENDING and SCHEDULING states needed?
> > Both indicate that the task had not been yet but is ready to be
> > submitted to task_work_add(). So, on a first glance it seems that
> > merging the two won't change the behaviour, what do I miss?
> Yes, this is right, we may drop SCHEDULING state, it does not change any
> behavior compared to PENDING.
> The state check before task_work_add is needed anyway, so we won't
> remove much code here.
> I kept it just to be more consistent: with every state check we also
> transition state machine forward.

Yeah, I like this property as well, I think it makes it easier to
reason about all this. I'd keep the PENDING and SCHEDULING
distinction, unless there is a strong reason not to.

It also gives us a natural point to check for FREED before doing
unnecessary task_work scheduling + cancelling (if we were already in
FREED). It doesn't seem like we'll simplify anything by SCHEDULING (or
PENDING) state.

> >
> >> +    err = task_work_add(ctx->task, &ctx->work, ctx->mode);
> >> +    if (err) {
> >> +            bpf_task_work_ctx_reset(ctx);
> >> +            /*
> >> +             * try to switch back to STANDBY for another task_work reuse, but we might have
> >> +             * gone to FREED already, which is fine as we already cleaned up after ourselves
> >> +             */
> >> +            (void)cmpxchg(&ctx->state, BPF_TW_SCHEDULING, BPF_TW_STANDBY);
> >> +
> >> +            /* we don't have RCU protection, so put after switching state */
> >> +            bpf_task_work_ctx_put(ctx);
> >> +    }
> >> +

[...]

> >> +
> >> +    return ctx;
> >> +}
> >> +
> >> +static int bpf_task_work_schedule(struct task_struct *task, struct bpf_task_work *tw,
> >> +                              struct bpf_map *map, bpf_task_work_callback_t callback_fn,
> >> +                              struct bpf_prog_aux *aux, enum task_work_notify_mode mode)
> >> +{
> >> +    struct bpf_prog *prog;
> >> +    struct bpf_task_work_ctx *ctx;
> >> +    int err;
> >> +
> >> +    BTF_TYPE_EMIT(struct bpf_task_work);
> >> +
> >> +    prog = bpf_prog_inc_not_zero(aux->prog);
> >> +    if (IS_ERR(prog))
> >> +            return -EBADF;
> >> +    task = bpf_task_acquire(task);
> >> +    if (!task) {
> >> +            err = -EPERM;
> > Nit: Why -EPERM? bpf_task_acquire() returns NULL if task->rcu_users
> >       is zero, does not seem to be permission related.
> Right, this probably should be -EBADF.

timer and wq (and now task_work) return -EPERM for map->usercnt==0
check, but we have -EBADF for prog refcount being zero. It's a bit all
over the place... I don't have a strong preference, but it would be
nice to stay more or less consistent for all these "it's too late"
conditions, IMO.

> >> +            goto release_prog;
> >> +    }
> >> +
> >> +    ctx = bpf_task_work_acquire_ctx(tw, map);
> >> +    if (IS_ERR(ctx)) {
> >> +            err = PTR_ERR(ctx);
> >> +            goto release_all;
> >> +    }
> >> +
> >> +    ctx->task = task;
> >> +    ctx->callback_fn = callback_fn;
> >> +    ctx->prog = prog;
> >> +    ctx->mode = mode;
> >> +    ctx->map = map;
> >> +    ctx->map_val = (void *)tw - map->record->task_work_off;
> >> +    init_task_work(&ctx->work, bpf_task_work_callback);
> >> +    init_irq_work(&ctx->irq_work, bpf_task_work_irq);
> >> +
> >> +    irq_work_queue(&ctx->irq_work);
> >> +    return 0;
> >> +
> >> +release_all:
> >> +    bpf_task_release(task);
> >> +release_prog:
> >> +    bpf_prog_put(prog);
> >> +    return err;
> >> +}
> >> +
> > [...]
>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH bpf-next v3 6/7] bpf: task work scheduling kfuncs
  2025-09-08 17:38       ` Eduard Zingerman
@ 2025-09-09  3:42         ` Andrii Nakryiko
  2025-09-09  4:15           ` Eduard Zingerman
  0 siblings, 1 reply; 44+ messages in thread
From: Andrii Nakryiko @ 2025-09-09  3:42 UTC (permalink / raw)
  To: Eduard Zingerman
  Cc: Mykyta Yatsenko, bpf, ast, andrii, daniel, kafai, kernel-team,
	memxor, Mykyta Yatsenko

On Mon, Sep 8, 2025 at 1:39 PM Eduard Zingerman <eddyz87@gmail.com> wrote:
>
> On Mon, 2025-09-08 at 14:13 +0100, Mykyta Yatsenko wrote:
>
> [...]
>
> > > > +static void bpf_task_work_irq(struct irq_work *irq_work)
> > > > +{
> > > > + struct bpf_task_work_ctx *ctx = container_of(irq_work, struct bpf_task_work_ctx, irq_work);
> > > > + enum bpf_task_work_state state;
> > > > + int err;
> > > > +
> > > > + guard(rcu_tasks_trace)();
> > > > +
> > > > + if (cmpxchg(&ctx->state, BPF_TW_PENDING, BPF_TW_SCHEDULING) != BPF_TW_PENDING) {
> > > > +         bpf_task_work_ctx_put(ctx);
> > > > +         return;
> > > > + }
> > > Why are separate PENDING and SCHEDULING states needed?
> > > Both indicate that the task had not been yet but is ready to be
> > > submitted to task_work_add(). So, on a first glance it seems that
> > > merging the two won't change the behaviour, what do I miss?
>
> > Yes, this is right, we may drop SCHEDULING state, it does not change any
> > behavior compared to PENDING.
> > The state check before task_work_add is needed anyway, so we won't
> > remove much code here.
> > I kept it just to be more consistent: with every state check we also
> > transition state machine forward.
>
> Why is state check before task_work_add() mandatory?
> You check for FREED in both branches of task_work_add(),
> so there seem to be no issues with leaking ctx.

Not really mandatory, but I think it is good to avoid even attempting
to schedule task_work if the map element was already deleted?
Technically, even that last FREED check in bpf_task_work_irq is not
strictly necessary, we could have always let task_work callback
execute and bail all the way there, but that seems too extreme (and
task_work can be delayed by a lot for some special task states, I
think).

Also, keep in mind, this same state machine will be used for timers
and wqs (at least we should try), and so, in general, being diligent
about not doing doomed-to-be-failed-or-cancelled work is a good
property.

>
> > > > + err = task_work_add(ctx->task, &ctx->work, ctx->mode);
> > > > + if (err) {
> > > > +         bpf_task_work_ctx_reset(ctx);
> > > > +         /*
> > > > +          * try to switch back to STANDBY for another task_work reuse, but we might have
> > > > +          * gone to FREED already, which is fine as we already cleaned up after ourselves
> > > > +          */
> > > > +         (void)cmpxchg(&ctx->state, BPF_TW_SCHEDULING, BPF_TW_STANDBY);
> > > > +
> > > > +         /* we don't have RCU protection, so put after switching state */
> > > > +         bpf_task_work_ctx_put(ctx);
> > > > + }
> > > > +
> > > > + /*
> > > > +  * It's technically possible for just scheduled task_work callback to
> > > > +  * complete running by now, going SCHEDULING -> RUNNING and then
> > > > +  * dropping its ctx refcount. Instead of capturing extra ref just to
> > > > +  * protected below ctx->state access, we rely on RCU protection to
> > > > +  * perform below SCHEDULING -> SCHEDULED attempt.
> > > > +  */
> > > > + state = cmpxchg(&ctx->state, BPF_TW_SCHEDULING, BPF_TW_SCHEDULED);
> > > > + if (state == BPF_TW_FREED)
> > > > +         bpf_task_work_cancel(ctx); /* clean up if we switched into FREED state */
> > > > +}
> > > [...]
> > >
> > > > +static struct bpf_task_work_ctx *bpf_task_work_acquire_ctx(struct bpf_task_work *tw,
> > > > +                                                    struct bpf_map *map)
> > > > +{
> > > > + struct bpf_task_work_ctx *ctx;
> > > > +
> > > > + /* early check to avoid any work, we'll double check at the end again */
> > > > + if (!atomic64_read(&map->usercnt))
> > > > +         return ERR_PTR(-EBUSY);
> > > > +
> > > > + ctx = bpf_task_work_fetch_ctx(tw, map);
> > > > + if (IS_ERR(ctx))
> > > > +         return ctx;
> > > > +
> > > > + /* try to get ref for task_work callback to hold */
> > > > + if (!bpf_task_work_ctx_tryget(ctx))
> > > > +         return ERR_PTR(-EBUSY);
> > > > +
> > > > + if (cmpxchg(&ctx->state, BPF_TW_STANDBY, BPF_TW_PENDING) != BPF_TW_STANDBY) {
> > > > +         /* lost acquiring race or map_release_uref() stole it from us, put ref and bail */
> > > > +         bpf_task_work_ctx_put(ctx);
> > > > +         return ERR_PTR(-EBUSY);
> > > > + }
> > > > +
> > > > + /*
> > > > +  * Double check that map->usercnt wasn't dropped while we were
> > > > +  * preparing context, and if it was, we need to clean up as if
> > > > +  * map_release_uref() was called; bpf_task_work_cancel_and_free()
> > > > +  * is safe to be called twice on the same task work
> > > > +  */
> > > > + if (!atomic64_read(&map->usercnt)) {
> > > > +         /* drop ref we just got for task_work callback itself */
> > > > +         bpf_task_work_ctx_put(ctx);
> > > > +         /* transfer map's ref into cancel_and_free() */
> > > > +         bpf_task_work_cancel_and_free(tw);
> > > > +         return ERR_PTR(-EBUSY);
> > > > + }
> > > I don't understand how the above check is useful.
> > > Is map->usercnt protected from being changed during execution of
> > > bpf_task_work_schedule()?
> > > There are two such checks in this function, so apparently it is not.
> > > Then what's the point of checking usercnt value if it can be
> > > immediately changed after the check?
>
> > BPF map implementation calls bpf_task_work_cancel_and_free() for each
> > value when map->usercnt goes to 0.
> > We need to make sure that after mutating map value (attaching a ctx,
> > setting state and refcnt), we do not
> > leak memory to a newly allocated ctx.
> > If bpf_task_work_cancel_and_free() runs concurrently with
> > bpf_task_work_acquire_ctx(), there is a chance that map cleans up the
> > value first and then we attach a ctx with refcnt=2, memory will leak.
> > Alternatively, if map->usercnt is set to 0 right after this check, we
> > are guaranteed to have the initialized context attached already, so the
> > refcnts will be properly decremented (once by
> > bpf_task_work_cancel_and_free()
> > and once by bpf_task_work_irq() and clean up is safe).
> >
> > In other words, initialization of the ctx in struct bpf_task_work is
> > multi-step operation, those steps could be
> > interleaved with cancel_and_free(), in such case the value may leak the
> > ctx. Check map->usercnt==0 after initialization,
> > to force correct cleanup preventing the leak. Calling cancel_and_free()
> > for the same value twice is safe.
>
> Ack, thank you for explaining.

btw, one can argue that first map->usercnt==0 check is also not
mandatory, we can always go through allocating/getting context and
only then check map->usercnt, but see above, I think we should be
diligent and avoid useless work.

>
> > >
> > > > +
> > > > + return ctx;
> > > > +}
>
> [...]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH bpf-next v3 7/7] selftests/bpf: BPF task work scheduling tests
  2025-09-08 18:23       ` Eduard Zingerman
@ 2025-09-09  3:44         ` Andrii Nakryiko
  0 siblings, 0 replies; 44+ messages in thread
From: Andrii Nakryiko @ 2025-09-09  3:44 UTC (permalink / raw)
  To: Eduard Zingerman
  Cc: Mykyta Yatsenko, bpf, ast, andrii, daniel, kafai, kernel-team,
	memxor, Mykyta Yatsenko

On Mon, Sep 8, 2025 at 2:23 PM Eduard Zingerman <eddyz87@gmail.com> wrote:
>
> On Mon, 2025-09-08 at 14:21 +0100, Mykyta Yatsenko wrote:
> > On 9/8/25 08:43, Eduard Zingerman wrote:
> > > On Fri, 2025-09-05 at 17:45 +0100, Mykyta Yatsenko wrote:
> > > > From: Mykyta Yatsenko <yatsenko@meta.com>
> > > >
> > > > Introducing selftests that check BPF task work scheduling mechanism.
> > > > Validate that verifier does not accepts incorrect calls to
> > > > bpf_task_work_schedule kfunc.
> > > >
> > > > Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
> > > > ---
> > > The test cases in this patch check functional correctness, but there
> > > is no attempt to do some stress testing of the state machine.
> > > E.g. how hard/feasible would it be to construct a test that attempts
> > > to exercise both branches of the (state == BPF_TW_SCHEDULED) in the
> > > bpf_task_work_cancel_and_free()?
>
> > Good point, I have stress test, but did not include it in the patches,
> > as it takes longer to run.
> > I had to add logs in the kernel code to make sure cancellation/freeing
> > branches are hit.
> > https://github.com/kernel-patches/bpf/commit/86408b074ab0a2d290977846c3e99a07443ac604
>
> Ack. I see no harm in having such test run for a couple of seconds on
> CI, but up to you.

Well, we generally push back against long running tests. We have tons
of them and maintainers still often run all of the tests locally (and
even with -j parallelization they run sloooowww), so I'd avoid adding
a test that will run for a few seconds hoping to trigger some unlikely
condition.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH bpf-next v3 6/7] bpf: task work scheduling kfuncs
  2025-09-09  3:33       ` Andrii Nakryiko
@ 2025-09-09  4:05         ` Eduard Zingerman
  2025-09-10 14:14           ` Andrii Nakryiko
  0 siblings, 1 reply; 44+ messages in thread
From: Eduard Zingerman @ 2025-09-09  4:05 UTC (permalink / raw)
  To: Andrii Nakryiko, Mykyta Yatsenko
  Cc: bpf, ast, andrii, daniel, kafai, kernel-team, memxor,
	Mykyta Yatsenko

On Mon, 2025-09-08 at 23:33 -0400, Andrii Nakryiko wrote:
> On Mon, Sep 8, 2025 at 9:13 AM Mykyta Yatsenko
> <mykyta.yatsenko5@gmail.com> wrote:
> > 
> > On 9/6/25 21:22, Eduard Zingerman wrote:
> > > On Fri, 2025-09-05 at 17:45 +0100, Mykyta Yatsenko wrote:
> > > 
> > > [...]
> > > 
> > > > A small state machine and refcounting scheme ensures safe reuse and
> > > > teardown:
> > > >   STANDBY -> PENDING -> SCHEDULING -> SCHEDULED -> RUNNING -> STANDBY
> > > Nit: state machine is actually a bit more complex:
> > > 
> > >    digraph G {
> > >      scheduling  -> running    [label="callback 1"];
> > >      scheduled   -> running    [label="callback 2"];
> > >      running     -> standby    [label="callback 3"];
> > >      pending     -> scheduling [label="irq 1"];
> > >      scheduling  -> standby    [label="irq 2"];
> > >      scheduling  -> scheduled  [label="irq 3"];
> > >      standby     -> pending    [label="acquire_ctx"];
> > > 
> > >      freed      -> freed [label="cancel_and_free"];
> > >      pending    -> freed [label="cancel_and_free"];
> > >      running    -> freed [label="cancel_and_free"];
> > >      scheduled  -> freed [label="cancel_and_free"];
> > >      scheduling -> freed [label="cancel_and_free"];
> > >      standby    -> freed [label="cancel_and_free"];
> > >    }
> > > 
> > > [...]
> > > 
> > I'll update the description to contain proper graph.
> 
> Hm... I like the main linear chain of state transitions as is, tbh.
> It's fundamentally simple and helps get the general picture. Sure,
> there are important details, but I don't think we should overwhelm
> anyone reading with all of this upfront.
> 
> In the above, "callback 1" and so on is not really helpful for
> understanding, IMO.

I'm not suggesting to take the above graphviz spec as a description.
Just point out that current message is misleading.

> I'd just add the note that a) each state can transition to FREED and
> b) with tiny probability we might skip SCHEDULED and go SCHEDULING ->
> RUNNING (extremely unlikely if at all possible, tbh).
> 
> In short, let's not go too detailed here.

As you see fit. Transition graph is easier to read for me, but maybe
other people are better at reading text.

> 
> > > > Flow of successful task work scheduling
> > > >   1) bpf_task_work_schedule_* is called from BPF code.
> > > >   2) Transition state from STANDBY to PENDING, marks context is owned by
> > > >   this task work scheduler
> > > >   3) irq_work_queue() schedules bpf_task_work_irq().
> > > >   4) Transition state from PENDING to SCHEDULING.
> > > >   4) bpf_task_work_irq() attempts task_work_add(). If successful, state
> > > >   transitions to SCHEDULED.
> > > Nit: "4" repeated two times.
> > > 
> > > >   5) Task work calls bpf_task_work_callback(), which transition state to
> > > >   RUNNING.
> > > >   6) BPF callback is executed
> > > >   7) Context is cleaned up, refcounts released, context state set back to
> > > >   STANDBY.
> > > > 
> > > > Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
> > > > ---
> > > >   kernel/bpf/helpers.c | 319 ++++++++++++++++++++++++++++++++++++++++++-
> > > >   1 file changed, 317 insertions(+), 2 deletions(-)
> > > > 
> > > > diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> > > > index 109cb249e88c..418a0a211699 100644
> > > > --- a/kernel/bpf/helpers.c
> > > > +++ b/kernel/bpf/helpers.c
> > > [...]
> > > 
> > > > +static void bpf_task_work_cancel(struct bpf_task_work_ctx *ctx)
> > > > +{
> > > > +    /*
> > > > +     * Scheduled task_work callback holds ctx ref, so if we successfully
> > > > +     * cancelled, we put that ref on callback's behalf. If we couldn't
> > > > +     * cancel, callback is inevitably run or has already completed
> > > > +     * running, and it would have taken care of its ctx ref itself.
> > > > +     */
> > > > +    if (task_work_cancel_match(ctx->task, task_work_match, ctx))
> > > Will `task_work_cancel(ctx->task, ctx->work)` do the same thing here?
> > I think so, yes, thanks for checking.
> > > 
> > > > +            bpf_task_work_ctx_put(ctx);
> > > > +}
> > > [...]
> > > 
> > > > +static void bpf_task_work_irq(struct irq_work *irq_work)
> > > > +{
> > > > +    struct bpf_task_work_ctx *ctx = container_of(irq_work, struct bpf_task_work_ctx, irq_work);
> > > > +    enum bpf_task_work_state state;
> > > > +    int err;
> > > > +
> > > > +    guard(rcu_tasks_trace)();
> > > > +
> > > > +    if (cmpxchg(&ctx->state, BPF_TW_PENDING, BPF_TW_SCHEDULING) != BPF_TW_PENDING) {
> > > > +            bpf_task_work_ctx_put(ctx);
> > > > +            return;
> > > > +    }
> > > Why are separate PENDING and SCHEDULING states needed?
> > > Both indicate that the task had not been yet but is ready to be
> > > submitted to task_work_add(). So, on a first glance it seems that
> > > merging the two won't change the behaviour, what do I miss?
> > Yes, this is right, we may drop SCHEDULING state, it does not change any
> > behavior compared to PENDING.
> > The state check before task_work_add is needed anyway, so we won't
> > remove much code here.
> > I kept it just to be more consistent: with every state check we also
> > transition state machine forward.
> 
> Yeah, I like this property as well, I think it makes it easier to
> reason about all this. I'd keep the PENDING and SCHEDULING
> distinction, unless there is a strong reason not to.
> 
> It also gives us a natural point to check for FREED before doing
> unnecessary task_work scheduling + cancelling (if we were already in
> FREED). It doesn't seem like we'll simplify anything by SCHEDULING (or
> PENDING) state.

Again, people are probably different, but it took me some time trying
to figure out if I'm missing some details or SCHEDULING is there just
for the sake of it.

> > > 
> > > > +    err = task_work_add(ctx->task, &ctx->work, ctx->mode);
> > > > +    if (err) {
> > > > +            bpf_task_work_ctx_reset(ctx);
> > > > +            /*
> > > > +             * try to switch back to STANDBY for another task_work reuse, but we might have
> > > > +             * gone to FREED already, which is fine as we already cleaned up after ourselves
> > > > +             */
> > > > +            (void)cmpxchg(&ctx->state, BPF_TW_SCHEDULING, BPF_TW_STANDBY);
> > > > +
> > > > +            /* we don't have RCU protection, so put after switching state */
> > > > +            bpf_task_work_ctx_put(ctx);
> > > > +    }
> > > > +
> 
> [...]
> 
> > > > +
> > > > +    return ctx;
> > > > +}
> > > > +
> > > > +static int bpf_task_work_schedule(struct task_struct *task, struct bpf_task_work *tw,
> > > > +                              struct bpf_map *map, bpf_task_work_callback_t callback_fn,
> > > > +                              struct bpf_prog_aux *aux, enum task_work_notify_mode mode)
> > > > +{
> > > > +    struct bpf_prog *prog;
> > > > +    struct bpf_task_work_ctx *ctx;
> > > > +    int err;
> > > > +
> > > > +    BTF_TYPE_EMIT(struct bpf_task_work);
> > > > +
> > > > +    prog = bpf_prog_inc_not_zero(aux->prog);
> > > > +    if (IS_ERR(prog))
> > > > +            return -EBADF;
> > > > +    task = bpf_task_acquire(task);
> > > > +    if (!task) {
> > > > +            err = -EPERM;
> > > Nit: Why -EPERM? bpf_task_acquire() returns NULL if task->rcu_users
> > >       is zero, does not seem to be permission related.
> > Right, this probably should be -EBADF.
> 
> timer and wq (and now task_work) return -EPERM for map->usercnt==0
> check, but we have -EBADF for prog refcount being zero. It's a bit all
> over the place... I don't have a strong preference, but it would be
> nice to stay more or less consistent for all these "it's too late"
> conditions, IMO.

Ok, probably being consistently wrong is better option here,
let's stick with -EPERM.

[...]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH bpf-next v3 6/7] bpf: task work scheduling kfuncs
  2025-09-09  3:42         ` Andrii Nakryiko
@ 2025-09-09  4:15           ` Eduard Zingerman
  0 siblings, 0 replies; 44+ messages in thread
From: Eduard Zingerman @ 2025-09-09  4:15 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Mykyta Yatsenko, bpf, ast, andrii, daniel, kafai, kernel-team,
	memxor, Mykyta Yatsenko

On Mon, 2025-09-08 at 23:42 -0400, Andrii Nakryiko wrote:
> On Mon, Sep 8, 2025 at 1:39 PM Eduard Zingerman <eddyz87@gmail.com> wrote:
> > 
> > On Mon, 2025-09-08 at 14:13 +0100, Mykyta Yatsenko wrote:
> > 
> > [...]
> > 
> > > > > +static void bpf_task_work_irq(struct irq_work *irq_work)
> > > > > +{
> > > > > + struct bpf_task_work_ctx *ctx = container_of(irq_work, struct bpf_task_work_ctx, irq_work);
> > > > > + enum bpf_task_work_state state;
> > > > > + int err;
> > > > > +
> > > > > + guard(rcu_tasks_trace)();
> > > > > +
> > > > > + if (cmpxchg(&ctx->state, BPF_TW_PENDING, BPF_TW_SCHEDULING) != BPF_TW_PENDING) {
> > > > > +         bpf_task_work_ctx_put(ctx);
> > > > > +         return;
> > > > > + }
> > > > Why are separate PENDING and SCHEDULING states needed?
> > > > Both indicate that the task had not been yet but is ready to be
> > > > submitted to task_work_add(). So, on a first glance it seems that
> > > > merging the two won't change the behaviour, what do I miss?
> > 
> > > Yes, this is right, we may drop SCHEDULING state, it does not change any
> > > behavior compared to PENDING.
> > > The state check before task_work_add is needed anyway, so we won't
> > > remove much code here.
> > > I kept it just to be more consistent: with every state check we also
> > > transition state machine forward.
> > 
> > Why is state check before task_work_add() mandatory?
> > You check for FREED in both branches of task_work_add(),
> > so there seem to be no issues with leaking ctx.
> 
> Not really mandatory, but I think it is good to avoid even attempting
> to schedule task_work if the map element was already deleted?
> Technically, even that last FREED check in bpf_task_work_irq is not
> strictly necessary, we could have always let task_work callback
> execute and bail all the way there, but that seems too extreme (and
> task_work can be delayed by a lot for some special task states, I
> think).
>
> Also, keep in mind, this same state machine will be used for timers
> and wqs (at least we should try), and so, in general, being diligent
> about not doing doomed-to-be-failed-or-cancelled work is a good
> property.

Ack, thank you for explaining.

[...]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH bpf-next v3 6/7] bpf: task work scheduling kfuncs
  2025-09-05 16:45 ` [PATCH bpf-next v3 6/7] bpf: task work scheduling kfuncs Mykyta Yatsenko
  2025-09-05 21:31   ` Andrii Nakryiko
  2025-09-06 20:22   ` Eduard Zingerman
@ 2025-09-09 17:49   ` Chris Mason
  2025-09-09 18:59     ` Mykyta Yatsenko
  2 siblings, 1 reply; 44+ messages in thread
From: Chris Mason @ 2025-09-09 17:49 UTC (permalink / raw)
  To: Mykyta Yatsenko, bpf, ast, andrii, daniel, kafai, kernel-team,
	eddyz87, memxor
  Cc: Mykyta Yatsenko

On 9/5/25 12:45 PM, Mykyta Yatsenko wrote:
> From: Mykyta Yatsenko <yatsenko@meta.com>
> 
> Implementation of the new bpf_task_work_schedule kfuncs, that let a BPF
> program schedule task_work callbacks for a target task:
>  * bpf_task_work_schedule_signal() → schedules with TWA_SIGNAL
>  * bpf_task_work_schedule_resume() → schedules with TWA_RESUME
> 
> Each map value should embed a struct bpf_task_work, which the kernel
> side pairs with struct bpf_task_work_kern, containing a pointer to
> struct bpf_task_work_ctx, that maintains metadata relevant for the
> concrete callback scheduling.
> 
> A small state machine and refcounting scheme ensures safe reuse and
> teardown:
>  STANDBY -> PENDING -> SCHEDULING -> SCHEDULED -> RUNNING -> STANDBY
> 
> A FREED terminal state coordinates with map-value
> deletion (bpf_task_work_cancel_and_free()).
> 
> Scheduling itself is deferred via irq_work to keep the kfunc callable
> from NMI context.
> 
> Lifetime is guarded with refcount_t + RCU Tasks Trace.
> 
> Main components:
>  * struct bpf_task_work_context – Metadata and state management per task
> work.
>  * enum bpf_task_work_state – A state machine to serialize work
>  scheduling and execution.
>  * bpf_task_work_schedule() – The central helper that initiates
> scheduling.
>  * bpf_task_work_acquire_ctx() - Attempts to take ownership of the context,
>  pointed by passed struct bpf_task_work, allocates new context if none
>  exists yet.
>  * bpf_task_work_callback() – Invoked when the actual task_work runs.
>  * bpf_task_work_irq() – An intermediate step (runs in softirq context)
> to enqueue task work.
>  * bpf_task_work_cancel_and_free() – Cleanup for deleted BPF map entries.
> 
> Flow of successful task work scheduling
>  1) bpf_task_work_schedule_* is called from BPF code.
>  2) Transition state from STANDBY to PENDING, marks context is owned by
>  this task work scheduler
>  3) irq_work_queue() schedules bpf_task_work_irq().
>  4) Transition state from PENDING to SCHEDULING.
>  4) bpf_task_work_irq() attempts task_work_add(). If successful, state
>  transitions to SCHEDULED.
>  5) Task work calls bpf_task_work_callback(), which transition state to
>  RUNNING.
>  6) BPF callback is executed
>  7) Context is cleaned up, refcounts released, context state set back to
>  STANDBY.
> 
> Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
> ---
>  kernel/bpf/helpers.c | 319 ++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 317 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> index 109cb249e88c..418a0a211699 100644
> --- a/kernel/bpf/helpers.c
> +++ b/kernel/bpf/helpers.c

[ ... ]

> +static void bpf_task_work_irq(struct irq_work *irq_work)
> +{
> +	struct bpf_task_work_ctx *ctx = container_of(irq_work, struct bpf_task_work_ctx, irq_work);
> +	enum bpf_task_work_state state;
> +	int err;
> +
> +	guard(rcu_tasks_trace)();
> +
> +	if (cmpxchg(&ctx->state, BPF_TW_PENDING, BPF_TW_SCHEDULING) != BPF_TW_PENDING) {
> +		bpf_task_work_ctx_put(ctx);
> +		return;
> +	}
> +
> +	err = task_work_add(ctx->task, &ctx->work, ctx->mode);
> +	if (err) {
> +		bpf_task_work_ctx_reset(ctx);
> +		/*
> +		 * try to switch back to STANDBY for another task_work reuse, but we might have
> +		 * gone to FREED already, which is fine as we already cleaned up after ourselves
> +		 */
> +		(void)cmpxchg(&ctx->state, BPF_TW_SCHEDULING, BPF_TW_STANDBY);
> +
> +		/* we don't have RCU protection, so put after switching state */
> +		bpf_task_work_ctx_put(ctx);
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^

Do we want to return here?  I didn't follow all of the references, but
even if this isn't the last reference, it looks like the rest of the
function isn't meant to work on the ctx after this point.

> +	}
> +
> +	/*
> +	 * It's technically possible for just scheduled task_work callback to
> +	 * complete running by now, going SCHEDULING -> RUNNING and then
> +	 * dropping its ctx refcount. Instead of capturing extra ref just to
> +	 * protected below ctx->state access, we rely on RCU protection to
> +	 * perform below SCHEDULING -> SCHEDULED attempt.
> +	 */
> +	state = cmpxchg(&ctx->state, BPF_TW_SCHEDULING, BPF_TW_SCHEDULED);
> +	if (state == BPF_TW_FREED)
> +		bpf_task_work_cancel(ctx); /* clean up if we switched into FREED state */
> +}
-chris

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH bpf-next v3 6/7] bpf: task work scheduling kfuncs
  2025-09-09 17:49   ` Chris Mason
@ 2025-09-09 18:59     ` Mykyta Yatsenko
  0 siblings, 0 replies; 44+ messages in thread
From: Mykyta Yatsenko @ 2025-09-09 18:59 UTC (permalink / raw)
  To: Chris Mason, bpf, ast, andrii, daniel, kafai, kernel-team,
	eddyz87, memxor
  Cc: Mykyta Yatsenko

On 9/9/25 18:49, Chris Mason wrote:
> On 9/5/25 12:45 PM, Mykyta Yatsenko wrote:
>> From: Mykyta Yatsenko <yatsenko@meta.com>
>>
>> Implementation of the new bpf_task_work_schedule kfuncs, that let a BPF
>> program schedule task_work callbacks for a target task:
>>   * bpf_task_work_schedule_signal() → schedules with TWA_SIGNAL
>>   * bpf_task_work_schedule_resume() → schedules with TWA_RESUME
>>
>> Each map value should embed a struct bpf_task_work, which the kernel
>> side pairs with struct bpf_task_work_kern, containing a pointer to
>> struct bpf_task_work_ctx, that maintains metadata relevant for the
>> concrete callback scheduling.
>>
>> A small state machine and refcounting scheme ensures safe reuse and
>> teardown:
>>   STANDBY -> PENDING -> SCHEDULING -> SCHEDULED -> RUNNING -> STANDBY
>>
>> A FREED terminal state coordinates with map-value
>> deletion (bpf_task_work_cancel_and_free()).
>>
>> Scheduling itself is deferred via irq_work to keep the kfunc callable
>> from NMI context.
>>
>> Lifetime is guarded with refcount_t + RCU Tasks Trace.
>>
>> Main components:
>>   * struct bpf_task_work_context – Metadata and state management per task
>> work.
>>   * enum bpf_task_work_state – A state machine to serialize work
>>   scheduling and execution.
>>   * bpf_task_work_schedule() – The central helper that initiates
>> scheduling.
>>   * bpf_task_work_acquire_ctx() - Attempts to take ownership of the context,
>>   pointed by passed struct bpf_task_work, allocates new context if none
>>   exists yet.
>>   * bpf_task_work_callback() – Invoked when the actual task_work runs.
>>   * bpf_task_work_irq() – An intermediate step (runs in softirq context)
>> to enqueue task work.
>>   * bpf_task_work_cancel_and_free() – Cleanup for deleted BPF map entries.
>>
>> Flow of successful task work scheduling
>>   1) bpf_task_work_schedule_* is called from BPF code.
>>   2) Transition state from STANDBY to PENDING, marks context is owned by
>>   this task work scheduler
>>   3) irq_work_queue() schedules bpf_task_work_irq().
>>   4) Transition state from PENDING to SCHEDULING.
>>   4) bpf_task_work_irq() attempts task_work_add(). If successful, state
>>   transitions to SCHEDULED.
>>   5) Task work calls bpf_task_work_callback(), which transition state to
>>   RUNNING.
>>   6) BPF callback is executed
>>   7) Context is cleaned up, refcounts released, context state set back to
>>   STANDBY.
>>
>> Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
>> ---
>>   kernel/bpf/helpers.c | 319 ++++++++++++++++++++++++++++++++++++++++++-
>>   1 file changed, 317 insertions(+), 2 deletions(-)
>>
>> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
>> index 109cb249e88c..418a0a211699 100644
>> --- a/kernel/bpf/helpers.c
>> +++ b/kernel/bpf/helpers.c
> [ ... ]
>
>> +static void bpf_task_work_irq(struct irq_work *irq_work)
>> +{
>> +	struct bpf_task_work_ctx *ctx = container_of(irq_work, struct bpf_task_work_ctx, irq_work);
>> +	enum bpf_task_work_state state;
>> +	int err;
>> +
>> +	guard(rcu_tasks_trace)();
>> +
>> +	if (cmpxchg(&ctx->state, BPF_TW_PENDING, BPF_TW_SCHEDULING) != BPF_TW_PENDING) {
>> +		bpf_task_work_ctx_put(ctx);
>> +		return;
>> +	}
>> +
>> +	err = task_work_add(ctx->task, &ctx->work, ctx->mode);
>> +	if (err) {
>> +		bpf_task_work_ctx_reset(ctx);
>> +		/*
>> +		 * try to switch back to STANDBY for another task_work reuse, but we might have
>> +		 * gone to FREED already, which is fine as we already cleaned up after ourselves
>> +		 */
>> +		(void)cmpxchg(&ctx->state, BPF_TW_SCHEDULING, BPF_TW_STANDBY);
>> +
>> +		/* we don't have RCU protection, so put after switching state */
>> +		bpf_task_work_ctx_put(ctx);
>                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
>
> Do we want to return here?  I didn't follow all of the references, but
> even if this isn't the last reference, it looks like the rest of the
> function isn't meant to work on the ctx after this point.
Thanks for taking a look! That's right, we should return there.
>
>> +	}
>> +
>> +	/*
>> +	 * It's technically possible for just scheduled task_work callback to
>> +	 * complete running by now, going SCHEDULING -> RUNNING and then
>> +	 * dropping its ctx refcount. Instead of capturing extra ref just to
>> +	 * protected below ctx->state access, we rely on RCU protection to
>> +	 * perform below SCHEDULING -> SCHEDULED attempt.
>> +	 */
>> +	state = cmpxchg(&ctx->state, BPF_TW_SCHEDULING, BPF_TW_SCHEDULED);
>> +	if (state == BPF_TW_FREED)
>> +		bpf_task_work_cancel(ctx); /* clean up if we switched into FREED state */
>> +}
> -chris


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH bpf-next v3 6/7] bpf: task work scheduling kfuncs
  2025-09-09  4:05         ` Eduard Zingerman
@ 2025-09-10 14:14           ` Andrii Nakryiko
  0 siblings, 0 replies; 44+ messages in thread
From: Andrii Nakryiko @ 2025-09-10 14:14 UTC (permalink / raw)
  To: Eduard Zingerman
  Cc: Mykyta Yatsenko, bpf, ast, andrii, daniel, kafai, kernel-team,
	memxor, Mykyta Yatsenko

On Tue, Sep 9, 2025 at 12:05 AM Eduard Zingerman <eddyz87@gmail.com> wrote:
>
> On Mon, 2025-09-08 at 23:33 -0400, Andrii Nakryiko wrote:
> > On Mon, Sep 8, 2025 at 9:13 AM Mykyta Yatsenko
> > <mykyta.yatsenko5@gmail.com> wrote:
> > >
> > > On 9/6/25 21:22, Eduard Zingerman wrote:
> > > > On Fri, 2025-09-05 at 17:45 +0100, Mykyta Yatsenko wrote:
> > > >
> > > > [...]
> > > >
> > > > > A small state machine and refcounting scheme ensures safe reuse and
> > > > > teardown:
> > > > >   STANDBY -> PENDING -> SCHEDULING -> SCHEDULED -> RUNNING -> STANDBY
> > > > Nit: state machine is actually a bit more complex:
> > > >
> > > >    digraph G {
> > > >      scheduling  -> running    [label="callback 1"];
> > > >      scheduled   -> running    [label="callback 2"];
> > > >      running     -> standby    [label="callback 3"];
> > > >      pending     -> scheduling [label="irq 1"];
> > > >      scheduling  -> standby    [label="irq 2"];
> > > >      scheduling  -> scheduled  [label="irq 3"];
> > > >      standby     -> pending    [label="acquire_ctx"];
> > > >
> > > >      freed      -> freed [label="cancel_and_free"];
> > > >      pending    -> freed [label="cancel_and_free"];
> > > >      running    -> freed [label="cancel_and_free"];
> > > >      scheduled  -> freed [label="cancel_and_free"];
> > > >      scheduling -> freed [label="cancel_and_free"];
> > > >      standby    -> freed [label="cancel_and_free"];
> > > >    }
> > > >
> > > > [...]
> > > >
> > > I'll update the description to contain proper graph.
> >
> > Hm... I like the main linear chain of state transitions as is, tbh.
> > It's fundamentally simple and helps get the general picture. Sure,
> > there are important details, but I don't think we should overwhelm
> > anyone reading with all of this upfront.
> >
> > In the above, "callback 1" and so on is not really helpful for
> > understanding, IMO.
>
> I'm not suggesting to take the above graphviz spec as a description.
> Just point out that current message is misleading.
>
> > I'd just add the note that a) each state can transition to FREED and
> > b) with tiny probability we might skip SCHEDULED and go SCHEDULING ->
> > RUNNING (extremely unlikely if at all possible, tbh).
> >
> > In short, let's not go too detailed here.
>
> As you see fit. Transition graph is easier to read for me, but maybe
> other people are better at reading text.
>
> >
> > > > > Flow of successful task work scheduling
> > > > >   1) bpf_task_work_schedule_* is called from BPF code.
> > > > >   2) Transition state from STANDBY to PENDING, marks context is owned by
> > > > >   this task work scheduler
> > > > >   3) irq_work_queue() schedules bpf_task_work_irq().
> > > > >   4) Transition state from PENDING to SCHEDULING.
> > > > >   4) bpf_task_work_irq() attempts task_work_add(). If successful, state
> > > > >   transitions to SCHEDULED.
> > > > Nit: "4" repeated two times.
> > > >
> > > > >   5) Task work calls bpf_task_work_callback(), which transition state to
> > > > >   RUNNING.
> > > > >   6) BPF callback is executed
> > > > >   7) Context is cleaned up, refcounts released, context state set back to
> > > > >   STANDBY.
> > > > >
> > > > > Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
> > > > > ---
> > > > >   kernel/bpf/helpers.c | 319 ++++++++++++++++++++++++++++++++++++++++++-
> > > > >   1 file changed, 317 insertions(+), 2 deletions(-)
> > > > >
> > > > > diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> > > > > index 109cb249e88c..418a0a211699 100644
> > > > > --- a/kernel/bpf/helpers.c
> > > > > +++ b/kernel/bpf/helpers.c
> > > > [...]
> > > >
> > > > > +static void bpf_task_work_cancel(struct bpf_task_work_ctx *ctx)
> > > > > +{
> > > > > +    /*
> > > > > +     * Scheduled task_work callback holds ctx ref, so if we successfully
> > > > > +     * cancelled, we put that ref on callback's behalf. If we couldn't
> > > > > +     * cancel, callback is inevitably run or has already completed
> > > > > +     * running, and it would have taken care of its ctx ref itself.
> > > > > +     */
> > > > > +    if (task_work_cancel_match(ctx->task, task_work_match, ctx))
> > > > Will `task_work_cancel(ctx->task, ctx->work)` do the same thing here?
> > > I think so, yes, thanks for checking.
> > > >
> > > > > +            bpf_task_work_ctx_put(ctx);
> > > > > +}
> > > > [...]
> > > >
> > > > > +static void bpf_task_work_irq(struct irq_work *irq_work)
> > > > > +{
> > > > > +    struct bpf_task_work_ctx *ctx = container_of(irq_work, struct bpf_task_work_ctx, irq_work);
> > > > > +    enum bpf_task_work_state state;
> > > > > +    int err;
> > > > > +
> > > > > +    guard(rcu_tasks_trace)();
> > > > > +
> > > > > +    if (cmpxchg(&ctx->state, BPF_TW_PENDING, BPF_TW_SCHEDULING) != BPF_TW_PENDING) {
> > > > > +            bpf_task_work_ctx_put(ctx);
> > > > > +            return;
> > > > > +    }
> > > > Why are separate PENDING and SCHEDULING states needed?
> > > > Both indicate that the task had not been yet but is ready to be
> > > > submitted to task_work_add(). So, on a first glance it seems that
> > > > merging the two won't change the behaviour, what do I miss?
> > > Yes, this is right, we may drop SCHEDULING state, it does not change any
> > > behavior compared to PENDING.
> > > The state check before task_work_add is needed anyway, so we won't
> > > remove much code here.
> > > I kept it just to be more consistent: with every state check we also
> > > transition state machine forward.
> >
> > Yeah, I like this property as well, I think it makes it easier to
> > reason about all this. I'd keep the PENDING and SCHEDULING
> > distinction, unless there is a strong reason not to.
> >
> > It also gives us a natural point to check for FREED before doing
> > unnecessary task_work scheduling + cancelling (if we were already in
> > FREED). It doesn't seem like we'll simplify anything by SCHEDULING (or
> > PENDING) state.
>
> Again, people are probably different, but it took me some time trying
> to figure out if I'm missing some details or SCHEDULING is there just
> for the sake of it.

Another reason I realized this morning, besides just a subjective "it
feels right", is that we'll want to optimize away the need for
irq_work (for timers it's going to be important) if we are in the more
permissive context (not nmi, not irq disabled, stuff like that), and
for that probably the simplest way would be to switch state into
SCHEDULING directly, bypassing PENDING altogether. So we'll need to
have two distinct states to keep the code simple.

>
> > > >
> > > > > +    err = task_work_add(ctx->task, &ctx->work, ctx->mode);
> > > > > +    if (err) {
> > > > > +            bpf_task_work_ctx_reset(ctx);
> > > > > +            /*
> > > > > +             * try to switch back to STANDBY for another task_work reuse, but we might have
> > > > > +             * gone to FREED already, which is fine as we already cleaned up after ourselves
> > > > > +             */
> > > > > +            (void)cmpxchg(&ctx->state, BPF_TW_SCHEDULING, BPF_TW_STANDBY);
> > > > > +
> > > > > +            /* we don't have RCU protection, so put after switching state */
> > > > > +            bpf_task_work_ctx_put(ctx);
> > > > > +    }
> > > > > +
> >
> > [...]
> >
> > > > > +
> > > > > +    return ctx;
> > > > > +}
> > > > > +
> > > > > +static int bpf_task_work_schedule(struct task_struct *task, struct bpf_task_work *tw,
> > > > > +                              struct bpf_map *map, bpf_task_work_callback_t callback_fn,
> > > > > +                              struct bpf_prog_aux *aux, enum task_work_notify_mode mode)
> > > > > +{
> > > > > +    struct bpf_prog *prog;
> > > > > +    struct bpf_task_work_ctx *ctx;
> > > > > +    int err;
> > > > > +
> > > > > +    BTF_TYPE_EMIT(struct bpf_task_work);
> > > > > +
> > > > > +    prog = bpf_prog_inc_not_zero(aux->prog);
> > > > > +    if (IS_ERR(prog))
> > > > > +            return -EBADF;
> > > > > +    task = bpf_task_acquire(task);
> > > > > +    if (!task) {
> > > > > +            err = -EPERM;
> > > > Nit: Why -EPERM? bpf_task_acquire() returns NULL if task->rcu_users
> > > >       is zero, does not seem to be permission related.
> > > Right, this probably should be -EBADF.
> >
> > timer and wq (and now task_work) return -EPERM for map->usercnt==0
> > check, but we have -EBADF for prog refcount being zero. It's a bit all
> > over the place... I don't have a strong preference, but it would be
> > nice to stay more or less consistent for all these "it's too late"
> > conditions, IMO.
>
> Ok, probably being consistently wrong is better option here,
> let's stick with -EPERM.
>
> [...]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH bpf-next v3 4/7] bpf: bpf task work plumbing
  2025-09-05 23:09   ` Eduard Zingerman
@ 2025-09-15 15:59     ` Mykyta Yatsenko
  2025-09-15 20:12       ` Andrii Nakryiko
  0 siblings, 1 reply; 44+ messages in thread
From: Mykyta Yatsenko @ 2025-09-15 15:59 UTC (permalink / raw)
  To: Eduard Zingerman, bpf, ast, andrii, daniel, kafai, kernel-team,
	memxor
  Cc: Mykyta Yatsenko

On 9/6/25 00:09, Eduard Zingerman wrote:
> On Fri, 2025-09-05 at 17:45 +0100, Mykyta Yatsenko wrote:
>
> [...]
>
>> diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
>> index 3d080916faf9..4130d8e76dff 100644
>> --- a/kernel/bpf/arraymap.c
>> +++ b/kernel/bpf/arraymap.c
> [...]
>
>> @@ -439,12 +439,14 @@ static void array_map_free_timers_wq(struct bpf_map *map)
>>   	/* We don't reset or free fields other than timer and workqueue
>>   	 * on uref dropping to zero.
>>   	 */
>> -	if (btf_record_has_field(map->record, BPF_TIMER | BPF_WORKQUEUE)) {
>> +	if (btf_record_has_field(map->record, BPF_TIMER | BPF_WORKQUEUE | BPF_TASK_WORK)) {
> I think that hashtab.c:htab_free_internal_structs needs to be renamed
> and called here, thus avoiding code duplication.
Sorry for the delayed follow up on this, just was trying to do it. I'm 
not sure if it is possible
to reuse anything from hashtab in arraymap at the moment, there is no 
header file for hashtab.
If we are going to introduce a new file to facilitate code reuse between 
maps, maybe we should go for
map_intern_helpers.c/h or something like that. WDYT?
>
>>   		for (i = 0; i < array->map.max_entries; i++) {
>>   			if (btf_record_has_field(map->record, BPF_TIMER))
>>   				bpf_obj_free_timer(map->record, array_map_elem_ptr(array, i));
>>   			if (btf_record_has_field(map->record, BPF_WORKQUEUE))
>>   				bpf_obj_free_workqueue(map->record, array_map_elem_ptr(array, i));
>> +			if (btf_record_has_field(map->record, BPF_TASK_WORK))
>> +				bpf_obj_free_task_work(map->record, array_map_elem_ptr(array, i));
>>   		}
>>   	}
>>   }
> [...]
>
>> diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
>> index a1a9bc589518..73ca21911b30 100644
>> --- a/kernel/bpf/btf.c
>> +++ b/kernel/bpf/btf.c
> [...]
>
>> @@ -4034,6 +4037,10 @@ struct btf_record *btf_parse_fields(const struct btf *btf, const struct btf_type
>>   		case BPF_LIST_NODE:
>>   		case BPF_RB_NODE:
>>   			break;
>> +		case BPF_TASK_WORK:
>> +			WARN_ON_ONCE(rec->task_work_off >= 0);
>> +			rec->task_work_off = rec->fields[i].offset;
>> +			break;
> Nit: let's move this case up to BPF_WORKQUEUE or BPF_REFCOUNT,
>       so that similar cases are grouped together.
>
>>   		default:
>>   			ret = -EFAULT;
>>   			goto end;
> [...]
>
>> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>> index 0fbfa8532c39..7da1ca893dfe 100644
>> --- a/kernel/bpf/syscall.c
>> +++ b/kernel/bpf/syscall.c
> [...]
>
>> @@ -840,6 +849,9 @@ void bpf_obj_free_fields(const struct btf_record *rec, void *obj)
>>   				continue;
>>   			bpf_rb_root_free(field, field_ptr, obj + rec->spin_lock_off);
>>   			break;
>> +		case BPF_TASK_WORK:
>> +			bpf_task_work_cancel_and_free(field_ptr);
>> +			break;
> Nit: same here, let's keep similar cases together.
>
>>   		case BPF_LIST_NODE:
>>   		case BPF_RB_NODE:
>>   		case BPF_REFCOUNT:
> [...]
>
>> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
>> index a5d19a01d488..6152536a834f 100644
>> --- a/kernel/bpf/verifier.c
>> +++ b/kernel/bpf/verifier.c
>> @@ -2240,6 +2240,8 @@ static void mark_ptr_not_null_reg(struct bpf_reg_state *reg)
>>   				reg->map_uid = reg->id;
>>   			if (btf_record_has_field(map->inner_map_meta->record, BPF_WORKQUEUE))
>>   				reg->map_uid = reg->id;
>> +			if (btf_record_has_field(map->inner_map_meta->record, BPF_TASK_WORK))
>> +				reg->map_uid = reg->id;
> Nit: this can be shorter:
>
> 			if (btf_record_has_field(map->inner_map_meta->record,
> 						 BPF_TIMER | BPF_WORKQUEUE | BPF_TASK_WORK))
> 				reg->map_uid = reg->id;
>
>
>>   		} else if (map->map_type == BPF_MAP_TYPE_XSKMAP) {
>>   			reg->type = PTR_TO_XDP_SOCK;
>>   		} else if (map->map_type == BPF_MAP_TYPE_SOCKMAP ||
> [...]
>
>> @@ -10943,6 +10956,35 @@ static int set_rbtree_add_callback_state(struct bpf_verifier_env *env,
>>   	return 0;
>>   }
>>   
>> +static int set_task_work_schedule_callback_state(struct bpf_verifier_env *env,
>> +						 struct bpf_func_state *caller,
>> +						 struct bpf_func_state *callee,
>> +						 int insn_idx)
>> +{
>> +	struct bpf_map *map_ptr = caller->regs[BPF_REG_3].map_ptr;
>> +
>> +	/*
>> +	 * callback_fn(struct bpf_map *map, void *key, void *value);
>> +	 */
>> +	callee->regs[BPF_REG_1].type = CONST_PTR_TO_MAP;
>> +	__mark_reg_known_zero(&callee->regs[BPF_REG_1]);
>> +	callee->regs[BPF_REG_1].map_ptr = map_ptr;
>> +
>> +	callee->regs[BPF_REG_2].type = PTR_TO_MAP_KEY;
>> +	__mark_reg_known_zero(&callee->regs[BPF_REG_2]);
>> +	callee->regs[BPF_REG_2].map_ptr = map_ptr;
>> +
>> +	callee->regs[BPF_REG_3].type = PTR_TO_MAP_VALUE;
>> +	__mark_reg_known_zero(&callee->regs[BPF_REG_3]);
>> +	callee->regs[BPF_REG_3].map_ptr = map_ptr;
>> +
>> +	/* unused */
>> +	__mark_reg_not_init(env, &callee->regs[BPF_REG_4]);
>> +	__mark_reg_not_init(env, &callee->regs[BPF_REG_5]);
>> +	callee->in_callback_fn = true;
> This should be `callee->in_async_callback_fn = true;` to avoid an
> infinite loop check in the is_state_visisted() in some cases.
>
>> +	return 0;
>> +}
>> +
>>   static bool is_rbtree_lock_required_kfunc(u32 btf_id);
>>   
>>   /* Are we currently verifying the callback for a rbtree helper that must
> [...]
>
>> @@ -13171,6 +13235,15 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
>>   					return -EINVAL;
>>   				}
>>   			}
>> +			if (meta->map.ptr && reg->map_ptr->record->task_work_off >= 0) {
>> +				if (meta->map.ptr != reg->map_ptr ||
>> +				    meta->map.uid != reg->map_uid) {
>> +					verbose(env,
>> +						"bpf_task_work pointer in R2 map_uid=%d doesn't match map pointer in R3 map_uid=%d\n",
>> +						meta->map.uid, reg->map_uid);
>> +					return -EINVAL;
>> +				}
>> +			}
> Please merge this with the case for wq_off above.
>
>>   			meta->map.ptr = reg->map_ptr;
>>   			meta->map.uid = reg->map_uid;
>>   			fallthrough;
> [...]


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH bpf-next v3 4/7] bpf: bpf task work plumbing
  2025-09-15 15:59     ` Mykyta Yatsenko
@ 2025-09-15 20:12       ` Andrii Nakryiko
  2025-09-15 20:20         ` Mykyta Yatsenko
  0 siblings, 1 reply; 44+ messages in thread
From: Andrii Nakryiko @ 2025-09-15 20:12 UTC (permalink / raw)
  To: Mykyta Yatsenko
  Cc: Eduard Zingerman, bpf, ast, andrii, daniel, kafai, kernel-team,
	memxor, Mykyta Yatsenko

On Mon, Sep 15, 2025 at 8:59 AM Mykyta Yatsenko
<mykyta.yatsenko5@gmail.com> wrote:
>
> On 9/6/25 00:09, Eduard Zingerman wrote:
> > On Fri, 2025-09-05 at 17:45 +0100, Mykyta Yatsenko wrote:
> >
> > [...]
> >
> >> diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
> >> index 3d080916faf9..4130d8e76dff 100644
> >> --- a/kernel/bpf/arraymap.c
> >> +++ b/kernel/bpf/arraymap.c
> > [...]
> >
> >> @@ -439,12 +439,14 @@ static void array_map_free_timers_wq(struct bpf_map *map)
> >>      /* We don't reset or free fields other than timer and workqueue
> >>       * on uref dropping to zero.
> >>       */
> >> -    if (btf_record_has_field(map->record, BPF_TIMER | BPF_WORKQUEUE)) {
> >> +    if (btf_record_has_field(map->record, BPF_TIMER | BPF_WORKQUEUE | BPF_TASK_WORK)) {
> > I think that hashtab.c:htab_free_internal_structs needs to be renamed
> > and called here, thus avoiding code duplication.
> Sorry for the delayed follow up on this, just was trying to do it. I'm
> not sure if it is possible
> to reuse anything from hashtab in arraymap at the moment, there is no
> header file for hashtab.
> If we are going to introduce a new file to facilitate code reuse between
> maps, maybe we should go for
> map_intern_helpers.c/h or something like that. WDYT?

no need for new files, just use include/linux/bpf.h (internal header)
and kernel/bpf/helpers.c or kernel/bpf/syscall.c (whichever makes more
sense and contains other map-related helpers)

> >
> >>              for (i = 0; i < array->map.max_entries; i++) {
> >>                      if (btf_record_has_field(map->record, BPF_TIMER))
> >>                              bpf_obj_free_timer(map->record, array_map_elem_ptr(array, i));
> >>                      if (btf_record_has_field(map->record, BPF_WORKQUEUE))
> >>                              bpf_obj_free_workqueue(map->record, array_map_elem_ptr(array, i));
> >> +                    if (btf_record_has_field(map->record, BPF_TASK_WORK))
> >> +                            bpf_obj_free_task_work(map->record, array_map_elem_ptr(array, i));
> >>              }
> >>      }
> >>   }
> > [...]
> >

[...]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH bpf-next v3 4/7] bpf: bpf task work plumbing
  2025-09-15 20:12       ` Andrii Nakryiko
@ 2025-09-15 20:20         ` Mykyta Yatsenko
  2025-09-15 20:28           ` Andrii Nakryiko
  0 siblings, 1 reply; 44+ messages in thread
From: Mykyta Yatsenko @ 2025-09-15 20:20 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Eduard Zingerman, bpf, ast, andrii, daniel, kafai, kernel-team,
	memxor, Mykyta Yatsenko

On 9/15/25 21:12, Andrii Nakryiko wrote:
> On Mon, Sep 15, 2025 at 8:59 AM Mykyta Yatsenko
> <mykyta.yatsenko5@gmail.com> wrote:
>> On 9/6/25 00:09, Eduard Zingerman wrote:
>>> On Fri, 2025-09-05 at 17:45 +0100, Mykyta Yatsenko wrote:
>>>
>>> [...]
>>>
>>>> diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
>>>> index 3d080916faf9..4130d8e76dff 100644
>>>> --- a/kernel/bpf/arraymap.c
>>>> +++ b/kernel/bpf/arraymap.c
>>> [...]
>>>
>>>> @@ -439,12 +439,14 @@ static void array_map_free_timers_wq(struct bpf_map *map)
>>>>       /* We don't reset or free fields other than timer and workqueue
>>>>        * on uref dropping to zero.
>>>>        */
>>>> -    if (btf_record_has_field(map->record, BPF_TIMER | BPF_WORKQUEUE)) {
>>>> +    if (btf_record_has_field(map->record, BPF_TIMER | BPF_WORKQUEUE | BPF_TASK_WORK)) {
>>> I think that hashtab.c:htab_free_internal_structs needs to be renamed
>>> and called here, thus avoiding code duplication.
>> Sorry for the delayed follow up on this, just was trying to do it. I'm
>> not sure if it is possible
>> to reuse anything from hashtab in arraymap at the moment, there is no
>> header file for hashtab.
>> If we are going to introduce a new file to facilitate code reuse between
>> maps, maybe we should go for
>> map_intern_helpers.c/h or something like that. WDYT?
> no need for new files, just use include/linux/bpf.h (internal header)
> and kernel/bpf/helpers.c or kernel/bpf/syscall.c (whichever makes more
> sense and contains other map-related helpers)
Thanks, I've just sent v4 without this. Will include in v5 or send 
refactoring in a separate patch.
>
>>>>               for (i = 0; i < array->map.max_entries; i++) {
>>>>                       if (btf_record_has_field(map->record, BPF_TIMER))
>>>>                               bpf_obj_free_timer(map->record, array_map_elem_ptr(array, i));
>>>>                       if (btf_record_has_field(map->record, BPF_WORKQUEUE))
>>>>                               bpf_obj_free_workqueue(map->record, array_map_elem_ptr(array, i));
>>>> +                    if (btf_record_has_field(map->record, BPF_TASK_WORK))
>>>> +                            bpf_obj_free_task_work(map->record, array_map_elem_ptr(array, i));
>>>>               }
>>>>       }
>>>>    }
>>> [...]
>>>
> [...]


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH bpf-next v3 4/7] bpf: bpf task work plumbing
  2025-09-15 20:20         ` Mykyta Yatsenko
@ 2025-09-15 20:28           ` Andrii Nakryiko
  0 siblings, 0 replies; 44+ messages in thread
From: Andrii Nakryiko @ 2025-09-15 20:28 UTC (permalink / raw)
  To: Mykyta Yatsenko
  Cc: Eduard Zingerman, bpf, ast, andrii, daniel, kafai, kernel-team,
	memxor, Mykyta Yatsenko

On Mon, Sep 15, 2025 at 1:20 PM Mykyta Yatsenko
<mykyta.yatsenko5@gmail.com> wrote:
>
> On 9/15/25 21:12, Andrii Nakryiko wrote:
> > On Mon, Sep 15, 2025 at 8:59 AM Mykyta Yatsenko
> > <mykyta.yatsenko5@gmail.com> wrote:
> >> On 9/6/25 00:09, Eduard Zingerman wrote:
> >>> On Fri, 2025-09-05 at 17:45 +0100, Mykyta Yatsenko wrote:
> >>>
> >>> [...]
> >>>
> >>>> diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
> >>>> index 3d080916faf9..4130d8e76dff 100644
> >>>> --- a/kernel/bpf/arraymap.c
> >>>> +++ b/kernel/bpf/arraymap.c
> >>> [...]
> >>>
> >>>> @@ -439,12 +439,14 @@ static void array_map_free_timers_wq(struct bpf_map *map)
> >>>>       /* We don't reset or free fields other than timer and workqueue
> >>>>        * on uref dropping to zero.
> >>>>        */
> >>>> -    if (btf_record_has_field(map->record, BPF_TIMER | BPF_WORKQUEUE)) {
> >>>> +    if (btf_record_has_field(map->record, BPF_TIMER | BPF_WORKQUEUE | BPF_TASK_WORK)) {
> >>> I think that hashtab.c:htab_free_internal_structs needs to be renamed
> >>> and called here, thus avoiding code duplication.
> >> Sorry for the delayed follow up on this, just was trying to do it. I'm
> >> not sure if it is possible
> >> to reuse anything from hashtab in arraymap at the moment, there is no
> >> header file for hashtab.
> >> If we are going to introduce a new file to facilitate code reuse between
> >> maps, maybe we should go for
> >> map_intern_helpers.c/h or something like that. WDYT?
> > no need for new files, just use include/linux/bpf.h (internal header)
> > and kernel/bpf/helpers.c or kernel/bpf/syscall.c (whichever makes more
> > sense and contains other map-related helpers)
> Thanks, I've just sent v4 without this. Will include in v5 or send
> refactoring in a separate patch.

I'd suggest a follow up patch with more refactoring, let's give people
a chance to review (and hopefully ack and land) v4 as is, thanks!

> >
> >>>>               for (i = 0; i < array->map.max_entries; i++) {
> >>>>                       if (btf_record_has_field(map->record, BPF_TIMER))
> >>>>                               bpf_obj_free_timer(map->record, array_map_elem_ptr(array, i));
> >>>>                       if (btf_record_has_field(map->record, BPF_WORKQUEUE))
> >>>>                               bpf_obj_free_workqueue(map->record, array_map_elem_ptr(array, i));
> >>>> +                    if (btf_record_has_field(map->record, BPF_TASK_WORK))
> >>>> +                            bpf_obj_free_task_work(map->record, array_map_elem_ptr(array, i));
> >>>>               }
> >>>>       }
> >>>>    }
> >>> [...]
> >>>
> > [...]
>

^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2025-09-15 20:28 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-05 16:44 [PATCH bpf-next v3 0/7] bpf: Introduce deferred task context execution Mykyta Yatsenko
2025-09-05 16:44 ` [PATCH bpf-next v3 1/7] bpf: refactor special field-type detection Mykyta Yatsenko
2025-09-05 19:36   ` Eduard Zingerman
2025-09-05 21:29   ` Andrii Nakryiko
2025-09-05 16:45 ` [PATCH bpf-next v3 2/7] bpf: extract generic helper from process_timer_func() Mykyta Yatsenko
2025-09-05 21:15   ` Eduard Zingerman
2025-09-05 21:28   ` Eduard Zingerman
2025-09-05 21:31     ` Andrii Nakryiko
2025-09-05 21:32       ` Eduard Zingerman
2025-09-05 21:29   ` Andrii Nakryiko
2025-09-05 16:45 ` [PATCH bpf-next v3 3/7] bpf: htab: extract helper for freeing special structs Mykyta Yatsenko
2025-09-05 21:31   ` Andrii Nakryiko
2025-09-05 21:31   ` Eduard Zingerman
2025-09-05 16:45 ` [PATCH bpf-next v3 4/7] bpf: bpf task work plumbing Mykyta Yatsenko
2025-09-05 21:31   ` Andrii Nakryiko
2025-09-05 23:09   ` Eduard Zingerman
2025-09-15 15:59     ` Mykyta Yatsenko
2025-09-15 20:12       ` Andrii Nakryiko
2025-09-15 20:20         ` Mykyta Yatsenko
2025-09-15 20:28           ` Andrii Nakryiko
2025-09-05 16:45 ` [PATCH bpf-next v3 5/7] bpf: extract map key pointer calculation Mykyta Yatsenko
2025-09-05 21:31   ` Andrii Nakryiko
2025-09-05 23:19   ` Eduard Zingerman
2025-09-08 13:39     ` Mykyta Yatsenko
2025-09-08 17:18       ` Eduard Zingerman
2025-09-05 16:45 ` [PATCH bpf-next v3 6/7] bpf: task work scheduling kfuncs Mykyta Yatsenko
2025-09-05 21:31   ` Andrii Nakryiko
2025-09-06 20:22   ` Eduard Zingerman
2025-09-08 13:13     ` Mykyta Yatsenko
2025-09-08 17:38       ` Eduard Zingerman
2025-09-09  3:42         ` Andrii Nakryiko
2025-09-09  4:15           ` Eduard Zingerman
2025-09-09  3:33       ` Andrii Nakryiko
2025-09-09  4:05         ` Eduard Zingerman
2025-09-10 14:14           ` Andrii Nakryiko
2025-09-09 17:49   ` Chris Mason
2025-09-09 18:59     ` Mykyta Yatsenko
2025-09-05 16:45 ` [PATCH bpf-next v3 7/7] selftests/bpf: BPF task work scheduling tests Mykyta Yatsenko
2025-09-05 21:31   ` Andrii Nakryiko
2025-09-08  7:43   ` Eduard Zingerman
2025-09-08 13:21     ` Mykyta Yatsenko
2025-09-08 18:23       ` Eduard Zingerman
2025-09-09  3:44         ` Andrii Nakryiko
2025-09-08 18:23   ` Eduard Zingerman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox