[PATCH bpf-next v2 0/4] bpf: Introduce deferred task context execution

bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH bpf-next v2 0/4] bpf: Introduce deferred task context execution
@ 2025-08-15 19:21 Mykyta Yatsenko
  2025-08-15 19:21 ` [PATCH bpf-next v2 1/4] bpf: bpf task work plumbing Mykyta Yatsenko
                   ` (3 more replies)
  0 siblings, 4 replies; 23+ messages in thread
From: Mykyta Yatsenko @ 2025-08-15 19:21 UTC (permalink / raw)
  To: bpf, ast, andrii, daniel, kafai, kernel-team, eddyz87, memxor
  Cc: Mykyta Yatsenko

From: Mykyta Yatsenko <yatsenko@meta.com>

This patch introduces a new mechanism for BPF programs to schedule
deferred execution in the context of a specific task using the kernel’s
task_work infrastructure.

The new bpf_task_work interface enables BPF use cases that
require sleepable subprogram execution within task context, for example,
scheduling sleepable function from the context that does not
allow sleepable, such as NMI.

Introduced kfuncs bpf_task_work_schedule_signal() and
bpf_task_work_schedule_resume() for scheduling BPF callbacks correspond
to different modes used by task_work (TWA_SIGNAL or
TWA_RESUME/TWA_NMI_CURRENT).

The implementation manages scheduling state via metadata objects (struct
bpf_task_work_context). Pointers to bpf_task_work_context are stored
in BPF map values. State transitions are handled via an atomic
state machine (bpf_task_work_state) to ensure correctness under
concurrent usage and deletion.
Kfuncs call task_work_add() indirectly via irq_work to avoid locking in
potentially NMI context.

Mykyta Yatsenko (4):
  bpf: bpf task work plumbing
  bpf: extract map key pointer calculation
  bpf: task work scheduling kfuncs
  selftests/bpf: BPF task work scheduling tests

 include/linux/bpf.h                           |  11 +
 include/uapi/linux/bpf.h                      |   4 +
 kernel/bpf/arraymap.c                         |   8 +-
 kernel/bpf/btf.c                              |  15 +
 kernel/bpf/hashtab.c                          |  22 +-
 kernel/bpf/helpers.c                          | 325 +++++++++++++++++-
 kernel/bpf/syscall.c                          |  23 +-
 kernel/bpf/verifier.c                         | 127 ++++++-
 tools/include/uapi/linux/bpf.h                |   4 +
 .../selftests/bpf/prog_tests/test_task_work.c | 149 ++++++++
 tools/testing/selftests/bpf/progs/task_work.c | 108 ++++++
 .../selftests/bpf/progs/task_work_fail.c      |  98 ++++++
 12 files changed, 863 insertions(+), 31 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/test_task_work.c
 create mode 100644 tools/testing/selftests/bpf/progs/task_work.c
 create mode 100644 tools/testing/selftests/bpf/progs/task_work_fail.c

-- 
2.50.1

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH bpf-next v2 1/4] bpf: bpf task work plumbing
  2025-08-15 19:21 [PATCH bpf-next v2 0/4] bpf: Introduce deferred task context execution Mykyta Yatsenko
@ 2025-08-15 19:21 ` Mykyta Yatsenko
  2025-08-19  1:34   ` Eduard Zingerman
  2025-08-15 19:21 ` [PATCH bpf-next v2 2/4] bpf: extract map key pointer calculation Mykyta Yatsenko
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 23+ messages in thread
From: Mykyta Yatsenko @ 2025-08-15 19:21 UTC (permalink / raw)
  To: bpf, ast, andrii, daniel, kafai, kernel-team, eddyz87, memxor
  Cc: Mykyta Yatsenko

From: Mykyta Yatsenko <yatsenko@meta.com>

This patch adds necessary plumbing in verifier, syscall and maps to
support handling new kfunc bpf_task_work_schedule and kernel structure
bpf_task_work. The idea is similar to how we already handle bpf_wq and
bpf_timer.
verifier changes validate calls to bpf_task_work_schedule to make sure
it is safe and expected invariants hold.
btf part is required to detect bpf_task_work structure inside map value
and store its offset, which will be used in the next patch to calculate
key and value addresses.
arraymap and hashtab changes are needed to handle freeing of the
bpf_task_work: run code needed to deinitialize it, for example cancel
task_work callback if possible.
The use of bpf_task_work and proper implementation for kfuncs are
introduced in the next patch.

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
---
 include/linux/bpf.h            |  11 +++
 include/uapi/linux/bpf.h       |   4 ++
 kernel/bpf/arraymap.c          |   8 ++-
 kernel/bpf/btf.c               |  15 ++++
 kernel/bpf/hashtab.c           |  22 +++---
 kernel/bpf/helpers.c           |  45 ++++++++++++
 kernel/bpf/syscall.c           |  23 +++++-
 kernel/bpf/verifier.c          | 127 ++++++++++++++++++++++++++++++++-
 tools/include/uapi/linux/bpf.h |   4 ++
 9 files changed, 245 insertions(+), 14 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index e7ee089e8a31..97ce8d08edb6 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -206,6 +206,7 @@ enum btf_field_type {
 	BPF_WORKQUEUE  = (1 << 10),
 	BPF_UPTR       = (1 << 11),
 	BPF_RES_SPIN_LOCK = (1 << 12),
+	BPF_TASK_WORK  = (1 << 13),
 };
 
 enum bpf_cgroup_storage_type {
@@ -259,6 +260,7 @@ struct btf_record {
 	int timer_off;
 	int wq_off;
 	int refcount_off;
+	int task_work_off;
 	struct btf_field fields[];
 };
 
@@ -358,6 +360,8 @@ static inline const char *btf_field_type_name(enum btf_field_type type)
 		return "bpf_rb_node";
 	case BPF_REFCOUNT:
 		return "bpf_refcount";
+	case BPF_TASK_WORK:
+		return "bpf_task_work";
 	default:
 		WARN_ON_ONCE(1);
 		return "unknown";
@@ -396,6 +400,8 @@ static inline u32 btf_field_type_size(enum btf_field_type type)
 		return sizeof(struct bpf_rb_node);
 	case BPF_REFCOUNT:
 		return sizeof(struct bpf_refcount);
+	case BPF_TASK_WORK:
+		return sizeof(struct bpf_task_work);
 	default:
 		WARN_ON_ONCE(1);
 		return 0;
@@ -428,6 +434,8 @@ static inline u32 btf_field_type_align(enum btf_field_type type)
 		return __alignof__(struct bpf_rb_node);
 	case BPF_REFCOUNT:
 		return __alignof__(struct bpf_refcount);
+	case BPF_TASK_WORK:
+		return __alignof__(struct bpf_task_work);
 	default:
 		WARN_ON_ONCE(1);
 		return 0;
@@ -459,6 +467,7 @@ static inline void bpf_obj_init_field(const struct btf_field *field, void *addr)
 	case BPF_KPTR_REF:
 	case BPF_KPTR_PERCPU:
 	case BPF_UPTR:
+	case BPF_TASK_WORK:
 		break;
 	default:
 		WARN_ON_ONCE(1);
@@ -595,6 +604,7 @@ void copy_map_value_locked(struct bpf_map *map, void *dst, void *src,
 			   bool lock_src);
 void bpf_timer_cancel_and_free(void *timer);
 void bpf_wq_cancel_and_free(void *timer);
+void bpf_task_work_cancel_and_free(void *timer);
 void bpf_list_head_free(const struct btf_field *field, void *list_head,
 			struct bpf_spin_lock *spin_lock);
 void bpf_rb_root_free(const struct btf_field *field, void *rb_root,
@@ -2412,6 +2422,7 @@ struct btf_record *btf_record_dup(const struct btf_record *rec);
 bool btf_record_equal(const struct btf_record *rec_a, const struct btf_record *rec_b);
 void bpf_obj_free_timer(const struct btf_record *rec, void *obj);
 void bpf_obj_free_workqueue(const struct btf_record *rec, void *obj);
+void bpf_obj_free_task_work(const struct btf_record *rec, void *obj);
 void bpf_obj_free_fields(const struct btf_record *rec, void *obj);
 void __bpf_obj_drop_impl(void *p, const struct btf_record *rec, bool percpu);
 
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 233de8677382..3856a71cb7db 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -7418,6 +7418,10 @@ struct bpf_timer {
 	__u64 __opaque[2];
 } __attribute__((aligned(8)));
 
+struct bpf_task_work {
+	__u64 ctx;
+} __attribute__((aligned(8)));
+
 struct bpf_wq {
 	__u64 __opaque[2];
 } __attribute__((aligned(8)));
diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index 3d080916faf9..4130d8e76dff 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -431,7 +431,7 @@ static void *array_map_vmalloc_addr(struct bpf_array *array)
 	return (void *)round_down((unsigned long)array, PAGE_SIZE);
 }
 
-static void array_map_free_timers_wq(struct bpf_map *map)
+static void array_map_free_internal_structs(struct bpf_map *map)
 {
 	struct bpf_array *array = container_of(map, struct bpf_array, map);
 	int i;
@@ -439,12 +439,14 @@ static void array_map_free_timers_wq(struct bpf_map *map)
 	/* We don't reset or free fields other than timer and workqueue
 	 * on uref dropping to zero.
 	 */
-	if (btf_record_has_field(map->record, BPF_TIMER | BPF_WORKQUEUE)) {
+	if (btf_record_has_field(map->record, BPF_TIMER | BPF_WORKQUEUE | BPF_TASK_WORK)) {
 		for (i = 0; i < array->map.max_entries; i++) {
 			if (btf_record_has_field(map->record, BPF_TIMER))
 				bpf_obj_free_timer(map->record, array_map_elem_ptr(array, i));
 			if (btf_record_has_field(map->record, BPF_WORKQUEUE))
 				bpf_obj_free_workqueue(map->record, array_map_elem_ptr(array, i));
+			if (btf_record_has_field(map->record, BPF_TASK_WORK))
+				bpf_obj_free_task_work(map->record, array_map_elem_ptr(array, i));
 		}
 	}
 }
@@ -783,7 +785,7 @@ const struct bpf_map_ops array_map_ops = {
 	.map_alloc = array_map_alloc,
 	.map_free = array_map_free,
 	.map_get_next_key = array_map_get_next_key,
-	.map_release_uref = array_map_free_timers_wq,
+	.map_release_uref = array_map_free_internal_structs,
 	.map_lookup_elem = array_map_lookup_elem,
 	.map_update_elem = array_map_update_elem,
 	.map_delete_elem = array_map_delete_elem,
diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index 64739308902f..378f260235dd 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -3527,6 +3527,15 @@ static int btf_get_field_type(const struct btf *btf, const struct btf_type *var_
 			goto end;
 		}
 	}
+	if (field_mask & BPF_TASK_WORK) {
+		if (!strcmp(name, "bpf_task_work")) {
+			if (*seen_mask & BPF_TASK_WORK)
+				return -E2BIG;
+			*seen_mask |= BPF_TASK_WORK;
+			type = BPF_TASK_WORK;
+			goto end;
+		}
+	}
 	field_mask_test_name(BPF_LIST_HEAD, "bpf_list_head");
 	field_mask_test_name(BPF_LIST_NODE, "bpf_list_node");
 	field_mask_test_name(BPF_RB_ROOT,   "bpf_rb_root");
@@ -3693,6 +3702,7 @@ static int btf_find_field_one(const struct btf *btf,
 	case BPF_LIST_NODE:
 	case BPF_RB_NODE:
 	case BPF_REFCOUNT:
+	case BPF_TASK_WORK:
 		ret = btf_find_struct(btf, var_type, off, sz, field_type,
 				      info_cnt ? &info[0] : &tmp);
 		if (ret < 0)
@@ -3985,6 +3995,7 @@ struct btf_record *btf_parse_fields(const struct btf *btf, const struct btf_type
 	rec->timer_off = -EINVAL;
 	rec->wq_off = -EINVAL;
 	rec->refcount_off = -EINVAL;
+	rec->task_work_off = -EINVAL;
 	for (i = 0; i < cnt; i++) {
 		field_type_size = btf_field_type_size(info_arr[i].type);
 		if (info_arr[i].off + field_type_size > value_size) {
@@ -4050,6 +4061,10 @@ struct btf_record *btf_parse_fields(const struct btf *btf, const struct btf_type
 		case BPF_LIST_NODE:
 		case BPF_RB_NODE:
 			break;
+		case BPF_TASK_WORK:
+			WARN_ON_ONCE(rec->task_work_off >= 0);
+			rec->task_work_off = rec->fields[i].offset;
+			break;
 		default:
 			ret = -EFAULT;
 			goto end;
diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 71f9931ac64c..207ad4823b5b 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -215,7 +215,7 @@ static bool htab_has_extra_elems(struct bpf_htab *htab)
 	return !htab_is_percpu(htab) && !htab_is_lru(htab) && !is_fd_htab(htab);
 }
 
-static void htab_free_prealloced_timers_and_wq(struct bpf_htab *htab)
+static void htab_free_prealloced_internal_structs(struct bpf_htab *htab)
 {
 	u32 num_entries = htab->map.max_entries;
 	int i;
@@ -233,6 +233,9 @@ static void htab_free_prealloced_timers_and_wq(struct bpf_htab *htab)
 		if (btf_record_has_field(htab->map.record, BPF_WORKQUEUE))
 			bpf_obj_free_workqueue(htab->map.record,
 					       htab_elem_value(elem, htab->map.key_size));
+		if (btf_record_has_field(htab->map.record, BPF_TASK_WORK))
+			bpf_obj_free_task_work(htab->map.record,
+					       htab_elem_value(elem, htab->map.key_size));
 		cond_resched();
 	}
 }
@@ -1490,7 +1493,7 @@ static void delete_all_elements(struct bpf_htab *htab)
 	}
 }
 
-static void htab_free_malloced_timers_and_wq(struct bpf_htab *htab)
+static void htab_free_malloced_internal_structs(struct bpf_htab *htab)
 {
 	int i;
 
@@ -1508,22 +1511,25 @@ static void htab_free_malloced_timers_and_wq(struct bpf_htab *htab)
 			if (btf_record_has_field(htab->map.record, BPF_WORKQUEUE))
 				bpf_obj_free_workqueue(htab->map.record,
 						       htab_elem_value(l, htab->map.key_size));
+			if (btf_record_has_field(htab->map.record, BPF_TASK_WORK))
+				bpf_obj_free_task_work(htab->map.record,
+						       htab_elem_value(l, htab->map.key_size));
 		}
 		cond_resched_rcu();
 	}
 	rcu_read_unlock();
 }
 
-static void htab_map_free_timers_and_wq(struct bpf_map *map)
+static void htab_map_free_internal_structs(struct bpf_map *map)
 {
 	struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
 
 	/* We only free timer and workqueue on uref dropping to zero */
-	if (btf_record_has_field(htab->map.record, BPF_TIMER | BPF_WORKQUEUE)) {
+	if (btf_record_has_field(htab->map.record, BPF_TIMER | BPF_WORKQUEUE | BPF_TASK_WORK)) {
 		if (!htab_is_prealloc(htab))
-			htab_free_malloced_timers_and_wq(htab);
+			htab_free_malloced_internal_structs(htab);
 		else
-			htab_free_prealloced_timers_and_wq(htab);
+			htab_free_prealloced_internal_structs(htab);
 	}
 }
 
@@ -2255,7 +2261,7 @@ const struct bpf_map_ops htab_map_ops = {
 	.map_alloc = htab_map_alloc,
 	.map_free = htab_map_free,
 	.map_get_next_key = htab_map_get_next_key,
-	.map_release_uref = htab_map_free_timers_and_wq,
+	.map_release_uref = htab_map_free_internal_structs,
 	.map_lookup_elem = htab_map_lookup_elem,
 	.map_lookup_and_delete_elem = htab_map_lookup_and_delete_elem,
 	.map_update_elem = htab_map_update_elem,
@@ -2276,7 +2282,7 @@ const struct bpf_map_ops htab_lru_map_ops = {
 	.map_alloc = htab_map_alloc,
 	.map_free = htab_map_free,
 	.map_get_next_key = htab_map_get_next_key,
-	.map_release_uref = htab_map_free_timers_and_wq,
+	.map_release_uref = htab_map_free_internal_structs,
 	.map_lookup_elem = htab_lru_map_lookup_elem,
 	.map_lookup_and_delete_elem = htab_lru_map_lookup_and_delete_elem,
 	.map_lookup_elem_sys_only = htab_lru_map_lookup_elem_sys,
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 6b4877e85a68..144877df6d02 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -3703,8 +3703,53 @@ __bpf_kfunc int bpf_strstr(const char *s1__ign, const char *s2__ign)
 	return bpf_strnstr(s1__ign, s2__ign, XATTR_SIZE_MAX);
 }
 
+typedef void (*bpf_task_work_callback_t)(struct bpf_map *map, void *key, void *value);
+
+/**
+ * bpf_task_work_schedule_signal - Schedule BPF callback using task_work_add with TWA_SIGNAL mode
+ * @task: Task struct for which callback should be scheduled
+ * @tw: Pointer to the bpf_task_work struct, to use by kernel internally for bookkeeping
+ * @map__map: bpf_map which contains bpf_task_work in one of the values
+ * @callback: pointer to BPF subprogram to call
+ * @aux__prog: user should pass NULL
+ *
+ * Return: 0 if task work has been scheduled successfully, negative error code otherwise
+ */
+__bpf_kfunc int bpf_task_work_schedule_signal(struct task_struct *task,
+					      struct bpf_task_work *tw,
+					      struct bpf_map *map__map,
+					      bpf_task_work_callback_t callback,
+					      void *aux__prog)
+{
+	return 0;
+}
+
+/**
+ * bpf_task_work_schedule_resume - Schedule BPF callback using task_work_add with TWA_RESUME or
+ * TWA_NMI_CURRENT mode if scheduling for the current task in the NMI
+ * @task: Task struct for which callback should be scheduled
+ * @tw: Pointer to the bpf_task_work struct, to use by kernel internally for bookkeeping
+ * @map__map: bpf_map which contains bpf_task_work in one of the values
+ * @callback: pointer to BPF subprogram to call
+ * @aux__prog: user should pass NULL
+ *
+ * Return: 0 if task work has been scheduled successfully, negative error code otherwise
+ */
+__bpf_kfunc int bpf_task_work_schedule_resume(struct task_struct *task,
+					      struct bpf_task_work *tw,
+					      struct bpf_map *map__map,
+					      bpf_task_work_callback_t callback,
+					      void *aux__prog)
+{
+	return 0;
+}
+
 __bpf_kfunc_end_defs();
 
+void bpf_task_work_cancel_and_free(void *val)
+{
+}
+
 BTF_KFUNCS_START(generic_btf_ids)
 #ifdef CONFIG_CRASH_DUMP
 BTF_ID_FLAGS(func, crash_kexec, KF_DESTRUCTIVE)
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 0fbfa8532c39..108d86f7eeaf 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -672,6 +672,7 @@ void btf_record_free(struct btf_record *rec)
 		case BPF_TIMER:
 		case BPF_REFCOUNT:
 		case BPF_WORKQUEUE:
+		case BPF_TASK_WORK:
 			/* Nothing to release */
 			break;
 		default:
@@ -725,6 +726,7 @@ struct btf_record *btf_record_dup(const struct btf_record *rec)
 		case BPF_TIMER:
 		case BPF_REFCOUNT:
 		case BPF_WORKQUEUE:
+		case BPF_TASK_WORK:
 			/* Nothing to acquire */
 			break;
 		default:
@@ -783,6 +785,13 @@ void bpf_obj_free_workqueue(const struct btf_record *rec, void *obj)
 	bpf_wq_cancel_and_free(obj + rec->wq_off);
 }
 
+void bpf_obj_free_task_work(const struct btf_record *rec, void *obj)
+{
+	if (WARN_ON_ONCE(!btf_record_has_field(rec, BPF_TASK_WORK)))
+		return;
+	bpf_task_work_cancel_and_free(obj + rec->task_work_off);
+}
+
 void bpf_obj_free_fields(const struct btf_record *rec, void *obj)
 {
 	const struct btf_field *fields;
@@ -840,6 +849,9 @@ void bpf_obj_free_fields(const struct btf_record *rec, void *obj)
 				continue;
 			bpf_rb_root_free(field, field_ptr, obj + rec->spin_lock_off);
 			break;
+		case BPF_TASK_WORK:
+			bpf_task_work_cancel_and_free(field_ptr);
+			break;
 		case BPF_LIST_NODE:
 		case BPF_RB_NODE:
 		case BPF_REFCOUNT:
@@ -1237,7 +1249,8 @@ static int map_check_btf(struct bpf_map *map, struct bpf_token *token,
 
 	map->record = btf_parse_fields(btf, value_type,
 				       BPF_SPIN_LOCK | BPF_RES_SPIN_LOCK | BPF_TIMER | BPF_KPTR | BPF_LIST_HEAD |
-				       BPF_RB_ROOT | BPF_REFCOUNT | BPF_WORKQUEUE | BPF_UPTR,
+				       BPF_RB_ROOT | BPF_REFCOUNT | BPF_WORKQUEUE | BPF_UPTR |
+				       BPF_TASK_WORK,
 				       map->value_size);
 	if (!IS_ERR_OR_NULL(map->record)) {
 		int i;
@@ -1309,6 +1322,14 @@ static int map_check_btf(struct bpf_map *map, struct bpf_token *token,
 					goto free_map_tab;
 				}
 				break;
+			case BPF_TASK_WORK:
+				if (map->map_type != BPF_MAP_TYPE_HASH &&
+				    map->map_type != BPF_MAP_TYPE_LRU_HASH &&
+				    map->map_type != BPF_MAP_TYPE_ARRAY) {
+					ret = -EOPNOTSUPP;
+					goto free_map_tab;
+				}
+				break;
 			default:
 				/* Fail if map_type checks are missing for a field type */
 				ret = -EOPNOTSUPP;
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index a61d57996692..be7a744c7917 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -2236,6 +2236,8 @@ static void mark_ptr_not_null_reg(struct bpf_reg_state *reg)
 				reg->map_uid = reg->id;
 			if (btf_record_has_field(map->inner_map_meta->record, BPF_WORKQUEUE))
 				reg->map_uid = reg->id;
+			if (btf_record_has_field(map->inner_map_meta->record, BPF_TASK_WORK))
+				reg->map_uid = reg->id;
 		} else if (map->map_type == BPF_MAP_TYPE_XSKMAP) {
 			reg->type = PTR_TO_XDP_SOCK;
 		} else if (map->map_type == BPF_MAP_TYPE_SOCKMAP ||
@@ -8569,6 +8571,44 @@ static int process_wq_func(struct bpf_verifier_env *env, int regno,
 	return 0;
 }
 
+static int process_task_work_func(struct bpf_verifier_env *env, int regno,
+				  struct bpf_kfunc_call_arg_meta *meta)
+{
+	struct bpf_reg_state *regs = cur_regs(env), *reg = &regs[regno];
+	struct bpf_map *map = reg->map_ptr;
+	bool is_const = tnum_is_const(reg->var_off);
+	u64 val = reg->var_off.value;
+
+	if (!map->btf) {
+		verbose(env, "map '%s' has to have BTF in order to use bpf_task_work\n",
+			map->name);
+		return -EINVAL;
+	}
+	if (!btf_record_has_field(map->record, BPF_TASK_WORK)) {
+		verbose(env, "map '%s' has no valid bpf_task_work\n", map->name);
+		return -EINVAL;
+	}
+	if (!is_const) {
+		verbose(env,
+			"bpf_task_work has to be at the constant offset\n");
+		return -EINVAL;
+	}
+	if (map->record->task_work_off != val + reg->off) {
+		verbose(env,
+			"off %lld doesn't point to 'struct bpf_task_work' that is at %d\n",
+			val + reg->off, map->record->task_work_off);
+		return -EINVAL;
+	}
+	if (meta->map.ptr) {
+		verifier_bug(env, "Two map pointers in a bpf_task_work kfunc");
+		return -EFAULT;
+	}
+
+	meta->map.uid = reg->map_uid;
+	meta->map.ptr = map;
+	return 0;
+}
+
 static int process_kptr_func(struct bpf_verifier_env *env, int regno,
 			     struct bpf_call_arg_meta *meta)
 {
@@ -10398,6 +10438,8 @@ typedef int (*set_callee_state_fn)(struct bpf_verifier_env *env,
 				   struct bpf_func_state *callee,
 				   int insn_idx);
 
+static bool is_task_work_add_kfunc(u32 func_id);
+
 static int set_callee_state(struct bpf_verifier_env *env,
 			    struct bpf_func_state *caller,
 			    struct bpf_func_state *callee, int insn_idx);
@@ -10616,7 +10658,8 @@ static int push_callback_call(struct bpf_verifier_env *env, struct bpf_insn *ins
 		env->subprog_info[subprog].is_async_cb = true;
 		async_cb = push_async_cb(env, env->subprog_info[subprog].start,
 					 insn_idx, subprog,
-					 is_bpf_wq_set_callback_impl_kfunc(insn->imm));
+					 is_bpf_wq_set_callback_impl_kfunc(insn->imm) ||
+					 is_task_work_add_kfunc(insn->imm));
 		if (!async_cb)
 			return -EFAULT;
 		callee = async_cb->frame[0];
@@ -10929,6 +10972,35 @@ static int set_rbtree_add_callback_state(struct bpf_verifier_env *env,
 	return 0;
 }
 
+static int set_task_work_schedule_callback_state(struct bpf_verifier_env *env,
+						 struct bpf_func_state *caller,
+						 struct bpf_func_state *callee,
+						 int insn_idx)
+{
+	struct bpf_map *map_ptr = caller->regs[BPF_REG_3].map_ptr;
+
+	/*
+	 * callback_fn(struct bpf_map *map, void *key, void *value);
+	 */
+	callee->regs[BPF_REG_1].type = CONST_PTR_TO_MAP;
+	__mark_reg_known_zero(&callee->regs[BPF_REG_1]);
+	callee->regs[BPF_REG_1].map_ptr = map_ptr;
+
+	callee->regs[BPF_REG_2].type = PTR_TO_MAP_KEY;
+	__mark_reg_known_zero(&callee->regs[BPF_REG_2]);
+	callee->regs[BPF_REG_2].map_ptr = map_ptr;
+
+	callee->regs[BPF_REG_3].type = PTR_TO_MAP_VALUE;
+	__mark_reg_known_zero(&callee->regs[BPF_REG_3]);
+	callee->regs[BPF_REG_3].map_ptr = map_ptr;
+
+	/* unused */
+	__mark_reg_not_init(env, &callee->regs[BPF_REG_4]);
+	__mark_reg_not_init(env, &callee->regs[BPF_REG_5]);
+	callee->in_callback_fn = true;
+	return 0;
+}
+
 static bool is_rbtree_lock_required_kfunc(u32 btf_id);
 
 /* Are we currently verifying the callback for a rbtree helper that must
@@ -12059,6 +12131,7 @@ enum {
 	KF_ARG_RB_NODE_ID,
 	KF_ARG_WORKQUEUE_ID,
 	KF_ARG_RES_SPIN_LOCK_ID,
+	KF_ARG_TASK_WORK_ID,
 };
 
 BTF_ID_LIST(kf_arg_btf_ids)
@@ -12069,6 +12142,7 @@ BTF_ID(struct, bpf_rb_root)
 BTF_ID(struct, bpf_rb_node)
 BTF_ID(struct, bpf_wq)
 BTF_ID(struct, bpf_res_spin_lock)
+BTF_ID(struct, bpf_task_work)
 
 static bool __is_kfunc_ptr_arg_type(const struct btf *btf,
 				    const struct btf_param *arg, int type)
@@ -12117,6 +12191,11 @@ static bool is_kfunc_arg_wq(const struct btf *btf, const struct btf_param *arg)
 	return __is_kfunc_ptr_arg_type(btf, arg, KF_ARG_WORKQUEUE_ID);
 }
 
+static bool is_kfunc_arg_task_work(const struct btf *btf, const struct btf_param *arg)
+{
+	return __is_kfunc_ptr_arg_type(btf, arg, KF_ARG_TASK_WORK_ID);
+}
+
 static bool is_kfunc_arg_res_spin_lock(const struct btf *btf, const struct btf_param *arg)
 {
 	return __is_kfunc_ptr_arg_type(btf, arg, KF_ARG_RES_SPIN_LOCK_ID);
@@ -12204,6 +12283,7 @@ enum kfunc_ptr_arg_type {
 	KF_ARG_PTR_TO_WORKQUEUE,
 	KF_ARG_PTR_TO_IRQ_FLAG,
 	KF_ARG_PTR_TO_RES_SPIN_LOCK,
+	KF_ARG_PTR_TO_TASK_WORK,
 };
 
 enum special_kfunc_type {
@@ -12252,6 +12332,8 @@ enum special_kfunc_type {
 	KF_bpf_res_spin_lock_irqsave,
 	KF_bpf_res_spin_unlock_irqrestore,
 	KF___bpf_trap,
+	KF_bpf_task_work_schedule_signal,
+	KF_bpf_task_work_schedule_resume,
 };
 
 BTF_ID_LIST(special_kfunc_list)
@@ -12318,6 +12400,14 @@ BTF_ID(func, bpf_res_spin_unlock)
 BTF_ID(func, bpf_res_spin_lock_irqsave)
 BTF_ID(func, bpf_res_spin_unlock_irqrestore)
 BTF_ID(func, __bpf_trap)
+BTF_ID(func, bpf_task_work_schedule_signal)
+BTF_ID(func, bpf_task_work_schedule_resume)
+
+static bool is_task_work_add_kfunc(u32 func_id)
+{
+	return func_id == special_kfunc_list[KF_bpf_task_work_schedule_signal] ||
+	       func_id == special_kfunc_list[KF_bpf_task_work_schedule_resume];
+}
 
 static bool is_kfunc_ret_null(struct bpf_kfunc_call_arg_meta *meta)
 {
@@ -12408,6 +12498,9 @@ get_kfunc_ptr_arg_type(struct bpf_verifier_env *env,
 	if (is_kfunc_arg_wq(meta->btf, &args[argno]))
 		return KF_ARG_PTR_TO_WORKQUEUE;
 
+	if (is_kfunc_arg_task_work(meta->btf, &args[argno]))
+		return KF_ARG_PTR_TO_TASK_WORK;
+
 	if (is_kfunc_arg_irq_flag(meta->btf, &args[argno]))
 		return KF_ARG_PTR_TO_IRQ_FLAG;
 
@@ -12751,7 +12844,8 @@ static bool is_sync_callback_calling_kfunc(u32 btf_id)
 
 static bool is_async_callback_calling_kfunc(u32 btf_id)
 {
-	return btf_id == special_kfunc_list[KF_bpf_wq_set_callback_impl];
+	return btf_id == special_kfunc_list[KF_bpf_wq_set_callback_impl] ||
+	       is_task_work_add_kfunc(btf_id);
 }
 
 static bool is_bpf_throw_kfunc(struct bpf_insn *insn)
@@ -13153,6 +13247,15 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
 					return -EINVAL;
 				}
 			}
+			if (meta->map.ptr && reg->map_ptr->record->task_work_off >= 0) {
+				if (meta->map.ptr != reg->map_ptr ||
+				    meta->map.uid != reg->map_uid) {
+					verbose(env,
+						"bpf_task_work pointer in R2 map_uid=%d doesn't match map pointer in R3 map_uid=%d\n",
+						meta->map.uid, reg->map_uid);
+					return -EINVAL;
+				}
+			}
 			meta->map.ptr = reg->map_ptr;
 			meta->map.uid = reg->map_uid;
 			fallthrough;
@@ -13185,6 +13288,7 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
 		case KF_ARG_PTR_TO_REFCOUNTED_KPTR:
 		case KF_ARG_PTR_TO_CONST_STR:
 		case KF_ARG_PTR_TO_WORKQUEUE:
+		case KF_ARG_PTR_TO_TASK_WORK:
 		case KF_ARG_PTR_TO_IRQ_FLAG:
 		case KF_ARG_PTR_TO_RES_SPIN_LOCK:
 			break;
@@ -13476,6 +13580,15 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
 			if (ret < 0)
 				return ret;
 			break;
+		case KF_ARG_PTR_TO_TASK_WORK:
+			if (reg->type != PTR_TO_MAP_VALUE) {
+				verbose(env, "arg#%d doesn't point to a map value\n", i);
+				return -EINVAL;
+			}
+			ret = process_task_work_func(env, regno, meta);
+			if (ret < 0)
+				return ret;
+			break;
 		case KF_ARG_PTR_TO_IRQ_FLAG:
 			if (reg->type != PTR_TO_STACK) {
 				verbose(env, "arg#%d doesn't point to an irq flag on stack\n", i);
@@ -13842,6 +13955,16 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
 		}
 	}
 
+	if (is_task_work_add_kfunc(meta.func_id)) {
+		err = push_callback_call(env, insn, insn_idx, meta.subprogno,
+					 set_task_work_schedule_callback_state);
+		if (err) {
+			verbose(env, "kfunc %s#%d failed callback verification\n",
+				func_name, meta.func_id);
+			return err;
+		}
+	}
+
 	rcu_lock = is_kfunc_bpf_rcu_read_lock(&meta);
 	rcu_unlock = is_kfunc_bpf_rcu_read_unlock(&meta);
 
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 233de8677382..3856a71cb7db 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -7418,6 +7418,10 @@ struct bpf_timer {
 	__u64 __opaque[2];
 } __attribute__((aligned(8)));
 
+struct bpf_task_work {
+	__u64 ctx;
+} __attribute__((aligned(8)));
+
 struct bpf_wq {
 	__u64 __opaque[2];
 } __attribute__((aligned(8)));
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH bpf-next v2 2/4] bpf: extract map key pointer calculation
  2025-08-15 19:21 [PATCH bpf-next v2 0/4] bpf: Introduce deferred task context execution Mykyta Yatsenko
  2025-08-15 19:21 ` [PATCH bpf-next v2 1/4] bpf: bpf task work plumbing Mykyta Yatsenko
@ 2025-08-15 19:21 ` Mykyta Yatsenko
  2025-08-19 11:05   ` Kumar Kartikeya Dwivedi
  2025-08-19 20:50   ` Andrii Nakryiko
  2025-08-15 19:21 ` [PATCH bpf-next v2 3/4] bpf: task work scheduling kfuncs Mykyta Yatsenko
  2025-08-15 19:21 ` [PATCH bpf-next v2 4/4] selftests/bpf: BPF task work scheduling tests Mykyta Yatsenko
  3 siblings, 2 replies; 23+ messages in thread
From: Mykyta Yatsenko @ 2025-08-15 19:21 UTC (permalink / raw)
  To: bpf, ast, andrii, daniel, kafai, kernel-team, eddyz87, memxor
  Cc: Mykyta Yatsenko

From: Mykyta Yatsenko <yatsenko@meta.com>

Calculation of the BPF map key, given the pointer to a value is
duplicated in a couple of places in helpers already, in the next patch
another use case is introduced as well.
This patch extracts that functionality into a separate function.

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
---
 kernel/bpf/helpers.c | 30 +++++++++++++-----------------
 1 file changed, 13 insertions(+), 17 deletions(-)

diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 144877df6d02..d2f88a9bc47b 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -1084,6 +1084,17 @@ const struct bpf_func_proto bpf_snprintf_proto = {
 	.arg5_type	= ARG_CONST_SIZE_OR_ZERO,
 };
 
+static void *map_key_from_value(struct bpf_map *map, void *value, u32 *arr_idx)
+{
+	if (map->map_type == BPF_MAP_TYPE_ARRAY) {
+		struct bpf_array *array = container_of(map, struct bpf_array, map);
+
+		*arr_idx = ((char *)value - array->value) / array->elem_size;
+		return arr_idx;
+	}
+	return (void *)value - round_up(map->key_size, 8);
+}
+
 struct bpf_async_cb {
 	struct bpf_map *map;
 	struct bpf_prog *prog;
@@ -1166,15 +1177,8 @@ static enum hrtimer_restart bpf_timer_cb(struct hrtimer *hrtimer)
 	 * bpf_map_delete_elem() on the same timer.
 	 */
 	this_cpu_write(hrtimer_running, t);
-	if (map->map_type == BPF_MAP_TYPE_ARRAY) {
-		struct bpf_array *array = container_of(map, struct bpf_array, map);
 
-		/* compute the key */
-		idx = ((char *)value - array->value) / array->elem_size;
-		key = &idx;
-	} else { /* hash or lru */
-		key = value - round_up(map->key_size, 8);
-	}
+	key = map_key_from_value(map, value, &idx);
 
 	callback_fn((u64)(long)map, (u64)(long)key, (u64)(long)value, 0, 0);
 	/* The verifier checked that return value is zero. */
@@ -1200,15 +1204,7 @@ static void bpf_wq_work(struct work_struct *work)
 	if (!callback_fn)
 		return;
 
-	if (map->map_type == BPF_MAP_TYPE_ARRAY) {
-		struct bpf_array *array = container_of(map, struct bpf_array, map);
-
-		/* compute the key */
-		idx = ((char *)value - array->value) / array->elem_size;
-		key = &idx;
-	} else { /* hash or lru */
-		key = value - round_up(map->key_size, 8);
-	}
+	key = map_key_from_value(map, value, &idx);
 
         rcu_read_lock_trace();
         migrate_disable();
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH bpf-next v2 3/4] bpf: task work scheduling kfuncs
  2025-08-15 19:21 [PATCH bpf-next v2 0/4] bpf: Introduce deferred task context execution Mykyta Yatsenko
  2025-08-15 19:21 ` [PATCH bpf-next v2 1/4] bpf: bpf task work plumbing Mykyta Yatsenko
  2025-08-15 19:21 ` [PATCH bpf-next v2 2/4] bpf: extract map key pointer calculation Mykyta Yatsenko
@ 2025-08-15 19:21 ` Mykyta Yatsenko
  2025-08-15 22:00   ` Jiri Olsa
                     ` (3 more replies)
  2025-08-15 19:21 ` [PATCH bpf-next v2 4/4] selftests/bpf: BPF task work scheduling tests Mykyta Yatsenko
  3 siblings, 4 replies; 23+ messages in thread
From: Mykyta Yatsenko @ 2025-08-15 19:21 UTC (permalink / raw)
  To: bpf, ast, andrii, daniel, kafai, kernel-team, eddyz87, memxor
  Cc: Mykyta Yatsenko

From: Mykyta Yatsenko <yatsenko@meta.com>

Implementation of the bpf_task_work_schedule kfuncs.

Main components:
 * struct bpf_task_work_context – Metadata and state management per task
work.
 * enum bpf_task_work_state – A state machine to serialize work
 scheduling and execution.
 * bpf_task_work_schedule() – The central helper that initiates
scheduling.
 * bpf_task_work_acquire() - Attempts to take ownership of the context,
 pointed by passed struct bpf_task_work, allocates new context if none
 exists yet.
 * bpf_task_work_callback() – Invoked when the actual task_work runs.
 * bpf_task_work_irq() – An intermediate step (runs in softirq context)
to enqueue task work.
 * bpf_task_work_cancel_and_free() – Cleanup for deleted BPF map entries.

Flow of successful task work scheduling
 1) bpf_task_work_schedule_* is called from BPF code.
 2) Transition state from STANDBY to PENDING, marks context is owned by
 this task work scheduler
 3) irq_work_queue() schedules bpf_task_work_irq().
 4) Transition state from PENDING to SCHEDULING.
 4) bpf_task_work_irq() attempts task_work_add(). If successful, state
 transitions to SCHEDULED.
 5) Task work calls bpf_task_work_callback(), which transition state to
 RUNNING.
 6) BPF callback is executed
 7) Context is cleaned up, refcounts released, context state set back to
 STANDBY.

bpf_task_work_context handling
The context pointer is stored in bpf_task_work ctx field (u64) but
treated as an __rcu pointer via casts.
bpf_task_work_acquire() publishes new bpf_task_work_context by cmpxchg
with RCU initializer.
Read under the RCU lock only in bpf_task_work_acquire() when ownership
is contended.
Upon deleting map value, bpf_task_work_cancel_and_free() is detaching
context pointer from struct bpf_task_work and releases resources
if scheduler does not own the context or can be canceled (state ==
STANDBY or state == SCHEDULED and callback canceled). If task work
scheduler owns the context, its state is set to FREED and scheduler is
expected to cleanup on the next state transition.

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
---
 kernel/bpf/helpers.c | 270 +++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 260 insertions(+), 10 deletions(-)

diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index d2f88a9bc47b..346ae8fd3ada 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -25,6 +25,8 @@
 #include <linux/kasan.h>
 #include <linux/bpf_verifier.h>
 #include <linux/uaccess.h>
+#include <linux/task_work.h>
+#include <linux/irq_work.h>
 
 #include "../../lib/kstrtox.h"
 
@@ -3701,6 +3703,226 @@ __bpf_kfunc int bpf_strstr(const char *s1__ign, const char *s2__ign)
 
 typedef void (*bpf_task_work_callback_t)(struct bpf_map *map, void *key, void *value);
 
+enum bpf_task_work_state {
+	/* bpf_task_work is ready to be used */
+	BPF_TW_STANDBY = 0,
+	/* irq work scheduling in progress */
+	BPF_TW_PENDING,
+	/* task work scheduling in progress */
+	BPF_TW_SCHEDULING,
+	/* task work is scheduled successfully */
+	BPF_TW_SCHEDULED,
+	/* callback is running */
+	BPF_TW_RUNNING,
+	/* associated BPF map value is deleted */
+	BPF_TW_FREED,
+};
+
+struct bpf_task_work_context {
+	/* the map and map value associated with this context */
+	struct bpf_map *map;
+	void *map_val;
+	/* bpf_prog that schedules task work */
+	struct bpf_prog *prog;
+	/* task for which callback is scheduled */
+	struct task_struct *task;
+	enum task_work_notify_mode mode;
+	enum bpf_task_work_state state;
+	bpf_task_work_callback_t callback_fn;
+	struct callback_head work;
+	struct irq_work irq_work;
+	struct rcu_head rcu;
+} __aligned(8);
+
+static struct bpf_task_work_context *bpf_task_work_context_alloc(void)
+{
+	struct bpf_task_work_context *ctx;
+
+	ctx = bpf_mem_alloc(&bpf_global_ma, sizeof(struct bpf_task_work_context));
+	if (ctx)
+		memset(ctx, 0, sizeof(*ctx));
+	return ctx;
+}
+
+static void bpf_task_work_context_free(struct rcu_head *rcu)
+{
+	struct bpf_task_work_context *ctx = container_of(rcu, struct bpf_task_work_context, rcu);
+	/* bpf_mem_free expects migration to be disabled */
+	migrate_disable();
+	bpf_mem_free(&bpf_global_ma, ctx);
+	migrate_enable();
+}
+
+static bool task_work_match(struct callback_head *head, void *data)
+{
+	struct bpf_task_work_context *ctx = container_of(head, struct bpf_task_work_context, work);
+
+	return ctx == data;
+}
+
+static void bpf_task_work_context_reset(struct bpf_task_work_context *ctx)
+{
+	bpf_prog_put(ctx->prog);
+	bpf_task_release(ctx->task);
+}
+
+static void bpf_task_work_callback(struct callback_head *cb)
+{
+	enum bpf_task_work_state state;
+	struct bpf_task_work_context *ctx;
+	u32 idx;
+	void *key;
+
+	ctx = container_of(cb, struct bpf_task_work_context, work);
+
+	/*
+	 * Read lock is needed to protect map key and value access below, it has to be done before
+	 * the state transition
+	 */
+	rcu_read_lock_trace();
+	/*
+	 * This callback may start running before bpf_task_work_irq() switched the state to
+	 * SCHEDULED so handle both transition variants SCHEDULING|SCHEDULED -> RUNNING.
+	 */
+	state = cmpxchg(&ctx->state, BPF_TW_SCHEDULING, BPF_TW_RUNNING);
+	if (state == BPF_TW_SCHEDULED)
+		state = cmpxchg(&ctx->state, BPF_TW_SCHEDULED, BPF_TW_RUNNING);
+	if (state == BPF_TW_FREED) {
+		rcu_read_unlock_trace();
+		bpf_task_work_context_reset(ctx);
+		call_rcu_tasks_trace(&ctx->rcu, bpf_task_work_context_free);
+		return;
+	}
+
+	key = (void *)map_key_from_value(ctx->map, ctx->map_val, &idx);
+	migrate_disable();
+	ctx->callback_fn(ctx->map, key, ctx->map_val);
+	migrate_enable();
+	rcu_read_unlock_trace();
+	/* State is running or freed, either way reset. */
+	bpf_task_work_context_reset(ctx);
+	state = cmpxchg(&ctx->state, BPF_TW_RUNNING, BPF_TW_STANDBY);
+	if (state == BPF_TW_FREED)
+		call_rcu_tasks_trace(&ctx->rcu, bpf_task_work_context_free);
+}
+
+static void bpf_task_work_irq(struct irq_work *irq_work)
+{
+	struct bpf_task_work_context *ctx;
+	enum bpf_task_work_state state;
+	int err;
+
+	ctx = container_of(irq_work, struct bpf_task_work_context, irq_work);
+
+	state = cmpxchg(&ctx->state, BPF_TW_PENDING, BPF_TW_SCHEDULING);
+	if (state == BPF_TW_FREED)
+		goto free_context;
+
+	err = task_work_add(ctx->task, &ctx->work, ctx->mode);
+	if (err) {
+		bpf_task_work_context_reset(ctx);
+		state = cmpxchg(&ctx->state, BPF_TW_SCHEDULING, BPF_TW_STANDBY);
+		if (state == BPF_TW_FREED)
+			call_rcu_tasks_trace(&ctx->rcu, bpf_task_work_context_free);
+		return;
+	}
+	state = cmpxchg(&ctx->state, BPF_TW_SCHEDULING, BPF_TW_SCHEDULED);
+	if (state == BPF_TW_FREED && task_work_cancel_match(ctx->task, task_work_match, ctx))
+		goto free_context; /* successful cancellation, release and free ctx */
+	return;
+
+free_context:
+	bpf_task_work_context_reset(ctx);
+	call_rcu_tasks_trace(&ctx->rcu, bpf_task_work_context_free);
+}
+
+static struct bpf_task_work_context *bpf_task_work_context_acquire(struct bpf_task_work *tw,
+								   struct bpf_map *map)
+{
+	struct bpf_task_work_context *ctx, *old_ctx;
+	enum bpf_task_work_state state;
+	struct bpf_task_work_context __force __rcu **ppc =
+		(struct bpf_task_work_context __force __rcu **)&tw->ctx;
+
+	/* ctx pointer is RCU protected */
+	rcu_read_lock_trace();
+	ctx = rcu_dereference(*ppc);
+	if (!ctx) {
+		ctx = bpf_task_work_context_alloc();
+		if (!ctx) {
+			rcu_read_unlock_trace();
+			return ERR_PTR(-ENOMEM);
+		}
+		old_ctx = unrcu_pointer(cmpxchg(ppc, NULL, RCU_INITIALIZER(ctx)));
+		/*
+		 * If ctx is set by another CPU, release allocated memory.
+		 * Do not fail, though, attempt stealing the work
+		 */
+		if (old_ctx) {
+			bpf_mem_free(&bpf_global_ma, ctx);
+			ctx = old_ctx;
+		}
+	}
+	state = cmpxchg(&ctx->state, BPF_TW_STANDBY, BPF_TW_PENDING);
+	/*
+	 * We can unlock RCU, because task work scheduler (this codepath)
+	 * now owns the ctx or returning an error
+	 */
+	rcu_read_unlock_trace();
+	if (state != BPF_TW_STANDBY)
+		return ERR_PTR(-EBUSY);
+	return ctx;
+}
+
+static int bpf_task_work_schedule(struct task_struct *task, struct bpf_task_work *tw,
+				  struct bpf_map *map, bpf_task_work_callback_t callback_fn,
+				  struct bpf_prog_aux *aux, enum task_work_notify_mode mode)
+{
+	struct bpf_prog *prog;
+	struct bpf_task_work_context *ctx = NULL;
+	int err;
+
+	BTF_TYPE_EMIT(struct bpf_task_work);
+
+	prog = bpf_prog_inc_not_zero(aux->prog);
+	if (IS_ERR(prog))
+		return -EBADF;
+
+	if (!atomic64_read(&map->usercnt)) {
+		err = -EBADF;
+		goto release_prog;
+	}
+	task = bpf_task_acquire(task);
+	if (!task) {
+		err = -EPERM;
+		goto release_prog;
+	}
+	ctx = bpf_task_work_context_acquire(tw, map);
+	if (IS_ERR(ctx)) {
+		err = PTR_ERR(ctx);
+		goto release_all;
+	}
+
+	ctx->task = task;
+	ctx->callback_fn = callback_fn;
+	ctx->prog = prog;
+	ctx->mode = mode;
+	ctx->map = map;
+	ctx->map_val = (void *)tw - map->record->task_work_off;
+	init_irq_work(&ctx->irq_work, bpf_task_work_irq);
+	init_task_work(&ctx->work, bpf_task_work_callback);
+
+	irq_work_queue(&ctx->irq_work);
+
+	return 0;
+
+release_all:
+	bpf_task_release(task);
+release_prog:
+	bpf_prog_put(prog);
+	return err;
+}
+
 /**
  * bpf_task_work_schedule_signal - Schedule BPF callback using task_work_add with TWA_SIGNAL mode
  * @task: Task struct for which callback should be scheduled
@@ -3711,13 +3933,11 @@ typedef void (*bpf_task_work_callback_t)(struct bpf_map *map, void *key, void *v
  *
  * Return: 0 if task work has been scheduled successfully, negative error code otherwise
  */
-__bpf_kfunc int bpf_task_work_schedule_signal(struct task_struct *task,
-					      struct bpf_task_work *tw,
+__bpf_kfunc int bpf_task_work_schedule_signal(struct task_struct *task, struct bpf_task_work *tw,
 					      struct bpf_map *map__map,
-					      bpf_task_work_callback_t callback,
-					      void *aux__prog)
+					      bpf_task_work_callback_t callback, void *aux__prog)
 {
-	return 0;
+	return bpf_task_work_schedule(task, tw, map__map, callback, aux__prog, TWA_SIGNAL);
 }
 
 /**
@@ -3731,19 +3951,47 @@ __bpf_kfunc int bpf_task_work_schedule_signal(struct task_struct *task,
  *
  * Return: 0 if task work has been scheduled successfully, negative error code otherwise
  */
-__bpf_kfunc int bpf_task_work_schedule_resume(struct task_struct *task,
-					      struct bpf_task_work *tw,
+__bpf_kfunc int bpf_task_work_schedule_resume(struct task_struct *task, struct bpf_task_work *tw,
 					      struct bpf_map *map__map,
-					      bpf_task_work_callback_t callback,
-					      void *aux__prog)
+					      bpf_task_work_callback_t callback, void *aux__prog)
 {
-	return 0;
+	enum task_work_notify_mode mode;
+
+	mode = task == current && in_nmi() ? TWA_NMI_CURRENT : TWA_RESUME;
+	return bpf_task_work_schedule(task, tw, map__map, callback, aux__prog, mode);
 }
 
 __bpf_kfunc_end_defs();
 
 void bpf_task_work_cancel_and_free(void *val)
 {
+	struct bpf_task_work *tw = val;
+	struct bpf_task_work_context *ctx;
+	enum bpf_task_work_state state;
+
+	/* No need do rcu_read_lock as no other codepath can reset this pointer */
+	ctx = unrcu_pointer(xchg((struct bpf_task_work_context __force __rcu **)&tw->ctx, NULL));
+	if (!ctx)
+		return;
+	state = xchg(&ctx->state, BPF_TW_FREED);
+
+	switch (state) {
+	case BPF_TW_SCHEDULED:
+		/* If we can't cancel task work, rely on task work callback to free the context */
+		if (!task_work_cancel_match(ctx->task, task_work_match, ctx))
+			break;
+		bpf_task_work_context_reset(ctx);
+		fallthrough;
+	case BPF_TW_STANDBY:
+		call_rcu_tasks_trace(&ctx->rcu, bpf_task_work_context_free);
+		break;
+	/* In all below cases scheduling logic should detect context state change and cleanup */
+	case BPF_TW_SCHEDULING:
+	case BPF_TW_PENDING:
+	case BPF_TW_RUNNING:
+	default:
+		break;
+	}
 }
 
 BTF_KFUNCS_START(generic_btf_ids)
@@ -3769,6 +4017,8 @@ BTF_ID_FLAGS(func, bpf_rbtree_first, KF_RET_NULL)
 BTF_ID_FLAGS(func, bpf_rbtree_root, KF_RET_NULL)
 BTF_ID_FLAGS(func, bpf_rbtree_left, KF_RET_NULL)
 BTF_ID_FLAGS(func, bpf_rbtree_right, KF_RET_NULL)
+BTF_ID_FLAGS(func, bpf_task_work_schedule_signal, KF_TRUSTED_ARGS)
+BTF_ID_FLAGS(func, bpf_task_work_schedule_resume, KF_TRUSTED_ARGS)
 
 #ifdef CONFIG_CGROUPS
 BTF_ID_FLAGS(func, bpf_cgroup_acquire, KF_ACQUIRE | KF_RCU | KF_RET_NULL)
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH bpf-next v2 4/4] selftests/bpf: BPF task work scheduling tests
  2025-08-15 19:21 [PATCH bpf-next v2 0/4] bpf: Introduce deferred task context execution Mykyta Yatsenko
                   ` (2 preceding siblings ...)
  2025-08-15 19:21 ` [PATCH bpf-next v2 3/4] bpf: task work scheduling kfuncs Mykyta Yatsenko
@ 2025-08-15 19:21 ` Mykyta Yatsenko
  3 siblings, 0 replies; 23+ messages in thread
From: Mykyta Yatsenko @ 2025-08-15 19:21 UTC (permalink / raw)
  To: bpf, ast, andrii, daniel, kafai, kernel-team, eddyz87, memxor
  Cc: Mykyta Yatsenko

From: Mykyta Yatsenko <yatsenko@meta.com>

Introducing selftests that check BPF task work scheduling mechanism.
Validate that verifier does not accepts incorrect calls to
bpf_task_work_schedule kfunc.

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
---
 .../selftests/bpf/prog_tests/test_task_work.c | 149 ++++++++++++++++++
 tools/testing/selftests/bpf/progs/task_work.c | 108 +++++++++++++
 .../selftests/bpf/progs/task_work_fail.c      |  98 ++++++++++++
 3 files changed, 355 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/test_task_work.c
 create mode 100644 tools/testing/selftests/bpf/progs/task_work.c
 create mode 100644 tools/testing/selftests/bpf/progs/task_work_fail.c

diff --git a/tools/testing/selftests/bpf/prog_tests/test_task_work.c b/tools/testing/selftests/bpf/prog_tests/test_task_work.c
new file mode 100644
index 000000000000..9c3c7a46a827
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/test_task_work.c
@@ -0,0 +1,149 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2025 Meta Platforms, Inc. and affiliates. */
+#include <test_progs.h>
+#include <string.h>
+#include <stdio.h>
+#include "task_work.skel.h"
+#include "task_work_fail.skel.h"
+#include <linux/bpf.h>
+#include <linux/perf_event.h>
+#include <sys/syscall.h>
+#include <time.h>
+
+static int perf_event_open(__u32 type, __u64 config, int pid)
+{
+	struct perf_event_attr attr = {
+		.type = type,
+		.config = config,
+		.size = sizeof(struct perf_event_attr),
+		.sample_period = 100000,
+	};
+
+	return syscall(__NR_perf_event_open, &attr, pid, -1, -1, 0);
+}
+
+struct elem {
+	char data[128];
+	struct bpf_task_work tw;
+};
+
+static int verify_map(struct bpf_map *map, const char *expected_data)
+{
+	int err;
+	struct elem value;
+	int processed_values = 0;
+	int k, sz;
+
+	sz = bpf_map__max_entries(map);
+	for (k = 0; k < sz; ++k) {
+		err = bpf_map__lookup_elem(map, &k, sizeof(int), &value, sizeof(struct elem), 0);
+		if (err)
+			continue;
+		if (!ASSERT_EQ(strcmp(expected_data, value.data), 0, "map data")) {
+			fprintf(stderr, "expected '%s', found '%s' in %s map", expected_data,
+				value.data, bpf_map__name(map));
+			return 2;
+		}
+		processed_values++;
+	}
+
+	return processed_values == 0;
+}
+
+static void task_work_run(const char *prog_name, const char *map_name)
+{
+	struct task_work *skel;
+	struct bpf_program *prog;
+	struct bpf_map *map;
+	struct bpf_link *link;
+	int err, pe_fd = 0, pid, status, pipefd[2];
+	char user_string[] = "hello world";
+
+	if (!ASSERT_NEQ(pipe(pipefd), -1, "pipe"))
+		return;
+
+	pid = fork();
+	if (pid == 0) {
+		__u64 num = 1;
+		int i;
+		char buf;
+
+		close(pipefd[1]);
+		read(pipefd[0], &buf, sizeof(buf));
+		close(pipefd[0]);
+
+		for (i = 0; i < 10000; ++i)
+			num *= time(0) % 7;
+		(void)num;
+		exit(0);
+	}
+	skel = task_work__open();
+	if (!ASSERT_OK_PTR(skel, "task_work__open"))
+		return;
+
+	bpf_object__for_each_program(prog, skel->obj) {
+		bpf_program__set_autoload(prog, false);
+	}
+
+	prog = bpf_object__find_program_by_name(skel->obj, prog_name);
+	if (!ASSERT_OK_PTR(prog, "prog_name"))
+		goto cleanup;
+	bpf_program__set_autoload(prog, true);
+	bpf_program__set_type(prog, BPF_PROG_TYPE_PERF_EVENT);
+	skel->bss->user_ptr = (char *)user_string;
+
+	err = task_work__load(skel);
+	if (!ASSERT_OK(err, "skel_load"))
+		goto cleanup;
+
+	pe_fd = perf_event_open(PERF_TYPE_HARDWARE, PERF_COUNT_HW_CPU_CYCLES, pid);
+	if (pe_fd == -1 && (errno == ENOENT || errno == EOPNOTSUPP)) {
+		printf("%s:SKIP:no PERF_COUNT_HW_CPU_CYCLES\n", __func__);
+		test__skip();
+		goto cleanup;
+	}
+	if (!ASSERT_NEQ(pe_fd, -1, "pe_fd")) {
+		fprintf(stderr, "perf_event_open errno: %d, pid: %d\n", errno, pid);
+		goto cleanup;
+	}
+
+	link = bpf_program__attach_perf_event(prog, pe_fd);
+	if (!ASSERT_OK_PTR(link, "attach_perf_event"))
+		goto cleanup;
+
+	close(pipefd[0]);
+	write(pipefd[1], user_string, 1);
+	close(pipefd[1]);
+	/* Wait to collect some samples */
+	waitpid(pid, &status, 0);
+	pid = 0;
+	map = bpf_object__find_map_by_name(skel->obj, map_name);
+	if (!ASSERT_OK_PTR(map, "find map_name"))
+		goto cleanup;
+	if (!ASSERT_OK(verify_map(map, user_string), "verify map"))
+		goto cleanup;
+cleanup:
+	if (pe_fd >= 0)
+		close(pe_fd);
+	task_work__destroy(skel);
+	if (pid) {
+		close(pipefd[0]);
+		write(pipefd[1], user_string, 1);
+		close(pipefd[1]);
+		waitpid(pid, &status, 0);
+	}
+}
+
+void test_task_work(void)
+{
+	if (test__start_subtest("test_task_work_hash_map"))
+		task_work_run("oncpu_hash_map", "hmap");
+
+	if (test__start_subtest("test_task_work_array_map"))
+		task_work_run("oncpu_array_map", "arrmap");
+
+	if (test__start_subtest("test_task_work_lru_map"))
+		task_work_run("oncpu_lru_map", "lrumap");
+
+	RUN_TESTS(task_work_fail);
+}
diff --git a/tools/testing/selftests/bpf/progs/task_work.c b/tools/testing/selftests/bpf/progs/task_work.c
new file mode 100644
index 000000000000..5e761b4a5fd1
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/task_work.c
@@ -0,0 +1,108 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2025 Meta Platforms, Inc. and affiliates. */
+
+#include <vmlinux.h>
+#include <string.h>
+#include <stdbool.h>
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+#include "bpf_misc.h"
+#include "errno.h"
+
+char _license[] SEC("license") = "GPL";
+
+const void *user_ptr = NULL;
+
+struct elem {
+	char data[128];
+	struct bpf_task_work tw;
+};
+
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__uint(map_flags, BPF_F_NO_PREALLOC);
+	__uint(max_entries, 1);
+	__type(key, int);
+	__type(value, struct elem);
+} hmap SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_ARRAY);
+	__uint(max_entries, 1);
+	__type(key, int);
+	__type(value, struct elem);
+} arrmap SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_LRU_HASH);
+	__uint(max_entries, 1);
+	__type(key, int);
+	__type(value, struct elem);
+} lrumap SEC(".maps");
+
+static void process_work(struct bpf_map *map, void *key, void *value)
+{
+	struct elem *work = value;
+
+	bpf_copy_from_user_str(work->data, sizeof(work->data), (const void *)user_ptr, 0);
+}
+
+int key = 0;
+
+SEC("perf_event")
+int oncpu_hash_map(struct pt_regs *args)
+{
+	struct elem empty_work = { .data = { 0 } };
+	struct elem *work;
+	struct task_struct *task;
+	int err;
+
+	task = bpf_get_current_task_btf();
+	err = bpf_map_update_elem(&hmap, &key, &empty_work, BPF_NOEXIST);
+	if (err)
+		return 0;
+	work = bpf_map_lookup_elem(&hmap, &key);
+	if (!work)
+		return 0;
+
+	bpf_task_work_schedule_resume(task, &work->tw, (struct bpf_map *)&hmap, process_work, NULL);
+	return 0;
+}
+
+SEC("perf_event")
+int oncpu_array_map(struct pt_regs *args)
+{
+	struct elem *work;
+	struct task_struct *task;
+
+	task = bpf_get_current_task_btf();
+	work = bpf_map_lookup_elem(&arrmap, &key);
+	if (!work)
+		return 0;
+	bpf_task_work_schedule_signal(task, &work->tw, (struct bpf_map *)&arrmap, process_work,
+				      NULL);
+	return 0;
+}
+
+SEC("perf_event")
+int oncpu_lru_map(struct pt_regs *args)
+{
+	struct elem empty_work = { .data = { 0 } };
+	struct elem *work;
+	struct task_struct *task;
+	int err;
+
+	task = bpf_get_current_task_btf();
+	work = bpf_map_lookup_elem(&lrumap, &key);
+	if (work)
+		return 0;
+	err = bpf_map_update_elem(&lrumap, &key, &empty_work, BPF_NOEXIST);
+	if (err)
+		return 0;
+	work = bpf_map_lookup_elem(&lrumap, &key);
+	if (!work || work->data[0])
+		return 0;
+	bpf_task_work_schedule_resume(task, &work->tw, (struct bpf_map *)&lrumap, process_work,
+				      NULL);
+	return 0;
+}
diff --git a/tools/testing/selftests/bpf/progs/task_work_fail.c b/tools/testing/selftests/bpf/progs/task_work_fail.c
new file mode 100644
index 000000000000..fca7052b805e
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/task_work_fail.c
@@ -0,0 +1,98 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2025 Meta Platforms, Inc. and affiliates. */
+
+#include <vmlinux.h>
+#include <string.h>
+#include <stdbool.h>
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+#include "bpf_misc.h"
+
+char _license[] SEC("license") = "GPL";
+
+const void *user_ptr = NULL;
+
+struct elem {
+	char data[128];
+	struct bpf_task_work tw;
+};
+
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__uint(map_flags, BPF_F_NO_PREALLOC);
+	__uint(max_entries, 1);
+	__type(key, int);
+	__type(value, struct elem);
+} hmap SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_ARRAY);
+	__uint(max_entries, 1);
+	__type(key, int);
+	__type(value, struct elem);
+} arrmap SEC(".maps");
+
+static void process_work(struct bpf_map *map, void *key, void *value)
+{
+	struct elem *work = value;
+
+	bpf_copy_from_user_str(work->data, sizeof(work->data), (const void *)user_ptr, 0);
+}
+
+int key = 0;
+
+SEC("perf_event")
+__failure __msg("doesn't match map pointer in R3")
+int mismatch_map(struct pt_regs *args)
+{
+	struct elem *work;
+	struct task_struct *task;
+
+	task = bpf_get_current_task_btf();
+	work = bpf_map_lookup_elem(&arrmap, &key);
+	if (!work)
+		return 0;
+	bpf_task_work_schedule_resume(task, &work->tw, (struct bpf_map *)&hmap,
+				      process_work, NULL);
+	return 0;
+}
+
+SEC("perf_event")
+__failure __msg("arg#1 doesn't point to a map value")
+int no_map_task_work(struct pt_regs *args)
+{
+	struct task_struct *task;
+	struct bpf_task_work tw;
+
+	task = bpf_get_current_task_btf();
+	bpf_task_work_schedule_resume(task, &tw, (struct bpf_map *)&hmap,
+				      process_work, NULL);
+	return 0;
+}
+
+SEC("perf_event")
+__failure __msg("Possibly NULL pointer passed to trusted arg1")
+int task_work_null(struct pt_regs *args)
+{
+	struct task_struct *task;
+
+	task = bpf_get_current_task_btf();
+	bpf_task_work_schedule_resume(task, NULL, (struct bpf_map *)&hmap,
+				      process_work, NULL);
+	return 0;
+}
+
+SEC("perf_event")
+__failure __msg("Possibly NULL pointer passed to trusted arg2")
+int map_null(struct pt_regs *args)
+{
+	struct elem *work;
+	struct task_struct *task;
+
+	task = bpf_get_current_task_btf();
+	work = bpf_map_lookup_elem(&arrmap, &key);
+	if (!work)
+		return 0;
+	bpf_task_work_schedule_resume(task, &work->tw, NULL, process_work, NULL);
+	return 0;
+}
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH bpf-next v2 3/4] bpf: task work scheduling kfuncs
  2025-08-15 19:21 ` [PATCH bpf-next v2 3/4] bpf: task work scheduling kfuncs Mykyta Yatsenko
@ 2025-08-15 22:00   ` Jiri Olsa
  2025-08-18 13:36     ` Mykyta Yatsenko
  2025-08-16 18:14   ` kernel test robot
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 23+ messages in thread
From: Jiri Olsa @ 2025-08-15 22:00 UTC (permalink / raw)
  To: Mykyta Yatsenko
  Cc: bpf, ast, andrii, daniel, kafai, kernel-team, eddyz87, memxor,
	Mykyta Yatsenko

On Fri, Aug 15, 2025 at 08:21:55PM +0100, Mykyta Yatsenko wrote:

SNIP

>  void bpf_task_work_cancel_and_free(void *val)
>  {
> +	struct bpf_task_work *tw = val;
> +	struct bpf_task_work_context *ctx;
> +	enum bpf_task_work_state state;
> +
> +	/* No need do rcu_read_lock as no other codepath can reset this pointer */
> +	ctx = unrcu_pointer(xchg((struct bpf_task_work_context __force __rcu **)&tw->ctx, NULL));
> +	if (!ctx)
> +		return;
> +	state = xchg(&ctx->state, BPF_TW_FREED);
> +
> +	switch (state) {
> +	case BPF_TW_SCHEDULED:
> +		/* If we can't cancel task work, rely on task work callback to free the context */
> +		if (!task_work_cancel_match(ctx->task, task_work_match, ctx))
> +			break;
> +		bpf_task_work_context_reset(ctx);
> +		fallthrough;
> +	case BPF_TW_STANDBY:
> +		call_rcu_tasks_trace(&ctx->rcu, bpf_task_work_context_free);
> +		break;
> +	/* In all below cases scheduling logic should detect context state change and cleanup */
> +	case BPF_TW_SCHEDULING:
> +	case BPF_TW_PENDING:
> +	case BPF_TW_RUNNING:
> +	default:
> +		break;
> +	}
>  }
>  
>  BTF_KFUNCS_START(generic_btf_ids)
> @@ -3769,6 +4017,8 @@ BTF_ID_FLAGS(func, bpf_rbtree_first, KF_RET_NULL)
>  BTF_ID_FLAGS(func, bpf_rbtree_root, KF_RET_NULL)
>  BTF_ID_FLAGS(func, bpf_rbtree_left, KF_RET_NULL)
>  BTF_ID_FLAGS(func, bpf_rbtree_right, KF_RET_NULL)
> +BTF_ID_FLAGS(func, bpf_task_work_schedule_signal, KF_TRUSTED_ARGS)
> +BTF_ID_FLAGS(func, bpf_task_work_schedule_resume, KF_TRUSTED_ARGS)

hi,
I'd like to use that with uprobes, could we add it to common_kfunc_set?
I tried it with uprobe and it seems to work nicely

thanks,
jirka


----
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 346ae8fd3ada..b5d52168ba77 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -4129,6 +4129,8 @@ BTF_ID_FLAGS(func, bpf_strnstr);
 BTF_ID_FLAGS(func, bpf_cgroup_read_xattr, KF_RCU)
 #endif
 BTF_ID_FLAGS(func, bpf_stream_vprintk, KF_TRUSTED_ARGS)
+BTF_ID_FLAGS(func, bpf_task_work_schedule_signal, KF_TRUSTED_ARGS)
+BTF_ID_FLAGS(func, bpf_task_work_schedule_resume, KF_TRUSTED_ARGS)
 BTF_KFUNCS_END(common_btf_ids)
 
 static const struct btf_kfunc_id_set common_kfunc_set = {

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH bpf-next v2 3/4] bpf: task work scheduling kfuncs
  2025-08-15 19:21 ` [PATCH bpf-next v2 3/4] bpf: task work scheduling kfuncs Mykyta Yatsenko
  2025-08-15 22:00   ` Jiri Olsa
@ 2025-08-16 18:14   ` kernel test robot
  2025-08-19 14:18   ` Kumar Kartikeya Dwivedi
  2025-08-27 21:03   ` Andrii Nakryiko
  3 siblings, 0 replies; 23+ messages in thread
From: kernel test robot @ 2025-08-16 18:14 UTC (permalink / raw)
  To: Mykyta Yatsenko, bpf, ast, andrii, daniel, kafai, kernel-team,
	eddyz87, memxor
  Cc: oe-kbuild-all, Mykyta Yatsenko

Hi Mykyta,

kernel test robot noticed the following build warnings:

[auto build test WARNING on bpf-next/master]

url:    https://github.com/intel-lab-lkp/linux/commits/Mykyta-Yatsenko/bpf-bpf-task-work-plumbing/20250816-032308
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
patch link:    https://lore.kernel.org/r/20250815192156.272445-4-mykyta.yatsenko5%40gmail.com
patch subject: [PATCH bpf-next v2 3/4] bpf: task work scheduling kfuncs
config: riscv-randconfig-r122-20250816 (https://download.01.org/0day-ci/archive/20250817/202508170152.F0lX7DFM-lkp@intel.com/config)
compiler: clang version 22.0.0git (https://github.com/llvm/llvm-project 93d24b6b7b148c47a2fa228a4ef31524fa1d9f3f)
reproduce: (https://download.01.org/0day-ci/archive/20250817/202508170152.F0lX7DFM-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202508170152.F0lX7DFM-lkp@intel.com/

sparse warnings: (new ones prefixed by >>)
   kernel/bpf/helpers.c:2471:18: sparse: sparse: symbol 'bpf_task_release_dtor' was not declared. Should it be static?
   kernel/bpf/helpers.c:2501:18: sparse: sparse: symbol 'bpf_cgroup_release_dtor' was not declared. Should it be static?
   kernel/bpf/helpers.c:3367:17: sparse: sparse: incorrect type in argument 2 (different address spaces) @@     expected void const [noderef] __user *from @@     got char * @@
   kernel/bpf/helpers.c:3367:17: sparse:     expected void const [noderef] __user *from
   kernel/bpf/helpers.c:3367:17: sparse:     got char *
   kernel/bpf/helpers.c:3368:17: sparse: sparse: incorrect type in argument 2 (different address spaces) @@     expected void const [noderef] __user *from @@     got char * @@
   kernel/bpf/helpers.c:3368:17: sparse:     expected void const [noderef] __user *from
   kernel/bpf/helpers.c:3368:17: sparse:     got char *
   kernel/bpf/helpers.c:3407:17: sparse: sparse: incorrect type in argument 2 (different address spaces) @@     expected void const [noderef] __user *from @@     got char * @@
   kernel/bpf/helpers.c:3407:17: sparse:     expected void const [noderef] __user *from
   kernel/bpf/helpers.c:3407:17: sparse:     got char *
   kernel/bpf/helpers.c:3461:17: sparse: sparse: incorrect type in argument 2 (different address spaces) @@     expected void const [noderef] __user *from @@     got char * @@
   kernel/bpf/helpers.c:3461:17: sparse:     expected void const [noderef] __user *from
   kernel/bpf/helpers.c:3461:17: sparse:     got char *
   kernel/bpf/helpers.c:3493:17: sparse: sparse: incorrect type in argument 2 (different address spaces) @@     expected void const [noderef] __user *from @@     got char * @@
   kernel/bpf/helpers.c:3493:17: sparse:     expected void const [noderef] __user *from
   kernel/bpf/helpers.c:3493:17: sparse:     got char *
   kernel/bpf/helpers.c:3526:17: sparse: sparse: incorrect type in argument 2 (different address spaces) @@     expected void const [noderef] __user *from @@     got char * @@
   kernel/bpf/helpers.c:3526:17: sparse:     expected void const [noderef] __user *from
   kernel/bpf/helpers.c:3526:17: sparse:     got char *
   kernel/bpf/helpers.c:3576:17: sparse: sparse: incorrect type in argument 2 (different address spaces) @@     expected void const [noderef] __user *from @@     got char * @@
   kernel/bpf/helpers.c:3576:17: sparse:     expected void const [noderef] __user *from
   kernel/bpf/helpers.c:3576:17: sparse:     got char *
   kernel/bpf/helpers.c:3580:25: sparse: sparse: incorrect type in argument 2 (different address spaces) @@     expected void const [noderef] __user *from @@     got char * @@
   kernel/bpf/helpers.c:3580:25: sparse:     expected void const [noderef] __user *from
   kernel/bpf/helpers.c:3580:25: sparse:     got char *
   kernel/bpf/helpers.c:3620:17: sparse: sparse: incorrect type in argument 2 (different address spaces) @@     expected void const [noderef] __user *from @@     got char * @@
   kernel/bpf/helpers.c:3620:17: sparse:     expected void const [noderef] __user *from
   kernel/bpf/helpers.c:3620:17: sparse:     got char *
   kernel/bpf/helpers.c:3624:25: sparse: sparse: incorrect type in argument 2 (different address spaces) @@     expected void const [noderef] __user *from @@     got char * @@
   kernel/bpf/helpers.c:3624:25: sparse:     expected void const [noderef] __user *from
   kernel/bpf/helpers.c:3624:25: sparse:     got char *
   kernel/bpf/helpers.c:3666:25: sparse: sparse: incorrect type in argument 2 (different address spaces) @@     expected void const [noderef] __user *from @@     got char * @@
   kernel/bpf/helpers.c:3666:25: sparse:     expected void const [noderef] __user *from
   kernel/bpf/helpers.c:3666:25: sparse:     got char *
   kernel/bpf/helpers.c:3669:25: sparse: sparse: incorrect type in argument 2 (different address spaces) @@     expected void const [noderef] __user *from @@     got char * @@
   kernel/bpf/helpers.c:3669:25: sparse:     expected void const [noderef] __user *from
   kernel/bpf/helpers.c:3669:25: sparse:     got char *
>> kernel/bpf/helpers.c:3856:27: sparse: sparse: cast removes address space '__rcu' of expression
>> kernel/bpf/helpers.c:3856:27: sparse: sparse: cast removes address space '__rcu' of expression
>> kernel/bpf/helpers.c:3856:27: sparse: sparse: cast removes address space '__rcu' of expression
>> kernel/bpf/helpers.c:3856:27: sparse: sparse: cast removes address space '__rcu' of expression
>> kernel/bpf/helpers.c:3856:27: sparse: sparse: cast removes address space '__rcu' of expression
>> kernel/bpf/helpers.c:3856:27: sparse: sparse: cast removes address space '__rcu' of expression
>> kernel/bpf/helpers.c:3856:27: sparse: sparse: cast removes address space '__rcu' of expression
   kernel/bpf/helpers.c:2965:18: sparse: sparse: context imbalance in 'bpf_rcu_read_lock' - wrong count at exit
   kernel/bpf/helpers.c:2970:18: sparse: sparse: context imbalance in 'bpf_rcu_read_unlock' - unexpected unlock

vim +/__rcu +3856 kernel/bpf/helpers.c

  3838	
  3839	static struct bpf_task_work_context *bpf_task_work_context_acquire(struct bpf_task_work *tw,
  3840									   struct bpf_map *map)
  3841	{
  3842		struct bpf_task_work_context *ctx, *old_ctx;
  3843		enum bpf_task_work_state state;
  3844		struct bpf_task_work_context __force __rcu **ppc =
  3845			(struct bpf_task_work_context __force __rcu **)&tw->ctx;
  3846	
  3847		/* ctx pointer is RCU protected */
  3848		rcu_read_lock_trace();
  3849		ctx = rcu_dereference(*ppc);
  3850		if (!ctx) {
  3851			ctx = bpf_task_work_context_alloc();
  3852			if (!ctx) {
  3853				rcu_read_unlock_trace();
  3854				return ERR_PTR(-ENOMEM);
  3855			}
> 3856			old_ctx = unrcu_pointer(cmpxchg(ppc, NULL, RCU_INITIALIZER(ctx)));
  3857			/*
  3858			 * If ctx is set by another CPU, release allocated memory.
  3859			 * Do not fail, though, attempt stealing the work
  3860			 */
  3861			if (old_ctx) {
  3862				bpf_mem_free(&bpf_global_ma, ctx);
  3863				ctx = old_ctx;
  3864			}
  3865		}
  3866		state = cmpxchg(&ctx->state, BPF_TW_STANDBY, BPF_TW_PENDING);
  3867		/*
  3868		 * We can unlock RCU, because task work scheduler (this codepath)
  3869		 * now owns the ctx or returning an error
  3870		 */
  3871		rcu_read_unlock_trace();
  3872		if (state != BPF_TW_STANDBY)
  3873			return ERR_PTR(-EBUSY);
  3874		return ctx;
  3875	}
  3876	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH bpf-next v2 3/4] bpf: task work scheduling kfuncs
  2025-08-15 22:00   ` Jiri Olsa
@ 2025-08-18 13:36     ` Mykyta Yatsenko
  0 siblings, 0 replies; 23+ messages in thread
From: Mykyta Yatsenko @ 2025-08-18 13:36 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: bpf, ast, andrii, daniel, kafai, kernel-team, eddyz87, memxor,
	Mykyta Yatsenko

On 8/15/25 23:00, Jiri Olsa wrote:
> On Fri, Aug 15, 2025 at 08:21:55PM +0100, Mykyta Yatsenko wrote:
>
> SNIP
>
>>   void bpf_task_work_cancel_and_free(void *val)
>>   {
>> +	struct bpf_task_work *tw = val;
>> +	struct bpf_task_work_context *ctx;
>> +	enum bpf_task_work_state state;
>> +
>> +	/* No need do rcu_read_lock as no other codepath can reset this pointer */
>> +	ctx = unrcu_pointer(xchg((struct bpf_task_work_context __force __rcu **)&tw->ctx, NULL));
>> +	if (!ctx)
>> +		return;
>> +	state = xchg(&ctx->state, BPF_TW_FREED);
>> +
>> +	switch (state) {
>> +	case BPF_TW_SCHEDULED:
>> +		/* If we can't cancel task work, rely on task work callback to free the context */
>> +		if (!task_work_cancel_match(ctx->task, task_work_match, ctx))
>> +			break;
>> +		bpf_task_work_context_reset(ctx);
>> +		fallthrough;
>> +	case BPF_TW_STANDBY:
>> +		call_rcu_tasks_trace(&ctx->rcu, bpf_task_work_context_free);
>> +		break;
>> +	/* In all below cases scheduling logic should detect context state change and cleanup */
>> +	case BPF_TW_SCHEDULING:
>> +	case BPF_TW_PENDING:
>> +	case BPF_TW_RUNNING:
>> +	default:
>> +		break;
>> +	}
>>   }
>>   
>>   BTF_KFUNCS_START(generic_btf_ids)
>> @@ -3769,6 +4017,8 @@ BTF_ID_FLAGS(func, bpf_rbtree_first, KF_RET_NULL)
>>   BTF_ID_FLAGS(func, bpf_rbtree_root, KF_RET_NULL)
>>   BTF_ID_FLAGS(func, bpf_rbtree_left, KF_RET_NULL)
>>   BTF_ID_FLAGS(func, bpf_rbtree_right, KF_RET_NULL)
>> +BTF_ID_FLAGS(func, bpf_task_work_schedule_signal, KF_TRUSTED_ARGS)
>> +BTF_ID_FLAGS(func, bpf_task_work_schedule_resume, KF_TRUSTED_ARGS)
> hi,
> I'd like to use that with uprobes, could we add it to common_kfunc_set?
> I tried it with uprobe and it seems to work nicely
>
> thanks,
> jirka
I'll move these kfuncs into common_kfunc_set, in the next version.
>
> ----
> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> index 346ae8fd3ada..b5d52168ba77 100644
> --- a/kernel/bpf/helpers.c
> +++ b/kernel/bpf/helpers.c
> @@ -4129,6 +4129,8 @@ BTF_ID_FLAGS(func, bpf_strnstr);
>   BTF_ID_FLAGS(func, bpf_cgroup_read_xattr, KF_RCU)
>   #endif
>   BTF_ID_FLAGS(func, bpf_stream_vprintk, KF_TRUSTED_ARGS)
> +BTF_ID_FLAGS(func, bpf_task_work_schedule_signal, KF_TRUSTED_ARGS)
> +BTF_ID_FLAGS(func, bpf_task_work_schedule_resume, KF_TRUSTED_ARGS)
>   BTF_KFUNCS_END(common_btf_ids)
>   
>   static const struct btf_kfunc_id_set common_kfunc_set = {


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH bpf-next v2 1/4] bpf: bpf task work plumbing
  2025-08-15 19:21 ` [PATCH bpf-next v2 1/4] bpf: bpf task work plumbing Mykyta Yatsenko
@ 2025-08-19  1:34   ` Eduard Zingerman
  2025-08-29 15:23     ` Mykyta Yatsenko
  0 siblings, 1 reply; 23+ messages in thread
From: Eduard Zingerman @ 2025-08-19  1:34 UTC (permalink / raw)
  To: Mykyta Yatsenko, bpf, ast, andrii, daniel, kafai, kernel-team,
	memxor
  Cc: Mykyta Yatsenko

On Fri, 2025-08-15 at 20:21 +0100, Mykyta Yatsenko wrote:
> From: Mykyta Yatsenko <yatsenko@meta.com>
> 
> This patch adds necessary plumbing in verifier, syscall and maps to
> support handling new kfunc bpf_task_work_schedule and kernel structure
> bpf_task_work. The idea is similar to how we already handle bpf_wq and
> bpf_timer.
> verifier changes validate calls to bpf_task_work_schedule to make sure
> it is safe and expected invariants hold.
> btf part is required to detect bpf_task_work structure inside map value
> and store its offset, which will be used in the next patch to calculate
> key and value addresses.
> arraymap and hashtab changes are needed to handle freeing of the
> bpf_task_work: run code needed to deinitialize it, for example cancel
> task_work callback if possible.
> The use of bpf_task_work and proper implementation for kfuncs are
> introduced in the next patch.
> 
> Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
> ---

Amount of copy-paste necessary for dealing with objects btf is saddening.
This patch follows current approach and seem to do it correctly.

[...]

> --- a/kernel/bpf/arraymap.c
> +++ b/kernel/bpf/arraymap.c
> @@ -431,7 +431,7 @@ static void *array_map_vmalloc_addr(struct bpf_array *array)
>  	return (void *)round_down((unsigned long)array, PAGE_SIZE);
>  }
>  
> -static void array_map_free_timers_wq(struct bpf_map *map)
> +static void array_map_free_internal_structs(struct bpf_map *map)
>  {
>  	struct bpf_array *array = container_of(map, struct bpf_array, map);
>  	int i;
> @@ -439,12 +439,14 @@ static void array_map_free_timers_wq(struct bpf_map *map)
>  	/* We don't reset or free fields other than timer and workqueue
>  	 * on uref dropping to zero.
>  	 */
> -	if (btf_record_has_field(map->record, BPF_TIMER | BPF_WORKQUEUE)) {
> +	if (btf_record_has_field(map->record, BPF_TIMER | BPF_WORKQUEUE | BPF_TASK_WORK)) {

Is there a way to share this code between array map and hash map?

>  		for (i = 0; i < array->map.max_entries; i++) {
>  			if (btf_record_has_field(map->record, BPF_TIMER))
>  				bpf_obj_free_timer(map->record, array_map_elem_ptr(array, i));
>  			if (btf_record_has_field(map->record, BPF_WORKQUEUE))
>  				bpf_obj_free_workqueue(map->record, array_map_elem_ptr(array, i));
> +			if (btf_record_has_field(map->record, BPF_TASK_WORK))
> +				bpf_obj_free_task_work(map->record, array_map_elem_ptr(array, i));
>  		}
>  	}
>  }

[...]

> diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
> index 64739308902f..378f260235dd 100644
> --- a/kernel/bpf/btf.c
> +++ b/kernel/bpf/btf.c
> @@ -3527,6 +3527,15 @@ static int btf_get_field_type(const struct btf *btf, const struct btf_type *var_
>  			goto end;
>  		}
>  	}
> +	if (field_mask & BPF_TASK_WORK) {
> +		if (!strcmp(name, "bpf_task_work")) {
> +			if (*seen_mask & BPF_TASK_WORK)
> +				return -E2BIG;
> +			*seen_mask |= BPF_TASK_WORK;
> +			type = BPF_TASK_WORK;
> +			goto end;
> +		}
> +	}

Nit: extract this and ifs before it as a loop over array
     of name/flag pairs?

>  	field_mask_test_name(BPF_LIST_HEAD, "bpf_list_head");
>  	field_mask_test_name(BPF_LIST_NODE, "bpf_list_node");
>  	field_mask_test_name(BPF_RB_ROOT,   "bpf_rb_root");

[...]

> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
> index 71f9931ac64c..207ad4823b5b 100644
> --- a/kernel/bpf/hashtab.c
> +++ b/kernel/bpf/hashtab.c

[...]

> @@ -233,6 +233,9 @@ static void htab_free_prealloced_timers_and_wq(struct bpf_htab *htab)
>  		if (btf_record_has_field(htab->map.record, BPF_WORKQUEUE))
>  			bpf_obj_free_workqueue(htab->map.record,
>  					       htab_elem_value(elem, htab->map.key_size));
> +		if (btf_record_has_field(htab->map.record, BPF_TASK_WORK))
> +			bpf_obj_free_task_work(htab->map.record,
> +					       htab_elem_value(elem, htab->map.key_size));

If there is no generic way to share this code with array maps,
please, at-least within the hashmap.c extract these "if (btf_record_has_field(...)) {...}"
groups so that there is no duplication between
htab_free_{malloced,preallocated}_internal_structs(htab).

>  		cond_resched();
>  	}
>  }

[...]

> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 0fbfa8532c39..108d86f7eeaf 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c

[...]

> @@ -1309,6 +1322,14 @@ static int map_check_btf(struct bpf_map *map, struct bpf_token *token,
>  					goto free_map_tab;
>  				}
>  				break;
> +			case BPF_TASK_WORK:

This can be added to the group with BPF_TIMER and BPF_WORKQUEUE just above.

> +				if (map->map_type != BPF_MAP_TYPE_HASH &&
> +				    map->map_type != BPF_MAP_TYPE_LRU_HASH &&
> +				    map->map_type != BPF_MAP_TYPE_ARRAY) {
> +					ret = -EOPNOTSUPP;
> +					goto free_map_tab;
> +				}
> +				break;
>  			default:
>  				/* Fail if map_type checks are missing for a field type */
>  				ret = -EOPNOTSUPP;
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index a61d57996692..be7a744c7917 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c

[...]

This function repeats process_timer_func() almost verbatim.

> +{
> +	struct bpf_reg_state *regs = cur_regs(env), *reg = &regs[regno];
> +	struct bpf_map *map = reg->map_ptr;
> +	bool is_const = tnum_is_const(reg->var_off);
> +	u64 val = reg->var_off.value;
> +
> +	if (!map->btf) {
> +		verbose(env, "map '%s' has to have BTF in order to use bpf_task_work\n",
> +			map->name);
> +		return -EINVAL;
> +	}
> +	if (!btf_record_has_field(map->record, BPF_TASK_WORK)) {
> +		verbose(env, "map '%s' has no valid bpf_task_work\n", map->name);
> +		return -EINVAL;
> +	}
> +	if (!is_const) {
> +		verbose(env,
> +			"bpf_task_work has to be at the constant offset\n");
> +		return -EINVAL;
> +	}
> +	if (map->record->task_work_off != val + reg->off) {
> +		verbose(env,
> +			"off %lld doesn't point to 'struct bpf_task_work' that is at %d\n",
> +			val + reg->off, map->record->task_work_off);
> +		return -EINVAL;
> +	}
> +	if (meta->map.ptr) {
> +		verifier_bug(env, "Two map pointers in a bpf_task_work kfunc");
> +		return -EFAULT;
> +	}
> +
> +	meta->map.uid = reg->map_uid;
> +	meta->map.ptr = map;
> +	return 0;
> +}
> +
>  static int process_kptr_func(struct bpf_verifier_env *env, int regno,
>  			     struct bpf_call_arg_meta *meta)
>  {

[...]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH bpf-next v2 2/4] bpf: extract map key pointer calculation
  2025-08-15 19:21 ` [PATCH bpf-next v2 2/4] bpf: extract map key pointer calculation Mykyta Yatsenko
@ 2025-08-19 11:05   ` Kumar Kartikeya Dwivedi
  2025-08-19 20:50   ` Andrii Nakryiko
  1 sibling, 0 replies; 23+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2025-08-19 11:05 UTC (permalink / raw)
  To: Mykyta Yatsenko
  Cc: bpf, ast, andrii, daniel, kafai, kernel-team, eddyz87,
	Mykyta Yatsenko

On Fri, 15 Aug 2025 at 21:22, Mykyta Yatsenko
<mykyta.yatsenko5@gmail.com> wrote:
>
> From: Mykyta Yatsenko <yatsenko@meta.com>
>
> Calculation of the BPF map key, given the pointer to a value is
> duplicated in a couple of places in helpers already, in the next patch
> another use case is introduced as well.
> This patch extracts that functionality into a separate function.
>
> Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
> ---

Please carry acks from v1.

>  [...]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH bpf-next v2 3/4] bpf: task work scheduling kfuncs
  2025-08-15 19:21 ` [PATCH bpf-next v2 3/4] bpf: task work scheduling kfuncs Mykyta Yatsenko
  2025-08-15 22:00   ` Jiri Olsa
  2025-08-16 18:14   ` kernel test robot
@ 2025-08-19 14:18   ` Kumar Kartikeya Dwivedi
  2025-08-19 18:13     ` Mykyta Yatsenko
  2025-08-27 21:03   ` Andrii Nakryiko
  3 siblings, 1 reply; 23+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2025-08-19 14:18 UTC (permalink / raw)
  To: Mykyta Yatsenko
  Cc: bpf, ast, andrii, daniel, kafai, kernel-team, eddyz87,
	Mykyta Yatsenko

On Fri, 15 Aug 2025 at 21:22, Mykyta Yatsenko
<mykyta.yatsenko5@gmail.com> wrote:
>
> From: Mykyta Yatsenko <yatsenko@meta.com>
>
> Implementation of the bpf_task_work_schedule kfuncs.
>
> Main components:
>  * struct bpf_task_work_context – Metadata and state management per task
> work.
>  * enum bpf_task_work_state – A state machine to serialize work
>  scheduling and execution.
>  * bpf_task_work_schedule() – The central helper that initiates
> scheduling.
>  * bpf_task_work_acquire() - Attempts to take ownership of the context,
>  pointed by passed struct bpf_task_work, allocates new context if none
>  exists yet.
>  * bpf_task_work_callback() – Invoked when the actual task_work runs.
>  * bpf_task_work_irq() – An intermediate step (runs in softirq context)
> to enqueue task work.
>  * bpf_task_work_cancel_and_free() – Cleanup for deleted BPF map entries.

Can you elaborate on why the bouncing through irq_work context is necessary?
I think we should have this info in the commit log.
Is it to avoid deadlocks with task_work locks and/or task->pi_lock?

>
> Flow of successful task work scheduling
>  1) bpf_task_work_schedule_* is called from BPF code.
>  2) Transition state from STANDBY to PENDING, marks context is owned by
>  this task work scheduler
>  3) irq_work_queue() schedules bpf_task_work_irq().
>  4) Transition state from PENDING to SCHEDULING.
>  4) bpf_task_work_irq() attempts task_work_add(). If successful, state
>  transitions to SCHEDULED.
>  5) Task work calls bpf_task_work_callback(), which transition state to
>  RUNNING.
>  6) BPF callback is executed
>  7) Context is cleaned up, refcounts released, context state set back to
>  STANDBY.
>
> bpf_task_work_context handling
> The context pointer is stored in bpf_task_work ctx field (u64) but
> treated as an __rcu pointer via casts.
> bpf_task_work_acquire() publishes new bpf_task_work_context by cmpxchg
> with RCU initializer.
> Read under the RCU lock only in bpf_task_work_acquire() when ownership
> is contended.
> Upon deleting map value, bpf_task_work_cancel_and_free() is detaching
> context pointer from struct bpf_task_work and releases resources
> if scheduler does not own the context or can be canceled (state ==
> STANDBY or state == SCHEDULED and callback canceled). If task work
> scheduler owns the context, its state is set to FREED and scheduler is
> expected to cleanup on the next state transition.
>
> Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
> ---

This is much better now, with clear ownership between free path and
scheduling path, I mostly have a few more comments on the current
implementation, plus one potential bug.

However, the more time I spend on this, the more I feel we should
unify all this with the two other bpf async work execution mechanisms
(timers and wq), and simplify and deduplicate a lot of this under the
serialized async->lock. I know NMI execution is probably critical for
this primitive, but we can replace async->lock with rqspinlock to
address that, so that it becomes safe to serialize in any context.
Apart from that, I don't see anything that would negate reworking all
this as a case of BPF_TASK_WORK for bpf_async_kern, modulo internal
task_work locks that have trouble with NMI execution (see later
comments).

I also feel like it would be cleaner if we split the API into 3 steps:
init(), set_callback(), schedule() like the other cases, I don't see
why we necessarily need to diverge, and it simplifies some of the
logic in schedule().
Once every state update is protected by a lock, all of the state
transitions are done automatically and a lot of the extra races are
eliminated.

I think we should discuss whether this was considered and why you
discarded this approach, otherwise the code is pretty complex, with
little upside.
Maybe I'm missing something obvious and you'd know more since you've
thought about all this longer.

>  kernel/bpf/helpers.c | 270 +++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 260 insertions(+), 10 deletions(-)
>
> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> index d2f88a9bc47b..346ae8fd3ada 100644
> --- a/kernel/bpf/helpers.c
> +++ b/kernel/bpf/helpers.c
> @@ -25,6 +25,8 @@
>  #include <linux/kasan.h>
>  #include <linux/bpf_verifier.h>
>  #include <linux/uaccess.h>
> +#include <linux/task_work.h>
> +#include <linux/irq_work.h>
>
>  #include "../../lib/kstrtox.h"
>
> @@ -3701,6 +3703,226 @@ __bpf_kfunc int bpf_strstr(const char *s1__ign, const char *s2__ign)
>
>  typedef void (*bpf_task_work_callback_t)(struct bpf_map *map, void *key, void *value);
>
> +enum bpf_task_work_state {
> +       /* bpf_task_work is ready to be used */
> +       BPF_TW_STANDBY = 0,
> +       /* irq work scheduling in progress */
> +       BPF_TW_PENDING,
> +       /* task work scheduling in progress */
> +       BPF_TW_SCHEDULING,
> +       /* task work is scheduled successfully */
> +       BPF_TW_SCHEDULED,
> +       /* callback is running */
> +       BPF_TW_RUNNING,
> +       /* associated BPF map value is deleted */
> +       BPF_TW_FREED,
> +};
> +
> +struct bpf_task_work_context {
> +       /* the map and map value associated with this context */
> +       struct bpf_map *map;
> +       void *map_val;
> +       /* bpf_prog that schedules task work */
> +       struct bpf_prog *prog;
> +       /* task for which callback is scheduled */
> +       struct task_struct *task;
> +       enum task_work_notify_mode mode;
> +       enum bpf_task_work_state state;
> +       bpf_task_work_callback_t callback_fn;
> +       struct callback_head work;
> +       struct irq_work irq_work;
> +       struct rcu_head rcu;
> +} __aligned(8);
> +
> +static struct bpf_task_work_context *bpf_task_work_context_alloc(void)
> +{
> +       struct bpf_task_work_context *ctx;
> +
> +       ctx = bpf_mem_alloc(&bpf_global_ma, sizeof(struct bpf_task_work_context));
> +       if (ctx)
> +               memset(ctx, 0, sizeof(*ctx));
> +       return ctx;
> +}
> +
> +static void bpf_task_work_context_free(struct rcu_head *rcu)
> +{
> +       struct bpf_task_work_context *ctx = container_of(rcu, struct bpf_task_work_context, rcu);
> +       /* bpf_mem_free expects migration to be disabled */
> +       migrate_disable();
> +       bpf_mem_free(&bpf_global_ma, ctx);
> +       migrate_enable();
> +}
> +
> +static bool task_work_match(struct callback_head *head, void *data)
> +{
> +       struct bpf_task_work_context *ctx = container_of(head, struct bpf_task_work_context, work);
> +
> +       return ctx == data;
> +}
> +
> +static void bpf_task_work_context_reset(struct bpf_task_work_context *ctx)
> +{
> +       bpf_prog_put(ctx->prog);
> +       bpf_task_release(ctx->task);
> +}
> +
> +static void bpf_task_work_callback(struct callback_head *cb)
> +{
> +       enum bpf_task_work_state state;
> +       struct bpf_task_work_context *ctx;
> +       u32 idx;
> +       void *key;
> +
> +       ctx = container_of(cb, struct bpf_task_work_context, work);
> +
> +       /*
> +        * Read lock is needed to protect map key and value access below, it has to be done before
> +        * the state transition
> +        */
> +       rcu_read_lock_trace();
> +       /*
> +        * This callback may start running before bpf_task_work_irq() switched the state to
> +        * SCHEDULED so handle both transition variants SCHEDULING|SCHEDULED -> RUNNING.
> +        */
> +       state = cmpxchg(&ctx->state, BPF_TW_SCHEDULING, BPF_TW_RUNNING);
> +       if (state == BPF_TW_SCHEDULED)

... and let's say we have concurrent cancel_and_free here, we mark
state BPF_TW_FREED.

> +               state = cmpxchg(&ctx->state, BPF_TW_SCHEDULED, BPF_TW_RUNNING);
> +       if (state == BPF_TW_FREED) {

... and notice it here now ...

> +               rcu_read_unlock_trace();
> +               bpf_task_work_context_reset(ctx);
> +               call_rcu_tasks_trace(&ctx->rcu, bpf_task_work_context_free);

... then I presume this is ok, because the cancel of tw in
cancel_and_free will fail?
Maybe add a comment here that it's interlocked with the free path.

> +               return;
> +       }
> +
> +       key = (void *)map_key_from_value(ctx->map, ctx->map_val, &idx);
> +       migrate_disable();
> +       ctx->callback_fn(ctx->map, key, ctx->map_val);
> +       migrate_enable();
> +       rcu_read_unlock_trace();
> +       /* State is running or freed, either way reset. */
> +       bpf_task_work_context_reset(ctx);
> +       state = cmpxchg(&ctx->state, BPF_TW_RUNNING, BPF_TW_STANDBY);
> +       if (state == BPF_TW_FREED)
> +               call_rcu_tasks_trace(&ctx->rcu, bpf_task_work_context_free);
> +}
> +
> +static void bpf_task_work_irq(struct irq_work *irq_work)
> +{
> +       struct bpf_task_work_context *ctx;
> +       enum bpf_task_work_state state;
> +       int err;
> +
> +       ctx = container_of(irq_work, struct bpf_task_work_context, irq_work);
> +
> +       state = cmpxchg(&ctx->state, BPF_TW_PENDING, BPF_TW_SCHEDULING);
> +       if (state == BPF_TW_FREED)
> +               goto free_context;
> +
> +       err = task_work_add(ctx->task, &ctx->work, ctx->mode);
> +       if (err) {
> +               bpf_task_work_context_reset(ctx);
> +               state = cmpxchg(&ctx->state, BPF_TW_SCHEDULING, BPF_TW_STANDBY);
> +               if (state == BPF_TW_FREED)
> +                       call_rcu_tasks_trace(&ctx->rcu, bpf_task_work_context_free);
> +               return;
> +       }
> +       state = cmpxchg(&ctx->state, BPF_TW_SCHEDULING, BPF_TW_SCHEDULED);
> +       if (state == BPF_TW_FREED && task_work_cancel_match(ctx->task, task_work_match, ctx))
> +               goto free_context; /* successful cancellation, release and free ctx */
> +       return;
> +
> +free_context:
> +       bpf_task_work_context_reset(ctx);
> +       call_rcu_tasks_trace(&ctx->rcu, bpf_task_work_context_free);
> +}
> +
> +static struct bpf_task_work_context *bpf_task_work_context_acquire(struct bpf_task_work *tw,
> +                                                                  struct bpf_map *map)
> +{
> +       struct bpf_task_work_context *ctx, *old_ctx;
> +       enum bpf_task_work_state state;
> +       struct bpf_task_work_context __force __rcu **ppc =
> +               (struct bpf_task_work_context __force __rcu **)&tw->ctx;
> +
> +       /* ctx pointer is RCU protected */
> +       rcu_read_lock_trace();
> +       ctx = rcu_dereference(*ppc);
> +       if (!ctx) {
> +               ctx = bpf_task_work_context_alloc();
> +               if (!ctx) {
> +                       rcu_read_unlock_trace();
> +                       return ERR_PTR(-ENOMEM);
> +               }
> +               old_ctx = unrcu_pointer(cmpxchg(ppc, NULL, RCU_INITIALIZER(ctx)));
> +               /*
> +                * If ctx is set by another CPU, release allocated memory.
> +                * Do not fail, though, attempt stealing the work
> +                */
> +               if (old_ctx) {
> +                       bpf_mem_free(&bpf_global_ma, ctx);
> +                       ctx = old_ctx;
> +               }
> +       }
> +       state = cmpxchg(&ctx->state, BPF_TW_STANDBY, BPF_TW_PENDING);
> +       /*
> +        * We can unlock RCU, because task work scheduler (this codepath)
> +        * now owns the ctx or returning an error
> +        */
> +       rcu_read_unlock_trace();
> +       if (state != BPF_TW_STANDBY)
> +               return ERR_PTR(-EBUSY);
> +       return ctx;
> +}
> +
> +static int bpf_task_work_schedule(struct task_struct *task, struct bpf_task_work *tw,
> +                                 struct bpf_map *map, bpf_task_work_callback_t callback_fn,
> +                                 struct bpf_prog_aux *aux, enum task_work_notify_mode mode)
> +{
> +       struct bpf_prog *prog;
> +       struct bpf_task_work_context *ctx = NULL;
> +       int err;
> +
> +       BTF_TYPE_EMIT(struct bpf_task_work);
> +
> +       prog = bpf_prog_inc_not_zero(aux->prog);
> +       if (IS_ERR(prog))
> +               return -EBADF;
> +
> +       if (!atomic64_read(&map->usercnt)) {
> +               err = -EBADF;
> +               goto release_prog;
> +       }

Please add a comment on why lack of ordering between load of usercnt
and load of tw->ctx is safe, in presence of a parallel usercnt
dec_and_test and ctx xchg.
See __bpf_async_init for similar race.

> +       task = bpf_task_acquire(task);
> +       if (!task) {
> +               err = -EPERM;
> +               goto release_prog;
> +       }
> +       ctx = bpf_task_work_context_acquire(tw, map);
> +       if (IS_ERR(ctx)) {
> +               err = PTR_ERR(ctx);
> +               goto release_all;
> +       }
> +
> +       ctx->task = task;
> +       ctx->callback_fn = callback_fn;
> +       ctx->prog = prog;
> +       ctx->mode = mode;
> +       ctx->map = map;
> +       ctx->map_val = (void *)tw - map->record->task_work_off;
> +       init_irq_work(&ctx->irq_work, bpf_task_work_irq);
> +       init_task_work(&ctx->work, bpf_task_work_callback);
> +
> +       irq_work_queue(&ctx->irq_work);
> +
> +       return 0;
> +
> +release_all:
> +       bpf_task_release(task);
> +release_prog:
> +       bpf_prog_put(prog);
> +       return err;
> +}
> +
>  /**
>   * bpf_task_work_schedule_signal - Schedule BPF callback using task_work_add with TWA_SIGNAL mode
>   * @task: Task struct for which callback should be scheduled
> @@ -3711,13 +3933,11 @@ typedef void (*bpf_task_work_callback_t)(struct bpf_map *map, void *key, void *v
>   *
>   * Return: 0 if task work has been scheduled successfully, negative error code otherwise
>   */
> -__bpf_kfunc int bpf_task_work_schedule_signal(struct task_struct *task,
> -                                             struct bpf_task_work *tw,
> +__bpf_kfunc int bpf_task_work_schedule_signal(struct task_struct *task, struct bpf_task_work *tw,
>                                               struct bpf_map *map__map,
> -                                             bpf_task_work_callback_t callback,
> -                                             void *aux__prog)
> +                                             bpf_task_work_callback_t callback, void *aux__prog)
>  {
> -       return 0;
> +       return bpf_task_work_schedule(task, tw, map__map, callback, aux__prog, TWA_SIGNAL);
>  }
>
>  /**
> @@ -3731,19 +3951,47 @@ __bpf_kfunc int bpf_task_work_schedule_signal(struct task_struct *task,
>   *
>   * Return: 0 if task work has been scheduled successfully, negative error code otherwise
>   */
> -__bpf_kfunc int bpf_task_work_schedule_resume(struct task_struct *task,
> -                                             struct bpf_task_work *tw,
> +__bpf_kfunc int bpf_task_work_schedule_resume(struct task_struct *task, struct bpf_task_work *tw,
>                                               struct bpf_map *map__map,
> -                                             bpf_task_work_callback_t callback,
> -                                             void *aux__prog)
> +                                             bpf_task_work_callback_t callback, void *aux__prog)
>  {
> -       return 0;
> +       enum task_work_notify_mode mode;
> +
> +       mode = task == current && in_nmi() ? TWA_NMI_CURRENT : TWA_RESUME;
> +       return bpf_task_work_schedule(task, tw, map__map, callback, aux__prog, mode);
>  }
>
>  __bpf_kfunc_end_defs();
>
>  void bpf_task_work_cancel_and_free(void *val)
>  {
> +       struct bpf_task_work *tw = val;
> +       struct bpf_task_work_context *ctx;
> +       enum bpf_task_work_state state;
> +
> +       /* No need do rcu_read_lock as no other codepath can reset this pointer */
> +       ctx = unrcu_pointer(xchg((struct bpf_task_work_context __force __rcu **)&tw->ctx, NULL));
> +       if (!ctx)
> +               return;
> +       state = xchg(&ctx->state, BPF_TW_FREED);
> +
> +       switch (state) {
> +       case BPF_TW_SCHEDULED:
> +               /* If we can't cancel task work, rely on task work callback to free the context */
> +               if (!task_work_cancel_match(ctx->task, task_work_match, ctx))

This part looks broken to me.
You are calling this path
(update->obj_free_fields->cancel_and_free->cancel_and_match) in
possibly NMI context.
Which means we can deadlock if we hit the NMI context prog in the
middle of task->pi_lock critical section.
That's taken in task_work functions
The task_work_cancel_match takes the pi_lock.

> +                       break;
> +               bpf_task_work_context_reset(ctx);
> +               fallthrough;
> +       case BPF_TW_STANDBY:
> +               call_rcu_tasks_trace(&ctx->rcu, bpf_task_work_context_free);
> +               break;
> +       /* In all below cases scheduling logic should detect context state change and cleanup */
> +       case BPF_TW_SCHEDULING:
> +       case BPF_TW_PENDING:
> +       case BPF_TW_RUNNING:
> +       default:
> +               break;
> +       }
>  }
>
>  BTF_KFUNCS_START(generic_btf_ids)
> @@ -3769,6 +4017,8 @@ BTF_ID_FLAGS(func, bpf_rbtree_first, KF_RET_NULL)
>  BTF_ID_FLAGS(func, bpf_rbtree_root, KF_RET_NULL)
>  BTF_ID_FLAGS(func, bpf_rbtree_left, KF_RET_NULL)
>  BTF_ID_FLAGS(func, bpf_rbtree_right, KF_RET_NULL)
> +BTF_ID_FLAGS(func, bpf_task_work_schedule_signal, KF_TRUSTED_ARGS)
> +BTF_ID_FLAGS(func, bpf_task_work_schedule_resume, KF_TRUSTED_ARGS)
>
>  #ifdef CONFIG_CGROUPS
>  BTF_ID_FLAGS(func, bpf_cgroup_acquire, KF_ACQUIRE | KF_RCU | KF_RET_NULL)
> --
> 2.50.1
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH bpf-next v2 3/4] bpf: task work scheduling kfuncs
  2025-08-19 14:18   ` Kumar Kartikeya Dwivedi
@ 2025-08-19 18:13     ` Mykyta Yatsenko
  2025-08-19 19:27       ` Kumar Kartikeya Dwivedi
  0 siblings, 1 reply; 23+ messages in thread
From: Mykyta Yatsenko @ 2025-08-19 18:13 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: bpf, ast, andrii, daniel, kafai, kernel-team, eddyz87,
	Mykyta Yatsenko

On 8/19/25 15:18, Kumar Kartikeya Dwivedi wrote:
> On Fri, 15 Aug 2025 at 21:22, Mykyta Yatsenko
> <mykyta.yatsenko5@gmail.com> wrote:
>> From: Mykyta Yatsenko <yatsenko@meta.com>
>>
>> Implementation of the bpf_task_work_schedule kfuncs.
>>
>> Main components:
>>   * struct bpf_task_work_context – Metadata and state management per task
>> work.
>>   * enum bpf_task_work_state – A state machine to serialize work
>>   scheduling and execution.
>>   * bpf_task_work_schedule() – The central helper that initiates
>> scheduling.
>>   * bpf_task_work_acquire() - Attempts to take ownership of the context,
>>   pointed by passed struct bpf_task_work, allocates new context if none
>>   exists yet.
>>   * bpf_task_work_callback() – Invoked when the actual task_work runs.
>>   * bpf_task_work_irq() – An intermediate step (runs in softirq context)
>> to enqueue task work.
>>   * bpf_task_work_cancel_and_free() – Cleanup for deleted BPF map entries.
> Can you elaborate on why the bouncing through irq_work context is necessary?
> I think we should have this info in the commit log.
> Is it to avoid deadlocks with task_work locks and/or task->pi_lock?
yes, mainly to avoid locks in NMI.
>
>> Flow of successful task work scheduling
>>   1) bpf_task_work_schedule_* is called from BPF code.
>>   2) Transition state from STANDBY to PENDING, marks context is owned by
>>   this task work scheduler
>>   3) irq_work_queue() schedules bpf_task_work_irq().
>>   4) Transition state from PENDING to SCHEDULING.
>>   4) bpf_task_work_irq() attempts task_work_add(). If successful, state
>>   transitions to SCHEDULED.
>>   5) Task work calls bpf_task_work_callback(), which transition state to
>>   RUNNING.
>>   6) BPF callback is executed
>>   7) Context is cleaned up, refcounts released, context state set back to
>>   STANDBY.
>>
>> bpf_task_work_context handling
>> The context pointer is stored in bpf_task_work ctx field (u64) but
>> treated as an __rcu pointer via casts.
>> bpf_task_work_acquire() publishes new bpf_task_work_context by cmpxchg
>> with RCU initializer.
>> Read under the RCU lock only in bpf_task_work_acquire() when ownership
>> is contended.
>> Upon deleting map value, bpf_task_work_cancel_and_free() is detaching
>> context pointer from struct bpf_task_work and releases resources
>> if scheduler does not own the context or can be canceled (state ==
>> STANDBY or state == SCHEDULED and callback canceled). If task work
>> scheduler owns the context, its state is set to FREED and scheduler is
>> expected to cleanup on the next state transition.
>>
>> Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
>> ---
> This is much better now, with clear ownership between free path and
> scheduling path, I mostly have a few more comments on the current
> implementation, plus one potential bug.
>
> However, the more time I spend on this, the more I feel we should
> unify all this with the two other bpf async work execution mechanisms
> (timers and wq), and simplify and deduplicate a lot of this under the
> serialized async->lock. I know NMI execution is probably critical for
> this primitive, but we can replace async->lock with rqspinlock to
> address that, so that it becomes safe to serialize in any context.
> Apart from that, I don't see anything that would negate reworking all
> this as a case of BPF_TASK_WORK for bpf_async_kern, modulo internal
> task_work locks that have trouble with NMI execution (see later
> comments).
>
> I also feel like it would be cleaner if we split the API into 3 steps:
> init(), set_callback(), schedule() like the other cases, I don't see
> why we necessarily need to diverge, and it simplifies some of the
> logic in schedule().
> Once every state update is protected by a lock, all of the state
> transitions are done automatically and a lot of the extra races are
> eliminated.
>
> I think we should discuss whether this was considered and why you
> discarded this approach, otherwise the code is pretty complex, with
> little upside.
> Maybe I'm missing something obvious and you'd know more since you've
> thought about all this longer.
As for API, I think having 1 function for scheduling callback is cleaner
then having 3 which are always called in the same order anyway. Most of 
the complexity
comes from synchronization, not logic, so not having to do the same 
synchronization in
init(), set_callback() and schedule() seems like a benefit to me.
Let me check if using rqspinlock going to make things simpler. We still 
need states to at least know if cancellation is possible and to flag 
deletion to scheduler, but using a lock will make code easier to 
understand.
>
>>   kernel/bpf/helpers.c | 270 +++++++++++++++++++++++++++++++++++++++++--
>>   1 file changed, 260 insertions(+), 10 deletions(-)
>>
>> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
>> index d2f88a9bc47b..346ae8fd3ada 100644
>> --- a/kernel/bpf/helpers.c
>> +++ b/kernel/bpf/helpers.c
>> @@ -25,6 +25,8 @@
>>   #include <linux/kasan.h>
>>   #include <linux/bpf_verifier.h>
>>   #include <linux/uaccess.h>
>> +#include <linux/task_work.h>
>> +#include <linux/irq_work.h>
>>
>>   #include "../../lib/kstrtox.h"
>>
>> @@ -3701,6 +3703,226 @@ __bpf_kfunc int bpf_strstr(const char *s1__ign, const char *s2__ign)
>>
>>   typedef void (*bpf_task_work_callback_t)(struct bpf_map *map, void *key, void *value);
>>
>> +enum bpf_task_work_state {
>> +       /* bpf_task_work is ready to be used */
>> +       BPF_TW_STANDBY = 0,
>> +       /* irq work scheduling in progress */
>> +       BPF_TW_PENDING,
>> +       /* task work scheduling in progress */
>> +       BPF_TW_SCHEDULING,
>> +       /* task work is scheduled successfully */
>> +       BPF_TW_SCHEDULED,
>> +       /* callback is running */
>> +       BPF_TW_RUNNING,
>> +       /* associated BPF map value is deleted */
>> +       BPF_TW_FREED,
>> +};
>> +
>> +struct bpf_task_work_context {
>> +       /* the map and map value associated with this context */
>> +       struct bpf_map *map;
>> +       void *map_val;
>> +       /* bpf_prog that schedules task work */
>> +       struct bpf_prog *prog;
>> +       /* task for which callback is scheduled */
>> +       struct task_struct *task;
>> +       enum task_work_notify_mode mode;
>> +       enum bpf_task_work_state state;
>> +       bpf_task_work_callback_t callback_fn;
>> +       struct callback_head work;
>> +       struct irq_work irq_work;
>> +       struct rcu_head rcu;
>> +} __aligned(8);
>> +
>> +static struct bpf_task_work_context *bpf_task_work_context_alloc(void)
>> +{
>> +       struct bpf_task_work_context *ctx;
>> +
>> +       ctx = bpf_mem_alloc(&bpf_global_ma, sizeof(struct bpf_task_work_context));
>> +       if (ctx)
>> +               memset(ctx, 0, sizeof(*ctx));
>> +       return ctx;
>> +}
>> +
>> +static void bpf_task_work_context_free(struct rcu_head *rcu)
>> +{
>> +       struct bpf_task_work_context *ctx = container_of(rcu, struct bpf_task_work_context, rcu);
>> +       /* bpf_mem_free expects migration to be disabled */
>> +       migrate_disable();
>> +       bpf_mem_free(&bpf_global_ma, ctx);
>> +       migrate_enable();
>> +}
>> +
>> +static bool task_work_match(struct callback_head *head, void *data)
>> +{
>> +       struct bpf_task_work_context *ctx = container_of(head, struct bpf_task_work_context, work);
>> +
>> +       return ctx == data;
>> +}
>> +
>> +static void bpf_task_work_context_reset(struct bpf_task_work_context *ctx)
>> +{
>> +       bpf_prog_put(ctx->prog);
>> +       bpf_task_release(ctx->task);
>> +}
>> +
>> +static void bpf_task_work_callback(struct callback_head *cb)
>> +{
>> +       enum bpf_task_work_state state;
>> +       struct bpf_task_work_context *ctx;
>> +       u32 idx;
>> +       void *key;
>> +
>> +       ctx = container_of(cb, struct bpf_task_work_context, work);
>> +
>> +       /*
>> +        * Read lock is needed to protect map key and value access below, it has to be done before
>> +        * the state transition
>> +        */
>> +       rcu_read_lock_trace();
>> +       /*
>> +        * This callback may start running before bpf_task_work_irq() switched the state to
>> +        * SCHEDULED so handle both transition variants SCHEDULING|SCHEDULED -> RUNNING.
>> +        */
>> +       state = cmpxchg(&ctx->state, BPF_TW_SCHEDULING, BPF_TW_RUNNING);
>> +       if (state == BPF_TW_SCHEDULED)
> ... and let's say we have concurrent cancel_and_free here, we mark
> state BPF_TW_FREED.
>
>> +               state = cmpxchg(&ctx->state, BPF_TW_SCHEDULED, BPF_TW_RUNNING);
>> +       if (state == BPF_TW_FREED) {
> ... and notice it here now ...
>
>> +               rcu_read_unlock_trace();
>> +               bpf_task_work_context_reset(ctx);
>> +               call_rcu_tasks_trace(&ctx->rcu, bpf_task_work_context_free);
> ... then I presume this is ok, because the cancel of tw in
> cancel_and_free will fail?
yes, if cancellation succeeds, callback will not be called.
If it fails, cancel_and_free does not do anything, except changing
the state and callback does the cleanup.
> Maybe add a comment here that it's interlocked with the free path.
>
>> +               return;
>> +       }
>> +
>> +       key = (void *)map_key_from_value(ctx->map, ctx->map_val, &idx);
>> +       migrate_disable();
>> +       ctx->callback_fn(ctx->map, key, ctx->map_val);
>> +       migrate_enable();
>> +       rcu_read_unlock_trace();
>> +       /* State is running or freed, either way reset. */
>> +       bpf_task_work_context_reset(ctx);
>> +       state = cmpxchg(&ctx->state, BPF_TW_RUNNING, BPF_TW_STANDBY);
>> +       if (state == BPF_TW_FREED)
>> +               call_rcu_tasks_trace(&ctx->rcu, bpf_task_work_context_free);
>> +}
>> +
>> +static void bpf_task_work_irq(struct irq_work *irq_work)
>> +{
>> +       struct bpf_task_work_context *ctx;
>> +       enum bpf_task_work_state state;
>> +       int err;
>> +
>> +       ctx = container_of(irq_work, struct bpf_task_work_context, irq_work);
>> +
>> +       state = cmpxchg(&ctx->state, BPF_TW_PENDING, BPF_TW_SCHEDULING);
>> +       if (state == BPF_TW_FREED)
>> +               goto free_context;
>> +
>> +       err = task_work_add(ctx->task, &ctx->work, ctx->mode);
>> +       if (err) {
>> +               bpf_task_work_context_reset(ctx);
>> +               state = cmpxchg(&ctx->state, BPF_TW_SCHEDULING, BPF_TW_STANDBY);
>> +               if (state == BPF_TW_FREED)
>> +                       call_rcu_tasks_trace(&ctx->rcu, bpf_task_work_context_free);
>> +               return;
>> +       }
>> +       state = cmpxchg(&ctx->state, BPF_TW_SCHEDULING, BPF_TW_SCHEDULED);
>> +       if (state == BPF_TW_FREED && task_work_cancel_match(ctx->task, task_work_match, ctx))
>> +               goto free_context; /* successful cancellation, release and free ctx */
>> +       return;
>> +
>> +free_context:
>> +       bpf_task_work_context_reset(ctx);
>> +       call_rcu_tasks_trace(&ctx->rcu, bpf_task_work_context_free);
>> +}
>> +
>> +static struct bpf_task_work_context *bpf_task_work_context_acquire(struct bpf_task_work *tw,
>> +                                                                  struct bpf_map *map)
>> +{
>> +       struct bpf_task_work_context *ctx, *old_ctx;
>> +       enum bpf_task_work_state state;
>> +       struct bpf_task_work_context __force __rcu **ppc =
>> +               (struct bpf_task_work_context __force __rcu **)&tw->ctx;
>> +
>> +       /* ctx pointer is RCU protected */
>> +       rcu_read_lock_trace();
>> +       ctx = rcu_dereference(*ppc);
>> +       if (!ctx) {
>> +               ctx = bpf_task_work_context_alloc();
>> +               if (!ctx) {
>> +                       rcu_read_unlock_trace();
>> +                       return ERR_PTR(-ENOMEM);
>> +               }
>> +               old_ctx = unrcu_pointer(cmpxchg(ppc, NULL, RCU_INITIALIZER(ctx)));
>> +               /*
>> +                * If ctx is set by another CPU, release allocated memory.
>> +                * Do not fail, though, attempt stealing the work
>> +                */
>> +               if (old_ctx) {
>> +                       bpf_mem_free(&bpf_global_ma, ctx);
>> +                       ctx = old_ctx;
>> +               }
>> +       }
>> +       state = cmpxchg(&ctx->state, BPF_TW_STANDBY, BPF_TW_PENDING);
>> +       /*
>> +        * We can unlock RCU, because task work scheduler (this codepath)
>> +        * now owns the ctx or returning an error
>> +        */
>> +       rcu_read_unlock_trace();
>> +       if (state != BPF_TW_STANDBY)
>> +               return ERR_PTR(-EBUSY);
>> +       return ctx;
>> +}
>> +
>> +static int bpf_task_work_schedule(struct task_struct *task, struct bpf_task_work *tw,
>> +                                 struct bpf_map *map, bpf_task_work_callback_t callback_fn,
>> +                                 struct bpf_prog_aux *aux, enum task_work_notify_mode mode)
>> +{
>> +       struct bpf_prog *prog;
>> +       struct bpf_task_work_context *ctx = NULL;
>> +       int err;
>> +
>> +       BTF_TYPE_EMIT(struct bpf_task_work);
>> +
>> +       prog = bpf_prog_inc_not_zero(aux->prog);
>> +       if (IS_ERR(prog))
>> +               return -EBADF;
>> +
>> +       if (!atomic64_read(&map->usercnt)) {
>> +               err = -EBADF;
>> +               goto release_prog;
>> +       }
> Please add a comment on why lack of ordering between load of usercnt
> and load of tw->ctx is safe, in presence of a parallel usercnt
> dec_and_test and ctx xchg.
> See __bpf_async_init for similar race.
I think I see what you mean, let me double check this.
>
>> +       task = bpf_task_acquire(task);
>> +       if (!task) {
>> +               err = -EPERM;
>> +               goto release_prog;
>> +       }
>> +       ctx = bpf_task_work_context_acquire(tw, map);
>> +       if (IS_ERR(ctx)) {
>> +               err = PTR_ERR(ctx);
>> +               goto release_all;
>> +       }
>> +
>> +       ctx->task = task;
>> +       ctx->callback_fn = callback_fn;
>> +       ctx->prog = prog;
>> +       ctx->mode = mode;
>> +       ctx->map = map;
>> +       ctx->map_val = (void *)tw - map->record->task_work_off;
>> +       init_irq_work(&ctx->irq_work, bpf_task_work_irq);
>> +       init_task_work(&ctx->work, bpf_task_work_callback);
>> +
>> +       irq_work_queue(&ctx->irq_work);
>> +
>> +       return 0;
>> +
>> +release_all:
>> +       bpf_task_release(task);
>> +release_prog:
>> +       bpf_prog_put(prog);
>> +       return err;
>> +}
>> +
>>   /**
>>    * bpf_task_work_schedule_signal - Schedule BPF callback using task_work_add with TWA_SIGNAL mode
>>    * @task: Task struct for which callback should be scheduled
>> @@ -3711,13 +3933,11 @@ typedef void (*bpf_task_work_callback_t)(struct bpf_map *map, void *key, void *v
>>    *
>>    * Return: 0 if task work has been scheduled successfully, negative error code otherwise
>>    */
>> -__bpf_kfunc int bpf_task_work_schedule_signal(struct task_struct *task,
>> -                                             struct bpf_task_work *tw,
>> +__bpf_kfunc int bpf_task_work_schedule_signal(struct task_struct *task, struct bpf_task_work *tw,
>>                                                struct bpf_map *map__map,
>> -                                             bpf_task_work_callback_t callback,
>> -                                             void *aux__prog)
>> +                                             bpf_task_work_callback_t callback, void *aux__prog)
>>   {
>> -       return 0;
>> +       return bpf_task_work_schedule(task, tw, map__map, callback, aux__prog, TWA_SIGNAL);
>>   }
>>
>>   /**
>> @@ -3731,19 +3951,47 @@ __bpf_kfunc int bpf_task_work_schedule_signal(struct task_struct *task,
>>    *
>>    * Return: 0 if task work has been scheduled successfully, negative error code otherwise
>>    */
>> -__bpf_kfunc int bpf_task_work_schedule_resume(struct task_struct *task,
>> -                                             struct bpf_task_work *tw,
>> +__bpf_kfunc int bpf_task_work_schedule_resume(struct task_struct *task, struct bpf_task_work *tw,
>>                                                struct bpf_map *map__map,
>> -                                             bpf_task_work_callback_t callback,
>> -                                             void *aux__prog)
>> +                                             bpf_task_work_callback_t callback, void *aux__prog)
>>   {
>> -       return 0;
>> +       enum task_work_notify_mode mode;
>> +
>> +       mode = task == current && in_nmi() ? TWA_NMI_CURRENT : TWA_RESUME;
>> +       return bpf_task_work_schedule(task, tw, map__map, callback, aux__prog, mode);
>>   }
>>
>>   __bpf_kfunc_end_defs();
>>
>>   void bpf_task_work_cancel_and_free(void *val)
>>   {
>> +       struct bpf_task_work *tw = val;
>> +       struct bpf_task_work_context *ctx;
>> +       enum bpf_task_work_state state;
>> +
>> +       /* No need do rcu_read_lock as no other codepath can reset this pointer */
>> +       ctx = unrcu_pointer(xchg((struct bpf_task_work_context __force __rcu **)&tw->ctx, NULL));
>> +       if (!ctx)
>> +               return;
>> +       state = xchg(&ctx->state, BPF_TW_FREED);
>> +
>> +       switch (state) {
>> +       case BPF_TW_SCHEDULED:
>> +               /* If we can't cancel task work, rely on task work callback to free the context */
>> +               if (!task_work_cancel_match(ctx->task, task_work_match, ctx))
> This part looks broken to me.
> You are calling this path
> (update->obj_free_fields->cancel_and_free->cancel_and_match) in
> possibly NMI context.
> Which means we can deadlock if we hit the NMI context prog in the
> middle of task->pi_lock critical section.
> That's taken in task_work functions
> The task_work_cancel_match takes the pi_lock.
Good point, thanks. I think this could be solved in 2 ways:
  * Don't cancel, rely on callback dropping the work
  * Cancel in another irq_work
I'll probably go with the second one.
>
>> +                       break;
>> +               bpf_task_work_context_reset(ctx);
>> +               fallthrough;
>> +       case BPF_TW_STANDBY:
>> +               call_rcu_tasks_trace(&ctx->rcu, bpf_task_work_context_free);
>> +               break;
>> +       /* In all below cases scheduling logic should detect context state change and cleanup */
>> +       case BPF_TW_SCHEDULING:
>> +       case BPF_TW_PENDING:
>> +       case BPF_TW_RUNNING:
>> +       default:
>> +               break;
>> +       }
>>   }
>>
>>   BTF_KFUNCS_START(generic_btf_ids)
>> @@ -3769,6 +4017,8 @@ BTF_ID_FLAGS(func, bpf_rbtree_first, KF_RET_NULL)
>>   BTF_ID_FLAGS(func, bpf_rbtree_root, KF_RET_NULL)
>>   BTF_ID_FLAGS(func, bpf_rbtree_left, KF_RET_NULL)
>>   BTF_ID_FLAGS(func, bpf_rbtree_right, KF_RET_NULL)
>> +BTF_ID_FLAGS(func, bpf_task_work_schedule_signal, KF_TRUSTED_ARGS)
>> +BTF_ID_FLAGS(func, bpf_task_work_schedule_resume, KF_TRUSTED_ARGS)
>>
>>   #ifdef CONFIG_CGROUPS
>>   BTF_ID_FLAGS(func, bpf_cgroup_acquire, KF_ACQUIRE | KF_RCU | KF_RET_NULL)
>> --
>> 2.50.1
>>


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH bpf-next v2 3/4] bpf: task work scheduling kfuncs
  2025-08-19 18:13     ` Mykyta Yatsenko
@ 2025-08-19 19:27       ` Kumar Kartikeya Dwivedi
  2025-08-19 20:49         ` Andrii Nakryiko
  0 siblings, 1 reply; 23+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2025-08-19 19:27 UTC (permalink / raw)
  To: Mykyta Yatsenko
  Cc: bpf, ast, andrii, daniel, kafai, kernel-team, eddyz87,
	Mykyta Yatsenko

On Tue, 19 Aug 2025 at 20:16, Mykyta Yatsenko
<mykyta.yatsenko5@gmail.com> wrote:
>
> On 8/19/25 15:18, Kumar Kartikeya Dwivedi wrote:
> > On Fri, 15 Aug 2025 at 21:22, Mykyta Yatsenko
> > <mykyta.yatsenko5@gmail.com> wrote:
> >> From: Mykyta Yatsenko <yatsenko@meta.com>
> >>
> >> Implementation of the bpf_task_work_schedule kfuncs.
> >>
> >> Main components:
> >>   * struct bpf_task_work_context – Metadata and state management per task
> >> work.
> >>   * enum bpf_task_work_state – A state machine to serialize work
> >>   scheduling and execution.
> >>   * bpf_task_work_schedule() – The central helper that initiates
> >> scheduling.
> >>   * bpf_task_work_acquire() - Attempts to take ownership of the context,
> >>   pointed by passed struct bpf_task_work, allocates new context if none
> >>   exists yet.
> >>   * bpf_task_work_callback() – Invoked when the actual task_work runs.
> >>   * bpf_task_work_irq() – An intermediate step (runs in softirq context)
> >> to enqueue task work.
> >>   * bpf_task_work_cancel_and_free() – Cleanup for deleted BPF map entries.
> > Can you elaborate on why the bouncing through irq_work context is necessary?
> > I think we should have this info in the commit log.
> > Is it to avoid deadlocks with task_work locks and/or task->pi_lock?
> yes, mainly to avoid locks in NMI.
> >
> >> Flow of successful task work scheduling
> >>   1) bpf_task_work_schedule_* is called from BPF code.
> >>   2) Transition state from STANDBY to PENDING, marks context is owned by
> >>   this task work scheduler
> >>   3) irq_work_queue() schedules bpf_task_work_irq().
> >>   4) Transition state from PENDING to SCHEDULING.
> >>   4) bpf_task_work_irq() attempts task_work_add(). If successful, state
> >>   transitions to SCHEDULED.
> >>   5) Task work calls bpf_task_work_callback(), which transition state to
> >>   RUNNING.
> >>   6) BPF callback is executed
> >>   7) Context is cleaned up, refcounts released, context state set back to
> >>   STANDBY.
> >>
> >> bpf_task_work_context handling
> >> The context pointer is stored in bpf_task_work ctx field (u64) but
> >> treated as an __rcu pointer via casts.
> >> bpf_task_work_acquire() publishes new bpf_task_work_context by cmpxchg
> >> with RCU initializer.
> >> Read under the RCU lock only in bpf_task_work_acquire() when ownership
> >> is contended.
> >> Upon deleting map value, bpf_task_work_cancel_and_free() is detaching
> >> context pointer from struct bpf_task_work and releases resources
> >> if scheduler does not own the context or can be canceled (state ==
> >> STANDBY or state == SCHEDULED and callback canceled). If task work
> >> scheduler owns the context, its state is set to FREED and scheduler is
> >> expected to cleanup on the next state transition.
> >>
> >> Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
> >> ---
> > This is much better now, with clear ownership between free path and
> > scheduling path, I mostly have a few more comments on the current
> > implementation, plus one potential bug.
> >
> > However, the more time I spend on this, the more I feel we should
> > unify all this with the two other bpf async work execution mechanisms
> > (timers and wq), and simplify and deduplicate a lot of this under the
> > serialized async->lock. I know NMI execution is probably critical for
> > this primitive, but we can replace async->lock with rqspinlock to
> > address that, so that it becomes safe to serialize in any context.
> > Apart from that, I don't see anything that would negate reworking all
> > this as a case of BPF_TASK_WORK for bpf_async_kern, modulo internal
> > task_work locks that have trouble with NMI execution (see later
> > comments).
> >
> > I also feel like it would be cleaner if we split the API into 3 steps:
> > init(), set_callback(), schedule() like the other cases, I don't see
> > why we necessarily need to diverge, and it simplifies some of the
> > logic in schedule().
> > Once every state update is protected by a lock, all of the state
> > transitions are done automatically and a lot of the extra races are
> > eliminated.
> >
> > I think we should discuss whether this was considered and why you
> > discarded this approach, otherwise the code is pretty complex, with
> > little upside.
> > Maybe I'm missing something obvious and you'd know more since you've
> > thought about all this longer.
> As for API, I think having 1 function for scheduling callback is cleaner
> then having 3 which are always called in the same order anyway. Most of
> the complexity
> comes from synchronization, not logic, so not having to do the same
> synchronization in
> init(), set_callback() and schedule() seems like a benefit to me.

Well, if you were to reuse bpf_async_kern, all of that extra logic is
already taken care of, or can be easily shared.
If you look closely you'll see that a lot of what you're doing is a
repetition of what timers and bpf_wq have.

> Let me check if using rqspinlock going to make things simpler. We still
> need states to at least know if cancellation is possible and to flag
> deletion to scheduler, but using a lock will make code easier to
> understand.

Yeah I think for all of this using lockless updates is not really
worth it, let's just serialize using a lock.

> >
> >>   kernel/bpf/helpers.c | 270 +++++++++++++++++++++++++++++++++++++++++--
> >>   1 file changed, 260 insertions(+), 10 deletions(-)
> >>
> >> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> >> index d2f88a9bc47b..346ae8fd3ada 100644
> >> --- a/kernel/bpf/helpers.c
> >> +++ b/kernel/bpf/helpers.c
> >> @@ -25,6 +25,8 @@
> >>   #include <linux/kasan.h>
> >>   #include <linux/bpf_verifier.h>
> >>   #include <linux/uaccess.h>
> >> +#include <linux/task_work.h>
> >> +#include <linux/irq_work.h>
> >>
> >>   #include "../../lib/kstrtox.h"
> >>
> >> @@ -3701,6 +3703,226 @@ __bpf_kfunc int bpf_strstr(const char *s1__ign, const char *s2__ign)
> >>
> >>   typedef void (*bpf_task_work_callback_t)(struct bpf_map *map, void *key, void *value);
> >>
> >> +enum bpf_task_work_state {
> >> +       /* bpf_task_work is ready to be used */
> >> +       BPF_TW_STANDBY = 0,
> >> +       /* irq work scheduling in progress */
> >> +       BPF_TW_PENDING,
> >> +       /* task work scheduling in progress */
> >> +       BPF_TW_SCHEDULING,
> >> +       /* task work is scheduled successfully */
> >> +       BPF_TW_SCHEDULED,
> >> +       /* callback is running */
> >> +       BPF_TW_RUNNING,
> >> +       /* associated BPF map value is deleted */
> >> +       BPF_TW_FREED,
> >> +};
> >> +
> >> +struct bpf_task_work_context {
> >> +       /* the map and map value associated with this context */
> >> +       struct bpf_map *map;
> >> +       void *map_val;
> >> +       /* bpf_prog that schedules task work */
> >> +       struct bpf_prog *prog;
> >> +       /* task for which callback is scheduled */
> >> +       struct task_struct *task;
> >> +       enum task_work_notify_mode mode;
> >> +       enum bpf_task_work_state state;
> >> +       bpf_task_work_callback_t callback_fn;
> >> +       struct callback_head work;
> >> +       struct irq_work irq_work;
> >> +       struct rcu_head rcu;
> >> +} __aligned(8);
> >> +
> >> +static struct bpf_task_work_context *bpf_task_work_context_alloc(void)
> >> +{
> >> +       struct bpf_task_work_context *ctx;
> >> +
> >> +       ctx = bpf_mem_alloc(&bpf_global_ma, sizeof(struct bpf_task_work_context));
> >> +       if (ctx)
> >> +               memset(ctx, 0, sizeof(*ctx));
> >> +       return ctx;
> >> +}
> >> +
> >> +static void bpf_task_work_context_free(struct rcu_head *rcu)
> >> +{
> >> +       struct bpf_task_work_context *ctx = container_of(rcu, struct bpf_task_work_context, rcu);
> >> +       /* bpf_mem_free expects migration to be disabled */
> >> +       migrate_disable();
> >> +       bpf_mem_free(&bpf_global_ma, ctx);
> >> +       migrate_enable();
> >> +}
> >> +
> >> +static bool task_work_match(struct callback_head *head, void *data)
> >> +{
> >> +       struct bpf_task_work_context *ctx = container_of(head, struct bpf_task_work_context, work);
> >> +
> >> +       return ctx == data;
> >> +}
> >> +
> >> +static void bpf_task_work_context_reset(struct bpf_task_work_context *ctx)
> >> +{
> >> +       bpf_prog_put(ctx->prog);
> >> +       bpf_task_release(ctx->task);
> >> +}
> >> +
> >> +static void bpf_task_work_callback(struct callback_head *cb)
> >> +{
> >> +       enum bpf_task_work_state state;
> >> +       struct bpf_task_work_context *ctx;
> >> +       u32 idx;
> >> +       void *key;
> >> +
> >> +       ctx = container_of(cb, struct bpf_task_work_context, work);
> >> +
> >> +       /*
> >> +        * Read lock is needed to protect map key and value access below, it has to be done before
> >> +        * the state transition
> >> +        */
> >> +       rcu_read_lock_trace();
> >> +       /*
> >> +        * This callback may start running before bpf_task_work_irq() switched the state to
> >> +        * SCHEDULED so handle both transition variants SCHEDULING|SCHEDULED -> RUNNING.
> >> +        */
> >> +       state = cmpxchg(&ctx->state, BPF_TW_SCHEDULING, BPF_TW_RUNNING);
> >> +       if (state == BPF_TW_SCHEDULED)
> > ... and let's say we have concurrent cancel_and_free here, we mark
> > state BPF_TW_FREED.
> >
> >> +               state = cmpxchg(&ctx->state, BPF_TW_SCHEDULED, BPF_TW_RUNNING);
> >> +       if (state == BPF_TW_FREED) {
> > ... and notice it here now ...
> >
> >> +               rcu_read_unlock_trace();
> >> +               bpf_task_work_context_reset(ctx);
> >> +               call_rcu_tasks_trace(&ctx->rcu, bpf_task_work_context_free);
> > ... then I presume this is ok, because the cancel of tw in
> > cancel_and_free will fail?
> yes, if cancellation succeeds, callback will not be called.
> If it fails, cancel_and_free does not do anything, except changing
> the state and callback does the cleanup.
> > Maybe add a comment here that it's interlocked with the free path.
> >
> >> +               return;
> >> +       }
> >> +
> >> +       key = (void *)map_key_from_value(ctx->map, ctx->map_val, &idx);
> >> +       migrate_disable();
> >> +       ctx->callback_fn(ctx->map, key, ctx->map_val);
> >> +       migrate_enable();
> >> +       rcu_read_unlock_trace();
> >> +       /* State is running or freed, either way reset. */
> >> +       bpf_task_work_context_reset(ctx);
> >> +       state = cmpxchg(&ctx->state, BPF_TW_RUNNING, BPF_TW_STANDBY);
> >> +       if (state == BPF_TW_FREED)
> >> +               call_rcu_tasks_trace(&ctx->rcu, bpf_task_work_context_free);
> >> +}
> >> +
> >> +static void bpf_task_work_irq(struct irq_work *irq_work)
> >> +{
> >> +       struct bpf_task_work_context *ctx;
> >> +       enum bpf_task_work_state state;
> >> +       int err;
> >> +
> >> +       ctx = container_of(irq_work, struct bpf_task_work_context, irq_work);
> >> +
> >> +       state = cmpxchg(&ctx->state, BPF_TW_PENDING, BPF_TW_SCHEDULING);
> >> +       if (state == BPF_TW_FREED)
> >> +               goto free_context;
> >> +
> >> +       err = task_work_add(ctx->task, &ctx->work, ctx->mode);
> >> +       if (err) {
> >> +               bpf_task_work_context_reset(ctx);
> >> +               state = cmpxchg(&ctx->state, BPF_TW_SCHEDULING, BPF_TW_STANDBY);
> >> +               if (state == BPF_TW_FREED)
> >> +                       call_rcu_tasks_trace(&ctx->rcu, bpf_task_work_context_free);
> >> +               return;
> >> +       }
> >> +       state = cmpxchg(&ctx->state, BPF_TW_SCHEDULING, BPF_TW_SCHEDULED);
> >> +       if (state == BPF_TW_FREED && task_work_cancel_match(ctx->task, task_work_match, ctx))
> >> +               goto free_context; /* successful cancellation, release and free ctx */
> >> +       return;
> >> +
> >> +free_context:
> >> +       bpf_task_work_context_reset(ctx);
> >> +       call_rcu_tasks_trace(&ctx->rcu, bpf_task_work_context_free);
> >> +}
> >> +
> >> +static struct bpf_task_work_context *bpf_task_work_context_acquire(struct bpf_task_work *tw,
> >> +                                                                  struct bpf_map *map)
> >> +{
> >> +       struct bpf_task_work_context *ctx, *old_ctx;
> >> +       enum bpf_task_work_state state;
> >> +       struct bpf_task_work_context __force __rcu **ppc =
> >> +               (struct bpf_task_work_context __force __rcu **)&tw->ctx;
> >> +
> >> +       /* ctx pointer is RCU protected */
> >> +       rcu_read_lock_trace();
> >> +       ctx = rcu_dereference(*ppc);
> >> +       if (!ctx) {
> >> +               ctx = bpf_task_work_context_alloc();
> >> +               if (!ctx) {
> >> +                       rcu_read_unlock_trace();
> >> +                       return ERR_PTR(-ENOMEM);
> >> +               }
> >> +               old_ctx = unrcu_pointer(cmpxchg(ppc, NULL, RCU_INITIALIZER(ctx)));
> >> +               /*
> >> +                * If ctx is set by another CPU, release allocated memory.
> >> +                * Do not fail, though, attempt stealing the work
> >> +                */
> >> +               if (old_ctx) {
> >> +                       bpf_mem_free(&bpf_global_ma, ctx);
> >> +                       ctx = old_ctx;
> >> +               }
> >> +       }
> >> +       state = cmpxchg(&ctx->state, BPF_TW_STANDBY, BPF_TW_PENDING);
> >> +       /*
> >> +        * We can unlock RCU, because task work scheduler (this codepath)
> >> +        * now owns the ctx or returning an error
> >> +        */
> >> +       rcu_read_unlock_trace();
> >> +       if (state != BPF_TW_STANDBY)
> >> +               return ERR_PTR(-EBUSY);
> >> +       return ctx;
> >> +}
> >> +
> >> +static int bpf_task_work_schedule(struct task_struct *task, struct bpf_task_work *tw,
> >> +                                 struct bpf_map *map, bpf_task_work_callback_t callback_fn,
> >> +                                 struct bpf_prog_aux *aux, enum task_work_notify_mode mode)
> >> +{
> >> +       struct bpf_prog *prog;
> >> +       struct bpf_task_work_context *ctx = NULL;
> >> +       int err;
> >> +
> >> +       BTF_TYPE_EMIT(struct bpf_task_work);
> >> +
> >> +       prog = bpf_prog_inc_not_zero(aux->prog);
> >> +       if (IS_ERR(prog))
> >> +               return -EBADF;
> >> +
> >> +       if (!atomic64_read(&map->usercnt)) {
> >> +               err = -EBADF;
> >> +               goto release_prog;
> >> +       }
> > Please add a comment on why lack of ordering between load of usercnt
> > and load of tw->ctx is safe, in presence of a parallel usercnt
> > dec_and_test and ctx xchg.
> > See __bpf_async_init for similar race.
> I think I see what you mean, let me double check this.
> >
> >> +       task = bpf_task_acquire(task);
> >> +       if (!task) {
> >> +               err = -EPERM;
> >> +               goto release_prog;
> >> +       }
> >> +       ctx = bpf_task_work_context_acquire(tw, map);
> >> +       if (IS_ERR(ctx)) {
> >> +               err = PTR_ERR(ctx);
> >> +               goto release_all;
> >> +       }
> >> +
> >> +       ctx->task = task;
> >> +       ctx->callback_fn = callback_fn;
> >> +       ctx->prog = prog;
> >> +       ctx->mode = mode;
> >> +       ctx->map = map;
> >> +       ctx->map_val = (void *)tw - map->record->task_work_off;
> >> +       init_irq_work(&ctx->irq_work, bpf_task_work_irq);
> >> +       init_task_work(&ctx->work, bpf_task_work_callback);
> >> +
> >> +       irq_work_queue(&ctx->irq_work);
> >> +
> >> +       return 0;
> >> +
> >> +release_all:
> >> +       bpf_task_release(task);
> >> +release_prog:
> >> +       bpf_prog_put(prog);
> >> +       return err;
> >> +}
> >> +
> >>   /**
> >>    * bpf_task_work_schedule_signal - Schedule BPF callback using task_work_add with TWA_SIGNAL mode
> >>    * @task: Task struct for which callback should be scheduled
> >> @@ -3711,13 +3933,11 @@ typedef void (*bpf_task_work_callback_t)(struct bpf_map *map, void *key, void *v
> >>    *
> >>    * Return: 0 if task work has been scheduled successfully, negative error code otherwise
> >>    */
> >> -__bpf_kfunc int bpf_task_work_schedule_signal(struct task_struct *task,
> >> -                                             struct bpf_task_work *tw,
> >> +__bpf_kfunc int bpf_task_work_schedule_signal(struct task_struct *task, struct bpf_task_work *tw,
> >>                                                struct bpf_map *map__map,
> >> -                                             bpf_task_work_callback_t callback,
> >> -                                             void *aux__prog)
> >> +                                             bpf_task_work_callback_t callback, void *aux__prog)
> >>   {
> >> -       return 0;
> >> +       return bpf_task_work_schedule(task, tw, map__map, callback, aux__prog, TWA_SIGNAL);
> >>   }
> >>
> >>   /**
> >> @@ -3731,19 +3951,47 @@ __bpf_kfunc int bpf_task_work_schedule_signal(struct task_struct *task,
> >>    *
> >>    * Return: 0 if task work has been scheduled successfully, negative error code otherwise
> >>    */
> >> -__bpf_kfunc int bpf_task_work_schedule_resume(struct task_struct *task,
> >> -                                             struct bpf_task_work *tw,
> >> +__bpf_kfunc int bpf_task_work_schedule_resume(struct task_struct *task, struct bpf_task_work *tw,
> >>                                                struct bpf_map *map__map,
> >> -                                             bpf_task_work_callback_t callback,
> >> -                                             void *aux__prog)
> >> +                                             bpf_task_work_callback_t callback, void *aux__prog)
> >>   {
> >> -       return 0;
> >> +       enum task_work_notify_mode mode;
> >> +
> >> +       mode = task == current && in_nmi() ? TWA_NMI_CURRENT : TWA_RESUME;
> >> +       return bpf_task_work_schedule(task, tw, map__map, callback, aux__prog, mode);
> >>   }
> >>
> >>   __bpf_kfunc_end_defs();
> >>
> >>   void bpf_task_work_cancel_and_free(void *val)
> >>   {
> >> +       struct bpf_task_work *tw = val;
> >> +       struct bpf_task_work_context *ctx;
> >> +       enum bpf_task_work_state state;
> >> +
> >> +       /* No need do rcu_read_lock as no other codepath can reset this pointer */
> >> +       ctx = unrcu_pointer(xchg((struct bpf_task_work_context __force __rcu **)&tw->ctx, NULL));
> >> +       if (!ctx)
> >> +               return;
> >> +       state = xchg(&ctx->state, BPF_TW_FREED);
> >> +
> >> +       switch (state) {
> >> +       case BPF_TW_SCHEDULED:
> >> +               /* If we can't cancel task work, rely on task work callback to free the context */
> >> +               if (!task_work_cancel_match(ctx->task, task_work_match, ctx))
> > This part looks broken to me.
> > You are calling this path
> > (update->obj_free_fields->cancel_and_free->cancel_and_match) in
> > possibly NMI context.
> > Which means we can deadlock if we hit the NMI context prog in the
> > middle of task->pi_lock critical section.
> > That's taken in task_work functions
> > The task_work_cancel_match takes the pi_lock.
> Good point, thanks. I think this could be solved in 2 ways:
>   * Don't cancel, rely on callback dropping the work
>   * Cancel in another irq_work
> I'll probably go with the second one.

What about 1? It seems like we can just rely on the existing hunk to
free the callback on noticing BPF_TW_FREED?
That seems simpler to me.

> >
> >> +                       break;
> >> +               bpf_task_work_context_reset(ctx);
> >> +               fallthrough;
> >> +       case BPF_TW_STANDBY:
> >> +               call_rcu_tasks_trace(&ctx->rcu, bpf_task_work_context_free);
> >> +               break;
> >> +       /* In all below cases scheduling logic should detect context state change and cleanup */
> >> +       case BPF_TW_SCHEDULING:
> >> +       case BPF_TW_PENDING:
> >> +       case BPF_TW_RUNNING:
> >> +       default:
> >> +               break;
> >> +       }
> >>   }
> >>
> >>   BTF_KFUNCS_START(generic_btf_ids)
> >> @@ -3769,6 +4017,8 @@ BTF_ID_FLAGS(func, bpf_rbtree_first, KF_RET_NULL)
> >>   BTF_ID_FLAGS(func, bpf_rbtree_root, KF_RET_NULL)
> >>   BTF_ID_FLAGS(func, bpf_rbtree_left, KF_RET_NULL)
> >>   BTF_ID_FLAGS(func, bpf_rbtree_right, KF_RET_NULL)
> >> +BTF_ID_FLAGS(func, bpf_task_work_schedule_signal, KF_TRUSTED_ARGS)
> >> +BTF_ID_FLAGS(func, bpf_task_work_schedule_resume, KF_TRUSTED_ARGS)
> >>
> >>   #ifdef CONFIG_CGROUPS
> >>   BTF_ID_FLAGS(func, bpf_cgroup_acquire, KF_ACQUIRE | KF_RCU | KF_RET_NULL)
> >> --
> >> 2.50.1
> >>
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH bpf-next v2 3/4] bpf: task work scheduling kfuncs
  2025-08-19 19:27       ` Kumar Kartikeya Dwivedi
@ 2025-08-19 20:49         ` Andrii Nakryiko
  2025-08-20 16:11           ` Kumar Kartikeya Dwivedi
  0 siblings, 1 reply; 23+ messages in thread
From: Andrii Nakryiko @ 2025-08-19 20:49 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: Mykyta Yatsenko, bpf, ast, andrii, daniel, kafai, kernel-team,
	eddyz87, Mykyta Yatsenko

On Tue, Aug 19, 2025 at 12:28 PM Kumar Kartikeya Dwivedi
<memxor@gmail.com> wrote:
>
> On Tue, 19 Aug 2025 at 20:16, Mykyta Yatsenko
> <mykyta.yatsenko5@gmail.com> wrote:
> >
> > On 8/19/25 15:18, Kumar Kartikeya Dwivedi wrote:
> > > On Fri, 15 Aug 2025 at 21:22, Mykyta Yatsenko
> > > <mykyta.yatsenko5@gmail.com> wrote:
> > >> From: Mykyta Yatsenko <yatsenko@meta.com>
> > >>
> > >> Implementation of the bpf_task_work_schedule kfuncs.
> > >>
> > >> Main components:
> > >>   * struct bpf_task_work_context – Metadata and state management per task
> > >> work.
> > >>   * enum bpf_task_work_state – A state machine to serialize work
> > >>   scheduling and execution.
> > >>   * bpf_task_work_schedule() – The central helper that initiates
> > >> scheduling.
> > >>   * bpf_task_work_acquire() - Attempts to take ownership of the context,
> > >>   pointed by passed struct bpf_task_work, allocates new context if none
> > >>   exists yet.
> > >>   * bpf_task_work_callback() – Invoked when the actual task_work runs.
> > >>   * bpf_task_work_irq() – An intermediate step (runs in softirq context)
> > >> to enqueue task work.
> > >>   * bpf_task_work_cancel_and_free() – Cleanup for deleted BPF map entries.
> > > Can you elaborate on why the bouncing through irq_work context is necessary?
> > > I think we should have this info in the commit log.
> > > Is it to avoid deadlocks with task_work locks and/or task->pi_lock?
> > yes, mainly to avoid locks in NMI.
> > >
> > >> Flow of successful task work scheduling
> > >>   1) bpf_task_work_schedule_* is called from BPF code.
> > >>   2) Transition state from STANDBY to PENDING, marks context is owned by
> > >>   this task work scheduler
> > >>   3) irq_work_queue() schedules bpf_task_work_irq().
> > >>   4) Transition state from PENDING to SCHEDULING.
> > >>   4) bpf_task_work_irq() attempts task_work_add(). If successful, state
> > >>   transitions to SCHEDULED.
> > >>   5) Task work calls bpf_task_work_callback(), which transition state to
> > >>   RUNNING.
> > >>   6) BPF callback is executed
> > >>   7) Context is cleaned up, refcounts released, context state set back to
> > >>   STANDBY.
> > >>
> > >> bpf_task_work_context handling
> > >> The context pointer is stored in bpf_task_work ctx field (u64) but
> > >> treated as an __rcu pointer via casts.
> > >> bpf_task_work_acquire() publishes new bpf_task_work_context by cmpxchg
> > >> with RCU initializer.
> > >> Read under the RCU lock only in bpf_task_work_acquire() when ownership
> > >> is contended.
> > >> Upon deleting map value, bpf_task_work_cancel_and_free() is detaching
> > >> context pointer from struct bpf_task_work and releases resources
> > >> if scheduler does not own the context or can be canceled (state ==
> > >> STANDBY or state == SCHEDULED and callback canceled). If task work
> > >> scheduler owns the context, its state is set to FREED and scheduler is
> > >> expected to cleanup on the next state transition.
> > >>
> > >> Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
> > >> ---
> > > This is much better now, with clear ownership between free path and
> > > scheduling path, I mostly have a few more comments on the current
> > > implementation, plus one potential bug.
> > >
> > > However, the more time I spend on this, the more I feel we should
> > > unify all this with the two other bpf async work execution mechanisms
> > > (timers and wq), and simplify and deduplicate a lot of this under the
> > > serialized async->lock. I know NMI execution is probably critical for
> > > this primitive, but we can replace async->lock with rqspinlock to
> > > address that, so that it becomes safe to serialize in any context.
> > > Apart from that, I don't see anything that would negate reworking all
> > > this as a case of BPF_TASK_WORK for bpf_async_kern, modulo internal
> > > task_work locks that have trouble with NMI execution (see later
> > > comments).
> > >
> > > I also feel like it would be cleaner if we split the API into 3 steps:
> > > init(), set_callback(), schedule() like the other cases, I don't see
> > > why we necessarily need to diverge, and it simplifies some of the
> > > logic in schedule().
> > > Once every state update is protected by a lock, all of the state
> > > transitions are done automatically and a lot of the extra races are
> > > eliminated.
> > >
> > > I think we should discuss whether this was considered and why you
> > > discarded this approach, otherwise the code is pretty complex, with
> > > little upside.
> > > Maybe I'm missing something obvious and you'd know more since you've
> > > thought about all this longer.
> > As for API, I think having 1 function for scheduling callback is cleaner
> > then having 3 which are always called in the same order anyway. Most of
> > the complexity
> > comes from synchronization, not logic, so not having to do the same
> > synchronization in
> > init(), set_callback() and schedule() seems like a benefit to me.
>
> Well, if you were to reuse bpf_async_kern, all of that extra logic is
> already taken care of, or can be easily shared.
> If you look closely you'll see that a lot of what you're doing is a
> repetition of what timers and bpf_wq have.
>
> > Let me check if using rqspinlock going to make things simpler. We still
> > need states to at least know if cancellation is possible and to flag
> > deletion to scheduler, but using a lock will make code easier to
> > understand.
>
> Yeah I think for all of this using lockless updates is not really
> worth it, let's just serialize using a lock.

I don't think it's "just serialize".

__bpf_async_init and __bpf_async_set_callback currently have `if
(in_nmi()) return -EOPNOTSUPP;`, because of `bpf_map_kmalloc_node`
(solvable with bpf_mem_alloc, not a big deal) and then unconditional
`__bpf_spin_lock_irqsave(&async->lock);` (and maybe some other things
that can't be done in NMI).

We can't just replace __bpf_spin_lock_irqsave with rqspinlock, because
the latter can fail. So the simplicity of unconditional locking is
gone. We'd need to deal with the possibility of lock failing. It's
probably not that straightforward in the case of
__bpf_async_cancel_and_free.

On the other hand, state machine with cmpxchg means there is always
forward progress and there is always determinism of what was the last
reached state before we went to FREED state, which means we will know
who needs to cancel callback (if at all), and who is responsible for
freeing resources.

I'm actually wondering if this state machine approach could/should be
adopted for bpf_async_cb?.. I wouldn't start there, though, and rather
finish task_work_add integration before trying to generalize.

[...]

> > > This part looks broken to me.
> > > You are calling this path
> > > (update->obj_free_fields->cancel_and_free->cancel_and_match) in
> > > possibly NMI context.
> > > Which means we can deadlock if we hit the NMI context prog in the
> > > middle of task->pi_lock critical section.
> > > That's taken in task_work functions
> > > The task_work_cancel_match takes the pi_lock.
> > Good point, thanks. I think this could be solved in 2 ways:
> >   * Don't cancel, rely on callback dropping the work
> >   * Cancel in another irq_work
> > I'll probably go with the second one.
>
> What about 1? It seems like we can just rely on the existing hunk to
> free the callback on noticing BPF_TW_FREED?
> That seems simpler to me.
>

Callback potentially might not be called for a long time, I'd feel
uneasy relying on it being called soon. Mykyta does irq_work in
scheduling kfunc for the reason that it might need to cancel task work
(because that doesn't support NMI), we can reuse the same approach
(and same irq work struct) here for cancellation, probably?


[...]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH bpf-next v2 2/4] bpf: extract map key pointer calculation
  2025-08-15 19:21 ` [PATCH bpf-next v2 2/4] bpf: extract map key pointer calculation Mykyta Yatsenko
  2025-08-19 11:05   ` Kumar Kartikeya Dwivedi
@ 2025-08-19 20:50   ` Andrii Nakryiko
  1 sibling, 0 replies; 23+ messages in thread
From: Andrii Nakryiko @ 2025-08-19 20:50 UTC (permalink / raw)
  To: Mykyta Yatsenko
  Cc: bpf, ast, andrii, daniel, kafai, kernel-team, eddyz87, memxor,
	Mykyta Yatsenko

On Fri, Aug 15, 2025 at 12:22 PM Mykyta Yatsenko
<mykyta.yatsenko5@gmail.com> wrote:
>
> From: Mykyta Yatsenko <yatsenko@meta.com>
>
> Calculation of the BPF map key, given the pointer to a value is
> duplicated in a couple of places in helpers already, in the next patch
> another use case is introduced as well.
> This patch extracts that functionality into a separate function.
>
> Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
> ---
>  kernel/bpf/helpers.c | 30 +++++++++++++-----------------
>  1 file changed, 13 insertions(+), 17 deletions(-)
>
> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> index 144877df6d02..d2f88a9bc47b 100644
> --- a/kernel/bpf/helpers.c
> +++ b/kernel/bpf/helpers.c
> @@ -1084,6 +1084,17 @@ const struct bpf_func_proto bpf_snprintf_proto = {
>         .arg5_type      = ARG_CONST_SIZE_OR_ZERO,
>  };
>
> +static void *map_key_from_value(struct bpf_map *map, void *value, u32 *arr_idx)
> +{
> +       if (map->map_type == BPF_MAP_TYPE_ARRAY) {
> +               struct bpf_array *array = container_of(map, struct bpf_array, map);
> +
> +               *arr_idx = ((char *)value - array->value) / array->elem_size;
> +               return arr_idx;
> +       }

should we add BUG here if map is not hash or lru?


> +       return (void *)value - round_up(map->key_size, 8);
> +}
> +
>  struct bpf_async_cb {
>         struct bpf_map *map;
>         struct bpf_prog *prog;
> @@ -1166,15 +1177,8 @@ static enum hrtimer_restart bpf_timer_cb(struct hrtimer *hrtimer)
>          * bpf_map_delete_elem() on the same timer.
>          */
>         this_cpu_write(hrtimer_running, t);
> -       if (map->map_type == BPF_MAP_TYPE_ARRAY) {
> -               struct bpf_array *array = container_of(map, struct bpf_array, map);
>
> -               /* compute the key */
> -               idx = ((char *)value - array->value) / array->elem_size;
> -               key = &idx;
> -       } else { /* hash or lru */
> -               key = value - round_up(map->key_size, 8);
> -       }
> +       key = map_key_from_value(map, value, &idx);
>
>         callback_fn((u64)(long)map, (u64)(long)key, (u64)(long)value, 0, 0);
>         /* The verifier checked that return value is zero. */
> @@ -1200,15 +1204,7 @@ static void bpf_wq_work(struct work_struct *work)
>         if (!callback_fn)
>                 return;
>
> -       if (map->map_type == BPF_MAP_TYPE_ARRAY) {
> -               struct bpf_array *array = container_of(map, struct bpf_array, map);
> -
> -               /* compute the key */
> -               idx = ((char *)value - array->value) / array->elem_size;
> -               key = &idx;
> -       } else { /* hash or lru */
> -               key = value - round_up(map->key_size, 8);
> -       }
> +       key = map_key_from_value(map, value, &idx);
>
>          rcu_read_lock_trace();
>          migrate_disable();
> --
> 2.50.1
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH bpf-next v2 3/4] bpf: task work scheduling kfuncs
  2025-08-19 20:49         ` Andrii Nakryiko
@ 2025-08-20 16:11           ` Kumar Kartikeya Dwivedi
  2025-08-20 18:33             ` Andrii Nakryiko
  0 siblings, 1 reply; 23+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2025-08-20 16:11 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Mykyta Yatsenko, bpf, ast, andrii, daniel, kafai, kernel-team,
	eddyz87, Mykyta Yatsenko

On Tue, 19 Aug 2025 at 22:49, Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
>
> On Tue, Aug 19, 2025 at 12:28 PM Kumar Kartikeya Dwivedi
> <memxor@gmail.com> wrote:
> >
> > On Tue, 19 Aug 2025 at 20:16, Mykyta Yatsenko
> > <mykyta.yatsenko5@gmail.com> wrote:
> > >
> > > On 8/19/25 15:18, Kumar Kartikeya Dwivedi wrote:
> > > > On Fri, 15 Aug 2025 at 21:22, Mykyta Yatsenko
> > > > <mykyta.yatsenko5@gmail.com> wrote:
> > > >> From: Mykyta Yatsenko <yatsenko@meta.com>
> > > >>
> > > >> Implementation of the bpf_task_work_schedule kfuncs.
> > > >>
> > > >> Main components:
> > > >>   * struct bpf_task_work_context – Metadata and state management per task
> > > >> work.
> > > >>   * enum bpf_task_work_state – A state machine to serialize work
> > > >>   scheduling and execution.
> > > >>   * bpf_task_work_schedule() – The central helper that initiates
> > > >> scheduling.
> > > >>   * bpf_task_work_acquire() - Attempts to take ownership of the context,
> > > >>   pointed by passed struct bpf_task_work, allocates new context if none
> > > >>   exists yet.
> > > >>   * bpf_task_work_callback() – Invoked when the actual task_work runs.
> > > >>   * bpf_task_work_irq() – An intermediate step (runs in softirq context)
> > > >> to enqueue task work.
> > > >>   * bpf_task_work_cancel_and_free() – Cleanup for deleted BPF map entries.
> > > > Can you elaborate on why the bouncing through irq_work context is necessary?
> > > > I think we should have this info in the commit log.
> > > > Is it to avoid deadlocks with task_work locks and/or task->pi_lock?
> > > yes, mainly to avoid locks in NMI.
> > > >
> > > >> Flow of successful task work scheduling
> > > >>   1) bpf_task_work_schedule_* is called from BPF code.
> > > >>   2) Transition state from STANDBY to PENDING, marks context is owned by
> > > >>   this task work scheduler
> > > >>   3) irq_work_queue() schedules bpf_task_work_irq().
> > > >>   4) Transition state from PENDING to SCHEDULING.
> > > >>   4) bpf_task_work_irq() attempts task_work_add(). If successful, state
> > > >>   transitions to SCHEDULED.
> > > >>   5) Task work calls bpf_task_work_callback(), which transition state to
> > > >>   RUNNING.
> > > >>   6) BPF callback is executed
> > > >>   7) Context is cleaned up, refcounts released, context state set back to
> > > >>   STANDBY.
> > > >>
> > > >> bpf_task_work_context handling
> > > >> The context pointer is stored in bpf_task_work ctx field (u64) but
> > > >> treated as an __rcu pointer via casts.
> > > >> bpf_task_work_acquire() publishes new bpf_task_work_context by cmpxchg
> > > >> with RCU initializer.
> > > >> Read under the RCU lock only in bpf_task_work_acquire() when ownership
> > > >> is contended.
> > > >> Upon deleting map value, bpf_task_work_cancel_and_free() is detaching
> > > >> context pointer from struct bpf_task_work and releases resources
> > > >> if scheduler does not own the context or can be canceled (state ==
> > > >> STANDBY or state == SCHEDULED and callback canceled). If task work
> > > >> scheduler owns the context, its state is set to FREED and scheduler is
> > > >> expected to cleanup on the next state transition.
> > > >>
> > > >> Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
> > > >> ---
> > > > This is much better now, with clear ownership between free path and
> > > > scheduling path, I mostly have a few more comments on the current
> > > > implementation, plus one potential bug.
> > > >
> > > > However, the more time I spend on this, the more I feel we should
> > > > unify all this with the two other bpf async work execution mechanisms
> > > > (timers and wq), and simplify and deduplicate a lot of this under the
> > > > serialized async->lock. I know NMI execution is probably critical for
> > > > this primitive, but we can replace async->lock with rqspinlock to
> > > > address that, so that it becomes safe to serialize in any context.
> > > > Apart from that, I don't see anything that would negate reworking all
> > > > this as a case of BPF_TASK_WORK for bpf_async_kern, modulo internal
> > > > task_work locks that have trouble with NMI execution (see later
> > > > comments).
> > > >
> > > > I also feel like it would be cleaner if we split the API into 3 steps:
> > > > init(), set_callback(), schedule() like the other cases, I don't see
> > > > why we necessarily need to diverge, and it simplifies some of the
> > > > logic in schedule().
> > > > Once every state update is protected by a lock, all of the state
> > > > transitions are done automatically and a lot of the extra races are
> > > > eliminated.
> > > >
> > > > I think we should discuss whether this was considered and why you
> > > > discarded this approach, otherwise the code is pretty complex, with
> > > > little upside.
> > > > Maybe I'm missing something obvious and you'd know more since you've
> > > > thought about all this longer.
> > > As for API, I think having 1 function for scheduling callback is cleaner
> > > then having 3 which are always called in the same order anyway. Most of
> > > the complexity
> > > comes from synchronization, not logic, so not having to do the same
> > > synchronization in
> > > init(), set_callback() and schedule() seems like a benefit to me.
> >
> > Well, if you were to reuse bpf_async_kern, all of that extra logic is
> > already taken care of, or can be easily shared.
> > If you look closely you'll see that a lot of what you're doing is a
> > repetition of what timers and bpf_wq have.
> >
> > > Let me check if using rqspinlock going to make things simpler. We still
> > > need states to at least know if cancellation is possible and to flag
> > > deletion to scheduler, but using a lock will make code easier to
> > > understand.
> >
> > Yeah I think for all of this using lockless updates is not really
> > worth it, let's just serialize using a lock.
>
> I don't think it's "just serialize".
>
> __bpf_async_init and __bpf_async_set_callback currently have `if
> (in_nmi()) return -EOPNOTSUPP;`, because of `bpf_map_kmalloc_node`
> (solvable with bpf_mem_alloc, not a big deal) and then unconditional
> `__bpf_spin_lock_irqsave(&async->lock);` (and maybe some other things
> that can't be done in NMI).
>
> We can't just replace __bpf_spin_lock_irqsave with rqspinlock, because
> the latter can fail. So the simplicity of unconditional locking is
> gone. We'd need to deal with the possibility of lock failing. It's
> probably not that straightforward in the case of
> __bpf_async_cancel_and_free.

We discussed converting async_cb to rqspinlock last time, the hold up
was __bpf_async_cancel_and_free, every other case can propagate error
upwards since they're already fallible.

The only reason I didn't move ahead was there was no apparent use case
for timer usage in NMI (to me at least).

But I don't see why it's less simpler in other cases, you need to
return an error in case you fail to take the lock (which should not
occur in correct usage), yes, but once you take the lock nobody
is touching the object anymore. And all those paths are already
fallible, so it's an extra error condition.

It is possible to then focus our effort on understanding failure modes
where __bpf_async_cancel_and_free's lock acquisition can fail, the
last time I looked it wasn't possible (otherwise we already have a bug
with the existing spin lock).

That said, BPF timers cannot be invoked in NMI, and irqsave provides
interrupt exclusion. We exclude usage of maps with timers in programs
that may run in NMI context. Things will be different once that restriction is
lifted for task_work, but it just means if the lock acquisition is failing on a
single lock, a lower context we interrupted is holding it, which means
it won the claim to free the object and we don't need to do anything.
Since we have a single lock the cases we need to actively worry about
are the reentrant ones.

I can imagine a task context program updating an array map element,
which invoked bpf_obj_free_fields, and then a perf program attempting
to do the same thing on the same element from an NMI. Fine, the lock
acquisition in free will fail, but we understand why it's ok to give up the
free in such a case.

>
> On the other hand, state machine with cmpxchg means there is always
> forward progress and there is always determinism of what was the last
> reached state before we went to FREED state, which means we will know
> who needs to cancel callback (if at all), and who is responsible for
> freeing resources.

There is forward progress regardless (now), but with a lockless state
machine, every state transition needs to consider various edges which
may have been concurrently activated by some other racing invocation.
You don't have such concerns with a lock. At least to me, I don't see
how the latter is worse than the former, it's less cases to think
about and deal with in the code.
E.g. all these "state == BPF_TW_FREED" would go away at various places
in the middle of various operations.

To me after looking at this code the second time, there seems to be
little benefit. Things would be different if multiple concurrent
schedule() calls on the same map value was a real use case, such that
lock contention would quickly become a performance bottleneck, but I
don't think that's true.

>
> I'm actually wondering if this state machine approach could/should be
> adopted for bpf_async_cb?.. I wouldn't start there, though, and rather
> finish task_work_add integration before trying to generalize.

Maybe it's just me, but I feel like it's additional complexity that's
not giving us much benefit.

There are enough things to worry about even when holding a lock and
excluding NMI, as seen with various bugs over the years.
E.g. git log --oneline --grep="Fixes: b00628b1c7d5 (\"bpf: Introduce
bpf timers.\")"

It is impossible to say that we can get it right with all this in the
1st attempt, even if we hold a fallible lock to avoid deadlocks, or we
switch to this state machine approach.
The best we can do is to at least minimize the set of cases we have to
worry about.

[
   As an aside, if we intend on keeping the door open on
consolidation, we probably should at least mirror the API surface.
   Maybe we made a mistake with init+set_callback+kick style split in
existing APIs, but it might be easier for people to understand that
all async primitives mostly follow this look and feel.
   It wouldn't be the end of the world, but there's an argument to be
made for consistency.
]



>
> [...]
>
> > > > This part looks broken to me.
> > > > You are calling this path
> > > > (update->obj_free_fields->cancel_and_free->cancel_and_match) in
> > > > possibly NMI context.
> > > > Which means we can deadlock if we hit the NMI context prog in the
> > > > middle of task->pi_lock critical section.
> > > > That's taken in task_work functions
> > > > The task_work_cancel_match takes the pi_lock.
> > > Good point, thanks. I think this could be solved in 2 ways:
> > >   * Don't cancel, rely on callback dropping the work
> > >   * Cancel in another irq_work
> > > I'll probably go with the second one.
> >
> > What about 1? It seems like we can just rely on the existing hunk to
> > free the callback on noticing BPF_TW_FREED?
> > That seems simpler to me.
> >
>
> Callback potentially might not be called for a long time, I'd feel
> uneasy relying on it being called soon. Mykyta does irq_work in
> scheduling kfunc for the reason that it might need to cancel task work
> (because that doesn't support NMI), we can reuse the same approach
> (and same irq work struct) here for cancellation, probably?
>
>
> [...]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH bpf-next v2 3/4] bpf: task work scheduling kfuncs
  2025-08-20 16:11           ` Kumar Kartikeya Dwivedi
@ 2025-08-20 18:33             ` Andrii Nakryiko
  2025-08-28  1:34               ` Alexei Starovoitov
  0 siblings, 1 reply; 23+ messages in thread
From: Andrii Nakryiko @ 2025-08-20 18:33 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: Mykyta Yatsenko, bpf, ast, andrii, daniel, kafai, kernel-team,
	eddyz87, Mykyta Yatsenko

On Wed, Aug 20, 2025 at 9:12 AM Kumar Kartikeya Dwivedi
<memxor@gmail.com> wrote:
>
> On Tue, 19 Aug 2025 at 22:49, Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
> >
> > On Tue, Aug 19, 2025 at 12:28 PM Kumar Kartikeya Dwivedi
> > <memxor@gmail.com> wrote:
> > >
> > > On Tue, 19 Aug 2025 at 20:16, Mykyta Yatsenko
> > > <mykyta.yatsenko5@gmail.com> wrote:
> > > >
> > > > On 8/19/25 15:18, Kumar Kartikeya Dwivedi wrote:
> > > > > On Fri, 15 Aug 2025 at 21:22, Mykyta Yatsenko
> > > > > <mykyta.yatsenko5@gmail.com> wrote:
> > > > >> From: Mykyta Yatsenko <yatsenko@meta.com>
> > > > >>
> > > > >> Implementation of the bpf_task_work_schedule kfuncs.
> > > > >>
> > > > >> Main components:
> > > > >>   * struct bpf_task_work_context – Metadata and state management per task
> > > > >> work.
> > > > >>   * enum bpf_task_work_state – A state machine to serialize work
> > > > >>   scheduling and execution.
> > > > >>   * bpf_task_work_schedule() – The central helper that initiates
> > > > >> scheduling.
> > > > >>   * bpf_task_work_acquire() - Attempts to take ownership of the context,
> > > > >>   pointed by passed struct bpf_task_work, allocates new context if none
> > > > >>   exists yet.
> > > > >>   * bpf_task_work_callback() – Invoked when the actual task_work runs.
> > > > >>   * bpf_task_work_irq() – An intermediate step (runs in softirq context)
> > > > >> to enqueue task work.
> > > > >>   * bpf_task_work_cancel_and_free() – Cleanup for deleted BPF map entries.
> > > > > Can you elaborate on why the bouncing through irq_work context is necessary?
> > > > > I think we should have this info in the commit log.
> > > > > Is it to avoid deadlocks with task_work locks and/or task->pi_lock?
> > > > yes, mainly to avoid locks in NMI.
> > > > >
> > > > >> Flow of successful task work scheduling
> > > > >>   1) bpf_task_work_schedule_* is called from BPF code.
> > > > >>   2) Transition state from STANDBY to PENDING, marks context is owned by
> > > > >>   this task work scheduler
> > > > >>   3) irq_work_queue() schedules bpf_task_work_irq().
> > > > >>   4) Transition state from PENDING to SCHEDULING.
> > > > >>   4) bpf_task_work_irq() attempts task_work_add(). If successful, state
> > > > >>   transitions to SCHEDULED.
> > > > >>   5) Task work calls bpf_task_work_callback(), which transition state to
> > > > >>   RUNNING.
> > > > >>   6) BPF callback is executed
> > > > >>   7) Context is cleaned up, refcounts released, context state set back to
> > > > >>   STANDBY.
> > > > >>
> > > > >> bpf_task_work_context handling
> > > > >> The context pointer is stored in bpf_task_work ctx field (u64) but
> > > > >> treated as an __rcu pointer via casts.
> > > > >> bpf_task_work_acquire() publishes new bpf_task_work_context by cmpxchg
> > > > >> with RCU initializer.
> > > > >> Read under the RCU lock only in bpf_task_work_acquire() when ownership
> > > > >> is contended.
> > > > >> Upon deleting map value, bpf_task_work_cancel_and_free() is detaching
> > > > >> context pointer from struct bpf_task_work and releases resources
> > > > >> if scheduler does not own the context or can be canceled (state ==
> > > > >> STANDBY or state == SCHEDULED and callback canceled). If task work
> > > > >> scheduler owns the context, its state is set to FREED and scheduler is
> > > > >> expected to cleanup on the next state transition.
> > > > >>
> > > > >> Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
> > > > >> ---
> > > > > This is much better now, with clear ownership between free path and
> > > > > scheduling path, I mostly have a few more comments on the current
> > > > > implementation, plus one potential bug.
> > > > >
> > > > > However, the more time I spend on this, the more I feel we should
> > > > > unify all this with the two other bpf async work execution mechanisms
> > > > > (timers and wq), and simplify and deduplicate a lot of this under the
> > > > > serialized async->lock. I know NMI execution is probably critical for
> > > > > this primitive, but we can replace async->lock with rqspinlock to
> > > > > address that, so that it becomes safe to serialize in any context.
> > > > > Apart from that, I don't see anything that would negate reworking all
> > > > > this as a case of BPF_TASK_WORK for bpf_async_kern, modulo internal
> > > > > task_work locks that have trouble with NMI execution (see later
> > > > > comments).
> > > > >
> > > > > I also feel like it would be cleaner if we split the API into 3 steps:
> > > > > init(), set_callback(), schedule() like the other cases, I don't see
> > > > > why we necessarily need to diverge, and it simplifies some of the
> > > > > logic in schedule().
> > > > > Once every state update is protected by a lock, all of the state
> > > > > transitions are done automatically and a lot of the extra races are
> > > > > eliminated.
> > > > >
> > > > > I think we should discuss whether this was considered and why you
> > > > > discarded this approach, otherwise the code is pretty complex, with
> > > > > little upside.
> > > > > Maybe I'm missing something obvious and you'd know more since you've
> > > > > thought about all this longer.
> > > > As for API, I think having 1 function for scheduling callback is cleaner
> > > > then having 3 which are always called in the same order anyway. Most of
> > > > the complexity
> > > > comes from synchronization, not logic, so not having to do the same
> > > > synchronization in
> > > > init(), set_callback() and schedule() seems like a benefit to me.
> > >
> > > Well, if you were to reuse bpf_async_kern, all of that extra logic is
> > > already taken care of, or can be easily shared.
> > > If you look closely you'll see that a lot of what you're doing is a
> > > repetition of what timers and bpf_wq have.
> > >
> > > > Let me check if using rqspinlock going to make things simpler. We still
> > > > need states to at least know if cancellation is possible and to flag
> > > > deletion to scheduler, but using a lock will make code easier to
> > > > understand.
> > >
> > > Yeah I think for all of this using lockless updates is not really
> > > worth it, let's just serialize using a lock.
> >
> > I don't think it's "just serialize".
> >
> > __bpf_async_init and __bpf_async_set_callback currently have `if
> > (in_nmi()) return -EOPNOTSUPP;`, because of `bpf_map_kmalloc_node`
> > (solvable with bpf_mem_alloc, not a big deal) and then unconditional
> > `__bpf_spin_lock_irqsave(&async->lock);` (and maybe some other things
> > that can't be done in NMI).
> >
> > We can't just replace __bpf_spin_lock_irqsave with rqspinlock, because
> > the latter can fail. So the simplicity of unconditional locking is
> > gone. We'd need to deal with the possibility of lock failing. It's
> > probably not that straightforward in the case of
> > __bpf_async_cancel_and_free.
>
> We discussed converting async_cb to rqspinlock last time, the hold up
> was __bpf_async_cancel_and_free, every other case can propagate error
> upwards since they're already fallible.
>
> The only reason I didn't move ahead was there was no apparent use case
> for timer usage in NMI (to me at least).

Scheduling some time-delayed action from perf_event/kprobe (with both
could be in NMI) seems like a reasonable thing to expect to work. So
I'd say there is a need for NMI support even without task_work.

>
> But I don't see why it's less simpler in other cases, you need to
> return an error in case you fail to take the lock (which should not
> occur in correct usage), yes, but once you take the lock nobody
> is touching the object anymore. And all those paths are already

Either you are oversimplifying or I'm over complicating.. :) Even when
BPF program started the process to schedule task work, which is
multi-step process (STANDBY -> PENDING -> SCHEDULING/SCHEDULED ->
RUNNING -> STANDBY), at each step you need to take lock. Meanwhile,
nothing prevents other BPF program executions to try (and successfully
do) take the same lock (that might be logical bug, or just how user
decided to handle, or rather disregard, locking). While it holds it,
callback might start running, it would need to take lock and won't be
for some time. Maybe that time is short and we won't run into
EDEADLOCK, but maybe it's not (and yes, that wouldn't be expected, but
it's not impossible either).

Similarly with cancel_and_free. That can be triggered by delete/update
which can happen simultaneously on multiple CPUs.

But even thinking through and proving that lock in cancel_and_free
will definitely be successfully taking (while interface itself screams
at you that it might not), is complication enough that I'd rather not
have to think through and deal with.

So I still maintain that atomic state transitions is a simpler model
to prove is working as expected and reliably.

When you think about it, it's really a linear transition through a few
stages (STANDBY -> PENDING -> ... -> RUNNING -> STANDBY), with the
only "interesting" interaction that we can go to FREED at any stage.
But when we do go to FREED, all participating parties know
deterministically where we were and who's responsible for clean up.

So in summary, I think it's good for you to try to switch timer and wq
to rqspinlock and work out all the kinds, but I'd prefer not to block
task_work on this work and proceed with state machine approach.


> fallible, so it's an extra error condition.
>
> It is possible to then focus our effort on understanding failure modes
> where __bpf_async_cancel_and_free's lock acquisition can fail, the
> last time I looked it wasn't possible (otherwise we already have a bug
> with the existing spin lock).
>
> That said, BPF timers cannot be invoked in NMI, and irqsave provides
> interrupt exclusion. We exclude usage of maps with timers in programs
> that may run in NMI context. Things will be different once that restriction is
> lifted for task_work, but it just means if the lock acquisition is failing on a
> single lock, a lower context we interrupted is holding it, which means
> it won the claim to free the object and we don't need to do anything.
> Since we have a single lock the cases we need to actively worry about
> are the reentrant ones.
>
> I can imagine a task context program updating an array map element,
> which invoked bpf_obj_free_fields, and then a perf program attempting
> to do the same thing on the same element from an NMI. Fine, the lock
> acquisition in free will fail, but we understand why it's ok to give up the
> free in such a case.
>

This "fine to give up free" doesn't sound neither obvious, nor simple,
and will require no less thinking and care than what you'd need to
understand that state machine we discussed for task_work, IMO.

> >
> > On the other hand, state machine with cmpxchg means there is always
> > forward progress and there is always determinism of what was the last
> > reached state before we went to FREED state, which means we will know
> > who needs to cancel callback (if at all), and who is responsible for
> > freeing resources.
>
> There is forward progress regardless (now), but with a lockless state
> machine, every state transition needs to consider various edges which
> may have been concurrently activated by some other racing invocation.
> You don't have such concerns with a lock. At least to me, I don't see

We still do have concurrency, lock doesn't magically solve that for
us. While you are cancelling/freeing callback might be running or is
about to run and you can't really cancel it. All that still would need
to be handled and thought through.

> how the latter is worse than the former, it's less cases to think
> about and deal with in the code.
> E.g. all these "state == BPF_TW_FREED" would go away at various places
> in the middle of various operations.

I don't think so, see above. You are oversimplifying. But again,
please try to do this conversion for timer and wq, it's worthwhile to
do regardless of task_work.

>
> To me after looking at this code the second time, there seems to be
> little benefit. Things would be different if multiple concurrent
> schedule() calls on the same map value was a real use case, such that
> lock contention would quickly become a performance bottleneck, but I
> don't think that's true.
>
> >
> > I'm actually wondering if this state machine approach could/should be
> > adopted for bpf_async_cb?.. I wouldn't start there, though, and rather
> > finish task_work_add integration before trying to generalize.
>
> Maybe it's just me, but I feel like it's additional complexity that's
> not giving us much benefit.
>
> There are enough things to worry about even when holding a lock and
> excluding NMI, as seen with various bugs over the years.
> E.g. git log --oneline --grep="Fixes: b00628b1c7d5 (\"bpf: Introduce
> bpf timers.\")"
>
> It is impossible to say that we can get it right with all this in the
> 1st attempt, even if we hold a fallible lock to avoid deadlocks, or we
> switch to this state machine approach.
> The best we can do is to at least minimize the set of cases we have to
> worry about.
>
> [
>    As an aside, if we intend on keeping the door open on
> consolidation, we probably should at least mirror the API surface.
>    Maybe we made a mistake with init+set_callback+kick style split in
> existing APIs, but it might be easier for people to understand that
> all async primitives mostly follow this look and feel.
>    It wouldn't be the end of the world, but there's an argument to be
> made for consistency.
> ]

I don't see why we can't consolidate internals of all these async
callbacks while maintaining a user-facing API that makes most sense
for each specific case. For task_work (and yes, I think for timers it
would make more sense as well), we are talking about a single
conceptual operation: just schedule a callback. So it makes sense to
have a single kfunc that expresses that.

Having a split into init, set_callback, kick is unnecessary and
cumbersome. It also adds additional mental overhead to think about
interleave of those three invocations from two or more concurrent BPF
programs (I'm not saying it doesn't work correctly in current
implementation, but it's another thing to think about in three
helpers/kfuncs, rather than in just one: that you can't have init done
by prog A, set_callback by prog B, and kick off be either prog A or B,
or even some other C program execution).

I'm guessing we modeled timer in such a way because we tried to stick
to kernel-internal API (and maybe we were trying to fit into 5
argument limitations, not sure). This makes sense internally in the
kernel, where you have different ways to init timer struct, different
ways to set or update expiration, etc, etc. This is not the case for
the BPF-side API of timer (but what's done is done, we can't just undo
it).

And for task_work_add(), even kernel-internal API is a singular
function, which makes most sense, so I'd like to stick to that
simplicity.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH bpf-next v2 3/4] bpf: task work scheduling kfuncs
  2025-08-15 19:21 ` [PATCH bpf-next v2 3/4] bpf: task work scheduling kfuncs Mykyta Yatsenko
                     ` (2 preceding siblings ...)
  2025-08-19 14:18   ` Kumar Kartikeya Dwivedi
@ 2025-08-27 21:03   ` Andrii Nakryiko
  2025-08-28 22:29     ` Kumar Kartikeya Dwivedi
  3 siblings, 1 reply; 23+ messages in thread
From: Andrii Nakryiko @ 2025-08-27 21:03 UTC (permalink / raw)
  To: Mykyta Yatsenko
  Cc: bpf, ast, andrii, daniel, kafai, kernel-team, eddyz87, memxor,
	Mykyta Yatsenko

On Fri, Aug 15, 2025 at 12:22 PM Mykyta Yatsenko
<mykyta.yatsenko5@gmail.com> wrote:
>
> From: Mykyta Yatsenko <yatsenko@meta.com>
>
> Implementation of the bpf_task_work_schedule kfuncs.
>
> Main components:
>  * struct bpf_task_work_context – Metadata and state management per task
> work.
>  * enum bpf_task_work_state – A state machine to serialize work
>  scheduling and execution.
>  * bpf_task_work_schedule() – The central helper that initiates
> scheduling.
>  * bpf_task_work_acquire() - Attempts to take ownership of the context,
>  pointed by passed struct bpf_task_work, allocates new context if none
>  exists yet.
>  * bpf_task_work_callback() – Invoked when the actual task_work runs.
>  * bpf_task_work_irq() – An intermediate step (runs in softirq context)
> to enqueue task work.
>  * bpf_task_work_cancel_and_free() – Cleanup for deleted BPF map entries.
>
> Flow of successful task work scheduling
>  1) bpf_task_work_schedule_* is called from BPF code.
>  2) Transition state from STANDBY to PENDING, marks context is owned by
>  this task work scheduler
>  3) irq_work_queue() schedules bpf_task_work_irq().
>  4) Transition state from PENDING to SCHEDULING.
>  4) bpf_task_work_irq() attempts task_work_add(). If successful, state
>  transitions to SCHEDULED.
>  5) Task work calls bpf_task_work_callback(), which transition state to
>  RUNNING.
>  6) BPF callback is executed
>  7) Context is cleaned up, refcounts released, context state set back to
>  STANDBY.
>
> bpf_task_work_context handling
> The context pointer is stored in bpf_task_work ctx field (u64) but
> treated as an __rcu pointer via casts.
> bpf_task_work_acquire() publishes new bpf_task_work_context by cmpxchg
> with RCU initializer.
> Read under the RCU lock only in bpf_task_work_acquire() when ownership
> is contended.
> Upon deleting map value, bpf_task_work_cancel_and_free() is detaching
> context pointer from struct bpf_task_work and releases resources
> if scheduler does not own the context or can be canceled (state ==
> STANDBY or state == SCHEDULED and callback canceled). If task work
> scheduler owns the context, its state is set to FREED and scheduler is
> expected to cleanup on the next state transition.
>
> Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
> ---
>  kernel/bpf/helpers.c | 270 +++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 260 insertions(+), 10 deletions(-)
>
> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> index d2f88a9bc47b..346ae8fd3ada 100644
> --- a/kernel/bpf/helpers.c
> +++ b/kernel/bpf/helpers.c
> @@ -25,6 +25,8 @@
>  #include <linux/kasan.h>
>  #include <linux/bpf_verifier.h>
>  #include <linux/uaccess.h>
> +#include <linux/task_work.h>
> +#include <linux/irq_work.h>
>
>  #include "../../lib/kstrtox.h"
>
> @@ -3701,6 +3703,226 @@ __bpf_kfunc int bpf_strstr(const char *s1__ign, const char *s2__ign)
>
>  typedef void (*bpf_task_work_callback_t)(struct bpf_map *map, void *key, void *value);
>
> +enum bpf_task_work_state {
> +       /* bpf_task_work is ready to be used */
> +       BPF_TW_STANDBY = 0,
> +       /* irq work scheduling in progress */
> +       BPF_TW_PENDING,
> +       /* task work scheduling in progress */
> +       BPF_TW_SCHEDULING,
> +       /* task work is scheduled successfully */
> +       BPF_TW_SCHEDULED,
> +       /* callback is running */
> +       BPF_TW_RUNNING,
> +       /* associated BPF map value is deleted */
> +       BPF_TW_FREED,
> +};
> +
> +struct bpf_task_work_context {
> +       /* the map and map value associated with this context */
> +       struct bpf_map *map;
> +       void *map_val;
> +       /* bpf_prog that schedules task work */
> +       struct bpf_prog *prog;
> +       /* task for which callback is scheduled */
> +       struct task_struct *task;
> +       enum task_work_notify_mode mode;
> +       enum bpf_task_work_state state;
> +       bpf_task_work_callback_t callback_fn;
> +       struct callback_head work;
> +       struct irq_work irq_work;
> +       struct rcu_head rcu;
> +} __aligned(8);
> +
> +static struct bpf_task_work_context *bpf_task_work_context_alloc(void)
> +{
> +       struct bpf_task_work_context *ctx;
> +
> +       ctx = bpf_mem_alloc(&bpf_global_ma, sizeof(struct bpf_task_work_context));
> +       if (ctx)
> +               memset(ctx, 0, sizeof(*ctx));
> +       return ctx;
> +}
> +
> +static void bpf_task_work_context_free(struct rcu_head *rcu)
> +{
> +       struct bpf_task_work_context *ctx = container_of(rcu, struct bpf_task_work_context, rcu);
> +       /* bpf_mem_free expects migration to be disabled */
> +       migrate_disable();
> +       bpf_mem_free(&bpf_global_ma, ctx);
> +       migrate_enable();
> +}
> +
> +static bool task_work_match(struct callback_head *head, void *data)
> +{
> +       struct bpf_task_work_context *ctx = container_of(head, struct bpf_task_work_context, work);
> +
> +       return ctx == data;
> +}
> +
> +static void bpf_task_work_context_reset(struct bpf_task_work_context *ctx)
> +{
> +       bpf_prog_put(ctx->prog);
> +       bpf_task_release(ctx->task);
> +}
> +
> +static void bpf_task_work_callback(struct callback_head *cb)
> +{
> +       enum bpf_task_work_state state;
> +       struct bpf_task_work_context *ctx;
> +       u32 idx;
> +       void *key;
> +
> +       ctx = container_of(cb, struct bpf_task_work_context, work);
> +
> +       /*
> +        * Read lock is needed to protect map key and value access below, it has to be done before
> +        * the state transition
> +        */
> +       rcu_read_lock_trace();
> +       /*
> +        * This callback may start running before bpf_task_work_irq() switched the state to
> +        * SCHEDULED so handle both transition variants SCHEDULING|SCHEDULED -> RUNNING.
> +        */
> +       state = cmpxchg(&ctx->state, BPF_TW_SCHEDULING, BPF_TW_RUNNING);
> +       if (state == BPF_TW_SCHEDULED)
> +               state = cmpxchg(&ctx->state, BPF_TW_SCHEDULED, BPF_TW_RUNNING);
> +       if (state == BPF_TW_FREED) {
> +               rcu_read_unlock_trace();
> +               bpf_task_work_context_reset(ctx);
> +               call_rcu_tasks_trace(&ctx->rcu, bpf_task_work_context_free);
> +               return;
> +       }
> +
> +       key = (void *)map_key_from_value(ctx->map, ctx->map_val, &idx);
> +       migrate_disable();
> +       ctx->callback_fn(ctx->map, key, ctx->map_val);
> +       migrate_enable();
> +       rcu_read_unlock_trace();
> +       /* State is running or freed, either way reset. */
> +       bpf_task_work_context_reset(ctx);
> +       state = cmpxchg(&ctx->state, BPF_TW_RUNNING, BPF_TW_STANDBY);
> +       if (state == BPF_TW_FREED)
> +               call_rcu_tasks_trace(&ctx->rcu, bpf_task_work_context_free);
> +}
> +
> +static void bpf_task_work_irq(struct irq_work *irq_work)
> +{
> +       struct bpf_task_work_context *ctx;
> +       enum bpf_task_work_state state;
> +       int err;
> +
> +       ctx = container_of(irq_work, struct bpf_task_work_context, irq_work);
> +
> +       state = cmpxchg(&ctx->state, BPF_TW_PENDING, BPF_TW_SCHEDULING);
> +       if (state == BPF_TW_FREED)
> +               goto free_context;
> +
> +       err = task_work_add(ctx->task, &ctx->work, ctx->mode);
> +       if (err) {
> +               bpf_task_work_context_reset(ctx);
> +               state = cmpxchg(&ctx->state, BPF_TW_SCHEDULING, BPF_TW_STANDBY);
> +               if (state == BPF_TW_FREED)
> +                       call_rcu_tasks_trace(&ctx->rcu, bpf_task_work_context_free);
> +               return;
> +       }
> +       state = cmpxchg(&ctx->state, BPF_TW_SCHEDULING, BPF_TW_SCHEDULED);
> +       if (state == BPF_TW_FREED && task_work_cancel_match(ctx->task, task_work_match, ctx))
> +               goto free_context; /* successful cancellation, release and free ctx */
> +       return;
> +
> +free_context:
> +       bpf_task_work_context_reset(ctx);
> +       call_rcu_tasks_trace(&ctx->rcu, bpf_task_work_context_free);
> +}
> +
> +static struct bpf_task_work_context *bpf_task_work_context_acquire(struct bpf_task_work *tw,
> +                                                                  struct bpf_map *map)
> +{
> +       struct bpf_task_work_context *ctx, *old_ctx;
> +       enum bpf_task_work_state state;
> +       struct bpf_task_work_context __force __rcu **ppc =
> +               (struct bpf_task_work_context __force __rcu **)&tw->ctx;

we discussed this with Mykyta earlier offline, but I'll not it here:
this looks pretty ugly, we should avoid these casts. We probably need
struct bpf_task_work with opaque contents for user-visible API, and
then have bpf_task_work_kern or something like that, where ctx will be
just a struct bpf_task_work_ctx pointer

> +
> +       /* ctx pointer is RCU protected */
> +       rcu_read_lock_trace();
> +       ctx = rcu_dereference(*ppc);

and here it should be enough to do just READ_ONCE(tw->ctx), either way
rcu_dereference is wrong here because it checks for just classic
rcu_read_lock() to be "held", while we use RCU Tasks Trace flavor for
protection of the context struct

> +       if (!ctx) {
> +               ctx = bpf_task_work_context_alloc();
> +               if (!ctx) {
> +                       rcu_read_unlock_trace();
> +                       return ERR_PTR(-ENOMEM);
> +               }
> +               old_ctx = unrcu_pointer(cmpxchg(ppc, NULL, RCU_INITIALIZER(ctx)));

similarly here, just `old_ctx = cmpxchg(tw->ctx, NULL, ctx);` seems
much more preferable to me

> +               /*
> +                * If ctx is set by another CPU, release allocated memory.
> +                * Do not fail, though, attempt stealing the work
> +                */
> +               if (old_ctx) {
> +                       bpf_mem_free(&bpf_global_ma, ctx);
> +                       ctx = old_ctx;
> +               }
> +       }
> +       state = cmpxchg(&ctx->state, BPF_TW_STANDBY, BPF_TW_PENDING);
> +       /*
> +        * We can unlock RCU, because task work scheduler (this codepath)
> +        * now owns the ctx or returning an error
> +        */
> +       rcu_read_unlock_trace();
> +       if (state != BPF_TW_STANDBY)
> +               return ERR_PTR(-EBUSY);
> +       return ctx;
> +}
> +
> +static int bpf_task_work_schedule(struct task_struct *task, struct bpf_task_work *tw,
> +                                 struct bpf_map *map, bpf_task_work_callback_t callback_fn,
> +                                 struct bpf_prog_aux *aux, enum task_work_notify_mode mode)
> +{
> +       struct bpf_prog *prog;
> +       struct bpf_task_work_context *ctx = NULL;
> +       int err;
> +
> +       BTF_TYPE_EMIT(struct bpf_task_work);
> +
> +       prog = bpf_prog_inc_not_zero(aux->prog);
> +       if (IS_ERR(prog))
> +               return -EBADF;
> +
> +       if (!atomic64_read(&map->usercnt)) {
> +               err = -EBADF;
> +               goto release_prog;
> +       }

Kumar pointed out that race between map_release_uref() and this check.
It looks like a simple fix would be to perform
bpf_task_work_context_acquire() first, and only then check for
map->usercnt, and if it dropped to zero, just clean up and return
-EPERM?

> +       task = bpf_task_acquire(task);
> +       if (!task) {
> +               err = -EPERM;
> +               goto release_prog;
> +       }
> +       ctx = bpf_task_work_context_acquire(tw, map);
> +       if (IS_ERR(ctx)) {
> +               err = PTR_ERR(ctx);
> +               goto release_all;
> +       }
> +
> +       ctx->task = task;
> +       ctx->callback_fn = callback_fn;
> +       ctx->prog = prog;
> +       ctx->mode = mode;
> +       ctx->map = map;
> +       ctx->map_val = (void *)tw - map->record->task_work_off;
> +       init_irq_work(&ctx->irq_work, bpf_task_work_irq);
> +       init_task_work(&ctx->work, bpf_task_work_callback);
> +
> +       irq_work_queue(&ctx->irq_work);
> +
> +       return 0;
> +
> +release_all:
> +       bpf_task_release(task);
> +release_prog:
> +       bpf_prog_put(prog);
> +       return err;
> +}
> +
>  /**
>   * bpf_task_work_schedule_signal - Schedule BPF callback using task_work_add with TWA_SIGNAL mode
>   * @task: Task struct for which callback should be scheduled
> @@ -3711,13 +3933,11 @@ typedef void (*bpf_task_work_callback_t)(struct bpf_map *map, void *key, void *v
>   *
>   * Return: 0 if task work has been scheduled successfully, negative error code otherwise
>   */
> -__bpf_kfunc int bpf_task_work_schedule_signal(struct task_struct *task,
> -                                             struct bpf_task_work *tw,
> +__bpf_kfunc int bpf_task_work_schedule_signal(struct task_struct *task, struct bpf_task_work *tw,
>                                               struct bpf_map *map__map,
> -                                             bpf_task_work_callback_t callback,
> -                                             void *aux__prog)
> +                                             bpf_task_work_callback_t callback, void *aux__prog)

this should have been folded into patch 1?

>  {
> -       return 0;
> +       return bpf_task_work_schedule(task, tw, map__map, callback, aux__prog, TWA_SIGNAL);
>  }
>
>  /**
> @@ -3731,19 +3951,47 @@ __bpf_kfunc int bpf_task_work_schedule_signal(struct task_struct *task,
>   *
>   * Return: 0 if task work has been scheduled successfully, negative error code otherwise
>   */
> -__bpf_kfunc int bpf_task_work_schedule_resume(struct task_struct *task,
> -                                             struct bpf_task_work *tw,
> +__bpf_kfunc int bpf_task_work_schedule_resume(struct task_struct *task, struct bpf_task_work *tw,
>                                               struct bpf_map *map__map,
> -                                             bpf_task_work_callback_t callback,
> -                                             void *aux__prog)
> +                                             bpf_task_work_callback_t callback, void *aux__prog)
>  {

same, rebasing/refactoring artifacts

> -       return 0;
> +       enum task_work_notify_mode mode;
> +
> +       mode = task == current && in_nmi() ? TWA_NMI_CURRENT : TWA_RESUME;
> +       return bpf_task_work_schedule(task, tw, map__map, callback, aux__prog, mode);
>  }
>
>  __bpf_kfunc_end_defs();
>
>  void bpf_task_work_cancel_and_free(void *val)
>  {
> +       struct bpf_task_work *tw = val;
> +       struct bpf_task_work_context *ctx;
> +       enum bpf_task_work_state state;
> +
> +       /* No need do rcu_read_lock as no other codepath can reset this pointer */

typo: to do

> +       ctx = unrcu_pointer(xchg((struct bpf_task_work_context __force __rcu **)&tw->ctx, NULL));
> +       if (!ctx)
> +               return;
> +       state = xchg(&ctx->state, BPF_TW_FREED);
> +
> +       switch (state) {
> +       case BPF_TW_SCHEDULED:
> +               /* If we can't cancel task work, rely on task work callback to free the context */
> +               if (!task_work_cancel_match(ctx->task, task_work_match, ctx))
> +                       break;
> +               bpf_task_work_context_reset(ctx);

As Kumar pointed out earlier, this can be called from NMI, so we can't
just cancel here, we need to schedule irq_work to do the cancellation.

Good thing is that if we were in SCHEDULED state, then we can simply
reuse irq_work struct from task_work_ctx (please contract "context"
into "ctx", btw), and there is no change to the state machine, it's
just a slightly delayed clean up (and we'll remain in terminal FREED
state anyways).

Bad news: that extra irq_work step mean we have to be careful about
ctx lifetime, because if task_work callback couldn't be cancelled, we
might have a situation where memory is freed by task_work callback
itself (when it fails to transition to RUNNING) while we wait for
irq_work callback to be run, and then we get freed memory dereference.

So I'm thinking that besides RCU protection we should also have a
simple refcounting protection for bpf_task_work_ctx itself. I don't
think that's a problem performance wise, and the nice thing is that
there will be less direct `call_rcu_tasks_trace(&ctx->rcu,
bpf_task_work_context_free);` sprinkled around. Instead we'll have
just bpf_task_work_put(ctx), which might call RCU-delayed freeing.

We might want to roll bpf_task_work_context_reset logic into it,
conditionally (but we need to make sure that we reset prog and task
pointers to NULL during a non-freeing context reset).

> +               fallthrough;
> +       case BPF_TW_STANDBY:
> +               call_rcu_tasks_trace(&ctx->rcu, bpf_task_work_context_free);
> +               break;
> +       /* In all below cases scheduling logic should detect context state change and cleanup */
> +       case BPF_TW_SCHEDULING:
> +       case BPF_TW_PENDING:
> +       case BPF_TW_RUNNING:
> +       default:
> +               break;
> +       }
>  }
>
>  BTF_KFUNCS_START(generic_btf_ids)
> @@ -3769,6 +4017,8 @@ BTF_ID_FLAGS(func, bpf_rbtree_first, KF_RET_NULL)
>  BTF_ID_FLAGS(func, bpf_rbtree_root, KF_RET_NULL)
>  BTF_ID_FLAGS(func, bpf_rbtree_left, KF_RET_NULL)
>  BTF_ID_FLAGS(func, bpf_rbtree_right, KF_RET_NULL)
> +BTF_ID_FLAGS(func, bpf_task_work_schedule_signal, KF_TRUSTED_ARGS)
> +BTF_ID_FLAGS(func, bpf_task_work_schedule_resume, KF_TRUSTED_ARGS)
>
>  #ifdef CONFIG_CGROUPS
>  BTF_ID_FLAGS(func, bpf_cgroup_acquire, KF_ACQUIRE | KF_RCU | KF_RET_NULL)
> --
> 2.50.1
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH bpf-next v2 3/4] bpf: task work scheduling kfuncs
  2025-08-20 18:33             ` Andrii Nakryiko
@ 2025-08-28  1:34               ` Alexei Starovoitov
  2025-08-28 17:00                 ` Andrii Nakryiko
  0 siblings, 1 reply; 23+ messages in thread
From: Alexei Starovoitov @ 2025-08-28  1:34 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Kumar Kartikeya Dwivedi, Mykyta Yatsenko, bpf, Alexei Starovoitov,
	Andrii Nakryiko, Daniel Borkmann, Martin Lau, Kernel Team, Eduard,
	Mykyta Yatsenko

On Wed, Aug 20, 2025 at 11:33 AM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
>
> I don't see why we can't consolidate internals of all these async
> callbacks while maintaining a user-facing API that makes most sense
> for each specific case. For task_work (and yes, I think for timers it
> would make more sense as well), we are talking about a single
> conceptual operation: just schedule a callback. So it makes sense to
> have a single kfunc that expresses that.
>
> Having a split into init, set_callback, kick is unnecessary and
> cumbersome.

For timers the split api of init vs start_timer is mandatory,
since it's common usage to start and stop the same timer before
it fires. So the moving init operation (which allocates) and
set_callback operation (that needs a pointer to callback)
out of start/stop is a good thing, since the code that does start/stop
may be far away and in a different file than init/set_callback.
That's how networking stack is written.
Where I screwed things up is when I made bpf_timer_cancel() to also
clear the callback to NULL. Not sure what I was thinking.
The api wasn't modelled by existing kernel timer api, but
by how the networking stack is using timers.
Most of the time the started timer will be cancelled before firing.
bpf_wq just followed bpf_timer api pattern.
But, unlike timers, wq and task_work actually expect the callback
to be called. It's rare to cancel wq/task_work.
So for them the single 'just schedule' kfunc that allocates,
sets callback, and schedules makes sense.
For wq there is no bpf_wq_cancel. So there is a difference already.
It's fine for bpf_timers, bpf_wq, bpf_task_work to have
different set of kfuncs, since the usage pattern is different.

Regarding state machine vs spinlock I think we should see
through this approach with state machine.
And if we convince ourselves that after all reviews it
looks to be bug free, we should try to convert timer/wq
to state machine as well to have a shared logic.
Note irq_work from nmi is mandatory for timers too
regardless of state machine or rqspinlock,
since timers/wq take more locks inside,
We cannot rqspinlock() + hrtimer_start() unconditionally.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH bpf-next v2 3/4] bpf: task work scheduling kfuncs
  2025-08-28  1:34               ` Alexei Starovoitov
@ 2025-08-28 17:00                 ` Andrii Nakryiko
  2025-08-28 17:38                   ` Kumar Kartikeya Dwivedi
  0 siblings, 1 reply; 23+ messages in thread
From: Andrii Nakryiko @ 2025-08-28 17:00 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Kumar Kartikeya Dwivedi, Mykyta Yatsenko, bpf, Alexei Starovoitov,
	Andrii Nakryiko, Daniel Borkmann, Martin Lau, Kernel Team, Eduard,
	Mykyta Yatsenko

On Wed, Aug 27, 2025 at 6:34 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Wed, Aug 20, 2025 at 11:33 AM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
> >
> >
> > I don't see why we can't consolidate internals of all these async
> > callbacks while maintaining a user-facing API that makes most sense
> > for each specific case. For task_work (and yes, I think for timers it
> > would make more sense as well), we are talking about a single
> > conceptual operation: just schedule a callback. So it makes sense to
> > have a single kfunc that expresses that.
> >
> > Having a split into init, set_callback, kick is unnecessary and
> > cumbersome.
>
> For timers the split api of init vs start_timer is mandatory,
> since it's common usage to start and stop the same timer before
> it fires. So the moving init operation (which allocates) and
> set_callback operation (that needs a pointer to callback)
> out of start/stop is a good thing, since the code that does start/stop
> may be far away and in a different file than init/set_callback.

Ok, that makes sense, thanks for filling in the context.

> That's how networking stack is written.
> Where I screwed things up is when I made bpf_timer_cancel() to also
> clear the callback to NULL. Not sure what I was thinking.
> The api wasn't modelled by existing kernel timer api, but
> by how the networking stack is using timers.
> Most of the time the started timer will be cancelled before firing.
> bpf_wq just followed bpf_timer api pattern.
> But, unlike timers, wq and task_work actually expect the callback
> to be called. It's rare to cancel wq/task_work.
> So for them the single 'just schedule' kfunc that allocates,
> sets callback, and schedules makes sense.
> For wq there is no bpf_wq_cancel. So there is a difference already.
> It's fine for bpf_timers, bpf_wq, bpf_task_work to have
> different set of kfuncs, since the usage pattern is different.
>

agreed

> Regarding state machine vs spinlock I think we should see
> through this approach with state machine.
> And if we convince ourselves that after all reviews it
> looks to be bug free, we should try to convert timer/wq
> to state machine as well to have a shared logic.

yep, completely agree

> Note irq_work from nmi is mandatory for timers too
> regardless of state machine or rqspinlock,
> since timers/wq take more locks inside,
> We cannot rqspinlock() + hrtimer_start() unconditionally.

Ack. Once we unify all this, we can invest a bit more effort trying to
optimize away the need for irq_work when not under NMI (and maybe not
in irq). For timers this seems more important than for task_work, as
timers can be used significantly more frequently.

But one step at a time, let's wait for Mykyta to come back from
vacation and update the patch set.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH bpf-next v2 3/4] bpf: task work scheduling kfuncs
  2025-08-28 17:00                 ` Andrii Nakryiko
@ 2025-08-28 17:38                   ` Kumar Kartikeya Dwivedi
  0 siblings, 0 replies; 23+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2025-08-28 17:38 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Alexei Starovoitov, Mykyta Yatsenko, bpf, Alexei Starovoitov,
	Andrii Nakryiko, Daniel Borkmann, Martin Lau, Kernel Team, Eduard,
	Mykyta Yatsenko

On Thu, 28 Aug 2025 at 19:00, Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
>
> On Wed, Aug 27, 2025 at 6:34 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Wed, Aug 20, 2025 at 11:33 AM Andrii Nakryiko
> > <andrii.nakryiko@gmail.com> wrote:
> > >
> > >
> > > I don't see why we can't consolidate internals of all these async
> > > callbacks while maintaining a user-facing API that makes most sense
> > > for each specific case. For task_work (and yes, I think for timers it
> > > would make more sense as well), we are talking about a single
> > > conceptual operation: just schedule a callback. So it makes sense to
> > > have a single kfunc that expresses that.
> > >
> > > Having a split into init, set_callback, kick is unnecessary and
> > > cumbersome.
> >
> > For timers the split api of init vs start_timer is mandatory,
> > since it's common usage to start and stop the same timer before
> > it fires. So the moving init operation (which allocates) and
> > set_callback operation (that needs a pointer to callback)
> > out of start/stop is a good thing, since the code that does start/stop
> > may be far away and in a different file than init/set_callback.
>
> Ok, that makes sense, thanks for filling in the context.
>
> > That's how networking stack is written.
> > Where I screwed things up is when I made bpf_timer_cancel() to also
> > clear the callback to NULL. Not sure what I was thinking.
> > The api wasn't modelled by existing kernel timer api, but
> > by how the networking stack is using timers.
> > Most of the time the started timer will be cancelled before firing.
> > bpf_wq just followed bpf_timer api pattern.
> > But, unlike timers, wq and task_work actually expect the callback
> > to be called. It's rare to cancel wq/task_work.
> > So for them the single 'just schedule' kfunc that allocates,
> > sets callback, and schedules makes sense.
> > For wq there is no bpf_wq_cancel. So there is a difference already.
> > It's fine for bpf_timers, bpf_wq, bpf_task_work to have
> > different set of kfuncs, since the usage pattern is different.
> >
>
> agreed
>
> > Regarding state machine vs spinlock I think we should see
> > through this approach with state machine.
> > And if we convince ourselves that after all reviews it
> > looks to be bug free, we should try to convert timer/wq
> > to state machine as well to have a shared logic.
>
> yep, completely agree
>
> > Note irq_work from nmi is mandatory for timers too
> > regardless of state machine or rqspinlock,
> > since timers/wq take more locks inside,
> > We cannot rqspinlock() + hrtimer_start() unconditionally.
>
> Ack. Once we unify all this, we can invest a bit more effort trying to
> optimize away the need for irq_work when not under NMI (and maybe not
> in irq). For timers this seems more important than for task_work, as
> timers can be used significantly more frequently.

Ok, let's go with this approach. I'll take a stab at addressing timers
later on once Mykyta's set lands.

>
> But one step at a time, let's wait for Mykyta to come back from
> vacation and update the patch set.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH bpf-next v2 3/4] bpf: task work scheduling kfuncs
  2025-08-27 21:03   ` Andrii Nakryiko
@ 2025-08-28 22:29     ` Kumar Kartikeya Dwivedi
  0 siblings, 0 replies; 23+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2025-08-28 22:29 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Mykyta Yatsenko, bpf, ast, andrii, daniel, kafai, kernel-team,
	eddyz87, Mykyta Yatsenko

On Wed, 27 Aug 2025 at 23:03, Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
>
> On Fri, Aug 15, 2025 at 12:22 PM Mykyta Yatsenko
> <mykyta.yatsenko5@gmail.com> wrote:
> >
> > From: Mykyta Yatsenko <yatsenko@meta.com>
> >
> > Implementation of the bpf_task_work_schedule kfuncs.
> >
> > Main components:
> >  * struct bpf_task_work_context – Metadata and state management per task
> > work.
> >  * enum bpf_task_work_state – A state machine to serialize work
> >  scheduling and execution.
> >  * bpf_task_work_schedule() – The central helper that initiates
> > scheduling.
> >  * bpf_task_work_acquire() - Attempts to take ownership of the context,
> >  pointed by passed struct bpf_task_work, allocates new context if none
> >  exists yet.
> >  * bpf_task_work_callback() – Invoked when the actual task_work runs.
> >  * bpf_task_work_irq() – An intermediate step (runs in softirq context)
> > to enqueue task work.
> >  * bpf_task_work_cancel_and_free() – Cleanup for deleted BPF map entries.
> >
> > Flow of successful task work scheduling
> >  1) bpf_task_work_schedule_* is called from BPF code.
> >  2) Transition state from STANDBY to PENDING, marks context is owned by
> >  this task work scheduler
> >  3) irq_work_queue() schedules bpf_task_work_irq().
> >  4) Transition state from PENDING to SCHEDULING.
> >  4) bpf_task_work_irq() attempts task_work_add(). If successful, state
> >  transitions to SCHEDULED.
> >  5) Task work calls bpf_task_work_callback(), which transition state to
> >  RUNNING.
> >  6) BPF callback is executed
> >  7) Context is cleaned up, refcounts released, context state set back to
> >  STANDBY.
> >
> > bpf_task_work_context handling
> > The context pointer is stored in bpf_task_work ctx field (u64) but
> > treated as an __rcu pointer via casts.
> > bpf_task_work_acquire() publishes new bpf_task_work_context by cmpxchg
> > with RCU initializer.
> > Read under the RCU lock only in bpf_task_work_acquire() when ownership
> > is contended.
> > Upon deleting map value, bpf_task_work_cancel_and_free() is detaching
> > context pointer from struct bpf_task_work and releases resources
> > if scheduler does not own the context or can be canceled (state ==
> > STANDBY or state == SCHEDULED and callback canceled). If task work
> > scheduler owns the context, its state is set to FREED and scheduler is
> > expected to cleanup on the next state transition.
> >
> > Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
> > ---
> >  kernel/bpf/helpers.c | 270 +++++++++++++++++++++++++++++++++++++++++--
> >  1 file changed, 260 insertions(+), 10 deletions(-)
> >
> > diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> > index d2f88a9bc47b..346ae8fd3ada 100644
> > --- a/kernel/bpf/helpers.c
> > +++ b/kernel/bpf/helpers.c
> > @@ -25,6 +25,8 @@
> >  #include <linux/kasan.h>
> >  #include <linux/bpf_verifier.h>
> >  #include <linux/uaccess.h>
> > +#include <linux/task_work.h>
> > +#include <linux/irq_work.h>
> >
> >  #include "../../lib/kstrtox.h"
> >
> > @@ -3701,6 +3703,226 @@ __bpf_kfunc int bpf_strstr(const char *s1__ign, const char *s2__ign)
> >
> >  typedef void (*bpf_task_work_callback_t)(struct bpf_map *map, void *key, void *value);
> >
> > +enum bpf_task_work_state {
> > +       /* bpf_task_work is ready to be used */
> > +       BPF_TW_STANDBY = 0,
> > +       /* irq work scheduling in progress */
> > +       BPF_TW_PENDING,
> > +       /* task work scheduling in progress */
> > +       BPF_TW_SCHEDULING,
> > +       /* task work is scheduled successfully */
> > +       BPF_TW_SCHEDULED,
> > +       /* callback is running */
> > +       BPF_TW_RUNNING,
> > +       /* associated BPF map value is deleted */
> > +       BPF_TW_FREED,
> > +};
> > +
> > +struct bpf_task_work_context {
> > +       /* the map and map value associated with this context */
> > +       struct bpf_map *map;
> > +       void *map_val;
> > +       /* bpf_prog that schedules task work */
> > +       struct bpf_prog *prog;
> > +       /* task for which callback is scheduled */
> > +       struct task_struct *task;
> > +       enum task_work_notify_mode mode;
> > +       enum bpf_task_work_state state;
> > +       bpf_task_work_callback_t callback_fn;
> > +       struct callback_head work;
> > +       struct irq_work irq_work;
> > +       struct rcu_head rcu;
> > +} __aligned(8);
> > +
> > +static struct bpf_task_work_context *bpf_task_work_context_alloc(void)
> > +{
> > +       struct bpf_task_work_context *ctx;
> > +
> > +       ctx = bpf_mem_alloc(&bpf_global_ma, sizeof(struct bpf_task_work_context));
> > +       if (ctx)
> > +               memset(ctx, 0, sizeof(*ctx));
> > +       return ctx;
> > +}
> > +
> > +static void bpf_task_work_context_free(struct rcu_head *rcu)
> > +{
> > +       struct bpf_task_work_context *ctx = container_of(rcu, struct bpf_task_work_context, rcu);
> > +       /* bpf_mem_free expects migration to be disabled */
> > +       migrate_disable();
> > +       bpf_mem_free(&bpf_global_ma, ctx);
> > +       migrate_enable();
> > +}
> > +
> > +static bool task_work_match(struct callback_head *head, void *data)
> > +{
> > +       struct bpf_task_work_context *ctx = container_of(head, struct bpf_task_work_context, work);
> > +
> > +       return ctx == data;
> > +}
> > +
> > +static void bpf_task_work_context_reset(struct bpf_task_work_context *ctx)
> > +{
> > +       bpf_prog_put(ctx->prog);
> > +       bpf_task_release(ctx->task);
> > +}
> > +
> > +static void bpf_task_work_callback(struct callback_head *cb)
> > +{
> > +       enum bpf_task_work_state state;
> > +       struct bpf_task_work_context *ctx;
> > +       u32 idx;
> > +       void *key;
> > +
> > +       ctx = container_of(cb, struct bpf_task_work_context, work);
> > +
> > +       /*
> > +        * Read lock is needed to protect map key and value access below, it has to be done before
> > +        * the state transition
> > +        */
> > +       rcu_read_lock_trace();
> > +       /*
> > +        * This callback may start running before bpf_task_work_irq() switched the state to
> > +        * SCHEDULED so handle both transition variants SCHEDULING|SCHEDULED -> RUNNING.
> > +        */
> > +       state = cmpxchg(&ctx->state, BPF_TW_SCHEDULING, BPF_TW_RUNNING);
> > +       if (state == BPF_TW_SCHEDULED)
> > +               state = cmpxchg(&ctx->state, BPF_TW_SCHEDULED, BPF_TW_RUNNING);
> > +       if (state == BPF_TW_FREED) {
> > +               rcu_read_unlock_trace();
> > +               bpf_task_work_context_reset(ctx);
> > +               call_rcu_tasks_trace(&ctx->rcu, bpf_task_work_context_free);
> > +               return;
> > +       }
> > +
> > +       key = (void *)map_key_from_value(ctx->map, ctx->map_val, &idx);
> > +       migrate_disable();
> > +       ctx->callback_fn(ctx->map, key, ctx->map_val);
> > +       migrate_enable();
> > +       rcu_read_unlock_trace();
> > +       /* State is running or freed, either way reset. */
> > +       bpf_task_work_context_reset(ctx);
> > +       state = cmpxchg(&ctx->state, BPF_TW_RUNNING, BPF_TW_STANDBY);
> > +       if (state == BPF_TW_FREED)
> > +               call_rcu_tasks_trace(&ctx->rcu, bpf_task_work_context_free);
> > +}
> > +
> > +static void bpf_task_work_irq(struct irq_work *irq_work)
> > +{
> > +       struct bpf_task_work_context *ctx;
> > +       enum bpf_task_work_state state;
> > +       int err;
> > +
> > +       ctx = container_of(irq_work, struct bpf_task_work_context, irq_work);
> > +
> > +       state = cmpxchg(&ctx->state, BPF_TW_PENDING, BPF_TW_SCHEDULING);
> > +       if (state == BPF_TW_FREED)
> > +               goto free_context;
> > +
> > +       err = task_work_add(ctx->task, &ctx->work, ctx->mode);
> > +       if (err) {
> > +               bpf_task_work_context_reset(ctx);
> > +               state = cmpxchg(&ctx->state, BPF_TW_SCHEDULING, BPF_TW_STANDBY);
> > +               if (state == BPF_TW_FREED)
> > +                       call_rcu_tasks_trace(&ctx->rcu, bpf_task_work_context_free);
> > +               return;
> > +       }
> > +       state = cmpxchg(&ctx->state, BPF_TW_SCHEDULING, BPF_TW_SCHEDULED);
> > +       if (state == BPF_TW_FREED && task_work_cancel_match(ctx->task, task_work_match, ctx))
> > +               goto free_context; /* successful cancellation, release and free ctx */
> > +       return;
> > +
> > +free_context:
> > +       bpf_task_work_context_reset(ctx);
> > +       call_rcu_tasks_trace(&ctx->rcu, bpf_task_work_context_free);
> > +}
> > +
> > +static struct bpf_task_work_context *bpf_task_work_context_acquire(struct bpf_task_work *tw,
> > +                                                                  struct bpf_map *map)
> > +{
> > +       struct bpf_task_work_context *ctx, *old_ctx;
> > +       enum bpf_task_work_state state;
> > +       struct bpf_task_work_context __force __rcu **ppc =
> > +               (struct bpf_task_work_context __force __rcu **)&tw->ctx;
>
> we discussed this with Mykyta earlier offline, but I'll not it here:
> this looks pretty ugly, we should avoid these casts. We probably need
> struct bpf_task_work with opaque contents for user-visible API, and
> then have bpf_task_work_kern or something like that, where ctx will be
> just a struct bpf_task_work_ctx pointer
>
> > +
> > +       /* ctx pointer is RCU protected */
> > +       rcu_read_lock_trace();
> > +       ctx = rcu_dereference(*ppc);
>
> and here it should be enough to do just READ_ONCE(tw->ctx), either way
> rcu_dereference is wrong here because it checks for just classic
> rcu_read_lock() to be "held", while we use RCU Tasks Trace flavor for
> protection of the context struct

Couldn't you use rcu_dereference_check(), with rcu_read_lock_trace_held().

>
> > +       if (!ctx) {
> > +               ctx = bpf_task_work_context_alloc();
> > +               if (!ctx) {
> > +                       rcu_read_unlock_trace();
> > +                       return ERR_PTR(-ENOMEM);
> > +               }
> > +               old_ctx = unrcu_pointer(cmpxchg(ppc, NULL, RCU_INITIALIZER(ctx)));
>
> similarly here, just `old_ctx = cmpxchg(tw->ctx, NULL, ctx);` seems
> much more preferable to me
>
> > +               /*
> > +                * If ctx is set by another CPU, release allocated memory.
> > +                * Do not fail, though, attempt stealing the work
> > +                */
> > +               if (old_ctx) {
> > +                       bpf_mem_free(&bpf_global_ma, ctx);
> > +                       ctx = old_ctx;
> > +               }
> > +       }
> > +       state = cmpxchg(&ctx->state, BPF_TW_STANDBY, BPF_TW_PENDING);
> > +       /*
> > +        * We can unlock RCU, because task work scheduler (this codepath)
> > +        * now owns the ctx or returning an error
> > +        */
> > +       rcu_read_unlock_trace();
> > +       if (state != BPF_TW_STANDBY)
> > +               return ERR_PTR(-EBUSY);
> > +       return ctx;
> > +}
> > +
> > +static int bpf_task_work_schedule(struct task_struct *task, struct bpf_task_work *tw,
> > +                                 struct bpf_map *map, bpf_task_work_callback_t callback_fn,
> > +                                 struct bpf_prog_aux *aux, enum task_work_notify_mode mode)
> > +{
> > +       struct bpf_prog *prog;
> > +       struct bpf_task_work_context *ctx = NULL;
> > +       int err;
> > +
> > +       BTF_TYPE_EMIT(struct bpf_task_work);
> > +
> > +       prog = bpf_prog_inc_not_zero(aux->prog);
> > +       if (IS_ERR(prog))
> > +               return -EBADF;
> > +
> > +       if (!atomic64_read(&map->usercnt)) {
> > +               err = -EBADF;
> > +               goto release_prog;
> > +       }
>
> Kumar pointed out that race between map_release_uref() and this check.
> It looks like a simple fix would be to perform
> bpf_task_work_context_acquire() first, and only then check for
> map->usercnt, and if it dropped to zero, just clean up and return
> -EPERM?

I believe that should work. As long as we don't create new objects to
be freed once usercnt has already dropped (and parallel racing free
already done).

>
> > +       task = bpf_task_acquire(task);
> > +       if (!task) {
> > +               err = -EPERM;
> > +               goto release_prog;
> > +       }
> > +       ctx = bpf_task_work_context_acquire(tw, map);
> > +       if (IS_ERR(ctx)) {
> > +               err = PTR_ERR(ctx);
> > +               goto release_all;
> > +       }
> > +
> > +       ctx->task = task;
> > +       ctx->callback_fn = callback_fn;
> > +       ctx->prog = prog;
> > +       ctx->mode = mode;
> > +       ctx->map = map;
> > +       ctx->map_val = (void *)tw - map->record->task_work_off;
> > +       init_irq_work(&ctx->irq_work, bpf_task_work_irq);
> > +       init_task_work(&ctx->work, bpf_task_work_callback);
> > +
> > +       irq_work_queue(&ctx->irq_work);
> > +
> > +       return 0;
> > +
> > +release_all:
> > +       bpf_task_release(task);
> > +release_prog:
> > +       bpf_prog_put(prog);
> > +       return err;
> > +}
> > +
> >  /**
> >   * bpf_task_work_schedule_signal - Schedule BPF callback using task_work_add with TWA_SIGNAL mode
> >   * @task: Task struct for which callback should be scheduled
> > @@ -3711,13 +3933,11 @@ typedef void (*bpf_task_work_callback_t)(struct bpf_map *map, void *key, void *v
> >   *
> >   * Return: 0 if task work has been scheduled successfully, negative error code otherwise
> >   */
> > -__bpf_kfunc int bpf_task_work_schedule_signal(struct task_struct *task,
> > -                                             struct bpf_task_work *tw,
> > +__bpf_kfunc int bpf_task_work_schedule_signal(struct task_struct *task, struct bpf_task_work *tw,
> >                                               struct bpf_map *map__map,
> > -                                             bpf_task_work_callback_t callback,
> > -                                             void *aux__prog)
> > +                                             bpf_task_work_callback_t callback, void *aux__prog)
>
> this should have been folded into patch 1?
>
> >  {
> > -       return 0;
> > +       return bpf_task_work_schedule(task, tw, map__map, callback, aux__prog, TWA_SIGNAL);
> >  }
> >
> >  /**
> > @@ -3731,19 +3951,47 @@ __bpf_kfunc int bpf_task_work_schedule_signal(struct task_struct *task,
> >   *
> >   * Return: 0 if task work has been scheduled successfully, negative error code otherwise
> >   */
> > -__bpf_kfunc int bpf_task_work_schedule_resume(struct task_struct *task,
> > -                                             struct bpf_task_work *tw,
> > +__bpf_kfunc int bpf_task_work_schedule_resume(struct task_struct *task, struct bpf_task_work *tw,
> >                                               struct bpf_map *map__map,
> > -                                             bpf_task_work_callback_t callback,
> > -                                             void *aux__prog)
> > +                                             bpf_task_work_callback_t callback, void *aux__prog)
> >  {
>
> same, rebasing/refactoring artifacts
>
> > -       return 0;
> > +       enum task_work_notify_mode mode;
> > +
> > +       mode = task == current && in_nmi() ? TWA_NMI_CURRENT : TWA_RESUME;
> > +       return bpf_task_work_schedule(task, tw, map__map, callback, aux__prog, mode);
> >  }
> >
> >  __bpf_kfunc_end_defs();
> >
> >  void bpf_task_work_cancel_and_free(void *val)
> >  {
> > +       struct bpf_task_work *tw = val;
> > +       struct bpf_task_work_context *ctx;
> > +       enum bpf_task_work_state state;
> > +
> > +       /* No need do rcu_read_lock as no other codepath can reset this pointer */
>
> typo: to do
>
> > +       ctx = unrcu_pointer(xchg((struct bpf_task_work_context __force __rcu **)&tw->ctx, NULL));
> > +       if (!ctx)
> > +               return;
> > +       state = xchg(&ctx->state, BPF_TW_FREED);
> > +
> > +       switch (state) {
> > +       case BPF_TW_SCHEDULED:
> > +               /* If we can't cancel task work, rely on task work callback to free the context */
> > +               if (!task_work_cancel_match(ctx->task, task_work_match, ctx))
> > +                       break;
> > +               bpf_task_work_context_reset(ctx);
>
> As Kumar pointed out earlier, this can be called from NMI, so we can't
> just cancel here, we need to schedule irq_work to do the cancellation.
>
> Good thing is that if we were in SCHEDULED state, then we can simply
> reuse irq_work struct from task_work_ctx (please contract "context"
> into "ctx", btw), and there is no change to the state machine, it's
> just a slightly delayed clean up (and we'll remain in terminal FREED
> state anyways).
>
> Bad news: that extra irq_work step mean we have to be careful about
> ctx lifetime, because if task_work callback couldn't be cancelled, we
> might have a situation where memory is freed by task_work callback
> itself (when it fails to transition to RUNNING) while we wait for
> irq_work callback to be run, and then we get freed memory dereference.
>
> So I'm thinking that besides RCU protection we should also have a
> simple refcounting protection for bpf_task_work_ctx itself. I don't
> think that's a problem performance wise, and the nice thing is that
> there will be less direct `call_rcu_tasks_trace(&ctx->rcu,
> bpf_task_work_context_free);` sprinkled around. Instead we'll have
> just bpf_task_work_put(ctx), which might call RCU-delayed freeing.
>
> We might want to roll bpf_task_work_context_reset logic into it,
> conditionally (but we need to make sure that we reset prog and task
> pointers to NULL during a non-freeing context reset).

This also sounds workable. I'll take a look in v5.

>
> > +               fallthrough;
> > +       case BPF_TW_STANDBY:
> > +               call_rcu_tasks_trace(&ctx->rcu, bpf_task_work_context_free);
> > +               break;
> > +       /* In all below cases scheduling logic should detect context state change and cleanup */
> > +       case BPF_TW_SCHEDULING:
> > +       case BPF_TW_PENDING:
> > +       case BPF_TW_RUNNING:
> > +       default:
> > +               break;
> > +       }
> >  }
> >
> >  BTF_KFUNCS_START(generic_btf_ids)
> > @@ -3769,6 +4017,8 @@ BTF_ID_FLAGS(func, bpf_rbtree_first, KF_RET_NULL)
> >  BTF_ID_FLAGS(func, bpf_rbtree_root, KF_RET_NULL)
> >  BTF_ID_FLAGS(func, bpf_rbtree_left, KF_RET_NULL)
> >  BTF_ID_FLAGS(func, bpf_rbtree_right, KF_RET_NULL)
> > +BTF_ID_FLAGS(func, bpf_task_work_schedule_signal, KF_TRUSTED_ARGS)
> > +BTF_ID_FLAGS(func, bpf_task_work_schedule_resume, KF_TRUSTED_ARGS)
> >
> >  #ifdef CONFIG_CGROUPS
> >  BTF_ID_FLAGS(func, bpf_cgroup_acquire, KF_ACQUIRE | KF_RCU | KF_RET_NULL)
> > --
> > 2.50.1
> >

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH bpf-next v2 1/4] bpf: bpf task work plumbing
  2025-08-19  1:34   ` Eduard Zingerman
@ 2025-08-29 15:23     ` Mykyta Yatsenko
  0 siblings, 0 replies; 23+ messages in thread
From: Mykyta Yatsenko @ 2025-08-29 15:23 UTC (permalink / raw)
  To: Eduard Zingerman, bpf, ast, andrii, daniel, kafai, kernel-team,
	memxor
  Cc: Mykyta Yatsenko

On 8/19/25 02:34, Eduard Zingerman wrote:
> On Fri, 2025-08-15 at 20:21 +0100, Mykyta Yatsenko wrote:
>> From: Mykyta Yatsenko <yatsenko@meta.com>
>>
>> This patch adds necessary plumbing in verifier, syscall and maps to
>> support handling new kfunc bpf_task_work_schedule and kernel structure
>> bpf_task_work. The idea is similar to how we already handle bpf_wq and
>> bpf_timer.
>> verifier changes validate calls to bpf_task_work_schedule to make sure
>> it is safe and expected invariants hold.
>> btf part is required to detect bpf_task_work structure inside map value
>> and store its offset, which will be used in the next patch to calculate
>> key and value addresses.
>> arraymap and hashtab changes are needed to handle freeing of the
>> bpf_task_work: run code needed to deinitialize it, for example cancel
>> task_work callback if possible.
>> The use of bpf_task_work and proper implementation for kfuncs are
>> introduced in the next patch.
>>
>> Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
>> ---
> Amount of copy-paste necessary for dealing with objects btf is saddening.
> This patch follows current approach and seem to do it correctly.
>
> [...]
>
>> --- a/kernel/bpf/arraymap.c
>> +++ b/kernel/bpf/arraymap.c
>> @@ -431,7 +431,7 @@ static void *array_map_vmalloc_addr(struct bpf_array *array)
>>   	return (void *)round_down((unsigned long)array, PAGE_SIZE);
>>   }
>>   
>> -static void array_map_free_timers_wq(struct bpf_map *map)
>> +static void array_map_free_internal_structs(struct bpf_map *map)
>>   {
>>   	struct bpf_array *array = container_of(map, struct bpf_array, map);
>>   	int i;
>> @@ -439,12 +439,14 @@ static void array_map_free_timers_wq(struct bpf_map *map)
>>   	/* We don't reset or free fields other than timer and workqueue
>>   	 * on uref dropping to zero.
>>   	 */
>> -	if (btf_record_has_field(map->record, BPF_TIMER | BPF_WORKQUEUE)) {
>> +	if (btf_record_has_field(map->record, BPF_TIMER | BPF_WORKQUEUE | BPF_TASK_WORK)) {
> Is there a way to share this code between array map and hash map?
I don't see any common library used by both arraymap and hashtab (it's 
likely I'm missing something).
Although this code looks similar, there are some differences, for 
example use of array_map_elem_ptr/htab_elem_value, which currently runs 
only when value has special field, if we to extract below code into a 
separate function, we'll have to run it unconditionally, which will add 
some small cost. All this combined makes it not very appealing to go 
after reusing that code. I agree that having a more centralized place to 
deal with these special structs would make things better.
>
>>   		for (i = 0; i < array->map.max_entries; i++) {
>>   			if (btf_record_has_field(map->record, BPF_TIMER))
>>   				bpf_obj_free_timer(map->record, array_map_elem_ptr(array, i));
>>   			if (btf_record_has_field(map->record, BPF_WORKQUEUE))
>>   				bpf_obj_free_workqueue(map->record, array_map_elem_ptr(array, i));
>> +			if (btf_record_has_field(map->record, BPF_TASK_WORK))
>> +				bpf_obj_free_task_work(map->record, array_map_elem_ptr(array, i));
>>   		}
>>   	}
>>   }
> [...]
>
>> diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
>> index 64739308902f..378f260235dd 100644
>> --- a/kernel/bpf/btf.c
>> +++ b/kernel/bpf/btf.c
>> @@ -3527,6 +3527,15 @@ static int btf_get_field_type(const struct btf *btf, const struct btf_type *var_
>>   			goto end;
>>   		}
>>   	}
>> +	if (field_mask & BPF_TASK_WORK) {
>> +		if (!strcmp(name, "bpf_task_work")) {
>> +			if (*seen_mask & BPF_TASK_WORK)
>> +				return -E2BIG;
>> +			*seen_mask |= BPF_TASK_WORK;
>> +			type = BPF_TASK_WORK;
>> +			goto end;
>> +		}
>> +	}
> Nit: extract this and ifs before it as a loop over array
>       of name/flag pairs?
Makes sense, though, I guess it'll require a separate refactoring patch.
>
>>   	field_mask_test_name(BPF_LIST_HEAD, "bpf_list_head");
>>   	field_mask_test_name(BPF_LIST_NODE, "bpf_list_node");
>>   	field_mask_test_name(BPF_RB_ROOT,   "bpf_rb_root");
> [...]
>
>> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
>> index 71f9931ac64c..207ad4823b5b 100644
>> --- a/kernel/bpf/hashtab.c
>> +++ b/kernel/bpf/hashtab.c
> [...]
>
>> @@ -233,6 +233,9 @@ static void htab_free_prealloced_timers_and_wq(struct bpf_htab *htab)
>>   		if (btf_record_has_field(htab->map.record, BPF_WORKQUEUE))
>>   			bpf_obj_free_workqueue(htab->map.record,
>>   					       htab_elem_value(elem, htab->map.key_size));
>> +		if (btf_record_has_field(htab->map.record, BPF_TASK_WORK))
>> +			bpf_obj_free_task_work(htab->map.record,
>> +					       htab_elem_value(elem, htab->map.key_size));
> If there is no generic way to share this code with array maps,
> please, at-least within the hashmap.c extract these "if (btf_record_has_field(...)) {...}"
> groups so that there is no duplication between
> htab_free_{malloced,preallocated}_internal_structs(htab).
>
>>   		cond_resched();
>>   	}
>>   }
> [...]
>
>> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>> index 0fbfa8532c39..108d86f7eeaf 100644
>> --- a/kernel/bpf/syscall.c
>> +++ b/kernel/bpf/syscall.c
> [...]
>
>> @@ -1309,6 +1322,14 @@ static int map_check_btf(struct bpf_map *map, struct bpf_token *token,
>>   					goto free_map_tab;
>>   				}
>>   				break;
>> +			case BPF_TASK_WORK:
> This can be added to the group with BPF_TIMER and BPF_WORKQUEUE just above.
Ack.
>
>> +				if (map->map_type != BPF_MAP_TYPE_HASH &&
>> +				    map->map_type != BPF_MAP_TYPE_LRU_HASH &&
>> +				    map->map_type != BPF_MAP_TYPE_ARRAY) {
>> +					ret = -EOPNOTSUPP;
>> +					goto free_map_tab;
>> +				}
>> +				break;
>>   			default:
>>   				/* Fail if map_type checks are missing for a field type */
>>   				ret = -EOPNOTSUPP;
>> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
>> index a61d57996692..be7a744c7917 100644
>> --- a/kernel/bpf/verifier.c
>> +++ b/kernel/bpf/verifier.c
> [...]
>
> This function repeats process_timer_func() almost verbatim.
Right, I'll extract into a generic function.
>
>> +{
>> +	struct bpf_reg_state *regs = cur_regs(env), *reg = &regs[regno];
>> +	struct bpf_map *map = reg->map_ptr;
>> +	bool is_const = tnum_is_const(reg->var_off);
>> +	u64 val = reg->var_off.value;
>> +
>> +	if (!map->btf) {
>> +		verbose(env, "map '%s' has to have BTF in order to use bpf_task_work\n",
>> +			map->name);
>> +		return -EINVAL;
>> +	}
>> +	if (!btf_record_has_field(map->record, BPF_TASK_WORK)) {
>> +		verbose(env, "map '%s' has no valid bpf_task_work\n", map->name);
>> +		return -EINVAL;
>> +	}
>> +	if (!is_const) {
>> +		verbose(env,
>> +			"bpf_task_work has to be at the constant offset\n");
>> +		return -EINVAL;
>> +	}
>> +	if (map->record->task_work_off != val + reg->off) {
>> +		verbose(env,
>> +			"off %lld doesn't point to 'struct bpf_task_work' that is at %d\n",
>> +			val + reg->off, map->record->task_work_off);
>> +		return -EINVAL;
>> +	}
>> +	if (meta->map.ptr) {
>> +		verifier_bug(env, "Two map pointers in a bpf_task_work kfunc");
>> +		return -EFAULT;
>> +	}
>> +
>> +	meta->map.uid = reg->map_uid;
>> +	meta->map.ptr = map;
>> +	return 0;
>> +}
>> +
>>   static int process_kptr_func(struct bpf_verifier_env *env, int regno,
>>   			     struct bpf_call_arg_meta *meta)
>>   {
> [...]


^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2025-08-29 15:23 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-15 19:21 [PATCH bpf-next v2 0/4] bpf: Introduce deferred task context execution Mykyta Yatsenko
2025-08-15 19:21 ` [PATCH bpf-next v2 1/4] bpf: bpf task work plumbing Mykyta Yatsenko
2025-08-19  1:34   ` Eduard Zingerman
2025-08-29 15:23     ` Mykyta Yatsenko
2025-08-15 19:21 ` [PATCH bpf-next v2 2/4] bpf: extract map key pointer calculation Mykyta Yatsenko
2025-08-19 11:05   ` Kumar Kartikeya Dwivedi
2025-08-19 20:50   ` Andrii Nakryiko
2025-08-15 19:21 ` [PATCH bpf-next v2 3/4] bpf: task work scheduling kfuncs Mykyta Yatsenko
2025-08-15 22:00   ` Jiri Olsa
2025-08-18 13:36     ` Mykyta Yatsenko
2025-08-16 18:14   ` kernel test robot
2025-08-19 14:18   ` Kumar Kartikeya Dwivedi
2025-08-19 18:13     ` Mykyta Yatsenko
2025-08-19 19:27       ` Kumar Kartikeya Dwivedi
2025-08-19 20:49         ` Andrii Nakryiko
2025-08-20 16:11           ` Kumar Kartikeya Dwivedi
2025-08-20 18:33             ` Andrii Nakryiko
2025-08-28  1:34               ` Alexei Starovoitov
2025-08-28 17:00                 ` Andrii Nakryiko
2025-08-28 17:38                   ` Kumar Kartikeya Dwivedi
2025-08-27 21:03   ` Andrii Nakryiko
2025-08-28 22:29     ` Kumar Kartikeya Dwivedi
2025-08-15 19:21 ` [PATCH bpf-next v2 4/4] selftests/bpf: BPF task work scheduling tests Mykyta Yatsenko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).