* [PATCH bpf-next 0/4] bpf: Introduce deferred task context execution
@ 2025-08-06 14:45 Mykyta Yatsenko
2025-08-06 14:45 ` [PATCH bpf-next 1/4] bpf: bpf task work plumbing Mykyta Yatsenko
` (3 more replies)
0 siblings, 4 replies; 15+ messages in thread
From: Mykyta Yatsenko @ 2025-08-06 14:45 UTC (permalink / raw)
To: bpf, ast, andrii, daniel, kafai, kernel-team, eddyz87; +Cc: Mykyta Yatsenko
From: Mykyta Yatsenko <yatsenko@meta.com>
This patch introduces a new mechanism for BPF programs to schedule
deferred execution in the context of a specific task using the kernel’s
task_work infrastructure.
The new bpf_task_work interface enables BPF use cases that
require sleepable subprogram execution within task context, for example,
scheduling sleepable function from the context that does not
allow sleepable, such as NMI.
Introduced kfuncs bpf_task_work_schedule_signal() and
bpf_task_work_schedule_resume() for scheduling BPF callbacks correspond
to different modes used by task_work (TWA_SIGNAL or
TWA_RESUME/TWA_NMI_CURRENT).
The implementation leverages BPF maps for storing callback metadata.
Indirectly call task_work_add() via irq_work to avoid locking in
potentially NMI context. State transitions are managed via an atomic
state machine (bpf_task_work_state) to ensure correctness under
concurrent usage and deletion.
Mykyta Yatsenko (4):
bpf: bpf task work plumbing
bpf: extract map key pointer calculation
bpf: task work scheduling kfuncs
selftests/bpf: BPF task work scheduling tests
include/linux/bpf.h | 11 +
include/uapi/linux/bpf.h | 4 +
kernel/bpf/arraymap.c | 8 +-
kernel/bpf/btf.c | 15 +
kernel/bpf/hashtab.c | 22 +-
kernel/bpf/helpers.c | 260 ++++++++++++++++--
kernel/bpf/syscall.c | 23 +-
kernel/bpf/verifier.c | 131 ++++++++-
tools/include/uapi/linux/bpf.h | 4 +
.../selftests/bpf/prog_tests/test_task_work.c | 149 ++++++++++
tools/testing/selftests/bpf/progs/task_work.c | 108 ++++++++
.../selftests/bpf/progs/task_work_fail.c | 98 +++++++
12 files changed, 800 insertions(+), 33 deletions(-)
create mode 100644 tools/testing/selftests/bpf/prog_tests/test_task_work.c
create mode 100644 tools/testing/selftests/bpf/progs/task_work.c
create mode 100644 tools/testing/selftests/bpf/progs/task_work_fail.c
--
2.50.1
^ permalink raw reply [flat|nested] 15+ messages in thread* [PATCH bpf-next 1/4] bpf: bpf task work plumbing 2025-08-06 14:45 [PATCH bpf-next 0/4] bpf: Introduce deferred task context execution Mykyta Yatsenko @ 2025-08-06 14:45 ` Mykyta Yatsenko 2025-08-07 17:53 ` Kumar Kartikeya Dwivedi 2025-08-06 14:45 ` [PATCH bpf-next 2/4] bpf: extract map key pointer calculation Mykyta Yatsenko ` (2 subsequent siblings) 3 siblings, 1 reply; 15+ messages in thread From: Mykyta Yatsenko @ 2025-08-06 14:45 UTC (permalink / raw) To: bpf, ast, andrii, daniel, kafai, kernel-team, eddyz87; +Cc: Mykyta Yatsenko From: Mykyta Yatsenko <yatsenko@meta.com> This patch adds necessary plumbing in verifier, syscall and maps to support handling new kfunc bpf_task_work_schedule and kernel structure bpf_task_work. The idea is similar to how we already handle bpf_wq and bpf_timer. verifier changes validate calls to bpf_task_work_schedule to make sure it is safe and expected invariants hold. btf part is required to detect bpf_task_work structure inside map value and store its offset, which will be used in the next patch to calculate key and value addresses. arraymap and hashtab changes are needed to handle freeing of the bpf_task_work: run code needed to deinitialize it, for example cancel task_work callback if possible. The use of bpf_task_work and proper implementation for kfuncs are introduced in the next patch. Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> --- include/linux/bpf.h | 11 +++ include/uapi/linux/bpf.h | 4 + kernel/bpf/arraymap.c | 8 +- kernel/bpf/btf.c | 15 ++++ kernel/bpf/hashtab.c | 22 ++++-- kernel/bpf/helpers.c | 45 +++++++++++ kernel/bpf/syscall.c | 23 +++++- kernel/bpf/verifier.c | 131 ++++++++++++++++++++++++++++++++- tools/include/uapi/linux/bpf.h | 4 + 9 files changed, 247 insertions(+), 16 deletions(-) diff --git a/include/linux/bpf.h b/include/linux/bpf.h index f9cd2164ed23..cb83ba0eaed5 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -206,6 +206,7 @@ enum btf_field_type { BPF_WORKQUEUE = (1 << 10), BPF_UPTR = (1 << 11), BPF_RES_SPIN_LOCK = (1 << 12), + BPF_TASK_WORK = (1 << 13), }; typedef void (*btf_dtor_kfunc_t)(void *); @@ -245,6 +246,7 @@ struct btf_record { int timer_off; int wq_off; int refcount_off; + int task_work_off; struct btf_field fields[]; }; @@ -340,6 +342,8 @@ static inline const char *btf_field_type_name(enum btf_field_type type) return "bpf_rb_node"; case BPF_REFCOUNT: return "bpf_refcount"; + case BPF_TASK_WORK: + return "bpf_task_work"; default: WARN_ON_ONCE(1); return "unknown"; @@ -378,6 +382,8 @@ static inline u32 btf_field_type_size(enum btf_field_type type) return sizeof(struct bpf_rb_node); case BPF_REFCOUNT: return sizeof(struct bpf_refcount); + case BPF_TASK_WORK: + return sizeof(struct bpf_task_work); default: WARN_ON_ONCE(1); return 0; @@ -410,6 +416,8 @@ static inline u32 btf_field_type_align(enum btf_field_type type) return __alignof__(struct bpf_rb_node); case BPF_REFCOUNT: return __alignof__(struct bpf_refcount); + case BPF_TASK_WORK: + return __alignof__(struct bpf_task_work); default: WARN_ON_ONCE(1); return 0; @@ -441,6 +449,7 @@ static inline void bpf_obj_init_field(const struct btf_field *field, void *addr) case BPF_KPTR_REF: case BPF_KPTR_PERCPU: case BPF_UPTR: + case BPF_TASK_WORK: break; default: WARN_ON_ONCE(1); @@ -577,6 +586,7 @@ void copy_map_value_locked(struct bpf_map *map, void *dst, void *src, bool lock_src); void bpf_timer_cancel_and_free(void *timer); void bpf_wq_cancel_and_free(void *timer); +void bpf_task_work_cancel_and_free(void *timer); void bpf_list_head_free(const struct btf_field *field, void *list_head, struct bpf_spin_lock *spin_lock); void bpf_rb_root_free(const struct btf_field *field, void *rb_root, @@ -2391,6 +2401,7 @@ struct btf_record *btf_record_dup(const struct btf_record *rec); bool btf_record_equal(const struct btf_record *rec_a, const struct btf_record *rec_b); void bpf_obj_free_timer(const struct btf_record *rec, void *obj); void bpf_obj_free_workqueue(const struct btf_record *rec, void *obj); +void bpf_obj_free_task_work(const struct btf_record *rec, void *obj); void bpf_obj_free_fields(const struct btf_record *rec, void *obj); void __bpf_obj_drop_impl(void *p, const struct btf_record *rec, bool percpu); diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 233de8677382..e444d9f67829 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -7418,6 +7418,10 @@ struct bpf_timer { __u64 __opaque[2]; } __attribute__((aligned(8))); +struct bpf_task_work { + __u64 __opaque[16]; +} __attribute__((aligned(8))); + struct bpf_wq { __u64 __opaque[2]; } __attribute__((aligned(8))); diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c index 3d080916faf9..4130d8e76dff 100644 --- a/kernel/bpf/arraymap.c +++ b/kernel/bpf/arraymap.c @@ -431,7 +431,7 @@ static void *array_map_vmalloc_addr(struct bpf_array *array) return (void *)round_down((unsigned long)array, PAGE_SIZE); } -static void array_map_free_timers_wq(struct bpf_map *map) +static void array_map_free_internal_structs(struct bpf_map *map) { struct bpf_array *array = container_of(map, struct bpf_array, map); int i; @@ -439,12 +439,14 @@ static void array_map_free_timers_wq(struct bpf_map *map) /* We don't reset or free fields other than timer and workqueue * on uref dropping to zero. */ - if (btf_record_has_field(map->record, BPF_TIMER | BPF_WORKQUEUE)) { + if (btf_record_has_field(map->record, BPF_TIMER | BPF_WORKQUEUE | BPF_TASK_WORK)) { for (i = 0; i < array->map.max_entries; i++) { if (btf_record_has_field(map->record, BPF_TIMER)) bpf_obj_free_timer(map->record, array_map_elem_ptr(array, i)); if (btf_record_has_field(map->record, BPF_WORKQUEUE)) bpf_obj_free_workqueue(map->record, array_map_elem_ptr(array, i)); + if (btf_record_has_field(map->record, BPF_TASK_WORK)) + bpf_obj_free_task_work(map->record, array_map_elem_ptr(array, i)); } } } @@ -783,7 +785,7 @@ const struct bpf_map_ops array_map_ops = { .map_alloc = array_map_alloc, .map_free = array_map_free, .map_get_next_key = array_map_get_next_key, - .map_release_uref = array_map_free_timers_wq, + .map_release_uref = array_map_free_internal_structs, .map_lookup_elem = array_map_lookup_elem, .map_update_elem = array_map_update_elem, .map_delete_elem = array_map_delete_elem, diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c index 0aff814cb53a..c66f9c6dfc48 100644 --- a/kernel/bpf/btf.c +++ b/kernel/bpf/btf.c @@ -3527,6 +3527,15 @@ static int btf_get_field_type(const struct btf *btf, const struct btf_type *var_ goto end; } } + if (field_mask & BPF_TASK_WORK) { + if (!strcmp(name, "bpf_task_work")) { + if (*seen_mask & BPF_TASK_WORK) + return -E2BIG; + *seen_mask |= BPF_TASK_WORK; + type = BPF_TASK_WORK; + goto end; + } + } field_mask_test_name(BPF_LIST_HEAD, "bpf_list_head"); field_mask_test_name(BPF_LIST_NODE, "bpf_list_node"); field_mask_test_name(BPF_RB_ROOT, "bpf_rb_root"); @@ -3693,6 +3702,7 @@ static int btf_find_field_one(const struct btf *btf, case BPF_LIST_NODE: case BPF_RB_NODE: case BPF_REFCOUNT: + case BPF_TASK_WORK: ret = btf_find_struct(btf, var_type, off, sz, field_type, info_cnt ? &info[0] : &tmp); if (ret < 0) @@ -3985,6 +3995,7 @@ struct btf_record *btf_parse_fields(const struct btf *btf, const struct btf_type rec->timer_off = -EINVAL; rec->wq_off = -EINVAL; rec->refcount_off = -EINVAL; + rec->task_work_off = -EINVAL; for (i = 0; i < cnt; i++) { field_type_size = btf_field_type_size(info_arr[i].type); if (info_arr[i].off + field_type_size > value_size) { @@ -4050,6 +4061,10 @@ struct btf_record *btf_parse_fields(const struct btf *btf, const struct btf_type case BPF_LIST_NODE: case BPF_RB_NODE: break; + case BPF_TASK_WORK: + WARN_ON_ONCE(rec->task_work_off >= 0); + rec->task_work_off = rec->fields[i].offset; + break; default: ret = -EFAULT; goto end; diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c index 71f9931ac64c..207ad4823b5b 100644 --- a/kernel/bpf/hashtab.c +++ b/kernel/bpf/hashtab.c @@ -215,7 +215,7 @@ static bool htab_has_extra_elems(struct bpf_htab *htab) return !htab_is_percpu(htab) && !htab_is_lru(htab) && !is_fd_htab(htab); } -static void htab_free_prealloced_timers_and_wq(struct bpf_htab *htab) +static void htab_free_prealloced_internal_structs(struct bpf_htab *htab) { u32 num_entries = htab->map.max_entries; int i; @@ -233,6 +233,9 @@ static void htab_free_prealloced_timers_and_wq(struct bpf_htab *htab) if (btf_record_has_field(htab->map.record, BPF_WORKQUEUE)) bpf_obj_free_workqueue(htab->map.record, htab_elem_value(elem, htab->map.key_size)); + if (btf_record_has_field(htab->map.record, BPF_TASK_WORK)) + bpf_obj_free_task_work(htab->map.record, + htab_elem_value(elem, htab->map.key_size)); cond_resched(); } } @@ -1490,7 +1493,7 @@ static void delete_all_elements(struct bpf_htab *htab) } } -static void htab_free_malloced_timers_and_wq(struct bpf_htab *htab) +static void htab_free_malloced_internal_structs(struct bpf_htab *htab) { int i; @@ -1508,22 +1511,25 @@ static void htab_free_malloced_timers_and_wq(struct bpf_htab *htab) if (btf_record_has_field(htab->map.record, BPF_WORKQUEUE)) bpf_obj_free_workqueue(htab->map.record, htab_elem_value(l, htab->map.key_size)); + if (btf_record_has_field(htab->map.record, BPF_TASK_WORK)) + bpf_obj_free_task_work(htab->map.record, + htab_elem_value(l, htab->map.key_size)); } cond_resched_rcu(); } rcu_read_unlock(); } -static void htab_map_free_timers_and_wq(struct bpf_map *map) +static void htab_map_free_internal_structs(struct bpf_map *map) { struct bpf_htab *htab = container_of(map, struct bpf_htab, map); /* We only free timer and workqueue on uref dropping to zero */ - if (btf_record_has_field(htab->map.record, BPF_TIMER | BPF_WORKQUEUE)) { + if (btf_record_has_field(htab->map.record, BPF_TIMER | BPF_WORKQUEUE | BPF_TASK_WORK)) { if (!htab_is_prealloc(htab)) - htab_free_malloced_timers_and_wq(htab); + htab_free_malloced_internal_structs(htab); else - htab_free_prealloced_timers_and_wq(htab); + htab_free_prealloced_internal_structs(htab); } } @@ -2255,7 +2261,7 @@ const struct bpf_map_ops htab_map_ops = { .map_alloc = htab_map_alloc, .map_free = htab_map_free, .map_get_next_key = htab_map_get_next_key, - .map_release_uref = htab_map_free_timers_and_wq, + .map_release_uref = htab_map_free_internal_structs, .map_lookup_elem = htab_map_lookup_elem, .map_lookup_and_delete_elem = htab_map_lookup_and_delete_elem, .map_update_elem = htab_map_update_elem, @@ -2276,7 +2282,7 @@ const struct bpf_map_ops htab_lru_map_ops = { .map_alloc = htab_map_alloc, .map_free = htab_map_free, .map_get_next_key = htab_map_get_next_key, - .map_release_uref = htab_map_free_timers_and_wq, + .map_release_uref = htab_map_free_internal_structs, .map_lookup_elem = htab_lru_map_lookup_elem, .map_lookup_and_delete_elem = htab_lru_map_lookup_and_delete_elem, .map_lookup_elem_sys_only = htab_lru_map_lookup_elem_sys, diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c index 6b4877e85a68..322ffcaedc38 100644 --- a/kernel/bpf/helpers.c +++ b/kernel/bpf/helpers.c @@ -3703,8 +3703,53 @@ __bpf_kfunc int bpf_strstr(const char *s1__ign, const char *s2__ign) return bpf_strnstr(s1__ign, s2__ign, XATTR_SIZE_MAX); } +typedef void (*bpf_task_work_callback_t)(struct bpf_map *, void *, void *); + +/** + * bpf_task_work_schedule_signal - Schedule BPF callback using task_work_add with TWA_SIGNAL mode + * @task: Task struct for which callback should be scheduled + * @tw: Pointer to the bpf_task_work struct, to use by kernel internally for bookkeeping + * @map__map: bpf_map which contains bpf_task_work in one of the values + * @callback: pointer to BPF subprogram to call + * @aux__prog: user should pass NULL + * + * Return: 0 if task work has been scheduled successfully, negative error code otherwise + */ +__bpf_kfunc int bpf_task_work_schedule_signal(struct task_struct *task, + struct bpf_task_work *tw, + struct bpf_map *map__map, + bpf_task_work_callback_t callback, + void *aux__prog) +{ + return 0; +} + +/** + * bpf_task_work_schedule_resume - Schedule BPF callback using task_work_add with TWA_RESUME or + * TWA_NMI_CURRENT mode if scheduling for the current task in the NMI + * @task: Task struct for which callback should be scheduled + * @tw: Pointer to the bpf_task_work struct, to use by kernel internally for bookkeeping + * @map__map: bpf_map which contains bpf_task_work in one of the values + * @callback: pointer to BPF subprogram to call + * @aux__prog: user should pass NULL + * + * Return: 0 if task work has been scheduled successfully, negative error code otherwise + */ +__bpf_kfunc int bpf_task_work_schedule_resume(struct task_struct *task, + struct bpf_task_work *tw, + struct bpf_map *map__map, + bpf_task_work_callback_t callback, + void *aux__prog) +{ + return 0; +} + __bpf_kfunc_end_defs(); +void bpf_task_work_cancel_and_free(void *val) +{ +} + BTF_KFUNCS_START(generic_btf_ids) #ifdef CONFIG_CRASH_DUMP BTF_ID_FLAGS(func, crash_kexec, KF_DESTRUCTIVE) diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index e63039817af3..73f801751280 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -670,6 +670,7 @@ void btf_record_free(struct btf_record *rec) case BPF_TIMER: case BPF_REFCOUNT: case BPF_WORKQUEUE: + case BPF_TASK_WORK: /* Nothing to release */ break; default: @@ -723,6 +724,7 @@ struct btf_record *btf_record_dup(const struct btf_record *rec) case BPF_TIMER: case BPF_REFCOUNT: case BPF_WORKQUEUE: + case BPF_TASK_WORK: /* Nothing to acquire */ break; default: @@ -781,6 +783,13 @@ void bpf_obj_free_workqueue(const struct btf_record *rec, void *obj) bpf_wq_cancel_and_free(obj + rec->wq_off); } +void bpf_obj_free_task_work(const struct btf_record *rec, void *obj) +{ + if (WARN_ON_ONCE(!btf_record_has_field(rec, BPF_TASK_WORK))) + return; + bpf_task_work_cancel_and_free(obj + rec->task_work_off); +} + void bpf_obj_free_fields(const struct btf_record *rec, void *obj) { const struct btf_field *fields; @@ -838,6 +847,9 @@ void bpf_obj_free_fields(const struct btf_record *rec, void *obj) continue; bpf_rb_root_free(field, field_ptr, obj + rec->spin_lock_off); break; + case BPF_TASK_WORK: + bpf_task_work_cancel_and_free(field_ptr); + break; case BPF_LIST_NODE: case BPF_RB_NODE: case BPF_REFCOUNT: @@ -1234,7 +1246,8 @@ static int map_check_btf(struct bpf_map *map, struct bpf_token *token, map->record = btf_parse_fields(btf, value_type, BPF_SPIN_LOCK | BPF_RES_SPIN_LOCK | BPF_TIMER | BPF_KPTR | BPF_LIST_HEAD | - BPF_RB_ROOT | BPF_REFCOUNT | BPF_WORKQUEUE | BPF_UPTR, + BPF_RB_ROOT | BPF_REFCOUNT | BPF_WORKQUEUE | BPF_UPTR | + BPF_TASK_WORK, map->value_size); if (!IS_ERR_OR_NULL(map->record)) { int i; @@ -1306,6 +1319,14 @@ static int map_check_btf(struct bpf_map *map, struct bpf_token *token, goto free_map_tab; } break; + case BPF_TASK_WORK: + if (map->map_type != BPF_MAP_TYPE_HASH && + map->map_type != BPF_MAP_TYPE_LRU_HASH && + map->map_type != BPF_MAP_TYPE_ARRAY) { + ret = -EOPNOTSUPP; + goto free_map_tab; + } + break; default: /* Fail if map_type checks are missing for a field type */ ret = -EOPNOTSUPP; diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 399f03e62508..905dc0c5a73d 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -524,9 +524,11 @@ static bool is_sync_callback_calling_function(enum bpf_func_id func_id) func_id == BPF_FUNC_user_ringbuf_drain; } -static bool is_async_callback_calling_function(enum bpf_func_id func_id) +static bool is_task_work_add_kfunc(u32 func_id); + +static bool is_async_callback_calling_function(u32 func_id) { - return func_id == BPF_FUNC_timer_set_callback; + return func_id == BPF_FUNC_timer_set_callback || is_task_work_add_kfunc(func_id); } static bool is_callback_calling_function(enum bpf_func_id func_id) @@ -2236,6 +2238,8 @@ static void mark_ptr_not_null_reg(struct bpf_reg_state *reg) reg->map_uid = reg->id; if (btf_record_has_field(map->inner_map_meta->record, BPF_WORKQUEUE)) reg->map_uid = reg->id; + if (btf_record_has_field(map->inner_map_meta->record, BPF_TASK_WORK)) + reg->map_uid = reg->id; } else if (map->map_type == BPF_MAP_TYPE_XSKMAP) { reg->type = PTR_TO_XDP_SOCK; } else if (map->map_type == BPF_MAP_TYPE_SOCKMAP || @@ -8569,6 +8573,44 @@ static int process_wq_func(struct bpf_verifier_env *env, int regno, return 0; } +static int process_task_work_func(struct bpf_verifier_env *env, int regno, + struct bpf_kfunc_call_arg_meta *meta) +{ + struct bpf_reg_state *regs = cur_regs(env), *reg = ®s[regno]; + struct bpf_map *map = reg->map_ptr; + bool is_const = tnum_is_const(reg->var_off); + u64 val = reg->var_off.value; + + if (!map->btf) { + verbose(env, "map '%s' has to have BTF in order to use bpf_task_work\n", + map->name); + return -EINVAL; + } + if (!btf_record_has_field(map->record, BPF_TASK_WORK)) { + verbose(env, "map '%s' has no valid bpf_task_work\n", map->name); + return -EINVAL; + } + if (map->record->task_work_off != val + reg->off) { + verbose(env, + "off %lld doesn't point to 'struct bpf_task_work' that is at %d\n", + val + reg->off, map->record->task_work_off); + return -EINVAL; + } + if (!is_const) { + verbose(env, + "bpf_task_work has to be at the constant offset\n"); + return -EINVAL; + } + if (meta->map.ptr) { + verifier_bug(env, "Two map pointers in a bpf_task_work kfunc"); + return -EFAULT; + } + + meta->map.uid = reg->map_uid; + meta->map.ptr = map; + return 0; +} + static int process_kptr_func(struct bpf_verifier_env *env, int regno, struct bpf_call_arg_meta *meta) { @@ -10616,7 +10658,8 @@ static int push_callback_call(struct bpf_verifier_env *env, struct bpf_insn *ins env->subprog_info[subprog].is_async_cb = true; async_cb = push_async_cb(env, env->subprog_info[subprog].start, insn_idx, subprog, - is_bpf_wq_set_callback_impl_kfunc(insn->imm)); + is_bpf_wq_set_callback_impl_kfunc(insn->imm) || + is_task_work_add_kfunc(insn->imm)); if (!async_cb) return -EFAULT; callee = async_cb->frame[0]; @@ -10929,6 +10972,35 @@ static int set_rbtree_add_callback_state(struct bpf_verifier_env *env, return 0; } +static int set_task_work_schedule_callback_state(struct bpf_verifier_env *env, + struct bpf_func_state *caller, + struct bpf_func_state *callee, + int insn_idx) +{ + struct bpf_map *map_ptr = caller->regs[BPF_REG_3].map_ptr; + + /* + * callback_fn(struct bpf_map *map, void *key, void *value); + */ + callee->regs[BPF_REG_1].type = CONST_PTR_TO_MAP; + __mark_reg_known_zero(&callee->regs[BPF_REG_1]); + callee->regs[BPF_REG_1].map_ptr = map_ptr; + + callee->regs[BPF_REG_2].type = PTR_TO_MAP_KEY; + __mark_reg_known_zero(&callee->regs[BPF_REG_2]); + callee->regs[BPF_REG_2].map_ptr = map_ptr; + + callee->regs[BPF_REG_3].type = PTR_TO_MAP_VALUE; + __mark_reg_known_zero(&callee->regs[BPF_REG_3]); + callee->regs[BPF_REG_3].map_ptr = map_ptr; + + /* unused */ + __mark_reg_not_init(env, &callee->regs[BPF_REG_4]); + __mark_reg_not_init(env, &callee->regs[BPF_REG_5]); + callee->in_callback_fn = true; + return 0; +} + static bool is_rbtree_lock_required_kfunc(u32 btf_id); /* Are we currently verifying the callback for a rbtree helper that must @@ -12059,6 +12131,7 @@ enum { KF_ARG_RB_NODE_ID, KF_ARG_WORKQUEUE_ID, KF_ARG_RES_SPIN_LOCK_ID, + KF_ARG_TASK_WORK_ID, }; BTF_ID_LIST(kf_arg_btf_ids) @@ -12069,6 +12142,7 @@ BTF_ID(struct, bpf_rb_root) BTF_ID(struct, bpf_rb_node) BTF_ID(struct, bpf_wq) BTF_ID(struct, bpf_res_spin_lock) +BTF_ID(struct, bpf_task_work) static bool __is_kfunc_ptr_arg_type(const struct btf *btf, const struct btf_param *arg, int type) @@ -12117,6 +12191,11 @@ static bool is_kfunc_arg_wq(const struct btf *btf, const struct btf_param *arg) return __is_kfunc_ptr_arg_type(btf, arg, KF_ARG_WORKQUEUE_ID); } +static bool is_kfunc_arg_task_work(const struct btf *btf, const struct btf_param *arg) +{ + return __is_kfunc_ptr_arg_type(btf, arg, KF_ARG_TASK_WORK_ID); +} + static bool is_kfunc_arg_res_spin_lock(const struct btf *btf, const struct btf_param *arg) { return __is_kfunc_ptr_arg_type(btf, arg, KF_ARG_RES_SPIN_LOCK_ID); @@ -12204,6 +12283,7 @@ enum kfunc_ptr_arg_type { KF_ARG_PTR_TO_WORKQUEUE, KF_ARG_PTR_TO_IRQ_FLAG, KF_ARG_PTR_TO_RES_SPIN_LOCK, + KF_ARG_PTR_TO_TASK_WORK, }; enum special_kfunc_type { @@ -12252,6 +12332,8 @@ enum special_kfunc_type { KF_bpf_res_spin_lock_irqsave, KF_bpf_res_spin_unlock_irqrestore, KF___bpf_trap, + KF_bpf_task_work_schedule_signal, + KF_bpf_task_work_schedule_resume, }; BTF_ID_LIST(special_kfunc_list) @@ -12318,6 +12400,14 @@ BTF_ID(func, bpf_res_spin_unlock) BTF_ID(func, bpf_res_spin_lock_irqsave) BTF_ID(func, bpf_res_spin_unlock_irqrestore) BTF_ID(func, __bpf_trap) +BTF_ID(func, bpf_task_work_schedule_signal) +BTF_ID(func, bpf_task_work_schedule_resume) + +static bool is_task_work_add_kfunc(u32 func_id) +{ + return func_id == special_kfunc_list[KF_bpf_task_work_schedule_signal] || + func_id == special_kfunc_list[KF_bpf_task_work_schedule_resume]; +} static bool is_kfunc_ret_null(struct bpf_kfunc_call_arg_meta *meta) { @@ -12408,6 +12498,9 @@ get_kfunc_ptr_arg_type(struct bpf_verifier_env *env, if (is_kfunc_arg_wq(meta->btf, &args[argno])) return KF_ARG_PTR_TO_WORKQUEUE; + if (is_kfunc_arg_task_work(meta->btf, &args[argno])) + return KF_ARG_PTR_TO_TASK_WORK; + if (is_kfunc_arg_irq_flag(meta->btf, &args[argno])) return KF_ARG_PTR_TO_IRQ_FLAG; @@ -12751,7 +12844,8 @@ static bool is_sync_callback_calling_kfunc(u32 btf_id) static bool is_async_callback_calling_kfunc(u32 btf_id) { - return btf_id == special_kfunc_list[KF_bpf_wq_set_callback_impl]; + return btf_id == special_kfunc_list[KF_bpf_wq_set_callback_impl] || + is_task_work_add_kfunc(btf_id); } static bool is_bpf_throw_kfunc(struct bpf_insn *insn) @@ -13153,6 +13247,15 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_ return -EINVAL; } } + if (meta->map.ptr && reg->map_ptr->record->task_work_off >= 0) { + if (meta->map.ptr != reg->map_ptr || + meta->map.uid != reg->map_uid) { + verbose(env, + "bpf_task_work pointer in R2 map_uid=%d doesn't match map pointer in R3 map_uid=%d\n", + meta->map.uid, reg->map_uid); + return -EINVAL; + } + } meta->map.ptr = reg->map_ptr; meta->map.uid = reg->map_uid; fallthrough; @@ -13185,6 +13288,7 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_ case KF_ARG_PTR_TO_REFCOUNTED_KPTR: case KF_ARG_PTR_TO_CONST_STR: case KF_ARG_PTR_TO_WORKQUEUE: + case KF_ARG_PTR_TO_TASK_WORK: case KF_ARG_PTR_TO_IRQ_FLAG: case KF_ARG_PTR_TO_RES_SPIN_LOCK: break; @@ -13476,6 +13580,15 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_ if (ret < 0) return ret; break; + case KF_ARG_PTR_TO_TASK_WORK: + if (reg->type != PTR_TO_MAP_VALUE) { + verbose(env, "arg#%d doesn't point to a map value\n", i); + return -EINVAL; + } + ret = process_task_work_func(env, regno, meta); + if (ret < 0) + return ret; + break; case KF_ARG_PTR_TO_IRQ_FLAG: if (reg->type != PTR_TO_STACK) { verbose(env, "arg#%d doesn't point to an irq flag on stack\n", i); @@ -13842,6 +13955,16 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn, } } + if (is_task_work_add_kfunc(meta.func_id)) { + err = push_callback_call(env, insn, insn_idx, meta.subprogno, + set_task_work_schedule_callback_state); + if (err) { + verbose(env, "kfunc %s#%d failed callback verification\n", + func_name, meta.func_id); + return err; + } + } + rcu_lock = is_kfunc_bpf_rcu_read_lock(&meta); rcu_unlock = is_kfunc_bpf_rcu_read_unlock(&meta); diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index 233de8677382..e444d9f67829 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -7418,6 +7418,10 @@ struct bpf_timer { __u64 __opaque[2]; } __attribute__((aligned(8))); +struct bpf_task_work { + __u64 __opaque[16]; +} __attribute__((aligned(8))); + struct bpf_wq { __u64 __opaque[2]; } __attribute__((aligned(8))); -- 2.50.1 ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH bpf-next 1/4] bpf: bpf task work plumbing 2025-08-06 14:45 ` [PATCH bpf-next 1/4] bpf: bpf task work plumbing Mykyta Yatsenko @ 2025-08-07 17:53 ` Kumar Kartikeya Dwivedi 2025-08-15 18:25 ` Mykyta Yatsenko 0 siblings, 1 reply; 15+ messages in thread From: Kumar Kartikeya Dwivedi @ 2025-08-07 17:53 UTC (permalink / raw) To: Mykyta Yatsenko Cc: bpf, ast, andrii, daniel, kafai, kernel-team, eddyz87, Mykyta Yatsenko On Wed, 6 Aug 2025 at 16:46, Mykyta Yatsenko <mykyta.yatsenko5@gmail.com> wrote: > > From: Mykyta Yatsenko <yatsenko@meta.com> > > This patch adds necessary plumbing in verifier, syscall and maps to > support handling new kfunc bpf_task_work_schedule and kernel structure > bpf_task_work. The idea is similar to how we already handle bpf_wq and > bpf_timer. > verifier changes validate calls to bpf_task_work_schedule to make sure > it is safe and expected invariants hold. > btf part is required to detect bpf_task_work structure inside map value > and store its offset, which will be used in the next patch to calculate > key and value addresses. > arraymap and hashtab changes are needed to handle freeing of the > bpf_task_work: run code needed to deinitialize it, for example cancel > task_work callback if possible. > The use of bpf_task_work and proper implementation for kfuncs are > introduced in the next patch. > > Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> > --- > include/linux/bpf.h | 11 +++ > include/uapi/linux/bpf.h | 4 + > kernel/bpf/arraymap.c | 8 +- > kernel/bpf/btf.c | 15 ++++ > kernel/bpf/hashtab.c | 22 ++++-- > kernel/bpf/helpers.c | 45 +++++++++++ > kernel/bpf/syscall.c | 23 +++++- > kernel/bpf/verifier.c | 131 ++++++++++++++++++++++++++++++++- > tools/include/uapi/linux/bpf.h | 4 + > 9 files changed, 247 insertions(+), 16 deletions(-) > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h > index f9cd2164ed23..cb83ba0eaed5 100644 > --- a/include/linux/bpf.h > +++ b/include/linux/bpf.h > @@ -206,6 +206,7 @@ enum btf_field_type { > BPF_WORKQUEUE = (1 << 10), > BPF_UPTR = (1 << 11), > BPF_RES_SPIN_LOCK = (1 << 12), > + BPF_TASK_WORK = (1 << 13), > }; > > typedef void (*btf_dtor_kfunc_t)(void *); > @@ -245,6 +246,7 @@ struct btf_record { > int timer_off; > int wq_off; > int refcount_off; > + int task_work_off; > struct btf_field fields[]; > }; > > @@ -340,6 +342,8 @@ static inline const char *btf_field_type_name(enum btf_field_type type) > return "bpf_rb_node"; > case BPF_REFCOUNT: > return "bpf_refcount"; > + case BPF_TASK_WORK: > + return "bpf_task_work"; > default: > WARN_ON_ONCE(1); > return "unknown"; > @@ -378,6 +382,8 @@ static inline u32 btf_field_type_size(enum btf_field_type type) > return sizeof(struct bpf_rb_node); > case BPF_REFCOUNT: > return sizeof(struct bpf_refcount); > + case BPF_TASK_WORK: > + return sizeof(struct bpf_task_work); > default: > WARN_ON_ONCE(1); > return 0; > @@ -410,6 +416,8 @@ static inline u32 btf_field_type_align(enum btf_field_type type) > return __alignof__(struct bpf_rb_node); > case BPF_REFCOUNT: > return __alignof__(struct bpf_refcount); > + case BPF_TASK_WORK: > + return __alignof__(struct bpf_task_work); > default: > WARN_ON_ONCE(1); > return 0; > @@ -441,6 +449,7 @@ static inline void bpf_obj_init_field(const struct btf_field *field, void *addr) > case BPF_KPTR_REF: > case BPF_KPTR_PERCPU: > case BPF_UPTR: > + case BPF_TASK_WORK: > break; > default: > WARN_ON_ONCE(1); > @@ -577,6 +586,7 @@ void copy_map_value_locked(struct bpf_map *map, void *dst, void *src, > bool lock_src); > void bpf_timer_cancel_and_free(void *timer); > void bpf_wq_cancel_and_free(void *timer); > +void bpf_task_work_cancel_and_free(void *timer); > void bpf_list_head_free(const struct btf_field *field, void *list_head, > struct bpf_spin_lock *spin_lock); > void bpf_rb_root_free(const struct btf_field *field, void *rb_root, > @@ -2391,6 +2401,7 @@ struct btf_record *btf_record_dup(const struct btf_record *rec); > bool btf_record_equal(const struct btf_record *rec_a, const struct btf_record *rec_b); > void bpf_obj_free_timer(const struct btf_record *rec, void *obj); > void bpf_obj_free_workqueue(const struct btf_record *rec, void *obj); > +void bpf_obj_free_task_work(const struct btf_record *rec, void *obj); > void bpf_obj_free_fields(const struct btf_record *rec, void *obj); > void __bpf_obj_drop_impl(void *p, const struct btf_record *rec, bool percpu); > > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h > index 233de8677382..e444d9f67829 100644 > --- a/include/uapi/linux/bpf.h > +++ b/include/uapi/linux/bpf.h > @@ -7418,6 +7418,10 @@ struct bpf_timer { > __u64 __opaque[2]; > } __attribute__((aligned(8))); > > +struct bpf_task_work { > + __u64 __opaque[16]; > +} __attribute__((aligned(8))); > + > struct bpf_wq { > __u64 __opaque[2]; > } __attribute__((aligned(8))); > diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c > index 3d080916faf9..4130d8e76dff 100644 > --- a/kernel/bpf/arraymap.c > +++ b/kernel/bpf/arraymap.c > @@ -431,7 +431,7 @@ static void *array_map_vmalloc_addr(struct bpf_array *array) > return (void *)round_down((unsigned long)array, PAGE_SIZE); > } > > -static void array_map_free_timers_wq(struct bpf_map *map) > +static void array_map_free_internal_structs(struct bpf_map *map) > { > struct bpf_array *array = container_of(map, struct bpf_array, map); > int i; > @@ -439,12 +439,14 @@ static void array_map_free_timers_wq(struct bpf_map *map) > /* We don't reset or free fields other than timer and workqueue > * on uref dropping to zero. > */ > - if (btf_record_has_field(map->record, BPF_TIMER | BPF_WORKQUEUE)) { > + if (btf_record_has_field(map->record, BPF_TIMER | BPF_WORKQUEUE | BPF_TASK_WORK)) { > for (i = 0; i < array->map.max_entries; i++) { > if (btf_record_has_field(map->record, BPF_TIMER)) > bpf_obj_free_timer(map->record, array_map_elem_ptr(array, i)); > if (btf_record_has_field(map->record, BPF_WORKQUEUE)) > bpf_obj_free_workqueue(map->record, array_map_elem_ptr(array, i)); > + if (btf_record_has_field(map->record, BPF_TASK_WORK)) > + bpf_obj_free_task_work(map->record, array_map_elem_ptr(array, i)); > } > } > } > @@ -783,7 +785,7 @@ const struct bpf_map_ops array_map_ops = { > .map_alloc = array_map_alloc, > .map_free = array_map_free, > .map_get_next_key = array_map_get_next_key, > - .map_release_uref = array_map_free_timers_wq, > + .map_release_uref = array_map_free_internal_structs, > .map_lookup_elem = array_map_lookup_elem, > .map_update_elem = array_map_update_elem, > .map_delete_elem = array_map_delete_elem, > diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c > index 0aff814cb53a..c66f9c6dfc48 100644 > --- a/kernel/bpf/btf.c > +++ b/kernel/bpf/btf.c > @@ -3527,6 +3527,15 @@ static int btf_get_field_type(const struct btf *btf, const struct btf_type *var_ > goto end; > } > } > + if (field_mask & BPF_TASK_WORK) { > + if (!strcmp(name, "bpf_task_work")) { > + if (*seen_mask & BPF_TASK_WORK) > + return -E2BIG; > + *seen_mask |= BPF_TASK_WORK; > + type = BPF_TASK_WORK; > + goto end; > + } > + } > field_mask_test_name(BPF_LIST_HEAD, "bpf_list_head"); > field_mask_test_name(BPF_LIST_NODE, "bpf_list_node"); > field_mask_test_name(BPF_RB_ROOT, "bpf_rb_root"); > @@ -3693,6 +3702,7 @@ static int btf_find_field_one(const struct btf *btf, > case BPF_LIST_NODE: > case BPF_RB_NODE: > case BPF_REFCOUNT: > + case BPF_TASK_WORK: > ret = btf_find_struct(btf, var_type, off, sz, field_type, > info_cnt ? &info[0] : &tmp); > if (ret < 0) > @@ -3985,6 +3995,7 @@ struct btf_record *btf_parse_fields(const struct btf *btf, const struct btf_type > rec->timer_off = -EINVAL; > rec->wq_off = -EINVAL; > rec->refcount_off = -EINVAL; > + rec->task_work_off = -EINVAL; > for (i = 0; i < cnt; i++) { > field_type_size = btf_field_type_size(info_arr[i].type); > if (info_arr[i].off + field_type_size > value_size) { > @@ -4050,6 +4061,10 @@ struct btf_record *btf_parse_fields(const struct btf *btf, const struct btf_type > case BPF_LIST_NODE: > case BPF_RB_NODE: > break; > + case BPF_TASK_WORK: > + WARN_ON_ONCE(rec->task_work_off >= 0); > + rec->task_work_off = rec->fields[i].offset; > + break; > default: > ret = -EFAULT; > goto end; > diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c > index 71f9931ac64c..207ad4823b5b 100644 > --- a/kernel/bpf/hashtab.c > +++ b/kernel/bpf/hashtab.c > @@ -215,7 +215,7 @@ static bool htab_has_extra_elems(struct bpf_htab *htab) > return !htab_is_percpu(htab) && !htab_is_lru(htab) && !is_fd_htab(htab); > } > > -static void htab_free_prealloced_timers_and_wq(struct bpf_htab *htab) > +static void htab_free_prealloced_internal_structs(struct bpf_htab *htab) > { > u32 num_entries = htab->map.max_entries; > int i; > @@ -233,6 +233,9 @@ static void htab_free_prealloced_timers_and_wq(struct bpf_htab *htab) > if (btf_record_has_field(htab->map.record, BPF_WORKQUEUE)) > bpf_obj_free_workqueue(htab->map.record, > htab_elem_value(elem, htab->map.key_size)); > + if (btf_record_has_field(htab->map.record, BPF_TASK_WORK)) > + bpf_obj_free_task_work(htab->map.record, > + htab_elem_value(elem, htab->map.key_size)); > cond_resched(); > } > } > @@ -1490,7 +1493,7 @@ static void delete_all_elements(struct bpf_htab *htab) > } > } > > -static void htab_free_malloced_timers_and_wq(struct bpf_htab *htab) > +static void htab_free_malloced_internal_structs(struct bpf_htab *htab) > { > int i; > > @@ -1508,22 +1511,25 @@ static void htab_free_malloced_timers_and_wq(struct bpf_htab *htab) > if (btf_record_has_field(htab->map.record, BPF_WORKQUEUE)) > bpf_obj_free_workqueue(htab->map.record, > htab_elem_value(l, htab->map.key_size)); > + if (btf_record_has_field(htab->map.record, BPF_TASK_WORK)) > + bpf_obj_free_task_work(htab->map.record, > + htab_elem_value(l, htab->map.key_size)); > } > cond_resched_rcu(); > } > rcu_read_unlock(); > } > > -static void htab_map_free_timers_and_wq(struct bpf_map *map) > +static void htab_map_free_internal_structs(struct bpf_map *map) > { > struct bpf_htab *htab = container_of(map, struct bpf_htab, map); > > /* We only free timer and workqueue on uref dropping to zero */ > - if (btf_record_has_field(htab->map.record, BPF_TIMER | BPF_WORKQUEUE)) { > + if (btf_record_has_field(htab->map.record, BPF_TIMER | BPF_WORKQUEUE | BPF_TASK_WORK)) { > if (!htab_is_prealloc(htab)) > - htab_free_malloced_timers_and_wq(htab); > + htab_free_malloced_internal_structs(htab); > else > - htab_free_prealloced_timers_and_wq(htab); > + htab_free_prealloced_internal_structs(htab); > } > } > > @@ -2255,7 +2261,7 @@ const struct bpf_map_ops htab_map_ops = { > .map_alloc = htab_map_alloc, > .map_free = htab_map_free, > .map_get_next_key = htab_map_get_next_key, > - .map_release_uref = htab_map_free_timers_and_wq, > + .map_release_uref = htab_map_free_internal_structs, > .map_lookup_elem = htab_map_lookup_elem, > .map_lookup_and_delete_elem = htab_map_lookup_and_delete_elem, > .map_update_elem = htab_map_update_elem, > @@ -2276,7 +2282,7 @@ const struct bpf_map_ops htab_lru_map_ops = { > .map_alloc = htab_map_alloc, > .map_free = htab_map_free, > .map_get_next_key = htab_map_get_next_key, > - .map_release_uref = htab_map_free_timers_and_wq, > + .map_release_uref = htab_map_free_internal_structs, > .map_lookup_elem = htab_lru_map_lookup_elem, > .map_lookup_and_delete_elem = htab_lru_map_lookup_and_delete_elem, > .map_lookup_elem_sys_only = htab_lru_map_lookup_elem_sys, > diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c > index 6b4877e85a68..322ffcaedc38 100644 > --- a/kernel/bpf/helpers.c > +++ b/kernel/bpf/helpers.c > @@ -3703,8 +3703,53 @@ __bpf_kfunc int bpf_strstr(const char *s1__ign, const char *s2__ign) > return bpf_strnstr(s1__ign, s2__ign, XATTR_SIZE_MAX); > } > > +typedef void (*bpf_task_work_callback_t)(struct bpf_map *, void *, void *); > + > +/** > + * bpf_task_work_schedule_signal - Schedule BPF callback using task_work_add with TWA_SIGNAL mode > + * @task: Task struct for which callback should be scheduled > + * @tw: Pointer to the bpf_task_work struct, to use by kernel internally for bookkeeping > + * @map__map: bpf_map which contains bpf_task_work in one of the values > + * @callback: pointer to BPF subprogram to call > + * @aux__prog: user should pass NULL > + * > + * Return: 0 if task work has been scheduled successfully, negative error code otherwise > + */ > +__bpf_kfunc int bpf_task_work_schedule_signal(struct task_struct *task, > + struct bpf_task_work *tw, > + struct bpf_map *map__map, > + bpf_task_work_callback_t callback, > + void *aux__prog) > +{ > + return 0; > +} > + > +/** > + * bpf_task_work_schedule_resume - Schedule BPF callback using task_work_add with TWA_RESUME or > + * TWA_NMI_CURRENT mode if scheduling for the current task in the NMI > + * @task: Task struct for which callback should be scheduled > + * @tw: Pointer to the bpf_task_work struct, to use by kernel internally for bookkeeping > + * @map__map: bpf_map which contains bpf_task_work in one of the values > + * @callback: pointer to BPF subprogram to call > + * @aux__prog: user should pass NULL > + * > + * Return: 0 if task work has been scheduled successfully, negative error code otherwise > + */ > +__bpf_kfunc int bpf_task_work_schedule_resume(struct task_struct *task, > + struct bpf_task_work *tw, > + struct bpf_map *map__map, > + bpf_task_work_callback_t callback, > + void *aux__prog) > +{ > + return 0; > +} Is there a reason we need separate kfuncs? Why can't we have one with flags for different TWA modes? > + > __bpf_kfunc_end_defs(); > > +void bpf_task_work_cancel_and_free(void *val) > +{ > +} > + > BTF_KFUNCS_START(generic_btf_ids) > #ifdef CONFIG_CRASH_DUMP > BTF_ID_FLAGS(func, crash_kexec, KF_DESTRUCTIVE) > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c > index e63039817af3..73f801751280 100644 > --- a/kernel/bpf/syscall.c > +++ b/kernel/bpf/syscall.c > @@ -670,6 +670,7 @@ void btf_record_free(struct btf_record *rec) > case BPF_TIMER: > case BPF_REFCOUNT: > case BPF_WORKQUEUE: > + case BPF_TASK_WORK: > /* Nothing to release */ > break; > default: > @@ -723,6 +724,7 @@ struct btf_record *btf_record_dup(const struct btf_record *rec) > case BPF_TIMER: > case BPF_REFCOUNT: > case BPF_WORKQUEUE: > + case BPF_TASK_WORK: > /* Nothing to acquire */ > break; > default: > @@ -781,6 +783,13 @@ void bpf_obj_free_workqueue(const struct btf_record *rec, void *obj) > bpf_wq_cancel_and_free(obj + rec->wq_off); > } > > +void bpf_obj_free_task_work(const struct btf_record *rec, void *obj) > +{ > + if (WARN_ON_ONCE(!btf_record_has_field(rec, BPF_TASK_WORK))) > + return; > + bpf_task_work_cancel_and_free(obj + rec->task_work_off); > +} > + > void bpf_obj_free_fields(const struct btf_record *rec, void *obj) > { > const struct btf_field *fields; > @@ -838,6 +847,9 @@ void bpf_obj_free_fields(const struct btf_record *rec, void *obj) > continue; > bpf_rb_root_free(field, field_ptr, obj + rec->spin_lock_off); > break; > + case BPF_TASK_WORK: > + bpf_task_work_cancel_and_free(field_ptr); > + break; > case BPF_LIST_NODE: > case BPF_RB_NODE: > case BPF_REFCOUNT: > @@ -1234,7 +1246,8 @@ static int map_check_btf(struct bpf_map *map, struct bpf_token *token, > > map->record = btf_parse_fields(btf, value_type, > BPF_SPIN_LOCK | BPF_RES_SPIN_LOCK | BPF_TIMER | BPF_KPTR | BPF_LIST_HEAD | > - BPF_RB_ROOT | BPF_REFCOUNT | BPF_WORKQUEUE | BPF_UPTR, > + BPF_RB_ROOT | BPF_REFCOUNT | BPF_WORKQUEUE | BPF_UPTR | > + BPF_TASK_WORK, > map->value_size); > if (!IS_ERR_OR_NULL(map->record)) { > int i; > @@ -1306,6 +1319,14 @@ static int map_check_btf(struct bpf_map *map, struct bpf_token *token, > goto free_map_tab; > } > break; > + case BPF_TASK_WORK: > + if (map->map_type != BPF_MAP_TYPE_HASH && > + map->map_type != BPF_MAP_TYPE_LRU_HASH && > + map->map_type != BPF_MAP_TYPE_ARRAY) { > + ret = -EOPNOTSUPP; > + goto free_map_tab; > + } > + break; > default: > /* Fail if map_type checks are missing for a field type */ > ret = -EOPNOTSUPP; > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c > index 399f03e62508..905dc0c5a73d 100644 > --- a/kernel/bpf/verifier.c > +++ b/kernel/bpf/verifier.c > @@ -524,9 +524,11 @@ static bool is_sync_callback_calling_function(enum bpf_func_id func_id) > func_id == BPF_FUNC_user_ringbuf_drain; > } > > -static bool is_async_callback_calling_function(enum bpf_func_id func_id) > +static bool is_task_work_add_kfunc(u32 func_id); > + > +static bool is_async_callback_calling_function(u32 func_id) > { > - return func_id == BPF_FUNC_timer_set_callback; > + return func_id == BPF_FUNC_timer_set_callback || is_task_work_add_kfunc(func_id); Hmm, isn't this for helpers? You probably want to change is_async_callback_calling_kfunc. > } > > static bool is_callback_calling_function(enum bpf_func_id func_id) > @@ -2236,6 +2238,8 @@ static void mark_ptr_not_null_reg(struct bpf_reg_state *reg) > reg->map_uid = reg->id; > if (btf_record_has_field(map->inner_map_meta->record, BPF_WORKQUEUE)) > reg->map_uid = reg->id; > + if (btf_record_has_field(map->inner_map_meta->record, BPF_TASK_WORK)) > + reg->map_uid = reg->id; > } else if (map->map_type == BPF_MAP_TYPE_XSKMAP) { > reg->type = PTR_TO_XDP_SOCK; > } else if (map->map_type == BPF_MAP_TYPE_SOCKMAP || > @@ -8569,6 +8573,44 @@ static int process_wq_func(struct bpf_verifier_env *env, int regno, > return 0; > } > > +static int process_task_work_func(struct bpf_verifier_env *env, int regno, > + struct bpf_kfunc_call_arg_meta *meta) > +{ > + struct bpf_reg_state *regs = cur_regs(env), *reg = ®s[regno]; > + struct bpf_map *map = reg->map_ptr; > + bool is_const = tnum_is_const(reg->var_off); > + u64 val = reg->var_off.value; > + > + if (!map->btf) { > + verbose(env, "map '%s' has to have BTF in order to use bpf_task_work\n", > + map->name); > + return -EINVAL; > + } > + if (!btf_record_has_field(map->record, BPF_TASK_WORK)) { > + verbose(env, "map '%s' has no valid bpf_task_work\n", map->name); > + return -EINVAL; > + } > + if (map->record->task_work_off != val + reg->off) { > + verbose(env, > + "off %lld doesn't point to 'struct bpf_task_work' that is at %d\n", > + val + reg->off, map->record->task_work_off); > + return -EINVAL; > + } > + if (!is_const) { > + verbose(env, > + "bpf_task_work has to be at the constant offset\n"); > + return -EINVAL; > + } It would make more sense to me to check this before matching val + reg->off. > + if (meta->map.ptr) { > + verifier_bug(env, "Two map pointers in a bpf_task_work kfunc"); > + return -EFAULT; > + } > + > + meta->map.uid = reg->map_uid; > + meta->map.ptr = map; > + return 0; > +} > + > static int process_kptr_func(struct bpf_verifier_env *env, int regno, > struct bpf_call_arg_meta *meta) > { > @@ -10616,7 +10658,8 @@ static int push_callback_call(struct bpf_verifier_env *env, struct bpf_insn *ins > env->subprog_info[subprog].is_async_cb = true; > async_cb = push_async_cb(env, env->subprog_info[subprog].start, > insn_idx, subprog, > - is_bpf_wq_set_callback_impl_kfunc(insn->imm)); > + is_bpf_wq_set_callback_impl_kfunc(insn->imm) || > + is_task_work_add_kfunc(insn->imm)); > if (!async_cb) > return -EFAULT; > callee = async_cb->frame[0]; > @@ -10929,6 +10972,35 @@ static int set_rbtree_add_callback_state(struct bpf_verifier_env *env, > return 0; > } > > +static int set_task_work_schedule_callback_state(struct bpf_verifier_env *env, > + struct bpf_func_state *caller, > + struct bpf_func_state *callee, > + int insn_idx) > +{ > + struct bpf_map *map_ptr = caller->regs[BPF_REG_3].map_ptr; > + > + /* > + * callback_fn(struct bpf_map *map, void *key, void *value); > + */ > + callee->regs[BPF_REG_1].type = CONST_PTR_TO_MAP; > + __mark_reg_known_zero(&callee->regs[BPF_REG_1]); > + callee->regs[BPF_REG_1].map_ptr = map_ptr; > + > + callee->regs[BPF_REG_2].type = PTR_TO_MAP_KEY; > + __mark_reg_known_zero(&callee->regs[BPF_REG_2]); > + callee->regs[BPF_REG_2].map_ptr = map_ptr; > + > + callee->regs[BPF_REG_3].type = PTR_TO_MAP_VALUE; > + __mark_reg_known_zero(&callee->regs[BPF_REG_3]); > + callee->regs[BPF_REG_3].map_ptr = map_ptr; > + > + /* unused */ > + __mark_reg_not_init(env, &callee->regs[BPF_REG_4]); > + __mark_reg_not_init(env, &callee->regs[BPF_REG_5]); > + callee->in_callback_fn = true; > + return 0; > +} > + > static bool is_rbtree_lock_required_kfunc(u32 btf_id); > > /* Are we currently verifying the callback for a rbtree helper that must > @@ -12059,6 +12131,7 @@ enum { > KF_ARG_RB_NODE_ID, > KF_ARG_WORKQUEUE_ID, > KF_ARG_RES_SPIN_LOCK_ID, > + KF_ARG_TASK_WORK_ID, > }; > > BTF_ID_LIST(kf_arg_btf_ids) > @@ -12069,6 +12142,7 @@ BTF_ID(struct, bpf_rb_root) > BTF_ID(struct, bpf_rb_node) > BTF_ID(struct, bpf_wq) > BTF_ID(struct, bpf_res_spin_lock) > +BTF_ID(struct, bpf_task_work) > > static bool __is_kfunc_ptr_arg_type(const struct btf *btf, > const struct btf_param *arg, int type) > @@ -12117,6 +12191,11 @@ static bool is_kfunc_arg_wq(const struct btf *btf, const struct btf_param *arg) > return __is_kfunc_ptr_arg_type(btf, arg, KF_ARG_WORKQUEUE_ID); > } > > +static bool is_kfunc_arg_task_work(const struct btf *btf, const struct btf_param *arg) > +{ > + return __is_kfunc_ptr_arg_type(btf, arg, KF_ARG_TASK_WORK_ID); > +} > + > static bool is_kfunc_arg_res_spin_lock(const struct btf *btf, const struct btf_param *arg) > { > return __is_kfunc_ptr_arg_type(btf, arg, KF_ARG_RES_SPIN_LOCK_ID); > @@ -12204,6 +12283,7 @@ enum kfunc_ptr_arg_type { > KF_ARG_PTR_TO_WORKQUEUE, > KF_ARG_PTR_TO_IRQ_FLAG, > KF_ARG_PTR_TO_RES_SPIN_LOCK, > + KF_ARG_PTR_TO_TASK_WORK, > }; > > enum special_kfunc_type { > @@ -12252,6 +12332,8 @@ enum special_kfunc_type { > KF_bpf_res_spin_lock_irqsave, > KF_bpf_res_spin_unlock_irqrestore, > KF___bpf_trap, > + KF_bpf_task_work_schedule_signal, > + KF_bpf_task_work_schedule_resume, > }; > > BTF_ID_LIST(special_kfunc_list) > @@ -12318,6 +12400,14 @@ BTF_ID(func, bpf_res_spin_unlock) > BTF_ID(func, bpf_res_spin_lock_irqsave) > BTF_ID(func, bpf_res_spin_unlock_irqrestore) > BTF_ID(func, __bpf_trap) > +BTF_ID(func, bpf_task_work_schedule_signal) > +BTF_ID(func, bpf_task_work_schedule_resume) > + > +static bool is_task_work_add_kfunc(u32 func_id) > +{ > + return func_id == special_kfunc_list[KF_bpf_task_work_schedule_signal] || > + func_id == special_kfunc_list[KF_bpf_task_work_schedule_resume]; > +} > > static bool is_kfunc_ret_null(struct bpf_kfunc_call_arg_meta *meta) > { > @@ -12408,6 +12498,9 @@ get_kfunc_ptr_arg_type(struct bpf_verifier_env *env, > if (is_kfunc_arg_wq(meta->btf, &args[argno])) > return KF_ARG_PTR_TO_WORKQUEUE; > > + if (is_kfunc_arg_task_work(meta->btf, &args[argno])) > + return KF_ARG_PTR_TO_TASK_WORK; > + > if (is_kfunc_arg_irq_flag(meta->btf, &args[argno])) > return KF_ARG_PTR_TO_IRQ_FLAG; > > @@ -12751,7 +12844,8 @@ static bool is_sync_callback_calling_kfunc(u32 btf_id) > > static bool is_async_callback_calling_kfunc(u32 btf_id) > { > - return btf_id == special_kfunc_list[KF_bpf_wq_set_callback_impl]; > + return btf_id == special_kfunc_list[KF_bpf_wq_set_callback_impl] || > + is_task_work_add_kfunc(btf_id); > } > > static bool is_bpf_throw_kfunc(struct bpf_insn *insn) > @@ -13153,6 +13247,15 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_ > return -EINVAL; > } > } > + if (meta->map.ptr && reg->map_ptr->record->task_work_off >= 0) { > + if (meta->map.ptr != reg->map_ptr || > + meta->map.uid != reg->map_uid) { > + verbose(env, > + "bpf_task_work pointer in R2 map_uid=%d doesn't match map pointer in R3 map_uid=%d\n", > + meta->map.uid, reg->map_uid); > + return -EINVAL; > + } > + } > meta->map.ptr = reg->map_ptr; > meta->map.uid = reg->map_uid; > fallthrough; > @@ -13185,6 +13288,7 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_ > case KF_ARG_PTR_TO_REFCOUNTED_KPTR: > case KF_ARG_PTR_TO_CONST_STR: > case KF_ARG_PTR_TO_WORKQUEUE: > + case KF_ARG_PTR_TO_TASK_WORK: > case KF_ARG_PTR_TO_IRQ_FLAG: > case KF_ARG_PTR_TO_RES_SPIN_LOCK: > break; > @@ -13476,6 +13580,15 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_ > if (ret < 0) > return ret; > break; > + case KF_ARG_PTR_TO_TASK_WORK: > + if (reg->type != PTR_TO_MAP_VALUE) { > + verbose(env, "arg#%d doesn't point to a map value\n", i); > + return -EINVAL; > + } > + ret = process_task_work_func(env, regno, meta); > + if (ret < 0) > + return ret; > + break; > case KF_ARG_PTR_TO_IRQ_FLAG: > if (reg->type != PTR_TO_STACK) { > verbose(env, "arg#%d doesn't point to an irq flag on stack\n", i); > @@ -13842,6 +13955,16 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn, > } > } > > + if (is_task_work_add_kfunc(meta.func_id)) { > + err = push_callback_call(env, insn, insn_idx, meta.subprogno, > + set_task_work_schedule_callback_state); > + if (err) { > + verbose(env, "kfunc %s#%d failed callback verification\n", > + func_name, meta.func_id); > + return err; > + } > + } > + > rcu_lock = is_kfunc_bpf_rcu_read_lock(&meta); > rcu_unlock = is_kfunc_bpf_rcu_read_unlock(&meta); > > diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h > index 233de8677382..e444d9f67829 100644 > --- a/tools/include/uapi/linux/bpf.h > +++ b/tools/include/uapi/linux/bpf.h > @@ -7418,6 +7418,10 @@ struct bpf_timer { > __u64 __opaque[2]; > } __attribute__((aligned(8))); > > +struct bpf_task_work { > + __u64 __opaque[16]; > +} __attribute__((aligned(8))); > + > struct bpf_wq { > __u64 __opaque[2]; > } __attribute__((aligned(8))); > -- > 2.50.1 > > ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH bpf-next 1/4] bpf: bpf task work plumbing 2025-08-07 17:53 ` Kumar Kartikeya Dwivedi @ 2025-08-15 18:25 ` Mykyta Yatsenko 0 siblings, 0 replies; 15+ messages in thread From: Mykyta Yatsenko @ 2025-08-15 18:25 UTC (permalink / raw) To: Kumar Kartikeya Dwivedi Cc: bpf, ast, andrii, daniel, kafai, kernel-team, eddyz87, Mykyta Yatsenko On 8/7/25 18:53, Kumar Kartikeya Dwivedi wrote: > On Wed, 6 Aug 2025 at 16:46, Mykyta Yatsenko <mykyta.yatsenko5@gmail.com> wrote: >> From: Mykyta Yatsenko <yatsenko@meta.com> >> >> This patch adds necessary plumbing in verifier, syscall and maps to >> support handling new kfunc bpf_task_work_schedule and kernel structure >> bpf_task_work. The idea is similar to how we already handle bpf_wq and >> bpf_timer. >> verifier changes validate calls to bpf_task_work_schedule to make sure >> it is safe and expected invariants hold. >> btf part is required to detect bpf_task_work structure inside map value >> and store its offset, which will be used in the next patch to calculate >> key and value addresses. >> arraymap and hashtab changes are needed to handle freeing of the >> bpf_task_work: run code needed to deinitialize it, for example cancel >> task_work callback if possible. >> The use of bpf_task_work and proper implementation for kfuncs are >> introduced in the next patch. >> >> Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> >> --- >> include/linux/bpf.h | 11 +++ >> include/uapi/linux/bpf.h | 4 + >> kernel/bpf/arraymap.c | 8 +- >> kernel/bpf/btf.c | 15 ++++ >> kernel/bpf/hashtab.c | 22 ++++-- >> kernel/bpf/helpers.c | 45 +++++++++++ >> kernel/bpf/syscall.c | 23 +++++- >> kernel/bpf/verifier.c | 131 ++++++++++++++++++++++++++++++++- >> tools/include/uapi/linux/bpf.h | 4 + >> 9 files changed, 247 insertions(+), 16 deletions(-) >> >> diff --git a/include/linux/bpf.h b/include/linux/bpf.h >> index f9cd2164ed23..cb83ba0eaed5 100644 >> --- a/include/linux/bpf.h >> +++ b/include/linux/bpf.h >> @@ -206,6 +206,7 @@ enum btf_field_type { >> BPF_WORKQUEUE = (1 << 10), >> BPF_UPTR = (1 << 11), >> BPF_RES_SPIN_LOCK = (1 << 12), >> + BPF_TASK_WORK = (1 << 13), >> }; >> >> typedef void (*btf_dtor_kfunc_t)(void *); >> @@ -245,6 +246,7 @@ struct btf_record { >> int timer_off; >> int wq_off; >> int refcount_off; >> + int task_work_off; >> struct btf_field fields[]; >> }; >> >> @@ -340,6 +342,8 @@ static inline const char *btf_field_type_name(enum btf_field_type type) >> return "bpf_rb_node"; >> case BPF_REFCOUNT: >> return "bpf_refcount"; >> + case BPF_TASK_WORK: >> + return "bpf_task_work"; >> default: >> WARN_ON_ONCE(1); >> return "unknown"; >> @@ -378,6 +382,8 @@ static inline u32 btf_field_type_size(enum btf_field_type type) >> return sizeof(struct bpf_rb_node); >> case BPF_REFCOUNT: >> return sizeof(struct bpf_refcount); >> + case BPF_TASK_WORK: >> + return sizeof(struct bpf_task_work); >> default: >> WARN_ON_ONCE(1); >> return 0; >> @@ -410,6 +416,8 @@ static inline u32 btf_field_type_align(enum btf_field_type type) >> return __alignof__(struct bpf_rb_node); >> case BPF_REFCOUNT: >> return __alignof__(struct bpf_refcount); >> + case BPF_TASK_WORK: >> + return __alignof__(struct bpf_task_work); >> default: >> WARN_ON_ONCE(1); >> return 0; >> @@ -441,6 +449,7 @@ static inline void bpf_obj_init_field(const struct btf_field *field, void *addr) >> case BPF_KPTR_REF: >> case BPF_KPTR_PERCPU: >> case BPF_UPTR: >> + case BPF_TASK_WORK: >> break; >> default: >> WARN_ON_ONCE(1); >> @@ -577,6 +586,7 @@ void copy_map_value_locked(struct bpf_map *map, void *dst, void *src, >> bool lock_src); >> void bpf_timer_cancel_and_free(void *timer); >> void bpf_wq_cancel_and_free(void *timer); >> +void bpf_task_work_cancel_and_free(void *timer); >> void bpf_list_head_free(const struct btf_field *field, void *list_head, >> struct bpf_spin_lock *spin_lock); >> void bpf_rb_root_free(const struct btf_field *field, void *rb_root, >> @@ -2391,6 +2401,7 @@ struct btf_record *btf_record_dup(const struct btf_record *rec); >> bool btf_record_equal(const struct btf_record *rec_a, const struct btf_record *rec_b); >> void bpf_obj_free_timer(const struct btf_record *rec, void *obj); >> void bpf_obj_free_workqueue(const struct btf_record *rec, void *obj); >> +void bpf_obj_free_task_work(const struct btf_record *rec, void *obj); >> void bpf_obj_free_fields(const struct btf_record *rec, void *obj); >> void __bpf_obj_drop_impl(void *p, const struct btf_record *rec, bool percpu); >> >> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h >> index 233de8677382..e444d9f67829 100644 >> --- a/include/uapi/linux/bpf.h >> +++ b/include/uapi/linux/bpf.h >> @@ -7418,6 +7418,10 @@ struct bpf_timer { >> __u64 __opaque[2]; >> } __attribute__((aligned(8))); >> >> +struct bpf_task_work { >> + __u64 __opaque[16]; >> +} __attribute__((aligned(8))); >> + >> struct bpf_wq { >> __u64 __opaque[2]; >> } __attribute__((aligned(8))); >> diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c >> index 3d080916faf9..4130d8e76dff 100644 >> --- a/kernel/bpf/arraymap.c >> +++ b/kernel/bpf/arraymap.c >> @@ -431,7 +431,7 @@ static void *array_map_vmalloc_addr(struct bpf_array *array) >> return (void *)round_down((unsigned long)array, PAGE_SIZE); >> } >> >> -static void array_map_free_timers_wq(struct bpf_map *map) >> +static void array_map_free_internal_structs(struct bpf_map *map) >> { >> struct bpf_array *array = container_of(map, struct bpf_array, map); >> int i; >> @@ -439,12 +439,14 @@ static void array_map_free_timers_wq(struct bpf_map *map) >> /* We don't reset or free fields other than timer and workqueue >> * on uref dropping to zero. >> */ >> - if (btf_record_has_field(map->record, BPF_TIMER | BPF_WORKQUEUE)) { >> + if (btf_record_has_field(map->record, BPF_TIMER | BPF_WORKQUEUE | BPF_TASK_WORK)) { >> for (i = 0; i < array->map.max_entries; i++) { >> if (btf_record_has_field(map->record, BPF_TIMER)) >> bpf_obj_free_timer(map->record, array_map_elem_ptr(array, i)); >> if (btf_record_has_field(map->record, BPF_WORKQUEUE)) >> bpf_obj_free_workqueue(map->record, array_map_elem_ptr(array, i)); >> + if (btf_record_has_field(map->record, BPF_TASK_WORK)) >> + bpf_obj_free_task_work(map->record, array_map_elem_ptr(array, i)); >> } >> } >> } >> @@ -783,7 +785,7 @@ const struct bpf_map_ops array_map_ops = { >> .map_alloc = array_map_alloc, >> .map_free = array_map_free, >> .map_get_next_key = array_map_get_next_key, >> - .map_release_uref = array_map_free_timers_wq, >> + .map_release_uref = array_map_free_internal_structs, >> .map_lookup_elem = array_map_lookup_elem, >> .map_update_elem = array_map_update_elem, >> .map_delete_elem = array_map_delete_elem, >> diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c >> index 0aff814cb53a..c66f9c6dfc48 100644 >> --- a/kernel/bpf/btf.c >> +++ b/kernel/bpf/btf.c >> @@ -3527,6 +3527,15 @@ static int btf_get_field_type(const struct btf *btf, const struct btf_type *var_ >> goto end; >> } >> } >> + if (field_mask & BPF_TASK_WORK) { >> + if (!strcmp(name, "bpf_task_work")) { >> + if (*seen_mask & BPF_TASK_WORK) >> + return -E2BIG; >> + *seen_mask |= BPF_TASK_WORK; >> + type = BPF_TASK_WORK; >> + goto end; >> + } >> + } >> field_mask_test_name(BPF_LIST_HEAD, "bpf_list_head"); >> field_mask_test_name(BPF_LIST_NODE, "bpf_list_node"); >> field_mask_test_name(BPF_RB_ROOT, "bpf_rb_root"); >> @@ -3693,6 +3702,7 @@ static int btf_find_field_one(const struct btf *btf, >> case BPF_LIST_NODE: >> case BPF_RB_NODE: >> case BPF_REFCOUNT: >> + case BPF_TASK_WORK: >> ret = btf_find_struct(btf, var_type, off, sz, field_type, >> info_cnt ? &info[0] : &tmp); >> if (ret < 0) >> @@ -3985,6 +3995,7 @@ struct btf_record *btf_parse_fields(const struct btf *btf, const struct btf_type >> rec->timer_off = -EINVAL; >> rec->wq_off = -EINVAL; >> rec->refcount_off = -EINVAL; >> + rec->task_work_off = -EINVAL; >> for (i = 0; i < cnt; i++) { >> field_type_size = btf_field_type_size(info_arr[i].type); >> if (info_arr[i].off + field_type_size > value_size) { >> @@ -4050,6 +4061,10 @@ struct btf_record *btf_parse_fields(const struct btf *btf, const struct btf_type >> case BPF_LIST_NODE: >> case BPF_RB_NODE: >> break; >> + case BPF_TASK_WORK: >> + WARN_ON_ONCE(rec->task_work_off >= 0); >> + rec->task_work_off = rec->fields[i].offset; >> + break; >> default: >> ret = -EFAULT; >> goto end; >> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c >> index 71f9931ac64c..207ad4823b5b 100644 >> --- a/kernel/bpf/hashtab.c >> +++ b/kernel/bpf/hashtab.c >> @@ -215,7 +215,7 @@ static bool htab_has_extra_elems(struct bpf_htab *htab) >> return !htab_is_percpu(htab) && !htab_is_lru(htab) && !is_fd_htab(htab); >> } >> >> -static void htab_free_prealloced_timers_and_wq(struct bpf_htab *htab) >> +static void htab_free_prealloced_internal_structs(struct bpf_htab *htab) >> { >> u32 num_entries = htab->map.max_entries; >> int i; >> @@ -233,6 +233,9 @@ static void htab_free_prealloced_timers_and_wq(struct bpf_htab *htab) >> if (btf_record_has_field(htab->map.record, BPF_WORKQUEUE)) >> bpf_obj_free_workqueue(htab->map.record, >> htab_elem_value(elem, htab->map.key_size)); >> + if (btf_record_has_field(htab->map.record, BPF_TASK_WORK)) >> + bpf_obj_free_task_work(htab->map.record, >> + htab_elem_value(elem, htab->map.key_size)); >> cond_resched(); >> } >> } >> @@ -1490,7 +1493,7 @@ static void delete_all_elements(struct bpf_htab *htab) >> } >> } >> >> -static void htab_free_malloced_timers_and_wq(struct bpf_htab *htab) >> +static void htab_free_malloced_internal_structs(struct bpf_htab *htab) >> { >> int i; >> >> @@ -1508,22 +1511,25 @@ static void htab_free_malloced_timers_and_wq(struct bpf_htab *htab) >> if (btf_record_has_field(htab->map.record, BPF_WORKQUEUE)) >> bpf_obj_free_workqueue(htab->map.record, >> htab_elem_value(l, htab->map.key_size)); >> + if (btf_record_has_field(htab->map.record, BPF_TASK_WORK)) >> + bpf_obj_free_task_work(htab->map.record, >> + htab_elem_value(l, htab->map.key_size)); >> } >> cond_resched_rcu(); >> } >> rcu_read_unlock(); >> } >> >> -static void htab_map_free_timers_and_wq(struct bpf_map *map) >> +static void htab_map_free_internal_structs(struct bpf_map *map) >> { >> struct bpf_htab *htab = container_of(map, struct bpf_htab, map); >> >> /* We only free timer and workqueue on uref dropping to zero */ >> - if (btf_record_has_field(htab->map.record, BPF_TIMER | BPF_WORKQUEUE)) { >> + if (btf_record_has_field(htab->map.record, BPF_TIMER | BPF_WORKQUEUE | BPF_TASK_WORK)) { >> if (!htab_is_prealloc(htab)) >> - htab_free_malloced_timers_and_wq(htab); >> + htab_free_malloced_internal_structs(htab); >> else >> - htab_free_prealloced_timers_and_wq(htab); >> + htab_free_prealloced_internal_structs(htab); >> } >> } >> >> @@ -2255,7 +2261,7 @@ const struct bpf_map_ops htab_map_ops = { >> .map_alloc = htab_map_alloc, >> .map_free = htab_map_free, >> .map_get_next_key = htab_map_get_next_key, >> - .map_release_uref = htab_map_free_timers_and_wq, >> + .map_release_uref = htab_map_free_internal_structs, >> .map_lookup_elem = htab_map_lookup_elem, >> .map_lookup_and_delete_elem = htab_map_lookup_and_delete_elem, >> .map_update_elem = htab_map_update_elem, >> @@ -2276,7 +2282,7 @@ const struct bpf_map_ops htab_lru_map_ops = { >> .map_alloc = htab_map_alloc, >> .map_free = htab_map_free, >> .map_get_next_key = htab_map_get_next_key, >> - .map_release_uref = htab_map_free_timers_and_wq, >> + .map_release_uref = htab_map_free_internal_structs, >> .map_lookup_elem = htab_lru_map_lookup_elem, >> .map_lookup_and_delete_elem = htab_lru_map_lookup_and_delete_elem, >> .map_lookup_elem_sys_only = htab_lru_map_lookup_elem_sys, >> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c >> index 6b4877e85a68..322ffcaedc38 100644 >> --- a/kernel/bpf/helpers.c >> +++ b/kernel/bpf/helpers.c >> @@ -3703,8 +3703,53 @@ __bpf_kfunc int bpf_strstr(const char *s1__ign, const char *s2__ign) >> return bpf_strnstr(s1__ign, s2__ign, XATTR_SIZE_MAX); >> } >> >> +typedef void (*bpf_task_work_callback_t)(struct bpf_map *, void *, void *); >> + >> +/** >> + * bpf_task_work_schedule_signal - Schedule BPF callback using task_work_add with TWA_SIGNAL mode >> + * @task: Task struct for which callback should be scheduled >> + * @tw: Pointer to the bpf_task_work struct, to use by kernel internally for bookkeeping >> + * @map__map: bpf_map which contains bpf_task_work in one of the values >> + * @callback: pointer to BPF subprogram to call >> + * @aux__prog: user should pass NULL >> + * >> + * Return: 0 if task work has been scheduled successfully, negative error code otherwise >> + */ >> +__bpf_kfunc int bpf_task_work_schedule_signal(struct task_struct *task, >> + struct bpf_task_work *tw, >> + struct bpf_map *map__map, >> + bpf_task_work_callback_t callback, >> + void *aux__prog) >> +{ >> + return 0; >> +} >> + >> +/** >> + * bpf_task_work_schedule_resume - Schedule BPF callback using task_work_add with TWA_RESUME or >> + * TWA_NMI_CURRENT mode if scheduling for the current task in the NMI >> + * @task: Task struct for which callback should be scheduled >> + * @tw: Pointer to the bpf_task_work struct, to use by kernel internally for bookkeeping >> + * @map__map: bpf_map which contains bpf_task_work in one of the values >> + * @callback: pointer to BPF subprogram to call >> + * @aux__prog: user should pass NULL >> + * >> + * Return: 0 if task work has been scheduled successfully, negative error code otherwise >> + */ >> +__bpf_kfunc int bpf_task_work_schedule_resume(struct task_struct *task, >> + struct bpf_task_work *tw, >> + struct bpf_map *map__map, >> + bpf_task_work_callback_t callback, >> + void *aux__prog) >> +{ >> + return 0; >> +} > Is there a reason we need separate kfuncs? Why can't we have one with > flags for different TWA modes? We are running into 5 arguments limit here. >> + >> __bpf_kfunc_end_defs(); >> >> +void bpf_task_work_cancel_and_free(void *val) >> +{ >> +} >> + >> BTF_KFUNCS_START(generic_btf_ids) >> #ifdef CONFIG_CRASH_DUMP >> BTF_ID_FLAGS(func, crash_kexec, KF_DESTRUCTIVE) >> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c >> index e63039817af3..73f801751280 100644 >> --- a/kernel/bpf/syscall.c >> +++ b/kernel/bpf/syscall.c >> @@ -670,6 +670,7 @@ void btf_record_free(struct btf_record *rec) >> case BPF_TIMER: >> case BPF_REFCOUNT: >> case BPF_WORKQUEUE: >> + case BPF_TASK_WORK: >> /* Nothing to release */ >> break; >> default: >> @@ -723,6 +724,7 @@ struct btf_record *btf_record_dup(const struct btf_record *rec) >> case BPF_TIMER: >> case BPF_REFCOUNT: >> case BPF_WORKQUEUE: >> + case BPF_TASK_WORK: >> /* Nothing to acquire */ >> break; >> default: >> @@ -781,6 +783,13 @@ void bpf_obj_free_workqueue(const struct btf_record *rec, void *obj) >> bpf_wq_cancel_and_free(obj + rec->wq_off); >> } >> >> +void bpf_obj_free_task_work(const struct btf_record *rec, void *obj) >> +{ >> + if (WARN_ON_ONCE(!btf_record_has_field(rec, BPF_TASK_WORK))) >> + return; >> + bpf_task_work_cancel_and_free(obj + rec->task_work_off); >> +} >> + >> void bpf_obj_free_fields(const struct btf_record *rec, void *obj) >> { >> const struct btf_field *fields; >> @@ -838,6 +847,9 @@ void bpf_obj_free_fields(const struct btf_record *rec, void *obj) >> continue; >> bpf_rb_root_free(field, field_ptr, obj + rec->spin_lock_off); >> break; >> + case BPF_TASK_WORK: >> + bpf_task_work_cancel_and_free(field_ptr); >> + break; >> case BPF_LIST_NODE: >> case BPF_RB_NODE: >> case BPF_REFCOUNT: >> @@ -1234,7 +1246,8 @@ static int map_check_btf(struct bpf_map *map, struct bpf_token *token, >> >> map->record = btf_parse_fields(btf, value_type, >> BPF_SPIN_LOCK | BPF_RES_SPIN_LOCK | BPF_TIMER | BPF_KPTR | BPF_LIST_HEAD | >> - BPF_RB_ROOT | BPF_REFCOUNT | BPF_WORKQUEUE | BPF_UPTR, >> + BPF_RB_ROOT | BPF_REFCOUNT | BPF_WORKQUEUE | BPF_UPTR | >> + BPF_TASK_WORK, >> map->value_size); >> if (!IS_ERR_OR_NULL(map->record)) { >> int i; >> @@ -1306,6 +1319,14 @@ static int map_check_btf(struct bpf_map *map, struct bpf_token *token, >> goto free_map_tab; >> } >> break; >> + case BPF_TASK_WORK: >> + if (map->map_type != BPF_MAP_TYPE_HASH && >> + map->map_type != BPF_MAP_TYPE_LRU_HASH && >> + map->map_type != BPF_MAP_TYPE_ARRAY) { >> + ret = -EOPNOTSUPP; >> + goto free_map_tab; >> + } >> + break; >> default: >> /* Fail if map_type checks are missing for a field type */ >> ret = -EOPNOTSUPP; >> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c >> index 399f03e62508..905dc0c5a73d 100644 >> --- a/kernel/bpf/verifier.c >> +++ b/kernel/bpf/verifier.c >> @@ -524,9 +524,11 @@ static bool is_sync_callback_calling_function(enum bpf_func_id func_id) >> func_id == BPF_FUNC_user_ringbuf_drain; >> } >> >> -static bool is_async_callback_calling_function(enum bpf_func_id func_id) >> +static bool is_task_work_add_kfunc(u32 func_id); >> + >> +static bool is_async_callback_calling_function(u32 func_id) >> { >> - return func_id == BPF_FUNC_timer_set_callback; >> + return func_id == BPF_FUNC_timer_set_callback || is_task_work_add_kfunc(func_id); > Hmm, isn't this for helpers? You probably want to change > is_async_callback_calling_kfunc. Agreed. > >> } >> >> static bool is_callback_calling_function(enum bpf_func_id func_id) >> @@ -2236,6 +2238,8 @@ static void mark_ptr_not_null_reg(struct bpf_reg_state *reg) >> reg->map_uid = reg->id; >> if (btf_record_has_field(map->inner_map_meta->record, BPF_WORKQUEUE)) >> reg->map_uid = reg->id; >> + if (btf_record_has_field(map->inner_map_meta->record, BPF_TASK_WORK)) >> + reg->map_uid = reg->id; >> } else if (map->map_type == BPF_MAP_TYPE_XSKMAP) { >> reg->type = PTR_TO_XDP_SOCK; >> } else if (map->map_type == BPF_MAP_TYPE_SOCKMAP || >> @@ -8569,6 +8573,44 @@ static int process_wq_func(struct bpf_verifier_env *env, int regno, >> return 0; >> } >> >> +static int process_task_work_func(struct bpf_verifier_env *env, int regno, >> + struct bpf_kfunc_call_arg_meta *meta) >> +{ >> + struct bpf_reg_state *regs = cur_regs(env), *reg = ®s[regno]; >> + struct bpf_map *map = reg->map_ptr; >> + bool is_const = tnum_is_const(reg->var_off); >> + u64 val = reg->var_off.value; >> + >> + if (!map->btf) { >> + verbose(env, "map '%s' has to have BTF in order to use bpf_task_work\n", >> + map->name); >> + return -EINVAL; >> + } >> + if (!btf_record_has_field(map->record, BPF_TASK_WORK)) { >> + verbose(env, "map '%s' has no valid bpf_task_work\n", map->name); >> + return -EINVAL; >> + } >> + if (map->record->task_work_off != val + reg->off) { >> + verbose(env, >> + "off %lld doesn't point to 'struct bpf_task_work' that is at %d\n", >> + val + reg->off, map->record->task_work_off); >> + return -EINVAL; >> + } >> + if (!is_const) { >> + verbose(env, >> + "bpf_task_work has to be at the constant offset\n"); >> + return -EINVAL; >> + } > It would make more sense to me to check this before matching val + reg->off. Makes sense. Sorry for the delayed response, I was focused on another patch. ^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH bpf-next 2/4] bpf: extract map key pointer calculation 2025-08-06 14:45 [PATCH bpf-next 0/4] bpf: Introduce deferred task context execution Mykyta Yatsenko 2025-08-06 14:45 ` [PATCH bpf-next 1/4] bpf: bpf task work plumbing Mykyta Yatsenko @ 2025-08-06 14:45 ` Mykyta Yatsenko 2025-08-07 19:01 ` Kumar Kartikeya Dwivedi 2025-08-06 14:45 ` [PATCH bpf-next 3/4] bpf: task work scheduling kfuncs Mykyta Yatsenko 2025-08-06 14:45 ` [PATCH bpf-next 4/4] selftests/bpf: BPF task work scheduling tests Mykyta Yatsenko 3 siblings, 1 reply; 15+ messages in thread From: Mykyta Yatsenko @ 2025-08-06 14:45 UTC (permalink / raw) To: bpf, ast, andrii, daniel, kafai, kernel-team, eddyz87; +Cc: Mykyta Yatsenko From: Mykyta Yatsenko <yatsenko@meta.com> Calculation of the BPF map key, given the pointer to a value is duplicated in a couple of places in helpers already, in the next patch another use case is introduced as well. This patch extracts that functionality into a separate function. Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> --- kernel/bpf/helpers.c | 31 ++++++++++++++----------------- 1 file changed, 14 insertions(+), 17 deletions(-) diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c index 322ffcaedc38..516286f67f0d 100644 --- a/kernel/bpf/helpers.c +++ b/kernel/bpf/helpers.c @@ -1084,6 +1084,18 @@ const struct bpf_func_proto bpf_snprintf_proto = { .arg5_type = ARG_CONST_SIZE_OR_ZERO, }; +static void *map_key_from_value(struct bpf_map *map, void *value, u32 *arr_idx) +{ + if (map->map_type == BPF_MAP_TYPE_ARRAY) { + struct bpf_array *array = + container_of(map, struct bpf_array, map); + + *arr_idx = ((char *)value - array->value) / array->elem_size; + return arr_idx; + } + return (void *)value - round_up(map->key_size, 8); +} + struct bpf_async_cb { struct bpf_map *map; struct bpf_prog *prog; @@ -1166,15 +1178,8 @@ static enum hrtimer_restart bpf_timer_cb(struct hrtimer *hrtimer) * bpf_map_delete_elem() on the same timer. */ this_cpu_write(hrtimer_running, t); - if (map->map_type == BPF_MAP_TYPE_ARRAY) { - struct bpf_array *array = container_of(map, struct bpf_array, map); - /* compute the key */ - idx = ((char *)value - array->value) / array->elem_size; - key = &idx; - } else { /* hash or lru */ - key = value - round_up(map->key_size, 8); - } + key = map_key_from_value(map, value, &idx); callback_fn((u64)(long)map, (u64)(long)key, (u64)(long)value, 0, 0); /* The verifier checked that return value is zero. */ @@ -1200,15 +1205,7 @@ static void bpf_wq_work(struct work_struct *work) if (!callback_fn) return; - if (map->map_type == BPF_MAP_TYPE_ARRAY) { - struct bpf_array *array = container_of(map, struct bpf_array, map); - - /* compute the key */ - idx = ((char *)value - array->value) / array->elem_size; - key = &idx; - } else { /* hash or lru */ - key = value - round_up(map->key_size, 8); - } + key = map_key_from_value(map, value, &idx); rcu_read_lock_trace(); migrate_disable(); -- 2.50.1 ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH bpf-next 2/4] bpf: extract map key pointer calculation 2025-08-06 14:45 ` [PATCH bpf-next 2/4] bpf: extract map key pointer calculation Mykyta Yatsenko @ 2025-08-07 19:01 ` Kumar Kartikeya Dwivedi 0 siblings, 0 replies; 15+ messages in thread From: Kumar Kartikeya Dwivedi @ 2025-08-07 19:01 UTC (permalink / raw) To: Mykyta Yatsenko Cc: bpf, ast, andrii, daniel, kafai, kernel-team, eddyz87, Mykyta Yatsenko On Wed, 6 Aug 2025 at 16:46, Mykyta Yatsenko <mykyta.yatsenko5@gmail.com> wrote: > > From: Mykyta Yatsenko <yatsenko@meta.com> > > Calculation of the BPF map key, given the pointer to a value is > duplicated in a couple of places in helpers already, in the next patch > another use case is introduced as well. > This patch extracts that functionality into a separate function. > > Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> > --- Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> > kernel/bpf/helpers.c | 31 ++++++++++++++----------------- > 1 file changed, 14 insertions(+), 17 deletions(-) > > diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c > index 322ffcaedc38..516286f67f0d 100644 > --- a/kernel/bpf/helpers.c > +++ b/kernel/bpf/helpers.c > @@ -1084,6 +1084,18 @@ const struct bpf_func_proto bpf_snprintf_proto = { > .arg5_type = ARG_CONST_SIZE_OR_ZERO, > }; > > +static void *map_key_from_value(struct bpf_map *map, void *value, u32 *arr_idx) > +{ > + if (map->map_type == BPF_MAP_TYPE_ARRAY) { > + struct bpf_array *array = > + container_of(map, struct bpf_array, map); nit: Can we keep it on the same line? > + > + *arr_idx = ((char *)value - array->value) / array->elem_size; > + return arr_idx; > + } > + return (void *)value - round_up(map->key_size, 8); > +} > + > struct bpf_async_cb { > struct bpf_map *map; > struct bpf_prog *prog; > @@ -1166,15 +1178,8 @@ static enum hrtimer_restart bpf_timer_cb(struct hrtimer *hrtimer) > * bpf_map_delete_elem() on the same timer. > */ > this_cpu_write(hrtimer_running, t); > - if (map->map_type == BPF_MAP_TYPE_ARRAY) { > - struct bpf_array *array = container_of(map, struct bpf_array, map); > > - /* compute the key */ > - idx = ((char *)value - array->value) / array->elem_size; > - key = &idx; > - } else { /* hash or lru */ > - key = value - round_up(map->key_size, 8); > - } > + key = map_key_from_value(map, value, &idx); > > callback_fn((u64)(long)map, (u64)(long)key, (u64)(long)value, 0, 0); > /* The verifier checked that return value is zero. */ > @@ -1200,15 +1205,7 @@ static void bpf_wq_work(struct work_struct *work) > if (!callback_fn) > return; > > - if (map->map_type == BPF_MAP_TYPE_ARRAY) { > - struct bpf_array *array = container_of(map, struct bpf_array, map); > - > - /* compute the key */ > - idx = ((char *)value - array->value) / array->elem_size; > - key = &idx; > - } else { /* hash or lru */ > - key = value - round_up(map->key_size, 8); > - } > + key = map_key_from_value(map, value, &idx); > > rcu_read_lock_trace(); > migrate_disable(); > -- > 2.50.1 > > ^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH bpf-next 3/4] bpf: task work scheduling kfuncs 2025-08-06 14:45 [PATCH bpf-next 0/4] bpf: Introduce deferred task context execution Mykyta Yatsenko 2025-08-06 14:45 ` [PATCH bpf-next 1/4] bpf: bpf task work plumbing Mykyta Yatsenko 2025-08-06 14:45 ` [PATCH bpf-next 2/4] bpf: extract map key pointer calculation Mykyta Yatsenko @ 2025-08-06 14:45 ` Mykyta Yatsenko 2025-08-07 17:27 ` Alexei Starovoitov 2025-08-07 18:55 ` Kumar Kartikeya Dwivedi 2025-08-06 14:45 ` [PATCH bpf-next 4/4] selftests/bpf: BPF task work scheduling tests Mykyta Yatsenko 3 siblings, 2 replies; 15+ messages in thread From: Mykyta Yatsenko @ 2025-08-06 14:45 UTC (permalink / raw) To: bpf, ast, andrii, daniel, kafai, kernel-team, eddyz87; +Cc: Mykyta Yatsenko From: Mykyta Yatsenko <yatsenko@meta.com> Implementation of the bpf_task_work_schedule kfuncs. Main components: * struct bpf_task_work_context – Metadata and state management per task work. * enum bpf_task_work_state – A state machine to serialize work scheduling and execution. * bpf_task_work_schedule() – The central helper that initiates scheduling. * bpf_task_work_callback() – Invoked when the actual task_work runs. * bpf_task_work_irq() – An intermediate step (runs in softirq context) to enqueue task work. * bpf_task_work_cancel_and_free() – Cleanup for deleted BPF map entries. Flow of task work scheduling 1) bpf_task_work_schedule_* is called from BPF code. 2) Transition state from STANDBY to PENDING. 3) irq_work_queue() schedules bpf_task_work_irq(). 4) Transition state from PENDING to SCHEDULING. 4) bpf_task_work_irq() attempts task_work_add(). If successful, state transitions to SCHEDULED. 5) Task work calls bpf_task_work_callback(), which transition state to RUNNING. 6) BPF callback is executed 7) Context is cleaned up, refcounts released, state set back to STANDBY. Map value deletion If map value that contains bpf_task_work_context is deleted, BPF map implementation calls bpf_task_work_cancel_and_free(). Deletion is handled by atomically setting state to FREED and releasing references or letting scheduler do that, depending on the last state before the deletion: * SCHEDULING: release references in bpf_task_work_cancel_and_free(), expect bpf_task_work_irq() to cancel task work. * SCHEDULED: release references and try to cancel task work in bpf_task_work_cancel_and_free(). * other states: one of bpf_task_work_irq(), bpf_task_work_schedule(), bpf_task_work_callback() should cleanup upon detecting the state switching to FREED. The state transitions are controlled with atomic_cmpxchg, ensuring: * Only one thread can successfully enqueue work. * Proper handling of concurrent deletes (BPF_TW_FREED). * Safe rollback if task_work_add() fails. Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> --- kernel/bpf/helpers.c | 188 ++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 186 insertions(+), 2 deletions(-) diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c index 516286f67f0d..4c8b1c9be7aa 100644 --- a/kernel/bpf/helpers.c +++ b/kernel/bpf/helpers.c @@ -25,6 +25,8 @@ #include <linux/kasan.h> #include <linux/bpf_verifier.h> #include <linux/uaccess.h> +#include <linux/task_work.h> +#include <linux/irq_work.h> #include "../../lib/kstrtox.h" @@ -3702,6 +3704,160 @@ __bpf_kfunc int bpf_strstr(const char *s1__ign, const char *s2__ign) typedef void (*bpf_task_work_callback_t)(struct bpf_map *, void *, void *); +enum bpf_task_work_state { + /* bpf_task_work is ready to be used */ + BPF_TW_STANDBY = 0, + /* bpf_task_work is getting scheduled into irq_work */ + BPF_TW_PENDING, + /* bpf_task_work is in irq_work and getting scheduled into task_work */ + BPF_TW_SCHEDULING, + /* bpf_task_work is scheduled into task_work successfully */ + BPF_TW_SCHEDULED, + /* callback is running */ + BPF_TW_RUNNING, + /* BPF map value storing this bpf_task_work is deleted */ + BPF_TW_FREED, +}; + +struct bpf_task_work_context { + /* map that contains this structure in a value */ + struct bpf_map *map; + /* bpf_task_work_state value, representing the state */ + atomic_t state; + /* bpf_prog that schedules task work */ + struct bpf_prog *prog; + /* task for which callback is scheduled */ + struct task_struct *task; + /* notification mode for task work scheduling */ + enum task_work_notify_mode mode; + /* callback to call from task work */ + bpf_task_work_callback_t callback_fn; + struct callback_head work; + struct irq_work irq_work; +} __aligned(8); + +static bool task_work_match(struct callback_head *head, void *data) +{ + struct bpf_task_work_context *ctx = container_of(head, struct bpf_task_work_context, work); + + return ctx == data; +} + +static void bpf_reset_task_work_context(struct bpf_task_work_context *ctx) +{ + bpf_prog_put(ctx->prog); + bpf_task_release(ctx->task); + rcu_assign_pointer(ctx->map, NULL); +} + +static void bpf_task_work_callback(struct callback_head *cb) +{ + enum bpf_task_work_state state; + struct bpf_task_work_context *ctx; + struct bpf_map *map; + u32 idx; + void *key; + void *value; + + rcu_read_lock_trace(); + ctx = container_of(cb, struct bpf_task_work_context, work); + + state = atomic_cmpxchg_acquire(&ctx->state, BPF_TW_SCHEDULING, BPF_TW_RUNNING); + if (state == BPF_TW_SCHEDULED) + state = atomic_cmpxchg_acquire(&ctx->state, BPF_TW_SCHEDULED, BPF_TW_RUNNING); + if (state == BPF_TW_FREED) + goto out; + + map = rcu_dereference(ctx->map); + if (!map) + goto out; + + value = (void *)ctx - map->record->task_work_off; + key = (void *)map_key_from_value(map, value, &idx); + + migrate_disable(); + ctx->callback_fn(map, key, value); + migrate_enable(); + + /* State is running or freed, either way reset. */ + bpf_reset_task_work_context(ctx); + atomic_cmpxchg_release(&ctx->state, BPF_TW_RUNNING, BPF_TW_STANDBY); +out: + rcu_read_unlock_trace(); +} + +static void bpf_task_work_irq(struct irq_work *irq_work) +{ + struct bpf_task_work_context *ctx; + enum bpf_task_work_state state; + int err; + + ctx = container_of(irq_work, struct bpf_task_work_context, irq_work); + + rcu_read_lock_trace(); + state = atomic_cmpxchg_release(&ctx->state, BPF_TW_PENDING, BPF_TW_SCHEDULING); + if (state == BPF_TW_FREED) { + bpf_reset_task_work_context(ctx); + goto out; + } + + err = task_work_add(ctx->task, &ctx->work, ctx->mode); + if (err) { + state = atomic_cmpxchg_acquire(&ctx->state, BPF_TW_SCHEDULING, BPF_TW_PENDING); + if (state == BPF_TW_SCHEDULING) { + bpf_reset_task_work_context(ctx); + atomic_cmpxchg_release(&ctx->state, BPF_TW_PENDING, BPF_TW_STANDBY); + } + goto out; + } + state = atomic_cmpxchg_release(&ctx->state, BPF_TW_SCHEDULING, BPF_TW_SCHEDULED); + if (state == BPF_TW_FREED) + task_work_cancel_match(ctx->task, task_work_match, ctx); +out: + rcu_read_unlock_trace(); +} + +static int bpf_task_work_schedule(struct task_struct *task, struct bpf_task_work_context *ctx, + struct bpf_map *map, bpf_task_work_callback_t callback_fn, + struct bpf_prog_aux *aux, enum task_work_notify_mode mode) +{ + struct bpf_prog *prog; + + BTF_TYPE_EMIT(struct bpf_task_work); + + prog = bpf_prog_inc_not_zero(aux->prog); + if (IS_ERR(prog)) + return -EPERM; + + if (!atomic64_read(&map->usercnt)) { + bpf_prog_put(prog); + return -EPERM; + } + task = bpf_task_acquire(task); + if (!task) { + bpf_prog_put(prog); + return -EPERM; + } + + if (atomic_cmpxchg_acquire(&ctx->state, BPF_TW_STANDBY, BPF_TW_PENDING) != BPF_TW_STANDBY) { + bpf_task_release(task); + bpf_prog_put(prog); + return -EBUSY; + } + + ctx->task = task; + ctx->callback_fn = callback_fn; + ctx->prog = prog; + ctx->mode = mode; + init_irq_work(&ctx->irq_work, bpf_task_work_irq); + init_task_work(&ctx->work, bpf_task_work_callback); + rcu_assign_pointer(ctx->map, map); + + irq_work_queue(&ctx->irq_work); + + return 0; +} + /** * bpf_task_work_schedule_signal - Schedule BPF callback using task_work_add with TWA_SIGNAL mode * @task: Task struct for which callback should be scheduled @@ -3718,7 +3874,8 @@ __bpf_kfunc int bpf_task_work_schedule_signal(struct task_struct *task, bpf_task_work_callback_t callback, void *aux__prog) { - return 0; + return bpf_task_work_schedule(task, (struct bpf_task_work_context *)tw, map__map, + callback, aux__prog, TWA_SIGNAL); } /** @@ -3738,13 +3895,38 @@ __bpf_kfunc int bpf_task_work_schedule_resume(struct task_struct *task, bpf_task_work_callback_t callback, void *aux__prog) { - return 0; + enum task_work_notify_mode mode; + + mode = task == current && in_nmi() ? TWA_NMI_CURRENT : TWA_RESUME; + return bpf_task_work_schedule(task, (struct bpf_task_work_context *)tw, map__map, + callback, aux__prog, mode); } __bpf_kfunc_end_defs(); void bpf_task_work_cancel_and_free(void *val) { + struct bpf_task_work_context *ctx = val; + enum bpf_task_work_state state; + + state = atomic_xchg(&ctx->state, BPF_TW_FREED); + switch (state) { + case BPF_TW_SCHEDULED: + task_work_cancel_match(ctx->task, task_work_match, ctx); + fallthrough; + /* Scheduling codepath is trying to schedule task work, reset context here. */ + case BPF_TW_SCHEDULING: + bpf_reset_task_work_context(ctx); + break; + /* work is not initialized, mark as freed and exit */ + case BPF_TW_STANDBY: + /* The context is in interim state, scheduling logic should cleanup. */ + case BPF_TW_PENDING: + /* Callback is already running, it should reset context upon finishing. */ + case BPF_TW_RUNNING: + default: + break; + } } BTF_KFUNCS_START(generic_btf_ids) @@ -3770,6 +3952,8 @@ BTF_ID_FLAGS(func, bpf_rbtree_first, KF_RET_NULL) BTF_ID_FLAGS(func, bpf_rbtree_root, KF_RET_NULL) BTF_ID_FLAGS(func, bpf_rbtree_left, KF_RET_NULL) BTF_ID_FLAGS(func, bpf_rbtree_right, KF_RET_NULL) +BTF_ID_FLAGS(func, bpf_task_work_schedule_signal, KF_TRUSTED_ARGS) +BTF_ID_FLAGS(func, bpf_task_work_schedule_resume, KF_TRUSTED_ARGS) #ifdef CONFIG_CGROUPS BTF_ID_FLAGS(func, bpf_cgroup_acquire, KF_ACQUIRE | KF_RCU | KF_RET_NULL) -- 2.50.1 ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH bpf-next 3/4] bpf: task work scheduling kfuncs 2025-08-06 14:45 ` [PATCH bpf-next 3/4] bpf: task work scheduling kfuncs Mykyta Yatsenko @ 2025-08-07 17:27 ` Alexei Starovoitov 2025-08-08 0:30 ` Mykyta Yatsenko 2025-08-07 18:55 ` Kumar Kartikeya Dwivedi 1 sibling, 1 reply; 15+ messages in thread From: Alexei Starovoitov @ 2025-08-07 17:27 UTC (permalink / raw) To: Mykyta Yatsenko Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, Martin Lau, Kernel Team, Eduard, Mykyta Yatsenko On Wed, Aug 6, 2025 at 7:46 AM Mykyta Yatsenko <mykyta.yatsenko5@gmail.com> wrote: > + > +struct bpf_task_work_context { > + /* map that contains this structure in a value */ > + struct bpf_map *map; > + /* bpf_task_work_state value, representing the state */ > + atomic_t state; a hole > + /* bpf_prog that schedules task work */ > + struct bpf_prog *prog; > + /* task for which callback is scheduled */ > + struct task_struct *task; > + /* notification mode for task work scheduling */ > + enum task_work_notify_mode mode; another hole > + /* callback to call from task work */ > + bpf_task_work_callback_t callback_fn; > + struct callback_head work; > + struct irq_work irq_work; > +} __aligned(8); and +struct bpf_task_work { + __u64 __opaque[16]; +} __attribute__((aligned(8))); This is way too fragile. A bunch of data structures in above are not in our control and might be changed without any one mentioning anything on the bpf mailing list, and things will break. If all of the fields were plain pointers we could consider placing bpf_task_work inline in the map value, but with inlined irq_work is imo no go. Indirection is the only option here. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH bpf-next 3/4] bpf: task work scheduling kfuncs 2025-08-07 17:27 ` Alexei Starovoitov @ 2025-08-08 0:30 ` Mykyta Yatsenko 0 siblings, 0 replies; 15+ messages in thread From: Mykyta Yatsenko @ 2025-08-08 0:30 UTC (permalink / raw) To: Alexei Starovoitov Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, Martin Lau, Kernel Team, Eduard, Mykyta Yatsenko On 8/7/25 18:27, Alexei Starovoitov wrote: > On Wed, Aug 6, 2025 at 7:46 AM Mykyta Yatsenko > <mykyta.yatsenko5@gmail.com> wrote: >> + >> +struct bpf_task_work_context { >> + /* map that contains this structure in a value */ >> + struct bpf_map *map; >> + /* bpf_task_work_state value, representing the state */ >> + atomic_t state; > a hole > >> + /* bpf_prog that schedules task work */ >> + struct bpf_prog *prog; >> + /* task for which callback is scheduled */ >> + struct task_struct *task; >> + /* notification mode for task work scheduling */ >> + enum task_work_notify_mode mode; > another hole > >> + /* callback to call from task work */ >> + bpf_task_work_callback_t callback_fn; >> + struct callback_head work; >> + struct irq_work irq_work; >> +} __aligned(8); > and > +struct bpf_task_work { > + __u64 __opaque[16]; > +} __attribute__((aligned(8))); > > This is way too fragile. > A bunch of data structures in above are not in our control > and might be changed without any one mentioning anything > on the bpf mailing list, and things will break. > If all of the fields were plain pointers we could consider > placing bpf_task_work inline in the map value, > but with inlined irq_work is imo no go. > Indirection is the only option here. Thanks for taking a look, makes sense we can't embed irq_work and callback_head into the map value. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH bpf-next 3/4] bpf: task work scheduling kfuncs 2025-08-06 14:45 ` [PATCH bpf-next 3/4] bpf: task work scheduling kfuncs Mykyta Yatsenko 2025-08-07 17:27 ` Alexei Starovoitov @ 2025-08-07 18:55 ` Kumar Kartikeya Dwivedi 2025-08-08 0:44 ` Mykyta Yatsenko 1 sibling, 1 reply; 15+ messages in thread From: Kumar Kartikeya Dwivedi @ 2025-08-07 18:55 UTC (permalink / raw) To: Mykyta Yatsenko Cc: bpf, ast, andrii, daniel, kafai, kernel-team, eddyz87, Mykyta Yatsenko On Wed, 6 Aug 2025 at 16:46, Mykyta Yatsenko <mykyta.yatsenko5@gmail.com> wrote: > > From: Mykyta Yatsenko <yatsenko@meta.com> > > Implementation of the bpf_task_work_schedule kfuncs. > > Main components: > * struct bpf_task_work_context – Metadata and state management per task > work. > * enum bpf_task_work_state – A state machine to serialize work > scheduling and execution. > * bpf_task_work_schedule() – The central helper that initiates > scheduling. > * bpf_task_work_callback() – Invoked when the actual task_work runs. > * bpf_task_work_irq() – An intermediate step (runs in softirq context) > to enqueue task work. > * bpf_task_work_cancel_and_free() – Cleanup for deleted BPF map entries. > > Flow of task work scheduling > 1) bpf_task_work_schedule_* is called from BPF code. > 2) Transition state from STANDBY to PENDING. > 3) irq_work_queue() schedules bpf_task_work_irq(). > 4) Transition state from PENDING to SCHEDULING. > 4) bpf_task_work_irq() attempts task_work_add(). If successful, state > transitions to SCHEDULED. > 5) Task work calls bpf_task_work_callback(), which transition state to > RUNNING. > 6) BPF callback is executed > 7) Context is cleaned up, refcounts released, state set back to > STANDBY. > > Map value deletion > If map value that contains bpf_task_work_context is deleted, BPF map > implementation calls bpf_task_work_cancel_and_free(). > Deletion is handled by atomically setting state to FREED and > releasing references or letting scheduler do that, depending on the > last state before the deletion: > * SCHEDULING: release references in bpf_task_work_cancel_and_free(), > expect bpf_task_work_irq() to cancel task work. > * SCHEDULED: release references and try to cancel task work in > bpf_task_work_cancel_and_free(). > * other states: one of bpf_task_work_irq(), bpf_task_work_schedule(), > bpf_task_work_callback() should cleanup upon detecting the state > switching to FREED. > > The state transitions are controlled with atomic_cmpxchg, ensuring: > * Only one thread can successfully enqueue work. > * Proper handling of concurrent deletes (BPF_TW_FREED). > * Safe rollback if task_work_add() fails. > In general I am not sure why we need so many acquire/release pairs. Why not use test_and_set_bit etc.? Or simply cmpxchg? What ordering of stores are we depending on that merits acquire/release ordering? We should probably document explicitly. > Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> > --- > kernel/bpf/helpers.c | 188 ++++++++++++++++++++++++++++++++++++++++++- > 1 file changed, 186 insertions(+), 2 deletions(-) > > diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c > index 516286f67f0d..4c8b1c9be7aa 100644 > --- a/kernel/bpf/helpers.c > +++ b/kernel/bpf/helpers.c > @@ -25,6 +25,8 @@ > #include <linux/kasan.h> > #include <linux/bpf_verifier.h> > #include <linux/uaccess.h> > +#include <linux/task_work.h> > +#include <linux/irq_work.h> > > #include "../../lib/kstrtox.h" > > @@ -3702,6 +3704,160 @@ __bpf_kfunc int bpf_strstr(const char *s1__ign, const char *s2__ign) > > typedef void (*bpf_task_work_callback_t)(struct bpf_map *, void *, void *); > > +enum bpf_task_work_state { > + /* bpf_task_work is ready to be used */ > + BPF_TW_STANDBY = 0, > + /* bpf_task_work is getting scheduled into irq_work */ > + BPF_TW_PENDING, > + /* bpf_task_work is in irq_work and getting scheduled into task_work */ > + BPF_TW_SCHEDULING, > + /* bpf_task_work is scheduled into task_work successfully */ > + BPF_TW_SCHEDULED, > + /* callback is running */ > + BPF_TW_RUNNING, > + /* BPF map value storing this bpf_task_work is deleted */ > + BPF_TW_FREED, > +}; > + > +struct bpf_task_work_context { > + /* map that contains this structure in a value */ > + struct bpf_map *map; > + /* bpf_task_work_state value, representing the state */ > + atomic_t state; > + /* bpf_prog that schedules task work */ > + struct bpf_prog *prog; > + /* task for which callback is scheduled */ > + struct task_struct *task; > + /* notification mode for task work scheduling */ > + enum task_work_notify_mode mode; > + /* callback to call from task work */ > + bpf_task_work_callback_t callback_fn; > + struct callback_head work; > + struct irq_work irq_work; > +} __aligned(8); I will echo Alexei's comments about the layout. We cannot inline all this in map value. Allocation using an init function or in some control function is probably the only way. > + > +static bool task_work_match(struct callback_head *head, void *data) > +{ > + struct bpf_task_work_context *ctx = container_of(head, struct bpf_task_work_context, work); > + > + return ctx == data; > +} > + > +static void bpf_reset_task_work_context(struct bpf_task_work_context *ctx) > +{ > + bpf_prog_put(ctx->prog); > + bpf_task_release(ctx->task); > + rcu_assign_pointer(ctx->map, NULL); > +} > + > +static void bpf_task_work_callback(struct callback_head *cb) > +{ > + enum bpf_task_work_state state; > + struct bpf_task_work_context *ctx; > + struct bpf_map *map; > + u32 idx; > + void *key; > + void *value; > + > + rcu_read_lock_trace(); > + ctx = container_of(cb, struct bpf_task_work_context, work); > + > + state = atomic_cmpxchg_acquire(&ctx->state, BPF_TW_SCHEDULING, BPF_TW_RUNNING); > + if (state == BPF_TW_SCHEDULED) > + state = atomic_cmpxchg_acquire(&ctx->state, BPF_TW_SCHEDULED, BPF_TW_RUNNING); > + if (state == BPF_TW_FREED) > + goto out; I am leaving out commenting on this, since I expect it to change per later comments. > + > + map = rcu_dereference(ctx->map); > + if (!map) > + goto out; > + > + value = (void *)ctx - map->record->task_work_off; > + key = (void *)map_key_from_value(map, value, &idx); > + > + migrate_disable(); > + ctx->callback_fn(map, key, value); > + migrate_enable(); > + > + /* State is running or freed, either way reset. */ > + bpf_reset_task_work_context(ctx); > + atomic_cmpxchg_release(&ctx->state, BPF_TW_RUNNING, BPF_TW_STANDBY); > +out: > + rcu_read_unlock_trace(); > +} > + > +static void bpf_task_work_irq(struct irq_work *irq_work) > +{ > + struct bpf_task_work_context *ctx; > + enum bpf_task_work_state state; > + int err; > + > + ctx = container_of(irq_work, struct bpf_task_work_context, irq_work); > + > + rcu_read_lock_trace(); What's the idea behind rcu_read_lock_trace? Let's add a comment. > + state = atomic_cmpxchg_release(&ctx->state, BPF_TW_PENDING, BPF_TW_SCHEDULING); > + if (state == BPF_TW_FREED) { > + bpf_reset_task_work_context(ctx); > + goto out; > + } > + > + err = task_work_add(ctx->task, &ctx->work, ctx->mode); Racy, SCHEDULING->FREE state claim from cancel_and_free will release ctx->task. > + if (err) { > + state = atomic_cmpxchg_acquire(&ctx->state, BPF_TW_SCHEDULING, BPF_TW_PENDING); Races here look fine, since we don't act on FREED (for this block atleast), cancel_and_free doesn't act on seeing PENDING, so there is interlocking. > + if (state == BPF_TW_SCHEDULING) { > + bpf_reset_task_work_context(ctx); > + atomic_cmpxchg_release(&ctx->state, BPF_TW_PENDING, BPF_TW_STANDBY); > + } > + goto out; > + } > + state = atomic_cmpxchg_release(&ctx->state, BPF_TW_SCHEDULING, BPF_TW_SCHEDULED); > + if (state == BPF_TW_FREED) > + task_work_cancel_match(ctx->task, task_work_match, ctx); It looks like there is a similar race condition here. If BPF_TW_SCHEDULING is set, cancel_and_free may invoke and attempt bpf_task_release() from bpf_reset_task_work_context(). Meanwhile, we will access ctx->task here directly after seeing BPF_TW_FREED. > +out: > + rcu_read_unlock_trace(); > +} > + > +static int bpf_task_work_schedule(struct task_struct *task, struct bpf_task_work_context *ctx, > + struct bpf_map *map, bpf_task_work_callback_t callback_fn, > + struct bpf_prog_aux *aux, enum task_work_notify_mode mode) > +{ > + struct bpf_prog *prog; > + > + BTF_TYPE_EMIT(struct bpf_task_work); > + > + prog = bpf_prog_inc_not_zero(aux->prog); > + if (IS_ERR(prog)) > + return -EPERM; > + > + if (!atomic64_read(&map->usercnt)) { > + bpf_prog_put(prog); > + return -EPERM; > + } > + task = bpf_task_acquire(task); > + if (!task) { > + bpf_prog_put(prog); > + return -EPERM; > + } > + > + if (atomic_cmpxchg_acquire(&ctx->state, BPF_TW_STANDBY, BPF_TW_PENDING) != BPF_TW_STANDBY) { If we are reusing map values, wouldn't a freed state stay perpetually freed? I.e. after the first delete of array elements etc. it becomes useless. Every array map update would invoke a cancel_and_free. Who resets it? > + bpf_task_release(task); > + bpf_prog_put(prog); > + return -EBUSY; > + } > + > + ctx->task = task; > + ctx->callback_fn = callback_fn; > + ctx->prog = prog; > + ctx->mode = mode; > + init_irq_work(&ctx->irq_work, bpf_task_work_irq); > + init_task_work(&ctx->work, bpf_task_work_callback); > + rcu_assign_pointer(ctx->map, map); > + > + irq_work_queue(&ctx->irq_work); > + > + return 0; > +} > + > /** > * bpf_task_work_schedule_signal - Schedule BPF callback using task_work_add with TWA_SIGNAL mode > * @task: Task struct for which callback should be scheduled > @@ -3718,7 +3874,8 @@ __bpf_kfunc int bpf_task_work_schedule_signal(struct task_struct *task, > bpf_task_work_callback_t callback, > void *aux__prog) > { > - return 0; > + return bpf_task_work_schedule(task, (struct bpf_task_work_context *)tw, map__map, > + callback, aux__prog, TWA_SIGNAL); > } > > /** > @@ -3738,13 +3895,38 @@ __bpf_kfunc int bpf_task_work_schedule_resume(struct task_struct *task, > bpf_task_work_callback_t callback, > void *aux__prog) > { > - return 0; > + enum task_work_notify_mode mode; > + > + mode = task == current && in_nmi() ? TWA_NMI_CURRENT : TWA_RESUME; > + return bpf_task_work_schedule(task, (struct bpf_task_work_context *)tw, map__map, > + callback, aux__prog, mode); > } > > __bpf_kfunc_end_defs(); > > void bpf_task_work_cancel_and_free(void *val) > { > + struct bpf_task_work_context *ctx = val; > + enum bpf_task_work_state state; > + > + state = atomic_xchg(&ctx->state, BPF_TW_FREED); > + switch (state) { > + case BPF_TW_SCHEDULED: > + task_work_cancel_match(ctx->task, task_work_match, ctx); > + fallthrough; > + /* Scheduling codepath is trying to schedule task work, reset context here. */ > + case BPF_TW_SCHEDULING: > + bpf_reset_task_work_context(ctx); > + break; > + /* work is not initialized, mark as freed and exit */ > + case BPF_TW_STANDBY: > + /* The context is in interim state, scheduling logic should cleanup. */ > + case BPF_TW_PENDING: > + /* Callback is already running, it should reset context upon finishing. */ > + case BPF_TW_RUNNING: > + default: > + break; > + } > } > > BTF_KFUNCS_START(generic_btf_ids) > @@ -3770,6 +3952,8 @@ BTF_ID_FLAGS(func, bpf_rbtree_first, KF_RET_NULL) > BTF_ID_FLAGS(func, bpf_rbtree_root, KF_RET_NULL) > BTF_ID_FLAGS(func, bpf_rbtree_left, KF_RET_NULL) > BTF_ID_FLAGS(func, bpf_rbtree_right, KF_RET_NULL) > +BTF_ID_FLAGS(func, bpf_task_work_schedule_signal, KF_TRUSTED_ARGS) > +BTF_ID_FLAGS(func, bpf_task_work_schedule_resume, KF_TRUSTED_ARGS) > > #ifdef CONFIG_CGROUPS > BTF_ID_FLAGS(func, bpf_cgroup_acquire, KF_ACQUIRE | KF_RCU | KF_RET_NULL) > -- > 2.50.1 > > ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH bpf-next 3/4] bpf: task work scheduling kfuncs 2025-08-07 18:55 ` Kumar Kartikeya Dwivedi @ 2025-08-08 0:44 ` Mykyta Yatsenko 2025-08-09 3:04 ` Kumar Kartikeya Dwivedi 0 siblings, 1 reply; 15+ messages in thread From: Mykyta Yatsenko @ 2025-08-08 0:44 UTC (permalink / raw) To: Kumar Kartikeya Dwivedi Cc: bpf, ast, andrii, daniel, kafai, kernel-team, eddyz87, Mykyta Yatsenko On 8/7/25 19:55, Kumar Kartikeya Dwivedi wrote: > On Wed, 6 Aug 2025 at 16:46, Mykyta Yatsenko <mykyta.yatsenko5@gmail.com> wrote: >> From: Mykyta Yatsenko <yatsenko@meta.com> >> >> Implementation of the bpf_task_work_schedule kfuncs. >> >> Main components: >> * struct bpf_task_work_context – Metadata and state management per task >> work. >> * enum bpf_task_work_state – A state machine to serialize work >> scheduling and execution. >> * bpf_task_work_schedule() – The central helper that initiates >> scheduling. >> * bpf_task_work_callback() – Invoked when the actual task_work runs. >> * bpf_task_work_irq() – An intermediate step (runs in softirq context) >> to enqueue task work. >> * bpf_task_work_cancel_and_free() – Cleanup for deleted BPF map entries. >> >> Flow of task work scheduling >> 1) bpf_task_work_schedule_* is called from BPF code. >> 2) Transition state from STANDBY to PENDING. >> 3) irq_work_queue() schedules bpf_task_work_irq(). >> 4) Transition state from PENDING to SCHEDULING. >> 4) bpf_task_work_irq() attempts task_work_add(). If successful, state >> transitions to SCHEDULED. >> 5) Task work calls bpf_task_work_callback(), which transition state to >> RUNNING. >> 6) BPF callback is executed >> 7) Context is cleaned up, refcounts released, state set back to >> STANDBY. >> >> Map value deletion >> If map value that contains bpf_task_work_context is deleted, BPF map >> implementation calls bpf_task_work_cancel_and_free(). >> Deletion is handled by atomically setting state to FREED and >> releasing references or letting scheduler do that, depending on the >> last state before the deletion: >> * SCHEDULING: release references in bpf_task_work_cancel_and_free(), >> expect bpf_task_work_irq() to cancel task work. >> * SCHEDULED: release references and try to cancel task work in >> bpf_task_work_cancel_and_free(). >> * other states: one of bpf_task_work_irq(), bpf_task_work_schedule(), >> bpf_task_work_callback() should cleanup upon detecting the state >> switching to FREED. >> >> The state transitions are controlled with atomic_cmpxchg, ensuring: >> * Only one thread can successfully enqueue work. >> * Proper handling of concurrent deletes (BPF_TW_FREED). >> * Safe rollback if task_work_add() fails. >> > In general I am not sure why we need so many acquire/release pairs. > Why not use test_and_set_bit etc.? Or simply cmpxchg? > What ordering of stores are we depending on that merits > acquire/release ordering? > We should probably document explicitly. > >> Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> >> --- >> kernel/bpf/helpers.c | 188 ++++++++++++++++++++++++++++++++++++++++++- >> 1 file changed, 186 insertions(+), 2 deletions(-) >> >> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c >> index 516286f67f0d..4c8b1c9be7aa 100644 >> --- a/kernel/bpf/helpers.c >> +++ b/kernel/bpf/helpers.c >> @@ -25,6 +25,8 @@ >> #include <linux/kasan.h> >> #include <linux/bpf_verifier.h> >> #include <linux/uaccess.h> >> +#include <linux/task_work.h> >> +#include <linux/irq_work.h> >> >> #include "../../lib/kstrtox.h" >> >> @@ -3702,6 +3704,160 @@ __bpf_kfunc int bpf_strstr(const char *s1__ign, const char *s2__ign) >> >> typedef void (*bpf_task_work_callback_t)(struct bpf_map *, void *, void *); >> >> +enum bpf_task_work_state { >> + /* bpf_task_work is ready to be used */ >> + BPF_TW_STANDBY = 0, >> + /* bpf_task_work is getting scheduled into irq_work */ >> + BPF_TW_PENDING, >> + /* bpf_task_work is in irq_work and getting scheduled into task_work */ >> + BPF_TW_SCHEDULING, >> + /* bpf_task_work is scheduled into task_work successfully */ >> + BPF_TW_SCHEDULED, >> + /* callback is running */ >> + BPF_TW_RUNNING, >> + /* BPF map value storing this bpf_task_work is deleted */ >> + BPF_TW_FREED, >> +}; >> + >> +struct bpf_task_work_context { >> + /* map that contains this structure in a value */ >> + struct bpf_map *map; >> + /* bpf_task_work_state value, representing the state */ >> + atomic_t state; >> + /* bpf_prog that schedules task work */ >> + struct bpf_prog *prog; >> + /* task for which callback is scheduled */ >> + struct task_struct *task; >> + /* notification mode for task work scheduling */ >> + enum task_work_notify_mode mode; >> + /* callback to call from task work */ >> + bpf_task_work_callback_t callback_fn; >> + struct callback_head work; >> + struct irq_work irq_work; >> +} __aligned(8); > I will echo Alexei's comments about the layout. We cannot inline all > this in map value. > Allocation using an init function or in some control function is > probably the only way. > >> + >> +static bool task_work_match(struct callback_head *head, void *data) >> +{ >> + struct bpf_task_work_context *ctx = container_of(head, struct bpf_task_work_context, work); >> + >> + return ctx == data; >> +} >> + >> +static void bpf_reset_task_work_context(struct bpf_task_work_context *ctx) >> +{ >> + bpf_prog_put(ctx->prog); >> + bpf_task_release(ctx->task); >> + rcu_assign_pointer(ctx->map, NULL); >> +} >> + >> +static void bpf_task_work_callback(struct callback_head *cb) >> +{ >> + enum bpf_task_work_state state; >> + struct bpf_task_work_context *ctx; >> + struct bpf_map *map; >> + u32 idx; >> + void *key; >> + void *value; >> + >> + rcu_read_lock_trace(); >> + ctx = container_of(cb, struct bpf_task_work_context, work); >> + >> + state = atomic_cmpxchg_acquire(&ctx->state, BPF_TW_SCHEDULING, BPF_TW_RUNNING); >> + if (state == BPF_TW_SCHEDULED) >> + state = atomic_cmpxchg_acquire(&ctx->state, BPF_TW_SCHEDULED, BPF_TW_RUNNING); >> + if (state == BPF_TW_FREED) >> + goto out; > I am leaving out commenting on this, since I expect it to change per > later comments. > >> + >> + map = rcu_dereference(ctx->map); >> + if (!map) >> + goto out; >> + >> + value = (void *)ctx - map->record->task_work_off; >> + key = (void *)map_key_from_value(map, value, &idx); >> + >> + migrate_disable(); >> + ctx->callback_fn(map, key, value); >> + migrate_enable(); >> + >> + /* State is running or freed, either way reset. */ >> + bpf_reset_task_work_context(ctx); >> + atomic_cmpxchg_release(&ctx->state, BPF_TW_RUNNING, BPF_TW_STANDBY); >> +out: >> + rcu_read_unlock_trace(); >> +} >> + >> +static void bpf_task_work_irq(struct irq_work *irq_work) >> +{ >> + struct bpf_task_work_context *ctx; >> + enum bpf_task_work_state state; >> + int err; >> + >> + ctx = container_of(irq_work, struct bpf_task_work_context, irq_work); >> + >> + rcu_read_lock_trace(); > What's the idea behind rcu_read_lock_trace? Let's add a comment. > >> + state = atomic_cmpxchg_release(&ctx->state, BPF_TW_PENDING, BPF_TW_SCHEDULING); >> + if (state == BPF_TW_FREED) { >> + bpf_reset_task_work_context(ctx); >> + goto out; >> + } >> + >> + err = task_work_add(ctx->task, &ctx->work, ctx->mode); > Racy, SCHEDULING->FREE state claim from cancel_and_free will release ctx->task. Thanks for pointing this out, I missed that case. > >> + if (err) { >> + state = atomic_cmpxchg_acquire(&ctx->state, BPF_TW_SCHEDULING, BPF_TW_PENDING); > Races here look fine, since we don't act on FREED (for this block > atleast), cancel_and_free doesn't act on seeing PENDING, > so there is interlocking. > >> + if (state == BPF_TW_SCHEDULING) { >> + bpf_reset_task_work_context(ctx); >> + atomic_cmpxchg_release(&ctx->state, BPF_TW_PENDING, BPF_TW_STANDBY); >> + } >> + goto out; >> + } >> + state = atomic_cmpxchg_release(&ctx->state, BPF_TW_SCHEDULING, BPF_TW_SCHEDULED); >> + if (state == BPF_TW_FREED) >> + task_work_cancel_match(ctx->task, task_work_match, ctx); > It looks like there is a similar race condition here. > If BPF_TW_SCHEDULING is set, cancel_and_free may invoke and attempt > bpf_task_release() from bpf_reset_task_work_context(). > Meanwhile, we will access ctx->task here directly after seeing BPF_TW_FREED. Yeah, we should release task_work in this function in case SCHEDULING gets transitioned into FREED. > >> +out: >> + rcu_read_unlock_trace(); >> +} >> + >> +static int bpf_task_work_schedule(struct task_struct *task, struct bpf_task_work_context *ctx, >> + struct bpf_map *map, bpf_task_work_callback_t callback_fn, >> + struct bpf_prog_aux *aux, enum task_work_notify_mode mode) >> +{ >> + struct bpf_prog *prog; >> + >> + BTF_TYPE_EMIT(struct bpf_task_work); >> + >> + prog = bpf_prog_inc_not_zero(aux->prog); >> + if (IS_ERR(prog)) >> + return -EPERM; >> + >> + if (!atomic64_read(&map->usercnt)) { >> + bpf_prog_put(prog); >> + return -EPERM; >> + } >> + task = bpf_task_acquire(task); >> + if (!task) { >> + bpf_prog_put(prog); >> + return -EPERM; >> + } >> + >> + if (atomic_cmpxchg_acquire(&ctx->state, BPF_TW_STANDBY, BPF_TW_PENDING) != BPF_TW_STANDBY) { > If we are reusing map values, wouldn't a freed state stay perpetually freed? > I.e. after the first delete of array elements etc. it becomes useless. > Every array map update would invoke a cancel_and_free. > Who resets it? I'm not sure I understand the question, the idea is that if element is deleted from map, we transition state to FREED and make sure refcounts of the task and prog are released. An element is returned into STANDBY state after task_work is completed or failed, so it can be reused. Could you please elaborate on the scenario you have in mind? > > > >> + bpf_task_release(task); >> + bpf_prog_put(prog); >> + return -EBUSY; >> + } >> + >> + ctx->task = task; >> + ctx->callback_fn = callback_fn; >> + ctx->prog = prog; >> + ctx->mode = mode; >> + init_irq_work(&ctx->irq_work, bpf_task_work_irq); >> + init_task_work(&ctx->work, bpf_task_work_callback); >> + rcu_assign_pointer(ctx->map, map); >> + >> + irq_work_queue(&ctx->irq_work); >> + >> + return 0; >> +} >> + >> /** >> * bpf_task_work_schedule_signal - Schedule BPF callback using task_work_add with TWA_SIGNAL mode >> * @task: Task struct for which callback should be scheduled >> @@ -3718,7 +3874,8 @@ __bpf_kfunc int bpf_task_work_schedule_signal(struct task_struct *task, >> bpf_task_work_callback_t callback, >> void *aux__prog) >> { >> - return 0; >> + return bpf_task_work_schedule(task, (struct bpf_task_work_context *)tw, map__map, >> + callback, aux__prog, TWA_SIGNAL); >> } >> >> /** >> @@ -3738,13 +3895,38 @@ __bpf_kfunc int bpf_task_work_schedule_resume(struct task_struct *task, >> bpf_task_work_callback_t callback, >> void *aux__prog) >> { >> - return 0; >> + enum task_work_notify_mode mode; >> + >> + mode = task == current && in_nmi() ? TWA_NMI_CURRENT : TWA_RESUME; >> + return bpf_task_work_schedule(task, (struct bpf_task_work_context *)tw, map__map, >> + callback, aux__prog, mode); >> } >> >> __bpf_kfunc_end_defs(); >> >> void bpf_task_work_cancel_and_free(void *val) >> { >> + struct bpf_task_work_context *ctx = val; >> + enum bpf_task_work_state state; >> + >> + state = atomic_xchg(&ctx->state, BPF_TW_FREED); >> + switch (state) { >> + case BPF_TW_SCHEDULED: >> + task_work_cancel_match(ctx->task, task_work_match, ctx); >> + fallthrough; >> + /* Scheduling codepath is trying to schedule task work, reset context here. */ >> + case BPF_TW_SCHEDULING: >> + bpf_reset_task_work_context(ctx); >> + break; >> + /* work is not initialized, mark as freed and exit */ >> + case BPF_TW_STANDBY: >> + /* The context is in interim state, scheduling logic should cleanup. */ >> + case BPF_TW_PENDING: >> + /* Callback is already running, it should reset context upon finishing. */ >> + case BPF_TW_RUNNING: >> + default: >> + break; >> + } >> } >> >> BTF_KFUNCS_START(generic_btf_ids) >> @@ -3770,6 +3952,8 @@ BTF_ID_FLAGS(func, bpf_rbtree_first, KF_RET_NULL) >> BTF_ID_FLAGS(func, bpf_rbtree_root, KF_RET_NULL) >> BTF_ID_FLAGS(func, bpf_rbtree_left, KF_RET_NULL) >> BTF_ID_FLAGS(func, bpf_rbtree_right, KF_RET_NULL) >> +BTF_ID_FLAGS(func, bpf_task_work_schedule_signal, KF_TRUSTED_ARGS) >> +BTF_ID_FLAGS(func, bpf_task_work_schedule_resume, KF_TRUSTED_ARGS) >> >> #ifdef CONFIG_CGROUPS >> BTF_ID_FLAGS(func, bpf_cgroup_acquire, KF_ACQUIRE | KF_RCU | KF_RET_NULL) >> -- >> 2.50.1 >> >> ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH bpf-next 3/4] bpf: task work scheduling kfuncs 2025-08-08 0:44 ` Mykyta Yatsenko @ 2025-08-09 3:04 ` Kumar Kartikeya Dwivedi 2025-08-11 20:13 ` Mykyta Yatsenko 0 siblings, 1 reply; 15+ messages in thread From: Kumar Kartikeya Dwivedi @ 2025-08-09 3:04 UTC (permalink / raw) To: Mykyta Yatsenko Cc: bpf, ast, andrii, daniel, kafai, kernel-team, eddyz87, Mykyta Yatsenko On Fri, 8 Aug 2025 at 02:44, Mykyta Yatsenko <mykyta.yatsenko5@gmail.com> wrote: > > On 8/7/25 19:55, Kumar Kartikeya Dwivedi wrote: > > On Wed, 6 Aug 2025 at 16:46, Mykyta Yatsenko <mykyta.yatsenko5@gmail.com> wrote: > >> From: Mykyta Yatsenko <yatsenko@meta.com> > >> > >> Implementation of the bpf_task_work_schedule kfuncs. > >> > >> Main components: > >> * struct bpf_task_work_context – Metadata and state management per task > >> work. > >> * enum bpf_task_work_state – A state machine to serialize work > >> scheduling and execution. > >> * bpf_task_work_schedule() – The central helper that initiates > >> scheduling. > >> * bpf_task_work_callback() – Invoked when the actual task_work runs. > >> * bpf_task_work_irq() – An intermediate step (runs in softirq context) > >> to enqueue task work. > >> * bpf_task_work_cancel_and_free() – Cleanup for deleted BPF map entries. > >> > >> Flow of task work scheduling > >> 1) bpf_task_work_schedule_* is called from BPF code. > >> 2) Transition state from STANDBY to PENDING. > >> 3) irq_work_queue() schedules bpf_task_work_irq(). > >> 4) Transition state from PENDING to SCHEDULING. > >> 4) bpf_task_work_irq() attempts task_work_add(). If successful, state > >> transitions to SCHEDULED. > >> 5) Task work calls bpf_task_work_callback(), which transition state to > >> RUNNING. > >> 6) BPF callback is executed > >> 7) Context is cleaned up, refcounts released, state set back to > >> STANDBY. > >> > >> Map value deletion > >> If map value that contains bpf_task_work_context is deleted, BPF map > >> implementation calls bpf_task_work_cancel_and_free(). > >> Deletion is handled by atomically setting state to FREED and > >> releasing references or letting scheduler do that, depending on the > >> last state before the deletion: > >> * SCHEDULING: release references in bpf_task_work_cancel_and_free(), > >> expect bpf_task_work_irq() to cancel task work. > >> * SCHEDULED: release references and try to cancel task work in > >> bpf_task_work_cancel_and_free(). > >> * other states: one of bpf_task_work_irq(), bpf_task_work_schedule(), > >> bpf_task_work_callback() should cleanup upon detecting the state > >> switching to FREED. > >> > >> The state transitions are controlled with atomic_cmpxchg, ensuring: > >> * Only one thread can successfully enqueue work. > >> * Proper handling of concurrent deletes (BPF_TW_FREED). > >> * Safe rollback if task_work_add() fails. > >> > > In general I am not sure why we need so many acquire/release pairs. > > Why not use test_and_set_bit etc.? Or simply cmpxchg? > > What ordering of stores are we depending on that merits > > acquire/release ordering? > > We should probably document explicitly. > > > >> Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> > >> --- > >> kernel/bpf/helpers.c | 188 ++++++++++++++++++++++++++++++++++++++++++- > >> 1 file changed, 186 insertions(+), 2 deletions(-) > >> > >> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c > >> index 516286f67f0d..4c8b1c9be7aa 100644 > >> --- a/kernel/bpf/helpers.c > >> +++ b/kernel/bpf/helpers.c > >> @@ -25,6 +25,8 @@ > >> #include <linux/kasan.h> > >> #include <linux/bpf_verifier.h> > >> #include <linux/uaccess.h> > >> +#include <linux/task_work.h> > >> +#include <linux/irq_work.h> > >> > >> #include "../../lib/kstrtox.h" > >> > >> @@ -3702,6 +3704,160 @@ __bpf_kfunc int bpf_strstr(const char *s1__ign, const char *s2__ign) > >> > >> typedef void (*bpf_task_work_callback_t)(struct bpf_map *, void *, void *); > >> > >> +enum bpf_task_work_state { > >> + /* bpf_task_work is ready to be used */ > >> + BPF_TW_STANDBY = 0, > >> + /* bpf_task_work is getting scheduled into irq_work */ > >> + BPF_TW_PENDING, > >> + /* bpf_task_work is in irq_work and getting scheduled into task_work */ > >> + BPF_TW_SCHEDULING, > >> + /* bpf_task_work is scheduled into task_work successfully */ > >> + BPF_TW_SCHEDULED, > >> + /* callback is running */ > >> + BPF_TW_RUNNING, > >> + /* BPF map value storing this bpf_task_work is deleted */ > >> + BPF_TW_FREED, > >> +}; > >> + > >> +struct bpf_task_work_context { > >> + /* map that contains this structure in a value */ > >> + struct bpf_map *map; > >> + /* bpf_task_work_state value, representing the state */ > >> + atomic_t state; > >> + /* bpf_prog that schedules task work */ > >> + struct bpf_prog *prog; > >> + /* task for which callback is scheduled */ > >> + struct task_struct *task; > >> + /* notification mode for task work scheduling */ > >> + enum task_work_notify_mode mode; > >> + /* callback to call from task work */ > >> + bpf_task_work_callback_t callback_fn; > >> + struct callback_head work; > >> + struct irq_work irq_work; > >> +} __aligned(8); > > I will echo Alexei's comments about the layout. We cannot inline all > > this in map value. > > Allocation using an init function or in some control function is > > probably the only way. > > > >> + > >> +static bool task_work_match(struct callback_head *head, void *data) > >> +{ > >> + struct bpf_task_work_context *ctx = container_of(head, struct bpf_task_work_context, work); > >> + > >> + return ctx == data; > >> +} > >> + > >> +static void bpf_reset_task_work_context(struct bpf_task_work_context *ctx) > >> +{ > >> + bpf_prog_put(ctx->prog); > >> + bpf_task_release(ctx->task); > >> + rcu_assign_pointer(ctx->map, NULL); > >> +} > >> + > >> +static void bpf_task_work_callback(struct callback_head *cb) > >> +{ > >> + enum bpf_task_work_state state; > >> + struct bpf_task_work_context *ctx; > >> + struct bpf_map *map; > >> + u32 idx; > >> + void *key; > >> + void *value; > >> + > >> + rcu_read_lock_trace(); > >> + ctx = container_of(cb, struct bpf_task_work_context, work); > >> + > >> + state = atomic_cmpxchg_acquire(&ctx->state, BPF_TW_SCHEDULING, BPF_TW_RUNNING); > >> + if (state == BPF_TW_SCHEDULED) > >> + state = atomic_cmpxchg_acquire(&ctx->state, BPF_TW_SCHEDULED, BPF_TW_RUNNING); > >> + if (state == BPF_TW_FREED) > >> + goto out; > > I am leaving out commenting on this, since I expect it to change per > > later comments. > > > >> + > >> + map = rcu_dereference(ctx->map); > >> + if (!map) > >> + goto out; > >> + > >> + value = (void *)ctx - map->record->task_work_off; > >> + key = (void *)map_key_from_value(map, value, &idx); > >> + > >> + migrate_disable(); > >> + ctx->callback_fn(map, key, value); > >> + migrate_enable(); > >> + > >> + /* State is running or freed, either way reset. */ > >> + bpf_reset_task_work_context(ctx); > >> + atomic_cmpxchg_release(&ctx->state, BPF_TW_RUNNING, BPF_TW_STANDBY); > >> +out: > >> + rcu_read_unlock_trace(); > >> +} > >> + > >> +static void bpf_task_work_irq(struct irq_work *irq_work) > >> +{ > >> + struct bpf_task_work_context *ctx; > >> + enum bpf_task_work_state state; > >> + int err; > >> + > >> + ctx = container_of(irq_work, struct bpf_task_work_context, irq_work); > >> + > >> + rcu_read_lock_trace(); > > What's the idea behind rcu_read_lock_trace? Let's add a comment. > > > >> + state = atomic_cmpxchg_release(&ctx->state, BPF_TW_PENDING, BPF_TW_SCHEDULING); > >> + if (state == BPF_TW_FREED) { > >> + bpf_reset_task_work_context(ctx); > >> + goto out; > >> + } > >> + > >> + err = task_work_add(ctx->task, &ctx->work, ctx->mode); > > Racy, SCHEDULING->FREE state claim from cancel_and_free will release ctx->task. > Thanks for pointing this out, I missed that case. > > > >> + if (err) { > >> + state = atomic_cmpxchg_acquire(&ctx->state, BPF_TW_SCHEDULING, BPF_TW_PENDING); > > Races here look fine, since we don't act on FREED (for this block > > atleast), cancel_and_free doesn't act on seeing PENDING, > > so there is interlocking. > > > >> + if (state == BPF_TW_SCHEDULING) { > >> + bpf_reset_task_work_context(ctx); > >> + atomic_cmpxchg_release(&ctx->state, BPF_TW_PENDING, BPF_TW_STANDBY); > >> + } > >> + goto out; > >> + } > >> + state = atomic_cmpxchg_release(&ctx->state, BPF_TW_SCHEDULING, BPF_TW_SCHEDULED); > >> + if (state == BPF_TW_FREED) > >> + task_work_cancel_match(ctx->task, task_work_match, ctx); > > It looks like there is a similar race condition here. > > If BPF_TW_SCHEDULING is set, cancel_and_free may invoke and attempt > > bpf_task_release() from bpf_reset_task_work_context(). > > Meanwhile, we will access ctx->task here directly after seeing BPF_TW_FREED. > Yeah, we should release task_work in this function in case SCHEDULING > gets transitioned into FREED. > > > >> +out: > >> + rcu_read_unlock_trace(); > >> +} > >> + > >> +static int bpf_task_work_schedule(struct task_struct *task, struct bpf_task_work_context *ctx, > >> + struct bpf_map *map, bpf_task_work_callback_t callback_fn, > >> + struct bpf_prog_aux *aux, enum task_work_notify_mode mode) > >> +{ > >> + struct bpf_prog *prog; > >> + > >> + BTF_TYPE_EMIT(struct bpf_task_work); > >> + > >> + prog = bpf_prog_inc_not_zero(aux->prog); > >> + if (IS_ERR(prog)) > >> + return -EPERM; > >> + > >> + if (!atomic64_read(&map->usercnt)) { > >> + bpf_prog_put(prog); > >> + return -EPERM; > >> + } > >> + task = bpf_task_acquire(task); > >> + if (!task) { > >> + bpf_prog_put(prog); > >> + return -EPERM; > >> + } > >> + > >> + if (atomic_cmpxchg_acquire(&ctx->state, BPF_TW_STANDBY, BPF_TW_PENDING) != BPF_TW_STANDBY) { > > If we are reusing map values, wouldn't a freed state stay perpetually freed? > > I.e. after the first delete of array elements etc. it becomes useless. > > Every array map update would invoke a cancel_and_free. > > Who resets it? > I'm not sure I understand the question, the idea is that if element is > deleted from map, we > transition state to FREED and make sure refcounts of the task and prog > are released. > > An element is returned into STANDBY state after task_work is completed > or failed, so it can be reused. > Could you please elaborate on the scenario you have in mind? I guess I am confused about where we will go from FREED to STANDBY, if we set it to BPF_TW_FREED in cancel_and_free. When you update an array map element, we always do bpf_obj_free_fields. Typically, this operation leaves the field in a reusable state. I don't see a FREED->STANDBY transition (after going from [SCHEDULED|SCHEDULING]->FREED, only RUNNING->STANDBY in the callback. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH bpf-next 3/4] bpf: task work scheduling kfuncs 2025-08-09 3:04 ` Kumar Kartikeya Dwivedi @ 2025-08-11 20:13 ` Mykyta Yatsenko 2025-08-11 20:17 ` Kumar Kartikeya Dwivedi 0 siblings, 1 reply; 15+ messages in thread From: Mykyta Yatsenko @ 2025-08-11 20:13 UTC (permalink / raw) To: Kumar Kartikeya Dwivedi Cc: bpf, ast, andrii, daniel, kafai, kernel-team, eddyz87, Mykyta Yatsenko On 8/9/25 04:04, Kumar Kartikeya Dwivedi wrote: > On Fri, 8 Aug 2025 at 02:44, Mykyta Yatsenko <mykyta.yatsenko5@gmail.com> wrote: >> On 8/7/25 19:55, Kumar Kartikeya Dwivedi wrote: >>> On Wed, 6 Aug 2025 at 16:46, Mykyta Yatsenko <mykyta.yatsenko5@gmail.com> wrote: >>>> From: Mykyta Yatsenko <yatsenko@meta.com> >>>> >>>> Implementation of the bpf_task_work_schedule kfuncs. >>>> >>>> Main components: >>>> * struct bpf_task_work_context – Metadata and state management per task >>>> work. >>>> * enum bpf_task_work_state – A state machine to serialize work >>>> scheduling and execution. >>>> * bpf_task_work_schedule() – The central helper that initiates >>>> scheduling. >>>> * bpf_task_work_callback() – Invoked when the actual task_work runs. >>>> * bpf_task_work_irq() – An intermediate step (runs in softirq context) >>>> to enqueue task work. >>>> * bpf_task_work_cancel_and_free() – Cleanup for deleted BPF map entries. >>>> >>>> Flow of task work scheduling >>>> 1) bpf_task_work_schedule_* is called from BPF code. >>>> 2) Transition state from STANDBY to PENDING. >>>> 3) irq_work_queue() schedules bpf_task_work_irq(). >>>> 4) Transition state from PENDING to SCHEDULING. >>>> 4) bpf_task_work_irq() attempts task_work_add(). If successful, state >>>> transitions to SCHEDULED. >>>> 5) Task work calls bpf_task_work_callback(), which transition state to >>>> RUNNING. >>>> 6) BPF callback is executed >>>> 7) Context is cleaned up, refcounts released, state set back to >>>> STANDBY. >>>> >>>> Map value deletion >>>> If map value that contains bpf_task_work_context is deleted, BPF map >>>> implementation calls bpf_task_work_cancel_and_free(). >>>> Deletion is handled by atomically setting state to FREED and >>>> releasing references or letting scheduler do that, depending on the >>>> last state before the deletion: >>>> * SCHEDULING: release references in bpf_task_work_cancel_and_free(), >>>> expect bpf_task_work_irq() to cancel task work. >>>> * SCHEDULED: release references and try to cancel task work in >>>> bpf_task_work_cancel_and_free(). >>>> * other states: one of bpf_task_work_irq(), bpf_task_work_schedule(), >>>> bpf_task_work_callback() should cleanup upon detecting the state >>>> switching to FREED. >>>> >>>> The state transitions are controlled with atomic_cmpxchg, ensuring: >>>> * Only one thread can successfully enqueue work. >>>> * Proper handling of concurrent deletes (BPF_TW_FREED). >>>> * Safe rollback if task_work_add() fails. >>>> >>> In general I am not sure why we need so many acquire/release pairs. >>> Why not use test_and_set_bit etc.? Or simply cmpxchg? >>> What ordering of stores are we depending on that merits >>> acquire/release ordering? >>> We should probably document explicitly. >>> >>>> Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> >>>> --- >>>> kernel/bpf/helpers.c | 188 ++++++++++++++++++++++++++++++++++++++++++- >>>> 1 file changed, 186 insertions(+), 2 deletions(-) >>>> >>>> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c >>>> index 516286f67f0d..4c8b1c9be7aa 100644 >>>> --- a/kernel/bpf/helpers.c >>>> +++ b/kernel/bpf/helpers.c >>>> @@ -25,6 +25,8 @@ >>>> #include <linux/kasan.h> >>>> #include <linux/bpf_verifier.h> >>>> #include <linux/uaccess.h> >>>> +#include <linux/task_work.h> >>>> +#include <linux/irq_work.h> >>>> >>>> #include "../../lib/kstrtox.h" >>>> >>>> @@ -3702,6 +3704,160 @@ __bpf_kfunc int bpf_strstr(const char *s1__ign, const char *s2__ign) >>>> >>>> typedef void (*bpf_task_work_callback_t)(struct bpf_map *, void *, void *); >>>> >>>> +enum bpf_task_work_state { >>>> + /* bpf_task_work is ready to be used */ >>>> + BPF_TW_STANDBY = 0, >>>> + /* bpf_task_work is getting scheduled into irq_work */ >>>> + BPF_TW_PENDING, >>>> + /* bpf_task_work is in irq_work and getting scheduled into task_work */ >>>> + BPF_TW_SCHEDULING, >>>> + /* bpf_task_work is scheduled into task_work successfully */ >>>> + BPF_TW_SCHEDULED, >>>> + /* callback is running */ >>>> + BPF_TW_RUNNING, >>>> + /* BPF map value storing this bpf_task_work is deleted */ >>>> + BPF_TW_FREED, >>>> +}; >>>> + >>>> +struct bpf_task_work_context { >>>> + /* map that contains this structure in a value */ >>>> + struct bpf_map *map; >>>> + /* bpf_task_work_state value, representing the state */ >>>> + atomic_t state; >>>> + /* bpf_prog that schedules task work */ >>>> + struct bpf_prog *prog; >>>> + /* task for which callback is scheduled */ >>>> + struct task_struct *task; >>>> + /* notification mode for task work scheduling */ >>>> + enum task_work_notify_mode mode; >>>> + /* callback to call from task work */ >>>> + bpf_task_work_callback_t callback_fn; >>>> + struct callback_head work; >>>> + struct irq_work irq_work; >>>> +} __aligned(8); >>> I will echo Alexei's comments about the layout. We cannot inline all >>> this in map value. >>> Allocation using an init function or in some control function is >>> probably the only way. >>> >>>> + >>>> +static bool task_work_match(struct callback_head *head, void *data) >>>> +{ >>>> + struct bpf_task_work_context *ctx = container_of(head, struct bpf_task_work_context, work); >>>> + >>>> + return ctx == data; >>>> +} >>>> + >>>> +static void bpf_reset_task_work_context(struct bpf_task_work_context *ctx) >>>> +{ >>>> + bpf_prog_put(ctx->prog); >>>> + bpf_task_release(ctx->task); >>>> + rcu_assign_pointer(ctx->map, NULL); >>>> +} >>>> + >>>> +static void bpf_task_work_callback(struct callback_head *cb) >>>> +{ >>>> + enum bpf_task_work_state state; >>>> + struct bpf_task_work_context *ctx; >>>> + struct bpf_map *map; >>>> + u32 idx; >>>> + void *key; >>>> + void *value; >>>> + >>>> + rcu_read_lock_trace(); >>>> + ctx = container_of(cb, struct bpf_task_work_context, work); >>>> + >>>> + state = atomic_cmpxchg_acquire(&ctx->state, BPF_TW_SCHEDULING, BPF_TW_RUNNING); >>>> + if (state == BPF_TW_SCHEDULED) >>>> + state = atomic_cmpxchg_acquire(&ctx->state, BPF_TW_SCHEDULED, BPF_TW_RUNNING); >>>> + if (state == BPF_TW_FREED) >>>> + goto out; >>> I am leaving out commenting on this, since I expect it to change per >>> later comments. >>> >>>> + >>>> + map = rcu_dereference(ctx->map); >>>> + if (!map) >>>> + goto out; >>>> + >>>> + value = (void *)ctx - map->record->task_work_off; >>>> + key = (void *)map_key_from_value(map, value, &idx); >>>> + >>>> + migrate_disable(); >>>> + ctx->callback_fn(map, key, value); >>>> + migrate_enable(); >>>> + >>>> + /* State is running or freed, either way reset. */ >>>> + bpf_reset_task_work_context(ctx); >>>> + atomic_cmpxchg_release(&ctx->state, BPF_TW_RUNNING, BPF_TW_STANDBY); >>>> +out: >>>> + rcu_read_unlock_trace(); >>>> +} >>>> + >>>> +static void bpf_task_work_irq(struct irq_work *irq_work) >>>> +{ >>>> + struct bpf_task_work_context *ctx; >>>> + enum bpf_task_work_state state; >>>> + int err; >>>> + >>>> + ctx = container_of(irq_work, struct bpf_task_work_context, irq_work); >>>> + >>>> + rcu_read_lock_trace(); >>> What's the idea behind rcu_read_lock_trace? Let's add a comment. >>> >>>> + state = atomic_cmpxchg_release(&ctx->state, BPF_TW_PENDING, BPF_TW_SCHEDULING); >>>> + if (state == BPF_TW_FREED) { >>>> + bpf_reset_task_work_context(ctx); >>>> + goto out; >>>> + } >>>> + >>>> + err = task_work_add(ctx->task, &ctx->work, ctx->mode); >>> Racy, SCHEDULING->FREE state claim from cancel_and_free will release ctx->task. >> Thanks for pointing this out, I missed that case. >>>> + if (err) { >>>> + state = atomic_cmpxchg_acquire(&ctx->state, BPF_TW_SCHEDULING, BPF_TW_PENDING); >>> Races here look fine, since we don't act on FREED (for this block >>> atleast), cancel_and_free doesn't act on seeing PENDING, >>> so there is interlocking. >>> >>>> + if (state == BPF_TW_SCHEDULING) { >>>> + bpf_reset_task_work_context(ctx); >>>> + atomic_cmpxchg_release(&ctx->state, BPF_TW_PENDING, BPF_TW_STANDBY); >>>> + } >>>> + goto out; >>>> + } >>>> + state = atomic_cmpxchg_release(&ctx->state, BPF_TW_SCHEDULING, BPF_TW_SCHEDULED); >>>> + if (state == BPF_TW_FREED) >>>> + task_work_cancel_match(ctx->task, task_work_match, ctx); >>> It looks like there is a similar race condition here. >>> If BPF_TW_SCHEDULING is set, cancel_and_free may invoke and attempt >>> bpf_task_release() from bpf_reset_task_work_context(). >>> Meanwhile, we will access ctx->task here directly after seeing BPF_TW_FREED. >> Yeah, we should release task_work in this function in case SCHEDULING >> gets transitioned into FREED. >>>> +out: >>>> + rcu_read_unlock_trace(); >>>> +} >>>> + >>>> +static int bpf_task_work_schedule(struct task_struct *task, struct bpf_task_work_context *ctx, >>>> + struct bpf_map *map, bpf_task_work_callback_t callback_fn, >>>> + struct bpf_prog_aux *aux, enum task_work_notify_mode mode) >>>> +{ >>>> + struct bpf_prog *prog; >>>> + >>>> + BTF_TYPE_EMIT(struct bpf_task_work); >>>> + >>>> + prog = bpf_prog_inc_not_zero(aux->prog); >>>> + if (IS_ERR(prog)) >>>> + return -EPERM; >>>> + >>>> + if (!atomic64_read(&map->usercnt)) { >>>> + bpf_prog_put(prog); >>>> + return -EPERM; >>>> + } >>>> + task = bpf_task_acquire(task); >>>> + if (!task) { >>>> + bpf_prog_put(prog); >>>> + return -EPERM; >>>> + } >>>> + >>>> + if (atomic_cmpxchg_acquire(&ctx->state, BPF_TW_STANDBY, BPF_TW_PENDING) != BPF_TW_STANDBY) { >>> If we are reusing map values, wouldn't a freed state stay perpetually freed? >>> I.e. after the first delete of array elements etc. it becomes useless. >>> Every array map update would invoke a cancel_and_free. >>> Who resets it? >> I'm not sure I understand the question, the idea is that if element is >> deleted from map, we >> transition state to FREED and make sure refcounts of the task and prog >> are released. >> >> An element is returned into STANDBY state after task_work is completed >> or failed, so it can be reused. >> Could you please elaborate on the scenario you have in mind? > I guess I am confused about where we will go from FREED to STANDBY, if > we set it to BPF_TW_FREED in cancel_and_free. > When you update an array map element, we always do > bpf_obj_free_fields. Typically, this operation leaves the field in a > reusable state. > I don't see a FREED->STANDBY transition (after going from > [SCHEDULED|SCHEDULING]->FREED, only RUNNING->STANDBY in the callback. I see your point, arraymap does not overwrite those special fields on update, thanks! ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH bpf-next 3/4] bpf: task work scheduling kfuncs 2025-08-11 20:13 ` Mykyta Yatsenko @ 2025-08-11 20:17 ` Kumar Kartikeya Dwivedi 0 siblings, 0 replies; 15+ messages in thread From: Kumar Kartikeya Dwivedi @ 2025-08-11 20:17 UTC (permalink / raw) To: Mykyta Yatsenko Cc: bpf, ast, andrii, daniel, kafai, kernel-team, eddyz87, Mykyta Yatsenko On Mon, 11 Aug 2025 at 22:13, Mykyta Yatsenko <mykyta.yatsenko5@gmail.com> wrote: > > On 8/9/25 04:04, Kumar Kartikeya Dwivedi wrote: > > On Fri, 8 Aug 2025 at 02:44, Mykyta Yatsenko <mykyta.yatsenko5@gmail.com> wrote: > >> On 8/7/25 19:55, Kumar Kartikeya Dwivedi wrote: > >>> On Wed, 6 Aug 2025 at 16:46, Mykyta Yatsenko <mykyta.yatsenko5@gmail.com> wrote: > >>>> From: Mykyta Yatsenko <yatsenko@meta.com> > >>>> > >>>> Implementation of the bpf_task_work_schedule kfuncs. > >>>> > >>>> Main components: > >>>> * struct bpf_task_work_context – Metadata and state management per task > >>>> work. > >>>> * enum bpf_task_work_state – A state machine to serialize work > >>>> scheduling and execution. > >>>> * bpf_task_work_schedule() – The central helper that initiates > >>>> scheduling. > >>>> * bpf_task_work_callback() – Invoked when the actual task_work runs. > >>>> * bpf_task_work_irq() – An intermediate step (runs in softirq context) > >>>> to enqueue task work. > >>>> * bpf_task_work_cancel_and_free() – Cleanup for deleted BPF map entries. > >>>> > >>>> Flow of task work scheduling > >>>> 1) bpf_task_work_schedule_* is called from BPF code. > >>>> 2) Transition state from STANDBY to PENDING. > >>>> 3) irq_work_queue() schedules bpf_task_work_irq(). > >>>> 4) Transition state from PENDING to SCHEDULING. > >>>> 4) bpf_task_work_irq() attempts task_work_add(). If successful, state > >>>> transitions to SCHEDULED. > >>>> 5) Task work calls bpf_task_work_callback(), which transition state to > >>>> RUNNING. > >>>> 6) BPF callback is executed > >>>> 7) Context is cleaned up, refcounts released, state set back to > >>>> STANDBY. > >>>> > >>>> Map value deletion > >>>> If map value that contains bpf_task_work_context is deleted, BPF map > >>>> implementation calls bpf_task_work_cancel_and_free(). > >>>> Deletion is handled by atomically setting state to FREED and > >>>> releasing references or letting scheduler do that, depending on the > >>>> last state before the deletion: > >>>> * SCHEDULING: release references in bpf_task_work_cancel_and_free(), > >>>> expect bpf_task_work_irq() to cancel task work. > >>>> * SCHEDULED: release references and try to cancel task work in > >>>> bpf_task_work_cancel_and_free(). > >>>> * other states: one of bpf_task_work_irq(), bpf_task_work_schedule(), > >>>> bpf_task_work_callback() should cleanup upon detecting the state > >>>> switching to FREED. > >>>> > >>>> The state transitions are controlled with atomic_cmpxchg, ensuring: > >>>> * Only one thread can successfully enqueue work. > >>>> * Proper handling of concurrent deletes (BPF_TW_FREED). > >>>> * Safe rollback if task_work_add() fails. > >>>> > >>> In general I am not sure why we need so many acquire/release pairs. > >>> Why not use test_and_set_bit etc.? Or simply cmpxchg? > >>> What ordering of stores are we depending on that merits > >>> acquire/release ordering? > >>> We should probably document explicitly. > >>> > >>>> Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> > >>>> --- > >>>> kernel/bpf/helpers.c | 188 ++++++++++++++++++++++++++++++++++++++++++- > >>>> 1 file changed, 186 insertions(+), 2 deletions(-) > >>>> > >>>> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c > >>>> index 516286f67f0d..4c8b1c9be7aa 100644 > >>>> --- a/kernel/bpf/helpers.c > >>>> +++ b/kernel/bpf/helpers.c > >>>> @@ -25,6 +25,8 @@ > >>>> #include <linux/kasan.h> > >>>> #include <linux/bpf_verifier.h> > >>>> #include <linux/uaccess.h> > >>>> +#include <linux/task_work.h> > >>>> +#include <linux/irq_work.h> > >>>> > >>>> #include "../../lib/kstrtox.h" > >>>> > >>>> @@ -3702,6 +3704,160 @@ __bpf_kfunc int bpf_strstr(const char *s1__ign, const char *s2__ign) > >>>> > >>>> typedef void (*bpf_task_work_callback_t)(struct bpf_map *, void *, void *); > >>>> > >>>> +enum bpf_task_work_state { > >>>> + /* bpf_task_work is ready to be used */ > >>>> + BPF_TW_STANDBY = 0, > >>>> + /* bpf_task_work is getting scheduled into irq_work */ > >>>> + BPF_TW_PENDING, > >>>> + /* bpf_task_work is in irq_work and getting scheduled into task_work */ > >>>> + BPF_TW_SCHEDULING, > >>>> + /* bpf_task_work is scheduled into task_work successfully */ > >>>> + BPF_TW_SCHEDULED, > >>>> + /* callback is running */ > >>>> + BPF_TW_RUNNING, > >>>> + /* BPF map value storing this bpf_task_work is deleted */ > >>>> + BPF_TW_FREED, > >>>> +}; > >>>> + > >>>> +struct bpf_task_work_context { > >>>> + /* map that contains this structure in a value */ > >>>> + struct bpf_map *map; > >>>> + /* bpf_task_work_state value, representing the state */ > >>>> + atomic_t state; > >>>> + /* bpf_prog that schedules task work */ > >>>> + struct bpf_prog *prog; > >>>> + /* task for which callback is scheduled */ > >>>> + struct task_struct *task; > >>>> + /* notification mode for task work scheduling */ > >>>> + enum task_work_notify_mode mode; > >>>> + /* callback to call from task work */ > >>>> + bpf_task_work_callback_t callback_fn; > >>>> + struct callback_head work; > >>>> + struct irq_work irq_work; > >>>> +} __aligned(8); > >>> I will echo Alexei's comments about the layout. We cannot inline all > >>> this in map value. > >>> Allocation using an init function or in some control function is > >>> probably the only way. > >>> > >>>> + > >>>> +static bool task_work_match(struct callback_head *head, void *data) > >>>> +{ > >>>> + struct bpf_task_work_context *ctx = container_of(head, struct bpf_task_work_context, work); > >>>> + > >>>> + return ctx == data; > >>>> +} > >>>> + > >>>> +static void bpf_reset_task_work_context(struct bpf_task_work_context *ctx) > >>>> +{ > >>>> + bpf_prog_put(ctx->prog); > >>>> + bpf_task_release(ctx->task); > >>>> + rcu_assign_pointer(ctx->map, NULL); > >>>> +} > >>>> + > >>>> +static void bpf_task_work_callback(struct callback_head *cb) > >>>> +{ > >>>> + enum bpf_task_work_state state; > >>>> + struct bpf_task_work_context *ctx; > >>>> + struct bpf_map *map; > >>>> + u32 idx; > >>>> + void *key; > >>>> + void *value; > >>>> + > >>>> + rcu_read_lock_trace(); > >>>> + ctx = container_of(cb, struct bpf_task_work_context, work); > >>>> + > >>>> + state = atomic_cmpxchg_acquire(&ctx->state, BPF_TW_SCHEDULING, BPF_TW_RUNNING); > >>>> + if (state == BPF_TW_SCHEDULED) > >>>> + state = atomic_cmpxchg_acquire(&ctx->state, BPF_TW_SCHEDULED, BPF_TW_RUNNING); > >>>> + if (state == BPF_TW_FREED) > >>>> + goto out; > >>> I am leaving out commenting on this, since I expect it to change per > >>> later comments. > >>> > >>>> + > >>>> + map = rcu_dereference(ctx->map); > >>>> + if (!map) > >>>> + goto out; > >>>> + > >>>> + value = (void *)ctx - map->record->task_work_off; > >>>> + key = (void *)map_key_from_value(map, value, &idx); > >>>> + > >>>> + migrate_disable(); > >>>> + ctx->callback_fn(map, key, value); > >>>> + migrate_enable(); > >>>> + > >>>> + /* State is running or freed, either way reset. */ > >>>> + bpf_reset_task_work_context(ctx); > >>>> + atomic_cmpxchg_release(&ctx->state, BPF_TW_RUNNING, BPF_TW_STANDBY); > >>>> +out: > >>>> + rcu_read_unlock_trace(); > >>>> +} > >>>> + > >>>> +static void bpf_task_work_irq(struct irq_work *irq_work) > >>>> +{ > >>>> + struct bpf_task_work_context *ctx; > >>>> + enum bpf_task_work_state state; > >>>> + int err; > >>>> + > >>>> + ctx = container_of(irq_work, struct bpf_task_work_context, irq_work); > >>>> + > >>>> + rcu_read_lock_trace(); > >>> What's the idea behind rcu_read_lock_trace? Let's add a comment. > >>> > >>>> + state = atomic_cmpxchg_release(&ctx->state, BPF_TW_PENDING, BPF_TW_SCHEDULING); > >>>> + if (state == BPF_TW_FREED) { > >>>> + bpf_reset_task_work_context(ctx); > >>>> + goto out; > >>>> + } > >>>> + > >>>> + err = task_work_add(ctx->task, &ctx->work, ctx->mode); > >>> Racy, SCHEDULING->FREE state claim from cancel_and_free will release ctx->task. > >> Thanks for pointing this out, I missed that case. > >>>> + if (err) { > >>>> + state = atomic_cmpxchg_acquire(&ctx->state, BPF_TW_SCHEDULING, BPF_TW_PENDING); > >>> Races here look fine, since we don't act on FREED (for this block > >>> atleast), cancel_and_free doesn't act on seeing PENDING, > >>> so there is interlocking. > >>> > >>>> + if (state == BPF_TW_SCHEDULING) { > >>>> + bpf_reset_task_work_context(ctx); > >>>> + atomic_cmpxchg_release(&ctx->state, BPF_TW_PENDING, BPF_TW_STANDBY); > >>>> + } > >>>> + goto out; > >>>> + } > >>>> + state = atomic_cmpxchg_release(&ctx->state, BPF_TW_SCHEDULING, BPF_TW_SCHEDULED); > >>>> + if (state == BPF_TW_FREED) > >>>> + task_work_cancel_match(ctx->task, task_work_match, ctx); > >>> It looks like there is a similar race condition here. > >>> If BPF_TW_SCHEDULING is set, cancel_and_free may invoke and attempt > >>> bpf_task_release() from bpf_reset_task_work_context(). > >>> Meanwhile, we will access ctx->task here directly after seeing BPF_TW_FREED. > >> Yeah, we should release task_work in this function in case SCHEDULING > >> gets transitioned into FREED. > >>>> +out: > >>>> + rcu_read_unlock_trace(); > >>>> +} > >>>> + > >>>> +static int bpf_task_work_schedule(struct task_struct *task, struct bpf_task_work_context *ctx, > >>>> + struct bpf_map *map, bpf_task_work_callback_t callback_fn, > >>>> + struct bpf_prog_aux *aux, enum task_work_notify_mode mode) > >>>> +{ > >>>> + struct bpf_prog *prog; > >>>> + > >>>> + BTF_TYPE_EMIT(struct bpf_task_work); > >>>> + > >>>> + prog = bpf_prog_inc_not_zero(aux->prog); > >>>> + if (IS_ERR(prog)) > >>>> + return -EPERM; > >>>> + > >>>> + if (!atomic64_read(&map->usercnt)) { > >>>> + bpf_prog_put(prog); > >>>> + return -EPERM; > >>>> + } > >>>> + task = bpf_task_acquire(task); > >>>> + if (!task) { > >>>> + bpf_prog_put(prog); > >>>> + return -EPERM; > >>>> + } > >>>> + > >>>> + if (atomic_cmpxchg_acquire(&ctx->state, BPF_TW_STANDBY, BPF_TW_PENDING) != BPF_TW_STANDBY) { > >>> If we are reusing map values, wouldn't a freed state stay perpetually freed? > >>> I.e. after the first delete of array elements etc. it becomes useless. > >>> Every array map update would invoke a cancel_and_free. > >>> Who resets it? > >> I'm not sure I understand the question, the idea is that if element is > >> deleted from map, we > >> transition state to FREED and make sure refcounts of the task and prog > >> are released. > >> > >> An element is returned into STANDBY state after task_work is completed > >> or failed, so it can be reused. > >> Could you please elaborate on the scenario you have in mind? > > I guess I am confused about where we will go from FREED to STANDBY, if > > we set it to BPF_TW_FREED in cancel_and_free. > > When you update an array map element, we always do > > bpf_obj_free_fields. Typically, this operation leaves the field in a > > reusable state. > > I don't see a FREED->STANDBY transition (after going from > > [SCHEDULED|SCHEDULING]->FREED, only RUNNING->STANDBY in the callback. > I see your point, arraymap does not overwrite those special fields on > update, thanks! > Right, it won't just be array maps though. Hash maps also have memory reuse, without any reinitialization. What you want in bpf_obj_free_fields is to claim the field in a way that it's ready for reuse. It must be left in a state that if a new program obtains a pointer to the map value after that, it can reinit the object and begin using it like if it were allocated anew. ^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH bpf-next 4/4] selftests/bpf: BPF task work scheduling tests 2025-08-06 14:45 [PATCH bpf-next 0/4] bpf: Introduce deferred task context execution Mykyta Yatsenko ` (2 preceding siblings ...) 2025-08-06 14:45 ` [PATCH bpf-next 3/4] bpf: task work scheduling kfuncs Mykyta Yatsenko @ 2025-08-06 14:45 ` Mykyta Yatsenko 3 siblings, 0 replies; 15+ messages in thread From: Mykyta Yatsenko @ 2025-08-06 14:45 UTC (permalink / raw) To: bpf, ast, andrii, daniel, kafai, kernel-team, eddyz87; +Cc: Mykyta Yatsenko From: Mykyta Yatsenko <yatsenko@meta.com> Introducing selftests that check BPF task work scheduling mechanism. Validate that verifier does not accepts incorrect calls to bpf_task_work_schedule kfunc. Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> --- .../selftests/bpf/prog_tests/test_task_work.c | 149 ++++++++++++++++++ tools/testing/selftests/bpf/progs/task_work.c | 108 +++++++++++++ .../selftests/bpf/progs/task_work_fail.c | 98 ++++++++++++ 3 files changed, 355 insertions(+) create mode 100644 tools/testing/selftests/bpf/prog_tests/test_task_work.c create mode 100644 tools/testing/selftests/bpf/progs/task_work.c create mode 100644 tools/testing/selftests/bpf/progs/task_work_fail.c diff --git a/tools/testing/selftests/bpf/prog_tests/test_task_work.c b/tools/testing/selftests/bpf/prog_tests/test_task_work.c new file mode 100644 index 000000000000..9c3c7a46a827 --- /dev/null +++ b/tools/testing/selftests/bpf/prog_tests/test_task_work.c @@ -0,0 +1,149 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (c) 2025 Meta Platforms, Inc. and affiliates. */ +#include <test_progs.h> +#include <string.h> +#include <stdio.h> +#include "task_work.skel.h" +#include "task_work_fail.skel.h" +#include <linux/bpf.h> +#include <linux/perf_event.h> +#include <sys/syscall.h> +#include <time.h> + +static int perf_event_open(__u32 type, __u64 config, int pid) +{ + struct perf_event_attr attr = { + .type = type, + .config = config, + .size = sizeof(struct perf_event_attr), + .sample_period = 100000, + }; + + return syscall(__NR_perf_event_open, &attr, pid, -1, -1, 0); +} + +struct elem { + char data[128]; + struct bpf_task_work tw; +}; + +static int verify_map(struct bpf_map *map, const char *expected_data) +{ + int err; + struct elem value; + int processed_values = 0; + int k, sz; + + sz = bpf_map__max_entries(map); + for (k = 0; k < sz; ++k) { + err = bpf_map__lookup_elem(map, &k, sizeof(int), &value, sizeof(struct elem), 0); + if (err) + continue; + if (!ASSERT_EQ(strcmp(expected_data, value.data), 0, "map data")) { + fprintf(stderr, "expected '%s', found '%s' in %s map", expected_data, + value.data, bpf_map__name(map)); + return 2; + } + processed_values++; + } + + return processed_values == 0; +} + +static void task_work_run(const char *prog_name, const char *map_name) +{ + struct task_work *skel; + struct bpf_program *prog; + struct bpf_map *map; + struct bpf_link *link; + int err, pe_fd = 0, pid, status, pipefd[2]; + char user_string[] = "hello world"; + + if (!ASSERT_NEQ(pipe(pipefd), -1, "pipe")) + return; + + pid = fork(); + if (pid == 0) { + __u64 num = 1; + int i; + char buf; + + close(pipefd[1]); + read(pipefd[0], &buf, sizeof(buf)); + close(pipefd[0]); + + for (i = 0; i < 10000; ++i) + num *= time(0) % 7; + (void)num; + exit(0); + } + skel = task_work__open(); + if (!ASSERT_OK_PTR(skel, "task_work__open")) + return; + + bpf_object__for_each_program(prog, skel->obj) { + bpf_program__set_autoload(prog, false); + } + + prog = bpf_object__find_program_by_name(skel->obj, prog_name); + if (!ASSERT_OK_PTR(prog, "prog_name")) + goto cleanup; + bpf_program__set_autoload(prog, true); + bpf_program__set_type(prog, BPF_PROG_TYPE_PERF_EVENT); + skel->bss->user_ptr = (char *)user_string; + + err = task_work__load(skel); + if (!ASSERT_OK(err, "skel_load")) + goto cleanup; + + pe_fd = perf_event_open(PERF_TYPE_HARDWARE, PERF_COUNT_HW_CPU_CYCLES, pid); + if (pe_fd == -1 && (errno == ENOENT || errno == EOPNOTSUPP)) { + printf("%s:SKIP:no PERF_COUNT_HW_CPU_CYCLES\n", __func__); + test__skip(); + goto cleanup; + } + if (!ASSERT_NEQ(pe_fd, -1, "pe_fd")) { + fprintf(stderr, "perf_event_open errno: %d, pid: %d\n", errno, pid); + goto cleanup; + } + + link = bpf_program__attach_perf_event(prog, pe_fd); + if (!ASSERT_OK_PTR(link, "attach_perf_event")) + goto cleanup; + + close(pipefd[0]); + write(pipefd[1], user_string, 1); + close(pipefd[1]); + /* Wait to collect some samples */ + waitpid(pid, &status, 0); + pid = 0; + map = bpf_object__find_map_by_name(skel->obj, map_name); + if (!ASSERT_OK_PTR(map, "find map_name")) + goto cleanup; + if (!ASSERT_OK(verify_map(map, user_string), "verify map")) + goto cleanup; +cleanup: + if (pe_fd >= 0) + close(pe_fd); + task_work__destroy(skel); + if (pid) { + close(pipefd[0]); + write(pipefd[1], user_string, 1); + close(pipefd[1]); + waitpid(pid, &status, 0); + } +} + +void test_task_work(void) +{ + if (test__start_subtest("test_task_work_hash_map")) + task_work_run("oncpu_hash_map", "hmap"); + + if (test__start_subtest("test_task_work_array_map")) + task_work_run("oncpu_array_map", "arrmap"); + + if (test__start_subtest("test_task_work_lru_map")) + task_work_run("oncpu_lru_map", "lrumap"); + + RUN_TESTS(task_work_fail); +} diff --git a/tools/testing/selftests/bpf/progs/task_work.c b/tools/testing/selftests/bpf/progs/task_work.c new file mode 100644 index 000000000000..5e761b4a5fd1 --- /dev/null +++ b/tools/testing/selftests/bpf/progs/task_work.c @@ -0,0 +1,108 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (c) 2025 Meta Platforms, Inc. and affiliates. */ + +#include <vmlinux.h> +#include <string.h> +#include <stdbool.h> +#include <bpf/bpf_helpers.h> +#include <bpf/bpf_tracing.h> +#include "bpf_misc.h" +#include "errno.h" + +char _license[] SEC("license") = "GPL"; + +const void *user_ptr = NULL; + +struct elem { + char data[128]; + struct bpf_task_work tw; +}; + +struct { + __uint(type, BPF_MAP_TYPE_HASH); + __uint(map_flags, BPF_F_NO_PREALLOC); + __uint(max_entries, 1); + __type(key, int); + __type(value, struct elem); +} hmap SEC(".maps"); + +struct { + __uint(type, BPF_MAP_TYPE_ARRAY); + __uint(max_entries, 1); + __type(key, int); + __type(value, struct elem); +} arrmap SEC(".maps"); + +struct { + __uint(type, BPF_MAP_TYPE_LRU_HASH); + __uint(max_entries, 1); + __type(key, int); + __type(value, struct elem); +} lrumap SEC(".maps"); + +static void process_work(struct bpf_map *map, void *key, void *value) +{ + struct elem *work = value; + + bpf_copy_from_user_str(work->data, sizeof(work->data), (const void *)user_ptr, 0); +} + +int key = 0; + +SEC("perf_event") +int oncpu_hash_map(struct pt_regs *args) +{ + struct elem empty_work = { .data = { 0 } }; + struct elem *work; + struct task_struct *task; + int err; + + task = bpf_get_current_task_btf(); + err = bpf_map_update_elem(&hmap, &key, &empty_work, BPF_NOEXIST); + if (err) + return 0; + work = bpf_map_lookup_elem(&hmap, &key); + if (!work) + return 0; + + bpf_task_work_schedule_resume(task, &work->tw, (struct bpf_map *)&hmap, process_work, NULL); + return 0; +} + +SEC("perf_event") +int oncpu_array_map(struct pt_regs *args) +{ + struct elem *work; + struct task_struct *task; + + task = bpf_get_current_task_btf(); + work = bpf_map_lookup_elem(&arrmap, &key); + if (!work) + return 0; + bpf_task_work_schedule_signal(task, &work->tw, (struct bpf_map *)&arrmap, process_work, + NULL); + return 0; +} + +SEC("perf_event") +int oncpu_lru_map(struct pt_regs *args) +{ + struct elem empty_work = { .data = { 0 } }; + struct elem *work; + struct task_struct *task; + int err; + + task = bpf_get_current_task_btf(); + work = bpf_map_lookup_elem(&lrumap, &key); + if (work) + return 0; + err = bpf_map_update_elem(&lrumap, &key, &empty_work, BPF_NOEXIST); + if (err) + return 0; + work = bpf_map_lookup_elem(&lrumap, &key); + if (!work || work->data[0]) + return 0; + bpf_task_work_schedule_resume(task, &work->tw, (struct bpf_map *)&lrumap, process_work, + NULL); + return 0; +} diff --git a/tools/testing/selftests/bpf/progs/task_work_fail.c b/tools/testing/selftests/bpf/progs/task_work_fail.c new file mode 100644 index 000000000000..fca7052b805e --- /dev/null +++ b/tools/testing/selftests/bpf/progs/task_work_fail.c @@ -0,0 +1,98 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (c) 2025 Meta Platforms, Inc. and affiliates. */ + +#include <vmlinux.h> +#include <string.h> +#include <stdbool.h> +#include <bpf/bpf_helpers.h> +#include <bpf/bpf_tracing.h> +#include "bpf_misc.h" + +char _license[] SEC("license") = "GPL"; + +const void *user_ptr = NULL; + +struct elem { + char data[128]; + struct bpf_task_work tw; +}; + +struct { + __uint(type, BPF_MAP_TYPE_HASH); + __uint(map_flags, BPF_F_NO_PREALLOC); + __uint(max_entries, 1); + __type(key, int); + __type(value, struct elem); +} hmap SEC(".maps"); + +struct { + __uint(type, BPF_MAP_TYPE_ARRAY); + __uint(max_entries, 1); + __type(key, int); + __type(value, struct elem); +} arrmap SEC(".maps"); + +static void process_work(struct bpf_map *map, void *key, void *value) +{ + struct elem *work = value; + + bpf_copy_from_user_str(work->data, sizeof(work->data), (const void *)user_ptr, 0); +} + +int key = 0; + +SEC("perf_event") +__failure __msg("doesn't match map pointer in R3") +int mismatch_map(struct pt_regs *args) +{ + struct elem *work; + struct task_struct *task; + + task = bpf_get_current_task_btf(); + work = bpf_map_lookup_elem(&arrmap, &key); + if (!work) + return 0; + bpf_task_work_schedule_resume(task, &work->tw, (struct bpf_map *)&hmap, + process_work, NULL); + return 0; +} + +SEC("perf_event") +__failure __msg("arg#1 doesn't point to a map value") +int no_map_task_work(struct pt_regs *args) +{ + struct task_struct *task; + struct bpf_task_work tw; + + task = bpf_get_current_task_btf(); + bpf_task_work_schedule_resume(task, &tw, (struct bpf_map *)&hmap, + process_work, NULL); + return 0; +} + +SEC("perf_event") +__failure __msg("Possibly NULL pointer passed to trusted arg1") +int task_work_null(struct pt_regs *args) +{ + struct task_struct *task; + + task = bpf_get_current_task_btf(); + bpf_task_work_schedule_resume(task, NULL, (struct bpf_map *)&hmap, + process_work, NULL); + return 0; +} + +SEC("perf_event") +__failure __msg("Possibly NULL pointer passed to trusted arg2") +int map_null(struct pt_regs *args) +{ + struct elem *work; + struct task_struct *task; + + task = bpf_get_current_task_btf(); + work = bpf_map_lookup_elem(&arrmap, &key); + if (!work) + return 0; + bpf_task_work_schedule_resume(task, &work->tw, NULL, process_work, NULL); + return 0; +} -- 2.50.1 ^ permalink raw reply related [flat|nested] 15+ messages in thread
end of thread, other threads:[~2025-08-15 18:25 UTC | newest] Thread overview: 15+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-08-06 14:45 [PATCH bpf-next 0/4] bpf: Introduce deferred task context execution Mykyta Yatsenko 2025-08-06 14:45 ` [PATCH bpf-next 1/4] bpf: bpf task work plumbing Mykyta Yatsenko 2025-08-07 17:53 ` Kumar Kartikeya Dwivedi 2025-08-15 18:25 ` Mykyta Yatsenko 2025-08-06 14:45 ` [PATCH bpf-next 2/4] bpf: extract map key pointer calculation Mykyta Yatsenko 2025-08-07 19:01 ` Kumar Kartikeya Dwivedi 2025-08-06 14:45 ` [PATCH bpf-next 3/4] bpf: task work scheduling kfuncs Mykyta Yatsenko 2025-08-07 17:27 ` Alexei Starovoitov 2025-08-08 0:30 ` Mykyta Yatsenko 2025-08-07 18:55 ` Kumar Kartikeya Dwivedi 2025-08-08 0:44 ` Mykyta Yatsenko 2025-08-09 3:04 ` Kumar Kartikeya Dwivedi 2025-08-11 20:13 ` Mykyta Yatsenko 2025-08-11 20:17 ` Kumar Kartikeya Dwivedi 2025-08-06 14:45 ` [PATCH bpf-next 4/4] selftests/bpf: BPF task work scheduling tests Mykyta Yatsenko
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.