* [PATCH bpf-next v1 0/4] Close race in freeing special fields and map value
@ 2026-02-25 18:51 Kumar Kartikeya Dwivedi
2026-02-25 18:51 ` [PATCH bpf-next v1 1/4] bpf: Register dtor for freeing special fields Kumar Kartikeya Dwivedi
` (4 more replies)
0 siblings, 5 replies; 15+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2026-02-25 18:51 UTC (permalink / raw)
To: bpf
Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau, Eduard Zingerman, Mykyta Yatsenko, kkd,
kernel-team
There exists a race across various map types where the freeing of
special fields (tw, timer, wq, kptr, etc.) can be done eagerly when a
logical delete operation is done on a map value, such that the program
which continues to have access to such a map value can recreate the
fields and cause them to leak.
The set contains fixes for this case. It is a continuation of Mykyta's
previous attempt in [0], but applies to all fields. A test is included
which reproduces the bug reliably in absence of the fixes.
[0]: https://lore.kernel.org/bpf/20260216131341.1285427-1-mykyta.yatsenko5@gmail.com
Kumar Kartikeya Dwivedi (4):
bpf: Register dtor for freeing special fields
bpf: Delay freeing fields in local storage
bpf: Retire rcu_trace_implies_rcu_gp() from local storage
selftests/bpf: Add tests for special fields races
include/linux/bpf_mem_alloc.h | 4 +
kernel/bpf/bpf_local_storage.c | 65 ++++----
kernel/bpf/hashtab.c | 37 +++++
kernel/bpf/memalloc.c | 42 +++--
.../selftests/bpf/prog_tests/map_kptr_race.c | 155 ++++++++++++++++++
.../selftests/bpf/progs/map_kptr_race.c | 145 ++++++++++++++++
6 files changed, 403 insertions(+), 45 deletions(-)
create mode 100644 tools/testing/selftests/bpf/prog_tests/map_kptr_race.c
create mode 100644 tools/testing/selftests/bpf/progs/map_kptr_race.c
base-commit: f620af11c27b8ec9994a39fe968aa778112d1566
--
2.47.3
^ permalink raw reply [flat|nested] 15+ messages in thread* [PATCH bpf-next v1 1/4] bpf: Register dtor for freeing special fields 2026-02-25 18:51 [PATCH bpf-next v1 0/4] Close race in freeing special fields and map value Kumar Kartikeya Dwivedi @ 2026-02-25 18:51 ` Kumar Kartikeya Dwivedi 2026-02-25 18:51 ` [PATCH bpf-next v1 2/4] bpf: Delay freeing fields in local storage Kumar Kartikeya Dwivedi ` (3 subsequent siblings) 4 siblings, 0 replies; 15+ messages in thread From: Kumar Kartikeya Dwivedi @ 2026-02-25 18:51 UTC (permalink / raw) To: bpf Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, Martin KaFai Lau, Eduard Zingerman, Mykyta Yatsenko, kkd, kernel-team There is a race window where BPF hash map elements can leak special fields if the program with access to the map value recreates these special fields between the check_and_free_fields done on the map value and its eventual return to the memory allocator. Several ways were explored prior to this patch, most notably [0] tried to use a poison value to reject attempts to recreate special fields for map values that have been logically deleted but still accessible to BPF programs (either while sitting in the free list or when reused). While this approach works well for task work, timers, wq, etc., it is harder to apply the idea to kptrs, which have a similar race and failure mode. Instead, we change bpf_mem_alloc to allow registering destructor for allocated elements, such that when they are returned to the allocator, any special fields created while they were accessible to programs in the mean time will be freed. If these values get reused, we do not free the fields again before handing the element back. The special fields thus may remain initialized while the map value sits in a free list. When bpf_mem_alloc is retired in the future, a similar concept can be introduced to kmalloc_nolock-backed kmem_cache, paired with the existing idea of a constructor. [0]: https://lore.kernel.org/bpf/20260216131341.1285427-1-mykyta.yatsenko5@gmail.com Fixes: 14a324f6a67e ("bpf: Wire up freeing of referenced kptr") Fixes: 1bfbc267ec91 ("bpf: Enable bpf_timer and bpf_wq in any context") Reported-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> --- include/linux/bpf_mem_alloc.h | 4 ++++ kernel/bpf/hashtab.c | 37 ++++++++++++++++++++++++++++++ kernel/bpf/memalloc.c | 42 +++++++++++++++++++++++------------ 3 files changed, 69 insertions(+), 14 deletions(-) diff --git a/include/linux/bpf_mem_alloc.h b/include/linux/bpf_mem_alloc.h index e45162ef59bb..7517eaf94ac9 100644 --- a/include/linux/bpf_mem_alloc.h +++ b/include/linux/bpf_mem_alloc.h @@ -14,6 +14,8 @@ struct bpf_mem_alloc { struct obj_cgroup *objcg; bool percpu; struct work_struct work; + void (*dtor)(void *obj, void *ctx); + void *dtor_ctx; }; /* 'size != 0' is for bpf_mem_alloc which manages fixed-size objects. @@ -32,6 +34,8 @@ int bpf_mem_alloc_percpu_init(struct bpf_mem_alloc *ma, struct obj_cgroup *objcg /* The percpu allocation with a specific unit size. */ int bpf_mem_alloc_percpu_unit_init(struct bpf_mem_alloc *ma, int size); void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma); +void bpf_mem_alloc_set_dtor(struct bpf_mem_alloc *ma, + void (*dtor)(void *obj, void *ctx), void *ctx); /* Check the allocation size for kmalloc equivalent allocator */ int bpf_mem_alloc_check_size(bool percpu, size_t size); diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c index 3b9d297a53be..74f7a6f44c50 100644 --- a/kernel/bpf/hashtab.c +++ b/kernel/bpf/hashtab.c @@ -457,6 +457,32 @@ static int htab_map_alloc_check(union bpf_attr *attr) return 0; } +static void htab_mem_dtor(void *obj, void *ctx) +{ + struct bpf_htab *htab = ctx; + struct htab_elem *elem = obj; + void *map_value; + + if (IS_ERR_OR_NULL(htab->map.record)) + return; + + map_value = htab_elem_value(elem, htab->map.key_size); + bpf_obj_free_fields(htab->map.record, map_value); +} + +static void htab_pcpu_mem_dtor(void *obj, void *ctx) +{ + struct bpf_htab *htab = ctx; + void __percpu *pptr = *(void __percpu **)obj; + int cpu; + + if (IS_ERR_OR_NULL(htab->map.record)) + return; + + for_each_possible_cpu(cpu) + bpf_obj_free_fields(htab->map.record, per_cpu_ptr(pptr, cpu)); +} + static struct bpf_map *htab_map_alloc(union bpf_attr *attr) { bool percpu = (attr->map_type == BPF_MAP_TYPE_PERCPU_HASH || @@ -569,6 +595,17 @@ static struct bpf_map *htab_map_alloc(union bpf_attr *attr) round_up(htab->map.value_size, 8), true); if (err) goto free_map_locked; + /* See comment below */ + bpf_mem_alloc_set_dtor(&htab->pcpu_ma, htab_pcpu_mem_dtor, htab); + } else { + /* + * Register the dtor unconditionally. map->record is + * set by map_create() after map_alloc() returns, so it + * is always NULL at this point. The dtor checks + * IS_ERR_OR_NULL(htab->map.record) and becomes a no-op + * for maps without special fields. + */ + bpf_mem_alloc_set_dtor(&htab->ma, htab_mem_dtor, htab); } } diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c index bd45dda9dc35..137e855c718b 100644 --- a/kernel/bpf/memalloc.c +++ b/kernel/bpf/memalloc.c @@ -102,6 +102,7 @@ struct bpf_mem_cache { int percpu_size; bool draining; struct bpf_mem_cache *tgt; + struct bpf_mem_alloc *ma; /* list of objects to be freed after RCU GP */ struct llist_head free_by_rcu; @@ -260,12 +261,15 @@ static void free_one(void *obj, bool percpu) kfree(obj); } -static int free_all(struct llist_node *llnode, bool percpu) +static int free_all(struct llist_node *llnode, bool percpu, + struct bpf_mem_alloc *ma) { struct llist_node *pos, *t; int cnt = 0; llist_for_each_safe(pos, t, llnode) { + if (ma->dtor) + ma->dtor((void *)pos + LLIST_NODE_SZ, ma->dtor_ctx); free_one(pos, percpu); cnt++; } @@ -276,7 +280,7 @@ static void __free_rcu(struct rcu_head *head) { struct bpf_mem_cache *c = container_of(head, struct bpf_mem_cache, rcu_ttrace); - free_all(llist_del_all(&c->waiting_for_gp_ttrace), !!c->percpu_size); + free_all(llist_del_all(&c->waiting_for_gp_ttrace), !!c->percpu_size, c->ma); atomic_set(&c->call_rcu_ttrace_in_progress, 0); } @@ -308,7 +312,7 @@ static void do_call_rcu_ttrace(struct bpf_mem_cache *c) if (atomic_xchg(&c->call_rcu_ttrace_in_progress, 1)) { if (unlikely(READ_ONCE(c->draining))) { llnode = llist_del_all(&c->free_by_rcu_ttrace); - free_all(llnode, !!c->percpu_size); + free_all(llnode, !!c->percpu_size, c->ma); } return; } @@ -417,7 +421,7 @@ static void check_free_by_rcu(struct bpf_mem_cache *c) dec_active(c, &flags); if (unlikely(READ_ONCE(c->draining))) { - free_all(llist_del_all(&c->waiting_for_gp), !!c->percpu_size); + free_all(llist_del_all(&c->waiting_for_gp), !!c->percpu_size, c->ma); atomic_set(&c->call_rcu_in_progress, 0); } else { call_rcu_hurry(&c->rcu, __free_by_rcu); @@ -542,6 +546,7 @@ int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, bool percpu) c->objcg = objcg; c->percpu_size = percpu_size; c->tgt = c; + c->ma = ma; init_refill_work(c); prefill_mem_cache(c, cpu); } @@ -564,6 +569,7 @@ int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, bool percpu) c->objcg = objcg; c->percpu_size = percpu_size; c->tgt = c; + c->ma = ma; init_refill_work(c); prefill_mem_cache(c, cpu); @@ -616,6 +622,7 @@ int bpf_mem_alloc_percpu_unit_init(struct bpf_mem_alloc *ma, int size) c->objcg = objcg; c->percpu_size = percpu_size; c->tgt = c; + c->ma = ma; init_refill_work(c); prefill_mem_cache(c, cpu); @@ -624,7 +631,7 @@ int bpf_mem_alloc_percpu_unit_init(struct bpf_mem_alloc *ma, int size) return 0; } -static void drain_mem_cache(struct bpf_mem_cache *c) +static void drain_mem_cache(struct bpf_mem_cache *c, struct bpf_mem_alloc *ma) { bool percpu = !!c->percpu_size; @@ -635,13 +642,13 @@ static void drain_mem_cache(struct bpf_mem_cache *c) * Except for waiting_for_gp_ttrace list, there are no concurrent operations * on these lists, so it is safe to use __llist_del_all(). */ - free_all(llist_del_all(&c->free_by_rcu_ttrace), percpu); - free_all(llist_del_all(&c->waiting_for_gp_ttrace), percpu); - free_all(__llist_del_all(&c->free_llist), percpu); - free_all(__llist_del_all(&c->free_llist_extra), percpu); - free_all(__llist_del_all(&c->free_by_rcu), percpu); - free_all(__llist_del_all(&c->free_llist_extra_rcu), percpu); - free_all(llist_del_all(&c->waiting_for_gp), percpu); + free_all(llist_del_all(&c->free_by_rcu_ttrace), percpu, ma); + free_all(llist_del_all(&c->waiting_for_gp_ttrace), percpu, ma); + free_all(__llist_del_all(&c->free_llist), percpu, ma); + free_all(__llist_del_all(&c->free_llist_extra), percpu, ma); + free_all(__llist_del_all(&c->free_by_rcu), percpu, ma); + free_all(__llist_del_all(&c->free_llist_extra_rcu), percpu, ma); + free_all(llist_del_all(&c->waiting_for_gp), percpu, ma); } static void check_mem_cache(struct bpf_mem_cache *c) @@ -751,7 +758,7 @@ void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma) c = per_cpu_ptr(ma->cache, cpu); WRITE_ONCE(c->draining, true); irq_work_sync(&c->refill_work); - drain_mem_cache(c); + drain_mem_cache(c, ma); rcu_in_progress += atomic_read(&c->call_rcu_ttrace_in_progress); rcu_in_progress += atomic_read(&c->call_rcu_in_progress); } @@ -766,7 +773,7 @@ void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma) c = &cc->cache[i]; WRITE_ONCE(c->draining, true); irq_work_sync(&c->refill_work); - drain_mem_cache(c); + drain_mem_cache(c, ma); rcu_in_progress += atomic_read(&c->call_rcu_ttrace_in_progress); rcu_in_progress += atomic_read(&c->call_rcu_in_progress); } @@ -1014,3 +1021,10 @@ int bpf_mem_alloc_check_size(bool percpu, size_t size) return 0; } + +void bpf_mem_alloc_set_dtor(struct bpf_mem_alloc *ma, + void (*dtor)(void *obj, void *ctx), void *ctx) +{ + ma->dtor = dtor; + ma->dtor_ctx = ctx; +} -- 2.47.3 ^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH bpf-next v1 2/4] bpf: Delay freeing fields in local storage 2026-02-25 18:51 [PATCH bpf-next v1 0/4] Close race in freeing special fields and map value Kumar Kartikeya Dwivedi 2026-02-25 18:51 ` [PATCH bpf-next v1 1/4] bpf: Register dtor for freeing special fields Kumar Kartikeya Dwivedi @ 2026-02-25 18:51 ` Kumar Kartikeya Dwivedi 2026-02-25 19:31 ` bot+bpf-ci 2026-02-25 18:51 ` [PATCH bpf-next v1 3/4] bpf: Retire rcu_trace_implies_rcu_gp() from " Kumar Kartikeya Dwivedi ` (2 subsequent siblings) 4 siblings, 1 reply; 15+ messages in thread From: Kumar Kartikeya Dwivedi @ 2026-02-25 18:51 UTC (permalink / raw) To: bpf Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, Martin KaFai Lau, Eduard Zingerman, Mykyta Yatsenko, kkd, kernel-team Currently, when use_kmalloc_nolock is false, the freeing of fields for a local storage selem is done eagerly before waiting for the RCU or RCU tasks trace grace period to elapse. This opens up a window where the program which has access to the selem can recreate the fields after the freeing of fields is done eagerly, causing memory leaks when the element is finally freed and returned to the kernel. Make a few changes to address this. First, delay the freeing of fields until after the grace periods have expired using a __bpf_selem_free_rcu wrapper which is eventually invoked after transitioning through the necessary number of grace period waits. Replace usage of the kfree_rcu with call_rcu to be able to take a custom callback. Finally, care needs to be taken to extend the rcu barriers for all cases, and not just when use_kmalloc_nolock is true, as RCU and RCU tasks trace callbacks can be in flight for either case and access the smap field, which is used to obtain the BTF record to walk over special fields in the map value. Fixes: 9bac675e6368 ("bpf: Postpone bpf_obj_free_fields to the rcu callback") Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> --- kernel/bpf/bpf_local_storage.c | 32 +++++++++++++++++--------------- 1 file changed, 17 insertions(+), 15 deletions(-) diff --git a/kernel/bpf/bpf_local_storage.c b/kernel/bpf/bpf_local_storage.c index b28f07d3a0db..0825a92dcd6d 100644 --- a/kernel/bpf/bpf_local_storage.c +++ b/kernel/bpf/bpf_local_storage.c @@ -164,16 +164,25 @@ static void bpf_local_storage_free(struct bpf_local_storage *local_storage, bpf_local_storage_free_trace_rcu); } -/* rcu tasks trace callback for use_kmalloc_nolock == false */ -static void __bpf_selem_free_trace_rcu(struct rcu_head *rcu) +/* rcu callback for use_kmalloc_nolock == false */ +static void __bpf_selem_free_rcu(struct rcu_head *rcu) { struct bpf_local_storage_elem *selem; selem = container_of(rcu, struct bpf_local_storage_elem, rcu); + /* bpf_selem_unlink_nofail may have already cleared smap and freed fields. */ + if (SDATA(selem)->smap) + bpf_obj_free_fields(SDATA(selem)->smap->map.record, SDATA(selem)->data); + kfree(selem); +} + +/* rcu tasks trace callback for use_kmalloc_nolock == false */ +static void __bpf_selem_free_trace_rcu(struct rcu_head *rcu) +{ if (rcu_trace_implies_rcu_gp()) - kfree(selem); + __bpf_selem_free_rcu(rcu); else - kfree_rcu(selem, rcu); + call_rcu(rcu, __bpf_selem_free_rcu); } /* Handle use_kmalloc_nolock == false */ @@ -181,7 +190,7 @@ static void __bpf_selem_free(struct bpf_local_storage_elem *selem, bool vanilla_rcu) { if (vanilla_rcu) - kfree_rcu(selem, rcu); + call_rcu(&selem->rcu, __bpf_selem_free_rcu); else call_rcu_tasks_trace(&selem->rcu, __bpf_selem_free_trace_rcu); } @@ -214,18 +223,12 @@ static void bpf_selem_free_trace_rcu(struct rcu_head *rcu) void bpf_selem_free(struct bpf_local_storage_elem *selem, bool reuse_now) { - struct bpf_local_storage_map *smap; - - smap = rcu_dereference_check(SDATA(selem)->smap, bpf_rcu_lock_held()); - if (!selem->use_kmalloc_nolock) { /* * No uptr will be unpin even when reuse_now == false since uptr * is only supported in task local storage, where * smap->use_kmalloc_nolock == true. */ - if (smap) - bpf_obj_free_fields(smap->map.record, SDATA(selem)->data); __bpf_selem_free(selem, reuse_now); return; } @@ -958,10 +961,9 @@ void bpf_local_storage_map_free(struct bpf_map *map, */ synchronize_rcu(); - if (smap->use_kmalloc_nolock) { - rcu_barrier_tasks_trace(); - rcu_barrier(); - } + /* smap remains in use regardless of kmalloc_nolock, so wait unconditionally. */ + rcu_barrier_tasks_trace(); + rcu_barrier(); kvfree(smap->buckets); bpf_map_area_free(smap); } -- 2.47.3 ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH bpf-next v1 2/4] bpf: Delay freeing fields in local storage 2026-02-25 18:51 ` [PATCH bpf-next v1 2/4] bpf: Delay freeing fields in local storage Kumar Kartikeya Dwivedi @ 2026-02-25 19:31 ` bot+bpf-ci 0 siblings, 0 replies; 15+ messages in thread From: bot+bpf-ci @ 2026-02-25 19:31 UTC (permalink / raw) To: memxor, bpf Cc: ast, andrii, daniel, martin.lau, eddyz87, yatsenko, kkd, kernel-team, ast, andrii, daniel, martin.lau, eddyz87, yonghong.song, clm, ihor.solodrai [-- Attachment #1: Type: text/plain, Size: 1886 bytes --] > diff --git a/kernel/bpf/bpf_local_storage.c b/kernel/bpf/bpf_local_storage.c > index b28f07d3a0db..0825a92dcd6d 100644 > --- a/kernel/bpf/bpf_local_storage.c > +++ b/kernel/bpf/bpf_local_storage.c [ ... ] > +/* rcu callback for use_kmalloc_nolock == false */ > +static void __bpf_selem_free_rcu(struct rcu_head *rcu) > +{ > + struct bpf_local_storage_elem *selem; > + > + selem = container_of(rcu, struct bpf_local_storage_elem, rcu); > + /* bpf_selem_unlink_nofail may have already cleared smap and freed fields. */ > + if (SDATA(selem)->smap) > + bpf_obj_free_fields(SDATA(selem)->smap->map.record, SDATA(selem)->data); > + kfree(selem); > +} The smap field is declared __rcu in struct bpf_local_storage_data: struct bpf_local_storage_map __rcu *smap; Should this use rcu_dereference_check() to be consistent with the existing pattern in bpf_selem_free_rcu()? smap = rcu_dereference_check(SDATA(selem)->smap, 1); if (smap) bpf_obj_free_fields(smap->map.record, SDATA(selem)->data); As written, the pointer is loaded from memory twice (once for the if check and once for the dereference) without READ_ONCE or a local variable, and sparse would flag the direct access to an __rcu pointer. --- AI reviewed your patch. Please fix the bug or email reply why it's not a bug. See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md CI run summary: https://github.com/kernel-patches/bpf/actions/runs/22411606914 AI-authorship-score: low AI-authorship-explanation: Focused, technically precise RCU bug fix by a well-known BPF maintainer showing deep subsystem expertise. issues-found: 1 issue-severity-score: low issue-severity-explanation: Missing rcu_dereference_check() for __rcu pointer access causes sparse warnings and double-loads the pointer, but no concurrent writer exists at callback time so functional impact is minimal. ^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH bpf-next v1 3/4] bpf: Retire rcu_trace_implies_rcu_gp() from local storage 2026-02-25 18:51 [PATCH bpf-next v1 0/4] Close race in freeing special fields and map value Kumar Kartikeya Dwivedi 2026-02-25 18:51 ` [PATCH bpf-next v1 1/4] bpf: Register dtor for freeing special fields Kumar Kartikeya Dwivedi 2026-02-25 18:51 ` [PATCH bpf-next v1 2/4] bpf: Delay freeing fields in local storage Kumar Kartikeya Dwivedi @ 2026-02-25 18:51 ` Kumar Kartikeya Dwivedi 2026-02-25 18:55 ` Paul E. McKenney 2026-02-25 18:51 ` [PATCH bpf-next v1 4/4] selftests/bpf: Add tests for special fields races Kumar Kartikeya Dwivedi 2026-02-26 9:31 ` [syzbot ci] Re: Close race in freeing special fields and map value syzbot ci 4 siblings, 1 reply; 15+ messages in thread From: Kumar Kartikeya Dwivedi @ 2026-02-25 18:51 UTC (permalink / raw) To: bpf Cc: Paul E. McKenney, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, Martin KaFai Lau, Eduard Zingerman, Mykyta Yatsenko, kkd, kernel-team This assumption will always hold going forward, hence just remove the various checks and assume it is true with a comment for the uninformed reader. Cc: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> --- kernel/bpf/bpf_local_storage.c | 37 +++++++++++++++++----------------- 1 file changed, 19 insertions(+), 18 deletions(-) diff --git a/kernel/bpf/bpf_local_storage.c b/kernel/bpf/bpf_local_storage.c index 0825a92dcd6d..84fe6364c75d 100644 --- a/kernel/bpf/bpf_local_storage.c +++ b/kernel/bpf/bpf_local_storage.c @@ -107,14 +107,12 @@ static void __bpf_local_storage_free_trace_rcu(struct rcu_head *rcu) { struct bpf_local_storage *local_storage; - /* If RCU Tasks Trace grace period implies RCU grace period, do - * kfree(), else do kfree_rcu(). + /* + * RCU Tasks Trace grace period implies RCU grace period, do + * kfree() directly. */ local_storage = container_of(rcu, struct bpf_local_storage, rcu); - if (rcu_trace_implies_rcu_gp()) - kfree(local_storage); - else - kfree_rcu(local_storage, rcu); + kfree(local_storage); } /* Handle use_kmalloc_nolock == false */ @@ -138,10 +136,11 @@ static void bpf_local_storage_free_rcu(struct rcu_head *rcu) static void bpf_local_storage_free_trace_rcu(struct rcu_head *rcu) { - if (rcu_trace_implies_rcu_gp()) - bpf_local_storage_free_rcu(rcu); - else - call_rcu(rcu, bpf_local_storage_free_rcu); + /* + * RCU Tasks Trace grace period implies RCU grace period, do + * kfree() directly. + */ + bpf_local_storage_free_rcu(rcu); } static void bpf_local_storage_free(struct bpf_local_storage *local_storage, @@ -179,10 +178,11 @@ static void __bpf_selem_free_rcu(struct rcu_head *rcu) /* rcu tasks trace callback for use_kmalloc_nolock == false */ static void __bpf_selem_free_trace_rcu(struct rcu_head *rcu) { - if (rcu_trace_implies_rcu_gp()) - __bpf_selem_free_rcu(rcu); - else - call_rcu(rcu, __bpf_selem_free_rcu); + /* + * RCU Tasks Trace grace period implies RCU grace period, do + * kfree() directly. + */ + __bpf_selem_free_rcu(rcu); } /* Handle use_kmalloc_nolock == false */ @@ -214,10 +214,11 @@ static void bpf_selem_free_rcu(struct rcu_head *rcu) static void bpf_selem_free_trace_rcu(struct rcu_head *rcu) { - if (rcu_trace_implies_rcu_gp()) - bpf_selem_free_rcu(rcu); - else - call_rcu(rcu, bpf_selem_free_rcu); + /* + * RCU Tasks Trace grace period implies RCU grace period, do + * kfree() directly. + */ + bpf_selem_free_rcu(rcu); } void bpf_selem_free(struct bpf_local_storage_elem *selem, -- 2.47.3 ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH bpf-next v1 3/4] bpf: Retire rcu_trace_implies_rcu_gp() from local storage 2026-02-25 18:51 ` [PATCH bpf-next v1 3/4] bpf: Retire rcu_trace_implies_rcu_gp() from " Kumar Kartikeya Dwivedi @ 2026-02-25 18:55 ` Paul E. McKenney 0 siblings, 0 replies; 15+ messages in thread From: Paul E. McKenney @ 2026-02-25 18:55 UTC (permalink / raw) To: Kumar Kartikeya Dwivedi Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, Martin KaFai Lau, Eduard Zingerman, Mykyta Yatsenko, kkd, kernel-team On Wed, Feb 25, 2026 at 10:51:20AM -0800, Kumar Kartikeya Dwivedi wrote: > This assumption will always hold going forward, hence just remove the > various checks and assume it is true with a comment for the uninformed > reader. > > Cc: Paul E. McKenney <paulmck@kernel.org> > Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Just be *very* careful with backports. ;-) Thanx, Paul > --- > kernel/bpf/bpf_local_storage.c | 37 +++++++++++++++++----------------- > 1 file changed, 19 insertions(+), 18 deletions(-) > > diff --git a/kernel/bpf/bpf_local_storage.c b/kernel/bpf/bpf_local_storage.c > index 0825a92dcd6d..84fe6364c75d 100644 > --- a/kernel/bpf/bpf_local_storage.c > +++ b/kernel/bpf/bpf_local_storage.c > @@ -107,14 +107,12 @@ static void __bpf_local_storage_free_trace_rcu(struct rcu_head *rcu) > { > struct bpf_local_storage *local_storage; > > - /* If RCU Tasks Trace grace period implies RCU grace period, do > - * kfree(), else do kfree_rcu(). > + /* > + * RCU Tasks Trace grace period implies RCU grace period, do > + * kfree() directly. > */ > local_storage = container_of(rcu, struct bpf_local_storage, rcu); > - if (rcu_trace_implies_rcu_gp()) > - kfree(local_storage); > - else > - kfree_rcu(local_storage, rcu); > + kfree(local_storage); > } > > /* Handle use_kmalloc_nolock == false */ > @@ -138,10 +136,11 @@ static void bpf_local_storage_free_rcu(struct rcu_head *rcu) > > static void bpf_local_storage_free_trace_rcu(struct rcu_head *rcu) > { > - if (rcu_trace_implies_rcu_gp()) > - bpf_local_storage_free_rcu(rcu); > - else > - call_rcu(rcu, bpf_local_storage_free_rcu); > + /* > + * RCU Tasks Trace grace period implies RCU grace period, do > + * kfree() directly. > + */ > + bpf_local_storage_free_rcu(rcu); > } > > static void bpf_local_storage_free(struct bpf_local_storage *local_storage, > @@ -179,10 +178,11 @@ static void __bpf_selem_free_rcu(struct rcu_head *rcu) > /* rcu tasks trace callback for use_kmalloc_nolock == false */ > static void __bpf_selem_free_trace_rcu(struct rcu_head *rcu) > { > - if (rcu_trace_implies_rcu_gp()) > - __bpf_selem_free_rcu(rcu); > - else > - call_rcu(rcu, __bpf_selem_free_rcu); > + /* > + * RCU Tasks Trace grace period implies RCU grace period, do > + * kfree() directly. > + */ > + __bpf_selem_free_rcu(rcu); > } > > /* Handle use_kmalloc_nolock == false */ > @@ -214,10 +214,11 @@ static void bpf_selem_free_rcu(struct rcu_head *rcu) > > static void bpf_selem_free_trace_rcu(struct rcu_head *rcu) > { > - if (rcu_trace_implies_rcu_gp()) > - bpf_selem_free_rcu(rcu); > - else > - call_rcu(rcu, bpf_selem_free_rcu); > + /* > + * RCU Tasks Trace grace period implies RCU grace period, do > + * kfree() directly. > + */ > + bpf_selem_free_rcu(rcu); > } > > void bpf_selem_free(struct bpf_local_storage_elem *selem, > -- > 2.47.3 > ^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH bpf-next v1 4/4] selftests/bpf: Add tests for special fields races 2026-02-25 18:51 [PATCH bpf-next v1 0/4] Close race in freeing special fields and map value Kumar Kartikeya Dwivedi ` (2 preceding siblings ...) 2026-02-25 18:51 ` [PATCH bpf-next v1 3/4] bpf: Retire rcu_trace_implies_rcu_gp() from " Kumar Kartikeya Dwivedi @ 2026-02-25 18:51 ` Kumar Kartikeya Dwivedi 2026-02-26 9:31 ` [syzbot ci] Re: Close race in freeing special fields and map value syzbot ci 4 siblings, 0 replies; 15+ messages in thread From: Kumar Kartikeya Dwivedi @ 2026-02-25 18:51 UTC (permalink / raw) To: bpf Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, Martin KaFai Lau, Eduard Zingerman, Mykyta Yatsenko, kkd, kernel-team Add a couple of tests to ensure that the refcount drops to zero when we exercise the race where creation of a special field succeeds the logical bpf_obj_free_fields done when deleting an element. Prior to previous changes, the fields would be freed eagerly and repopulate and end up leaking, causing the reference to not drop down correctly. Running this test on a kernel without fixes will cause a hang in delete_module, since the module reference stays active due to the leaked kptr not dropping it. After the fixes tests succeed as expected. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> --- .../selftests/bpf/prog_tests/map_kptr_race.c | 155 ++++++++++++++++++ .../selftests/bpf/progs/map_kptr_race.c | 145 ++++++++++++++++ 2 files changed, 300 insertions(+) create mode 100644 tools/testing/selftests/bpf/prog_tests/map_kptr_race.c create mode 100644 tools/testing/selftests/bpf/progs/map_kptr_race.c diff --git a/tools/testing/selftests/bpf/prog_tests/map_kptr_race.c b/tools/testing/selftests/bpf/prog_tests/map_kptr_race.c new file mode 100644 index 000000000000..941e3d9880ed --- /dev/null +++ b/tools/testing/selftests/bpf/prog_tests/map_kptr_race.c @@ -0,0 +1,155 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (c) 2026 Meta Platforms, Inc. and affiliates. */ +#include <test_progs.h> +#include <network_helpers.h> + +#include "map_kptr_race.skel.h" + +static int get_map_id(int map_fd) +{ + struct bpf_map_info info = {}; + __u32 len = sizeof(info); + + if (!ASSERT_OK(bpf_map_get_info_by_fd(map_fd, &info, &len), "get_map_info")) + return -1; + return info.id; +} + +static int read_refs(struct map_kptr_race *skel) +{ + LIBBPF_OPTS(bpf_test_run_opts, opts); + int ret; + + ret = bpf_prog_test_run_opts(bpf_program__fd(skel->progs.count_ref), &opts); + if (!ASSERT_OK(ret, "count_ref run")) + return -1; + if (!ASSERT_OK(opts.retval, "count_ref retval")) + return -1; + return skel->bss->num_of_refs; +} + +static void test_htab_leak(void) +{ + LIBBPF_OPTS(bpf_test_run_opts, opts, + .data_in = &pkt_v4, + .data_size_in = sizeof(pkt_v4), + .repeat = 1, + ); + struct map_kptr_race *skel, *watcher; + int ret, map_id; + + skel = map_kptr_race__open_and_load(); + if (!ASSERT_OK_PTR(skel, "open_and_load")) + return; + + ret = bpf_prog_test_run_opts(bpf_program__fd(skel->progs.test_htab_leak), &opts); + if (!ASSERT_OK(ret, "test_htab_leak run")) + goto out_skel; + if (!ASSERT_OK(opts.retval, "test_htab_leak retval")) + goto out_skel; + + map_id = get_map_id(bpf_map__fd(skel->maps.race_hash_map)); + if (!ASSERT_GE(map_id, 0, "map_id")) + goto out_skel; + + watcher = map_kptr_race__open_and_load(); + if (!ASSERT_OK_PTR(watcher, "watcher open_and_load")) + goto out_skel; + + watcher->bss->target_map_id = map_id; + watcher->links.map_put = bpf_program__attach(watcher->progs.map_put); + if (!ASSERT_OK_PTR(watcher->links.map_put, "attach fentry")) + goto out_watcher; + watcher->links.htab_map_free = bpf_program__attach(watcher->progs.htab_map_free); + if (!ASSERT_OK_PTR(watcher->links.htab_map_free, "attach fexit")) + goto out_watcher; + + map_kptr_race__destroy(skel); + skel = NULL; + + kern_sync_rcu(); + + while (!READ_ONCE(watcher->bss->map_freed)) + sched_yield(); + + ASSERT_EQ(watcher->bss->map_freed, 1, "map_freed"); + ASSERT_EQ(read_refs(watcher), 2, "htab refcount"); + +out_watcher: + map_kptr_race__destroy(watcher); +out_skel: + map_kptr_race__destroy(skel); +} + +static void test_sk_ls_leak(void) +{ + struct map_kptr_race *skel, *watcher; + int listen_fd = -1, client_fd = -1, map_id; + + skel = map_kptr_race__open_and_load(); + if (!ASSERT_OK_PTR(skel, "open_and_load")) + return; + + if (!ASSERT_OK(map_kptr_race__attach(skel), "attach")) + goto out_skel; + + listen_fd = start_server(AF_INET6, SOCK_STREAM, "::1", 0, 0); + if (!ASSERT_GE(listen_fd, 0, "start_server")) + goto out_skel; + + client_fd = connect_to_fd(listen_fd, 0); + if (!ASSERT_GE(client_fd, 0, "connect_to_fd")) + goto out_skel; + + if (!ASSERT_EQ(skel->bss->sk_ls_leak_done, 1, "sk_ls_leak_done")) + goto out_skel; + + close(client_fd); + client_fd = -1; + close(listen_fd); + listen_fd = -1; + + map_id = get_map_id(bpf_map__fd(skel->maps.sk_ls_race_map)); + if (!ASSERT_GE(map_id, 0, "map_id")) + goto out_skel; + + watcher = map_kptr_race__open_and_load(); + if (!ASSERT_OK_PTR(watcher, "watcher open_and_load")) + goto out_skel; + + watcher->bss->target_map_id = map_id; + watcher->links.map_put = bpf_program__attach(watcher->progs.map_put); + if (!ASSERT_OK_PTR(watcher->links.map_put, "attach fentry")) + goto out_watcher; + watcher->links.sk_map_free = bpf_program__attach(watcher->progs.sk_map_free); + if (!ASSERT_OK_PTR(watcher->links.sk_map_free, "attach fexit")) + goto out_watcher; + + map_kptr_race__destroy(skel); + skel = NULL; + + kern_sync_rcu(); + + while (!READ_ONCE(watcher->bss->map_freed)) + sched_yield(); + + ASSERT_EQ(watcher->bss->map_freed, 1, "map_freed"); + ASSERT_EQ(read_refs(watcher), 2, "sk_ls refcount"); + +out_watcher: + map_kptr_race__destroy(watcher); +out_skel: + if (client_fd >= 0) + close(client_fd); + if (listen_fd >= 0) + close(listen_fd); + map_kptr_race__destroy(skel); +} + +void serial_test_map_kptr_race(void) +{ + if (test__start_subtest("htab_leak")) + test_htab_leak(); + if (test__start_subtest("sk_ls_leak")) + test_sk_ls_leak(); +} diff --git a/tools/testing/selftests/bpf/progs/map_kptr_race.c b/tools/testing/selftests/bpf/progs/map_kptr_race.c new file mode 100644 index 000000000000..1731f42bf5bc --- /dev/null +++ b/tools/testing/selftests/bpf/progs/map_kptr_race.c @@ -0,0 +1,145 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (c) 2026 Meta Platforms, Inc. and affiliates. */ +#include <vmlinux.h> +#include <bpf/bpf_helpers.h> +#include <bpf/bpf_tracing.h> +#include "../test_kmods/bpf_testmod_kfunc.h" + +struct map_value { + struct prog_test_ref_kfunc __kptr *ref_ptr; +}; + +struct { + __uint(type, BPF_MAP_TYPE_HASH); + __uint(map_flags, BPF_F_NO_PREALLOC); + __type(key, int); + __type(value, struct map_value); + __uint(max_entries, 1); +} race_hash_map SEC(".maps"); + +struct { + __uint(type, BPF_MAP_TYPE_SK_STORAGE); + __uint(map_flags, BPF_F_NO_PREALLOC); + __type(key, int); + __type(value, struct map_value); +} sk_ls_race_map SEC(".maps"); + +int num_of_refs; +int sk_ls_leak_done; +int target_map_id; +int map_freed; + +SEC("tc") +int test_htab_leak(struct __sk_buff *skb) +{ + struct prog_test_ref_kfunc *p, *old; + struct map_value val = {}; + struct map_value *v; + int key = 0; + + if (bpf_map_update_elem(&race_hash_map, &key, &val, BPF_ANY)) + return 1; + + v = bpf_map_lookup_elem(&race_hash_map, &key); + if (!v) + return 2; + + p = bpf_kfunc_call_test_acquire(&(unsigned long){0}); + if (!p) + return 3; + old = bpf_kptr_xchg(&v->ref_ptr, p); + if (old) + bpf_kfunc_call_test_release(old); + + bpf_map_delete_elem(&race_hash_map, &key); + + p = bpf_kfunc_call_test_acquire(&(unsigned long){0}); + if (!p) + return 4; + old = bpf_kptr_xchg(&v->ref_ptr, p); + if (old) + bpf_kfunc_call_test_release(old); + + return 0; +} + +SEC("tp_btf/inet_sock_set_state") +int BPF_PROG(test_sk_ls_leak, struct sock *sk, int oldstate, int newstate) +{ + struct prog_test_ref_kfunc *p, *old; + struct map_value *v; + + if (newstate != BPF_TCP_SYN_SENT) + return 0; + + if (sk_ls_leak_done) + return 0; + + v = bpf_sk_storage_get(&sk_ls_race_map, sk, NULL, + BPF_SK_STORAGE_GET_F_CREATE); + if (!v) + return 0; + + p = bpf_kfunc_call_test_acquire(&(unsigned long){0}); + if (!p) + return 0; + old = bpf_kptr_xchg(&v->ref_ptr, p); + if (old) + bpf_kfunc_call_test_release(old); + + bpf_sk_storage_delete(&sk_ls_race_map, sk); + + p = bpf_kfunc_call_test_acquire(&(unsigned long){0}); + if (!p) + return 0; + old = bpf_kptr_xchg(&v->ref_ptr, p); + if (old) + bpf_kfunc_call_test_release(old); + + sk_ls_leak_done = 1; + return 0; +} + +long target_map_ptr; + +SEC("fentry/bpf_map_put") +int BPF_PROG(map_put, struct bpf_map *map) +{ + if (target_map_id && map->id == (u32)target_map_id) + target_map_ptr = (long)map; + return 0; +} + +SEC("fexit/htab_map_free") +int BPF_PROG(htab_map_free, struct bpf_map *map) +{ + if (target_map_ptr && (long)map == target_map_ptr) + map_freed = 1; + return 0; +} + +SEC("fexit/bpf_sk_storage_map_free") +int BPF_PROG(sk_map_free, struct bpf_map *map) +{ + if (target_map_ptr && (long)map == target_map_ptr) + map_freed = 1; + return 0; +} + +SEC("syscall") +int count_ref(void *ctx) +{ + struct prog_test_ref_kfunc *p; + unsigned long arg = 0; + + p = bpf_kfunc_call_test_acquire(&arg); + if (!p) + return 1; + + num_of_refs = p->cnt.refs.counter; + + bpf_kfunc_call_test_release(p); + return 0; +} + +char _license[] SEC("license") = "GPL"; -- 2.47.3 ^ permalink raw reply related [flat|nested] 15+ messages in thread
* [syzbot ci] Re: Close race in freeing special fields and map value 2026-02-25 18:51 [PATCH bpf-next v1 0/4] Close race in freeing special fields and map value Kumar Kartikeya Dwivedi ` (3 preceding siblings ...) 2026-02-25 18:51 ` [PATCH bpf-next v1 4/4] selftests/bpf: Add tests for special fields races Kumar Kartikeya Dwivedi @ 2026-02-26 9:31 ` syzbot ci 2026-02-26 15:05 ` Kumar Kartikeya Dwivedi ` (3 more replies) 4 siblings, 4 replies; 15+ messages in thread From: syzbot ci @ 2026-02-26 9:31 UTC (permalink / raw) To: andrii, ast, bpf, daniel, eddyz87, kernel-team, kkd, martin.lau, memxor, paulmck, yatsenko Cc: syzbot, syzkaller-bugs syzbot ci has tested the following series [v1] Close race in freeing special fields and map value https://lore.kernel.org/all/20260225185121.2057388-1-memxor@gmail.com * [PATCH bpf-next v1 1/4] bpf: Register dtor for freeing special fields * [PATCH bpf-next v1 2/4] bpf: Delay freeing fields in local storage * [PATCH bpf-next v1 3/4] bpf: Retire rcu_trace_implies_rcu_gp() from local storage * [PATCH bpf-next v1 4/4] selftests/bpf: Add tests for special fields races and found the following issue: KASAN: slab-use-after-free Read in free_all Full report is available here: https://ci.syzbot.org/series/ad9c7ce6-d861-4afe-821a-e8fae6120b12 *** KASAN: slab-use-after-free Read in free_all tree: bpf-next URL: https://kernel.googlesource.com/pub/scm/linux/kernel/git/bpf/bpf-next.git base: f620af11c27b8ec9994a39fe968aa778112d1566 arch: amd64 compiler: Debian clang version 21.1.8 (++20251221033036+2078da43e25a-1~exp1~20251221153213.50), Debian LLD 21.1.8 config: https://ci.syzbot.org/builds/b2735b0b-963d-41a7-bb33-db10494a24e1/config syz repro: https://ci.syzbot.org/findings/3ee61f99-7d80-4986-a0b1-492ffb6b1c46/syz_repro ================================================================== BUG: KASAN: slab-use-after-free in free_all+0x84/0x140 kernel/bpf/memalloc.c:271 Read of size 8 at addr ffff888112ee9260 by task swapper/0/0 CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted syzkaller #0 PREEMPT(full) Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014 Call Trace: <IRQ> dump_stack_lvl+0xe8/0x150 lib/dump_stack.c:120 print_address_description mm/kasan/report.c:378 [inline] print_report+0xba/0x230 mm/kasan/report.c:482 kasan_report+0x117/0x150 mm/kasan/report.c:595 free_all+0x84/0x140 kernel/bpf/memalloc.c:271 do_call_rcu_ttrace+0x385/0x400 kernel/bpf/memalloc.c:315 __free_by_rcu+0x23b/0x3f0 kernel/bpf/memalloc.c:381 rcu_do_batch kernel/rcu/tree.c:2617 [inline] rcu_core+0x7cd/0x1070 kernel/rcu/tree.c:2869 handle_softirqs+0x22a/0x870 kernel/softirq.c:622 __do_softirq kernel/softirq.c:656 [inline] invoke_softirq kernel/softirq.c:496 [inline] __irq_exit_rcu+0x5f/0x150 kernel/softirq.c:723 irq_exit_rcu+0x9/0x30 kernel/softirq.c:739 instr_sysvec_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1056 [inline] sysvec_apic_timer_interrupt+0xa6/0xc0 arch/x86/kernel/apic/apic.c:1056 </IRQ> <TASK> asm_sysvec_apic_timer_interrupt+0x1a/0x20 arch/x86/include/asm/idtentry.h:697 RIP: 0010:pv_native_safe_halt+0xf/0x20 arch/x86/kernel/paravirt.c:63 Code: 8e 6d 02 c3 cc cc cc cc cc cc cc 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa eb 07 0f 00 2d 43 3d 1c 00 fb f4 <e9> 7c ea 02 00 cc cc cc cc cc cc cc cc cc cc cc cc 90 90 90 90 90 RSP: 0018:ffffffff8e407dc0 EFLAGS: 00000202 RAX: 00000000000d2297 RBX: ffffffff819a80ad RCX: 0000000080000001 RDX: 0000000000000001 RSI: ffffffff8def22fa RDI: ffffffff8c27a800 RBP: ffffffff8e407eb0 R08: ffff88812103395b R09: 1ffff1102420672b R10: dffffc0000000000 R11: ffffed102420672c R12: ffffffff90117db0 R13: 1ffffffff1c929d8 R14: 0000000000000000 R15: 0000000000000000 arch_safe_halt arch/x86/kernel/process.c:766 [inline] default_idle+0x9/0x20 arch/x86/kernel/process.c:767 default_idle_call+0x72/0xb0 kernel/sched/idle.c:122 cpuidle_idle_call kernel/sched/idle.c:191 [inline] do_idle+0x1bd/0x500 kernel/sched/idle.c:332 cpu_startup_entry+0x43/0x60 kernel/sched/idle.c:430 rest_init+0x2de/0x300 init/main.c:760 start_kernel+0x385/0x3d0 init/main.c:1210 x86_64_start_reservations+0x24/0x30 arch/x86/kernel/head64.c:310 x86_64_start_kernel+0x143/0x1c0 arch/x86/kernel/head64.c:291 common_startup_64+0x13e/0x147 </TASK> Allocated by task 5961: kasan_save_stack mm/kasan/common.c:57 [inline] kasan_save_track+0x3e/0x80 mm/kasan/common.c:78 poison_kmalloc_redzone mm/kasan/common.c:398 [inline] __kasan_kmalloc+0x93/0xb0 mm/kasan/common.c:415 kasan_kmalloc include/linux/kasan.h:263 [inline] __do_kmalloc_node mm/slub.c:5219 [inline] __kmalloc_node_noprof+0x4e0/0x7c0 mm/slub.c:5225 kmalloc_node_noprof include/linux/slab.h:1093 [inline] __bpf_map_area_alloc kernel/bpf/syscall.c:398 [inline] bpf_map_area_alloc+0x64/0x170 kernel/bpf/syscall.c:411 trie_alloc+0x14f/0x340 kernel/bpf/lpm_trie.c:588 map_create+0xafd/0x16a0 kernel/bpf/syscall.c:1507 __sys_bpf+0x6e1/0x950 kernel/bpf/syscall.c:6210 __do_sys_bpf kernel/bpf/syscall.c:6341 [inline] __se_sys_bpf kernel/bpf/syscall.c:6339 [inline] __x64_sys_bpf+0x7c/0x90 kernel/bpf/syscall.c:6339 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0x14d/0xf80 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f Freed by task 83: kasan_save_stack mm/kasan/common.c:57 [inline] kasan_save_track+0x3e/0x80 mm/kasan/common.c:78 kasan_save_free_info+0x46/0x50 mm/kasan/generic.c:584 poison_slab_object mm/kasan/common.c:253 [inline] __kasan_slab_free+0x5c/0x80 mm/kasan/common.c:285 kasan_slab_free include/linux/kasan.h:235 [inline] slab_free_hook mm/slub.c:2687 [inline] slab_free mm/slub.c:6124 [inline] kfree+0x1c1/0x630 mm/slub.c:6442 bpf_map_free kernel/bpf/syscall.c:892 [inline] bpf_map_free_deferred+0x217/0x460 kernel/bpf/syscall.c:919 process_one_work kernel/workqueue.c:3275 [inline] process_scheduled_works+0xb02/0x1830 kernel/workqueue.c:3358 worker_thread+0xa50/0xfc0 kernel/workqueue.c:3439 kthread+0x388/0x470 kernel/kthread.c:467 ret_from_fork+0x51e/0xb90 arch/x86/kernel/process.c:158 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245 Last potentially related work creation: kasan_save_stack+0x3e/0x60 mm/kasan/common.c:57 kasan_record_aux_stack+0xbd/0xd0 mm/kasan/generic.c:556 insert_work+0x3d/0x330 kernel/workqueue.c:2199 __queue_work+0xd03/0x1020 kernel/workqueue.c:2354 queue_work_on+0x106/0x1d0 kernel/workqueue.c:2405 bpf_map_put_with_uref kernel/bpf/syscall.c:975 [inline] bpf_map_release+0x127/0x140 kernel/bpf/syscall.c:985 __fput+0x44f/0xa70 fs/file_table.c:469 task_work_run+0x1d9/0x270 kernel/task_work.c:233 resume_user_mode_work include/linux/resume_user_mode.h:50 [inline] __exit_to_user_mode_loop kernel/entry/common.c:67 [inline] exit_to_user_mode_loop+0xed/0x480 kernel/entry/common.c:98 __exit_to_user_mode_prepare include/linux/irq-entry-common.h:226 [inline] syscall_exit_to_user_mode_prepare include/linux/irq-entry-common.h:256 [inline] syscall_exit_to_user_mode include/linux/entry-common.h:325 [inline] do_syscall_64+0x32d/0xf80 arch/x86/entry/syscall_64.c:100 entry_SYSCALL_64_after_hwframe+0x77/0x7f The buggy address belongs to the object at ffff888112ee9000 which belongs to the cache kmalloc-cg-1k of size 1024 The buggy address is located 608 bytes inside of freed 1024-byte region [ffff888112ee9000, ffff888112ee9400) The buggy address belongs to the physical page: page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x112ee8 head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0 memcg:ffff888106b17c01 flags: 0x17ff00000000040(head|node=0|zone=2|lastcpupid=0x7ff) page_type: f5(slab) raw: 017ff00000000040 ffff88810005d280 dead000000000100 dead000000000122 raw: 0000000000000000 0000000800100010 00000000f5000000 ffff888106b17c01 head: 017ff00000000040 ffff88810005d280 dead000000000100 dead000000000122 head: 0000000000000000 0000000800100010 00000000f5000000 ffff888106b17c01 head: 017ff00000000003 ffffea00044bba01 00000000ffffffff 00000000ffffffff head: ffffffffffffffff 0000000000000000 00000000ffffffff 0000000000000008 page dumped because: kasan: bad access detected page_owner tracks the page as allocated page last allocated via order 3, migratetype Unmovable, gfp_mask 0xd20c0(__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC), pid 5550, tgid 5550 (dhcpcd), ts 37867090520, free_ts 37834172120 set_page_owner include/linux/page_owner.h:32 [inline] post_alloc_hook+0x231/0x280 mm/page_alloc.c:1889 prep_new_page mm/page_alloc.c:1897 [inline] get_page_from_freelist+0x24dc/0x2580 mm/page_alloc.c:3962 __alloc_frozen_pages_noprof+0x18d/0x380 mm/page_alloc.c:5250 alloc_slab_page mm/slub.c:3255 [inline] allocate_slab+0x77/0x660 mm/slub.c:3444 new_slab mm/slub.c:3502 [inline] refill_objects+0x331/0x3c0 mm/slub.c:7134 refill_sheaf mm/slub.c:2804 [inline] __pcs_replace_empty_main+0x2b9/0x620 mm/slub.c:4578 alloc_from_pcs mm/slub.c:4681 [inline] slab_alloc_node mm/slub.c:4815 [inline] __do_kmalloc_node mm/slub.c:5218 [inline] __kmalloc_node_track_caller_noprof+0x572/0x7b0 mm/slub.c:5327 kmalloc_reserve net/core/skbuff.c:635 [inline] __alloc_skb+0x2c1/0x7d0 net/core/skbuff.c:713 alloc_skb include/linux/skbuff.h:1383 [inline] alloc_skb_with_frags+0xca/0x890 net/core/skbuff.c:6750 sock_alloc_send_pskb+0x878/0x990 net/core/sock.c:2995 unix_dgram_sendmsg+0x460/0x18e0 net/unix/af_unix.c:2125 sock_sendmsg_nosec net/socket.c:727 [inline] __sock_sendmsg net/socket.c:742 [inline] sock_write_iter+0x503/0x550 net/socket.c:1195 do_iter_readv_writev+0x619/0x8c0 fs/read_write.c:-1 vfs_writev+0x33c/0x990 fs/read_write.c:1059 do_writev+0x154/0x2e0 fs/read_write.c:1105 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0x14d/0xf80 arch/x86/entry/syscall_64.c:94 page last free pid 5678 tgid 5678 stack trace: reset_page_owner include/linux/page_owner.h:25 [inline] __free_pages_prepare mm/page_alloc.c:1433 [inline] __free_frozen_pages+0xc2b/0xdb0 mm/page_alloc.c:2978 __slab_free+0x263/0x2b0 mm/slub.c:5532 qlink_free mm/kasan/quarantine.c:163 [inline] qlist_free_all+0x97/0x100 mm/kasan/quarantine.c:179 kasan_quarantine_reduce+0x148/0x160 mm/kasan/quarantine.c:286 __kasan_slab_alloc+0x22/0x80 mm/kasan/common.c:350 kasan_slab_alloc include/linux/kasan.h:253 [inline] slab_post_alloc_hook mm/slub.c:4501 [inline] slab_alloc_node mm/slub.c:4830 [inline] __do_kmalloc_node mm/slub.c:5218 [inline] __kmalloc_noprof+0x316/0x760 mm/slub.c:5231 kmalloc_noprof include/linux/slab.h:966 [inline] tomoyo_realpath_from_path+0xe3/0x5d0 security/tomoyo/realpath.c:251 tomoyo_realpath_nofollow+0xc3/0x110 security/tomoyo/realpath.c:304 tomoyo_find_next_domain+0x259/0x1aa0 security/tomoyo/domain.c:731 tomoyo_bprm_check_security+0x11b/0x180 security/tomoyo/tomoyo.c:102 security_bprm_check+0x85/0x240 security/security.c:794 search_binary_handler fs/exec.c:1654 [inline] exec_binprm fs/exec.c:1696 [inline] bprm_execve+0x896/0x1460 fs/exec.c:1748 do_execveat_common+0x50d/0x690 fs/exec.c:1846 __do_sys_execve fs/exec.c:1930 [inline] __se_sys_execve fs/exec.c:1924 [inline] __x64_sys_execve+0x97/0xc0 fs/exec.c:1924 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0x14d/0xf80 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f Memory state around the buggy address: ffff888112ee9100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff888112ee9180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >ffff888112ee9200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ^ ffff888112ee9280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff888112ee9300: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ================================================================== ---------------- Code disassembly (best guess): 0: 8e 6d 02 mov 0x2(%rbp),%gs 3: c3 ret 4: cc int3 5: cc int3 6: cc int3 7: cc int3 8: cc int3 9: cc int3 a: cc int3 b: 90 nop c: 90 nop d: 90 nop e: 90 nop f: 90 nop 10: 90 nop 11: 90 nop 12: 90 nop 13: 90 nop 14: 90 nop 15: 90 nop 16: 90 nop 17: 90 nop 18: 90 nop 19: 90 nop 1a: 90 nop 1b: f3 0f 1e fa endbr64 1f: eb 07 jmp 0x28 21: 0f 00 2d 43 3d 1c 00 verw 0x1c3d43(%rip) # 0x1c3d6b 28: fb sti 29: f4 hlt * 2a: e9 7c ea 02 00 jmp 0x2eaab <-- trapping instruction 2f: cc int3 30: cc int3 31: cc int3 32: cc int3 33: cc int3 34: cc int3 35: cc int3 36: cc int3 37: cc int3 38: cc int3 39: cc int3 3a: cc int3 3b: 90 nop 3c: 90 nop 3d: 90 nop 3e: 90 nop 3f: 90 nop *** If these findings have caused you to resend the series or submit a separate fix, please add the following tag to your commit message: Tested-by: syzbot@syzkaller.appspotmail.com --- This report is generated by a bot. It may contain errors. syzbot ci engineers can be reached at syzkaller@googlegroups.com. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [syzbot ci] Re: Close race in freeing special fields and map value 2026-02-26 9:31 ` [syzbot ci] Re: Close race in freeing special fields and map value syzbot ci @ 2026-02-26 15:05 ` Kumar Kartikeya Dwivedi 2026-02-26 17:36 ` Kumar Kartikeya Dwivedi ` (2 subsequent siblings) 3 siblings, 0 replies; 15+ messages in thread From: Kumar Kartikeya Dwivedi @ 2026-02-26 15:05 UTC (permalink / raw) To: syzbot ci Cc: andrii, ast, bpf, daniel, eddyz87, kernel-team, kkd, martin.lau, paulmck, yatsenko, syzbot, syzkaller-bugs On Thu, 26 Feb 2026 at 10:31, syzbot ci <syzbot+ci3826af8b4f91bf97@syzkaller.appspotmail.com> wrote: > > syzbot ci has tested the following series > > [v1] Close race in freeing special fields and map value > https://lore.kernel.org/all/20260225185121.2057388-1-memxor@gmail.com > * [PATCH bpf-next v1 1/4] bpf: Register dtor for freeing special fields > * [PATCH bpf-next v1 2/4] bpf: Delay freeing fields in local storage > * [PATCH bpf-next v1 3/4] bpf: Retire rcu_trace_implies_rcu_gp() from local storage > * [PATCH bpf-next v1 4/4] selftests/bpf: Add tests for special fields races > > and found the following issue: > KASAN: slab-use-after-free Read in free_all > > Full report is available here: > https://ci.syzbot.org/series/ad9c7ce6-d861-4afe-821a-e8fae6120b12 > > *** > > KASAN: slab-use-after-free Read in free_all > > tree: bpf-next > URL: https://kernel.googlesource.com/pub/scm/linux/kernel/git/bpf/bpf-next.git > base: f620af11c27b8ec9994a39fe968aa778112d1566 > arch: amd64 > compiler: Debian clang version 21.1.8 (++20251221033036+2078da43e25a-1~exp1~20251221153213.50), Debian LLD 21.1.8 > config: https://ci.syzbot.org/builds/b2735b0b-963d-41a7-bb33-db10494a24e1/config > syz repro: https://ci.syzbot.org/findings/3ee61f99-7d80-4986-a0b1-492ffb6b1c46/syz_repro > > ================================================================== > BUG: KASAN: slab-use-after-free in free_all+0x84/0x140 kernel/bpf/memalloc.c:271 > Read of size 8 at addr ffff888112ee9260 by task swapper/0/0 > > CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted syzkaller #0 PREEMPT(full) > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014 > Call Trace: > <IRQ> > dump_stack_lvl+0xe8/0x150 lib/dump_stack.c:120 > print_address_description mm/kasan/report.c:378 [inline] > print_report+0xba/0x230 mm/kasan/report.c:482 > kasan_report+0x117/0x150 mm/kasan/report.c:595 > free_all+0x84/0x140 kernel/bpf/memalloc.c:271 > do_call_rcu_ttrace+0x385/0x400 kernel/bpf/memalloc.c:315 > __free_by_rcu+0x23b/0x3f0 kernel/bpf/memalloc.c:381 > rcu_do_batch kernel/rcu/tree.c:2617 [inline] > rcu_core+0x7cd/0x1070 kernel/rcu/tree.c:2869 > handle_softirqs+0x22a/0x870 kernel/softirq.c:622 > __do_softirq kernel/softirq.c:656 [inline] > invoke_softirq kernel/softirq.c:496 [inline] > __irq_exit_rcu+0x5f/0x150 kernel/softirq.c:723 > irq_exit_rcu+0x9/0x30 kernel/softirq.c:739 > instr_sysvec_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1056 [inline] > sysvec_apic_timer_interrupt+0xa6/0xc0 arch/x86/kernel/apic/apic.c:1056 > </IRQ> > <TASK> > asm_sysvec_apic_timer_interrupt+0x1a/0x20 arch/x86/include/asm/idtentry.h:697 > RIP: 0010:pv_native_safe_halt+0xf/0x20 arch/x86/kernel/paravirt.c:63 > Code: 8e 6d 02 c3 cc cc cc cc cc cc cc 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa eb 07 0f 00 2d 43 3d 1c 00 fb f4 <e9> 7c ea 02 00 cc cc cc cc cc cc cc cc cc cc cc cc 90 90 90 90 90 > RSP: 0018:ffffffff8e407dc0 EFLAGS: 00000202 > RAX: 00000000000d2297 RBX: ffffffff819a80ad RCX: 0000000080000001 > RDX: 0000000000000001 RSI: ffffffff8def22fa RDI: ffffffff8c27a800 > RBP: ffffffff8e407eb0 R08: ffff88812103395b R09: 1ffff1102420672b > R10: dffffc0000000000 R11: ffffed102420672c R12: ffffffff90117db0 > R13: 1ffffffff1c929d8 R14: 0000000000000000 R15: 0000000000000000 > arch_safe_halt arch/x86/kernel/process.c:766 [inline] > default_idle+0x9/0x20 arch/x86/kernel/process.c:767 > default_idle_call+0x72/0xb0 kernel/sched/idle.c:122 > cpuidle_idle_call kernel/sched/idle.c:191 [inline] > do_idle+0x1bd/0x500 kernel/sched/idle.c:332 > cpu_startup_entry+0x43/0x60 kernel/sched/idle.c:430 > rest_init+0x2de/0x300 init/main.c:760 > start_kernel+0x385/0x3d0 init/main.c:1210 > x86_64_start_reservations+0x24/0x30 arch/x86/kernel/head64.c:310 > x86_64_start_kernel+0x143/0x1c0 arch/x86/kernel/head64.c:291 > common_startup_64+0x13e/0x147 > </TASK> > > Allocated by task 5961: > kasan_save_stack mm/kasan/common.c:57 [inline] > kasan_save_track+0x3e/0x80 mm/kasan/common.c:78 > poison_kmalloc_redzone mm/kasan/common.c:398 [inline] > __kasan_kmalloc+0x93/0xb0 mm/kasan/common.c:415 > kasan_kmalloc include/linux/kasan.h:263 [inline] > __do_kmalloc_node mm/slub.c:5219 [inline] > __kmalloc_node_noprof+0x4e0/0x7c0 mm/slub.c:5225 > kmalloc_node_noprof include/linux/slab.h:1093 [inline] > __bpf_map_area_alloc kernel/bpf/syscall.c:398 [inline] > bpf_map_area_alloc+0x64/0x170 kernel/bpf/syscall.c:411 > trie_alloc+0x14f/0x340 kernel/bpf/lpm_trie.c:588 > map_create+0xafd/0x16a0 kernel/bpf/syscall.c:1507 > __sys_bpf+0x6e1/0x950 kernel/bpf/syscall.c:6210 > __do_sys_bpf kernel/bpf/syscall.c:6341 [inline] > __se_sys_bpf kernel/bpf/syscall.c:6339 [inline] > __x64_sys_bpf+0x7c/0x90 kernel/bpf/syscall.c:6339 > do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] > do_syscall_64+0x14d/0xf80 arch/x86/entry/syscall_64.c:94 > entry_SYSCALL_64_after_hwframe+0x77/0x7f > > Freed by task 83: > kasan_save_stack mm/kasan/common.c:57 [inline] > kasan_save_track+0x3e/0x80 mm/kasan/common.c:78 > kasan_save_free_info+0x46/0x50 mm/kasan/generic.c:584 > poison_slab_object mm/kasan/common.c:253 [inline] > __kasan_slab_free+0x5c/0x80 mm/kasan/common.c:285 > kasan_slab_free include/linux/kasan.h:235 [inline] > slab_free_hook mm/slub.c:2687 [inline] > slab_free mm/slub.c:6124 [inline] > kfree+0x1c1/0x630 mm/slub.c:6442 > bpf_map_free kernel/bpf/syscall.c:892 [inline] > bpf_map_free_deferred+0x217/0x460 kernel/bpf/syscall.c:919 > process_one_work kernel/workqueue.c:3275 [inline] > process_scheduled_works+0xb02/0x1830 kernel/workqueue.c:3358 > worker_thread+0xa50/0xfc0 kernel/workqueue.c:3439 > kthread+0x388/0x470 kernel/kthread.c:467 > ret_from_fork+0x51e/0xb90 arch/x86/kernel/process.c:158 > ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245 > > Last potentially related work creation: > kasan_save_stack+0x3e/0x60 mm/kasan/common.c:57 > kasan_record_aux_stack+0xbd/0xd0 mm/kasan/generic.c:556 > insert_work+0x3d/0x330 kernel/workqueue.c:2199 > __queue_work+0xd03/0x1020 kernel/workqueue.c:2354 > queue_work_on+0x106/0x1d0 kernel/workqueue.c:2405 > bpf_map_put_with_uref kernel/bpf/syscall.c:975 [inline] > bpf_map_release+0x127/0x140 kernel/bpf/syscall.c:985 > __fput+0x44f/0xa70 fs/file_table.c:469 > task_work_run+0x1d9/0x270 kernel/task_work.c:233 > resume_user_mode_work include/linux/resume_user_mode.h:50 [inline] > __exit_to_user_mode_loop kernel/entry/common.c:67 [inline] > exit_to_user_mode_loop+0xed/0x480 kernel/entry/common.c:98 > __exit_to_user_mode_prepare include/linux/irq-entry-common.h:226 [inline] > syscall_exit_to_user_mode_prepare include/linux/irq-entry-common.h:256 [inline] > syscall_exit_to_user_mode include/linux/entry-common.h:325 [inline] > do_syscall_64+0x32d/0xf80 arch/x86/entry/syscall_64.c:100 > entry_SYSCALL_64_after_hwframe+0x77/0x7f > > [...] This is because the 'ma' is embedded in map struct, so we have an optimization to speed up map destruction without waiting for pending callbacks' rcu_barrier() by copying it out. Typically no callbacks are in progress by the time map is gone, so it wasn't seen in tests, but UAF can happen when we kmemdup() and proceed with freeing the bpf_map, such that the 'ma' pointer becomes invalid. Just storing dtor and dtor ctx directly is not ok either, since it accesses the map struct and the corresponding btf_record to decide how to free fields. It will cause the same UAF. The right way for me appears to be to duplicate the btf_record (it's not very big anyway) from the map and store it as the dtor ctx. To remove dependency on ma->dtor, just store dtor directly in the bpf_mem_caches along with dtor ctx. Then the dtor can take care of everything else. ^ permalink raw reply [flat|nested] 15+ messages in thread
* [syzbot ci] Re: Close race in freeing special fields and map value 2026-02-26 9:31 ` [syzbot ci] Re: Close race in freeing special fields and map value syzbot ci 2026-02-26 15:05 ` Kumar Kartikeya Dwivedi @ 2026-02-26 17:36 ` Kumar Kartikeya Dwivedi 2026-02-26 17:38 ` syzbot ci 2026-02-26 18:28 ` Kumar Kartikeya Dwivedi 2026-02-26 18:36 ` [PATCH] please work syzbot Kumar Kartikeya Dwivedi 3 siblings, 1 reply; 15+ messages in thread From: Kumar Kartikeya Dwivedi @ 2026-02-26 17:36 UTC (permalink / raw) To: syzbot+ci3826af8b4f91bf97 Cc: andrii, ast, bpf, daniel, eddyz87, kernel-team, kkd, martin.lau, memxor, paulmck, syzbot, syzkaller-bugs, yatsenko #syz test --- diff --git a/include/linux/bpf_mem_alloc.h b/include/linux/bpf_mem_alloc.h index 7517eaf94ac9..4ce0d27f8ea2 100644 --- a/include/linux/bpf_mem_alloc.h +++ b/include/linux/bpf_mem_alloc.h @@ -14,7 +14,7 @@ struct bpf_mem_alloc { struct obj_cgroup *objcg; bool percpu; struct work_struct work; - void (*dtor)(void *obj, void *ctx); + void (*dtor_ctx_free)(void *ctx); void *dtor_ctx; }; @@ -35,7 +35,9 @@ int bpf_mem_alloc_percpu_init(struct bpf_mem_alloc *ma, struct obj_cgroup *objcg int bpf_mem_alloc_percpu_unit_init(struct bpf_mem_alloc *ma, int size); void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma); void bpf_mem_alloc_set_dtor(struct bpf_mem_alloc *ma, - void (*dtor)(void *obj, void *ctx), void *ctx); + void (*dtor)(void *obj, void *ctx), + void (*dtor_ctx_free)(void *ctx), + void *ctx); /* Check the allocation size for kmalloc equivalent allocator */ int bpf_mem_alloc_check_size(bool percpu, size_t size); diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c index 74f7a6f44c50..582f0192b7e1 100644 --- a/kernel/bpf/hashtab.c +++ b/kernel/bpf/hashtab.c @@ -125,6 +125,11 @@ struct htab_elem { char key[] __aligned(8); }; +struct htab_btf_record { + struct btf_record *record; + u32 key_size; +}; + static inline bool htab_is_prealloc(const struct bpf_htab *htab) { return !(htab->map.map_flags & BPF_F_NO_PREALLOC); @@ -459,28 +464,80 @@ static int htab_map_alloc_check(union bpf_attr *attr) static void htab_mem_dtor(void *obj, void *ctx) { - struct bpf_htab *htab = ctx; + struct htab_btf_record *hrec = ctx; struct htab_elem *elem = obj; void *map_value; - if (IS_ERR_OR_NULL(htab->map.record)) + if (IS_ERR_OR_NULL(hrec->record)) return; - map_value = htab_elem_value(elem, htab->map.key_size); - bpf_obj_free_fields(htab->map.record, map_value); + map_value = htab_elem_value(elem, hrec->key_size); + bpf_obj_free_fields(hrec->record, map_value); } static void htab_pcpu_mem_dtor(void *obj, void *ctx) { - struct bpf_htab *htab = ctx; void __percpu *pptr = *(void __percpu **)obj; + struct htab_btf_record *hrec = ctx; int cpu; - if (IS_ERR_OR_NULL(htab->map.record)) + if (IS_ERR_OR_NULL(hrec->record)) return; for_each_possible_cpu(cpu) - bpf_obj_free_fields(htab->map.record, per_cpu_ptr(pptr, cpu)); + bpf_obj_free_fields(hrec->record, per_cpu_ptr(pptr, cpu)); +} + +static void htab_dtor_ctx_free(void *ctx) +{ + struct htab_btf_record *hrec = ctx; + + btf_record_free(hrec->record); + kfree(ctx); +} + +static int htab_set_dtor(const struct bpf_htab *htab, void (*dtor)(void *, void *)) +{ + u32 key_size = htab->map.key_size; + const struct bpf_mem_alloc *ma; + struct htab_btf_record *hrec; + int err; + + /* No need for dtors. */ + if (IS_ERR_OR_NULL(htab->map.record)) + return 0; + + hrec = kzalloc(sizeof(*hrec), GFP_KERNEL); + if (!hrec) + return -ENOMEM; + hrec->key_size = key_size; + hrec->record = btf_record_dup(htab->map.record); + if (IS_ERR(hrec->record)) { + err = PTR_ERR(hrec->record); + kfree(hrec); + return err; + } + ma = htab_is_percpu(htab) ? &htab->pcpu_ma : &htab->ma; + /* Kinda sad, but cast away const-ness since we change ma->dtor. */ + bpf_mem_alloc_set_dtor((struct bpf_mem_alloc *)ma, dtor, htab_dtor_ctx_free, hrec); + return 0; +} + +static int htab_map_check_btf(const struct bpf_map *map, const struct btf *btf, + const struct btf_type *key_type, const struct btf_type *value_type) +{ + struct bpf_htab *htab = container_of(map, struct bpf_htab, map); + + if (htab_is_prealloc(htab)) + return 0; + /* + * We must set the dtor using this callback, as map's BTF record is not + * populated in htab_map_alloc(), so it will always appear as NULL. + */ + if (htab_is_percpu(htab)) + return htab_set_dtor(htab, htab_pcpu_mem_dtor); + else + return htab_set_dtor(htab, htab_mem_dtor); } static struct bpf_map *htab_map_alloc(union bpf_attr *attr) @@ -595,17 +652,6 @@ static struct bpf_map *htab_map_alloc(union bpf_attr *attr) round_up(htab->map.value_size, 8), true); if (err) goto free_map_locked; - /* See comment below */ - bpf_mem_alloc_set_dtor(&htab->pcpu_ma, htab_pcpu_mem_dtor, htab); - } else { - /* - * Register the dtor unconditionally. map->record is - * set by map_create() after map_alloc() returns, so it - * is always NULL at this point. The dtor checks - * IS_ERR_OR_NULL(htab->map.record) and becomes a no-op - * for maps without special fields. - */ - bpf_mem_alloc_set_dtor(&htab->ma, htab_mem_dtor, htab); } } @@ -2318,6 +2364,7 @@ const struct bpf_map_ops htab_map_ops = { .map_seq_show_elem = htab_map_seq_show_elem, .map_set_for_each_callback_args = map_set_for_each_callback_args, .map_for_each_callback = bpf_for_each_hash_elem, + .map_check_btf = htab_map_check_btf, .map_mem_usage = htab_map_mem_usage, BATCH_OPS(htab), .map_btf_id = &htab_map_btf_ids[0], @@ -2340,6 +2387,7 @@ const struct bpf_map_ops htab_lru_map_ops = { .map_seq_show_elem = htab_map_seq_show_elem, .map_set_for_each_callback_args = map_set_for_each_callback_args, .map_for_each_callback = bpf_for_each_hash_elem, + .map_check_btf = htab_map_check_btf, .map_mem_usage = htab_map_mem_usage, BATCH_OPS(htab_lru), .map_btf_id = &htab_map_btf_ids[0], @@ -2519,6 +2567,7 @@ const struct bpf_map_ops htab_percpu_map_ops = { .map_seq_show_elem = htab_percpu_map_seq_show_elem, .map_set_for_each_callback_args = map_set_for_each_callback_args, .map_for_each_callback = bpf_for_each_hash_elem, + .map_check_btf = htab_map_check_btf, .map_mem_usage = htab_map_mem_usage, BATCH_OPS(htab_percpu), .map_btf_id = &htab_map_btf_ids[0], @@ -2539,6 +2588,7 @@ const struct bpf_map_ops htab_lru_percpu_map_ops = { .map_seq_show_elem = htab_percpu_map_seq_show_elem, .map_set_for_each_callback_args = map_set_for_each_callback_args, .map_for_each_callback = bpf_for_each_hash_elem, + .map_check_btf = htab_map_check_btf, .map_mem_usage = htab_map_mem_usage, BATCH_OPS(htab_lru_percpu), .map_btf_id = &htab_map_btf_ids[0], diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c index 137e855c718b..682a9f34214b 100644 --- a/kernel/bpf/memalloc.c +++ b/kernel/bpf/memalloc.c @@ -102,7 +102,8 @@ struct bpf_mem_cache { int percpu_size; bool draining; struct bpf_mem_cache *tgt; - struct bpf_mem_alloc *ma; + void (*dtor)(void *obj, void *ctx); + void *dtor_ctx; /* list of objects to be freed after RCU GP */ struct llist_head free_by_rcu; @@ -261,15 +262,14 @@ static void free_one(void *obj, bool percpu) kfree(obj); } -static int free_all(struct llist_node *llnode, bool percpu, - struct bpf_mem_alloc *ma) +static int free_all(struct bpf_mem_cache *c, struct llist_node *llnode, bool percpu) { struct llist_node *pos, *t; int cnt = 0; llist_for_each_safe(pos, t, llnode) { - if (ma->dtor) - ma->dtor((void *)pos + LLIST_NODE_SZ, ma->dtor_ctx); + if (c->dtor) + c->dtor((void *)pos + LLIST_NODE_SZ, c->dtor_ctx); free_one(pos, percpu); cnt++; } @@ -280,7 +280,7 @@ static void __free_rcu(struct rcu_head *head) { struct bpf_mem_cache *c = container_of(head, struct bpf_mem_cache, rcu_ttrace); - free_all(llist_del_all(&c->waiting_for_gp_ttrace), !!c->percpu_size, c->ma); + free_all(c, llist_del_all(&c->waiting_for_gp_ttrace), !!c->percpu_size); atomic_set(&c->call_rcu_ttrace_in_progress, 0); } @@ -312,7 +312,7 @@ static void do_call_rcu_ttrace(struct bpf_mem_cache *c) if (atomic_xchg(&c->call_rcu_ttrace_in_progress, 1)) { if (unlikely(READ_ONCE(c->draining))) { llnode = llist_del_all(&c->free_by_rcu_ttrace); - free_all(llnode, !!c->percpu_size, c->ma); + free_all(c, llnode, !!c->percpu_size); } return; } @@ -421,7 +421,7 @@ static void check_free_by_rcu(struct bpf_mem_cache *c) dec_active(c, &flags); if (unlikely(READ_ONCE(c->draining))) { - free_all(llist_del_all(&c->waiting_for_gp), !!c->percpu_size, c->ma); + free_all(c, llist_del_all(&c->waiting_for_gp), !!c->percpu_size); atomic_set(&c->call_rcu_in_progress, 0); } else { call_rcu_hurry(&c->rcu, __free_by_rcu); @@ -546,7 +546,6 @@ int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, bool percpu) c->objcg = objcg; c->percpu_size = percpu_size; c->tgt = c; - c->ma = ma; init_refill_work(c); prefill_mem_cache(c, cpu); } @@ -569,7 +568,6 @@ int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, bool percpu) c->objcg = objcg; c->percpu_size = percpu_size; c->tgt = c; - c->ma = ma; init_refill_work(c); prefill_mem_cache(c, cpu); @@ -622,7 +620,6 @@ int bpf_mem_alloc_percpu_unit_init(struct bpf_mem_alloc *ma, int size) c->objcg = objcg; c->percpu_size = percpu_size; c->tgt = c; - c->ma = ma; init_refill_work(c); prefill_mem_cache(c, cpu); @@ -631,7 +628,7 @@ int bpf_mem_alloc_percpu_unit_init(struct bpf_mem_alloc *ma, int size) return 0; } -static void drain_mem_cache(struct bpf_mem_cache *c, struct bpf_mem_alloc *ma) +static void drain_mem_cache(struct bpf_mem_cache *c) { bool percpu = !!c->percpu_size; @@ -642,13 +639,13 @@ static void drain_mem_cache(struct bpf_mem_cache *c, struct bpf_mem_alloc *ma) * Except for waiting_for_gp_ttrace list, there are no concurrent operations * on these lists, so it is safe to use __llist_del_all(). */ - free_all(llist_del_all(&c->free_by_rcu_ttrace), percpu, ma); - free_all(llist_del_all(&c->waiting_for_gp_ttrace), percpu, ma); - free_all(__llist_del_all(&c->free_llist), percpu, ma); - free_all(__llist_del_all(&c->free_llist_extra), percpu, ma); - free_all(__llist_del_all(&c->free_by_rcu), percpu, ma); - free_all(__llist_del_all(&c->free_llist_extra_rcu), percpu, ma); - free_all(llist_del_all(&c->waiting_for_gp), percpu, ma); + free_all(c, llist_del_all(&c->free_by_rcu_ttrace), percpu); + free_all(c, llist_del_all(&c->waiting_for_gp_ttrace), percpu); + free_all(c, __llist_del_all(&c->free_llist), percpu); + free_all(c, __llist_del_all(&c->free_llist_extra), percpu); + free_all(c, __llist_del_all(&c->free_by_rcu), percpu); + free_all(c, __llist_del_all(&c->free_llist_extra_rcu), percpu); + free_all(c, llist_del_all(&c->waiting_for_gp), percpu); } static void check_mem_cache(struct bpf_mem_cache *c) @@ -687,6 +684,9 @@ static void check_leaked_objs(struct bpf_mem_alloc *ma) static void free_mem_alloc_no_barrier(struct bpf_mem_alloc *ma) { + /* We can free dtor ctx only once all callbacks are done using it. */ + if (ma->dtor_ctx_free) + ma->dtor_ctx_free(ma->dtor_ctx); check_leaked_objs(ma); free_percpu(ma->cache); free_percpu(ma->caches); @@ -758,7 +758,7 @@ void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma) c = per_cpu_ptr(ma->cache, cpu); WRITE_ONCE(c->draining, true); irq_work_sync(&c->refill_work); - drain_mem_cache(c, ma); + drain_mem_cache(c); rcu_in_progress += atomic_read(&c->call_rcu_ttrace_in_progress); rcu_in_progress += atomic_read(&c->call_rcu_in_progress); } @@ -773,7 +773,7 @@ void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma) c = &cc->cache[i]; WRITE_ONCE(c->draining, true); irq_work_sync(&c->refill_work); - drain_mem_cache(c, ma); + drain_mem_cache(c); rcu_in_progress += atomic_read(&c->call_rcu_ttrace_in_progress); rcu_in_progress += atomic_read(&c->call_rcu_in_progress); } @@ -1022,9 +1022,31 @@ int bpf_mem_alloc_check_size(bool percpu, size_t size) return 0; } -void bpf_mem_alloc_set_dtor(struct bpf_mem_alloc *ma, - void (*dtor)(void *obj, void *ctx), void *ctx) +void bpf_mem_alloc_set_dtor(struct bpf_mem_alloc *ma, void (*dtor)(void *obj, void *ctx), + void (*dtor_ctx_free)(void *ctx), void *ctx) { - ma->dtor = dtor; + struct bpf_mem_caches *cc; + struct bpf_mem_cache *c; + int cpu, i; + + ma->dtor_ctx_free = dtor_ctx_free; ma->dtor_ctx = ctx; + + if (ma->cache) { + for_each_possible_cpu(cpu) { + c = per_cpu_ptr(ma->cache, cpu); + c->dtor = dtor; + c->dtor_ctx = ctx; + } + } + if (ma->caches) { + for_each_possible_cpu(cpu) { + cc = per_cpu_ptr(ma->caches, cpu); + for (i = 0; i < NUM_CACHES; i++) { + c = &cc->cache[i]; + c->dtor = dtor; + c->dtor_ctx = ctx; + } + } + } } ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [syzbot ci] Re: Close race in freeing special fields and map value 2026-02-26 17:36 ` Kumar Kartikeya Dwivedi @ 2026-02-26 17:38 ` syzbot ci 0 siblings, 0 replies; 15+ messages in thread From: syzbot ci @ 2026-02-26 17:38 UTC (permalink / raw) To: memxor Cc: andrii, ast, bpf, daniel, eddyz87, kernel-team, kkd, martin.lau, memxor, paulmck, syzbot, syzkaller-bugs, yatsenko Unknown command ^ permalink raw reply [flat|nested] 15+ messages in thread
* [syzbot ci] Re: Close race in freeing special fields and map value 2026-02-26 9:31 ` [syzbot ci] Re: Close race in freeing special fields and map value syzbot ci 2026-02-26 15:05 ` Kumar Kartikeya Dwivedi 2026-02-26 17:36 ` Kumar Kartikeya Dwivedi @ 2026-02-26 18:28 ` Kumar Kartikeya Dwivedi 2026-02-26 18:30 ` syzbot ci 2026-02-26 18:36 ` [PATCH] please work syzbot Kumar Kartikeya Dwivedi 3 siblings, 1 reply; 15+ messages in thread From: Kumar Kartikeya Dwivedi @ 2026-02-26 18:28 UTC (permalink / raw) To: syzbot+ci3826af8b4f91bf97 Cc: andrii, ast, bpf, daniel, eddyz87, kernel-team, kkd, martin.lau, memxor, paulmck, syzbot, syzkaller-bugs, yatsenko #syz test: diff --git a/include/linux/bpf_mem_alloc.h b/include/linux/bpf_mem_alloc.h index 7517eaf94ac9..4ce0d27f8ea2 100644 --- a/include/linux/bpf_mem_alloc.h +++ b/include/linux/bpf_mem_alloc.h @@ -14,7 +14,7 @@ struct bpf_mem_alloc { struct obj_cgroup *objcg; bool percpu; struct work_struct work; - void (*dtor)(void *obj, void *ctx); + void (*dtor_ctx_free)(void *ctx); void *dtor_ctx; }; @@ -35,7 +35,9 @@ int bpf_mem_alloc_percpu_init(struct bpf_mem_alloc *ma, struct obj_cgroup *objcg int bpf_mem_alloc_percpu_unit_init(struct bpf_mem_alloc *ma, int size); void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma); void bpf_mem_alloc_set_dtor(struct bpf_mem_alloc *ma, - void (*dtor)(void *obj, void *ctx), void *ctx); + void (*dtor)(void *obj, void *ctx), + void (*dtor_ctx_free)(void *ctx), + void *ctx); /* Check the allocation size for kmalloc equivalent allocator */ int bpf_mem_alloc_check_size(bool percpu, size_t size); diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c index 74f7a6f44c50..582f0192b7e1 100644 --- a/kernel/bpf/hashtab.c +++ b/kernel/bpf/hashtab.c @@ -125,6 +125,11 @@ struct htab_elem { char key[] __aligned(8); }; +struct htab_btf_record { + struct btf_record *record; + u32 key_size; +}; + static inline bool htab_is_prealloc(const struct bpf_htab *htab) { return !(htab->map.map_flags & BPF_F_NO_PREALLOC); @@ -459,28 +464,80 @@ static int htab_map_alloc_check(union bpf_attr *attr) static void htab_mem_dtor(void *obj, void *ctx) { - struct bpf_htab *htab = ctx; + struct htab_btf_record *hrec = ctx; struct htab_elem *elem = obj; void *map_value; - if (IS_ERR_OR_NULL(htab->map.record)) + if (IS_ERR_OR_NULL(hrec->record)) return; - map_value = htab_elem_value(elem, htab->map.key_size); - bpf_obj_free_fields(htab->map.record, map_value); + map_value = htab_elem_value(elem, hrec->key_size); + bpf_obj_free_fields(hrec->record, map_value); } static void htab_pcpu_mem_dtor(void *obj, void *ctx) { - struct bpf_htab *htab = ctx; void __percpu *pptr = *(void __percpu **)obj; + struct htab_btf_record *hrec = ctx; int cpu; - if (IS_ERR_OR_NULL(htab->map.record)) + if (IS_ERR_OR_NULL(hrec->record)) return; for_each_possible_cpu(cpu) - bpf_obj_free_fields(htab->map.record, per_cpu_ptr(pptr, cpu)); + bpf_obj_free_fields(hrec->record, per_cpu_ptr(pptr, cpu)); +} + +static void htab_dtor_ctx_free(void *ctx) +{ + struct htab_btf_record *hrec = ctx; + + btf_record_free(hrec->record); + kfree(ctx); +} + +static int htab_set_dtor(const struct bpf_htab *htab, void (*dtor)(void *, void *)) +{ + u32 key_size = htab->map.key_size; + const struct bpf_mem_alloc *ma; + struct htab_btf_record *hrec; + int err; + + /* No need for dtors. */ + if (IS_ERR_OR_NULL(htab->map.record)) + return 0; + + hrec = kzalloc(sizeof(*hrec), GFP_KERNEL); + if (!hrec) + return -ENOMEM; + hrec->key_size = key_size; + hrec->record = btf_record_dup(htab->map.record); + if (IS_ERR(hrec->record)) { + err = PTR_ERR(hrec->record); + kfree(hrec); + return err; + } + ma = htab_is_percpu(htab) ? &htab->pcpu_ma : &htab->ma; + /* Kinda sad, but cast away const-ness since we change ma->dtor. */ + bpf_mem_alloc_set_dtor((struct bpf_mem_alloc *)ma, dtor, htab_dtor_ctx_free, hrec); + return 0; +} + +static int htab_map_check_btf(const struct bpf_map *map, const struct btf *btf, + const struct btf_type *key_type, const struct btf_type *value_type) +{ + struct bpf_htab *htab = container_of(map, struct bpf_htab, map); + + if (htab_is_prealloc(htab)) + return 0; + /* + * We must set the dtor using this callback, as map's BTF record is not + * populated in htab_map_alloc(), so it will always appear as NULL. + */ + if (htab_is_percpu(htab)) + return htab_set_dtor(htab, htab_pcpu_mem_dtor); + else + return htab_set_dtor(htab, htab_mem_dtor); } static struct bpf_map *htab_map_alloc(union bpf_attr *attr) @@ -595,17 +652,6 @@ static struct bpf_map *htab_map_alloc(union bpf_attr *attr) round_up(htab->map.value_size, 8), true); if (err) goto free_map_locked; - /* See comment below */ - bpf_mem_alloc_set_dtor(&htab->pcpu_ma, htab_pcpu_mem_dtor, htab); - } else { - /* - * Register the dtor unconditionally. map->record is - * set by map_create() after map_alloc() returns, so it - * is always NULL at this point. The dtor checks - * IS_ERR_OR_NULL(htab->map.record) and becomes a no-op - * for maps without special fields. - */ - bpf_mem_alloc_set_dtor(&htab->ma, htab_mem_dtor, htab); } } @@ -2318,6 +2364,7 @@ const struct bpf_map_ops htab_map_ops = { .map_seq_show_elem = htab_map_seq_show_elem, .map_set_for_each_callback_args = map_set_for_each_callback_args, .map_for_each_callback = bpf_for_each_hash_elem, + .map_check_btf = htab_map_check_btf, .map_mem_usage = htab_map_mem_usage, BATCH_OPS(htab), .map_btf_id = &htab_map_btf_ids[0], @@ -2340,6 +2387,7 @@ const struct bpf_map_ops htab_lru_map_ops = { .map_seq_show_elem = htab_map_seq_show_elem, .map_set_for_each_callback_args = map_set_for_each_callback_args, .map_for_each_callback = bpf_for_each_hash_elem, + .map_check_btf = htab_map_check_btf, .map_mem_usage = htab_map_mem_usage, BATCH_OPS(htab_lru), .map_btf_id = &htab_map_btf_ids[0], @@ -2519,6 +2567,7 @@ const struct bpf_map_ops htab_percpu_map_ops = { .map_seq_show_elem = htab_percpu_map_seq_show_elem, .map_set_for_each_callback_args = map_set_for_each_callback_args, .map_for_each_callback = bpf_for_each_hash_elem, + .map_check_btf = htab_map_check_btf, .map_mem_usage = htab_map_mem_usage, BATCH_OPS(htab_percpu), .map_btf_id = &htab_map_btf_ids[0], @@ -2539,6 +2588,7 @@ const struct bpf_map_ops htab_lru_percpu_map_ops = { .map_seq_show_elem = htab_percpu_map_seq_show_elem, .map_set_for_each_callback_args = map_set_for_each_callback_args, .map_for_each_callback = bpf_for_each_hash_elem, + .map_check_btf = htab_map_check_btf, .map_mem_usage = htab_map_mem_usage, BATCH_OPS(htab_lru_percpu), .map_btf_id = &htab_map_btf_ids[0], diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c index 137e855c718b..682a9f34214b 100644 --- a/kernel/bpf/memalloc.c +++ b/kernel/bpf/memalloc.c @@ -102,7 +102,8 @@ struct bpf_mem_cache { int percpu_size; bool draining; struct bpf_mem_cache *tgt; - struct bpf_mem_alloc *ma; + void (*dtor)(void *obj, void *ctx); + void *dtor_ctx; /* list of objects to be freed after RCU GP */ struct llist_head free_by_rcu; @@ -261,15 +262,14 @@ static void free_one(void *obj, bool percpu) kfree(obj); } -static int free_all(struct llist_node *llnode, bool percpu, - struct bpf_mem_alloc *ma) +static int free_all(struct bpf_mem_cache *c, struct llist_node *llnode, bool percpu) { struct llist_node *pos, *t; int cnt = 0; llist_for_each_safe(pos, t, llnode) { - if (ma->dtor) - ma->dtor((void *)pos + LLIST_NODE_SZ, ma->dtor_ctx); + if (c->dtor) + c->dtor((void *)pos + LLIST_NODE_SZ, c->dtor_ctx); free_one(pos, percpu); cnt++; } @@ -280,7 +280,7 @@ static void __free_rcu(struct rcu_head *head) { struct bpf_mem_cache *c = container_of(head, struct bpf_mem_cache, rcu_ttrace); - free_all(llist_del_all(&c->waiting_for_gp_ttrace), !!c->percpu_size, c->ma); + free_all(c, llist_del_all(&c->waiting_for_gp_ttrace), !!c->percpu_size); atomic_set(&c->call_rcu_ttrace_in_progress, 0); } @@ -312,7 +312,7 @@ static void do_call_rcu_ttrace(struct bpf_mem_cache *c) if (atomic_xchg(&c->call_rcu_ttrace_in_progress, 1)) { if (unlikely(READ_ONCE(c->draining))) { llnode = llist_del_all(&c->free_by_rcu_ttrace); - free_all(llnode, !!c->percpu_size, c->ma); + free_all(c, llnode, !!c->percpu_size); } return; } @@ -421,7 +421,7 @@ static void check_free_by_rcu(struct bpf_mem_cache *c) dec_active(c, &flags); if (unlikely(READ_ONCE(c->draining))) { - free_all(llist_del_all(&c->waiting_for_gp), !!c->percpu_size, c->ma); + free_all(c, llist_del_all(&c->waiting_for_gp), !!c->percpu_size); atomic_set(&c->call_rcu_in_progress, 0); } else { call_rcu_hurry(&c->rcu, __free_by_rcu); @@ -546,7 +546,6 @@ int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, bool percpu) c->objcg = objcg; c->percpu_size = percpu_size; c->tgt = c; - c->ma = ma; init_refill_work(c); prefill_mem_cache(c, cpu); } @@ -569,7 +568,6 @@ int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, bool percpu) c->objcg = objcg; c->percpu_size = percpu_size; c->tgt = c; - c->ma = ma; init_refill_work(c); prefill_mem_cache(c, cpu); @@ -622,7 +620,6 @@ int bpf_mem_alloc_percpu_unit_init(struct bpf_mem_alloc *ma, int size) c->objcg = objcg; c->percpu_size = percpu_size; c->tgt = c; - c->ma = ma; init_refill_work(c); prefill_mem_cache(c, cpu); @@ -631,7 +628,7 @@ int bpf_mem_alloc_percpu_unit_init(struct bpf_mem_alloc *ma, int size) return 0; } -static void drain_mem_cache(struct bpf_mem_cache *c, struct bpf_mem_alloc *ma) +static void drain_mem_cache(struct bpf_mem_cache *c) { bool percpu = !!c->percpu_size; @@ -642,13 +639,13 @@ static void drain_mem_cache(struct bpf_mem_cache *c, struct bpf_mem_alloc *ma) * Except for waiting_for_gp_ttrace list, there are no concurrent operations * on these lists, so it is safe to use __llist_del_all(). */ - free_all(llist_del_all(&c->free_by_rcu_ttrace), percpu, ma); - free_all(llist_del_all(&c->waiting_for_gp_ttrace), percpu, ma); - free_all(__llist_del_all(&c->free_llist), percpu, ma); - free_all(__llist_del_all(&c->free_llist_extra), percpu, ma); - free_all(__llist_del_all(&c->free_by_rcu), percpu, ma); - free_all(__llist_del_all(&c->free_llist_extra_rcu), percpu, ma); - free_all(llist_del_all(&c->waiting_for_gp), percpu, ma); + free_all(c, llist_del_all(&c->free_by_rcu_ttrace), percpu); + free_all(c, llist_del_all(&c->waiting_for_gp_ttrace), percpu); + free_all(c, __llist_del_all(&c->free_llist), percpu); + free_all(c, __llist_del_all(&c->free_llist_extra), percpu); + free_all(c, __llist_del_all(&c->free_by_rcu), percpu); + free_all(c, __llist_del_all(&c->free_llist_extra_rcu), percpu); + free_all(c, llist_del_all(&c->waiting_for_gp), percpu); } static void check_mem_cache(struct bpf_mem_cache *c) @@ -687,6 +684,9 @@ static void check_leaked_objs(struct bpf_mem_alloc *ma) static void free_mem_alloc_no_barrier(struct bpf_mem_alloc *ma) { + /* We can free dtor ctx only once all callbacks are done using it. */ + if (ma->dtor_ctx_free) + ma->dtor_ctx_free(ma->dtor_ctx); check_leaked_objs(ma); free_percpu(ma->cache); free_percpu(ma->caches); @@ -758,7 +758,7 @@ void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma) c = per_cpu_ptr(ma->cache, cpu); WRITE_ONCE(c->draining, true); irq_work_sync(&c->refill_work); - drain_mem_cache(c, ma); + drain_mem_cache(c); rcu_in_progress += atomic_read(&c->call_rcu_ttrace_in_progress); rcu_in_progress += atomic_read(&c->call_rcu_in_progress); } @@ -773,7 +773,7 @@ void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma) c = &cc->cache[i]; WRITE_ONCE(c->draining, true); irq_work_sync(&c->refill_work); - drain_mem_cache(c, ma); + drain_mem_cache(c); rcu_in_progress += atomic_read(&c->call_rcu_ttrace_in_progress); rcu_in_progress += atomic_read(&c->call_rcu_in_progress); } @@ -1022,9 +1022,31 @@ int bpf_mem_alloc_check_size(bool percpu, size_t size) return 0; } -void bpf_mem_alloc_set_dtor(struct bpf_mem_alloc *ma, - void (*dtor)(void *obj, void *ctx), void *ctx) +void bpf_mem_alloc_set_dtor(struct bpf_mem_alloc *ma, void (*dtor)(void *obj, void *ctx), + void (*dtor_ctx_free)(void *ctx), void *ctx) { - ma->dtor = dtor; + struct bpf_mem_caches *cc; + struct bpf_mem_cache *c; + int cpu, i; + + ma->dtor_ctx_free = dtor_ctx_free; ma->dtor_ctx = ctx; + + if (ma->cache) { + for_each_possible_cpu(cpu) { + c = per_cpu_ptr(ma->cache, cpu); + c->dtor = dtor; + c->dtor_ctx = ctx; + } + } + if (ma->caches) { + for_each_possible_cpu(cpu) { + cc = per_cpu_ptr(ma->caches, cpu); + for (i = 0; i < NUM_CACHES; i++) { + c = &cc->cache[i]; + c->dtor = dtor; + c->dtor_ctx = ctx; + } + } + } } ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [syzbot ci] Re: Close race in freeing special fields and map value 2026-02-26 18:28 ` Kumar Kartikeya Dwivedi @ 2026-02-26 18:30 ` syzbot ci 0 siblings, 0 replies; 15+ messages in thread From: syzbot ci @ 2026-02-26 18:30 UTC (permalink / raw) To: memxor Cc: andrii, ast, bpf, daniel, eddyz87, kernel-team, kkd, martin.lau, memxor, paulmck, syzbot, syzkaller-bugs, yatsenko Unknown command ^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH] please work syzbot 2026-02-26 9:31 ` [syzbot ci] Re: Close race in freeing special fields and map value syzbot ci ` (2 preceding siblings ...) 2026-02-26 18:28 ` Kumar Kartikeya Dwivedi @ 2026-02-26 18:36 ` Kumar Kartikeya Dwivedi 2026-02-26 18:42 ` syzbot ci 3 siblings, 1 reply; 15+ messages in thread From: Kumar Kartikeya Dwivedi @ 2026-02-26 18:36 UTC (permalink / raw) To: syzbot+ci3826af8b4f91bf97 Cc: andrii, ast, bpf, daniel, eddyz87, kernel-team, kkd, martin.lau, memxor, paulmck, syzbot, syzkaller-bugs, yatsenko Main things to remember: -dtor is set fro m map_check_btf because record cannot be inspected in alloc. -set dtor for each cache individually. -free ctx after callbacks are done running. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> --- #syz test include/linux/bpf_mem_alloc.h | 6 ++- kernel/bpf/hashtab.c | 86 +++++++++++++++++++++++++++-------- kernel/bpf/memalloc.c | 70 ++++++++++++++++++---------- 3 files changed, 118 insertions(+), 44 deletions(-) diff --git a/include/linux/bpf_mem_alloc.h b/include/linux/bpf_mem_alloc.h index 7517eaf94ac9..4ce0d27f8ea2 100644 --- a/include/linux/bpf_mem_alloc.h +++ b/include/linux/bpf_mem_alloc.h @@ -14,7 +14,7 @@ struct bpf_mem_alloc { struct obj_cgroup *objcg; bool percpu; struct work_struct work; - void (*dtor)(void *obj, void *ctx); + void (*dtor_ctx_free)(void *ctx); void *dtor_ctx; }; @@ -35,7 +35,9 @@ int bpf_mem_alloc_percpu_init(struct bpf_mem_alloc *ma, struct obj_cgroup *objcg int bpf_mem_alloc_percpu_unit_init(struct bpf_mem_alloc *ma, int size); void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma); void bpf_mem_alloc_set_dtor(struct bpf_mem_alloc *ma, - void (*dtor)(void *obj, void *ctx), void *ctx); + void (*dtor)(void *obj, void *ctx), + void (*dtor_ctx_free)(void *ctx), + void *ctx); /* Check the allocation size for kmalloc equivalent allocator */ int bpf_mem_alloc_check_size(bool percpu, size_t size); diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c index 74f7a6f44c50..582f0192b7e1 100644 --- a/kernel/bpf/hashtab.c +++ b/kernel/bpf/hashtab.c @@ -125,6 +125,11 @@ struct htab_elem { char key[] __aligned(8); }; +struct htab_btf_record { + struct btf_record *record; + u32 key_size; +}; + static inline bool htab_is_prealloc(const struct bpf_htab *htab) { return !(htab->map.map_flags & BPF_F_NO_PREALLOC); @@ -459,28 +464,80 @@ static int htab_map_alloc_check(union bpf_attr *attr) static void htab_mem_dtor(void *obj, void *ctx) { - struct bpf_htab *htab = ctx; + struct htab_btf_record *hrec = ctx; struct htab_elem *elem = obj; void *map_value; - if (IS_ERR_OR_NULL(htab->map.record)) + if (IS_ERR_OR_NULL(hrec->record)) return; - map_value = htab_elem_value(elem, htab->map.key_size); - bpf_obj_free_fields(htab->map.record, map_value); + map_value = htab_elem_value(elem, hrec->key_size); + bpf_obj_free_fields(hrec->record, map_value); } static void htab_pcpu_mem_dtor(void *obj, void *ctx) { - struct bpf_htab *htab = ctx; void __percpu *pptr = *(void __percpu **)obj; + struct htab_btf_record *hrec = ctx; int cpu; - if (IS_ERR_OR_NULL(htab->map.record)) + if (IS_ERR_OR_NULL(hrec->record)) return; for_each_possible_cpu(cpu) - bpf_obj_free_fields(htab->map.record, per_cpu_ptr(pptr, cpu)); + bpf_obj_free_fields(hrec->record, per_cpu_ptr(pptr, cpu)); +} + +static void htab_dtor_ctx_free(void *ctx) +{ + struct htab_btf_record *hrec = ctx; + + btf_record_free(hrec->record); + kfree(ctx); +} + +static int htab_set_dtor(const struct bpf_htab *htab, void (*dtor)(void *, void *)) +{ + u32 key_size = htab->map.key_size; + const struct bpf_mem_alloc *ma; + struct htab_btf_record *hrec; + int err; + + /* No need for dtors. */ + if (IS_ERR_OR_NULL(htab->map.record)) + return 0; + + hrec = kzalloc(sizeof(*hrec), GFP_KERNEL); + if (!hrec) + return -ENOMEM; + hrec->key_size = key_size; + hrec->record = btf_record_dup(htab->map.record); + if (IS_ERR(hrec->record)) { + err = PTR_ERR(hrec->record); + kfree(hrec); + return err; + } + ma = htab_is_percpu(htab) ? &htab->pcpu_ma : &htab->ma; + /* Kinda sad, but cast away const-ness since we change ma->dtor. */ + bpf_mem_alloc_set_dtor((struct bpf_mem_alloc *)ma, dtor, htab_dtor_ctx_free, hrec); + return 0; +} + +static int htab_map_check_btf(const struct bpf_map *map, const struct btf *btf, + const struct btf_type *key_type, const struct btf_type *value_type) +{ + struct bpf_htab *htab = container_of(map, struct bpf_htab, map); + + if (htab_is_prealloc(htab)) + return 0; + /* + * We must set the dtor using this callback, as map's BTF record is not + * populated in htab_map_alloc(), so it will always appear as NULL. + */ + if (htab_is_percpu(htab)) + return htab_set_dtor(htab, htab_pcpu_mem_dtor); + else + return htab_set_dtor(htab, htab_mem_dtor); } static struct bpf_map *htab_map_alloc(union bpf_attr *attr) @@ -595,17 +652,6 @@ static struct bpf_map *htab_map_alloc(union bpf_attr *attr) round_up(htab->map.value_size, 8), true); if (err) goto free_map_locked; - /* See comment below */ - bpf_mem_alloc_set_dtor(&htab->pcpu_ma, htab_pcpu_mem_dtor, htab); - } else { - /* - * Register the dtor unconditionally. map->record is - * set by map_create() after map_alloc() returns, so it - * is always NULL at this point. The dtor checks - * IS_ERR_OR_NULL(htab->map.record) and becomes a no-op - * for maps without special fields. - */ - bpf_mem_alloc_set_dtor(&htab->ma, htab_mem_dtor, htab); } } @@ -2318,6 +2364,7 @@ const struct bpf_map_ops htab_map_ops = { .map_seq_show_elem = htab_map_seq_show_elem, .map_set_for_each_callback_args = map_set_for_each_callback_args, .map_for_each_callback = bpf_for_each_hash_elem, + .map_check_btf = htab_map_check_btf, .map_mem_usage = htab_map_mem_usage, BATCH_OPS(htab), .map_btf_id = &htab_map_btf_ids[0], @@ -2340,6 +2387,7 @@ const struct bpf_map_ops htab_lru_map_ops = { .map_seq_show_elem = htab_map_seq_show_elem, .map_set_for_each_callback_args = map_set_for_each_callback_args, .map_for_each_callback = bpf_for_each_hash_elem, + .map_check_btf = htab_map_check_btf, .map_mem_usage = htab_map_mem_usage, BATCH_OPS(htab_lru), .map_btf_id = &htab_map_btf_ids[0], @@ -2519,6 +2567,7 @@ const struct bpf_map_ops htab_percpu_map_ops = { .map_seq_show_elem = htab_percpu_map_seq_show_elem, .map_set_for_each_callback_args = map_set_for_each_callback_args, .map_for_each_callback = bpf_for_each_hash_elem, + .map_check_btf = htab_map_check_btf, .map_mem_usage = htab_map_mem_usage, BATCH_OPS(htab_percpu), .map_btf_id = &htab_map_btf_ids[0], @@ -2539,6 +2588,7 @@ const struct bpf_map_ops htab_lru_percpu_map_ops = { .map_seq_show_elem = htab_percpu_map_seq_show_elem, .map_set_for_each_callback_args = map_set_for_each_callback_args, .map_for_each_callback = bpf_for_each_hash_elem, + .map_check_btf = htab_map_check_btf, .map_mem_usage = htab_map_mem_usage, BATCH_OPS(htab_lru_percpu), .map_btf_id = &htab_map_btf_ids[0], diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c index 137e855c718b..682a9f34214b 100644 --- a/kernel/bpf/memalloc.c +++ b/kernel/bpf/memalloc.c @@ -102,7 +102,8 @@ struct bpf_mem_cache { int percpu_size; bool draining; struct bpf_mem_cache *tgt; - struct bpf_mem_alloc *ma; + void (*dtor)(void *obj, void *ctx); + void *dtor_ctx; /* list of objects to be freed after RCU GP */ struct llist_head free_by_rcu; @@ -261,15 +262,14 @@ static void free_one(void *obj, bool percpu) kfree(obj); } -static int free_all(struct llist_node *llnode, bool percpu, - struct bpf_mem_alloc *ma) +static int free_all(struct bpf_mem_cache *c, struct llist_node *llnode, bool percpu) { struct llist_node *pos, *t; int cnt = 0; llist_for_each_safe(pos, t, llnode) { - if (ma->dtor) - ma->dtor((void *)pos + LLIST_NODE_SZ, ma->dtor_ctx); + if (c->dtor) + c->dtor((void *)pos + LLIST_NODE_SZ, c->dtor_ctx); free_one(pos, percpu); cnt++; } @@ -280,7 +280,7 @@ static void __free_rcu(struct rcu_head *head) { struct bpf_mem_cache *c = container_of(head, struct bpf_mem_cache, rcu_ttrace); - free_all(llist_del_all(&c->waiting_for_gp_ttrace), !!c->percpu_size, c->ma); + free_all(c, llist_del_all(&c->waiting_for_gp_ttrace), !!c->percpu_size); atomic_set(&c->call_rcu_ttrace_in_progress, 0); } @@ -312,7 +312,7 @@ static void do_call_rcu_ttrace(struct bpf_mem_cache *c) if (atomic_xchg(&c->call_rcu_ttrace_in_progress, 1)) { if (unlikely(READ_ONCE(c->draining))) { llnode = llist_del_all(&c->free_by_rcu_ttrace); - free_all(llnode, !!c->percpu_size, c->ma); + free_all(c, llnode, !!c->percpu_size); } return; } @@ -421,7 +421,7 @@ static void check_free_by_rcu(struct bpf_mem_cache *c) dec_active(c, &flags); if (unlikely(READ_ONCE(c->draining))) { - free_all(llist_del_all(&c->waiting_for_gp), !!c->percpu_size, c->ma); + free_all(c, llist_del_all(&c->waiting_for_gp), !!c->percpu_size); atomic_set(&c->call_rcu_in_progress, 0); } else { call_rcu_hurry(&c->rcu, __free_by_rcu); @@ -546,7 +546,6 @@ int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, bool percpu) c->objcg = objcg; c->percpu_size = percpu_size; c->tgt = c; - c->ma = ma; init_refill_work(c); prefill_mem_cache(c, cpu); } @@ -569,7 +568,6 @@ int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, bool percpu) c->objcg = objcg; c->percpu_size = percpu_size; c->tgt = c; - c->ma = ma; init_refill_work(c); prefill_mem_cache(c, cpu); @@ -622,7 +620,6 @@ int bpf_mem_alloc_percpu_unit_init(struct bpf_mem_alloc *ma, int size) c->objcg = objcg; c->percpu_size = percpu_size; c->tgt = c; - c->ma = ma; init_refill_work(c); prefill_mem_cache(c, cpu); @@ -631,7 +628,7 @@ int bpf_mem_alloc_percpu_unit_init(struct bpf_mem_alloc *ma, int size) return 0; } -static void drain_mem_cache(struct bpf_mem_cache *c, struct bpf_mem_alloc *ma) +static void drain_mem_cache(struct bpf_mem_cache *c) { bool percpu = !!c->percpu_size; @@ -642,13 +639,13 @@ static void drain_mem_cache(struct bpf_mem_cache *c, struct bpf_mem_alloc *ma) * Except for waiting_for_gp_ttrace list, there are no concurrent operations * on these lists, so it is safe to use __llist_del_all(). */ - free_all(llist_del_all(&c->free_by_rcu_ttrace), percpu, ma); - free_all(llist_del_all(&c->waiting_for_gp_ttrace), percpu, ma); - free_all(__llist_del_all(&c->free_llist), percpu, ma); - free_all(__llist_del_all(&c->free_llist_extra), percpu, ma); - free_all(__llist_del_all(&c->free_by_rcu), percpu, ma); - free_all(__llist_del_all(&c->free_llist_extra_rcu), percpu, ma); - free_all(llist_del_all(&c->waiting_for_gp), percpu, ma); + free_all(c, llist_del_all(&c->free_by_rcu_ttrace), percpu); + free_all(c, llist_del_all(&c->waiting_for_gp_ttrace), percpu); + free_all(c, __llist_del_all(&c->free_llist), percpu); + free_all(c, __llist_del_all(&c->free_llist_extra), percpu); + free_all(c, __llist_del_all(&c->free_by_rcu), percpu); + free_all(c, __llist_del_all(&c->free_llist_extra_rcu), percpu); + free_all(c, llist_del_all(&c->waiting_for_gp), percpu); } static void check_mem_cache(struct bpf_mem_cache *c) @@ -687,6 +684,9 @@ static void check_leaked_objs(struct bpf_mem_alloc *ma) static void free_mem_alloc_no_barrier(struct bpf_mem_alloc *ma) { + /* We can free dtor ctx only once all callbacks are done using it. */ + if (ma->dtor_ctx_free) + ma->dtor_ctx_free(ma->dtor_ctx); check_leaked_objs(ma); free_percpu(ma->cache); free_percpu(ma->caches); @@ -758,7 +758,7 @@ void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma) c = per_cpu_ptr(ma->cache, cpu); WRITE_ONCE(c->draining, true); irq_work_sync(&c->refill_work); - drain_mem_cache(c, ma); + drain_mem_cache(c); rcu_in_progress += atomic_read(&c->call_rcu_ttrace_in_progress); rcu_in_progress += atomic_read(&c->call_rcu_in_progress); } @@ -773,7 +773,7 @@ void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma) c = &cc->cache[i]; WRITE_ONCE(c->draining, true); irq_work_sync(&c->refill_work); - drain_mem_cache(c, ma); + drain_mem_cache(c); rcu_in_progress += atomic_read(&c->call_rcu_ttrace_in_progress); rcu_in_progress += atomic_read(&c->call_rcu_in_progress); } @@ -1022,9 +1022,31 @@ int bpf_mem_alloc_check_size(bool percpu, size_t size) return 0; } -void bpf_mem_alloc_set_dtor(struct bpf_mem_alloc *ma, - void (*dtor)(void *obj, void *ctx), void *ctx) +void bpf_mem_alloc_set_dtor(struct bpf_mem_alloc *ma, void (*dtor)(void *obj, void *ctx), + void (*dtor_ctx_free)(void *ctx), void *ctx) { - ma->dtor = dtor; + struct bpf_mem_caches *cc; + struct bpf_mem_cache *c; + int cpu, i; + + ma->dtor_ctx_free = dtor_ctx_free; ma->dtor_ctx = ctx; + + if (ma->cache) { + for_each_possible_cpu(cpu) { + c = per_cpu_ptr(ma->cache, cpu); + c->dtor = dtor; + c->dtor_ctx = ctx; + } + } + if (ma->caches) { + for_each_possible_cpu(cpu) { + cc = per_cpu_ptr(ma->caches, cpu); + for (i = 0; i < NUM_CACHES; i++) { + c = &cc->cache[i]; + c->dtor = dtor; + c->dtor_ctx = ctx; + } + } + } } -- 2.47.3 ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH] please work syzbot 2026-02-26 18:36 ` [PATCH] please work syzbot Kumar Kartikeya Dwivedi @ 2026-02-26 18:42 ` syzbot ci 0 siblings, 0 replies; 15+ messages in thread From: syzbot ci @ 2026-02-26 18:42 UTC (permalink / raw) To: memxor Cc: andrii, ast, bpf, daniel, eddyz87, kernel-team, kkd, martin.lau, memxor, paulmck, syzbot, syzkaller-bugs, yatsenko Unknown command ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2026-02-26 18:42 UTC | newest] Thread overview: 15+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-02-25 18:51 [PATCH bpf-next v1 0/4] Close race in freeing special fields and map value Kumar Kartikeya Dwivedi 2026-02-25 18:51 ` [PATCH bpf-next v1 1/4] bpf: Register dtor for freeing special fields Kumar Kartikeya Dwivedi 2026-02-25 18:51 ` [PATCH bpf-next v1 2/4] bpf: Delay freeing fields in local storage Kumar Kartikeya Dwivedi 2026-02-25 19:31 ` bot+bpf-ci 2026-02-25 18:51 ` [PATCH bpf-next v1 3/4] bpf: Retire rcu_trace_implies_rcu_gp() from " Kumar Kartikeya Dwivedi 2026-02-25 18:55 ` Paul E. McKenney 2026-02-25 18:51 ` [PATCH bpf-next v1 4/4] selftests/bpf: Add tests for special fields races Kumar Kartikeya Dwivedi 2026-02-26 9:31 ` [syzbot ci] Re: Close race in freeing special fields and map value syzbot ci 2026-02-26 15:05 ` Kumar Kartikeya Dwivedi 2026-02-26 17:36 ` Kumar Kartikeya Dwivedi 2026-02-26 17:38 ` syzbot ci 2026-02-26 18:28 ` Kumar Kartikeya Dwivedi 2026-02-26 18:30 ` syzbot ci 2026-02-26 18:36 ` [PATCH] please work syzbot Kumar Kartikeya Dwivedi 2026-02-26 18:42 ` syzbot ci
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.