* [PATCH bpf] bpf, lpm_trie: Allow lookups from sleepable BPF programs
@ 2026-05-29 17:42 Vlad Poenaru
2026-05-29 19:19 ` Emil Tsalapatis
2026-06-09 13:55 ` [PATCH bpf v2 0/2] bpf, lpm_trie: Allow sleepable BPF programs to use LPM tries Vlad Poenaru
0 siblings, 2 replies; 10+ messages in thread
From: Vlad Poenaru @ 2026-05-29 17:42 UTC (permalink / raw)
To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, bpf
Cc: Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi,
Song Liu, Yonghong Song, Jiri Olsa,
Toke Høiland-Jørgensen, linux-kernel, stable
trie_lookup_elem() annotates its rcu_dereference_check() walks with
only rcu_read_lock_bh_held(). Because rcu_dereference_check(p, c)
resolves to "c || rcu_read_lock_held()", this passes for XDP/NAPI and
classic RCU readers but fails for sleepable BPF programs, which enter
via __bpf_prog_enter_sleepable() and hold only rcu_read_lock_trace().
A sleepable LSM hook that ends up doing bpf_map_lookup_elem() on an LPM
trie therefore triggers lockdep on debug kernels:
=============================
WARNING: suspicious RCU usage
7.1.0-... Tainted: G E
-----------------------------
kernel/bpf/lpm_trie.c:249 suspicious rcu_dereference_check() usage!
1 lock held by net_tests/540:
#0: (rcu_tasks_trace_srcu_struct){....}-{0:0},
at: __bpf_prog_enter_sleepable+0x26/0x280
Call Trace:
dump_stack_lvl
lockdep_rcu_suspicious
trie_lookup_elem
bpf_prog_..._enforce_security_socket_connect
bpf_trampoline_...
security_socket_connect
__sys_connect
do_syscall_64
This is lockdep-only -- no UAF, since Tasks Trace RCU does serialize
against the trie's reclaim path -- but it spams the console once per
distinct callsite on every debug kernel running a sleepable BPF LSM
that does map lookups on an LPM trie, which is increasingly common.
Other map types already use the bpf_rcu_lock_held() helper, which
accepts all three contexts (classic, BH, Tasks Trace). Use it here as
well, matching the established convention.
Fixes: 694cea395fde ("bpf: Allow RCU-protected lookups to happen from bh context")
Cc: stable@vger.kernel.org
Signed-off-by: Vlad Poenaru <vlad.wing@gmail.com>
---
kernel/bpf/lpm_trie.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
index 0f57608b385d..ac36063cb7e6 100644
--- a/kernel/bpf/lpm_trie.c
+++ b/kernel/bpf/lpm_trie.c
@@ -246,7 +246,7 @@ static void *trie_lookup_elem(struct bpf_map *map, void *_key)
/* Start walking the trie from the root node ... */
- for (node = rcu_dereference_check(trie->root, rcu_read_lock_bh_held());
+ for (node = rcu_dereference_check(trie->root, bpf_rcu_lock_held());
node;) {
unsigned int next_bit;
size_t matchlen;
@@ -280,7 +280,7 @@ static void *trie_lookup_elem(struct bpf_map *map, void *_key)
*/
next_bit = extract_bit(key->data, node->prefixlen);
node = rcu_dereference_check(node->child[next_bit],
- rcu_read_lock_bh_held());
+ bpf_rcu_lock_held());
}
if (!found)
--
2.53.0-Meta
^ permalink raw reply related [flat|nested] 10+ messages in thread* Re: [PATCH bpf] bpf, lpm_trie: Allow lookups from sleepable BPF programs 2026-05-29 17:42 [PATCH bpf] bpf, lpm_trie: Allow lookups from sleepable BPF programs Vlad Poenaru @ 2026-05-29 19:19 ` Emil Tsalapatis 2026-06-09 13:55 ` [PATCH bpf v2 0/2] bpf, lpm_trie: Allow sleepable BPF programs to use LPM tries Vlad Poenaru 1 sibling, 0 replies; 10+ messages in thread From: Emil Tsalapatis @ 2026-05-29 19:19 UTC (permalink / raw) To: Vlad Poenaru, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, bpf Cc: Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi, Song Liu, Yonghong Song, Jiri Olsa, Toke Høiland-Jørgensen, linux-kernel, stable On Fri May 29, 2026 at 1:42 PM EDT, Vlad Poenaru wrote: > trie_lookup_elem() annotates its rcu_dereference_check() walks with > only rcu_read_lock_bh_held(). Because rcu_dereference_check(p, c) > resolves to "c || rcu_read_lock_held()", this passes for XDP/NAPI and > classic RCU readers but fails for sleepable BPF programs, which enter > via __bpf_prog_enter_sleepable() and hold only rcu_read_lock_trace(). > > A sleepable LSM hook that ends up doing bpf_map_lookup_elem() on an LPM > trie therefore triggers lockdep on debug kernels: > > ============================= > WARNING: suspicious RCU usage > 7.1.0-... Tainted: G E > ----------------------------- > kernel/bpf/lpm_trie.c:249 suspicious rcu_dereference_check() usage! > 1 lock held by net_tests/540: > #0: (rcu_tasks_trace_srcu_struct){....}-{0:0}, > at: __bpf_prog_enter_sleepable+0x26/0x280 > Call Trace: > dump_stack_lvl > lockdep_rcu_suspicious > trie_lookup_elem > bpf_prog_..._enforce_security_socket_connect > bpf_trampoline_... > security_socket_connect > __sys_connect > do_syscall_64 > > This is lockdep-only -- no UAF, since Tasks Trace RCU does serialize > against the trie's reclaim path -- but it spams the console once per > distinct callsite on every debug kernel running a sleepable BPF LSM > that does map lookups on an LPM trie, which is increasingly common. > > Other map types already use the bpf_rcu_lock_held() helper, which > accepts all three contexts (classic, BH, Tasks Trace). Use it here as > well, matching the established convention. > > Fixes: 694cea395fde ("bpf: Allow RCU-protected lookups to happen from bh context") > Cc: stable@vger.kernel.org > Signed-off-by: Vlad Poenaru <vlad.wing@gmail.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> > --- > kernel/bpf/lpm_trie.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c > index 0f57608b385d..ac36063cb7e6 100644 > --- a/kernel/bpf/lpm_trie.c > +++ b/kernel/bpf/lpm_trie.c > @@ -246,7 +246,7 @@ static void *trie_lookup_elem(struct bpf_map *map, void *_key) > > /* Start walking the trie from the root node ... */ > > - for (node = rcu_dereference_check(trie->root, rcu_read_lock_bh_held()); > + for (node = rcu_dereference_check(trie->root, bpf_rcu_lock_held()); > node;) { > unsigned int next_bit; > size_t matchlen; > @@ -280,7 +280,7 @@ static void *trie_lookup_elem(struct bpf_map *map, void *_key) > */ > next_bit = extract_bit(key->data, node->prefixlen); > node = rcu_dereference_check(node->child[next_bit], > - rcu_read_lock_bh_held()); > + bpf_rcu_lock_held()); > } > > if (!found) ^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH bpf v2 0/2] bpf, lpm_trie: Allow sleepable BPF programs to use LPM tries 2026-05-29 17:42 [PATCH bpf] bpf, lpm_trie: Allow lookups from sleepable BPF programs Vlad Poenaru 2026-05-29 19:19 ` Emil Tsalapatis @ 2026-06-09 13:55 ` Vlad Poenaru 2026-06-09 13:55 ` [PATCH bpf v2 1/2] bpf, lpm_trie: Allow access from sleepable BPF programs Vlad Poenaru ` (2 more replies) 1 sibling, 3 replies; 10+ messages in thread From: Vlad Poenaru @ 2026-06-09 13:55 UTC (permalink / raw) To: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, John Fastabend, Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi, Song Liu, Yonghong Song, Jiri Olsa, Toke Høiland-Jørgensen Cc: Emil Tsalapatis, linux-kernel trie_lookup_elem() annotates its rcu_dereference_check() walks with only rcu_read_lock_bh_held(), so a sleepable BPF program that touches an LPM trie (e.g. a sleepable LSM hook calling bpf_map_lookup_elem()) trips a "suspicious RCU usage" lockdep splat on debug kernels: it holds only rcu_read_lock_trace(), which that annotation does not accept. Patch 1 relaxes the rcu_dereference annotations in the trie walks so they no longer trip lockdep from the Tasks Trace context, including the trie_update_elem()/trie_delete_elem() writer walks (protected by trie->lock). Patch 2 adds BPF_MAP_TYPE_LPM_TRIE to the verifier's sleepable map whitelist so sleepable programs can reference an LPM trie directly, not just as the inner map of a map-of-maps. LPM trie nodes are reclaimed via bpf_mem_cache_free_rcu(), which chains a regular RCU grace period into a Tasks Trace grace period before freeing -- the same discipline BPF_MAP_TYPE_HASH relies on for sleepable access. Changes since v1: - Split into a 2-patch series. - Patch 1 now also converts the trie_update_elem()/trie_delete_elem() walks from rcu_dereference() to rcu_dereference_protected(*p, 1), addressing review feedback that v1 only fixed the lookup path and left the same splat on the writer paths. - New patch 2 adds the verifier whitelist entry so the fix is actually reachable for directly-referenced LPM tries. - Retitled v1 ("Allow lookups from sleepable BPF programs"). v1: https://lore.kernel.org/all/20260529174233.2954240-1-vlad.wing@gmail.com/ Vlad Poenaru (2): bpf, lpm_trie: Allow access from sleepable BPF programs bpf, lpm_trie: Allow sleepable programs to use LPM trie maps directly kernel/bpf/lpm_trie.c | 8 ++++---- kernel/bpf/verifier.c | 1 + 2 files changed, 5 insertions(+), 4 deletions(-) -- 2.53.0-Meta ^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH bpf v2 1/2] bpf, lpm_trie: Allow access from sleepable BPF programs 2026-06-09 13:55 ` [PATCH bpf v2 0/2] bpf, lpm_trie: Allow sleepable BPF programs to use LPM tries Vlad Poenaru @ 2026-06-09 13:55 ` Vlad Poenaru 2026-06-09 16:36 ` Emil Tsalapatis 2026-06-09 13:55 ` [PATCH bpf v2 2/2] bpf, lpm_trie: Allow sleepable programs to use LPM trie maps directly Vlad Poenaru 2026-06-09 19:50 ` [PATCH bpf v2 0/2] bpf, lpm_trie: Allow sleepable BPF programs to use LPM tries patchwork-bot+netdevbpf 2 siblings, 1 reply; 10+ messages in thread From: Vlad Poenaru @ 2026-06-09 13:55 UTC (permalink / raw) To: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, John Fastabend, Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi, Song Liu, Yonghong Song, Jiri Olsa, Toke Høiland-Jørgensen Cc: Emil Tsalapatis, linux-kernel, stable trie_lookup_elem() annotates its rcu_dereference_check() walks with only rcu_read_lock_bh_held(). Because rcu_dereference_check(p, c) resolves to "c || rcu_read_lock_held()", this passes for XDP/NAPI and classic RCU readers but fails for sleepable BPF programs, which enter via __bpf_prog_enter_sleepable() and hold only rcu_read_lock_trace(). trie_update_elem() and trie_delete_elem() have the same problem in a different form: they walk the trie with plain rcu_dereference(), which asserts rcu_read_lock_held() unconditionally. Both are reachable from sleepable BPF programs via the bpf_map_update_elem / bpf_map_delete_elem helpers, and from the syscall path under classic rcu_read_lock(). In the writer paths the trie is actually protected by trie->lock (an rqspinlock taken across the walk); we never relied on the RCU read-side lock to keep nodes alive there. A sleepable LSM hook that ends up touching an LPM trie therefore triggers lockdep on debug kernels: ============================= WARNING: suspicious RCU usage 7.1.0-... Tainted: G E ----------------------------- kernel/bpf/lpm_trie.c:249 suspicious rcu_dereference_check() usage! 1 lock held by net_tests/540: #0: (rcu_tasks_trace_srcu_struct){....}-{0:0}, at: __bpf_prog_enter_sleepable+0x26/0x280 Call Trace: dump_stack_lvl lockdep_rcu_suspicious trie_lookup_elem bpf_prog_..._enforce_security_socket_connect bpf_trampoline_... security_socket_connect __sys_connect do_syscall_64 This is lockdep-only -- no UAF, since Tasks Trace RCU does serialize against the trie's reclaim path -- but it spams the console once per distinct callsite on every debug kernel running a sleepable BPF LSM that touches an LPM trie, which is increasingly common. For the lookup path, switch the rcu_dereference_check() annotation from rcu_read_lock_bh_held() to bpf_rcu_lock_held(), which accepts all three contexts (classic, BH, Tasks Trace). Other map types already follow this convention. For trie_update_elem() and trie_delete_elem(), annotate the walks as rcu_dereference_protected(*p, 1) -- matching trie_free() in the same file -- since trie->lock is held across the walk. rqspinlock has no lockdep_map, so the predicate degenerates to '1' rather than lockdep_is_held(&trie->lock); the protection is real but not machine-verifiable. trie_get_next_key() also uses bare rcu_dereference() but is reachable only from the BPF syscall, which holds classic rcu_read_lock() before dispatching, so it is left untouched. Fixes: 694cea395fde ("bpf: Allow RCU-protected lookups to happen from bh context") Cc: stable@vger.kernel.org Signed-off-by: Vlad Poenaru <vlad.wing@gmail.com> --- kernel/bpf/lpm_trie.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c index 0f57608b385d..4d6f25db9ba1 100644 --- a/kernel/bpf/lpm_trie.c +++ b/kernel/bpf/lpm_trie.c @@ -246,7 +246,7 @@ static void *trie_lookup_elem(struct bpf_map *map, void *_key) /* Start walking the trie from the root node ... */ - for (node = rcu_dereference_check(trie->root, rcu_read_lock_bh_held()); + for (node = rcu_dereference_check(trie->root, bpf_rcu_lock_held()); node;) { unsigned int next_bit; size_t matchlen; @@ -280,7 +280,7 @@ static void *trie_lookup_elem(struct bpf_map *map, void *_key) */ next_bit = extract_bit(key->data, node->prefixlen); node = rcu_dereference_check(node->child[next_bit], - rcu_read_lock_bh_held()); + bpf_rcu_lock_held()); } if (!found) @@ -359,7 +359,7 @@ static long trie_update_elem(struct bpf_map *map, */ slot = &trie->root; - while ((node = rcu_dereference(*slot))) { + while ((node = rcu_dereference_protected(*slot, 1))) { matchlen = longest_prefix_match(trie, node, key); if (node->prefixlen != matchlen || @@ -482,7 +482,7 @@ static long trie_delete_elem(struct bpf_map *map, void *_key) trim = &trie->root; trim2 = trim; parent = NULL; - while ((node = rcu_dereference(*trim))) { + while ((node = rcu_dereference_protected(*trim, 1))) { matchlen = longest_prefix_match(trie, node, key); if (node->prefixlen != matchlen || -- 2.53.0-Meta ^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH bpf v2 1/2] bpf, lpm_trie: Allow access from sleepable BPF programs 2026-06-09 13:55 ` [PATCH bpf v2 1/2] bpf, lpm_trie: Allow access from sleepable BPF programs Vlad Poenaru @ 2026-06-09 16:36 ` Emil Tsalapatis 0 siblings, 0 replies; 10+ messages in thread From: Emil Tsalapatis @ 2026-06-09 16:36 UTC (permalink / raw) To: Vlad Poenaru, bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, John Fastabend, Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi, Song Liu, Yonghong Song, Jiri Olsa, Toke Høiland-Jørgensen Cc: Emil Tsalapatis, linux-kernel, stable On Tue Jun 9, 2026 at 9:55 AM EDT, Vlad Poenaru wrote: > trie_lookup_elem() annotates its rcu_dereference_check() walks with > only rcu_read_lock_bh_held(). Because rcu_dereference_check(p, c) > resolves to "c || rcu_read_lock_held()", this passes for XDP/NAPI and > classic RCU readers but fails for sleepable BPF programs, which enter > via __bpf_prog_enter_sleepable() and hold only rcu_read_lock_trace(). > > trie_update_elem() and trie_delete_elem() have the same problem in a > different form: they walk the trie with plain rcu_dereference(), which > asserts rcu_read_lock_held() unconditionally. Both are reachable from > sleepable BPF programs via the bpf_map_update_elem / bpf_map_delete_elem > helpers, and from the syscall path under classic rcu_read_lock(). In > the writer paths the trie is actually protected by trie->lock (an > rqspinlock taken across the walk); we never relied on the RCU read-side > lock to keep nodes alive there. > > A sleepable LSM hook that ends up touching an LPM trie therefore > triggers lockdep on debug kernels: > > ============================= > WARNING: suspicious RCU usage > 7.1.0-... Tainted: G E > ----------------------------- > kernel/bpf/lpm_trie.c:249 suspicious rcu_dereference_check() usage! > 1 lock held by net_tests/540: > #0: (rcu_tasks_trace_srcu_struct){....}-{0:0}, > at: __bpf_prog_enter_sleepable+0x26/0x280 > Call Trace: > dump_stack_lvl > lockdep_rcu_suspicious > trie_lookup_elem > bpf_prog_..._enforce_security_socket_connect > bpf_trampoline_... > security_socket_connect > __sys_connect > do_syscall_64 > > This is lockdep-only -- no UAF, since Tasks Trace RCU does serialize > against the trie's reclaim path -- but it spams the console once per > distinct callsite on every debug kernel running a sleepable BPF LSM > that touches an LPM trie, which is increasingly common. > > For the lookup path, switch the rcu_dereference_check() annotation > from rcu_read_lock_bh_held() to bpf_rcu_lock_held(), which accepts all > three contexts (classic, BH, Tasks Trace). Other map types already > follow this convention. > > For trie_update_elem() and trie_delete_elem(), annotate the walks as > rcu_dereference_protected(*p, 1) -- matching trie_free() in the same > file -- since trie->lock is held across the walk. rqspinlock has no > lockdep_map, so the predicate degenerates to '1' rather than > lockdep_is_held(&trie->lock); the protection is real but not > machine-verifiable. trie_get_next_key() also uses bare > rcu_dereference() but is reachable only from the BPF syscall, which > holds classic rcu_read_lock() before dispatching, so it is left > untouched. > > Fixes: 694cea395fde ("bpf: Allow RCU-protected lookups to happen from bh context") > Cc: stable@vger.kernel.org > Signed-off-by: Vlad Poenaru <vlad.wing@gmail.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> > --- > kernel/bpf/lpm_trie.c | 8 ++++---- > 1 file changed, 4 insertions(+), 4 deletions(-) > > diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c > index 0f57608b385d..4d6f25db9ba1 100644 > --- a/kernel/bpf/lpm_trie.c > +++ b/kernel/bpf/lpm_trie.c > @@ -246,7 +246,7 @@ static void *trie_lookup_elem(struct bpf_map *map, void *_key) > > /* Start walking the trie from the root node ... */ > > - for (node = rcu_dereference_check(trie->root, rcu_read_lock_bh_held()); > + for (node = rcu_dereference_check(trie->root, bpf_rcu_lock_held()); > node;) { > unsigned int next_bit; > size_t matchlen; > @@ -280,7 +280,7 @@ static void *trie_lookup_elem(struct bpf_map *map, void *_key) > */ > next_bit = extract_bit(key->data, node->prefixlen); > node = rcu_dereference_check(node->child[next_bit], > - rcu_read_lock_bh_held()); > + bpf_rcu_lock_held()); > } > > if (!found) > @@ -359,7 +359,7 @@ static long trie_update_elem(struct bpf_map *map, > */ > slot = &trie->root; > > - while ((node = rcu_dereference(*slot))) { > + while ((node = rcu_dereference_protected(*slot, 1))) { > matchlen = longest_prefix_match(trie, node, key); > > if (node->prefixlen != matchlen || > @@ -482,7 +482,7 @@ static long trie_delete_elem(struct bpf_map *map, void *_key) > trim = &trie->root; > trim2 = trim; > parent = NULL; > - while ((node = rcu_dereference(*trim))) { > + while ((node = rcu_dereference_protected(*trim, 1))) { > matchlen = longest_prefix_match(trie, node, key); > > if (node->prefixlen != matchlen || ^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH bpf v2 2/2] bpf, lpm_trie: Allow sleepable programs to use LPM trie maps directly 2026-06-09 13:55 ` [PATCH bpf v2 0/2] bpf, lpm_trie: Allow sleepable BPF programs to use LPM tries Vlad Poenaru 2026-06-09 13:55 ` [PATCH bpf v2 1/2] bpf, lpm_trie: Allow access from sleepable BPF programs Vlad Poenaru @ 2026-06-09 13:55 ` Vlad Poenaru 2026-06-09 16:19 ` Emil Tsalapatis 2026-06-10 1:53 ` Hou Tao 2026-06-09 19:50 ` [PATCH bpf v2 0/2] bpf, lpm_trie: Allow sleepable BPF programs to use LPM tries patchwork-bot+netdevbpf 2 siblings, 2 replies; 10+ messages in thread From: Vlad Poenaru @ 2026-06-09 13:55 UTC (permalink / raw) To: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, John Fastabend, Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi, Song Liu, Yonghong Song, Jiri Olsa, Toke Høiland-Jørgensen Cc: Emil Tsalapatis, linux-kernel The previous change relaxed the rcu_dereference annotations in lpm_trie.c so the trie walks no longer trip lockdep when reached from a sleepable BPF program holding only rcu_read_lock_trace(). By itself that only helps tries reached as the inner map of a map-of-maps, or from the classic-RCU syscall path: a sleepable program that references an LPM trie directly is still rejected at load time by check_map_prog_compatibility(), whose sleepable whitelist omits BPF_MAP_TYPE_LPM_TRIE: Sleepable programs can only use array, hash, ringbuf and local storage maps LPM trie nodes are allocated from a bpf_mem_alloc (trie->ma) and freed with bpf_mem_cache_free_rcu(), which chains a regular RCU grace period into a Tasks Trace grace period before the node -- and the value embedded in it that trie_lookup_elem() returns to the program -- is released. That is the same reclaim discipline BPF_MAP_TYPE_HASH relies on for sleepable access, so a value handed to a sleepable reader cannot be freed while the program is still running under rcu_read_lock_trace(). The writer paths take trie->lock across the walk and never relied on the RCU read-side lock to keep nodes alive. Add BPF_MAP_TYPE_LPM_TRIE to the sleepable map whitelist so these programs can use LPM tries directly. Signed-off-by: Vlad Poenaru <vlad.wing@gmail.com> --- kernel/bpf/verifier.c | 1 + 1 file changed, 1 insertion(+) diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 7fb88e1cd7c4..71c1e59e4df4 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -18122,6 +18122,7 @@ static int check_map_prog_compatibility(struct bpf_verifier_env *env, case BPF_MAP_TYPE_PERCPU_HASH: case BPF_MAP_TYPE_PERCPU_ARRAY: case BPF_MAP_TYPE_LRU_PERCPU_HASH: + case BPF_MAP_TYPE_LPM_TRIE: case BPF_MAP_TYPE_ARRAY_OF_MAPS: case BPF_MAP_TYPE_HASH_OF_MAPS: case BPF_MAP_TYPE_RINGBUF: -- 2.53.0-Meta ^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH bpf v2 2/2] bpf, lpm_trie: Allow sleepable programs to use LPM trie maps directly 2026-06-09 13:55 ` [PATCH bpf v2 2/2] bpf, lpm_trie: Allow sleepable programs to use LPM trie maps directly Vlad Poenaru @ 2026-06-09 16:19 ` Emil Tsalapatis 2026-06-10 1:53 ` Hou Tao 1 sibling, 0 replies; 10+ messages in thread From: Emil Tsalapatis @ 2026-06-09 16:19 UTC (permalink / raw) To: Vlad Poenaru, bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, John Fastabend, Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi, Song Liu, Yonghong Song, Jiri Olsa, Toke Høiland-Jørgensen Cc: Emil Tsalapatis, linux-kernel On Tue Jun 9, 2026 at 9:55 AM EDT, Vlad Poenaru wrote: > The previous change relaxed the rcu_dereference annotations in > lpm_trie.c so the trie walks no longer trip lockdep when reached from a > sleepable BPF program holding only rcu_read_lock_trace(). By itself > that only helps tries reached as the inner map of a map-of-maps, or > from the classic-RCU syscall path: a sleepable program that references > an LPM trie directly is still rejected at load time by > check_map_prog_compatibility(), whose sleepable whitelist omits > BPF_MAP_TYPE_LPM_TRIE: > > Sleepable programs can only use array, hash, ringbuf and local storage maps > > LPM trie nodes are allocated from a bpf_mem_alloc (trie->ma) and freed > with bpf_mem_cache_free_rcu(), which chains a regular RCU grace period > into a Tasks Trace grace period before the node -- and the value > embedded in it that trie_lookup_elem() returns to the program -- is > released. That is the same reclaim discipline BPF_MAP_TYPE_HASH relies > on for sleepable access, so a value handed to a sleepable reader cannot > be freed while the program is still running under rcu_read_lock_trace(). > The writer paths take trie->lock across the walk and never relied on the > RCU read-side lock to keep nodes alive. > > Add BPF_MAP_TYPE_LPM_TRIE to the sleepable map whitelist so these > programs can use LPM tries directly. > > Signed-off-by: Vlad Poenaru <vlad.wing@gmail.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> > --- > kernel/bpf/verifier.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c > index 7fb88e1cd7c4..71c1e59e4df4 100644 > --- a/kernel/bpf/verifier.c > +++ b/kernel/bpf/verifier.c > @@ -18122,6 +18122,7 @@ static int check_map_prog_compatibility(struct bpf_verifier_env *env, > case BPF_MAP_TYPE_PERCPU_HASH: > case BPF_MAP_TYPE_PERCPU_ARRAY: > case BPF_MAP_TYPE_LRU_PERCPU_HASH: > + case BPF_MAP_TYPE_LPM_TRIE: > case BPF_MAP_TYPE_ARRAY_OF_MAPS: > case BPF_MAP_TYPE_HASH_OF_MAPS: > case BPF_MAP_TYPE_RINGBUF: ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH bpf v2 2/2] bpf, lpm_trie: Allow sleepable programs to use LPM trie maps directly 2026-06-09 13:55 ` [PATCH bpf v2 2/2] bpf, lpm_trie: Allow sleepable programs to use LPM trie maps directly Vlad Poenaru 2026-06-09 16:19 ` Emil Tsalapatis @ 2026-06-10 1:53 ` Hou Tao 2026-06-10 2:34 ` Alexei Starovoitov 1 sibling, 1 reply; 10+ messages in thread From: Hou Tao @ 2026-06-10 1:53 UTC (permalink / raw) To: Vlad Poenaru, bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, John Fastabend, Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi, Song Liu, Yonghong Song, Jiri Olsa, Toke Høiland-Jørgensen Cc: Emil Tsalapatis, linux-kernel Hi, On 6/9/2026 9:55 PM, Vlad Poenaru wrote: > The previous change relaxed the rcu_dereference annotations in > lpm_trie.c so the trie walks no longer trip lockdep when reached from a > sleepable BPF program holding only rcu_read_lock_trace(). By itself > that only helps tries reached as the inner map of a map-of-maps, or > from the classic-RCU syscall path: a sleepable program that references > an LPM trie directly is still rejected at load time by > check_map_prog_compatibility(), whose sleepable whitelist omits > BPF_MAP_TYPE_LPM_TRIE: > > Sleepable programs can only use array, hash, ringbuf and local storage maps > > LPM trie nodes are allocated from a bpf_mem_alloc (trie->ma) and freed > with bpf_mem_cache_free_rcu(), which chains a regular RCU grace period > into a Tasks Trace grace period before the node -- and the value > embedded in it that trie_lookup_elem() returns to the program -- is > released. That is the same reclaim discipline BPF_MAP_TYPE_HASH relies > on for sleepable access, so a value handed to a sleepable reader cannot > be freed while the program is still running under rcu_read_lock_trace(). > The writer paths take trie->lock across the walk and never relied on the > RCU read-side lock to keep nodes alive. For trie_lookup_elem(), I think it is not safe to enable the usage in the sleep-able program as the patch does and it may return unexpected value. The main reason is that rcu_read_lock_trace() can not guarantee the current node which is being lookup-ed up will not reused by other update procedure concurrently. However rcu_read_lock() has such guarantee, because bpf_mem_cache_free_rcu() makes it be reusable only after one RCU grace. For the hash-table case, I think it has the similar problem through it has already used some trickle (hlist_nulls_node variants) to mitigate it. > > Add BPF_MAP_TYPE_LPM_TRIE to the sleepable map whitelist so these > programs can use LPM tries directly. > > Signed-off-by: Vlad Poenaru <vlad.wing@gmail.com> > --- > kernel/bpf/verifier.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c > index 7fb88e1cd7c4..71c1e59e4df4 100644 > --- a/kernel/bpf/verifier.c > +++ b/kernel/bpf/verifier.c > @@ -18122,6 +18122,7 @@ static int check_map_prog_compatibility(struct bpf_verifier_env *env, > case BPF_MAP_TYPE_PERCPU_HASH: > case BPF_MAP_TYPE_PERCPU_ARRAY: > case BPF_MAP_TYPE_LRU_PERCPU_HASH: > + case BPF_MAP_TYPE_LPM_TRIE: > case BPF_MAP_TYPE_ARRAY_OF_MAPS: > case BPF_MAP_TYPE_HASH_OF_MAPS: > case BPF_MAP_TYPE_RINGBUF: ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH bpf v2 2/2] bpf, lpm_trie: Allow sleepable programs to use LPM trie maps directly 2026-06-10 1:53 ` Hou Tao @ 2026-06-10 2:34 ` Alexei Starovoitov 0 siblings, 0 replies; 10+ messages in thread From: Alexei Starovoitov @ 2026-06-10 2:34 UTC (permalink / raw) To: Hou Tao, Vlad Poenaru, bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, John Fastabend, Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi, Song Liu, Yonghong Song, Jiri Olsa, Toke Høiland-Jørgensen Cc: Emil Tsalapatis, linux-kernel On Tue Jun 9, 2026 at 6:53 PM PDT, Hou Tao wrote: > Hi, > > On 6/9/2026 9:55 PM, Vlad Poenaru wrote: >> The previous change relaxed the rcu_dereference annotations in >> lpm_trie.c so the trie walks no longer trip lockdep when reached from a >> sleepable BPF program holding only rcu_read_lock_trace(). By itself >> that only helps tries reached as the inner map of a map-of-maps, or >> from the classic-RCU syscall path: a sleepable program that references >> an LPM trie directly is still rejected at load time by >> check_map_prog_compatibility(), whose sleepable whitelist omits >> BPF_MAP_TYPE_LPM_TRIE: >> >> Sleepable programs can only use array, hash, ringbuf and local storage maps >> >> LPM trie nodes are allocated from a bpf_mem_alloc (trie->ma) and freed >> with bpf_mem_cache_free_rcu(), which chains a regular RCU grace period >> into a Tasks Trace grace period before the node -- and the value >> embedded in it that trie_lookup_elem() returns to the program -- is >> released. That is the same reclaim discipline BPF_MAP_TYPE_HASH relies >> on for sleepable access, so a value handed to a sleepable reader cannot >> be freed while the program is still running under rcu_read_lock_trace(). >> The writer paths take trie->lock across the walk and never relied on the >> RCU read-side lock to keep nodes alive. > > For trie_lookup_elem(), I think it is not safe to enable the usage in > the sleep-able program as the patch does and it may return unexpected > value. The main reason is that rcu_read_lock_trace() can not guarantee > the current node which is being lookup-ed up will not reused by other > update procedure concurrently. However rcu_read_lock() has such > guarantee, because bpf_mem_cache_free_rcu() makes it be reusable only > after one RCU grace. For the hash-table case, I think it has the similar > problem through it has already used some trickle (hlist_nulls_node > variants) to mitigate it. You're correct. I remember that discussion. Yet people already use lpm via map-in-map bug/workaround. So I applied this set to make lpm-in-sleepable usage official and force us to do a proper fix. Also both AI bots didn't spot an issue, so the bug won't be discovered immediately and we won't see a flurry of "security" reports with slop "fixes". AI isn't that smart yet. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH bpf v2 0/2] bpf, lpm_trie: Allow sleepable BPF programs to use LPM tries 2026-06-09 13:55 ` [PATCH bpf v2 0/2] bpf, lpm_trie: Allow sleepable BPF programs to use LPM tries Vlad Poenaru 2026-06-09 13:55 ` [PATCH bpf v2 1/2] bpf, lpm_trie: Allow access from sleepable BPF programs Vlad Poenaru 2026-06-09 13:55 ` [PATCH bpf v2 2/2] bpf, lpm_trie: Allow sleepable programs to use LPM trie maps directly Vlad Poenaru @ 2026-06-09 19:50 ` patchwork-bot+netdevbpf 2 siblings, 0 replies; 10+ messages in thread From: patchwork-bot+netdevbpf @ 2026-06-09 19:50 UTC (permalink / raw) To: Vlad Poenaru Cc: bpf, ast, daniel, andrii, john.fastabend, martin.lau, eddyz87, memxor, song, yonghong.song, jolsa, toke, emil, linux-kernel Hello: This series was applied to bpf/bpf-next.git (master) by Alexei Starovoitov <ast@kernel.org>: On Tue, 9 Jun 2026 06:55:56 -0700 you wrote: > trie_lookup_elem() annotates its rcu_dereference_check() walks with only > rcu_read_lock_bh_held(), so a sleepable BPF program that touches an LPM > trie (e.g. a sleepable LSM hook calling bpf_map_lookup_elem()) trips a > "suspicious RCU usage" lockdep splat on debug kernels: it holds only > rcu_read_lock_trace(), which that annotation does not accept. > > Patch 1 relaxes the rcu_dereference annotations in the trie walks so they > no longer trip lockdep from the Tasks Trace context, including the > trie_update_elem()/trie_delete_elem() writer walks (protected by > trie->lock). Patch 2 adds BPF_MAP_TYPE_LPM_TRIE to the verifier's > sleepable map whitelist so sleepable programs can reference an LPM trie > directly, not just as the inner map of a map-of-maps. LPM trie nodes are > reclaimed via bpf_mem_cache_free_rcu(), which chains a regular RCU grace > period into a Tasks Trace grace period before freeing -- the same > discipline BPF_MAP_TYPE_HASH relies on for sleepable access. > > [...] Here is the summary with links: - [bpf,v2,1/2] bpf, lpm_trie: Allow access from sleepable BPF programs https://git.kernel.org/bpf/bpf-next/c/2f884d371faf - [bpf,v2,2/2] bpf, lpm_trie: Allow sleepable programs to use LPM trie maps directly https://git.kernel.org/bpf/bpf-next/c/a3d76e27bbbf You are awesome, thank you! -- Deet-doot-dot, I am a bot. https://korg.docs.kernel.org/patchwork/pwbot.html ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2026-06-10 2:34 UTC | newest] Thread overview: 10+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-05-29 17:42 [PATCH bpf] bpf, lpm_trie: Allow lookups from sleepable BPF programs Vlad Poenaru 2026-05-29 19:19 ` Emil Tsalapatis 2026-06-09 13:55 ` [PATCH bpf v2 0/2] bpf, lpm_trie: Allow sleepable BPF programs to use LPM tries Vlad Poenaru 2026-06-09 13:55 ` [PATCH bpf v2 1/2] bpf, lpm_trie: Allow access from sleepable BPF programs Vlad Poenaru 2026-06-09 16:36 ` Emil Tsalapatis 2026-06-09 13:55 ` [PATCH bpf v2 2/2] bpf, lpm_trie: Allow sleepable programs to use LPM trie maps directly Vlad Poenaru 2026-06-09 16:19 ` Emil Tsalapatis 2026-06-10 1:53 ` Hou Tao 2026-06-10 2:34 ` Alexei Starovoitov 2026-06-09 19:50 ` [PATCH bpf v2 0/2] bpf, lpm_trie: Allow sleepable BPF programs to use LPM tries patchwork-bot+netdevbpf
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox