* [PATCH bpf] bpf, lpm_trie: Allow lookups from sleepable BPF programs
@ 2026-05-29 17:42 Vlad Poenaru
2026-05-29 19:19 ` Emil Tsalapatis
2026-06-09 13:55 ` [PATCH bpf v2 0/2] bpf, lpm_trie: Allow sleepable BPF programs to use LPM tries Vlad Poenaru
0 siblings, 2 replies; 10+ messages in thread
From: Vlad Poenaru @ 2026-05-29 17:42 UTC (permalink / raw)
To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, bpf
Cc: Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi,
Song Liu, Yonghong Song, Jiri Olsa,
Toke Høiland-Jørgensen, linux-kernel, stable
trie_lookup_elem() annotates its rcu_dereference_check() walks with
only rcu_read_lock_bh_held(). Because rcu_dereference_check(p, c)
resolves to "c || rcu_read_lock_held()", this passes for XDP/NAPI and
classic RCU readers but fails for sleepable BPF programs, which enter
via __bpf_prog_enter_sleepable() and hold only rcu_read_lock_trace().
A sleepable LSM hook that ends up doing bpf_map_lookup_elem() on an LPM
trie therefore triggers lockdep on debug kernels:
=============================
WARNING: suspicious RCU usage
7.1.0-... Tainted: G E
-----------------------------
kernel/bpf/lpm_trie.c:249 suspicious rcu_dereference_check() usage!
1 lock held by net_tests/540:
#0: (rcu_tasks_trace_srcu_struct){....}-{0:0},
at: __bpf_prog_enter_sleepable+0x26/0x280
Call Trace:
dump_stack_lvl
lockdep_rcu_suspicious
trie_lookup_elem
bpf_prog_..._enforce_security_socket_connect
bpf_trampoline_...
security_socket_connect
__sys_connect
do_syscall_64
This is lockdep-only -- no UAF, since Tasks Trace RCU does serialize
against the trie's reclaim path -- but it spams the console once per
distinct callsite on every debug kernel running a sleepable BPF LSM
that does map lookups on an LPM trie, which is increasingly common.
Other map types already use the bpf_rcu_lock_held() helper, which
accepts all three contexts (classic, BH, Tasks Trace). Use it here as
well, matching the established convention.
Fixes: 694cea395fde ("bpf: Allow RCU-protected lookups to happen from bh context")
Cc: stable@vger.kernel.org
Signed-off-by: Vlad Poenaru <vlad.wing@gmail.com>
---
kernel/bpf/lpm_trie.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
index 0f57608b385d..ac36063cb7e6 100644
--- a/kernel/bpf/lpm_trie.c
+++ b/kernel/bpf/lpm_trie.c
@@ -246,7 +246,7 @@ static void *trie_lookup_elem(struct bpf_map *map, void *_key)
/* Start walking the trie from the root node ... */
- for (node = rcu_dereference_check(trie->root, rcu_read_lock_bh_held());
+ for (node = rcu_dereference_check(trie->root, bpf_rcu_lock_held());
node;) {
unsigned int next_bit;
size_t matchlen;
@@ -280,7 +280,7 @@ static void *trie_lookup_elem(struct bpf_map *map, void *_key)
*/
next_bit = extract_bit(key->data, node->prefixlen);
node = rcu_dereference_check(node->child[next_bit],
- rcu_read_lock_bh_held());
+ bpf_rcu_lock_held());
}
if (!found)
--
2.53.0-Meta
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH bpf] bpf, lpm_trie: Allow lookups from sleepable BPF programs
2026-05-29 17:42 [PATCH bpf] bpf, lpm_trie: Allow lookups from sleepable BPF programs Vlad Poenaru
@ 2026-05-29 19:19 ` Emil Tsalapatis
2026-06-09 13:55 ` [PATCH bpf v2 0/2] bpf, lpm_trie: Allow sleepable BPF programs to use LPM tries Vlad Poenaru
1 sibling, 0 replies; 10+ messages in thread
From: Emil Tsalapatis @ 2026-05-29 19:19 UTC (permalink / raw)
To: Vlad Poenaru, Alexei Starovoitov, Daniel Borkmann,
Andrii Nakryiko, bpf
Cc: Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi,
Song Liu, Yonghong Song, Jiri Olsa,
Toke Høiland-Jørgensen, linux-kernel, stable
On Fri May 29, 2026 at 1:42 PM EDT, Vlad Poenaru wrote:
> trie_lookup_elem() annotates its rcu_dereference_check() walks with
> only rcu_read_lock_bh_held(). Because rcu_dereference_check(p, c)
> resolves to "c || rcu_read_lock_held()", this passes for XDP/NAPI and
> classic RCU readers but fails for sleepable BPF programs, which enter
> via __bpf_prog_enter_sleepable() and hold only rcu_read_lock_trace().
>
> A sleepable LSM hook that ends up doing bpf_map_lookup_elem() on an LPM
> trie therefore triggers lockdep on debug kernels:
>
> =============================
> WARNING: suspicious RCU usage
> 7.1.0-... Tainted: G E
> -----------------------------
> kernel/bpf/lpm_trie.c:249 suspicious rcu_dereference_check() usage!
> 1 lock held by net_tests/540:
> #0: (rcu_tasks_trace_srcu_struct){....}-{0:0},
> at: __bpf_prog_enter_sleepable+0x26/0x280
> Call Trace:
> dump_stack_lvl
> lockdep_rcu_suspicious
> trie_lookup_elem
> bpf_prog_..._enforce_security_socket_connect
> bpf_trampoline_...
> security_socket_connect
> __sys_connect
> do_syscall_64
>
> This is lockdep-only -- no UAF, since Tasks Trace RCU does serialize
> against the trie's reclaim path -- but it spams the console once per
> distinct callsite on every debug kernel running a sleepable BPF LSM
> that does map lookups on an LPM trie, which is increasingly common.
>
> Other map types already use the bpf_rcu_lock_held() helper, which
> accepts all three contexts (classic, BH, Tasks Trace). Use it here as
> well, matching the established convention.
>
> Fixes: 694cea395fde ("bpf: Allow RCU-protected lookups to happen from bh context")
> Cc: stable@vger.kernel.org
> Signed-off-by: Vlad Poenaru <vlad.wing@gmail.com>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
> ---
> kernel/bpf/lpm_trie.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
> index 0f57608b385d..ac36063cb7e6 100644
> --- a/kernel/bpf/lpm_trie.c
> +++ b/kernel/bpf/lpm_trie.c
> @@ -246,7 +246,7 @@ static void *trie_lookup_elem(struct bpf_map *map, void *_key)
>
> /* Start walking the trie from the root node ... */
>
> - for (node = rcu_dereference_check(trie->root, rcu_read_lock_bh_held());
> + for (node = rcu_dereference_check(trie->root, bpf_rcu_lock_held());
> node;) {
> unsigned int next_bit;
> size_t matchlen;
> @@ -280,7 +280,7 @@ static void *trie_lookup_elem(struct bpf_map *map, void *_key)
> */
> next_bit = extract_bit(key->data, node->prefixlen);
> node = rcu_dereference_check(node->child[next_bit],
> - rcu_read_lock_bh_held());
> + bpf_rcu_lock_held());
> }
>
> if (!found)
^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH bpf v2 0/2] bpf, lpm_trie: Allow sleepable BPF programs to use LPM tries
2026-05-29 17:42 [PATCH bpf] bpf, lpm_trie: Allow lookups from sleepable BPF programs Vlad Poenaru
2026-05-29 19:19 ` Emil Tsalapatis
@ 2026-06-09 13:55 ` Vlad Poenaru
2026-06-09 13:55 ` [PATCH bpf v2 1/2] bpf, lpm_trie: Allow access from sleepable BPF programs Vlad Poenaru
` (2 more replies)
1 sibling, 3 replies; 10+ messages in thread
From: Vlad Poenaru @ 2026-06-09 13:55 UTC (permalink / raw)
To: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
John Fastabend, Martin KaFai Lau, Eduard Zingerman,
Kumar Kartikeya Dwivedi, Song Liu, Yonghong Song, Jiri Olsa,
Toke Høiland-Jørgensen
Cc: Emil Tsalapatis, linux-kernel
trie_lookup_elem() annotates its rcu_dereference_check() walks with only
rcu_read_lock_bh_held(), so a sleepable BPF program that touches an LPM
trie (e.g. a sleepable LSM hook calling bpf_map_lookup_elem()) trips a
"suspicious RCU usage" lockdep splat on debug kernels: it holds only
rcu_read_lock_trace(), which that annotation does not accept.
Patch 1 relaxes the rcu_dereference annotations in the trie walks so they
no longer trip lockdep from the Tasks Trace context, including the
trie_update_elem()/trie_delete_elem() writer walks (protected by
trie->lock). Patch 2 adds BPF_MAP_TYPE_LPM_TRIE to the verifier's
sleepable map whitelist so sleepable programs can reference an LPM trie
directly, not just as the inner map of a map-of-maps. LPM trie nodes are
reclaimed via bpf_mem_cache_free_rcu(), which chains a regular RCU grace
period into a Tasks Trace grace period before freeing -- the same
discipline BPF_MAP_TYPE_HASH relies on for sleepable access.
Changes since v1:
- Split into a 2-patch series.
- Patch 1 now also converts the trie_update_elem()/trie_delete_elem()
walks from rcu_dereference() to rcu_dereference_protected(*p, 1),
addressing review feedback that v1 only fixed the lookup path and left
the same splat on the writer paths.
- New patch 2 adds the verifier whitelist entry so the fix is actually
reachable for directly-referenced LPM tries.
- Retitled v1 ("Allow lookups from sleepable BPF programs").
v1: https://lore.kernel.org/all/20260529174233.2954240-1-vlad.wing@gmail.com/
Vlad Poenaru (2):
bpf, lpm_trie: Allow access from sleepable BPF programs
bpf, lpm_trie: Allow sleepable programs to use LPM trie maps directly
kernel/bpf/lpm_trie.c | 8 ++++----
kernel/bpf/verifier.c | 1 +
2 files changed, 5 insertions(+), 4 deletions(-)
--
2.53.0-Meta
^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH bpf v2 1/2] bpf, lpm_trie: Allow access from sleepable BPF programs
2026-06-09 13:55 ` [PATCH bpf v2 0/2] bpf, lpm_trie: Allow sleepable BPF programs to use LPM tries Vlad Poenaru
@ 2026-06-09 13:55 ` Vlad Poenaru
2026-06-09 16:36 ` Emil Tsalapatis
2026-06-09 13:55 ` [PATCH bpf v2 2/2] bpf, lpm_trie: Allow sleepable programs to use LPM trie maps directly Vlad Poenaru
2026-06-09 19:50 ` [PATCH bpf v2 0/2] bpf, lpm_trie: Allow sleepable BPF programs to use LPM tries patchwork-bot+netdevbpf
2 siblings, 1 reply; 10+ messages in thread
From: Vlad Poenaru @ 2026-06-09 13:55 UTC (permalink / raw)
To: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
John Fastabend, Martin KaFai Lau, Eduard Zingerman,
Kumar Kartikeya Dwivedi, Song Liu, Yonghong Song, Jiri Olsa,
Toke Høiland-Jørgensen
Cc: Emil Tsalapatis, linux-kernel, stable
trie_lookup_elem() annotates its rcu_dereference_check() walks with
only rcu_read_lock_bh_held(). Because rcu_dereference_check(p, c)
resolves to "c || rcu_read_lock_held()", this passes for XDP/NAPI and
classic RCU readers but fails for sleepable BPF programs, which enter
via __bpf_prog_enter_sleepable() and hold only rcu_read_lock_trace().
trie_update_elem() and trie_delete_elem() have the same problem in a
different form: they walk the trie with plain rcu_dereference(), which
asserts rcu_read_lock_held() unconditionally. Both are reachable from
sleepable BPF programs via the bpf_map_update_elem / bpf_map_delete_elem
helpers, and from the syscall path under classic rcu_read_lock(). In
the writer paths the trie is actually protected by trie->lock (an
rqspinlock taken across the walk); we never relied on the RCU read-side
lock to keep nodes alive there.
A sleepable LSM hook that ends up touching an LPM trie therefore
triggers lockdep on debug kernels:
=============================
WARNING: suspicious RCU usage
7.1.0-... Tainted: G E
-----------------------------
kernel/bpf/lpm_trie.c:249 suspicious rcu_dereference_check() usage!
1 lock held by net_tests/540:
#0: (rcu_tasks_trace_srcu_struct){....}-{0:0},
at: __bpf_prog_enter_sleepable+0x26/0x280
Call Trace:
dump_stack_lvl
lockdep_rcu_suspicious
trie_lookup_elem
bpf_prog_..._enforce_security_socket_connect
bpf_trampoline_...
security_socket_connect
__sys_connect
do_syscall_64
This is lockdep-only -- no UAF, since Tasks Trace RCU does serialize
against the trie's reclaim path -- but it spams the console once per
distinct callsite on every debug kernel running a sleepable BPF LSM
that touches an LPM trie, which is increasingly common.
For the lookup path, switch the rcu_dereference_check() annotation
from rcu_read_lock_bh_held() to bpf_rcu_lock_held(), which accepts all
three contexts (classic, BH, Tasks Trace). Other map types already
follow this convention.
For trie_update_elem() and trie_delete_elem(), annotate the walks as
rcu_dereference_protected(*p, 1) -- matching trie_free() in the same
file -- since trie->lock is held across the walk. rqspinlock has no
lockdep_map, so the predicate degenerates to '1' rather than
lockdep_is_held(&trie->lock); the protection is real but not
machine-verifiable. trie_get_next_key() also uses bare
rcu_dereference() but is reachable only from the BPF syscall, which
holds classic rcu_read_lock() before dispatching, so it is left
untouched.
Fixes: 694cea395fde ("bpf: Allow RCU-protected lookups to happen from bh context")
Cc: stable@vger.kernel.org
Signed-off-by: Vlad Poenaru <vlad.wing@gmail.com>
---
kernel/bpf/lpm_trie.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
index 0f57608b385d..4d6f25db9ba1 100644
--- a/kernel/bpf/lpm_trie.c
+++ b/kernel/bpf/lpm_trie.c
@@ -246,7 +246,7 @@ static void *trie_lookup_elem(struct bpf_map *map, void *_key)
/* Start walking the trie from the root node ... */
- for (node = rcu_dereference_check(trie->root, rcu_read_lock_bh_held());
+ for (node = rcu_dereference_check(trie->root, bpf_rcu_lock_held());
node;) {
unsigned int next_bit;
size_t matchlen;
@@ -280,7 +280,7 @@ static void *trie_lookup_elem(struct bpf_map *map, void *_key)
*/
next_bit = extract_bit(key->data, node->prefixlen);
node = rcu_dereference_check(node->child[next_bit],
- rcu_read_lock_bh_held());
+ bpf_rcu_lock_held());
}
if (!found)
@@ -359,7 +359,7 @@ static long trie_update_elem(struct bpf_map *map,
*/
slot = &trie->root;
- while ((node = rcu_dereference(*slot))) {
+ while ((node = rcu_dereference_protected(*slot, 1))) {
matchlen = longest_prefix_match(trie, node, key);
if (node->prefixlen != matchlen ||
@@ -482,7 +482,7 @@ static long trie_delete_elem(struct bpf_map *map, void *_key)
trim = &trie->root;
trim2 = trim;
parent = NULL;
- while ((node = rcu_dereference(*trim))) {
+ while ((node = rcu_dereference_protected(*trim, 1))) {
matchlen = longest_prefix_match(trie, node, key);
if (node->prefixlen != matchlen ||
--
2.53.0-Meta
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH bpf v2 2/2] bpf, lpm_trie: Allow sleepable programs to use LPM trie maps directly
2026-06-09 13:55 ` [PATCH bpf v2 0/2] bpf, lpm_trie: Allow sleepable BPF programs to use LPM tries Vlad Poenaru
2026-06-09 13:55 ` [PATCH bpf v2 1/2] bpf, lpm_trie: Allow access from sleepable BPF programs Vlad Poenaru
@ 2026-06-09 13:55 ` Vlad Poenaru
2026-06-09 16:19 ` Emil Tsalapatis
2026-06-10 1:53 ` Hou Tao
2026-06-09 19:50 ` [PATCH bpf v2 0/2] bpf, lpm_trie: Allow sleepable BPF programs to use LPM tries patchwork-bot+netdevbpf
2 siblings, 2 replies; 10+ messages in thread
From: Vlad Poenaru @ 2026-06-09 13:55 UTC (permalink / raw)
To: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
John Fastabend, Martin KaFai Lau, Eduard Zingerman,
Kumar Kartikeya Dwivedi, Song Liu, Yonghong Song, Jiri Olsa,
Toke Høiland-Jørgensen
Cc: Emil Tsalapatis, linux-kernel
The previous change relaxed the rcu_dereference annotations in
lpm_trie.c so the trie walks no longer trip lockdep when reached from a
sleepable BPF program holding only rcu_read_lock_trace(). By itself
that only helps tries reached as the inner map of a map-of-maps, or
from the classic-RCU syscall path: a sleepable program that references
an LPM trie directly is still rejected at load time by
check_map_prog_compatibility(), whose sleepable whitelist omits
BPF_MAP_TYPE_LPM_TRIE:
Sleepable programs can only use array, hash, ringbuf and local storage maps
LPM trie nodes are allocated from a bpf_mem_alloc (trie->ma) and freed
with bpf_mem_cache_free_rcu(), which chains a regular RCU grace period
into a Tasks Trace grace period before the node -- and the value
embedded in it that trie_lookup_elem() returns to the program -- is
released. That is the same reclaim discipline BPF_MAP_TYPE_HASH relies
on for sleepable access, so a value handed to a sleepable reader cannot
be freed while the program is still running under rcu_read_lock_trace().
The writer paths take trie->lock across the walk and never relied on the
RCU read-side lock to keep nodes alive.
Add BPF_MAP_TYPE_LPM_TRIE to the sleepable map whitelist so these
programs can use LPM tries directly.
Signed-off-by: Vlad Poenaru <vlad.wing@gmail.com>
---
kernel/bpf/verifier.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 7fb88e1cd7c4..71c1e59e4df4 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -18122,6 +18122,7 @@ static int check_map_prog_compatibility(struct bpf_verifier_env *env,
case BPF_MAP_TYPE_PERCPU_HASH:
case BPF_MAP_TYPE_PERCPU_ARRAY:
case BPF_MAP_TYPE_LRU_PERCPU_HASH:
+ case BPF_MAP_TYPE_LPM_TRIE:
case BPF_MAP_TYPE_ARRAY_OF_MAPS:
case BPF_MAP_TYPE_HASH_OF_MAPS:
case BPF_MAP_TYPE_RINGBUF:
--
2.53.0-Meta
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH bpf v2 2/2] bpf, lpm_trie: Allow sleepable programs to use LPM trie maps directly
2026-06-09 13:55 ` [PATCH bpf v2 2/2] bpf, lpm_trie: Allow sleepable programs to use LPM trie maps directly Vlad Poenaru
@ 2026-06-09 16:19 ` Emil Tsalapatis
2026-06-10 1:53 ` Hou Tao
1 sibling, 0 replies; 10+ messages in thread
From: Emil Tsalapatis @ 2026-06-09 16:19 UTC (permalink / raw)
To: Vlad Poenaru, bpf, Alexei Starovoitov, Daniel Borkmann,
Andrii Nakryiko, John Fastabend, Martin KaFai Lau,
Eduard Zingerman, Kumar Kartikeya Dwivedi, Song Liu,
Yonghong Song, Jiri Olsa, Toke Høiland-Jørgensen
Cc: Emil Tsalapatis, linux-kernel
On Tue Jun 9, 2026 at 9:55 AM EDT, Vlad Poenaru wrote:
> The previous change relaxed the rcu_dereference annotations in
> lpm_trie.c so the trie walks no longer trip lockdep when reached from a
> sleepable BPF program holding only rcu_read_lock_trace(). By itself
> that only helps tries reached as the inner map of a map-of-maps, or
> from the classic-RCU syscall path: a sleepable program that references
> an LPM trie directly is still rejected at load time by
> check_map_prog_compatibility(), whose sleepable whitelist omits
> BPF_MAP_TYPE_LPM_TRIE:
>
> Sleepable programs can only use array, hash, ringbuf and local storage maps
>
> LPM trie nodes are allocated from a bpf_mem_alloc (trie->ma) and freed
> with bpf_mem_cache_free_rcu(), which chains a regular RCU grace period
> into a Tasks Trace grace period before the node -- and the value
> embedded in it that trie_lookup_elem() returns to the program -- is
> released. That is the same reclaim discipline BPF_MAP_TYPE_HASH relies
> on for sleepable access, so a value handed to a sleepable reader cannot
> be freed while the program is still running under rcu_read_lock_trace().
> The writer paths take trie->lock across the walk and never relied on the
> RCU read-side lock to keep nodes alive.
>
> Add BPF_MAP_TYPE_LPM_TRIE to the sleepable map whitelist so these
> programs can use LPM tries directly.
>
> Signed-off-by: Vlad Poenaru <vlad.wing@gmail.com>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
> ---
> kernel/bpf/verifier.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 7fb88e1cd7c4..71c1e59e4df4 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -18122,6 +18122,7 @@ static int check_map_prog_compatibility(struct bpf_verifier_env *env,
> case BPF_MAP_TYPE_PERCPU_HASH:
> case BPF_MAP_TYPE_PERCPU_ARRAY:
> case BPF_MAP_TYPE_LRU_PERCPU_HASH:
> + case BPF_MAP_TYPE_LPM_TRIE:
> case BPF_MAP_TYPE_ARRAY_OF_MAPS:
> case BPF_MAP_TYPE_HASH_OF_MAPS:
> case BPF_MAP_TYPE_RINGBUF:
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH bpf v2 1/2] bpf, lpm_trie: Allow access from sleepable BPF programs
2026-06-09 13:55 ` [PATCH bpf v2 1/2] bpf, lpm_trie: Allow access from sleepable BPF programs Vlad Poenaru
@ 2026-06-09 16:36 ` Emil Tsalapatis
0 siblings, 0 replies; 10+ messages in thread
From: Emil Tsalapatis @ 2026-06-09 16:36 UTC (permalink / raw)
To: Vlad Poenaru, bpf, Alexei Starovoitov, Daniel Borkmann,
Andrii Nakryiko, John Fastabend, Martin KaFai Lau,
Eduard Zingerman, Kumar Kartikeya Dwivedi, Song Liu,
Yonghong Song, Jiri Olsa, Toke Høiland-Jørgensen
Cc: Emil Tsalapatis, linux-kernel, stable
On Tue Jun 9, 2026 at 9:55 AM EDT, Vlad Poenaru wrote:
> trie_lookup_elem() annotates its rcu_dereference_check() walks with
> only rcu_read_lock_bh_held(). Because rcu_dereference_check(p, c)
> resolves to "c || rcu_read_lock_held()", this passes for XDP/NAPI and
> classic RCU readers but fails for sleepable BPF programs, which enter
> via __bpf_prog_enter_sleepable() and hold only rcu_read_lock_trace().
>
> trie_update_elem() and trie_delete_elem() have the same problem in a
> different form: they walk the trie with plain rcu_dereference(), which
> asserts rcu_read_lock_held() unconditionally. Both are reachable from
> sleepable BPF programs via the bpf_map_update_elem / bpf_map_delete_elem
> helpers, and from the syscall path under classic rcu_read_lock(). In
> the writer paths the trie is actually protected by trie->lock (an
> rqspinlock taken across the walk); we never relied on the RCU read-side
> lock to keep nodes alive there.
>
> A sleepable LSM hook that ends up touching an LPM trie therefore
> triggers lockdep on debug kernels:
>
> =============================
> WARNING: suspicious RCU usage
> 7.1.0-... Tainted: G E
> -----------------------------
> kernel/bpf/lpm_trie.c:249 suspicious rcu_dereference_check() usage!
> 1 lock held by net_tests/540:
> #0: (rcu_tasks_trace_srcu_struct){....}-{0:0},
> at: __bpf_prog_enter_sleepable+0x26/0x280
> Call Trace:
> dump_stack_lvl
> lockdep_rcu_suspicious
> trie_lookup_elem
> bpf_prog_..._enforce_security_socket_connect
> bpf_trampoline_...
> security_socket_connect
> __sys_connect
> do_syscall_64
>
> This is lockdep-only -- no UAF, since Tasks Trace RCU does serialize
> against the trie's reclaim path -- but it spams the console once per
> distinct callsite on every debug kernel running a sleepable BPF LSM
> that touches an LPM trie, which is increasingly common.
>
> For the lookup path, switch the rcu_dereference_check() annotation
> from rcu_read_lock_bh_held() to bpf_rcu_lock_held(), which accepts all
> three contexts (classic, BH, Tasks Trace). Other map types already
> follow this convention.
>
> For trie_update_elem() and trie_delete_elem(), annotate the walks as
> rcu_dereference_protected(*p, 1) -- matching trie_free() in the same
> file -- since trie->lock is held across the walk. rqspinlock has no
> lockdep_map, so the predicate degenerates to '1' rather than
> lockdep_is_held(&trie->lock); the protection is real but not
> machine-verifiable. trie_get_next_key() also uses bare
> rcu_dereference() but is reachable only from the BPF syscall, which
> holds classic rcu_read_lock() before dispatching, so it is left
> untouched.
>
> Fixes: 694cea395fde ("bpf: Allow RCU-protected lookups to happen from bh context")
> Cc: stable@vger.kernel.org
> Signed-off-by: Vlad Poenaru <vlad.wing@gmail.com>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
> ---
> kernel/bpf/lpm_trie.c | 8 ++++----
> 1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
> index 0f57608b385d..4d6f25db9ba1 100644
> --- a/kernel/bpf/lpm_trie.c
> +++ b/kernel/bpf/lpm_trie.c
> @@ -246,7 +246,7 @@ static void *trie_lookup_elem(struct bpf_map *map, void *_key)
>
> /* Start walking the trie from the root node ... */
>
> - for (node = rcu_dereference_check(trie->root, rcu_read_lock_bh_held());
> + for (node = rcu_dereference_check(trie->root, bpf_rcu_lock_held());
> node;) {
> unsigned int next_bit;
> size_t matchlen;
> @@ -280,7 +280,7 @@ static void *trie_lookup_elem(struct bpf_map *map, void *_key)
> */
> next_bit = extract_bit(key->data, node->prefixlen);
> node = rcu_dereference_check(node->child[next_bit],
> - rcu_read_lock_bh_held());
> + bpf_rcu_lock_held());
> }
>
> if (!found)
> @@ -359,7 +359,7 @@ static long trie_update_elem(struct bpf_map *map,
> */
> slot = &trie->root;
>
> - while ((node = rcu_dereference(*slot))) {
> + while ((node = rcu_dereference_protected(*slot, 1))) {
> matchlen = longest_prefix_match(trie, node, key);
>
> if (node->prefixlen != matchlen ||
> @@ -482,7 +482,7 @@ static long trie_delete_elem(struct bpf_map *map, void *_key)
> trim = &trie->root;
> trim2 = trim;
> parent = NULL;
> - while ((node = rcu_dereference(*trim))) {
> + while ((node = rcu_dereference_protected(*trim, 1))) {
> matchlen = longest_prefix_match(trie, node, key);
>
> if (node->prefixlen != matchlen ||
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH bpf v2 0/2] bpf, lpm_trie: Allow sleepable BPF programs to use LPM tries
2026-06-09 13:55 ` [PATCH bpf v2 0/2] bpf, lpm_trie: Allow sleepable BPF programs to use LPM tries Vlad Poenaru
2026-06-09 13:55 ` [PATCH bpf v2 1/2] bpf, lpm_trie: Allow access from sleepable BPF programs Vlad Poenaru
2026-06-09 13:55 ` [PATCH bpf v2 2/2] bpf, lpm_trie: Allow sleepable programs to use LPM trie maps directly Vlad Poenaru
@ 2026-06-09 19:50 ` patchwork-bot+netdevbpf
2 siblings, 0 replies; 10+ messages in thread
From: patchwork-bot+netdevbpf @ 2026-06-09 19:50 UTC (permalink / raw)
To: Vlad Poenaru
Cc: bpf, ast, daniel, andrii, john.fastabend, martin.lau, eddyz87,
memxor, song, yonghong.song, jolsa, toke, emil, linux-kernel
Hello:
This series was applied to bpf/bpf-next.git (master)
by Alexei Starovoitov <ast@kernel.org>:
On Tue, 9 Jun 2026 06:55:56 -0700 you wrote:
> trie_lookup_elem() annotates its rcu_dereference_check() walks with only
> rcu_read_lock_bh_held(), so a sleepable BPF program that touches an LPM
> trie (e.g. a sleepable LSM hook calling bpf_map_lookup_elem()) trips a
> "suspicious RCU usage" lockdep splat on debug kernels: it holds only
> rcu_read_lock_trace(), which that annotation does not accept.
>
> Patch 1 relaxes the rcu_dereference annotations in the trie walks so they
> no longer trip lockdep from the Tasks Trace context, including the
> trie_update_elem()/trie_delete_elem() writer walks (protected by
> trie->lock). Patch 2 adds BPF_MAP_TYPE_LPM_TRIE to the verifier's
> sleepable map whitelist so sleepable programs can reference an LPM trie
> directly, not just as the inner map of a map-of-maps. LPM trie nodes are
> reclaimed via bpf_mem_cache_free_rcu(), which chains a regular RCU grace
> period into a Tasks Trace grace period before freeing -- the same
> discipline BPF_MAP_TYPE_HASH relies on for sleepable access.
>
> [...]
Here is the summary with links:
- [bpf,v2,1/2] bpf, lpm_trie: Allow access from sleepable BPF programs
https://git.kernel.org/bpf/bpf-next/c/2f884d371faf
- [bpf,v2,2/2] bpf, lpm_trie: Allow sleepable programs to use LPM trie maps directly
https://git.kernel.org/bpf/bpf-next/c/a3d76e27bbbf
You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH bpf v2 2/2] bpf, lpm_trie: Allow sleepable programs to use LPM trie maps directly
2026-06-09 13:55 ` [PATCH bpf v2 2/2] bpf, lpm_trie: Allow sleepable programs to use LPM trie maps directly Vlad Poenaru
2026-06-09 16:19 ` Emil Tsalapatis
@ 2026-06-10 1:53 ` Hou Tao
2026-06-10 2:34 ` Alexei Starovoitov
1 sibling, 1 reply; 10+ messages in thread
From: Hou Tao @ 2026-06-10 1:53 UTC (permalink / raw)
To: Vlad Poenaru, bpf, Alexei Starovoitov, Daniel Borkmann,
Andrii Nakryiko, John Fastabend, Martin KaFai Lau,
Eduard Zingerman, Kumar Kartikeya Dwivedi, Song Liu,
Yonghong Song, Jiri Olsa, Toke Høiland-Jørgensen
Cc: Emil Tsalapatis, linux-kernel
Hi,
On 6/9/2026 9:55 PM, Vlad Poenaru wrote:
> The previous change relaxed the rcu_dereference annotations in
> lpm_trie.c so the trie walks no longer trip lockdep when reached from a
> sleepable BPF program holding only rcu_read_lock_trace(). By itself
> that only helps tries reached as the inner map of a map-of-maps, or
> from the classic-RCU syscall path: a sleepable program that references
> an LPM trie directly is still rejected at load time by
> check_map_prog_compatibility(), whose sleepable whitelist omits
> BPF_MAP_TYPE_LPM_TRIE:
>
> Sleepable programs can only use array, hash, ringbuf and local storage maps
>
> LPM trie nodes are allocated from a bpf_mem_alloc (trie->ma) and freed
> with bpf_mem_cache_free_rcu(), which chains a regular RCU grace period
> into a Tasks Trace grace period before the node -- and the value
> embedded in it that trie_lookup_elem() returns to the program -- is
> released. That is the same reclaim discipline BPF_MAP_TYPE_HASH relies
> on for sleepable access, so a value handed to a sleepable reader cannot
> be freed while the program is still running under rcu_read_lock_trace().
> The writer paths take trie->lock across the walk and never relied on the
> RCU read-side lock to keep nodes alive.
For trie_lookup_elem(), I think it is not safe to enable the usage in
the sleep-able program as the patch does and it may return unexpected
value. The main reason is that rcu_read_lock_trace() can not guarantee
the current node which is being lookup-ed up will not reused by other
update procedure concurrently. However rcu_read_lock() has such
guarantee, because bpf_mem_cache_free_rcu() makes it be reusable only
after one RCU grace. For the hash-table case, I think it has the similar
problem through it has already used some trickle (hlist_nulls_node
variants) to mitigate it.
>
> Add BPF_MAP_TYPE_LPM_TRIE to the sleepable map whitelist so these
> programs can use LPM tries directly.
>
> Signed-off-by: Vlad Poenaru <vlad.wing@gmail.com>
> ---
> kernel/bpf/verifier.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 7fb88e1cd7c4..71c1e59e4df4 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -18122,6 +18122,7 @@ static int check_map_prog_compatibility(struct bpf_verifier_env *env,
> case BPF_MAP_TYPE_PERCPU_HASH:
> case BPF_MAP_TYPE_PERCPU_ARRAY:
> case BPF_MAP_TYPE_LRU_PERCPU_HASH:
> + case BPF_MAP_TYPE_LPM_TRIE:
> case BPF_MAP_TYPE_ARRAY_OF_MAPS:
> case BPF_MAP_TYPE_HASH_OF_MAPS:
> case BPF_MAP_TYPE_RINGBUF:
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH bpf v2 2/2] bpf, lpm_trie: Allow sleepable programs to use LPM trie maps directly
2026-06-10 1:53 ` Hou Tao
@ 2026-06-10 2:34 ` Alexei Starovoitov
0 siblings, 0 replies; 10+ messages in thread
From: Alexei Starovoitov @ 2026-06-10 2:34 UTC (permalink / raw)
To: Hou Tao, Vlad Poenaru, bpf, Alexei Starovoitov, Daniel Borkmann,
Andrii Nakryiko, John Fastabend, Martin KaFai Lau,
Eduard Zingerman, Kumar Kartikeya Dwivedi, Song Liu,
Yonghong Song, Jiri Olsa, Toke Høiland-Jørgensen
Cc: Emil Tsalapatis, linux-kernel
On Tue Jun 9, 2026 at 6:53 PM PDT, Hou Tao wrote:
> Hi,
>
> On 6/9/2026 9:55 PM, Vlad Poenaru wrote:
>> The previous change relaxed the rcu_dereference annotations in
>> lpm_trie.c so the trie walks no longer trip lockdep when reached from a
>> sleepable BPF program holding only rcu_read_lock_trace(). By itself
>> that only helps tries reached as the inner map of a map-of-maps, or
>> from the classic-RCU syscall path: a sleepable program that references
>> an LPM trie directly is still rejected at load time by
>> check_map_prog_compatibility(), whose sleepable whitelist omits
>> BPF_MAP_TYPE_LPM_TRIE:
>>
>> Sleepable programs can only use array, hash, ringbuf and local storage maps
>>
>> LPM trie nodes are allocated from a bpf_mem_alloc (trie->ma) and freed
>> with bpf_mem_cache_free_rcu(), which chains a regular RCU grace period
>> into a Tasks Trace grace period before the node -- and the value
>> embedded in it that trie_lookup_elem() returns to the program -- is
>> released. That is the same reclaim discipline BPF_MAP_TYPE_HASH relies
>> on for sleepable access, so a value handed to a sleepable reader cannot
>> be freed while the program is still running under rcu_read_lock_trace().
>> The writer paths take trie->lock across the walk and never relied on the
>> RCU read-side lock to keep nodes alive.
>
> For trie_lookup_elem(), I think it is not safe to enable the usage in
> the sleep-able program as the patch does and it may return unexpected
> value. The main reason is that rcu_read_lock_trace() can not guarantee
> the current node which is being lookup-ed up will not reused by other
> update procedure concurrently. However rcu_read_lock() has such
> guarantee, because bpf_mem_cache_free_rcu() makes it be reusable only
> after one RCU grace. For the hash-table case, I think it has the similar
> problem through it has already used some trickle (hlist_nulls_node
> variants) to mitigate it.
You're correct. I remember that discussion.
Yet people already use lpm via map-in-map bug/workaround.
So I applied this set to make lpm-in-sleepable usage official
and force us to do a proper fix.
Also both AI bots didn't spot an issue, so the bug won't be
discovered immediately and we won't see a flurry of
"security" reports with slop "fixes". AI isn't that smart yet.
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2026-06-10 2:34 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-29 17:42 [PATCH bpf] bpf, lpm_trie: Allow lookups from sleepable BPF programs Vlad Poenaru
2026-05-29 19:19 ` Emil Tsalapatis
2026-06-09 13:55 ` [PATCH bpf v2 0/2] bpf, lpm_trie: Allow sleepable BPF programs to use LPM tries Vlad Poenaru
2026-06-09 13:55 ` [PATCH bpf v2 1/2] bpf, lpm_trie: Allow access from sleepable BPF programs Vlad Poenaru
2026-06-09 16:36 ` Emil Tsalapatis
2026-06-09 13:55 ` [PATCH bpf v2 2/2] bpf, lpm_trie: Allow sleepable programs to use LPM trie maps directly Vlad Poenaru
2026-06-09 16:19 ` Emil Tsalapatis
2026-06-10 1:53 ` Hou Tao
2026-06-10 2:34 ` Alexei Starovoitov
2026-06-09 19:50 ` [PATCH bpf v2 0/2] bpf, lpm_trie: Allow sleepable BPF programs to use LPM tries patchwork-bot+netdevbpf
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox