The Linux Kernel Mailing List
 help / color / mirror / Atom feed
* [PATCH bpf] bpf, lpm_trie: Allow lookups from sleepable BPF programs
@ 2026-05-29 17:42 Vlad Poenaru
  2026-05-29 19:19 ` Emil Tsalapatis
  2026-06-09 13:55 ` [PATCH bpf v2 0/2] bpf, lpm_trie: Allow sleepable BPF programs to use LPM tries Vlad Poenaru
  0 siblings, 2 replies; 10+ messages in thread
From: Vlad Poenaru @ 2026-05-29 17:42 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, bpf
  Cc: Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi,
	Song Liu, Yonghong Song, Jiri Olsa,
	Toke Høiland-Jørgensen, linux-kernel, stable

trie_lookup_elem() annotates its rcu_dereference_check() walks with
only rcu_read_lock_bh_held(). Because rcu_dereference_check(p, c)
resolves to "c || rcu_read_lock_held()", this passes for XDP/NAPI and
classic RCU readers but fails for sleepable BPF programs, which enter
via __bpf_prog_enter_sleepable() and hold only rcu_read_lock_trace().

A sleepable LSM hook that ends up doing bpf_map_lookup_elem() on an LPM
trie therefore triggers lockdep on debug kernels:

  =============================
  WARNING: suspicious RCU usage
  7.1.0-... Tainted: G            E
  -----------------------------
  kernel/bpf/lpm_trie.c:249 suspicious rcu_dereference_check() usage!
  1 lock held by net_tests/540:
   #0: (rcu_tasks_trace_srcu_struct){....}-{0:0},
       at: __bpf_prog_enter_sleepable+0x26/0x280
  Call Trace:
   dump_stack_lvl
   lockdep_rcu_suspicious
   trie_lookup_elem
   bpf_prog_..._enforce_security_socket_connect
   bpf_trampoline_...
   security_socket_connect
   __sys_connect
   do_syscall_64

This is lockdep-only -- no UAF, since Tasks Trace RCU does serialize
against the trie's reclaim path -- but it spams the console once per
distinct callsite on every debug kernel running a sleepable BPF LSM
that does map lookups on an LPM trie, which is increasingly common.

Other map types already use the bpf_rcu_lock_held() helper, which
accepts all three contexts (classic, BH, Tasks Trace). Use it here as
well, matching the established convention.

Fixes: 694cea395fde ("bpf: Allow RCU-protected lookups to happen from bh context")
Cc: stable@vger.kernel.org
Signed-off-by: Vlad Poenaru <vlad.wing@gmail.com>
---
 kernel/bpf/lpm_trie.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
index 0f57608b385d..ac36063cb7e6 100644
--- a/kernel/bpf/lpm_trie.c
+++ b/kernel/bpf/lpm_trie.c
@@ -246,7 +246,7 @@ static void *trie_lookup_elem(struct bpf_map *map, void *_key)
 
 	/* Start walking the trie from the root node ... */
 
-	for (node = rcu_dereference_check(trie->root, rcu_read_lock_bh_held());
+	for (node = rcu_dereference_check(trie->root, bpf_rcu_lock_held());
 	     node;) {
 		unsigned int next_bit;
 		size_t matchlen;
@@ -280,7 +280,7 @@ static void *trie_lookup_elem(struct bpf_map *map, void *_key)
 		 */
 		next_bit = extract_bit(key->data, node->prefixlen);
 		node = rcu_dereference_check(node->child[next_bit],
-					     rcu_read_lock_bh_held());
+					     bpf_rcu_lock_held());
 	}
 
 	if (!found)
-- 
2.53.0-Meta


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH bpf] bpf, lpm_trie: Allow lookups from sleepable BPF programs
  2026-05-29 17:42 [PATCH bpf] bpf, lpm_trie: Allow lookups from sleepable BPF programs Vlad Poenaru
@ 2026-05-29 19:19 ` Emil Tsalapatis
  2026-06-09 13:55 ` [PATCH bpf v2 0/2] bpf, lpm_trie: Allow sleepable BPF programs to use LPM tries Vlad Poenaru
  1 sibling, 0 replies; 10+ messages in thread
From: Emil Tsalapatis @ 2026-05-29 19:19 UTC (permalink / raw)
  To: Vlad Poenaru, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, bpf
  Cc: Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi,
	Song Liu, Yonghong Song, Jiri Olsa,
	Toke Høiland-Jørgensen, linux-kernel, stable

On Fri May 29, 2026 at 1:42 PM EDT, Vlad Poenaru wrote:
> trie_lookup_elem() annotates its rcu_dereference_check() walks with
> only rcu_read_lock_bh_held(). Because rcu_dereference_check(p, c)
> resolves to "c || rcu_read_lock_held()", this passes for XDP/NAPI and
> classic RCU readers but fails for sleepable BPF programs, which enter
> via __bpf_prog_enter_sleepable() and hold only rcu_read_lock_trace().
>
> A sleepable LSM hook that ends up doing bpf_map_lookup_elem() on an LPM
> trie therefore triggers lockdep on debug kernels:
>
>   =============================
>   WARNING: suspicious RCU usage
>   7.1.0-... Tainted: G            E
>   -----------------------------
>   kernel/bpf/lpm_trie.c:249 suspicious rcu_dereference_check() usage!
>   1 lock held by net_tests/540:
>    #0: (rcu_tasks_trace_srcu_struct){....}-{0:0},
>        at: __bpf_prog_enter_sleepable+0x26/0x280
>   Call Trace:
>    dump_stack_lvl
>    lockdep_rcu_suspicious
>    trie_lookup_elem
>    bpf_prog_..._enforce_security_socket_connect
>    bpf_trampoline_...
>    security_socket_connect
>    __sys_connect
>    do_syscall_64
>
> This is lockdep-only -- no UAF, since Tasks Trace RCU does serialize
> against the trie's reclaim path -- but it spams the console once per
> distinct callsite on every debug kernel running a sleepable BPF LSM
> that does map lookups on an LPM trie, which is increasingly common.
>
> Other map types already use the bpf_rcu_lock_held() helper, which
> accepts all three contexts (classic, BH, Tasks Trace). Use it here as
> well, matching the established convention.
>
> Fixes: 694cea395fde ("bpf: Allow RCU-protected lookups to happen from bh context")
> Cc: stable@vger.kernel.org
> Signed-off-by: Vlad Poenaru <vlad.wing@gmail.com>

Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>

> ---
>  kernel/bpf/lpm_trie.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
> index 0f57608b385d..ac36063cb7e6 100644
> --- a/kernel/bpf/lpm_trie.c
> +++ b/kernel/bpf/lpm_trie.c
> @@ -246,7 +246,7 @@ static void *trie_lookup_elem(struct bpf_map *map, void *_key)
>  
>  	/* Start walking the trie from the root node ... */
>  
> -	for (node = rcu_dereference_check(trie->root, rcu_read_lock_bh_held());
> +	for (node = rcu_dereference_check(trie->root, bpf_rcu_lock_held());
>  	     node;) {
>  		unsigned int next_bit;
>  		size_t matchlen;
> @@ -280,7 +280,7 @@ static void *trie_lookup_elem(struct bpf_map *map, void *_key)
>  		 */
>  		next_bit = extract_bit(key->data, node->prefixlen);
>  		node = rcu_dereference_check(node->child[next_bit],
> -					     rcu_read_lock_bh_held());
> +					     bpf_rcu_lock_held());
>  	}
>  
>  	if (!found)


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH bpf v2 0/2] bpf, lpm_trie: Allow sleepable BPF programs to use LPM tries
  2026-05-29 17:42 [PATCH bpf] bpf, lpm_trie: Allow lookups from sleepable BPF programs Vlad Poenaru
  2026-05-29 19:19 ` Emil Tsalapatis
@ 2026-06-09 13:55 ` Vlad Poenaru
  2026-06-09 13:55   ` [PATCH bpf v2 1/2] bpf, lpm_trie: Allow access from sleepable BPF programs Vlad Poenaru
                     ` (2 more replies)
  1 sibling, 3 replies; 10+ messages in thread
From: Vlad Poenaru @ 2026-06-09 13:55 UTC (permalink / raw)
  To: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	John Fastabend, Martin KaFai Lau, Eduard Zingerman,
	Kumar Kartikeya Dwivedi, Song Liu, Yonghong Song, Jiri Olsa,
	Toke Høiland-Jørgensen
  Cc: Emil Tsalapatis, linux-kernel

trie_lookup_elem() annotates its rcu_dereference_check() walks with only
rcu_read_lock_bh_held(), so a sleepable BPF program that touches an LPM
trie (e.g. a sleepable LSM hook calling bpf_map_lookup_elem()) trips a
"suspicious RCU usage" lockdep splat on debug kernels: it holds only
rcu_read_lock_trace(), which that annotation does not accept.

Patch 1 relaxes the rcu_dereference annotations in the trie walks so they
no longer trip lockdep from the Tasks Trace context, including the
trie_update_elem()/trie_delete_elem() writer walks (protected by
trie->lock). Patch 2 adds BPF_MAP_TYPE_LPM_TRIE to the verifier's
sleepable map whitelist so sleepable programs can reference an LPM trie
directly, not just as the inner map of a map-of-maps. LPM trie nodes are
reclaimed via bpf_mem_cache_free_rcu(), which chains a regular RCU grace
period into a Tasks Trace grace period before freeing -- the same
discipline BPF_MAP_TYPE_HASH relies on for sleepable access.

Changes since v1:
- Split into a 2-patch series.
- Patch 1 now also converts the trie_update_elem()/trie_delete_elem()
  walks from rcu_dereference() to rcu_dereference_protected(*p, 1),
  addressing review feedback that v1 only fixed the lookup path and left
  the same splat on the writer paths.
- New patch 2 adds the verifier whitelist entry so the fix is actually
  reachable for directly-referenced LPM tries.
- Retitled v1 ("Allow lookups from sleepable BPF programs").

v1: https://lore.kernel.org/all/20260529174233.2954240-1-vlad.wing@gmail.com/

Vlad Poenaru (2):
  bpf, lpm_trie: Allow access from sleepable BPF programs
  bpf, lpm_trie: Allow sleepable programs to use LPM trie maps directly

 kernel/bpf/lpm_trie.c | 8 ++++----
 kernel/bpf/verifier.c | 1 +
 2 files changed, 5 insertions(+), 4 deletions(-)

--
2.53.0-Meta


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH bpf v2 1/2] bpf, lpm_trie: Allow access from sleepable BPF programs
  2026-06-09 13:55 ` [PATCH bpf v2 0/2] bpf, lpm_trie: Allow sleepable BPF programs to use LPM tries Vlad Poenaru
@ 2026-06-09 13:55   ` Vlad Poenaru
  2026-06-09 16:36     ` Emil Tsalapatis
  2026-06-09 13:55   ` [PATCH bpf v2 2/2] bpf, lpm_trie: Allow sleepable programs to use LPM trie maps directly Vlad Poenaru
  2026-06-09 19:50   ` [PATCH bpf v2 0/2] bpf, lpm_trie: Allow sleepable BPF programs to use LPM tries patchwork-bot+netdevbpf
  2 siblings, 1 reply; 10+ messages in thread
From: Vlad Poenaru @ 2026-06-09 13:55 UTC (permalink / raw)
  To: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	John Fastabend, Martin KaFai Lau, Eduard Zingerman,
	Kumar Kartikeya Dwivedi, Song Liu, Yonghong Song, Jiri Olsa,
	Toke Høiland-Jørgensen
  Cc: Emil Tsalapatis, linux-kernel, stable

trie_lookup_elem() annotates its rcu_dereference_check() walks with
only rcu_read_lock_bh_held().  Because rcu_dereference_check(p, c)
resolves to "c || rcu_read_lock_held()", this passes for XDP/NAPI and
classic RCU readers but fails for sleepable BPF programs, which enter
via __bpf_prog_enter_sleepable() and hold only rcu_read_lock_trace().

trie_update_elem() and trie_delete_elem() have the same problem in a
different form: they walk the trie with plain rcu_dereference(), which
asserts rcu_read_lock_held() unconditionally.  Both are reachable from
sleepable BPF programs via the bpf_map_update_elem / bpf_map_delete_elem
helpers, and from the syscall path under classic rcu_read_lock().  In
the writer paths the trie is actually protected by trie->lock (an
rqspinlock taken across the walk); we never relied on the RCU read-side
lock to keep nodes alive there.

A sleepable LSM hook that ends up touching an LPM trie therefore
triggers lockdep on debug kernels:

  =============================
  WARNING: suspicious RCU usage
  7.1.0-... Tainted: G            E
  -----------------------------
  kernel/bpf/lpm_trie.c:249 suspicious rcu_dereference_check() usage!
  1 lock held by net_tests/540:
   #0: (rcu_tasks_trace_srcu_struct){....}-{0:0},
       at: __bpf_prog_enter_sleepable+0x26/0x280
  Call Trace:
   dump_stack_lvl
   lockdep_rcu_suspicious
   trie_lookup_elem
   bpf_prog_..._enforce_security_socket_connect
   bpf_trampoline_...
   security_socket_connect
   __sys_connect
   do_syscall_64

This is lockdep-only -- no UAF, since Tasks Trace RCU does serialize
against the trie's reclaim path -- but it spams the console once per
distinct callsite on every debug kernel running a sleepable BPF LSM
that touches an LPM trie, which is increasingly common.

For the lookup path, switch the rcu_dereference_check() annotation
from rcu_read_lock_bh_held() to bpf_rcu_lock_held(), which accepts all
three contexts (classic, BH, Tasks Trace).  Other map types already
follow this convention.

For trie_update_elem() and trie_delete_elem(), annotate the walks as
rcu_dereference_protected(*p, 1) -- matching trie_free() in the same
file -- since trie->lock is held across the walk.  rqspinlock has no
lockdep_map, so the predicate degenerates to '1' rather than
lockdep_is_held(&trie->lock); the protection is real but not
machine-verifiable.  trie_get_next_key() also uses bare
rcu_dereference() but is reachable only from the BPF syscall, which
holds classic rcu_read_lock() before dispatching, so it is left
untouched.

Fixes: 694cea395fde ("bpf: Allow RCU-protected lookups to happen from bh context")
Cc: stable@vger.kernel.org
Signed-off-by: Vlad Poenaru <vlad.wing@gmail.com>
---
 kernel/bpf/lpm_trie.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
index 0f57608b385d..4d6f25db9ba1 100644
--- a/kernel/bpf/lpm_trie.c
+++ b/kernel/bpf/lpm_trie.c
@@ -246,7 +246,7 @@ static void *trie_lookup_elem(struct bpf_map *map, void *_key)
 
 	/* Start walking the trie from the root node ... */
 
-	for (node = rcu_dereference_check(trie->root, rcu_read_lock_bh_held());
+	for (node = rcu_dereference_check(trie->root, bpf_rcu_lock_held());
 	     node;) {
 		unsigned int next_bit;
 		size_t matchlen;
@@ -280,7 +280,7 @@ static void *trie_lookup_elem(struct bpf_map *map, void *_key)
 		 */
 		next_bit = extract_bit(key->data, node->prefixlen);
 		node = rcu_dereference_check(node->child[next_bit],
-					     rcu_read_lock_bh_held());
+					     bpf_rcu_lock_held());
 	}
 
 	if (!found)
@@ -359,7 +359,7 @@ static long trie_update_elem(struct bpf_map *map,
 	 */
 	slot = &trie->root;
 
-	while ((node = rcu_dereference(*slot))) {
+	while ((node = rcu_dereference_protected(*slot, 1))) {
 		matchlen = longest_prefix_match(trie, node, key);
 
 		if (node->prefixlen != matchlen ||
@@ -482,7 +482,7 @@ static long trie_delete_elem(struct bpf_map *map, void *_key)
 	trim = &trie->root;
 	trim2 = trim;
 	parent = NULL;
-	while ((node = rcu_dereference(*trim))) {
+	while ((node = rcu_dereference_protected(*trim, 1))) {
 		matchlen = longest_prefix_match(trie, node, key);
 
 		if (node->prefixlen != matchlen ||
-- 
2.53.0-Meta


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH bpf v2 2/2] bpf, lpm_trie: Allow sleepable programs to use LPM trie maps directly
  2026-06-09 13:55 ` [PATCH bpf v2 0/2] bpf, lpm_trie: Allow sleepable BPF programs to use LPM tries Vlad Poenaru
  2026-06-09 13:55   ` [PATCH bpf v2 1/2] bpf, lpm_trie: Allow access from sleepable BPF programs Vlad Poenaru
@ 2026-06-09 13:55   ` Vlad Poenaru
  2026-06-09 16:19     ` Emil Tsalapatis
  2026-06-10  1:53     ` Hou Tao
  2026-06-09 19:50   ` [PATCH bpf v2 0/2] bpf, lpm_trie: Allow sleepable BPF programs to use LPM tries patchwork-bot+netdevbpf
  2 siblings, 2 replies; 10+ messages in thread
From: Vlad Poenaru @ 2026-06-09 13:55 UTC (permalink / raw)
  To: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	John Fastabend, Martin KaFai Lau, Eduard Zingerman,
	Kumar Kartikeya Dwivedi, Song Liu, Yonghong Song, Jiri Olsa,
	Toke Høiland-Jørgensen
  Cc: Emil Tsalapatis, linux-kernel

The previous change relaxed the rcu_dereference annotations in
lpm_trie.c so the trie walks no longer trip lockdep when reached from a
sleepable BPF program holding only rcu_read_lock_trace().  By itself
that only helps tries reached as the inner map of a map-of-maps, or
from the classic-RCU syscall path: a sleepable program that references
an LPM trie directly is still rejected at load time by
check_map_prog_compatibility(), whose sleepable whitelist omits
BPF_MAP_TYPE_LPM_TRIE:

  Sleepable programs can only use array, hash, ringbuf and local storage maps

LPM trie nodes are allocated from a bpf_mem_alloc (trie->ma) and freed
with bpf_mem_cache_free_rcu(), which chains a regular RCU grace period
into a Tasks Trace grace period before the node -- and the value
embedded in it that trie_lookup_elem() returns to the program -- is
released.  That is the same reclaim discipline BPF_MAP_TYPE_HASH relies
on for sleepable access, so a value handed to a sleepable reader cannot
be freed while the program is still running under rcu_read_lock_trace().
The writer paths take trie->lock across the walk and never relied on the
RCU read-side lock to keep nodes alive.

Add BPF_MAP_TYPE_LPM_TRIE to the sleepable map whitelist so these
programs can use LPM tries directly.

Signed-off-by: Vlad Poenaru <vlad.wing@gmail.com>
---
 kernel/bpf/verifier.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 7fb88e1cd7c4..71c1e59e4df4 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -18122,6 +18122,7 @@ static int check_map_prog_compatibility(struct bpf_verifier_env *env,
 		case BPF_MAP_TYPE_PERCPU_HASH:
 		case BPF_MAP_TYPE_PERCPU_ARRAY:
 		case BPF_MAP_TYPE_LRU_PERCPU_HASH:
+		case BPF_MAP_TYPE_LPM_TRIE:
 		case BPF_MAP_TYPE_ARRAY_OF_MAPS:
 		case BPF_MAP_TYPE_HASH_OF_MAPS:
 		case BPF_MAP_TYPE_RINGBUF:
-- 
2.53.0-Meta


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH bpf v2 2/2] bpf, lpm_trie: Allow sleepable programs to use LPM trie maps directly
  2026-06-09 13:55   ` [PATCH bpf v2 2/2] bpf, lpm_trie: Allow sleepable programs to use LPM trie maps directly Vlad Poenaru
@ 2026-06-09 16:19     ` Emil Tsalapatis
  2026-06-10  1:53     ` Hou Tao
  1 sibling, 0 replies; 10+ messages in thread
From: Emil Tsalapatis @ 2026-06-09 16:19 UTC (permalink / raw)
  To: Vlad Poenaru, bpf, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, John Fastabend, Martin KaFai Lau,
	Eduard Zingerman, Kumar Kartikeya Dwivedi, Song Liu,
	Yonghong Song, Jiri Olsa, Toke Høiland-Jørgensen
  Cc: Emil Tsalapatis, linux-kernel

On Tue Jun 9, 2026 at 9:55 AM EDT, Vlad Poenaru wrote:
> The previous change relaxed the rcu_dereference annotations in
> lpm_trie.c so the trie walks no longer trip lockdep when reached from a
> sleepable BPF program holding only rcu_read_lock_trace().  By itself
> that only helps tries reached as the inner map of a map-of-maps, or
> from the classic-RCU syscall path: a sleepable program that references
> an LPM trie directly is still rejected at load time by
> check_map_prog_compatibility(), whose sleepable whitelist omits
> BPF_MAP_TYPE_LPM_TRIE:
>
>   Sleepable programs can only use array, hash, ringbuf and local storage maps
>
> LPM trie nodes are allocated from a bpf_mem_alloc (trie->ma) and freed
> with bpf_mem_cache_free_rcu(), which chains a regular RCU grace period
> into a Tasks Trace grace period before the node -- and the value
> embedded in it that trie_lookup_elem() returns to the program -- is
> released.  That is the same reclaim discipline BPF_MAP_TYPE_HASH relies
> on for sleepable access, so a value handed to a sleepable reader cannot
> be freed while the program is still running under rcu_read_lock_trace().
> The writer paths take trie->lock across the walk and never relied on the
> RCU read-side lock to keep nodes alive.
>
> Add BPF_MAP_TYPE_LPM_TRIE to the sleepable map whitelist so these
> programs can use LPM tries directly.
>
> Signed-off-by: Vlad Poenaru <vlad.wing@gmail.com>

Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>

> ---
>  kernel/bpf/verifier.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 7fb88e1cd7c4..71c1e59e4df4 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -18122,6 +18122,7 @@ static int check_map_prog_compatibility(struct bpf_verifier_env *env,
>  		case BPF_MAP_TYPE_PERCPU_HASH:
>  		case BPF_MAP_TYPE_PERCPU_ARRAY:
>  		case BPF_MAP_TYPE_LRU_PERCPU_HASH:
> +		case BPF_MAP_TYPE_LPM_TRIE:
>  		case BPF_MAP_TYPE_ARRAY_OF_MAPS:
>  		case BPF_MAP_TYPE_HASH_OF_MAPS:
>  		case BPF_MAP_TYPE_RINGBUF:


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH bpf v2 1/2] bpf, lpm_trie: Allow access from sleepable BPF programs
  2026-06-09 13:55   ` [PATCH bpf v2 1/2] bpf, lpm_trie: Allow access from sleepable BPF programs Vlad Poenaru
@ 2026-06-09 16:36     ` Emil Tsalapatis
  0 siblings, 0 replies; 10+ messages in thread
From: Emil Tsalapatis @ 2026-06-09 16:36 UTC (permalink / raw)
  To: Vlad Poenaru, bpf, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, John Fastabend, Martin KaFai Lau,
	Eduard Zingerman, Kumar Kartikeya Dwivedi, Song Liu,
	Yonghong Song, Jiri Olsa, Toke Høiland-Jørgensen
  Cc: Emil Tsalapatis, linux-kernel, stable

On Tue Jun 9, 2026 at 9:55 AM EDT, Vlad Poenaru wrote:
> trie_lookup_elem() annotates its rcu_dereference_check() walks with
> only rcu_read_lock_bh_held().  Because rcu_dereference_check(p, c)
> resolves to "c || rcu_read_lock_held()", this passes for XDP/NAPI and
> classic RCU readers but fails for sleepable BPF programs, which enter
> via __bpf_prog_enter_sleepable() and hold only rcu_read_lock_trace().
>
> trie_update_elem() and trie_delete_elem() have the same problem in a
> different form: they walk the trie with plain rcu_dereference(), which
> asserts rcu_read_lock_held() unconditionally.  Both are reachable from
> sleepable BPF programs via the bpf_map_update_elem / bpf_map_delete_elem
> helpers, and from the syscall path under classic rcu_read_lock().  In
> the writer paths the trie is actually protected by trie->lock (an
> rqspinlock taken across the walk); we never relied on the RCU read-side
> lock to keep nodes alive there.
>
> A sleepable LSM hook that ends up touching an LPM trie therefore
> triggers lockdep on debug kernels:
>
>   =============================
>   WARNING: suspicious RCU usage
>   7.1.0-... Tainted: G            E
>   -----------------------------
>   kernel/bpf/lpm_trie.c:249 suspicious rcu_dereference_check() usage!
>   1 lock held by net_tests/540:
>    #0: (rcu_tasks_trace_srcu_struct){....}-{0:0},
>        at: __bpf_prog_enter_sleepable+0x26/0x280
>   Call Trace:
>    dump_stack_lvl
>    lockdep_rcu_suspicious
>    trie_lookup_elem
>    bpf_prog_..._enforce_security_socket_connect
>    bpf_trampoline_...
>    security_socket_connect
>    __sys_connect
>    do_syscall_64
>
> This is lockdep-only -- no UAF, since Tasks Trace RCU does serialize
> against the trie's reclaim path -- but it spams the console once per
> distinct callsite on every debug kernel running a sleepable BPF LSM
> that touches an LPM trie, which is increasingly common.
>
> For the lookup path, switch the rcu_dereference_check() annotation
> from rcu_read_lock_bh_held() to bpf_rcu_lock_held(), which accepts all
> three contexts (classic, BH, Tasks Trace).  Other map types already
> follow this convention.
>
> For trie_update_elem() and trie_delete_elem(), annotate the walks as
> rcu_dereference_protected(*p, 1) -- matching trie_free() in the same
> file -- since trie->lock is held across the walk.  rqspinlock has no
> lockdep_map, so the predicate degenerates to '1' rather than
> lockdep_is_held(&trie->lock); the protection is real but not
> machine-verifiable.  trie_get_next_key() also uses bare
> rcu_dereference() but is reachable only from the BPF syscall, which
> holds classic rcu_read_lock() before dispatching, so it is left
> untouched.
>
> Fixes: 694cea395fde ("bpf: Allow RCU-protected lookups to happen from bh context")
> Cc: stable@vger.kernel.org
> Signed-off-by: Vlad Poenaru <vlad.wing@gmail.com>

Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>

> ---
>  kernel/bpf/lpm_trie.c | 8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
> index 0f57608b385d..4d6f25db9ba1 100644
> --- a/kernel/bpf/lpm_trie.c
> +++ b/kernel/bpf/lpm_trie.c
> @@ -246,7 +246,7 @@ static void *trie_lookup_elem(struct bpf_map *map, void *_key)
>  
>  	/* Start walking the trie from the root node ... */
>  
> -	for (node = rcu_dereference_check(trie->root, rcu_read_lock_bh_held());
> +	for (node = rcu_dereference_check(trie->root, bpf_rcu_lock_held());
>  	     node;) {
>  		unsigned int next_bit;
>  		size_t matchlen;
> @@ -280,7 +280,7 @@ static void *trie_lookup_elem(struct bpf_map *map, void *_key)
>  		 */
>  		next_bit = extract_bit(key->data, node->prefixlen);
>  		node = rcu_dereference_check(node->child[next_bit],
> -					     rcu_read_lock_bh_held());
> +					     bpf_rcu_lock_held());
>  	}
>  
>  	if (!found)
> @@ -359,7 +359,7 @@ static long trie_update_elem(struct bpf_map *map,
>  	 */
>  	slot = &trie->root;
>  
> -	while ((node = rcu_dereference(*slot))) {
> +	while ((node = rcu_dereference_protected(*slot, 1))) {
>  		matchlen = longest_prefix_match(trie, node, key);
>  
>  		if (node->prefixlen != matchlen ||
> @@ -482,7 +482,7 @@ static long trie_delete_elem(struct bpf_map *map, void *_key)
>  	trim = &trie->root;
>  	trim2 = trim;
>  	parent = NULL;
> -	while ((node = rcu_dereference(*trim))) {
> +	while ((node = rcu_dereference_protected(*trim, 1))) {
>  		matchlen = longest_prefix_match(trie, node, key);
>  
>  		if (node->prefixlen != matchlen ||


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH bpf v2 0/2] bpf, lpm_trie: Allow sleepable BPF programs to use LPM tries
  2026-06-09 13:55 ` [PATCH bpf v2 0/2] bpf, lpm_trie: Allow sleepable BPF programs to use LPM tries Vlad Poenaru
  2026-06-09 13:55   ` [PATCH bpf v2 1/2] bpf, lpm_trie: Allow access from sleepable BPF programs Vlad Poenaru
  2026-06-09 13:55   ` [PATCH bpf v2 2/2] bpf, lpm_trie: Allow sleepable programs to use LPM trie maps directly Vlad Poenaru
@ 2026-06-09 19:50   ` patchwork-bot+netdevbpf
  2 siblings, 0 replies; 10+ messages in thread
From: patchwork-bot+netdevbpf @ 2026-06-09 19:50 UTC (permalink / raw)
  To: Vlad Poenaru
  Cc: bpf, ast, daniel, andrii, john.fastabend, martin.lau, eddyz87,
	memxor, song, yonghong.song, jolsa, toke, emil, linux-kernel

Hello:

This series was applied to bpf/bpf-next.git (master)
by Alexei Starovoitov <ast@kernel.org>:

On Tue,  9 Jun 2026 06:55:56 -0700 you wrote:
> trie_lookup_elem() annotates its rcu_dereference_check() walks with only
> rcu_read_lock_bh_held(), so a sleepable BPF program that touches an LPM
> trie (e.g. a sleepable LSM hook calling bpf_map_lookup_elem()) trips a
> "suspicious RCU usage" lockdep splat on debug kernels: it holds only
> rcu_read_lock_trace(), which that annotation does not accept.
> 
> Patch 1 relaxes the rcu_dereference annotations in the trie walks so they
> no longer trip lockdep from the Tasks Trace context, including the
> trie_update_elem()/trie_delete_elem() writer walks (protected by
> trie->lock). Patch 2 adds BPF_MAP_TYPE_LPM_TRIE to the verifier's
> sleepable map whitelist so sleepable programs can reference an LPM trie
> directly, not just as the inner map of a map-of-maps. LPM trie nodes are
> reclaimed via bpf_mem_cache_free_rcu(), which chains a regular RCU grace
> period into a Tasks Trace grace period before freeing -- the same
> discipline BPF_MAP_TYPE_HASH relies on for sleepable access.
> 
> [...]

Here is the summary with links:
  - [bpf,v2,1/2] bpf, lpm_trie: Allow access from sleepable BPF programs
    https://git.kernel.org/bpf/bpf-next/c/2f884d371faf
  - [bpf,v2,2/2] bpf, lpm_trie: Allow sleepable programs to use LPM trie maps directly
    https://git.kernel.org/bpf/bpf-next/c/a3d76e27bbbf

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH bpf v2 2/2] bpf, lpm_trie: Allow sleepable programs to use LPM trie maps directly
  2026-06-09 13:55   ` [PATCH bpf v2 2/2] bpf, lpm_trie: Allow sleepable programs to use LPM trie maps directly Vlad Poenaru
  2026-06-09 16:19     ` Emil Tsalapatis
@ 2026-06-10  1:53     ` Hou Tao
  2026-06-10  2:34       ` Alexei Starovoitov
  1 sibling, 1 reply; 10+ messages in thread
From: Hou Tao @ 2026-06-10  1:53 UTC (permalink / raw)
  To: Vlad Poenaru, bpf, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, John Fastabend, Martin KaFai Lau,
	Eduard Zingerman, Kumar Kartikeya Dwivedi, Song Liu,
	Yonghong Song, Jiri Olsa, Toke Høiland-Jørgensen
  Cc: Emil Tsalapatis, linux-kernel

Hi,

On 6/9/2026 9:55 PM, Vlad Poenaru wrote:
> The previous change relaxed the rcu_dereference annotations in
> lpm_trie.c so the trie walks no longer trip lockdep when reached from a
> sleepable BPF program holding only rcu_read_lock_trace().  By itself
> that only helps tries reached as the inner map of a map-of-maps, or
> from the classic-RCU syscall path: a sleepable program that references
> an LPM trie directly is still rejected at load time by
> check_map_prog_compatibility(), whose sleepable whitelist omits
> BPF_MAP_TYPE_LPM_TRIE:
>
>   Sleepable programs can only use array, hash, ringbuf and local storage maps
>
> LPM trie nodes are allocated from a bpf_mem_alloc (trie->ma) and freed
> with bpf_mem_cache_free_rcu(), which chains a regular RCU grace period
> into a Tasks Trace grace period before the node -- and the value
> embedded in it that trie_lookup_elem() returns to the program -- is
> released.  That is the same reclaim discipline BPF_MAP_TYPE_HASH relies
> on for sleepable access, so a value handed to a sleepable reader cannot
> be freed while the program is still running under rcu_read_lock_trace().
> The writer paths take trie->lock across the walk and never relied on the
> RCU read-side lock to keep nodes alive.

For trie_lookup_elem(), I think it is not safe to enable the usage in
the sleep-able program as the patch does and it may return unexpected
value. The main reason is that rcu_read_lock_trace() can not guarantee
the current node which is being lookup-ed up will not reused by other
update procedure concurrently. However rcu_read_lock() has such
guarantee, because bpf_mem_cache_free_rcu() makes it be reusable only
after one RCU grace. For the hash-table case, I think it has the similar
problem through it has already used some trickle (hlist_nulls_node
variants) to mitigate it.
>
> Add BPF_MAP_TYPE_LPM_TRIE to the sleepable map whitelist so these
> programs can use LPM tries directly.
>
> Signed-off-by: Vlad Poenaru <vlad.wing@gmail.com>
> ---
>  kernel/bpf/verifier.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 7fb88e1cd7c4..71c1e59e4df4 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -18122,6 +18122,7 @@ static int check_map_prog_compatibility(struct bpf_verifier_env *env,
>  		case BPF_MAP_TYPE_PERCPU_HASH:
>  		case BPF_MAP_TYPE_PERCPU_ARRAY:
>  		case BPF_MAP_TYPE_LRU_PERCPU_HASH:
> +		case BPF_MAP_TYPE_LPM_TRIE:
>  		case BPF_MAP_TYPE_ARRAY_OF_MAPS:
>  		case BPF_MAP_TYPE_HASH_OF_MAPS:
>  		case BPF_MAP_TYPE_RINGBUF:


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH bpf v2 2/2] bpf, lpm_trie: Allow sleepable programs to use LPM trie maps directly
  2026-06-10  1:53     ` Hou Tao
@ 2026-06-10  2:34       ` Alexei Starovoitov
  0 siblings, 0 replies; 10+ messages in thread
From: Alexei Starovoitov @ 2026-06-10  2:34 UTC (permalink / raw)
  To: Hou Tao, Vlad Poenaru, bpf, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, John Fastabend, Martin KaFai Lau,
	Eduard Zingerman, Kumar Kartikeya Dwivedi, Song Liu,
	Yonghong Song, Jiri Olsa, Toke Høiland-Jørgensen
  Cc: Emil Tsalapatis, linux-kernel

On Tue Jun 9, 2026 at 6:53 PM PDT, Hou Tao wrote:
> Hi,
>
> On 6/9/2026 9:55 PM, Vlad Poenaru wrote:
>> The previous change relaxed the rcu_dereference annotations in
>> lpm_trie.c so the trie walks no longer trip lockdep when reached from a
>> sleepable BPF program holding only rcu_read_lock_trace().  By itself
>> that only helps tries reached as the inner map of a map-of-maps, or
>> from the classic-RCU syscall path: a sleepable program that references
>> an LPM trie directly is still rejected at load time by
>> check_map_prog_compatibility(), whose sleepable whitelist omits
>> BPF_MAP_TYPE_LPM_TRIE:
>>
>>   Sleepable programs can only use array, hash, ringbuf and local storage maps
>>
>> LPM trie nodes are allocated from a bpf_mem_alloc (trie->ma) and freed
>> with bpf_mem_cache_free_rcu(), which chains a regular RCU grace period
>> into a Tasks Trace grace period before the node -- and the value
>> embedded in it that trie_lookup_elem() returns to the program -- is
>> released.  That is the same reclaim discipline BPF_MAP_TYPE_HASH relies
>> on for sleepable access, so a value handed to a sleepable reader cannot
>> be freed while the program is still running under rcu_read_lock_trace().
>> The writer paths take trie->lock across the walk and never relied on the
>> RCU read-side lock to keep nodes alive.
>
> For trie_lookup_elem(), I think it is not safe to enable the usage in
> the sleep-able program as the patch does and it may return unexpected
> value. The main reason is that rcu_read_lock_trace() can not guarantee
> the current node which is being lookup-ed up will not reused by other
> update procedure concurrently. However rcu_read_lock() has such
> guarantee, because bpf_mem_cache_free_rcu() makes it be reusable only
> after one RCU grace. For the hash-table case, I think it has the similar
> problem through it has already used some trickle (hlist_nulls_node
> variants) to mitigate it.

You're correct. I remember that discussion.
Yet people already use lpm via map-in-map bug/workaround.
So I applied this set to make lpm-in-sleepable usage official
and force us to do a proper fix.

Also both AI bots didn't spot an issue, so the bug won't be
discovered immediately and we won't see a flurry of
"security" reports with slop "fixes". AI isn't that smart yet.

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2026-06-10  2:34 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-29 17:42 [PATCH bpf] bpf, lpm_trie: Allow lookups from sleepable BPF programs Vlad Poenaru
2026-05-29 19:19 ` Emil Tsalapatis
2026-06-09 13:55 ` [PATCH bpf v2 0/2] bpf, lpm_trie: Allow sleepable BPF programs to use LPM tries Vlad Poenaru
2026-06-09 13:55   ` [PATCH bpf v2 1/2] bpf, lpm_trie: Allow access from sleepable BPF programs Vlad Poenaru
2026-06-09 16:36     ` Emil Tsalapatis
2026-06-09 13:55   ` [PATCH bpf v2 2/2] bpf, lpm_trie: Allow sleepable programs to use LPM trie maps directly Vlad Poenaru
2026-06-09 16:19     ` Emil Tsalapatis
2026-06-10  1:53     ` Hou Tao
2026-06-10  2:34       ` Alexei Starovoitov
2026-06-09 19:50   ` [PATCH bpf v2 0/2] bpf, lpm_trie: Allow sleepable BPF programs to use LPM tries patchwork-bot+netdevbpf

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox