[PATCH bpf v3 0/9] Fixes for LPM trie

BPF List
 help / color / mirror / Atom feed

* [PATCH bpf v3 0/9] Fixes for LPM trie
@ 2024-12-06 11:06 Hou Tao
  2024-12-06 11:06 ` [PATCH bpf v3 1/9] bpf: Remove unnecessary check when updating " Hou Tao
                   ` (9 more replies)
  0 siblings, 10 replies; 17+ messages in thread
From: Hou Tao @ 2024-12-06 11:06 UTC (permalink / raw)
  To: bpf
  Cc: Martin KaFai Lau, Alexei Starovoitov, Andrii Nakryiko,
	Eduard Zingerman, Song Liu, Hao Luo, Yonghong Song,
	Daniel Borkmann, KP Singh, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Toke Høiland-Jørgensen,
	Sebastian Andrzej Siewior, Thomas Gleixner, Thomas Weißschuh,
	houtao1, xukuohai

From: Hou Tao <houtao1@huawei.com>

Hi,

This patch set fixes several issues for LPM trie. These issues were
found during adding new test cases or were reported by syzbot.

The patch set is structured as follows:

Patch #1~#2 are clean-ups for lpm_trie_update_elem().
Patch #3 handles BPF_EXIST and BPF_NOEXIST correctly for LPM trie.
Patch #4 fixes the accounting of n_entries when doing in-place update.
Patch #5 fixes the exact match condition in trie_get_next_key() and it
may skip keys when the passed key is not found in the map.
Patch #6~#7 switch from kmalloc() to bpf memory allocator for LPM trie
to fix several lock order warnings reported by syzbot. It also enables
raw_spinlock_t for LPM trie again. After these changes, the LPM trie will
be closer to being usable in any context (though the reentrance check of
trie->lock is still missing, but it is on my todo list).
Patch #8: move test_lpm_map to map_tests to make it run regularly.
Patch #9: add test cases for the issues fixed by patch #3~#5.

Please see individual patches for more details. Comments are always
welcome.

Change Log:
v3:
  * patch #2: remove the unnecessary NULL-init for im_node
  * patch #6: alloc the leaf node before disabling IRQ to low
    the possibility of -ENOMEM when leaf_size is large; Free
    these nodes outside the trie lock (Suggested by Alexei)
  * collect review and ack tags (Thanks for Toke & Daniel)

v2: https://lore.kernel.org/bpf/20241127004641.1118269-1-houtao@huaweicloud.com/
  * collect review tags (Thanks for Toke)
  * drop "Add bpf_mem_cache_is_mergeable() helper" patch
  * patch #3~#4: add fix tag
  * patch #4: rename the helper to trie_check_add_elem() and increase
    n_entries in it.
  * patch #6: use one bpf mem allocator and update commit message to
    clarify that using bpf mem allocator is more appropriate.
  * patch #7: update commit message to add the possible max running time
    for update operation.
  * patch #9: update commit message to specify the purpose of these test
    cases.

v1: https://lore.kernel.org/bpf/20241118010808.2243555-1-houtao@huaweicloud.com/

Hou Tao (9):
  bpf: Remove unnecessary check when updating LPM trie
  bpf: Remove unnecessary kfree(im_node) in lpm_trie_update_elem
  bpf: Handle BPF_EXIST and BPF_NOEXIST for LPM trie
  bpf: Handle in-place update for full LPM trie correctly
  bpf: Fix exact match conditions in trie_get_next_key()
  bpf: Switch to bpf mem allocator for LPM trie
  bpf: Use raw_spinlock_t for LPM trie
  selftests/bpf: Move test_lpm_map.c to map_tests
  selftests/bpf: Add more test cases for LPM trie

 kernel/bpf/lpm_trie.c                         | 133 +++---
 tools/testing/selftests/bpf/.gitignore        |   1 -
 tools/testing/selftests/bpf/Makefile          |   2 +-
 .../lpm_trie_map_basic_ops.c}                 | 405 +++++++++++++++++-
 4 files changed, 484 insertions(+), 57 deletions(-)
 rename tools/testing/selftests/bpf/{test_lpm_map.c => map_tests/lpm_trie_map_basic_ops.c} (65%)

-- 
2.29.2


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH bpf v3 1/9] bpf: Remove unnecessary check when updating LPM trie
  2024-12-06 11:06 [PATCH bpf v3 0/9] Fixes for LPM trie Hou Tao
@ 2024-12-06 11:06 ` Hou Tao
  2024-12-06 11:06 ` [PATCH bpf v3 2/9] bpf: Remove unnecessary kfree(im_node) in lpm_trie_update_elem Hou Tao
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 17+ messages in thread
From: Hou Tao @ 2024-12-06 11:06 UTC (permalink / raw)
  To: bpf
  Cc: Martin KaFai Lau, Alexei Starovoitov, Andrii Nakryiko,
	Eduard Zingerman, Song Liu, Hao Luo, Yonghong Song,
	Daniel Borkmann, KP Singh, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Toke Høiland-Jørgensen,
	Sebastian Andrzej Siewior, Thomas Gleixner, Thomas Weißschuh,
	houtao1, xukuohai

From: Hou Tao <houtao1@huawei.com>

When "node->prefixlen == matchlen" is true, it means that the node is
fully matched. If "node->prefixlen == key->prefixlen" is false, it means
the prefix length of key is greater than the prefix length of node,
otherwise, matchlen will not be equal with node->prefixlen. However, it
also implies that the prefix length of node must be less than
max_prefixlen.

Therefore, "node->prefixlen == trie->max_prefixlen" will always be false
when the check of "node->prefixlen == key->prefixlen" returns false.
Remove this unnecessary comparison.

Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 kernel/bpf/lpm_trie.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
index 9b60eda0f727..73fd593d3745 100644
--- a/kernel/bpf/lpm_trie.c
+++ b/kernel/bpf/lpm_trie.c
@@ -364,8 +364,7 @@ static long trie_update_elem(struct bpf_map *map,
 		matchlen = longest_prefix_match(trie, node, key);
 
 		if (node->prefixlen != matchlen ||
-		    node->prefixlen == key->prefixlen ||
-		    node->prefixlen == trie->max_prefixlen)
+		    node->prefixlen == key->prefixlen)
 			break;
 
 		next_bit = extract_bit(key->data, node->prefixlen);
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH bpf v3 2/9] bpf: Remove unnecessary kfree(im_node) in lpm_trie_update_elem
  2024-12-06 11:06 [PATCH bpf v3 0/9] Fixes for LPM trie Hou Tao
  2024-12-06 11:06 ` [PATCH bpf v3 1/9] bpf: Remove unnecessary check when updating " Hou Tao
@ 2024-12-06 11:06 ` Hou Tao
  2024-12-06 11:06 ` [PATCH bpf v3 3/9] bpf: Handle BPF_EXIST and BPF_NOEXIST for LPM trie Hou Tao
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 17+ messages in thread
From: Hou Tao @ 2024-12-06 11:06 UTC (permalink / raw)
  To: bpf
  Cc: Martin KaFai Lau, Alexei Starovoitov, Andrii Nakryiko,
	Eduard Zingerman, Song Liu, Hao Luo, Yonghong Song,
	Daniel Borkmann, KP Singh, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Toke Høiland-Jørgensen,
	Sebastian Andrzej Siewior, Thomas Gleixner, Thomas Weißschuh,
	houtao1, xukuohai

From: Hou Tao <houtao1@huawei.com>

There is no need to call kfree(im_node) when updating element fails,
because im_node must be NULL. Remove the unnecessary kfree() for
im_node.

Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 kernel/bpf/lpm_trie.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
index 73fd593d3745..b5e281a55760 100644
--- a/kernel/bpf/lpm_trie.c
+++ b/kernel/bpf/lpm_trie.c
@@ -315,7 +315,7 @@ static long trie_update_elem(struct bpf_map *map,
 			     void *_key, void *value, u64 flags)
 {
 	struct lpm_trie *trie = container_of(map, struct lpm_trie, map);
-	struct lpm_trie_node *node, *im_node = NULL, *new_node = NULL;
+	struct lpm_trie_node *node, *im_node, *new_node = NULL;
 	struct lpm_trie_node *free_node = NULL;
 	struct lpm_trie_node __rcu **slot;
 	struct bpf_lpm_trie_key_u8 *key = _key;
@@ -431,9 +431,7 @@ static long trie_update_elem(struct bpf_map *map,
 	if (ret) {
 		if (new_node)
 			trie->n_entries--;
-
 		kfree(new_node);
-		kfree(im_node);
 	}
 
 	spin_unlock_irqrestore(&trie->lock, irq_flags);
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH bpf v3 3/9] bpf: Handle BPF_EXIST and BPF_NOEXIST for LPM trie
  2024-12-06 11:06 [PATCH bpf v3 0/9] Fixes for LPM trie Hou Tao
  2024-12-06 11:06 ` [PATCH bpf v3 1/9] bpf: Remove unnecessary check when updating " Hou Tao
  2024-12-06 11:06 ` [PATCH bpf v3 2/9] bpf: Remove unnecessary kfree(im_node) in lpm_trie_update_elem Hou Tao
@ 2024-12-06 11:06 ` Hou Tao
  2024-12-06 11:06 ` [PATCH bpf v3 4/9] bpf: Handle in-place update for full LPM trie correctly Hou Tao
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 17+ messages in thread
From: Hou Tao @ 2024-12-06 11:06 UTC (permalink / raw)
  To: bpf
  Cc: Martin KaFai Lau, Alexei Starovoitov, Andrii Nakryiko,
	Eduard Zingerman, Song Liu, Hao Luo, Yonghong Song,
	Daniel Borkmann, KP Singh, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Toke Høiland-Jørgensen,
	Sebastian Andrzej Siewior, Thomas Gleixner, Thomas Weißschuh,
	houtao1, xukuohai

From: Hou Tao <houtao1@huawei.com>

Add the currently missing handling for the BPF_EXIST and BPF_NOEXIST
flags. These flags can be specified by users and are relevant since LPM
trie supports exact matches during update.

Fixes: b95a5c4db09b ("bpf: add a longest prefix match trie map implementation")
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 kernel/bpf/lpm_trie.c | 23 ++++++++++++++++++++---
 1 file changed, 20 insertions(+), 3 deletions(-)

diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
index b5e281a55760..be5bf0389532 100644
--- a/kernel/bpf/lpm_trie.c
+++ b/kernel/bpf/lpm_trie.c
@@ -375,6 +375,10 @@ static long trie_update_elem(struct bpf_map *map,
 	 * simply assign the @new_node to that slot and be done.
 	 */
 	if (!node) {
+		if (flags == BPF_EXIST) {
+			ret = -ENOENT;
+			goto out;
+		}
 		rcu_assign_pointer(*slot, new_node);
 		goto out;
 	}
@@ -383,18 +387,31 @@ static long trie_update_elem(struct bpf_map *map,
 	 * which already has the correct data array set.
 	 */
 	if (node->prefixlen == matchlen) {
+		if (!(node->flags & LPM_TREE_NODE_FLAG_IM)) {
+			if (flags == BPF_NOEXIST) {
+				ret = -EEXIST;
+				goto out;
+			}
+			trie->n_entries--;
+		} else if (flags == BPF_EXIST) {
+			ret = -ENOENT;
+			goto out;
+		}
+
 		new_node->child[0] = node->child[0];
 		new_node->child[1] = node->child[1];
 
-		if (!(node->flags & LPM_TREE_NODE_FLAG_IM))
-			trie->n_entries--;
-
 		rcu_assign_pointer(*slot, new_node);
 		free_node = node;
 
 		goto out;
 	}
 
+	if (flags == BPF_EXIST) {
+		ret = -ENOENT;
+		goto out;
+	}
+
 	/* If the new node matches the prefix completely, it must be inserted
 	 * as an ancestor. Simply insert it between @node and *@slot.
 	 */
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH bpf v3 4/9] bpf: Handle in-place update for full LPM trie correctly
  2024-12-06 11:06 [PATCH bpf v3 0/9] Fixes for LPM trie Hou Tao
                   ` (2 preceding siblings ...)
  2024-12-06 11:06 ` [PATCH bpf v3 3/9] bpf: Handle BPF_EXIST and BPF_NOEXIST for LPM trie Hou Tao
@ 2024-12-06 11:06 ` Hou Tao
  2024-12-06 11:06 ` [PATCH bpf v3 5/9] bpf: Fix exact match conditions in trie_get_next_key() Hou Tao
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 17+ messages in thread
From: Hou Tao @ 2024-12-06 11:06 UTC (permalink / raw)
  To: bpf
  Cc: Martin KaFai Lau, Alexei Starovoitov, Andrii Nakryiko,
	Eduard Zingerman, Song Liu, Hao Luo, Yonghong Song,
	Daniel Borkmann, KP Singh, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Toke Høiland-Jørgensen,
	Sebastian Andrzej Siewior, Thomas Gleixner, Thomas Weißschuh,
	houtao1, xukuohai

From: Hou Tao <houtao1@huawei.com>

When a LPM trie is full, in-place updates of existing elements
incorrectly return -ENOSPC.

Fix this by deferring the check of trie->n_entries. For new insertions,
n_entries must not exceed max_entries. However, in-place updates are
allowed even when the trie is full.

Fixes: b95a5c4db09b ("bpf: add a longest prefix match trie map implementation")
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 kernel/bpf/lpm_trie.c | 44 +++++++++++++++++++++----------------------
 1 file changed, 21 insertions(+), 23 deletions(-)

diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
index be5bf0389532..df6cc0a1c9bf 100644
--- a/kernel/bpf/lpm_trie.c
+++ b/kernel/bpf/lpm_trie.c
@@ -310,6 +310,16 @@ static struct lpm_trie_node *lpm_trie_node_alloc(const struct lpm_trie *trie,
 	return node;
 }
 
+static int trie_check_add_elem(struct lpm_trie *trie, u64 flags)
+{
+	if (flags == BPF_EXIST)
+		return -ENOENT;
+	if (trie->n_entries == trie->map.max_entries)
+		return -ENOSPC;
+	trie->n_entries++;
+	return 0;
+}
+
 /* Called from syscall or from eBPF program */
 static long trie_update_elem(struct bpf_map *map,
 			     void *_key, void *value, u64 flags)
@@ -333,20 +343,12 @@ static long trie_update_elem(struct bpf_map *map,
 	spin_lock_irqsave(&trie->lock, irq_flags);
 
 	/* Allocate and fill a new node */
-
-	if (trie->n_entries == trie->map.max_entries) {
-		ret = -ENOSPC;
-		goto out;
-	}
-
 	new_node = lpm_trie_node_alloc(trie, value);
 	if (!new_node) {
 		ret = -ENOMEM;
 		goto out;
 	}
 
-	trie->n_entries++;
-
 	new_node->prefixlen = key->prefixlen;
 	RCU_INIT_POINTER(new_node->child[0], NULL);
 	RCU_INIT_POINTER(new_node->child[1], NULL);
@@ -375,10 +377,10 @@ static long trie_update_elem(struct bpf_map *map,
 	 * simply assign the @new_node to that slot and be done.
 	 */
 	if (!node) {
-		if (flags == BPF_EXIST) {
-			ret = -ENOENT;
+		ret = trie_check_add_elem(trie, flags);
+		if (ret)
 			goto out;
-		}
+
 		rcu_assign_pointer(*slot, new_node);
 		goto out;
 	}
@@ -392,10 +394,10 @@ static long trie_update_elem(struct bpf_map *map,
 				ret = -EEXIST;
 				goto out;
 			}
-			trie->n_entries--;
-		} else if (flags == BPF_EXIST) {
-			ret = -ENOENT;
-			goto out;
+		} else {
+			ret = trie_check_add_elem(trie, flags);
+			if (ret)
+				goto out;
 		}
 
 		new_node->child[0] = node->child[0];
@@ -407,10 +409,9 @@ static long trie_update_elem(struct bpf_map *map,
 		goto out;
 	}
 
-	if (flags == BPF_EXIST) {
-		ret = -ENOENT;
+	ret = trie_check_add_elem(trie, flags);
+	if (ret)
 		goto out;
-	}
 
 	/* If the new node matches the prefix completely, it must be inserted
 	 * as an ancestor. Simply insert it between @node and *@slot.
@@ -424,6 +425,7 @@ static long trie_update_elem(struct bpf_map *map,
 
 	im_node = lpm_trie_node_alloc(trie, NULL);
 	if (!im_node) {
+		trie->n_entries--;
 		ret = -ENOMEM;
 		goto out;
 	}
@@ -445,12 +447,8 @@ static long trie_update_elem(struct bpf_map *map,
 	rcu_assign_pointer(*slot, im_node);
 
 out:
-	if (ret) {
-		if (new_node)
-			trie->n_entries--;
+	if (ret)
 		kfree(new_node);
-	}
-
 	spin_unlock_irqrestore(&trie->lock, irq_flags);
 	kfree_rcu(free_node, rcu);
 
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH bpf v3 5/9] bpf: Fix exact match conditions in trie_get_next_key()
  2024-12-06 11:06 [PATCH bpf v3 0/9] Fixes for LPM trie Hou Tao
                   ` (3 preceding siblings ...)
  2024-12-06 11:06 ` [PATCH bpf v3 4/9] bpf: Handle in-place update for full LPM trie correctly Hou Tao
@ 2024-12-06 11:06 ` Hou Tao
  2024-12-06 11:06 ` [PATCH bpf v3 6/9] bpf: Switch to bpf mem allocator for LPM trie Hou Tao
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 17+ messages in thread
From: Hou Tao @ 2024-12-06 11:06 UTC (permalink / raw)
  To: bpf
  Cc: Martin KaFai Lau, Alexei Starovoitov, Andrii Nakryiko,
	Eduard Zingerman, Song Liu, Hao Luo, Yonghong Song,
	Daniel Borkmann, KP Singh, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Toke Høiland-Jørgensen,
	Sebastian Andrzej Siewior, Thomas Gleixner, Thomas Weißschuh,
	houtao1, xukuohai

From: Hou Tao <houtao1@huawei.com>

trie_get_next_key() uses node->prefixlen == key->prefixlen to identify
an exact match, However, it is incorrect because when the target key
doesn't fully match the found node (e.g., node->prefixlen != matchlen),
these two nodes may also have the same prefixlen. It will return
expected result when the passed key exist in the trie. However when a
recently-deleted key or nonexistent key is passed to
trie_get_next_key(), it may skip keys and return incorrect result.

Fix it by using node->prefixlen == matchlen to identify exact matches.
When the condition is true after the search, it also implies
node->prefixlen equals key->prefixlen, otherwise, the search would
return NULL instead.

Fixes: b471f2f1de8b ("bpf: implement MAP_GET_NEXT_KEY command for LPM_TRIE map")
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 kernel/bpf/lpm_trie.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
index df6cc0a1c9bf..9ba6ae145239 100644
--- a/kernel/bpf/lpm_trie.c
+++ b/kernel/bpf/lpm_trie.c
@@ -645,7 +645,7 @@ static int trie_get_next_key(struct bpf_map *map, void *_key, void *_next_key)
 	struct lpm_trie_node **node_stack = NULL;
 	int err = 0, stack_ptr = -1;
 	unsigned int next_bit;
-	size_t matchlen;
+	size_t matchlen = 0;
 
 	/* The get_next_key follows postorder. For the 4 node example in
 	 * the top of this file, the trie_get_next_key() returns the following
@@ -684,7 +684,7 @@ static int trie_get_next_key(struct bpf_map *map, void *_key, void *_next_key)
 		next_bit = extract_bit(key->data, node->prefixlen);
 		node = rcu_dereference(node->child[next_bit]);
 	}
-	if (!node || node->prefixlen != key->prefixlen ||
+	if (!node || node->prefixlen != matchlen ||
 	    (node->flags & LPM_TREE_NODE_FLAG_IM))
 		goto find_leftmost;
 
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH bpf v3 6/9] bpf: Switch to bpf mem allocator for LPM trie
  2024-12-06 11:06 [PATCH bpf v3 0/9] Fixes for LPM trie Hou Tao
                   ` (4 preceding siblings ...)
  2024-12-06 11:06 ` [PATCH bpf v3 5/9] bpf: Fix exact match conditions in trie_get_next_key() Hou Tao
@ 2024-12-06 11:06 ` Hou Tao
  2025-09-19 21:28   ` Andrii Nakryiko
  2024-12-06 11:06 ` [PATCH bpf v3 7/9] bpf: Use raw_spinlock_t " Hou Tao
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 17+ messages in thread
From: Hou Tao @ 2024-12-06 11:06 UTC (permalink / raw)
  To: bpf
  Cc: Martin KaFai Lau, Alexei Starovoitov, Andrii Nakryiko,
	Eduard Zingerman, Song Liu, Hao Luo, Yonghong Song,
	Daniel Borkmann, KP Singh, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Toke Høiland-Jørgensen,
	Sebastian Andrzej Siewior, Thomas Gleixner, Thomas Weißschuh,
	houtao1, xukuohai

From: Hou Tao <houtao1@huawei.com>

Multiple syzbot warnings have been reported. These warnings are mainly
about the lock order between trie->lock and kmalloc()'s internal lock.
See report [1] as an example:

======================================================
WARNING: possible circular locking dependency detected
6.10.0-rc7-syzkaller-00003-g4376e966ecb7 #0 Not tainted
------------------------------------------------------
syz.3.2069/15008 is trying to acquire lock:
ffff88801544e6d8 (&n->list_lock){-.-.}-{2:2}, at: get_partial_node ...

but task is already holding lock:
ffff88802dcc89f8 (&trie->lock){-.-.}-{2:2}, at: trie_update_elem ...

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #1 (&trie->lock){-.-.}-{2:2}:
       __raw_spin_lock_irqsave
       _raw_spin_lock_irqsave+0x3a/0x60
       trie_delete_elem+0xb0/0x820
       ___bpf_prog_run+0x3e51/0xabd0
       __bpf_prog_run32+0xc1/0x100
       bpf_dispatcher_nop_func
       ......
       bpf_trace_run2+0x231/0x590
       __bpf_trace_contention_end+0xca/0x110
       trace_contention_end.constprop.0+0xea/0x170
       __pv_queued_spin_lock_slowpath+0x28e/0xcc0
       pv_queued_spin_lock_slowpath
       queued_spin_lock_slowpath
       queued_spin_lock
       do_raw_spin_lock+0x210/0x2c0
       __raw_spin_lock_irqsave
       _raw_spin_lock_irqsave+0x42/0x60
       __put_partials+0xc3/0x170
       qlink_free
       qlist_free_all+0x4e/0x140
       kasan_quarantine_reduce+0x192/0x1e0
       __kasan_slab_alloc+0x69/0x90
       kasan_slab_alloc
       slab_post_alloc_hook
       slab_alloc_node
       kmem_cache_alloc_node_noprof+0x153/0x310
       __alloc_skb+0x2b1/0x380
       ......

-> #0 (&n->list_lock){-.-.}-{2:2}:
       check_prev_add
       check_prevs_add
       validate_chain
       __lock_acquire+0x2478/0x3b30
       lock_acquire
       lock_acquire+0x1b1/0x560
       __raw_spin_lock_irqsave
       _raw_spin_lock_irqsave+0x3a/0x60
       get_partial_node.part.0+0x20/0x350
       get_partial_node
       get_partial
       ___slab_alloc+0x65b/0x1870
       __slab_alloc.constprop.0+0x56/0xb0
       __slab_alloc_node
       slab_alloc_node
       __do_kmalloc_node
       __kmalloc_node_noprof+0x35c/0x440
       kmalloc_node_noprof
       bpf_map_kmalloc_node+0x98/0x4a0
       lpm_trie_node_alloc
       trie_update_elem+0x1ef/0xe00
       bpf_map_update_value+0x2c1/0x6c0
       map_update_elem+0x623/0x910
       __sys_bpf+0x90c/0x49a0
       ...

other info that might help us debug this:

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(&trie->lock);
                               lock(&n->list_lock);
                               lock(&trie->lock);
  lock(&n->list_lock);

 *** DEADLOCK ***

[1]: https://syzkaller.appspot.com/bug?extid=9045c0a3d5a7f1b119f7

A bpf program attached to trace_contention_end() triggers after
acquiring &n->list_lock. The program invokes trie_delete_elem(), which
then acquires trie->lock. However, it is possible that another
process is invoking trie_update_elem(). trie_update_elem() will acquire
trie->lock first, then invoke kmalloc_node(). kmalloc_node() may invoke
get_partial_node() and try to acquire &n->list_lock (not necessarily the
same lock object). Therefore, lockdep warns about the circular locking
dependency.

Invoking kmalloc() before acquiring trie->lock could fix the warning.
However, since BPF programs call be invoked from any context (e.g.,
through kprobe/tracepoint/fentry), there may still be lock ordering
problems for internal locks in kmalloc() or trie->lock itself.

To eliminate these potential lock ordering problems with kmalloc()'s
internal locks, replacing kmalloc()/kfree()/kfree_rcu() with equivalent
BPF memory allocator APIs that can be invoked in any context. The lock
ordering problems with trie->lock (e.g., reentrance) will be handled
separately.

Three aspects of this change require explanation:

1. Intermediate and leaf nodes are allocated from the same allocator.
Since the value size of LPM trie is usually small, using a single
alocator reduces the memory overhead of the BPF memory allocator.

2. Leaf nodes are allocated before disabling IRQs. This handles cases
where leaf_size is large (e.g., > 4KB - 8) and updates require
intermediate node allocation. If leaf nodes were allocated in
IRQ-disabled region, the free objects in BPF memory allocator would not
be refilled timely and the intermediate node allocation may fail.

3. Paired migrate_{disable|enable}() calls for node alloc and free. The
BPF memory allocator uses per-CPU struct internally, these paired calls
are necessary to guarantee correctness.

Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 kernel/bpf/lpm_trie.c | 71 +++++++++++++++++++++++++++++--------------
 1 file changed, 48 insertions(+), 23 deletions(-)

diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
index 9ba6ae145239..f850360e75ce 100644
--- a/kernel/bpf/lpm_trie.c
+++ b/kernel/bpf/lpm_trie.c
@@ -15,6 +15,7 @@
 #include <net/ipv6.h>
 #include <uapi/linux/btf.h>
 #include <linux/btf_ids.h>
+#include <linux/bpf_mem_alloc.h>
 
 /* Intermediate node */
 #define LPM_TREE_NODE_FLAG_IM BIT(0)
@@ -22,7 +23,6 @@
 struct lpm_trie_node;
 
 struct lpm_trie_node {
-	struct rcu_head rcu;
 	struct lpm_trie_node __rcu	*child[2];
 	u32				prefixlen;
 	u32				flags;
@@ -32,6 +32,7 @@ struct lpm_trie_node {
 struct lpm_trie {
 	struct bpf_map			map;
 	struct lpm_trie_node __rcu	*root;
+	struct bpf_mem_alloc		ma;
 	size_t				n_entries;
 	size_t				max_prefixlen;
 	size_t				data_size;
@@ -287,17 +288,18 @@ static void *trie_lookup_elem(struct bpf_map *map, void *_key)
 	return found->data + trie->data_size;
 }
 
-static struct lpm_trie_node *lpm_trie_node_alloc(const struct lpm_trie *trie,
-						 const void *value)
+static struct lpm_trie_node *lpm_trie_node_alloc(struct lpm_trie *trie,
+						 const void *value,
+						 bool disable_migration)
 {
 	struct lpm_trie_node *node;
-	size_t size = sizeof(struct lpm_trie_node) + trie->data_size;
 
-	if (value)
-		size += trie->map.value_size;
+	if (disable_migration)
+		migrate_disable();
+	node = bpf_mem_cache_alloc(&trie->ma);
+	if (disable_migration)
+		migrate_enable();
 
-	node = bpf_map_kmalloc_node(&trie->map, size, GFP_NOWAIT | __GFP_NOWARN,
-				    trie->map.numa_node);
 	if (!node)
 		return NULL;
 
@@ -325,7 +327,7 @@ static long trie_update_elem(struct bpf_map *map,
 			     void *_key, void *value, u64 flags)
 {
 	struct lpm_trie *trie = container_of(map, struct lpm_trie, map);
-	struct lpm_trie_node *node, *im_node, *new_node = NULL;
+	struct lpm_trie_node *node, *im_node, *new_node;
 	struct lpm_trie_node *free_node = NULL;
 	struct lpm_trie_node __rcu **slot;
 	struct bpf_lpm_trie_key_u8 *key = _key;
@@ -340,14 +342,14 @@ static long trie_update_elem(struct bpf_map *map,
 	if (key->prefixlen > trie->max_prefixlen)
 		return -EINVAL;
 
-	spin_lock_irqsave(&trie->lock, irq_flags);
+	/* Allocate and fill a new node. Need to disable migration before
+	 * invoking bpf_mem_cache_alloc().
+	 */
+	new_node = lpm_trie_node_alloc(trie, value, true);
+	if (!new_node)
+		return -ENOMEM;
 
-	/* Allocate and fill a new node */
-	new_node = lpm_trie_node_alloc(trie, value);
-	if (!new_node) {
-		ret = -ENOMEM;
-		goto out;
-	}
+	spin_lock_irqsave(&trie->lock, irq_flags);
 
 	new_node->prefixlen = key->prefixlen;
 	RCU_INIT_POINTER(new_node->child[0], NULL);
@@ -423,7 +425,8 @@ static long trie_update_elem(struct bpf_map *map,
 		goto out;
 	}
 
-	im_node = lpm_trie_node_alloc(trie, NULL);
+	/* migration is disabled within the locked scope */
+	im_node = lpm_trie_node_alloc(trie, NULL, false);
 	if (!im_node) {
 		trie->n_entries--;
 		ret = -ENOMEM;
@@ -447,10 +450,13 @@ static long trie_update_elem(struct bpf_map *map,
 	rcu_assign_pointer(*slot, im_node);
 
 out:
-	if (ret)
-		kfree(new_node);
 	spin_unlock_irqrestore(&trie->lock, irq_flags);
-	kfree_rcu(free_node, rcu);
+
+	migrate_disable();
+	if (ret)
+		bpf_mem_cache_free(&trie->ma, new_node);
+	bpf_mem_cache_free_rcu(&trie->ma, free_node);
+	migrate_enable();
 
 	return ret;
 }
@@ -548,8 +554,11 @@ static long trie_delete_elem(struct bpf_map *map, void *_key)
 
 out:
 	spin_unlock_irqrestore(&trie->lock, irq_flags);
-	kfree_rcu(free_parent, rcu);
-	kfree_rcu(free_node, rcu);
+
+	migrate_disable();
+	bpf_mem_cache_free_rcu(&trie->ma, free_parent);
+	bpf_mem_cache_free_rcu(&trie->ma, free_node);
+	migrate_enable();
 
 	return ret;
 }
@@ -571,6 +580,8 @@ static long trie_delete_elem(struct bpf_map *map, void *_key)
 static struct bpf_map *trie_alloc(union bpf_attr *attr)
 {
 	struct lpm_trie *trie;
+	size_t leaf_size;
+	int err;
 
 	/* check sanity of attributes */
 	if (attr->max_entries == 0 ||
@@ -595,7 +606,17 @@ static struct bpf_map *trie_alloc(union bpf_attr *attr)
 
 	spin_lock_init(&trie->lock);
 
+	/* Allocate intermediate and leaf nodes from the same allocator */
+	leaf_size = sizeof(struct lpm_trie_node) + trie->data_size +
+		    trie->map.value_size;
+	err = bpf_mem_alloc_init(&trie->ma, leaf_size, false);
+	if (err)
+		goto free_out;
 	return &trie->map;
+
+free_out:
+	bpf_map_area_free(trie);
+	return ERR_PTR(err);
 }
 
 static void trie_free(struct bpf_map *map)
@@ -627,13 +648,17 @@ static void trie_free(struct bpf_map *map)
 				continue;
 			}
 
-			kfree(node);
+			/* No bpf program may access the map, so freeing the
+			 * node without waiting for the extra RCU GP.
+			 */
+			bpf_mem_cache_raw_free(node);
 			RCU_INIT_POINTER(*slot, NULL);
 			break;
 		}
 	}
 
 out:
+	bpf_mem_alloc_destroy(&trie->ma);
 	bpf_map_area_free(trie);
 }
 
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH bpf v3 7/9] bpf: Use raw_spinlock_t for LPM trie
  2024-12-06 11:06 [PATCH bpf v3 0/9] Fixes for LPM trie Hou Tao
                   ` (5 preceding siblings ...)
  2024-12-06 11:06 ` [PATCH bpf v3 6/9] bpf: Switch to bpf mem allocator for LPM trie Hou Tao
@ 2024-12-06 11:06 ` Hou Tao
  2024-12-06 11:06 ` [PATCH bpf v3 8/9] selftests/bpf: Move test_lpm_map.c to map_tests Hou Tao
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 17+ messages in thread
From: Hou Tao @ 2024-12-06 11:06 UTC (permalink / raw)
  To: bpf
  Cc: Martin KaFai Lau, Alexei Starovoitov, Andrii Nakryiko,
	Eduard Zingerman, Song Liu, Hao Luo, Yonghong Song,
	Daniel Borkmann, KP Singh, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Toke Høiland-Jørgensen,
	Sebastian Andrzej Siewior, Thomas Gleixner, Thomas Weißschuh,
	houtao1, xukuohai

From: Hou Tao <houtao1@huawei.com>

After switching from kmalloc() to the bpf memory allocator, there will be
no blocking operation during the update of LPM trie. Therefore, change
trie->lock from spinlock_t to raw_spinlock_t to make LPM trie usable in
atomic context, even on RT kernels.

The max value of prefixlen is 2048. Therefore, update or deletion
operations will find the target after at most 2048 comparisons.
Constructing a test case which updates an element after 2048 comparisons
under a 8 CPU VM, and the average time and the maximal time for such
update operation is about 210us and 900us.

Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 kernel/bpf/lpm_trie.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
index f850360e75ce..f8bc1e096182 100644
--- a/kernel/bpf/lpm_trie.c
+++ b/kernel/bpf/lpm_trie.c
@@ -36,7 +36,7 @@ struct lpm_trie {
 	size_t				n_entries;
 	size_t				max_prefixlen;
 	size_t				data_size;
-	spinlock_t			lock;
+	raw_spinlock_t			lock;
 };
 
 /* This trie implements a longest prefix match algorithm that can be used to
@@ -349,7 +349,7 @@ static long trie_update_elem(struct bpf_map *map,
 	if (!new_node)
 		return -ENOMEM;
 
-	spin_lock_irqsave(&trie->lock, irq_flags);
+	raw_spin_lock_irqsave(&trie->lock, irq_flags);
 
 	new_node->prefixlen = key->prefixlen;
 	RCU_INIT_POINTER(new_node->child[0], NULL);
@@ -450,7 +450,7 @@ static long trie_update_elem(struct bpf_map *map,
 	rcu_assign_pointer(*slot, im_node);
 
 out:
-	spin_unlock_irqrestore(&trie->lock, irq_flags);
+	raw_spin_unlock_irqrestore(&trie->lock, irq_flags);
 
 	migrate_disable();
 	if (ret)
@@ -477,7 +477,7 @@ static long trie_delete_elem(struct bpf_map *map, void *_key)
 	if (key->prefixlen > trie->max_prefixlen)
 		return -EINVAL;
 
-	spin_lock_irqsave(&trie->lock, irq_flags);
+	raw_spin_lock_irqsave(&trie->lock, irq_flags);
 
 	/* Walk the tree looking for an exact key/length match and keeping
 	 * track of the path we traverse.  We will need to know the node
@@ -553,7 +553,7 @@ static long trie_delete_elem(struct bpf_map *map, void *_key)
 	free_node = node;
 
 out:
-	spin_unlock_irqrestore(&trie->lock, irq_flags);
+	raw_spin_unlock_irqrestore(&trie->lock, irq_flags);
 
 	migrate_disable();
 	bpf_mem_cache_free_rcu(&trie->ma, free_parent);
@@ -604,7 +604,7 @@ static struct bpf_map *trie_alloc(union bpf_attr *attr)
 			  offsetof(struct bpf_lpm_trie_key_u8, data);
 	trie->max_prefixlen = trie->data_size * 8;
 
-	spin_lock_init(&trie->lock);
+	raw_spin_lock_init(&trie->lock);
 
 	/* Allocate intermediate and leaf nodes from the same allocator */
 	leaf_size = sizeof(struct lpm_trie_node) + trie->data_size +
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH bpf v3 8/9] selftests/bpf: Move test_lpm_map.c to map_tests
  2024-12-06 11:06 [PATCH bpf v3 0/9] Fixes for LPM trie Hou Tao
                   ` (6 preceding siblings ...)
  2024-12-06 11:06 ` [PATCH bpf v3 7/9] bpf: Use raw_spinlock_t " Hou Tao
@ 2024-12-06 11:06 ` Hou Tao
  2024-12-06 11:06 ` [PATCH bpf v3 9/9] selftests/bpf: Add more test cases for LPM trie Hou Tao
  2024-12-06 17:40 ` [PATCH bpf v3 0/9] Fixes " patchwork-bot+netdevbpf
  9 siblings, 0 replies; 17+ messages in thread
From: Hou Tao @ 2024-12-06 11:06 UTC (permalink / raw)
  To: bpf
  Cc: Martin KaFai Lau, Alexei Starovoitov, Andrii Nakryiko,
	Eduard Zingerman, Song Liu, Hao Luo, Yonghong Song,
	Daniel Borkmann, KP Singh, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Toke Høiland-Jørgensen,
	Sebastian Andrzej Siewior, Thomas Gleixner, Thomas Weißschuh,
	houtao1, xukuohai

From: Hou Tao <houtao1@huawei.com>

Move test_lpm_map.c to map_tests/ to include LPM trie test cases in
regular test_maps run. Most code remains unchanged, including the use of
assert(). Only reduce n_lookups from 64K to 512, which decreases
test_lpm_map runtime from 37s to 0.7s.

Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 tools/testing/selftests/bpf/.gitignore                 |  1 -
 tools/testing/selftests/bpf/Makefile                   |  2 +-
 .../lpm_trie_map_basic_ops.c}                          | 10 +++-------
 3 files changed, 4 insertions(+), 9 deletions(-)
 rename tools/testing/selftests/bpf/{test_lpm_map.c => map_tests/lpm_trie_map_basic_ops.c} (99%)

diff --git a/tools/testing/selftests/bpf/.gitignore b/tools/testing/selftests/bpf/.gitignore
index c2a1842c3d8b..e9c377001f93 100644
--- a/tools/testing/selftests/bpf/.gitignore
+++ b/tools/testing/selftests/bpf/.gitignore
@@ -5,7 +5,6 @@ bpf-syscall*
 test_verifier
 test_maps
 test_lru_map
-test_lpm_map
 test_tag
 FEATURE-DUMP.libbpf
 FEATURE-DUMP.selftests
diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index 6ad3b1ba1920..7eeb3cbe18c7 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -83,7 +83,7 @@ CLANG_CPUV4 := 1
 endif
 
 # Order correspond to 'make run_tests' order
-TEST_GEN_PROGS = test_verifier test_tag test_maps test_lru_map test_lpm_map test_progs \
+TEST_GEN_PROGS = test_verifier test_tag test_maps test_lru_map test_progs \
 	test_sockmap \
 	test_tcpnotify_user test_sysctl \
 	test_progs-no_alu32
diff --git a/tools/testing/selftests/bpf/test_lpm_map.c b/tools/testing/selftests/bpf/map_tests/lpm_trie_map_basic_ops.c
similarity index 99%
rename from tools/testing/selftests/bpf/test_lpm_map.c
rename to tools/testing/selftests/bpf/map_tests/lpm_trie_map_basic_ops.c
index d98c72dc563e..f375c89d78a4 100644
--- a/tools/testing/selftests/bpf/test_lpm_map.c
+++ b/tools/testing/selftests/bpf/map_tests/lpm_trie_map_basic_ops.c
@@ -223,7 +223,7 @@ static void test_lpm_map(int keysize)
 	n_matches = 0;
 	n_matches_after_delete = 0;
 	n_nodes = 1 << 8;
-	n_lookups = 1 << 16;
+	n_lookups = 1 << 9;
 
 	data = alloca(keysize);
 	memset(data, 0, keysize);
@@ -770,16 +770,13 @@ static void test_lpm_multi_thread(void)
 	close(map_fd);
 }
 
-int main(void)
+void test_lpm_trie_map_basic_ops(void)
 {
 	int i;
 
 	/* we want predictable, pseudo random tests */
 	srand(0xf00ba1);
 
-	/* Use libbpf 1.0 API mode */
-	libbpf_set_strict_mode(LIBBPF_STRICT_ALL);
-
 	test_lpm_basic();
 	test_lpm_order();
 
@@ -792,6 +789,5 @@ int main(void)
 	test_lpm_get_next_key();
 	test_lpm_multi_thread();
 
-	printf("test_lpm: OK\n");
-	return 0;
+	printf("%s: PASS\n", __func__);
 }
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH bpf v3 9/9] selftests/bpf: Add more test cases for LPM trie
  2024-12-06 11:06 [PATCH bpf v3 0/9] Fixes for LPM trie Hou Tao
                   ` (7 preceding siblings ...)
  2024-12-06 11:06 ` [PATCH bpf v3 8/9] selftests/bpf: Move test_lpm_map.c to map_tests Hou Tao
@ 2024-12-06 11:06 ` Hou Tao
  2024-12-06 17:40 ` [PATCH bpf v3 0/9] Fixes " patchwork-bot+netdevbpf
  9 siblings, 0 replies; 17+ messages in thread
From: Hou Tao @ 2024-12-06 11:06 UTC (permalink / raw)
  To: bpf
  Cc: Martin KaFai Lau, Alexei Starovoitov, Andrii Nakryiko,
	Eduard Zingerman, Song Liu, Hao Luo, Yonghong Song,
	Daniel Borkmann, KP Singh, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Toke Høiland-Jørgensen,
	Sebastian Andrzej Siewior, Thomas Gleixner, Thomas Weißschuh,
	houtao1, xukuohai

From: Hou Tao <houtao1@huawei.com>

Add more test cases for LPM trie in test_maps:

1) test_lpm_trie_update_flags
It constructs various use cases for BPF_EXIST and BPF_NOEXIST and check
whether the return value of update operation is expected.

2) test_lpm_trie_update_full_maps
It tests the update operations on a full LPM trie map. Adding new node
will fail and overwriting the value of existed node will succeed.

3) test_lpm_trie_iterate_strs and test_lpm_trie_iterate_ints
There two test cases test whether the iteration through get_next_key is
sorted and expected. These two test cases delete the minimal key after
each iteration and check whether next iteration returns the second
minimal key. The only difference between these two test cases is the
former one saves strings in the LPM trie and the latter saves integers.
Without the fix of get_next_key, these two cases will fail as shown
below:
  test_lpm_trie_iterate_strs(1091):FAIL:iterate #2 got abc exp abS
  test_lpm_trie_iterate_ints(1142):FAIL:iterate #1 got 0x2 exp 0x1

Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 .../bpf/map_tests/lpm_trie_map_basic_ops.c    | 395 ++++++++++++++++++
 1 file changed, 395 insertions(+)

diff --git a/tools/testing/selftests/bpf/map_tests/lpm_trie_map_basic_ops.c b/tools/testing/selftests/bpf/map_tests/lpm_trie_map_basic_ops.c
index f375c89d78a4..d32e4edac930 100644
--- a/tools/testing/selftests/bpf/map_tests/lpm_trie_map_basic_ops.c
+++ b/tools/testing/selftests/bpf/map_tests/lpm_trie_map_basic_ops.c
@@ -20,10 +20,12 @@
 #include <string.h>
 #include <time.h>
 #include <unistd.h>
+#include <endian.h>
 #include <arpa/inet.h>
 #include <sys/time.h>
 
 #include <bpf/bpf.h>
+#include <test_maps.h>
 
 #include "bpf_util.h"
 
@@ -33,6 +35,22 @@ struct tlpm_node {
 	uint8_t key[];
 };
 
+struct lpm_trie_bytes_key {
+	union {
+		struct bpf_lpm_trie_key_hdr hdr;
+		__u32 prefixlen;
+	};
+	unsigned char data[8];
+};
+
+struct lpm_trie_int_key {
+	union {
+		struct bpf_lpm_trie_key_hdr hdr;
+		__u32 prefixlen;
+	};
+	unsigned int data;
+};
+
 static struct tlpm_node *tlpm_match(struct tlpm_node *list,
 				    const uint8_t *key,
 				    size_t n_bits);
@@ -770,6 +788,378 @@ static void test_lpm_multi_thread(void)
 	close(map_fd);
 }
 
+static int lpm_trie_create(unsigned int key_size, unsigned int value_size, unsigned int max_entries)
+{
+	LIBBPF_OPTS(bpf_map_create_opts, opts);
+	int fd;
+
+	opts.map_flags = BPF_F_NO_PREALLOC;
+	fd = bpf_map_create(BPF_MAP_TYPE_LPM_TRIE, "lpm_trie", key_size, value_size, max_entries,
+			    &opts);
+	CHECK(fd < 0, "bpf_map_create", "error %d\n", errno);
+
+	return fd;
+}
+
+static void test_lpm_trie_update_flags(void)
+{
+	struct lpm_trie_int_key key;
+	unsigned int value, got;
+	int fd, err;
+
+	fd = lpm_trie_create(sizeof(key), sizeof(value), 3);
+
+	/* invalid flags (Error) */
+	key.prefixlen = 32;
+	key.data = 0;
+	value = 0;
+	err = bpf_map_update_elem(fd, &key, &value, BPF_F_LOCK);
+	CHECK(err != -EINVAL, "invalid update flag", "error %d\n", err);
+
+	/* invalid flags (Error) */
+	key.prefixlen = 32;
+	key.data = 0;
+	value = 0;
+	err = bpf_map_update_elem(fd, &key, &value, BPF_NOEXIST | BPF_EXIST);
+	CHECK(err != -EINVAL, "invalid update flag", "error %d\n", err);
+
+	/* overwrite an empty qp-trie (Error) */
+	key.prefixlen = 32;
+	key.data = 0;
+	value = 2;
+	err = bpf_map_update_elem(fd, &key, &value, BPF_EXIST);
+	CHECK(err != -ENOENT, "overwrite empty qp-trie", "error %d\n", err);
+
+	/* add a new node */
+	key.prefixlen = 16;
+	key.data = 0;
+	value = 1;
+	err = bpf_map_update_elem(fd, &key, &value, BPF_NOEXIST);
+	CHECK(err, "add new elem", "error %d\n", err);
+	got = 0;
+	err = bpf_map_lookup_elem(fd, &key, &got);
+	CHECK(err, "lookup elem", "error %d\n", err);
+	CHECK(got != value, "check value", "got %d exp %d\n", got, value);
+
+	/* add the same node as new node (Error) */
+	err = bpf_map_update_elem(fd, &key, &value, BPF_NOEXIST);
+	CHECK(err != -EEXIST, "add new elem again", "error %d\n", err);
+
+	/* overwrite the existed node */
+	value = 4;
+	err = bpf_map_update_elem(fd, &key, &value, BPF_EXIST);
+	CHECK(err, "overwrite elem", "error %d\n", err);
+	got = 0;
+	err = bpf_map_lookup_elem(fd, &key, &got);
+	CHECK(err, "lookup elem", "error %d\n", err);
+	CHECK(got != value, "check value", "got %d exp %d\n", got, value);
+
+	/* overwrite the node */
+	value = 1;
+	err = bpf_map_update_elem(fd, &key, &value, BPF_ANY);
+	CHECK(err, "update elem", "error %d\n", err);
+	got = 0;
+	err = bpf_map_lookup_elem(fd, &key, &got);
+	CHECK(err, "lookup elem", "error %d\n", err);
+	CHECK(got != value, "check value", "got %d exp %d\n", got, value);
+
+	/* overwrite a non-existent node which is the prefix of the first
+	 * node (Error).
+	 */
+	key.prefixlen = 8;
+	key.data = 0;
+	value = 2;
+	err = bpf_map_update_elem(fd, &key, &value, BPF_EXIST);
+	CHECK(err != -ENOENT, "overwrite nonexistent elem", "error %d\n", err);
+
+	/* add a new node which is the prefix of the first node */
+	err = bpf_map_update_elem(fd, &key, &value, BPF_NOEXIST);
+	CHECK(err, "add new elem", "error %d\n", err);
+	got = 0;
+	err = bpf_map_lookup_elem(fd, &key, &got);
+	CHECK(err, "lookup key", "error %d\n", err);
+	CHECK(got != value, "check value", "got %d exp %d\n", got, value);
+
+	/* add another new node which will be the sibling of the first node */
+	key.prefixlen = 9;
+	key.data = htobe32(1 << 23);
+	value = 5;
+	err = bpf_map_update_elem(fd, &key, &value, BPF_NOEXIST);
+	CHECK(err, "add new elem", "error %d\n", err);
+	got = 0;
+	err = bpf_map_lookup_elem(fd, &key, &got);
+	CHECK(err, "lookup key", "error %d\n", err);
+	CHECK(got != value, "check value", "got %d exp %d\n", got, value);
+
+	/* overwrite the third node */
+	value = 3;
+	err = bpf_map_update_elem(fd, &key, &value, BPF_ANY);
+	CHECK(err, "overwrite elem", "error %d\n", err);
+	got = 0;
+	err = bpf_map_lookup_elem(fd, &key, &got);
+	CHECK(err, "lookup key", "error %d\n", err);
+	CHECK(got != value, "check value", "got %d exp %d\n", got, value);
+
+	/* delete the second node to make it an intermediate node */
+	key.prefixlen = 8;
+	key.data = 0;
+	err = bpf_map_delete_elem(fd, &key);
+	CHECK(err, "del elem", "error %d\n", err);
+
+	/* overwrite the intermediate node (Error) */
+	value = 2;
+	err = bpf_map_update_elem(fd, &key, &value, BPF_EXIST);
+	CHECK(err != -ENOENT, "overwrite nonexistent elem", "error %d\n", err);
+
+	close(fd);
+}
+
+static void test_lpm_trie_update_full_map(void)
+{
+	struct lpm_trie_int_key key;
+	int value, got;
+	int fd, err;
+
+	fd = lpm_trie_create(sizeof(key), sizeof(value), 3);
+
+	/* add a new node */
+	key.prefixlen = 16;
+	key.data = 0;
+	value = 0;
+	err = bpf_map_update_elem(fd, &key, &value, BPF_NOEXIST);
+	CHECK(err, "add new elem", "error %d\n", err);
+	got = 0;
+	err = bpf_map_lookup_elem(fd, &key, &got);
+	CHECK(err, "lookup elem", "error %d\n", err);
+	CHECK(got != value, "check value", "got %d exp %d\n", got, value);
+
+	/* add new node */
+	key.prefixlen = 8;
+	key.data = 0;
+	value = 1;
+	err = bpf_map_update_elem(fd, &key, &value, BPF_NOEXIST);
+	CHECK(err, "add new elem", "error %d\n", err);
+	got = 0;
+	err = bpf_map_lookup_elem(fd, &key, &got);
+	CHECK(err, "lookup elem", "error %d\n", err);
+	CHECK(got != value, "check value", "got %d exp %d\n", got, value);
+
+	/* add new node */
+	key.prefixlen = 9;
+	key.data = htobe32(1 << 23);
+	value = 2;
+	err = bpf_map_update_elem(fd, &key, &value, BPF_NOEXIST);
+	CHECK(err, "add new elem", "error %d\n", err);
+	got = 0;
+	err = bpf_map_lookup_elem(fd, &key, &got);
+	CHECK(err, "lookup elem", "error %d\n", err);
+	CHECK(got != value, "check value", "got %d exp %d\n", got, value);
+
+	/* try to add more node (Error) */
+	key.prefixlen = 32;
+	key.data = 0;
+	value = 3;
+	err = bpf_map_update_elem(fd, &key, &value, BPF_ANY);
+	CHECK(err != -ENOSPC, "add to full trie", "error %d\n", err);
+
+	/* update the value of an existed node with BPF_EXIST */
+	key.prefixlen = 16;
+	key.data = 0;
+	value = 4;
+	err = bpf_map_update_elem(fd, &key, &value, BPF_EXIST);
+	CHECK(err, "overwrite elem", "error %d\n", err);
+	got = 0;
+	err = bpf_map_lookup_elem(fd, &key, &got);
+	CHECK(err, "lookup elem", "error %d\n", err);
+	CHECK(got != value, "check value", "got %d exp %d\n", got, value);
+
+	/* update the value of an existed node with BPF_ANY */
+	key.prefixlen = 9;
+	key.data = htobe32(1 << 23);
+	value = 5;
+	err = bpf_map_update_elem(fd, &key, &value, BPF_ANY);
+	CHECK(err, "overwrite elem", "error %d\n", err);
+	got = 0;
+	err = bpf_map_lookup_elem(fd, &key, &got);
+	CHECK(err, "lookup elem", "error %d\n", err);
+	CHECK(got != value, "check value", "got %d exp %d\n", got, value);
+
+	close(fd);
+}
+
+static int cmp_str(const void *a, const void *b)
+{
+	const char *str_a = *(const char **)a, *str_b = *(const char **)b;
+
+	return strcmp(str_a, str_b);
+}
+
+/* Save strings in LPM trie. The trailing '\0' for each string will be
+ * accounted in the prefixlen. The strings returned during the iteration
+ * should be sorted as expected.
+ */
+static void test_lpm_trie_iterate_strs(void)
+{
+	static const char * const keys[] = {
+		"ab", "abO", "abc", "abo", "abS", "abcd",
+	};
+	const char *sorted_keys[ARRAY_SIZE(keys)];
+	struct lpm_trie_bytes_key key, next_key;
+	unsigned int value, got, i, j, len;
+	struct lpm_trie_bytes_key *cur;
+	int fd, err;
+
+	fd = lpm_trie_create(sizeof(key), sizeof(value), ARRAY_SIZE(keys));
+
+	for (i = 0; i < ARRAY_SIZE(keys); i++) {
+		unsigned int flags;
+
+		/* add i-th element */
+		flags = i % 2 ? BPF_NOEXIST : 0;
+		len = strlen(keys[i]);
+		/* include the trailing '\0' */
+		key.prefixlen = (len + 1) * 8;
+		memset(key.data, 0, sizeof(key.data));
+		memcpy(key.data, keys[i], len);
+		value = i + 100;
+		err = bpf_map_update_elem(fd, &key, &value, flags);
+		CHECK(err, "add elem", "#%u error %d\n", i, err);
+
+		err = bpf_map_lookup_elem(fd, &key, &got);
+		CHECK(err, "lookup elem", "#%u error %d\n", i, err);
+		CHECK(got != value, "lookup elem", "#%u expect %u got %u\n", i, value, got);
+
+		/* re-add i-th element (Error) */
+		err = bpf_map_update_elem(fd, &key, &value, BPF_NOEXIST);
+		CHECK(err != -EEXIST, "re-add elem", "#%u error %d\n", i, err);
+
+		/* Overwrite i-th element */
+		flags = i % 2 ? 0 : BPF_EXIST;
+		value = i;
+		err = bpf_map_update_elem(fd, &key, &value, flags);
+		CHECK(err, "update elem", "error %d\n", err);
+
+		/* Lookup #[0~i] elements */
+		for (j = 0; j <= i; j++) {
+			len = strlen(keys[j]);
+			key.prefixlen = (len + 1) * 8;
+			memset(key.data, 0, sizeof(key.data));
+			memcpy(key.data, keys[j], len);
+			err = bpf_map_lookup_elem(fd, &key, &got);
+			CHECK(err, "lookup elem", "#%u/%u error %d\n", i, j, err);
+			CHECK(got != j, "lookup elem", "#%u/%u expect %u got %u\n",
+			      i, j, value, got);
+		}
+	}
+
+	/* Add element to a full qp-trie (Error) */
+	key.prefixlen = sizeof(key.data) * 8;
+	memset(key.data, 0, sizeof(key.data));
+	value = 0;
+	err = bpf_map_update_elem(fd, &key, &value, 0);
+	CHECK(err != -ENOSPC, "add to full qp-trie", "error %d\n", err);
+
+	/* Iterate sorted elements: no deletion */
+	memcpy(sorted_keys, keys, sizeof(keys));
+	qsort(sorted_keys, ARRAY_SIZE(sorted_keys), sizeof(sorted_keys[0]), cmp_str);
+	cur = NULL;
+	for (i = 0; i < ARRAY_SIZE(sorted_keys); i++) {
+		len = strlen(sorted_keys[i]);
+		err = bpf_map_get_next_key(fd, cur, &next_key);
+		CHECK(err, "iterate", "#%u error %d\n", i, err);
+		CHECK(next_key.prefixlen != (len + 1) * 8, "iterate",
+		      "#%u invalid len %u expect %u\n",
+		      i, next_key.prefixlen, (len + 1) * 8);
+		CHECK(memcmp(sorted_keys[i], next_key.data, len + 1), "iterate",
+		      "#%u got %.*s exp %.*s\n", i, len, next_key.data, len, sorted_keys[i]);
+
+		cur = &next_key;
+	}
+	err = bpf_map_get_next_key(fd, cur, &next_key);
+	CHECK(err != -ENOENT, "more element", "error %d\n", err);
+
+	/* Iterate sorted elements: delete the found key after each iteration */
+	cur = NULL;
+	for (i = 0; i < ARRAY_SIZE(sorted_keys); i++) {
+		len = strlen(sorted_keys[i]);
+		err = bpf_map_get_next_key(fd, cur, &next_key);
+		CHECK(err, "iterate", "#%u error %d\n", i, err);
+		CHECK(next_key.prefixlen != (len + 1) * 8, "iterate",
+		      "#%u invalid len %u expect %u\n",
+		      i, next_key.prefixlen, (len + 1) * 8);
+		CHECK(memcmp(sorted_keys[i], next_key.data, len + 1), "iterate",
+		      "#%u got %.*s exp %.*s\n", i, len, next_key.data, len, sorted_keys[i]);
+
+		cur = &next_key;
+
+		err = bpf_map_delete_elem(fd, cur);
+		CHECK(err, "delete", "#%u error %d\n", i, err);
+	}
+	err = bpf_map_get_next_key(fd, cur, &next_key);
+	CHECK(err != -ENOENT, "non-empty qp-trie", "error %d\n", err);
+
+	close(fd);
+}
+
+/* Use the fixed prefixlen (32) and save integers in LPM trie. The iteration of
+ * LPM trie will return these integers in big-endian order, therefore, convert
+ * these integers to big-endian before update. After each iteration, delete the
+ * found key (the smallest integer) and expect the next iteration will return
+ * the second smallest number.
+ */
+static void test_lpm_trie_iterate_ints(void)
+{
+	struct lpm_trie_int_key key, next_key;
+	unsigned int i, max_entries;
+	struct lpm_trie_int_key *cur;
+	unsigned int *data_set;
+	int fd, err;
+	bool value;
+
+	max_entries = 4096;
+	data_set = calloc(max_entries, sizeof(*data_set));
+	CHECK(!data_set, "malloc", "no mem\n");
+	for (i = 0; i < max_entries; i++)
+		data_set[i] = i;
+
+	fd = lpm_trie_create(sizeof(key), sizeof(value), max_entries);
+	value = true;
+	for (i = 0; i < max_entries; i++) {
+		key.prefixlen = 32;
+		key.data = htobe32(data_set[i]);
+
+		err = bpf_map_update_elem(fd, &key, &value, BPF_NOEXIST);
+		CHECK(err, "add elem", "#%u error %d\n", i, err);
+	}
+
+	cur = NULL;
+	for (i = 0; i < max_entries; i++) {
+		err = bpf_map_get_next_key(fd, cur, &next_key);
+		CHECK(err, "iterate", "#%u error %d\n", i, err);
+		CHECK(next_key.prefixlen != 32, "iterate", "#%u invalid len %u\n",
+		      i, next_key.prefixlen);
+		CHECK(be32toh(next_key.data) != data_set[i], "iterate", "#%u got 0x%x exp 0x%x\n",
+		      i, be32toh(next_key.data), data_set[i]);
+		cur = &next_key;
+
+		/*
+		 * Delete the minimal key, the next call of bpf_get_next_key()
+		 * will return the second minimal key.
+		 */
+		err = bpf_map_delete_elem(fd, &next_key);
+		CHECK(err, "del elem", "#%u elem error %d\n", i, err);
+	}
+	err = bpf_map_get_next_key(fd, cur, &next_key);
+	CHECK(err != -ENOENT, "more element", "error %d\n", err);
+
+	err = bpf_map_get_next_key(fd, NULL, &next_key);
+	CHECK(err != -ENOENT, "no-empty qp-trie", "error %d\n", err);
+
+	free(data_set);
+
+	close(fd);
+}
+
 void test_lpm_trie_map_basic_ops(void)
 {
 	int i;
@@ -789,5 +1179,10 @@ void test_lpm_trie_map_basic_ops(void)
 	test_lpm_get_next_key();
 	test_lpm_multi_thread();
 
+	test_lpm_trie_update_flags();
+	test_lpm_trie_update_full_map();
+	test_lpm_trie_iterate_strs();
+	test_lpm_trie_iterate_ints();
+
 	printf("%s: PASS\n", __func__);
 }
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH bpf v3 0/9] Fixes for LPM trie
  2024-12-06 11:06 [PATCH bpf v3 0/9] Fixes for LPM trie Hou Tao
                   ` (8 preceding siblings ...)
  2024-12-06 11:06 ` [PATCH bpf v3 9/9] selftests/bpf: Add more test cases for LPM trie Hou Tao
@ 2024-12-06 17:40 ` patchwork-bot+netdevbpf
  9 siblings, 0 replies; 17+ messages in thread
From: patchwork-bot+netdevbpf @ 2024-12-06 17:40 UTC (permalink / raw)
  To: Hou Tao
  Cc: bpf, martin.lau, alexei.starovoitov, andrii, eddyz87, song,
	haoluo, yonghong.song, daniel, kpsingh, sdf, jolsa,
	john.fastabend, toke, bigeasy, tglx, linux, houtao1, xukuohai

Hello:

This series was applied to bpf/bpf.git (master)
by Alexei Starovoitov <ast@kernel.org>:

On Fri,  6 Dec 2024 19:06:13 +0800 you wrote:
> From: Hou Tao <houtao1@huawei.com>
> 
> Hi,
> 
> This patch set fixes several issues for LPM trie. These issues were
> found during adding new test cases or were reported by syzbot.
> 
> [...]

Here is the summary with links:
  - [bpf,v3,1/9] bpf: Remove unnecessary check when updating LPM trie
    https://git.kernel.org/bpf/bpf/c/156c977c539e
  - [bpf,v3,2/9] bpf: Remove unnecessary kfree(im_node) in lpm_trie_update_elem
    https://git.kernel.org/bpf/bpf/c/3d5611b4d7ef
  - [bpf,v3,3/9] bpf: Handle BPF_EXIST and BPF_NOEXIST for LPM trie
    https://git.kernel.org/bpf/bpf/c/eae6a075e953
  - [bpf,v3,4/9] bpf: Handle in-place update for full LPM trie correctly
    https://git.kernel.org/bpf/bpf/c/532d6b36b2bf
  - [bpf,v3,5/9] bpf: Fix exact match conditions in trie_get_next_key()
    https://git.kernel.org/bpf/bpf/c/27abc7b3fa2e
  - [bpf,v3,6/9] bpf: Switch to bpf mem allocator for LPM trie
    https://git.kernel.org/bpf/bpf/c/3d8dc43eb2a3
  - [bpf,v3,7/9] bpf: Use raw_spinlock_t for LPM trie
    https://git.kernel.org/bpf/bpf/c/6a5c63d43c02
  - [bpf,v3,8/9] selftests/bpf: Move test_lpm_map.c to map_tests
    https://git.kernel.org/bpf/bpf/c/3e18f5f1e5a1
  - [bpf,v3,9/9] selftests/bpf: Add more test cases for LPM trie
    https://git.kernel.org/bpf/bpf/c/04d4ce91b0be

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH bpf v3 6/9] bpf: Switch to bpf mem allocator for LPM trie
  2024-12-06 11:06 ` [PATCH bpf v3 6/9] bpf: Switch to bpf mem allocator for LPM trie Hou Tao
@ 2025-09-19 21:28   ` Andrii Nakryiko
  2025-09-23  1:33     ` Hou Tao
  0 siblings, 1 reply; 17+ messages in thread
From: Andrii Nakryiko @ 2025-09-19 21:28 UTC (permalink / raw)
  To: Hou Tao
  Cc: bpf, Martin KaFai Lau, Alexei Starovoitov, Andrii Nakryiko,
	Eduard Zingerman, Song Liu, Hao Luo, Yonghong Song,
	Daniel Borkmann, KP Singh, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Toke Høiland-Jørgensen,
	Sebastian Andrzej Siewior, Thomas Gleixner, Thomas Weißschuh,
	houtao1, xukuohai

On Fri, Dec 6, 2024 at 2:54 AM Hou Tao <houtao@huaweicloud.com> wrote:
>
> From: Hou Tao <houtao1@huawei.com>
>
> Multiple syzbot warnings have been reported. These warnings are mainly
> about the lock order between trie->lock and kmalloc()'s internal lock.
> See report [1] as an example:
>
> ======================================================
> WARNING: possible circular locking dependency detected
> 6.10.0-rc7-syzkaller-00003-g4376e966ecb7 #0 Not tainted
> ------------------------------------------------------
> syz.3.2069/15008 is trying to acquire lock:
> ffff88801544e6d8 (&n->list_lock){-.-.}-{2:2}, at: get_partial_node ...
>
> but task is already holding lock:
> ffff88802dcc89f8 (&trie->lock){-.-.}-{2:2}, at: trie_update_elem ...
>
> which lock already depends on the new lock.
>
> the existing dependency chain (in reverse order) is:
>
> -> #1 (&trie->lock){-.-.}-{2:2}:
>        __raw_spin_lock_irqsave
>        _raw_spin_lock_irqsave+0x3a/0x60
>        trie_delete_elem+0xb0/0x820
>        ___bpf_prog_run+0x3e51/0xabd0
>        __bpf_prog_run32+0xc1/0x100
>        bpf_dispatcher_nop_func
>        ......
>        bpf_trace_run2+0x231/0x590
>        __bpf_trace_contention_end+0xca/0x110
>        trace_contention_end.constprop.0+0xea/0x170
>        __pv_queued_spin_lock_slowpath+0x28e/0xcc0
>        pv_queued_spin_lock_slowpath
>        queued_spin_lock_slowpath
>        queued_spin_lock
>        do_raw_spin_lock+0x210/0x2c0
>        __raw_spin_lock_irqsave
>        _raw_spin_lock_irqsave+0x42/0x60
>        __put_partials+0xc3/0x170
>        qlink_free
>        qlist_free_all+0x4e/0x140
>        kasan_quarantine_reduce+0x192/0x1e0
>        __kasan_slab_alloc+0x69/0x90
>        kasan_slab_alloc
>        slab_post_alloc_hook
>        slab_alloc_node
>        kmem_cache_alloc_node_noprof+0x153/0x310
>        __alloc_skb+0x2b1/0x380
>        ......
>
> -> #0 (&n->list_lock){-.-.}-{2:2}:
>        check_prev_add
>        check_prevs_add
>        validate_chain
>        __lock_acquire+0x2478/0x3b30
>        lock_acquire
>        lock_acquire+0x1b1/0x560
>        __raw_spin_lock_irqsave
>        _raw_spin_lock_irqsave+0x3a/0x60
>        get_partial_node.part.0+0x20/0x350
>        get_partial_node
>        get_partial
>        ___slab_alloc+0x65b/0x1870
>        __slab_alloc.constprop.0+0x56/0xb0
>        __slab_alloc_node
>        slab_alloc_node
>        __do_kmalloc_node
>        __kmalloc_node_noprof+0x35c/0x440
>        kmalloc_node_noprof
>        bpf_map_kmalloc_node+0x98/0x4a0
>        lpm_trie_node_alloc
>        trie_update_elem+0x1ef/0xe00
>        bpf_map_update_value+0x2c1/0x6c0
>        map_update_elem+0x623/0x910
>        __sys_bpf+0x90c/0x49a0
>        ...
>
> other info that might help us debug this:
>
>  Possible unsafe locking scenario:
>
>        CPU0                    CPU1
>        ----                    ----
>   lock(&trie->lock);
>                                lock(&n->list_lock);
>                                lock(&trie->lock);
>   lock(&n->list_lock);
>
>  *** DEADLOCK ***
>
> [1]: https://syzkaller.appspot.com/bug?extid=9045c0a3d5a7f1b119f7
>
> A bpf program attached to trace_contention_end() triggers after
> acquiring &n->list_lock. The program invokes trie_delete_elem(), which
> then acquires trie->lock. However, it is possible that another
> process is invoking trie_update_elem(). trie_update_elem() will acquire
> trie->lock first, then invoke kmalloc_node(). kmalloc_node() may invoke
> get_partial_node() and try to acquire &n->list_lock (not necessarily the
> same lock object). Therefore, lockdep warns about the circular locking
> dependency.
>
> Invoking kmalloc() before acquiring trie->lock could fix the warning.
> However, since BPF programs call be invoked from any context (e.g.,
> through kprobe/tracepoint/fentry), there may still be lock ordering
> problems for internal locks in kmalloc() or trie->lock itself.
>
> To eliminate these potential lock ordering problems with kmalloc()'s
> internal locks, replacing kmalloc()/kfree()/kfree_rcu() with equivalent
> BPF memory allocator APIs that can be invoked in any context. The lock
> ordering problems with trie->lock (e.g., reentrance) will be handled
> separately.
>
> Three aspects of this change require explanation:
>
> 1. Intermediate and leaf nodes are allocated from the same allocator.
> Since the value size of LPM trie is usually small, using a single
> alocator reduces the memory overhead of the BPF memory allocator.
>
> 2. Leaf nodes are allocated before disabling IRQs. This handles cases
> where leaf_size is large (e.g., > 4KB - 8) and updates require
> intermediate node allocation. If leaf nodes were allocated in
> IRQ-disabled region, the free objects in BPF memory allocator would not
> be refilled timely and the intermediate node allocation may fail.
>
> 3. Paired migrate_{disable|enable}() calls for node alloc and free. The
> BPF memory allocator uses per-CPU struct internally, these paired calls
> are necessary to guarantee correctness.
>
> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
> Signed-off-by: Hou Tao <houtao1@huawei.com>
> ---
>  kernel/bpf/lpm_trie.c | 71 +++++++++++++++++++++++++++++--------------
>  1 file changed, 48 insertions(+), 23 deletions(-)
>
> diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
> index 9ba6ae145239..f850360e75ce 100644
> --- a/kernel/bpf/lpm_trie.c
> +++ b/kernel/bpf/lpm_trie.c
> @@ -15,6 +15,7 @@
>  #include <net/ipv6.h>
>  #include <uapi/linux/btf.h>
>  #include <linux/btf_ids.h>
> +#include <linux/bpf_mem_alloc.h>
>
>  /* Intermediate node */
>  #define LPM_TREE_NODE_FLAG_IM BIT(0)
> @@ -22,7 +23,6 @@
>  struct lpm_trie_node;
>
>  struct lpm_trie_node {
> -       struct rcu_head rcu;
>         struct lpm_trie_node __rcu      *child[2];
>         u32                             prefixlen;
>         u32                             flags;
> @@ -32,6 +32,7 @@ struct lpm_trie_node {
>  struct lpm_trie {
>         struct bpf_map                  map;
>         struct lpm_trie_node __rcu      *root;
> +       struct bpf_mem_alloc            ma;
>         size_t                          n_entries;
>         size_t                          max_prefixlen;
>         size_t                          data_size;
> @@ -287,17 +288,18 @@ static void *trie_lookup_elem(struct bpf_map *map, void *_key)

Hey Hao,

We recently got a warning from trie_lookup_elem() triggered by

rcu_dereference_check(trie->root, rcu_read_lock_bh_held())

check in trie_lookup_elem, when LPM trie map was used from a sleepable
BPF program.

It seems like it can be just converted to bpf_rcu_lock_held(), because
with your switch to bpf_mem_alloc all the nodes are now both RCU and
RCU Tasks Trace protected, is my thinking correct?

Can you please double check? Thanks!

[...]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH bpf v3 6/9] bpf: Switch to bpf mem allocator for LPM trie
  2025-09-19 21:28   ` Andrii Nakryiko
@ 2025-09-23  1:33     ` Hou Tao
  2025-09-24 23:31       ` Andrii Nakryiko
  0 siblings, 1 reply; 17+ messages in thread
From: Hou Tao @ 2025-09-23  1:33 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bpf, Martin KaFai Lau, Alexei Starovoitov, Andrii Nakryiko,
	Eduard Zingerman, Song Liu, Hao Luo, Yonghong Song,
	Daniel Borkmann, KP Singh, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Toke Høiland-Jørgensen,
	Sebastian Andrzej Siewior, Thomas Gleixner, Thomas Weißschuh,
	houtao1, xukuohai



On 9/20/2025 5:28 AM, Andrii Nakryiko wrote:
> On Fri, Dec 6, 2024 at 2:54 AM Hou Tao <houtao@huaweicloud.com> wrote:
>> From: Hou Tao <houtao1@huawei.com>
>>
>> Multiple syzbot warnings have been reported. These warnings are mainly
>> about the lock order between trie->lock and kmalloc()'s internal lock.
>> See report [1] as an example:
>>
>> ======================================================
>> WARNING: possible circular locking dependency detected
>> 6.10.0-rc7-syzkaller-00003-g4376e966ecb7 #0 Not tainted
>> ------------------------------------------------------
>> syz.3.2069/15008 is trying to acquire lock:
>> ffff88801544e6d8 (&n->list_lock){-.-.}-{2:2}, at: get_partial_node ...
>>
>> but task is already holding lock:
>> ffff88802dcc89f8 (&trie->lock){-.-.}-{2:2}, at: trie_update_elem ...
>>
>> which lock already depends on the new lock.
>>
>> the existing dependency chain (in reverse order) is:
>>
>> -> #1 (&trie->lock){-.-.}-{2:2}:
>>        __raw_spin_lock_irqsave
>>        _raw_spin_lock_irqsave+0x3a/0x60
>>        trie_delete_elem+0xb0/0x820
>>        ___bpf_prog_run+0x3e51/0xabd0
>>        __bpf_prog_run32+0xc1/0x100
>>        bpf_dispatcher_nop_func
>>        ......
>>        bpf_trace_run2+0x231/0x590
>>        __bpf_trace_contention_end+0xca/0x110
>>        trace_contention_end.constprop.0+0xea/0x170
>>        __pv_queued_spin_lock_slowpath+0x28e/0xcc0
>>        pv_queued_spin_lock_slowpath
>>        queued_spin_lock_slowpath
>>        queued_spin_lock
>>        do_raw_spin_lock+0x210/0x2c0
>>        __raw_spin_lock_irqsave
>>        _raw_spin_lock_irqsave+0x42/0x60
>>        __put_partials+0xc3/0x170
>>        qlink_free
>>        qlist_free_all+0x4e/0x140
>>        kasan_quarantine_reduce+0x192/0x1e0
>>        __kasan_slab_alloc+0x69/0x90
>>        kasan_slab_alloc
>>        slab_post_alloc_hook
>>        slab_alloc_node
>>        kmem_cache_alloc_node_noprof+0x153/0x310
>>        __alloc_skb+0x2b1/0x380
>>        ......
>>
>> -> #0 (&n->list_lock){-.-.}-{2:2}:
>>        check_prev_add
>>        check_prevs_add
>>        validate_chain
>>        __lock_acquire+0x2478/0x3b30
>>        lock_acquire
>>        lock_acquire+0x1b1/0x560
>>        __raw_spin_lock_irqsave
>>        _raw_spin_lock_irqsave+0x3a/0x60
>>        get_partial_node.part.0+0x20/0x350
>>        get_partial_node
>>        get_partial
>>        ___slab_alloc+0x65b/0x1870
>>        __slab_alloc.constprop.0+0x56/0xb0
>>        __slab_alloc_node
>>        slab_alloc_node
>>        __do_kmalloc_node
>>        __kmalloc_node_noprof+0x35c/0x440
>>        kmalloc_node_noprof
>>        bpf_map_kmalloc_node+0x98/0x4a0
>>        lpm_trie_node_alloc
>>        trie_update_elem+0x1ef/0xe00
>>        bpf_map_update_value+0x2c1/0x6c0
>>        map_update_elem+0x623/0x910
>>        __sys_bpf+0x90c/0x49a0
>>        ...
>>
>> other info that might help us debug this:
>>
>>  Possible unsafe locking scenario:
>>
>>        CPU0                    CPU1
>>        ----                    ----
>>   lock(&trie->lock);
>>                                lock(&n->list_lock);
>>                                lock(&trie->lock);
>>   lock(&n->list_lock);
>>
>>  *** DEADLOCK ***
>>
>> [1]: https://syzkaller.appspot.com/bug?extid=9045c0a3d5a7f1b119f7
>>
>> A bpf program attached to trace_contention_end() triggers after
>> acquiring &n->list_lock. The program invokes trie_delete_elem(), which
>> then acquires trie->lock. However, it is possible that another
>> process is invoking trie_update_elem(). trie_update_elem() will acquire
>> trie->lock first, then invoke kmalloc_node(). kmalloc_node() may invoke
>> get_partial_node() and try to acquire &n->list_lock (not necessarily the
>> same lock object). Therefore, lockdep warns about the circular locking
>> dependency.
>>
>> Invoking kmalloc() before acquiring trie->lock could fix the warning.
>> However, since BPF programs call be invoked from any context (e.g.,
>> through kprobe/tracepoint/fentry), there may still be lock ordering
>> problems for internal locks in kmalloc() or trie->lock itself.
>>
>> To eliminate these potential lock ordering problems with kmalloc()'s
>> internal locks, replacing kmalloc()/kfree()/kfree_rcu() with equivalent
>> BPF memory allocator APIs that can be invoked in any context. The lock
>> ordering problems with trie->lock (e.g., reentrance) will be handled
>> separately.
>>
>> Three aspects of this change require explanation:
>>
>> 1. Intermediate and leaf nodes are allocated from the same allocator.
>> Since the value size of LPM trie is usually small, using a single
>> alocator reduces the memory overhead of the BPF memory allocator.
>>
>> 2. Leaf nodes are allocated before disabling IRQs. This handles cases
>> where leaf_size is large (e.g., > 4KB - 8) and updates require
>> intermediate node allocation. If leaf nodes were allocated in
>> IRQ-disabled region, the free objects in BPF memory allocator would not
>> be refilled timely and the intermediate node allocation may fail.
>>
>> 3. Paired migrate_{disable|enable}() calls for node alloc and free. The
>> BPF memory allocator uses per-CPU struct internally, these paired calls
>> are necessary to guarantee correctness.
>>
>> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
>> Signed-off-by: Hou Tao <houtao1@huawei.com>
>> ---
>>  kernel/bpf/lpm_trie.c | 71 +++++++++++++++++++++++++++++--------------
>>  1 file changed, 48 insertions(+), 23 deletions(-)
>>
>> diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
>> index 9ba6ae145239..f850360e75ce 100644
>> --- a/kernel/bpf/lpm_trie.c
>> +++ b/kernel/bpf/lpm_trie.c
>> @@ -15,6 +15,7 @@
>>  #include <net/ipv6.h>
>>  #include <uapi/linux/btf.h>
>>  #include <linux/btf_ids.h>
>> +#include <linux/bpf_mem_alloc.h>
>>
>>  /* Intermediate node */
>>  #define LPM_TREE_NODE_FLAG_IM BIT(0)
>> @@ -22,7 +23,6 @@
>>  struct lpm_trie_node;
>>
>>  struct lpm_trie_node {
>> -       struct rcu_head rcu;
>>         struct lpm_trie_node __rcu      *child[2];
>>         u32                             prefixlen;
>>         u32                             flags;
>> @@ -32,6 +32,7 @@ struct lpm_trie_node {
>>  struct lpm_trie {
>>         struct bpf_map                  map;
>>         struct lpm_trie_node __rcu      *root;
>> +       struct bpf_mem_alloc            ma;
>>         size_t                          n_entries;
>>         size_t                          max_prefixlen;
>>         size_t                          data_size;
>> @@ -287,17 +288,18 @@ static void *trie_lookup_elem(struct bpf_map *map, void *_key)
> Hey Hao,

Hi Andrii,

Actually my name is Hou Tao :)
>
> We recently got a warning from trie_lookup_elem() triggered by
>
> rcu_dereference_check(trie->root, rcu_read_lock_bh_held())
>
> check in trie_lookup_elem, when LPM trie map was used from a sleepable
> BPF program.
>
> It seems like it can be just converted to bpf_rcu_lock_held(), because
> with your switch to bpf_mem_alloc all the nodes are now both RCU and
> RCU Tasks Trace protected, is my thinking correct?
>
> Can you please double check? Thanks!

No. Although the node is freed after one RCU GP and one RCU Task Trace
GP, the reuse of node happens after one RCU GP. Therefore, for the
sleepable program when it looks up the trie, it may find and try to read
a reused node, the returned result will be unexpected due to the reuse.

>
> [...]


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH bpf v3 6/9] bpf: Switch to bpf mem allocator for LPM trie
  2025-09-23  1:33     ` Hou Tao
@ 2025-09-24 23:31       ` Andrii Nakryiko
  2025-10-10 17:04         ` Andrii Nakryiko
  0 siblings, 1 reply; 17+ messages in thread
From: Andrii Nakryiko @ 2025-09-24 23:31 UTC (permalink / raw)
  To: Hou Tao
  Cc: bpf, Martin KaFai Lau, Alexei Starovoitov, Andrii Nakryiko,
	Eduard Zingerman, Song Liu, Hao Luo, Yonghong Song,
	Daniel Borkmann, KP Singh, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Toke Høiland-Jørgensen,
	Sebastian Andrzej Siewior, Thomas Gleixner, Thomas Weißschuh,
	houtao1, xukuohai

On Mon, Sep 22, 2025 at 6:33 PM Hou Tao <houtao@huaweicloud.com> wrote:
>
>
>
> On 9/20/2025 5:28 AM, Andrii Nakryiko wrote:
> > On Fri, Dec 6, 2024 at 2:54 AM Hou Tao <houtao@huaweicloud.com> wrote:
> >> From: Hou Tao <houtao1@huawei.com>
> >>
> >> Multiple syzbot warnings have been reported. These warnings are mainly
> >> about the lock order between trie->lock and kmalloc()'s internal lock.
> >> See report [1] as an example:
> >>
> >> ======================================================
> >> WARNING: possible circular locking dependency detected
> >> 6.10.0-rc7-syzkaller-00003-g4376e966ecb7 #0 Not tainted
> >> ------------------------------------------------------
> >> syz.3.2069/15008 is trying to acquire lock:
> >> ffff88801544e6d8 (&n->list_lock){-.-.}-{2:2}, at: get_partial_node ...
> >>
> >> but task is already holding lock:
> >> ffff88802dcc89f8 (&trie->lock){-.-.}-{2:2}, at: trie_update_elem ...
> >>
> >> which lock already depends on the new lock.
> >>
> >> the existing dependency chain (in reverse order) is:
> >>
> >> -> #1 (&trie->lock){-.-.}-{2:2}:
> >>        __raw_spin_lock_irqsave
> >>        _raw_spin_lock_irqsave+0x3a/0x60
> >>        trie_delete_elem+0xb0/0x820
> >>        ___bpf_prog_run+0x3e51/0xabd0
> >>        __bpf_prog_run32+0xc1/0x100
> >>        bpf_dispatcher_nop_func
> >>        ......
> >>        bpf_trace_run2+0x231/0x590
> >>        __bpf_trace_contention_end+0xca/0x110
> >>        trace_contention_end.constprop.0+0xea/0x170
> >>        __pv_queued_spin_lock_slowpath+0x28e/0xcc0
> >>        pv_queued_spin_lock_slowpath
> >>        queued_spin_lock_slowpath
> >>        queued_spin_lock
> >>        do_raw_spin_lock+0x210/0x2c0
> >>        __raw_spin_lock_irqsave
> >>        _raw_spin_lock_irqsave+0x42/0x60
> >>        __put_partials+0xc3/0x170
> >>        qlink_free
> >>        qlist_free_all+0x4e/0x140
> >>        kasan_quarantine_reduce+0x192/0x1e0
> >>        __kasan_slab_alloc+0x69/0x90
> >>        kasan_slab_alloc
> >>        slab_post_alloc_hook
> >>        slab_alloc_node
> >>        kmem_cache_alloc_node_noprof+0x153/0x310
> >>        __alloc_skb+0x2b1/0x380
> >>        ......
> >>
> >> -> #0 (&n->list_lock){-.-.}-{2:2}:
> >>        check_prev_add
> >>        check_prevs_add
> >>        validate_chain
> >>        __lock_acquire+0x2478/0x3b30
> >>        lock_acquire
> >>        lock_acquire+0x1b1/0x560
> >>        __raw_spin_lock_irqsave
> >>        _raw_spin_lock_irqsave+0x3a/0x60
> >>        get_partial_node.part.0+0x20/0x350
> >>        get_partial_node
> >>        get_partial
> >>        ___slab_alloc+0x65b/0x1870
> >>        __slab_alloc.constprop.0+0x56/0xb0
> >>        __slab_alloc_node
> >>        slab_alloc_node
> >>        __do_kmalloc_node
> >>        __kmalloc_node_noprof+0x35c/0x440
> >>        kmalloc_node_noprof
> >>        bpf_map_kmalloc_node+0x98/0x4a0
> >>        lpm_trie_node_alloc
> >>        trie_update_elem+0x1ef/0xe00
> >>        bpf_map_update_value+0x2c1/0x6c0
> >>        map_update_elem+0x623/0x910
> >>        __sys_bpf+0x90c/0x49a0
> >>        ...
> >>
> >> other info that might help us debug this:
> >>
> >>  Possible unsafe locking scenario:
> >>
> >>        CPU0                    CPU1
> >>        ----                    ----
> >>   lock(&trie->lock);
> >>                                lock(&n->list_lock);
> >>                                lock(&trie->lock);
> >>   lock(&n->list_lock);
> >>
> >>  *** DEADLOCK ***
> >>
> >> [1]: https://syzkaller.appspot.com/bug?extid=9045c0a3d5a7f1b119f7
> >>
> >> A bpf program attached to trace_contention_end() triggers after
> >> acquiring &n->list_lock. The program invokes trie_delete_elem(), which
> >> then acquires trie->lock. However, it is possible that another
> >> process is invoking trie_update_elem(). trie_update_elem() will acquire
> >> trie->lock first, then invoke kmalloc_node(). kmalloc_node() may invoke
> >> get_partial_node() and try to acquire &n->list_lock (not necessarily the
> >> same lock object). Therefore, lockdep warns about the circular locking
> >> dependency.
> >>
> >> Invoking kmalloc() before acquiring trie->lock could fix the warning.
> >> However, since BPF programs call be invoked from any context (e.g.,
> >> through kprobe/tracepoint/fentry), there may still be lock ordering
> >> problems for internal locks in kmalloc() or trie->lock itself.
> >>
> >> To eliminate these potential lock ordering problems with kmalloc()'s
> >> internal locks, replacing kmalloc()/kfree()/kfree_rcu() with equivalent
> >> BPF memory allocator APIs that can be invoked in any context. The lock
> >> ordering problems with trie->lock (e.g., reentrance) will be handled
> >> separately.
> >>
> >> Three aspects of this change require explanation:
> >>
> >> 1. Intermediate and leaf nodes are allocated from the same allocator.
> >> Since the value size of LPM trie is usually small, using a single
> >> alocator reduces the memory overhead of the BPF memory allocator.
> >>
> >> 2. Leaf nodes are allocated before disabling IRQs. This handles cases
> >> where leaf_size is large (e.g., > 4KB - 8) and updates require
> >> intermediate node allocation. If leaf nodes were allocated in
> >> IRQ-disabled region, the free objects in BPF memory allocator would not
> >> be refilled timely and the intermediate node allocation may fail.
> >>
> >> 3. Paired migrate_{disable|enable}() calls for node alloc and free. The
> >> BPF memory allocator uses per-CPU struct internally, these paired calls
> >> are necessary to guarantee correctness.
> >>
> >> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
> >> Signed-off-by: Hou Tao <houtao1@huawei.com>
> >> ---
> >>  kernel/bpf/lpm_trie.c | 71 +++++++++++++++++++++++++++++--------------
> >>  1 file changed, 48 insertions(+), 23 deletions(-)
> >>
> >> diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
> >> index 9ba6ae145239..f850360e75ce 100644
> >> --- a/kernel/bpf/lpm_trie.c
> >> +++ b/kernel/bpf/lpm_trie.c
> >> @@ -15,6 +15,7 @@
> >>  #include <net/ipv6.h>
> >>  #include <uapi/linux/btf.h>
> >>  #include <linux/btf_ids.h>
> >> +#include <linux/bpf_mem_alloc.h>
> >>
> >>  /* Intermediate node */
> >>  #define LPM_TREE_NODE_FLAG_IM BIT(0)
> >> @@ -22,7 +23,6 @@
> >>  struct lpm_trie_node;
> >>
> >>  struct lpm_trie_node {
> >> -       struct rcu_head rcu;
> >>         struct lpm_trie_node __rcu      *child[2];
> >>         u32                             prefixlen;
> >>         u32                             flags;
> >> @@ -32,6 +32,7 @@ struct lpm_trie_node {
> >>  struct lpm_trie {
> >>         struct bpf_map                  map;
> >>         struct lpm_trie_node __rcu      *root;
> >> +       struct bpf_mem_alloc            ma;
> >>         size_t                          n_entries;
> >>         size_t                          max_prefixlen;
> >>         size_t                          data_size;
> >> @@ -287,17 +288,18 @@ static void *trie_lookup_elem(struct bpf_map *map, void *_key)
> > Hey Hao,
>
> Hi Andrii,
>
> Actually my name is Hou Tao :)

Oops, I'm sorry for butchering your name, Hou!

> >
> > We recently got a warning from trie_lookup_elem() triggered by
> >
> > rcu_dereference_check(trie->root, rcu_read_lock_bh_held())
> >
> > check in trie_lookup_elem, when LPM trie map was used from a sleepable
> > BPF program.
> >
> > It seems like it can be just converted to bpf_rcu_lock_held(), because
> > with your switch to bpf_mem_alloc all the nodes are now both RCU and
> > RCU Tasks Trace protected, is my thinking correct?
> >
> > Can you please double check? Thanks!
>
> No. Although the node is freed after one RCU GP and one RCU Task Trace
> GP, the reuse of node happens after one RCU GP. Therefore, for the
> sleepable program when it looks up the trie, it may find and try to read
> a reused node, the returned result will be unexpected due to the reuse.
>

That's two different things, no? Because we do lookup without lock,
it's possible for (non-sleepable or sleepable, doesn't matter) BPF
program to update/delete trie node while some other BPF program looks
it up without locking. I don't think anything fundamentally changes
here. We have similar behavior for other BPF maps (hashmap, for
example), regardless of sleepable or not.

But the problem here is specifically about overly eager
rcu_dereference_check check. That memory is not going to be freed, so
it's "safe" to dereference it, even if it might be reused for another
node.

Either that, or we need to disable LPM trie map for sleepable until we
somehow fix this memory reuse, no?

> >
> > [...]
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH bpf v3 6/9] bpf: Switch to bpf mem allocator for LPM trie
  2025-09-24 23:31       ` Andrii Nakryiko
@ 2025-10-10 17:04         ` Andrii Nakryiko
  2025-10-13  0:59           ` Hou Tao
  0 siblings, 1 reply; 17+ messages in thread
From: Andrii Nakryiko @ 2025-10-10 17:04 UTC (permalink / raw)
  To: Hou Tao
  Cc: bpf, Martin KaFai Lau, Alexei Starovoitov, Andrii Nakryiko,
	Eduard Zingerman, Song Liu, Hao Luo, Yonghong Song,
	Daniel Borkmann, KP Singh, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Toke Høiland-Jørgensen,
	Sebastian Andrzej Siewior, Thomas Gleixner, Thomas Weißschuh,
	houtao1, xukuohai

On Wed, Sep 24, 2025 at 4:31 PM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Mon, Sep 22, 2025 at 6:33 PM Hou Tao <houtao@huaweicloud.com> wrote:
> >
> >
> >
> > On 9/20/2025 5:28 AM, Andrii Nakryiko wrote:
> > > On Fri, Dec 6, 2024 at 2:54 AM Hou Tao <houtao@huaweicloud.com> wrote:
> > >> From: Hou Tao <houtao1@huawei.com>
> > >>
> > >> Multiple syzbot warnings have been reported. These warnings are mainly
> > >> about the lock order between trie->lock and kmalloc()'s internal lock.
> > >> See report [1] as an example:
> > >>
> > >> ======================================================
> > >> WARNING: possible circular locking dependency detected
> > >> 6.10.0-rc7-syzkaller-00003-g4376e966ecb7 #0 Not tainted
> > >> ------------------------------------------------------
> > >> syz.3.2069/15008 is trying to acquire lock:
> > >> ffff88801544e6d8 (&n->list_lock){-.-.}-{2:2}, at: get_partial_node ...
> > >>
> > >> but task is already holding lock:
> > >> ffff88802dcc89f8 (&trie->lock){-.-.}-{2:2}, at: trie_update_elem ...
> > >>
> > >> which lock already depends on the new lock.
> > >>
> > >> the existing dependency chain (in reverse order) is:
> > >>
> > >> -> #1 (&trie->lock){-.-.}-{2:2}:
> > >>        __raw_spin_lock_irqsave
> > >>        _raw_spin_lock_irqsave+0x3a/0x60
> > >>        trie_delete_elem+0xb0/0x820
> > >>        ___bpf_prog_run+0x3e51/0xabd0
> > >>        __bpf_prog_run32+0xc1/0x100
> > >>        bpf_dispatcher_nop_func
> > >>        ......
> > >>        bpf_trace_run2+0x231/0x590
> > >>        __bpf_trace_contention_end+0xca/0x110
> > >>        trace_contention_end.constprop.0+0xea/0x170
> > >>        __pv_queued_spin_lock_slowpath+0x28e/0xcc0
> > >>        pv_queued_spin_lock_slowpath
> > >>        queued_spin_lock_slowpath
> > >>        queued_spin_lock
> > >>        do_raw_spin_lock+0x210/0x2c0
> > >>        __raw_spin_lock_irqsave
> > >>        _raw_spin_lock_irqsave+0x42/0x60
> > >>        __put_partials+0xc3/0x170
> > >>        qlink_free
> > >>        qlist_free_all+0x4e/0x140
> > >>        kasan_quarantine_reduce+0x192/0x1e0
> > >>        __kasan_slab_alloc+0x69/0x90
> > >>        kasan_slab_alloc
> > >>        slab_post_alloc_hook
> > >>        slab_alloc_node
> > >>        kmem_cache_alloc_node_noprof+0x153/0x310
> > >>        __alloc_skb+0x2b1/0x380
> > >>        ......
> > >>
> > >> -> #0 (&n->list_lock){-.-.}-{2:2}:
> > >>        check_prev_add
> > >>        check_prevs_add
> > >>        validate_chain
> > >>        __lock_acquire+0x2478/0x3b30
> > >>        lock_acquire
> > >>        lock_acquire+0x1b1/0x560
> > >>        __raw_spin_lock_irqsave
> > >>        _raw_spin_lock_irqsave+0x3a/0x60
> > >>        get_partial_node.part.0+0x20/0x350
> > >>        get_partial_node
> > >>        get_partial
> > >>        ___slab_alloc+0x65b/0x1870
> > >>        __slab_alloc.constprop.0+0x56/0xb0
> > >>        __slab_alloc_node
> > >>        slab_alloc_node
> > >>        __do_kmalloc_node
> > >>        __kmalloc_node_noprof+0x35c/0x440
> > >>        kmalloc_node_noprof
> > >>        bpf_map_kmalloc_node+0x98/0x4a0
> > >>        lpm_trie_node_alloc
> > >>        trie_update_elem+0x1ef/0xe00
> > >>        bpf_map_update_value+0x2c1/0x6c0
> > >>        map_update_elem+0x623/0x910
> > >>        __sys_bpf+0x90c/0x49a0
> > >>        ...
> > >>
> > >> other info that might help us debug this:
> > >>
> > >>  Possible unsafe locking scenario:
> > >>
> > >>        CPU0                    CPU1
> > >>        ----                    ----
> > >>   lock(&trie->lock);
> > >>                                lock(&n->list_lock);
> > >>                                lock(&trie->lock);
> > >>   lock(&n->list_lock);
> > >>
> > >>  *** DEADLOCK ***
> > >>
> > >> [1]: https://syzkaller.appspot.com/bug?extid=9045c0a3d5a7f1b119f7
> > >>
> > >> A bpf program attached to trace_contention_end() triggers after
> > >> acquiring &n->list_lock. The program invokes trie_delete_elem(), which
> > >> then acquires trie->lock. However, it is possible that another
> > >> process is invoking trie_update_elem(). trie_update_elem() will acquire
> > >> trie->lock first, then invoke kmalloc_node(). kmalloc_node() may invoke
> > >> get_partial_node() and try to acquire &n->list_lock (not necessarily the
> > >> same lock object). Therefore, lockdep warns about the circular locking
> > >> dependency.
> > >>
> > >> Invoking kmalloc() before acquiring trie->lock could fix the warning.
> > >> However, since BPF programs call be invoked from any context (e.g.,
> > >> through kprobe/tracepoint/fentry), there may still be lock ordering
> > >> problems for internal locks in kmalloc() or trie->lock itself.
> > >>
> > >> To eliminate these potential lock ordering problems with kmalloc()'s
> > >> internal locks, replacing kmalloc()/kfree()/kfree_rcu() with equivalent
> > >> BPF memory allocator APIs that can be invoked in any context. The lock
> > >> ordering problems with trie->lock (e.g., reentrance) will be handled
> > >> separately.
> > >>
> > >> Three aspects of this change require explanation:
> > >>
> > >> 1. Intermediate and leaf nodes are allocated from the same allocator.
> > >> Since the value size of LPM trie is usually small, using a single
> > >> alocator reduces the memory overhead of the BPF memory allocator.
> > >>
> > >> 2. Leaf nodes are allocated before disabling IRQs. This handles cases
> > >> where leaf_size is large (e.g., > 4KB - 8) and updates require
> > >> intermediate node allocation. If leaf nodes were allocated in
> > >> IRQ-disabled region, the free objects in BPF memory allocator would not
> > >> be refilled timely and the intermediate node allocation may fail.
> > >>
> > >> 3. Paired migrate_{disable|enable}() calls for node alloc and free. The
> > >> BPF memory allocator uses per-CPU struct internally, these paired calls
> > >> are necessary to guarantee correctness.
> > >>
> > >> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
> > >> Signed-off-by: Hou Tao <houtao1@huawei.com>
> > >> ---
> > >>  kernel/bpf/lpm_trie.c | 71 +++++++++++++++++++++++++++++--------------
> > >>  1 file changed, 48 insertions(+), 23 deletions(-)
> > >>
> > >> diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
> > >> index 9ba6ae145239..f850360e75ce 100644
> > >> --- a/kernel/bpf/lpm_trie.c
> > >> +++ b/kernel/bpf/lpm_trie.c
> > >> @@ -15,6 +15,7 @@
> > >>  #include <net/ipv6.h>
> > >>  #include <uapi/linux/btf.h>
> > >>  #include <linux/btf_ids.h>
> > >> +#include <linux/bpf_mem_alloc.h>
> > >>
> > >>  /* Intermediate node */
> > >>  #define LPM_TREE_NODE_FLAG_IM BIT(0)
> > >> @@ -22,7 +23,6 @@
> > >>  struct lpm_trie_node;
> > >>
> > >>  struct lpm_trie_node {
> > >> -       struct rcu_head rcu;
> > >>         struct lpm_trie_node __rcu      *child[2];
> > >>         u32                             prefixlen;
> > >>         u32                             flags;
> > >> @@ -32,6 +32,7 @@ struct lpm_trie_node {
> > >>  struct lpm_trie {
> > >>         struct bpf_map                  map;
> > >>         struct lpm_trie_node __rcu      *root;
> > >> +       struct bpf_mem_alloc            ma;
> > >>         size_t                          n_entries;
> > >>         size_t                          max_prefixlen;
> > >>         size_t                          data_size;
> > >> @@ -287,17 +288,18 @@ static void *trie_lookup_elem(struct bpf_map *map, void *_key)
> > > Hey Hao,
> >
> > Hi Andrii,
> >
> > Actually my name is Hou Tao :)
>
> Oops, I'm sorry for butchering your name, Hou!
>
> > >
> > > We recently got a warning from trie_lookup_elem() triggered by
> > >
> > > rcu_dereference_check(trie->root, rcu_read_lock_bh_held())
> > >
> > > check in trie_lookup_elem, when LPM trie map was used from a sleepable
> > > BPF program.
> > >
> > > It seems like it can be just converted to bpf_rcu_lock_held(), because
> > > with your switch to bpf_mem_alloc all the nodes are now both RCU and
> > > RCU Tasks Trace protected, is my thinking correct?
> > >
> > > Can you please double check? Thanks!
> >
> > No. Although the node is freed after one RCU GP and one RCU Task Trace
> > GP, the reuse of node happens after one RCU GP. Therefore, for the
> > sleepable program when it looks up the trie, it may find and try to read
> > a reused node, the returned result will be unexpected due to the reuse.
> >
>
> That's two different things, no? Because we do lookup without lock,
> it's possible for (non-sleepable or sleepable, doesn't matter) BPF
> program to update/delete trie node while some other BPF program looks
> it up without locking. I don't think anything fundamentally changes
> here. We have similar behavior for other BPF maps (hashmap, for
> example), regardless of sleepable or not.
>
> But the problem here is specifically about overly eager
> rcu_dereference_check check. That memory is not going to be freed, so
> it's "safe" to dereference it, even if it might be reused for another
> node.
>
> Either that, or we need to disable LPM trie map for sleepable until we
> somehow fix this memory reuse, no?

Hey Hou,

So, it's actually more interesting than what I initially thought.
LPM_TRIE is not currently allowed for sleepable programs, but we have
a bug in the verifier where you can sneak in such map into a sleepable
program through map-in-map. It seems like we are missing checking the
inner map for sleepable compatibility.

The question for you is how hard is it to make LPM_TRIE support
sleepable program? You mentioned some unintended memory reused, can
that be fixed/changed? Can you please take a look? Thanks!

>
> > >
> > > [...]
> >

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH bpf v3 6/9] bpf: Switch to bpf mem allocator for LPM trie
  2025-10-10 17:04         ` Andrii Nakryiko
@ 2025-10-13  0:59           ` Hou Tao
  2025-10-13 23:18             ` Andrii Nakryiko
  0 siblings, 1 reply; 17+ messages in thread
From: Hou Tao @ 2025-10-13  0:59 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bpf, Martin KaFai Lau, Alexei Starovoitov, Andrii Nakryiko,
	Eduard Zingerman, Song Liu, Hao Luo, Yonghong Song,
	Daniel Borkmann, KP Singh, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Toke Høiland-Jørgensen,
	Sebastian Andrzej Siewior, Thomas Gleixner, Thomas Weißschuh,
	houtao1, xukuohai

Hi Andrii,

On 10/11/2025 1:04 AM, Andrii Nakryiko wrote:
> On Wed, Sep 24, 2025 at 4:31 PM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
>> On Mon, Sep 22, 2025 at 6:33 PM Hou Tao <houtao@huaweicloud.com> wrote:
>>>
>>>
>>> On 9/20/2025 5:28 AM, Andrii Nakryiko wrote:
>>>> On Fri, Dec 6, 2024 at 2:54 AM Hou Tao <houtao@huaweicloud.com> wrote:
>>>>> From: Hou Tao <houtao1@huawei.com>
>>>>>
>>>>> Multiple syzbot warnings have been reported. These warnings are mainly
>>>>> about the lock order between trie->lock and kmalloc()'s internal lock.
>>>>> See report [1] as an example:
>>>>>
>>>>> ======================================================
>>>>> WARNING: possible circular locking dependency detected
>>>>> 6.10.0-rc7-syzkaller-00003-g4376e966ecb7 #0 Not tainted
>>>>> ------------------------------------------------------
>>>>> syz.3.2069/15008 is trying to acquire lock:
>>>>> ffff88801544e6d8 (&n->list_lock){-.-.}-{2:2}, at: get_partial_node ...
>>>>>
>>>>> but task is already holding lock:
>>>>> ffff88802dcc89f8 (&trie->lock){-.-.}-{2:2}, at: trie_update_elem ...
>>>>>
>>>>> which lock already depends on the new lock.
>>>>>
>>>>> the existing dependency chain (in reverse order) is:
>>>>>
>>>>> -> #1 (&trie->lock){-.-.}-{2:2}:
>>>>>        __raw_spin_lock_irqsave
>>>>>        _raw_spin_lock_irqsave+0x3a/0x60
>>>>>        trie_delete_elem+0xb0/0x820
>>>>>        ___bpf_prog_run+0x3e51/0xabd0
>>>>>        __bpf_prog_run32+0xc1/0x100
>>>>>        bpf_dispatcher_nop_func
>>>>>        ......
>>>>>        bpf_trace_run2+0x231/0x590
>>>>>        __bpf_trace_contention_end+0xca/0x110
>>>>>        trace_contention_end.constprop.0+0xea/0x170
>>>>>        __pv_queued_spin_lock_slowpath+0x28e/0xcc0
>>>>>        pv_queued_spin_lock_slowpath
>>>>>        queued_spin_lock_slowpath
>>>>>        queued_spin_lock
>>>>>        do_raw_spin_lock+0x210/0x2c0
>>>>>        __raw_spin_lock_irqsave
>>>>>        _raw_spin_lock_irqsave+0x42/0x60
>>>>>        __put_partials+0xc3/0x170
>>>>>        qlink_free
>>>>>        qlist_free_all+0x4e/0x140
>>>>>        kasan_quarantine_reduce+0x192/0x1e0
>>>>>        __kasan_slab_alloc+0x69/0x90
>>>>>        kasan_slab_alloc
>>>>>        slab_post_alloc_hook
>>>>>        slab_alloc_node
>>>>>        kmem_cache_alloc_node_noprof+0x153/0x310
>>>>>        __alloc_skb+0x2b1/0x380
>>>>>        ......
>>>>>
>>>>> -> #0 (&n->list_lock){-.-.}-{2:2}:
>>>>>        check_prev_add
>>>>>        check_prevs_add
>>>>>        validate_chain
>>>>>        __lock_acquire+0x2478/0x3b30
>>>>>        lock_acquire
>>>>>        lock_acquire+0x1b1/0x560
>>>>>        __raw_spin_lock_irqsave
>>>>>        _raw_spin_lock_irqsave+0x3a/0x60
>>>>>        get_partial_node.part.0+0x20/0x350
>>>>>        get_partial_node
>>>>>        get_partial
>>>>>        ___slab_alloc+0x65b/0x1870
>>>>>        __slab_alloc.constprop.0+0x56/0xb0
>>>>>        __slab_alloc_node
>>>>>        slab_alloc_node
>>>>>        __do_kmalloc_node
>>>>>        __kmalloc_node_noprof+0x35c/0x440
>>>>>        kmalloc_node_noprof
>>>>>        bpf_map_kmalloc_node+0x98/0x4a0
>>>>>        lpm_trie_node_alloc
>>>>>        trie_update_elem+0x1ef/0xe00
>>>>>        bpf_map_update_value+0x2c1/0x6c0
>>>>>        map_update_elem+0x623/0x910
>>>>>        __sys_bpf+0x90c/0x49a0
>>>>>        ...
>>>>>
>>>>> other info that might help us debug this:
>>>>>
>>>>>  Possible unsafe locking scenario:
>>>>>
>>>>>        CPU0                    CPU1
>>>>>        ----                    ----
>>>>>   lock(&trie->lock);
>>>>>                                lock(&n->list_lock);
>>>>>                                lock(&trie->lock);
>>>>>   lock(&n->list_lock);
>>>>>
>>>>>  *** DEADLOCK ***
>>>>>
>>>>> [1]: https://syzkaller.appspot.com/bug?extid=9045c0a3d5a7f1b119f7
>>>>>
>>>>> A bpf program attached to trace_contention_end() triggers after
>>>>> acquiring &n->list_lock. The program invokes trie_delete_elem(), which
>>>>> then acquires trie->lock. However, it is possible that another
>>>>> process is invoking trie_update_elem(). trie_update_elem() will acquire
>>>>> trie->lock first, then invoke kmalloc_node(). kmalloc_node() may invoke
>>>>> get_partial_node() and try to acquire &n->list_lock (not necessarily the
>>>>> same lock object). Therefore, lockdep warns about the circular locking
>>>>> dependency.
>>>>>
>>>>> Invoking kmalloc() before acquiring trie->lock could fix the warning.
>>>>> However, since BPF programs call be invoked from any context (e.g.,
>>>>> through kprobe/tracepoint/fentry), there may still be lock ordering
>>>>> problems for internal locks in kmalloc() or trie->lock itself.
>>>>>
>>>>> To eliminate these potential lock ordering problems with kmalloc()'s
>>>>> internal locks, replacing kmalloc()/kfree()/kfree_rcu() with equivalent
>>>>> BPF memory allocator APIs that can be invoked in any context. The lock
>>>>> ordering problems with trie->lock (e.g., reentrance) will be handled
>>>>> separately.
>>>>>
>>>>> Three aspects of this change require explanation:
>>>>>
>>>>> 1. Intermediate and leaf nodes are allocated from the same allocator.
>>>>> Since the value size of LPM trie is usually small, using a single
>>>>> alocator reduces the memory overhead of the BPF memory allocator.
>>>>>
>>>>> 2. Leaf nodes are allocated before disabling IRQs. This handles cases
>>>>> where leaf_size is large (e.g., > 4KB - 8) and updates require
>>>>> intermediate node allocation. If leaf nodes were allocated in
>>>>> IRQ-disabled region, the free objects in BPF memory allocator would not
>>>>> be refilled timely and the intermediate node allocation may fail.
>>>>>
>>>>> 3. Paired migrate_{disable|enable}() calls for node alloc and free. The
>>>>> BPF memory allocator uses per-CPU struct internally, these paired calls
>>>>> are necessary to guarantee correctness.
>>>>>
>>>>> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
>>>>> Signed-off-by: Hou Tao <houtao1@huawei.com>
>>>>> ---
>>>>>  kernel/bpf/lpm_trie.c | 71 +++++++++++++++++++++++++++++--------------
>>>>>  1 file changed, 48 insertions(+), 23 deletions(-)
>>>>>
>>>>> diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
>>>>> index 9ba6ae145239..f850360e75ce 100644
>>>>> --- a/kernel/bpf/lpm_trie.c
>>>>> +++ b/kernel/bpf/lpm_trie.c
>>>>> @@ -15,6 +15,7 @@
>>>>>  #include <net/ipv6.h>
>>>>>  #include <uapi/linux/btf.h>
>>>>>  #include <linux/btf_ids.h>
>>>>> +#include <linux/bpf_mem_alloc.h>
>>>>>
>>>>>  /* Intermediate node */
>>>>>  #define LPM_TREE_NODE_FLAG_IM BIT(0)
>>>>> @@ -22,7 +23,6 @@
>>>>>  struct lpm_trie_node;
>>>>>
>>>>>  struct lpm_trie_node {
>>>>> -       struct rcu_head rcu;
>>>>>         struct lpm_trie_node __rcu      *child[2];
>>>>>         u32                             prefixlen;
>>>>>         u32                             flags;
>>>>> @@ -32,6 +32,7 @@ struct lpm_trie_node {
>>>>>  struct lpm_trie {
>>>>>         struct bpf_map                  map;
>>>>>         struct lpm_trie_node __rcu      *root;
>>>>> +       struct bpf_mem_alloc            ma;
>>>>>         size_t                          n_entries;
>>>>>         size_t                          max_prefixlen;
>>>>>         size_t                          data_size;
>>>>> @@ -287,17 +288,18 @@ static void *trie_lookup_elem(struct bpf_map *map, void *_key)
>>>> Hey Hao,
>>> Hi Andrii,
>>>
>>> Actually my name is Hou Tao :)
>> Oops, I'm sorry for butchering your name, Hou!
>>
>>>> We recently got a warning from trie_lookup_elem() triggered by
>>>>
>>>> rcu_dereference_check(trie->root, rcu_read_lock_bh_held())
>>>>
>>>> check in trie_lookup_elem, when LPM trie map was used from a sleepable
>>>> BPF program.
>>>>
>>>> It seems like it can be just converted to bpf_rcu_lock_held(), because
>>>> with your switch to bpf_mem_alloc all the nodes are now both RCU and
>>>> RCU Tasks Trace protected, is my thinking correct?
>>>>
>>>> Can you please double check? Thanks!
>>> No. Although the node is freed after one RCU GP and one RCU Task Trace
>>> GP, the reuse of node happens after one RCU GP. Therefore, for the
>>> sleepable program when it looks up the trie, it may find and try to read
>>> a reused node, the returned result will be unexpected due to the reuse.
>>>

Sorry for the delayed response.
>> That's two different things, no? Because we do lookup without lock,
>> it's possible for (non-sleepable or sleepable, doesn't matter) BPF
>> program to update/delete trie node while some other BPF program looks
>> it up without locking. I don't think anything fundamentally changes
>> here. We have similar behavior for other BPF maps (hashmap, for
>> example), regardless of sleepable or not.
>>
>> But the problem here is specifically about overly eager
>> rcu_dereference_check check. That memory is not going to be freed, so
>> it's "safe" to dereference it, even if it might be reused for another
>> node.
>>
>> Either that, or we need to disable LPM trie map for sleepable until we
>> somehow fix this memory reuse, no?
> Hey Hou,
>
> So, it's actually more interesting than what I initially thought.
> LPM_TRIE is not currently allowed for sleepable programs, but we have
> a bug in the verifier where you can sneak in such map into a sleepable
> program through map-in-map. It seems like we are missing checking the
> inner map for sleepable compatibility.
>
> The question for you is how hard is it to make LPM_TRIE support
> sleepable program? You mentioned some unintended memory reused, can
> that be fixed/changed? Can you please take a look? Thanks!

Instead of updating LPM_TRIE to handle memory reuse, why not use
bpf_rcu_read_lock/bpf_rcu_read_unlock to ensure the reuse will not
happen during the lookup ?
>
>>>> [...]
> .


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH bpf v3 6/9] bpf: Switch to bpf mem allocator for LPM trie
  2025-10-13  0:59           ` Hou Tao
@ 2025-10-13 23:18             ` Andrii Nakryiko
  0 siblings, 0 replies; 17+ messages in thread
From: Andrii Nakryiko @ 2025-10-13 23:18 UTC (permalink / raw)
  To: Hou Tao
  Cc: bpf, Martin KaFai Lau, Alexei Starovoitov, Andrii Nakryiko,
	Eduard Zingerman, Song Liu, Hao Luo, Yonghong Song,
	Daniel Borkmann, KP Singh, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Toke Høiland-Jørgensen,
	Sebastian Andrzej Siewior, Thomas Gleixner, Thomas Weißschuh,
	houtao1, xukuohai

On Sun, Oct 12, 2025 at 5:59 PM Hou Tao <houtao@huaweicloud.com> wrote:
>
> Hi Andrii,
>
> On 10/11/2025 1:04 AM, Andrii Nakryiko wrote:
> > On Wed, Sep 24, 2025 at 4:31 PM Andrii Nakryiko
> > <andrii.nakryiko@gmail.com> wrote:
> >> On Mon, Sep 22, 2025 at 6:33 PM Hou Tao <houtao@huaweicloud.com> wrote:
> >>>
> >>>
> >>> On 9/20/2025 5:28 AM, Andrii Nakryiko wrote:
> >>>> On Fri, Dec 6, 2024 at 2:54 AM Hou Tao <houtao@huaweicloud.com> wrote:
> >>>>> From: Hou Tao <houtao1@huawei.com>
> >>>>>
> >>>>> Multiple syzbot warnings have been reported. These warnings are mainly
> >>>>> about the lock order between trie->lock and kmalloc()'s internal lock.
> >>>>> See report [1] as an example:
> >>>>>
> >>>>> ======================================================
> >>>>> WARNING: possible circular locking dependency detected
> >>>>> 6.10.0-rc7-syzkaller-00003-g4376e966ecb7 #0 Not tainted
> >>>>> ------------------------------------------------------
> >>>>> syz.3.2069/15008 is trying to acquire lock:
> >>>>> ffff88801544e6d8 (&n->list_lock){-.-.}-{2:2}, at: get_partial_node ...
> >>>>>
> >>>>> but task is already holding lock:
> >>>>> ffff88802dcc89f8 (&trie->lock){-.-.}-{2:2}, at: trie_update_elem ...
> >>>>>
> >>>>> which lock already depends on the new lock.
> >>>>>
> >>>>> the existing dependency chain (in reverse order) is:
> >>>>>
> >>>>> -> #1 (&trie->lock){-.-.}-{2:2}:
> >>>>>        __raw_spin_lock_irqsave
> >>>>>        _raw_spin_lock_irqsave+0x3a/0x60
> >>>>>        trie_delete_elem+0xb0/0x820
> >>>>>        ___bpf_prog_run+0x3e51/0xabd0
> >>>>>        __bpf_prog_run32+0xc1/0x100
> >>>>>        bpf_dispatcher_nop_func
> >>>>>        ......
> >>>>>        bpf_trace_run2+0x231/0x590
> >>>>>        __bpf_trace_contention_end+0xca/0x110
> >>>>>        trace_contention_end.constprop.0+0xea/0x170
> >>>>>        __pv_queued_spin_lock_slowpath+0x28e/0xcc0
> >>>>>        pv_queued_spin_lock_slowpath
> >>>>>        queued_spin_lock_slowpath
> >>>>>        queued_spin_lock
> >>>>>        do_raw_spin_lock+0x210/0x2c0
> >>>>>        __raw_spin_lock_irqsave
> >>>>>        _raw_spin_lock_irqsave+0x42/0x60
> >>>>>        __put_partials+0xc3/0x170
> >>>>>        qlink_free
> >>>>>        qlist_free_all+0x4e/0x140
> >>>>>        kasan_quarantine_reduce+0x192/0x1e0
> >>>>>        __kasan_slab_alloc+0x69/0x90
> >>>>>        kasan_slab_alloc
> >>>>>        slab_post_alloc_hook
> >>>>>        slab_alloc_node
> >>>>>        kmem_cache_alloc_node_noprof+0x153/0x310
> >>>>>        __alloc_skb+0x2b1/0x380
> >>>>>        ......
> >>>>>
> >>>>> -> #0 (&n->list_lock){-.-.}-{2:2}:
> >>>>>        check_prev_add
> >>>>>        check_prevs_add
> >>>>>        validate_chain
> >>>>>        __lock_acquire+0x2478/0x3b30
> >>>>>        lock_acquire
> >>>>>        lock_acquire+0x1b1/0x560
> >>>>>        __raw_spin_lock_irqsave
> >>>>>        _raw_spin_lock_irqsave+0x3a/0x60
> >>>>>        get_partial_node.part.0+0x20/0x350
> >>>>>        get_partial_node
> >>>>>        get_partial
> >>>>>        ___slab_alloc+0x65b/0x1870
> >>>>>        __slab_alloc.constprop.0+0x56/0xb0
> >>>>>        __slab_alloc_node
> >>>>>        slab_alloc_node
> >>>>>        __do_kmalloc_node
> >>>>>        __kmalloc_node_noprof+0x35c/0x440
> >>>>>        kmalloc_node_noprof
> >>>>>        bpf_map_kmalloc_node+0x98/0x4a0
> >>>>>        lpm_trie_node_alloc
> >>>>>        trie_update_elem+0x1ef/0xe00
> >>>>>        bpf_map_update_value+0x2c1/0x6c0
> >>>>>        map_update_elem+0x623/0x910
> >>>>>        __sys_bpf+0x90c/0x49a0
> >>>>>        ...
> >>>>>
> >>>>> other info that might help us debug this:
> >>>>>
> >>>>>  Possible unsafe locking scenario:
> >>>>>
> >>>>>        CPU0                    CPU1
> >>>>>        ----                    ----
> >>>>>   lock(&trie->lock);
> >>>>>                                lock(&n->list_lock);
> >>>>>                                lock(&trie->lock);
> >>>>>   lock(&n->list_lock);
> >>>>>
> >>>>>  *** DEADLOCK ***
> >>>>>
> >>>>> [1]: https://syzkaller.appspot.com/bug?extid=9045c0a3d5a7f1b119f7
> >>>>>
> >>>>> A bpf program attached to trace_contention_end() triggers after
> >>>>> acquiring &n->list_lock. The program invokes trie_delete_elem(), which
> >>>>> then acquires trie->lock. However, it is possible that another
> >>>>> process is invoking trie_update_elem(). trie_update_elem() will acquire
> >>>>> trie->lock first, then invoke kmalloc_node(). kmalloc_node() may invoke
> >>>>> get_partial_node() and try to acquire &n->list_lock (not necessarily the
> >>>>> same lock object). Therefore, lockdep warns about the circular locking
> >>>>> dependency.
> >>>>>
> >>>>> Invoking kmalloc() before acquiring trie->lock could fix the warning.
> >>>>> However, since BPF programs call be invoked from any context (e.g.,
> >>>>> through kprobe/tracepoint/fentry), there may still be lock ordering
> >>>>> problems for internal locks in kmalloc() or trie->lock itself.
> >>>>>
> >>>>> To eliminate these potential lock ordering problems with kmalloc()'s
> >>>>> internal locks, replacing kmalloc()/kfree()/kfree_rcu() with equivalent
> >>>>> BPF memory allocator APIs that can be invoked in any context. The lock
> >>>>> ordering problems with trie->lock (e.g., reentrance) will be handled
> >>>>> separately.
> >>>>>
> >>>>> Three aspects of this change require explanation:
> >>>>>
> >>>>> 1. Intermediate and leaf nodes are allocated from the same allocator.
> >>>>> Since the value size of LPM trie is usually small, using a single
> >>>>> alocator reduces the memory overhead of the BPF memory allocator.
> >>>>>
> >>>>> 2. Leaf nodes are allocated before disabling IRQs. This handles cases
> >>>>> where leaf_size is large (e.g., > 4KB - 8) and updates require
> >>>>> intermediate node allocation. If leaf nodes were allocated in
> >>>>> IRQ-disabled region, the free objects in BPF memory allocator would not
> >>>>> be refilled timely and the intermediate node allocation may fail.
> >>>>>
> >>>>> 3. Paired migrate_{disable|enable}() calls for node alloc and free. The
> >>>>> BPF memory allocator uses per-CPU struct internally, these paired calls
> >>>>> are necessary to guarantee correctness.
> >>>>>
> >>>>> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
> >>>>> Signed-off-by: Hou Tao <houtao1@huawei.com>
> >>>>> ---
> >>>>>  kernel/bpf/lpm_trie.c | 71 +++++++++++++++++++++++++++++--------------
> >>>>>  1 file changed, 48 insertions(+), 23 deletions(-)
> >>>>>
> >>>>> diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
> >>>>> index 9ba6ae145239..f850360e75ce 100644
> >>>>> --- a/kernel/bpf/lpm_trie.c
> >>>>> +++ b/kernel/bpf/lpm_trie.c
> >>>>> @@ -15,6 +15,7 @@
> >>>>>  #include <net/ipv6.h>
> >>>>>  #include <uapi/linux/btf.h>
> >>>>>  #include <linux/btf_ids.h>
> >>>>> +#include <linux/bpf_mem_alloc.h>
> >>>>>
> >>>>>  /* Intermediate node */
> >>>>>  #define LPM_TREE_NODE_FLAG_IM BIT(0)
> >>>>> @@ -22,7 +23,6 @@
> >>>>>  struct lpm_trie_node;
> >>>>>
> >>>>>  struct lpm_trie_node {
> >>>>> -       struct rcu_head rcu;
> >>>>>         struct lpm_trie_node __rcu      *child[2];
> >>>>>         u32                             prefixlen;
> >>>>>         u32                             flags;
> >>>>> @@ -32,6 +32,7 @@ struct lpm_trie_node {
> >>>>>  struct lpm_trie {
> >>>>>         struct bpf_map                  map;
> >>>>>         struct lpm_trie_node __rcu      *root;
> >>>>> +       struct bpf_mem_alloc            ma;
> >>>>>         size_t                          n_entries;
> >>>>>         size_t                          max_prefixlen;
> >>>>>         size_t                          data_size;
> >>>>> @@ -287,17 +288,18 @@ static void *trie_lookup_elem(struct bpf_map *map, void *_key)
> >>>> Hey Hao,
> >>> Hi Andrii,
> >>>
> >>> Actually my name is Hou Tao :)
> >> Oops, I'm sorry for butchering your name, Hou!
> >>
> >>>> We recently got a warning from trie_lookup_elem() triggered by
> >>>>
> >>>> rcu_dereference_check(trie->root, rcu_read_lock_bh_held())
> >>>>
> >>>> check in trie_lookup_elem, when LPM trie map was used from a sleepable
> >>>> BPF program.
> >>>>
> >>>> It seems like it can be just converted to bpf_rcu_lock_held(), because
> >>>> with your switch to bpf_mem_alloc all the nodes are now both RCU and
> >>>> RCU Tasks Trace protected, is my thinking correct?
> >>>>
> >>>> Can you please double check? Thanks!
> >>> No. Although the node is freed after one RCU GP and one RCU Task Trace
> >>> GP, the reuse of node happens after one RCU GP. Therefore, for the
> >>> sleepable program when it looks up the trie, it may find and try to read
> >>> a reused node, the returned result will be unexpected due to the reuse.
> >>>
>
> Sorry for the delayed response.
> >> That's two different things, no? Because we do lookup without lock,
> >> it's possible for (non-sleepable or sleepable, doesn't matter) BPF
> >> program to update/delete trie node while some other BPF program looks
> >> it up without locking. I don't think anything fundamentally changes
> >> here. We have similar behavior for other BPF maps (hashmap, for
> >> example), regardless of sleepable or not.
> >>
> >> But the problem here is specifically about overly eager
> >> rcu_dereference_check check. That memory is not going to be freed, so
> >> it's "safe" to dereference it, even if it might be reused for another
> >> node.
> >>
> >> Either that, or we need to disable LPM trie map for sleepable until we
> >> somehow fix this memory reuse, no?
> > Hey Hou,
> >
> > So, it's actually more interesting than what I initially thought.
> > LPM_TRIE is not currently allowed for sleepable programs, but we have
> > a bug in the verifier where you can sneak in such map into a sleepable
> > program through map-in-map. It seems like we are missing checking the
> > inner map for sleepable compatibility.
> >
> > The question for you is how hard is it to make LPM_TRIE support
> > sleepable program? You mentioned some unintended memory reused, can
> > that be fixed/changed? Can you please take a look? Thanks!
>
> Instead of updating LPM_TRIE to handle memory reuse, why not use
> bpf_rcu_read_lock/bpf_rcu_read_unlock to ensure the reuse will not
> happen during the lookup ?

That should work as well, I suppose. We'll need some changes on the
verifier side to support this kind of enforcement for map operations,
is that right? Do you plan to work on this?

> >
> >>>> [...]
> > .
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2025-10-13 23:19 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-12-06 11:06 [PATCH bpf v3 0/9] Fixes for LPM trie Hou Tao
2024-12-06 11:06 ` [PATCH bpf v3 1/9] bpf: Remove unnecessary check when updating " Hou Tao
2024-12-06 11:06 ` [PATCH bpf v3 2/9] bpf: Remove unnecessary kfree(im_node) in lpm_trie_update_elem Hou Tao
2024-12-06 11:06 ` [PATCH bpf v3 3/9] bpf: Handle BPF_EXIST and BPF_NOEXIST for LPM trie Hou Tao
2024-12-06 11:06 ` [PATCH bpf v3 4/9] bpf: Handle in-place update for full LPM trie correctly Hou Tao
2024-12-06 11:06 ` [PATCH bpf v3 5/9] bpf: Fix exact match conditions in trie_get_next_key() Hou Tao
2024-12-06 11:06 ` [PATCH bpf v3 6/9] bpf: Switch to bpf mem allocator for LPM trie Hou Tao
2025-09-19 21:28   ` Andrii Nakryiko
2025-09-23  1:33     ` Hou Tao
2025-09-24 23:31       ` Andrii Nakryiko
2025-10-10 17:04         ` Andrii Nakryiko
2025-10-13  0:59           ` Hou Tao
2025-10-13 23:18             ` Andrii Nakryiko
2024-12-06 11:06 ` [PATCH bpf v3 7/9] bpf: Use raw_spinlock_t " Hou Tao
2024-12-06 11:06 ` [PATCH bpf v3 8/9] selftests/bpf: Move test_lpm_map.c to map_tests Hou Tao
2024-12-06 11:06 ` [PATCH bpf v3 9/9] selftests/bpf: Add more test cases for LPM trie Hou Tao
2024-12-06 17:40 ` [PATCH bpf v3 0/9] Fixes " patchwork-bot+netdevbpf

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox