[PATCH bpf-next 00/10] Fixes for LPM trie

BPF List
 help / color / mirror / Atom feed

* [PATCH bpf-next 00/10] Fixes for LPM trie
@ 2024-11-18  1:07 Hou Tao
  2024-11-18  1:07 ` [PATCH bpf-next 01/10] bpf: Remove unnecessary check when updating " Hou Tao
                   ` (11 more replies)
  0 siblings, 12 replies; 40+ messages in thread
From: Hou Tao @ 2024-11-18  1:07 UTC (permalink / raw)
  To: bpf
  Cc: Martin KaFai Lau, Alexei Starovoitov, Andrii Nakryiko,
	Eduard Zingerman, Song Liu, Hao Luo, Yonghong Song,
	Daniel Borkmann, KP Singh, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Sebastian Andrzej Siewior, Thomas Gleixner,
	Thomas Weißschuh, houtao1, xukuohai

From: Hou Tao <houtao1@huawei.com>

Hi,

This patch set fixes several issues for LPM trie. These issues were
found during adding new test cases or were reported by syzbot.

The patch set is structured as follows:

Patch #1~#2 are clean-ups for lpm_trie_update_elem().
Patch #3 handles BPF_EXIST and BPF_NOEXIST correctly for LPM trie.
Patch #4 fixes the accounting of n_entries when doing in-place update.
Patch #5 fixes the exact match condition in trie_get_next_key() and it
may skip keys when the passed key is not found in the map.
Patch #6~7 switch from kmalloc() to bpf memory allocator for LPM trie to
fix several lock order warnings reported by syzbot.
Patch #8 enables raw_spinlock_t for LPM trie again after switching to
bpf memory allocator.
Patch #9: move test_lpm_map to map_tests to make it run regularly.
Patch #10: add more basic test cases for LPM trie.

Please see individual patches for more details. Comments are always
welcome.

Hou Tao (10):
  bpf: Remove unnecessary check when updating LPM trie
  bpf: Remove unnecessary kfree(im_node) in lpm_trie_update_elem
  bpf: Handle BPF_EXIST and BPF_NOEXIST for LPM trie
  bpf: Handle in-place update for full LPM trie correctly
  bpf: Fix exact match conditions in trie_get_next_key()
  bpf: Add bpf_mem_cache_is_mergeable() helper
  bpf: Switch to bpf mem allocator for LPM trie
  bpf: Use raw_spinlock_t for LPM trie
  selftests/bpf: Move test_lpm_map.c to map_tests
  selftests/bpf: Add more test cases for LPM trie

 include/linux/bpf_mem_alloc.h                 |   1 +
 kernel/bpf/lpm_trie.c                         | 158 +++++--
 kernel/bpf/memalloc.c                         |  12 +
 tools/testing/selftests/bpf/.gitignore        |   1 -
 tools/testing/selftests/bpf/Makefile          |   2 +-
 .../lpm_trie_map_basic_ops.c}                 | 401 +++++++++++++++++-
 6 files changed, 523 insertions(+), 52 deletions(-)
 rename tools/testing/selftests/bpf/{test_lpm_map.c => map_tests/lpm_trie_map_basic_ops.c} (64%)

-- 
2.29.2

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH bpf-next 01/10] bpf: Remove unnecessary check when updating LPM trie
  2024-11-18  1:07 [PATCH bpf-next 00/10] Fixes for LPM trie Hou Tao
@ 2024-11-18  1:07 ` Hou Tao
  2024-11-21 10:22   ` Toke Høiland-Jørgensen
  2024-11-18  1:08 ` [PATCH bpf-next 02/10] bpf: Remove unnecessary kfree(im_node) in lpm_trie_update_elem Hou Tao
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 40+ messages in thread
From: Hou Tao @ 2024-11-18  1:07 UTC (permalink / raw)
  To: bpf
  Cc: Martin KaFai Lau, Alexei Starovoitov, Andrii Nakryiko,
	Eduard Zingerman, Song Liu, Hao Luo, Yonghong Song,
	Daniel Borkmann, KP Singh, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Sebastian Andrzej Siewior, Thomas Gleixner,
	Thomas Weißschuh, houtao1, xukuohai

From: Hou Tao <houtao1@huawei.com>

When "node->prefixlen == matchlen" is true, it means that the node is
fully matched. If "node->prefixlen == key->prefixlen" is false, it means
the prefix length of key is greater than the prefix length of node,
otherwise, matchlen will not be equal with node->prefixlen. However, it
also implies that the prefix length of node must be less than
max_prefixlen.

Therefore, "node->prefixlen == trie->max_prefixlen" will always be false
when the check of "node->prefixlen == key->prefixlen" returns false.
Remove this unnecessary comparison.

Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 kernel/bpf/lpm_trie.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
index 9b60eda0f727..73fd593d3745 100644
--- a/kernel/bpf/lpm_trie.c
+++ b/kernel/bpf/lpm_trie.c
@@ -364,8 +364,7 @@ static long trie_update_elem(struct bpf_map *map,
 		matchlen = longest_prefix_match(trie, node, key);
 
 		if (node->prefixlen != matchlen ||
-		    node->prefixlen == key->prefixlen ||
-		    node->prefixlen == trie->max_prefixlen)
+		    node->prefixlen == key->prefixlen)
 			break;
 
 		next_bit = extract_bit(key->data, node->prefixlen);
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH bpf-next 02/10] bpf: Remove unnecessary kfree(im_node) in lpm_trie_update_elem
  2024-11-18  1:07 [PATCH bpf-next 00/10] Fixes for LPM trie Hou Tao
  2024-11-18  1:07 ` [PATCH bpf-next 01/10] bpf: Remove unnecessary check when updating " Hou Tao
@ 2024-11-18  1:08 ` Hou Tao
  2024-11-21 10:25   ` Toke Høiland-Jørgensen
  2024-11-18  1:08 ` [PATCH bpf-next 03/10] bpf: Handle BPF_EXIST and BPF_NOEXIST for LPM trie Hou Tao
                   ` (9 subsequent siblings)
  11 siblings, 1 reply; 40+ messages in thread
From: Hou Tao @ 2024-11-18  1:08 UTC (permalink / raw)
  To: bpf
  Cc: Martin KaFai Lau, Alexei Starovoitov, Andrii Nakryiko,
	Eduard Zingerman, Song Liu, Hao Luo, Yonghong Song,
	Daniel Borkmann, KP Singh, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Sebastian Andrzej Siewior, Thomas Gleixner,
	Thomas Weißschuh, houtao1, xukuohai

From: Hou Tao <houtao1@huawei.com>

There is no need to call kfree(im_node) when updating element fails,
because im_node must be NULL. Remove the unnecessary kfree() for
im_node.

Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 kernel/bpf/lpm_trie.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
index 73fd593d3745..c6f036e3044b 100644
--- a/kernel/bpf/lpm_trie.c
+++ b/kernel/bpf/lpm_trie.c
@@ -431,9 +431,7 @@ static long trie_update_elem(struct bpf_map *map,
 	if (ret) {
 		if (new_node)
 			trie->n_entries--;
-
 		kfree(new_node);
-		kfree(im_node);
 	}
 
 	spin_unlock_irqrestore(&trie->lock, irq_flags);
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH bpf-next 03/10] bpf: Handle BPF_EXIST and BPF_NOEXIST for LPM trie
  2024-11-18  1:07 [PATCH bpf-next 00/10] Fixes for LPM trie Hou Tao
  2024-11-18  1:07 ` [PATCH bpf-next 01/10] bpf: Remove unnecessary check when updating " Hou Tao
  2024-11-18  1:08 ` [PATCH bpf-next 02/10] bpf: Remove unnecessary kfree(im_node) in lpm_trie_update_elem Hou Tao
@ 2024-11-18  1:08 ` Hou Tao
  2024-11-18 13:39   ` Thomas Weißschuh
  2024-11-21 10:32   ` Toke Høiland-Jørgensen
  2024-11-18  1:08 ` [PATCH bpf-next 04/10] bpf: Handle in-place update for full LPM trie correctly Hou Tao
                   ` (8 subsequent siblings)
  11 siblings, 2 replies; 40+ messages in thread
From: Hou Tao @ 2024-11-18  1:08 UTC (permalink / raw)
  To: bpf
  Cc: Martin KaFai Lau, Alexei Starovoitov, Andrii Nakryiko,
	Eduard Zingerman, Song Liu, Hao Luo, Yonghong Song,
	Daniel Borkmann, KP Singh, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Sebastian Andrzej Siewior, Thomas Gleixner,
	Thomas Weißschuh, houtao1, xukuohai

From: Hou Tao <houtao1@huawei.com>

There is exact match during the update of LPM trie, therefore, add the
missed handling for BPF_EXIST and BPF_NOEXIST flags.

Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 kernel/bpf/lpm_trie.c | 23 ++++++++++++++++++++---
 1 file changed, 20 insertions(+), 3 deletions(-)

diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
index c6f036e3044b..4300bd51ec6e 100644
--- a/kernel/bpf/lpm_trie.c
+++ b/kernel/bpf/lpm_trie.c
@@ -375,6 +375,10 @@ static long trie_update_elem(struct bpf_map *map,
 	 * simply assign the @new_node to that slot and be done.
 	 */
 	if (!node) {
+		if (flags == BPF_EXIST) {
+			ret = -ENOENT;
+			goto out;
+		}
 		rcu_assign_pointer(*slot, new_node);
 		goto out;
 	}
@@ -383,18 +387,31 @@ static long trie_update_elem(struct bpf_map *map,
 	 * which already has the correct data array set.
 	 */
 	if (node->prefixlen == matchlen) {
+		if (!(node->flags & LPM_TREE_NODE_FLAG_IM)) {
+			if (flags == BPF_NOEXIST) {
+				ret = -EEXIST;
+				goto out;
+			}
+			trie->n_entries--;
+		} else if (flags == BPF_EXIST) {
+			ret = -ENOENT;
+			goto out;
+		}
+
 		new_node->child[0] = node->child[0];
 		new_node->child[1] = node->child[1];
 
-		if (!(node->flags & LPM_TREE_NODE_FLAG_IM))
-			trie->n_entries--;
-
 		rcu_assign_pointer(*slot, new_node);
 		free_node = node;
 
 		goto out;
 	}
 
+	if (flags == BPF_EXIST) {
+		ret = -ENOENT;
+		goto out;
+	}
+
 	/* If the new node matches the prefix completely, it must be inserted
 	 * as an ancestor. Simply insert it between @node and *@slot.
 	 */
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH bpf-next 04/10] bpf: Handle in-place update for full LPM trie correctly
  2024-11-18  1:07 [PATCH bpf-next 00/10] Fixes for LPM trie Hou Tao
                   ` (2 preceding siblings ...)
  2024-11-18  1:08 ` [PATCH bpf-next 03/10] bpf: Handle BPF_EXIST and BPF_NOEXIST for LPM trie Hou Tao
@ 2024-11-18  1:08 ` Hou Tao
  2024-11-18 13:13   ` Sebastian Andrzej Siewior
  2024-11-21 10:53   ` Toke Høiland-Jørgensen
  2024-11-18  1:08 ` [PATCH bpf-next 05/10] bpf: Fix exact match conditions in trie_get_next_key() Hou Tao
                   ` (7 subsequent siblings)
  11 siblings, 2 replies; 40+ messages in thread
From: Hou Tao @ 2024-11-18  1:08 UTC (permalink / raw)
  To: bpf
  Cc: Martin KaFai Lau, Alexei Starovoitov, Andrii Nakryiko,
	Eduard Zingerman, Song Liu, Hao Luo, Yonghong Song,
	Daniel Borkmann, KP Singh, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Sebastian Andrzej Siewior, Thomas Gleixner,
	Thomas Weißschuh, houtao1, xukuohai

From: Hou Tao <houtao1@huawei.com>

When a LPM trie is full, in-place updates of existing elements
incorrectly return -ENOSPC.

Fix this by deferring the check of trie->n_entries. For new insertions,
n_entries must not exceed max_entries. However, in-place updates are
allowed even when the trie is full.

Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 kernel/bpf/lpm_trie.c | 46 +++++++++++++++++++++----------------------
 1 file changed, 23 insertions(+), 23 deletions(-)

diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
index 4300bd51ec6e..ff57e35357ae 100644
--- a/kernel/bpf/lpm_trie.c
+++ b/kernel/bpf/lpm_trie.c
@@ -310,6 +310,15 @@ static struct lpm_trie_node *lpm_trie_node_alloc(const struct lpm_trie *trie,
 	return node;
 }
 
+static int trie_check_noreplace_update(const struct lpm_trie *trie, u64 flags)
+{
+	if (flags == BPF_EXIST)
+		return -ENOENT;
+	if (trie->n_entries == trie->map.max_entries)
+		return -ENOSPC;
+	return 0;
+}
+
 /* Called from syscall or from eBPF program */
 static long trie_update_elem(struct bpf_map *map,
 			     void *_key, void *value, u64 flags)
@@ -333,20 +342,12 @@ static long trie_update_elem(struct bpf_map *map,
 	spin_lock_irqsave(&trie->lock, irq_flags);
 
 	/* Allocate and fill a new node */
-
-	if (trie->n_entries == trie->map.max_entries) {
-		ret = -ENOSPC;
-		goto out;
-	}
-
 	new_node = lpm_trie_node_alloc(trie, value);
 	if (!new_node) {
 		ret = -ENOMEM;
 		goto out;
 	}
 
-	trie->n_entries++;
-
 	new_node->prefixlen = key->prefixlen;
 	RCU_INIT_POINTER(new_node->child[0], NULL);
 	RCU_INIT_POINTER(new_node->child[1], NULL);
@@ -375,10 +376,11 @@ static long trie_update_elem(struct bpf_map *map,
 	 * simply assign the @new_node to that slot and be done.
 	 */
 	if (!node) {
-		if (flags == BPF_EXIST) {
-			ret = -ENOENT;
+		ret = trie_check_noreplace_update(trie, flags);
+		if (ret)
 			goto out;
-		}
+
+		trie->n_entries++;
 		rcu_assign_pointer(*slot, new_node);
 		goto out;
 	}
@@ -392,10 +394,11 @@ static long trie_update_elem(struct bpf_map *map,
 				ret = -EEXIST;
 				goto out;
 			}
-			trie->n_entries--;
-		} else if (flags == BPF_EXIST) {
-			ret = -ENOENT;
-			goto out;
+		} else {
+			ret = trie_check_noreplace_update(trie, flags);
+			if (ret)
+				goto out;
+			trie->n_entries++;
 		}
 
 		new_node->child[0] = node->child[0];
@@ -407,15 +410,15 @@ static long trie_update_elem(struct bpf_map *map,
 		goto out;
 	}
 
-	if (flags == BPF_EXIST) {
-		ret = -ENOENT;
+	ret = trie_check_noreplace_update(trie, flags);
+	if (ret)
 		goto out;
-	}
 
 	/* If the new node matches the prefix completely, it must be inserted
 	 * as an ancestor. Simply insert it between @node and *@slot.
 	 */
 	if (matchlen == key->prefixlen) {
+		trie->n_entries++;
 		next_bit = extract_bit(node->data, matchlen);
 		rcu_assign_pointer(new_node->child[next_bit], node);
 		rcu_assign_pointer(*slot, new_node);
@@ -428,6 +431,7 @@ static long trie_update_elem(struct bpf_map *map,
 		goto out;
 	}
 
+	trie->n_entries++;
 	im_node->prefixlen = matchlen;
 	im_node->flags |= LPM_TREE_NODE_FLAG_IM;
 	memcpy(im_node->data, node->data, trie->data_size);
@@ -445,12 +449,8 @@ static long trie_update_elem(struct bpf_map *map,
 	rcu_assign_pointer(*slot, im_node);
 
 out:
-	if (ret) {
-		if (new_node)
-			trie->n_entries--;
+	if (ret)
 		kfree(new_node);
-	}
-
 	spin_unlock_irqrestore(&trie->lock, irq_flags);
 	kfree_rcu(free_node, rcu);
 
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH bpf-next 05/10] bpf: Fix exact match conditions in trie_get_next_key()
  2024-11-18  1:07 [PATCH bpf-next 00/10] Fixes for LPM trie Hou Tao
                   ` (3 preceding siblings ...)
  2024-11-18  1:08 ` [PATCH bpf-next 04/10] bpf: Handle in-place update for full LPM trie correctly Hou Tao
@ 2024-11-18  1:08 ` Hou Tao
  2024-11-21 11:01   ` Toke Høiland-Jørgensen
  2024-11-18  1:08 ` [PATCH bpf-next 06/10] bpf: Add bpf_mem_cache_is_mergeable() helper Hou Tao
                   ` (6 subsequent siblings)
  11 siblings, 1 reply; 40+ messages in thread
From: Hou Tao @ 2024-11-18  1:08 UTC (permalink / raw)
  To: bpf
  Cc: Martin KaFai Lau, Alexei Starovoitov, Andrii Nakryiko,
	Eduard Zingerman, Song Liu, Hao Luo, Yonghong Song,
	Daniel Borkmann, KP Singh, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Sebastian Andrzej Siewior, Thomas Gleixner,
	Thomas Weißschuh, houtao1, xukuohai

From: Hou Tao <houtao1@huawei.com>

trie_get_next_key() uses node->prefixlen == key->prefixlen to identify
an exact match, However, it is incorrect because when the target key
doesn't fully match the found node (e.g., node->prefixlen != matchlen),
these two nodes may also have the same prefixlen. It will return
expected result when the passed key exist in the trie. However when a
recently-deleted key or nonexistent key is passed to
trie_get_next_key(), it may skip keys and return incorrect result.

Fix it by using node->prefixlen == matchlen to identify exact matches.
When the condition is true after the search, it also implies
node->prefixlen equals key->prefixlen, otherwise, the search would
return NULL instead.

Fixes: b471f2f1de8b ("bpf: implement MAP_GET_NEXT_KEY command for LPM_TRIE map")
Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 kernel/bpf/lpm_trie.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
index ff57e35357ae..d447a6dab83b 100644
--- a/kernel/bpf/lpm_trie.c
+++ b/kernel/bpf/lpm_trie.c
@@ -647,7 +647,7 @@ static int trie_get_next_key(struct bpf_map *map, void *_key, void *_next_key)
 	struct lpm_trie_node **node_stack = NULL;
 	int err = 0, stack_ptr = -1;
 	unsigned int next_bit;
-	size_t matchlen;
+	size_t matchlen = 0;
 
 	/* The get_next_key follows postorder. For the 4 node example in
 	 * the top of this file, the trie_get_next_key() returns the following
@@ -686,7 +686,7 @@ static int trie_get_next_key(struct bpf_map *map, void *_key, void *_next_key)
 		next_bit = extract_bit(key->data, node->prefixlen);
 		node = rcu_dereference(node->child[next_bit]);
 	}
-	if (!node || node->prefixlen != key->prefixlen ||
+	if (!node || node->prefixlen != matchlen ||
 	    (node->flags & LPM_TREE_NODE_FLAG_IM))
 		goto find_leftmost;
 
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH bpf-next 06/10] bpf: Add bpf_mem_cache_is_mergeable() helper
  2024-11-18  1:07 [PATCH bpf-next 00/10] Fixes for LPM trie Hou Tao
                   ` (4 preceding siblings ...)
  2024-11-18  1:08 ` [PATCH bpf-next 05/10] bpf: Fix exact match conditions in trie_get_next_key() Hou Tao
@ 2024-11-18  1:08 ` Hou Tao
  2024-11-18 13:29   ` Thomas Weißschuh
  2024-11-18  1:08 ` [PATCH bpf-next 07/10] bpf: Switch to bpf mem allocator for LPM trie Hou Tao
                   ` (5 subsequent siblings)
  11 siblings, 1 reply; 40+ messages in thread
From: Hou Tao @ 2024-11-18  1:08 UTC (permalink / raw)
  To: bpf
  Cc: Martin KaFai Lau, Alexei Starovoitov, Andrii Nakryiko,
	Eduard Zingerman, Song Liu, Hao Luo, Yonghong Song,
	Daniel Borkmann, KP Singh, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Sebastian Andrzej Siewior, Thomas Gleixner,
	Thomas Weißschuh, houtao1, xukuohai

From: Hou Tao <houtao1@huawei.com>

Add bpf_mem_cache_is_mergeable() to check whether two bpf mem allocator
for fixed-size objects are mergeable or not. The merging could reduce
the memory overhead of bpf mem allocator.

Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 include/linux/bpf_mem_alloc.h |  1 +
 kernel/bpf/memalloc.c         | 12 ++++++++++++
 2 files changed, 13 insertions(+)

diff --git a/include/linux/bpf_mem_alloc.h b/include/linux/bpf_mem_alloc.h
index e45162ef59bb..faa54b9c7a04 100644
--- a/include/linux/bpf_mem_alloc.h
+++ b/include/linux/bpf_mem_alloc.h
@@ -47,5 +47,6 @@ void bpf_mem_cache_free(struct bpf_mem_alloc *ma, void *ptr);
 void bpf_mem_cache_free_rcu(struct bpf_mem_alloc *ma, void *ptr);
 void bpf_mem_cache_raw_free(void *ptr);
 void *bpf_mem_cache_alloc_flags(struct bpf_mem_alloc *ma, gfp_t flags);
+bool bpf_mem_cache_is_mergeable(size_t size, size_t new_size, bool percpu);
 
 #endif /* _BPF_MEM_ALLOC_H */
diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
index 889374722d0a..49dd08ad1d4f 100644
--- a/kernel/bpf/memalloc.c
+++ b/kernel/bpf/memalloc.c
@@ -1014,3 +1014,15 @@ int bpf_mem_alloc_check_size(bool percpu, size_t size)
 
 	return 0;
 }
+
+bool bpf_mem_cache_is_mergeable(size_t size, size_t new_size, bool percpu)
+{
+	/* Only for fixed-size object allocator */
+	if (!size || !new_size)
+		return false;
+
+	return (percpu && ALIGN(size, PCPU_MIN_ALLOC_SIZE) ==
+			  ALIGN(new_size, PCPU_MIN_ALLOC_SIZE)) ||
+	       (!percpu && kmalloc_size_roundup(size + LLIST_NODE_SZ) ==
+			   kmalloc_size_roundup(new_size + LLIST_NODE_SZ));
+}
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH bpf-next 07/10] bpf: Switch to bpf mem allocator for LPM trie
  2024-11-18  1:07 [PATCH bpf-next 00/10] Fixes for LPM trie Hou Tao
                   ` (5 preceding siblings ...)
  2024-11-18  1:08 ` [PATCH bpf-next 06/10] bpf: Add bpf_mem_cache_is_mergeable() helper Hou Tao
@ 2024-11-18  1:08 ` Hou Tao
  2024-11-18 13:30   ` Sebastian Andrzej Siewior
                     ` (2 more replies)
  2024-11-18  1:08 ` [PATCH bpf-next 08/10] bpf: Use raw_spinlock_t " Hou Tao
                   ` (4 subsequent siblings)
  11 siblings, 3 replies; 40+ messages in thread
From: Hou Tao @ 2024-11-18  1:08 UTC (permalink / raw)
  To: bpf
  Cc: Martin KaFai Lau, Alexei Starovoitov, Andrii Nakryiko,
	Eduard Zingerman, Song Liu, Hao Luo, Yonghong Song,
	Daniel Borkmann, KP Singh, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Sebastian Andrzej Siewior, Thomas Gleixner,
	Thomas Weißschuh, houtao1, xukuohai

From: Hou Tao <houtao1@huawei.com>

Multiple syzbot warnings have been reported. These warnings are mainly
about the lock order between trie->lock and kmalloc()'s internal lock.
See report [1] as an example:

======================================================
WARNING: possible circular locking dependency detected
6.10.0-rc7-syzkaller-00003-g4376e966ecb7 #0 Not tainted
------------------------------------------------------
syz.3.2069/15008 is trying to acquire lock:
ffff88801544e6d8 (&n->list_lock){-.-.}-{2:2}, at: get_partial_node ...

but task is already holding lock:
ffff88802dcc89f8 (&trie->lock){-.-.}-{2:2}, at: trie_update_elem ...

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #1 (&trie->lock){-.-.}-{2:2}:
       __raw_spin_lock_irqsave
       _raw_spin_lock_irqsave+0x3a/0x60
       trie_delete_elem+0xb0/0x820
       ___bpf_prog_run+0x3e51/0xabd0
       __bpf_prog_run32+0xc1/0x100
       bpf_dispatcher_nop_func
       ......
       bpf_trace_run2+0x231/0x590
       __bpf_trace_contention_end+0xca/0x110
       trace_contention_end.constprop.0+0xea/0x170
       __pv_queued_spin_lock_slowpath+0x28e/0xcc0
       pv_queued_spin_lock_slowpath
       queued_spin_lock_slowpath
       queued_spin_lock
       do_raw_spin_lock+0x210/0x2c0
       __raw_spin_lock_irqsave
       _raw_spin_lock_irqsave+0x42/0x60
       __put_partials+0xc3/0x170
       qlink_free
       qlist_free_all+0x4e/0x140
       kasan_quarantine_reduce+0x192/0x1e0
       __kasan_slab_alloc+0x69/0x90
       kasan_slab_alloc
       slab_post_alloc_hook
       slab_alloc_node
       kmem_cache_alloc_node_noprof+0x153/0x310
       __alloc_skb+0x2b1/0x380
       ......

-> #0 (&n->list_lock){-.-.}-{2:2}:
       check_prev_add
       check_prevs_add
       validate_chain
       __lock_acquire+0x2478/0x3b30
       lock_acquire
       lock_acquire+0x1b1/0x560
       __raw_spin_lock_irqsave
       _raw_spin_lock_irqsave+0x3a/0x60
       get_partial_node.part.0+0x20/0x350
       get_partial_node
       get_partial
       ___slab_alloc+0x65b/0x1870
       __slab_alloc.constprop.0+0x56/0xb0
       __slab_alloc_node
       slab_alloc_node
       __do_kmalloc_node
       __kmalloc_node_noprof+0x35c/0x440
       kmalloc_node_noprof
       bpf_map_kmalloc_node+0x98/0x4a0
       lpm_trie_node_alloc
       trie_update_elem+0x1ef/0xe00
       bpf_map_update_value+0x2c1/0x6c0
       map_update_elem+0x623/0x910
       __sys_bpf+0x90c/0x49a0
       ...

other info that might help us debug this:

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(&trie->lock);
                               lock(&n->list_lock);
                               lock(&trie->lock);
  lock(&n->list_lock);

 *** DEADLOCK ***

[1]: https://syzkaller.appspot.com/bug?extid=9045c0a3d5a7f1b119f7

A bpf program attached to trace_contention_end() triggers after
acquiring &n->list_lock. The program invokes trie_delete_elem(), which
then acquires trie->lock. However, it is possible that another
process is invoking trie_update_elem(). trie_update_elem() will acquire
trie->lock first, then invoke kmalloc_node(). kmalloc_node() may invoke
get_partial_node() and try to acquire &n->list_lock (not necessarily the
same lock object). Therefore, lockdep warns about the circular locking
dependency.

Fix these warnings by replacing kmalloc()/kfree()/kfree_rcu() with
equivalent bpf memory allocator APIs. Since intermediate node and leaf
node have fixed sizes, fixed-size allocation APIs are used.

Two aspects of this change require explanation:

1. A new flag LPM_TREE_NODE_FLAG_ALLOC_LEAF is added to track the
   original allocator. This is necessary because during deletion, a leaf
   node may be used as an intermediate node. These nodes must be freed
   through the leaf allocator.
2. The intermediate node allocator and leaf node allocator may be merged
   because value_size for LPM trie is usually small. The merging reduces
   the memory overhead of bpf memory allocator.

Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 kernel/bpf/lpm_trie.c | 90 +++++++++++++++++++++++++++++++++++--------
 1 file changed, 74 insertions(+), 16 deletions(-)

diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
index d447a6dab83b..d8995acecedf 100644
--- a/kernel/bpf/lpm_trie.c
+++ b/kernel/bpf/lpm_trie.c
@@ -15,23 +15,34 @@
 #include <net/ipv6.h>
 #include <uapi/linux/btf.h>
 #include <linux/btf_ids.h>
+#include <linux/bpf_mem_alloc.h>
 
 /* Intermediate node */
 #define LPM_TREE_NODE_FLAG_IM BIT(0)
+/* Allocated as leaf node. It may be used as intermediate node after deletion */
+#define LPM_TREE_NODE_FLAG_ALLOC_LEAF BIT(1)
 
 struct lpm_trie_node;
 
 struct lpm_trie_node {
-	struct rcu_head rcu;
 	struct lpm_trie_node __rcu	*child[2];
 	u32				prefixlen;
 	u32				flags;
 	u8				data[];
 };
 
+enum {
+	LPM_TRIE_MA_IM = 0,
+	LPM_TRIE_MA_LEAF,
+	LPM_TRIE_MA_CNT,
+};
+
 struct lpm_trie {
 	struct bpf_map			map;
 	struct lpm_trie_node __rcu	*root;
+	struct bpf_mem_alloc		ma[LPM_TRIE_MA_CNT];
+	struct bpf_mem_alloc		*im_ma;
+	struct bpf_mem_alloc		*leaf_ma;
 	size_t				n_entries;
 	size_t				max_prefixlen;
 	size_t				data_size;
@@ -287,25 +298,21 @@ static void *trie_lookup_elem(struct bpf_map *map, void *_key)
 	return found->data + trie->data_size;
 }
 
-static struct lpm_trie_node *lpm_trie_node_alloc(const struct lpm_trie *trie,
-						 const void *value)
+static struct lpm_trie_node *lpm_trie_node_alloc(struct lpm_trie *trie, const void *value)
 {
+	struct bpf_mem_alloc *ma = value ? trie->leaf_ma : trie->im_ma;
 	struct lpm_trie_node *node;
-	size_t size = sizeof(struct lpm_trie_node) + trie->data_size;
-
-	if (value)
-		size += trie->map.value_size;
 
-	node = bpf_map_kmalloc_node(&trie->map, size, GFP_NOWAIT | __GFP_NOWARN,
-				    trie->map.numa_node);
+	node = bpf_mem_cache_alloc(ma);
 	if (!node)
 		return NULL;
 
 	node->flags = 0;
-
-	if (value)
+	if (value) {
+		node->flags |= LPM_TREE_NODE_FLAG_ALLOC_LEAF;
 		memcpy(node->data + trie->data_size, value,
 		       trie->map.value_size);
+	}
 
 	return node;
 }
@@ -319,6 +326,25 @@ static int trie_check_noreplace_update(const struct lpm_trie *trie, u64 flags)
 	return 0;
 }
 
+static void lpm_trie_node_free(struct lpm_trie *trie,
+			       struct lpm_trie_node *node, bool defer)
+{
+	struct bpf_mem_alloc *ma;
+
+	if (!node)
+		return;
+
+	ma = (node->flags & LPM_TREE_NODE_FLAG_ALLOC_LEAF) ? trie->leaf_ma :
+							     trie->im_ma;
+
+	migrate_disable();
+	if (defer)
+		bpf_mem_cache_free_rcu(ma, node);
+	else
+		bpf_mem_cache_free(ma, node);
+	migrate_enable();
+}
+
 /* Called from syscall or from eBPF program */
 static long trie_update_elem(struct bpf_map *map,
 			     void *_key, void *value, u64 flags)
@@ -450,9 +476,10 @@ static long trie_update_elem(struct bpf_map *map,
 
 out:
 	if (ret)
-		kfree(new_node);
+		bpf_mem_cache_free(trie->leaf_ma, new_node);
+
 	spin_unlock_irqrestore(&trie->lock, irq_flags);
-	kfree_rcu(free_node, rcu);
+	lpm_trie_node_free(trie, free_node, true);
 
 	return ret;
 }
@@ -550,8 +577,8 @@ static long trie_delete_elem(struct bpf_map *map, void *_key)
 
 out:
 	spin_unlock_irqrestore(&trie->lock, irq_flags);
-	kfree_rcu(free_parent, rcu);
-	kfree_rcu(free_node, rcu);
+	lpm_trie_node_free(trie, free_parent, true);
+	lpm_trie_node_free(trie, free_node, true);
 
 	return ret;
 }
@@ -572,7 +599,9 @@ static long trie_delete_elem(struct bpf_map *map, void *_key)
 
 static struct bpf_map *trie_alloc(union bpf_attr *attr)
 {
+	size_t size, leaf_size;
 	struct lpm_trie *trie;
+	int err;
 
 	/* check sanity of attributes */
 	if (attr->max_entries == 0 ||
@@ -597,7 +626,30 @@ static struct bpf_map *trie_alloc(union bpf_attr *attr)
 
 	spin_lock_init(&trie->lock);
 
+	size = sizeof(struct lpm_trie_node) + trie->data_size;
+	err = bpf_mem_alloc_init(&trie->ma[LPM_TRIE_MA_IM], size, false);
+	if (err)
+		goto free_out;
+	trie->im_ma = &trie->ma[LPM_TRIE_MA_IM];
+
+	leaf_size = size + trie->map.value_size;
+	if (bpf_mem_cache_is_mergeable(size, leaf_size, false)) {
+		trie->leaf_ma = trie->im_ma;
+	} else {
+		err = bpf_mem_alloc_init(&trie->ma[LPM_TRIE_MA_LEAF],
+					 leaf_size, false);
+		if (err)
+			goto destroy_ma_out;
+		trie->leaf_ma = &trie->ma[LPM_TRIE_MA_LEAF];
+	}
+
 	return &trie->map;
+
+destroy_ma_out:
+	bpf_mem_alloc_destroy(trie->im_ma);
+free_out:
+	bpf_map_area_free(trie);
+	return ERR_PTR(err);
 }
 
 static void trie_free(struct bpf_map *map)
@@ -629,13 +681,19 @@ static void trie_free(struct bpf_map *map)
 				continue;
 			}
 
-			kfree(node);
+			/* No bpf program may access the map, so freeing the
+			 * node without waiting for the extra RCU GP.
+			 */
+			lpm_trie_node_free(trie, node, false);
 			RCU_INIT_POINTER(*slot, NULL);
 			break;
 		}
 	}
 
 out:
+	bpf_mem_alloc_destroy(trie->im_ma);
+	if (trie->leaf_ma != trie->im_ma)
+		bpf_mem_alloc_destroy(trie->leaf_ma);
 	bpf_map_area_free(trie);
 }
 
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH bpf-next 08/10] bpf: Use raw_spinlock_t for LPM trie
  2024-11-18  1:07 [PATCH bpf-next 00/10] Fixes for LPM trie Hou Tao
                   ` (6 preceding siblings ...)
  2024-11-18  1:08 ` [PATCH bpf-next 07/10] bpf: Switch to bpf mem allocator for LPM trie Hou Tao
@ 2024-11-18  1:08 ` Hou Tao
  2024-11-18  1:08 ` [PATCH bpf-next 09/10] selftests/bpf: Move test_lpm_map.c to map_tests Hou Tao
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 40+ messages in thread
From: Hou Tao @ 2024-11-18  1:08 UTC (permalink / raw)
  To: bpf
  Cc: Martin KaFai Lau, Alexei Starovoitov, Andrii Nakryiko,
	Eduard Zingerman, Song Liu, Hao Luo, Yonghong Song,
	Daniel Borkmann, KP Singh, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Sebastian Andrzej Siewior, Thomas Gleixner,
	Thomas Weißschuh, houtao1, xukuohai

From: Hou Tao <houtao1@huawei.com>

After switching from kmalloc() to the bpf memory allocator, there will be
no blocking operation during the update of LPM trie. Therefore, change
trie->lock from spinlock_t to raw_spinlock_t to make LPM trie usable in
atomic context, even on RT kernels.

Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 kernel/bpf/lpm_trie.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
index d8995acecedf..f8ff2bfcd704 100644
--- a/kernel/bpf/lpm_trie.c
+++ b/kernel/bpf/lpm_trie.c
@@ -46,7 +46,7 @@ struct lpm_trie {
 	size_t				n_entries;
 	size_t				max_prefixlen;
 	size_t				data_size;
-	spinlock_t			lock;
+	raw_spinlock_t			lock;
 };
 
 /* This trie implements a longest prefix match algorithm that can be used to
@@ -365,7 +365,7 @@ static long trie_update_elem(struct bpf_map *map,
 	if (key->prefixlen > trie->max_prefixlen)
 		return -EINVAL;
 
-	spin_lock_irqsave(&trie->lock, irq_flags);
+	raw_spin_lock_irqsave(&trie->lock, irq_flags);
 
 	/* Allocate and fill a new node */
 	new_node = lpm_trie_node_alloc(trie, value);
@@ -478,7 +478,7 @@ static long trie_update_elem(struct bpf_map *map,
 	if (ret)
 		bpf_mem_cache_free(trie->leaf_ma, new_node);
 
-	spin_unlock_irqrestore(&trie->lock, irq_flags);
+	raw_spin_unlock_irqrestore(&trie->lock, irq_flags);
 	lpm_trie_node_free(trie, free_node, true);
 
 	return ret;
@@ -500,7 +500,7 @@ static long trie_delete_elem(struct bpf_map *map, void *_key)
 	if (key->prefixlen > trie->max_prefixlen)
 		return -EINVAL;
 
-	spin_lock_irqsave(&trie->lock, irq_flags);
+	raw_spin_lock_irqsave(&trie->lock, irq_flags);
 
 	/* Walk the tree looking for an exact key/length match and keeping
 	 * track of the path we traverse.  We will need to know the node
@@ -576,7 +576,7 @@ static long trie_delete_elem(struct bpf_map *map, void *_key)
 	free_node = node;
 
 out:
-	spin_unlock_irqrestore(&trie->lock, irq_flags);
+	raw_spin_unlock_irqrestore(&trie->lock, irq_flags);
 	lpm_trie_node_free(trie, free_parent, true);
 	lpm_trie_node_free(trie, free_node, true);
 
@@ -624,7 +624,7 @@ static struct bpf_map *trie_alloc(union bpf_attr *attr)
 			  offsetof(struct bpf_lpm_trie_key_u8, data);
 	trie->max_prefixlen = trie->data_size * 8;
 
-	spin_lock_init(&trie->lock);
+	raw_spin_lock_init(&trie->lock);
 
 	size = sizeof(struct lpm_trie_node) + trie->data_size;
 	err = bpf_mem_alloc_init(&trie->ma[LPM_TRIE_MA_IM], size, false);
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH bpf-next 09/10] selftests/bpf: Move test_lpm_map.c to map_tests
  2024-11-18  1:07 [PATCH bpf-next 00/10] Fixes for LPM trie Hou Tao
                   ` (7 preceding siblings ...)
  2024-11-18  1:08 ` [PATCH bpf-next 08/10] bpf: Use raw_spinlock_t " Hou Tao
@ 2024-11-18  1:08 ` Hou Tao
  2024-11-18  1:08 ` [PATCH bpf-next 10/10] selftests/bpf: Add more test cases for LPM trie Hou Tao
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 40+ messages in thread
From: Hou Tao @ 2024-11-18  1:08 UTC (permalink / raw)
  To: bpf
  Cc: Martin KaFai Lau, Alexei Starovoitov, Andrii Nakryiko,
	Eduard Zingerman, Song Liu, Hao Luo, Yonghong Song,
	Daniel Borkmann, KP Singh, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Sebastian Andrzej Siewior, Thomas Gleixner,
	Thomas Weißschuh, houtao1, xukuohai

From: Hou Tao <houtao1@huawei.com>

Move test_lpm_map.c to map_tests/ to include LPM trie test cases in
regular test_maps run. Most code remains unchanged, including the use of
assert(). Only reduce n_lookups from 64K to 512, which decreases
test_lpm_map runtime from 37s to 0.7s.

Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 tools/testing/selftests/bpf/.gitignore                 |  1 -
 tools/testing/selftests/bpf/Makefile                   |  2 +-
 .../lpm_trie_map_basic_ops.c}                          | 10 +++-------
 3 files changed, 4 insertions(+), 9 deletions(-)
 rename tools/testing/selftests/bpf/{test_lpm_map.c => map_tests/lpm_trie_map_basic_ops.c} (99%)

diff --git a/tools/testing/selftests/bpf/.gitignore b/tools/testing/selftests/bpf/.gitignore
index c2a1842c3d8b..e9c377001f93 100644
--- a/tools/testing/selftests/bpf/.gitignore
+++ b/tools/testing/selftests/bpf/.gitignore
@@ -5,7 +5,6 @@ bpf-syscall*
 test_verifier
 test_maps
 test_lru_map
-test_lpm_map
 test_tag
 FEATURE-DUMP.libbpf
 FEATURE-DUMP.selftests
diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index 6ad3b1ba1920..7eeb3cbe18c7 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -83,7 +83,7 @@ CLANG_CPUV4 := 1
 endif
 
 # Order correspond to 'make run_tests' order
-TEST_GEN_PROGS = test_verifier test_tag test_maps test_lru_map test_lpm_map test_progs \
+TEST_GEN_PROGS = test_verifier test_tag test_maps test_lru_map test_progs \
 	test_sockmap \
 	test_tcpnotify_user test_sysctl \
 	test_progs-no_alu32
diff --git a/tools/testing/selftests/bpf/test_lpm_map.c b/tools/testing/selftests/bpf/map_tests/lpm_trie_map_basic_ops.c
similarity index 99%
rename from tools/testing/selftests/bpf/test_lpm_map.c
rename to tools/testing/selftests/bpf/map_tests/lpm_trie_map_basic_ops.c
index d98c72dc563e..f375c89d78a4 100644
--- a/tools/testing/selftests/bpf/test_lpm_map.c
+++ b/tools/testing/selftests/bpf/map_tests/lpm_trie_map_basic_ops.c
@@ -223,7 +223,7 @@ static void test_lpm_map(int keysize)
 	n_matches = 0;
 	n_matches_after_delete = 0;
 	n_nodes = 1 << 8;
-	n_lookups = 1 << 16;
+	n_lookups = 1 << 9;
 
 	data = alloca(keysize);
 	memset(data, 0, keysize);
@@ -770,16 +770,13 @@ static void test_lpm_multi_thread(void)
 	close(map_fd);
 }
 
-int main(void)
+void test_lpm_trie_map_basic_ops(void)
 {
 	int i;
 
 	/* we want predictable, pseudo random tests */
 	srand(0xf00ba1);
 
-	/* Use libbpf 1.0 API mode */
-	libbpf_set_strict_mode(LIBBPF_STRICT_ALL);
-
 	test_lpm_basic();
 	test_lpm_order();
 
@@ -792,6 +789,5 @@ int main(void)
 	test_lpm_get_next_key();
 	test_lpm_multi_thread();
 
-	printf("test_lpm: OK\n");
-	return 0;
+	printf("%s: PASS\n", __func__);
 }
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH bpf-next 10/10] selftests/bpf: Add more test cases for LPM trie
  2024-11-18  1:07 [PATCH bpf-next 00/10] Fixes for LPM trie Hou Tao
                   ` (8 preceding siblings ...)
  2024-11-18  1:08 ` [PATCH bpf-next 09/10] selftests/bpf: Move test_lpm_map.c to map_tests Hou Tao
@ 2024-11-18  1:08 ` Hou Tao
  2024-11-18 17:46   ` Daniel Borkmann
       [not found] ` <46268aa9ef13a24388af833b17f6cef8bdd3a7be8402fec7640e65a2f1118468@mail.kernel.org>
  2024-11-18 15:39 ` Sebastian Andrzej Siewior
  11 siblings, 1 reply; 40+ messages in thread
From: Hou Tao @ 2024-11-18  1:08 UTC (permalink / raw)
  To: bpf
  Cc: Martin KaFai Lau, Alexei Starovoitov, Andrii Nakryiko,
	Eduard Zingerman, Song Liu, Hao Luo, Yonghong Song,
	Daniel Borkmann, KP Singh, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Sebastian Andrzej Siewior, Thomas Gleixner,
	Thomas Weißschuh, houtao1, xukuohai

From: Hou Tao <houtao1@huawei.com>

Add more test cases for LPM trie in test_maps:

1) test_lpm_trie_update_flags()
It constructs various use cases for BPF_EXIST and BPF_NOEXIST and check
whether the return value of update operation is expected.
2) test_lpm_trie_reuse_leaf_node
It tests whether the reuse of leaf node as intermediate node is handled
correctly when freeing the intermediate node.
3) test_lpm_trie_iterate_strs & test_lpm_trie_iterate_ints
There two test case test whether the iteration through get_next_key is
sorted and expected. These two test cases delete the minimal key after
each iteration and check whether next iteration returns the second
minimal key. The only difference between these two test cases is the
former one saves strings in the LPM trie and the latter saves integers.

Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 .../bpf/map_tests/lpm_trie_map_basic_ops.c    | 391 ++++++++++++++++++
 1 file changed, 391 insertions(+)

diff --git a/tools/testing/selftests/bpf/map_tests/lpm_trie_map_basic_ops.c b/tools/testing/selftests/bpf/map_tests/lpm_trie_map_basic_ops.c
index f375c89d78a4..eb17c711b4c8 100644
--- a/tools/testing/selftests/bpf/map_tests/lpm_trie_map_basic_ops.c
+++ b/tools/testing/selftests/bpf/map_tests/lpm_trie_map_basic_ops.c
@@ -20,10 +20,12 @@
 #include <string.h>
 #include <time.h>
 #include <unistd.h>
+#include <endian.h>
 #include <arpa/inet.h>
 #include <sys/time.h>
 
 #include <bpf/bpf.h>
+#include <test_maps.h>
 
 #include "bpf_util.h"
 
@@ -33,6 +35,22 @@ struct tlpm_node {
 	uint8_t key[];
 };
 
+struct lpm_trie_bytes_key {
+	union {
+		struct bpf_lpm_trie_key_hdr hdr;
+		__u32 prefixlen;
+	};
+	unsigned char data[8];
+};
+
+struct lpm_trie_int_key {
+	union {
+		struct bpf_lpm_trie_key_hdr hdr;
+		__u32 prefixlen;
+	};
+	unsigned int data;
+};
+
 static struct tlpm_node *tlpm_match(struct tlpm_node *list,
 				    const uint8_t *key,
 				    size_t n_bits);
@@ -770,6 +788,374 @@ static void test_lpm_multi_thread(void)
 	close(map_fd);
 }
 
+static int lpm_trie_create(unsigned int key_size, unsigned int value_size, unsigned int max_entries)
+{
+	LIBBPF_OPTS(bpf_map_create_opts, opts);
+	int fd;
+
+	opts.map_flags = BPF_F_NO_PREALLOC;
+	fd = bpf_map_create(BPF_MAP_TYPE_LPM_TRIE, "lpm_trie", key_size, value_size, max_entries,
+			    &opts);
+	CHECK(fd < 0, "bpf_map_create", "error %d\n", errno);
+
+	return fd;
+}
+
+static void test_lpm_trie_update_flags(void)
+{
+	struct lpm_trie_int_key key;
+	unsigned int value, got;
+	int fd, err;
+
+	fd = lpm_trie_create(sizeof(key), sizeof(value), 3);
+
+	/* invalid flags (Error) */
+	key.prefixlen = 32;
+	key.data = 0;
+	value = 0;
+	err = bpf_map_update_elem(fd, &key, &value, BPF_F_LOCK);
+	CHECK(err != -EINVAL, "invalid update flag", "error %d\n", err);
+
+	/* invalid flags (Error) */
+	key.prefixlen = 32;
+	key.data = 0;
+	value = 0;
+	err = bpf_map_update_elem(fd, &key, &value, BPF_NOEXIST | BPF_EXIST);
+	CHECK(err != -EINVAL, "invalid update flag", "error %d\n", err);
+
+	/* overwrite an empty qp-trie (Error) */
+	key.prefixlen = 32;
+	key.data = 0;
+	value = 2;
+	err = bpf_map_update_elem(fd, &key, &value, BPF_EXIST);
+	CHECK(err != -ENOENT, "overwrite empty qp-trie", "error %d\n", err);
+
+	/* add a new node */
+	key.prefixlen = 16;
+	key.data = 0;
+	value = 1;
+	err = bpf_map_update_elem(fd, &key, &value, BPF_NOEXIST);
+	CHECK(err, "add new elem", "error %d\n", err);
+	got = 0;
+	err = bpf_map_lookup_elem(fd, &key, &got);
+	CHECK(err, "lookup elem", "error %d\n", err);
+	CHECK(got != value, "check value", "got %d exp %d\n", got, value);
+
+	/* add the same node as new node (Error) */
+	err = bpf_map_update_elem(fd, &key, &value, BPF_NOEXIST);
+	CHECK(err != -EEXIST, "add new elem again", "error %d\n", err);
+
+	/* overwrite the existed node */
+	value = 4;
+	err = bpf_map_update_elem(fd, &key, &value, BPF_EXIST);
+	CHECK(err, "overwrite elem", "error %d\n", err);
+	got = 0;
+	err = bpf_map_lookup_elem(fd, &key, &got);
+	CHECK(err, "lookup elem", "error %d\n", err);
+	CHECK(got != value, "check value", "got %d exp %d\n", got, value);
+
+	/* overwrite the node */
+	value = 1;
+	err = bpf_map_update_elem(fd, &key, &value, BPF_ANY);
+	CHECK(err, "update elem", "error %d\n", err);
+	got = 0;
+	err = bpf_map_lookup_elem(fd, &key, &got);
+	CHECK(err, "lookup elem", "error %d\n", err);
+	CHECK(got != value, "check value", "got %d exp %d\n", got, value);
+
+	/* overwrite a non-existent node which is the prefix of the first
+	 * node (Error).
+	 */
+	key.prefixlen = 8;
+	key.data = 0;
+	value = 2;
+	err = bpf_map_update_elem(fd, &key, &value, BPF_EXIST);
+	CHECK(err != -ENOENT, "overwrite nonexistent elem", "error %d\n", err);
+
+	/* add a new node which is the prefix of the first node */
+	err = bpf_map_update_elem(fd, &key, &value, BPF_NOEXIST);
+	CHECK(err, "add new elem", "error %d\n", err);
+	got = 0;
+	err = bpf_map_lookup_elem(fd, &key, &got);
+	CHECK(err, "lookup key", "error %d\n", err);
+	CHECK(got != value, "check value", "got %d exp %d\n", got, value);
+
+	/* add another new node which will be the sibling of the first node */
+	key.prefixlen = 9;
+	key.data = htobe32(1 << 23);
+	value = 5;
+	err = bpf_map_update_elem(fd, &key, &value, BPF_NOEXIST);
+	CHECK(err, "add new elem", "error %d\n", err);
+	got = 0;
+	err = bpf_map_lookup_elem(fd, &key, &got);
+	CHECK(err, "lookup key", "error %d\n", err);
+	CHECK(got != value, "check value", "got %d exp %d\n", got, value);
+
+	/* overwrite the third node */
+	value = 3;
+	err = bpf_map_update_elem(fd, &key, &value, BPF_ANY);
+	CHECK(err, "overwrite elem", "error %d\n", err);
+	got = 0;
+	err = bpf_map_lookup_elem(fd, &key, &got);
+	CHECK(err, "lookup key", "error %d\n", err);
+	CHECK(got != value, "check value", "got %d exp %d\n", got, value);
+
+	/* delete the second node to make it an intermediate node */
+	key.prefixlen = 8;
+	key.data = 0;
+	err = bpf_map_delete_elem(fd, &key);
+	CHECK(err, "del elem", "error %d\n", err);
+
+	/* overwrite the intermediate node (Error) */
+	value = 2;
+	err = bpf_map_update_elem(fd, &key, &value, BPF_EXIST);
+	CHECK(err != -ENOENT, "overwrite nonexistent elem", "error %d\n", err);
+
+	close(fd);
+}
+
+static void test_lpm_trie_reuse_leaf_node(void)
+{
+	/* Use a big value_size to prevent the merging of intermediate node
+	 * allocator and leaf node allocator.
+	 */
+	char value[64] = {}, got[64] = {};
+	struct lpm_trie_int_key key;
+	int fd, err;
+
+	fd = lpm_trie_create(sizeof(key), sizeof(value), 3);
+
+	/* add a new node */
+	key.prefixlen = 16;
+	key.data = 0;
+	memset(value, 'a', sizeof(value));
+	err = bpf_map_update_elem(fd, &key, value, BPF_NOEXIST);
+	CHECK(err, "add new elem", "error %d\n", err);
+	memset(got, 0, sizeof(got));
+	err = bpf_map_lookup_elem(fd, &key, got);
+	CHECK(err, "lookup elem", "error %d\n", err);
+	CHECK(memcmp(value, got, sizeof(value)), "check value", "got %.*s exp %.*s\n",
+	      (int)sizeof(value), value, (int)sizeof(got), got);
+
+	/* add a new node which is the prefix of the first node */
+	key.prefixlen = 8;
+	key.data = 0;
+	memset(value, 'b', sizeof(value));
+	err = bpf_map_update_elem(fd, &key, value, BPF_NOEXIST);
+	CHECK(err, "add new elem", "error %d\n", err);
+	memset(got, 0, sizeof(got));
+	err = bpf_map_lookup_elem(fd, &key, got);
+	CHECK(err, "lookup elem", "error %d\n", err);
+	CHECK(memcmp(value, got, sizeof(value)), "check value", "got %.*s exp %.*s\n",
+	      (int)sizeof(value), value, (int)sizeof(got), got);
+
+	/* add another new node which will be the sibling of the first node */
+	key.prefixlen = 9;
+	key.data = htobe32(1 << 23);
+	memset(value, 'c', sizeof(value));
+	err = bpf_map_update_elem(fd, &key, value, BPF_NOEXIST);
+	CHECK(err, "add new elem", "error %d\n", err);
+	memset(got, 0, sizeof(got));
+	err = bpf_map_lookup_elem(fd, &key, got);
+	CHECK(err, "lookup elem", "error %d\n", err);
+	CHECK(memcmp(value, got, sizeof(value)), "check value", "got %.*s exp %.*s\n",
+	      (int)sizeof(value), value, (int)sizeof(got), got);
+
+	/* try to add more node (Error) */
+	key.prefixlen = 32;
+	key.data = 0;
+	memset(value, 'd', sizeof(value));
+	err = bpf_map_update_elem(fd, &key, value, BPF_NOEXIST);
+	CHECK(err != -ENOSPC, "add to full trie", "error %d\n", err);
+
+	/* delete the second node to make it an intermediate node */
+	key.prefixlen = 8;
+	key.data = 0;
+	err = bpf_map_delete_elem(fd, &key);
+	CHECK(err, "del elem", "error %d\n", err);
+
+	/* delete the first node to free the intermediate node */
+	key.prefixlen = 16;
+	key.data = 0;
+	err = bpf_map_delete_elem(fd, &key);
+	CHECK(err, "del elem", "error %d\n", err);
+
+	close(fd);
+}
+
+static int cmp_str(const void *a, const void *b)
+{
+	const char *str_a = *(const char **)a, *str_b = *(const char **)b;
+
+	return strcmp(str_a, str_b);
+}
+
+/* Save strings in LPM trie. The trailing '\0' for each string will be
+ * accounted in the prefixlen. The strings returned during the iteration
+ * should be sorted as expected.
+ */
+static void test_lpm_trie_iterate_strs(void)
+{
+	static const char * const keys[] = {
+		"ab", "abc", "abo", "abS", "abcd",
+	};
+	const char *sorted_keys[ARRAY_SIZE(keys)];
+	struct lpm_trie_bytes_key key, next_key;
+	unsigned int value, got, i, j, len;
+	struct lpm_trie_bytes_key *cur;
+	int fd, err;
+
+	fd = lpm_trie_create(sizeof(key), sizeof(value), ARRAY_SIZE(keys));
+
+	for (i = 0; i < ARRAY_SIZE(keys); i++) {
+		unsigned int flags;
+
+		/* add i-th element */
+		flags = i % 2 ? BPF_NOEXIST : 0;
+		len = strlen(keys[i]);
+		/* include the trailing '\0' */
+		key.prefixlen = (len + 1) * 8;
+		memset(key.data, 0, sizeof(key.data));
+		memcpy(key.data, keys[i], len);
+		value = i + 100;
+		err = bpf_map_update_elem(fd, &key, &value, flags);
+		CHECK(err, "add elem", "#%u error %d\n", i, err);
+
+		err = bpf_map_lookup_elem(fd, &key, &got);
+		CHECK(err, "lookup elem", "#%u error %d\n", i, err);
+		CHECK(got != value, "lookup elem", "#%u expect %u got %u\n", i, value, got);
+
+		/* re-add i-th element (Error) */
+		err = bpf_map_update_elem(fd, &key, &value, BPF_NOEXIST);
+		CHECK(err != -EEXIST, "re-add elem", "#%u error %d\n", i, err);
+
+		/* Overwrite i-th element */
+		flags = i % 2 ? 0 : BPF_EXIST;
+		value = i;
+		err = bpf_map_update_elem(fd, &key, &value, flags);
+		CHECK(err, "update elem", "error %d\n", err);
+
+		/* Lookup #[0~i] elements */
+		for (j = 0; j <= i; j++) {
+			len = strlen(keys[j]);
+			key.prefixlen = (len + 1) * 8;
+			memset(key.data, 0, sizeof(key.data));
+			memcpy(key.data, keys[j], len);
+			err = bpf_map_lookup_elem(fd, &key, &got);
+			CHECK(err, "lookup elem", "#%u/%u error %d\n", i, j, err);
+			CHECK(got != j, "lookup elem", "#%u/%u expect %u got %u\n",
+			      i, j, value, got);
+		}
+	}
+
+	/* Add element to a full qp-trie (Error) */
+	key.prefixlen = sizeof(key.data) * 8;
+	memset(key.data, 0, sizeof(key.data));
+	value = 0;
+	err = bpf_map_update_elem(fd, &key, &value, 0);
+	CHECK(err != -ENOSPC, "add to full qp-trie", "error %d\n", err);
+
+	/* Iterate sorted elements: no deletion */
+	memcpy(sorted_keys, keys, sizeof(keys));
+	qsort(sorted_keys, ARRAY_SIZE(sorted_keys), sizeof(sorted_keys[0]), cmp_str);
+	cur = NULL;
+	for (i = 0; i < ARRAY_SIZE(sorted_keys); i++) {
+		len = strlen(sorted_keys[i]);
+		err = bpf_map_get_next_key(fd, cur, &next_key);
+		CHECK(err, "iterate", "#%u error %d\n", i, err);
+		CHECK(next_key.prefixlen != (len + 1) * 8, "iterate",
+		      "#%u invalid len %u expect %u\n",
+		      i, next_key.prefixlen, (len + 1) * 8);
+		CHECK(memcmp(sorted_keys[i], next_key.data, len + 1), "iterate",
+		      "#%u got %.*s exp %.*s\n", i, len, next_key.data, len, sorted_keys[i]);
+
+		cur = &next_key;
+	}
+	err = bpf_map_get_next_key(fd, cur, &next_key);
+	CHECK(err != -ENOENT, "more element", "error %d\n", err);
+
+	/* Iterate sorted elements: delete the found key after each iteration */
+	cur = NULL;
+	for (i = 0; i < ARRAY_SIZE(sorted_keys); i++) {
+		len = strlen(sorted_keys[i]);
+		err = bpf_map_get_next_key(fd, cur, &next_key);
+		CHECK(err, "iterate", "#%u error %d\n", i, err);
+		CHECK(next_key.prefixlen != (len + 1) * 8, "iterate",
+		      "#%u invalid len %u expect %u\n",
+		      i, next_key.prefixlen, (len + 1) * 8);
+		CHECK(memcmp(sorted_keys[i], next_key.data, len + 1), "iterate",
+		      "#%u got %.*s exp %.*s\n", i, len, next_key.data, len, sorted_keys[i]);
+
+		cur = &next_key;
+
+		err = bpf_map_delete_elem(fd, cur);
+		CHECK(err, "delete", "#%u error %d\n", i, err);
+	}
+	err = bpf_map_get_next_key(fd, cur, &next_key);
+	CHECK(err != -ENOENT, "non-empty qp-trie", "error %d\n", err);
+
+	close(fd);
+}
+
+/* Use the fixed prefixlen (32) and save integers in LPM trie. The iteration of
+ * LPM trie will return these integers in big-endian order, therefore, convert
+ * these integers to big-endian before update. After each iteration, delete the
+ * found key (the smallest integer) and expect the next iteration will return
+ * the second smallest number.
+ */
+static void test_lpm_trie_iterate_ints(void)
+{
+	struct lpm_trie_int_key key, next_key;
+	unsigned int i, max_entries;
+	struct lpm_trie_int_key *cur;
+	unsigned int *data_set;
+	int fd, err;
+	bool value;
+
+	max_entries = 4096;
+	data_set = calloc(max_entries, sizeof(*data_set));
+	CHECK(!data_set, "malloc", "no mem\n");
+	for (i = 0; i < max_entries; i++)
+		data_set[i] = i;
+
+	fd = lpm_trie_create(sizeof(key), sizeof(value), max_entries);
+	value = true;
+	for (i = 0; i < max_entries; i++) {
+		key.prefixlen = 32;
+		key.data = htobe32(data_set[i]);
+
+		err = bpf_map_update_elem(fd, &key, &value, BPF_NOEXIST);
+		CHECK(err, "add elem", "#%u error %d\n", i, err);
+	}
+
+	cur = NULL;
+	for (i = 0; i < max_entries; i++) {
+		err = bpf_map_get_next_key(fd, cur, &next_key);
+		CHECK(err, "iterate", "#%u error %d\n", i, err);
+		CHECK(next_key.prefixlen != 32, "iterate", "#%u invalid len %u\n",
+		      i, next_key.prefixlen);
+		CHECK(be32toh(next_key.data) != data_set[i], "iterate", "#%u got 0x%x exp 0x%x\n",
+		      i, be32toh(next_key.data), data_set[i]);
+		cur = &next_key;
+
+		/*
+		 * Delete the minimal key, the next call of bpf_get_next_key()
+		 * will return the second minimal key.
+		 */
+		err = bpf_map_delete_elem(fd, &next_key);
+		CHECK(err, "del elem", "#%u elem error %d\n", i, err);
+	}
+	err = bpf_map_get_next_key(fd, cur, &next_key);
+	CHECK(err != -ENOENT, "more element", "error %d\n", err);
+
+	err = bpf_map_get_next_key(fd, NULL, &next_key);
+	CHECK(err != -ENOENT, "no-empty qp-trie", "error %d\n", err);
+
+	free(data_set);
+
+	close(fd);
+}
+
 void test_lpm_trie_map_basic_ops(void)
 {
 	int i;
@@ -789,5 +1175,10 @@ void test_lpm_trie_map_basic_ops(void)
 	test_lpm_get_next_key();
 	test_lpm_multi_thread();
 
+	test_lpm_trie_update_flags();
+	test_lpm_trie_reuse_leaf_node();
+	test_lpm_trie_iterate_strs();
+	test_lpm_trie_iterate_ints();
+
 	printf("%s: PASS\n", __func__);
 }
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH bpf-next 00/10] Fixes for LPM trie
       [not found] ` <46268aa9ef13a24388af833b17f6cef8bdd3a7be8402fec7640e65a2f1118468@mail.kernel.org>
@ 2024-11-18  6:20   ` Hou Tao
  2024-11-18 17:43     ` Daniel Xu
  0 siblings, 1 reply; 40+ messages in thread
From: Hou Tao @ 2024-11-18  6:20 UTC (permalink / raw)
  To: bot+bpf-ci; +Cc: kernel-ci, andrii, daniel, martin.lau, bpf

Hi,

On 11/18/2024 9:42 AM, bot+bpf-ci@kernel.org wrote:
> Dear patch submitter,
>
> CI has tested the following submission:
> Status:     SUCCESS
> Name:       [bpf-next,00/10] Fixes for LPM trie
> Patchwork:  https://patchwork.kernel.org/project/netdevbpf/list/?series=910440&state=*
> Matrix:     https://github.com/kernel-patches/bpf/actions/runs/11884065937
>
> No further action is necessary on your part.
>
>
> Please note: this email is coming from an unmonitored mailbox. If you have
> questions or feedback, please reach out to the Meta Kernel CI team at
> kernel-ci@meta.com.

I am curious about the reason on why test_maps on s390 is disabled. If I
remember correctly, test_maps on s390 was still enabled last year [1].

[1]: https://github.com/kernel-patches/bpf/actions/runs/7164372250


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH bpf-next 04/10] bpf: Handle in-place update for full LPM trie correctly
  2024-11-18  1:08 ` [PATCH bpf-next 04/10] bpf: Handle in-place update for full LPM trie correctly Hou Tao
@ 2024-11-18 13:13   ` Sebastian Andrzej Siewior
  2024-11-19  1:05     ` Hou Tao
  2024-11-21 10:53   ` Toke Høiland-Jørgensen
  1 sibling, 1 reply; 40+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-11-18 13:13 UTC (permalink / raw)
  To: Hou Tao
  Cc: bpf, Martin KaFai Lau, Alexei Starovoitov, Andrii Nakryiko,
	Eduard Zingerman, Song Liu, Hao Luo, Yonghong Song,
	Daniel Borkmann, KP Singh, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Thomas Gleixner, Thomas Weißschuh, houtao1,
	xukuohai

On 2024-11-18 09:08:02 [+0800], Hou Tao wrote:
> From: Hou Tao <houtao1@huawei.com>
> 
> When a LPM trie is full, in-place updates of existing elements
> incorrectly return -ENOSPC.
> 
> Fix this by deferring the check of trie->n_entries. For new insertions,
> n_entries must not exceed max_entries. However, in-place updates are
> allowed even when the trie is full.

This and the previous patch look like a fix to an existing problem.
Shouldn't both have a fixes tag?

> Signed-off-by: Hou Tao <houtao1@huawei.com>

Sebastian

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH bpf-next 06/10] bpf: Add bpf_mem_cache_is_mergeable() helper
  2024-11-18  1:08 ` [PATCH bpf-next 06/10] bpf: Add bpf_mem_cache_is_mergeable() helper Hou Tao
@ 2024-11-18 13:29   ` Thomas Weißschuh
  2024-11-19  1:06     ` Hou Tao
  0 siblings, 1 reply; 40+ messages in thread
From: Thomas Weißschuh @ 2024-11-18 13:29 UTC (permalink / raw)
  To: Hou Tao
  Cc: bpf, Martin KaFai Lau, Alexei Starovoitov, Andrii Nakryiko,
	Eduard Zingerman, Song Liu, Hao Luo, Yonghong Song,
	Daniel Borkmann, KP Singh, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Sebastian Andrzej Siewior, Thomas Gleixner,
	Thomas Weißschuh, houtao1, xukuohai

On Mon, Nov 18, 2024 at 09:08:04AM +0800, Hou Tao wrote:
> From: Hou Tao <houtao1@huawei.com>
> 
> Add bpf_mem_cache_is_mergeable() to check whether two bpf mem allocator
> for fixed-size objects are mergeable or not. The merging could reduce
> the memory overhead of bpf mem allocator.
> 
> Signed-off-by: Hou Tao <houtao1@huawei.com>
> ---
>  include/linux/bpf_mem_alloc.h |  1 +
>  kernel/bpf/memalloc.c         | 12 ++++++++++++
>  2 files changed, 13 insertions(+)
> 
> diff --git a/include/linux/bpf_mem_alloc.h b/include/linux/bpf_mem_alloc.h
> index e45162ef59bb..faa54b9c7a04 100644
> --- a/include/linux/bpf_mem_alloc.h
> +++ b/include/linux/bpf_mem_alloc.h
> @@ -47,5 +47,6 @@ void bpf_mem_cache_free(struct bpf_mem_alloc *ma, void *ptr);
>  void bpf_mem_cache_free_rcu(struct bpf_mem_alloc *ma, void *ptr);
>  void bpf_mem_cache_raw_free(void *ptr);
>  void *bpf_mem_cache_alloc_flags(struct bpf_mem_alloc *ma, gfp_t flags);
> +bool bpf_mem_cache_is_mergeable(size_t size, size_t new_size, bool percpu);
>  
>  #endif /* _BPF_MEM_ALLOC_H */
> diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
> index 889374722d0a..49dd08ad1d4f 100644
> --- a/kernel/bpf/memalloc.c
> +++ b/kernel/bpf/memalloc.c
> @@ -1014,3 +1014,15 @@ int bpf_mem_alloc_check_size(bool percpu, size_t size)
>  
>  	return 0;
>  }
> +
> +bool bpf_mem_cache_is_mergeable(size_t size, size_t new_size, bool percpu)
> +{
> +	/* Only for fixed-size object allocator */
> +	if (!size || !new_size)
> +		return false;
> +
> +	return (percpu && ALIGN(size, PCPU_MIN_ALLOC_SIZE) ==
> +			  ALIGN(new_size, PCPU_MIN_ALLOC_SIZE)) ||
> +	       (!percpu && kmalloc_size_roundup(size + LLIST_NODE_SZ) ==
> +			   kmalloc_size_roundup(new_size + LLIST_NODE_SZ));

This would be easier to read:

if (percpu)
	return ALIGN() == ALIGN();
else
	return kmalloc_size_roundup() == kmalloc_size_roundup();

> +}
> -- 
> 2.29.2
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH bpf-next 07/10] bpf: Switch to bpf mem allocator for LPM trie
  2024-11-18  1:08 ` [PATCH bpf-next 07/10] bpf: Switch to bpf mem allocator for LPM trie Hou Tao
@ 2024-11-18 13:30   ` Sebastian Andrzej Siewior
  2024-11-18 16:56     ` Yonghong Song
  2024-11-20  1:16   ` Alexei Starovoitov
  2024-11-21 11:39   ` Toke Høiland-Jørgensen
  2 siblings, 1 reply; 40+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-11-18 13:30 UTC (permalink / raw)
  To: Hou Tao
  Cc: bpf, Martin KaFai Lau, Alexei Starovoitov, Andrii Nakryiko,
	Eduard Zingerman, Song Liu, Hao Luo, Yonghong Song,
	Daniel Borkmann, KP Singh, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Thomas Gleixner, Thomas Weißschuh, houtao1,
	xukuohai

On 2024-11-18 09:08:05 [+0800], Hou Tao wrote:
> diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
> index d447a6dab83b..d8995acecedf 100644
> --- a/kernel/bpf/lpm_trie.c
> +++ b/kernel/bpf/lpm_trie.c
> @@ -319,6 +326,25 @@ static int trie_check_noreplace_update(const struct lpm_trie *trie, u64 flags)
>  	return 0;
>  }
>  
> +static void lpm_trie_node_free(struct lpm_trie *trie,
> +			       struct lpm_trie_node *node, bool defer)
> +{
> +	struct bpf_mem_alloc *ma;
> +
> +	if (!node)
> +		return;
> +
> +	ma = (node->flags & LPM_TREE_NODE_FLAG_ALLOC_LEAF) ? trie->leaf_ma :
> +							     trie->im_ma;
> +
> +	migrate_disable();
> +	if (defer)
> +		bpf_mem_cache_free_rcu(ma, node);
> +	else
> +		bpf_mem_cache_free(ma, node);
> +	migrate_enable();

I guess a preempt_disable() here instead wouldn't hurt much. The inner
pieces of the allocator (unit_free()) does local_irq_save() for the
entire function so we don't win much with migrate_disable().

> +}
> +

Sebastian

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH bpf-next 03/10] bpf: Handle BPF_EXIST and BPF_NOEXIST for LPM trie
  2024-11-18  1:08 ` [PATCH bpf-next 03/10] bpf: Handle BPF_EXIST and BPF_NOEXIST for LPM trie Hou Tao
@ 2024-11-18 13:39   ` Thomas Weißschuh
  2024-11-19  1:08     ` Hou Tao
  2024-11-21 10:32   ` Toke Høiland-Jørgensen
  1 sibling, 1 reply; 40+ messages in thread
From: Thomas Weißschuh @ 2024-11-18 13:39 UTC (permalink / raw)
  To: Hou Tao
  Cc: bpf, Martin KaFai Lau, Alexei Starovoitov, Andrii Nakryiko,
	Eduard Zingerman, Song Liu, Hao Luo, Yonghong Song,
	Daniel Borkmann, KP Singh, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Sebastian Andrzej Siewior, Thomas Gleixner,
	Thomas Weißschuh, houtao1, xukuohai

On Mon, Nov 18, 2024 at 09:08:01AM +0800, Hou Tao wrote:
> From: Hou Tao <houtao1@huawei.com>
> 
> There is exact match during the update of LPM trie, therefore, add the
> missed handling for BPF_EXIST and BPF_NOEXIST flags.

"There is" can be interpreted as "this can be true" and "this will
always be true".

Maybe:

Add the currently missing handling for the BPF_EXIST and BPF_NOEXIST
flags, as these can be specified by users.

> Signed-off-by: Hou Tao <houtao1@huawei.com>
> ---
>  kernel/bpf/lpm_trie.c | 23 ++++++++++++++++++++---
>  1 file changed, 20 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
> index c6f036e3044b..4300bd51ec6e 100644
> --- a/kernel/bpf/lpm_trie.c
> +++ b/kernel/bpf/lpm_trie.c
> @@ -375,6 +375,10 @@ static long trie_update_elem(struct bpf_map *map,
>  	 * simply assign the @new_node to that slot and be done.
>  	 */
>  	if (!node) {
> +		if (flags == BPF_EXIST) {
> +			ret = -ENOENT;
> +			goto out;
> +		}
>  		rcu_assign_pointer(*slot, new_node);
>  		goto out;
>  	}
> @@ -383,18 +387,31 @@ static long trie_update_elem(struct bpf_map *map,
>  	 * which already has the correct data array set.
>  	 */
>  	if (node->prefixlen == matchlen) {
> +		if (!(node->flags & LPM_TREE_NODE_FLAG_IM)) {
> +			if (flags == BPF_NOEXIST) {
> +				ret = -EEXIST;
> +				goto out;
> +			}
> +			trie->n_entries--;
> +		} else if (flags == BPF_EXIST) {
> +			ret = -ENOENT;
> +			goto out;
> +		}
> +
>  		new_node->child[0] = node->child[0];
>  		new_node->child[1] = node->child[1];
>  
> -		if (!(node->flags & LPM_TREE_NODE_FLAG_IM))
> -			trie->n_entries--;
> -
>  		rcu_assign_pointer(*slot, new_node);
>  		free_node = node;
>  
>  		goto out;
>  	}
>  
> +	if (flags == BPF_EXIST) {
> +		ret = -ENOENT;
> +		goto out;
> +	}
> +
>  	/* If the new node matches the prefix completely, it must be inserted
>  	 * as an ancestor. Simply insert it between @node and *@slot.
>  	 */
> -- 
> 2.29.2
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH bpf-next 00/10] Fixes for LPM trie
  2024-11-18  1:07 [PATCH bpf-next 00/10] Fixes for LPM trie Hou Tao
                   ` (10 preceding siblings ...)
       [not found] ` <46268aa9ef13a24388af833b17f6cef8bdd3a7be8402fec7640e65a2f1118468@mail.kernel.org>
@ 2024-11-18 15:39 ` Sebastian Andrzej Siewior
  2024-11-19  1:35   ` Hou Tao
  11 siblings, 1 reply; 40+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-11-18 15:39 UTC (permalink / raw)
  To: Hou Tao
  Cc: bpf, Martin KaFai Lau, Alexei Starovoitov, Andrii Nakryiko,
	Eduard Zingerman, Song Liu, Hao Luo, Yonghong Song,
	Daniel Borkmann, KP Singh, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Thomas Gleixner, Thomas Weißschuh, houtao1,
	xukuohai

On 2024-11-18 09:07:58 [+0800], Hou Tao wrote:
> From: Hou Tao <houtao1@huawei.com>
> 
> Hi,
Hi,

> Please see individual patches for more details. Comments are always
> welcome.

This might be a coincidence but it seems I get

| helper_fill_hashmap(282):FAIL:can't update hashmap err: Unknown error -12
| test_maps: test_maps.c:1379: __run_parallel: Assertion `status == 0' failed.

more often with the series when I do ./test_maps. I never managed to
pass the test with series while it passed on v6.12. I'm not blaming the
series, just pointing this out it might be known…

In 08/10 you switch the locks to raw_spinlock_t. I was a little worried
that a lot of elements will make the while() loop go for a long time. Is
there a test for this? I run into "test_progs -a map_kptr" and noticed
something else…

Sebastian

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH bpf-next 07/10] bpf: Switch to bpf mem allocator for LPM trie
  2024-11-18 13:30   ` Sebastian Andrzej Siewior
@ 2024-11-18 16:56     ` Yonghong Song
  0 siblings, 0 replies; 40+ messages in thread
From: Yonghong Song @ 2024-11-18 16:56 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, Hou Tao
  Cc: bpf, Martin KaFai Lau, Alexei Starovoitov, Andrii Nakryiko,
	Eduard Zingerman, Song Liu, Hao Luo, Daniel Borkmann, KP Singh,
	Stanislav Fomichev, Jiri Olsa, John Fastabend, Thomas Gleixner,
	Thomas Weißschuh, houtao1, xukuohai




On 11/18/24 5:30 AM, Sebastian Andrzej Siewior wrote:
> On 2024-11-18 09:08:05 [+0800], Hou Tao wrote:
>> diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
>> index d447a6dab83b..d8995acecedf 100644
>> --- a/kernel/bpf/lpm_trie.c
>> +++ b/kernel/bpf/lpm_trie.c
>> @@ -319,6 +326,25 @@ static int trie_check_noreplace_update(const struct lpm_trie *trie, u64 flags)
>>   	return 0;
>>   }
>>   
>> +static void lpm_trie_node_free(struct lpm_trie *trie,
>> +			       struct lpm_trie_node *node, bool defer)
>> +{
>> +	struct bpf_mem_alloc *ma;
>> +
>> +	if (!node)
>> +		return;
>> +
>> +	ma = (node->flags & LPM_TREE_NODE_FLAG_ALLOC_LEAF) ? trie->leaf_ma :
>> +							     trie->im_ma;
>> +
>> +	migrate_disable();
>> +	if (defer)
>> +		bpf_mem_cache_free_rcu(ma, node);
>> +	else
>> +		bpf_mem_cache_free(ma, node);
>> +	migrate_enable();
> I guess a preempt_disable() here instead wouldn't hurt much. The inner
> pieces of the allocator (unit_free()) does local_irq_save() for the
> entire function so we don't win much with migrate_disable().

Typically, bpf_mem_*() functions are surrounded directly
or indirectly by migrate_disable/enable. Let us just keep
this pattern to be consistent with other similar usage.
One close example is in kernel/bpf/cpumask.c.

>
>> +}
>> +
> Sebastian


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH bpf-next 00/10] Fixes for LPM trie
  2024-11-18  6:20   ` [PATCH bpf-next 00/10] Fixes " Hou Tao
@ 2024-11-18 17:43     ` Daniel Xu
  2024-11-19  1:09       ` Hou Tao
  0 siblings, 1 reply; 40+ messages in thread
From: Daniel Xu @ 2024-11-18 17:43 UTC (permalink / raw)
  To: Hou Tao, bot+bpf-ci@kernel.org
  Cc: kernel-ci, andrii@kernel.org, daniel@iogearbox.net,
	martin.lau@linux.dev, bpf

Hi Hou,

On 11/17/24 22:20, Hou Tao wrote:
> Hi,
>
> On 11/18/2024 9:42 AM, bot+bpf-ci@kernel.org wrote:
>> Dear patch submitter,
>>
>> CI has tested the following submission:
>> Status:     SUCCESS
>> Name:       [bpf-next,00/10] Fixes for LPM trie
>> Patchwork:  https://patchwork.kernel.org/project/netdevbpf/list/?series=910440&state=*
>> Matrix:     https://github.com/kernel-patches/bpf/actions/runs/11884065937
>>
>> No further action is necessary on your part.
>>
>>
>> Please note: this email is coming from an unmonitored mailbox. If you have
>> questions or feedback, please reach out to the Meta Kernel CI team at
>> kernel-ci@meta.com.
> I am curious about the reason on why test_maps on s390 is disabled. If I
> remember correctly, test_maps on s390 was still enabled last year [1].
>
> [1]: https://github.com/kernel-patches/bpf/actions/runs/7164372250
>
It was disabled in 
https://github.com/kernel-patches/vmtest/commit/5489ef2bc1fed11e841d2d3485ab80ce4b390d94 
.

I recall it was kinda flakey and low signal.

Thanks,

Daniel


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH bpf-next 10/10] selftests/bpf: Add more test cases for LPM trie
  2024-11-18  1:08 ` [PATCH bpf-next 10/10] selftests/bpf: Add more test cases for LPM trie Hou Tao
@ 2024-11-18 17:46   ` Daniel Borkmann
  2024-11-19  1:10     ` Hou Tao
  0 siblings, 1 reply; 40+ messages in thread
From: Daniel Borkmann @ 2024-11-18 17:46 UTC (permalink / raw)
  To: Hou Tao, bpf
  Cc: Martin KaFai Lau, Alexei Starovoitov, Andrii Nakryiko,
	Eduard Zingerman, Song Liu, Hao Luo, Yonghong Song, KP Singh,
	Stanislav Fomichev, Jiri Olsa, John Fastabend,
	Sebastian Andrzej Siewior, Thomas Gleixner, Thomas Weißschuh,
	houtao1, xukuohai

Hi Hou,

On 11/18/24 2:08 AM, Hou Tao wrote:
> From: Hou Tao <houtao1@huawei.com>
> 
> Add more test cases for LPM trie in test_maps:
> 
> 1) test_lpm_trie_update_flags()
> It constructs various use cases for BPF_EXIST and BPF_NOEXIST and check
> whether the return value of update operation is expected.
> 2) test_lpm_trie_reuse_leaf_node
> It tests whether the reuse of leaf node as intermediate node is handled
> correctly when freeing the intermediate node.
> 3) test_lpm_trie_iterate_strs & test_lpm_trie_iterate_ints
> There two test case test whether the iteration through get_next_key is
> sorted and expected. These two test cases delete the minimal key after
> each iteration and check whether next iteration returns the second
> minimal key. The only difference between these two test cases is the
> former one saves strings in the LPM trie and the latter saves integers.

Just to clarify, the added tests in 3) exercise the fix in patch 5, correct?

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH bpf-next 04/10] bpf: Handle in-place update for full LPM trie correctly
  2024-11-18 13:13   ` Sebastian Andrzej Siewior
@ 2024-11-19  1:05     ` Hou Tao
  0 siblings, 0 replies; 40+ messages in thread
From: Hou Tao @ 2024-11-19  1:05 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: bpf, Martin KaFai Lau, Alexei Starovoitov, Andrii Nakryiko,
	Eduard Zingerman, Song Liu, Hao Luo, Yonghong Song,
	Daniel Borkmann, KP Singh, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Thomas Gleixner, Thomas Weißschuh, houtao1,
	xukuohai



On 11/18/2024 9:13 PM, Sebastian Andrzej Siewior wrote:
> On 2024-11-18 09:08:02 [+0800], Hou Tao wrote:
>> From: Hou Tao <houtao1@huawei.com>
>>
>> When a LPM trie is full, in-place updates of existing elements
>> incorrectly return -ENOSPC.
>>
>> Fix this by deferring the check of trie->n_entries. For new insertions,
>> n_entries must not exceed max_entries. However, in-place updates are
>> allowed even when the trie is full.
> This and the previous patch look like a fix to an existing problem.
> Shouldn't both have a fixes tag?

Will add in v2. Thanks for the suggestion.
>
>> Signed-off-by: Hou Tao <houtao1@huawei.com>
> Sebastian


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH bpf-next 06/10] bpf: Add bpf_mem_cache_is_mergeable() helper
  2024-11-18 13:29   ` Thomas Weißschuh
@ 2024-11-19  1:06     ` Hou Tao
  0 siblings, 0 replies; 40+ messages in thread
From: Hou Tao @ 2024-11-19  1:06 UTC (permalink / raw)
  To: Thomas Weißschuh
  Cc: bpf, Martin KaFai Lau, Alexei Starovoitov, Andrii Nakryiko,
	Eduard Zingerman, Song Liu, Hao Luo, Yonghong Song,
	Daniel Borkmann, KP Singh, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Sebastian Andrzej Siewior, Thomas Gleixner,
	Thomas Weißschuh, houtao1, xukuohai

Hi,

On 11/18/2024 9:29 PM, Thomas Weißschuh wrote:
> On Mon, Nov 18, 2024 at 09:08:04AM +0800, Hou Tao wrote:
>> From: Hou Tao <houtao1@huawei.com>
>>
>> Add bpf_mem_cache_is_mergeable() to check whether two bpf mem allocator
>> for fixed-size objects are mergeable or not. The merging could reduce
>> the memory overhead of bpf mem allocator.
>>
>> Signed-off-by: Hou Tao <houtao1@huawei.com>
>> ---
>>  include/linux/bpf_mem_alloc.h |  1 +
>>  kernel/bpf/memalloc.c         | 12 ++++++++++++
>>  2 files changed, 13 insertions(+)
>>
>> diff --git a/include/linux/bpf_mem_alloc.h b/include/linux/bpf_mem_alloc.h
>> index e45162ef59bb..faa54b9c7a04 100644
>> --- a/include/linux/bpf_mem_alloc.h
>> +++ b/include/linux/bpf_mem_alloc.h
>> @@ -47,5 +47,6 @@ void bpf_mem_cache_free(struct bpf_mem_alloc *ma, void *ptr);
>>  void bpf_mem_cache_free_rcu(struct bpf_mem_alloc *ma, void *ptr);
>>  void bpf_mem_cache_raw_free(void *ptr);
>>  void *bpf_mem_cache_alloc_flags(struct bpf_mem_alloc *ma, gfp_t flags);
>> +bool bpf_mem_cache_is_mergeable(size_t size, size_t new_size, bool percpu);
>>  
>>  #endif /* _BPF_MEM_ALLOC_H */
>> diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
>> index 889374722d0a..49dd08ad1d4f 100644
>> --- a/kernel/bpf/memalloc.c
>> +++ b/kernel/bpf/memalloc.c
>> @@ -1014,3 +1014,15 @@ int bpf_mem_alloc_check_size(bool percpu, size_t size)
>>  
>>  	return 0;
>>  }
>> +
>> +bool bpf_mem_cache_is_mergeable(size_t size, size_t new_size, bool percpu)
>> +{
>> +	/* Only for fixed-size object allocator */
>> +	if (!size || !new_size)
>> +		return false;
>> +
>> +	return (percpu && ALIGN(size, PCPU_MIN_ALLOC_SIZE) ==
>> +			  ALIGN(new_size, PCPU_MIN_ALLOC_SIZE)) ||
>> +	       (!percpu && kmalloc_size_roundup(size + LLIST_NODE_SZ) ==
>> +			   kmalloc_size_roundup(new_size + LLIST_NODE_SZ));
> This would be easier to read:
>
> if (percpu)
> 	return ALIGN() == ALIGN();
> else
> 	return kmalloc_size_roundup() == kmalloc_size_roundup();

Indeed. Thanks for the suggestion. Will do in v2.
>> +}
>> -- 
>> 2.29.2
>>


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH bpf-next 03/10] bpf: Handle BPF_EXIST and BPF_NOEXIST for LPM trie
  2024-11-18 13:39   ` Thomas Weißschuh
@ 2024-11-19  1:08     ` Hou Tao
  0 siblings, 0 replies; 40+ messages in thread
From: Hou Tao @ 2024-11-19  1:08 UTC (permalink / raw)
  To: Thomas Weißschuh
  Cc: bpf, Martin KaFai Lau, Alexei Starovoitov, Andrii Nakryiko,
	Eduard Zingerman, Song Liu, Hao Luo, Yonghong Song,
	Daniel Borkmann, KP Singh, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Sebastian Andrzej Siewior, Thomas Gleixner,
	Thomas Weißschuh, houtao1, xukuohai

Hi Thomas,

On 11/18/2024 9:39 PM, Thomas Weißschuh wrote:
> On Mon, Nov 18, 2024 at 09:08:01AM +0800, Hou Tao wrote:
>> From: Hou Tao <houtao1@huawei.com>
>>
>> There is exact match during the update of LPM trie, therefore, add the
>> missed handling for BPF_EXIST and BPF_NOEXIST flags.
> "There is" can be interpreted as "this can be true" and "this will
> always be true".
>
> Maybe:
>
> Add the currently missing handling for the BPF_EXIST and BPF_NOEXIST
> flags, as these can be specified by users.

Will fix it in v2. Thanks.
>
>> Signed-off-by: Hou Tao <houtao1@huawei.com>
>> ---
>>  kernel/bpf/lpm_trie.c | 23 ++++++++++++++++++++---
>>  1 file changed, 20 insertions(+), 3 deletions(-)
>>
>> diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
>> index c6f036e3044b..4300bd51ec6e 100644
>> --- a/kernel/bpf/lpm_trie.c
>> +++ b/kernel/bpf/lpm_trie.c
>> @@ -375,6 +375,10 @@ static long trie_update_elem(struct bpf_map *map,
>>  	 * simply assign the @new_node to that slot and be done.
>>  	 */
>>  	if (!node) {
>> +		if (flags == BPF_EXIST) {
>> +			ret = -ENOENT;
>> +			goto out;
>> +		}
>>  		rcu_assign_pointer(*slot, new_node);
>>  		goto out;
>>  	}
>> @@ -383,18 +387,31 @@ static long trie_update_elem(struct bpf_map *map,
>>  	 * which already has the correct data array set.
>>  	 */
>>  	if (node->prefixlen == matchlen) {
>> +		if (!(node->flags & LPM_TREE_NODE_FLAG_IM)) {
>> +			if (flags == BPF_NOEXIST) {
>> +				ret = -EEXIST;
>> +				goto out;
>> +			}
>> +			trie->n_entries--;
>> +		} else if (flags == BPF_EXIST) {
>> +			ret = -ENOENT;
>> +			goto out;
>> +		}
>> +
>>  		new_node->child[0] = node->child[0];
>>  		new_node->child[1] = node->child[1];
>>  
>> -		if (!(node->flags & LPM_TREE_NODE_FLAG_IM))
>> -			trie->n_entries--;
>> -
>>  		rcu_assign_pointer(*slot, new_node);
>>  		free_node = node;
>>  
>>  		goto out;
>>  	}
>>  
>> +	if (flags == BPF_EXIST) {
>> +		ret = -ENOENT;
>> +		goto out;
>> +	}
>> +
>>  	/* If the new node matches the prefix completely, it must be inserted
>>  	 * as an ancestor. Simply insert it between @node and *@slot.
>>  	 */
>> -- 
>> 2.29.2
>>


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH bpf-next 00/10] Fixes for LPM trie
  2024-11-18 17:43     ` Daniel Xu
@ 2024-11-19  1:09       ` Hou Tao
  0 siblings, 0 replies; 40+ messages in thread
From: Hou Tao @ 2024-11-19  1:09 UTC (permalink / raw)
  To: Daniel Xu, bot+bpf-ci@kernel.org
  Cc: kernel-ci, andrii@kernel.org, daniel@iogearbox.net,
	martin.lau@linux.dev, bpf

Hi,

On 11/19/2024 1:43 AM, Daniel Xu wrote:
> Hi Hou,
>
> On 11/17/24 22:20, Hou Tao wrote:
>> Hi,
>>
>> On 11/18/2024 9:42 AM, bot+bpf-ci@kernel.org wrote:
>>> Dear patch submitter,
>>>
>>> CI has tested the following submission:
>>> Status:     SUCCESS
>>> Name:       [bpf-next,00/10] Fixes for LPM trie
>>> Patchwork:  https://patchwork.kernel.org/project/netdevbpf/list/?series=910440&state=*
>>> Matrix:     https://github.com/kernel-patches/bpf/actions/runs/11884065937
>>>
>>> No further action is necessary on your part.
>>>
>>>
>>> Please note: this email is coming from an unmonitored mailbox. If you have
>>> questions or feedback, please reach out to the Meta Kernel CI team at
>>> kernel-ci@meta.com.
>> I am curious about the reason on why test_maps on s390 is disabled. If I
>> remember correctly, test_maps on s390 was still enabled last year [1].
>>
>> [1]: https://github.com/kernel-patches/bpf/actions/runs/7164372250
>>
> It was disabled in 
> https://github.com/kernel-patches/vmtest/commit/5489ef2bc1fed11e841d2d3485ab80ce4b390d94 
> .
>
> I recall it was kinda flakey and low signal.

I see. Thanks for the information.
>
> Thanks,
>
> Daniel
>


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH bpf-next 10/10] selftests/bpf: Add more test cases for LPM trie
  2024-11-18 17:46   ` Daniel Borkmann
@ 2024-11-19  1:10     ` Hou Tao
  0 siblings, 0 replies; 40+ messages in thread
From: Hou Tao @ 2024-11-19  1:10 UTC (permalink / raw)
  To: Daniel Borkmann, bpf
  Cc: Martin KaFai Lau, Alexei Starovoitov, Andrii Nakryiko,
	Eduard Zingerman, Song Liu, Hao Luo, Yonghong Song, KP Singh,
	Stanislav Fomichev, Jiri Olsa, John Fastabend,
	Sebastian Andrzej Siewior, Thomas Gleixner, Thomas Weißschuh,
	houtao1, xukuohai

Hi,

On 11/19/2024 1:46 AM, Daniel Borkmann wrote:
> Hi Hou,
>
> On 11/18/24 2:08 AM, Hou Tao wrote:
>> From: Hou Tao <houtao1@huawei.com>
>>
>> Add more test cases for LPM trie in test_maps:
>>
>> 1) test_lpm_trie_update_flags()
>> It constructs various use cases for BPF_EXIST and BPF_NOEXIST and check
>> whether the return value of update operation is expected.
>> 2) test_lpm_trie_reuse_leaf_node
>> It tests whether the reuse of leaf node as intermediate node is handled
>> correctly when freeing the intermediate node.
>> 3) test_lpm_trie_iterate_strs & test_lpm_trie_iterate_ints
>> There two test case test whether the iteration through get_next_key is
>> sorted and expected. These two test cases delete the minimal key after
>> each iteration and check whether next iteration returns the second
>> minimal key. The only difference between these two test cases is the
>> former one saves strings in the LPM trie and the latter saves integers.
>
> Just to clarify, the added tests in 3) exercise the fix in patch 5,
> correct?

Yes. Sorry for failing to mention it in the commit message. Will update
the commit message accordingly in next revision.
>
> Thanks,
> Daniel


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH bpf-next 00/10] Fixes for LPM trie
  2024-11-18 15:39 ` Sebastian Andrzej Siewior
@ 2024-11-19  1:35   ` Hou Tao
  2024-11-19 14:15     ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 40+ messages in thread
From: Hou Tao @ 2024-11-19  1:35 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: bpf, Martin KaFai Lau, Alexei Starovoitov, Andrii Nakryiko,
	Eduard Zingerman, Song Liu, Hao Luo, Yonghong Song,
	Daniel Borkmann, KP Singh, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Thomas Gleixner, Thomas Weißschuh, houtao1,
	xukuohai

Hi Sebastian,

On 11/18/2024 11:39 PM, Sebastian Andrzej Siewior wrote:
> On 2024-11-18 09:07:58 [+0800], Hou Tao wrote:
>> From: Hou Tao <houtao1@huawei.com>
>>
>> Hi,
> Hi,
>
>> Please see individual patches for more details. Comments are always
>> welcome.
> This might be a coincidence but it seems I get
>
> | helper_fill_hashmap(282):FAIL:can't update hashmap err: Unknown error -12
> | test_maps: test_maps.c:1379: __run_parallel: Assertion `status == 0' failed.
>
> more often with the series when I do ./test_maps. I never managed to
> pass the test with series while it passed on v6.12. I'm not blaming the
> series, just pointing this out it might be known…

Thanks for the information. 12 is ENOMEM, so the hash map failed to
allocate an element for it. There are multiple possible reasons for ENOMEM:

1) something is wrong for bpf mem allocator. E.g., it could not refill
the free list timely. It may be possible when running under RT, because
the irq work is threaded under RT.
2) the series causes the shortage of memory (e.g., It uses a lot memory
then free these memory, but the reclaim of the memory is slow)

Could you please share the kernel config file and the VM setup, so I
could try to reproduce the problem ?
>
> In 08/10 you switch the locks to raw_spinlock_t. I was a little worried
> that a lot of elements will make the while() loop go for a long time. Is
> there a test for this? I run into "test_progs -a map_kptr" and noticed
> something else…

The concern is the possibility of hard-lockup, right ? The total time
used for update or deletion is decided by the max_prefixlen. The typical
use case will use 32 or 128 as the max_prefixlen. The max value of
max_prefixlen is LPM_DATA_SIZE_MAX * 8 = 2048, I think the loop time
will be fine. Will try to construct some test cases for it.
>
> Sebastian


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH bpf-next 00/10] Fixes for LPM trie
  2024-11-19  1:35   ` Hou Tao
@ 2024-11-19 14:15     ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 40+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-11-19 14:15 UTC (permalink / raw)
  To: Hou Tao
  Cc: bpf, Martin KaFai Lau, Alexei Starovoitov, Andrii Nakryiko,
	Eduard Zingerman, Song Liu, Hao Luo, Yonghong Song,
	Daniel Borkmann, KP Singh, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Thomas Gleixner, Thomas Weißschuh, houtao1,
	xukuohai

On 2024-11-19 09:35:51 [+0800], Hou Tao wrote:
> Hi Sebastian,
Hi Hou,

> On 11/18/2024 11:39 PM, Sebastian Andrzej Siewior wrote:
> > On 2024-11-18 09:07:58 [+0800], Hou Tao wrote:
> >> From: Hou Tao <houtao1@huawei.com>
> >>
> >> Hi,
> > Hi,
> >
> >> Please see individual patches for more details. Comments are always
> >> welcome.
> > This might be a coincidence but it seems I get
> >
> > | helper_fill_hashmap(282):FAIL:can't update hashmap err: Unknown error -12
> > | test_maps: test_maps.c:1379: __run_parallel: Assertion `status == 0' failed.
> >
> > more often with the series when I do ./test_maps. I never managed to
> > pass the test with series while it passed on v6.12. I'm not blaming the
> > series, just pointing this out it might be known…
> 
> Thanks for the information. 12 is ENOMEM, so the hash map failed to
> allocate an element for it. There are multiple possible reasons for ENOMEM:
> 
> 1) something is wrong for bpf mem allocator. E.g., it could not refill
> the free list timely. It may be possible when running under RT, because
> the irq work is threaded under RT.

right. forgot to switch that one off. I had it for the initial test…

> 2) the series causes the shortage of memory (e.g., It uses a lot memory
> then free these memory, but the reclaim of the memory is slow)

> Could you please share the kernel config file and the VM setup, so I
> could try to reproduce the problem ?

I very much thing this is due to the RT switch. The irq-work and
testcase run on different CPUs so…

> > In 08/10 you switch the locks to raw_spinlock_t. I was a little worried
> > that a lot of elements will make the while() loop go for a long time. Is
> > there a test for this? I run into "test_progs -a map_kptr" and noticed
> > something else…
> 
> The concern is the possibility of hard-lockup, right ? The total time
Not lockup but spending a "visible amount of time" for the lookup.

> used for update or deletion is decided by the max_prefixlen. The typical
> use case will use 32 or 128 as the max_prefixlen. The max value of
> max_prefixlen is LPM_DATA_SIZE_MAX * 8 = 2048, I think the loop time
> will be fine. Will try to construct some test cases for it.

Okay, thank you.

Sebastian

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH bpf-next 07/10] bpf: Switch to bpf mem allocator for LPM trie
  2024-11-18  1:08 ` [PATCH bpf-next 07/10] bpf: Switch to bpf mem allocator for LPM trie Hou Tao
  2024-11-18 13:30   ` Sebastian Andrzej Siewior
@ 2024-11-20  1:16   ` Alexei Starovoitov
  2024-11-21  1:20     ` Hou Tao
  2024-11-21 11:39   ` Toke Høiland-Jørgensen
  2 siblings, 1 reply; 40+ messages in thread
From: Alexei Starovoitov @ 2024-11-20  1:16 UTC (permalink / raw)
  To: Hou Tao
  Cc: bpf, Martin KaFai Lau, Andrii Nakryiko, Eduard Zingerman,
	Song Liu, Hao Luo, Yonghong Song, Daniel Borkmann, KP Singh,
	Stanislav Fomichev, Jiri Olsa, John Fastabend,
	Sebastian Andrzej Siewior, Thomas Gleixner, Thomas Weißschuh,
	Hou Tao, Xu Kuohai

On Sun, Nov 17, 2024 at 4:56 PM Hou Tao <houtao@huaweicloud.com> wrote:
>
>
> +enum {
> +       LPM_TRIE_MA_IM = 0,
> +       LPM_TRIE_MA_LEAF,
> +       LPM_TRIE_MA_CNT,
> +};
> +
>  struct lpm_trie {
>         struct bpf_map                  map;
>         struct lpm_trie_node __rcu      *root;
> +       struct bpf_mem_alloc            ma[LPM_TRIE_MA_CNT];
> +       struct bpf_mem_alloc            *im_ma;
> +       struct bpf_mem_alloc            *leaf_ma;

We cannot use bpf_ma-s liberally like that.
Freelists are not huge, but we shouldn't be adding new bpf_ma
in every map and every use case.

bpf_mem_cache_is_mergeable() in the previous patch also
leaks implementation details.

Can you use bpf_global_ma for all nodes?

pw-bot: cr

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH bpf-next 07/10] bpf: Switch to bpf mem allocator for LPM trie
  2024-11-20  1:16   ` Alexei Starovoitov
@ 2024-11-21  1:20     ` Hou Tao
  2024-11-23  3:29       ` Alexei Starovoitov
  0 siblings, 1 reply; 40+ messages in thread
From: Hou Tao @ 2024-11-21  1:20 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Martin KaFai Lau, Andrii Nakryiko, Eduard Zingerman,
	Song Liu, Hao Luo, Yonghong Song, Daniel Borkmann, KP Singh,
	Stanislav Fomichev, Jiri Olsa, John Fastabend,
	Sebastian Andrzej Siewior, Thomas Gleixner, Thomas Weißschuh,
	Hou Tao, Xu Kuohai

Hi Alexei,

On 11/20/2024 9:16 AM, Alexei Starovoitov wrote:
> On Sun, Nov 17, 2024 at 4:56 PM Hou Tao <houtao@huaweicloud.com> wrote:
>>
>> +enum {
>> +       LPM_TRIE_MA_IM = 0,
>> +       LPM_TRIE_MA_LEAF,
>> +       LPM_TRIE_MA_CNT,
>> +};
>> +
>>  struct lpm_trie {
>>         struct bpf_map                  map;
>>         struct lpm_trie_node __rcu      *root;
>> +       struct bpf_mem_alloc            ma[LPM_TRIE_MA_CNT];
>> +       struct bpf_mem_alloc            *im_ma;
>> +       struct bpf_mem_alloc            *leaf_ma;
> We cannot use bpf_ma-s liberally like that.
> Freelists are not huge, but we shouldn't be adding new bpf_ma
> in every map and every use case.
>
> bpf_mem_cache_is_mergeable() in the previous patch also
> leaks implementation details.
>
> Can you use bpf_global_ma for all nodes?

Will try. However, there are mainly two differences between
bpf_global_ma and map specific bpf_mem_alloc. The first one is the
memory accounting problem. All memories allocated from bpf_global_ma
will be accounted to the root memory cgroup instead of the current
memory cgroup (due to the return value of get_memcg()). I think we could
fix this partially by returning NULL instead of root_mem_cgroup when
c->objcg is NULL. However, even with the fix, the memory account is
still inaccurate, because these pre-allocated objects may be used by
other maps instead of the map which triggers the pre-allocation. The
second one is the freeing of freed objects  when destroying the map. For
a map specific bpf_mem_alloc, most of these freed objects could be freed
immediately back to slub, However, it is not true for the bpf_global_ma,
because we could not tell whether the object belongs to a to-be-freed
map or not. And also we can not drain the bpf_global_ma just like we do
for bpf_mem_alloc.
>
> pw-bot: cr
> .


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH bpf-next 01/10] bpf: Remove unnecessary check when updating LPM trie
  2024-11-18  1:07 ` [PATCH bpf-next 01/10] bpf: Remove unnecessary check when updating " Hou Tao
@ 2024-11-21 10:22   ` Toke Høiland-Jørgensen
  0 siblings, 0 replies; 40+ messages in thread
From: Toke Høiland-Jørgensen @ 2024-11-21 10:22 UTC (permalink / raw)
  To: Hou Tao, bpf
  Cc: Martin KaFai Lau, Alexei Starovoitov, Andrii Nakryiko,
	Eduard Zingerman, Song Liu, Hao Luo, Yonghong Song,
	Daniel Borkmann, KP Singh, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Sebastian Andrzej Siewior, Thomas Gleixner,
	Thomas Weißschuh, houtao1, xukuohai

Hou Tao <houtao@huaweicloud.com> writes:

> From: Hou Tao <houtao1@huawei.com>
>
> When "node->prefixlen == matchlen" is true, it means that the node is
> fully matched. If "node->prefixlen == key->prefixlen" is false, it means
> the prefix length of key is greater than the prefix length of node,
> otherwise, matchlen will not be equal with node->prefixlen. However, it
> also implies that the prefix length of node must be less than
> max_prefixlen.
>
> Therefore, "node->prefixlen == trie->max_prefixlen" will always be false
> when the check of "node->prefixlen == key->prefixlen" returns false.
> Remove this unnecessary comparison.
>
> Signed-off-by: Hou Tao <houtao1@huawei.com>


Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH bpf-next 02/10] bpf: Remove unnecessary kfree(im_node) in lpm_trie_update_elem
  2024-11-18  1:08 ` [PATCH bpf-next 02/10] bpf: Remove unnecessary kfree(im_node) in lpm_trie_update_elem Hou Tao
@ 2024-11-21 10:25   ` Toke Høiland-Jørgensen
  0 siblings, 0 replies; 40+ messages in thread
From: Toke Høiland-Jørgensen @ 2024-11-21 10:25 UTC (permalink / raw)
  To: Hou Tao, bpf
  Cc: Martin KaFai Lau, Alexei Starovoitov, Andrii Nakryiko,
	Eduard Zingerman, Song Liu, Hao Luo, Yonghong Song,
	Daniel Borkmann, KP Singh, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Sebastian Andrzej Siewior, Thomas Gleixner,
	Thomas Weißschuh, houtao1, xukuohai

Hou Tao <houtao@huaweicloud.com> writes:

> From: Hou Tao <houtao1@huawei.com>
>
> There is no need to call kfree(im_node) when updating element fails,
> because im_node must be NULL. Remove the unnecessary kfree() for
> im_node.
>
> Signed-off-by: Hou Tao <houtao1@huawei.com>

Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH bpf-next 03/10] bpf: Handle BPF_EXIST and BPF_NOEXIST for LPM trie
  2024-11-18  1:08 ` [PATCH bpf-next 03/10] bpf: Handle BPF_EXIST and BPF_NOEXIST for LPM trie Hou Tao
  2024-11-18 13:39   ` Thomas Weißschuh
@ 2024-11-21 10:32   ` Toke Høiland-Jørgensen
  1 sibling, 0 replies; 40+ messages in thread
From: Toke Høiland-Jørgensen @ 2024-11-21 10:32 UTC (permalink / raw)
  To: Hou Tao, bpf
  Cc: Martin KaFai Lau, Alexei Starovoitov, Andrii Nakryiko,
	Eduard Zingerman, Song Liu, Hao Luo, Yonghong Song,
	Daniel Borkmann, KP Singh, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Sebastian Andrzej Siewior, Thomas Gleixner,
	Thomas Weißschuh, houtao1, xukuohai

Hou Tao <houtao@huaweicloud.com> writes:

> From: Hou Tao <houtao1@huawei.com>
>
> There is exact match during the update of LPM trie, therefore, add the
> missed handling for BPF_EXIST and BPF_NOEXIST flags.
>
> Signed-off-by: Hou Tao <houtao1@huawei.com>

Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH bpf-next 04/10] bpf: Handle in-place update for full LPM trie correctly
  2024-11-18  1:08 ` [PATCH bpf-next 04/10] bpf: Handle in-place update for full LPM trie correctly Hou Tao
  2024-11-18 13:13   ` Sebastian Andrzej Siewior
@ 2024-11-21 10:53   ` Toke Høiland-Jørgensen
  2024-11-22  2:06     ` Hou Tao
  1 sibling, 1 reply; 40+ messages in thread
From: Toke Høiland-Jørgensen @ 2024-11-21 10:53 UTC (permalink / raw)
  To: Hou Tao, bpf
  Cc: Martin KaFai Lau, Alexei Starovoitov, Andrii Nakryiko,
	Eduard Zingerman, Song Liu, Hao Luo, Yonghong Song,
	Daniel Borkmann, KP Singh, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Sebastian Andrzej Siewior, Thomas Gleixner,
	Thomas Weißschuh, houtao1, xukuohai

Hou Tao <houtao@huaweicloud.com> writes:

> From: Hou Tao <houtao1@huawei.com>
>
> When a LPM trie is full, in-place updates of existing elements
> incorrectly return -ENOSPC.
>
> Fix this by deferring the check of trie->n_entries. For new insertions,
> n_entries must not exceed max_entries. However, in-place updates are
> allowed even when the trie is full.
>
> Signed-off-by: Hou Tao <houtao1@huawei.com>
> ---
>  kernel/bpf/lpm_trie.c | 46 +++++++++++++++++++++----------------------
>  1 file changed, 23 insertions(+), 23 deletions(-)
>
> diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
> index 4300bd51ec6e..ff57e35357ae 100644
> --- a/kernel/bpf/lpm_trie.c
> +++ b/kernel/bpf/lpm_trie.c
> @@ -310,6 +310,15 @@ static struct lpm_trie_node *lpm_trie_node_alloc(const struct lpm_trie *trie,
>  	return node;
>  }
>  
> +static int trie_check_noreplace_update(const struct lpm_trie *trie, u64 flags)

I think this function name is hard to parse (took me a few tries). How
about trie_check_add_entry() instead?

> +{
> +	if (flags == BPF_EXIST)
> +		return -ENOENT;
> +	if (trie->n_entries == trie->map.max_entries)
> +		return -ENOSPC;

The calls to this function are always paired with a trie->n_entries++; -
so how about moving that into the function after the checks? You'll have
to then add a decrement if the im_node allocation fails, but I think
that is still clearer than having the n_entries++ statements scattered
around the function.

-Toke


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH bpf-next 05/10] bpf: Fix exact match conditions in trie_get_next_key()
  2024-11-18  1:08 ` [PATCH bpf-next 05/10] bpf: Fix exact match conditions in trie_get_next_key() Hou Tao
@ 2024-11-21 11:01   ` Toke Høiland-Jørgensen
  0 siblings, 0 replies; 40+ messages in thread
From: Toke Høiland-Jørgensen @ 2024-11-21 11:01 UTC (permalink / raw)
  To: Hou Tao, bpf
  Cc: Martin KaFai Lau, Alexei Starovoitov, Andrii Nakryiko,
	Eduard Zingerman, Song Liu, Hao Luo, Yonghong Song,
	Daniel Borkmann, KP Singh, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Sebastian Andrzej Siewior, Thomas Gleixner,
	Thomas Weißschuh, houtao1, xukuohai

Hou Tao <houtao@huaweicloud.com> writes:

> From: Hou Tao <houtao1@huawei.com>
>
> trie_get_next_key() uses node->prefixlen == key->prefixlen to identify
> an exact match, However, it is incorrect because when the target key
> doesn't fully match the found node (e.g., node->prefixlen != matchlen),
> these two nodes may also have the same prefixlen. It will return
> expected result when the passed key exist in the trie. However when a
> recently-deleted key or nonexistent key is passed to
> trie_get_next_key(), it may skip keys and return incorrect result.
>
> Fix it by using node->prefixlen == matchlen to identify exact matches.
> When the condition is true after the search, it also implies
> node->prefixlen equals key->prefixlen, otherwise, the search would
> return NULL instead.
>
> Fixes: b471f2f1de8b ("bpf: implement MAP_GET_NEXT_KEY command for LPM_TRIE map")
> Signed-off-by: Hou Tao <houtao1@huawei.com>

Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH bpf-next 07/10] bpf: Switch to bpf mem allocator for LPM trie
  2024-11-18  1:08 ` [PATCH bpf-next 07/10] bpf: Switch to bpf mem allocator for LPM trie Hou Tao
  2024-11-18 13:30   ` Sebastian Andrzej Siewior
  2024-11-20  1:16   ` Alexei Starovoitov
@ 2024-11-21 11:39   ` Toke Høiland-Jørgensen
  2024-11-21 11:52     ` Thomas Weißschuh
  2 siblings, 1 reply; 40+ messages in thread
From: Toke Høiland-Jørgensen @ 2024-11-21 11:39 UTC (permalink / raw)
  To: Hou Tao, bpf
  Cc: Martin KaFai Lau, Alexei Starovoitov, Andrii Nakryiko,
	Eduard Zingerman, Song Liu, Hao Luo, Yonghong Song,
	Daniel Borkmann, KP Singh, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Sebastian Andrzej Siewior, Thomas Gleixner,
	Thomas Weißschuh, houtao1, xukuohai

Hou Tao <houtao@huaweicloud.com> writes:

> Fix these warnings by replacing kmalloc()/kfree()/kfree_rcu() with
> equivalent bpf memory allocator APIs. Since intermediate node and leaf
> node have fixed sizes, fixed-size allocation APIs are used.
>
> Two aspects of this change require explanation:
>
> 1. A new flag LPM_TREE_NODE_FLAG_ALLOC_LEAF is added to track the
>    original allocator. This is necessary because during deletion, a leaf
>    node may be used as an intermediate node. These nodes must be freed
>    through the leaf allocator.
> 2. The intermediate node allocator and leaf node allocator may be merged
>    because value_size for LPM trie is usually small. The merging reduces
>    the memory overhead of bpf memory allocator.

This seems like an awfully complicated way to fix this. Couldn't we just
move the node allocations in trie_update_elem() out so they happen
before the trie lock is taken instead?

-Toke


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH bpf-next 07/10] bpf: Switch to bpf mem allocator for LPM trie
  2024-11-21 11:39   ` Toke Høiland-Jørgensen
@ 2024-11-21 11:52     ` Thomas Weißschuh
  2024-11-21 12:50       ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 40+ messages in thread
From: Thomas Weißschuh @ 2024-11-21 11:52 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Hou Tao, bpf, Martin KaFai Lau, Alexei Starovoitov,
	Andrii Nakryiko, Eduard Zingerman, Song Liu, Hao Luo,
	Yonghong Song, Daniel Borkmann, KP Singh, Stanislav Fomichev,
	Jiri Olsa, John Fastabend, Sebastian Andrzej Siewior,
	Thomas Gleixner, Thomas Weißschuh, houtao1, xukuohai

On Thu, Nov 21, 2024 at 12:39:08PM +0100, Toke Høiland-Jørgensen wrote:
> Hou Tao <houtao@huaweicloud.com> writes:
> 
> > Fix these warnings by replacing kmalloc()/kfree()/kfree_rcu() with
> > equivalent bpf memory allocator APIs. Since intermediate node and leaf
> > node have fixed sizes, fixed-size allocation APIs are used.
> >
> > Two aspects of this change require explanation:
> >
> > 1. A new flag LPM_TREE_NODE_FLAG_ALLOC_LEAF is added to track the
> >    original allocator. This is necessary because during deletion, a leaf
> >    node may be used as an intermediate node. These nodes must be freed
> >    through the leaf allocator.
> > 2. The intermediate node allocator and leaf node allocator may be merged
> >    because value_size for LPM trie is usually small. The merging reduces
> >    the memory overhead of bpf memory allocator.
> 
> This seems like an awfully complicated way to fix this. Couldn't we just
> move the node allocations in trie_update_elem() out so they happen
> before the trie lock is taken instead?

The problematic lock nesting is not between the trie lock and the
allocator lock but between each of them and any other lock in the kernel.
BPF programs can be called from any context through tracepoints.
In this specific case the issue was a tracepoint executed under the
workqueue lock.


Thomas

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH bpf-next 07/10] bpf: Switch to bpf mem allocator for LPM trie
  2024-11-21 11:52     ` Thomas Weißschuh
@ 2024-11-21 12:50       ` Toke Høiland-Jørgensen
  2024-11-22  3:36         ` Hou Tao
  0 siblings, 1 reply; 40+ messages in thread
From: Toke Høiland-Jørgensen @ 2024-11-21 12:50 UTC (permalink / raw)
  To: Thomas Weißschuh
  Cc: Hou Tao, bpf, Martin KaFai Lau, Alexei Starovoitov,
	Andrii Nakryiko, Eduard Zingerman, Song Liu, Hao Luo,
	Yonghong Song, Daniel Borkmann, KP Singh, Stanislav Fomichev,
	Jiri Olsa, John Fastabend, Sebastian Andrzej Siewior,
	Thomas Gleixner, Thomas Weißschuh, houtao1, xukuohai

Thomas Weißschuh <thomas.weissschuh@linutronix.de> writes:

> On Thu, Nov 21, 2024 at 12:39:08PM +0100, Toke Høiland-Jørgensen wrote:
>> Hou Tao <houtao@huaweicloud.com> writes:
>> 
>> > Fix these warnings by replacing kmalloc()/kfree()/kfree_rcu() with
>> > equivalent bpf memory allocator APIs. Since intermediate node and leaf
>> > node have fixed sizes, fixed-size allocation APIs are used.
>> >
>> > Two aspects of this change require explanation:
>> >
>> > 1. A new flag LPM_TREE_NODE_FLAG_ALLOC_LEAF is added to track the
>> >    original allocator. This is necessary because during deletion, a leaf
>> >    node may be used as an intermediate node. These nodes must be freed
>> >    through the leaf allocator.
>> > 2. The intermediate node allocator and leaf node allocator may be merged
>> >    because value_size for LPM trie is usually small. The merging reduces
>> >    the memory overhead of bpf memory allocator.
>> 
>> This seems like an awfully complicated way to fix this. Couldn't we just
>> move the node allocations in trie_update_elem() out so they happen
>> before the trie lock is taken instead?
>
> The problematic lock nesting is not between the trie lock and the
> allocator lock but between each of them and any other lock in the kernel.
> BPF programs can be called from any context through tracepoints.
> In this specific case the issue was a tracepoint executed under the
> workqueue lock.

That is not the issue described in the commit message, though. If the
goal is to make the lpm_trie map usable in any context, the commit
message should be rewritten to reflect this, instead of mentioning a
specific deadlock between the trie lock and the allocator lock.

And in that case, I think it's better to use a single 'struct
bpf_mem_alloc' per map (like hashmaps do). This will waste some memory
for intermediate nodes, but that seems like an acceptable trade-off to
avoid all the complexity around two different allocators.

Not sure if Alexei's comment about too many allocators was aimed solely
at this, or if he has issues even with having a single allocator per map
as well; but in that case, that seems like something that should be
fixed for hashmaps as well?

-Toke


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH bpf-next 04/10] bpf: Handle in-place update for full LPM trie correctly
  2024-11-21 10:53   ` Toke Høiland-Jørgensen
@ 2024-11-22  2:06     ` Hou Tao
  0 siblings, 0 replies; 40+ messages in thread
From: Hou Tao @ 2024-11-22  2:06 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, bpf
  Cc: Martin KaFai Lau, Alexei Starovoitov, Andrii Nakryiko,
	Eduard Zingerman, Song Liu, Hao Luo, Yonghong Song,
	Daniel Borkmann, KP Singh, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Sebastian Andrzej Siewior, Thomas Gleixner,
	Thomas Weißschuh, houtao1, xukuohai

Hi,

On 11/21/2024 6:53 PM, Toke Høiland-Jørgensen wrote:
> Hou Tao <houtao@huaweicloud.com> writes:
>
>> From: Hou Tao <houtao1@huawei.com>
>>
>> When a LPM trie is full, in-place updates of existing elements
>> incorrectly return -ENOSPC.
>>
>> Fix this by deferring the check of trie->n_entries. For new insertions,
>> n_entries must not exceed max_entries. However, in-place updates are
>> allowed even when the trie is full.
>>
>> Signed-off-by: Hou Tao <houtao1@huawei.com>
>> ---
>>  kernel/bpf/lpm_trie.c | 46 +++++++++++++++++++++----------------------
>>  1 file changed, 23 insertions(+), 23 deletions(-)
>>
>> diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
>> index 4300bd51ec6e..ff57e35357ae 100644
>> --- a/kernel/bpf/lpm_trie.c
>> +++ b/kernel/bpf/lpm_trie.c
>> @@ -310,6 +310,15 @@ static struct lpm_trie_node *lpm_trie_node_alloc(const struct lpm_trie *trie,
>>  	return node;
>>  }
>>  
>> +static int trie_check_noreplace_update(const struct lpm_trie *trie, u64 flags)
> I think this function name is hard to parse (took me a few tries). How
> about trie_check_add_entry() instead?

Better. However, "entry" is not commonly used. The commonly used noun is
either "node" or "element". Will use trie_check_add_elem().
>
>> +{
>> +	if (flags == BPF_EXIST)
>> +		return -ENOENT;
>> +	if (trie->n_entries == trie->map.max_entries)
>> +		return -ENOSPC;
> The calls to this function are always paired with a trie->n_entries++; -
> so how about moving that into the function after the checks? You'll have
> to then add a decrement if the im_node allocation fails, but I think
> that is still clearer than having the n_entries++ statements scattered
> around the function.

Got it. Will update it in v2.  One motivation for adding n_entries only
necessary is to prevent trie_mem_usage from reading an inaccurate
n_entries, but consider it again I don't think it matters much.
>
> -Toke


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH bpf-next 07/10] bpf: Switch to bpf mem allocator for LPM trie
  2024-11-21 12:50       ` Toke Høiland-Jørgensen
@ 2024-11-22  3:36         ` Hou Tao
  0 siblings, 0 replies; 40+ messages in thread
From: Hou Tao @ 2024-11-22  3:36 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, Thomas Weißschuh,
	Alexei Starovoitov
  Cc: bpf, Martin KaFai Lau, Andrii Nakryiko, Eduard Zingerman,
	Song Liu, Hao Luo, Yonghong Song, Daniel Borkmann, KP Singh,
	Stanislav Fomichev, Jiri Olsa, John Fastabend,
	Sebastian Andrzej Siewior, Thomas Gleixner, Thomas Weißschuh,
	houtao1, xukuohai

Hi,

On 11/21/2024 8:50 PM, Toke Høiland-Jørgensen wrote:
> Thomas Weißschuh <thomas.weissschuh@linutronix.de> writes:
>
>> On Thu, Nov 21, 2024 at 12:39:08PM +0100, Toke Høiland-Jørgensen wrote:
>>> Hou Tao <houtao@huaweicloud.com> writes:
>>>
>>>> Fix these warnings by replacing kmalloc()/kfree()/kfree_rcu() with
>>>> equivalent bpf memory allocator APIs. Since intermediate node and leaf
>>>> node have fixed sizes, fixed-size allocation APIs are used.
>>>>
>>>> Two aspects of this change require explanation:
>>>>
>>>> 1. A new flag LPM_TREE_NODE_FLAG_ALLOC_LEAF is added to track the
>>>>    original allocator. This is necessary because during deletion, a leaf
>>>>    node may be used as an intermediate node. These nodes must be freed
>>>>    through the leaf allocator.
>>>> 2. The intermediate node allocator and leaf node allocator may be merged
>>>>    because value_size for LPM trie is usually small. The merging reduces
>>>>    the memory overhead of bpf memory allocator.
>>> This seems like an awfully complicated way to fix this. Couldn't we just
>>> move the node allocations in trie_update_elem() out so they happen
>>> before the trie lock is taken instead?

Had considered about it. However, it will allocate two nodes for each
update, I think it may be too expensive.
>> The problematic lock nesting is not between the trie lock and the
>> allocator lock but between each of them and any other lock in the kernel.
>> BPF programs can be called from any context through tracepoints.
>> In this specific case the issue was a tracepoint executed under the
>> workqueue lock.
> That is not the issue described in the commit message, though. If the
> goal is to make the lpm_trie map usable in any context, the commit
> message should be rewritten to reflect this, instead of mentioning a
> specific deadlock between the trie lock and the allocator lock.

The original intention of the patch set is trying to resolve the
multiple syzbot dead-lock locking reports [1]. All of these reports
involve trie->lock and the internal lock in kmalloc(). However, as
pointed out by Thomas, only fixing the order of trie->lock and internal
lock in kmalloc() is not enough. Will update the commit message to
reflect that.

[1]:
https://lore.kernel.org/bpf/e14d8882-4760-7c9c-0cfc-db04eda494ee@huaweicloud.com/
>
> And in that case, I think it's better to use a single 'struct
> bpf_mem_alloc' per map (like hashmaps do). This will waste some memory
> for intermediate nodes, but that seems like an acceptable trade-off to
> avoid all the complexity around two different allocators.
>
> Not sure if Alexei's comment about too many allocators was aimed solely
> at this, or if he has issues even with having a single allocator per map
> as well; but in that case, that seems like something that should be
> fixed for hashmaps as well?

In my understanding, the motivation for using a shared bpf_mem_alloc
instead of a per-map one is to reduce the memory overhead of per-cpu
object freelist in each bpf memory allocator. The hash map is a bit
different, because these freed objects return back to bpf_mem_alloc may
be reused immediately, so if the freed object is reused by other
call-sites (e.g., no hash map) and the hlist_nulls_node is overwritten,
the lookup procedure of hash map may incur oops.

>
> -Toke


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH bpf-next 07/10] bpf: Switch to bpf mem allocator for LPM trie
  2024-11-21  1:20     ` Hou Tao
@ 2024-11-23  3:29       ` Alexei Starovoitov
  0 siblings, 0 replies; 40+ messages in thread
From: Alexei Starovoitov @ 2024-11-23  3:29 UTC (permalink / raw)
  To: Hou Tao
  Cc: bpf, Martin KaFai Lau, Andrii Nakryiko, Eduard Zingerman,
	Song Liu, Hao Luo, Yonghong Song, Daniel Borkmann, KP Singh,
	Stanislav Fomichev, Jiri Olsa, John Fastabend,
	Sebastian Andrzej Siewior, Thomas Gleixner, Thomas Weißschuh,
	Hou Tao, Xu Kuohai

On Wed, Nov 20, 2024 at 5:20 PM Hou Tao <houtao@huaweicloud.com> wrote:
>
> Hi Alexei,
>
> On 11/20/2024 9:16 AM, Alexei Starovoitov wrote:
> > On Sun, Nov 17, 2024 at 4:56 PM Hou Tao <houtao@huaweicloud.com> wrote:
> >>
> >> +enum {
> >> +       LPM_TRIE_MA_IM = 0,
> >> +       LPM_TRIE_MA_LEAF,
> >> +       LPM_TRIE_MA_CNT,
> >> +};
> >> +
> >>  struct lpm_trie {
> >>         struct bpf_map                  map;
> >>         struct lpm_trie_node __rcu      *root;
> >> +       struct bpf_mem_alloc            ma[LPM_TRIE_MA_CNT];
> >> +       struct bpf_mem_alloc            *im_ma;
> >> +       struct bpf_mem_alloc            *leaf_ma;
> > We cannot use bpf_ma-s liberally like that.
> > Freelists are not huge, but we shouldn't be adding new bpf_ma
> > in every map and every use case.
> >
> > bpf_mem_cache_is_mergeable() in the previous patch also
> > leaks implementation details.
> >
> > Can you use bpf_global_ma for all nodes?
>
> Will try. However, there are mainly two differences between
> bpf_global_ma and map specific bpf_mem_alloc. The first one is the
> memory accounting problem. All memories allocated from bpf_global_ma
> will be accounted to the root memory cgroup instead of the current
> memory cgroup (due to the return value of get_memcg()). I think we could
> fix this partially by returning NULL instead of root_mem_cgroup when
> c->objcg is NULL. However, even with the fix, the memory account is
> still inaccurate, because these pre-allocated objects may be used by
> other maps instead of the map which triggers the pre-allocation.

That's a valid point.
Though we ignore this issue in bpf_obj_new and other places
if we can account into memgcg correctly we should do it.

> The
> second one is the freeing of freed objects  when destroying the map. For
> a map specific bpf_mem_alloc, most of these freed objects could be freed
> immediately back to slub, However, it is not true for the bpf_global_ma,
> because we could not tell whether the object belongs to a to-be-freed
> map or not. And also we can not drain the bpf_global_ma just like we do
> for bpf_mem_alloc.

I don't think it's a big issue here. Optimizing delays in the free path
is imo too soon. The extra complexity is not worth it.

Let's do one bpf_ma for lpm of size LPM_TRIE_MA_LEAF.
Inner nodes may be wasting memory and it's ok.
The whole LPM trie is not efficient anyway.
Micro-optiming at bpf_ma level is a small improvement compared
to rewriting the whole LPM map as a more performance and memory
efficient algorithm.

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2024-11-23  3:30 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-11-18  1:07 [PATCH bpf-next 00/10] Fixes for LPM trie Hou Tao
2024-11-18  1:07 ` [PATCH bpf-next 01/10] bpf: Remove unnecessary check when updating " Hou Tao
2024-11-21 10:22   ` Toke Høiland-Jørgensen
2024-11-18  1:08 ` [PATCH bpf-next 02/10] bpf: Remove unnecessary kfree(im_node) in lpm_trie_update_elem Hou Tao
2024-11-21 10:25   ` Toke Høiland-Jørgensen
2024-11-18  1:08 ` [PATCH bpf-next 03/10] bpf: Handle BPF_EXIST and BPF_NOEXIST for LPM trie Hou Tao
2024-11-18 13:39   ` Thomas Weißschuh
2024-11-19  1:08     ` Hou Tao
2024-11-21 10:32   ` Toke Høiland-Jørgensen
2024-11-18  1:08 ` [PATCH bpf-next 04/10] bpf: Handle in-place update for full LPM trie correctly Hou Tao
2024-11-18 13:13   ` Sebastian Andrzej Siewior
2024-11-19  1:05     ` Hou Tao
2024-11-21 10:53   ` Toke Høiland-Jørgensen
2024-11-22  2:06     ` Hou Tao
2024-11-18  1:08 ` [PATCH bpf-next 05/10] bpf: Fix exact match conditions in trie_get_next_key() Hou Tao
2024-11-21 11:01   ` Toke Høiland-Jørgensen
2024-11-18  1:08 ` [PATCH bpf-next 06/10] bpf: Add bpf_mem_cache_is_mergeable() helper Hou Tao
2024-11-18 13:29   ` Thomas Weißschuh
2024-11-19  1:06     ` Hou Tao
2024-11-18  1:08 ` [PATCH bpf-next 07/10] bpf: Switch to bpf mem allocator for LPM trie Hou Tao
2024-11-18 13:30   ` Sebastian Andrzej Siewior
2024-11-18 16:56     ` Yonghong Song
2024-11-20  1:16   ` Alexei Starovoitov
2024-11-21  1:20     ` Hou Tao
2024-11-23  3:29       ` Alexei Starovoitov
2024-11-21 11:39   ` Toke Høiland-Jørgensen
2024-11-21 11:52     ` Thomas Weißschuh
2024-11-21 12:50       ` Toke Høiland-Jørgensen
2024-11-22  3:36         ` Hou Tao
2024-11-18  1:08 ` [PATCH bpf-next 08/10] bpf: Use raw_spinlock_t " Hou Tao
2024-11-18  1:08 ` [PATCH bpf-next 09/10] selftests/bpf: Move test_lpm_map.c to map_tests Hou Tao
2024-11-18  1:08 ` [PATCH bpf-next 10/10] selftests/bpf: Add more test cases for LPM trie Hou Tao
2024-11-18 17:46   ` Daniel Borkmann
2024-11-19  1:10     ` Hou Tao
     [not found] ` <46268aa9ef13a24388af833b17f6cef8bdd3a7be8402fec7640e65a2f1118468@mail.kernel.org>
2024-11-18  6:20   ` [PATCH bpf-next 00/10] Fixes " Hou Tao
2024-11-18 17:43     ` Daniel Xu
2024-11-19  1:09       ` Hou Tao
2024-11-18 15:39 ` Sebastian Andrzej Siewior
2024-11-19  1:35   ` Hou Tao
2024-11-19 14:15     ` Sebastian Andrzej Siewior

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox