* [PATCH v1 1/5] fib6: fix tbl8 reservation drift in trie
2026-05-22 14:58 ` [PATCH v1 0/5] fib6: fix tbl8 reservation drift Maxime Leroy
@ 2026-05-22 14:58 ` Maxime Leroy
2026-06-05 13:03 ` [PATCH 1/3] " Vladimir Medvedkin
2026-05-22 14:58 ` [PATCH v1 2/5] test/fib6: add reproducer for tbl8 reservation drift Maxime Leroy
` (4 subsequent siblings)
5 siblings, 1 reply; 14+ messages in thread
From: Maxime Leroy @ 2026-05-22 14:58 UTC (permalink / raw)
To: Vladimir Medvedkin; +Cc: dev, stable, Maxime Leroy
trie_modify() maintained rsvd_tbl8s by computing a depth_diff from
the current RIB topology at both ADD and DEL. The two values diverge
when the RIB changes between an ADD and its later DEL (a covering
parent added or removed), and rsvd_tbl8s eventually wraps to
UINT32_MAX, rejecting all subsequent /25+ ADDs with -ENOSPC.
Replace the depth_diff arithmetic with the dir24_8 approach
generalised to multiple byte boundaries: for each byte boundary L
below depth, the supernet at level L needs a tbl8 if and only if at
least one prefix with depth > L exists in that supernet. The supernet
identity at level L is stable across unrelated RIB modifications, so
ADD/DEL pairs are symmetric by construction.
A helper count_empty_levels() scans byte boundaries from 24 upward,
stopping at the first level where the supernet has no descendant. A
NULL answer at level L propagates upward because narrower supernets
are subsets, so the remaining boundaries can be tallied in one shot
without further queries.
- On ADD, count the empty levels and increment rsvd_tbl8s by that
count. The capacity pre-check uses the exact post-operation
envelope.
- On DEL, after removing the prefix, count the levels that became
empty and decrement.
Fixes: c3e12e0f0354 ("fib: add dataplane algorithm for IPv6")
Cc: stable@dpdk.org
Signed-off-by: Maxime Leroy <maxime@leroys.fr>
---
lib/fib/trie.c | 71 +++++++++++++++++++++++---------------------------
1 file changed, 32 insertions(+), 39 deletions(-)
diff --git a/lib/fib/trie.c b/lib/fib/trie.c
index fa5d9ec6b0..44b90f72ff 100644
--- a/lib/fib/trie.c
+++ b/lib/fib/trie.c
@@ -534,19 +534,43 @@ modify_dp(struct rte_trie_tbl *dp, struct rte_rib6 *rib,
return 0;
}
+/*
+ * Count byte boundaries between 24 and CEIL(depth, 8) where the
+ * supernet of ip has no descendant in the RIB. This is the number of
+ * new tbl8 levels an ADD of ip/depth would introduce, or the number
+ * to free at DEL once the prefix has been removed from the RIB.
+ *
+ * A NULL answer at level L propagates upwards: narrower supernets at
+ * L+8, L+16, ... are subsets of S_L and cannot contain descendants
+ * either. The loop stops at the first NULL and tallies the remaining
+ * boundaries in one shot.
+ */
+static uint8_t
+count_empty_levels(struct rte_rib6 *rib, const struct rte_ipv6_addr *ip,
+ uint8_t depth)
+{
+ uint8_t level, top = RTE_ALIGN_CEIL(depth, 8);
+
+ for (level = 24; level < top; level += 8) {
+ if (rte_rib6_get_nxt(rib, ip, level, NULL,
+ RTE_RIB6_GET_NXT_COVER) == NULL)
+ return (top - level) >> 3;
+ }
+ return 0;
+}
+
int
trie_modify(struct rte_fib6 *fib, const struct rte_ipv6_addr *ip,
uint8_t depth, uint64_t next_hop, int op)
{
struct rte_trie_tbl *dp;
struct rte_rib6 *rib;
- struct rte_rib6_node *tmp = NULL;
struct rte_rib6_node *node;
struct rte_rib6_node *parent;
- struct rte_ipv6_addr ip_masked, tmp_ip;
+ struct rte_ipv6_addr ip_masked;
int ret = 0;
uint64_t par_nh, node_nh;
- uint8_t tmp_depth, depth_diff = 0, parent_depth = 24;
+ uint8_t new_levels;
if ((fib == NULL) || (ip == NULL) || (depth > RTE_IPV6_MAX_DEPTH))
return -EINVAL;
@@ -559,37 +583,6 @@ trie_modify(struct rte_fib6 *fib, const struct rte_ipv6_addr *ip,
ip_masked = *ip;
rte_ipv6_addr_mask(&ip_masked, depth);
- if (depth > 24) {
- tmp = rte_rib6_get_nxt(rib, &ip_masked,
- RTE_ALIGN_FLOOR(depth, 8), NULL,
- RTE_RIB6_GET_NXT_ALL);
- if (tmp && op == RTE_FIB6_DEL) {
- /* in case of delete operation, skip the prefix we are going to delete */
- rte_rib6_get_ip(tmp, &tmp_ip);
- rte_rib6_get_depth(tmp, &tmp_depth);
- if (rte_ipv6_addr_eq(&ip_masked, &tmp_ip) && depth == tmp_depth)
- tmp = rte_rib6_get_nxt(rib, &ip_masked,
- RTE_ALIGN_FLOOR(depth, 8), tmp, RTE_RIB6_GET_NXT_ALL);
- }
-
- if (tmp == NULL) {
- tmp = rte_rib6_lookup(rib, ip);
- /**
- * in case of delete operation, lookup returns the prefix
- * we are going to delete. Find the parent.
- */
- if (tmp && op == RTE_FIB6_DEL)
- tmp = rte_rib6_lookup_parent(tmp);
-
- if (tmp != NULL) {
- rte_rib6_get_depth(tmp, &tmp_depth);
- parent_depth = RTE_MAX(tmp_depth, 24);
- }
- depth_diff = RTE_ALIGN_CEIL(depth, 8) -
- RTE_ALIGN_CEIL(parent_depth, 8);
- depth_diff = depth_diff >> 3;
- }
- }
node = rte_rib6_lookup_exact(rib, &ip_masked, depth);
switch (op) {
case RTE_FIB6_ADD:
@@ -603,7 +596,8 @@ trie_modify(struct rte_fib6 *fib, const struct rte_ipv6_addr *ip,
return 0;
}
- if ((depth > 24) && (dp->rsvd_tbl8s + depth_diff > dp->number_tbl8s))
+ new_levels = count_empty_levels(rib, &ip_masked, depth);
+ if (dp->rsvd_tbl8s + new_levels > dp->number_tbl8s)
return -ENOSPC;
node = rte_rib6_insert(rib, &ip_masked, depth);
@@ -622,7 +616,7 @@ trie_modify(struct rte_fib6 *fib, const struct rte_ipv6_addr *ip,
return ret;
}
successfully_added:
- dp->rsvd_tbl8s += depth_diff;
+ dp->rsvd_tbl8s += new_levels;
return 0;
case RTE_FIB6_DEL:
if (node == NULL)
@@ -640,9 +634,8 @@ trie_modify(struct rte_fib6 *fib, const struct rte_ipv6_addr *ip,
if (ret != 0)
return ret;
- rte_rib6_remove(rib, ip, depth);
-
- dp->rsvd_tbl8s -= depth_diff;
+ rte_rib6_remove(rib, &ip_masked, depth);
+ dp->rsvd_tbl8s -= count_empty_levels(rib, &ip_masked, depth);
return 0;
default:
break;
--
2.43.0
^ permalink raw reply related [flat|nested] 14+ messages in thread* [PATCH 1/3] fib6: fix tbl8 reservation drift in trie
2026-05-22 14:58 ` [PATCH v1 1/5] fib6: fix tbl8 reservation drift in trie Maxime Leroy
@ 2026-06-05 13:03 ` Vladimir Medvedkin
0 siblings, 0 replies; 14+ messages in thread
From: Vladimir Medvedkin @ 2026-06-05 13:03 UTC (permalink / raw)
To: dev; +Cc: maxime, stable
From: Maxime Leroy <maxime@leroys.fr>
trie_modify() maintained rsvd_tbl8s by computing a depth_diff from
the current RIB topology at both ADD and DEL. The two values diverge
when the RIB changes between an ADD and its later DEL (a covering
parent added or removed), and rsvd_tbl8s eventually wraps to
UINT32_MAX, rejecting all subsequent /25+ ADDs with -ENOSPC.
A helper count_empty_levels() was added to fix the issue.
Fixes: c3e12e0f0354 ("fib: add dataplane algorithm for IPv6")
Cc: stable@dpdk.org
Signed-off-by: Maxime Leroy <maxime@leroys.fr>
Signed-off-by: Vladimir Medvedkin <vladimir.medvedkin@intel.com>
---
lib/fib/trie.c | 81 +++++++++++++++++++++--------------------
lib/rib/rib6_internal.h | 24 ++++++++++++
lib/rib/rte_rib6.c | 12 +-----
3 files changed, 66 insertions(+), 51 deletions(-)
create mode 100644 lib/rib/rib6_internal.h
diff --git a/lib/fib/trie.c b/lib/fib/trie.c
index fa5d9ec6b0..e9f1141cef 100644
--- a/lib/fib/trie.c
+++ b/lib/fib/trie.c
@@ -15,6 +15,7 @@
#include <rte_fib6.h>
#include "fib_log.h"
#include "trie.h"
+#include <rib6_internal.h>
#ifdef CC_AVX512_SUPPORT
@@ -534,19 +535,45 @@ modify_dp(struct rte_trie_tbl *dp, struct rte_rib6 *rib,
return 0;
}
+/*
+ * Count bumber of TBL8s that can be freed after deleting a prefix or allocated
+ * after adding a prefix.
+ */
+static uint8_t
+count_empty_levels(struct rte_rib6 *rib, const struct rte_ipv6_addr *ip, uint8_t depth)
+{
+ struct rte_rib6_node *cur = rte_rib6_lookup_exact(rib, ip, depth);
+ /* expect prefix exists */
+ if (cur == NULL)
+ return 0;
+
+ /* more specifics present */
+ if (cur->left != NULL || cur->right != NULL)
+ return 0;
+
+ struct rte_rib6_node *parent = cur->parent;
+ /* we know parent->depth lt a target cur->depth
+ * also, there exists tbl8 path up to RTE_ALIGN_CEIL(parent->depth, 8)
+ */
+ depth = RTE_MAX(depth, 24);
+ uint8_t parent_depth = (parent) ? RTE_MAX(parent->depth, 24) : 24;
+ uint8_t depth_diff = (RTE_ALIGN_CEIL(depth, 8) - RTE_ALIGN_CEIL(parent_depth, 8)) >> 3;
+
+ return depth_diff;
+}
+
int
trie_modify(struct rte_fib6 *fib, const struct rte_ipv6_addr *ip,
uint8_t depth, uint64_t next_hop, int op)
{
struct rte_trie_tbl *dp;
struct rte_rib6 *rib;
- struct rte_rib6_node *tmp = NULL;
struct rte_rib6_node *node;
struct rte_rib6_node *parent;
- struct rte_ipv6_addr ip_masked, tmp_ip;
+ struct rte_ipv6_addr ip_masked;
int ret = 0;
uint64_t par_nh, node_nh;
- uint8_t tmp_depth, depth_diff = 0, parent_depth = 24;
+ uint8_t new_levels;
if ((fib == NULL) || (ip == NULL) || (depth > RTE_IPV6_MAX_DEPTH))
return -EINVAL;
@@ -559,37 +586,6 @@ trie_modify(struct rte_fib6 *fib, const struct rte_ipv6_addr *ip,
ip_masked = *ip;
rte_ipv6_addr_mask(&ip_masked, depth);
- if (depth > 24) {
- tmp = rte_rib6_get_nxt(rib, &ip_masked,
- RTE_ALIGN_FLOOR(depth, 8), NULL,
- RTE_RIB6_GET_NXT_ALL);
- if (tmp && op == RTE_FIB6_DEL) {
- /* in case of delete operation, skip the prefix we are going to delete */
- rte_rib6_get_ip(tmp, &tmp_ip);
- rte_rib6_get_depth(tmp, &tmp_depth);
- if (rte_ipv6_addr_eq(&ip_masked, &tmp_ip) && depth == tmp_depth)
- tmp = rte_rib6_get_nxt(rib, &ip_masked,
- RTE_ALIGN_FLOOR(depth, 8), tmp, RTE_RIB6_GET_NXT_ALL);
- }
-
- if (tmp == NULL) {
- tmp = rte_rib6_lookup(rib, ip);
- /**
- * in case of delete operation, lookup returns the prefix
- * we are going to delete. Find the parent.
- */
- if (tmp && op == RTE_FIB6_DEL)
- tmp = rte_rib6_lookup_parent(tmp);
-
- if (tmp != NULL) {
- rte_rib6_get_depth(tmp, &tmp_depth);
- parent_depth = RTE_MAX(tmp_depth, 24);
- }
- depth_diff = RTE_ALIGN_CEIL(depth, 8) -
- RTE_ALIGN_CEIL(parent_depth, 8);
- depth_diff = depth_diff >> 3;
- }
- }
node = rte_rib6_lookup_exact(rib, &ip_masked, depth);
switch (op) {
case RTE_FIB6_ADD:
@@ -603,12 +599,16 @@ trie_modify(struct rte_fib6 *fib, const struct rte_ipv6_addr *ip,
return 0;
}
- if ((depth > 24) && (dp->rsvd_tbl8s + depth_diff > dp->number_tbl8s))
- return -ENOSPC;
-
node = rte_rib6_insert(rib, &ip_masked, depth);
if (node == NULL)
return -rte_errno;
+
+ new_levels = count_empty_levels(rib, &ip_masked, depth);
+ if (dp->rsvd_tbl8s + new_levels > dp->number_tbl8s) {
+ rte_rib6_remove(rib, &ip_masked, depth);
+ return -ENOSPC;
+ }
+
rte_rib6_set_nh(node, next_hop);
parent = rte_rib6_lookup_parent(node);
if (parent != NULL) {
@@ -622,7 +622,7 @@ trie_modify(struct rte_fib6 *fib, const struct rte_ipv6_addr *ip,
return ret;
}
successfully_added:
- dp->rsvd_tbl8s += depth_diff;
+ dp->rsvd_tbl8s += new_levels;
return 0;
case RTE_FIB6_DEL:
if (node == NULL)
@@ -640,9 +640,10 @@ trie_modify(struct rte_fib6 *fib, const struct rte_ipv6_addr *ip,
if (ret != 0)
return ret;
- rte_rib6_remove(rib, ip, depth);
- dp->rsvd_tbl8s -= depth_diff;
+ dp->rsvd_tbl8s -= count_empty_levels(rib, &ip_masked, depth);
+ rte_rib6_remove(rib, &ip_masked, depth);
+
return 0;
default:
break;
diff --git a/lib/rib/rib6_internal.h b/lib/rib/rib6_internal.h
new file mode 100644
index 0000000000..674befc152
--- /dev/null
+++ b/lib/rib/rib6_internal.h
@@ -0,0 +1,24 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2018 Vladimir Medvedkin <medvedkinv@gmail.com>
+ * Copyright(c) 2019 Intel Corporation
+ */
+
+#ifndef _RIB6_INTERNAL_H_
+#define _RIB6_INTERNAL_H_
+
+#include <stdint.h>
+
+#include <rte_ip6.h>
+
+struct rte_rib6_node {
+ struct rte_rib6_node *left;
+ struct rte_rib6_node *right;
+ struct rte_rib6_node *parent;
+ uint64_t nh;
+ struct rte_ipv6_addr ip;
+ uint8_t depth;
+ uint8_t flag;
+ uint64_t ext[];
+};
+
+#endif /* _RIB6_INTERNAL_H_ */
diff --git a/lib/rib/rte_rib6.c b/lib/rib/rte_rib6.c
index ec8ff68e87..f9023fca59 100644
--- a/lib/rib/rte_rib6.c
+++ b/lib/rib/rte_rib6.c
@@ -19,6 +19,7 @@
#include <rte_rib6.h>
#include "rib_log.h"
+#include "rib6_internal.h"
#define RTE_RIB_VALID_NODE 1
/* Maximum length of a RIB6 name. */
@@ -30,17 +31,6 @@ static struct rte_tailq_elem rte_rib6_tailq = {
};
EAL_REGISTER_TAILQ(rte_rib6_tailq)
-struct rte_rib6_node {
- struct rte_rib6_node *left;
- struct rte_rib6_node *right;
- struct rte_rib6_node *parent;
- uint64_t nh;
- struct rte_ipv6_addr ip;
- uint8_t depth;
- uint8_t flag;
- uint64_t ext[];
-};
-
struct rte_rib6 {
char name[RTE_RIB6_NAMESIZE];
struct rte_rib6_node *tree;
--
2.43.0
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH v1 2/5] test/fib6: add reproducer for tbl8 reservation drift
2026-05-22 14:58 ` [PATCH v1 0/5] fib6: fix tbl8 reservation drift Maxime Leroy
2026-05-22 14:58 ` [PATCH v1 1/5] fib6: fix tbl8 reservation drift in trie Maxime Leroy
@ 2026-05-22 14:58 ` Maxime Leroy
2026-05-22 14:58 ` [PATCH v1 3/5] test/fib6: extended drift test cases Maxime Leroy
` (3 subsequent siblings)
5 siblings, 0 replies; 14+ messages in thread
From: Maxime Leroy @ 2026-05-22 14:58 UTC (permalink / raw)
To: Vladimir Medvedkin; +Cc: dev, stable, Maxime Leroy
test_drift covers the asymmetric ADD parent / ADD children / DEL
parent / DEL children sequence that wraps rsvd_tbl8s past zero in a
single iteration. After the wrap the next /25+ ADD is rejected with
-ENOSPC even though the tbl8 pool is empty. With the preceding fix
in place the final ADD succeeds.
Signed-off-by: Maxime Leroy <maxime@leroys.fr>
---
app/test/test_fib6.c | 74 ++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 74 insertions(+)
diff --git a/app/test/test_fib6.c b/app/test/test_fib6.c
index fffb590dbf..c4283f3f2d 100644
--- a/app/test/test_fib6.c
+++ b/app/test/test_fib6.c
@@ -25,6 +25,7 @@ static int32_t test_get_invalid(void);
static int32_t test_lookup(void);
static int32_t test_invalid_rcu(void);
static int32_t test_fib_rcu_sync_rw(void);
+static int32_t test_drift(void);
#define MAX_ROUTES (1 << 16)
/** Maximum number of tbl8 for 2-byte entries */
@@ -599,6 +600,78 @@ test_fib_rcu_sync_rw(void)
return status == 0 ? TEST_SUCCESS : TEST_FAILED;
}
+/*
+ * Reproducer for the rsvd_tbl8s drift bug. depth_diff used to maintain
+ * rsvd_tbl8s is computed from the current RIB state, so it is not
+ * invariant between the ADD of a prefix and its later DEL when a
+ * covering parent prefix is removed in between.
+ *
+ * Layout: one /28 parent (fcde::/28) and three /48 siblings under it
+ * (fcde:0:6000::/48, fcde:1:6000::/48, fcde:2:6000::/48). The second
+ * hextet's high 12 bits are zero, so the three /48 IPs all fall inside
+ * the /28.
+ *
+ * One asymmetric sequence is enough to wrap the counter:
+ * ADD /28 rsvd_tbl8s += 1
+ * ADD /48 child_0,1,2 (with /28 parent) rsvd_tbl8s += 2 each (+6)
+ * DEL /28 (sibling /48 found) rsvd_tbl8s -= 0
+ * DEL /48 child_0,1,2 (no parent left) rsvd_tbl8s -= 3 each (-9)
+ */
+static int32_t
+test_drift(void)
+{
+ struct rte_fib6_conf config = { 0 };
+ struct rte_fib6 *fib;
+ struct rte_ipv6_addr parent =
+ RTE_IPV6(0xfcde, 0, 0, 0, 0, 0, 0, 0);
+ struct rte_ipv6_addr child[3] = {
+ RTE_IPV6(0xfcde, 0, 0x6000, 0, 0, 0, 0, 0),
+ RTE_IPV6(0xfcde, 1, 0x6000, 0, 0, 0, 0, 0),
+ RTE_IPV6(0xfcde, 2, 0x6000, 0, 0, 0, 0, 0),
+ };
+ unsigned int c;
+ int ret;
+
+ config.max_routes = 1024;
+ config.rib_ext_sz = 0;
+ config.default_nh = 0;
+ config.type = RTE_FIB6_TRIE;
+ config.trie.nh_sz = RTE_FIB6_TRIE_2B;
+ config.trie.num_tbl8 = 256;
+
+ fib = rte_fib6_create(__func__, SOCKET_ID_ANY, &config);
+ RTE_TEST_ASSERT(fib != NULL, "Failed to create FIB\n");
+
+ ret = rte_fib6_add(fib, &parent, 28, 0xa);
+ RTE_TEST_ASSERT(ret == 0, "ADD /28 failed (ret=%d)\n", ret);
+
+ for (c = 0; c < 3; c++) {
+ ret = rte_fib6_add(fib, &child[c], 48, 0xb + c);
+ RTE_TEST_ASSERT(ret == 0,
+ "ADD /48 child %u failed (ret=%d)\n", c, ret);
+ }
+
+ ret = rte_fib6_delete(fib, &parent, 28);
+ RTE_TEST_ASSERT(ret == 0, "DEL /28 failed (ret=%d)\n", ret);
+
+ for (c = 0; c < 3; c++) {
+ ret = rte_fib6_delete(fib, &child[c], 48);
+ RTE_TEST_ASSERT(ret == 0,
+ "DEL /48 child %u failed (ret=%d)\n", c, ret);
+ }
+
+ /* Pre-fix: -ENOSPC. Post-fix: succeeds. */
+ ret = rte_fib6_add(fib, &parent, 28, 0xe);
+ RTE_TEST_ASSERT(ret == 0,
+ "Fresh ADD /28 spuriously failed (ret=%d)\n", ret);
+
+ ret = rte_fib6_delete(fib, &parent, 28);
+ RTE_TEST_ASSERT(ret == 0, "Final DEL /28 failed (ret=%d)\n", ret);
+
+ rte_fib6_free(fib);
+ return TEST_SUCCESS;
+}
+
static struct unit_test_suite fib6_fast_tests = {
.suite_name = "fib6 autotest",
.setup = NULL,
@@ -611,6 +684,7 @@ static struct unit_test_suite fib6_fast_tests = {
TEST_CASE(test_lookup),
TEST_CASE(test_invalid_rcu),
TEST_CASE(test_fib_rcu_sync_rw),
+ TEST_CASE(test_drift),
TEST_CASES_END()
}
};
--
2.43.0
^ permalink raw reply related [flat|nested] 14+ messages in thread* [PATCH v1 3/5] test/fib6: extended drift test cases
2026-05-22 14:58 ` [PATCH v1 0/5] fib6: fix tbl8 reservation drift Maxime Leroy
2026-05-22 14:58 ` [PATCH v1 1/5] fib6: fix tbl8 reservation drift in trie Maxime Leroy
2026-05-22 14:58 ` [PATCH v1 2/5] test/fib6: add reproducer for tbl8 reservation drift Maxime Leroy
@ 2026-05-22 14:58 ` Maxime Leroy
2026-05-22 14:58 ` [PATCH v1 4/5] rib: track valid descendant count per node Maxime Leroy
` (2 subsequent siblings)
5 siblings, 0 replies; 14+ messages in thread
From: Maxime Leroy @ 2026-05-22 14:58 UTC (permalink / raw)
To: Vladimir Medvedkin; +Cc: dev, stable, Maxime Leroy
Four additional test cases exercise scenarios touched by the
multi-level supernet counting:
- test_drift_compression: parent + compressed child, DEL parent
forces decompression, re-ADD ancestor re-compresses.
- test_drift_multilevel: /28 + /48 + /96 chain with mixed
compressed and non-compressed links, then DEL of the middle
prefix.
- test_drift_stress: pseudo-random ADD/DEL sequence checking that
no operation returns -ENOSPC under a leaked rsvd_tbl8s.
- test_drift_tight_pool: pool sized to exactly the legitimate
envelope, re-ADD after decompression must succeed.
Signed-off-by: Maxime Leroy <maxime@leroys.fr>
---
app/test/test_fib6.c | 269 ++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 265 insertions(+), 4 deletions(-)
diff --git a/app/test/test_fib6.c b/app/test/test_fib6.c
index c4283f3f2d..ad68645428 100644
--- a/app/test/test_fib6.c
+++ b/app/test/test_fib6.c
@@ -26,6 +26,10 @@ static int32_t test_lookup(void);
static int32_t test_invalid_rcu(void);
static int32_t test_fib_rcu_sync_rw(void);
static int32_t test_drift(void);
+static int32_t test_drift_compression(void);
+static int32_t test_drift_multilevel(void);
+static int32_t test_drift_stress(void);
+static int32_t test_drift_tight_pool(void);
#define MAX_ROUTES (1 << 16)
/** Maximum number of tbl8 for 2-byte entries */
@@ -601,10 +605,9 @@ test_fib_rcu_sync_rw(void)
}
/*
- * Reproducer for the rsvd_tbl8s drift bug. depth_diff used to maintain
- * rsvd_tbl8s is computed from the current RIB state, so it is not
- * invariant between the ADD of a prefix and its later DEL when a
- * covering parent prefix is removed in between.
+ * Reproducer for the rsvd_tbl8s drift bug. The tbl8 reservation
+ * accounting must remain balanced even when a covering parent prefix
+ * is removed between an ADD and its later matching DEL.
*
* Layout: one /28 parent (fcde::/28) and three /48 siblings under it
* (fcde:0:6000::/48, fcde:1:6000::/48, fcde:2:6000::/48). The second
@@ -672,6 +675,260 @@ test_drift(void)
return TEST_SUCCESS;
}
+/*
+ * Exercise compression (same nh as parent), forced decompression on
+ * DEL parent, then re-compression after re-adding the same ancestor.
+ * The tbl8 reservation accounting must remain balanced even though
+ * the child is physically decompressed/recompressed in the dataplane.
+ *
+ * Layout: parent fcde::/28 and child fcde:0:6000::/48, both nh=1.
+ *
+ * ADD /28 (no ancestor) rsvd_tbl8s += 1
+ * ADD /48 (compressed under /28) rsvd_tbl8s += 2
+ * DEL /28 (decompresses /48) rsvd unchanged (/48 keeps)
+ * re-ADD /28 (re-compresses /48) rsvd unchanged
+ * DEL /48 rsvd_tbl8s -= 2
+ * DEL /28 rsvd_tbl8s -= 1
+ */
+static int32_t
+test_drift_compression(void)
+{
+ struct rte_fib6_conf config = { 0 };
+ struct rte_fib6 *fib;
+ struct rte_ipv6_addr parent = RTE_IPV6(0xfcde, 0, 0, 0, 0, 0, 0, 0);
+ struct rte_ipv6_addr child = RTE_IPV6(0xfcde, 0, 0x6000, 0, 0, 0, 0, 0);
+ int ret;
+
+ config.max_routes = 1024;
+ config.rib_ext_sz = 0;
+ config.default_nh = 0;
+ config.type = RTE_FIB6_TRIE;
+ config.trie.nh_sz = RTE_FIB6_TRIE_2B;
+ config.trie.num_tbl8 = 256;
+
+ fib = rte_fib6_create(__func__, SOCKET_ID_ANY, &config);
+ RTE_TEST_ASSERT(fib != NULL, "Failed to create FIB\n");
+
+ /* Compressed: child shares the parent's nh, modify_dp is skipped */
+ ret = rte_fib6_add(fib, &parent, 28, 1);
+ RTE_TEST_ASSERT(ret == 0, "ADD /28 failed\n");
+ ret = rte_fib6_add(fib, &child, 48, 1);
+ RTE_TEST_ASSERT(ret == 0, "ADD /48 (compressed) failed\n");
+
+ /* DEL parent forces decompression: child must be materialized */
+ ret = rte_fib6_delete(fib, &parent, 28);
+ RTE_TEST_ASSERT(ret == 0, "DEL /28 (decompression) failed\n");
+
+ /* Re-add parent with same nh: child becomes compressed again */
+ ret = rte_fib6_add(fib, &parent, 28, 1);
+ RTE_TEST_ASSERT(ret == 0, "Re-ADD /28 failed\n");
+
+ ret = rte_fib6_delete(fib, &child, 48);
+ RTE_TEST_ASSERT(ret == 0, "DEL /48 failed\n");
+ ret = rte_fib6_delete(fib, &parent, 28);
+ RTE_TEST_ASSERT(ret == 0, "DEL /28 final failed\n");
+
+ rte_fib6_free(fib);
+ return TEST_SUCCESS;
+}
+
+/*
+ * Three-level nesting with compressed and non-compressed paths, then
+ * DEL of the middle prefix. The byte-boundary supernet accounting
+ * must remain balanced through the chain.
+ *
+ * Layout: grand fcde::/28 nh=1, mid fcde:0:6000::/48 nh=1 (compressed
+ * under grand), leaf fcde:0:6000::4000::/96 nh=2 (not compressed).
+ *
+ * ADD /28 (no ancestor) rsvd_tbl8s += 1
+ * ADD /48 (compressed under /28) rsvd_tbl8s += 2
+ * ADD /96 (not compressed under /48) rsvd_tbl8s += 6
+ * DEL /48 (leaf /96 still covers 32, 40) rsvd_tbl8s -= 0
+ * DEL /28 (only level 24 was solely /28's) rsvd_tbl8s -= 0
+ * DEL /96 (last route gone, all freed) rsvd_tbl8s -= 9
+ *
+ * Boundaries get refunded only on the DEL that makes them empty;
+ * intermediate DELs that leave a covering descendant are refund-free.
+ */
+static int32_t
+test_drift_multilevel(void)
+{
+ struct rte_fib6_conf config = { 0 };
+ struct rte_fib6 *fib;
+ struct rte_ipv6_addr grand = RTE_IPV6(0xfcde, 0, 0, 0, 0, 0, 0, 0);
+ struct rte_ipv6_addr mid = RTE_IPV6(0xfcde, 0, 0x6000, 0, 0, 0, 0, 0);
+ struct rte_ipv6_addr leaf = RTE_IPV6(0xfcde, 0, 0x6000, 0, 0, 0x4000, 0, 0);
+ int ret;
+
+ config.max_routes = 1024;
+ config.rib_ext_sz = 0;
+ config.default_nh = 0;
+ config.type = RTE_FIB6_TRIE;
+ config.trie.nh_sz = RTE_FIB6_TRIE_2B;
+ config.trie.num_tbl8 = 256;
+
+ fib = rte_fib6_create(__func__, SOCKET_ID_ANY, &config);
+ RTE_TEST_ASSERT(fib != NULL, "Failed to create FIB\n");
+
+ ret = rte_fib6_add(fib, &grand, 28, 1);
+ RTE_TEST_ASSERT(ret == 0, "ADD /28 failed\n");
+ ret = rte_fib6_add(fib, &mid, 48, 1); /* compressed under /28 */
+ RTE_TEST_ASSERT(ret == 0, "ADD /48 failed\n");
+ ret = rte_fib6_add(fib, &leaf, 96, 2); /* non-compressed under /48 */
+ RTE_TEST_ASSERT(ret == 0, "ADD /96 failed\n");
+
+ /* DEL the middle prefix: byte-boundary accounting must stay
+ * coherent so the subsequent operations succeed.
+ */
+ ret = rte_fib6_delete(fib, &mid, 48);
+ RTE_TEST_ASSERT(ret == 0, "DEL /48 failed\n");
+
+ ret = rte_fib6_delete(fib, &grand, 28);
+ RTE_TEST_ASSERT(ret == 0, "DEL /28 failed\n");
+ ret = rte_fib6_delete(fib, &leaf, 96);
+ RTE_TEST_ASSERT(ret == 0, "DEL /96 failed\n");
+
+ rte_fib6_free(fib);
+ return TEST_SUCCESS;
+}
+
+/*
+ * Pseudo-random ADD/DEL sequence over 8 prefixes with varying depths
+ * and next-hops. A hand-rolled LCG (not rte_rand) makes the sequence
+ * reproducible across runs and DPDK versions. After all prefixes are
+ * removed, a final ADD/DEL pair must succeed - it would fail under a
+ * leaked rsvd_tbl8s.
+ *
+ * depths[1] and depths[6] both use /36 on purpose: ips[1] and ips[6]
+ * are distinct prefixes, so this exercises two parallel /36 ADD/DEL
+ * paths that share byte boundaries 24 and 32.
+ */
+static int32_t
+test_drift_stress(void)
+{
+ uint8_t depths[8] = { 28, 36, 40, 48, 64, 80, 36, 128 };
+ struct rte_fib6_conf config = { 0 };
+ struct rte_ipv6_addr ips[8] = {
+ RTE_IPV6(0xfcde, 0, 0, 0, 0, 0, 0, 0),
+ RTE_IPV6(0xfcde, 0x1, 0, 0, 0, 0, 0, 0),
+ RTE_IPV6(0xfcde, 0x2, 0, 0, 0, 0, 0, 0),
+ RTE_IPV6(0xfcde, 0x2, 0x4000, 0, 0, 0, 0, 0),
+ RTE_IPV6(0xfcde, 0x2, 0x4000, 0x1000, 0, 0, 0, 0),
+ RTE_IPV6(0xfcde, 0x2, 0x4000, 0x1000, 0x1, 0, 0, 0),
+ RTE_IPV6(0xfcde, 0x3, 0, 0, 0, 0, 0, 0),
+ RTE_IPV6(0xfcde, 0x3, 0, 0, 0, 0, 0, 0x1),
+ };
+ uint8_t live[8] = { 0 };
+ struct rte_fib6 *fib;
+ uint32_t seed = 0x4242;
+ unsigned int i, idx;
+ int ret;
+
+ config.max_routes = 64;
+ config.rib_ext_sz = 0;
+ config.default_nh = 0;
+ config.type = RTE_FIB6_TRIE;
+ config.trie.nh_sz = RTE_FIB6_TRIE_2B;
+ config.trie.num_tbl8 = 256;
+
+ fib = rte_fib6_create(__func__, SOCKET_ID_ANY, &config);
+ RTE_TEST_ASSERT(fib != NULL, "Failed to create FIB\n");
+
+ for (i = 0; i < 2000; i++) {
+ seed = seed * 1103515245u + 12345u;
+ idx = (seed >> 8) & 7;
+ if (live[idx]) {
+ ret = rte_fib6_delete(fib, &ips[idx], depths[idx]);
+ RTE_TEST_ASSERT(ret == 0,
+ "DEL idx %u (depth /%u) failed (ret=%d)\n",
+ idx, depths[idx], ret);
+ live[idx] = 0;
+ } else {
+ uint64_t nh = ((seed >> 16) & 0xff) + 1;
+ ret = rte_fib6_add(fib, &ips[idx], depths[idx], nh);
+ RTE_TEST_ASSERT(ret == 0,
+ "ADD idx %u (depth /%u nh=%" PRIu64 ") failed (ret=%d)\n",
+ idx, depths[idx], nh, ret);
+ live[idx] = 1;
+ }
+ }
+
+ /* Drain everything */
+ for (i = 0; i < RTE_DIM(live); i++) {
+ if (live[i]) {
+ ret = rte_fib6_delete(fib, &ips[i], depths[i]);
+ RTE_TEST_ASSERT(ret == 0,
+ "final drain DEL idx %u failed (ret=%d)\n",
+ i, ret);
+ }
+ }
+
+ /* If rsvd_tbl8s had leaked, this fresh ADD would fail */
+ ret = rte_fib6_add(fib, &ips[0], depths[0], 0xff);
+ RTE_TEST_ASSERT(ret == 0,
+ "post-drain ADD failed (rsvd leaked?) (ret=%d)\n", ret);
+ ret = rte_fib6_delete(fib, &ips[0], depths[0]);
+ RTE_TEST_ASSERT(ret == 0, "post-drain DEL failed\n");
+
+ rte_fib6_free(fib);
+ return TEST_SUCCESS;
+}
+
+/* Tight-pool re-compression scenario. Pool sized to exactly the
+ * highest legitimate envelope: an ADD that becomes a closer ancestor
+ * of an existing descendant must succeed because the byte-boundary
+ * supernet accounting reports the same envelope post-operation.
+ *
+ * num_tbl8 = 3
+ * ADD /28 nh=1 rsvd = 1
+ * ADD /48 nh=1 (compr.) rsvd = 3 (/48 reserves 2 new boundaries)
+ * DEL /28 rsvd unchanged (/48 still holds them)
+ * RE-ADD /28 nh=1 rsvd unchanged (already reserved)
+ * (pre-fix: pre-check rejects)
+ */
+static int32_t
+test_drift_tight_pool(void)
+{
+ struct rte_fib6_conf config = { 0 };
+ struct rte_fib6 *fib;
+ struct rte_ipv6_addr parent = RTE_IPV6(0xfcde, 0, 0, 0, 0, 0, 0, 0);
+ struct rte_ipv6_addr child = RTE_IPV6(0xfcde, 0, 0x6000, 0, 0, 0, 0, 0);
+ int ret;
+
+ config.max_routes = 16;
+ config.rib_ext_sz = 0;
+ config.default_nh = 0;
+ config.type = RTE_FIB6_TRIE;
+ config.trie.nh_sz = RTE_FIB6_TRIE_2B;
+ config.trie.num_tbl8 = 3;
+
+ fib = rte_fib6_create(__func__, SOCKET_ID_ANY, &config);
+ RTE_TEST_ASSERT(fib != NULL, "Failed to create FIB\n");
+
+ ret = rte_fib6_add(fib, &parent, 28, 1);
+ RTE_TEST_ASSERT(ret == 0, "ADD /28 failed (ret=%d)\n", ret);
+ ret = rte_fib6_add(fib, &child, 48, 1);
+ RTE_TEST_ASSERT(ret == 0, "ADD /48 failed (ret=%d)\n", ret);
+ ret = rte_fib6_delete(fib, &parent, 28);
+ RTE_TEST_ASSERT(ret == 0, "DEL /28 failed (ret=%d)\n", ret);
+
+ /* Re-add /28: byte boundary 24 is already occupied by the /48,
+ * so the re-added /28 introduces no new reservation. The
+ * envelope stays at 3 and still fits the pool of 3.
+ */
+ ret = rte_fib6_add(fib, &parent, 28, 1);
+ RTE_TEST_ASSERT(ret == 0,
+ "Re-ADD /28 spuriously failed (ret=%d)\n", ret);
+
+ ret = rte_fib6_delete(fib, &child, 48);
+ RTE_TEST_ASSERT(ret == 0, "DEL /48 failed (ret=%d)\n", ret);
+ ret = rte_fib6_delete(fib, &parent, 28);
+ RTE_TEST_ASSERT(ret == 0, "Final DEL /28 failed (ret=%d)\n", ret);
+
+ rte_fib6_free(fib);
+ return TEST_SUCCESS;
+}
+
static struct unit_test_suite fib6_fast_tests = {
.suite_name = "fib6 autotest",
.setup = NULL,
@@ -685,6 +942,10 @@ static struct unit_test_suite fib6_fast_tests = {
TEST_CASE(test_invalid_rcu),
TEST_CASE(test_fib_rcu_sync_rw),
TEST_CASE(test_drift),
+ TEST_CASE(test_drift_compression),
+ TEST_CASE(test_drift_multilevel),
+ TEST_CASE(test_drift_stress),
+ TEST_CASE(test_drift_tight_pool),
TEST_CASES_END()
}
};
--
2.43.0
^ permalink raw reply related [flat|nested] 14+ messages in thread* [PATCH v1 4/5] rib: track valid descendant count per node
2026-05-22 14:58 ` [PATCH v1 0/5] fib6: fix tbl8 reservation drift Maxime Leroy
` (2 preceding siblings ...)
2026-05-22 14:58 ` [PATCH v1 3/5] test/fib6: extended drift test cases Maxime Leroy
@ 2026-05-22 14:58 ` Maxime Leroy
2026-05-22 14:58 ` [PATCH v1 5/5] fib6: speed up tbl8 reservation accounting Maxime Leroy
2026-06-05 13:04 ` [PATCH v1 0/5] fib6: fix tbl8 reservation drift Medvedkin, Vladimir
5 siblings, 0 replies; 14+ messages in thread
From: Maxime Leroy @ 2026-05-22 14:58 UTC (permalink / raw)
To: Vladimir Medvedkin; +Cc: dev, stable, Maxime Leroy
Add a uint32_t field to struct rte_rib6_node that counts valid prefixes
in the node's subtree (excluding self). Maintained incrementally on
insert (when a node becomes valid) and remove (when a node becomes
invalid), in O(tree depth) per operation.
The field fits in the padding before ext[]; the node size and ext[]
offset are unchanged. uint32_t is required because a single subtree
may hold more than 65535 BGP routes.
Expose an __rte_internal helper rte_rib6_count_empty_supernets()
via a new private header lib/rib/rib6_internal.h that descends the
tree once and reports how many byte boundaries below a given prefix
have no valid descendant. lib/fib uses this to maintain tbl8
reservation accounting in a single tree walk instead of one
rte_rib6_get_nxt() call per byte boundary.
Signed-off-by: Maxime Leroy <maxime@leroys.fr>
---
app/test/test_rib6.c | 92 +++++++++++++++++++++++++++++++++++++++++
lib/rib/rib6_internal.h | 37 +++++++++++++++++
lib/rib/rte_rib6.c | 80 +++++++++++++++++++++++++++++++++++
3 files changed, 209 insertions(+)
create mode 100644 lib/rib/rib6_internal.h
diff --git a/app/test/test_rib6.c b/app/test/test_rib6.c
index 0295a9640c..a1a6ab17f4 100644
--- a/app/test/test_rib6.c
+++ b/app/test/test_rib6.c
@@ -8,6 +8,7 @@
#include <stdlib.h>
#include <rte_ip6.h>
#include <rte_rib6.h>
+#include <rib6_internal.h>
#include "test.h"
@@ -20,6 +21,7 @@ static int32_t test_insert_invalid(void);
static int32_t test_get_fn(void);
static int32_t test_basic(void);
static int32_t test_tree_traversal(void);
+static int32_t test_empty_supernets(void);
#define MAX_DEPTH 128
#define MAX_RULES (1 << 22)
@@ -322,6 +324,95 @@ test_tree_traversal(void)
return TEST_SUCCESS;
}
+/*
+ * Exercise rte_rib6_count_empty_supernets() which depends on the
+ * valid_descendants counter maintained on insert/remove.
+ */
+int32_t
+test_empty_supernets(void)
+{
+ struct rte_rib6 *rib = NULL;
+ struct rte_rib6_conf config;
+ struct rte_ipv6_addr ip = RTE_IPV6(0xfcde, 0, 0, 0, 0, 0, 0, 0);
+ struct rte_ipv6_addr sibling = RTE_IPV6(0xfcde, 0, 0, 0, 0, 0, 0, 1);
+ uint8_t cnt;
+
+ config.max_nodes = 64;
+ config.ext_sz = 0;
+
+ rib = rte_rib6_create(__func__, SOCKET_ID_ANY, &config);
+ RTE_TEST_ASSERT(rib != NULL, "Failed to create RIB\n");
+
+ /* depth <= 24: no byte boundaries above to inspect. */
+ cnt = rte_rib6_count_empty_supernets(rib, &ip, 24);
+ RTE_TEST_ASSERT(cnt == 0, "depth 24 must return 0, got %u\n", cnt);
+
+ /* Empty RIB, /128 query: 13 byte boundaries (24..120) all empty. */
+ cnt = rte_rib6_count_empty_supernets(rib, &ip, 128);
+ RTE_TEST_ASSERT(cnt == 13, "empty RIB /128 must return 13, got %u\n", cnt);
+
+ /* Insert a /32 ancestor: level 24 now has a descendant, levels
+ * 32..120 still empty -> 12.
+ */
+ RTE_TEST_ASSERT(rte_rib6_insert(rib, &ip, 32) != NULL,
+ "Failed to insert /32\n");
+ cnt = rte_rib6_count_empty_supernets(rib, &ip, 128);
+ RTE_TEST_ASSERT(cnt == 12, "after /32 ADD: expected 12, got %u\n", cnt);
+
+ /* Insert a /48 below: levels 24, 32 and 40 see /48 as descendant,
+ * 48..120 empty -> 10.
+ */
+ RTE_TEST_ASSERT(rte_rib6_insert(rib, &ip, 48) != NULL,
+ "Failed to insert /48\n");
+ cnt = rte_rib6_count_empty_supernets(rib, &ip, 128);
+ RTE_TEST_ASSERT(cnt == 10, "after /48 ADD: expected 10, got %u\n", cnt);
+
+ /* Insert a sibling /128 that shares a long common prefix with ip
+ * but differs in the last bits. This forces creation of a
+ * common_node and exercises the inherited valid_descendants of
+ * that synthesized intermediate.
+ */
+ RTE_TEST_ASSERT(rte_rib6_insert(rib, &sibling, 128) != NULL,
+ "Failed to insert sibling /128\n");
+ /* sibling shares prefix with ip down to bit 127; for the query on
+ * ip/128, all byte boundaries 24..120 have descendants (the /32,
+ * /48 and the sibling /128 chain) -> 0 empty levels.
+ */
+ cnt = rte_rib6_count_empty_supernets(rib, &ip, 128);
+ RTE_TEST_ASSERT(cnt == 0, "fully populated chain: expected 0, got %u\n",
+ cnt);
+
+ /* Remove the /32 ancestor: /48 and /128 sibling still cover all
+ * byte boundaries 24..120 -> still 0 empty levels.
+ */
+ rte_rib6_remove(rib, &ip, 32);
+ cnt = rte_rib6_count_empty_supernets(rib, &ip, 128);
+ RTE_TEST_ASSERT(cnt == 0,
+ "after /32 DEL still covered: expected 0, got %u\n", cnt);
+
+ /* Remove the /48: only the /128 sibling remains. For an ip/128
+ * query, levels 24..120 each see the sibling as descendant, so
+ * count is still 0.
+ */
+ rte_rib6_remove(rib, &ip, 48);
+ cnt = rte_rib6_count_empty_supernets(rib, &ip, 128);
+ RTE_TEST_ASSERT(cnt == 0,
+ "after /48 DEL still covered: expected 0, got %u\n", cnt);
+
+ /* Remove the sibling /128: RIB now empty again. */
+ rte_rib6_remove(rib, &sibling, 128);
+ cnt = rte_rib6_count_empty_supernets(rib, &ip, 128);
+ RTE_TEST_ASSERT(cnt == 13,
+ "after final DEL: expected 13, got %u\n", cnt);
+
+ /* Invalid input: NULL rib. */
+ cnt = rte_rib6_count_empty_supernets(NULL, &ip, 128);
+ RTE_TEST_ASSERT(cnt == 0, "NULL rib must return 0, got %u\n", cnt);
+
+ rte_rib6_free(rib);
+ return TEST_SUCCESS;
+}
+
static struct unit_test_suite rib6_tests = {
.suite_name = "rib6 autotest",
.setup = NULL,
@@ -333,6 +424,7 @@ static struct unit_test_suite rib6_tests = {
TEST_CASE(test_get_fn),
TEST_CASE(test_basic),
TEST_CASE(test_tree_traversal),
+ TEST_CASE(test_empty_supernets),
TEST_CASES_END()
}
};
diff --git a/lib/rib/rib6_internal.h b/lib/rib/rib6_internal.h
new file mode 100644
index 0000000000..8246626de9
--- /dev/null
+++ b/lib/rib/rib6_internal.h
@@ -0,0 +1,37 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2026 Maxime Leroy, Free Mobile
+ */
+
+#ifndef _RIB6_INTERNAL_H_
+#define _RIB6_INTERNAL_H_
+
+#include <stdint.h>
+
+#include <rte_compat.h>
+#include <rte_ip6.h>
+
+struct rte_rib6;
+
+/**
+ * @internal
+ * Count byte boundaries L in {24, 32, 40, ..., RTE_ALIGN_CEIL(depth, 8) - 8}
+ * for which the supernet of ip at level L has no valid descendant with
+ * depth > L. Used by lib/fib to maintain tbl8 reservation accounting in
+ * a single descent of the binary tree.
+ *
+ * @param rib
+ * RIB object handle
+ * @param ip
+ * IPv6 prefix address
+ * @param depth
+ * prefix length
+ * @return
+ * number of empty byte boundaries (0 if all levels have descendants
+ * or depth <= 24)
+ */
+__rte_internal
+uint8_t
+rte_rib6_count_empty_supernets(struct rte_rib6 *rib,
+ const struct rte_ipv6_addr *ip, uint8_t depth);
+
+#endif /* _RIB6_INTERNAL_H_ */
diff --git a/lib/rib/rte_rib6.c b/lib/rib/rte_rib6.c
index ec8ff68e87..ae8aba6563 100644
--- a/lib/rib/rte_rib6.c
+++ b/lib/rib/rte_rib6.c
@@ -18,6 +18,7 @@
#include <rte_rib6.h>
+#include "rib6_internal.h"
#include "rib_log.h"
#define RTE_RIB_VALID_NODE 1
@@ -36,6 +37,7 @@ struct rte_rib6_node {
struct rte_rib6_node *parent;
uint64_t nh;
struct rte_ipv6_addr ip;
+ uint32_t valid_descendants;
uint8_t depth;
uint8_t flag;
uint64_t ext[];
@@ -104,10 +106,35 @@ node_alloc(struct rte_rib6 *rib)
ret = rte_mempool_get(rib->node_pool, (void *)&ent);
if (unlikely(ret != 0))
return NULL;
+ ent->valid_descendants = 0;
++rib->cur_nodes;
return ent;
}
+/* Increment valid_descendants along the parent chain when node becomes valid. */
+static inline void
+inc_valid_descendants(struct rte_rib6_node *node)
+{
+ struct rte_rib6_node *p = node->parent;
+
+ while (p != NULL) {
+ p->valid_descendants++;
+ p = p->parent;
+ }
+}
+
+/* Decrement valid_descendants along the parent chain when node becomes invalid. */
+static inline void
+dec_valid_descendants(struct rte_rib6_node *node)
+{
+ struct rte_rib6_node *p = node->parent;
+
+ while (p != NULL) {
+ p->valid_descendants--;
+ p = p->parent;
+ }
+}
+
static void
node_free(struct rte_rib6 *rib, struct rte_rib6_node *ent)
{
@@ -250,6 +277,7 @@ rte_rib6_remove(struct rte_rib6 *rib,
--rib->cur_routes;
cur->flag &= ~RTE_RIB_VALID_NODE;
+ dec_valid_descendants(cur);
while (!is_valid_node(cur)) {
if ((cur->left != NULL) && (cur->right != NULL))
return;
@@ -320,6 +348,7 @@ rte_rib6_insert(struct rte_rib6 *rib,
*tmp = new_node;
new_node->parent = prev;
++rib->cur_routes;
+ inc_valid_descendants(new_node);
return *tmp;
}
/*
@@ -332,6 +361,7 @@ rte_rib6_insert(struct rte_rib6 *rib,
node_free(rib, new_node);
(*tmp)->flag |= RTE_RIB_VALID_NODE;
++rib->cur_routes;
+ inc_valid_descendants(*tmp);
return *tmp;
}
@@ -371,6 +401,9 @@ rte_rib6_insert(struct rte_rib6 *rib,
new_node->left = *tmp;
new_node->parent = (*tmp)->parent;
(*tmp)->parent = new_node;
+ /* new_node inherits *tmp's subtree */
+ new_node->valid_descendants = (is_valid_node(*tmp) ? 1 : 0) +
+ (*tmp)->valid_descendants;
*tmp = new_node;
} else {
/* create intermediate node */
@@ -386,6 +419,11 @@ rte_rib6_insert(struct rte_rib6 *rib,
common_node->parent = (*tmp)->parent;
new_node->parent = common_node;
(*tmp)->parent = common_node;
+ /* common_node inherits *tmp's subtree (new_node will be
+ * counted by inc_valid_descendants below).
+ */
+ common_node->valid_descendants = (is_valid_node(*tmp) ? 1 : 0) +
+ (*tmp)->valid_descendants;
if (get_dir(&(*tmp)->ip, common_depth) == 1) {
common_node->left = new_node;
common_node->right = *tmp;
@@ -396,6 +434,7 @@ rte_rib6_insert(struct rte_rib6 *rib,
*tmp = common_node;
}
++rib->cur_routes;
+ inc_valid_descendants(new_node);
return new_node;
}
@@ -606,3 +645,44 @@ rte_rib6_free(struct rte_rib6 *rib)
rte_free(rib);
rte_free(te);
}
+
+RTE_EXPORT_INTERNAL_SYMBOL(rte_rib6_count_empty_supernets)
+uint8_t
+rte_rib6_count_empty_supernets(struct rte_rib6 *rib,
+ const struct rte_ipv6_addr *ip, uint8_t depth)
+{
+ uint8_t top = RTE_ALIGN_CEIL(depth, 8);
+ struct rte_rib6_node *cur;
+ bool has_descendant;
+ uint8_t level;
+
+ if (unlikely(rib == NULL || ip == NULL || depth > RTE_IPV6_MAX_DEPTH))
+ return 0;
+ if (depth <= 24)
+ return 0;
+
+ cur = rib->tree;
+
+ /* Single descent through the binary tree, checking each byte
+ * boundary on the way. NULL at level L propagates upward, so we
+ * stop at the first empty supernet and tally the remaining levels.
+ */
+ for (level = 24; level < top; level += 8) {
+ while (cur != NULL && cur->depth < level)
+ cur = get_nxt_node(cur, ip);
+
+ if (cur == NULL || !rte_ipv6_addr_eq_prefix(&cur->ip, ip, level))
+ return (top - level) >> 3;
+
+ if (cur->depth > level)
+ has_descendant = is_valid_node(cur) ||
+ cur->valid_descendants > 0;
+ else
+ has_descendant = cur->valid_descendants > 0;
+
+ if (!has_descendant)
+ return (top - level) >> 3;
+ }
+ return 0;
+}
+
--
2.43.0
^ permalink raw reply related [flat|nested] 14+ messages in thread* [PATCH v1 5/5] fib6: speed up tbl8 reservation accounting
2026-05-22 14:58 ` [PATCH v1 0/5] fib6: fix tbl8 reservation drift Maxime Leroy
` (3 preceding siblings ...)
2026-05-22 14:58 ` [PATCH v1 4/5] rib: track valid descendant count per node Maxime Leroy
@ 2026-05-22 14:58 ` Maxime Leroy
2026-06-05 13:04 ` [PATCH v1 0/5] fib6: fix tbl8 reservation drift Medvedkin, Vladimir
5 siblings, 0 replies; 14+ messages in thread
From: Maxime Leroy @ 2026-05-22 14:58 UTC (permalink / raw)
To: Vladimir Medvedkin; +Cc: dev, stable, Maxime Leroy
Replace the local count_empty_levels() helper, which issued up to 13
rte_rib6_get_nxt() calls each descending the binary tree from root,
with the new __rte_internal rte_rib6_count_empty_supernets(). The
latter descends once and consults the valid_descendants counter
maintained inside rte_rib6, answering all byte boundary questions
in O(tree depth) with O(1) work per boundary.
For a /128 ADD with ancestors at every byte boundary, this reduces
the worst-case query cost from 13 trie descents to 1.
Signed-off-by: Maxime Leroy <maxime@leroys.fr>
---
lib/fib/trie.c | 30 +++---------------------------
1 file changed, 3 insertions(+), 27 deletions(-)
diff --git a/lib/fib/trie.c b/lib/fib/trie.c
index 44b90f72ff..b6ef626fd4 100644
--- a/lib/fib/trie.c
+++ b/lib/fib/trie.c
@@ -13,6 +13,7 @@
#include <rte_rib6.h>
#include <rte_fib6.h>
+#include <rib6_internal.h>
#include "fib_log.h"
#include "trie.h"
@@ -534,31 +535,6 @@ modify_dp(struct rte_trie_tbl *dp, struct rte_rib6 *rib,
return 0;
}
-/*
- * Count byte boundaries between 24 and CEIL(depth, 8) where the
- * supernet of ip has no descendant in the RIB. This is the number of
- * new tbl8 levels an ADD of ip/depth would introduce, or the number
- * to free at DEL once the prefix has been removed from the RIB.
- *
- * A NULL answer at level L propagates upwards: narrower supernets at
- * L+8, L+16, ... are subsets of S_L and cannot contain descendants
- * either. The loop stops at the first NULL and tallies the remaining
- * boundaries in one shot.
- */
-static uint8_t
-count_empty_levels(struct rte_rib6 *rib, const struct rte_ipv6_addr *ip,
- uint8_t depth)
-{
- uint8_t level, top = RTE_ALIGN_CEIL(depth, 8);
-
- for (level = 24; level < top; level += 8) {
- if (rte_rib6_get_nxt(rib, ip, level, NULL,
- RTE_RIB6_GET_NXT_COVER) == NULL)
- return (top - level) >> 3;
- }
- return 0;
-}
-
int
trie_modify(struct rte_fib6 *fib, const struct rte_ipv6_addr *ip,
uint8_t depth, uint64_t next_hop, int op)
@@ -596,7 +572,7 @@ trie_modify(struct rte_fib6 *fib, const struct rte_ipv6_addr *ip,
return 0;
}
- new_levels = count_empty_levels(rib, &ip_masked, depth);
+ new_levels = rte_rib6_count_empty_supernets(rib, &ip_masked, depth);
if (dp->rsvd_tbl8s + new_levels > dp->number_tbl8s)
return -ENOSPC;
@@ -635,7 +611,7 @@ trie_modify(struct rte_fib6 *fib, const struct rte_ipv6_addr *ip,
if (ret != 0)
return ret;
rte_rib6_remove(rib, &ip_masked, depth);
- dp->rsvd_tbl8s -= count_empty_levels(rib, &ip_masked, depth);
+ dp->rsvd_tbl8s -= rte_rib6_count_empty_supernets(rib, &ip_masked, depth);
return 0;
default:
break;
--
2.43.0
^ permalink raw reply related [flat|nested] 14+ messages in thread* Re: [PATCH v1 0/5] fib6: fix tbl8 reservation drift
2026-05-22 14:58 ` [PATCH v1 0/5] fib6: fix tbl8 reservation drift Maxime Leroy
` (4 preceding siblings ...)
2026-05-22 14:58 ` [PATCH v1 5/5] fib6: speed up tbl8 reservation accounting Maxime Leroy
@ 2026-06-05 13:04 ` Medvedkin, Vladimir
5 siblings, 0 replies; 14+ messages in thread
From: Medvedkin, Vladimir @ 2026-06-05 13:04 UTC (permalink / raw)
To: Maxime Leroy; +Cc: dev, stable
Hi Maxime,
Thanks for the patches.
For patches 2 and 3 introducing unit tests
Acked-by: Vladimir Medvedkin <vladimir.medvedkin@intel.com>
For patch 1, 4 and 5,
I think both implementations are over-complicated.
I suggest more optimal implementation for the count_empty_levels() (from
the patch 1/5):
here is a draft based on your 1/5:
https://patches.dpdk.org/project/dpdk/patch/20260605130317.896413-1-vladimir.medvedkin@intel.com/
This implementation can be backported. The function logically belongs to
trie.c, since it reflects its specifics, and not to the RIB library as
it's more generic.
You may take this patch and integrate it into v2 replacing your 1/5.
On 5/22/2026 3:58 PM, Maxime Leroy wrote:
> This v1 supersedes the earlier RFC. The RFC dropped rsvd_tbl8s and
> used tbl8_pool_pos in the pre-check, which loses the worst-case
> envelope: a compressed /48 under a /28 allocates zero tbl8s but must
> reserve the boundaries the /48 would need if the /28 is later
> removed (DEL forces mid-flight decompression in modify_dp() with no
> rollback).
>
> This v1 keeps rsvd_tbl8s and computes it the way dir24_8 already
> does for IPv4. dir24_8 counts /24 supernets that contain at least
> one /25..32 prefix: that count is invariant under unrelated RIB
> changes, so the counter cannot drift. trie6 has the same need at
> 13 levels instead of 1 (byte boundaries 24, 32, ..., 120), so v1
> counts, for each L in that set, the /L supernets containing at
> least one prefix with depth > L. ADD/DEL pairs are symmetric by
> construction.
>
> Patch 1 is the minimal self-contained fix (Fixes: + Cc: stable).
> Patches 2-3 add the reproducer and extended regression tests.
> Patches 4-5 are an optimization (not for stable): valid_descendants
> in rte_rib6 + single-descent helper, so trie_modify() walks once
> instead of up to 13 times per ADD/DEL.
>
> Validated on a live BGP router (grout + FRR, 127 IPv6 prefixes):
> RSVD_TBL8 returned to its pre-cycle value after a zebra-kill /
> reconverge cycle.
>
> Maxime Leroy (5):
> fib6: fix tbl8 reservation drift in trie
> test/fib6: add reproducer for tbl8 reservation drift
> test/fib6: extended drift test cases
> rib: track valid descendant count per node
> fib6: speed up tbl8 reservation accounting
>
> app/test/test_fib6.c | 335 ++++++++++++++++++++++++++++++++++++++++
> app/test/test_rib6.c | 92 +++++++++++
> lib/fib/trie.c | 47 +-----
> lib/rib/rib6_internal.h | 37 +++++
> lib/rib/rte_rib6.c | 80 ++++++++++
> 5 files changed, 552 insertions(+), 39 deletions(-)
> create mode 100644 lib/rib/rib6_internal.h
>
> ---
> v1:
> * Keep rsvd_tbl8s; recompute it via topology-stable empty-supernet
> count (dir24_8 pattern at 13 levels) instead of RIB-derived
> depth_diff.
> * Drop RFC patch 3/3 (no longer needed).
> * Add extended regression tests.
> * Add patches 4-5: RIB valid_descendants + single-descent helper
> (optional perf optimization; not for stable).
> * Production-validated on a live BGP router.
>
> --
> 2.43.0
--
Regards,
Vladimir
^ permalink raw reply [flat|nested] 14+ messages in thread