* [PATCH nf-next 0/3] netfilter: nft_set_rbtree: use cloned tree for insertions and removal
@ 2025-11-18 11:16 Florian Westphal
2025-11-18 11:16 ` [PATCH nf-next 1/3] netfilter: nft_set_rbtree: prepare for two rbtrees Florian Westphal
` (3 more replies)
0 siblings, 4 replies; 14+ messages in thread
From: Florian Westphal @ 2025-11-18 11:16 UTC (permalink / raw)
To: netfilter-devel; +Cc: Florian Westphal
This series fixes false negative lookup bug in rbtree set backend that
can occur during transaction.
First two patches prepare for actual fix, which is coming in last patch.
All inserts/removals will now occur in a cloned copy, so packetpath can
no longer observe the problematic mixed-bag of old, current and new
elements.
The live tree will only have reachable elements that are active in the
current generation or were active in the previous generation (but are still
valid while packetpath holds rcu read lock). The latter case is only
temporary, as new lookups already observe the updated tree).
Florian Westphal (3):
netfilter: nft_set_rbtree: prepare for two rbtrees
netfilter: nft_set_rbtree: factor out insert helper
netfilter: nft_set_rbtree: do not modifiy live tree
net/netfilter/nft_set_rbtree.c | 279 +++++++++++++++++++++++++--------
1 file changed, 211 insertions(+), 68 deletions(-)
--
2.51.0
^ permalink raw reply [flat|nested] 14+ messages in thread* [PATCH nf-next 1/3] netfilter: nft_set_rbtree: prepare for two rbtrees 2025-11-18 11:16 [PATCH nf-next 0/3] netfilter: nft_set_rbtree: use cloned tree for insertions and removal Florian Westphal @ 2025-11-18 11:16 ` Florian Westphal 2025-11-18 11:16 ` [PATCH nf-next 2/3] netfilter: nft_set_rbtree: factor out insert helper Florian Westphal ` (2 subsequent siblings) 3 siblings, 0 replies; 14+ messages in thread From: Florian Westphal @ 2025-11-18 11:16 UTC (permalink / raw) To: netfilter-devel; +Cc: Florian Westphal Preparation patch that has no intended side effects. 1. add a second rb_root to the set 2. allow elements to be part of two trees One tree will be the live copy used for matching from packet path. Other tree will be used for deletions and insertions from control plane. As-is, new elements are exposed to the packetpath. Once the commit phase increments the gencursor (but before it erased old elements from the tree), there is a short time period where such now inactive-and-about-to-be-erased elements shadow new, valid entries. This situation only exists for a very short time window, but it is unwanted. If the tree stores ranges a-b (being removed) and a-c (getting added), then a lookup for a key that resides in the a-b or a-c interval must find the former or the latter range. To resolve this, followup patches will do all modifications in a hidden tree, not exposed to packet path: Lookups either observe still-valid a-b in the old tree or the already-valid a-c range in the replaced tree. Signed-off-by: Florian Westphal <fw@strlen.de> --- net/netfilter/nft_set_rbtree.c | 105 +++++++++++++++++++++------------ 1 file changed, 66 insertions(+), 39 deletions(-) diff --git a/net/netfilter/nft_set_rbtree.c b/net/netfilter/nft_set_rbtree.c index ca594161b840..3d4db2b93d6a 100644 --- a/net/netfilter/nft_set_rbtree.c +++ b/net/netfilter/nft_set_rbtree.c @@ -16,7 +16,7 @@ #include <net/netfilter/nf_tables_core.h> struct nft_rbtree { - struct rb_root root; + struct rb_root root[2]; rwlock_t lock; seqcount_rwlock_t count; unsigned long last_gc; @@ -24,7 +24,7 @@ struct nft_rbtree { struct nft_rbtree_elem { struct nft_elem_priv priv; - struct rb_node node; + struct rb_node node[2]; struct nft_set_ext ext; }; @@ -60,14 +60,15 @@ __nft_rbtree_lookup(const struct net *net, const struct nft_set *set, const struct nft_rbtree_elem *rbe, *interval = NULL; u8 genmask = nft_genmask_cur(net); const struct rb_node *parent; + u8 genbit = 0; int d; - parent = rcu_dereference_raw(priv->root.rb_node); + parent = rcu_dereference_raw(priv->root[genbit].rb_node); while (parent != NULL) { if (read_seqcount_retry(&priv->count, seq)) return NULL; - rbe = rb_entry(parent, struct nft_rbtree_elem, node); + rbe = rb_entry(parent, struct nft_rbtree_elem, node[genbit]); d = memcmp(nft_set_ext_key(&rbe->ext), key, set->klen); if (d < 0) { @@ -139,14 +140,15 @@ static bool __nft_rbtree_get(const struct net *net, const struct nft_set *set, struct nft_rbtree *priv = nft_set_priv(set); const struct rb_node *parent; const void *this; + u8 genbit = 0; int d; - parent = rcu_dereference_raw(priv->root.rb_node); + parent = rcu_dereference_raw(priv->root[genbit].rb_node); while (parent != NULL) { if (read_seqcount_retry(&priv->count, seq)) return false; - rbe = rb_entry(parent, struct nft_rbtree_elem, node); + rbe = rb_entry(parent, struct nft_rbtree_elem, node[genbit]); this = nft_set_ext_key(&rbe->ext); d = memcmp(this, key, set->klen); @@ -223,11 +225,12 @@ nft_rbtree_get(const struct net *net, const struct nft_set *set, static void nft_rbtree_gc_elem_remove(struct net *net, struct nft_set *set, struct nft_rbtree *priv, - struct nft_rbtree_elem *rbe) + struct nft_rbtree_elem *rbe, + u8 genbit) { lockdep_assert_held_write(&priv->lock); nft_setelem_data_deactivate(net, set, &rbe->priv); - rb_erase(&rbe->node, &priv->root); + rb_erase(&rbe->node[genbit], &priv->root[genbit]); } static const struct nft_rbtree_elem * @@ -235,10 +238,13 @@ nft_rbtree_gc_elem(const struct nft_set *__set, struct nft_rbtree *priv, struct nft_rbtree_elem *rbe) { struct nft_set *set = (struct nft_set *)__set; - struct rb_node *prev = rb_prev(&rbe->node); struct net *net = read_pnet(&set->net); struct nft_rbtree_elem *rbe_prev; struct nft_trans_gc *gc; + struct rb_node *prev; + u8 genbit = 0; + + prev = rb_prev(&rbe->node[genbit]); gc = nft_trans_gc_alloc(set, 0, GFP_ATOMIC); if (!gc) @@ -249,7 +255,7 @@ nft_rbtree_gc_elem(const struct nft_set *__set, struct nft_rbtree *priv, * are coupled with the interval start element. */ while (prev) { - rbe_prev = rb_entry(prev, struct nft_rbtree_elem, node); + rbe_prev = rb_entry(prev, struct nft_rbtree_elem, node[genbit]); if (nft_rbtree_interval_end(rbe_prev) && nft_set_elem_active(&rbe_prev->ext, NFT_GENMASK_ANY)) break; @@ -259,8 +265,8 @@ nft_rbtree_gc_elem(const struct nft_set *__set, struct nft_rbtree *priv, rbe_prev = NULL; if (prev) { - rbe_prev = rb_entry(prev, struct nft_rbtree_elem, node); - nft_rbtree_gc_elem_remove(net, set, priv, rbe_prev); + rbe_prev = rb_entry(prev, struct nft_rbtree_elem, node[genbit]); + nft_rbtree_gc_elem_remove(net, set, priv, rbe_prev, genbit); /* There is always room in this trans gc for this element, * memory allocation never actually happens, hence, the warning @@ -274,7 +280,7 @@ nft_rbtree_gc_elem(const struct nft_set *__set, struct nft_rbtree *priv, nft_trans_gc_elem_add(gc, rbe_prev); } - nft_rbtree_gc_elem_remove(net, set, priv, rbe); + nft_rbtree_gc_elem_remove(net, set, priv, rbe, genbit); gc = nft_trans_gc_queue_sync(gc, GFP_ATOMIC); if (WARN_ON_ONCE(!gc)) return ERR_PTR(-ENOMEM); @@ -291,8 +297,9 @@ static bool nft_rbtree_update_first(const struct nft_set *set, struct rb_node *first) { struct nft_rbtree_elem *first_elem; + u8 genbit = 0; - first_elem = rb_entry(first, struct nft_rbtree_elem, node); + first_elem = rb_entry(first, struct nft_rbtree_elem, node[genbit]); /* this element is closest to where the new element is to be inserted: * update the first element for the node list path. */ @@ -312,6 +319,7 @@ static int __nft_rbtree_insert(const struct net *net, const struct nft_set *set, u8 cur_genmask = nft_genmask_cur(net); u8 genmask = nft_genmask_next(net); u64 tstamp = nft_net_tstamp(net); + u8 genbit = 0; int d; /* Descend the tree to search for an existing element greater than the @@ -319,10 +327,10 @@ static int __nft_rbtree_insert(const struct net *net, const struct nft_set *set, * first element to walk the ordered elements to find possible overlap. */ parent = NULL; - p = &priv->root.rb_node; + p = &priv->root[genbit].rb_node; while (*p != NULL) { parent = *p; - rbe = rb_entry(parent, struct nft_rbtree_elem, node); + rbe = rb_entry(parent, struct nft_rbtree_elem, node[genbit]); d = nft_rbtree_cmp(set, rbe, new); if (d < 0) { @@ -330,7 +338,7 @@ static int __nft_rbtree_insert(const struct net *net, const struct nft_set *set, } else if (d > 0) { if (!first || nft_rbtree_update_first(set, rbe, first)) - first = &rbe->node; + first = &rbe->node[genbit]; p = &parent->rb_right; } else { @@ -342,7 +350,7 @@ static int __nft_rbtree_insert(const struct net *net, const struct nft_set *set, } if (!first) - first = rb_first(&priv->root); + first = rb_first(&priv->root[genbit]); /* Detect overlap by going through the list of valid tree nodes. * Values stored in the tree are in reversed order, starting from @@ -351,7 +359,7 @@ static int __nft_rbtree_insert(const struct net *net, const struct nft_set *set, for (node = first; node != NULL; node = next) { next = rb_next(node); - rbe = rb_entry(node, struct nft_rbtree_elem, node); + rbe = rb_entry(node, struct nft_rbtree_elem, node[genbit]); if (!nft_set_elem_active(&rbe->ext, genmask)) continue; @@ -460,10 +468,10 @@ static int __nft_rbtree_insert(const struct net *net, const struct nft_set *set, /* Accepted element: pick insertion point depending on key value */ parent = NULL; - p = &priv->root.rb_node; + p = &priv->root[genbit].rb_node; while (*p != NULL) { parent = *p; - rbe = rb_entry(parent, struct nft_rbtree_elem, node); + rbe = rb_entry(parent, struct nft_rbtree_elem, node[genbit]); d = nft_rbtree_cmp(set, rbe, new); if (d < 0) @@ -476,11 +484,15 @@ static int __nft_rbtree_insert(const struct net *net, const struct nft_set *set, p = &parent->rb_right; } - rb_link_node_rcu(&new->node, parent, p); - rb_insert_color(&new->node, &priv->root); + rb_link_node_rcu(&new->node[genbit], parent, p); + rb_insert_color(&new->node[genbit], &priv->root[genbit]); return 0; } +static void nft_rbtree_maybe_clone(const struct net *net, const struct nft_set *set) +{ +} + static int nft_rbtree_insert(const struct net *net, const struct nft_set *set, const struct nft_set_elem *elem, struct nft_elem_priv **elem_priv) @@ -489,6 +501,8 @@ static int nft_rbtree_insert(const struct net *net, const struct nft_set *set, struct nft_rbtree *priv = nft_set_priv(set); int err; + nft_rbtree_maybe_clone(net, set); + do { if (fatal_signal_pending(current)) return -EINTR; @@ -507,9 +521,11 @@ static int nft_rbtree_insert(const struct net *net, const struct nft_set *set, static void nft_rbtree_erase(struct nft_rbtree *priv, struct nft_rbtree_elem *rbe) { + u8 genbit = 0; + write_lock_bh(&priv->lock); write_seqcount_begin(&priv->count); - rb_erase(&rbe->node, &priv->root); + rb_erase(&rbe->node[genbit], &priv->root[genbit]); write_seqcount_end(&priv->count); write_unlock_bh(&priv->lock); } @@ -548,13 +564,16 @@ nft_rbtree_deactivate(const struct net *net, const struct nft_set *set, { struct nft_rbtree_elem *rbe, *this = nft_elem_priv_cast(elem->priv); const struct nft_rbtree *priv = nft_set_priv(set); - const struct rb_node *parent = priv->root.rb_node; u8 genmask = nft_genmask_next(net); u64 tstamp = nft_net_tstamp(net); + const struct rb_node *parent; int d; + nft_rbtree_maybe_clone(net, set); + + parent = priv->root[0].rb_node; while (parent != NULL) { - rbe = rb_entry(parent, struct nft_rbtree_elem, node); + rbe = rb_entry(parent, struct nft_rbtree_elem, node[0]); d = memcmp(nft_set_ext_key(&rbe->ext), &elem->key.val, set->klen); @@ -586,14 +605,15 @@ nft_rbtree_deactivate(const struct net *net, const struct nft_set *set, static void nft_rbtree_do_walk(const struct nft_ctx *ctx, struct nft_set *set, - struct nft_set_iter *iter) + struct nft_set_iter *iter, + u8 genbit) { struct nft_rbtree *priv = nft_set_priv(set); struct nft_rbtree_elem *rbe; struct rb_node *node; - for (node = rb_first(&priv->root); node != NULL; node = rb_next(node)) { - rbe = rb_entry(node, struct nft_rbtree_elem, node); + for (node = rb_first(&priv->root[genbit]); node; node = rb_next(node)) { + rbe = rb_entry(node, struct nft_rbtree_elem, node[genbit]); if (iter->count < iter->skip) goto cont; @@ -611,15 +631,18 @@ static void nft_rbtree_walk(const struct nft_ctx *ctx, struct nft_set_iter *iter) { struct nft_rbtree *priv = nft_set_priv(set); + struct net *net = read_pnet(&set->net); switch (iter->type) { case NFT_ITER_UPDATE: lockdep_assert_held(&nft_pernet(ctx->net)->commit_mutex); - nft_rbtree_do_walk(ctx, set, iter); + + nft_rbtree_maybe_clone(net, set); + nft_rbtree_do_walk(ctx, set, iter, 0); break; case NFT_ITER_READ: read_lock_bh(&priv->lock); - nft_rbtree_do_walk(ctx, set, iter); + nft_rbtree_do_walk(ctx, set, iter, 0); read_unlock_bh(&priv->lock); break; default: @@ -645,6 +668,7 @@ static void nft_rbtree_gc(struct nft_set *set) u64 tstamp = nft_net_tstamp(net); struct rb_node *node, *next; struct nft_trans_gc *gc; + u8 genbit = 0; set = nft_set_container_of(priv); net = read_pnet(&set->net); @@ -653,10 +677,10 @@ static void nft_rbtree_gc(struct nft_set *set) if (!gc) return; - for (node = rb_first(&priv->root); node ; node = next) { + for (node = rb_first(&priv->root[genbit]); node ; node = next) { next = rb_next(node); - rbe = rb_entry(node, struct nft_rbtree_elem, node); + rbe = rb_entry(node, struct nft_rbtree_elem, node[genbit]); /* elements are reversed in the rbtree for historical reasons, * from highest to lowest value, that is why end element is @@ -715,7 +739,8 @@ static int nft_rbtree_init(const struct nft_set *set, rwlock_init(&priv->lock); seqcount_rwlock_init(&priv->count, &priv->lock); - priv->root = RB_ROOT; + priv->root[0] = RB_ROOT; + priv->root[1] = RB_ROOT; return 0; } @@ -726,10 +751,11 @@ static void nft_rbtree_destroy(const struct nft_ctx *ctx, struct nft_rbtree *priv = nft_set_priv(set); struct nft_rbtree_elem *rbe; struct rb_node *node; + u8 genbit = 0; - while ((node = priv->root.rb_node) != NULL) { - rb_erase(node, &priv->root); - rbe = rb_entry(node, struct nft_rbtree_elem, node); + while ((node = priv->root[genbit].rb_node) != NULL) { + rb_erase(node, &priv->root[genbit]); + rbe = rb_entry(node, struct nft_rbtree_elem, node[genbit]); nf_tables_set_elem_destroy(ctx, set, &rbe->priv); } } @@ -790,12 +816,13 @@ static u32 nft_rbtree_adjust_maxsize(const struct nft_set *set) struct nft_rbtree_elem *rbe; struct rb_node *node; const void *key; + u8 genbit = 0; - node = rb_last(&priv->root); + node = rb_last(&priv->root[genbit]); if (!node) return 0; - rbe = rb_entry(node, struct nft_rbtree_elem, node); + rbe = rb_entry(node, struct nft_rbtree_elem, node[genbit]); if (!nft_rbtree_interval_end(rbe)) return 0; -- 2.51.0 ^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH nf-next 2/3] netfilter: nft_set_rbtree: factor out insert helper 2025-11-18 11:16 [PATCH nf-next 0/3] netfilter: nft_set_rbtree: use cloned tree for insertions and removal Florian Westphal 2025-11-18 11:16 ` [PATCH nf-next 1/3] netfilter: nft_set_rbtree: prepare for two rbtrees Florian Westphal @ 2025-11-18 11:16 ` Florian Westphal 2025-11-18 11:16 ` [PATCH nf-next 3/3] netfilter: nft_set_rbtree: do not modifiy live tree Florian Westphal 2025-11-18 16:07 ` [PATCH nf-next 0/3] netfilter: nft_set_rbtree: use cloned tree for insertions and removal Fernando Fernandez Mancera 3 siblings, 0 replies; 14+ messages in thread From: Florian Westphal @ 2025-11-18 11:16 UTC (permalink / raw) To: netfilter-devel; +Cc: Florian Westphal Before __nft_rbtree_insert() adds a new node it performs overlap checks and refuses to insert such conflicting elements. Upcoming change needs to duplicate the live tree once before the first modifications can be done on the newly cloned tree. Such copy doesn't need any of the duplicate checks as all inserted elements have been validated earlier. As the cloned tree isn't exüosed to other cpus we won't need to hold any extra lock either. So split the final insertion step into its own function so it can be reused in followup patch. Signed-off-by: Florian Westphal <fw@strlen.de> --- net/netfilter/nft_set_rbtree.c | 57 ++++++++++++++++++++++------------ 1 file changed, 38 insertions(+), 19 deletions(-) diff --git a/net/netfilter/nft_set_rbtree.c b/net/netfilter/nft_set_rbtree.c index 3d4db2b93d6a..a420addedc27 100644 --- a/net/netfilter/nft_set_rbtree.c +++ b/net/netfilter/nft_set_rbtree.c @@ -28,6 +28,11 @@ struct nft_rbtree_elem { struct nft_set_ext ext; }; +static inline u8 nft_rbtree_genbit_copy(const struct nft_rbtree *priv) +{ + return 0; +} + static bool nft_rbtree_interval_end(const struct nft_rbtree_elem *rbe) { return nft_set_ext_exists(&rbe->ext, NFT_SET_EXT_FLAGS) && @@ -309,6 +314,38 @@ static bool nft_rbtree_update_first(const struct nft_set *set, return false; } +/* insert into tree without any timeout or overlap checks. */ +static void __nft_rbtree_insert_do(const struct nft_set *set, + struct nft_rbtree_elem *new) +{ + struct nft_rbtree *priv = nft_set_priv(set); + u8 genbit = nft_rbtree_genbit_copy(priv); + struct rb_node *parent, **p; + int d; + + parent = NULL; + p = &priv->root[genbit].rb_node; + while (*p) { + struct nft_rbtree_elem *rbe; + + parent = *p; + rbe = rb_entry(parent, struct nft_rbtree_elem, node[genbit]); + d = nft_rbtree_cmp(set, rbe, new); + + if (d < 0) + p = &parent->rb_left; + else if (d > 0) + p = &parent->rb_right; + else if (nft_rbtree_interval_end(rbe)) + p = &parent->rb_left; + else + p = &parent->rb_right; + } + + rb_link_node_rcu(&new->node[genbit], parent, p); + rb_insert_color(&new->node[genbit], &priv->root[genbit]); +} + static int __nft_rbtree_insert(const struct net *net, const struct nft_set *set, struct nft_rbtree_elem *new, struct nft_elem_priv **elem_priv) @@ -467,25 +504,7 @@ static int __nft_rbtree_insert(const struct net *net, const struct nft_set *set, return -ENOTEMPTY; /* Accepted element: pick insertion point depending on key value */ - parent = NULL; - p = &priv->root[genbit].rb_node; - while (*p != NULL) { - parent = *p; - rbe = rb_entry(parent, struct nft_rbtree_elem, node[genbit]); - d = nft_rbtree_cmp(set, rbe, new); - - if (d < 0) - p = &parent->rb_left; - else if (d > 0) - p = &parent->rb_right; - else if (nft_rbtree_interval_end(rbe)) - p = &parent->rb_left; - else - p = &parent->rb_right; - } - - rb_link_node_rcu(&new->node[genbit], parent, p); - rb_insert_color(&new->node[genbit], &priv->root[genbit]); + __nft_rbtree_insert_do(set, new); return 0; } -- 2.51.0 ^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH nf-next 3/3] netfilter: nft_set_rbtree: do not modifiy live tree 2025-11-18 11:16 [PATCH nf-next 0/3] netfilter: nft_set_rbtree: use cloned tree for insertions and removal Florian Westphal 2025-11-18 11:16 ` [PATCH nf-next 1/3] netfilter: nft_set_rbtree: prepare for two rbtrees Florian Westphal 2025-11-18 11:16 ` [PATCH nf-next 2/3] netfilter: nft_set_rbtree: factor out insert helper Florian Westphal @ 2025-11-18 11:16 ` Florian Westphal 2025-11-19 8:29 ` kernel test robot 2025-11-18 16:07 ` [PATCH nf-next 0/3] netfilter: nft_set_rbtree: use cloned tree for insertions and removal Fernando Fernandez Mancera 3 siblings, 1 reply; 14+ messages in thread From: Florian Westphal @ 2025-11-18 11:16 UTC (permalink / raw) To: netfilter-devel; +Cc: Florian Westphal The earlier attempt to resolve the false negative lookups are not sufficient for the rbtree set. Given existing tree with a matching range interval a-b, following race exists: 1. control plane marks a-b for removal. 2. control plane adds a-c as new range. 3. control plane passes all checks, and increments the gencursor. 4. packet path starts a lookup for key that matches range a-b and a-c 5. control plane erases a-b (and other to-be-freed elements). If step 4 has picked up the new gencursor, it may find a-b but ignores it as its marked inactive already, while range a-c might not be found until after step 5 in case its hidden in the wrong subtree. Avoid this by using two trees, one for matching and one for control plane updates. Above sequence changes as follows: In the new step 6 (post removal), tree genbits are swapped so the updated (removed-and-about-to-be-freed elements are not reachable) tree becomes the new live tree. In step 4, packet path can now elide generation check for elements, because newly-added-but-not-yet-valid entries are not reachable from the live tree, so it will either find range a-b (if it picked up the old tree) or a-c (if it picked up the new one). In case it picked up the old tree and the a-b range was already removed by the ongoing transaction, no match is found. But in that case the lookup also observes seqcount mismatch and relookup is done in new tree. GC happens during insertion and right before commit. In both cases we operate on the copied tree. However, we must also erase the entry from the live version since the element will be freed after next grace period, which might happen before the live/copy swap happens. Note that we could rework GC after this change: We could move the gc from commit phase to clone step. This would then avoid the need for expiration check at insert time, because copied tree only has non-expired nodes. That in turn gets rid of the -EAGAIN retry loop in nft_rbtree_insert. Fixes: a60f7bf4a152 ("netfilter: nft_set_rbtree: continue traversal if element is inactive") Signed-off-by: Florian Westphal <fw@strlen.de> --- net/netfilter/nft_set_rbtree.c | 157 ++++++++++++++++++++++++++------- 1 file changed, 127 insertions(+), 30 deletions(-) diff --git a/net/netfilter/nft_set_rbtree.c b/net/netfilter/nft_set_rbtree.c index a420addedc27..a2396cd03f71 100644 --- a/net/netfilter/nft_set_rbtree.c +++ b/net/netfilter/nft_set_rbtree.c @@ -17,6 +17,8 @@ struct nft_rbtree { struct rb_root root[2]; + u8 genbit; + bool cloned; rwlock_t lock; seqcount_rwlock_t count; unsigned long last_gc; @@ -28,9 +30,14 @@ struct nft_rbtree_elem { struct nft_set_ext ext; }; +static inline u8 nft_rbtree_genbit_live(const struct nft_rbtree *priv) +{ + return READ_ONCE(priv->genbit); +} + static inline u8 nft_rbtree_genbit_copy(const struct nft_rbtree *priv) { - return 0; + return !nft_rbtree_genbit_live(priv); } static bool nft_rbtree_interval_end(const struct nft_rbtree_elem *rbe) @@ -63,11 +70,11 @@ __nft_rbtree_lookup(const struct net *net, const struct nft_set *set, { struct nft_rbtree *priv = nft_set_priv(set); const struct nft_rbtree_elem *rbe, *interval = NULL; - u8 genmask = nft_genmask_cur(net); const struct rb_node *parent; - u8 genbit = 0; + u8 genbit; int d; + genbit = nft_rbtree_genbit_live(priv); parent = rcu_dereference_raw(priv->root[genbit].rb_node); while (parent != NULL) { if (read_seqcount_retry(&priv->count, seq)) @@ -83,17 +90,11 @@ __nft_rbtree_lookup(const struct net *net, const struct nft_set *set, nft_rbtree_interval_end(rbe) && nft_rbtree_interval_start(interval)) continue; - if (nft_set_elem_active(&rbe->ext, genmask) && - !nft_rbtree_elem_expired(rbe)) + if (!nft_rbtree_elem_expired(rbe)) interval = rbe; } else if (d > 0) parent = rcu_dereference_raw(parent->rb_right); else { - if (!nft_set_elem_active(&rbe->ext, genmask)) { - parent = rcu_dereference_raw(parent->rb_left); - continue; - } - if (nft_rbtree_elem_expired(rbe)) return NULL; @@ -145,9 +146,10 @@ static bool __nft_rbtree_get(const struct net *net, const struct nft_set *set, struct nft_rbtree *priv = nft_set_priv(set); const struct rb_node *parent; const void *this; - u8 genbit = 0; + u8 genbit; int d; + genbit = nft_rbtree_genbit_live(priv); parent = rcu_dereference_raw(priv->root[genbit].rb_node); while (parent != NULL) { if (read_seqcount_retry(&priv->count, seq)) @@ -236,18 +238,18 @@ static void nft_rbtree_gc_elem_remove(struct net *net, struct nft_set *set, lockdep_assert_held_write(&priv->lock); nft_setelem_data_deactivate(net, set, &rbe->priv); rb_erase(&rbe->node[genbit], &priv->root[genbit]); + rb_erase(&rbe->node[!genbit], &priv->root[!genbit]); } static const struct nft_rbtree_elem * nft_rbtree_gc_elem(const struct nft_set *__set, struct nft_rbtree *priv, - struct nft_rbtree_elem *rbe) + struct nft_rbtree_elem *rbe, u8 genbit) { struct nft_set *set = (struct nft_set *)__set; struct net *net = read_pnet(&set->net); struct nft_rbtree_elem *rbe_prev; struct nft_trans_gc *gc; struct rb_node *prev; - u8 genbit = 0; prev = rb_prev(&rbe->node[genbit]); @@ -299,10 +301,9 @@ nft_rbtree_gc_elem(const struct nft_set *__set, struct nft_rbtree *priv, static bool nft_rbtree_update_first(const struct nft_set *set, struct nft_rbtree_elem *rbe, - struct rb_node *first) + struct rb_node *first, u8 genbit) { struct nft_rbtree_elem *first_elem; - u8 genbit = 0; first_elem = rb_entry(first, struct nft_rbtree_elem, node[genbit]); /* this element is closest to where the new element is to be inserted: @@ -353,10 +354,10 @@ static int __nft_rbtree_insert(const struct net *net, const struct nft_set *set, struct nft_rbtree_elem *rbe, *rbe_le = NULL, *rbe_ge = NULL; struct rb_node *node, *next, *parent, **p, *first = NULL; struct nft_rbtree *priv = nft_set_priv(set); + u8 genbit = nft_rbtree_genbit_copy(priv); u8 cur_genmask = nft_genmask_cur(net); u8 genmask = nft_genmask_next(net); u64 tstamp = nft_net_tstamp(net); - u8 genbit = 0; int d; /* Descend the tree to search for an existing element greater than the @@ -374,7 +375,7 @@ static int __nft_rbtree_insert(const struct net *net, const struct nft_set *set, p = &parent->rb_left; } else if (d > 0) { if (!first || - nft_rbtree_update_first(set, rbe, first)) + nft_rbtree_update_first(set, rbe, first, genbit)) first = &rbe->node[genbit]; p = &parent->rb_right; @@ -408,7 +409,7 @@ static int __nft_rbtree_insert(const struct net *net, const struct nft_set *set, nft_set_elem_active(&rbe->ext, cur_genmask)) { const struct nft_rbtree_elem *removed_end; - removed_end = nft_rbtree_gc_elem(set, priv, rbe); + removed_end = nft_rbtree_gc_elem(set, priv, rbe, genbit); if (IS_ERR(removed_end)) return PTR_ERR(removed_end); @@ -510,6 +511,26 @@ static int __nft_rbtree_insert(const struct net *net, const struct nft_set *set, static void nft_rbtree_maybe_clone(const struct net *net, const struct nft_set *set) { + struct nft_rbtree *priv = nft_set_priv(set); + u8 genbit = nft_rbtree_genbit_live(priv); + struct nft_rbtree_elem *rbe; + struct rb_node *node, *next; + + lockdep_assert_held_once(&nft_pernet(net)->commit_mutex); + + if (priv->cloned) + return; + + for (node = rb_first(&priv->root[genbit]); node ; node = next) { + next = rb_next(node); + rbe = rb_entry(node, struct nft_rbtree_elem, node[genbit]); + + /* No need to acquire a lock, this is the future tree, not + * exposed to packetpath. + */ + __nft_rbtree_insert_do(set, rbe); + } + priv->cloned = true; } static int nft_rbtree_insert(const struct net *net, const struct nft_set *set, @@ -538,10 +559,8 @@ static int nft_rbtree_insert(const struct net *net, const struct nft_set *set, return err; } -static void nft_rbtree_erase(struct nft_rbtree *priv, struct nft_rbtree_elem *rbe) +static void nft_rbtree_erase(struct nft_rbtree *priv, struct nft_rbtree_elem *rbe, u8 genbit) { - u8 genbit = 0; - write_lock_bh(&priv->lock); write_seqcount_begin(&priv->count); rb_erase(&rbe->node[genbit], &priv->root[genbit]); @@ -555,8 +574,9 @@ static void nft_rbtree_remove(const struct net *net, { struct nft_rbtree_elem *rbe = nft_elem_priv_cast(elem_priv); struct nft_rbtree *priv = nft_set_priv(set); + u8 genbit = nft_rbtree_genbit_copy(priv); - nft_rbtree_erase(priv, rbe); + nft_rbtree_erase(priv, rbe, genbit); } static void nft_rbtree_activate(const struct net *net, @@ -583,6 +603,7 @@ nft_rbtree_deactivate(const struct net *net, const struct nft_set *set, { struct nft_rbtree_elem *rbe, *this = nft_elem_priv_cast(elem->priv); const struct nft_rbtree *priv = nft_set_priv(set); + u8 genbit = nft_rbtree_genbit_copy(priv); u8 genmask = nft_genmask_next(net); u64 tstamp = nft_net_tstamp(net); const struct rb_node *parent; @@ -590,9 +611,9 @@ nft_rbtree_deactivate(const struct net *net, const struct nft_set *set, nft_rbtree_maybe_clone(net, set); - parent = priv->root[0].rb_node; + parent = priv->root[genbit].rb_node; while (parent != NULL) { - rbe = rb_entry(parent, struct nft_rbtree_elem, node[0]); + rbe = rb_entry(parent, struct nft_rbtree_elem, node[genbit]); d = memcmp(nft_set_ext_key(&rbe->ext), &elem->key.val, set->klen); @@ -657,11 +678,13 @@ static void nft_rbtree_walk(const struct nft_ctx *ctx, lockdep_assert_held(&nft_pernet(ctx->net)->commit_mutex); nft_rbtree_maybe_clone(net, set); - nft_rbtree_do_walk(ctx, set, iter, 0); + nft_rbtree_do_walk(ctx, set, iter, + nft_rbtree_genbit_copy(priv)); break; case NFT_ITER_READ: read_lock_bh(&priv->lock); - nft_rbtree_do_walk(ctx, set, iter, 0); + nft_rbtree_do_walk(ctx, set, iter, + nft_rbtree_genbit_live(priv)); read_unlock_bh(&priv->lock); break; default: @@ -675,8 +698,15 @@ static void nft_rbtree_gc_remove(struct net *net, struct nft_set *set, struct nft_rbtree *priv, struct nft_rbtree_elem *rbe) { + u8 genbit = nft_rbtree_genbit_copy(priv); + nft_setelem_data_deactivate(net, set, &rbe->priv); - nft_rbtree_erase(priv, rbe); + + /* first remove from live copy */ + nft_rbtree_erase(priv, rbe, !genbit); + + /* then from private copy */ + rb_erase(&rbe->node[genbit], &priv->root[genbit]); } static void nft_rbtree_gc(struct nft_set *set) @@ -687,7 +717,7 @@ static void nft_rbtree_gc(struct nft_set *set) u64 tstamp = nft_net_tstamp(net); struct rb_node *node, *next; struct nft_trans_gc *gc; - u8 genbit = 0; + u8 genbit; set = nft_set_container_of(priv); net = read_pnet(&set->net); @@ -696,6 +726,7 @@ static void nft_rbtree_gc(struct nft_set *set) if (!gc) return; + genbit = nft_rbtree_genbit_copy(priv); for (node = rb_first(&priv->root[genbit]); node ; node = next) { next = rb_next(node); @@ -760,6 +791,8 @@ static int nft_rbtree_init(const struct nft_set *set, seqcount_rwlock_init(&priv->count, &priv->lock); priv->root[0] = RB_ROOT; priv->root[1] = RB_ROOT; + priv->cloned = false; + priv->genbit = 0; return 0; } @@ -770,7 +803,21 @@ static void nft_rbtree_destroy(const struct nft_ctx *ctx, struct nft_rbtree *priv = nft_set_priv(set); struct nft_rbtree_elem *rbe; struct rb_node *node; - u8 genbit = 0; + u8 genbit; + + if (priv->cloned) { + genbit = nft_rbtree_genbit_copy(priv); + priv->cloned = false; + + /* live version is outdated: + * Still contains elements that have been + * removed already and are queued for freeing. + * Doesn't contain new elements. + */ + } else { + /* no changes? Tear down live version. */ + genbit = nft_rbtree_genbit_live(priv); + } while ((node = priv->root[genbit].rb_node) != NULL) { rb_erase(node, &priv->root[genbit]); @@ -797,12 +844,56 @@ static bool nft_rbtree_estimate(const struct nft_set_desc *desc, u32 features, return true; } +static void nft_rbtree_abort(const struct nft_set *set) +{ + struct nft_rbtree *priv = nft_set_priv(set); + + if (priv->cloned) { + u8 genbit = nft_rbtree_genbit_copy(priv); + + priv->root[genbit] = RB_ROOT; + priv->cloned = false; + } +} + static void nft_rbtree_commit(struct nft_set *set) { struct nft_rbtree *priv = nft_set_priv(set); + u8 genbit; if (time_after_eq(jiffies, priv->last_gc + nft_set_gc_interval(set))) nft_rbtree_gc(set); + + priv->cloned = false; + + write_lock_bh(&priv->lock); + write_seqcount_begin(&priv->count); + + genbit = nft_rbtree_genbit_live(priv); + priv->genbit = !genbit; + + /* genbit is now the old generation. priv->cloned as been set to + * false. Future call to nft_rbtree_maybe_clone() will create a + * new on-demand copy of the live version. + * + * Elements new in the committed transaction and all unchanged + * elements are now visible to other CPUs. + * + * Removed elements are now only reachable via their + * DELSETELEM transaction entry, they will be free'd after + * rcu grace period. + * + * A cpu still doing a lookup in the old (genbit) tree will + * either find a match, or, if it did not find a result, will + * obseve the altered sequence count. + * + * In the latter case, it will spin on priv->lock and then performs + * a new lookup in the current tree. + */ + rcu_assign_pointer(priv->root[genbit].rb_node, NULL); + + write_seqcount_end(&priv->count); + write_unlock_bh(&priv->lock); } static void nft_rbtree_gc_init(const struct nft_set *set) @@ -835,7 +926,12 @@ static u32 nft_rbtree_adjust_maxsize(const struct nft_set *set) struct nft_rbtree_elem *rbe; struct rb_node *node; const void *key; - u8 genbit = 0; + u8 genbit; + + if (priv->cloned) + genbit = nft_rbtree_genbit_copy(priv); + else + genbit = nft_rbtree_genbit_live(priv); node = rb_last(&priv->root[genbit]); if (!node) @@ -867,6 +963,7 @@ const struct nft_set_type nft_set_rbtree_type = { .flush = nft_rbtree_flush, .activate = nft_rbtree_activate, .commit = nft_rbtree_commit, + .abort = nft_rbtree_abort, .gc_init = nft_rbtree_gc_init, .lookup = nft_rbtree_lookup, .walk = nft_rbtree_walk, -- 2.51.0 ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH nf-next 3/3] netfilter: nft_set_rbtree: do not modifiy live tree 2025-11-18 11:16 ` [PATCH nf-next 3/3] netfilter: nft_set_rbtree: do not modifiy live tree Florian Westphal @ 2025-11-19 8:29 ` kernel test robot 2025-11-19 10:48 ` Florian Westphal 0 siblings, 1 reply; 14+ messages in thread From: kernel test robot @ 2025-11-19 8:29 UTC (permalink / raw) To: Florian Westphal, netfilter-devel; +Cc: oe-kbuild-all, Florian Westphal Hi Florian, kernel test robot noticed the following build warnings: [auto build test WARNING on netfilter-nf/main] [also build test WARNING on linus/master v6.18-rc6 next-20251119] [cannot apply to nf-next/master] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch#_base_tree_information] url: https://github.com/intel-lab-lkp/linux/commits/Florian-Westphal/netfilter-nft_set_rbtree-prepare-for-two-rbtrees/20251118-191851 base: https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf.git main patch link: https://lore.kernel.org/r/20251118111657.12003-4-fw%40strlen.de patch subject: [PATCH nf-next 3/3] netfilter: nft_set_rbtree: do not modifiy live tree config: arm64-randconfig-r122-20251119 (https://download.01.org/0day-ci/archive/20251119/202511191544.ss8W56uH-lkp@intel.com/config) compiler: clang version 22.0.0git (https://github.com/llvm/llvm-project 0bba1e76581bad04e7d7f09f5115ae5e2989e0d9) reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251119/202511191544.ss8W56uH-lkp@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <lkp@intel.com> | Closes: https://lore.kernel.org/oe-kbuild-all/202511191544.ss8W56uH-lkp@intel.com/ sparse warnings: (new ones prefixed by >>) >> net/netfilter/nft_set_rbtree.c:893:9: sparse: sparse: incompatible types in comparison expression (different address spaces): net/netfilter/nft_set_rbtree.c:893:9: sparse: struct rb_node [noderef] __rcu * net/netfilter/nft_set_rbtree.c:893:9: sparse: struct rb_node * net/netfilter/nft_set_rbtree.c: note: in included file (through include/linux/mm_types.h, include/linux/page-flags.h, arch/arm64/include/asm/mte.h, ...): include/linux/rbtree.h:74:9: sparse: sparse: incompatible types in comparison expression (different address spaces): include/linux/rbtree.h:74:9: sparse: struct rb_node [noderef] __rcu * include/linux/rbtree.h:74:9: sparse: struct rb_node * vim +893 net/netfilter/nft_set_rbtree.c 858 859 static void nft_rbtree_commit(struct nft_set *set) 860 { 861 struct nft_rbtree *priv = nft_set_priv(set); 862 u8 genbit; 863 864 if (time_after_eq(jiffies, priv->last_gc + nft_set_gc_interval(set))) 865 nft_rbtree_gc(set); 866 867 priv->cloned = false; 868 869 write_lock_bh(&priv->lock); 870 write_seqcount_begin(&priv->count); 871 872 genbit = nft_rbtree_genbit_live(priv); 873 priv->genbit = !genbit; 874 875 /* genbit is now the old generation. priv->cloned as been set to 876 * false. Future call to nft_rbtree_maybe_clone() will create a 877 * new on-demand copy of the live version. 878 * 879 * Elements new in the committed transaction and all unchanged 880 * elements are now visible to other CPUs. 881 * 882 * Removed elements are now only reachable via their 883 * DELSETELEM transaction entry, they will be free'd after 884 * rcu grace period. 885 * 886 * A cpu still doing a lookup in the old (genbit) tree will 887 * either find a match, or, if it did not find a result, will 888 * obseve the altered sequence count. 889 * 890 * In the latter case, it will spin on priv->lock and then performs 891 * a new lookup in the current tree. 892 */ > 893 rcu_assign_pointer(priv->root[genbit].rb_node, NULL); 894 895 write_seqcount_end(&priv->count); 896 write_unlock_bh(&priv->lock); 897 } 898 -- 0-DAY CI Kernel Test Service https://github.com/intel/lkp-tests/wiki ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH nf-next 3/3] netfilter: nft_set_rbtree: do not modifiy live tree 2025-11-19 8:29 ` kernel test robot @ 2025-11-19 10:48 ` Florian Westphal 0 siblings, 0 replies; 14+ messages in thread From: Florian Westphal @ 2025-11-19 10:48 UTC (permalink / raw) To: kernel test robot; +Cc: netfilter-devel, oe-kbuild-all kernel test robot <lkp@intel.com> wrote: > sparse warnings: (new ones prefixed by >>) > >> net/netfilter/nft_set_rbtree.c:893:9: sparse: sparse: incompatible types in comparison expression (different address spaces): > net/netfilter/nft_set_rbtree.c:893:9: sparse: struct rb_node [noderef] __rcu * > net/netfilter/nft_set_rbtree.c:893:9: sparse: struct rb_node * > net/netfilter/nft_set_rbtree.c: note: in included file (through include/linux/mm_types.h, include/linux/page-flags.h, arch/arm64/include/asm/mte.h, ...): > include/linux/rbtree.h:74:9: sparse: sparse: incompatible types in comparison expression (different address spaces): > include/linux/rbtree.h:74:9: sparse: struct rb_node [noderef] __rcu * > include/linux/rbtree.h:74:9: sparse: struct rb_node * Sigh. > > 893 rcu_assign_pointer(priv->root[genbit].rb_node, NULL); This can be replaced by WRITE_ONCE(priv->root[genbit].rb_node, NULL); ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH nf-next 0/3] netfilter: nft_set_rbtree: use cloned tree for insertions and removal 2025-11-18 11:16 [PATCH nf-next 0/3] netfilter: nft_set_rbtree: use cloned tree for insertions and removal Florian Westphal ` (2 preceding siblings ...) 2025-11-18 11:16 ` [PATCH nf-next 3/3] netfilter: nft_set_rbtree: do not modifiy live tree Florian Westphal @ 2025-11-18 16:07 ` Fernando Fernandez Mancera 2025-11-18 16:46 ` Florian Westphal 2025-11-19 12:52 ` Phil Sutter 3 siblings, 2 replies; 14+ messages in thread From: Fernando Fernandez Mancera @ 2025-11-18 16:07 UTC (permalink / raw) To: Florian Westphal, netfilter-devel On 11/18/25 12:16 PM, Florian Westphal wrote: > This series fixes false negative lookup bug in rbtree set backend that > can occur during transaction. > > First two patches prepare for actual fix, which is coming in last patch. > > All inserts/removals will now occur in a cloned copy, so packetpath can > no longer observe the problematic mixed-bag of old, current and new > elements. > > The live tree will only have reachable elements that are active in the > current generation or were active in the previous generation (but are still > valid while packetpath holds rcu read lock). The latter case is only > temporary, as new lookups already observe the updated tree). > Hi Florian, I have been taking a look to the series and testing it with different set sizes. The implementation looks good to me but I noticed that due the new "clone" mechanism there is an impact on performance when modifying an existing rbtree set (inserts or deletions). According to my number the impact would be around 40~%. Usually that isn't problematic but if big sets are used.. then it is a bit more. e.g 200K elements ~ avg. time insertion before 510ms after 744ms 500K elements ~ avg. time insertion before 5460ms after 7730ms I was wondering if there could be a different solution.. what if we keep both trees synchronize? Would that be possible? Initially root[0] and root[1] can be identical and root[0] is the "live" tree. When adding a new element, it is inserted into the cloned copy and we swap the genbit so root[1] is now live. Then, when we are sure that the operation was successful, we update root[0] with the same operation. Therefore, root[0] and root[1] are now identical. This way we can avoid the clone operation which is quite expensive.. of course, it would require to do the insert/removal operation twice.. but that is cheaper if I am not wrong. Maybe I am asking for too much (?). Also it brings some problems.. like what if the sync operation fails, should we re-do the cloning? Thanks, Fernando. > Florian Westphal (3): > netfilter: nft_set_rbtree: prepare for two rbtrees > netfilter: nft_set_rbtree: factor out insert helper > netfilter: nft_set_rbtree: do not modifiy live tree > > net/netfilter/nft_set_rbtree.c | 279 +++++++++++++++++++++++++-------- > 1 file changed, 211 insertions(+), 68 deletions(-) ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH nf-next 0/3] netfilter: nft_set_rbtree: use cloned tree for insertions and removal 2025-11-18 16:07 ` [PATCH nf-next 0/3] netfilter: nft_set_rbtree: use cloned tree for insertions and removal Fernando Fernandez Mancera @ 2025-11-18 16:46 ` Florian Westphal 2025-11-18 17:01 ` Fernando Fernandez Mancera 2025-11-19 12:52 ` Phil Sutter 1 sibling, 1 reply; 14+ messages in thread From: Florian Westphal @ 2025-11-18 16:46 UTC (permalink / raw) To: Fernando Fernandez Mancera; +Cc: netfilter-devel Fernando Fernandez Mancera <fmancera@suse.de> wrote: > When adding a new element, it is inserted into the cloned copy and we > swap the genbit so root[1] is now live. Then, when we are sure that the > operation was successful, we update root[0] with the same operation. > Therefore, root[0] and root[1] are now identical. I don't understand this. The swap can't be done before ->commit(). Else, how do you deal with a rollback (failing transaction)? Not exposing any of the new elements to the data path until the entire transaction has moved past the point-of-no-return is large part of the patch series. After commit, yes, we can do a walk of the old tree, purge old elements, then walk the new tree, add new elements to the old tree so they are identical again. But that doesn't sound faster than duplicating everything on next insert/removal. > This way we can avoid the clone operation which is quite expensive.. of > course, it would require to do the insert/removal operation twice.. but > that is cheaper if I am not wrong. How do I know what to re-insert and to remove from the old live tree without a walk of the new tree? > Maybe I am asking for too much (?). Also it brings some problems.. like > what if the sync operation fails, should we re-do the cloning? How can the sync op fail? Can you elaborate? ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH nf-next 0/3] netfilter: nft_set_rbtree: use cloned tree for insertions and removal 2025-11-18 16:46 ` Florian Westphal @ 2025-11-18 17:01 ` Fernando Fernandez Mancera 0 siblings, 0 replies; 14+ messages in thread From: Fernando Fernandez Mancera @ 2025-11-18 17:01 UTC (permalink / raw) To: Florian Westphal; +Cc: netfilter-devel On 11/18/25 5:46 PM, Florian Westphal wrote: > Fernando Fernandez Mancera <fmancera@suse.de> wrote: >> When adding a new element, it is inserted into the cloned copy and we >> swap the genbit so root[1] is now live. Then, when we are sure that the >> operation was successful, we update root[0] with the same operation. >> Therefore, root[0] and root[1] are now identical. > > I don't understand this. The swap can't be done before ->commit(). > Else, how do you deal with a rollback (failing transaction)? > > Not exposing any of the new elements to the data path until > the entire transaction has moved past the point-of-no-return > is large part of the patch series. > Right, that is good point, the swap must happen after commit. > After commit, yes, we can do a walk of the old tree, purge > old elements, then walk the new tree, add new elements to the old > tree so they are identical again. > > But that doesn't sound faster than duplicating everything > on next insert/removal. > Of course, if walking the old/new tree is required then it is not better. I guess, a possibility would be to keep track of the performed operations (elements added/deleted) to perform the updates after commit but to be honest that would be quite cumbersome and would require infrastructure for bookkeeping. I think we should not consider it. >> This way we can avoid the clone operation which is quite expensive.. of >> course, it would require to do the insert/removal operation twice.. but >> that is cheaper if I am not wrong. > > How do I know what to re-insert and to remove from the old live tree > without a walk of the new tree? > >> Maybe I am asking for too much (?). Also it brings some problems.. like >> what if the sync operation fails, should we re-do the cloning? > > How can the sync op fail? Can you elaborate? > I was thinking in not enough memory situations (unlikely but possible). Anyway, given the constraints we have on hands I do not think my idea is suitable. In addition, the performance impact happens on control-plane operations so it should be fine I guess as it isn't a hot path. So never mind, the series is fine and after some testing on it no serious issues found so far. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH nf-next 0/3] netfilter: nft_set_rbtree: use cloned tree for insertions and removal 2025-11-18 16:07 ` [PATCH nf-next 0/3] netfilter: nft_set_rbtree: use cloned tree for insertions and removal Fernando Fernandez Mancera 2025-11-18 16:46 ` Florian Westphal @ 2025-11-19 12:52 ` Phil Sutter 2025-11-19 15:56 ` Florian Westphal 1 sibling, 1 reply; 14+ messages in thread From: Phil Sutter @ 2025-11-19 12:52 UTC (permalink / raw) To: Fernando Fernandez Mancera; +Cc: Florian Westphal, netfilter-devel Hi, On Tue, Nov 18, 2025 at 05:07:56PM +0100, Fernando Fernandez Mancera wrote: > On 11/18/25 12:16 PM, Florian Westphal wrote: > > This series fixes false negative lookup bug in rbtree set backend that > > can occur during transaction. > > > > First two patches prepare for actual fix, which is coming in last patch. > > > > All inserts/removals will now occur in a cloned copy, so packetpath can > > no longer observe the problematic mixed-bag of old, current and new > > elements. > > > > The live tree will only have reachable elements that are active in the > > current generation or were active in the previous generation (but are still > > valid while packetpath holds rcu read lock). The latter case is only > > temporary, as new lookups already observe the updated tree). > > > > I have been taking a look to the series and testing it with different > set sizes. The implementation looks good to me but I noticed that due > the new "clone" mechanism there is an impact on performance when > modifying an existing rbtree set (inserts or deletions). According to my > number the impact would be around 40~%. Usually that isn't problematic > but if big sets are used.. then it is a bit more. e.g > > 200K elements ~ avg. time insertion before 510ms after 744ms > 500K elements ~ avg. time insertion before 5460ms after 7730ms I wonder if nft_rbtree_maybe_clone() could run a simpler copying algorithm than properly inserting every element from the old tree into the new one since the old tree is already correctly organized - basically leveraging the existing knowledge of every element's correct position. Or is there a need to traverse the new tree with each element instead of copying the whole thing as-is? Cheers, Phil ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH nf-next 0/3] netfilter: nft_set_rbtree: use cloned tree for insertions and removal 2025-11-19 12:52 ` Phil Sutter @ 2025-11-19 15:56 ` Florian Westphal 2025-11-19 22:14 ` Phil Sutter 0 siblings, 1 reply; 14+ messages in thread From: Florian Westphal @ 2025-11-19 15:56 UTC (permalink / raw) To: Phil Sutter, Fernando Fernandez Mancera, netfilter-devel Phil Sutter <phil@nwl.cc> wrote: > > 200K elements ~ avg. time insertion before 510ms after 744ms > > 500K elements ~ avg. time insertion before 5460ms after 7730ms > > I wonder if nft_rbtree_maybe_clone() could run a simpler copying > algorithm than properly inserting every element from the old tree into > the new one since the old tree is already correctly organized - > basically leveraging the existing knowledge of every element's correct > position. Yes, but I doubt its going to help much. And I don't see how this can be done without relying on implementation details of rb_node struct. > Or is there a need to traverse the new tree with each element instead of > copying the whole thing as-is? What do you mean? ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH nf-next 0/3] netfilter: nft_set_rbtree: use cloned tree for insertions and removal 2025-11-19 15:56 ` Florian Westphal @ 2025-11-19 22:14 ` Phil Sutter 2025-11-20 10:28 ` Florian Westphal 2025-11-20 11:39 ` Fernando Fernandez Mancera 0 siblings, 2 replies; 14+ messages in thread From: Phil Sutter @ 2025-11-19 22:14 UTC (permalink / raw) To: Florian Westphal; +Cc: Fernando Fernandez Mancera, netfilter-devel [-- Attachment #1: Type: text/plain, Size: 1094 bytes --] On Wed, Nov 19, 2025 at 04:56:34PM +0100, Florian Westphal wrote: > Phil Sutter <phil@nwl.cc> wrote: > > > 200K elements ~ avg. time insertion before 510ms after 744ms > > > 500K elements ~ avg. time insertion before 5460ms after 7730ms > > > > I wonder if nft_rbtree_maybe_clone() could run a simpler copying > > algorithm than properly inserting every element from the old tree into > > the new one since the old tree is already correctly organized - > > basically leveraging the existing knowledge of every element's correct > > position. > > Yes, but I doubt its going to help much. > > And I don't see how this can be done without relying on implementation > details of rb_node struct. > > > Or is there a need to traverse the new tree with each element instead of > > copying the whole thing as-is? > > What do you mean? So I gave it a try and to my big surprise nftables test suite does not cause a kernel crash and inserting elements into a large set seems to be faster (0.07s vs. 0.1s per 'nft add element' call). Where's the rub? Fernando, care to give it a try? Thanks, Phil [-- Attachment #2: quick_tree_copy.diff --] [-- Type: text/plain, Size: 2205 bytes --] diff --git a/net/netfilter/nft_set_rbtree.c b/net/netfilter/nft_set_rbtree.c index a2396cd03f71..33f2ec84d150 100644 --- a/net/netfilter/nft_set_rbtree.c +++ b/net/netfilter/nft_set_rbtree.c @@ -10,6 +10,7 @@ #include <linux/module.h> #include <linux/list.h> #include <linux/rbtree.h> +#include <linux/rbtree_augmented.h> #include <linux/netlink.h> #include <linux/netfilter.h> #include <linux/netfilter/nf_tables.h> @@ -509,26 +510,44 @@ static int __nft_rbtree_insert(const struct net *net, const struct nft_set *set, return 0; } +static void nft_rbtree_copy(const struct nft_set *set, struct rb_node *parent, + struct rb_node **pos, struct nft_rbtree_elem *elem) +{ + struct nft_rbtree *priv = nft_set_priv(set); + u8 genbit = nft_rbtree_genbit_copy(priv); + + rb_link_node_rcu(&elem->node[genbit], parent, pos); + rb_set_parent_color(&elem->node[genbit], parent, + rb_color(&elem->node[!genbit])); + + if (elem->node[!genbit].rb_left) + nft_rbtree_copy(set, &elem->node[genbit], + &elem->node[genbit].rb_left, + rb_entry(elem->node[!genbit].rb_left, + struct nft_rbtree_elem, + node[!genbit])); + if (elem->node[!genbit].rb_right) + nft_rbtree_copy(set, &elem->node[genbit], + &elem->node[genbit].rb_right, + rb_entry(elem->node[!genbit].rb_right, + struct nft_rbtree_elem, + node[!genbit])); +} + static void nft_rbtree_maybe_clone(const struct net *net, const struct nft_set *set) { struct nft_rbtree *priv = nft_set_priv(set); u8 genbit = nft_rbtree_genbit_live(priv); - struct nft_rbtree_elem *rbe; - struct rb_node *node, *next; lockdep_assert_held_once(&nft_pernet(net)->commit_mutex); if (priv->cloned) return; - for (node = rb_first(&priv->root[genbit]); node ; node = next) { - next = rb_next(node); - rbe = rb_entry(node, struct nft_rbtree_elem, node[genbit]); - - /* No need to acquire a lock, this is the future tree, not - * exposed to packetpath. - */ - __nft_rbtree_insert_do(set, rbe); + if (priv->root[genbit].rb_node) { + nft_rbtree_copy(set, NULL, &priv->root[!genbit].rb_node, + rb_entry(priv->root[genbit].rb_node, + struct nft_rbtree_elem, node[genbit])); } priv->cloned = true; } ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH nf-next 0/3] netfilter: nft_set_rbtree: use cloned tree for insertions and removal 2025-11-19 22:14 ` Phil Sutter @ 2025-11-20 10:28 ` Florian Westphal 2025-11-20 11:39 ` Fernando Fernandez Mancera 1 sibling, 0 replies; 14+ messages in thread From: Florian Westphal @ 2025-11-20 10:28 UTC (permalink / raw) To: Phil Sutter, Fernando Fernandez Mancera, netfilter-devel Phil Sutter <phil@nwl.cc> wrote: Thanks for doing this but I'm not sold on this. This also traverses the entire tree, which is afaics not avoidable and also the expensive part. I don't get much better results with this version (10 inserts into 800k tree: mainline: ~3m, patchset, 3m47m, your version 3m47s). > +static void nft_rbtree_copy(const struct nft_set *set, struct rb_node *parent, > + struct rb_node **pos, struct nft_rbtree_elem *elem) > +{ > + struct nft_rbtree *priv = nft_set_priv(set); > + u8 genbit = nft_rbtree_genbit_copy(priv); > + > + rb_link_node_rcu(&elem->node[genbit], parent, pos); > + rb_set_parent_color(&elem->node[genbit], parent, > + rb_color(&elem->node[!genbit])); > + > + if (elem->node[!genbit].rb_left) > + nft_rbtree_copy(set, &elem->node[genbit], > + &elem->node[genbit].rb_left, > + rb_entry(elem->node[!genbit].rb_left, > + struct nft_rbtree_elem, > + node[!genbit])); ... and that makes recursive calls, i am reluctant here. It should be fine given its limited by tree height, but it makes me uneasy to have this. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH nf-next 0/3] netfilter: nft_set_rbtree: use cloned tree for insertions and removal 2025-11-19 22:14 ` Phil Sutter 2025-11-20 10:28 ` Florian Westphal @ 2025-11-20 11:39 ` Fernando Fernandez Mancera 1 sibling, 0 replies; 14+ messages in thread From: Fernando Fernandez Mancera @ 2025-11-20 11:39 UTC (permalink / raw) To: Phil Sutter, Florian Westphal, netfilter-devel On 11/19/25 11:14 PM, Phil Sutter wrote: > On Wed, Nov 19, 2025 at 04:56:34PM +0100, Florian Westphal wrote: >> Phil Sutter <phil@nwl.cc> wrote: >>>> 200K elements ~ avg. time insertion before 510ms after 744ms >>>> 500K elements ~ avg. time insertion before 5460ms after 7730ms >>> >>> I wonder if nft_rbtree_maybe_clone() could run a simpler copying >>> algorithm than properly inserting every element from the old tree into >>> the new one since the old tree is already correctly organized - >>> basically leveraging the existing knowledge of every element's correct >>> position. >> >> Yes, but I doubt its going to help much. >> >> And I don't see how this can be done without relying on implementation >> details of rb_node struct. >> >>> Or is there a need to traverse the new tree with each element instead of >>> copying the whole thing as-is? >> >> What do you mean? > > So I gave it a try and to my big surprise nftables test suite does not > cause a kernel crash and inserting elements into a large set seems to be > faster (0.07s vs. 0.1s per 'nft add element' call). > > Where's the rub? Fernando, care to give it a try? > It improved very little on my side.. 200K elements ~ avg. time insertion mainline 510ms patchset 744ms this patch 703ms 500K elements ~ avg. time insertion mainline 5460ms patchset 7730ms this patch 7654ms Please, notice that for calculate this I create a big set and then perform 20 insert operations calculating the average time. Thanks, Fernando. ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2025-11-20 11:39 UTC | newest] Thread overview: 14+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-11-18 11:16 [PATCH nf-next 0/3] netfilter: nft_set_rbtree: use cloned tree for insertions and removal Florian Westphal 2025-11-18 11:16 ` [PATCH nf-next 1/3] netfilter: nft_set_rbtree: prepare for two rbtrees Florian Westphal 2025-11-18 11:16 ` [PATCH nf-next 2/3] netfilter: nft_set_rbtree: factor out insert helper Florian Westphal 2025-11-18 11:16 ` [PATCH nf-next 3/3] netfilter: nft_set_rbtree: do not modifiy live tree Florian Westphal 2025-11-19 8:29 ` kernel test robot 2025-11-19 10:48 ` Florian Westphal 2025-11-18 16:07 ` [PATCH nf-next 0/3] netfilter: nft_set_rbtree: use cloned tree for insertions and removal Fernando Fernandez Mancera 2025-11-18 16:46 ` Florian Westphal 2025-11-18 17:01 ` Fernando Fernandez Mancera 2025-11-19 12:52 ` Phil Sutter 2025-11-19 15:56 ` Florian Westphal 2025-11-19 22:14 ` Phil Sutter 2025-11-20 10:28 ` Florian Westphal 2025-11-20 11:39 ` Fernando Fernandez Mancera
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.