Linux Netfilter development
 help / color / mirror / Atom feed
* [PATCH nf-next,v1 0/3] convert rbtree to binary search array
@ 2026-01-15 12:43 Pablo Neira Ayuso
  2026-01-15 12:43 ` [PATCH nf-next,v1 1/3] netfilter: nf_tables: add .abort_skip_removal flag for set types Pablo Neira Ayuso
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Pablo Neira Ayuso @ 2026-01-15 12:43 UTC (permalink / raw)
  To: netfilter-devel; +Cc: fw

Hi,

Florian reported that there is currently an issue with interval matching
in the rbtree, allowing for false negative lookup during transaction.

This series addresses this issue by translating the rbtree, which keeps
the intervals in order, to binary search. The array is published to
packet path through RCU. The idea is to keep using the rbtree
datastructure for control plane, which needs to deal with updates, then
generating a array using this rbtree for binary search lookups.

There is a corner case worth mention: Anonymous maps result in
consecutive start elements in case of contiguous intervals, this needs
a special handling which is documented in the new .commit function.

With 100k elements, this shows a little more overhead to build this
array, mainly due to the memory allocation.

This approach allows to compact the start and end elements in one single
intervals in the array, which speeds up packet lookups by ~50% according
to the existing performance script in selftests.

This is approach does not require the spinlock anymore, which is only
needed by control plane. Although I think it can be removed if .walk
is translated to dump the set content using the new array.

This also allows to remove the seqcount_t in a later patch.

Patch #1 allows to call .remove in case .abort is defined, which is
needed by this new approach. Only pipapo needs to skip .remove to speed.

Patch #2 add the binary search array approach for interval matching.

Patch #3 updates .get to use the binary search array to find for
(closest or exact) interval matching.

This series passes tests/shell successfully, but more tests on this
would be good to have because this is a bit of new code.

This series is an alternative proposal to Florian's approach.

Pablo Neira Ayuso (3):
  netfilter: nf_tables: add .abort_skip_removal flag for set types
  netfilter: nft_set_rbtree: translate rbtree to array for binary search
  netfilter: nft_set_rbtree: use binary search array in get command

 include/net/netfilter/nf_tables.h |   1 +
 net/netfilter/nf_tables_api.c     |   2 +-
 net/netfilter/nft_set_pipapo.c    |   2 +
 net/netfilter/nft_set_rbtree.c    | 421 ++++++++++++++++++++----------
 4 files changed, 287 insertions(+), 139 deletions(-)

-- 
2.47.3


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH nf-next,v1 1/3] netfilter: nf_tables: add .abort_skip_removal flag for set types
  2026-01-15 12:43 [PATCH nf-next,v1 0/3] convert rbtree to binary search array Pablo Neira Ayuso
@ 2026-01-15 12:43 ` Pablo Neira Ayuso
  2026-01-15 14:59   ` Florian Westphal
  2026-01-15 12:43 ` [PATCH nf-next,v1 2/3] netfilter: nft_set_rbtree: translate rbtree to array for binary search Pablo Neira Ayuso
  2026-01-15 12:43 ` [PATCH nf-next,v1 3/3] netfilter: nft_set_rbtree: use binary search array in get command Pablo Neira Ayuso
  2 siblings, 1 reply; 6+ messages in thread
From: Pablo Neira Ayuso @ 2026-01-15 12:43 UTC (permalink / raw)
  To: netfilter-devel; +Cc: fw

The pipapo set backend is the only user of the .abort interface so far.
To speed up pipapo abort path, removals are skipped.

The follow up patch updates the rbtree to use to build an array of
ordered elements, then use binary search. This needs a new .abort
interface but, unlike pipapo, it also need to undo/remove elements.

Add a flag and use it from the pipapo set backend.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 include/net/netfilter/nf_tables.h | 1 +
 net/netfilter/nf_tables_api.c     | 2 +-
 net/netfilter/nft_set_pipapo.c    | 2 ++
 3 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/include/net/netfilter/nf_tables.h b/include/net/netfilter/nf_tables.h
index fab7dc73f738..21af1a2a6d3d 100644
--- a/include/net/netfilter/nf_tables.h
+++ b/include/net/netfilter/nf_tables.h
@@ -509,6 +509,7 @@ struct nft_set_ops {
 						   const struct nft_set *set);
 	void				(*gc_init)(const struct nft_set *set);
 
+	bool				abort_skip_removal;
 	unsigned int			elemsize;
 };
 
diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
index f3de2f9bbebf..8f5ad4e89715 100644
--- a/net/netfilter/nf_tables_api.c
+++ b/net/netfilter/nf_tables_api.c
@@ -7744,7 +7744,7 @@ static bool nft_trans_elems_new_abort(const struct nft_ctx *ctx,
 			continue;
 		}
 
-		if (!te->set->ops->abort || nft_setelem_is_catchall(te->set, te->elems[i].priv))
+		if (!te->set->ops->abort_skip_removal || nft_setelem_is_catchall(te->set, te->elems[i].priv))
 			nft_setelem_remove(ctx->net, te->set, te->elems[i].priv);
 
 		if (!nft_setelem_is_catchall(te->set, te->elems[i].priv))
diff --git a/net/netfilter/nft_set_pipapo.c b/net/netfilter/nft_set_pipapo.c
index 112fe46788b6..412a1f1eafd8 100644
--- a/net/netfilter/nft_set_pipapo.c
+++ b/net/netfilter/nft_set_pipapo.c
@@ -2370,6 +2370,7 @@ const struct nft_set_type nft_set_pipapo_type = {
 		.gc_init	= nft_pipapo_gc_init,
 		.commit		= nft_pipapo_commit,
 		.abort		= nft_pipapo_abort,
+		.abort_skip_removal = true,
 		.elemsize	= offsetof(struct nft_pipapo_elem, ext),
 	},
 };
@@ -2394,6 +2395,7 @@ const struct nft_set_type nft_set_pipapo_avx2_type = {
 		.gc_init	= nft_pipapo_gc_init,
 		.commit		= nft_pipapo_commit,
 		.abort		= nft_pipapo_abort,
+		.abort_skip_removal = true,
 		.elemsize	= offsetof(struct nft_pipapo_elem, ext),
 	},
 };
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH nf-next,v1 2/3] netfilter: nft_set_rbtree: translate rbtree to array for binary search
  2026-01-15 12:43 [PATCH nf-next,v1 0/3] convert rbtree to binary search array Pablo Neira Ayuso
  2026-01-15 12:43 ` [PATCH nf-next,v1 1/3] netfilter: nf_tables: add .abort_skip_removal flag for set types Pablo Neira Ayuso
@ 2026-01-15 12:43 ` Pablo Neira Ayuso
  2026-01-16  4:21   ` kernel test robot
  2026-01-15 12:43 ` [PATCH nf-next,v1 3/3] netfilter: nft_set_rbtree: use binary search array in get command Pablo Neira Ayuso
  2 siblings, 1 reply; 6+ messages in thread
From: Pablo Neira Ayuso @ 2026-01-15 12:43 UTC (permalink / raw)
  To: netfilter-devel; +Cc: fw

The rbtree can temporarily store overlapping inactive elements during
the transaction processing, leading to false negative lookups.

To address this issue, this patch adds a .commit function that walks the
the rbtree to build a array of intervals of ordered elements. This
conversion compacts the two singleton elements that represent the start
and the end of the interval into a single interval object for space
efficient.

Binary search is O(log n), similar to rbtree lookup time, therefore,
performance number should be similar, and there is an implementation
available under lib/bsearch.c and include/linux/bsearch.h that is used
for this purpose.

This slightly increases memory consumption for this new array that
stores pointers to the start and the end of the interval.

With this patch:

 # time nft -f 100k-intervals-set.nft

 real    0m4.218s
 user    0m3.544s
 sys     0m0.400s

Without this patch:

 # time nft -f 100k-intervals-set.nft

 real    0m3.920s
 user    0m3.547s
 sys     0m0.276s

With this patch, with IPv4 intervals:

  baseline rbtree (match on first field only):   15254954pps

Without this patch:

  baseline rbtree (match on first field only):   10256119pps

This provides a ~50% improvement in matching intervals from packet path.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/netfilter/nft_set_rbtree.c | 339 +++++++++++++++++++++++++--------
 1 file changed, 255 insertions(+), 84 deletions(-)

diff --git a/net/netfilter/nft_set_rbtree.c b/net/netfilter/nft_set_rbtree.c
index ca594161b840..aed2784ea12a 100644
--- a/net/netfilter/nft_set_rbtree.c
+++ b/net/netfilter/nft_set_rbtree.c
@@ -10,14 +10,29 @@
 #include <linux/module.h>
 #include <linux/list.h>
 #include <linux/rbtree.h>
+#include <linux/bsearch.h>
 #include <linux/netlink.h>
 #include <linux/netfilter.h>
 #include <linux/netfilter/nf_tables.h>
 #include <net/netfilter/nf_tables_core.h>
 
+struct nft_array_interval {
+	struct nft_set_ext	*from;
+	struct nft_set_ext	*to;
+};
+
+struct nft_array {
+	u32			max_intervals;
+	u32			num_intervals;
+	struct nft_array_interval *intervals;
+	struct rcu_head		rcu_head;
+};
+
 struct nft_rbtree {
 	struct rb_root		root;
 	rwlock_t		lock;
+	struct nft_array __rcu	*array;
+	struct nft_array	*array_next;
 	seqcount_rwlock_t	count;
 	unsigned long		last_gc;
 };
@@ -47,90 +62,6 @@ static int nft_rbtree_cmp(const struct nft_set *set,
 		      set->klen);
 }
 
-static bool nft_rbtree_elem_expired(const struct nft_rbtree_elem *rbe)
-{
-	return nft_set_elem_expired(&rbe->ext);
-}
-
-static const struct nft_set_ext *
-__nft_rbtree_lookup(const struct net *net, const struct nft_set *set,
-		    const u32 *key, unsigned int seq)
-{
-	struct nft_rbtree *priv = nft_set_priv(set);
-	const struct nft_rbtree_elem *rbe, *interval = NULL;
-	u8 genmask = nft_genmask_cur(net);
-	const struct rb_node *parent;
-	int d;
-
-	parent = rcu_dereference_raw(priv->root.rb_node);
-	while (parent != NULL) {
-		if (read_seqcount_retry(&priv->count, seq))
-			return NULL;
-
-		rbe = rb_entry(parent, struct nft_rbtree_elem, node);
-
-		d = memcmp(nft_set_ext_key(&rbe->ext), key, set->klen);
-		if (d < 0) {
-			parent = rcu_dereference_raw(parent->rb_left);
-			if (interval &&
-			    !nft_rbtree_cmp(set, rbe, interval) &&
-			    nft_rbtree_interval_end(rbe) &&
-			    nft_rbtree_interval_start(interval))
-				continue;
-			if (nft_set_elem_active(&rbe->ext, genmask) &&
-			    !nft_rbtree_elem_expired(rbe))
-				interval = rbe;
-		} else if (d > 0)
-			parent = rcu_dereference_raw(parent->rb_right);
-		else {
-			if (!nft_set_elem_active(&rbe->ext, genmask)) {
-				parent = rcu_dereference_raw(parent->rb_left);
-				continue;
-			}
-
-			if (nft_rbtree_elem_expired(rbe))
-				return NULL;
-
-			if (nft_rbtree_interval_end(rbe)) {
-				if (nft_set_is_anonymous(set))
-					return NULL;
-				parent = rcu_dereference_raw(parent->rb_left);
-				interval = NULL;
-				continue;
-			}
-
-			return &rbe->ext;
-		}
-	}
-
-	if (set->flags & NFT_SET_INTERVAL && interval != NULL &&
-	    nft_rbtree_interval_start(interval))
-		return &interval->ext;
-
-	return NULL;
-}
-
-INDIRECT_CALLABLE_SCOPE
-const struct nft_set_ext *
-nft_rbtree_lookup(const struct net *net, const struct nft_set *set,
-		  const u32 *key)
-{
-	struct nft_rbtree *priv = nft_set_priv(set);
-	unsigned int seq = read_seqcount_begin(&priv->count);
-	const struct nft_set_ext *ext;
-
-	ext = __nft_rbtree_lookup(net, set, key, seq);
-	if (ext || !read_seqcount_retry(&priv->count, seq))
-		return ext;
-
-	read_lock_bh(&priv->lock);
-	seq = read_seqcount_begin(&priv->count);
-	ext = __nft_rbtree_lookup(net, set, key, seq);
-	read_unlock_bh(&priv->lock);
-
-	return ext;
-}
-
 static bool __nft_rbtree_get(const struct net *net, const struct nft_set *set,
 			     const u32 *key, struct nft_rbtree_elem **elem,
 			     unsigned int seq, unsigned int flags, u8 genmask)
@@ -221,6 +152,60 @@ nft_rbtree_get(const struct net *net, const struct nft_set *set,
 	return &rbe->priv;
 }
 
+struct nft_array_lookup_ctx {
+	const u32	*key;
+	u32		klen;
+};
+
+static int nft_array_lookup_cmp(const void *pkey, const void *entry)
+{
+	const struct nft_array_interval *interval = entry;
+	const struct nft_array_lookup_ctx *ctx = pkey;
+	int a, b;
+
+	if (!interval->from)
+		return 1;
+
+	a = memcmp(ctx->key, nft_set_ext_key(interval->from), ctx->klen);
+	if (!interval->to)
+		b = -1;
+	else
+		b = memcmp(ctx->key, nft_set_ext_key(interval->to), ctx->klen);
+
+	if (a >= 0 && b < 0)
+		return 0;
+
+	if (a < 0)
+		return -1;
+
+	return 1;
+}
+
+INDIRECT_CALLABLE_SCOPE
+const struct nft_set_ext *
+nft_rbtree_lookup(const struct net *net, const struct nft_set *set,
+		  const u32 *key)
+{
+	struct nft_rbtree *priv = nft_set_priv(set);
+	struct nft_array *array = rcu_dereference(priv->array);
+	const struct nft_array_interval *interval;
+	struct nft_array_lookup_ctx ctx = {
+		.key	= key,
+		.klen	= set->klen,
+	};
+
+	if (!array)
+		return NULL;
+
+	interval = bsearch(&ctx, array->intervals, array->num_intervals,
+			   sizeof(struct nft_array_interval),
+			   nft_array_lookup_cmp);
+	if (!interval || nft_set_elem_expired(interval->from))
+		return NULL;
+
+	return interval->from;
+}
+
 static void nft_rbtree_gc_elem_remove(struct net *net, struct nft_set *set,
 				      struct nft_rbtree *priv,
 				      struct nft_rbtree_elem *rbe)
@@ -481,6 +466,87 @@ static int __nft_rbtree_insert(const struct net *net, const struct nft_set *set,
 	return 0;
 }
 
+static int nft_array_intervals_alloc(struct nft_array *array, u32 max_intervals)
+{
+	struct nft_array_interval *intervals;
+
+	intervals = kvcalloc(max_intervals, sizeof(struct nft_array_interval),
+			     GFP_KERNEL_ACCOUNT);
+	if (!intervals)
+		return -ENOMEM;
+
+	if (array->intervals)
+		kvfree(array->intervals);
+
+	array->intervals = intervals;
+	array->max_intervals = max_intervals;
+
+	return 0;
+}
+
+static struct nft_array *nft_array_alloc(u32 max_intervals)
+{
+	struct nft_array *array;
+
+	array = kzalloc(sizeof(*array), GFP_KERNEL_ACCOUNT);
+	if (!array)
+		return NULL;
+
+	if (nft_array_intervals_alloc(array, max_intervals) < 0) {
+		kfree(array);
+		return NULL;
+	}
+
+	return array;
+}
+
+#define NFT_ARRAY_EXTRA_SIZE	10240
+
+/* Similar to nft_rbtree_{u,k}size to hide details to userspace, but consider
+ * packed representation coming from userspace for anonymous sets too.
+ */
+static u32 nft_array_elems(const struct nft_set *set)
+{
+	u32 nelems = atomic_read(&set->nelems);
+
+	/* Adjacent intervals are represented with a single start element in
+	 * anonymous sets, use the current element counter as is.
+	 */
+	if (nft_set_is_anonymous(set))
+		return nelems;
+
+	/* Add extra room for never matching interval at the beginning and open
+	 * interval at the end which only use a single element to represent it.
+	 * The conversion to array will compact intervals, this allows reduce
+	 * memory consumption.
+	 */
+	return (nelems / 2) + 2;
+}
+
+static int nft_array_may_resize(const struct nft_set *set)
+{
+	u32 nelems = nft_array_elems(set), new_max_intervals;
+	struct nft_rbtree *priv = nft_set_priv(set);
+	struct nft_array *array;
+
+	if (!priv->array_next) {
+		array = nft_array_alloc(nelems + NFT_ARRAY_EXTRA_SIZE);
+		if (!array)
+			return -ENOMEM;
+
+		priv->array_next = array;
+	}
+
+	if (nelems < priv->array_next->max_intervals)
+		return 0;
+
+	new_max_intervals = priv->array_next->max_intervals + NFT_ARRAY_EXTRA_SIZE;
+	if (nft_array_intervals_alloc(priv->array_next, new_max_intervals) < 0)
+		return -ENOMEM;
+
+	return 0;
+}
+
 static int nft_rbtree_insert(const struct net *net, const struct nft_set *set,
 			     const struct nft_set_elem *elem,
 			     struct nft_elem_priv **elem_priv)
@@ -489,6 +555,9 @@ static int nft_rbtree_insert(const struct net *net, const struct nft_set *set,
 	struct nft_rbtree *priv = nft_set_priv(set);
 	int err;
 
+	if (nft_array_may_resize(set) < 0)
+		return -ENOMEM;
+
 	do {
 		if (fatal_signal_pending(current))
 			return -EINTR;
@@ -553,6 +622,9 @@ nft_rbtree_deactivate(const struct net *net, const struct nft_set *set,
 	u64 tstamp = nft_net_tstamp(net);
 	int d;
 
+	if (nft_array_may_resize(set) < 0)
+		return NULL;
+
 	while (parent != NULL) {
 		rbe = rb_entry(parent, struct nft_rbtree_elem, node);
 
@@ -615,6 +687,11 @@ static void nft_rbtree_walk(const struct nft_ctx *ctx,
 	switch (iter->type) {
 	case NFT_ITER_UPDATE:
 		lockdep_assert_held(&nft_pernet(ctx->net)->commit_mutex);
+
+		if (nft_array_may_resize(set) < 0) {
+			iter->err = -ENOMEM;
+			break;
+		}
 		nft_rbtree_do_walk(ctx, set, iter);
 		break;
 	case NFT_ITER_READ:
@@ -717,9 +794,18 @@ static int nft_rbtree_init(const struct nft_set *set,
 	seqcount_rwlock_init(&priv->count, &priv->lock);
 	priv->root = RB_ROOT;
 
+	priv->array = NULL;
+	priv->array_next = NULL;
+
 	return 0;
 }
 
+static void __nft_array_free(struct nft_array *array)
+{
+	kvfree(array->intervals);
+	kfree(array);
+}
+
 static void nft_rbtree_destroy(const struct nft_ctx *ctx,
 			       const struct nft_set *set)
 {
@@ -732,6 +818,11 @@ static void nft_rbtree_destroy(const struct nft_ctx *ctx,
 		rbe = rb_entry(node, struct nft_rbtree_elem, node);
 		nf_tables_set_elem_destroy(ctx, set, &rbe->priv);
 	}
+
+	if (priv->array)
+		__nft_array_free(priv->array);
+	if (priv->array_next)
+		__nft_array_free(priv->array_next);
 }
 
 static bool nft_rbtree_estimate(const struct nft_set_desc *desc, u32 features,
@@ -752,12 +843,91 @@ static bool nft_rbtree_estimate(const struct nft_set_desc *desc, u32 features,
 	return true;
 }
 
+static void nft_array_free_rcu(struct rcu_head *rcu_head)
+{
+	struct nft_array *array = container_of(rcu_head, struct nft_array, rcu_head);
+
+	__nft_array_free(array);
+}
+
 static void nft_rbtree_commit(struct nft_set *set)
 {
 	struct nft_rbtree *priv = nft_set_priv(set);
+	struct nft_rbtree_elem *rbe, *prev_rbe;
+	struct nft_array *old;
+	u32 num_intervals = 0;
+	struct rb_node *node;
 
 	if (time_after_eq(jiffies, priv->last_gc + nft_set_gc_interval(set)))
 		nft_rbtree_gc(set);
+
+	/* No changes, skip, eg. elements updates only. */
+	if (!priv->array_next)
+		return;
+
+	/* Reverse walk to create an array from smaller to largest interval. */
+	node = rb_last(&priv->root);
+	if (node)
+		prev_rbe = rb_entry(node, struct nft_rbtree_elem, node);
+	else
+		prev_rbe = NULL;
+
+	while (prev_rbe) {
+		rbe = prev_rbe;
+
+		if (nft_rbtree_interval_start(rbe))
+			priv->array_next->intervals[num_intervals].from = &rbe->ext;
+		else if (nft_rbtree_interval_end(rbe))
+			priv->array_next->intervals[num_intervals++].to = &rbe->ext;
+
+		if (num_intervals >= priv->array_next->max_intervals) {
+			pr_warn_once("malformed interval set from userspace?");
+			goto err_out;
+		}
+
+		node = rb_prev(node);
+		if (!node)
+			break;
+
+		prev_rbe = rb_entry(node, struct nft_rbtree_elem, node);
+
+		/* For anonymous sets, when adjacent ranges are found,
+		 * the end element is not added to the set to pack the set
+		 * representation. Use next start element to complete this
+		 * interval.
+		 */
+		if (nft_rbtree_interval_start(rbe) &&
+		    nft_rbtree_interval_start(prev_rbe) &&
+		    priv->array_next->intervals[num_intervals].from)
+			priv->array_next->intervals[num_intervals++].to = &prev_rbe->ext;
+
+		if (num_intervals >= priv->array_next->max_intervals) {
+			pr_warn_once("malformed interval set from userspace?");
+			goto err_out;
+		}
+	}
+
+	if (priv->array_next->intervals[num_intervals].from)
+		num_intervals++;
+err_out:
+	priv->array_next->num_intervals = num_intervals;
+	old = rcu_replace_pointer(priv->array, priv->array_next, true);
+	priv->array_next = NULL;
+	if (old)
+		call_rcu(&old->rcu_head, nft_array_free_rcu);
+}
+
+static void nft_rbtree_abort(const struct nft_set *set)
+{
+	struct nft_rbtree *priv = nft_set_priv(set);
+	struct nft_array *array_next;
+
+	if (!priv->array_next)
+		return;
+
+	array_next = priv->array_next;
+	priv->array_next = NULL;
+	__nft_array_free(array_next);
 }
 
 static void nft_rbtree_gc_init(const struct nft_set *set)
@@ -821,6 +991,7 @@ const struct nft_set_type nft_set_rbtree_type = {
 		.flush		= nft_rbtree_flush,
 		.activate	= nft_rbtree_activate,
 		.commit		= nft_rbtree_commit,
+		.abort		= nft_rbtree_abort,
 		.gc_init	= nft_rbtree_gc_init,
 		.lookup		= nft_rbtree_lookup,
 		.walk		= nft_rbtree_walk,
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH nf-next,v1 3/3] netfilter: nft_set_rbtree: use binary search array in get command
  2026-01-15 12:43 [PATCH nf-next,v1 0/3] convert rbtree to binary search array Pablo Neira Ayuso
  2026-01-15 12:43 ` [PATCH nf-next,v1 1/3] netfilter: nf_tables: add .abort_skip_removal flag for set types Pablo Neira Ayuso
  2026-01-15 12:43 ` [PATCH nf-next,v1 2/3] netfilter: nft_set_rbtree: translate rbtree to array for binary search Pablo Neira Ayuso
@ 2026-01-15 12:43 ` Pablo Neira Ayuso
  2 siblings, 0 replies; 6+ messages in thread
From: Pablo Neira Ayuso @ 2026-01-15 12:43 UTC (permalink / raw)
  To: netfilter-devel; +Cc: fw

Rework .get interface to use the binary search array, this needs a specific
lookup function to match on end intervals (<=). Packet path lookup is slight
different because match is on lesser value, not equal (ie. <).

After this patch, seqcount can be removed in a follow up patch.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/netfilter/nft_set_rbtree.c | 154 ++++++++++++++-------------------
 1 file changed, 64 insertions(+), 90 deletions(-)

diff --git a/net/netfilter/nft_set_rbtree.c b/net/netfilter/nft_set_rbtree.c
index aed2784ea12a..d2e8ebfeb660 100644
--- a/net/netfilter/nft_set_rbtree.c
+++ b/net/netfilter/nft_set_rbtree.c
@@ -62,96 +62,6 @@ static int nft_rbtree_cmp(const struct nft_set *set,
 		      set->klen);
 }
 
-static bool __nft_rbtree_get(const struct net *net, const struct nft_set *set,
-			     const u32 *key, struct nft_rbtree_elem **elem,
-			     unsigned int seq, unsigned int flags, u8 genmask)
-{
-	struct nft_rbtree_elem *rbe, *interval = NULL;
-	struct nft_rbtree *priv = nft_set_priv(set);
-	const struct rb_node *parent;
-	const void *this;
-	int d;
-
-	parent = rcu_dereference_raw(priv->root.rb_node);
-	while (parent != NULL) {
-		if (read_seqcount_retry(&priv->count, seq))
-			return false;
-
-		rbe = rb_entry(parent, struct nft_rbtree_elem, node);
-
-		this = nft_set_ext_key(&rbe->ext);
-		d = memcmp(this, key, set->klen);
-		if (d < 0) {
-			parent = rcu_dereference_raw(parent->rb_left);
-			if (!(flags & NFT_SET_ELEM_INTERVAL_END))
-				interval = rbe;
-		} else if (d > 0) {
-			parent = rcu_dereference_raw(parent->rb_right);
-			if (flags & NFT_SET_ELEM_INTERVAL_END)
-				interval = rbe;
-		} else {
-			if (!nft_set_elem_active(&rbe->ext, genmask)) {
-				parent = rcu_dereference_raw(parent->rb_left);
-				continue;
-			}
-
-			if (nft_set_elem_expired(&rbe->ext))
-				return false;
-
-			if (!nft_set_ext_exists(&rbe->ext, NFT_SET_EXT_FLAGS) ||
-			    (*nft_set_ext_flags(&rbe->ext) & NFT_SET_ELEM_INTERVAL_END) ==
-			    (flags & NFT_SET_ELEM_INTERVAL_END)) {
-				*elem = rbe;
-				return true;
-			}
-
-			if (nft_rbtree_interval_end(rbe))
-				interval = NULL;
-
-			parent = rcu_dereference_raw(parent->rb_left);
-		}
-	}
-
-	if (set->flags & NFT_SET_INTERVAL && interval != NULL &&
-	    nft_set_elem_active(&interval->ext, genmask) &&
-	    !nft_set_elem_expired(&interval->ext) &&
-	    ((!nft_rbtree_interval_end(interval) &&
-	      !(flags & NFT_SET_ELEM_INTERVAL_END)) ||
-	     (nft_rbtree_interval_end(interval) &&
-	      (flags & NFT_SET_ELEM_INTERVAL_END)))) {
-		*elem = interval;
-		return true;
-	}
-
-	return false;
-}
-
-static struct nft_elem_priv *
-nft_rbtree_get(const struct net *net, const struct nft_set *set,
-	       const struct nft_set_elem *elem, unsigned int flags)
-{
-	struct nft_rbtree *priv = nft_set_priv(set);
-	unsigned int seq = read_seqcount_begin(&priv->count);
-	struct nft_rbtree_elem *rbe = ERR_PTR(-ENOENT);
-	const u32 *key = (const u32 *)&elem->key.val;
-	u8 genmask = nft_genmask_cur(net);
-	bool ret;
-
-	ret = __nft_rbtree_get(net, set, key, &rbe, seq, flags, genmask);
-	if (ret || !read_seqcount_retry(&priv->count, seq))
-		return &rbe->priv;
-
-	read_lock_bh(&priv->lock);
-	seq = read_seqcount_begin(&priv->count);
-	ret = __nft_rbtree_get(net, set, key, &rbe, seq, flags, genmask);
-	read_unlock_bh(&priv->lock);
-
-	if (!ret)
-		return ERR_PTR(-ENOENT);
-
-	return &rbe->priv;
-}
-
 struct nft_array_lookup_ctx {
 	const u32	*key;
 	u32		klen;
@@ -206,6 +116,70 @@ nft_rbtree_lookup(const struct net *net, const struct nft_set *set,
 	return interval->from;
 }
 
+struct nft_array_get_ctx {
+	const u32	*key;
+	unsigned int	flags;
+	u32		klen;
+};
+
+static int nft_array_get_cmp(const void *pkey, const void *entry)
+{
+	const struct nft_array_interval *interval = entry;
+	const struct nft_array_get_ctx *ctx = pkey;
+	int a, b;
+
+	if (!interval->from)
+		return 1;
+
+	a = memcmp(ctx->key, nft_set_ext_key(interval->from), ctx->klen);
+	if (!interval->to)
+		b = -1;
+	else
+		b = memcmp(ctx->key, nft_set_ext_key(interval->to), ctx->klen);
+
+	if (a >= 0) {
+		if (ctx->flags & NFT_SET_ELEM_INTERVAL_END && b <= 0)
+			return 0;
+		else if (b < 0)
+			return 0;
+	}
+
+	if (a < 0)
+		return -1;
+
+	return 1;
+}
+
+static struct nft_elem_priv *
+nft_rbtree_get(const struct net *net, const struct nft_set *set,
+	       const struct nft_set_elem *elem, unsigned int flags)
+{
+	struct nft_rbtree *priv = nft_set_priv(set);
+	struct nft_array *array = rcu_dereference(priv->array);
+	const struct nft_array_interval *interval;
+	struct nft_array_get_ctx ctx = {
+		.key	= (const u32 *)&elem->key.val,
+		.flags	= flags,
+		.klen	= set->klen,
+	};
+	struct nft_rbtree_elem *rbe;
+
+	if (!array)
+		return ERR_PTR(-ENOENT);
+
+	interval = bsearch(&ctx, array->intervals, array->num_intervals,
+			   sizeof(struct nft_array_interval), nft_array_get_cmp);
+	if (!interval || nft_set_elem_expired(interval->from))
+		return ERR_PTR(-ENOENT);
+
+	if (flags & NFT_SET_ELEM_INTERVAL_END)
+		rbe = container_of(interval->to, struct nft_rbtree_elem, ext);
+	else
+		rbe = container_of(interval->from, struct nft_rbtree_elem, ext);
+
+	return &rbe->priv;
+}
+
 static void nft_rbtree_gc_elem_remove(struct net *net, struct nft_set *set,
 				      struct nft_rbtree *priv,
 				      struct nft_rbtree_elem *rbe)
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH nf-next,v1 1/3] netfilter: nf_tables: add .abort_skip_removal flag for set types
  2026-01-15 12:43 ` [PATCH nf-next,v1 1/3] netfilter: nf_tables: add .abort_skip_removal flag for set types Pablo Neira Ayuso
@ 2026-01-15 14:59   ` Florian Westphal
  0 siblings, 0 replies; 6+ messages in thread
From: Florian Westphal @ 2026-01-15 14:59 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: netfilter-devel

Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> The pipapo set backend is the only user of the .abort interface so far.
> To speed up pipapo abort path, removals are skipped.
> 
> The follow up patch updates the rbtree to use to build an array of
> ordered elements, then use binary search. This needs a new .abort
> interface but, unlike pipapo, it also need to undo/remove elements.
> 
> Add a flag and use it from the pipapo set backend.
> 
> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
> ---
>  include/net/netfilter/nf_tables.h | 1 +
>  net/netfilter/nf_tables_api.c     | 2 +-
>  net/netfilter/nft_set_pipapo.c    | 2 ++
>  3 files changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/include/net/netfilter/nf_tables.h b/include/net/netfilter/nf_tables.h
> index fab7dc73f738..21af1a2a6d3d 100644
> --- a/include/net/netfilter/nf_tables.h
> +++ b/include/net/netfilter/nf_tables.h
> @@ -509,6 +509,7 @@ struct nft_set_ops {
>  						   const struct nft_set *set);
>  	void				(*gc_init)(const struct nft_set *set);
>  
> +	bool				abort_skip_removal;
>  	unsigned int			elemsize;
>  };
>  

This gives kdoc warning:

Warning: include/net/netfilter/nf_tables.h:513 struct member 'abort_skip_removal' not described in 'nft_set_ops'

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH nf-next,v1 2/3] netfilter: nft_set_rbtree: translate rbtree to array for binary search
  2026-01-15 12:43 ` [PATCH nf-next,v1 2/3] netfilter: nft_set_rbtree: translate rbtree to array for binary search Pablo Neira Ayuso
@ 2026-01-16  4:21   ` kernel test robot
  0 siblings, 0 replies; 6+ messages in thread
From: kernel test robot @ 2026-01-16  4:21 UTC (permalink / raw)
  To: Pablo Neira Ayuso, netfilter-devel; +Cc: oe-kbuild-all, fw

Hi Pablo,

kernel test robot noticed the following build warnings:

[auto build test WARNING on netfilter-nf/main]
[also build test WARNING on linus/master v6.19-rc5 next-20260115]
[cannot apply to nf-next/master]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Pablo-Neira-Ayuso/netfilter-nf_tables-add-abort_skip_removal-flag-for-set-types/20260115-204555
base:   https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf.git main
patch link:    https://lore.kernel.org/r/20260115124322.90712-3-pablo%40netfilter.org
patch subject: [PATCH nf-next,v1 2/3] netfilter: nft_set_rbtree: translate rbtree to array for binary search
config: sh-randconfig-r121-20260116 (https://download.01.org/0day-ci/archive/20260116/202601161133.QwCjx0vn-lkp@intel.com/config)
compiler: sh4-linux-gcc (GCC) 15.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260116/202601161133.QwCjx0vn-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202601161133.QwCjx0vn-lkp@intel.com/

sparse warnings: (new ones prefixed by >>)
>> net/netfilter/nft_set_rbtree.c:823:38: sparse: sparse: incorrect type in argument 1 (different address spaces) @@     expected struct nft_array *array @@     got struct nft_array [noderef] __rcu *array @@
   net/netfilter/nft_set_rbtree.c:823:38: sparse:     expected struct nft_array *array
   net/netfilter/nft_set_rbtree.c:823:38: sparse:     got struct nft_array [noderef] __rcu *array
   net/netfilter/nft_set_rbtree.c: note: in included file (through include/linux/mm_types.h, include/linux/mmzone.h, include/linux/gfp.h, ...):
   include/linux/rbtree.h:102:9: sparse: sparse: incompatible types in comparison expression (different address spaces):
   include/linux/rbtree.h:102:9: sparse:    struct rb_node [noderef] __rcu *
   include/linux/rbtree.h:102:9: sparse:    struct rb_node *

vim +823 net/netfilter/nft_set_rbtree.c

   808	
   809	static void nft_rbtree_destroy(const struct nft_ctx *ctx,
   810				       const struct nft_set *set)
   811	{
   812		struct nft_rbtree *priv = nft_set_priv(set);
   813		struct nft_rbtree_elem *rbe;
   814		struct rb_node *node;
   815	
   816		while ((node = priv->root.rb_node) != NULL) {
   817			rb_erase(node, &priv->root);
   818			rbe = rb_entry(node, struct nft_rbtree_elem, node);
   819			nf_tables_set_elem_destroy(ctx, set, &rbe->priv);
   820		}
   821	
   822		if (priv->array)
 > 823			__nft_array_free(priv->array);
   824		if (priv->array_next)
   825			__nft_array_free(priv->array_next);
   826	}
   827	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2026-01-16  4:21 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-01-15 12:43 [PATCH nf-next,v1 0/3] convert rbtree to binary search array Pablo Neira Ayuso
2026-01-15 12:43 ` [PATCH nf-next,v1 1/3] netfilter: nf_tables: add .abort_skip_removal flag for set types Pablo Neira Ayuso
2026-01-15 14:59   ` Florian Westphal
2026-01-15 12:43 ` [PATCH nf-next,v1 2/3] netfilter: nft_set_rbtree: translate rbtree to array for binary search Pablo Neira Ayuso
2026-01-16  4:21   ` kernel test robot
2026-01-15 12:43 ` [PATCH nf-next,v1 3/3] netfilter: nft_set_rbtree: use binary search array in get command Pablo Neira Ayuso

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox