* [PATCH nf-next v2 0/2] netfilter: nf_tables: avoid atomic allocations for set flush
@ 2025-08-22 8:15 Florian Westphal
2025-08-22 8:15 ` [PATCH nf-next v2 1/2] netfilter: nf_tables: allow iter callbacks to sleep Florian Westphal
2025-08-22 8:15 ` [PATCH nf-next v2 2/2] netfilter: nf_tables: all transaction allocations can now sleep Florian Westphal
0 siblings, 2 replies; 6+ messages in thread
From: Florian Westphal @ 2025-08-22 8:15 UTC (permalink / raw)
To: netfilter-devel; +Cc: Florian Westphal
Sven Auhagen reports memory allocation errors during set flush.
This is because of GFP_ATOMIC allocations because rhashtable walker
uses rcu and cannot sleep.
Build a linear list in rhashtable walker, drop rcu read lock and
then call the iter callback in a second loop.
This allows use of GFP_KERNEL allocations.
The second loop has no noticeable impact on set flush durations, even
for large (800k entries) sets.
Florian Westphal (2):
netfilter: nf_tables: allow iter callbacks to sleep
netfilter: nf_tables: all transaction allocations can now sleep
include/net/netfilter/nf_tables.h | 2 +
net/netfilter/nf_tables_api.c | 47 ++++++--------
net/netfilter/nft_set_hash.c | 102 +++++++++++++++++++++++++++++-
net/netfilter/nft_set_rbtree.c | 35 +++++++---
4 files changed, 147 insertions(+), 39 deletions(-)
--
2.49.1
^ permalink raw reply [flat|nested] 6+ messages in thread
* [PATCH nf-next v2 1/2] netfilter: nf_tables: allow iter callbacks to sleep
2025-08-22 8:15 [PATCH nf-next v2 0/2] netfilter: nf_tables: avoid atomic allocations for set flush Florian Westphal
@ 2025-08-22 8:15 ` Florian Westphal
2025-08-28 14:30 ` Pablo Neira Ayuso
2025-08-22 8:15 ` [PATCH nf-next v2 2/2] netfilter: nf_tables: all transaction allocations can now sleep Florian Westphal
1 sibling, 1 reply; 6+ messages in thread
From: Florian Westphal @ 2025-08-22 8:15 UTC (permalink / raw)
To: netfilter-devel; +Cc: Florian Westphal, Sven Auhagen
Quoting Sven Auhagen:
we do see on occasions that we get the following error message, more so on
x86 systems than on arm64:
Error: Could not process rule: Cannot allocate memory delete table inet filter
It is not a consistent error and does not happen all the time.
We are on Kernel 6.6.80, seems to me like we have something along the lines
of the nf_tables: allow clone callbacks to sleep problem using GFP_ATOMIC.
As hinted at by Sven, this is because of GFP_ATOMIC allocations during
set flush.
When set is flushed, all elements are deactivated. This triggers a set
walk and each element gets added to the transaction list.
The rbtree and rhashtable sets don't allow the iter callback to sleep:
rbtree walk acquires read side of an rwlock with bh disabled, rhashtable
walk happens with rcu read lock held.
Rbtree is simple enough to resolve:
When the walk context is ITER_READ, no change is needed (the iter
callback must not deactivate elements; we're not in a transaction).
When the iter type is ITER_UPDATE, the rwlock isn't needed because the
caller holds the transaction mutex, this prevents any and all changes to
the ruleset, including add/remove of set elements.
Rhashtable is slightly more complex.
When the iter type is ITER_READ, no change is needed, like rbtree.
For ITER_UPDATE, we hold transaction mutex which prevents elements from
getting free'd, even outside of rcu read lock section.
So build a temporary list of all elements while doing the rcu iteration
and then call the iterator in a second pass.
The disadvantage is the need to iterate twice, but this cost comes with
the benefit to allow the iter callback to use GFP_KERNEL allocations in
a followup patch.
The new list based logic makes it necessary to catch recursive calls to
the same set earlier.
Such walk -> iter -> walk recursion for the same set can happen during
ruleset validation in case userspace gave us a bogus (cyclic) ruleset
where verdict map m jumps to chain that sooner or later also calls
"vmap @m".
Before the new ->in_update_walk test, the ruleset is rejected because the
infinite recursion causes ctx->level to exceed the allowed maximum.
But with the new logic added here, elements would get skipped:
nft_rhash_walk_update would see elements that are on the walk_list of
an older stack frame.
As all recursive calls into same map results in -EMLINK, we can avoid this
problem by using the new in_update_walk flag and reject immediately.
Next patch converts the problematic GFP_ATOMIC allocations.
Reported-by: Sven Auhagen <Sven.Auhagen@belden.com>
Closes: https://lore.kernel.org/netfilter-devel/BY1PR18MB5874110CAFF1ED098D0BC4E7E07BA@BY1PR18MB5874.namprd18.prod.outlook.com/
Signed-off-by: Florian Westphal <fw@strlen.de>
---
include/net/netfilter/nf_tables.h | 2 +
net/netfilter/nft_set_hash.c | 102 +++++++++++++++++++++++++++++-
net/netfilter/nft_set_rbtree.c | 35 +++++++---
3 files changed, 128 insertions(+), 11 deletions(-)
diff --git a/include/net/netfilter/nf_tables.h b/include/net/netfilter/nf_tables.h
index 891e43a01bdc..e2128663b160 100644
--- a/include/net/netfilter/nf_tables.h
+++ b/include/net/netfilter/nf_tables.h
@@ -556,6 +556,7 @@ struct nft_set_elem_expr {
* @size: maximum set size
* @field_len: length of each field in concatenation, bytes
* @field_count: number of concatenated fields in element
+ * @in_update_walk: true during ->walk() in transaction phase
* @use: number of rules references to this set
* @nelems: number of elements
* @ndeact: number of deactivated elements queued for removal
@@ -590,6 +591,7 @@ struct nft_set {
u32 size;
u8 field_len[NFT_REG32_COUNT];
u8 field_count;
+ bool in_update_walk;
u32 use;
atomic_t nelems;
u32 ndeact;
diff --git a/net/netfilter/nft_set_hash.c b/net/netfilter/nft_set_hash.c
index 266d0c637225..cc302543c2e4 100644
--- a/net/netfilter/nft_set_hash.c
+++ b/net/netfilter/nft_set_hash.c
@@ -30,6 +30,7 @@ struct nft_rhash {
struct nft_rhash_elem {
struct nft_elem_priv priv;
struct rhash_head node;
+ struct llist_node walk_node;
u32 wq_gc_seq;
struct nft_set_ext ext;
};
@@ -144,6 +145,7 @@ nft_rhash_update(struct nft_set *set, const u32 *key,
goto err1;
he = nft_elem_priv_cast(elem_priv);
+ init_llist_node(&he->walk_node);
prev = rhashtable_lookup_get_insert_key(&priv->ht, &arg, &he->node,
nft_rhash_params);
if (IS_ERR(prev))
@@ -180,6 +182,7 @@ static int nft_rhash_insert(const struct net *net, const struct nft_set *set,
};
struct nft_rhash_elem *prev;
+ init_llist_node(&he->walk_node);
prev = rhashtable_lookup_get_insert_key(&priv->ht, &arg, &he->node,
nft_rhash_params);
if (IS_ERR(prev))
@@ -261,12 +264,12 @@ static bool nft_rhash_delete(const struct nft_set *set,
return true;
}
-static void nft_rhash_walk(const struct nft_ctx *ctx, struct nft_set *set,
- struct nft_set_iter *iter)
+static void nft_rhash_walk_ro(const struct nft_ctx *ctx, struct nft_set *set,
+ struct nft_set_iter *iter)
{
struct nft_rhash *priv = nft_set_priv(set);
- struct nft_rhash_elem *he;
struct rhashtable_iter hti;
+ struct nft_rhash_elem *he;
rhashtable_walk_enter(&priv->ht, &hti);
rhashtable_walk_start(&hti);
@@ -295,6 +298,99 @@ static void nft_rhash_walk(const struct nft_ctx *ctx, struct nft_set *set,
rhashtable_walk_exit(&hti);
}
+static void nft_rhash_walk_update(const struct nft_ctx *ctx,
+ struct nft_set *set,
+ struct nft_set_iter *iter)
+{
+ struct nft_rhash *priv = nft_set_priv(set);
+ struct nft_rhash_elem *he, *tmp;
+ struct llist_node *first_node;
+ struct rhashtable_iter hti;
+ LLIST_HEAD(walk_list);
+
+ lockdep_assert_held(&nft_pernet(ctx->net)->commit_mutex);
+
+ if (set->in_update_walk) {
+ /* This can happen with bogus rulesets during ruleset validation
+ * when a verdict map causes a jump back to the same map.
+ *
+ * Without this extra check the walk_next loop below will see
+ * elems on the callers walk_list and skip (not validate) them.
+ */
+ iter->err = -EMLINK;
+ return;
+ }
+
+ /* walk happens under RCU.
+ *
+ * We create a snapshot list so ->iter callback can sleep.
+ * commit_mutex is held, elements can ...
+ * .. be added in parallel from dataplane (dynset)
+ * .. be marked as dead in parallel from dataplane (dynset).
+ * .. be queued for removal in parallel (gc timeout).
+ * .. not be freed: transaction mutex is held.
+ */
+ rhashtable_walk_enter(&priv->ht, &hti);
+ rhashtable_walk_start(&hti);
+
+ while ((he = rhashtable_walk_next(&hti))) {
+ if (IS_ERR(he)) {
+ if (PTR_ERR(he) != -EAGAIN) {
+ iter->err = PTR_ERR(he);
+ break;
+ }
+
+ continue;
+ }
+
+ /* rhashtable resized during walk, skip */
+ if (llist_on_list(&he->walk_node))
+ continue;
+
+ if (iter->count < iter->skip) {
+ iter->count++;
+ continue;
+ }
+
+ llist_add(&he->walk_node, &walk_list);
+ }
+ rhashtable_walk_stop(&hti);
+ rhashtable_walk_exit(&hti);
+
+ first_node = __llist_del_all(&walk_list);
+ set->in_update_walk = true;
+ llist_for_each_entry_safe(he, tmp, first_node, walk_node) {
+ if (iter->err == 0) {
+ iter->err = iter->fn(ctx, set, iter, &he->priv);
+ if (iter->err == 0)
+ iter->count++;
+ }
+
+ /* all entries must be cleared again, else next ->walk iteration
+ * will skip entries.
+ */
+ init_llist_node(&he->walk_node);
+ }
+ set->in_update_walk = false;
+}
+
+static void nft_rhash_walk(const struct nft_ctx *ctx, struct nft_set *set,
+ struct nft_set_iter *iter)
+{
+ switch (iter->type) {
+ case NFT_ITER_UPDATE:
+ nft_rhash_walk_update(ctx, set, iter);
+ break;
+ case NFT_ITER_READ:
+ nft_rhash_walk_ro(ctx, set, iter);
+ break;
+ default:
+ iter->err = -EINVAL;
+ WARN_ON_ONCE(1);
+ break;
+ }
+}
+
static bool nft_rhash_expr_needs_gc_run(const struct nft_set *set,
struct nft_set_ext *ext)
{
diff --git a/net/netfilter/nft_set_rbtree.c b/net/netfilter/nft_set_rbtree.c
index 938a257c069e..b311b66df3e9 100644
--- a/net/netfilter/nft_set_rbtree.c
+++ b/net/netfilter/nft_set_rbtree.c
@@ -584,15 +584,14 @@ nft_rbtree_deactivate(const struct net *net, const struct nft_set *set,
return NULL;
}
-static void nft_rbtree_walk(const struct nft_ctx *ctx,
- struct nft_set *set,
- struct nft_set_iter *iter)
+static void nft_rbtree_do_walk(const struct nft_ctx *ctx,
+ struct nft_set *set,
+ struct nft_set_iter *iter)
{
struct nft_rbtree *priv = nft_set_priv(set);
struct nft_rbtree_elem *rbe;
struct rb_node *node;
- read_lock_bh(&priv->lock);
for (node = rb_first(&priv->root); node != NULL; node = rb_next(node)) {
rbe = rb_entry(node, struct nft_rbtree_elem, node);
@@ -600,14 +599,34 @@ static void nft_rbtree_walk(const struct nft_ctx *ctx,
goto cont;
iter->err = iter->fn(ctx, set, iter, &rbe->priv);
- if (iter->err < 0) {
- read_unlock_bh(&priv->lock);
+ if (iter->err < 0)
return;
- }
cont:
iter->count++;
}
- read_unlock_bh(&priv->lock);
+}
+
+static void nft_rbtree_walk(const struct nft_ctx *ctx,
+ struct nft_set *set,
+ struct nft_set_iter *iter)
+{
+ struct nft_rbtree *priv = nft_set_priv(set);
+
+ switch (iter->type) {
+ case NFT_ITER_UPDATE:
+ lockdep_assert_held(&nft_pernet(ctx->net)->commit_mutex);
+ nft_rbtree_do_walk(ctx, set, iter);
+ break;
+ case NFT_ITER_READ:
+ read_lock_bh(&priv->lock);
+ nft_rbtree_do_walk(ctx, set, iter);
+ read_unlock_bh(&priv->lock);
+ break;
+ default:
+ iter->err = -EINVAL;
+ WARN_ON_ONCE(1);
+ break;
+ }
}
static void nft_rbtree_gc_remove(struct net *net, struct nft_set *set,
--
2.49.1
^ permalink raw reply related [flat|nested] 6+ messages in thread
* [PATCH nf-next v2 2/2] netfilter: nf_tables: all transaction allocations can now sleep
2025-08-22 8:15 [PATCH nf-next v2 0/2] netfilter: nf_tables: avoid atomic allocations for set flush Florian Westphal
2025-08-22 8:15 ` [PATCH nf-next v2 1/2] netfilter: nf_tables: allow iter callbacks to sleep Florian Westphal
@ 2025-08-22 8:15 ` Florian Westphal
1 sibling, 0 replies; 6+ messages in thread
From: Florian Westphal @ 2025-08-22 8:15 UTC (permalink / raw)
To: netfilter-devel; +Cc: Florian Westphal
Now that nft_setelem_flush is not called with rcu read lock held or
disabled softinterrupts anymore this can now use GFP_KERNEL too.
This is the last atomic allocation of transaction elements, so remove
all gfp_t arguments and the wrapper function.
This makes attempts to delete large sets much more reliable, before
this was prone to transient memory allocation failures.
Signed-off-by: Florian Westphal <fw@strlen.de>
---
net/netfilter/nf_tables_api.c | 47 ++++++++++++++---------------------
1 file changed, 19 insertions(+), 28 deletions(-)
diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
index 58c5425d61c2..54519b3d2868 100644
--- a/net/netfilter/nf_tables_api.c
+++ b/net/netfilter/nf_tables_api.c
@@ -151,12 +151,12 @@ static void nft_ctx_init(struct nft_ctx *ctx,
bitmap_zero(ctx->reg_inited, NFT_REG32_NUM);
}
-static struct nft_trans *nft_trans_alloc_gfp(const struct nft_ctx *ctx,
- int msg_type, u32 size, gfp_t gfp)
+static struct nft_trans *nft_trans_alloc(const struct nft_ctx *ctx,
+ int msg_type, u32 size)
{
struct nft_trans *trans;
- trans = kzalloc(size, gfp);
+ trans = kzalloc(size, GFP_KERNEL);
if (trans == NULL)
return NULL;
@@ -172,12 +172,6 @@ static struct nft_trans *nft_trans_alloc_gfp(const struct nft_ctx *ctx,
return trans;
}
-static struct nft_trans *nft_trans_alloc(const struct nft_ctx *ctx,
- int msg_type, u32 size)
-{
- return nft_trans_alloc_gfp(ctx, msg_type, size, GFP_KERNEL);
-}
-
static struct nft_trans_binding *nft_trans_get_binding(struct nft_trans *trans)
{
switch (trans->msg_type) {
@@ -442,8 +436,7 @@ static bool nft_trans_collapse_set_elem_allowed(const struct nft_trans_elem *a,
static bool nft_trans_collapse_set_elem(struct nftables_pernet *nft_net,
struct nft_trans_elem *tail,
- struct nft_trans_elem *trans,
- gfp_t gfp)
+ struct nft_trans_elem *trans)
{
unsigned int nelems, old_nelems = tail->nelems;
struct nft_trans_elem *new_trans;
@@ -466,9 +459,11 @@ static bool nft_trans_collapse_set_elem(struct nftables_pernet *nft_net,
/* krealloc might free tail which invalidates list pointers */
list_del_init(&tail->nft_trans.list);
- new_trans = krealloc(tail, struct_size(tail, elems, nelems), gfp);
+ new_trans = krealloc(tail, struct_size(tail, elems, nelems),
+ GFP_KERNEL);
if (!new_trans) {
- list_add_tail(&tail->nft_trans.list, &nft_net->commit_list);
+ list_add_tail(&tail->nft_trans.list,
+ &nft_net->commit_list);
return false;
}
@@ -484,7 +479,7 @@ static bool nft_trans_collapse_set_elem(struct nftables_pernet *nft_net,
}
static bool nft_trans_try_collapse(struct nftables_pernet *nft_net,
- struct nft_trans *trans, gfp_t gfp)
+ struct nft_trans *trans)
{
struct nft_trans *tail;
@@ -501,7 +496,7 @@ static bool nft_trans_try_collapse(struct nftables_pernet *nft_net,
case NFT_MSG_DELSETELEM:
return nft_trans_collapse_set_elem(nft_net,
nft_trans_container_elem(tail),
- nft_trans_container_elem(trans), gfp);
+ nft_trans_container_elem(trans));
}
return false;
@@ -537,17 +532,14 @@ static void nft_trans_commit_list_add_tail(struct net *net, struct nft_trans *tr
}
}
-static void nft_trans_commit_list_add_elem(struct net *net, struct nft_trans *trans,
- gfp_t gfp)
+static void nft_trans_commit_list_add_elem(struct net *net, struct nft_trans *trans)
{
struct nftables_pernet *nft_net = nft_pernet(net);
WARN_ON_ONCE(trans->msg_type != NFT_MSG_NEWSETELEM &&
trans->msg_type != NFT_MSG_DELSETELEM);
- might_alloc(gfp);
-
- if (nft_trans_try_collapse(nft_net, trans, gfp)) {
+ if (nft_trans_try_collapse(nft_net, trans)) {
kfree(trans);
return;
}
@@ -7549,7 +7541,7 @@ static int nft_add_set_elem(struct nft_ctx *ctx, struct nft_set *set,
}
ue->priv = elem_priv;
- nft_trans_commit_list_add_elem(ctx->net, trans, GFP_KERNEL);
+ nft_trans_commit_list_add_elem(ctx->net, trans);
goto err_elem_free;
}
}
@@ -7573,7 +7565,7 @@ static int nft_add_set_elem(struct nft_ctx *ctx, struct nft_set *set,
}
nft_trans_container_elem(trans)->elems[0].priv = elem.priv;
- nft_trans_commit_list_add_elem(ctx->net, trans, GFP_KERNEL);
+ nft_trans_commit_list_add_elem(ctx->net, trans);
return 0;
err_set_full:
@@ -7839,7 +7831,7 @@ static int nft_del_setelem(struct nft_ctx *ctx, struct nft_set *set,
nft_setelem_data_deactivate(ctx->net, set, elem.priv);
nft_trans_container_elem(trans)->elems[0].priv = elem.priv;
- nft_trans_commit_list_add_elem(ctx->net, trans, GFP_KERNEL);
+ nft_trans_commit_list_add_elem(ctx->net, trans);
return 0;
fail_ops:
@@ -7864,9 +7856,8 @@ static int nft_setelem_flush(const struct nft_ctx *ctx,
if (!nft_set_elem_active(ext, iter->genmask))
return 0;
- trans = nft_trans_alloc_gfp(ctx, NFT_MSG_DELSETELEM,
- struct_size_t(struct nft_trans_elem, elems, 1),
- GFP_ATOMIC);
+ trans = nft_trans_alloc(ctx, NFT_MSG_DELSETELEM,
+ struct_size_t(struct nft_trans_elem, elems, 1));
if (!trans)
return -ENOMEM;
@@ -7877,7 +7868,7 @@ static int nft_setelem_flush(const struct nft_ctx *ctx,
nft_trans_elem_set(trans) = set;
nft_trans_container_elem(trans)->nelems = 1;
nft_trans_container_elem(trans)->elems[0].priv = elem_priv;
- nft_trans_commit_list_add_elem(ctx->net, trans, GFP_ATOMIC);
+ nft_trans_commit_list_add_elem(ctx->net, trans);
return 0;
}
@@ -7894,7 +7885,7 @@ static int __nft_set_catchall_flush(const struct nft_ctx *ctx,
nft_setelem_data_deactivate(ctx->net, set, elem_priv);
nft_trans_container_elem(trans)->elems[0].priv = elem_priv;
- nft_trans_commit_list_add_elem(ctx->net, trans, GFP_KERNEL);
+ nft_trans_commit_list_add_elem(ctx->net, trans);
return 0;
}
--
2.49.1
^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH nf-next v2 1/2] netfilter: nf_tables: allow iter callbacks to sleep
2025-08-22 8:15 ` [PATCH nf-next v2 1/2] netfilter: nf_tables: allow iter callbacks to sleep Florian Westphal
@ 2025-08-28 14:30 ` Pablo Neira Ayuso
2025-08-28 15:10 ` Florian Westphal
0 siblings, 1 reply; 6+ messages in thread
From: Pablo Neira Ayuso @ 2025-08-28 14:30 UTC (permalink / raw)
To: Florian Westphal; +Cc: netfilter-devel, Sven Auhagen
Hi Florian,
On Fri, Aug 22, 2025 at 10:15:37AM +0200, Florian Westphal wrote:
[...]
> diff --git a/net/netfilter/nft_set_hash.c b/net/netfilter/nft_set_hash.c
> index 266d0c637225..cc302543c2e4 100644
> --- a/net/netfilter/nft_set_hash.c
> +++ b/net/netfilter/nft_set_hash.c
> @@ -30,6 +30,7 @@ struct nft_rhash {
> struct nft_rhash_elem {
> struct nft_elem_priv priv;
> struct rhash_head node;
> + struct llist_node walk_node;
> u32 wq_gc_seq;
> struct nft_set_ext ext;
> };
> @@ -144,6 +145,7 @@ nft_rhash_update(struct nft_set *set, const u32 *key,
> goto err1;
>
> he = nft_elem_priv_cast(elem_priv);
> + init_llist_node(&he->walk_node);
> prev = rhashtable_lookup_get_insert_key(&priv->ht, &arg, &he->node,
> nft_rhash_params);
> if (IS_ERR(prev))
> @@ -180,6 +182,7 @@ static int nft_rhash_insert(const struct net *net, const struct nft_set *set,
> };
> struct nft_rhash_elem *prev;
>
> + init_llist_node(&he->walk_node);
> prev = rhashtable_lookup_get_insert_key(&priv->ht, &arg, &he->node,
> nft_rhash_params);
> if (IS_ERR(prev))
> @@ -261,12 +264,12 @@ static bool nft_rhash_delete(const struct nft_set *set,
> return true;
> }
>
> -static void nft_rhash_walk(const struct nft_ctx *ctx, struct nft_set *set,
> - struct nft_set_iter *iter)
> +static void nft_rhash_walk_ro(const struct nft_ctx *ctx, struct nft_set *set,
> + struct nft_set_iter *iter)
> {
> struct nft_rhash *priv = nft_set_priv(set);
> - struct nft_rhash_elem *he;
> struct rhashtable_iter hti;
> + struct nft_rhash_elem *he;
>
> rhashtable_walk_enter(&priv->ht, &hti);
> rhashtable_walk_start(&hti);
> @@ -295,6 +298,99 @@ static void nft_rhash_walk(const struct nft_ctx *ctx, struct nft_set *set,
> rhashtable_walk_exit(&hti);
> }
>
> +static void nft_rhash_walk_update(const struct nft_ctx *ctx,
> + struct nft_set *set,
> + struct nft_set_iter *iter)
> +{
> + struct nft_rhash *priv = nft_set_priv(set);
> + struct nft_rhash_elem *he, *tmp;
> + struct llist_node *first_node;
> + struct rhashtable_iter hti;
> + LLIST_HEAD(walk_list);
> +
> + lockdep_assert_held(&nft_pernet(ctx->net)->commit_mutex);
> +
> + if (set->in_update_walk) {
> + /* This can happen with bogus rulesets during ruleset validation
> + * when a verdict map causes a jump back to the same map.
> + *
> + * Without this extra check the walk_next loop below will see
> + * elems on the callers walk_list and skip (not validate) them.
> + */
> + iter->err = -EMLINK;
> + return;
> + }
> +
> + /* walk happens under RCU.
> + *
> + * We create a snapshot list so ->iter callback can sleep.
> + * commit_mutex is held, elements can ...
> + * .. be added in parallel from dataplane (dynset)
> + * .. be marked as dead in parallel from dataplane (dynset).
> + * .. be queued for removal in parallel (gc timeout).
> + * .. not be freed: transaction mutex is held.
> + */
> + rhashtable_walk_enter(&priv->ht, &hti);
> + rhashtable_walk_start(&hti);
> +
> + while ((he = rhashtable_walk_next(&hti))) {
> + if (IS_ERR(he)) {
> + if (PTR_ERR(he) != -EAGAIN) {
> + iter->err = PTR_ERR(he);
> + break;
> + }
> +
> + continue;
> + }
> +
> + /* rhashtable resized during walk, skip */
> + if (llist_on_list(&he->walk_node))
> + continue;
> +
> + if (iter->count < iter->skip) {
> + iter->count++;
> + continue;
> + }
Not causing any harm, but is iter->count useful for this
NFT_ITER_UPDATE variant?
I think iter->count is only used for netlink dumps, to resume from the
last netlink message.
> + llist_add(&he->walk_node, &walk_list);
> + }
> + rhashtable_walk_stop(&hti);
> + rhashtable_walk_exit(&hti);
> +
> + first_node = __llist_del_all(&walk_list);
> + set->in_update_walk = true;
> + llist_for_each_entry_safe(he, tmp, first_node, walk_node) {
> + if (iter->err == 0) {
> + iter->err = iter->fn(ctx, set, iter, &he->priv);
> + if (iter->err == 0)
> + iter->count++;
> + }
> +
> + /* all entries must be cleared again, else next ->walk iteration
> + * will skip entries.
> + */
> + init_llist_node(&he->walk_node);
> + }
> + set->in_update_walk = false;
> +}
> +
> +static void nft_rhash_walk(const struct nft_ctx *ctx, struct nft_set *set,
> + struct nft_set_iter *iter)
> +{
> + switch (iter->type) {
> + case NFT_ITER_UPDATE:
> + nft_rhash_walk_update(ctx, set, iter);
> + break;
> + case NFT_ITER_READ:
> + nft_rhash_walk_ro(ctx, set, iter);
> + break;
> + default:
> + iter->err = -EINVAL;
> + WARN_ON_ONCE(1);
> + break;
> + }
> +}
> +
> static bool nft_rhash_expr_needs_gc_run(const struct nft_set *set,
> struct nft_set_ext *ext)
> {
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH nf-next v2 1/2] netfilter: nf_tables: allow iter callbacks to sleep
2025-08-28 14:30 ` Pablo Neira Ayuso
@ 2025-08-28 15:10 ` Florian Westphal
2025-08-28 23:46 ` Florian Westphal
0 siblings, 1 reply; 6+ messages in thread
From: Florian Westphal @ 2025-08-28 15:10 UTC (permalink / raw)
To: Pablo Neira Ayuso; +Cc: netfilter-devel, Sven Auhagen
Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> > + if (iter->count < iter->skip) {
> > + iter->count++;
> > + continue;
> > + }
>
> Not causing any harm, but is iter->count useful for this
> NFT_ITER_UPDATE variant?
>
> I think iter->count is only used for netlink dumps, to resume from the
> last netlink message.
Yes. Should I just remove the above or also add a WARN_ON_ONCE(iter->skip) ...
> > + switch (iter->type) {
> > + case NFT_ITER_UPDATE:
... here?
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH nf-next v2 1/2] netfilter: nf_tables: allow iter callbacks to sleep
2025-08-28 15:10 ` Florian Westphal
@ 2025-08-28 23:46 ` Florian Westphal
0 siblings, 0 replies; 6+ messages in thread
From: Florian Westphal @ 2025-08-28 23:46 UTC (permalink / raw)
To: Pablo Neira Ayuso; +Cc: netfilter-devel, Sven Auhagen
Florian Westphal <fw@strlen.de> wrote:
> > Not causing any harm, but is iter->count useful for this
> > NFT_ITER_UPDATE variant?
> >
> > I think iter->count is only used for netlink dumps, to resume from the
> > last netlink message.
>
> Yes. Should I just remove the above or also add a WARN_ON_ONCE(iter->skip) ...
FTR, I removed the iterator and placed the WARN_ON_ONCE
> > > + switch (iter->type) {
> > > + case NFT_ITER_UPDATE:
here, then pushed this to nf-next:testing.
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2025-08-28 23:46 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-22 8:15 [PATCH nf-next v2 0/2] netfilter: nf_tables: avoid atomic allocations for set flush Florian Westphal
2025-08-22 8:15 ` [PATCH nf-next v2 1/2] netfilter: nf_tables: allow iter callbacks to sleep Florian Westphal
2025-08-28 14:30 ` Pablo Neira Ayuso
2025-08-28 15:10 ` Florian Westphal
2025-08-28 23:46 ` Florian Westphal
2025-08-22 8:15 ` [PATCH nf-next v2 2/2] netfilter: nf_tables: all transaction allocations can now sleep Florian Westphal
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.