* [nf-next 0/2] netfilter: nf_tables: make set flush more resistant to memory pressure
@ 2025-07-04 12:30 Florian Westphal
2025-07-04 12:30 ` [nf-next 1/2] netfilter: nf_tables: allow iter callbacks to sleep Florian Westphal
` (2 more replies)
0 siblings, 3 replies; 18+ messages in thread
From: Florian Westphal @ 2025-07-04 12:30 UTC (permalink / raw)
To: netfilter-devel; +Cc: Florian Westphal
Removal of many set elements, e.g. during set flush or ruleset
deletion, can sometimes fail due to memory pressure.
Reduce likelyhood of this happening and enable sleeping allocations
for this.
Florian Westphal (2):
netfilter: nf_tables: allow iter callbacks to sleep
netfilter: nf_tables: all transaction allocations can now sleep
include/net/netfilter/nf_tables.h | 2 +
net/netfilter/nf_tables_api.c | 43 +++++--------
net/netfilter/nft_set_hash.c | 102 +++++++++++++++++++++++++++++-
net/netfilter/nft_set_rbtree.c | 35 +++++++---
4 files changed, 144 insertions(+), 38 deletions(-)
--
2.50.0
^ permalink raw reply [flat|nested] 18+ messages in thread
* [nf-next 1/2] netfilter: nf_tables: allow iter callbacks to sleep
2025-07-04 12:30 [nf-next 0/2] netfilter: nf_tables: make set flush more resistant to memory pressure Florian Westphal
@ 2025-07-04 12:30 ` Florian Westphal
2025-07-04 12:30 ` [nf-next 2/2] netfilter: nf_tables: all transaction allocations can now sleep Florian Westphal
2025-07-24 23:19 ` [nf-next 0/2] netfilter: nf_tables: make set flush more resistant to memory pressure Pablo Neira Ayuso
2 siblings, 0 replies; 18+ messages in thread
From: Florian Westphal @ 2025-07-04 12:30 UTC (permalink / raw)
To: netfilter-devel; +Cc: Florian Westphal, Sven Auhagen
Quoting Sven Auhagen:
we do see on occasions that we get the following error message, more so on
x86 systems than on arm64:
Error: Could not process rule: Cannot allocate memory delete table inet filter
It is not a consistent error and does not happen all the time.
We are on Kernel 6.6.80, seems to me like we have something along the lines
of the nf_tables: allow clone callbacks to sleep problem using GFP_ATOMIC.
As hinted at by Sven, this is because of GFP_ATOMIC allocations during set
flush.
When set is flushed, all elements are deactivated. This triggers a set walk
and each element gets added to the transaction list.
The rbtree and rhashtable sets don't allow the iter callback to sleep:
rbtree walk acquires read side of an rwlock with bh disabled, rhashtable
walk happens with rcu read lock held.
rbtree is simple:
When the walk context is ITER_READ, no change is needed (the iter
callback must not deactivate elements; we're not in a transaction).
When the iter type is ITER_UPDATE, the rwlock isn't needed because the
caller holds the transaction mutex, this prevents any and all changes to
the ruleset, including add/remove of set elements.
Unlike rhashtable rbtree doesn't support insertions/removals from datapath
and timeout is handled only from control plane.
rhashtable is more complex. Just like rbtree no change is needed for
ITER_READ.
For ITER_UPDATE, we hold transaction mutex which prevents elements from
getting free'd, even outside of rcu read lock section.
Build a temporary list of all elements while doing the rcu iteration
and then call the iterator in a second pass.
The disadvantage is the need to iterate twice, but it allows the iter
callback to use GFP_KERNEL allocations in a followup patch.
The new list based logic makes it necessary to catch recursive calls to
the same set earlier.
Such walk -> iter -> walk recursion for the same set can happen during
ruleset validation in case userspace gave us a bogus (cyclic) ruleset
where verdict map m jumps to chain that sooner or later also calls
"vmap @m".
Before the new ->in_update_walk test, the ruleset is rejected because the
infinite recursion makes ctx->level hit the allowed maximum.
But with the new logic added here, elements would get skipped as
nft_rhash_walk_update sees elements that are on the walk_list of
earlier, nested call.
Use a per-set "in_update_walk" flag and jut return -EMINK immediately
to avoid this.
Next patch converts the problematic GFP_ATOMIC allocations.
Reported-by: Sven Auhagen <Sven.Auhagen@belden.com>
Closes: https://lore.kernel.org/netfilter-devel/aGO89KaXVNuToRJg@strlen.de/
Signed-off-by: Florian Westphal <fw@strlen.de>
---
include/net/netfilter/nf_tables.h | 2 +
net/netfilter/nft_set_hash.c | 102 +++++++++++++++++++++++++++++-
net/netfilter/nft_set_rbtree.c | 35 +++++++---
3 files changed, 128 insertions(+), 11 deletions(-)
diff --git a/include/net/netfilter/nf_tables.h b/include/net/netfilter/nf_tables.h
index e4d8e451e935..0cdaac6cc006 100644
--- a/include/net/netfilter/nf_tables.h
+++ b/include/net/netfilter/nf_tables.h
@@ -562,6 +562,7 @@ struct nft_set_elem_expr {
* @size: maximum set size
* @field_len: length of each field in concatenation, bytes
* @field_count: number of concatenated fields in element
+ * @in_update_walk: true during ->walk() in transaction phase
* @use: number of rules references to this set
* @nelems: number of elements
* @ndeact: number of deactivated elements queued for removal
@@ -596,6 +597,7 @@ struct nft_set {
u32 size;
u8 field_len[NFT_REG32_COUNT];
u8 field_count;
+ bool in_update_walk;
u32 use;
atomic_t nelems;
u32 ndeact;
diff --git a/net/netfilter/nft_set_hash.c b/net/netfilter/nft_set_hash.c
index abb0c8ec6371..3d235f4e4f60 100644
--- a/net/netfilter/nft_set_hash.c
+++ b/net/netfilter/nft_set_hash.c
@@ -30,6 +30,7 @@ struct nft_rhash {
struct nft_rhash_elem {
struct nft_elem_priv priv;
struct rhash_head node;
+ struct llist_node walk_node;
u32 wq_gc_seq;
struct nft_set_ext ext;
};
@@ -148,6 +149,7 @@ static bool nft_rhash_update(struct nft_set *set, const u32 *key,
goto err1;
he = nft_elem_priv_cast(elem_priv);
+ init_llist_node(&he->walk_node);
prev = rhashtable_lookup_get_insert_key(&priv->ht, &arg, &he->node,
nft_rhash_params);
if (IS_ERR(prev))
@@ -185,6 +187,7 @@ static int nft_rhash_insert(const struct net *net, const struct nft_set *set,
};
struct nft_rhash_elem *prev;
+ init_llist_node(&he->walk_node);
prev = rhashtable_lookup_get_insert_key(&priv->ht, &arg, &he->node,
nft_rhash_params);
if (IS_ERR(prev))
@@ -266,12 +269,12 @@ static bool nft_rhash_delete(const struct nft_set *set,
return true;
}
-static void nft_rhash_walk(const struct nft_ctx *ctx, struct nft_set *set,
- struct nft_set_iter *iter)
+static void nft_rhash_walk_ro(const struct nft_ctx *ctx, struct nft_set *set,
+ struct nft_set_iter *iter)
{
struct nft_rhash *priv = nft_set_priv(set);
- struct nft_rhash_elem *he;
struct rhashtable_iter hti;
+ struct nft_rhash_elem *he;
rhashtable_walk_enter(&priv->ht, &hti);
rhashtable_walk_start(&hti);
@@ -300,6 +303,99 @@ static void nft_rhash_walk(const struct nft_ctx *ctx, struct nft_set *set,
rhashtable_walk_exit(&hti);
}
+static void nft_rhash_walk_update(const struct nft_ctx *ctx, struct nft_set *set,
+ struct nft_set_iter *iter)
+{
+ struct nft_rhash *priv = nft_set_priv(set);
+ struct nft_rhash_elem *he, *tmp;
+ struct llist_node *first_node;
+ struct rhashtable_iter hti;
+ LLIST_HEAD(walk_list);
+
+ lockdep_assert_held(&nft_pernet(read_pnet(&set->net))->commit_mutex);
+
+ if (set->in_update_walk) {
+ /* This can happen with bogus rulesets during ruleset validation
+ * when a verdict map causes a jump back to the same verdict map.
+ *
+ * Without this extra check the walk_next loop below will see
+ * elems on the calling functions walk_list and skip (not validate)
+ * them.
+ */
+ iter->err = -EMLINK;
+ return;
+ }
+
+ /* walk happens under RCU.
+ *
+ * We create a snapshot list so ->iter callback can sleep.
+ * commit_mutex is held, elements can ...
+ * .. be added in parallel from dataplane (dynset)
+ * .. be marked as dead in parallel from dataplane (dynset).
+ * .. be queued for removal in parallel (gc timeout).
+ * .. not be freed: transaction mutex is held.
+ */
+ rhashtable_walk_enter(&priv->ht, &hti);
+ rhashtable_walk_start(&hti);
+
+ while ((he = rhashtable_walk_next(&hti))) {
+ if (IS_ERR(he)) {
+ if (PTR_ERR(he) != -EAGAIN) {
+ iter->err = PTR_ERR(he);
+ break;
+ }
+
+ continue;
+ }
+
+ /* rhashtable resized during walk, skip */
+ if (llist_on_list(&he->walk_node))
+ continue;
+
+ if (iter->count < iter->skip) {
+ iter->count++;
+ continue;
+ }
+
+ llist_add(&he->walk_node, &walk_list);
+ }
+ rhashtable_walk_stop(&hti);
+ rhashtable_walk_exit(&hti);
+
+ first_node = __llist_del_all(&walk_list);
+ set->in_update_walk = true;
+ llist_for_each_entry_safe(he, tmp, first_node, walk_node) {
+ if (iter->err == 0) {
+ iter->err = iter->fn(ctx, set, iter, &he->priv);
+ if (iter->err == 0)
+ iter->count++;
+ }
+
+ /* all entries must be cleared again, else next ->walk iteration
+ * will skip entries.
+ */
+ init_llist_node(&he->walk_node);
+ }
+ set->in_update_walk = false;
+}
+
+static void nft_rhash_walk(const struct nft_ctx *ctx, struct nft_set *set,
+ struct nft_set_iter *iter)
+{
+ switch (iter->type) {
+ case NFT_ITER_UPDATE:
+ nft_rhash_walk_update(ctx, set, iter);
+ break;
+ case NFT_ITER_READ:
+ nft_rhash_walk_ro(ctx, set, iter);
+ break;
+ default:
+ iter->err = -EINVAL;
+ WARN_ON_ONCE(1);
+ break;
+ }
+}
+
static bool nft_rhash_expr_needs_gc_run(const struct nft_set *set,
struct nft_set_ext *ext)
{
diff --git a/net/netfilter/nft_set_rbtree.c b/net/netfilter/nft_set_rbtree.c
index 2e8ef16ff191..5c55840d4ebd 100644
--- a/net/netfilter/nft_set_rbtree.c
+++ b/net/netfilter/nft_set_rbtree.c
@@ -586,15 +586,14 @@ nft_rbtree_deactivate(const struct net *net, const struct nft_set *set,
return NULL;
}
-static void nft_rbtree_walk(const struct nft_ctx *ctx,
- struct nft_set *set,
- struct nft_set_iter *iter)
+static void nft_rbtree_do_walk(const struct nft_ctx *ctx,
+ struct nft_set *set,
+ struct nft_set_iter *iter)
{
struct nft_rbtree *priv = nft_set_priv(set);
struct nft_rbtree_elem *rbe;
struct rb_node *node;
- read_lock_bh(&priv->lock);
for (node = rb_first(&priv->root); node != NULL; node = rb_next(node)) {
rbe = rb_entry(node, struct nft_rbtree_elem, node);
@@ -602,14 +601,34 @@ static void nft_rbtree_walk(const struct nft_ctx *ctx,
goto cont;
iter->err = iter->fn(ctx, set, iter, &rbe->priv);
- if (iter->err < 0) {
- read_unlock_bh(&priv->lock);
+ if (iter->err < 0)
return;
- }
cont:
iter->count++;
}
- read_unlock_bh(&priv->lock);
+}
+
+static void nft_rbtree_walk(const struct nft_ctx *ctx,
+ struct nft_set *set,
+ struct nft_set_iter *iter)
+{
+ struct nft_rbtree *priv = nft_set_priv(set);
+
+ switch (iter->type) {
+ case NFT_ITER_UPDATE:
+ lockdep_assert_held(&nft_pernet(read_pnet(&set->net))->commit_mutex);
+ nft_rbtree_do_walk(ctx, set, iter);
+ break;
+ case NFT_ITER_READ:
+ read_lock_bh(&priv->lock);
+ nft_rbtree_do_walk(ctx, set, iter);
+ read_unlock_bh(&priv->lock);
+ break;
+ default:
+ iter->err = -EINVAL;
+ WARN_ON_ONCE(1);
+ break;
+ }
}
static void nft_rbtree_gc_remove(struct net *net, struct nft_set *set,
--
2.50.0
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [nf-next 2/2] netfilter: nf_tables: all transaction allocations can now sleep
2025-07-04 12:30 [nf-next 0/2] netfilter: nf_tables: make set flush more resistant to memory pressure Florian Westphal
2025-07-04 12:30 ` [nf-next 1/2] netfilter: nf_tables: allow iter callbacks to sleep Florian Westphal
@ 2025-07-04 12:30 ` Florian Westphal
2025-07-24 23:19 ` [nf-next 0/2] netfilter: nf_tables: make set flush more resistant to memory pressure Pablo Neira Ayuso
2 siblings, 0 replies; 18+ messages in thread
From: Florian Westphal @ 2025-07-04 12:30 UTC (permalink / raw)
To: netfilter-devel; +Cc: Florian Westphal
Now that nft_setelem_flush is not called with rcu read lock held or
disabled softinterrupts anymore this can now use GFP_KERNEL too.
This is the last atomic allocation of transaction elements, so remove
all gfp_t arguments and the wrapper function.
This makes attempts to delete large sets much more reliable, before
this was prone to transient memory allocation failures.
Signed-off-by: Florian Westphal <fw@strlen.de>
---
net/netfilter/nf_tables_api.c | 43 +++++++++++++----------------------
1 file changed, 16 insertions(+), 27 deletions(-)
diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
index 24c71ecb2179..9615d0a873c9 100644
--- a/net/netfilter/nf_tables_api.c
+++ b/net/netfilter/nf_tables_api.c
@@ -151,12 +151,12 @@ static void nft_ctx_init(struct nft_ctx *ctx,
bitmap_zero(ctx->reg_inited, NFT_REG32_NUM);
}
-static struct nft_trans *nft_trans_alloc_gfp(const struct nft_ctx *ctx,
- int msg_type, u32 size, gfp_t gfp)
+static struct nft_trans *nft_trans_alloc(const struct nft_ctx *ctx,
+ int msg_type, u32 size)
{
struct nft_trans *trans;
- trans = kzalloc(size, gfp);
+ trans = kzalloc(size, GFP_KERNEL);
if (trans == NULL)
return NULL;
@@ -172,12 +172,6 @@ static struct nft_trans *nft_trans_alloc_gfp(const struct nft_ctx *ctx,
return trans;
}
-static struct nft_trans *nft_trans_alloc(const struct nft_ctx *ctx,
- int msg_type, u32 size)
-{
- return nft_trans_alloc_gfp(ctx, msg_type, size, GFP_KERNEL);
-}
-
static struct nft_trans_binding *nft_trans_get_binding(struct nft_trans *trans)
{
switch (trans->msg_type) {
@@ -442,8 +436,7 @@ static bool nft_trans_collapse_set_elem_allowed(const struct nft_trans_elem *a,
static bool nft_trans_collapse_set_elem(struct nftables_pernet *nft_net,
struct nft_trans_elem *tail,
- struct nft_trans_elem *trans,
- gfp_t gfp)
+ struct nft_trans_elem *trans)
{
unsigned int nelems, old_nelems = tail->nelems;
struct nft_trans_elem *new_trans;
@@ -466,7 +459,7 @@ static bool nft_trans_collapse_set_elem(struct nftables_pernet *nft_net,
/* krealloc might free tail which invalidates list pointers */
list_del_init(&tail->nft_trans.list);
- new_trans = krealloc(tail, struct_size(tail, elems, nelems), gfp);
+ new_trans = krealloc(tail, struct_size(tail, elems, nelems), GFP_KERNEL);
if (!new_trans) {
list_add_tail(&tail->nft_trans.list, &nft_net->commit_list);
return false;
@@ -484,7 +477,7 @@ static bool nft_trans_collapse_set_elem(struct nftables_pernet *nft_net,
}
static bool nft_trans_try_collapse(struct nftables_pernet *nft_net,
- struct nft_trans *trans, gfp_t gfp)
+ struct nft_trans *trans)
{
struct nft_trans *tail;
@@ -501,7 +494,7 @@ static bool nft_trans_try_collapse(struct nftables_pernet *nft_net,
case NFT_MSG_DELSETELEM:
return nft_trans_collapse_set_elem(nft_net,
nft_trans_container_elem(tail),
- nft_trans_container_elem(trans), gfp);
+ nft_trans_container_elem(trans));
}
return false;
@@ -537,17 +530,14 @@ static void nft_trans_commit_list_add_tail(struct net *net, struct nft_trans *tr
}
}
-static void nft_trans_commit_list_add_elem(struct net *net, struct nft_trans *trans,
- gfp_t gfp)
+static void nft_trans_commit_list_add_elem(struct net *net, struct nft_trans *trans)
{
struct nftables_pernet *nft_net = nft_pernet(net);
WARN_ON_ONCE(trans->msg_type != NFT_MSG_NEWSETELEM &&
trans->msg_type != NFT_MSG_DELSETELEM);
- might_alloc(gfp);
-
- if (nft_trans_try_collapse(nft_net, trans, gfp)) {
+ if (nft_trans_try_collapse(nft_net, trans)) {
kfree(trans);
return;
}
@@ -7529,7 +7519,7 @@ static int nft_add_set_elem(struct nft_ctx *ctx, struct nft_set *set,
}
ue->priv = elem_priv;
- nft_trans_commit_list_add_elem(ctx->net, trans, GFP_KERNEL);
+ nft_trans_commit_list_add_elem(ctx->net, trans);
goto err_elem_free;
}
}
@@ -7553,7 +7543,7 @@ static int nft_add_set_elem(struct nft_ctx *ctx, struct nft_set *set,
}
nft_trans_container_elem(trans)->elems[0].priv = elem.priv;
- nft_trans_commit_list_add_elem(ctx->net, trans, GFP_KERNEL);
+ nft_trans_commit_list_add_elem(ctx->net, trans);
return 0;
err_set_full:
@@ -7819,7 +7809,7 @@ static int nft_del_setelem(struct nft_ctx *ctx, struct nft_set *set,
nft_setelem_data_deactivate(ctx->net, set, elem.priv);
nft_trans_container_elem(trans)->elems[0].priv = elem.priv;
- nft_trans_commit_list_add_elem(ctx->net, trans, GFP_KERNEL);
+ nft_trans_commit_list_add_elem(ctx->net, trans);
return 0;
fail_ops:
@@ -7844,9 +7834,8 @@ static int nft_setelem_flush(const struct nft_ctx *ctx,
if (!nft_set_elem_active(ext, iter->genmask))
return 0;
- trans = nft_trans_alloc_gfp(ctx, NFT_MSG_DELSETELEM,
- struct_size_t(struct nft_trans_elem, elems, 1),
- GFP_ATOMIC);
+ trans = nft_trans_alloc(ctx, NFT_MSG_DELSETELEM,
+ struct_size_t(struct nft_trans_elem, elems, 1));
if (!trans)
return -ENOMEM;
@@ -7857,7 +7846,7 @@ static int nft_setelem_flush(const struct nft_ctx *ctx,
nft_trans_elem_set(trans) = set;
nft_trans_container_elem(trans)->nelems = 1;
nft_trans_container_elem(trans)->elems[0].priv = elem_priv;
- nft_trans_commit_list_add_elem(ctx->net, trans, GFP_ATOMIC);
+ nft_trans_commit_list_add_elem(ctx->net, trans);
return 0;
}
@@ -7874,7 +7863,7 @@ static int __nft_set_catchall_flush(const struct nft_ctx *ctx,
nft_setelem_data_deactivate(ctx->net, set, elem_priv);
nft_trans_container_elem(trans)->elems[0].priv = elem_priv;
- nft_trans_commit_list_add_elem(ctx->net, trans, GFP_KERNEL);
+ nft_trans_commit_list_add_elem(ctx->net, trans);
return 0;
}
--
2.50.0
^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [nf-next 0/2] netfilter: nf_tables: make set flush more resistant to memory pressure
2025-07-04 12:30 [nf-next 0/2] netfilter: nf_tables: make set flush more resistant to memory pressure Florian Westphal
2025-07-04 12:30 ` [nf-next 1/2] netfilter: nf_tables: allow iter callbacks to sleep Florian Westphal
2025-07-04 12:30 ` [nf-next 2/2] netfilter: nf_tables: all transaction allocations can now sleep Florian Westphal
@ 2025-07-24 23:19 ` Pablo Neira Ayuso
2025-07-25 0:24 ` Florian Westphal
2 siblings, 1 reply; 18+ messages in thread
From: Pablo Neira Ayuso @ 2025-07-24 23:19 UTC (permalink / raw)
To: Florian Westphal; +Cc: netfilter-devel
Hi Florian,
On Fri, Jul 04, 2025 at 02:30:16PM +0200, Florian Westphal wrote:
> Removal of many set elements, e.g. during set flush or ruleset
> deletion, can sometimes fail due to memory pressure.
> Reduce likelyhood of this happening and enable sleeping allocations
> for this.
I am exploring to skip the allocation of the transaction objects for
this case. This needs a closer look to deal with batches like:
delelem + flush set + abort
flush set + del set + abort
Special care need to be taken to avoid restoring the state of the
element twice on abort.
This would allow to save the memory allocation entirely, as well as
speeding up the transaction handling.
From userspace, the idea would be to print this event:
flush set inet x y
to skip a large burst of events when a set is flushed.
Is this worth to be pursued?
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [nf-next 0/2] netfilter: nf_tables: make set flush more resistant to memory pressure
2025-07-24 23:19 ` [nf-next 0/2] netfilter: nf_tables: make set flush more resistant to memory pressure Pablo Neira Ayuso
@ 2025-07-25 0:24 ` Florian Westphal
2025-07-25 10:10 ` Pablo Neira Ayuso
0 siblings, 1 reply; 18+ messages in thread
From: Florian Westphal @ 2025-07-25 0:24 UTC (permalink / raw)
To: Pablo Neira Ayuso; +Cc: netfilter-devel
Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> On Fri, Jul 04, 2025 at 02:30:16PM +0200, Florian Westphal wrote:
> > Removal of many set elements, e.g. during set flush or ruleset
> > deletion, can sometimes fail due to memory pressure.
> > Reduce likelyhood of this happening and enable sleeping allocations
> > for this.
>
> I am exploring to skip the allocation of the transaction objects for
> this case. This needs a closer look to deal with batches like:
>
> delelem + flush set + abort
> flush set + del set + abort
>
> Special care need to be taken to avoid restoring the state of the
> element twice on abort.
Its possible to defer the flush to until after we've reached the
point of no return.
But I was worried about delete/add from datapath, since it can
happen in parallel.
Also, I think for:
flush set x + delelem x y
You get an error, as the flush marks the element as invalid in
the new generation. Can we handle this with a flag
in nft_set, that disallows all del elem operations on
the set after a flush was seen?
And, is that safe from a backwards-compat point of view?
I tought the answer was: no.
Maybe we can turn delsetelem after flush into a no-op
in case the element existed. Not sure.
Which then means that we either can't do it, or
need to make sure that the "del elem x" is always
handled before the flush-set.
For maps it becomes even more problematic as we
would elide the deactivate step on chains.
And given walk isn't stable for rhashtable at the
moment, I don't think we can rely on "two walks" scheme.
Right now its fine because even if elements get inserted
during or after the delset operation has done the walk+deactivate,
those elements are not on the transaction list so we don't run into
trouble on abort and always undo only what the walk placed on the
transaction log.
> This would allow to save the memory allocation entirely, as well as
> speeding up the transaction handling.
Sure, it sounds tempting to pursue this.
> From userspace, the idea would be to print this event:
>
> flush set inet x y
>
> to skip a large burst of events when a set is flushed.
I think thats fine.
> Is this worth to be pursued?
Yes, but I am not sure it is doable without
breaking some existing behaviour.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [nf-next 0/2] netfilter: nf_tables: make set flush more resistant to memory pressure
2025-07-25 0:24 ` Florian Westphal
@ 2025-07-25 10:10 ` Pablo Neira Ayuso
2025-07-25 11:15 ` Florian Westphal
0 siblings, 1 reply; 18+ messages in thread
From: Pablo Neira Ayuso @ 2025-07-25 10:10 UTC (permalink / raw)
To: Florian Westphal; +Cc: netfilter-devel
On Fri, Jul 25, 2025 at 02:24:04AM +0200, Florian Westphal wrote:
> Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> > On Fri, Jul 04, 2025 at 02:30:16PM +0200, Florian Westphal wrote:
> > > Removal of many set elements, e.g. during set flush or ruleset
> > > deletion, can sometimes fail due to memory pressure.
> > > Reduce likelyhood of this happening and enable sleeping allocations
> > > for this.
> >
> > I am exploring to skip the allocation of the transaction objects for
> > this case. This needs a closer look to deal with batches like:
> >
> > delelem + flush set + abort
> > flush set + del set + abort
> >
> > Special care need to be taken to avoid restoring the state of the
> > element twice on abort.
>
> Its possible to defer the flush to until after we've reached the
> point of no return.
>
> But I was worried about delete/add from datapath, since it can
> happen in parallel.
>
> Also, I think for:
> flush set x + delelem x y
>
> You get an error, as the flush marks the element as invalid in
> the new generation. Can we handle this with a flag
> in nft_set, that disallows all del elem operations on
> the set after a flush was seen?
>
> And, is that safe from a backwards-compat point of view?
> I tought the answer was: no.
> Maybe we can turn delsetelem after flush into a no-op
> in case the element existed. Not sure.
>
> Which then means that we either can't do it, or
> need to make sure that the "del elem x" is always
> handled before the flush-set.
>
> For maps it becomes even more problematic as we
> would elide the deactivate step on chains.
>
> And given walk isn't stable for rhashtable at the
> moment, I don't think we can rely on "two walks" scheme.
>
> Right now its fine because even if elements get inserted
> during or after the delset operation has done the walk+deactivate,
> those elements are not on the transaction list so we don't run into
> trouble on abort and always undo only what the walk placed on the
> transaction log.
I think the key is to be able to identify what elements have been
flushed by what flush command, so abort path can just restore/undo the
state for the given elements.
Because this also is possible:
flush set x + [...] + flush set x
And [...] includes possible new/delete elements in x.
It should be possible to store an flush command id in the set element
(this increases the memory consumption of the set element, which your
series already does it) to identify what flush command has deleted it.
This is needed because the transaction object won't be in place but I
think it is a fair tradeoff. The flush command id can be incremental
in the batch (the netlink sequence number cannot be used for this
purpose).
> > This would allow to save the memory allocation entirely, as well as
> > speeding up the transaction handling.
>
> Sure, it sounds tempting to pursue this.
>
> > From userspace, the idea would be to print this event:
> >
> > flush set inet x y
> >
> > to skip a large burst of events when a set is flushed.
>
> I think thats fine.
>
> > Is this worth to be pursued?
>
> Yes, but I am not sure it is doable without
> breaking some existing behaviour.
Of course, this needs careful look, but if the set element can be used
to annotate the information that allows us to restore to previous
state before flush (given the transaction object is not used anymore),
then it should work. Your series is extending the set element size for
a different purpose, so I think the extra memory should not be an
issue.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [nf-next 0/2] netfilter: nf_tables: make set flush more resistant to memory pressure
2025-07-25 10:10 ` Pablo Neira Ayuso
@ 2025-07-25 11:15 ` Florian Westphal
2025-07-25 15:03 ` Pablo Neira Ayuso
0 siblings, 1 reply; 18+ messages in thread
From: Florian Westphal @ 2025-07-25 11:15 UTC (permalink / raw)
To: Pablo Neira Ayuso; +Cc: netfilter-devel
Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> I think the key is to be able to identify what elements have been
> flushed by what flush command, so abort path can just restore/undo the
> state for the given elements.
>
> Because this also is possible:
>
> flush set x + [...] + flush set x
>
> And [...] includes possible new/delete elements in x.
>
> It should be possible to store an flush command id in the set element
> (this increases the memory consumption of the set element, which your
> series already does it) to identify what flush command has deleted it.
> This is needed because the transaction object won't be in place but I
> think it is a fair tradeoff. The flush command id can be incremental
> in the batch (the netlink sequence number cannot be used for this
> purpose).
OK, that might work. So the idea is to do the set walk as-is, but
instead of allocating a NFT_MSG_DELSETELEM for each transaction
object, introduce NFT_MSG_FLUSHSET transaction.
Then, for a DELSETELEM request with no elements (== flush),
allocate *one* NFT_MSG_FLUSHSET transaction.
The NFT_MSG_FLUSHSET transaction holds the set being flushed
and an id, that increments sequentially once for each flush.
Then, do the walk as-is:
static int nft_setelem_flush(const struct nft_ctx *ctx,
struct nft_set *set,
const struct nft_set_iter *iter,
struct nft_elem_priv *elem_priv)
{
const struct nft_set_ext *ext = nft_set_elem_ext(set, elem_priv);
struct nft_trans *trans;
/* previous delsetelem or erlier flush marked it inactive */
if (!nft_set_elem_active(ext, iter->genmask))
return 0;
/* No allocation per set elemenet anymore */
/* trans = nft_trans_alloc_gfp(ctx, NFT_MSG_DELSETELEM, */
/* trans_flush could be obtained from the tail of
* the transaction list. Or placed in *iter.
*/
elem_priv->flush_id = trans_flush->flush_id
set->ops->flush(ctx->net, set, elem_priv);
set->ndeact++;
nft_setelem_data_deactivate(ctx->net, set, elem_priv);
return 0;
}
On abort, NFT_MSG_FLUSHSET would do another walk, for all set elements,
and then calls nft_setelem_data_activate/nft_setelem_activate in case
elem_priv->flush_id == trans_flush->flush_id.
Did I get that right? I don't see any major issues with this, except
the need to add u32 flush_id to struct nft_elem_priv.
Or perhaps struct nft_set_ext would be a better fit as nft_elem_priv is
just a proxy.
> Of course, this needs careful look, but if the set element can be used
> to annotate the information that allows us to restore to previous
> state before flush (given the transaction object is not used anymore),
> then it should work. Your series is extending the set element size for
> a different purpose, so I think the extra memory should not be an
> issue.
Agree, it would need 4 bytes per elem which isn't much compared to the
transaction log savings.
Will you have a look or should I have a go at this?
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [nf-next 0/2] netfilter: nf_tables: make set flush more resistant to memory pressure
2025-07-25 11:15 ` Florian Westphal
@ 2025-07-25 15:03 ` Pablo Neira Ayuso
2025-07-28 21:28 ` Florian Westphal
0 siblings, 1 reply; 18+ messages in thread
From: Pablo Neira Ayuso @ 2025-07-25 15:03 UTC (permalink / raw)
To: Florian Westphal; +Cc: netfilter-devel
On Fri, Jul 25, 2025 at 01:15:27PM +0200, Florian Westphal wrote:
> Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> > I think the key is to be able to identify what elements have been
> > flushed by what flush command, so abort path can just restore/undo the
> > state for the given elements.
> >
> > Because this also is possible:
> >
> > flush set x + [...] + flush set x
> >
> > And [...] includes possible new/delete elements in x.
> >
> > It should be possible to store an flush command id in the set element
> > (this increases the memory consumption of the set element, which your
> > series already does it) to identify what flush command has deleted it.
> > This is needed because the transaction object won't be in place but I
> > think it is a fair tradeoff. The flush command id can be incremental
> > in the batch (the netlink sequence number cannot be used for this
> > purpose).
>
> OK, that might work. So the idea is to do the set walk as-is, but
> instead of allocating a NFT_MSG_DELSETELEM for each transaction
> object, introduce NFT_MSG_FLUSHSET transaction.
Or simply using NFT_MSG_DELSET and add a flag to note this is a flush.
> Then, for a DELSETELEM request with no elements (== flush),
> allocate *one* NFT_MSG_FLUSHSET transaction.
Yes.
> The NFT_MSG_FLUSHSET transaction holds the set being flushed
> and an id, that increments sequentially once for each flush.
Yes.
> Then, do the walk as-is:
>
> static int nft_setelem_flush(const struct nft_ctx *ctx,
> struct nft_set *set,
> const struct nft_set_iter *iter,
> struct nft_elem_priv *elem_priv)
> {
> const struct nft_set_ext *ext = nft_set_elem_ext(set, elem_priv);
> struct nft_trans *trans;
>
> /* previous delsetelem or erlier flush marked it inactive */
> if (!nft_set_elem_active(ext, iter->genmask))
> return 0;
>
> /* No allocation per set elemenet anymore */
> /* trans = nft_trans_alloc_gfp(ctx, NFT_MSG_DELSETELEM, */
>
> /* trans_flush could be obtained from the tail of
> * the transaction list. Or placed in *iter.
> */
> elem_priv->flush_id = trans_flush->flush_id
> set->ops->flush(ctx->net, set, elem_priv);
> set->ndeact++;
>
> nft_setelem_data_deactivate(ctx->net, set, elem_priv);
>
> return 0;
Maybe use nft_map_deactivate() ?
> }
>
> On abort, NFT_MSG_FLUSHSET would do another walk, for all set elements,
> and then calls nft_setelem_data_activate/nft_setelem_activate in case
> elem_priv->flush_id == trans_flush->flush_id.
Exactly, maybe nft_map_activate() can help.
> Did I get that right? I don't see any major issues with this, except
> the need to add u32 flush_id to struct nft_elem_priv.
> Or perhaps struct nft_set_ext would be a better fit as nft_elem_priv is
> just a proxy.
Yes, u32 flush_id (or trans_id) needs to be added, then set
transaction id incrementally.
> > Of course, this needs careful look, but if the set element can be used
> > to annotate the information that allows us to restore to previous
> > state before flush (given the transaction object is not used anymore),
> > then it should work. Your series is extending the set element size for
> > a different purpose, so I think the extra memory should not be an
> > issue.
>
> Agree, it would need 4 bytes per elem which isn't much compared to the
> transaction log savings.
>
> Will you have a look or should I have a go at this?
Please, go for it.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [nf-next 0/2] netfilter: nf_tables: make set flush more resistant to memory pressure
2025-07-25 15:03 ` Pablo Neira Ayuso
@ 2025-07-28 21:28 ` Florian Westphal
2025-07-29 7:22 ` Jozsef Kadlecsik
2025-07-29 10:38 ` Pablo Neira Ayuso
0 siblings, 2 replies; 18+ messages in thread
From: Florian Westphal @ 2025-07-28 21:28 UTC (permalink / raw)
To: Pablo Neira Ayuso; +Cc: netfilter-devel
Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> Yes, u32 flush_id (or trans_id) needs to be added, then set
> transaction id incrementally.
Not enough, unfortunately.
The key difference between flush (delete all elements) and delset
(remove the set and all elements) is that the set itself gets detached
from the dataplane. Then, when elements get free'd, we can just iterate
the set and free all elements, they are all unreachable from the
dataplane.
But in case of a flush, thats not the case, releasing the elements will
cause use-after-free. Current DELSETELEM method unlinks the elements
from the set, links them to the DELSETELEM transactional container.
Then, on abort they get re-linked to the set, or, in case of commit,
they can be free'd after the final synchronize_rcu().
That leaves two options:
1. Use the first patchset, that makes delsetelem allocations sleepable.
2. Add a pointer + and id to nft_set_ext.
The drawback of 2) is the added mem cost for every set eleemnt (first
patch series only forces it for rhashtable).
The major upside however is that DELSETELEM transaction objects are
simplified a lot, the to-be-deleted elements could be linked to it by
the then-always-available nft_set_ext pointer, i.e., each DELSETELEM
transaction object can take an arbitrary number of elements.
Unless you disagree, I will go for 2).
This will also allow to remove the krealloc() compaction for DELSETELEM,
so it should be a net code-removal patch.
Another option might be to replace a flush with delset+newset
internally, but this will get tricky because the set/map still being
referenced by other rules, we'd have to fixup the ruleset internally to
use the new/empty set while still being able to roll back.
Proably too tricky / hard to get right, but I'll check it anyway.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [nf-next 0/2] netfilter: nf_tables: make set flush more resistant to memory pressure
2025-07-28 21:28 ` Florian Westphal
@ 2025-07-29 7:22 ` Jozsef Kadlecsik
2025-07-29 10:27 ` Pablo Neira Ayuso
2025-07-29 10:38 ` Pablo Neira Ayuso
1 sibling, 1 reply; 18+ messages in thread
From: Jozsef Kadlecsik @ 2025-07-29 7:22 UTC (permalink / raw)
To: Florian Westphal; +Cc: Pablo Neira Ayuso, netfilter-devel
Hi,
On Mon, 28 Jul 2025, Florian Westphal wrote:
> Another option might be to replace a flush with delset+newset
> internally, but this will get tricky because the set/map still being
> referenced by other rules, we'd have to fixup the ruleset internally to
> use the new/empty set while still being able to roll back.
If "data" of struct nft_set would be a pointer to an allocated memory
area, then there'd be no need to fixup the references in the rules: it
would be enough to create-delete the data part. (All non-static, set data
related attributes could be move to the "data" as well, like nelems,
ndeact.) But it'd mean a serious redesign.
Best regards,
Jozsef
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [nf-next 0/2] netfilter: nf_tables: make set flush more resistant to memory pressure
2025-07-29 7:22 ` Jozsef Kadlecsik
@ 2025-07-29 10:27 ` Pablo Neira Ayuso
2025-07-29 10:50 ` Jozsef Kadlecsik
0 siblings, 1 reply; 18+ messages in thread
From: Pablo Neira Ayuso @ 2025-07-29 10:27 UTC (permalink / raw)
To: Jozsef Kadlecsik; +Cc: Florian Westphal, netfilter-devel
Hi Jozsef,
On Tue, Jul 29, 2025 at 09:22:46AM +0200, Jozsef Kadlecsik wrote:
> Hi,
>
> On Mon, 28 Jul 2025, Florian Westphal wrote:
>
> > Another option might be to replace a flush with delset+newset
> > internally, but this will get tricky because the set/map still being
> > referenced by other rules, we'd have to fixup the ruleset internally to
> > use the new/empty set while still being able to roll back.
>
> If "data" of struct nft_set would be a pointer to an allocated memory area,
> then there'd be no need to fixup the references in the rules: it would be
> enough to create-delete the data part. (All non-static, set data related
> attributes could be move to the "data" as well, like nelems, ndeact.) But
> it'd mean a serious redesign.
refcounting on object is needed to detect deletion of chains that are
still in used, rule refer to chains either via direct jump/goto or via
verdict map. When handling the transaction batch is needed to know
what can be deleted or not.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [nf-next 0/2] netfilter: nf_tables: make set flush more resistant to memory pressure
2025-07-28 21:28 ` Florian Westphal
2025-07-29 7:22 ` Jozsef Kadlecsik
@ 2025-07-29 10:38 ` Pablo Neira Ayuso
2025-07-29 11:37 ` Florian Westphal
1 sibling, 1 reply; 18+ messages in thread
From: Pablo Neira Ayuso @ 2025-07-29 10:38 UTC (permalink / raw)
To: Florian Westphal; +Cc: netfilter-devel
On Mon, Jul 28, 2025 at 11:28:50PM +0200, Florian Westphal wrote:
> Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> > Yes, u32 flush_id (or trans_id) needs to be added, then set
> > transaction id incrementally.
>
> Not enough, unfortunately.
>
> The key difference between flush (delete all elements) and delset
> (remove the set and all elements) is that the set itself gets detached
> from the dataplane. Then, when elements get free'd, we can just iterate
> the set and free all elements, they are all unreachable from the
> dataplane.
>
> But in case of a flush, thats not the case, releasing the elements will
> cause use-after-free. Current DELSETELEM method unlinks the elements
> from the set, links them to the DELSETELEM transactional container.
DELSETELEM does not unlink elements from set in the preparation phase,
instead elements are marked as inactive in the next generation but
they still remain linked to the set. These elements are removed from
the set from either the commit/abort phase.
- flush should skip elements that are already inactive
- flush should not work on deleted sets.
- flush command (elements are marked as inactive) then delete set
skips those elements that are inactive. So abort path can unwind
accordingly using the transaction id marker what I am proposing.
I think the key is that no two different transaction release the same
object, hence the need for the transaction id for the flush command to
differentiate between delete set and flush set commands.
I can take a look next week to see if all this is practical,
otherwise...
> Then, on abort they get re-linked to the set, or, in case of commit,
> they can be free'd after the final synchronize_rcu().
>
> That leaves two options:
> 1. Use the first patchset, that makes delsetelem allocations sleepable.
> 2. Add a pointer + and id to nft_set_ext.
>
> The drawback of 2) is the added mem cost for every set eleemnt (first
> patch series only forces it for rhashtable).
>
> The major upside however is that DELSETELEM transaction objects are
> simplified a lot, the to-be-deleted elements could be linked to it by
> the then-always-available nft_set_ext pointer, i.e., each DELSETELEM
> transaction object can take an arbitrary number of elements.
>
> Unless you disagree, I will go for 2).
> This will also allow to remove the krealloc() compaction for DELSETELEM,
> so it should be a net code-removal patch.
>
> Another option might be to replace a flush with delset+newset
> internally, but this will get tricky because the set/map still being
> referenced by other rules, we'd have to fixup the ruleset internally to
> use the new/empty set while still being able to roll back.
>
> Proably too tricky / hard to get right, but I'll check it anyway.
... if I don't find a way or I'm too slow, we can take your series in
the next merge window as is.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [nf-next 0/2] netfilter: nf_tables: make set flush more resistant to memory pressure
2025-07-29 10:27 ` Pablo Neira Ayuso
@ 2025-07-29 10:50 ` Jozsef Kadlecsik
0 siblings, 0 replies; 18+ messages in thread
From: Jozsef Kadlecsik @ 2025-07-29 10:50 UTC (permalink / raw)
To: Pablo Neira Ayuso; +Cc: Florian Westphal, netfilter-devel
Hi Pablo,
On Tue, 29 Jul 2025, Pablo Neira Ayuso wrote:
> On Tue, Jul 29, 2025 at 09:22:46AM +0200, Jozsef Kadlecsik wrote:
>> Hi,
>>
>> On Mon, 28 Jul 2025, Florian Westphal wrote:
>>
>>> Another option might be to replace a flush with delset+newset
>>> internally, but this will get tricky because the set/map still being
>>> referenced by other rules, we'd have to fixup the ruleset internally to
>>> use the new/empty set while still being able to roll back.
>>
>> If "data" of struct nft_set would be a pointer to an allocated memory area,
>> then there'd be no need to fixup the references in the rules: it would be
>> enough to create-delete the data part. (All non-static, set data related
>> attributes could be move to the "data" as well, like nelems, ndeact.) But
>> it'd mean a serious redesign.
>
> refcounting on object is needed to detect deletion of chains that are
> still in used, rule refer to chains either via direct jump/goto or via
> verdict map. When handling the transaction batch is needed to know what
> can be deleted or not.
The private set data part of struct nft_set contains anything which is
directly referenced from rules, maps? What I wanted to suggest is to keep
the set structure part which is referenced/pointed to from rules, etc.
intact but separate the private set data part so that it could be handled
independently.
[It seems I miss something.]
Best regards,
Jozsef
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [nf-next 0/2] netfilter: nf_tables: make set flush more resistant to memory pressure
2025-07-29 10:38 ` Pablo Neira Ayuso
@ 2025-07-29 11:37 ` Florian Westphal
2025-07-30 16:16 ` Pablo Neira Ayuso
0 siblings, 1 reply; 18+ messages in thread
From: Florian Westphal @ 2025-07-29 11:37 UTC (permalink / raw)
To: Pablo Neira Ayuso; +Cc: netfilter-devel
Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> DELSETELEM does not unlink elements from set in the preparation phase,
> instead elements are marked as inactive in the next generation but
> they still remain linked to the set. These elements are removed from
> the set from either the commit/abort phase.
>
> - flush should skip elements that are already inactive
> - flush should not work on deleted sets.
> - flush command (elements are marked as inactive) then delete set
> skips those elements that are inactive. So abort path can unwind
> accordingly using the transaction id marker what I am proposing.
Yes, that part works, but we still need to kfree the elements after unlink.
When commit phase does the unlink, the element becomes unreachable from
the set. At this time, the DELSETELEM object keeps a pointer to the
unlinked elements, and that allows us to kfree after synchronize_rcu
from the worker. If we don't want DELSETELEM for flush, we need to
provide the address to free by other means, e.g. stick a pointer into
struct nft_set_ext.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [nf-next 0/2] netfilter: nf_tables: make set flush more resistant to memory pressure
2025-07-29 11:37 ` Florian Westphal
@ 2025-07-30 16:16 ` Pablo Neira Ayuso
2025-07-30 16:35 ` Florian Westphal
0 siblings, 1 reply; 18+ messages in thread
From: Pablo Neira Ayuso @ 2025-07-30 16:16 UTC (permalink / raw)
To: Florian Westphal; +Cc: netfilter-devel
On Tue, Jul 29, 2025 at 01:37:19PM +0200, Florian Westphal wrote:
> Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> > DELSETELEM does not unlink elements from set in the preparation phase,
> > instead elements are marked as inactive in the next generation but
> > they still remain linked to the set. These elements are removed from
> > the set from either the commit/abort phase.
> >
> > - flush should skip elements that are already inactive
> > - flush should not work on deleted sets.
> > - flush command (elements are marked as inactive) then delete set
> > skips those elements that are inactive. So abort path can unwind
> > accordingly using the transaction id marker what I am proposing.
>
> Yes, that part works, but we still need to kfree the elements after unlink.
>
> When commit phase does the unlink, the element becomes unreachable from
> the set. At this time, the DELSETELEM object keeps a pointer to the
> unlinked elements, and that allows us to kfree after synchronize_rcu
> from the worker. If we don't want DELSETELEM for flush, we need to
> provide the address to free by other means, e.g. stick a pointer into
> struct nft_set_ext.
For the commit phase, I suggest to add a list of dying elements to the
transaction object. After unlinking the element from the (internal)
set data structure, add it to this transaction dying list so it
remains reachable to be released after the rcu grace period.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [nf-next 0/2] netfilter: nf_tables: make set flush more resistant to memory pressure
2025-07-30 16:16 ` Pablo Neira Ayuso
@ 2025-07-30 16:35 ` Florian Westphal
2025-08-19 19:10 ` Florian Westphal
0 siblings, 1 reply; 18+ messages in thread
From: Florian Westphal @ 2025-07-30 16:35 UTC (permalink / raw)
To: Pablo Neira Ayuso; +Cc: netfilter-devel
Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> > Yes, that part works, but we still need to kfree the elements after unlink.
> >
> > When commit phase does the unlink, the element becomes unreachable from
> > the set. At this time, the DELSETELEM object keeps a pointer to the
> > unlinked elements, and that allows us to kfree after synchronize_rcu
> > from the worker. If we don't want DELSETELEM for flush, we need to
> > provide the address to free by other means, e.g. stick a pointer into
> > struct nft_set_ext.
>
> For the commit phase, I suggest to add a list of dying elements to the
> transaction object. After unlinking the element from the (internal)
> set data structure, add it to this transaction dying list so it
> remains reachable to be released after the rcu grace period.
Thats what I meant by 'stick a pointer into struct nft_set_ext'.
Its awkward but I should be able to get the priv pointer back
by doing the inverse of nft_set_elem_ext().
The cleaner solution would be to turn nft_elem_priv into a
'nft_elem_common', place a hlist_node into that and then
use container_of(). But its too much code churn for my
liking.
So I'll extend each set element with a pointer and
add a removed_elements hlist_head to struct nft_trans_elem.
The transacion id isn't needed I think once that list exist:
it provides the needed info to undo previous operations
without the need to walk the set again.
We can probably even rework struct nft_trans_elem to always use
this pointer, even for inserts, and only use the
struct nft_trans_one_elem elems[]
member for elements that we update (no add or removal).
But thats something for a later time.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [nf-next 0/2] netfilter: nf_tables: make set flush more resistant to memory pressure
2025-07-30 16:35 ` Florian Westphal
@ 2025-08-19 19:10 ` Florian Westphal
2025-08-19 22:23 ` Pablo Neira Ayuso
0 siblings, 1 reply; 18+ messages in thread
From: Florian Westphal @ 2025-08-19 19:10 UTC (permalink / raw)
To: Pablo Neira Ayuso; +Cc: netfilter-devel
Florian Westphal <fw@strlen.de> wrote:
> > For the commit phase, I suggest to add a list of dying elements to the
> > transaction object. After unlinking the element from the (internal)
> > set data structure, add it to this transaction dying list so it
> > remains reachable to be released after the rcu grace period.
>
> Thats what I meant by 'stick a pointer into struct nft_set_ext'.
> Its awkward but I should be able to get the priv pointer back
> by doing the inverse of nft_set_elem_ext().
>
> The cleaner solution would be to turn nft_elem_priv into a
> 'nft_elem_common', place a hlist_node into that and then
> use container_of(). But its too much code churn for my
> liking.
>
> So I'll extend each set element with a pointer and
> add a removed_elements hlist_head to struct nft_trans_elem.
>
> The transacion id isn't needed I think once that list exist:
> it provides the needed info to undo previous operations
> without the need to walk the set again.
>
> We can probably even rework struct nft_trans_elem to always use
> this pointer, even for inserts, and only use the
>
> struct nft_trans_one_elem elems[]
>
> member for elements that we update (no add or removal).
> But thats something for a later time.
This doesn't work.
NEWSETELEM cannot (re)use the same list node as DELSETELEM.
Reason is that a set flush will also flush elements
added in the same batch.
But if NEWSETELEM uses a list (instead of priv pointer
as we do now), then at the time of the set flush, the
encountered element is already on a NEWSETELEM trans_elem list.
I'll try doing:
struct nft_set_ext {
u8 genmask;
u8 offset[NFT_SET_EXT_NUM];
+ struct llist_node trans_list_new;
+ struct llist_node trans_list_del;
char data[];
to avoid this problem.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [nf-next 0/2] netfilter: nf_tables: make set flush more resistant to memory pressure
2025-08-19 19:10 ` Florian Westphal
@ 2025-08-19 22:23 ` Pablo Neira Ayuso
0 siblings, 0 replies; 18+ messages in thread
From: Pablo Neira Ayuso @ 2025-08-19 22:23 UTC (permalink / raw)
To: Florian Westphal; +Cc: netfilter-devel
On Tue, Aug 19, 2025 at 09:10:42PM +0200, Florian Westphal wrote:
> Florian Westphal <fw@strlen.de> wrote:
> > > For the commit phase, I suggest to add a list of dying elements to the
> > > transaction object. After unlinking the element from the (internal)
> > > set data structure, add it to this transaction dying list so it
> > > remains reachable to be released after the rcu grace period.
> >
> > Thats what I meant by 'stick a pointer into struct nft_set_ext'.
> > Its awkward but I should be able to get the priv pointer back
> > by doing the inverse of nft_set_elem_ext().
> >
> > The cleaner solution would be to turn nft_elem_priv into a
> > 'nft_elem_common', place a hlist_node into that and then
> > use container_of(). But its too much code churn for my
> > liking.
> >
> > So I'll extend each set element with a pointer and
> > add a removed_elements hlist_head to struct nft_trans_elem.
> >
> > The transacion id isn't needed I think once that list exist:
> > it provides the needed info to undo previous operations
> > without the need to walk the set again.
> >
> > We can probably even rework struct nft_trans_elem to always use
> > this pointer, even for inserts, and only use the
> >
> > struct nft_trans_one_elem elems[]
> >
> > member for elements that we update (no add or removal).
> > But thats something for a later time.
>
> This doesn't work.
> NEWSETELEM cannot (re)use the same list node as DELSETELEM.
>
> Reason is that a set flush will also flush elements
> added in the same batch.
>
> But if NEWSETELEM uses a list (instead of priv pointer
> as we do now), then at the time of the set flush, the
> encountered element is already on a NEWSETELEM trans_elem list.
>
> I'll try doing:
>
> struct nft_set_ext {
> u8 genmask;
> u8 offset[NFT_SET_EXT_NUM];
> + struct llist_node trans_list_new;
> + struct llist_node trans_list_del;
> char data[];
>
> to avoid this problem.
Hm, I think this is not looking good.
I am considering it is better to take your patch by now, then postpone
explore further memory consumption reduction at a later stage.
Thanks for addressing my suggestion, let me know if you prefer this
path, I apologize for delaying your original proposal.
^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2025-08-19 22:24 UTC | newest]
Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-04 12:30 [nf-next 0/2] netfilter: nf_tables: make set flush more resistant to memory pressure Florian Westphal
2025-07-04 12:30 ` [nf-next 1/2] netfilter: nf_tables: allow iter callbacks to sleep Florian Westphal
2025-07-04 12:30 ` [nf-next 2/2] netfilter: nf_tables: all transaction allocations can now sleep Florian Westphal
2025-07-24 23:19 ` [nf-next 0/2] netfilter: nf_tables: make set flush more resistant to memory pressure Pablo Neira Ayuso
2025-07-25 0:24 ` Florian Westphal
2025-07-25 10:10 ` Pablo Neira Ayuso
2025-07-25 11:15 ` Florian Westphal
2025-07-25 15:03 ` Pablo Neira Ayuso
2025-07-28 21:28 ` Florian Westphal
2025-07-29 7:22 ` Jozsef Kadlecsik
2025-07-29 10:27 ` Pablo Neira Ayuso
2025-07-29 10:50 ` Jozsef Kadlecsik
2025-07-29 10:38 ` Pablo Neira Ayuso
2025-07-29 11:37 ` Florian Westphal
2025-07-30 16:16 ` Pablo Neira Ayuso
2025-07-30 16:35 ` Florian Westphal
2025-08-19 19:10 ` Florian Westphal
2025-08-19 22:23 ` Pablo Neira Ayuso
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).