Linux Netfilter development
 help / color / mirror / Atom feed
* [PATCH nf-next v4 0/5] netfilter: nf_tables: reduce set element transaction size
@ 2024-11-07 17:44 Florian Westphal
  2024-11-07 17:44 ` [PATCH nf-next v4 1/5] netfilter: nf_tables: add nft_trans_commit_list_add_elem helper Florian Westphal
                   ` (5 more replies)
  0 siblings, 6 replies; 12+ messages in thread
From: Florian Westphal @ 2024-11-07 17:44 UTC (permalink / raw)
  To: netfilter-devel; +Cc: Florian Westphal

v4: fix a typo in patch 3 commit message.
    rebase on nf-next:main.

No other changes.

Changes in v3:
I failed to realize that nft_audit leaks one implementation detail
to userspace: the length of the transaction log.
This gets fixed by patch 3 which adds needed helper to increment
the count variable by the number of elements carried by the compacted
set update.

Also fix up notifications, for update case, notifications were
skipped but currently newsetelem notifications are done even if
existing set element is updated.

Most patches are unchanged.
"prefer nft_trans_elem_alloc helper" is already upstreamed so
its dropped from this batch.

v2: only change is in patch 3, and by extension, the last one:
During transaction abort, we need to handle an aggregate container to
contain both new set elements and updates.  The latter must be
skipped, else we remove element that already existed at start of the
transaction.

original cover letter:

When doing a flush on a set or mass adding/removing elements from a
set, each element needs to allocate 96 bytes to hold the transactional
state.

In such cases, virtually all the information in struct nft_trans_elem
is the same.

Change nft_trans_elem to a flex-array, i.e. a single nft_trans_elem
can hold multiple set element pointers.

The number of elements that can be stored in one nft_trans_elem is limited
by the slab allocator, this series limits the compaction to at most 62
elements as it caps the reallocation to 2048 bytes of memory.

Florian Westphal (5):
  netfilter: nf_tables: add nft_trans_commit_list_add_elem helper
  netfilter: nf_tables: prepare for multiple elements in nft_trans_elem
    structure
  netfilter: nf_tables: preemptive fix for audit selftest failure
  netfilter: nf_tables: switch trans_elem to real flex array
  netfilter: nf_tables: allocate element update information dynamically

 include/net/netfilter/nf_tables.h |  25 +-
 net/netfilter/nf_tables_api.c     | 368 +++++++++++++++++++++++-------
 2 files changed, 304 insertions(+), 89 deletions(-)

-- 
2.45.2


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH nf-next v4 1/5] netfilter: nf_tables: add nft_trans_commit_list_add_elem helper
  2024-11-07 17:44 [PATCH nf-next v4 0/5] netfilter: nf_tables: reduce set element transaction size Florian Westphal
@ 2024-11-07 17:44 ` Florian Westphal
  2024-11-07 17:44 ` [PATCH nf-next v4 2/5] netfilter: nf_tables: prepare for multiple elements in nft_trans_elem structure Florian Westphal
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 12+ messages in thread
From: Florian Westphal @ 2024-11-07 17:44 UTC (permalink / raw)
  To: netfilter-devel; +Cc: Florian Westphal

Add and use a wrapper to append trans_elem structures to the
transaction log.

Unlike the existing helper, pass a gfp_t to indicate if sleeping
is allowed.

This will be used by a followup patch to realloc nft_trans_elem
structures after they gain a flexible array member to reduce
number of such container structures on the transaction list.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 net/netfilter/nf_tables_api.c | 21 ++++++++++++++++-----
 1 file changed, 16 insertions(+), 5 deletions(-)

diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
index b7a817e483aa..e4a54c618760 100644
--- a/net/netfilter/nf_tables_api.c
+++ b/net/netfilter/nf_tables_api.c
@@ -421,6 +421,17 @@ static void nft_trans_commit_list_add_tail(struct net *net, struct nft_trans *tr
 	}
 }
 
+static void nft_trans_commit_list_add_elem(struct net *net, struct nft_trans *trans,
+					   gfp_t gfp)
+{
+	WARN_ON_ONCE(trans->msg_type != NFT_MSG_NEWSETELEM &&
+		     trans->msg_type != NFT_MSG_DELSETELEM);
+
+	might_alloc(gfp);
+
+	nft_trans_commit_list_add_tail(net, trans);
+}
+
 static int nft_trans_table_add(struct nft_ctx *ctx, int msg_type)
 {
 	struct nft_trans *trans;
@@ -7204,7 +7215,7 @@ static int nft_add_set_elem(struct nft_ctx *ctx, struct nft_set *set,
 					if (update_flags) {
 						nft_trans_elem_priv(trans) = elem_priv;
 						nft_trans_elem_update_flags(trans) = update_flags;
-						nft_trans_commit_list_add_tail(ctx->net, trans);
+						nft_trans_commit_list_add_elem(ctx->net, trans, GFP_KERNEL);
 						goto err_elem_free;
 					}
 				}
@@ -7228,7 +7239,7 @@ static int nft_add_set_elem(struct nft_ctx *ctx, struct nft_set *set,
 	}
 
 	nft_trans_elem_priv(trans) = elem.priv;
-	nft_trans_commit_list_add_tail(ctx->net, trans);
+	nft_trans_commit_list_add_elem(ctx->net, trans, GFP_KERNEL);
 	return 0;
 
 err_set_full:
@@ -7445,7 +7456,7 @@ static int nft_del_setelem(struct nft_ctx *ctx, struct nft_set *set,
 	nft_setelem_data_deactivate(ctx->net, set, elem.priv);
 
 	nft_trans_elem_priv(trans) = elem.priv;
-	nft_trans_commit_list_add_tail(ctx->net, trans);
+	nft_trans_commit_list_add_elem(ctx->net, trans, GFP_KERNEL);
 	return 0;
 
 fail_ops:
@@ -7481,7 +7492,7 @@ static int nft_setelem_flush(const struct nft_ctx *ctx,
 	nft_setelem_data_deactivate(ctx->net, set, elem_priv);
 	nft_trans_elem_set(trans) = set;
 	nft_trans_elem_priv(trans) = elem_priv;
-	nft_trans_commit_list_add_tail(ctx->net, trans);
+	nft_trans_commit_list_add_elem(ctx->net, trans, GFP_ATOMIC);
 
 	return 0;
 }
@@ -7498,7 +7509,7 @@ static int __nft_set_catchall_flush(const struct nft_ctx *ctx,
 
 	nft_setelem_data_deactivate(ctx->net, set, elem_priv);
 	nft_trans_elem_priv(trans) = elem_priv;
-	nft_trans_commit_list_add_tail(ctx->net, trans);
+	nft_trans_commit_list_add_elem(ctx->net, trans, GFP_KERNEL);
 
 	return 0;
 }
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH nf-next v4 2/5] netfilter: nf_tables: prepare for multiple elements in nft_trans_elem structure
  2024-11-07 17:44 [PATCH nf-next v4 0/5] netfilter: nf_tables: reduce set element transaction size Florian Westphal
  2024-11-07 17:44 ` [PATCH nf-next v4 1/5] netfilter: nf_tables: add nft_trans_commit_list_add_elem helper Florian Westphal
@ 2024-11-07 17:44 ` Florian Westphal
  2024-11-07 17:44 ` [PATCH nf-next v4 3/5] netfilter: nf_tables: preemptive fix for audit selftest failure Florian Westphal
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 12+ messages in thread
From: Florian Westphal @ 2024-11-07 17:44 UTC (permalink / raw)
  To: netfilter-devel; +Cc: Florian Westphal

Add helpers to release the individual elements contained in the
trans_elem container structure.

No functional change intended.

Followup patch will add 'nelems' member and will turn 'priv' into
a flexible array.

These helpers can then loop over all elements.
Care needs to be taken to handle a mix of new elements and existing
elements that are being updated (e.g. timeout refresh).

Before this patch, NEWSETELEM transaction with update is released
early so nft_trans_set_elem_destroy() won't get called, so we need
to skip elements marked as update.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 include/net/netfilter/nf_tables.h |  21 ++-
 net/netfilter/nf_tables_api.c     | 228 +++++++++++++++++++++---------
 2 files changed, 173 insertions(+), 76 deletions(-)

diff --git a/include/net/netfilter/nf_tables.h b/include/net/netfilter/nf_tables.h
index c1513bd14568..14ccf1133a4c 100644
--- a/include/net/netfilter/nf_tables.h
+++ b/include/net/netfilter/nf_tables.h
@@ -1755,28 +1755,25 @@ enum nft_trans_elem_flags {
 	NFT_TRANS_UPD_EXPIRATION	= (1 << 1),
 };
 
-struct nft_trans_elem {
-	struct nft_trans		nft_trans;
-	struct nft_set			*set;
-	struct nft_elem_priv		*elem_priv;
+struct nft_trans_one_elem {
+	struct nft_elem_priv		*priv;
 	u64				timeout;
 	u64				expiration;
 	u8				update_flags;
+};
+
+struct nft_trans_elem {
+	struct nft_trans		nft_trans;
+	struct nft_set			*set;
 	bool				bound;
+	unsigned int			nelems;
+	struct nft_trans_one_elem	elems[] __counted_by(nelems);
 };
 
 #define nft_trans_container_elem(t)			\
 	container_of(t, struct nft_trans_elem, nft_trans)
 #define nft_trans_elem_set(trans)			\
 	nft_trans_container_elem(trans)->set
-#define nft_trans_elem_priv(trans)			\
-	nft_trans_container_elem(trans)->elem_priv
-#define nft_trans_elem_update_flags(trans)		\
-	nft_trans_container_elem(trans)->update_flags
-#define nft_trans_elem_timeout(trans)			\
-	nft_trans_container_elem(trans)->timeout
-#define nft_trans_elem_expiration(trans)		\
-	nft_trans_container_elem(trans)->expiration
 #define nft_trans_elem_set_bound(trans)			\
 	nft_trans_container_elem(trans)->bound
 
diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
index e4a54c618760..5eab6f121684 100644
--- a/net/netfilter/nf_tables_api.c
+++ b/net/netfilter/nf_tables_api.c
@@ -6445,13 +6445,17 @@ static struct nft_trans *nft_trans_elem_alloc(const struct nft_ctx *ctx,
 					      int msg_type,
 					      struct nft_set *set)
 {
+	struct nft_trans_elem *te;
 	struct nft_trans *trans;
 
-	trans = nft_trans_alloc(ctx, msg_type, sizeof(struct nft_trans_elem));
+	trans = nft_trans_alloc(ctx, msg_type, struct_size(te, elems, 1));
 	if (trans == NULL)
 		return NULL;
 
-	nft_trans_elem_set(trans) = set;
+	te = nft_trans_container_elem(trans);
+	te->nelems = 1;
+	te->set = set;
+
 	return trans;
 }
 
@@ -6573,28 +6577,51 @@ static void nft_set_elem_expr_destroy(const struct nft_ctx *ctx,
 }
 
 /* Drop references and destroy. Called from gc, dynset and abort path. */
-void nft_set_elem_destroy(const struct nft_set *set,
-			  const struct nft_elem_priv *elem_priv,
-			  bool destroy_expr)
+static void __nft_set_elem_destroy(const struct nft_ctx *ctx,
+				   const struct nft_set *set,
+				   const struct nft_elem_priv *elem_priv,
+				   bool destroy_expr)
 {
 	struct nft_set_ext *ext = nft_set_elem_ext(set, elem_priv);
-	struct nft_ctx ctx = {
-		.net	= read_pnet(&set->net),
-		.family	= set->table->family,
-	};
 
 	nft_data_release(nft_set_ext_key(ext), NFT_DATA_VALUE);
 	if (nft_set_ext_exists(ext, NFT_SET_EXT_DATA))
 		nft_data_release(nft_set_ext_data(ext), set->dtype);
 	if (destroy_expr && nft_set_ext_exists(ext, NFT_SET_EXT_EXPRESSIONS))
-		nft_set_elem_expr_destroy(&ctx, nft_set_ext_expr(ext));
+		nft_set_elem_expr_destroy(ctx, nft_set_ext_expr(ext));
 	if (nft_set_ext_exists(ext, NFT_SET_EXT_OBJREF))
 		nft_use_dec(&(*nft_set_ext_obj(ext))->use);
 
 	kfree(elem_priv);
 }
+
+/* Drop references and destroy. Called from gc and dynset. */
+void nft_set_elem_destroy(const struct nft_set *set,
+			  const struct nft_elem_priv *elem_priv,
+			  bool destroy_expr)
+{
+	struct nft_ctx ctx = {
+		.net	= read_pnet(&set->net),
+		.family	= set->table->family,
+	};
+
+	__nft_set_elem_destroy(&ctx, set, elem_priv, destroy_expr);
+}
 EXPORT_SYMBOL_GPL(nft_set_elem_destroy);
 
+/* Drop references and destroy. Called from abort path. */
+static void nft_trans_set_elem_destroy(const struct nft_ctx *ctx, struct nft_trans_elem *te)
+{
+	int i;
+
+	for (i = 0; i < te->nelems; i++) {
+		if (te->elems[i].update_flags)
+			continue;
+
+		__nft_set_elem_destroy(ctx, te->set, te->elems[i].priv, true);
+	}
+}
+
 /* Destroy element. References have been already dropped in the preparation
  * path via nft_setelem_data_deactivate().
  */
@@ -6610,6 +6637,15 @@ void nf_tables_set_elem_destroy(const struct nft_ctx *ctx,
 	kfree(elem_priv);
 }
 
+static void nft_trans_elems_destroy(const struct nft_ctx *ctx,
+				    const struct nft_trans_elem *te)
+{
+	int i;
+
+	for (i = 0; i < te->nelems; i++)
+		nf_tables_set_elem_destroy(ctx, te->set, te->elems[i].priv);
+}
+
 int nft_set_elem_expr_clone(const struct nft_ctx *ctx, struct nft_set *set,
 			    struct nft_expr *expr_array[])
 {
@@ -6766,6 +6802,36 @@ static void nft_setelem_activate(struct net *net, struct nft_set *set,
 	}
 }
 
+static void nft_trans_elem_update(const struct nft_set *set,
+				  const struct nft_trans_one_elem *elem)
+{
+	const struct nft_set_ext *ext = nft_set_elem_ext(set, elem->priv);
+
+	if (elem->update_flags & NFT_TRANS_UPD_TIMEOUT)
+		WRITE_ONCE(nft_set_ext_timeout(ext)->timeout, elem->timeout);
+
+	if (elem->update_flags & NFT_TRANS_UPD_EXPIRATION)
+		WRITE_ONCE(nft_set_ext_timeout(ext)->expiration, get_jiffies_64() + elem->expiration);
+}
+
+static void nft_trans_elems_add(const struct nft_ctx *ctx,
+				struct nft_trans_elem *te)
+{
+	int i;
+
+	for (i = 0; i < te->nelems; i++) {
+		const struct nft_trans_one_elem *elem = &te->elems[i];
+
+		if (elem->update_flags)
+			nft_trans_elem_update(te->set, elem);
+		else
+			nft_setelem_activate(ctx->net, te->set, elem->priv);
+
+		nf_tables_setelem_notify(ctx, te->set, elem->priv,
+					 NFT_MSG_NEWSETELEM);
+	}
+}
+
 static int nft_setelem_catchall_deactivate(const struct net *net,
 					   struct nft_set *set,
 					   struct nft_set_elem *elem)
@@ -6848,6 +6914,24 @@ static void nft_setelem_remove(const struct net *net,
 		set->ops->remove(net, set, elem_priv);
 }
 
+static void nft_trans_elems_remove(const struct nft_ctx *ctx,
+				   const struct nft_trans_elem *te)
+{
+	int i;
+
+	for (i = 0; i < te->nelems; i++) {
+		nf_tables_setelem_notify(ctx, te->set,
+					 te->elems[i].priv,
+					 te->nft_trans.msg_type);
+
+		nft_setelem_remove(ctx->net, te->set, te->elems[i].priv);
+		if (!nft_setelem_is_catchall(te->set, te->elems[i].priv)) {
+			atomic_dec(&te->set->nelems);
+			te->set->ndeact--;
+		}
+	}
+}
+
 static bool nft_setelem_valid_key_end(const struct nft_set *set,
 				      struct nlattr **nla, u32 flags)
 {
@@ -7199,22 +7283,26 @@ static int nft_add_set_elem(struct nft_ctx *ctx, struct nft_set *set,
 			else if (!(nlmsg_flags & NLM_F_EXCL)) {
 				err = 0;
 				if (nft_set_ext_exists(ext2, NFT_SET_EXT_TIMEOUT)) {
+					struct nft_trans_one_elem *update;
+
+					update = &nft_trans_container_elem(trans)->elems[0];
+
 					update_flags = 0;
 					if (timeout != nft_set_ext_timeout(ext2)->timeout) {
-						nft_trans_elem_timeout(trans) = timeout;
+						update->timeout = timeout;
 						if (expiration == 0)
 							expiration = timeout;
 
 						update_flags |= NFT_TRANS_UPD_TIMEOUT;
 					}
 					if (expiration) {
-						nft_trans_elem_expiration(trans) = expiration;
+						update->expiration = expiration;
 						update_flags |= NFT_TRANS_UPD_EXPIRATION;
 					}
 
 					if (update_flags) {
-						nft_trans_elem_priv(trans) = elem_priv;
-						nft_trans_elem_update_flags(trans) = update_flags;
+						update->priv = elem_priv;
+						update->update_flags = update_flags;
 						nft_trans_commit_list_add_elem(ctx->net, trans, GFP_KERNEL);
 						goto err_elem_free;
 					}
@@ -7238,7 +7326,7 @@ static int nft_add_set_elem(struct nft_ctx *ctx, struct nft_set *set,
 		}
 	}
 
-	nft_trans_elem_priv(trans) = elem.priv;
+	nft_trans_container_elem(trans)->elems[0].priv = elem.priv;
 	nft_trans_commit_list_add_elem(ctx->net, trans, GFP_KERNEL);
 	return 0;
 
@@ -7376,6 +7464,50 @@ void nft_setelem_data_deactivate(const struct net *net,
 		nft_use_dec(&(*nft_set_ext_obj(ext))->use);
 }
 
+/* similar to nft_trans_elems_remove, but called from abort path to undo newsetelem.
+ * No notifications and no ndeact changes.
+ *
+ * Returns true if set had been added to (i.e., elements need to be removed again).
+ */
+static bool nft_trans_elems_new_abort(const struct nft_ctx *ctx,
+				      const struct nft_trans_elem *te)
+{
+	bool removed = false;
+	int i;
+
+	for (i = 0; i < te->nelems; i++) {
+		if (te->elems[i].update_flags)
+			continue;
+
+		if (!te->set->ops->abort || nft_setelem_is_catchall(te->set, te->elems[i].priv))
+			nft_setelem_remove(ctx->net, te->set, te->elems[i].priv);
+
+		if (!nft_setelem_is_catchall(te->set, te->elems[i].priv))
+			atomic_dec(&te->set->nelems);
+
+		removed = true;
+	}
+
+	return removed;
+}
+
+/* Called from abort path to undo DELSETELEM/DESTROYSETELEM. */
+static void nft_trans_elems_destroy_abort(const struct nft_ctx *ctx,
+					  const struct nft_trans_elem *te)
+{
+	int i;
+
+	for (i = 0; i < te->nelems; i++) {
+		if (!nft_setelem_active_next(ctx->net, te->set, te->elems[i].priv)) {
+			nft_setelem_data_activate(ctx->net, te->set, te->elems[i].priv);
+			nft_setelem_activate(ctx->net, te->set, te->elems[i].priv);
+		}
+
+		if (!nft_setelem_is_catchall(te->set, te->elems[i].priv))
+			te->set->ndeact--;
+	}
+}
+
 static int nft_del_setelem(struct nft_ctx *ctx, struct nft_set *set,
 			   const struct nlattr *attr)
 {
@@ -7455,7 +7587,7 @@ static int nft_del_setelem(struct nft_ctx *ctx, struct nft_set *set,
 
 	nft_setelem_data_deactivate(ctx->net, set, elem.priv);
 
-	nft_trans_elem_priv(trans) = elem.priv;
+	nft_trans_container_elem(trans)->elems[0].priv = elem.priv;
 	nft_trans_commit_list_add_elem(ctx->net, trans, GFP_KERNEL);
 	return 0;
 
@@ -7482,7 +7614,8 @@ static int nft_setelem_flush(const struct nft_ctx *ctx,
 		return 0;
 
 	trans = nft_trans_alloc_gfp(ctx, NFT_MSG_DELSETELEM,
-				    sizeof(struct nft_trans_elem), GFP_ATOMIC);
+				    struct_size_t(struct nft_trans_elem, elems, 1),
+				    GFP_ATOMIC);
 	if (!trans)
 		return -ENOMEM;
 
@@ -7491,7 +7624,8 @@ static int nft_setelem_flush(const struct nft_ctx *ctx,
 
 	nft_setelem_data_deactivate(ctx->net, set, elem_priv);
 	nft_trans_elem_set(trans) = set;
-	nft_trans_elem_priv(trans) = elem_priv;
+	nft_trans_container_elem(trans)->nelems = 1;
+	nft_trans_container_elem(trans)->elems[0].priv = elem_priv;
 	nft_trans_commit_list_add_elem(ctx->net, trans, GFP_ATOMIC);
 
 	return 0;
@@ -7508,7 +7642,7 @@ static int __nft_set_catchall_flush(const struct nft_ctx *ctx,
 		return -ENOMEM;
 
 	nft_setelem_data_deactivate(ctx->net, set, elem_priv);
-	nft_trans_elem_priv(trans) = elem_priv;
+	nft_trans_container_elem(trans)->elems[0].priv = elem_priv;
 	nft_trans_commit_list_add_elem(ctx->net, trans, GFP_KERNEL);
 
 	return 0;
@@ -9690,9 +9824,7 @@ static void nft_commit_release(struct nft_trans *trans)
 		break;
 	case NFT_MSG_DELSETELEM:
 	case NFT_MSG_DESTROYSETELEM:
-		nf_tables_set_elem_destroy(&ctx,
-					   nft_trans_elem_set(trans),
-					   nft_trans_elem_priv(trans));
+		nft_trans_elems_destroy(&ctx, nft_trans_container_elem(trans));
 		break;
 	case NFT_MSG_DELOBJ:
 	case NFT_MSG_DESTROYOBJ:
@@ -10545,25 +10677,8 @@ static int nf_tables_commit(struct net *net, struct sk_buff *skb)
 		case NFT_MSG_NEWSETELEM:
 			te = nft_trans_container_elem(trans);
 
-			if (te->update_flags) {
-				const struct nft_set_ext *ext =
-					nft_set_elem_ext(te->set, te->elem_priv);
+			nft_trans_elems_add(&ctx, te);
 
-				if (te->update_flags & NFT_TRANS_UPD_TIMEOUT) {
-					WRITE_ONCE(nft_set_ext_timeout(ext)->timeout,
-						   te->timeout);
-				}
-				if (te->update_flags & NFT_TRANS_UPD_EXPIRATION) {
-					WRITE_ONCE(nft_set_ext_timeout(ext)->expiration,
-						   get_jiffies_64() + te->expiration);
-				}
-			} else {
-				nft_setelem_activate(net, te->set, te->elem_priv);
-			}
-
-			nf_tables_setelem_notify(&ctx, te->set,
-						 te->elem_priv,
-						 NFT_MSG_NEWSETELEM);
 			if (te->set->ops->commit &&
 			    list_empty(&te->set->pending_update)) {
 				list_add_tail(&te->set->pending_update,
@@ -10575,14 +10690,8 @@ static int nf_tables_commit(struct net *net, struct sk_buff *skb)
 		case NFT_MSG_DESTROYSETELEM:
 			te = nft_trans_container_elem(trans);
 
-			nf_tables_setelem_notify(&ctx, te->set,
-						 te->elem_priv,
-						 trans->msg_type);
-			nft_setelem_remove(net, te->set, te->elem_priv);
-			if (!nft_setelem_is_catchall(te->set, te->elem_priv)) {
-				atomic_dec(&te->set->nelems);
-				te->set->ndeact--;
-			}
+			nft_trans_elems_remove(&ctx, te);
+
 			if (te->set->ops->commit &&
 			    list_empty(&te->set->pending_update)) {
 				list_add_tail(&te->set->pending_update,
@@ -10702,8 +10811,7 @@ static void nf_tables_abort_release(struct nft_trans *trans)
 		nft_set_destroy(&ctx, nft_trans_set(trans));
 		break;
 	case NFT_MSG_NEWSETELEM:
-		nft_set_elem_destroy(nft_trans_elem_set(trans),
-				     nft_trans_elem_priv(trans), true);
+		nft_trans_set_elem_destroy(&ctx, nft_trans_container_elem(trans));
 		break;
 	case NFT_MSG_NEWOBJ:
 		nft_obj_destroy(&ctx, nft_trans_obj(trans));
@@ -10860,18 +10968,15 @@ static int __nf_tables_abort(struct net *net, enum nfnl_abort_action action)
 			nft_trans_destroy(trans);
 			break;
 		case NFT_MSG_NEWSETELEM:
-			if (nft_trans_elem_update_flags(trans) ||
-			    nft_trans_elem_set_bound(trans)) {
+			if (nft_trans_elem_set_bound(trans)) {
 				nft_trans_destroy(trans);
 				break;
 			}
 			te = nft_trans_container_elem(trans);
-			if (!te->set->ops->abort ||
-			    nft_setelem_is_catchall(te->set, te->elem_priv))
-				nft_setelem_remove(net, te->set, te->elem_priv);
-
-			if (!nft_setelem_is_catchall(te->set, te->elem_priv))
-				atomic_dec(&te->set->nelems);
+			if (!nft_trans_elems_new_abort(&ctx, te)) {
+				nft_trans_destroy(trans);
+				break;
+			}
 
 			if (te->set->ops->abort &&
 			    list_empty(&te->set->pending_update)) {
@@ -10883,12 +10988,7 @@ static int __nf_tables_abort(struct net *net, enum nfnl_abort_action action)
 		case NFT_MSG_DESTROYSETELEM:
 			te = nft_trans_container_elem(trans);
 
-			if (!nft_setelem_active_next(net, te->set, te->elem_priv)) {
-				nft_setelem_data_activate(net, te->set, te->elem_priv);
-				nft_setelem_activate(net, te->set, te->elem_priv);
-			}
-			if (!nft_setelem_is_catchall(te->set, te->elem_priv))
-				te->set->ndeact--;
+			nft_trans_elems_destroy_abort(&ctx, te);
 
 			if (te->set->ops->abort &&
 			    list_empty(&te->set->pending_update)) {
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH nf-next v4 3/5] netfilter: nf_tables: preemptive fix for audit selftest failure
  2024-11-07 17:44 [PATCH nf-next v4 0/5] netfilter: nf_tables: reduce set element transaction size Florian Westphal
  2024-11-07 17:44 ` [PATCH nf-next v4 1/5] netfilter: nf_tables: add nft_trans_commit_list_add_elem helper Florian Westphal
  2024-11-07 17:44 ` [PATCH nf-next v4 2/5] netfilter: nf_tables: prepare for multiple elements in nft_trans_elem structure Florian Westphal
@ 2024-11-07 17:44 ` Florian Westphal
  2024-11-07 17:44 ` [PATCH nf-next v4 4/5] netfilter: nf_tables: switch trans_elem to real flex array Florian Westphal
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 12+ messages in thread
From: Florian Westphal @ 2024-11-07 17:44 UTC (permalink / raw)
  To: netfilter-devel; +Cc: Florian Westphal

nftables audit log format unfortunately leaks an implementation detail, the
transaction log size, to userspace:

    table=t1 family=2 entries=4 op=nft_register_set
                      ~~~~~~~~~

This 'entries' key is the number of transactions that will be applied.
The upcoming set element compression (add elem x to set s, add element y
to s would be placed in a single transaction request) would lower that
number to 3.

~ncrement the audit counter by the number of elements to keep the reported
entries value the same.

Without this, nft_audit.sh selftest fails because the recorded
(expected) entries key is smaller than the expected one.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 net/netfilter/nf_tables_api.c | 23 ++++++++++++++++++++---
 1 file changed, 20 insertions(+), 3 deletions(-)

diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
index 5eab6f121684..bdf5ba21c76d 100644
--- a/net/netfilter/nf_tables_api.c
+++ b/net/netfilter/nf_tables_api.c
@@ -10397,9 +10397,26 @@ static void nf_tables_commit_audit_free(struct list_head *adl)
 	}
 }
 
+/* Silly, but existing test audit test cases require a count
+ * value derived from the (INTERNAL!) transaction log length.
+ *
+ * Thus, compaction of NEW/DELSETELEM breaks such tests.
+ */
+static unsigned int nf_tables_commit_audit_entrycount(const struct nft_trans *trans)
+{
+	switch (trans->msg_type) {
+	case NFT_MSG_NEWSETELEM:
+	case NFT_MSG_DELSETELEM:
+		return nft_trans_container_elem(trans)->nelems;
+	}
+
+	return 1;
+}
+
 static void nf_tables_commit_audit_collect(struct list_head *adl,
-					   struct nft_table *table, u32 op)
+					   const struct nft_trans *trans, u32 op)
 {
+	const struct nft_table *table = trans->table;
 	struct nft_audit_data *adp;
 
 	list_for_each_entry(adp, adl, list) {
@@ -10409,7 +10426,7 @@ static void nf_tables_commit_audit_collect(struct list_head *adl,
 	WARN_ONCE(1, "table=%s not expected in commit list", table->name);
 	return;
 found:
-	adp->entries++;
+	adp->entries += nf_tables_commit_audit_entrycount(trans);
 	if (!adp->op || adp->op > op)
 		adp->op = op;
 }
@@ -10568,7 +10585,7 @@ static int nf_tables_commit(struct net *net, struct sk_buff *skb)
 
 		nft_ctx_update(&ctx, trans);
 
-		nf_tables_commit_audit_collect(&adl, table, trans->msg_type);
+		nf_tables_commit_audit_collect(&adl, trans, trans->msg_type);
 		switch (trans->msg_type) {
 		case NFT_MSG_NEWTABLE:
 			if (nft_trans_table_update(trans)) {
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH nf-next v4 4/5] netfilter: nf_tables: switch trans_elem to real flex array
  2024-11-07 17:44 [PATCH nf-next v4 0/5] netfilter: nf_tables: reduce set element transaction size Florian Westphal
                   ` (2 preceding siblings ...)
  2024-11-07 17:44 ` [PATCH nf-next v4 3/5] netfilter: nf_tables: preemptive fix for audit selftest failure Florian Westphal
@ 2024-11-07 17:44 ` Florian Westphal
  2024-11-13 10:15   ` Pablo Neira Ayuso
  2024-11-07 17:44 ` [PATCH nf-next v4 5/5] netfilter: nf_tables: allocate element update information dynamically Florian Westphal
  2024-11-12 18:42 ` [PATCH nf-next v4 0/5] netfilter: nf_tables: reduce set element transaction size Pablo Neira Ayuso
  5 siblings, 1 reply; 12+ messages in thread
From: Florian Westphal @ 2024-11-07 17:44 UTC (permalink / raw)
  To: netfilter-devel; +Cc: Florian Westphal

When queueing a set element add or removal operation to the transaction
log, check if the previous operation already asks for a the identical
operation on the same set.

If so, store the element reference in the preceding operation.
This significantlty reduces memory consumption when many set add/delete
operations appear in a single transaction.

Example: 10k elements require 937kb of memory (10k allocations from
kmalloc-96 slab).

Assuming we can compact 4 elements in the same set, 468 kbytes
are needed (64 bytes for base struct, nft_trans_elemn, 32 bytes
for nft_trans_one_elem structure, so 2500 allocations from kmalloc-192
slab).

For large batch updates we can compact up to 62 elements
into one single nft_trans_elem structure (~65% mem reduction):
(64 bytes for base struct, nft_trans_elem, 32 byte for nft_trans_one_elem
 struct).

We can halve size of nft_trans_one_elem struct by moving
timeout/expire/update_flags into a dynamically allocated structure,
this allows to store 124 elements in a 2k slab nft_trans_elem struct.
This is done in a followup patch.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 net/netfilter/nf_tables_api.c | 71 +++++++++++++++++++++++++++++++++++
 1 file changed, 71 insertions(+)

diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
index bdf5ba21c76d..e96e538fe2eb 100644
--- a/net/netfilter/nf_tables_api.c
+++ b/net/netfilter/nf_tables_api.c
@@ -25,6 +25,7 @@
 
 #define NFT_MODULE_AUTOLOAD_LIMIT (MODULE_NAME_LEN - sizeof("nft-expr-255-"))
 #define NFT_SET_MAX_ANONLEN 16
+#define NFT_MAX_SET_NELEMS ((2048 - sizeof(struct nft_trans_elem)) / sizeof(struct nft_trans_one_elem))
 
 unsigned int nf_tables_net_id __read_mostly;
 
@@ -391,6 +392,69 @@ static void nf_tables_unregister_hook(struct net *net,
 	return __nf_tables_unregister_hook(net, table, chain, false);
 }
 
+static bool nft_trans_collapse_set_elem_allowed(const struct nft_trans_elem *a, const struct nft_trans_elem *b)
+{
+	return a->set == b->set && a->bound == b->bound && a->nelems < NFT_MAX_SET_NELEMS;
+}
+
+static bool nft_trans_collapse_set_elem(struct nftables_pernet *nft_net,
+					struct nft_trans_elem *tail,
+					struct nft_trans_elem *trans,
+					gfp_t gfp)
+{
+	unsigned int nelems, old_nelems = tail->nelems;
+	struct nft_trans_elem *new_trans;
+
+	if (!nft_trans_collapse_set_elem_allowed(tail, trans))
+		return false;
+
+	if (WARN_ON_ONCE(trans->nelems != 1))
+		return false;
+
+	if (check_add_overflow(old_nelems, trans->nelems, &nelems))
+		return false;
+
+	/* krealloc might free tail which invalidates list pointers */
+	list_del_init(&tail->nft_trans.list);
+
+	new_trans = krealloc(tail, struct_size(tail, elems, nelems), gfp);
+	if (!new_trans) {
+		list_add_tail(&tail->nft_trans.list, &nft_net->commit_list);
+		return false;
+	}
+
+	INIT_LIST_HEAD(&new_trans->nft_trans.list);
+	new_trans->nelems = nelems;
+	new_trans->elems[old_nelems] = trans->elems[0];
+	list_add_tail(&new_trans->nft_trans.list, &nft_net->commit_list);
+
+	return true;
+}
+
+static bool nft_trans_try_collapse(struct nftables_pernet *nft_net,
+				   struct nft_trans *trans, gfp_t gfp)
+{
+	struct nft_trans *tail;
+
+	if (list_empty(&nft_net->commit_list))
+		return false;
+
+	tail = list_last_entry(&nft_net->commit_list, struct nft_trans, list);
+
+	if (tail->msg_type != trans->msg_type)
+		return false;
+
+	switch (trans->msg_type) {
+	case NFT_MSG_NEWSETELEM:
+	case NFT_MSG_DELSETELEM:
+		return nft_trans_collapse_set_elem(nft_net,
+						   nft_trans_container_elem(tail),
+						   nft_trans_container_elem(trans), gfp);
+	}
+
+	return false;
+}
+
 static void nft_trans_commit_list_add_tail(struct net *net, struct nft_trans *trans)
 {
 	struct nftables_pernet *nft_net = nft_pernet(net);
@@ -424,11 +488,18 @@ static void nft_trans_commit_list_add_tail(struct net *net, struct nft_trans *tr
 static void nft_trans_commit_list_add_elem(struct net *net, struct nft_trans *trans,
 					   gfp_t gfp)
 {
+	struct nftables_pernet *nft_net = nft_pernet(net);
+
 	WARN_ON_ONCE(trans->msg_type != NFT_MSG_NEWSETELEM &&
 		     trans->msg_type != NFT_MSG_DELSETELEM);
 
 	might_alloc(gfp);
 
+	if (nft_trans_try_collapse(nft_net, trans, gfp)) {
+		kfree(trans);
+		return;
+	}
+
 	nft_trans_commit_list_add_tail(net, trans);
 }
 
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH nf-next v4 5/5] netfilter: nf_tables: allocate element update information dynamically
  2024-11-07 17:44 [PATCH nf-next v4 0/5] netfilter: nf_tables: reduce set element transaction size Florian Westphal
                   ` (3 preceding siblings ...)
  2024-11-07 17:44 ` [PATCH nf-next v4 4/5] netfilter: nf_tables: switch trans_elem to real flex array Florian Westphal
@ 2024-11-07 17:44 ` Florian Westphal
  2024-11-12 18:42 ` [PATCH nf-next v4 0/5] netfilter: nf_tables: reduce set element transaction size Pablo Neira Ayuso
  5 siblings, 0 replies; 12+ messages in thread
From: Florian Westphal @ 2024-11-07 17:44 UTC (permalink / raw)
  To: netfilter-devel; +Cc: Florian Westphal

Move the timeout/expire/flag members from nft_trans_one_elem struct into
a dybamically allocated structure, only needed when timeout update was
requested.

This halves size of nft_trans_one_elem struct and allows to compact up to
124 elements in one transaction container rather than 62.

This halves memory requirements for a large flush or insert transaction,
where ->update remains NULL.

Care has to be taken to release the extra data in all spots, including
abort path.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 include/net/netfilter/nf_tables.h | 10 ++++--
 net/netfilter/nf_tables_api.c     | 57 +++++++++++++++++++------------
 2 files changed, 43 insertions(+), 24 deletions(-)

diff --git a/include/net/netfilter/nf_tables.h b/include/net/netfilter/nf_tables.h
index 14ccf1133a4c..4afa64c81304 100644
--- a/include/net/netfilter/nf_tables.h
+++ b/include/net/netfilter/nf_tables.h
@@ -1755,11 +1755,15 @@ enum nft_trans_elem_flags {
 	NFT_TRANS_UPD_EXPIRATION	= (1 << 1),
 };
 
-struct nft_trans_one_elem {
-	struct nft_elem_priv		*priv;
+struct nft_elem_update {
 	u64				timeout;
 	u64				expiration;
-	u8				update_flags;
+	u8				flags;
+};
+
+struct nft_trans_one_elem {
+	struct nft_elem_priv		*priv;
+	struct nft_elem_update		*update;
 };
 
 struct nft_trans_elem {
diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
index e96e538fe2eb..95dc814a9059 100644
--- a/net/netfilter/nf_tables_api.c
+++ b/net/netfilter/nf_tables_api.c
@@ -6686,7 +6686,8 @@ static void nft_trans_set_elem_destroy(const struct nft_ctx *ctx, struct nft_tra
 	int i;
 
 	for (i = 0; i < te->nelems; i++) {
-		if (te->elems[i].update_flags)
+		/* skip update request, see nft_trans_elems_new_abort() */
+		if (!te->elems[i].priv)
 			continue;
 
 		__nft_set_elem_destroy(ctx, te->set, te->elems[i].priv, true);
@@ -6877,12 +6878,13 @@ static void nft_trans_elem_update(const struct nft_set *set,
 				  const struct nft_trans_one_elem *elem)
 {
 	const struct nft_set_ext *ext = nft_set_elem_ext(set, elem->priv);
+	const struct nft_elem_update *update = elem->update;
 
-	if (elem->update_flags & NFT_TRANS_UPD_TIMEOUT)
-		WRITE_ONCE(nft_set_ext_timeout(ext)->timeout, elem->timeout);
+	if (update->flags & NFT_TRANS_UPD_TIMEOUT)
+		WRITE_ONCE(nft_set_ext_timeout(ext)->timeout, update->timeout);
 
-	if (elem->update_flags & NFT_TRANS_UPD_EXPIRATION)
-		WRITE_ONCE(nft_set_ext_timeout(ext)->expiration, get_jiffies_64() + elem->expiration);
+	if (update->flags & NFT_TRANS_UPD_EXPIRATION)
+		WRITE_ONCE(nft_set_ext_timeout(ext)->expiration, get_jiffies_64() + update->expiration);
 }
 
 static void nft_trans_elems_add(const struct nft_ctx *ctx,
@@ -6891,15 +6893,16 @@ static void nft_trans_elems_add(const struct nft_ctx *ctx,
 	int i;
 
 	for (i = 0; i < te->nelems; i++) {
-		const struct nft_trans_one_elem *elem = &te->elems[i];
+		struct nft_trans_one_elem *elem = &te->elems[i];
 
-		if (elem->update_flags)
+		if (elem->update)
 			nft_trans_elem_update(te->set, elem);
 		else
 			nft_setelem_activate(ctx->net, te->set, elem->priv);
 
 		nf_tables_setelem_notify(ctx, te->set, elem->priv,
 					 NFT_MSG_NEWSETELEM);
+		kfree(elem->update);
 	}
 }
 
@@ -6991,6 +6994,8 @@ static void nft_trans_elems_remove(const struct nft_ctx *ctx,
 	int i;
 
 	for (i = 0; i < te->nelems; i++) {
+		WARN_ON_ONCE(te->elems[i].update);
+
 		nf_tables_setelem_notify(ctx, te->set,
 					 te->elems[i].priv,
 					 te->nft_trans.msg_type);
@@ -7039,7 +7044,6 @@ static int nft_add_set_elem(struct nft_ctx *ctx, struct nft_set *set,
 	struct nft_data_desc desc;
 	enum nft_registers dreg;
 	struct nft_trans *trans;
-	u8 update_flags;
 	u64 expiration;
 	u64 timeout;
 	int err, i;
@@ -7354,26 +7358,32 @@ static int nft_add_set_elem(struct nft_ctx *ctx, struct nft_set *set,
 			else if (!(nlmsg_flags & NLM_F_EXCL)) {
 				err = 0;
 				if (nft_set_ext_exists(ext2, NFT_SET_EXT_TIMEOUT)) {
-					struct nft_trans_one_elem *update;
-
-					update = &nft_trans_container_elem(trans)->elems[0];
+					struct nft_elem_update update = { };
 
-					update_flags = 0;
 					if (timeout != nft_set_ext_timeout(ext2)->timeout) {
-						update->timeout = timeout;
+						update.timeout = timeout;
 						if (expiration == 0)
 							expiration = timeout;
 
-						update_flags |= NFT_TRANS_UPD_TIMEOUT;
+						update.flags |= NFT_TRANS_UPD_TIMEOUT;
 					}
 					if (expiration) {
-						update->expiration = expiration;
-						update_flags |= NFT_TRANS_UPD_EXPIRATION;
+						update.expiration = expiration;
+						update.flags |= NFT_TRANS_UPD_EXPIRATION;
 					}
 
-					if (update_flags) {
-						update->priv = elem_priv;
-						update->update_flags = update_flags;
+					if (update.flags) {
+						struct nft_trans_one_elem *ue;
+
+						ue = &nft_trans_container_elem(trans)->elems[0];
+
+						ue->update = kmemdup(&update, sizeof(update), GFP_KERNEL);
+						if (!ue->update) {
+							err = -ENOMEM;
+							goto err_element_clash;
+						}
+
+						ue->priv = elem_priv;
 						nft_trans_commit_list_add_elem(ctx->net, trans, GFP_KERNEL);
 						goto err_elem_free;
 					}
@@ -7541,14 +7551,19 @@ void nft_setelem_data_deactivate(const struct net *net,
  * Returns true if set had been added to (i.e., elements need to be removed again).
  */
 static bool nft_trans_elems_new_abort(const struct nft_ctx *ctx,
-				      const struct nft_trans_elem *te)
+				      struct nft_trans_elem *te)
 {
 	bool removed = false;
 	int i;
 
 	for (i = 0; i < te->nelems; i++) {
-		if (te->elems[i].update_flags)
+		if (te->elems[i].update) {
+			kfree(te->elems[i].update);
+			te->elems[i].update = NULL;
+			/* Update request, so do not release this element */
+			te->elems[i].priv = NULL;
 			continue;
+		}
 
 		if (!te->set->ops->abort || nft_setelem_is_catchall(te->set, te->elems[i].priv))
 			nft_setelem_remove(ctx->net, te->set, te->elems[i].priv);
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH nf-next v4 0/5] netfilter: nf_tables: reduce set element transaction size
  2024-11-07 17:44 [PATCH nf-next v4 0/5] netfilter: nf_tables: reduce set element transaction size Florian Westphal
                   ` (4 preceding siblings ...)
  2024-11-07 17:44 ` [PATCH nf-next v4 5/5] netfilter: nf_tables: allocate element update information dynamically Florian Westphal
@ 2024-11-12 18:42 ` Pablo Neira Ayuso
  2024-11-12 20:44   ` Florian Westphal
  5 siblings, 1 reply; 12+ messages in thread
From: Pablo Neira Ayuso @ 2024-11-12 18:42 UTC (permalink / raw)
  To: Florian Westphal; +Cc: netfilter-devel

Hi Florian,

This series looks good to me.

Regarding 3/5, I don't see any fix or anything silly in this.

>nftables audit log format unfortunately leaks an implementation detail, the
>transaction log size, to userspace:
>
>    table=t1 family=2 entries=4 op=nft_register_set
>                      ~~~~~~~~~
>
>This 'entries' key is the number of transactions that will be applied.

To my understanding, entries= is the number of entries that are either
added or updated in this transaction.

Before this patch, there was a 1:1 mapping between transaction and
elements, now this is not the case anymore.

If entries= exposes only the number of transactions, then this becomes
useless to userspace?

In iptables, it shows the number of entries in the table after the
update.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH nf-next v4 0/5] netfilter: nf_tables: reduce set element transaction size
  2024-11-12 18:42 ` [PATCH nf-next v4 0/5] netfilter: nf_tables: reduce set element transaction size Pablo Neira Ayuso
@ 2024-11-12 20:44   ` Florian Westphal
  2024-11-13 10:19     ` Pablo Neira Ayuso
  0 siblings, 1 reply; 12+ messages in thread
From: Florian Westphal @ 2024-11-12 20:44 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: Florian Westphal, netfilter-devel

Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> >nftables audit log format unfortunately leaks an implementation detail, the
> >transaction log size, to userspace:
> >
> >    table=t1 family=2 entries=4 op=nft_register_set
> >                      ~~~~~~~~~
> >
> >This 'entries' key is the number of transactions that will be applied.
> 
> To my understanding, entries= is the number of entries that are either
> added or updated in this transaction.
> 
> Before this patch, there was a 1:1 mapping between transaction and
> elements, now this is not the case anymore.
> 
> If entries= exposes only the number of transactions, then this becomes
> useless to userspace?

Hmm, I would need to know what this is supposed to be.
Its not going to be the same in either case,
iptables-legacy -A ... vs iptables-nft -A won't result in same
entries due to the whole-table-replace paradigm and introduction
of "update" mechanism also changes entries count.

I think its fine now, but please feel free to rewrite the commit
message if you think its needed.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH nf-next v4 4/5] netfilter: nf_tables: switch trans_elem to real flex array
  2024-11-07 17:44 ` [PATCH nf-next v4 4/5] netfilter: nf_tables: switch trans_elem to real flex array Florian Westphal
@ 2024-11-13 10:15   ` Pablo Neira Ayuso
  2024-11-13 11:04     ` Florian Westphal
  0 siblings, 1 reply; 12+ messages in thread
From: Pablo Neira Ayuso @ 2024-11-13 10:15 UTC (permalink / raw)
  To: Florian Westphal; +Cc: netfilter-devel

Hi Florian,

I'm making another pass on this series, a few thing I would like to
ask, see below.

On Thu, Nov 07, 2024 at 06:44:08PM +0100, Florian Westphal wrote:
> diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
> index bdf5ba21c76d..e96e538fe2eb 100644
> --- a/net/netfilter/nf_tables_api.c
> +++ b/net/netfilter/nf_tables_api.c
> @@ -25,6 +25,7 @@
>  
>  #define NFT_MODULE_AUTOLOAD_LIMIT (MODULE_NAME_LEN - sizeof("nft-expr-255-"))
>  #define NFT_SET_MAX_ANONLEN 16
> +#define NFT_MAX_SET_NELEMS ((2048 - sizeof(struct nft_trans_elem)) / sizeof(struct nft_trans_one_elem))

This NFT_MAX_SET_NELEMS is to stay in a specific kmalloc-X?

What is the logic behind this NFT_MAX_SET_NELEMS?

>  unsigned int nf_tables_net_id __read_mostly;
>  
> @@ -391,6 +392,69 @@ static void nf_tables_unregister_hook(struct net *net,
>  	return __nf_tables_unregister_hook(net, table, chain, false);
>  }
>  
> +static bool nft_trans_collapse_set_elem_allowed(const struct nft_trans_elem *a, const struct nft_trans_elem *b)
> +{
> +	return a->set == b->set && a->bound == b->bound && a->nelems < NFT_MAX_SET_NELEMS;

I think this a->bound == b->bound check defensive.

This code is collapsing only two consecutive transactions, the one at
the tail (where nelems > 1) and the new transaction (where nelems ==
1).

bound state should only change in case there is a NEWRULE transaction
in between.

I am trying to find a error scenario where a->bound == b->bound
evaluates false. I considered the following:

   newelem -> newrule -> newelem

where newrule has these expressions:

   lookup -> error

in this case, newrule error path is exercised:

   nft_rule_expr_deactivate(&ctx, rule, NFT_TRANS_PREPARE_ERROR);

this calls nf_tables_deactivate_set() that calls
nft_set_trans_unbind(), then a->bound is restored to false. Rule is
released and no transaction is added.

Because if this succeeds:

   newelem -> newrule -> newelem

then no element collapsing can happen, because we only collapse what
is at the tail.

TLDR; Check does not harm, but it looks unlikely to happen to me.

one more comment below.

> +}
> +
> +static bool nft_trans_collapse_set_elem(struct nftables_pernet *nft_net,
> +					struct nft_trans_elem *tail,
> +					struct nft_trans_elem *trans,
> +					gfp_t gfp)
> +{
> +	unsigned int nelems, old_nelems = tail->nelems;
> +	struct nft_trans_elem *new_trans;
> +
> +	if (!nft_trans_collapse_set_elem_allowed(tail, trans))
> +		return false;
> +
> +	if (WARN_ON_ONCE(trans->nelems != 1))
> +		return false;
> +
> +	if (check_add_overflow(old_nelems, trans->nelems, &nelems))
> +		return false;
> +
> +	/* krealloc might free tail which invalidates list pointers */
> +	list_del_init(&tail->nft_trans.list);
> +
> +	new_trans = krealloc(tail, struct_size(tail, elems, nelems), gfp);
> +	if (!new_trans) {
> +		list_add_tail(&tail->nft_trans.list, &nft_net->commit_list);
> +		return false;
> +	}
> +
> +	INIT_LIST_HEAD(&new_trans->nft_trans.list);

This initialization is also defensive, this element is added via
list_add_tail().

> +	new_trans->nelems = nelems;
> +	new_trans->elems[old_nelems] = trans->elems[0];
> +	list_add_tail(&new_trans->nft_trans.list, &nft_net->commit_list);
> +
> +	return true;
> +}

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH nf-next v4 0/5] netfilter: nf_tables: reduce set element transaction size
  2024-11-12 20:44   ` Florian Westphal
@ 2024-11-13 10:19     ` Pablo Neira Ayuso
  0 siblings, 0 replies; 12+ messages in thread
From: Pablo Neira Ayuso @ 2024-11-13 10:19 UTC (permalink / raw)
  To: Florian Westphal; +Cc: netfilter-devel

On Tue, Nov 12, 2024 at 09:44:36PM +0100, Florian Westphal wrote:
> Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> > >nftables audit log format unfortunately leaks an implementation detail, the
> > >transaction log size, to userspace:
> > >
> > >    table=t1 family=2 entries=4 op=nft_register_set
> > >                      ~~~~~~~~~
> > >
> > >This 'entries' key is the number of transactions that will be applied.
> > 
> > To my understanding, entries= is the number of entries that are either
> > added or updated in this transaction.
> > 
> > Before this patch, there was a 1:1 mapping between transaction and
> > elements, now this is not the case anymore.
> > 
> > If entries= exposes only the number of transactions, then this becomes
> > useless to userspace?
> 
> Hmm, I would need to know what this is supposed to be.
> Its not going to be the same in either case,
> iptables-legacy -A ... vs iptables-nft -A won't result in same
> entries due to the whole-table-replace paradigm and introduction
> of "update" mechanism also changes entries count.

Right, there is a change between -legacy and -nft regarding audit.

> I think its fine now, but please feel free to rewrite the commit
> message if you think its needed.

Thanks, I will make an edit.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH nf-next v4 4/5] netfilter: nf_tables: switch trans_elem to real flex array
  2024-11-13 10:15   ` Pablo Neira Ayuso
@ 2024-11-13 11:04     ` Florian Westphal
  2024-11-13 11:11       ` Pablo Neira Ayuso
  0 siblings, 1 reply; 12+ messages in thread
From: Florian Westphal @ 2024-11-13 11:04 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: Florian Westphal, netfilter-devel

Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> I'm making another pass on this series, a few thing I would like to
> ask, see below.
> 
> On Thu, Nov 07, 2024 at 06:44:08PM +0100, Florian Westphal wrote:
> > diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
> > index bdf5ba21c76d..e96e538fe2eb 100644
> > --- a/net/netfilter/nf_tables_api.c
> > +++ b/net/netfilter/nf_tables_api.c
> > @@ -25,6 +25,7 @@
> >  
> >  #define NFT_MODULE_AUTOLOAD_LIMIT (MODULE_NAME_LEN - sizeof("nft-expr-255-"))
> >  #define NFT_SET_MAX_ANONLEN 16
> > +#define NFT_MAX_SET_NELEMS ((2048 - sizeof(struct nft_trans_elem)) / sizeof(struct nft_trans_one_elem))
> 
> This NFT_MAX_SET_NELEMS is to stay in a specific kmalloc-X?
> 
> What is the logic behind this NFT_MAX_SET_NELEMS?

I want to avoid making huge kmalloc requests, plus avoid huge krealloc
overhead.

I think that kmalloc-2048 slab is a good fit.
I can add a comment, or increase to kmalloc-4096 but I'd prefer to
not go over that, since kmalloc allocations > 1 page are more prone
to allocation failure.

> >  unsigned int nf_tables_net_id __read_mostly;
> >  
> > @@ -391,6 +392,69 @@ static void nf_tables_unregister_hook(struct net *net,
> >  	return __nf_tables_unregister_hook(net, table, chain, false);
> >  }
> >  
> > +static bool nft_trans_collapse_set_elem_allowed(const struct nft_trans_elem *a, const struct nft_trans_elem *b)
> > +{
> > +	return a->set == b->set && a->bound == b->bound && a->nelems < NFT_MAX_SET_NELEMS;
> 
> I think this a->bound == b->bound check defensive.
> 
> This code is collapsing only two consecutive transactions, the one at
> the tail (where nelems > 1) and the new transaction (where nelems ==
> 1).

Yes.

> bound state should only change in case there is a NEWRULE transaction
> in between.

Yes.

> I am trying to find a error scenario where a->bound == b->bound
> evaluates false. I considered the following:
> 
>    newelem -> newrule -> newelem
> 
> where newrule has these expressions:
> 
>    lookup -> error
> 
> in this case, newrule error path is exercised:
> 
>    nft_rule_expr_deactivate(&ctx, rule, NFT_TRANS_PREPARE_ERROR);
> 
> this calls nf_tables_deactivate_set() that calls
> nft_set_trans_unbind(), then a->bound is restored to false. Rule is
> released and no transaction is added.
> 
> Because if this succeeds:
> 
>    newelem -> newrule -> newelem
> 
> then no element collapsing can happen, because we only collapse what
> is at the tail.
> 
> TLDR; Check does not harm, but it looks unlikely to happen to me.

Yes, its defensive check.  I could add a comment.
The WARN_ON_ONCE for trans->nelems != 1 exists for same reason.

> > +}
> > +
> > +static bool nft_trans_collapse_set_elem(struct nftables_pernet *nft_net,
> > +					struct nft_trans_elem *tail,
> > +					struct nft_trans_elem *trans,
> > +					gfp_t gfp)
> > +{
> > +	unsigned int nelems, old_nelems = tail->nelems;
> > +	struct nft_trans_elem *new_trans;
> > +
> > +	if (!nft_trans_collapse_set_elem_allowed(tail, trans))
> > +		return false;
> > +
> > +	if (WARN_ON_ONCE(trans->nelems != 1))
> > +		return false;
> > +
> > +	if (check_add_overflow(old_nelems, trans->nelems, &nelems))
> > +		return false;
> > +
> > +	/* krealloc might free tail which invalidates list pointers */
> > +	list_del_init(&tail->nft_trans.list);
> > +
> > +	new_trans = krealloc(tail, struct_size(tail, elems, nelems), gfp);
> > +	if (!new_trans) {
> > +		list_add_tail(&tail->nft_trans.list, &nft_net->commit_list);
> > +		return false;
> > +	}
> > +
> > +	INIT_LIST_HEAD(&new_trans->nft_trans.list);
> 
> This initialization is also defensive, this element is added via
> list_add_tail().

Yes, the first arg to list_add(_tail) can live without initialisation.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH nf-next v4 4/5] netfilter: nf_tables: switch trans_elem to real flex array
  2024-11-13 11:04     ` Florian Westphal
@ 2024-11-13 11:11       ` Pablo Neira Ayuso
  0 siblings, 0 replies; 12+ messages in thread
From: Pablo Neira Ayuso @ 2024-11-13 11:11 UTC (permalink / raw)
  To: Florian Westphal; +Cc: netfilter-devel

On Wed, Nov 13, 2024 at 12:04:05PM +0100, Florian Westphal wrote:
> Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> > I'm making another pass on this series, a few thing I would like to
> > ask, see below.
> > 
> > On Thu, Nov 07, 2024 at 06:44:08PM +0100, Florian Westphal wrote:
> > > diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
> > > index bdf5ba21c76d..e96e538fe2eb 100644
> > > --- a/net/netfilter/nf_tables_api.c
> > > +++ b/net/netfilter/nf_tables_api.c
> > > @@ -25,6 +25,7 @@
> > >  
> > >  #define NFT_MODULE_AUTOLOAD_LIMIT (MODULE_NAME_LEN - sizeof("nft-expr-255-"))
> > >  #define NFT_SET_MAX_ANONLEN 16
> > > +#define NFT_MAX_SET_NELEMS ((2048 - sizeof(struct nft_trans_elem)) / sizeof(struct nft_trans_one_elem))
> > 
> > This NFT_MAX_SET_NELEMS is to stay in a specific kmalloc-X?
> > 
> > What is the logic behind this NFT_MAX_SET_NELEMS?
> 
> I want to avoid making huge kmalloc requests, plus avoid huge krealloc
> overhead.
> 
> I think that kmalloc-2048 slab is a good fit.
> I can add a comment, or increase to kmalloc-4096 but I'd prefer to
> not go over that, since kmalloc allocations > 1 page are more prone
> to allocation failure.

Makes sense as it is now, thanks for explaining.

> > >  unsigned int nf_tables_net_id __read_mostly;
> > >  
> > > @@ -391,6 +392,69 @@ static void nf_tables_unregister_hook(struct net *net,
> > >  	return __nf_tables_unregister_hook(net, table, chain, false);
> > >  }
> > >  
> > > +static bool nft_trans_collapse_set_elem_allowed(const struct nft_trans_elem *a, const struct nft_trans_elem *b)
> > > +{
> > > +	return a->set == b->set && a->bound == b->bound && a->nelems < NFT_MAX_SET_NELEMS;
> > 
> > I think this a->bound == b->bound check defensive.
> > 
> > This code is collapsing only two consecutive transactions, the one at
> > the tail (where nelems > 1) and the new transaction (where nelems ==
> > 1).
> 
> Yes.
> 
> > bound state should only change in case there is a NEWRULE transaction
> > in between.
> 
> Yes.
> 
> > I am trying to find a error scenario where a->bound == b->bound
> > evaluates false. I considered the following:
> > 
> >    newelem -> newrule -> newelem
> > 
> > where newrule has these expressions:
> > 
> >    lookup -> error
> > 
> > in this case, newrule error path is exercised:
> > 
> >    nft_rule_expr_deactivate(&ctx, rule, NFT_TRANS_PREPARE_ERROR);
> > 
> > this calls nf_tables_deactivate_set() that calls
> > nft_set_trans_unbind(), then a->bound is restored to false. Rule is
> > released and no transaction is added.
> > 
> > Because if this succeeds:
> > 
> >    newelem -> newrule -> newelem
> > 
> > then no element collapsing can happen, because we only collapse what
> > is at the tail.
> > 
> > TLDR; Check does not harm, but it looks unlikely to happen to me.
> 
> Yes, its defensive check.  I could add a comment.

Please, do it so we don't forget about this subtle detail.

> The WARN_ON_ONCE for trans->nelems != 1 exists for same reason.

Right.

> > > +}
> > > +
> > > +static bool nft_trans_collapse_set_elem(struct nftables_pernet *nft_net,
> > > +					struct nft_trans_elem *tail,
> > > +					struct nft_trans_elem *trans,
> > > +					gfp_t gfp)
> > > +{
> > > +	unsigned int nelems, old_nelems = tail->nelems;
> > > +	struct nft_trans_elem *new_trans;
> > > +
> > > +	if (!nft_trans_collapse_set_elem_allowed(tail, trans))
> > > +		return false;
> > > +
> > > +	if (WARN_ON_ONCE(trans->nelems != 1))
> > > +		return false;
> > > +
> > > +	if (check_add_overflow(old_nelems, trans->nelems, &nelems))
> > > +		return false;
> > > +
> > > +	/* krealloc might free tail which invalidates list pointers */
> > > +	list_del_init(&tail->nft_trans.list);
> > > +
> > > +	new_trans = krealloc(tail, struct_size(tail, elems, nelems), gfp);
> > > +	if (!new_trans) {
> > > +		list_add_tail(&tail->nft_trans.list, &nft_net->commit_list);
> > > +		return false;
> > > +	}
> > > +
> > > +	INIT_LIST_HEAD(&new_trans->nft_trans.list);
> > 
> > This initialization is also defensive, this element is added via
> > list_add_tail().
> 
> Yes, the first arg to list_add(_tail) can live without initialisation.

Let's remove it then.

Would you submit a new revision with all these little nitpicks?

Then you also have a chance to edit your explaination on the audit
aspect of this series.

If you are busy with other stuff I can take a look here, just let me know.

Thanks.

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2024-11-13 11:11 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-11-07 17:44 [PATCH nf-next v4 0/5] netfilter: nf_tables: reduce set element transaction size Florian Westphal
2024-11-07 17:44 ` [PATCH nf-next v4 1/5] netfilter: nf_tables: add nft_trans_commit_list_add_elem helper Florian Westphal
2024-11-07 17:44 ` [PATCH nf-next v4 2/5] netfilter: nf_tables: prepare for multiple elements in nft_trans_elem structure Florian Westphal
2024-11-07 17:44 ` [PATCH nf-next v4 3/5] netfilter: nf_tables: preemptive fix for audit selftest failure Florian Westphal
2024-11-07 17:44 ` [PATCH nf-next v4 4/5] netfilter: nf_tables: switch trans_elem to real flex array Florian Westphal
2024-11-13 10:15   ` Pablo Neira Ayuso
2024-11-13 11:04     ` Florian Westphal
2024-11-13 11:11       ` Pablo Neira Ayuso
2024-11-07 17:44 ` [PATCH nf-next v4 5/5] netfilter: nf_tables: allocate element update information dynamically Florian Westphal
2024-11-12 18:42 ` [PATCH nf-next v4 0/5] netfilter: nf_tables: reduce set element transaction size Pablo Neira Ayuso
2024-11-12 20:44   ` Florian Westphal
2024-11-13 10:19     ` Pablo Neira Ayuso

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox