[PATCH AUTOSEL 6.17] netfilter: nf_tables: all transaction allocations can now sleep

netfilter-devel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH AUTOSEL 6.17] netfilter: nf_tables: all transaction allocations can now sleep
       [not found] <20251025160905.3857885-1-sashal@kernel.org>
@ 2025-10-25 15:57 ` Sasha Levin
  2025-10-25 16:01 ` [PATCH AUTOSEL 6.17-5.15] netfilter: nf_reject: don't reply to icmp error messages Sasha Levin
  1 sibling, 0 replies; 2+ messages in thread
From: Sasha Levin @ 2025-10-25 15:57 UTC (permalink / raw)
  To: patches, stable
  Cc: Florian Westphal, Sasha Levin, pablo, kadlec, netfilter-devel,
	coreteam

From: Florian Westphal <fw@strlen.de>

[ Upstream commit 3d95a2e016abab29ccb6f384576b2038e544a5a8 ]

Now that nft_setelem_flush is not called with rcu read lock held or
disabled softinterrupts anymore this can now use GFP_KERNEL too.

This is the last atomic allocation of transaction elements, so remove
all gfp_t arguments and the wrapper function.

This makes attempts to delete large sets much more reliable, before
this was prone to transient memory allocation failures.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---

LLM Generated explanations, may be completely bogus:

YES

Explanation

What changed
- Switched all nf_tables transaction element allocations to use
  sleepable allocations and removed gfp plumbing.
  - `nft_trans_alloc()` now always uses `kzalloc(..., GFP_KERNEL)`
    instead of parameterized gfp or `GFP_ATOMIC` via wrapper removal:
    net/netfilter/nf_tables_api.c:154.
  - Set-element transaction collapsing uses `krealloc(..., GFP_KERNEL)`:
    net/netfilter/nf_tables_api.c:457.
  - Removed gfp_t parameters from commit-list helpers and collapse
    logic; `nft_trans_commit_list_add_elem()` no longer takes/propagates
    gfp and simply collapses or enqueues:
    net/netfilter/nf_tables_api.c:535.
  - Bulk element flush paths allocate transactions and enqueue them
    without any atomic gfp flags:
    - `nft_setelem_flush()` allocates with `nft_trans_alloc(...
      GFP_KERNEL)` and enqueues via `nft_trans_commit_list_add_elem()`:
      net/netfilter/nf_tables_api.c:7872,
      net/netfilter/nf_tables_api.c:7893.
    - Catchall flush similarly enqueues with the new helper:
      net/netfilter/nf_tables_api.c:7906,
      net/netfilter/nf_tables_api.c:7912.

Why this is safe now (sleepable paths are guaranteed)
- The iter callback used for set-element flushing (`.fn =
  nft_setelem_flush`) runs in a context where sleeping is allowed during
  UPDATE walks:
  - Hash backend creates a snapshot under RCU and then iterates that
    snapshot with `commit_mutex` held so the callback “can sleep”
    (explicitly documented): net/netfilter/nft_set_hash.c:324–332,
    372–383.
  - Rbtree backend asserts `commit_mutex` for UPDATE walks and calls
    `iter->fn` under that mutex (no BH/RCU read section):
    net/netfilter/nft_set_rbtree.c:609–627.
  - Bitmap backend traverses with RCU but protected by `commit_mutex`
    (writer lock allowed for traversal) and calls `iter->fn` under that
    protection (sleepable): net/netfilter/nft_set_bitmap.c:230–241.
- The bulk delete entry point (`nf_tables_delsetelem`) sets up UPDATE
  walk (`.type = NFT_ITER_UPDATE`) and uses `nft_set_flush()` which
  wires `nft_setelem_flush` as the callback, ensuring it executes in the
  above sleepable contexts: net/netfilter/nf_tables_api.c:7940–7955.
- Transaction element additions/deletions are performed from netlink
  processing paths (process context), not hardirq/softirq handlers, so
  allocating with `GFP_KERNEL` is appropriate in all call sites shown by
  the new helpers: net/netfilter/nf_tables_api.c:7568–7597, 7847–7895,
  7906–7912.

User-visible impact (bug fix)
- Deleting large sets previously used `GFP_ATOMIC` along the
  flush/commit-item path, making allocations prone to transient failures
  under memory pressure. By switching to `GFP_KERNEL` and permitting
  sleeping, large set deletions become substantially more reliable,
  aligning with the commit message intent.
  - The collapsing path (`nft_trans_collapse_set_elem`) that may
    `krealloc` the transaction to coalesce elements now does so with
    `GFP_KERNEL`, reducing failure rates:
    net/netfilter/nf_tables_api.c:457–468.

Scope and risk
- Scope: confined to nf_tables internal transaction allocation/path; no
  UAPI changes, no semantic changes to ruleset behavior.
- Architectural changes: none; this is a cleanup following prior design
  changes that made iter callbacks sleepable.
- Side effects: allocations may now sleep. This is intended and correct
  given the current walk/flush call paths hold `commit_mutex` or operate
  on snapshot lists designed for sleepable callbacks, as shown above.
- Regression risk: low in trees that already have the reworked set
  walk/flush semantics (snapshot under RCU or commit_mutex-protected
  UPDATE walks). If a target stable tree does not include those enabling
  changes, backporting this patch alone would be unsafe because it could
  sleep in contexts that used to run under RCU read lock or with
  softirqs disabled. In such cases, this commit should be backported
  together with the prerequisite walk/flush changes (e.g., the
  hash/rbtree/bitmap UPDATE-walk designs that explicitly allow
  sleeping).

Conclusion
- This is a contained reliability fix that removes the last atomic
  allocation along nf_tables transaction element paths and is consistent
  with the current, sleepable UPDATE-walk design. It reduces transient
  ENOMEM failures when deleting large sets with minimal risk, provided
  the prerequisite sleepable-walk changes are present in the target
  stable series. Recommended for stable backport with that dependency
  consideration.

 net/netfilter/nf_tables_api.c | 47 ++++++++++++++---------------------
 1 file changed, 19 insertions(+), 28 deletions(-)

diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
index c3c73411c40c4..eed434e0a9702 100644
--- a/net/netfilter/nf_tables_api.c
+++ b/net/netfilter/nf_tables_api.c
@@ -151,12 +151,12 @@ static void nft_ctx_init(struct nft_ctx *ctx,
 	bitmap_zero(ctx->reg_inited, NFT_REG32_NUM);
 }
 
-static struct nft_trans *nft_trans_alloc_gfp(const struct nft_ctx *ctx,
-					     int msg_type, u32 size, gfp_t gfp)
+static struct nft_trans *nft_trans_alloc(const struct nft_ctx *ctx,
+					 int msg_type, u32 size)
 {
 	struct nft_trans *trans;
 
-	trans = kzalloc(size, gfp);
+	trans = kzalloc(size, GFP_KERNEL);
 	if (trans == NULL)
 		return NULL;
 
@@ -172,12 +172,6 @@ static struct nft_trans *nft_trans_alloc_gfp(const struct nft_ctx *ctx,
 	return trans;
 }
 
-static struct nft_trans *nft_trans_alloc(const struct nft_ctx *ctx,
-					 int msg_type, u32 size)
-{
-	return nft_trans_alloc_gfp(ctx, msg_type, size, GFP_KERNEL);
-}
-
 static struct nft_trans_binding *nft_trans_get_binding(struct nft_trans *trans)
 {
 	switch (trans->msg_type) {
@@ -442,8 +436,7 @@ static bool nft_trans_collapse_set_elem_allowed(const struct nft_trans_elem *a,
 
 static bool nft_trans_collapse_set_elem(struct nftables_pernet *nft_net,
 					struct nft_trans_elem *tail,
-					struct nft_trans_elem *trans,
-					gfp_t gfp)
+					struct nft_trans_elem *trans)
 {
 	unsigned int nelems, old_nelems = tail->nelems;
 	struct nft_trans_elem *new_trans;
@@ -466,9 +459,11 @@ static bool nft_trans_collapse_set_elem(struct nftables_pernet *nft_net,
 	/* krealloc might free tail which invalidates list pointers */
 	list_del_init(&tail->nft_trans.list);
 
-	new_trans = krealloc(tail, struct_size(tail, elems, nelems), gfp);
+	new_trans = krealloc(tail, struct_size(tail, elems, nelems),
+			     GFP_KERNEL);
 	if (!new_trans) {
-		list_add_tail(&tail->nft_trans.list, &nft_net->commit_list);
+		list_add_tail(&tail->nft_trans.list,
+			      &nft_net->commit_list);
 		return false;
 	}
 
@@ -484,7 +479,7 @@ static bool nft_trans_collapse_set_elem(struct nftables_pernet *nft_net,
 }
 
 static bool nft_trans_try_collapse(struct nftables_pernet *nft_net,
-				   struct nft_trans *trans, gfp_t gfp)
+				   struct nft_trans *trans)
 {
 	struct nft_trans *tail;
 
@@ -501,7 +496,7 @@ static bool nft_trans_try_collapse(struct nftables_pernet *nft_net,
 	case NFT_MSG_DELSETELEM:
 		return nft_trans_collapse_set_elem(nft_net,
 						   nft_trans_container_elem(tail),
-						   nft_trans_container_elem(trans), gfp);
+						   nft_trans_container_elem(trans));
 	}
 
 	return false;
@@ -537,17 +532,14 @@ static void nft_trans_commit_list_add_tail(struct net *net, struct nft_trans *tr
 	}
 }
 
-static void nft_trans_commit_list_add_elem(struct net *net, struct nft_trans *trans,
-					   gfp_t gfp)
+static void nft_trans_commit_list_add_elem(struct net *net, struct nft_trans *trans)
 {
 	struct nftables_pernet *nft_net = nft_pernet(net);
 
 	WARN_ON_ONCE(trans->msg_type != NFT_MSG_NEWSETELEM &&
 		     trans->msg_type != NFT_MSG_DELSETELEM);
 
-	might_alloc(gfp);
-
-	if (nft_trans_try_collapse(nft_net, trans, gfp)) {
+	if (nft_trans_try_collapse(nft_net, trans)) {
 		kfree(trans);
 		return;
 	}
@@ -7573,7 +7565,7 @@ static int nft_add_set_elem(struct nft_ctx *ctx, struct nft_set *set,
 						}
 
 						ue->priv = elem_priv;
-						nft_trans_commit_list_add_elem(ctx->net, trans, GFP_KERNEL);
+						nft_trans_commit_list_add_elem(ctx->net, trans);
 						goto err_elem_free;
 					}
 				}
@@ -7597,7 +7589,7 @@ static int nft_add_set_elem(struct nft_ctx *ctx, struct nft_set *set,
 	}
 
 	nft_trans_container_elem(trans)->elems[0].priv = elem.priv;
-	nft_trans_commit_list_add_elem(ctx->net, trans, GFP_KERNEL);
+	nft_trans_commit_list_add_elem(ctx->net, trans);
 	return 0;
 
 err_set_full:
@@ -7863,7 +7855,7 @@ static int nft_del_setelem(struct nft_ctx *ctx, struct nft_set *set,
 	nft_setelem_data_deactivate(ctx->net, set, elem.priv);
 
 	nft_trans_container_elem(trans)->elems[0].priv = elem.priv;
-	nft_trans_commit_list_add_elem(ctx->net, trans, GFP_KERNEL);
+	nft_trans_commit_list_add_elem(ctx->net, trans);
 	return 0;
 
 fail_ops:
@@ -7888,9 +7880,8 @@ static int nft_setelem_flush(const struct nft_ctx *ctx,
 	if (!nft_set_elem_active(ext, iter->genmask))
 		return 0;
 
-	trans = nft_trans_alloc_gfp(ctx, NFT_MSG_DELSETELEM,
-				    struct_size_t(struct nft_trans_elem, elems, 1),
-				    GFP_ATOMIC);
+	trans = nft_trans_alloc(ctx, NFT_MSG_DELSETELEM,
+				struct_size_t(struct nft_trans_elem, elems, 1));
 	if (!trans)
 		return -ENOMEM;
 
@@ -7901,7 +7892,7 @@ static int nft_setelem_flush(const struct nft_ctx *ctx,
 	nft_trans_elem_set(trans) = set;
 	nft_trans_container_elem(trans)->nelems = 1;
 	nft_trans_container_elem(trans)->elems[0].priv = elem_priv;
-	nft_trans_commit_list_add_elem(ctx->net, trans, GFP_ATOMIC);
+	nft_trans_commit_list_add_elem(ctx->net, trans);
 
 	return 0;
 }
@@ -7918,7 +7909,7 @@ static int __nft_set_catchall_flush(const struct nft_ctx *ctx,
 
 	nft_setelem_data_deactivate(ctx->net, set, elem_priv);
 	nft_trans_container_elem(trans)->elems[0].priv = elem_priv;
-	nft_trans_commit_list_add_elem(ctx->net, trans, GFP_KERNEL);
+	nft_trans_commit_list_add_elem(ctx->net, trans);
 
 	return 0;
 }
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 2+ messages in thread

* [PATCH AUTOSEL 6.17-5.15] netfilter: nf_reject: don't reply to icmp error messages
       [not found] <20251025160905.3857885-1-sashal@kernel.org>
  2025-10-25 15:57 ` [PATCH AUTOSEL 6.17] netfilter: nf_tables: all transaction allocations can now sleep Sasha Levin
@ 2025-10-25 16:01 ` Sasha Levin
  1 sibling, 0 replies; 2+ messages in thread
From: Sasha Levin @ 2025-10-25 16:01 UTC (permalink / raw)
  To: patches, stable
  Cc: Florian Westphal, Sasha Levin, pablo, kadlec, netfilter-devel,
	coreteam

From: Florian Westphal <fw@strlen.de>

[ Upstream commit db99b2f2b3e2cd8227ac9990ca4a8a31a1e95e56 ]

tcp reject code won't reply to a tcp reset.

But the icmp reject 'netdev' family versions will reply to icmp
dst-unreach errors, unlike icmp_send() and icmp6_send() which are used
by the inet family implementation (and internally by the REJECT target).

Check for the icmp(6) type and do not respond if its an unreachable error.

Without this, something like 'ip protocol icmp reject', when used
in a netdev chain attached to 'lo', cause a packet loop.

Same for two hosts that both use such a rule: each error packet
will be replied to.

Such situation persist until the (bogus) rule is amended to ratelimit or
checks the icmp type before the reject statement.

As the inet versions don't do this make the netdev ones follow along.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---

LLM Generated explanations, may be completely bogus:

YES

- What it fixes
  - Prevents ICMP error storms/loops when REJECT is used in
    netdev/bridge chains. Example from the commit: an “ip protocol icmp
    reject” rule on `lo` will reply to the ICMP error it just generated,
    causing a loop. Same problem occurs between two hosts with such
    rules (each error elicits another error).
  - Aligns netdev/bridge behavior with inet-family REJECT, which already
    does not reply to ICMP errors (per RFC guidance).

- IPv4 change (small, contained)
  - Adds ICMP type check helper: only treat the packet as ICMP
    unreachable if its type is `ICMP_DEST_UNREACH`.
    - net/ipv4/netfilter/nf_reject_ipv4.c:83
  - Suppresses generating an ICMP unreachable in response to an ICMP
    unreachable:
    - Early return when input is ICMP unreachable:
      net/ipv4/netfilter/nf_reject_ipv4.c:124
  - The rest of the function remains unchanged (length limit 536,
    checksums, header build), so behavior is identical except for
    skipping replies to ICMP unreachable.
  - Symmetry with TCP RST path already present (no RST in response to
    RST): net/ipv4/netfilter/nf_reject_ipv4.c:179

- IPv6 change (small, contained)
  - Adds ICMPv6 type check helper using `ipv6_skip_exthdr()` and
    `skb_header_pointer()`; only treat as ICMPv6 unreachable if
    `icmp6_type` is `ICMPV6_DEST_UNREACH`:
    - net/ipv6/netfilter/nf_reject_ipv6.c:107
  - Suppresses generating an ICMPv6 unreachable in response to an ICMPv6
    unreachable:
    - Early return when input is ICMPv6 unreachable:
      net/ipv6/netfilter/nf_reject_ipv6.c:146
  - Rest of path (length cap ≈ minimum IPv6 MTU, checksum, header build)
    is unchanged.

- Scope and impact
  - Affects only the nf_reject helpers used by netdev/bridge REJECT
    expressions:
    - net/netfilter/nft_reject_netdev.c calls into
      `nf_reject_skb_v4_unreach()` and `nf_reject_skb_v6_unreach()`
      (e.g., net/netfilter/nft_reject_netdev.c:48,
      net/netfilter/nft_reject_netdev.c:77).
    - net/bridge/netfilter/nft_reject_bridge.c likewise uses the same
      helpers (e.g., net/bridge/netfilter/nft_reject_bridge.c:68,
      net/bridge/netfilter/nft_reject_bridge.c:101).
  - Inet-family REJECT is unchanged and already uses stack helpers that
    refuse to reply to ICMP errors:
    - IPv4 inet path uses `icmp_send()`
      (net/ipv4/netfilter/nf_reject_ipv4.c:346), which already avoids
      generating errors in response to errors.
    - IPv6 inet path uses `icmpv6_send()`
      (net/ipv6/netfilter/nf_reject_ipv6.c:446) with similar behavior.
  - No ABI or architectural changes; only introduces small static
    helpers and early returns. The behavior change is to refrain from
    sending an error in response to an error, which is RFC-compliant and
    reduces risk of loops and traffic amplification.

- Risk assessment for stable
  - Minimal regression risk: change is narrowly targeted and only
    suppresses replies for ICMP/ICMPv6 unreachable error packets.
  - Fixes a real user-facing bug (loops on `lo`, cross-host error
    storms) without adding features.
  - Matches existing inet behavior, improving consistency across
    families.

Given the clear bug fix, small and contained change, alignment with inet
behavior and RFC guidance, and low regression risk, this is a good
candidate for backporting to stable trees.

 net/ipv4/netfilter/nf_reject_ipv4.c | 25 ++++++++++++++++++++++++
 net/ipv6/netfilter/nf_reject_ipv6.c | 30 +++++++++++++++++++++++++++++
 2 files changed, 55 insertions(+)

diff --git a/net/ipv4/netfilter/nf_reject_ipv4.c b/net/ipv4/netfilter/nf_reject_ipv4.c
index 0d3cb2ba6fc84..a7a3439fe7800 100644
--- a/net/ipv4/netfilter/nf_reject_ipv4.c
+++ b/net/ipv4/netfilter/nf_reject_ipv4.c
@@ -71,6 +71,27 @@ struct sk_buff *nf_reject_skb_v4_tcp_reset(struct net *net,
 }
 EXPORT_SYMBOL_GPL(nf_reject_skb_v4_tcp_reset);
 
+static bool nf_skb_is_icmp_unreach(const struct sk_buff *skb)
+{
+	const struct iphdr *iph = ip_hdr(skb);
+	u8 *tp, _type;
+	int thoff;
+
+	if (iph->protocol != IPPROTO_ICMP)
+		return false;
+
+	thoff = skb_network_offset(skb) + sizeof(*iph);
+
+	tp = skb_header_pointer(skb,
+				thoff + offsetof(struct icmphdr, type),
+				sizeof(_type), &_type);
+
+	if (!tp)
+		return false;
+
+	return *tp == ICMP_DEST_UNREACH;
+}
+
 struct sk_buff *nf_reject_skb_v4_unreach(struct net *net,
 					 struct sk_buff *oldskb,
 					 const struct net_device *dev,
@@ -91,6 +112,10 @@ struct sk_buff *nf_reject_skb_v4_unreach(struct net *net,
 	if (ip_hdr(oldskb)->frag_off & htons(IP_OFFSET))
 		return NULL;
 
+	/* don't reply to ICMP_DEST_UNREACH with ICMP_DEST_UNREACH. */
+	if (nf_skb_is_icmp_unreach(oldskb))
+		return NULL;
+
 	/* RFC says return as much as we can without exceeding 576 bytes. */
 	len = min_t(unsigned int, 536, oldskb->len);
 
diff --git a/net/ipv6/netfilter/nf_reject_ipv6.c b/net/ipv6/netfilter/nf_reject_ipv6.c
index cb2d38e80de9a..3c56e94e6943b 100644
--- a/net/ipv6/netfilter/nf_reject_ipv6.c
+++ b/net/ipv6/netfilter/nf_reject_ipv6.c
@@ -91,6 +91,32 @@ struct sk_buff *nf_reject_skb_v6_tcp_reset(struct net *net,
 }
 EXPORT_SYMBOL_GPL(nf_reject_skb_v6_tcp_reset);
 
+static bool nf_skb_is_icmp6_unreach(const struct sk_buff *skb)
+{
+	const struct ipv6hdr *ip6h = ipv6_hdr(skb);
+	u8 proto = ip6h->nexthdr;
+	u8 _type, *tp;
+	int thoff;
+	__be16 fo;
+
+	thoff = ipv6_skip_exthdr(skb, ((u8 *)(ip6h + 1) - skb->data), &proto, &fo);
+
+	if (thoff < 0 || thoff >= skb->len || fo != 0)
+		return false;
+
+	if (proto != IPPROTO_ICMPV6)
+		return false;
+
+	tp = skb_header_pointer(skb,
+				thoff + offsetof(struct icmp6hdr, icmp6_type),
+				sizeof(_type), &_type);
+
+	if (!tp)
+		return false;
+
+	return *tp == ICMPV6_DEST_UNREACH;
+}
+
 struct sk_buff *nf_reject_skb_v6_unreach(struct net *net,
 					 struct sk_buff *oldskb,
 					 const struct net_device *dev,
@@ -104,6 +130,10 @@ struct sk_buff *nf_reject_skb_v6_unreach(struct net *net,
 	if (!nf_reject_ip6hdr_validate(oldskb))
 		return NULL;
 
+	/* Don't reply to ICMPV6_DEST_UNREACH with ICMPV6_DEST_UNREACH */
+	if (nf_skb_is_icmp6_unreach(oldskb))
+		return NULL;
+
 	/* Include "As much of invoking packet as possible without the ICMPv6
 	 * packet exceeding the minimum IPv6 MTU" in the ICMP payload.
 	 */
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2025-10-25 16:28 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20251025160905.3857885-1-sashal@kernel.org>
2025-10-25 15:57 ` [PATCH AUTOSEL 6.17] netfilter: nf_tables: all transaction allocations can now sleep Sasha Levin
2025-10-25 16:01 ` [PATCH AUTOSEL 6.17-5.15] netfilter: nf_reject: don't reply to icmp error messages Sasha Levin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).