* [RFC nf 1/2] netfilter: nfnetlink_queue: restrict writes to network header
2026-05-27 12:11 [RFC nf 0/2] netfilter: add restrictions/validations for packet rewrites Florian Westphal
@ 2026-05-27 12:11 ` Florian Westphal
2026-05-27 12:11 ` [RFC nf 2/2] netfilter: nftables: restrict linklayer and network header writes Florian Westphal
2026-06-03 22:33 ` [RFC nf 0/2] netfilter: add restrictions/validations for packet rewrites Pablo Neira Ayuso
2 siblings, 0 replies; 7+ messages in thread
From: Florian Westphal @ 2026-05-27 12:11 UTC (permalink / raw)
To: netfilter-devel; +Cc: Florian Westphal
nfnetlink_queue doesn't allow selective replacements of some part of the
payload, only complete replacement.
If the new data is shorter, skb is trimmed, otherwise expanded.
Add minimal validation of the new ip/ipv6 header.
Check total len matches skb length. Disallow
ip option modifications after prerouting.
IPv6 extension headers are also disabled.
IP options and exthdrs could be allowed later after validation pass or
ip option recompile.
Transport header is not checked.
Bridge gets no validation other than size isn't below ethernet header
size.
Signed-off-by: Florian Westphal <fw@strlen.de>
---
net/netfilter/nfnetlink_queue.c | 103 +++++++++++++++++++++++++++++++-
1 file changed, 102 insertions(+), 1 deletion(-)
diff --git a/net/netfilter/nfnetlink_queue.c b/net/netfilter/nfnetlink_queue.c
index 60ab88d45096..ae4257384e8e 100644
--- a/net/netfilter/nfnetlink_queue.c
+++ b/net/netfilter/nfnetlink_queue.c
@@ -1136,14 +1136,115 @@ nfqnl_enqueue_packet(struct nf_queue_entry *entry, unsigned int queuenum)
return err;
}
+static bool nfqnl_validate_ipopts(const struct iphdr *iph_new,
+ const struct nf_queue_entry *e)
+{
+ const struct iphdr *iph_orig = ip_hdr(e->skb);
+ unsigned int ihl = iph_new->ihl * 4;
+
+ if (iph_new->ihl != iph_orig->ihl)
+ return false;
+ if (ihl == sizeof(*iph_orig))
+ return true;
+
+ return memcmp(iph_new + 1, ip_hdr(e->skb) + 1, ihl - sizeof(*iph_orig)) == 0;
+}
+
+static bool nfqnl_validate_ip4(const struct iphdr *iph, unsigned int data_len,
+ const struct nf_queue_entry *e)
+{
+ unsigned int ihl;
+
+ if (data_len < sizeof(*iph))
+ return false;
+
+ ihl = iph->ihl * 4u;
+ if (ihl < sizeof(*iph) || data_len < ihl)
+ return false;
+
+ if (iph->version != 4 ||
+ iph->frag_off != ip_hdr(e->skb)->frag_off ||
+ ntohs(iph->tot_len) != data_len)
+ return false;
+
+ /* ip parses (validates) options after PRE_ROUTING hook.
+ * Do not allow later ip option modifications.
+ */
+ if (e->state.hook != NF_INET_PRE_ROUTING &&
+ !nfqnl_validate_ipopts(iph, e))
+ return false;
+
+ return true;
+}
+
+static bool nfqnl_validate_ip6(const struct ipv6hdr *ip6, unsigned int data_len)
+{
+ if (data_len < sizeof(*ip6))
+ return false;
+
+ /* netlink attribute size is limited to 2**16 - sizeof netlink header,
+ * so we cannot support jumbograms.
+ */
+ if (ntohs(ip6->payload_len) != data_len - sizeof(*ip6))
+ return false;
+
+ if (ip6->version != 6 ||
+ ipv6_ext_hdr(ip6->nexthdr))
+ return false;
+
+ return true;
+}
+
+static bool nfqnl_validate_arp(unsigned int data_len)
+{
+ const unsigned int minsz = 8 + 2 * ETH_ALEN + 2 * sizeof(__be32);
+
+ /* don't allow truncation below min size */
+ return data_len >= minsz;
+}
+
+static bool nfqnl_validate_br(const struct ethhdr *eth, unsigned int data_len)
+{
+#ifdef CONFIG_NETFILTER_FAMILY_BRIDGE
+ /* disallow truncation below ethernet header size */
+ if (data_len < sizeof(*eth))
+ return false;
+
+ return true;
+#else
+ return false;
+#endif
+}
+
+static bool nfqnl_validate_write(const void *data, unsigned int data_len,
+ const struct nf_queue_entry *e)
+{
+ switch (e->state.pf) {
+ case NFPROTO_ARP:
+ return nfqnl_validate_arp(data_len);
+ case NFPROTO_IPV4:
+ return nfqnl_validate_ip4(data, data_len, e);
+ case NFPROTO_IPV6:
+ return nfqnl_validate_ip6(data, data_len) &&
+ !(IP6CB(e->skb)->flags & IP6SKB_JUMBOGRAM);
+ case NFPROTO_BRIDGE:
+ return nfqnl_validate_br(data, data_len);
+ }
+
+ return false;
+}
+
static int
-nfqnl_mangle(void *data, unsigned int data_len, struct nf_queue_entry *e, int diff)
+nfqnl_mangle(const void *data, unsigned int data_len, struct nf_queue_entry *e, int diff)
{
struct sk_buff *nskb;
if (e->state.net->user_ns != &init_user_ns)
return -EPERM;
+ if (!nfqnl_validate_write(data, data_len, e))
+ return -EINVAL;
+
if (diff < 0) {
unsigned int min_len = skb_transport_offset(e->skb);
--
2.53.0
^ permalink raw reply related [flat|nested] 7+ messages in thread* [RFC nf 2/2] netfilter: nftables: restrict linklayer and network header writes
2026-05-27 12:11 [RFC nf 0/2] netfilter: add restrictions/validations for packet rewrites Florian Westphal
2026-05-27 12:11 ` [RFC nf 1/2] netfilter: nfnetlink_queue: restrict writes to network header Florian Westphal
@ 2026-05-27 12:11 ` Florian Westphal
2026-06-03 22:33 ` [RFC nf 0/2] netfilter: add restrictions/validations for packet rewrites Pablo Neira Ayuso
2 siblings, 0 replies; 7+ messages in thread
From: Florian Westphal @ 2026-05-27 12:11 UTC (permalink / raw)
To: netfilter-devel; +Cc: Florian Westphal
Don't permit arbitrary writes to linklayer and network header data.
Several spots in network stack trust header validation performed in
ipv4/ipv6 before PRE_ROUTING hook.
For linklayer, allow writes for netdev ingress. For other hooks, only
allow link layer writes that do not spill into network header.
For network header, check the offset/length combinations:
- changing dscp requires store at offset 0 for checsum fixups, so
make sure ip version + length field isn't altered.
- ip6 dscp starts directly after the version field, so make sure it
remains 6.
Several of these checks could already be done at rule insertion time.
Risk is that this might cause ruleset load failures for existing
rulesets. With this change such writes are silently skipped and packet
passes unchanged.
Transport and inner header bases are not checked / restricted.
Signed-off-by: Florian Westphal <fw@strlen.de>
---
net/netfilter/nft_payload.c | 166 ++++++++++++++++++++++++++++++++++++
1 file changed, 166 insertions(+)
diff --git a/net/netfilter/nft_payload.c b/net/netfilter/nft_payload.c
index 484a5490832e..0afe8a7ac5ed 100644
--- a/net/netfilter/nft_payload.c
+++ b/net/netfilter/nft_payload.c
@@ -824,6 +824,168 @@ nft_payload_set_vlan(const u32 *src, struct sk_buff *skb, u16 offset, u8 len,
return true;
}
+/* Ingress is very early, before l3 protocol handlers.
+ * There should be no in-tree code that trusts l3/l4 headers
+ * between ingress and NF_INET_PRE_ROUTING hooks.
+ */
+static bool nft_in_ingress(const struct nf_hook_state *s)
+{
+ return s->pf == NFPROTO_NETDEV && s->hook == NF_NETDEV_INGRESS;
+}
+
+static bool nft_nh_write_ok_ip4(const struct nft_pktinfo *pkt,
+ const struct nft_payload_set *priv,
+ const u32 *src)
+{
+ unsigned int offset = priv->offset + skb_network_offset(pkt->skb);
+ const u8 *new_octets = (const u8 *)src;
+ u8 old_octet;
+
+ switch (priv->offset) {
+ case 0: /* csum fixups does expand dscp/tos store to 2 bytes.
+ * make sure ihl/version remain unchanged.
+ */
+ if (skb_copy_bits(pkt->skb, offset, &old_octet, sizeof(old_octet)))
+ return false;
+
+ return priv->len == 2 &&
+ *new_octets == old_octet;
+ case offsetof(struct iphdr, tos):
+ return priv->len == 1;
+ case offsetof(struct iphdr, id):
+ return priv->len == 2;
+ case offsetof(struct iphdr, ttl):
+ if (priv->len == 1)
+ return true;
+
+ if (priv->len != 2)
+ return false;
+
+ /* same, csum fixup does expand ttl store to two bytes.
+ * check protocol is not altered.
+ */
+ if (skb_copy_bits(pkt->skb, offset + 1, &old_octet, sizeof(old_octet)))
+ return false;
+
+ return new_octets[0] &&
+ new_octets[1] == old_octet;
+ case offsetof(struct iphdr, check):
+ return priv->len <= 2 + 4 + 4;
+ case offsetof(struct iphdr, saddr):
+ return priv->len <= 4 + 4;
+ case offsetof(struct iphdr, daddr):
+ return priv->len <= 4;
+ }
+
+ return false;
+}
+
+static bool nft_nh_write_ok_ip6(const struct nft_pktinfo *pkt,
+ const struct nft_payload_set *priv,
+ const u32 *src)
+{
+ const struct ipv6hdr *ih = (const void *)src;
+
+ switch (priv->offset) {
+ case 0: /* store to dscp must not alter ip6 version */
+ return priv->len <= 4 && ih->version == 6;
+ case 2:
+ return priv->len <= 2;
+ case offsetof(struct ipv6hdr, hop_limit):
+ return priv->len == 1;
+ case offsetof(struct ipv6hdr, saddr):
+ return priv->len <= 16 + 16;
+ case offsetof(struct ipv6hdr, daddr):
+ return priv->len <= 16;
+ }
+
+ return false;
+}
+
+static bool nft_nh_write_ok_arp(const struct nft_payload_set *priv)
+{
+ /* Variable size for standard ethernet arp */
+ const unsigned int eth_ip = 2 * (ETH_ALEN + 4);
+ unsigned int offset = priv->offset;
+
+ switch (offset) {
+ case offsetof(struct arphdr, ar_op):
+ return priv->len == 2;
+ default:
+ break;
+ }
+
+ /* permit writes post fixed arp header size */
+ return offset >= sizeof(struct arphdr) && priv->len <= eth_ip;
+}
+
+static bool nft_nh_write_ok_netdev(const struct nft_pktinfo *pkt,
+ const struct nft_payload_set *priv,
+ const u32 *src)
+{
+#ifdef CONFIG_NF_TABLES_NETDEV
+ switch (pkt->skb->protocol) {
+ case htons(ETH_P_ARP):
+ return nft_nh_write_ok_arp(priv);
+ case htons(ETH_P_IP):
+ return nft_nh_write_ok_ip4(pkt, priv, src);
+ case htons(ETH_P_IPV6):
+ return nft_nh_write_ok_ip6(pkt, priv, src);
+ default:
+ break;
+ }
+#endif
+ return false;
+}
+
+static bool nft_nh_write_ok_bridge(const struct nft_pktinfo *pkt,
+ const struct nft_payload_set *priv,
+ const u32 *src)
+{
+#if IS_ENABLED(CONFIG_NF_TABLES_BRIDGE)
+ switch (pkt->ethertype) {
+ case htons(ETH_P_ARP):
+ return nft_nh_write_ok_arp(priv);
+ case htons(ETH_P_IP):
+ return nft_nh_write_ok_ip4(pkt, priv, src);
+ case htons(ETH_P_IPV6):
+ return nft_nh_write_ok_ip6(pkt, priv, src);
+ }
+#endif
+ return false;
+}
+
+static bool nft_nh_write_ok(const struct nft_pktinfo *pkt,
+ const struct nft_payload_set *priv,
+ const u32 *src)
+{
+ switch (pkt->state->pf) {
+ case NFPROTO_ARP:
+ return nft_nh_write_ok_arp(priv);
+ case NFPROTO_BRIDGE:
+ return nft_nh_write_ok_bridge(pkt, priv, src);
+ case NFPROTO_IPV4:
+ return nft_nh_write_ok_ip4(pkt, priv, src);
+ case NFPROTO_IPV6:
+ return nft_nh_write_ok_ip6(pkt, priv, src);
+ case NFPROTO_NETDEV:
+ if (pkt->state->hook == NF_NETDEV_INGRESS)
+ return true;
+ return nft_nh_write_ok_netdev(pkt, priv, src);
+ }
+
+ return false;
+}
+
+/* check linklayer modifications don't spill into network header. */
+static bool nft_ll_write_ok(const struct nft_pktinfo *pkt, int offset)
+{
+ if (nft_in_ingress(pkt->state))
+ return true;
+
+ return offset <= skb_network_offset(pkt->skb);
+}
+
static void nft_payload_set_eval(const struct nft_expr *expr,
struct nft_regs *regs,
const struct nft_pktinfo *pkt)
@@ -851,8 +1013,12 @@ static void nft_payload_set_eval(const struct nft_expr *expr,
}
offset = skb_mac_header(skb) - skb->data - vlan_hlen;
+ if (!nft_ll_write_ok(pkt, priv->len + priv->offset + offset))
+ goto err;
break;
case NFT_PAYLOAD_NETWORK_HEADER:
+ if (!nft_nh_write_ok(pkt, priv, src))
+ goto err;
offset = skb_network_offset(skb);
break;
case NFT_PAYLOAD_TRANSPORT_HEADER:
--
2.53.0
^ permalink raw reply related [flat|nested] 7+ messages in thread* Re: [RFC nf 0/2] netfilter: add restrictions/validations for packet rewrites
2026-05-27 12:11 [RFC nf 0/2] netfilter: add restrictions/validations for packet rewrites Florian Westphal
2026-05-27 12:11 ` [RFC nf 1/2] netfilter: nfnetlink_queue: restrict writes to network header Florian Westphal
2026-05-27 12:11 ` [RFC nf 2/2] netfilter: nftables: restrict linklayer and network header writes Florian Westphal
@ 2026-06-03 22:33 ` Pablo Neira Ayuso
2026-06-03 22:42 ` Pablo Neira Ayuso
2 siblings, 1 reply; 7+ messages in thread
From: Pablo Neira Ayuso @ 2026-06-03 22:33 UTC (permalink / raw)
To: Florian Westphal; +Cc: netfilter-devel
Hi Florian,
On Wed, May 27, 2026 at 02:11:42PM +0200, Florian Westphal wrote:
> This is a followup to the recent patch that disabled packet manipulation
> via nfqueue or nft_payload in user namespaces.
>
> This adds additional *restrictions*.
> For nfqueue, do minimal header checks in case userspace provides payload
> replacement data.
>
> For nft_payload, restrict the offset/length combinations.
>
> Several of these checks could be done at rule insertion time (i.e.
> control plane).
> Risk is that this may cause ruleset load failures for existing rulesets.
> With this change such writes are silently skipped and packet passes
> unchanged.
>
> Restriction is added for link and network bases only.
>
> Open questions:
> - target tree: nf or nf-next?
>
> - should there be an immediate followup ('patch 3') that reverts
> the userns restrictions again?
>
> - should nft_payload reject those requests it can validate there from
> the control plane?
Ideally, better from control plane.
If anyone is breaking with ilegals field, they should come here to
explain? Data plane validation might look safer ... but it will just
drop packets and it will take a bit more time to the user to debug.
But your approach is more conservative, it just leave the packet
untouch, so it is basically ignoring the invalid mangling.
Your call.
Maybe I would like to see the commit description a bit more expanded
to describe what it is intentionally allowed on top of what you
already describe.
> I would propose to target nf-next for now and leave the userns
> restrictions in place to see what relevant use-cases exist.
OK, let's do that.
Thanks.
> Florian Westphal (2):
> netfilter: nfnetlink_queue: restrict writes to network header
> netfilter: nftables: restrict linklayer and network header writes
>
> net/netfilter/nfnetlink_queue.c | 103 +++++++++++++++++++-
> net/netfilter/nft_payload.c | 166 ++++++++++++++++++++++++++++++++
> 2 files changed, 268 insertions(+), 1 deletion(-)
>
> --
> 2.53.0
>
>
^ permalink raw reply [flat|nested] 7+ messages in thread