public inbox for netdev@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH net-next,RFC 0/8] netfilter: flowtable bulking
@ 2026-03-17 11:29 Pablo Neira Ayuso
  2026-03-17 11:29 ` [PATCH net-next,RFC 1/8] netfilter: flowtable: Add basic bulking infrastructure for early ingress hook Pablo Neira Ayuso
                   ` (9 more replies)
  0 siblings, 10 replies; 16+ messages in thread
From: Pablo Neira Ayuso @ 2026-03-17 11:29 UTC (permalink / raw)
  To: netfilter-devel
  Cc: davem, netdev, kuba, pabeni, edumazet, fw, horms,
	steffen.klassert, antony.antony

Hi,
 
Back in 2018 [1], a new fast forwarding combining the flowtable and
GRO/GSO was proposed, however, "GRO is specialized to optimize the
non-forwarding case", so it was considered "counter-intuitive to base a
fast forwarding path on top of it".
 
Then, Steffen Klassert proposed the idea of adding a new engine for the
flowtable that operates on the skb list that is provided after the NAPI
cycle. The idea is to process this skb list to create bulks grouped by
the ethertype, output device, next hop and tos/dscp. Then, add a
specialized xmit path that can deal with these skb bulks. Note that GRO
needs to be disabled so this new forwarding engine obtains the list of
skbs that resulted from the NAPI cycle.
 
Before grouping skbs in bulks, there is a flowtable lookup to check if
this flow is already in the flowtable, otherwise, the packet follows
slow path. In case the flowtable lookup returns an entry, then this
packet follows fast path: the ttl is decremented, the corresponding NAT
mangling on the packet and layer 2/3 tunnel encapsulation (layer 2:
vlan/pppoe, layer 3: ipip) are performed.
 
The fast forwarding path is enabled through explicit user policy, so the
user needs to request this behaviour from control plane, the following
example shows how to place flows in the new fast forwarding path from
the forward chain:

 table x {
        flowtable f {
                hook early_ingress priority 0; devices = { eth0, eth1 }
        }
 
        chain y {
                type filter hook forward priority 0;
                ip protocol tcp flow offload @f counter
        }
 }
 
 
The example above sets up a fastpath for TCP flows that are placed in
the flowtable 'f', this flowtable is hooked at the new early_ingress
hook.  The initial TCP packets that match this rule from the standard
fowarding path create an entry in the flowtable.
 
Note that tcpdump only shows the packets in the tx path, since this
new early_ingress hook happens before the ingress tap.

The patch series contains 8 patches:

- #1 and #2 adds the basic RX flowtable bulking infrastructure for
  IPv4 and IPv6.
- #3 adds the early_ingress netfilter hook.
- #4 adds a helper function to prepare for the netfilter chain for
  the early_ingress hook.
- #5 adds the early_ingress filter chain.
- #6 and #7 add helper functions to reuse TX path codebase.
- #8 adds the custom TX path for listified skbs and updates
  the flowtable bulking to use it.

= Benchmark numbers =

Using the following testbed with 4 hosts with this topology:
 
 | sunset |-----| west |====| east |----| sunrise |
 
And this hardware:
 
* Supermicro H13SSW Motherboard
* AMD EPYC 9135 16-Core Processor (a.k.a. Bergamo, or Zen 5)
* NIC: Mellanox MT28800 ConnectX-5 Ex (100Gbps NIc)
* NIC: Broadcom BCM57508 NetXtreme-E (only on sunrise, 100Gbps NIc)
 
With 128 byte packets:
 
* From ~2 Mpps (baseline) to ~4 Mpps with 1 flow.
* From ~10.6 Mpps (baseline) to ~15.7 Mpps with 10 flows.
 
Antony Antony collected performance numbers and made a report describing
this the benchmarking[2]. This report includes numbers from the IPsec
support which is not included in this series.

Comments welcome, thanks.

Pablo Neira Ayuso (8):
  netfilter: flowtable: Add basic bulking infrastructure for early ingress hook
  netfilter: flowtable: Add IPv6 bulking infrastructure for early ingress hook
  netfilter: nf_tables: add flowtable early_ingress support
  netfilter: nf_tables: add nft_set_pktinfo_ingress()
  netfilter: nf_tables: add early ingress chain
  net: add dev_dst_drop() helper function
  net: add dev_noqueue_xmit_list() helper function
  net: add dev_queue_xmit_list() and use it

 include/linux/netdevice.h             |   2 +
 include/net/netfilter/nf_flow_table.h |  13 +-
 net/core/dev.c                        | 297 ++++++++++++++++----
 net/netfilter/nf_flow_table_inet.c    |  81 ++++++
 net/netfilter/nf_flow_table_ip.c      | 384 ++++++++++++++++++++++++++
 net/netfilter/nf_tables_api.c         |  12 +-
 net/netfilter/nft_chain_filter.c      | 164 +++++++++--
 7 files changed, 872 insertions(+), 81 deletions(-)

-- 
2.47.3


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH net-next,RFC 1/8] netfilter: flowtable: Add basic bulking infrastructure for early ingress hook
  2026-03-17 11:29 [PATCH net-next,RFC 0/8] netfilter: flowtable bulking Pablo Neira Ayuso
@ 2026-03-17 11:29 ` Pablo Neira Ayuso
  2026-03-17 11:29 ` [PATCH net-next,RFC 2/8] netfilter: flowtable: Add IPv6 " Pablo Neira Ayuso
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 16+ messages in thread
From: Pablo Neira Ayuso @ 2026-03-17 11:29 UTC (permalink / raw)
  To: netfilter-devel
  Cc: davem, netdev, kuba, pabeni, edumazet, fw, horms,
	steffen.klassert, antony.antony

Add support for registering an early_ingress hook for the flowtable to
deal with the skb list. Split initial this list in bulks according to
ethertype, output device, next hop and tos.

Then, send each skb bulk through neighbour layer. The xmit path is not
yet listified, ie. the bulk is splitted in individual skbuffs that are
sent to xmit path, one by one, at this stage.

This only implements the flowtable RX bulking. The TX side comes as a
follow up patch in this series.

Co-developed-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 include/net/netfilter/nf_flow_table.h |  11 +-
 net/netfilter/nf_flow_table_inet.c    |  79 ++++++++++
 net/netfilter/nf_flow_table_ip.c      | 209 ++++++++++++++++++++++++++
 3 files changed, 298 insertions(+), 1 deletion(-)

diff --git a/include/net/netfilter/nf_flow_table.h b/include/net/netfilter/nf_flow_table.h
index b09c11c048d5..ee98da9edc1b 100644
--- a/include/net/netfilter/nf_flow_table.h
+++ b/include/net/netfilter/nf_flow_table.h
@@ -18,6 +18,13 @@ struct nf_flow_rule;
 struct flow_offload;
 enum flow_offload_tuple_dir;
 
+struct nft_bulk_cb {
+	struct sk_buff *last;
+	struct flow_offload_tuple_rhash *tuplehash;
+};
+
+#define NFT_BULK_CB(skb) ((struct nft_bulk_cb *)(skb)->cb)
+
 struct nf_flow_key {
 	struct flow_dissector_key_meta			meta;
 	struct flow_dissector_key_control		control;
@@ -65,6 +72,7 @@ struct nf_flowtable_type {
 	void				(*get)(struct nf_flowtable *ft);
 	void				(*put)(struct nf_flowtable *ft);
 	nf_hookfn			*hook;
+	nf_hookfn			*hook_list;
 	struct module			*owner;
 };
 
@@ -77,7 +85,6 @@ struct nf_flowtable {
 	unsigned int			flags;		/* readonly in datapath */
 	int				priority;	/* control path (padding hole) */
 	struct rhashtable		rhashtable;	/* datapath, read-mostly members come first */
-
 	struct list_head		list;		/* slowpath parts */
 	const struct nf_flowtable_type	*type;
 	struct delayed_work		gc_work;
@@ -339,6 +346,8 @@ unsigned int nf_flow_offload_ip_hook(void *priv, struct sk_buff *skb,
 				     const struct nf_hook_state *state);
 unsigned int nf_flow_offload_ipv6_hook(void *priv, struct sk_buff *skb,
 				       const struct nf_hook_state *state);
+void __nf_flow_offload_ip_hook_list(void *priv, struct list_head *head,
+				    const struct nf_hook_state *state);
 
 #if (IS_BUILTIN(CONFIG_NF_FLOW_TABLE) && IS_ENABLED(CONFIG_DEBUG_INFO_BTF)) || \
     (IS_MODULE(CONFIG_NF_FLOW_TABLE) && IS_ENABLED(CONFIG_DEBUG_INFO_BTF_MODULES))
diff --git a/net/netfilter/nf_flow_table_inet.c b/net/netfilter/nf_flow_table_inet.c
index b0f199171932..d0e7860c9d08 100644
--- a/net/netfilter/nf_flow_table_inet.c
+++ b/net/netfilter/nf_flow_table_inet.c
@@ -42,6 +42,82 @@ nf_flow_offload_inet_hook(void *priv, struct sk_buff *skb,
 	return NF_ACCEPT;
 }
 
+static unsigned int
+__nf_flow_offload_hook_list(void *priv, struct sk_buff *unused,
+			    const struct nf_hook_state *state, u32 flags)
+{
+	struct list_head *skb_list = state->skb_list;
+	struct sk_buff *skb, *next;
+	struct vlan_ethhdr *veth;
+	LIST_HEAD(skb_ipv4_list);
+	LIST_HEAD(skb_ipv6_list);
+	__be16 proto;
+
+	list_for_each_entry_safe(skb, next, skb_list, list) {
+		skb_reset_network_header(skb);
+		if (!skb_transport_header_was_set(skb))
+			skb_reset_transport_header(skb);
+		skb_reset_mac_len(skb);
+
+		switch (skb->protocol) {
+		case htons(ETH_P_8021Q):
+			veth = (struct vlan_ethhdr *)skb_mac_header(skb);
+			proto = veth->h_vlan_encapsulated_proto;
+			break;
+		case htons(ETH_P_PPP_SES):
+			nf_flow_pppoe_proto(skb, &proto);
+			break;
+		default:
+			proto = skb->protocol;
+			break;
+		}
+
+		switch (proto) {
+		case htons(ETH_P_IP):
+			list_move_tail(&skb->list, &skb_ipv4_list);
+			break;
+		case htons(ETH_P_IPV6):
+			list_move_tail(&skb->list, &skb_ipv6_list);
+			break;
+		}
+	}
+
+	if (flags & (1 << NFPROTO_IPV4) && !list_empty(&skb_ipv4_list))
+		__nf_flow_offload_ip_hook_list(priv, &skb_ipv4_list, state);
+
+	list_splice_tail(&skb_ipv4_list, skb_list);
+	list_splice_tail(&skb_ipv6_list, skb_list);
+
+	if (!list_empty(skb_list))
+		return NF_ACCEPT;
+
+	return NF_STOLEN;
+}
+
+static unsigned int
+nf_flow_offload_ip_hook_list(void *priv, struct sk_buff *unused,
+			     const struct nf_hook_state *state)
+{
+	return __nf_flow_offload_hook_list(priv, unused, state,
+					   1 << NFPROTO_IPV4);
+}
+
+static unsigned int
+nf_flow_offload_ipv6_hook_list(void *priv, struct sk_buff *unused,
+				 const struct nf_hook_state *state)
+{
+	return __nf_flow_offload_hook_list(priv, unused, state,
+					   1 << NFPROTO_IPV6);
+}
+
+static unsigned int
+nf_flow_offload_inet_hook_list(void *priv, struct sk_buff *unused,
+			       const struct nf_hook_state *state)
+{
+	return __nf_flow_offload_hook_list(priv, unused, state,
+					   (1 << NFPROTO_IPV4) | (1 << NFPROTO_IPV6));
+}
+
 static int nf_flow_rule_route_inet(struct net *net,
 				   struct flow_offload *flow,
 				   enum flow_offload_tuple_dir dir,
@@ -72,6 +148,7 @@ static struct nf_flowtable_type flowtable_inet = {
 	.action		= nf_flow_rule_route_inet,
 	.free		= nf_flow_table_free,
 	.hook		= nf_flow_offload_inet_hook,
+	.hook_list	= nf_flow_offload_inet_hook_list,
 	.owner		= THIS_MODULE,
 };
 
@@ -82,6 +159,7 @@ static struct nf_flowtable_type flowtable_ipv4 = {
 	.action		= nf_flow_rule_route_ipv4,
 	.free		= nf_flow_table_free,
 	.hook		= nf_flow_offload_ip_hook,
+	.hook_list	= nf_flow_offload_ip_hook_list,
 	.owner		= THIS_MODULE,
 };
 
@@ -92,6 +170,7 @@ static struct nf_flowtable_type flowtable_ipv6 = {
 	.action		= nf_flow_rule_route_ipv6,
 	.free		= nf_flow_table_free,
 	.hook		= nf_flow_offload_ipv6_hook,
+	.hook_list	= nf_flow_offload_ipv6_hook_list,
 	.owner		= THIS_MODULE,
 };
 
diff --git a/net/netfilter/nf_flow_table_ip.c b/net/netfilter/nf_flow_table_ip.c
index 3fdb10d9bf7f..41f4768ce715 100644
--- a/net/netfilter/nf_flow_table_ip.c
+++ b/net/netfilter/nf_flow_table_ip.c
@@ -752,6 +752,215 @@ static int nf_flow_encap_push(struct sk_buff *skb,
 	return 0;
 }
 
+static void nft_flow_v4_push_hdrs_list(struct net *net, struct sk_buff *first,
+				       struct flow_offload_tuple *other_tuple,
+				       __be32 *ip_daddr)
+{
+	struct sk_buff *skb, *nskb;
+
+	skb_list_walk_safe(first, skb, nskb) {
+		if (nf_flow_tunnel_v4_push(net, skb, other_tuple, ip_daddr) < 0) {
+			skb_mark_not_on_list(skb);
+			kfree_skb(skb);
+			continue;
+		}
+		if (nf_flow_encap_push(skb, other_tuple) < 0) {
+			skb_mark_not_on_list(skb);
+			kfree_skb(skb);
+			continue;
+		}
+	}
+}
+
+static void nft_bulk_receive(struct list_head *head, struct sk_buff *skb)
+{
+	const struct iphdr *iph;
+	struct dst_entry *dst;
+	struct xfrm_state *x;
+	struct sk_buff *p;
+	struct rtable *rt;
+	__be32 daddr;
+	int proto;
+	__u8 tos;
+
+	iph = ip_hdr(skb);
+	dst = skb_dst(skb);
+	BUG_ON(!dst);
+
+	rt = (struct rtable *)dst;
+	daddr = rt_nexthop(rt, iph->daddr);
+	x = dst_xfrm(dst);
+	proto = iph->protocol;
+	tos = iph->tos;
+
+	list_for_each_entry(p, head, list) {
+		struct dst_entry *dst2;
+		struct rtable *rt2;
+		struct iphdr *iph2;
+		__be32 daddr2;
+
+		if (p->protocol != htons(ETH_P_IP))
+			continue;
+
+		dst2 = skb_dst(p);
+		rt2 = (struct rtable *)dst2;
+		if (dst->dev != dst2->dev)
+			continue;
+
+		iph2 = ip_hdr(p);
+		daddr2 = rt_nexthop(rt2, iph2->daddr);
+		if (daddr != daddr2)
+			continue;
+
+		if (tos != iph2->tos)
+			continue;
+
+		if (x != dst_xfrm(dst2))
+			continue;
+
+		goto found;
+	}
+
+	goto out;
+
+found:
+	if (NFT_BULK_CB(p)->last == p)
+		skb_shinfo(p)->frag_list = skb;
+	else
+		NFT_BULK_CB(p)->last->next = skb;
+
+	NFT_BULK_CB(p)->last = skb;
+
+	return;
+out:
+	/* First skb */
+	NFT_BULK_CB(skb)->last = skb;
+	list_add_tail(&skb->list, head);
+	skb->priority = rt_tos2priority(iph->tos);
+
+	return;
+}
+
+static void nf_flow_neigh_xmit_list(struct sk_buff *skb, struct net_device *outdev, const void *daddr)
+{
+	struct sk_buff *iter = skb->next;
+	int hlen;
+
+	skb->dev = outdev;
+	hlen = dev_hard_header(skb, outdev, ntohs(skb->protocol), daddr, NULL, skb->len);
+	if (hlen < 0) {
+		kfree_skb_list(skb);
+		return;
+	}
+
+	skb_reset_mac_header(skb);
+
+	while (iter) {
+		iter->dev = outdev;
+		skb_push(iter, hlen);
+		skb_copy_to_linear_data(iter, skb->data, hlen);
+		skb_reset_mac_header(iter);
+		iter = iter->next;
+	}
+
+	iter = skb;
+	while (iter) {
+		struct sk_buff *next;
+
+		next = iter->next;
+		iter->next = NULL;
+		dev_queue_xmit(iter);
+		iter = next;
+	}
+}
+
+void __nf_flow_offload_ip_hook_list(void *priv, struct list_head *head,
+				    const struct nf_hook_state *state)
+{
+	struct flow_offload_tuple_rhash *tuplehash;
+	struct nf_flowtable *flow_table = priv;
+	struct flow_offload_tuple *other_tuple;
+	enum flow_offload_tuple_dir dir;
+	struct nf_flowtable_ctx ctx = {
+		.in	= state->in,
+	};
+	struct flow_offload *flow;
+	struct sk_buff *skb, *n;
+	struct neighbour *neigh;
+	LIST_HEAD(bulk_head);
+	LIST_HEAD(bulk_list);
+	LIST_HEAD(acc_list);
+	struct rtable *rt;
+	__be32 ip_daddr;
+	int ret;
+
+	list_for_each_entry_safe(skb, n, head, list) {
+		skb_list_del_init(skb);
+
+		ctx.hdrsize = 0;
+		ctx.offset = 0;
+
+		tuplehash = nf_flow_offload_lookup(&ctx, flow_table, skb);
+		if (!tuplehash) {
+			list_add_tail(&skb->list, &acc_list);
+			continue;
+		}
+
+		ret = nf_flow_offload_forward(&ctx, flow_table, tuplehash, skb);
+		if (ret < 0) {
+			kfree_skb(skb);
+			continue;
+		} else if (ret == 0) {
+			list_add_tail(&skb->list, &acc_list);
+			continue;
+		}
+
+		skb_dst_set_noref(skb, tuplehash->tuple.dst_cache);
+		memset(skb->cb, 0, sizeof(struct nft_bulk_cb));
+		NFT_BULK_CB(skb)->tuplehash = tuplehash;
+
+		list_add_tail(&skb->list, &bulk_list);
+	}
+
+	list_splice_init(&acc_list, head);
+
+	list_for_each_entry_safe(skb, n, &bulk_list, list) {
+		skb_list_del_init(skb);
+		nft_bulk_receive(&bulk_head, skb);
+	}
+
+	list_for_each_entry_safe(skb, n, &bulk_head, list) {
+
+		list_del_init(&skb->list);
+
+		skb->next = skb_shinfo(skb)->frag_list;
+		skb_shinfo(skb)->frag_list = NULL;
+
+		tuplehash = NFT_BULK_CB(skb)->tuplehash;
+		skb_dst_set_noref(skb, tuplehash->tuple.dst_cache);
+		rt = (struct rtable *)skb_dst(skb);
+
+		dir = tuplehash->tuple.dir;
+		flow = container_of(tuplehash, struct flow_offload, tuplehash[dir]);
+		other_tuple = &flow->tuplehash[!dir].tuple;
+		ip_daddr = other_tuple->src_v4.s_addr;
+
+		if (other_tuple->tun_num || other_tuple->encap_num)
+			nft_flow_v4_push_hdrs_list(state->net, skb, other_tuple, &ip_daddr);
+
+		neigh = ip_neigh_gw4(rt->dst.dev, rt_nexthop(rt, ip_daddr));
+		if (IS_ERR(neigh)) {
+			kfree_skb_list(skb);
+			continue;
+		}
+
+		nf_flow_neigh_xmit_list(skb, rt->dst.dev, neigh->ha);
+	}
+
+	BUG_ON(!list_empty(&bulk_head));
+}
+EXPORT_SYMBOL_GPL(__nf_flow_offload_ip_hook_list);
+
 unsigned int
 nf_flow_offload_ip_hook(void *priv, struct sk_buff *skb,
 			const struct nf_hook_state *state)
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH net-next,RFC 2/8] netfilter: flowtable: Add IPv6 bulking infrastructure for early ingress hook
  2026-03-17 11:29 [PATCH net-next,RFC 0/8] netfilter: flowtable bulking Pablo Neira Ayuso
  2026-03-17 11:29 ` [PATCH net-next,RFC 1/8] netfilter: flowtable: Add basic bulking infrastructure for early ingress hook Pablo Neira Ayuso
@ 2026-03-17 11:29 ` Pablo Neira Ayuso
  2026-03-17 11:29 ` [PATCH net-next,RFC 3/8] netfilter: nf_tables: add flowtable early_ingress support Pablo Neira Ayuso
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 16+ messages in thread
From: Pablo Neira Ayuso @ 2026-03-17 11:29 UTC (permalink / raw)
  To: netfilter-devel
  Cc: davem, netdev, kuba, pabeni, edumazet, fw, horms,
	steffen.klassert, antony.antony

Extend bulking infrastructure to support for IPv6. Split skb list in
bulks according to ethertype, output device and next hop. Then, send
each bulk through neighbour layer.

This only implements the flowtable RX bulking. The TX side comes as a
follow up patch in this series.

Co-developed-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 include/net/netfilter/nf_flow_table.h |   2 +
 net/netfilter/nf_flow_table_inet.c    |   2 +
 net/netfilter/nf_flow_table_ip.c      | 173 ++++++++++++++++++++++++++
 3 files changed, 177 insertions(+)

diff --git a/include/net/netfilter/nf_flow_table.h b/include/net/netfilter/nf_flow_table.h
index ee98da9edc1b..3d41c739f634 100644
--- a/include/net/netfilter/nf_flow_table.h
+++ b/include/net/netfilter/nf_flow_table.h
@@ -348,6 +348,8 @@ unsigned int nf_flow_offload_ipv6_hook(void *priv, struct sk_buff *skb,
 				       const struct nf_hook_state *state);
 void __nf_flow_offload_ip_hook_list(void *priv, struct list_head *head,
 				    const struct nf_hook_state *state);
+void __nf_flow_offload_ipv6_hook_list(void *priv, struct list_head *head,
+				      const struct nf_hook_state *state);
 
 #if (IS_BUILTIN(CONFIG_NF_FLOW_TABLE) && IS_ENABLED(CONFIG_DEBUG_INFO_BTF)) || \
     (IS_MODULE(CONFIG_NF_FLOW_TABLE) && IS_ENABLED(CONFIG_DEBUG_INFO_BTF_MODULES))
diff --git a/net/netfilter/nf_flow_table_inet.c b/net/netfilter/nf_flow_table_inet.c
index d0e7860c9d08..6efcb26c4523 100644
--- a/net/netfilter/nf_flow_table_inet.c
+++ b/net/netfilter/nf_flow_table_inet.c
@@ -84,6 +84,8 @@ __nf_flow_offload_hook_list(void *priv, struct sk_buff *unused,
 
 	if (flags & (1 << NFPROTO_IPV4) && !list_empty(&skb_ipv4_list))
 		__nf_flow_offload_ip_hook_list(priv, &skb_ipv4_list, state);
+	if (flags & (1 << NFPROTO_IPV6) && !list_empty(&skb_ipv6_list))
+		__nf_flow_offload_ipv6_hook_list(priv, &skb_ipv6_list, state);
 
 	list_splice_tail(&skb_ipv4_list, skb_list);
 	list_splice_tail(&skb_ipv6_list, skb_list);
diff --git a/net/netfilter/nf_flow_table_ip.c b/net/netfilter/nf_flow_table_ip.c
index 41f4768ce715..98b5d5e022c8 100644
--- a/net/netfilter/nf_flow_table_ip.c
+++ b/net/netfilter/nf_flow_table_ip.c
@@ -1363,3 +1363,176 @@ nf_flow_offload_ipv6_hook(void *priv, struct sk_buff *skb,
 	return nf_flow_queue_xmit(state->net, skb, &xmit);
 }
 EXPORT_SYMBOL_GPL(nf_flow_offload_ipv6_hook);
+
+static void nft_flow_v6_push_hdrs_list(struct net *net, struct sk_buff *first,
+				       struct flow_offload_tuple *other_tuple,
+				       struct in6_addr **ip6_daddr, int encap_limit)
+{
+	struct sk_buff *skb, *nskb;
+
+	skb_list_walk_safe(first, skb, nskb) {
+		if (nf_flow_tunnel_v6_push(net, skb, other_tuple, ip6_daddr, encap_limit) < 0) {
+			skb_mark_not_on_list(skb);
+			kfree_skb(skb);
+			continue;
+		}
+		if (nf_flow_encap_push(skb, other_tuple) < 0) {
+			skb_mark_not_on_list(skb);
+			kfree_skb(skb);
+			continue;
+		}
+	}
+}
+
+static void nft_bulk_ipv6_receive(struct list_head *head, struct sk_buff *skb)
+{
+	const struct in6_addr *daddr;
+	const struct ipv6hdr *ip6h;
+	struct dst_entry *dst;
+	struct xfrm_state *x;
+	struct rt6_info *rt;
+	struct sk_buff *p;
+	int proto;
+
+	ip6h = ipv6_hdr(skb);
+	dst = skb_dst(skb);
+	BUG_ON(!dst);
+
+	rt = (struct rt6_info *)dst;
+	daddr = rt6_nexthop(rt, &ip6h->daddr);
+	x = dst_xfrm(dst);
+	proto = ip6h->nexthdr;
+
+	list_for_each_entry(p, head, list) {
+		const struct in6_addr *daddr2;
+		struct dst_entry *dst2;
+		struct ipv6hdr *ip6h2;
+		struct rt6_info *rt2;
+
+		if (p->protocol != htons(ETH_P_IPV6))
+			continue;
+
+		dst2 = skb_dst(p);
+		rt2 = (struct rt6_info *)dst2;
+		if (dst->dev != dst2->dev)
+			continue;
+
+		ip6h2 = ipv6_hdr(p);
+		daddr2 = rt6_nexthop(rt2, &ip6h2->daddr);
+		if (!ipv6_addr_equal(daddr, daddr2))
+			continue;
+
+		if (x != dst_xfrm(dst2))
+			continue;
+
+		goto found;
+	}
+
+	goto out;
+
+found:
+	if (NFT_BULK_CB(p)->last == p)
+		skb_shinfo(p)->frag_list = skb;
+	else
+		NFT_BULK_CB(p)->last->next = skb;
+
+	NFT_BULK_CB(p)->last = skb;
+
+	return;
+out:
+	/* First skb */
+	NFT_BULK_CB(skb)->last = skb;
+	list_add_tail(&skb->list, head);
+
+	return;
+
+}
+
+void __nf_flow_offload_ipv6_hook_list(void *priv, struct list_head *head,
+				      const struct nf_hook_state *state)
+{
+	struct flow_offload_tuple_rhash *tuplehash;
+	struct nf_flowtable *flow_table = priv;
+	struct flow_offload_tuple *other_tuple;
+	enum flow_offload_tuple_dir dir;
+	struct nf_flowtable_ctx ctx = {
+		.in	= state->in,
+	};
+	struct in6_addr *ip6_daddr;
+	struct flow_offload *flow;
+	struct sk_buff *skb, *n;
+	struct neighbour *neigh;
+	LIST_HEAD(bulk_head);
+	LIST_HEAD(bulk_list);
+	LIST_HEAD(acc_list);
+	struct rt6_info *rt;
+	int ret;
+
+	list_for_each_entry_safe(skb, n, head, list) {
+		skb_list_del_init(skb);
+
+		ctx.hdrsize = 0;
+		ctx.offset = 0;
+
+		tuplehash = nf_flow_offload_ipv6_lookup(&ctx, flow_table, skb);
+		if (!tuplehash) {
+			list_add_tail(&skb->list, &acc_list);
+			continue;
+		}
+
+		ret = nf_flow_offload_ipv6_forward(&ctx, flow_table, tuplehash, skb,
+						   IPV6_DEFAULT_TNL_ENCAP_LIMIT);
+		if (ret < 0) {
+			kfree_skb(skb);
+			continue;
+		} else if (ret == 0) {
+			list_add_tail(&skb->list, &acc_list);
+			continue;
+		}
+
+		skb_dst_set_noref(skb, tuplehash->tuple.dst_cache);
+		memset(skb->cb, 0, sizeof(struct nft_bulk_cb));
+		NFT_BULK_CB(skb)->tuplehash = tuplehash;
+
+		list_add_tail(&skb->list, &bulk_list);
+	}
+
+	list_splice_init(&acc_list, head);
+
+	list_for_each_entry_safe(skb, n, &bulk_list, list) {
+		skb_list_del_init(skb);
+		nft_bulk_ipv6_receive(&bulk_head, skb);
+	}
+
+	list_for_each_entry_safe(skb, n, &bulk_head, list) {
+
+		list_del_init(&skb->list);
+
+		skb->next = skb_shinfo(skb)->frag_list;
+		skb_shinfo(skb)->frag_list = NULL;
+
+		tuplehash = NFT_BULK_CB(skb)->tuplehash;
+		skb_dst_set_noref(skb, tuplehash->tuple.dst_cache);
+		rt = (struct rt6_info *)skb_dst(skb);
+
+		dir = tuplehash->tuple.dir;
+		flow = container_of(tuplehash, struct flow_offload, tuplehash[dir]);
+		other_tuple = &flow->tuplehash[!dir].tuple;
+		ip6_daddr = &other_tuple->src_v6;
+
+		if (other_tuple->tun_num || other_tuple->encap_num)
+			nft_flow_v6_push_hdrs_list(state->net, skb, other_tuple, &ip6_daddr,
+						   IPV6_DEFAULT_TNL_ENCAP_LIMIT);
+
+		neigh = ip_neigh_gw6(rt->dst.dev, rt6_nexthop(rt, ip6_daddr));
+		if (IS_ERR(neigh)) {
+			kfree_skb_list(skb);
+			continue;
+		}
+
+		nf_flow_neigh_xmit_list(skb, rt->dst.dev, neigh->ha);
+	}
+
+	BUG_ON(!list_empty(&bulk_head));
+}
+EXPORT_SYMBOL_GPL(__nf_flow_offload_ipv6_hook_list);
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH net-next,RFC 3/8] netfilter: nf_tables: add flowtable early_ingress support
  2026-03-17 11:29 [PATCH net-next,RFC 0/8] netfilter: flowtable bulking Pablo Neira Ayuso
  2026-03-17 11:29 ` [PATCH net-next,RFC 1/8] netfilter: flowtable: Add basic bulking infrastructure for early ingress hook Pablo Neira Ayuso
  2026-03-17 11:29 ` [PATCH net-next,RFC 2/8] netfilter: flowtable: Add IPv6 " Pablo Neira Ayuso
@ 2026-03-17 11:29 ` Pablo Neira Ayuso
  2026-03-17 11:29 ` [PATCH net-next,RFC 4/8] netfilter: nf_tables: add nft_set_pktinfo_ingress() Pablo Neira Ayuso
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 16+ messages in thread
From: Pablo Neira Ayuso @ 2026-03-17 11:29 UTC (permalink / raw)
  To: netfilter-devel
  Cc: davem, netdev, kuba, pabeni, edumazet, fw, horms,
	steffen.klassert, antony.antony

Update control plane to allow to create a flowtable in the early_ingress
hook.

Co-developed-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/netfilter/nf_tables_api.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
index 1ed034a47bd0..66fadf4c6e3e 100644
--- a/net/netfilter/nf_tables_api.c
+++ b/net/netfilter/nf_tables_api.c
@@ -8969,7 +8969,8 @@ static int nft_flowtable_parse_hook(const struct nft_ctx *ctx,
 		}
 
 		hooknum = ntohl(nla_get_be32(tb[NFTA_FLOWTABLE_HOOK_NUM]));
-		if (hooknum != NF_NETDEV_INGRESS)
+		if (hooknum != NF_NETDEV_INGRESS &&
+		    hooknum != NF_NETDEV_EARLY_INGRESS)
 			return -EOPNOTSUPP;
 
 		priority = ntohl(nla_get_be32(tb[NFTA_FLOWTABLE_HOOK_PRIORITY]));
@@ -9008,7 +9009,14 @@ static int nft_flowtable_parse_hook(const struct nft_ctx *ctx,
 			ops->hooknum		= flowtable_hook->num;
 			ops->priority		= flowtable_hook->priority;
 			ops->priv		= &flowtable->data;
-			ops->hook		= flowtable->data.type->hook;
+			switch (ops->hooknum) {
+			case NF_NETDEV_INGRESS:
+				ops->hook	= flowtable->data.type->hook;
+				break;
+			case NF_NETDEV_EARLY_INGRESS:
+				ops->hook	= flowtable->data.type->hook_list;
+				break;
+			}
 			ops->hook_ops_type	= NF_HOOK_OP_NFT_FT;
 		}
 	}
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH net-next,RFC 4/8] netfilter: nf_tables: add nft_set_pktinfo_ingress()
  2026-03-17 11:29 [PATCH net-next,RFC 0/8] netfilter: flowtable bulking Pablo Neira Ayuso
                   ` (2 preceding siblings ...)
  2026-03-17 11:29 ` [PATCH net-next,RFC 3/8] netfilter: nf_tables: add flowtable early_ingress support Pablo Neira Ayuso
@ 2026-03-17 11:29 ` Pablo Neira Ayuso
  2026-03-17 11:29 ` [PATCH net-next,RFC 5/8] netfilter: nf_tables: add early ingress chain Pablo Neira Ayuso
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 16+ messages in thread
From: Pablo Neira Ayuso @ 2026-03-17 11:29 UTC (permalink / raw)
  To: netfilter-devel
  Cc: davem, netdev, kuba, pabeni, edumazet, fw, horms,
	steffen.klassert, antony.antony

Add helper function to prepare for early ingress filtering support.

No functional changes are intended, this is a preparation patch.

Co-developed-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/netfilter/nft_chain_filter.c | 48 ++++++++++++++++++++++----------
 1 file changed, 33 insertions(+), 15 deletions(-)

diff --git a/net/netfilter/nft_chain_filter.c b/net/netfilter/nft_chain_filter.c
index b16185e9a6dd..47a612bdd03e 100644
--- a/net/netfilter/nft_chain_filter.c
+++ b/net/netfilter/nft_chain_filter.c
@@ -161,32 +161,50 @@ static unsigned int nft_do_chain_inet(void *priv, struct sk_buff *skb,
 	return nft_do_chain(&pkt, priv);
 }
 
-static unsigned int nft_do_chain_inet_ingress(void *priv, struct sk_buff *skb,
-					      const struct nf_hook_state *state)
+static int nft_set_pktinfo_ingress(struct nft_pktinfo *pkt,
+				   struct sk_buff *skb,
+				   struct nf_hook_state *ingress_state)
 {
-	struct nf_hook_state ingress_state = *state;
-	struct nft_pktinfo pkt;
-
 	switch (skb->protocol) {
 	case htons(ETH_P_IP):
 		/* Original hook is NFPROTO_NETDEV and NF_NETDEV_INGRESS. */
-		ingress_state.pf = NFPROTO_IPV4;
-		ingress_state.hook = NF_INET_INGRESS;
-		nft_set_pktinfo(&pkt, skb, &ingress_state);
+		ingress_state->pf = NFPROTO_IPV4;
+		ingress_state->hook = NF_INET_INGRESS;
+		nft_set_pktinfo(pkt, skb, ingress_state);
 
-		if (nft_set_pktinfo_ipv4_ingress(&pkt) < 0)
-			return NF_DROP;
+		if (nft_set_pktinfo_ipv4_ingress(pkt) < 0)
+			return -1;
 		break;
 	case htons(ETH_P_IPV6):
-		ingress_state.pf = NFPROTO_IPV6;
-		ingress_state.hook = NF_INET_INGRESS;
-		nft_set_pktinfo(&pkt, skb, &ingress_state);
+		ingress_state->pf = NFPROTO_IPV6;
+		ingress_state->hook = NF_INET_INGRESS;
+		nft_set_pktinfo(pkt, skb, ingress_state);
 
-		if (nft_set_pktinfo_ipv6_ingress(&pkt) < 0)
-			return NF_DROP;
+		if (nft_set_pktinfo_ipv6_ingress(pkt) < 0)
+			return -1;
 		break;
 	default:
+		return 1;
+	}
+
+	return 0;
+}
+
+static unsigned int nft_do_chain_inet_ingress(void *priv, struct sk_buff *skb,
+					      const struct nf_hook_state *state)
+{
+	struct nf_hook_state ingress_state = *state;
+	struct nft_pktinfo pkt;
+	int ret;
+
+	ret = nft_set_pktinfo_ingress(&pkt, skb, &ingress_state);
+	switch (ret) {
+	case -1:
+		return NF_DROP;
+	case 1:
 		return NF_ACCEPT;
+	default:
+		break;
 	}
 
 	return nft_do_chain(&pkt, priv);
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH net-next,RFC 5/8] netfilter: nf_tables: add early ingress chain
  2026-03-17 11:29 [PATCH net-next,RFC 0/8] netfilter: flowtable bulking Pablo Neira Ayuso
                   ` (3 preceding siblings ...)
  2026-03-17 11:29 ` [PATCH net-next,RFC 4/8] netfilter: nf_tables: add nft_set_pktinfo_ingress() Pablo Neira Ayuso
@ 2026-03-17 11:29 ` Pablo Neira Ayuso
  2026-03-17 11:29 ` [PATCH net-next,RFC 6/8] net: add dev_dst_drop() helper function Pablo Neira Ayuso
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 16+ messages in thread
From: Pablo Neira Ayuso @ 2026-03-17 11:29 UTC (permalink / raw)
  To: netfilter-devel
  Cc: davem, netdev, kuba, pabeni, edumazet, fw, horms,
	steffen.klassert, antony.antony

Add a new filter chain to filter out packets from the early_ingress hook.

This is the second user of this new hook, after the flowtable.

Co-developed-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/netfilter/nft_chain_filter.c | 116 ++++++++++++++++++++++++++++++-
 1 file changed, 114 insertions(+), 2 deletions(-)

diff --git a/net/netfilter/nft_chain_filter.c b/net/netfilter/nft_chain_filter.c
index 47a612bdd03e..3467f7b7bd38 100644
--- a/net/netfilter/nft_chain_filter.c
+++ b/net/netfilter/nft_chain_filter.c
@@ -210,17 +210,75 @@ static unsigned int nft_do_chain_inet_ingress(void *priv, struct sk_buff *skb,
 	return nft_do_chain(&pkt, priv);
 }
 
+static unsigned int
+nft_do_chain_inet_early_ingress(void *priv, struct sk_buff *unused,
+				const struct nf_hook_state *state)
+{
+	struct nf_hook_state ingress_state = *state;
+	struct sk_buff *skb, *nskb;
+	struct nft_pktinfo pkt;
+	LIST_HEAD(accept_list);
+	int ret;
+
+	list_for_each_entry_safe(skb, nskb, state->skb_list, list) {
+		skb_list_del_init(skb);
+
+		skb_reset_network_header(skb);
+		if (!skb_transport_header_was_set(skb))
+			skb_reset_transport_header(skb);
+		skb_reset_mac_len(skb);
+
+		ret = nft_set_pktinfo_ingress(&pkt, skb, &ingress_state);
+		switch (ret) {
+		case 1:
+			list_add_tail(&skb->list, &accept_list);
+			continue;
+		case 0:
+			break;
+		case -1:
+			kfree_skb(skb);
+			continue;
+		default:
+			break;
+		}
+
+		ret = nft_do_chain(&pkt, priv);
+		switch (ret) {
+		case NF_ACCEPT:
+			list_add_tail(&skb->list, &accept_list);
+			break;
+		default:
+			WARN_ON_ONCE(1);
+			fallthrough;
+		case NF_DROP:
+			kfree_skb(skb);
+			break;
+		}
+	}
+
+	WARN_ON_ONCE(!list_empty(state->skb_list));
+
+	list_splice(&accept_list, state->skb_list);
+
+	if (list_empty(state->skb_list))
+		return NF_STOLEN;
+
+	return NF_ACCEPT;
+}
+
 static const struct nft_chain_type nft_chain_filter_inet = {
 	.name		= "filter",
 	.type		= NFT_CHAIN_T_DEFAULT,
 	.family		= NFPROTO_INET,
-	.hook_mask	= (1 << NF_INET_INGRESS) |
+	.hook_mask	= (1 << NF_INET_EARLY_INGRESS) |
+			  (1 << NF_INET_INGRESS) |
 			  (1 << NF_INET_LOCAL_IN) |
 			  (1 << NF_INET_LOCAL_OUT) |
 			  (1 << NF_INET_FORWARD) |
 			  (1 << NF_INET_PRE_ROUTING) |
 			  (1 << NF_INET_POST_ROUTING),
 	.hooks		= {
+		[NF_INET_EARLY_INGRESS]	= nft_do_chain_inet_early_ingress,
 		[NF_INET_INGRESS]	= nft_do_chain_inet_ingress,
 		[NF_INET_LOCAL_IN]	= nft_do_chain_inet,
 		[NF_INET_LOCAL_OUT]	= nft_do_chain_inet,
@@ -324,15 +382,69 @@ static unsigned int nft_do_chain_netdev(void *priv, struct sk_buff *skb,
 	return nft_do_chain(&pkt, priv);
 }
 
+static unsigned int
+nft_do_chain_netdev_early_ingress(void *priv, struct sk_buff *unused,
+				  const struct nf_hook_state *state)
+{
+	struct nf_hook_state ingress_state = *state;
+	struct sk_buff *skb, *nskb;
+	struct nft_pktinfo pkt;
+	LIST_HEAD(accept_list);
+	int ret;
+
+	list_for_each_entry_safe(skb, nskb, state->skb_list, list) {
+		skb_list_del_init(skb);
+
+		skb_reset_network_header(skb);
+		if (!skb_transport_header_was_set(skb))
+			skb_reset_transport_header(skb);
+		skb_reset_mac_len(skb);
+
+		ret = nft_set_pktinfo_ingress(&pkt, skb, &ingress_state);
+		switch (ret) {
+		case 1:
+		case -1:
+			nft_set_pktinfo(&pkt, skb, &ingress_state);
+			break;
+		default:
+			break;
+		}
+
+		ret = nft_do_chain(&pkt, priv);
+		switch (ret) {
+		case NF_ACCEPT:
+			list_add_tail(&skb->list, &accept_list);
+			break;
+		default:
+			WARN_ON_ONCE(1);
+			fallthrough;
+		case NF_DROP:
+			kfree_skb(skb);
+			break;
+		}
+	}
+
+	WARN_ON_ONCE(!list_empty(state->skb_list));
+
+	list_splice(&accept_list, state->skb_list);
+
+	if (list_empty(state->skb_list))
+		return NF_STOLEN;
+
+	return NF_ACCEPT;
+}
+
 static const struct nft_chain_type nft_chain_filter_netdev = {
 	.name		= "filter",
 	.type		= NFT_CHAIN_T_DEFAULT,
 	.family		= NFPROTO_NETDEV,
 	.hook_mask	= (1 << NF_NETDEV_INGRESS) |
-			  (1 << NF_NETDEV_EGRESS),
+			  (1 << NF_NETDEV_EGRESS) |
+			  (1 << NF_NETDEV_EARLY_INGRESS),
 	.hooks		= {
 		[NF_NETDEV_INGRESS]	= nft_do_chain_netdev,
 		[NF_NETDEV_EGRESS]	= nft_do_chain_netdev,
+		[NF_NETDEV_EARLY_INGRESS] = nft_do_chain_netdev_early_ingress,
 	},
 };
 
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH net-next,RFC 6/8] net: add dev_dst_drop() helper function
  2026-03-17 11:29 [PATCH net-next,RFC 0/8] netfilter: flowtable bulking Pablo Neira Ayuso
                   ` (4 preceding siblings ...)
  2026-03-17 11:29 ` [PATCH net-next,RFC 5/8] netfilter: nf_tables: add early ingress chain Pablo Neira Ayuso
@ 2026-03-17 11:29 ` Pablo Neira Ayuso
  2026-03-17 11:29 ` [PATCH net-next,RFC 7/8] net: add dev_noqueue_xmit_list() " Pablo Neira Ayuso
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 16+ messages in thread
From: Pablo Neira Ayuso @ 2026-03-17 11:29 UTC (permalink / raw)
  To: netfilter-devel
  Cc: davem, netdev, kuba, pabeni, edumazet, fw, horms,
	steffen.klassert, antony.antony

Prepare to reuse this function from listified tx path.

No functional changes are intended, this is a preparation patch.

Co-developed-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/core/dev.c | 19 ++++++++++++-------
 1 file changed, 12 insertions(+), 7 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 476ee88440a6..5c339416ae5d 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4719,6 +4719,17 @@ struct netdev_queue *netdev_core_pick_tx(struct net_device *dev,
 	return netdev_get_tx_queue(dev, queue_index);
 }
 
+/* If device/qdisc don't need skb->dst, release it right now while
+ * its hot in this cpu cache.
+ */
+static inline void dev_dst_drop(const struct net_device *dev, struct sk_buff *skb)
+{
+	if (dev->priv_flags & IFF_XMIT_DST_RELEASE)
+		skb_dst_drop(skb);
+	else
+		skb_dst_force(skb);
+}
+
 /**
  * __dev_queue_xmit() - transmit a buffer
  * @skb:	buffer to transmit
@@ -4784,13 +4795,7 @@ int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
 			txq = netdev_tx_queue_mapping(dev, skb);
 	}
 #endif
-	/* If device/qdisc don't need skb->dst, release it right now while
-	 * its hot in this cpu cache.
-	 */
-	if (dev->priv_flags & IFF_XMIT_DST_RELEASE)
-		skb_dst_drop(skb);
-	else
-		skb_dst_force(skb);
+	dev_dst_drop(dev, skb);
 
 	if (!txq)
 		txq = netdev_core_pick_tx(dev, skb, sb_dev);
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH net-next,RFC 7/8] net: add dev_noqueue_xmit_list() helper function
  2026-03-17 11:29 [PATCH net-next,RFC 0/8] netfilter: flowtable bulking Pablo Neira Ayuso
                   ` (5 preceding siblings ...)
  2026-03-17 11:29 ` [PATCH net-next,RFC 6/8] net: add dev_dst_drop() helper function Pablo Neira Ayuso
@ 2026-03-17 11:29 ` Pablo Neira Ayuso
  2026-03-17 11:29 ` [PATCH net-next,RFC 8/8] net: add dev_queue_xmit_list() and use it Pablo Neira Ayuso
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 16+ messages in thread
From: Pablo Neira Ayuso @ 2026-03-17 11:29 UTC (permalink / raw)
  To: netfilter-devel
  Cc: davem, netdev, kuba, pabeni, edumazet, fw, horms,
	steffen.klassert, antony.antony

This new helper function wraps the device has no queue case.  This function can
be reused from listified skb in the tx path since the device has no queue path
already supports for skb list.

***This replaces validate_xmit_skb() by validate_xmit_skb_list()*** in this
new helper function.

An alternative to this patch to reuse a smaller fraction of code can be to wrap
this common code instead in a function:

+                       HARD_TX_LOCK(dev, txq, cpu);
+
+                       if (!netif_xmit_stopped(txq)) {
+                               dev_xmit_recursion_inc();
+                               skb = dev_hard_start_xmit(skb, dev, txq, &rc);
+                               dev_xmit_recursion_dec();
+                               if (dev_xmit_complete(rc)) {
+                                       HARD_TX_UNLOCK(dev, txq);
+                                       goto out;
+                               }
+                       }
+                       HARD_TX_UNLOCK(dev, txq);

Note: This comment was in the original patch from Steffen:

       /* FIXME: For each skb!!! */
       dev_core_stats_tx_dropped_inc(dev);
       kfree_skb_list(skb);

This currently reports only one packet drop through stats.

Co-developed-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/core/dev.c | 121 +++++++++++++++++++++++++++----------------------
 1 file changed, 67 insertions(+), 54 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 5c339416ae5d..8f5bef5a715c 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4730,6 +4730,68 @@ static inline void dev_dst_drop(const struct net_device *dev, struct sk_buff *sk
 		skb_dst_force(skb);
 }
 
+/* The device has no queue. Common case for software devices:
+ * loopback, all the sorts of tunnels...
+ *
+ * Really, it is unlikely that netif_tx_lock protection is necessary
+ * here.  (f.e. loopback and IP tunnels are clean ignoring statistics
+ * counters.)
+ * However, it is possible, that they rely on protection
+ * made by us here.
+ *
+ * Check this and shot the lock. It is not prone from deadlocks.
+ * Either shot noqueue qdisc, it is even simpler 8)
+ */
+static inline int dev_noqueue_xmit_list(struct sk_buff *skb,
+					struct net_device *dev,
+					struct netdev_queue *txq)
+{
+	bool again = false;
+	int rc = -ENOMEM;
+
+	if (dev->flags & IFF_UP) {
+		int cpu = smp_processor_id(); /* ok because BHs are off */
+
+		/* Other cpus might concurrently change txq->xmit_lock_owner
+		 * to -1 or to their cpu id, but not to our id.
+		 */
+		if (READ_ONCE(txq->xmit_lock_owner) != cpu) {
+			if (dev_xmit_recursion())
+				goto recursion_alert;
+
+			skb = validate_xmit_skb_list(skb, dev, &again);
+			if (!skb)
+				goto out;
+
+			HARD_TX_LOCK(dev, txq, cpu);
+
+			if (!netif_xmit_stopped(txq)) {
+				dev_xmit_recursion_inc();
+				skb = dev_hard_start_xmit(skb, dev, txq, &rc);
+				dev_xmit_recursion_dec();
+				if (dev_xmit_complete(rc)) {
+					HARD_TX_UNLOCK(dev, txq);
+					goto out;
+				}
+			}
+			HARD_TX_UNLOCK(dev, txq);
+			net_crit_ratelimited("Virtual device %s asks to queue packet!\n",
+					     dev->name);
+		} else {
+			/* Recursion is detected! It is possible,
+			 * unfortunately
+			 */
+recursion_alert:
+			net_crit_ratelimited("Dead loop on virtual device %s, fix it urgently!\n",
+					     dev->name);
+		}
+	}
+
+	rc = -ENETDOWN;
+out:
+	return rc;
+}
+
 /**
  * __dev_queue_xmit() - transmit a buffer
  * @skb:	buffer to transmit
@@ -4757,7 +4819,6 @@ int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
 	struct netdev_queue *txq = NULL;
 	struct Qdisc *q;
 	int rc = -ENOMEM;
-	bool again = false;
 
 	skb_reset_mac_header(skb);
 	skb_assert_len(skb);
@@ -4808,61 +4869,13 @@ int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
 		goto out;
 	}
 
-	/* The device has no queue. Common case for software devices:
-	 * loopback, all the sorts of tunnels...
-
-	 * Really, it is unlikely that netif_tx_lock protection is necessary
-	 * here.  (f.e. loopback and IP tunnels are clean ignoring statistics
-	 * counters.)
-	 * However, it is possible, that they rely on protection
-	 * made by us here.
-
-	 * Check this and shot the lock. It is not prone from deadlocks.
-	 *Either shot noqueue qdisc, it is even simpler 8)
-	 */
-	if (dev->flags & IFF_UP) {
-		int cpu = smp_processor_id(); /* ok because BHs are off */
-
-		/* Other cpus might concurrently change txq->xmit_lock_owner
-		 * to -1 or to their cpu id, but not to our id.
-		 */
-		if (READ_ONCE(txq->xmit_lock_owner) != cpu) {
-			if (dev_xmit_recursion())
-				goto recursion_alert;
-
-			skb = validate_xmit_skb(skb, dev, &again);
-			if (!skb)
-				goto out;
-
-			HARD_TX_LOCK(dev, txq, cpu);
-
-			if (!netif_xmit_stopped(txq)) {
-				dev_xmit_recursion_inc();
-				skb = dev_hard_start_xmit(skb, dev, txq, &rc);
-				dev_xmit_recursion_dec();
-				if (dev_xmit_complete(rc)) {
-					HARD_TX_UNLOCK(dev, txq);
-					goto out;
-				}
-			}
-			HARD_TX_UNLOCK(dev, txq);
-			net_crit_ratelimited("Virtual device %s asks to queue packet!\n",
-					     dev->name);
-		} else {
-			/* Recursion is detected! It is possible,
-			 * unfortunately
-			 */
-recursion_alert:
-			net_crit_ratelimited("Dead loop on virtual device %s, fix it urgently!\n",
-					     dev->name);
-		}
-	}
-
-	rc = -ENETDOWN;
+	rc = dev_noqueue_xmit_list(skb, dev, txq);
 	rcu_read_unlock_bh();
 
-	dev_core_stats_tx_dropped_inc(dev);
-	kfree_skb_list(skb);
+	if (rc < 0) {
+		dev_core_stats_tx_dropped_inc(dev);
+		kfree_skb_list(skb);
+	}
 	return rc;
 out:
 	rcu_read_unlock_bh();
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH net-next,RFC 8/8] net: add dev_queue_xmit_list() and use it
  2026-03-17 11:29 [PATCH net-next,RFC 0/8] netfilter: flowtable bulking Pablo Neira Ayuso
                   ` (6 preceding siblings ...)
  2026-03-17 11:29 ` [PATCH net-next,RFC 7/8] net: add dev_noqueue_xmit_list() " Pablo Neira Ayuso
@ 2026-03-17 11:29 ` Pablo Neira Ayuso
  2026-03-17 11:39 ` [PATCH net-next,RFC 0/8] netfilter: flowtable bulking Pablo Neira Ayuso
  2026-03-19  6:15 ` Qingfang Deng
  9 siblings, 0 replies; 16+ messages in thread
From: Pablo Neira Ayuso @ 2026-03-17 11:29 UTC (permalink / raw)
  To: netfilter-devel
  Cc: davem, netdev, kuba, pabeni, edumazet, fw, horms,
	steffen.klassert, antony.antony

Add listified skb tx path and use it to implement the flowtable TX
datapath. Use the dev_dst_drop() and dev_noqueue_xmit_list() helper
functions to build dev_queue_xmit_list().

100dfa74cad9 ("net: dev_queue_xmit() llist adoption") requires to
reverse the skb list and then splice this list to the the last pending
skbuff for transmission.

A few notes:

- I removed:

       if (unlikely(skb_shinfo(skb)->tx_flags & SKBTX_SCHED_TSTAMP))
               return -1;

Only possible if skb->sk is set on, if my assumption is not correct, this
can be checked from flowtable path.

Reducing the size of dev_queue_xmit_list() is convenient, to focus only
on speeding up what it can be really speed up, so let's return -1 if
either:

- qdisc is not empty

OR

- qdisc is not work-conserving (no TCQ_F_CAN_BYPASS is set on)

Then, the flowtable falls back to call dev_queue_xmit() for each single
skbuff.

Co-developed-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 include/linux/netdevice.h        |   2 +
 net/core/dev.c                   | 157 +++++++++++++++++++++++++++++++
 net/netfilter/nf_flow_table_ip.c |  18 ++--
 3 files changed, 169 insertions(+), 8 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index c0174aa1037f..34747e9b85d2 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3401,6 +3401,8 @@ static inline int dev_direct_xmit(struct sk_buff *skb, u16 queue_id)
 	return ret;
 }
 
+int dev_queue_xmit_list(struct sk_buff *skb);
+
 int register_netdevice(struct net_device *dev);
 void unregister_netdevice_queue(struct net_device *dev, struct list_head *head);
 void unregister_netdevice_many(struct list_head *head);
diff --git a/net/core/dev.c b/net/core/dev.c
index 8f5bef5a715c..8f114f5af537 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4920,6 +4920,163 @@ int __dev_direct_xmit(struct sk_buff *skb, u16 queue_id)
 }
 EXPORT_SYMBOL(__dev_direct_xmit);
 
+static int dev_queue_xmit_skb_list(struct sk_buff *skb, struct Qdisc *q,
+				   struct net_device *dev,
+				   struct netdev_queue *txq)
+{
+	struct sk_buff *next, *to_free = NULL, *to_free2 = NULL;
+	spinlock_t *root_lock = qdisc_lock(q);
+	struct llist_node *ll_list, *first_n;
+	unsigned long defer_count = 0;
+	int rc = -1;
+
+	tcf_set_drop_reason(skb, SKB_DROP_REASON_QDISC_DROP);
+
+	if (q->flags & TCQ_F_NOLOCK) {
+		if (q->flags & TCQ_F_CAN_BYPASS && nolock_qdisc_is_empty(q) &&
+		    qdisc_run_begin(q)) {
+			/* Retest nolock_qdisc_is_empty() within the protection
+			 * of q->seqlock to protect from racing with requeuing.
+			 */
+			if (unlikely(!nolock_qdisc_is_empty(q))) {
+				to_free2 = qdisc_run_end(q);
+				goto free_skbs;
+			}
+
+			if (sch_direct_xmit(skb, q, dev, txq, NULL, false) &&
+			    !nolock_qdisc_is_empty(q))
+				__qdisc_run(q);
+
+			to_free2 = qdisc_run_end(q);
+			rc = NET_XMIT_SUCCESS;
+			goto free_skbs;
+		}
+	}
+
+	/* Transform skb list to llist in reverse order to splice this batch
+	 * into the defer_list. The next field of skb chain and llist use the
+	 * memory layout.
+	 */
+	ll_list = llist_reverse_order(&skb->ll_node);
+
+	/* Open code llist_add(&skb->ll_node, &q->defer_list) + queue limit.
+	 * In the try_cmpxchg() loop, we want to increment q->defer_count
+	 * at most once to limit the number of skbs in defer_list.
+	 * We perform the defer_count increment only if the list is not empty,
+	 * because some arches have slow atomic_long_inc_return().
+	 */
+	first_n = READ_ONCE(q->defer_list.first);
+	do {
+		if (first_n && !defer_count) {
+			defer_count = atomic_long_inc_return(&q->defer_count);
+			if (unlikely(defer_count > READ_ONCE(net_hotdata.qdisc_max_burst))) {
+				kfree_skb_reason(skb, SKB_DROP_REASON_QDISC_BURST_DROP);
+				return NET_XMIT_DROP;
+			}
+                }
+		/* Splice using last skb in the reverse list. */
+		skb->ll_node.next = first_n;
+	} while (!try_cmpxchg(&q->defer_list.first, &first_n, ll_list));
+
+	/* If defer_list was not empty, we know the cpu which queued
+	 * the first skb will process the whole list for us.
+	 */
+	if (first_n)
+		return NET_XMIT_SUCCESS;
+
+	spin_lock(root_lock);
+
+	ll_list = llist_del_all(&q->defer_list);
+	/* There is a small race because we clear defer_count not atomically
+	 * with the prior llist_del_all(). This means defer_list could grow
+	 * over qdisc_max_burst.
+	 */
+	atomic_long_set(&q->defer_count, 0);
+
+	ll_list = llist_reverse_order(ll_list);
+
+	if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) {
+		llist_for_each_entry_safe(skb, next, ll_list, ll_node)
+			__qdisc_drop(skb, &to_free);
+		rc = NET_XMIT_DROP;
+		goto unlock;
+	}
+
+	if ((q->flags & TCQ_F_CAN_BYPASS) && !qdisc_qlen(q) &&
+	    !llist_next(ll_list) && qdisc_run_begin(q)) {
+		/*
+		 * This is a work-conserving queue; there are no old skbs
+		 * waiting to be sent out; and the qdisc is not running -
+		 * xmit the skb directly.
+		 */
+		DEBUG_NET_WARN_ON_ONCE(skb != llist_entry(ll_list,
+							  struct sk_buff,
+							  ll_node));
+		qdisc_bstats_update(q, skb);
+		if (sch_direct_xmit(skb, q, dev, txq, root_lock, true))
+			__qdisc_run(q);
+		to_free2 = qdisc_run_end(q);
+		rc = NET_XMIT_SUCCESS;
+	}
+unlock:
+	spin_unlock(root_lock);
+
+free_skbs:
+	tcf_kfree_skb_list(to_free);
+	tcf_kfree_skb_list(to_free2);
+	return rc;
+}
+
+int dev_queue_xmit_list(struct sk_buff *skb)
+{
+	struct net_device *dev = skb->dev;
+	struct netdev_queue *txq;
+	struct sk_buff *iter;
+	struct Qdisc *q;
+	int rc;
+
+	/* Disable soft irqs for various locks below. Also
+	 * stops preemption for RCU.
+	 */
+	rcu_read_lock_bh();
+
+	/* Intentionally, no egress hooks here. This is called from the ingress
+	 * path, which should have already classified packets before calling
+	 * this function.
+	 */
+
+	txq = netdev_tx_queue_mapping(dev, skb);
+	if (!txq)
+		txq = netdev_core_pick_tx(dev, skb, NULL);
+
+	q = rcu_dereference_bh(txq->qdisc);
+
+	iter = skb;
+	while (iter) {
+		dev_dst_drop(dev, iter);
+		skb_copy_queue_mapping(iter, skb);
+		iter = iter->next;
+	}
+
+	if (q->enqueue) {
+		rc = dev_queue_xmit_skb_list(skb, q, dev, txq);
+		goto out;
+	}
+
+	rc = dev_noqueue_xmit_list(skb, dev, txq);
+	rcu_read_unlock_bh();
+
+	if (rc < 0) {
+		dev_core_stats_tx_dropped_inc(dev);
+		kfree_skb_list(skb);
+	}
+	return rc;
+out:
+	rcu_read_unlock_bh();
+	return rc;
+}
+EXPORT_SYMBOL(dev_queue_xmit_list);
+
 /*************************************************************************
  *			Receiver routines
  *************************************************************************/
diff --git a/net/netfilter/nf_flow_table_ip.c b/net/netfilter/nf_flow_table_ip.c
index 98b5d5e022c8..3d2d02be0f0d 100644
--- a/net/netfilter/nf_flow_table_ip.c
+++ b/net/netfilter/nf_flow_table_ip.c
@@ -863,14 +863,16 @@ static void nf_flow_neigh_xmit_list(struct sk_buff *skb, struct net_device *outd
 		iter = iter->next;
 	}
 
-	iter = skb;
-	while (iter) {
-		struct sk_buff *next;
-
-		next = iter->next;
-		iter->next = NULL;
-		dev_queue_xmit(iter);
-		iter = next;
+	if (dev_queue_xmit_list(skb) == -1) {
+		iter = skb;
+		while (iter) {
+			struct sk_buff *next;
+
+			next = iter->next;
+			iter->next = NULL;
+			dev_queue_xmit(iter);
+			iter = next;
+		}
 	}
 }
 
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next,RFC 0/8] netfilter: flowtable bulking
  2026-03-17 11:29 [PATCH net-next,RFC 0/8] netfilter: flowtable bulking Pablo Neira Ayuso
                   ` (7 preceding siblings ...)
  2026-03-17 11:29 ` [PATCH net-next,RFC 8/8] net: add dev_queue_xmit_list() and use it Pablo Neira Ayuso
@ 2026-03-17 11:39 ` Pablo Neira Ayuso
  2026-03-19  6:15 ` Qingfang Deng
  9 siblings, 0 replies; 16+ messages in thread
From: Pablo Neira Ayuso @ 2026-03-17 11:39 UTC (permalink / raw)
  To: netfilter-devel
  Cc: davem, netdev, kuba, pabeni, edumazet, fw, horms,
	steffen.klassert, antony.antony

Missing links:

[1] https://lore.kernel.org/netdev/20180614141947.3580-1-pablo@netfilter.org/
[2] https://linux-ipsec.org/2025-linux-kernel-flowtable-bulk-forwarding-and-xfrm-pcpu-forwarding-testing-results.html

On Tue, Mar 17, 2026 at 12:29:09PM +0100, Pablo Neira Ayuso wrote:
> Hi,
>  
> Back in 2018 [1], a new fast forwarding combining the flowtable and
> GRO/GSO was proposed, however, "GRO is specialized to optimize the
> non-forwarding case", so it was considered "counter-intuitive to base a
> fast forwarding path on top of it".
>  
> Then, Steffen Klassert proposed the idea of adding a new engine for the
> flowtable that operates on the skb list that is provided after the NAPI
> cycle. The idea is to process this skb list to create bulks grouped by
> the ethertype, output device, next hop and tos/dscp. Then, add a
> specialized xmit path that can deal with these skb bulks. Note that GRO
> needs to be disabled so this new forwarding engine obtains the list of
> skbs that resulted from the NAPI cycle.
>  
> Before grouping skbs in bulks, there is a flowtable lookup to check if
> this flow is already in the flowtable, otherwise, the packet follows
> slow path. In case the flowtable lookup returns an entry, then this
> packet follows fast path: the ttl is decremented, the corresponding NAT
> mangling on the packet and layer 2/3 tunnel encapsulation (layer 2:
> vlan/pppoe, layer 3: ipip) are performed.
>  
> The fast forwarding path is enabled through explicit user policy, so the
> user needs to request this behaviour from control plane, the following
> example shows how to place flows in the new fast forwarding path from
> the forward chain:
> 
>  table x {
>         flowtable f {
>                 hook early_ingress priority 0; devices = { eth0, eth1 }
>         }
>  
>         chain y {
>                 type filter hook forward priority 0;
>                 ip protocol tcp flow offload @f counter
>         }
>  }
>  
>  
> The example above sets up a fastpath for TCP flows that are placed in
> the flowtable 'f', this flowtable is hooked at the new early_ingress
> hook.  The initial TCP packets that match this rule from the standard
> fowarding path create an entry in the flowtable.
>  
> Note that tcpdump only shows the packets in the tx path, since this
> new early_ingress hook happens before the ingress tap.
> 
> The patch series contains 8 patches:
> 
> - #1 and #2 adds the basic RX flowtable bulking infrastructure for
>   IPv4 and IPv6.
> - #3 adds the early_ingress netfilter hook.
> - #4 adds a helper function to prepare for the netfilter chain for
>   the early_ingress hook.
> - #5 adds the early_ingress filter chain.
> - #6 and #7 add helper functions to reuse TX path codebase.
> - #8 adds the custom TX path for listified skbs and updates
>   the flowtable bulking to use it.
> 
> = Benchmark numbers =
> 
> Using the following testbed with 4 hosts with this topology:
>  
>  | sunset |-----| west |====| east |----| sunrise |
>  
> And this hardware:
>  
> * Supermicro H13SSW Motherboard
> * AMD EPYC 9135 16-Core Processor (a.k.a. Bergamo, or Zen 5)
> * NIC: Mellanox MT28800 ConnectX-5 Ex (100Gbps NIc)
> * NIC: Broadcom BCM57508 NetXtreme-E (only on sunrise, 100Gbps NIc)
>  
> With 128 byte packets:
>  
> * From ~2 Mpps (baseline) to ~4 Mpps with 1 flow.
> * From ~10.6 Mpps (baseline) to ~15.7 Mpps with 10 flows.
>  
> Antony Antony collected performance numbers and made a report describing
> this the benchmarking[2]. This report includes numbers from the IPsec
> support which is not included in this series.
>
> Comments welcome, thanks.
> 
> Pablo Neira Ayuso (8):
>   netfilter: flowtable: Add basic bulking infrastructure for early ingress hook
>   netfilter: flowtable: Add IPv6 bulking infrastructure for early ingress hook
>   netfilter: nf_tables: add flowtable early_ingress support
>   netfilter: nf_tables: add nft_set_pktinfo_ingress()
>   netfilter: nf_tables: add early ingress chain
>   net: add dev_dst_drop() helper function
>   net: add dev_noqueue_xmit_list() helper function
>   net: add dev_queue_xmit_list() and use it
> 
>  include/linux/netdevice.h             |   2 +
>  include/net/netfilter/nf_flow_table.h |  13 +-
>  net/core/dev.c                        | 297 ++++++++++++++++----
>  net/netfilter/nf_flow_table_inet.c    |  81 ++++++
>  net/netfilter/nf_flow_table_ip.c      | 384 ++++++++++++++++++++++++++
>  net/netfilter/nf_tables_api.c         |  12 +-
>  net/netfilter/nft_chain_filter.c      | 164 +++++++++--
>  7 files changed, 872 insertions(+), 81 deletions(-)
> 
> -- 
> 2.47.3
> 
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next,RFC 0/8] netfilter: flowtable bulking
  2026-03-17 11:29 [PATCH net-next,RFC 0/8] netfilter: flowtable bulking Pablo Neira Ayuso
                   ` (8 preceding siblings ...)
  2026-03-17 11:39 ` [PATCH net-next,RFC 0/8] netfilter: flowtable bulking Pablo Neira Ayuso
@ 2026-03-19  6:15 ` Qingfang Deng
  2026-03-19 11:28   ` Steffen Klassert
  9 siblings, 1 reply; 16+ messages in thread
From: Qingfang Deng @ 2026-03-19  6:15 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: netfilter-devel, davem, netdev, kuba, pabeni, edumazet, fw, horms,
	steffen.klassert, antony.antony, Felix Fietkau

Hi Pablo,

On Tue, 17 Mar 2026 12:29:09 +0100, Pablo Neira Ayuso wrote:
> Hi,
>  
> Back in 2018 [1], a new fast forwarding combining the flowtable and
> GRO/GSO was proposed, however, "GRO is specialized to optimize the
> non-forwarding case", so it was considered "counter-intuitive to base a
> fast forwarding path on top of it".
>  
> Then, Steffen Klassert proposed the idea of adding a new engine for the
> flowtable that operates on the skb list that is provided after the NAPI
> cycle. The idea is to process this skb list to create bulks grouped by
> the ethertype, output device, next hop and tos/dscp. Then, add a
> specialized xmit path that can deal with these skb bulks. Note that GRO
> needs to be disabled so this new forwarding engine obtains the list of
> skbs that resulted from the NAPI cycle.

+Cc: Felix Fietkau

How does this compare to fraglist GRO with the original flowtable?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next,RFC 0/8] netfilter: flowtable bulking
  2026-03-19  6:15 ` Qingfang Deng
@ 2026-03-19 11:28   ` Steffen Klassert
  2026-03-19 12:18     ` Felix Fietkau
  0 siblings, 1 reply; 16+ messages in thread
From: Steffen Klassert @ 2026-03-19 11:28 UTC (permalink / raw)
  To: Qingfang Deng
  Cc: Pablo Neira Ayuso, netfilter-devel, davem, netdev, kuba, pabeni,
	edumazet, fw, horms, antony.antony, Felix Fietkau

On Thu, Mar 19, 2026 at 02:15:17PM +0800, Qingfang Deng wrote:
> Hi Pablo,
> 
> On Tue, 17 Mar 2026 12:29:09 +0100, Pablo Neira Ayuso wrote:
> > Hi,
> >  
> > Back in 2018 [1], a new fast forwarding combining the flowtable and
> > GRO/GSO was proposed, however, "GRO is specialized to optimize the
> > non-forwarding case", so it was considered "counter-intuitive to base a
> > fast forwarding path on top of it".
> >  
> > Then, Steffen Klassert proposed the idea of adding a new engine for the
> > flowtable that operates on the skb list that is provided after the NAPI
> > cycle. The idea is to process this skb list to create bulks grouped by
> > the ethertype, output device, next hop and tos/dscp. Then, add a
> > specialized xmit path that can deal with these skb bulks. Note that GRO
> > needs to be disabled so this new forwarding engine obtains the list of
> > skbs that resulted from the NAPI cycle.
> 
> +Cc: Felix Fietkau
> 
> How does this compare to fraglist GRO with the original flowtable?

GRO can only aggregate packets of the same L4 flow. This can
aggregate all packets the are treated  the same by the
forwarding path. Packets need to have the same output device
and next hop, but can be from different L3 and L4 flows.

Packet forwarders usually receive many different flows.
GRO might not even kick in if there are not at least
two packets from the same flow on a napi cycle.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next,RFC 0/8] netfilter: flowtable bulking
  2026-03-19 11:28   ` Steffen Klassert
@ 2026-03-19 12:18     ` Felix Fietkau
  2026-03-20  6:49       ` Steffen Klassert
  0 siblings, 1 reply; 16+ messages in thread
From: Felix Fietkau @ 2026-03-19 12:18 UTC (permalink / raw)
  To: Steffen Klassert, Qingfang Deng
  Cc: Pablo Neira Ayuso, netfilter-devel, davem, netdev, kuba, pabeni,
	edumazet, fw, horms, antony.antony

On 19.03.26 12:28, Steffen Klassert wrote:
> On Thu, Mar 19, 2026 at 02:15:17PM +0800, Qingfang Deng wrote:
>> Hi Pablo,
>> 
>> On Tue, 17 Mar 2026 12:29:09 +0100, Pablo Neira Ayuso wrote:
>> > Hi,
>> >  
>> > Back in 2018 [1], a new fast forwarding combining the flowtable and
>> > GRO/GSO was proposed, however, "GRO is specialized to optimize the
>> > non-forwarding case", so it was considered "counter-intuitive to base a
>> > fast forwarding path on top of it".
>> >  
>> > Then, Steffen Klassert proposed the idea of adding a new engine for the
>> > flowtable that operates on the skb list that is provided after the NAPI
>> > cycle. The idea is to process this skb list to create bulks grouped by
>> > the ethertype, output device, next hop and tos/dscp. Then, add a
>> > specialized xmit path that can deal with these skb bulks. Note that GRO
>> > needs to be disabled so this new forwarding engine obtains the list of
>> > skbs that resulted from the NAPI cycle.
>> 
>> +Cc: Felix Fietkau
>> 
>> How does this compare to fraglist GRO with the original flowtable?
> 
> GRO can only aggregate packets of the same L4 flow. This can
> aggregate all packets the are treated  the same by the
> forwarding path. Packets need to have the same output device
> and next hop, but can be from different L3 and L4 flows.
> 
> Packet forwarders usually receive many different flows.
> GRO might not even kick in if there are not at least
> two packets from the same flow on a napi cycle.

Interesting approach! Do you think it might be possible to combine this 
with GRO by bulking together GRO-combined frames from different flows?

I think it would be unfortunate if you have to choose between decent 
forwarding throughput and decent local rx throughput.

- Felix

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next,RFC 0/8] netfilter: flowtable bulking
  2026-03-19 12:18     ` Felix Fietkau
@ 2026-03-20  6:49       ` Steffen Klassert
  2026-03-20  8:50         ` Felix Fietkau
  0 siblings, 1 reply; 16+ messages in thread
From: Steffen Klassert @ 2026-03-20  6:49 UTC (permalink / raw)
  To: Felix Fietkau
  Cc: Qingfang Deng, Pablo Neira Ayuso, netfilter-devel, davem, netdev,
	kuba, pabeni, edumazet, fw, horms, antony.antony

On Thu, Mar 19, 2026 at 01:18:19PM +0100, Felix Fietkau wrote:
> On 19.03.26 12:28, Steffen Klassert wrote:
> > On Thu, Mar 19, 2026 at 02:15:17PM +0800, Qingfang Deng wrote:
> > > Hi Pablo,
> > > 
> > > On Tue, 17 Mar 2026 12:29:09 +0100, Pablo Neira Ayuso wrote:
> > > > Hi,
> > > >  > Back in 2018 [1], a new fast forwarding combining the flowtable
> > > and
> > > > GRO/GSO was proposed, however, "GRO is specialized to optimize the
> > > > non-forwarding case", so it was considered "counter-intuitive to base a
> > > > fast forwarding path on top of it".
> > > >  > Then, Steffen Klassert proposed the idea of adding a new engine
> > > for the
> > > > flowtable that operates on the skb list that is provided after the NAPI
> > > > cycle. The idea is to process this skb list to create bulks grouped by
> > > > the ethertype, output device, next hop and tos/dscp. Then, add a
> > > > specialized xmit path that can deal with these skb bulks. Note that GRO
> > > > needs to be disabled so this new forwarding engine obtains the list of
> > > > skbs that resulted from the NAPI cycle.
> > > 
> > > +Cc: Felix Fietkau
> > > 
> > > How does this compare to fraglist GRO with the original flowtable?
> > 
> > GRO can only aggregate packets of the same L4 flow. This can
> > aggregate all packets the are treated  the same by the
> > forwarding path. Packets need to have the same output device
> > and next hop, but can be from different L3 and L4 flows.
> > 
> > Packet forwarders usually receive many different flows.
> > GRO might not even kick in if there are not at least
> > two packets from the same flow on a napi cycle.
> 
> Interesting approach! Do you think it might be possible to combine this with
> GRO by bulking together GRO-combined frames from different flows?

This depends on how the GRO packets are crafted. If the packets built
just by adding skb page frags, then yes. If the fraglist pointer is
used to chain packets, then no (our approach uses the fraglist pointer
as well). So combining these would require some changes to the GRO
layer.

> I think it would be unfortunate if you have to choose between decent
> forwarding throughput and decent local rx throughput.

Would be nice if we could tune for both cases, but sometimes it
needs a decision for which use case it should be tuned.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next,RFC 0/8] netfilter: flowtable bulking
  2026-03-20  6:49       ` Steffen Klassert
@ 2026-03-20  8:50         ` Felix Fietkau
  2026-03-20  9:00           ` Steffen Klassert
  0 siblings, 1 reply; 16+ messages in thread
From: Felix Fietkau @ 2026-03-20  8:50 UTC (permalink / raw)
  To: Steffen Klassert
  Cc: Qingfang Deng, Pablo Neira Ayuso, netfilter-devel, davem, netdev,
	kuba, pabeni, edumazet, fw, horms, antony.antony

On 20.03.26 07:49, Steffen Klassert wrote:
> On Thu, Mar 19, 2026 at 01:18:19PM +0100, Felix Fietkau wrote:
>> On 19.03.26 12:28, Steffen Klassert wrote:
>> > On Thu, Mar 19, 2026 at 02:15:17PM +0800, Qingfang Deng wrote:
>> > > Hi Pablo,
>> > > 
>> > > On Tue, 17 Mar 2026 12:29:09 +0100, Pablo Neira Ayuso wrote:
>> > > > Hi,
>> > > >  > Back in 2018 [1], a new fast forwarding combining the flowtable
>> > > and
>> > > > GRO/GSO was proposed, however, "GRO is specialized to optimize the
>> > > > non-forwarding case", so it was considered "counter-intuitive to base a
>> > > > fast forwarding path on top of it".
>> > > >  > Then, Steffen Klassert proposed the idea of adding a new engine
>> > > for the
>> > > > flowtable that operates on the skb list that is provided after the NAPI
>> > > > cycle. The idea is to process this skb list to create bulks grouped by
>> > > > the ethertype, output device, next hop and tos/dscp. Then, add a
>> > > > specialized xmit path that can deal with these skb bulks. Note that GRO
>> > > > needs to be disabled so this new forwarding engine obtains the list of
>> > > > skbs that resulted from the NAPI cycle.
>> > > 
>> > > +Cc: Felix Fietkau
>> > > 
>> > > How does this compare to fraglist GRO with the original flowtable?
>> > 
>> > GRO can only aggregate packets of the same L4 flow. This can
>> > aggregate all packets the are treated  the same by the
>> > forwarding path. Packets need to have the same output device
>> > and next hop, but can be from different L3 and L4 flows.
>> > 
>> > Packet forwarders usually receive many different flows.
>> > GRO might not even kick in if there are not at least
>> > two packets from the same flow on a napi cycle.
>> 
>> Interesting approach! Do you think it might be possible to combine this with
>> GRO by bulking together GRO-combined frames from different flows?
> 
> This depends on how the GRO packets are crafted. If the packets built
> just by adding skb page frags, then yes. If the fraglist pointer is
> used to chain packets, then no (our approach uses the fraglist pointer
> as well). So combining these would require some changes to the GRO
> layer.

On OpenWrt we use fraglist GRO by default for both TCP and UDP.
Maybe the bulking code could take fraglist GRO packets and simply 
convert them to its internal form of bulking batches.
That way the GRO layer doesn't have to be changed, and it shouldn't add 
too much complexity to the bulking code either.

- Felix

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next,RFC 0/8] netfilter: flowtable bulking
  2026-03-20  8:50         ` Felix Fietkau
@ 2026-03-20  9:00           ` Steffen Klassert
  0 siblings, 0 replies; 16+ messages in thread
From: Steffen Klassert @ 2026-03-20  9:00 UTC (permalink / raw)
  To: Felix Fietkau
  Cc: Qingfang Deng, Pablo Neira Ayuso, netfilter-devel, davem, netdev,
	kuba, pabeni, edumazet, fw, horms, antony.antony

On Fri, Mar 20, 2026 at 09:50:31AM +0100, Felix Fietkau wrote:
> On 20.03.26 07:49, Steffen Klassert wrote:
> > On Thu, Mar 19, 2026 at 01:18:19PM +0100, Felix Fietkau wrote:
> > > On 19.03.26 12:28, Steffen Klassert wrote:
> > > > On Thu, Mar 19, 2026 at 02:15:17PM +0800, Qingfang Deng wrote:
> > > > > Hi Pablo,
> > > > > > > On Tue, 17 Mar 2026 12:29:09 +0100, Pablo Neira Ayuso wrote:
> > > > > > Hi,
> > > > > >  > Back in 2018 [1], a new fast forwarding combining the flowtable
> > > > > and
> > > > > > GRO/GSO was proposed, however, "GRO is specialized to optimize the
> > > > > > non-forwarding case", so it was considered "counter-intuitive to base a
> > > > > > fast forwarding path on top of it".
> > > > > >  > Then, Steffen Klassert proposed the idea of adding a new engine
> > > > > for the
> > > > > > flowtable that operates on the skb list that is provided after the NAPI
> > > > > > cycle. The idea is to process this skb list to create bulks grouped by
> > > > > > the ethertype, output device, next hop and tos/dscp. Then, add a
> > > > > > specialized xmit path that can deal with these skb bulks. Note that GRO
> > > > > > needs to be disabled so this new forwarding engine obtains the list of
> > > > > > skbs that resulted from the NAPI cycle.
> > > > > > > +Cc: Felix Fietkau
> > > > > > > How does this compare to fraglist GRO with the original
> > > flowtable?
> > > > > GRO can only aggregate packets of the same L4 flow. This can
> > > > aggregate all packets the are treated  the same by the
> > > > forwarding path. Packets need to have the same output device
> > > > and next hop, but can be from different L3 and L4 flows.
> > > > > Packet forwarders usually receive many different flows.
> > > > GRO might not even kick in if there are not at least
> > > > two packets from the same flow on a napi cycle.
> > > 
> > > Interesting approach! Do you think it might be possible to combine this with
> > > GRO by bulking together GRO-combined frames from different flows?
> > 
> > This depends on how the GRO packets are crafted. If the packets built
> > just by adding skb page frags, then yes. If the fraglist pointer is
> > used to chain packets, then no (our approach uses the fraglist pointer
> > as well). So combining these would require some changes to the GRO
> > layer.
> 
> On OpenWrt we use fraglist GRO by default for both TCP and UDP.
> Maybe the bulking code could take fraglist GRO packets and simply convert
> them to its internal form of bulking batches.

This is possible, but would run the bulking logic two times
in the forwarding path.

The problem is the standard GRO. This also use the fraglist pointer if
MAX_SKB_FRAGS is reached. Converting this back is complicated. That would
not be a forwarding fastpath anymore.

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2026-03-20  9:00 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-17 11:29 [PATCH net-next,RFC 0/8] netfilter: flowtable bulking Pablo Neira Ayuso
2026-03-17 11:29 ` [PATCH net-next,RFC 1/8] netfilter: flowtable: Add basic bulking infrastructure for early ingress hook Pablo Neira Ayuso
2026-03-17 11:29 ` [PATCH net-next,RFC 2/8] netfilter: flowtable: Add IPv6 " Pablo Neira Ayuso
2026-03-17 11:29 ` [PATCH net-next,RFC 3/8] netfilter: nf_tables: add flowtable early_ingress support Pablo Neira Ayuso
2026-03-17 11:29 ` [PATCH net-next,RFC 4/8] netfilter: nf_tables: add nft_set_pktinfo_ingress() Pablo Neira Ayuso
2026-03-17 11:29 ` [PATCH net-next,RFC 5/8] netfilter: nf_tables: add early ingress chain Pablo Neira Ayuso
2026-03-17 11:29 ` [PATCH net-next,RFC 6/8] net: add dev_dst_drop() helper function Pablo Neira Ayuso
2026-03-17 11:29 ` [PATCH net-next,RFC 7/8] net: add dev_noqueue_xmit_list() " Pablo Neira Ayuso
2026-03-17 11:29 ` [PATCH net-next,RFC 8/8] net: add dev_queue_xmit_list() and use it Pablo Neira Ayuso
2026-03-17 11:39 ` [PATCH net-next,RFC 0/8] netfilter: flowtable bulking Pablo Neira Ayuso
2026-03-19  6:15 ` Qingfang Deng
2026-03-19 11:28   ` Steffen Klassert
2026-03-19 12:18     ` Felix Fietkau
2026-03-20  6:49       ` Steffen Klassert
2026-03-20  8:50         ` Felix Fietkau
2026-03-20  9:00           ` Steffen Klassert

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox