All of lore.kernel.org
 help / color / mirror / Atom feed
From: Pablo Neira Ayuso <pablo@netfilter.org>
To: Lorenzo Bianconi <lorenzo@kernel.org>
Cc: "David S. Miller" <davem@davemloft.net>,
	David Ahern <dsahern@kernel.org>,
	Eric Dumazet <edumazet@google.com>,
	Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>,
	Simon Horman <horms@kernel.org>,
	Jozsef Kadlecsik <kadlec@netfilter.org>,
	Shuah Khan <shuah@kernel.org>,
	Andrew Lunn <andrew+netdev@lunn.ch>,
	Florian Westphal <fw@strlen.de>,
	netdev@vger.kernel.org, netfilter-devel@vger.kernel.org,
	coreteam@netfilter.org, linux-kselftest@vger.kernel.org
Subject: Re: [PATCH nf-next v6 1/2] net: netfilter: Add IPIP flowtable SW acceleration
Date: Tue, 9 Sep 2025 23:31:08 +0200	[thread overview]
Message-ID: <aMCcnO4rJdDIdx3m@calendula> (raw)
In-Reply-To: <20250818-nf-flowtable-ipip-v6-1-eda90442739c@kernel.org>

[-- Attachment #1: Type: text/plain, Size: 5441 bytes --]

On Mon, Aug 18, 2025 at 11:07:33AM +0200, Lorenzo Bianconi wrote:
> Introduce SW acceleration for IPIP tunnels in the netfilter flowtable
> infrastructure.
> IPIP SW acceleration can be tested running the following scenario where
> the traffic is forwarded between two NICs (eth0 and eth1) and an IPIP
> tunnel is used to access a remote site (using eth1 as the underlay device):
> 
> ETH0 -- TUN0 <==> ETH1 -- [IP network] -- TUN1 (192.168.100.2)
> 
> $ip addr show
> 6: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
>     link/ether 00:00:22:33:11:55 brd ff:ff:ff:ff:ff:ff
>     inet 192.168.0.2/24 scope global eth0
>        valid_lft forever preferred_lft forever
> 7: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
>     link/ether 00:11:22:33:11:55 brd ff:ff:ff:ff:ff:ff
>     inet 192.168.1.1/24 scope global eth1
>        valid_lft forever preferred_lft forever
> 8: tun0@NONE: <POINTOPOINT,NOARP,UP,LOWER_UP> mtu 1480 qdisc noqueue state UNKNOWN group default qlen 1000
>     link/ipip 192.168.1.1 peer 192.168.1.2
>     inet 192.168.100.1/24 scope global tun0
>        valid_lft forever preferred_lft forever
> 
> $ip route show
> default via 192.168.100.2 dev tun0
> 192.168.0.0/24 dev eth0 proto kernel scope link src 192.168.0.2
> 192.168.1.0/24 dev eth1 proto kernel scope link src 192.168.1.1
> 192.168.100.0/24 dev tun0 proto kernel scope link src 192.168.100.1
> 
> $nft list ruleset
> table inet filter {
>         flowtable ft {
>                 hook ingress priority filter
>                 devices = { eth0, eth1 }
>         }
> 
>         chain forward {
>                 type filter hook forward priority filter; policy accept;
>                 meta l4proto { tcp, udp } flow add @ft
>         }
> }
> 
> Reproducing the scenario described above using veths I got the following
> results:
> - TCP stream transmitted into the IPIP tunnel:
>   - net-next:				~41Gbps
>   - net-next + IPIP flowtbale support:	~40Gbps

I found this patch in one of my trees (see attachment) to explore
tunnel integration of the tx path, there has been similar patches
floating on the mailing list for layer 2 encapsulation (eg. pppoe and
vlan), IIRC for pppoe I remember they claim to accelerate tx.

Another aspect of this series is that I think it would be good to
explore integration of other layer 3 tunnel protocols, rather than
following an incremental approach.

More comments below.

> - TCP stream received from the IPIP tunnel:
>   - net-next:				~35Gbps
>   - net-next + IPIP flowtbale support:	~49Gbps
> 
> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
> ---
>  include/linux/netdevice.h        |  1 +
>  net/ipv4/ipip.c                  | 28 ++++++++++++++++++++
>  net/netfilter/nf_flow_table_ip.c | 56 ++++++++++++++++++++++++++++++++++++++--
>  net/netfilter/nft_flow_offload.c |  1 +
>  4 files changed, 84 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index f3a3b761abfb1b883a970b04634c1ef3e7ee5407..0527a4e3d1fd512b564e47311f6ce3957b66298f 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -874,6 +874,7 @@ enum net_device_path_type {
>  	DEV_PATH_PPPOE,
>  	DEV_PATH_DSA,
>  	DEV_PATH_MTK_WDMA,
> +	DEV_PATH_IPENCAP,
>  };
>  
>  struct net_device_path {
> diff --git a/net/ipv4/ipip.c b/net/ipv4/ipip.c
> index 3e03af073a1ccc3d7597a998a515b6cfdded40b5..b7a3311bd061c341987380b5872caa8990d02e63 100644
> --- a/net/ipv4/ipip.c
> +++ b/net/ipv4/ipip.c
> @@ -353,6 +353,33 @@ ipip_tunnel_ctl(struct net_device *dev, struct ip_tunnel_parm_kern *p, int cmd)
>  	return ip_tunnel_ctl(dev, p, cmd);
>  }
>  
> +static int ipip_fill_forward_path(struct net_device_path_ctx *ctx,
> +				  struct net_device_path *path)
> +{
> +	struct ip_tunnel *tunnel = netdev_priv(ctx->dev);
> +	const struct iphdr *tiph = &tunnel->parms.iph;
> +	struct rtable *rt;
> +
> +	rt = ip_route_output(dev_net(ctx->dev), tiph->daddr, 0, 0, 0,
> +			     RT_SCOPE_UNIVERSE);
> +	if (IS_ERR(rt))
> +		return PTR_ERR(rt);
> +
> +	path->type = DEV_PATH_IPENCAP;
> +	path->dev = ctx->dev;
> +	path->encap.proto = htons(ETH_P_IP);
> +	/* Use the hash of outer header IP src and dst addresses as
> +	 * encapsulation ID. This must be kept in sync with
> +	 * nf_flow_tuple_encap().
> +	 */
> +	path->encap.id = __ipv4_addr_hash(tiph->saddr, ntohl(tiph->daddr));

This hash approach sounds reasonable, but I feel a bit uncomfortable
with the idea that the flowtable bypasses _entirely_ the existing
firewall policy and that this does not provide a perfect match. The
idea is that only initial packets of a flow goes through the policy,
then once flow is added in the flowtabled such firewall policy
validation is circumvented.

To achieve a perfect match, this means more memory consumption to
store the two IPs in the tuple.

        struct {
                u16                     id;
                __be16                  proto;
        } encap[NF_FLOW_TABLE_ENCAP_MAX];

And possibility more information will need to be stored for other
layer 3 tunnel protocols.

While this hash trick looks like an interesting approach, I am
ambivalent.

And one nitpick (typo) below...

> +	ctx->dev = rt->dst.dev;
> +	ip_rt_put(rt);
> +
> +	return 0;
> +}
> +

[...]
> +static void nf_flow_ip4_ecanp_pop(struct sk_buff *skb)

                          _encap_pop ?

[-- Attachment #2: ipip-tx.patch --]
[-- Type: text/x-diff, Size: 5865 bytes --]

commit 4c635431740ecaa011c732bce954086266f07218
Author: Pablo Neira Ayuso <pablo@netfilter.org>
Date:   Wed Jul 6 12:52:02 2022 +0200

    netfilter: flowtable: tunnel tx support

diff --git a/include/net/netfilter/nf_flow_table.h b/include/net/netfilter/nf_flow_table.h
index d21da5b57eeb..d4ecb57a8bfc 100644
--- a/include/net/netfilter/nf_flow_table.h
+++ b/include/net/netfilter/nf_flow_table.h
@@ -139,6 +139,27 @@ struct flow_offload_tuple {
 		struct {
 			struct dst_entry *dst_cache;
 			u32		dst_cookie;
+			u8		tunnel_num;
+			struct {
+				u8	l3proto;
+				u8	l4proto;
+				u8	tos;
+				u8	ttl;
+				__be16	df;
+
+				union {
+					struct in_addr		src_v4;
+					struct in6_addr		src_v6;
+				};
+				union {
+					struct in_addr		dst_v4;
+					struct in6_addr		dst_v6;
+				};
+				struct {
+					__be16			src_port;
+					__be16			dst_port;
+				};
+			} tunnel;
 		};
 		struct {
 			u32		ifidx;
@@ -223,6 +244,17 @@ struct nf_flow_route {
 			u32			hw_ifindex;
 			u8			h_source[ETH_ALEN];
 			u8			h_dest[ETH_ALEN];
+
+			int			num_tunnels;
+			struct {
+				int		ifindex;
+				u8		l3proto;
+				u8		l4proto;
+				struct {
+					__be32	saddr;
+					__be32	daddr;
+				} ip;
+			} tun;
 		} out;
 		enum flow_offload_xmit_type	xmit_type;
 	} tuple[FLOW_OFFLOAD_DIR_MAX];
diff --git a/net/netfilter/nf_flow_table_core.c b/net/netfilter/nf_flow_table_core.c
index ab7df5c54eba..9244168c8cc8 100644
--- a/net/netfilter/nf_flow_table_core.c
+++ b/net/netfilter/nf_flow_table_core.c
@@ -177,6 +177,24 @@ static int flow_offload_fill_route(struct flow_offload *flow,
 		flow_tuple->tun.inner = flow->inner_tuple;
 	}
 
+	if (route->tuple[dir].out.num_tunnels) {
+		flow_tuple->tunnel_num++;
+
+		switch (route->tuple[dir].out.tun.l3proto) {
+		case NFPROTO_IPV4:
+			flow_tuple->tunnel.src_v4.s_addr = route->tuple[dir].out.tun.ip.saddr;
+			flow_tuple->tunnel.dst_v4.s_addr = route->tuple[dir].out.tun.ip.daddr;
+			break;
+		case NFPROTO_IPV6:
+			break;
+		}
+
+		flow_tuple->tunnel.l3proto = route->tuple[dir].out.tun.l3proto;
+		flow_tuple->tunnel.l4proto = route->tuple[dir].out.tun.l4proto;
+		flow_tuple->tunnel.src_port = 0;
+		flow_tuple->tunnel.dst_port = 0;
+	}
+
 	return 0;
 }
 
diff --git a/net/netfilter/nf_flow_table_ip.c b/net/netfilter/nf_flow_table_ip.c
index c1156d4ce865..1b96309210b8 100644
--- a/net/netfilter/nf_flow_table_ip.c
+++ b/net/netfilter/nf_flow_table_ip.c
@@ -349,6 +349,58 @@ static unsigned int nf_flow_queue_xmit(struct net *net, struct sk_buff *skb,
 	return NF_STOLEN;
 }
 
+/* extract from ip_tunnel_xmit(). */
+static unsigned int nf_flow_tunnel_add(struct net *net, struct sk_buff *skb,
+				       struct flow_offload *flow, int dir,
+				       const struct rtable *rt,
+				       struct iphdr *inner_iph)
+{
+	u32 headroom = sizeof(struct iphdr);
+	struct iphdr *iph;
+	u8 tos, ttl;
+	__be16 df;
+
+	if (iptunnel_handle_offloads(skb, SKB_GSO_IPXIP4))
+		return -1;
+
+	skb_set_inner_ipproto(skb, IPPROTO_IPIP);
+
+	headroom += LL_RESERVED_SPACE(rt->dst.dev) + rt->dst.header_len;
+
+        if (skb_cow_head(skb, headroom))
+		return -1;
+
+	skb_scrub_packet(skb, true);
+	skb_clear_hash_if_not_l4(skb);
+	memset(IPCB(skb), 0, sizeof(*IPCB(skb)));
+
+        /* Push down and install the IP header. */
+	skb_push(skb, sizeof(struct iphdr));
+	skb_reset_network_header(skb);
+
+	df = flow->tuple[dir]->tunnel.df;
+	tos = ip_tunnel_ecn_encap(flow->tuple[dir]->tunnel.tos, inner_iph, skb);
+	ttl = flow->tuple[dir]->tunnel.ttl;
+	if (ttl == 0)
+		ttl = inner_iph->ttl;
+
+	iph = ip_hdr(skb);
+
+	iph->version    =       4;
+	iph->ihl        =       sizeof(struct iphdr) >> 2;
+	iph->frag_off   =       ip_mtu_locked(&rt->dst) ? 0 : df;
+	iph->protocol   =       flow->tuple[dir]->tunnel.l4proto;
+	iph->tos        =       flow->tuple[dir]->tunnel.tos;
+	iph->daddr      =       flow->tuple[dir]->tunnel.dst_v4.s_addr;
+	iph->saddr      =	flow->tuple[dir]->tunnel.src_v4.s_addr;
+	iph->ttl        =       ttl;
+	iph->tot_len	=	htons(skb->len);
+	__ip_select_ident(net, iph, skb_shinfo(skb)->gso_segs ?: 1);
+	ip_send_check(iph);
+
+	return 0;
+}
+
 unsigned int
 nf_flow_offload_ip_hook(void *priv, struct sk_buff *skb,
 			const struct nf_hook_state *state)
@@ -430,9 +482,19 @@ nf_flow_offload_ip_hook(void *priv, struct sk_buff *skb,
 	switch (flow->tuple[dir]->xmit_type) {
 	case FLOW_OFFLOAD_XMIT_NEIGH:
 		rt = (struct rtable *)flow->tuple[dir]->dst_cache;
+		if (flow->tuple[dir]->tunnel_num) {
+			ret = nf_flow_tunnel_add(state->net, skb, flow, dir, rt, iph);
+			if (ret < 0) {
+				ret = NF_DROP;
+				flow_offload_teardown(flow);
+				break;
+			}
+			nexthop = rt_nexthop(rt, flow->tuple[dir]->tunnel.dst_v4.s_addr);
+		} else {
+			nexthop = rt_nexthop(rt, flow->tuple[!dir]->src_v4.s_addr);
+		}
 		outdev = rt->dst.dev;
 		skb->dev = outdev;
-		nexthop = rt_nexthop(rt, flow->tuple[!dir]->src_v4.s_addr);
 		skb_dst_set_noref(skb, &rt->dst);
 		neigh_xmit(NEIGH_ARP_TABLE, outdev, &nexthop, skb);
 		ret = NF_STOLEN;
diff --git a/net/netfilter/nft_flow_offload.c b/net/netfilter/nft_flow_offload.c
index ea403b95326c..1d672310ac6a 100644
--- a/net/netfilter/nft_flow_offload.c
+++ b/net/netfilter/nft_flow_offload.c
@@ -159,7 +159,13 @@ static void nft_dev_path_info(const struct net_device_path_stack *stack,
 			route->tuple[!dir].in.tun.ip.saddr = path->tun.ip.daddr;
 			route->tuple[!dir].in.tun.ip.daddr = path->tun.ip.saddr;
 			route->tuple[!dir].in.tun.l4proto = path->tun.l4proto;
-			dst_release(path->tun.dst);
+
+			route->tuple[dir].out.num_tunnels++;
+			route->tuple[dir].out.tun.l3proto = path->tun.l3proto;
+			route->tuple[dir].out.tun.ip.saddr = path->tun.ip.saddr;
+			route->tuple[dir].out.tun.ip.daddr = path->tun.ip.daddr;
+			route->tuple[dir].out.tun.l4proto = path->tun.l4proto;
+			route->tuple[dir].dst = path->tun.dst;
 			break;
 		default:
 			info->indev = NULL;

  reply	other threads:[~2025-09-09 21:31 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-08-18  9:07 [PATCH nf-next v6 0/2] Add IPIP flowtable SW acceleratio Lorenzo Bianconi
2025-08-18  9:07 ` [PATCH nf-next v6 1/2] net: netfilter: Add IPIP flowtable SW acceleration Lorenzo Bianconi
2025-09-09 21:31   ` Pablo Neira Ayuso [this message]
2025-10-21 17:46     ` Lorenzo Bianconi
2025-08-18  9:07 ` [PATCH nf-next v6 2/2] selftests: netfilter: nft_flowtable.sh: Add IPIP flowtable selftest Lorenzo Bianconi
2025-09-05 21:09 ` [PATCH nf-next v6 0/2] Add IPIP flowtable SW acceleratio Lorenzo Bianconi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aMCcnO4rJdDIdx3m@calendula \
    --to=pablo@netfilter.org \
    --cc=andrew+netdev@lunn.ch \
    --cc=coreteam@netfilter.org \
    --cc=davem@davemloft.net \
    --cc=dsahern@kernel.org \
    --cc=edumazet@google.com \
    --cc=fw@strlen.de \
    --cc=horms@kernel.org \
    --cc=kadlec@netfilter.org \
    --cc=kuba@kernel.org \
    --cc=linux-kselftest@vger.kernel.org \
    --cc=lorenzo@kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=netfilter-devel@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=shuah@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.