Re: [PATCH nf-next v6 1/2] net: netfilter: Add IPIP flowtable SW acceleration

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Lorenzo Bianconi <lorenzo@kernel.org>
To: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: "David S. Miller" <davem@davemloft.net>,
	David Ahern <dsahern@kernel.org>,
	Eric Dumazet <edumazet@google.com>,
	Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>,
	Simon Horman <horms@kernel.org>,
	Jozsef Kadlecsik <kadlec@netfilter.org>,
	Shuah Khan <shuah@kernel.org>,
	Andrew Lunn <andrew+netdev@lunn.ch>,
	Florian Westphal <fw@strlen.de>,
	netdev@vger.kernel.org, netfilter-devel@vger.kernel.org,
	coreteam@netfilter.org, linux-kselftest@vger.kernel.org
Subject: Re: [PATCH nf-next v6 1/2] net: netfilter: Add IPIP flowtable SW acceleration
Date: Tue, 21 Oct 2025 19:46:50 +0200	[thread overview]
Message-ID: <aPfHCgx5hjNAkiE1@lore-desk> (raw)
In-Reply-To: <aMCcnO4rJdDIdx3m@calendula>

[-- Attachment #1: Type: text/plain, Size: 8544 bytes --]

> On Mon, Aug 18, 2025 at 11:07:33AM +0200, Lorenzo Bianconi wrote:

Hi Pablo,

sorry for the long delay.

[...]

> 
> I found this patch in one of my trees (see attachment) to explore
> tunnel integration of the tx path, there has been similar patches
> floating on the mailing list for layer 2 encapsulation (eg. pppoe and
> vlan), IIRC for pppoe I remember they claim to accelerate tx.

ack, thx. I will look into it for v7.

> 
> Another aspect of this series is that I think it would be good to
> explore integration of other layer 3 tunnel protocols, rather than
> following an incremental approach.

ack.

> 
> More comments below.
> 
> > - TCP stream received from the IPIP tunnel:
> >   - net-next:				~35Gbps
> >   - net-next + IPIP flowtbale support:	~49Gbps
> > 

[...]

> > +	path->encap.id = __ipv4_addr_hash(tiph->saddr, ntohl(tiph->daddr));
> 
> This hash approach sounds reasonable, but I feel a bit uncomfortable
> with the idea that the flowtable bypasses _entirely_ the existing
> firewall policy and that this does not provide a perfect match. The
> idea is that only initial packets of a flow goes through the policy,
> then once flow is added in the flowtabled such firewall policy
> validation is circumvented.

ack, I will implement a perfect match for tuple lookup in v7.

> 
> To achieve a perfect match, this means more memory consumption to
> store the two IPs in the tuple.
> 
>         struct {
>                 u16                     id;
>                 __be16                  proto;
>         } encap[NF_FLOW_TABLE_ENCAP_MAX];
> 
> And possibility more information will need to be stored for other
> layer 3 tunnel protocols.
> 
> While this hash trick looks like an interesting approach, I am
> ambivalent.
> 
> And one nitpick (typo) below...

ack, I will fix it in v7.

Regards,
Lorenzo

> 
> > +	ctx->dev = rt->dst.dev;
> > +	ip_rt_put(rt);
> > +
> > +	return 0;
> > +}
> > +
> 
> [...]
> > +static void nf_flow_ip4_ecanp_pop(struct sk_buff *skb)
> 
>                           _encap_pop ?

> commit 4c635431740ecaa011c732bce954086266f07218
> Author: Pablo Neira Ayuso <pablo@netfilter.org>
> Date:   Wed Jul 6 12:52:02 2022 +0200
> 
>     netfilter: flowtable: tunnel tx support
> 
> diff --git a/include/net/netfilter/nf_flow_table.h b/include/net/netfilter/nf_flow_table.h
> index d21da5b57eeb..d4ecb57a8bfc 100644
> --- a/include/net/netfilter/nf_flow_table.h
> +++ b/include/net/netfilter/nf_flow_table.h
> @@ -139,6 +139,27 @@ struct flow_offload_tuple {
>  		struct {
>  			struct dst_entry *dst_cache;
>  			u32		dst_cookie;
> +			u8		tunnel_num;
> +			struct {
> +				u8	l3proto;
> +				u8	l4proto;
> +				u8	tos;
> +				u8	ttl;
> +				__be16	df;
> +
> +				union {
> +					struct in_addr		src_v4;
> +					struct in6_addr		src_v6;
> +				};
> +				union {
> +					struct in_addr		dst_v4;
> +					struct in6_addr		dst_v6;
> +				};
> +				struct {
> +					__be16			src_port;
> +					__be16			dst_port;
> +				};
> +			} tunnel;
>  		};
>  		struct {
>  			u32		ifidx;
> @@ -223,6 +244,17 @@ struct nf_flow_route {
>  			u32			hw_ifindex;
>  			u8			h_source[ETH_ALEN];
>  			u8			h_dest[ETH_ALEN];
> +
> +			int			num_tunnels;
> +			struct {
> +				int		ifindex;
> +				u8		l3proto;
> +				u8		l4proto;
> +				struct {
> +					__be32	saddr;
> +					__be32	daddr;
> +				} ip;
> +			} tun;
>  		} out;
>  		enum flow_offload_xmit_type	xmit_type;
>  	} tuple[FLOW_OFFLOAD_DIR_MAX];
> diff --git a/net/netfilter/nf_flow_table_core.c b/net/netfilter/nf_flow_table_core.c
> index ab7df5c54eba..9244168c8cc8 100644
> --- a/net/netfilter/nf_flow_table_core.c
> +++ b/net/netfilter/nf_flow_table_core.c
> @@ -177,6 +177,24 @@ static int flow_offload_fill_route(struct flow_offload *flow,
>  		flow_tuple->tun.inner = flow->inner_tuple;
>  	}
>  
> +	if (route->tuple[dir].out.num_tunnels) {
> +		flow_tuple->tunnel_num++;
> +
> +		switch (route->tuple[dir].out.tun.l3proto) {
> +		case NFPROTO_IPV4:
> +			flow_tuple->tunnel.src_v4.s_addr = route->tuple[dir].out.tun.ip.saddr;
> +			flow_tuple->tunnel.dst_v4.s_addr = route->tuple[dir].out.tun.ip.daddr;
> +			break;
> +		case NFPROTO_IPV6:
> +			break;
> +		}
> +
> +		flow_tuple->tunnel.l3proto = route->tuple[dir].out.tun.l3proto;
> +		flow_tuple->tunnel.l4proto = route->tuple[dir].out.tun.l4proto;
> +		flow_tuple->tunnel.src_port = 0;
> +		flow_tuple->tunnel.dst_port = 0;
> +	}
> +
>  	return 0;
>  }
>  
> diff --git a/net/netfilter/nf_flow_table_ip.c b/net/netfilter/nf_flow_table_ip.c
> index c1156d4ce865..1b96309210b8 100644
> --- a/net/netfilter/nf_flow_table_ip.c
> +++ b/net/netfilter/nf_flow_table_ip.c
> @@ -349,6 +349,58 @@ static unsigned int nf_flow_queue_xmit(struct net *net, struct sk_buff *skb,
>  	return NF_STOLEN;
>  }
>  
> +/* extract from ip_tunnel_xmit(). */
> +static unsigned int nf_flow_tunnel_add(struct net *net, struct sk_buff *skb,
> +				       struct flow_offload *flow, int dir,
> +				       const struct rtable *rt,
> +				       struct iphdr *inner_iph)
> +{
> +	u32 headroom = sizeof(struct iphdr);
> +	struct iphdr *iph;
> +	u8 tos, ttl;
> +	__be16 df;
> +
> +	if (iptunnel_handle_offloads(skb, SKB_GSO_IPXIP4))
> +		return -1;
> +
> +	skb_set_inner_ipproto(skb, IPPROTO_IPIP);
> +
> +	headroom += LL_RESERVED_SPACE(rt->dst.dev) + rt->dst.header_len;
> +
> +        if (skb_cow_head(skb, headroom))
> +		return -1;
> +
> +	skb_scrub_packet(skb, true);
> +	skb_clear_hash_if_not_l4(skb);
> +	memset(IPCB(skb), 0, sizeof(*IPCB(skb)));
> +
> +        /* Push down and install the IP header. */
> +	skb_push(skb, sizeof(struct iphdr));
> +	skb_reset_network_header(skb);
> +
> +	df = flow->tuple[dir]->tunnel.df;
> +	tos = ip_tunnel_ecn_encap(flow->tuple[dir]->tunnel.tos, inner_iph, skb);
> +	ttl = flow->tuple[dir]->tunnel.ttl;
> +	if (ttl == 0)
> +		ttl = inner_iph->ttl;
> +
> +	iph = ip_hdr(skb);
> +
> +	iph->version    =       4;
> +	iph->ihl        =       sizeof(struct iphdr) >> 2;
> +	iph->frag_off   =       ip_mtu_locked(&rt->dst) ? 0 : df;
> +	iph->protocol   =       flow->tuple[dir]->tunnel.l4proto;
> +	iph->tos        =       flow->tuple[dir]->tunnel.tos;
> +	iph->daddr      =       flow->tuple[dir]->tunnel.dst_v4.s_addr;
> +	iph->saddr      =	flow->tuple[dir]->tunnel.src_v4.s_addr;
> +	iph->ttl        =       ttl;
> +	iph->tot_len	=	htons(skb->len);
> +	__ip_select_ident(net, iph, skb_shinfo(skb)->gso_segs ?: 1);
> +	ip_send_check(iph);
> +
> +	return 0;
> +}
> +
>  unsigned int
>  nf_flow_offload_ip_hook(void *priv, struct sk_buff *skb,
>  			const struct nf_hook_state *state)
> @@ -430,9 +482,19 @@ nf_flow_offload_ip_hook(void *priv, struct sk_buff *skb,
>  	switch (flow->tuple[dir]->xmit_type) {
>  	case FLOW_OFFLOAD_XMIT_NEIGH:
>  		rt = (struct rtable *)flow->tuple[dir]->dst_cache;
> +		if (flow->tuple[dir]->tunnel_num) {
> +			ret = nf_flow_tunnel_add(state->net, skb, flow, dir, rt, iph);
> +			if (ret < 0) {
> +				ret = NF_DROP;
> +				flow_offload_teardown(flow);
> +				break;
> +			}
> +			nexthop = rt_nexthop(rt, flow->tuple[dir]->tunnel.dst_v4.s_addr);
> +		} else {
> +			nexthop = rt_nexthop(rt, flow->tuple[!dir]->src_v4.s_addr);
> +		}
>  		outdev = rt->dst.dev;
>  		skb->dev = outdev;
> -		nexthop = rt_nexthop(rt, flow->tuple[!dir]->src_v4.s_addr);
>  		skb_dst_set_noref(skb, &rt->dst);
>  		neigh_xmit(NEIGH_ARP_TABLE, outdev, &nexthop, skb);
>  		ret = NF_STOLEN;
> diff --git a/net/netfilter/nft_flow_offload.c b/net/netfilter/nft_flow_offload.c
> index ea403b95326c..1d672310ac6a 100644
> --- a/net/netfilter/nft_flow_offload.c
> +++ b/net/netfilter/nft_flow_offload.c
> @@ -159,7 +159,13 @@ static void nft_dev_path_info(const struct net_device_path_stack *stack,
>  			route->tuple[!dir].in.tun.ip.saddr = path->tun.ip.daddr;
>  			route->tuple[!dir].in.tun.ip.daddr = path->tun.ip.saddr;
>  			route->tuple[!dir].in.tun.l4proto = path->tun.l4proto;
> -			dst_release(path->tun.dst);
> +
> +			route->tuple[dir].out.num_tunnels++;
> +			route->tuple[dir].out.tun.l3proto = path->tun.l3proto;
> +			route->tuple[dir].out.tun.ip.saddr = path->tun.ip.saddr;
> +			route->tuple[dir].out.tun.ip.daddr = path->tun.ip.daddr;
> +			route->tuple[dir].out.tun.l4proto = path->tun.l4proto;
> +			route->tuple[dir].dst = path->tun.dst;
>  			break;
>  		default:
>  			info->indev = NULL;


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

next prev parent reply	other threads:[~2025-10-21 17:46 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-08-18  9:07 [PATCH nf-next v6 0/2] Add IPIP flowtable SW acceleratio Lorenzo Bianconi
2025-08-18  9:07 ` [PATCH nf-next v6 1/2] net: netfilter: Add IPIP flowtable SW acceleration Lorenzo Bianconi
2025-09-09 21:31   ` Pablo Neira Ayuso
2025-10-21 17:46     ` Lorenzo Bianconi [this message]
2025-08-18  9:07 ` [PATCH nf-next v6 2/2] selftests: netfilter: nft_flowtable.sh: Add IPIP flowtable selftest Lorenzo Bianconi
2025-09-05 21:09 ` [PATCH nf-next v6 0/2] Add IPIP flowtable SW acceleratio Lorenzo Bianconi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aPfHCgx5hjNAkiE1@lore-desk \
    --to=lorenzo@kernel.org \
    --cc=andrew+netdev@lunn.ch \
    --cc=coreteam@netfilter.org \
    --cc=davem@davemloft.net \
    --cc=dsahern@kernel.org \
    --cc=edumazet@google.com \
    --cc=fw@strlen.de \
    --cc=horms@kernel.org \
    --cc=kadlec@netfilter.org \
    --cc=kuba@kernel.org \
    --cc=linux-kselftest@vger.kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=netfilter-devel@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=pablo@netfilter.org \
    --cc=shuah@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.