[PATCH net-next 00/17] BIG TCP for UDP tunnels

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH net-next 00/17] BIG TCP for UDP tunnels
@ 2025-09-23 13:47 Maxim Mikityanskiy
  2025-09-23 13:47 ` [PATCH net-next 01/17] net/ipv6: Introduce payload_len helpers Maxim Mikityanskiy
                   ` (17 more replies)
  0 siblings, 18 replies; 27+ messages in thread
From: Maxim Mikityanskiy @ 2025-09-23 13:47 UTC (permalink / raw)
  To: Daniel Borkmann, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Willem de Bruijn, David Ahern, Nikolay Aleksandrov
  Cc: netdev, tcpdump-workers, Guy Harris, Michael Richardson,
	Denis Ovsienko, Xin Long, Maxim Mikityanskiy

From: Maxim Mikityanskiy <maxim@isovalent.com>

This series consists adds support for BIG TCP IPv4/IPv6 workloads for vxlan
and geneve. It consists of two parts:

01-11: Remove hop-by-hop header for BIG TCP IPv6 to align with BIG TCP IPv4
12-17: Fix up things that prevent BIG TCP from working with tunnels.

There are a few places that make assumptions about skb->len being
smaller than 64k and/or that store it in 16-bit fields, trimming the
length. The first step to enable BIG TCP with VXLAN and GENEVE tunnels
is to patch those places to handle bigger lengths properly (patches
12-17). This is enough to make IPv4 in IPv4 work with BIG TCP, but when
either the outer or the inner protocol is IPv6, the current BIG TCP code
inserts a hop-by-hop extension header that stores the actual 32-bit
length of the packet. This additional hop-by-hop header turns out to be
problematic for encapsulated cases, because:

1. The drivers don't strip it, and they'd all need to know the structure
of each tunnel protocol in order to strip it correctly.

2. Even if (1) is implemented, it would be an additional performance
penalty per aggregated packet.

3. The skb_gso_validate_network_len check is skipped in
ip6_finish_output_gso when IP6SKB_FAKEJUMBO is set, but it seems that it
would make sense to do the actual validation, just taking into account
the length of the HBH header. When the support for tunnels is added, it
becomes trickier, because there may be one or two HBH headers, depending
on whether it's IPv6 in IPv6 or not.

At the same time, having an HBH header to store the 32-bit length is not
strictly necessary, as BIG TCP IPv4 doesn't do anything like this and
just restores the length from skb->len. The same thing can be done for
BIG TCP IPv6 (patches 01-11). Removing HBH from BIG TCP would allow to
simplify the implementation significantly, and align it with BIG TCP IPv4.

A trivial tcpdump PR for IPv6 is pending here [0]. While the tcpdump
commiters seem actively contributing code to the repository, it
appears community PRs are stuck for a long time (?). We checked
with Xin Long with regards to BIG TCP IPv4, and it turned out only
GUESS_TSO was added to the Fedora distro spec file CFLAGS definition
back then. In any case we have Cc'ed Guy Harris et al (tcpdump maintainer/
committer) here just in case to see if he could help out with unblocking [0].

Thanks all!

[0] https://github.com/the-tcpdump-group/tcpdump/pull/1329

Daniel Borkmann (1):
  geneve: Enable BIG TCP packets

Maxim Mikityanskiy (16):
  net/ipv6: Introduce payload_len helpers
  net/ipv6: Drop HBH for BIG TCP on TX side
  net/ipv6: Drop HBH for BIG TCP on RX side
  net/ipv6: Remove jumbo_remove step from TX path
  net/mlx5e: Remove jumbo_remove step from TX path
  net/mlx4: Remove jumbo_remove step from TX path
  ice: Remove jumbo_remove step from TX path
  bnxt_en: Remove jumbo_remove step from TX path
  gve: Remove jumbo_remove step from TX path
  net: mana: Remove jumbo_remove step from TX path
  net/ipv6: Remove HBH helpers
  net: Enable BIG TCP with partial GSO
  udp: Support gro_ipv4_max_size > 65536
  udp: Validate UDP length in udp_gro_receive
  udp: Set length in UDP header to 0 for big GSO packets
  vxlan: Enable BIG TCP packets

 drivers/net/ethernet/broadcom/bnxt/bnxt.c     | 21 -----
 drivers/net/ethernet/google/gve/gve_tx_dqo.c  |  3 -
 drivers/net/ethernet/intel/ice/ice_txrx.c     |  3 -
 drivers/net/ethernet/mellanox/mlx4/en_tx.c    | 42 ++--------
 .../net/ethernet/mellanox/mlx5/core/en_tx.c   | 75 +++---------------
 drivers/net/ethernet/microsoft/mana/mana_en.c |  3 -
 drivers/net/geneve.c                          |  2 +
 drivers/net/vxlan/vxlan_core.c                |  2 +
 include/linux/ipv6.h                          | 21 ++++-
 include/net/ipv6.h                            | 79 -------------------
 include/net/netfilter/nf_tables_ipv6.h        |  4 +-
 net/bridge/br_netfilter_ipv6.c                |  2 +-
 net/bridge/netfilter/nf_conntrack_bridge.c    |  4 +-
 net/core/dev.c                                |  6 +-
 net/core/gro.c                                |  2 -
 net/core/skbuff.c                             | 10 +--
 net/ipv4/udp.c                                |  5 +-
 net/ipv4/udp_offload.c                        | 12 ++-
 net/ipv4/udp_tunnel_core.c                    |  2 +-
 net/ipv6/ip6_input.c                          |  2 +-
 net/ipv6/ip6_offload.c                        | 36 +--------
 net/ipv6/ip6_output.c                         | 20 +----
 net/ipv6/ip6_udp_tunnel.c                     |  2 +-
 net/ipv6/output_core.c                        |  7 +-
 net/netfilter/ipvs/ip_vs_xmit.c               |  2 +-
 net/netfilter/nf_conntrack_ovs.c              |  2 +-
 net/netfilter/nf_log_syslog.c                 |  2 +-
 net/sched/sch_cake.c                          |  2 +-
 28 files changed, 84 insertions(+), 289 deletions(-)

-- 
2.50.1

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH net-next 01/17] net/ipv6: Introduce payload_len helpers
  2025-09-23 13:47 [PATCH net-next 00/17] BIG TCP for UDP tunnels Maxim Mikityanskiy
@ 2025-09-23 13:47 ` Maxim Mikityanskiy
  2025-09-25 13:51   ` Paolo Abeni
  2025-09-25 18:23   ` Stanislav Fomichev
  2025-09-23 13:47 ` [PATCH net-next 02/17] net/ipv6: Drop HBH for BIG TCP on TX side Maxim Mikityanskiy
                   ` (16 subsequent siblings)
  17 siblings, 2 replies; 27+ messages in thread
From: Maxim Mikityanskiy @ 2025-09-23 13:47 UTC (permalink / raw)
  To: Daniel Borkmann, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Willem de Bruijn, David Ahern, Nikolay Aleksandrov
  Cc: netdev, tcpdump-workers, Guy Harris, Michael Richardson,
	Denis Ovsienko, Xin Long, Maxim Mikityanskiy

From: Maxim Mikityanskiy <maxim@isovalent.com>

From: Maxim Mikityanskiy <maxim@isovalent.com>

The next commits will transition away from using the hop-by-hop
extension header to encode packet length for BIG TCP. Add wrappers
around ip6->payload_len that return the actual value if it's non-zero,
and calculate it from skb->len if payload_len is set to zero (and a
symmetrical setter).

The new helpers are used wherever the surrounding code supports the
hop-by-hop jumbo header for BIG TCP IPv6, or the corresponding IPv4 code
uses skb_ip_totlen (e.g., in include/net/netfilter/nf_tables_ipv6.h).

No behavioral change in this commit.

Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com>
---
 include/linux/ipv6.h                       | 20 ++++++++++++++++++++
 include/net/ipv6.h                         |  2 --
 include/net/netfilter/nf_tables_ipv6.h     |  4 ++--
 net/bridge/br_netfilter_ipv6.c             |  2 +-
 net/bridge/netfilter/nf_conntrack_bridge.c |  4 ++--
 net/ipv6/ip6_input.c                       |  2 +-
 net/ipv6/ip6_offload.c                     |  7 +++----
 net/ipv6/output_core.c                     |  7 +------
 net/netfilter/ipvs/ip_vs_xmit.c            |  2 +-
 net/netfilter/nf_conntrack_ovs.c           |  2 +-
 net/netfilter/nf_log_syslog.c              |  2 +-
 net/sched/sch_cake.c                       |  2 +-
 12 files changed, 34 insertions(+), 22 deletions(-)

diff --git a/include/linux/ipv6.h b/include/linux/ipv6.h
index 43b7bb828738..44c4b791eceb 100644
--- a/include/linux/ipv6.h
+++ b/include/linux/ipv6.h
@@ -126,6 +126,26 @@ static inline unsigned int ipv6_transport_len(const struct sk_buff *skb)
 	       skb_network_header_len(skb);
 }
 
+static inline unsigned int ipv6_payload_len(const struct sk_buff *skb, const struct ipv6hdr *ip6)
+{
+	u32 len = ntohs(ip6->payload_len);
+
+	return (len || !skb_is_gso(skb) || !skb_is_gso_tcp(skb)) ?
+	       len : skb->len - skb_network_offset(skb) - sizeof(struct ipv6hdr);
+}
+
+static inline unsigned int skb_ipv6_payload_len(const struct sk_buff *skb)
+{
+	return ipv6_payload_len(skb, ipv6_hdr(skb));
+}
+
+#define IPV6_MAXPLEN		65535
+
+static inline void ipv6_set_payload_len(struct ipv6hdr *ip6, unsigned int len)
+{
+	ip6->payload_len = len <= IPV6_MAXPLEN ? htons(len) : 0;
+}
+
 /* 
    This structure contains results of exthdrs parsing
    as offsets from skb->nh.
diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index 2ccdf85f34f1..38b332f3028e 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -25,8 +25,6 @@ struct ip_tunnel_info;
 
 #define SIN6_LEN_RFC2133	24
 
-#define IPV6_MAXPLEN		65535
-
 /*
  *	NextHeader field of IPv6 header
  */
diff --git a/include/net/netfilter/nf_tables_ipv6.h b/include/net/netfilter/nf_tables_ipv6.h
index a0633eeaec97..c53ac00bb974 100644
--- a/include/net/netfilter/nf_tables_ipv6.h
+++ b/include/net/netfilter/nf_tables_ipv6.h
@@ -42,7 +42,7 @@ static inline int __nft_set_pktinfo_ipv6_validate(struct nft_pktinfo *pkt)
 	if (ip6h->version != 6)
 		return -1;
 
-	pkt_len = ntohs(ip6h->payload_len);
+	pkt_len = ipv6_payload_len(pkt->skb, ip6h);
 	skb_len = pkt->skb->len - skb_network_offset(pkt->skb);
 	if (pkt_len + sizeof(*ip6h) > skb_len)
 		return -1;
@@ -86,7 +86,7 @@ static inline int nft_set_pktinfo_ipv6_ingress(struct nft_pktinfo *pkt)
 	if (ip6h->version != 6)
 		goto inhdr_error;
 
-	pkt_len = ntohs(ip6h->payload_len);
+	pkt_len = ipv6_payload_len(pkt->skb, ip6h);
 	if (pkt_len + sizeof(*ip6h) > pkt->skb->len) {
 		idev = __in6_dev_get(nft_in(pkt));
 		__IP6_INC_STATS(nft_net(pkt), idev, IPSTATS_MIB_INTRUNCATEDPKTS);
diff --git a/net/bridge/br_netfilter_ipv6.c b/net/bridge/br_netfilter_ipv6.c
index e0421eaa3abc..76ce70b4e7f3 100644
--- a/net/bridge/br_netfilter_ipv6.c
+++ b/net/bridge/br_netfilter_ipv6.c
@@ -58,7 +58,7 @@ int br_validate_ipv6(struct net *net, struct sk_buff *skb)
 	if (hdr->version != 6)
 		goto inhdr_error;
 
-	pkt_len = ntohs(hdr->payload_len);
+	pkt_len = ipv6_payload_len(skb, hdr);
 	if (hdr->nexthdr == NEXTHDR_HOP && nf_ip6_check_hbh_len(skb, &pkt_len))
 		goto drop;
 
diff --git a/net/bridge/netfilter/nf_conntrack_bridge.c b/net/bridge/netfilter/nf_conntrack_bridge.c
index 6482de4d8750..e3fd414906a0 100644
--- a/net/bridge/netfilter/nf_conntrack_bridge.c
+++ b/net/bridge/netfilter/nf_conntrack_bridge.c
@@ -230,7 +230,7 @@ static int nf_ct_br_ipv6_check(const struct sk_buff *skb)
 	if (hdr->version != 6)
 		return -1;
 
-	len = ntohs(hdr->payload_len) + sizeof(struct ipv6hdr) + nhoff;
+	len = ipv6_payload_len(skb, hdr) + sizeof(struct ipv6hdr) + nhoff;
 	if (skb->len < len)
 		return -1;
 
@@ -270,7 +270,7 @@ static unsigned int nf_ct_bridge_pre(void *priv, struct sk_buff *skb,
 		if (!pskb_may_pull(skb, sizeof(struct ipv6hdr)))
 			return NF_ACCEPT;
 
-		len = sizeof(struct ipv6hdr) + ntohs(ipv6_hdr(skb)->payload_len);
+		len = sizeof(struct ipv6hdr) + skb_ipv6_payload_len(skb);
 		if (pskb_trim_rcsum(skb, len))
 			return NF_ACCEPT;
 
diff --git a/net/ipv6/ip6_input.c b/net/ipv6/ip6_input.c
index 168ec07e31cc..2bcb981c91aa 100644
--- a/net/ipv6/ip6_input.c
+++ b/net/ipv6/ip6_input.c
@@ -262,7 +262,7 @@ static struct sk_buff *ip6_rcv_core(struct sk_buff *skb, struct net_device *dev,
 	skb->transport_header = skb->network_header + sizeof(*hdr);
 	IP6CB(skb)->nhoff = offsetof(struct ipv6hdr, nexthdr);
 
-	pkt_len = ntohs(hdr->payload_len);
+	pkt_len = ipv6_payload_len(skb, hdr);
 
 	/* pkt_len may be zero if Jumbo payload option is present */
 	if (pkt_len || hdr->nexthdr != NEXTHDR_HOP) {
diff --git a/net/ipv6/ip6_offload.c b/net/ipv6/ip6_offload.c
index fce91183797a..6762ce7909c8 100644
--- a/net/ipv6/ip6_offload.c
+++ b/net/ipv6/ip6_offload.c
@@ -372,12 +372,11 @@ INDIRECT_CALLABLE_SCOPE int ipv6_gro_complete(struct sk_buff *skb, int nhoff)
 		hop_jumbo->jumbo_payload_len = htonl(payload_len + hoplen);
 
 		iph->nexthdr = NEXTHDR_HOP;
-		iph->payload_len = 0;
-	} else {
-		iph = (struct ipv6hdr *)(skb->data + nhoff);
-		iph->payload_len = htons(payload_len);
 	}
 
+	iph = (struct ipv6hdr *)(skb->data + nhoff);
+	ipv6_set_payload_len(iph, payload_len);
+
 	nhoff += sizeof(*iph) + ipv6_exthdrs_len(iph, &ops);
 	if (WARN_ON(!ops || !ops->callbacks.gro_complete))
 		goto out;
diff --git a/net/ipv6/output_core.c b/net/ipv6/output_core.c
index 1c9b283a4132..cba1684a3f30 100644
--- a/net/ipv6/output_core.c
+++ b/net/ipv6/output_core.c
@@ -125,12 +125,7 @@ EXPORT_SYMBOL(ip6_dst_hoplimit);
 
 int __ip6_local_out(struct net *net, struct sock *sk, struct sk_buff *skb)
 {
-	int len;
-
-	len = skb->len - sizeof(struct ipv6hdr);
-	if (len > IPV6_MAXPLEN)
-		len = 0;
-	ipv6_hdr(skb)->payload_len = htons(len);
+	ipv6_set_payload_len(ipv6_hdr(skb), skb->len - sizeof(struct ipv6hdr));
 	IP6CB(skb)->nhoff = offsetof(struct ipv6hdr, nexthdr);
 
 	/* if egress device is enslaved to an L3 master device pass the
diff --git a/net/netfilter/ipvs/ip_vs_xmit.c b/net/netfilter/ipvs/ip_vs_xmit.c
index 95af252b2939..50501f3764f4 100644
--- a/net/netfilter/ipvs/ip_vs_xmit.c
+++ b/net/netfilter/ipvs/ip_vs_xmit.c
@@ -947,7 +947,7 @@ ip_vs_prepare_tunneled_skb(struct sk_buff *skb, int skb_af,
 		*next_protocol = IPPROTO_IPV6;
 		if (payload_len)
 			*payload_len =
-				ntohs(old_ipv6h->payload_len) +
+				ipv6_payload_len(skb, old_ipv6h) +
 				sizeof(*old_ipv6h);
 		old_dsfield = ipv6_get_dsfield(old_ipv6h);
 		*ttl = old_ipv6h->hop_limit;
diff --git a/net/netfilter/nf_conntrack_ovs.c b/net/netfilter/nf_conntrack_ovs.c
index 068e9489e1c2..a6988eeb1579 100644
--- a/net/netfilter/nf_conntrack_ovs.c
+++ b/net/netfilter/nf_conntrack_ovs.c
@@ -121,7 +121,7 @@ int nf_ct_skb_network_trim(struct sk_buff *skb, int family)
 		len = skb_ip_totlen(skb);
 		break;
 	case NFPROTO_IPV6:
-		len = ntohs(ipv6_hdr(skb)->payload_len);
+		len = skb_ipv6_payload_len(skb);
 		if (ipv6_hdr(skb)->nexthdr == NEXTHDR_HOP) {
 			int err = nf_ip6_check_hbh_len(skb, &len);
 
diff --git a/net/netfilter/nf_log_syslog.c b/net/netfilter/nf_log_syslog.c
index 86d5fc5d28e3..41503847d9d7 100644
--- a/net/netfilter/nf_log_syslog.c
+++ b/net/netfilter/nf_log_syslog.c
@@ -561,7 +561,7 @@ dump_ipv6_packet(struct net *net, struct nf_log_buf *m,
 
 	/* Max length: 44 "LEN=65535 TC=255 HOPLIMIT=255 FLOWLBL=FFFFF " */
 	nf_log_buf_add(m, "LEN=%zu TC=%u HOPLIMIT=%u FLOWLBL=%u ",
-		       ntohs(ih->payload_len) + sizeof(struct ipv6hdr),
+		       ipv6_payload_len(skb, ih) + sizeof(struct ipv6hdr),
 		       (ntohl(*(__be32 *)ih) & 0x0ff00000) >> 20,
 		       ih->hop_limit,
 		       (ntohl(*(__be32 *)ih) & 0x000fffff));
diff --git a/net/sched/sch_cake.c b/net/sched/sch_cake.c
index 32bacfc314c2..205f11c298ac 100644
--- a/net/sched/sch_cake.c
+++ b/net/sched/sch_cake.c
@@ -1266,7 +1266,7 @@ static struct sk_buff *cake_ack_filter(struct cake_sched_data *q,
 			    ipv6_addr_cmp(&ipv6h_check->daddr, &ipv6h->daddr))
 				continue;
 
-			seglen = ntohs(ipv6h_check->payload_len);
+			seglen = ipv6_payload_len(skb, ipv6h_check);
 		} else {
 			WARN_ON(1);  /* shouldn't happen */
 			continue;
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH net-next 02/17] net/ipv6: Drop HBH for BIG TCP on TX side
  2025-09-23 13:47 [PATCH net-next 00/17] BIG TCP for UDP tunnels Maxim Mikityanskiy
  2025-09-23 13:47 ` [PATCH net-next 01/17] net/ipv6: Introduce payload_len helpers Maxim Mikityanskiy
@ 2025-09-23 13:47 ` Maxim Mikityanskiy
  2025-09-23 13:47 ` [PATCH net-next 03/17] net/ipv6: Drop HBH for BIG TCP on RX side Maxim Mikityanskiy
                   ` (15 subsequent siblings)
  17 siblings, 0 replies; 27+ messages in thread
From: Maxim Mikityanskiy @ 2025-09-23 13:47 UTC (permalink / raw)
  To: Daniel Borkmann, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Willem de Bruijn, David Ahern, Nikolay Aleksandrov
  Cc: netdev, tcpdump-workers, Guy Harris, Michael Richardson,
	Denis Ovsienko, Xin Long, Maxim Mikityanskiy

From: Maxim Mikityanskiy <maxim@isovalent.com>

From: Maxim Mikityanskiy <maxim@isovalent.com>

BIG TCP IPv6 inserts a hop-by-hop extension header to indicate the real
IPv6 payload length when it doesn't fit into the 16-bit field in the
IPv6 header itself. While it helps tools parse the packet, it also
requires every driver that supports TSO and BIG TCP to remove this
8-byte extension header. It might not sound that bad until we try to
apply it to tunneled traffic. Currently, the drivers don't attempt to
strip HBH if skb->encapsulation = 1. Moreover, trying to do so would
require dissecting different tunnel protocols and making corresponding
adjustments on case-by-case basis, which would slow down the fastpath
(potentially also requiring adjusting checksums in outer headers).

At the same time, BIG TCP IPv4 doesn't insert any extra headers and just
calculates the payload length from skb->len, significantly simplifying
implementing BIG TCP for tunnels.

Stop inserting HBH when building BIG TCP GSO SKBs.

Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com>
---
 include/linux/ipv6.h  |  1 -
 net/ipv6/ip6_output.c | 20 +++-----------------
 2 files changed, 3 insertions(+), 18 deletions(-)

diff --git a/include/linux/ipv6.h b/include/linux/ipv6.h
index 44c4b791eceb..116219ce2c3b 100644
--- a/include/linux/ipv6.h
+++ b/include/linux/ipv6.h
@@ -175,7 +175,6 @@ struct inet6_skb_parm {
 #define IP6SKB_L3SLAVE         64
 #define IP6SKB_JUMBOGRAM      128
 #define IP6SKB_SEG6	      256
-#define IP6SKB_FAKEJUMBO      512
 #define IP6SKB_MULTIPATH      1024
 #define IP6SKB_MCROUTE        2048
 };
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index f904739e99b9..ed1b8e62ef61 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -179,8 +179,7 @@ ip6_finish_output_gso_slowpath_drop(struct net *net, struct sock *sk,
 static int ip6_finish_output_gso(struct net *net, struct sock *sk,
 				 struct sk_buff *skb, unsigned int mtu)
 {
-	if (!(IP6CB(skb)->flags & IP6SKB_FAKEJUMBO) &&
-	    !skb_gso_validate_network_len(skb, mtu))
+	if (!skb_gso_validate_network_len(skb, mtu))
 		return ip6_finish_output_gso_slowpath_drop(net, sk, skb, mtu);
 
 	return ip6_finish_output2(net, sk, skb);
@@ -273,8 +272,6 @@ int ip6_xmit(const struct sock *sk, struct sk_buff *skb, struct flowi6 *fl6,
 	struct in6_addr *first_hop = &fl6->daddr;
 	struct dst_entry *dst = skb_dst(skb);
 	struct inet6_dev *idev = ip6_dst_idev(dst);
-	struct hop_jumbo_hdr *hop_jumbo;
-	int hoplen = sizeof(*hop_jumbo);
 	struct net *net = sock_net(sk);
 	unsigned int head_room;
 	struct net_device *dev;
@@ -287,7 +284,7 @@ int ip6_xmit(const struct sock *sk, struct sk_buff *skb, struct flowi6 *fl6,
 	rcu_read_lock();
 
 	dev = dst_dev_rcu(dst);
-	head_room = sizeof(struct ipv6hdr) + hoplen + LL_RESERVED_SPACE(dev);
+	head_room = sizeof(struct ipv6hdr) + LL_RESERVED_SPACE(dev);
 	if (opt)
 		head_room += opt->opt_nflen + opt->opt_flen;
 
@@ -312,19 +309,8 @@ int ip6_xmit(const struct sock *sk, struct sk_buff *skb, struct flowi6 *fl6,
 					     &fl6->saddr);
 	}
 
-	if (unlikely(seg_len > IPV6_MAXPLEN)) {
-		hop_jumbo = skb_push(skb, hoplen);
-
-		hop_jumbo->nexthdr = proto;
-		hop_jumbo->hdrlen = 0;
-		hop_jumbo->tlv_type = IPV6_TLV_JUMBO;
-		hop_jumbo->tlv_len = 4;
-		hop_jumbo->jumbo_payload_len = htonl(seg_len + hoplen);
-
-		proto = IPPROTO_HOPOPTS;
+	if (unlikely(seg_len > IPV6_MAXPLEN))
 		seg_len = 0;
-		IP6CB(skb)->flags |= IP6SKB_FAKEJUMBO;
-	}
 
 	skb_push(skb, sizeof(struct ipv6hdr));
 	skb_reset_network_header(skb);
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH net-next 03/17] net/ipv6: Drop HBH for BIG TCP on RX side
  2025-09-23 13:47 [PATCH net-next 00/17] BIG TCP for UDP tunnels Maxim Mikityanskiy
  2025-09-23 13:47 ` [PATCH net-next 01/17] net/ipv6: Introduce payload_len helpers Maxim Mikityanskiy
  2025-09-23 13:47 ` [PATCH net-next 02/17] net/ipv6: Drop HBH for BIG TCP on TX side Maxim Mikityanskiy
@ 2025-09-23 13:47 ` Maxim Mikityanskiy
  2025-09-23 13:47 ` [PATCH net-next 04/17] net/ipv6: Remove jumbo_remove step from TX path Maxim Mikityanskiy
                   ` (14 subsequent siblings)
  17 siblings, 0 replies; 27+ messages in thread
From: Maxim Mikityanskiy @ 2025-09-23 13:47 UTC (permalink / raw)
  To: Daniel Borkmann, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Willem de Bruijn, David Ahern, Nikolay Aleksandrov
  Cc: netdev, tcpdump-workers, Guy Harris, Michael Richardson,
	Denis Ovsienko, Xin Long, Maxim Mikityanskiy

From: Maxim Mikityanskiy <maxim@isovalent.com>

From: Maxim Mikityanskiy <maxim@isovalent.com>

Complementary to the previous commit, stop inserting HBH when building
BIG TCP GRO SKBs.

Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com>
---
 net/core/gro.c         |  2 --
 net/ipv6/ip6_offload.c | 28 +---------------------------
 2 files changed, 1 insertion(+), 29 deletions(-)

diff --git a/net/core/gro.c b/net/core/gro.c
index 5ba4504cfd28..3ca3855bedec 100644
--- a/net/core/gro.c
+++ b/net/core/gro.c
@@ -115,8 +115,6 @@ int skb_gro_receive(struct sk_buff *p, struct sk_buff *skb)
 
 	if (unlikely(p->len + len >= GRO_LEGACY_MAX_SIZE)) {
 		if (NAPI_GRO_CB(skb)->proto != IPPROTO_TCP ||
-		    (p->protocol == htons(ETH_P_IPV6) &&
-		     skb_headroom(p) < sizeof(struct hop_jumbo_hdr)) ||
 		    p->encapsulation)
 			return -E2BIG;
 	}
diff --git a/net/ipv6/ip6_offload.c b/net/ipv6/ip6_offload.c
index 6762ce7909c8..e5861089cc80 100644
--- a/net/ipv6/ip6_offload.c
+++ b/net/ipv6/ip6_offload.c
@@ -342,40 +342,14 @@ INDIRECT_CALLABLE_SCOPE int ipv6_gro_complete(struct sk_buff *skb, int nhoff)
 	const struct net_offload *ops;
 	struct ipv6hdr *iph;
 	int err = -ENOSYS;
-	u32 payload_len;
 
 	if (skb->encapsulation) {
 		skb_set_inner_protocol(skb, cpu_to_be16(ETH_P_IPV6));
 		skb_set_inner_network_header(skb, nhoff);
 	}
 
-	payload_len = skb->len - nhoff - sizeof(*iph);
-	if (unlikely(payload_len > IPV6_MAXPLEN)) {
-		struct hop_jumbo_hdr *hop_jumbo;
-		int hoplen = sizeof(*hop_jumbo);
-
-		/* Move network header left */
-		memmove(skb_mac_header(skb) - hoplen, skb_mac_header(skb),
-			skb->transport_header - skb->mac_header);
-		skb->data -= hoplen;
-		skb->len += hoplen;
-		skb->mac_header -= hoplen;
-		skb->network_header -= hoplen;
-		iph = (struct ipv6hdr *)(skb->data + nhoff);
-		hop_jumbo = (struct hop_jumbo_hdr *)(iph + 1);
-
-		/* Build hop-by-hop options */
-		hop_jumbo->nexthdr = iph->nexthdr;
-		hop_jumbo->hdrlen = 0;
-		hop_jumbo->tlv_type = IPV6_TLV_JUMBO;
-		hop_jumbo->tlv_len = 4;
-		hop_jumbo->jumbo_payload_len = htonl(payload_len + hoplen);
-
-		iph->nexthdr = NEXTHDR_HOP;
-	}
-
 	iph = (struct ipv6hdr *)(skb->data + nhoff);
-	ipv6_set_payload_len(iph, payload_len);
+	ipv6_set_payload_len(iph, skb->len - nhoff - sizeof(*iph));
 
 	nhoff += sizeof(*iph) + ipv6_exthdrs_len(iph, &ops);
 	if (WARN_ON(!ops || !ops->callbacks.gro_complete))
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH net-next 04/17] net/ipv6: Remove jumbo_remove step from TX path
  2025-09-23 13:47 [PATCH net-next 00/17] BIG TCP for UDP tunnels Maxim Mikityanskiy
                   ` (2 preceding siblings ...)
  2025-09-23 13:47 ` [PATCH net-next 03/17] net/ipv6: Drop HBH for BIG TCP on RX side Maxim Mikityanskiy
@ 2025-09-23 13:47 ` Maxim Mikityanskiy
  2025-09-23 13:47 ` [PATCH net-next 05/17] net/mlx5e: " Maxim Mikityanskiy
                   ` (13 subsequent siblings)
  17 siblings, 0 replies; 27+ messages in thread
From: Maxim Mikityanskiy @ 2025-09-23 13:47 UTC (permalink / raw)
  To: Daniel Borkmann, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Willem de Bruijn, David Ahern, Nikolay Aleksandrov
  Cc: netdev, tcpdump-workers, Guy Harris, Michael Richardson,
	Denis Ovsienko, Xin Long, Maxim Mikityanskiy

From: Maxim Mikityanskiy <maxim@isovalent.com>

From: Maxim Mikityanskiy <maxim@isovalent.com>

Now that the kernel doesn't insert HBH for BIG TCP IPv6 packets, remove
unnecessary steps from the GSO TX path, that used to check and remove
HBH.

Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com>
---
 net/core/dev.c         | 6 ++----
 net/ipv6/ip6_offload.c | 5 +----
 2 files changed, 3 insertions(+), 8 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index fc4993526ead..c7a4ea33d46d 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3787,8 +3787,7 @@ static netdev_features_t gso_features_check(const struct sk_buff *skb,
 	     (skb_shinfo(skb)->gso_type & SKB_GSO_UDP_L4 &&
 	      vlan_get_protocol(skb) == htons(ETH_P_IPV6))) &&
 	    skb_transport_header_was_set(skb) &&
-	    skb_network_header_len(skb) != sizeof(struct ipv6hdr) &&
-	    !ipv6_has_hopopt_jumbo(skb))
+	    skb_network_header_len(skb) != sizeof(struct ipv6hdr))
 		features &= ~(NETIF_F_IPV6_CSUM | NETIF_F_TSO6 | NETIF_F_GSO_UDP_L4);
 
 	return features;
@@ -3891,8 +3890,7 @@ int skb_csum_hwoffload_help(struct sk_buff *skb,
 
 	if (features & (NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM)) {
 		if (vlan_get_protocol(skb) == htons(ETH_P_IPV6) &&
-		    skb_network_header_len(skb) != sizeof(struct ipv6hdr) &&
-		    !ipv6_has_hopopt_jumbo(skb))
+		    skb_network_header_len(skb) != sizeof(struct ipv6hdr))
 			goto sw_checksum;
 
 		switch (skb->csum_offset) {
diff --git a/net/ipv6/ip6_offload.c b/net/ipv6/ip6_offload.c
index e5861089cc80..3252a9c2ad58 100644
--- a/net/ipv6/ip6_offload.c
+++ b/net/ipv6/ip6_offload.c
@@ -110,7 +110,7 @@ static struct sk_buff *ipv6_gso_segment(struct sk_buff *skb,
 	struct sk_buff *segs = ERR_PTR(-EINVAL);
 	struct ipv6hdr *ipv6h;
 	const struct net_offload *ops;
-	int proto, err;
+	int proto;
 	struct frag_hdr *fptr;
 	unsigned int payload_len;
 	u8 *prevhdr;
@@ -120,9 +120,6 @@ static struct sk_buff *ipv6_gso_segment(struct sk_buff *skb,
 	bool gso_partial;
 
 	skb_reset_network_header(skb);
-	err = ipv6_hopopt_jumbo_remove(skb);
-	if (err)
-		return ERR_PTR(err);
 	nhoff = skb_network_header(skb) - skb_mac_header(skb);
 	if (unlikely(!pskb_may_pull(skb, sizeof(*ipv6h))))
 		goto out;
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH net-next 05/17] net/mlx5e: Remove jumbo_remove step from TX path
  2025-09-23 13:47 [PATCH net-next 00/17] BIG TCP for UDP tunnels Maxim Mikityanskiy
                   ` (3 preceding siblings ...)
  2025-09-23 13:47 ` [PATCH net-next 04/17] net/ipv6: Remove jumbo_remove step from TX path Maxim Mikityanskiy
@ 2025-09-23 13:47 ` Maxim Mikityanskiy
  2025-09-23 13:47 ` [PATCH net-next 06/17] net/mlx4: " Maxim Mikityanskiy
                   ` (12 subsequent siblings)
  17 siblings, 0 replies; 27+ messages in thread
From: Maxim Mikityanskiy @ 2025-09-23 13:47 UTC (permalink / raw)
  To: Daniel Borkmann, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Willem de Bruijn, David Ahern, Nikolay Aleksandrov
  Cc: netdev, tcpdump-workers, Guy Harris, Michael Richardson,
	Denis Ovsienko, Xin Long, Maxim Mikityanskiy

From: Maxim Mikityanskiy <maxim@isovalent.com>

From: Maxim Mikityanskiy <maxim@isovalent.com>

Now that the kernel doesn't insert HBH for BIG TCP IPv6 packets, remove
unnecessary steps from the mlx5e and mlx5i TX path, that used to check
and remove HBH.

Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com>
---
 .../net/ethernet/mellanox/mlx5/core/en_tx.c   | 75 +++----------------
 1 file changed, 12 insertions(+), 63 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
index b7227afcb51d..0b15e141567e 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
@@ -152,12 +152,11 @@ mlx5e_txwqe_build_eseg_csum(struct mlx5e_txqsq *sq, struct sk_buff *skb,
  * to inline later in the transmit descriptor
  */
 static inline u16
-mlx5e_tx_get_gso_ihs(struct mlx5e_txqsq *sq, struct sk_buff *skb, int *hopbyhop)
+mlx5e_tx_get_gso_ihs(struct mlx5e_txqsq *sq, struct sk_buff *skb)
 {
 	struct mlx5e_sq_stats *stats = sq->stats;
 	u16 ihs;
 
-	*hopbyhop = 0;
 	if (skb->encapsulation) {
 		if (skb_shinfo(skb)->gso_type & SKB_GSO_UDP_L4)
 			ihs = skb_inner_transport_offset(skb) +
@@ -167,17 +166,12 @@ mlx5e_tx_get_gso_ihs(struct mlx5e_txqsq *sq, struct sk_buff *skb, int *hopbyhop)
 		stats->tso_inner_packets++;
 		stats->tso_inner_bytes += skb->len - ihs;
 	} else {
-		if (skb_shinfo(skb)->gso_type & SKB_GSO_UDP_L4) {
+		if (skb_shinfo(skb)->gso_type & SKB_GSO_UDP_L4)
 			ihs = skb_transport_offset(skb) + sizeof(struct udphdr);
-		} else {
+		else
 			ihs = skb_tcp_all_headers(skb);
-			if (ipv6_has_hopopt_jumbo(skb)) {
-				*hopbyhop = sizeof(struct hop_jumbo_hdr);
-				ihs -= sizeof(struct hop_jumbo_hdr);
-			}
-		}
 		stats->tso_packets++;
-		stats->tso_bytes += skb->len - ihs - *hopbyhop;
+		stats->tso_bytes += skb->len - ihs;
 	}
 
 	return ihs;
@@ -239,7 +233,6 @@ struct mlx5e_tx_attr {
 	__be16 mss;
 	u16 insz;
 	u8 opcode;
-	u8 hopbyhop;
 };
 
 struct mlx5e_tx_wqe_attr {
@@ -275,16 +268,14 @@ static void mlx5e_sq_xmit_prepare(struct mlx5e_txqsq *sq, struct sk_buff *skb,
 	struct mlx5e_sq_stats *stats = sq->stats;
 
 	if (skb_is_gso(skb)) {
-		int hopbyhop;
-		u16 ihs = mlx5e_tx_get_gso_ihs(sq, skb, &hopbyhop);
+		u16 ihs = mlx5e_tx_get_gso_ihs(sq, skb);
 
 		*attr = (struct mlx5e_tx_attr) {
 			.opcode    = MLX5_OPCODE_LSO,
 			.mss       = cpu_to_be16(skb_shinfo(skb)->gso_size),
 			.ihs       = ihs,
 			.num_bytes = skb->len + (skb_shinfo(skb)->gso_segs - 1) * ihs,
-			.headlen   = skb_headlen(skb) - ihs - hopbyhop,
-			.hopbyhop  = hopbyhop,
+			.headlen   = skb_headlen(skb) - ihs,
 		};
 
 		stats->packets += skb_shinfo(skb)->gso_segs;
@@ -439,7 +430,6 @@ mlx5e_sq_xmit_wqe(struct mlx5e_txqsq *sq, struct sk_buff *skb,
 	struct mlx5_wqe_data_seg *dseg;
 	struct mlx5e_tx_wqe_info *wi;
 	u16 ihs = attr->ihs;
-	struct ipv6hdr *h6;
 	struct mlx5e_sq_stats *stats = sq->stats;
 	int num_dma;
 
@@ -456,28 +446,7 @@ mlx5e_sq_xmit_wqe(struct mlx5e_txqsq *sq, struct sk_buff *skb,
 	if (ihs) {
 		u8 *start = eseg->inline_hdr.start;
 
-		if (unlikely(attr->hopbyhop)) {
-			/* remove the HBH header.
-			 * Layout: [Ethernet header][IPv6 header][HBH][TCP header]
-			 */
-			if (skb_vlan_tag_present(skb)) {
-				mlx5e_insert_vlan(start, skb, ETH_HLEN + sizeof(*h6));
-				ihs += VLAN_HLEN;
-				h6 = (struct ipv6hdr *)(start + sizeof(struct vlan_ethhdr));
-			} else {
-				unsafe_memcpy(start, skb->data,
-					      ETH_HLEN + sizeof(*h6),
-					      MLX5_UNSAFE_MEMCPY_DISCLAIMER);
-				h6 = (struct ipv6hdr *)(start + ETH_HLEN);
-			}
-			h6->nexthdr = IPPROTO_TCP;
-			/* Copy the TCP header after the IPv6 one */
-			memcpy(h6 + 1,
-			       skb->data + ETH_HLEN + sizeof(*h6) +
-					sizeof(struct hop_jumbo_hdr),
-			       tcp_hdrlen(skb));
-			/* Leave ipv6 payload_len set to 0, as LSO v2 specs request. */
-		} else if (skb_vlan_tag_present(skb)) {
+		if (skb_vlan_tag_present(skb)) {
 			mlx5e_insert_vlan(start, skb, ihs);
 			ihs += VLAN_HLEN;
 			stats->added_vlan_packets++;
@@ -491,7 +460,7 @@ mlx5e_sq_xmit_wqe(struct mlx5e_txqsq *sq, struct sk_buff *skb,
 	}
 
 	dseg += wqe_attr->ds_cnt_ids;
-	num_dma = mlx5e_txwqe_build_dsegs(sq, skb, skb->data + attr->ihs + attr->hopbyhop,
+	num_dma = mlx5e_txwqe_build_dsegs(sq, skb, skb->data + attr->ihs,
 					  attr->headlen, dseg);
 	if (unlikely(num_dma < 0))
 		goto err_drop;
@@ -1014,34 +983,14 @@ void mlx5i_sq_xmit(struct mlx5e_txqsq *sq, struct sk_buff *skb,
 	eseg->mss = attr.mss;
 
 	if (attr.ihs) {
-		if (unlikely(attr.hopbyhop)) {
-			struct ipv6hdr *h6;
-
-			/* remove the HBH header.
-			 * Layout: [Ethernet header][IPv6 header][HBH][TCP header]
-			 */
-			unsafe_memcpy(eseg->inline_hdr.start, skb->data,
-				      ETH_HLEN + sizeof(*h6),
-				      MLX5_UNSAFE_MEMCPY_DISCLAIMER);
-			h6 = (struct ipv6hdr *)((char *)eseg->inline_hdr.start + ETH_HLEN);
-			h6->nexthdr = IPPROTO_TCP;
-			/* Copy the TCP header after the IPv6 one */
-			unsafe_memcpy(h6 + 1,
-				      skb->data + ETH_HLEN + sizeof(*h6) +
-						  sizeof(struct hop_jumbo_hdr),
-				      tcp_hdrlen(skb),
-				      MLX5_UNSAFE_MEMCPY_DISCLAIMER);
-			/* Leave ipv6 payload_len set to 0, as LSO v2 specs request. */
-		} else {
-			unsafe_memcpy(eseg->inline_hdr.start, skb->data,
-				      attr.ihs,
-				      MLX5_UNSAFE_MEMCPY_DISCLAIMER);
-		}
+		unsafe_memcpy(eseg->inline_hdr.start, skb->data,
+			      attr.ihs,
+			      MLX5_UNSAFE_MEMCPY_DISCLAIMER);
 		eseg->inline_hdr.sz = cpu_to_be16(attr.ihs);
 		dseg += wqe_attr.ds_cnt_inl;
 	}
 
-	num_dma = mlx5e_txwqe_build_dsegs(sq, skb, skb->data + attr.ihs + attr.hopbyhop,
+	num_dma = mlx5e_txwqe_build_dsegs(sq, skb, skb->data + attr.ihs,
 					  attr.headlen, dseg);
 	if (unlikely(num_dma < 0))
 		goto err_drop;
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH net-next 06/17] net/mlx4: Remove jumbo_remove step from TX path
  2025-09-23 13:47 [PATCH net-next 00/17] BIG TCP for UDP tunnels Maxim Mikityanskiy
                   ` (4 preceding siblings ...)
  2025-09-23 13:47 ` [PATCH net-next 05/17] net/mlx5e: " Maxim Mikityanskiy
@ 2025-09-23 13:47 ` Maxim Mikityanskiy
  2025-09-23 13:47 ` [PATCH net-next 07/17] ice: " Maxim Mikityanskiy
                   ` (11 subsequent siblings)
  17 siblings, 0 replies; 27+ messages in thread
From: Maxim Mikityanskiy @ 2025-09-23 13:47 UTC (permalink / raw)
  To: Daniel Borkmann, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Willem de Bruijn, David Ahern, Nikolay Aleksandrov
  Cc: netdev, tcpdump-workers, Guy Harris, Michael Richardson,
	Denis Ovsienko, Xin Long, Maxim Mikityanskiy

From: Maxim Mikityanskiy <maxim@isovalent.com>

From: Maxim Mikityanskiy <maxim@isovalent.com>

Now that the kernel doesn't insert HBH for BIG TCP IPv6 packets, remove
unnecessary steps from the mlx4 TX path, that used to check and remove
HBH.

Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_tx.c | 42 +++++-----------------
 1 file changed, 8 insertions(+), 34 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
index 87f35bcbeff8..c5d564e5a581 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
@@ -636,28 +636,20 @@ static int get_real_size(const struct sk_buff *skb,
 			 struct net_device *dev,
 			 int *lso_header_size,
 			 bool *inline_ok,
-			 void **pfrag,
-			 int *hopbyhop)
+			 void **pfrag)
 {
 	struct mlx4_en_priv *priv = netdev_priv(dev);
 	int real_size;
 
 	if (shinfo->gso_size) {
 		*inline_ok = false;
-		*hopbyhop = 0;
 		if (skb->encapsulation) {
 			*lso_header_size = skb_inner_tcp_all_headers(skb);
 		} else {
-			/* Detects large IPV6 TCP packets and prepares for removal of
-			 * HBH header that has been pushed by ip6_xmit(),
-			 * mainly so that tcpdump can dissect them.
-			 */
-			if (ipv6_has_hopopt_jumbo(skb))
-				*hopbyhop = sizeof(struct hop_jumbo_hdr);
 			*lso_header_size = skb_tcp_all_headers(skb);
 		}
 		real_size = CTRL_SIZE + shinfo->nr_frags * DS_SIZE +
-			ALIGN(*lso_header_size - *hopbyhop + 4, DS_SIZE);
+			ALIGN(*lso_header_size + 4, DS_SIZE);
 		if (unlikely(*lso_header_size != skb_headlen(skb))) {
 			/* We add a segment for the skb linear buffer only if
 			 * it contains data */
@@ -884,7 +876,6 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 	int desc_size;
 	int real_size;
 	u32 index, bf_index;
-	struct ipv6hdr *h6;
 	__be32 op_own;
 	int lso_header_size;
 	void *fragptr = NULL;
@@ -893,7 +884,6 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 	bool stop_queue;
 	bool inline_ok;
 	u8 data_offset;
-	int hopbyhop;
 	bool bf_ok;
 
 	tx_ind = skb_get_queue_mapping(skb);
@@ -903,7 +893,7 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 		goto tx_drop;
 
 	real_size = get_real_size(skb, shinfo, dev, &lso_header_size,
-				  &inline_ok, &fragptr, &hopbyhop);
+				  &inline_ok, &fragptr);
 	if (unlikely(!real_size))
 		goto tx_drop_count;
 
@@ -956,7 +946,7 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 		data = &tx_desc->data;
 		data_offset = offsetof(struct mlx4_en_tx_desc, data);
 	} else {
-		int lso_align = ALIGN(lso_header_size - hopbyhop + 4, DS_SIZE);
+		int lso_align = ALIGN(lso_header_size + 4, DS_SIZE);
 
 		data = (void *)&tx_desc->lso + lso_align;
 		data_offset = offsetof(struct mlx4_en_tx_desc, lso) + lso_align;
@@ -1021,31 +1011,15 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 			((ring->prod & ring->size) ?
 				cpu_to_be32(MLX4_EN_BIT_DESC_OWN) : 0);
 
-		lso_header_size -= hopbyhop;
 		/* Fill in the LSO prefix */
 		tx_desc->lso.mss_hdr_size = cpu_to_be32(
 			shinfo->gso_size << 16 | lso_header_size);
 
+		/* Copy headers;
+		 * note that we already verified that it is linear
+		 */
+		memcpy(tx_desc->lso.header, skb->data, lso_header_size);
 
-		if (unlikely(hopbyhop)) {
-			/* remove the HBH header.
-			 * Layout: [Ethernet header][IPv6 header][HBH][TCP header]
-			 */
-			memcpy(tx_desc->lso.header, skb->data, ETH_HLEN + sizeof(*h6));
-			h6 = (struct ipv6hdr *)((char *)tx_desc->lso.header + ETH_HLEN);
-			h6->nexthdr = IPPROTO_TCP;
-			/* Copy the TCP header after the IPv6 one */
-			memcpy(h6 + 1,
-			       skb->data + ETH_HLEN + sizeof(*h6) +
-					sizeof(struct hop_jumbo_hdr),
-			       tcp_hdrlen(skb));
-			/* Leave ipv6 payload_len set to 0, as LSO v2 specs request. */
-		} else {
-			/* Copy headers;
-			 * note that we already verified that it is linear
-			 */
-			memcpy(tx_desc->lso.header, skb->data, lso_header_size);
-		}
 		ring->tso_packets++;
 
 		i = shinfo->gso_segs;
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH net-next 07/17] ice: Remove jumbo_remove step from TX path
  2025-09-23 13:47 [PATCH net-next 00/17] BIG TCP for UDP tunnels Maxim Mikityanskiy
                   ` (5 preceding siblings ...)
  2025-09-23 13:47 ` [PATCH net-next 06/17] net/mlx4: " Maxim Mikityanskiy
@ 2025-09-23 13:47 ` Maxim Mikityanskiy
  2025-09-23 13:47 ` [PATCH net-next 08/17] bnxt_en: " Maxim Mikityanskiy
                   ` (10 subsequent siblings)
  17 siblings, 0 replies; 27+ messages in thread
From: Maxim Mikityanskiy @ 2025-09-23 13:47 UTC (permalink / raw)
  To: Daniel Borkmann, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Willem de Bruijn, David Ahern, Nikolay Aleksandrov
  Cc: netdev, tcpdump-workers, Guy Harris, Michael Richardson,
	Denis Ovsienko, Xin Long, Maxim Mikityanskiy

From: Maxim Mikityanskiy <maxim@isovalent.com>

From: Maxim Mikityanskiy <maxim@isovalent.com>

Now that the kernel doesn't insert HBH for BIG TCP IPv6 packets, remove
unnecessary steps from the ice TX path, that used to check and remove
HBH.

Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com>
---
 drivers/net/ethernet/intel/ice/ice_txrx.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.c b/drivers/net/ethernet/intel/ice/ice_txrx.c
index 17cda5e2f6a4..5db84ef36fd6 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx.c
+++ b/drivers/net/ethernet/intel/ice/ice_txrx.c
@@ -2427,9 +2427,6 @@ ice_xmit_frame_ring(struct sk_buff *skb, struct ice_tx_ring *tx_ring)
 
 	ice_trace(xmit_frame_ring, tx_ring, skb);
 
-	if (unlikely(ipv6_hopopt_jumbo_remove(skb)))
-		goto out_drop;
-
 	count = ice_xmit_desc_count(skb);
 	if (ice_chk_linearize(skb, count)) {
 		if (__skb_linearize(skb))
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH net-next 08/17] bnxt_en: Remove jumbo_remove step from TX path
  2025-09-23 13:47 [PATCH net-next 00/17] BIG TCP for UDP tunnels Maxim Mikityanskiy
                   ` (6 preceding siblings ...)
  2025-09-23 13:47 ` [PATCH net-next 07/17] ice: " Maxim Mikityanskiy
@ 2025-09-23 13:47 ` Maxim Mikityanskiy
  2025-09-23 13:47 ` [PATCH net-next 09/17] gve: " Maxim Mikityanskiy
                   ` (9 subsequent siblings)
  17 siblings, 0 replies; 27+ messages in thread
From: Maxim Mikityanskiy @ 2025-09-23 13:47 UTC (permalink / raw)
  To: Daniel Borkmann, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Willem de Bruijn, David Ahern, Nikolay Aleksandrov
  Cc: netdev, tcpdump-workers, Guy Harris, Michael Richardson,
	Denis Ovsienko, Xin Long, Maxim Mikityanskiy

From: Maxim Mikityanskiy <maxim@isovalent.com>

From: Maxim Mikityanskiy <maxim@isovalent.com>

Now that the kernel doesn't insert HBH for BIG TCP IPv6 packets, remove
unnecessary steps from the bnxt_en TX path, that used to check and
remove HBH.

Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com>
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 21 ---------------------
 1 file changed, 21 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index d59612d1e176..b3c31282b002 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -517,9 +517,6 @@ static netdev_tx_t bnxt_start_xmit(struct sk_buff *skb, struct net_device *dev)
 			return NETDEV_TX_BUSY;
 	}
 
-	if (unlikely(ipv6_hopopt_jumbo_remove(skb)))
-		goto tx_free;
-
 	length = skb->len;
 	len = skb_headlen(skb);
 	last_frag = skb_shinfo(skb)->nr_frags;
@@ -13792,7 +13789,6 @@ static bool bnxt_exthdr_check(struct bnxt *bp, struct sk_buff *skb, int nw_off,
 			      u8 **nextp)
 {
 	struct ipv6hdr *ip6h = (struct ipv6hdr *)(skb->data + nw_off);
-	struct hop_jumbo_hdr *jhdr;
 	int hdr_count = 0;
 	u8 *nexthdr;
 	int start;
@@ -13821,24 +13817,7 @@ static bool bnxt_exthdr_check(struct bnxt *bp, struct sk_buff *skb, int nw_off,
 		if (hdrlen > 64)
 			return false;
 
-		/* The ext header may be a hop-by-hop header inserted for
-		 * big TCP purposes. This will be removed before sending
-		 * from NIC, so do not count it.
-		 */
-		if (*nexthdr == NEXTHDR_HOP) {
-			if (likely(skb->len <= GRO_LEGACY_MAX_SIZE))
-				goto increment_hdr;
-
-			jhdr = (struct hop_jumbo_hdr *)hp;
-			if (jhdr->tlv_type != IPV6_TLV_JUMBO || jhdr->hdrlen != 0 ||
-			    jhdr->nexthdr != IPPROTO_TCP)
-				goto increment_hdr;
-
-			goto next_hdr;
-		}
-increment_hdr:
 		hdr_count++;
-next_hdr:
 		nexthdr = &hp->nexthdr;
 		start += hdrlen;
 	}
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH net-next 09/17] gve: Remove jumbo_remove step from TX path
  2025-09-23 13:47 [PATCH net-next 00/17] BIG TCP for UDP tunnels Maxim Mikityanskiy
                   ` (7 preceding siblings ...)
  2025-09-23 13:47 ` [PATCH net-next 08/17] bnxt_en: " Maxim Mikityanskiy
@ 2025-09-23 13:47 ` Maxim Mikityanskiy
  2025-09-23 13:47 ` [PATCH net-next 10/17] net: mana: " Maxim Mikityanskiy
                   ` (8 subsequent siblings)
  17 siblings, 0 replies; 27+ messages in thread
From: Maxim Mikityanskiy @ 2025-09-23 13:47 UTC (permalink / raw)
  To: Daniel Borkmann, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Willem de Bruijn, David Ahern, Nikolay Aleksandrov
  Cc: netdev, tcpdump-workers, Guy Harris, Michael Richardson,
	Denis Ovsienko, Xin Long, Maxim Mikityanskiy

From: Maxim Mikityanskiy <maxim@isovalent.com>

From: Maxim Mikityanskiy <maxim@isovalent.com>

Now that the kernel doesn't insert HBH for BIG TCP IPv6 packets, remove
unnecessary steps from the gve TX path, that used to check and remove
HBH.

Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com>
---
 drivers/net/ethernet/google/gve/gve_tx_dqo.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/drivers/net/ethernet/google/gve/gve_tx_dqo.c b/drivers/net/ethernet/google/gve/gve_tx_dqo.c
index 6f1d515673d2..984a918433f1 100644
--- a/drivers/net/ethernet/google/gve/gve_tx_dqo.c
+++ b/drivers/net/ethernet/google/gve/gve_tx_dqo.c
@@ -963,9 +963,6 @@ static int gve_try_tx_skb(struct gve_priv *priv, struct gve_tx_ring *tx,
 	int num_buffer_descs;
 	int total_num_descs;
 
-	if (skb_is_gso(skb) && unlikely(ipv6_hopopt_jumbo_remove(skb)))
-		goto drop;
-
 	if (tx->dqo.qpl) {
 		/* We do not need to verify the number of buffers used per
 		 * packet or per segment in case of TSO as with 2K size buffers
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH net-next 10/17] net: mana: Remove jumbo_remove step from TX path
  2025-09-23 13:47 [PATCH net-next 00/17] BIG TCP for UDP tunnels Maxim Mikityanskiy
                   ` (8 preceding siblings ...)
  2025-09-23 13:47 ` [PATCH net-next 09/17] gve: " Maxim Mikityanskiy
@ 2025-09-23 13:47 ` Maxim Mikityanskiy
  2025-09-23 13:47 ` [PATCH net-next 11/17] net/ipv6: Remove HBH helpers Maxim Mikityanskiy
                   ` (7 subsequent siblings)
  17 siblings, 0 replies; 27+ messages in thread
From: Maxim Mikityanskiy @ 2025-09-23 13:47 UTC (permalink / raw)
  To: Daniel Borkmann, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Willem de Bruijn, David Ahern, Nikolay Aleksandrov
  Cc: netdev, tcpdump-workers, Guy Harris, Michael Richardson,
	Denis Ovsienko, Xin Long, Maxim Mikityanskiy

From: Maxim Mikityanskiy <maxim@isovalent.com>

From: Maxim Mikityanskiy <maxim@isovalent.com>

Now that the kernel doesn't insert HBH for BIG TCP IPv6 packets, remove
unnecessary steps from the mana TX path, that used to check and remove
HBH.

Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com>
---
 drivers/net/ethernet/microsoft/mana/mana_en.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index 0142fd98392c..5a4eb2bfedff 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -281,9 +281,6 @@ netdev_tx_t mana_start_xmit(struct sk_buff *skb, struct net_device *ndev)
 	if (skb_cow_head(skb, MANA_HEADROOM))
 		goto tx_drop_count;
 
-	if (unlikely(ipv6_hopopt_jumbo_remove(skb)))
-		goto tx_drop_count;
-
 	txq = &apc->tx_qp[txq_idx].txq;
 	gdma_sq = txq->gdma_sq;
 	cq = &apc->tx_qp[txq_idx].tx_cq;
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH net-next 11/17] net/ipv6: Remove HBH helpers
  2025-09-23 13:47 [PATCH net-next 00/17] BIG TCP for UDP tunnels Maxim Mikityanskiy
                   ` (9 preceding siblings ...)
  2025-09-23 13:47 ` [PATCH net-next 10/17] net: mana: " Maxim Mikityanskiy
@ 2025-09-23 13:47 ` Maxim Mikityanskiy
  2025-09-23 13:47 ` [PATCH net-next 12/17] net: Enable BIG TCP with partial GSO Maxim Mikityanskiy
                   ` (6 subsequent siblings)
  17 siblings, 0 replies; 27+ messages in thread
From: Maxim Mikityanskiy @ 2025-09-23 13:47 UTC (permalink / raw)
  To: Daniel Borkmann, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Willem de Bruijn, David Ahern, Nikolay Aleksandrov
  Cc: netdev, tcpdump-workers, Guy Harris, Michael Richardson,
	Denis Ovsienko, Xin Long, Maxim Mikityanskiy

From: Maxim Mikityanskiy <maxim@isovalent.com>

From: Maxim Mikityanskiy <maxim@isovalent.com>

Now that the HBH jumbo helpers are not used by any driver or GSO, remove
them altogether.

Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com>
---
 include/net/ipv6.h | 77 ----------------------------------------------
 1 file changed, 77 deletions(-)

diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index 38b332f3028e..da42a5e5216f 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -149,17 +149,6 @@ struct frag_hdr {
 	__be32	identification;
 };
 
-/*
- * Jumbo payload option, as described in RFC 2675 2.
- */
-struct hop_jumbo_hdr {
-	u8	nexthdr;
-	u8	hdrlen;
-	u8	tlv_type;	/* IPV6_TLV_JUMBO, 0xC2 */
-	u8	tlv_len;	/* 4 */
-	__be32	jumbo_payload_len;
-};
-
 #define	IP6_MF		0x0001
 #define	IP6_OFFSET	0xFFF8
 
@@ -462,72 +451,6 @@ bool ipv6_opt_accepted(const struct sock *sk, const struct sk_buff *skb,
 struct ipv6_txoptions *ipv6_update_options(struct sock *sk,
 					   struct ipv6_txoptions *opt);
 
-/* This helper is specialized for BIG TCP needs.
- * It assumes the hop_jumbo_hdr will immediately follow the IPV6 header.
- * It assumes headers are already in skb->head.
- * Returns: 0, or IPPROTO_TCP if a BIG TCP packet is there.
- */
-static inline int ipv6_has_hopopt_jumbo(const struct sk_buff *skb)
-{
-	const struct hop_jumbo_hdr *jhdr;
-	const struct ipv6hdr *nhdr;
-
-	if (likely(skb->len <= GRO_LEGACY_MAX_SIZE))
-		return 0;
-
-	if (skb->protocol != htons(ETH_P_IPV6))
-		return 0;
-
-	if (skb_network_offset(skb) +
-	    sizeof(struct ipv6hdr) +
-	    sizeof(struct hop_jumbo_hdr) > skb_headlen(skb))
-		return 0;
-
-	nhdr = ipv6_hdr(skb);
-
-	if (nhdr->nexthdr != NEXTHDR_HOP)
-		return 0;
-
-	jhdr = (const struct hop_jumbo_hdr *) (nhdr + 1);
-	if (jhdr->tlv_type != IPV6_TLV_JUMBO || jhdr->hdrlen != 0 ||
-	    jhdr->nexthdr != IPPROTO_TCP)
-		return 0;
-	return jhdr->nexthdr;
-}
-
-/* Return 0 if HBH header is successfully removed
- * Or if HBH removal is unnecessary (packet is not big TCP)
- * Return error to indicate dropping the packet
- */
-static inline int ipv6_hopopt_jumbo_remove(struct sk_buff *skb)
-{
-	const int hophdr_len = sizeof(struct hop_jumbo_hdr);
-	int nexthdr = ipv6_has_hopopt_jumbo(skb);
-	struct ipv6hdr *h6;
-
-	if (!nexthdr)
-		return 0;
-
-	if (skb_cow_head(skb, 0))
-		return -1;
-
-	/* Remove the HBH header.
-	 * Layout: [Ethernet header][IPv6 header][HBH][L4 Header]
-	 */
-	memmove(skb_mac_header(skb) + hophdr_len, skb_mac_header(skb),
-		skb_network_header(skb) - skb_mac_header(skb) +
-		sizeof(struct ipv6hdr));
-
-	__skb_pull(skb, hophdr_len);
-	skb->network_header += hophdr_len;
-	skb->mac_header += hophdr_len;
-
-	h6 = ipv6_hdr(skb);
-	h6->nexthdr = nexthdr;
-
-	return 0;
-}
-
 static inline bool ipv6_accept_ra(const struct inet6_dev *idev)
 {
 	s32 accept_ra = READ_ONCE(idev->cnf.accept_ra);
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH net-next 12/17] net: Enable BIG TCP with partial GSO
  2025-09-23 13:47 [PATCH net-next 00/17] BIG TCP for UDP tunnels Maxim Mikityanskiy
                   ` (10 preceding siblings ...)
  2025-09-23 13:47 ` [PATCH net-next 11/17] net/ipv6: Remove HBH helpers Maxim Mikityanskiy
@ 2025-09-23 13:47 ` Maxim Mikityanskiy
  2025-09-23 13:47 ` [PATCH net-next 13/17] udp: Support gro_ipv4_max_size > 65536 Maxim Mikityanskiy
                   ` (5 subsequent siblings)
  17 siblings, 0 replies; 27+ messages in thread
From: Maxim Mikityanskiy @ 2025-09-23 13:47 UTC (permalink / raw)
  To: Daniel Borkmann, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Willem de Bruijn, David Ahern, Nikolay Aleksandrov
  Cc: netdev, tcpdump-workers, Guy Harris, Michael Richardson,
	Denis Ovsienko, Xin Long, Maxim Mikityanskiy

From: Maxim Mikityanskiy <maxim@isovalent.com>

From: Maxim Mikityanskiy <maxim@isovalent.com>

skb_segment is called for partial GSO, when netif_needs_gso returns true
in validate_xmit_skb. Partial GSO is needed, for example, when
segmentation of tunneled traffic is offloaded to a NIC that only
supports inner checksum offload.

Currently, skb_segment clamps the segment length to 65534 bytes, because
gso_size == 65535 is a special value GSO_BY_FRAGS, and we don't want
to accidentally assign mss = 65535, as it would fall into the
GSO_BY_FRAGS check further in the function.

This implementation, however, artificially blocks len > 65534, which is
possible since the introduction of BIG TCP. To allow bigger lengths and
avoid resegmentation of BIG TCP packets, store the gso_by_frags flag in
the beginning and don't use a special value of mss for this purpose
after mss was modified.

Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com>
---
 net/core/skbuff.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index d331e607edfb..2ebacf5fa09a 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -4699,6 +4699,7 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
 	struct sk_buff *tail = NULL;
 	struct sk_buff *list_skb = skb_shinfo(head_skb)->frag_list;
 	unsigned int mss = skb_shinfo(head_skb)->gso_size;
+	bool gso_by_frags = mss == GSO_BY_FRAGS;
 	unsigned int doffset = head_skb->data - skb_mac_header(head_skb);
 	unsigned int offset = doffset;
 	unsigned int tnl_hlen = skb_tnl_header_len(head_skb);
@@ -4714,7 +4715,7 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
 	int nfrags, pos;
 
 	if ((skb_shinfo(head_skb)->gso_type & SKB_GSO_DODGY) &&
-	    mss != GSO_BY_FRAGS && mss != skb_headlen(head_skb)) {
+	    !gso_by_frags && mss != skb_headlen(head_skb)) {
 		struct sk_buff *check_skb;
 
 		for (check_skb = list_skb; check_skb; check_skb = check_skb->next) {
@@ -4742,7 +4743,7 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
 	sg = !!(features & NETIF_F_SG);
 	csum = !!can_checksum_protocol(features, proto);
 
-	if (sg && csum && (mss != GSO_BY_FRAGS))  {
+	if (sg && csum && !gso_by_frags)  {
 		if (!(features & NETIF_F_GSO_PARTIAL)) {
 			struct sk_buff *iter;
 			unsigned int frag_len;
@@ -4776,9 +4777,8 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
 		/* GSO partial only requires that we trim off any excess that
 		 * doesn't fit into an MSS sized block, so take care of that
 		 * now.
-		 * Cap len to not accidentally hit GSO_BY_FRAGS.
 		 */
-		partial_segs = min(len, GSO_BY_FRAGS - 1) / mss;
+		partial_segs = len / mss;
 		if (partial_segs > 1)
 			mss *= partial_segs;
 		else
@@ -4802,7 +4802,7 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
 		int hsize;
 		int size;
 
-		if (unlikely(mss == GSO_BY_FRAGS)) {
+		if (unlikely(gso_by_frags)) {
 			len = list_skb->len;
 		} else {
 			len = head_skb->len - offset;
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH net-next 13/17] udp: Support gro_ipv4_max_size > 65536
  2025-09-23 13:47 [PATCH net-next 00/17] BIG TCP for UDP tunnels Maxim Mikityanskiy
                   ` (11 preceding siblings ...)
  2025-09-23 13:47 ` [PATCH net-next 12/17] net: Enable BIG TCP with partial GSO Maxim Mikityanskiy
@ 2025-09-23 13:47 ` Maxim Mikityanskiy
  2025-09-25 14:15   ` Paolo Abeni
  2025-09-23 13:47 ` [PATCH net-next 14/17] udp: Validate UDP length in udp_gro_receive Maxim Mikityanskiy
                   ` (4 subsequent siblings)
  17 siblings, 1 reply; 27+ messages in thread
From: Maxim Mikityanskiy @ 2025-09-23 13:47 UTC (permalink / raw)
  To: Daniel Borkmann, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Willem de Bruijn, David Ahern, Nikolay Aleksandrov
  Cc: netdev, tcpdump-workers, Guy Harris, Michael Richardson,
	Denis Ovsienko, Xin Long, Maxim Mikityanskiy

From: Maxim Mikityanskiy <maxim@isovalent.com>

From: Maxim Mikityanskiy <maxim@isovalent.com>

Currently, gro_max_size and gro_ipv4_max_size can be set to values
bigger than 65536, and GRO will happily aggregate UDP to the configured
size (for example, with TCP traffic in VXLAN tunnels). However,
udp_gro_complete uses the 16-bit length field in the UDP header to store
the length of the aggregated packet. It leads to the packet truncation
later in __udp4_lib_rcv.

Fix this by storing 0 to the UDP length field and by restoring the real
length from skb->len in __udp4_lib_rcv.

Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com>
---
 net/ipv4/udp.c         | 5 ++++-
 net/ipv4/udp_offload.c | 7 +++++--
 2 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 0c40426628eb..0ac03f5596ac 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -2643,7 +2643,7 @@ int __udp4_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
 {
 	struct sock *sk = NULL;
 	struct udphdr *uh;
-	unsigned short ulen;
+	unsigned int ulen;
 	struct rtable *rt = skb_rtable(skb);
 	__be32 saddr, daddr;
 	struct net *net = dev_net(skb->dev);
@@ -2667,6 +2667,9 @@ int __udp4_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
 		goto short_packet;
 
 	if (proto == IPPROTO_UDP) {
+		if (!ulen)
+			ulen = skb->len;
+
 		/* UDP validates ulen. */
 		if (ulen < sizeof(*uh) || pskb_trim_rcsum(skb, ulen))
 			goto short_packet;
diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
index b1f3fd302e9d..1e7ed7718d7b 100644
--- a/net/ipv4/udp_offload.c
+++ b/net/ipv4/udp_offload.c
@@ -924,12 +924,15 @@ static int udp_gro_complete_segment(struct sk_buff *skb)
 int udp_gro_complete(struct sk_buff *skb, int nhoff,
 		     udp_lookup_t lookup)
 {
-	__be16 newlen = htons(skb->len - nhoff);
+	unsigned int newlen = skb->len - nhoff;
 	struct udphdr *uh = (struct udphdr *)(skb->data + nhoff);
 	struct sock *sk;
 	int err;
 
-	uh->len = newlen;
+	if (newlen <= GRO_LEGACY_MAX_SIZE)
+		uh->len = htons(newlen);
+	else
+		uh->len = 0;
 
 	sk = INDIRECT_CALL_INET(lookup, udp6_lib_lookup_skb,
 				udp4_lib_lookup_skb, skb, uh->source, uh->dest);
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH net-next 14/17] udp: Validate UDP length in udp_gro_receive
  2025-09-23 13:47 [PATCH net-next 00/17] BIG TCP for UDP tunnels Maxim Mikityanskiy
                   ` (12 preceding siblings ...)
  2025-09-23 13:47 ` [PATCH net-next 13/17] udp: Support gro_ipv4_max_size > 65536 Maxim Mikityanskiy
@ 2025-09-23 13:47 ` Maxim Mikityanskiy
  2025-09-25 14:10   ` Paolo Abeni
  2025-09-23 13:47 ` [PATCH net-next 15/17] udp: Set length in UDP header to 0 for big GSO packets Maxim Mikityanskiy
                   ` (3 subsequent siblings)
  17 siblings, 1 reply; 27+ messages in thread
From: Maxim Mikityanskiy @ 2025-09-23 13:47 UTC (permalink / raw)
  To: Daniel Borkmann, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Willem de Bruijn, David Ahern, Nikolay Aleksandrov
  Cc: netdev, tcpdump-workers, Guy Harris, Michael Richardson,
	Denis Ovsienko, Xin Long, Maxim Mikityanskiy

From: Maxim Mikityanskiy <maxim@isovalent.com>

From: Maxim Mikityanskiy <maxim@isovalent.com>

In the previous commit we started using uh->len = 0 as a marker of a GRO
packet bigger than 65536 bytes. To prevent abuse by maliciously crafted
packets, check the length in the UDP header in udp_gro_receive. Note
that a similar check is present in udp_gro_receive_segment, but not in
the UDP socket gro_receive flow.

Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com>
---
 net/ipv4/udp_offload.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
index 1e7ed7718d7b..fd86f76fda2c 100644
--- a/net/ipv4/udp_offload.c
+++ b/net/ipv4/udp_offload.c
@@ -788,6 +788,7 @@ struct sk_buff *udp_gro_receive(struct list_head *head, struct sk_buff *skb,
 	struct sk_buff *p;
 	struct udphdr *uh2;
 	unsigned int off = skb_gro_offset(skb);
+	unsigned int ulen;
 	int flush = 1;
 
 	/* We can do L4 aggregation only if the packet can't land in a tunnel
@@ -820,6 +821,10 @@ struct sk_buff *udp_gro_receive(struct list_head *head, struct sk_buff *skb,
 	     !NAPI_GRO_CB(skb)->csum_valid))
 		goto out;
 
+	ulen = ntohs(uh->len);
+	if (ulen <= sizeof(*uh) || ulen != skb_gro_len(skb))
+		goto out;
+
 	/* mark that this skb passed once through the tunnel gro layer */
 	NAPI_GRO_CB(skb)->encap_mark = 1;
 
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH net-next 15/17] udp: Set length in UDP header to 0 for big GSO packets
  2025-09-23 13:47 [PATCH net-next 00/17] BIG TCP for UDP tunnels Maxim Mikityanskiy
                   ` (13 preceding siblings ...)
  2025-09-23 13:47 ` [PATCH net-next 14/17] udp: Validate UDP length in udp_gro_receive Maxim Mikityanskiy
@ 2025-09-23 13:47 ` Maxim Mikityanskiy
  2025-09-25 14:18   ` Paolo Abeni
  2025-09-23 13:47 ` [PATCH net-next 16/17] vxlan: Enable BIG TCP packets Maxim Mikityanskiy
                   ` (2 subsequent siblings)
  17 siblings, 1 reply; 27+ messages in thread
From: Maxim Mikityanskiy @ 2025-09-23 13:47 UTC (permalink / raw)
  To: Daniel Borkmann, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Willem de Bruijn, David Ahern, Nikolay Aleksandrov
  Cc: netdev, tcpdump-workers, Guy Harris, Michael Richardson,
	Denis Ovsienko, Xin Long, Maxim Mikityanskiy

From: Maxim Mikityanskiy <maxim@isovalent.com>

From: Maxim Mikityanskiy <maxim@isovalent.com>

skb->len may be bigger than 65535 in UDP-based tunnels that have BIG TCP
enabled. If GSO aggregates packets that large, set the length in the UDP
header to 0, so that tcpdump can print such packets properly (treating
them as RFC 2675 jumbograms). Later in the pipeline, __udp_gso_segment
will set uh->len to the size of individual packets.

Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com>
---
 net/ipv4/udp_tunnel_core.c | 2 +-
 net/ipv6/ip6_udp_tunnel.c  | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/udp_tunnel_core.c b/net/ipv4/udp_tunnel_core.c
index 54386e06a813..98faddb7b4bf 100644
--- a/net/ipv4/udp_tunnel_core.c
+++ b/net/ipv4/udp_tunnel_core.c
@@ -184,7 +184,7 @@ void udp_tunnel_xmit_skb(struct rtable *rt, struct sock *sk, struct sk_buff *skb
 
 	uh->dest = dst_port;
 	uh->source = src_port;
-	uh->len = htons(skb->len);
+	uh->len = skb->len <= GRO_LEGACY_MAX_SIZE ? htons(skb->len) : 0;
 
 	memset(&(IPCB(skb)->opt), 0, sizeof(IPCB(skb)->opt));
 
diff --git a/net/ipv6/ip6_udp_tunnel.c b/net/ipv6/ip6_udp_tunnel.c
index 0ff547a4bff7..0fb85f490f8c 100644
--- a/net/ipv6/ip6_udp_tunnel.c
+++ b/net/ipv6/ip6_udp_tunnel.c
@@ -93,7 +93,7 @@ void udp_tunnel6_xmit_skb(struct dst_entry *dst, struct sock *sk,
 	uh->dest = dst_port;
 	uh->source = src_port;
 
-	uh->len = htons(skb->len);
+	uh->len = skb->len <= GRO_LEGACY_MAX_SIZE ? htons(skb->len) : 0;
 
 	skb_dst_set(skb, dst);
 
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH net-next 16/17] vxlan: Enable BIG TCP packets
  2025-09-23 13:47 [PATCH net-next 00/17] BIG TCP for UDP tunnels Maxim Mikityanskiy
                   ` (14 preceding siblings ...)
  2025-09-23 13:47 ` [PATCH net-next 15/17] udp: Set length in UDP header to 0 for big GSO packets Maxim Mikityanskiy
@ 2025-09-23 13:47 ` Maxim Mikityanskiy
  2025-09-23 13:47 ` [PATCH net-next 17/17] geneve: " Maxim Mikityanskiy
  2025-09-25 14:26 ` [PATCH net-next 00/17] BIG TCP for UDP tunnels Paolo Abeni
  17 siblings, 0 replies; 27+ messages in thread
From: Maxim Mikityanskiy @ 2025-09-23 13:47 UTC (permalink / raw)
  To: Daniel Borkmann, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Willem de Bruijn, David Ahern, Nikolay Aleksandrov
  Cc: netdev, tcpdump-workers, Guy Harris, Michael Richardson,
	Denis Ovsienko, Xin Long, Maxim Mikityanskiy

From: Maxim Mikityanskiy <maxim@isovalent.com>

From: Maxim Mikityanskiy <maxim@isovalent.com>

In Cilium we do support BIG TCP, but so far the latter has only been
enabled for direct routing use-cases. A lot of users rely on Cilium
with vxlan/geneve tunneling though. The underlying kernel infra for
tunneling has not been supporting BIG TCP up to this point.

Given we do now, bump tso_max_size for vxlan netdevs up to GSO_MAX_SIZE
to allow the admin to use BIG TCP with vxlan tunnels.

BIG TCP on vxlan disabled:

  Standard MTU:

    # netperf -H 10.1.0.2 -t TCP_STREAM -l60
    MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.1.0.2 () port 0 AF_INET : demo
    Recv   Send    Send
    Socket Socket  Message  Elapsed
    Size   Size    Size     Time     Throughput
    bytes  bytes   bytes    secs.    10^6bits/sec

    131072  16384  16384    30.00    34440.00

  8k MTU:

    # netperf -H 10.1.0.2 -t TCP_STREAM -l60
    MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.1.0.2 () port 0 AF_INET : demo
    Recv   Send    Send
    Socket Socket  Message  Elapsed
    Size   Size    Size     Time     Throughput
    bytes  bytes   bytes    secs.    10^6bits/sec

    262144  32768  32768    30.00    55684.26

BIG TCP on vxlan enabled:

  Standard MTU:

    # netperf -H 10.1.0.2 -t TCP_STREAM -l60
    MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.1.0.2 () port 0 AF_INET : demo
    Recv   Send    Send
    Socket Socket  Message  Elapsed
    Size   Size    Size     Time     Throughput
    bytes  bytes   bytes    secs.    10^6bits/sec

    131072  16384  16384    30.00    39564.78

  8k MTU:

    # netperf -H 10.1.0.2 -t TCP_STREAM -l60
    MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.1.0.2 () port 0 AF_INET : demo
    Recv   Send    Send
    Socket Socket  Message  Elapsed
    Size   Size    Size     Time     Throughput
    bytes  bytes   bytes    secs.    10^6bits/sec

    262144  32768  32768    30.00    61466.47

When tunnel offloads are not enabled/exposed and we fully need to rely on
SW-based segmentation on transmit (e.g. in case of Azure) then the more
aggressive batching also has a visible effect. Below example was on the
same setup as with above benchmarks but with HW support disabled:

  # ethtool -k enp10s0f0np0 | grep udp
  tx-udp_tnl-segmentation: off
  tx-udp_tnl-csum-segmentation: off
  tx-udp-segmentation: off
  rx-udp_tunnel-port-offload: off
  rx-udp-gro-forwarding: off

  Before:

    # netperf -H 10.1.0.2 -t TCP_STREAM -l60
    MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.1.0.2 () port 0 AF_INET : demo
    Recv   Send    Send
    Socket Socket  Message  Elapsed
    Size   Size    Size     Time     Throughput
    bytes  bytes   bytes    secs.    10^6bits/sec

    131072  16384  16384    60.00    21820.82

  After:

    # netperf -H 10.1.0.2 -t TCP_STREAM -l60
    MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.1.0.2 () port 0 AF_INET : demo
    Recv   Send    Send
    Socket Socket  Message  Elapsed
    Size   Size    Size     Time     Throughput
    bytes  bytes   bytes    secs.    10^6bits/sec

    131072  16384  16384    60.00    29390.78

Example receive side:

  swapper       0 [002]  4712.645070: net:netif_receive_skb: dev=enp10s0f0np0 skbaddr=0xffff8f3b086e0200 len=129542
        ffffffff8cfe3aaa __netif_receive_skb_core.constprop.0+0x6ca ([kernel.kallsyms])
        ffffffff8cfe3aaa __netif_receive_skb_core.constprop.0+0x6ca ([kernel.kallsyms])
        ffffffff8cfe47dd __netif_receive_skb_list_core+0xed ([kernel.kallsyms])
        ffffffff8cfe4e52 netif_receive_skb_list_internal+0x1d2 ([kernel.kallsyms])
        ffffffff8d0210d8 gro_complete.constprop.0+0x108 ([kernel.kallsyms])
        ffffffff8d021724 dev_gro_receive+0x4e4 ([kernel.kallsyms])
        ffffffff8d021a99 gro_receive_skb+0x89 ([kernel.kallsyms])
        ffffffffc06edb71 mlx5e_handle_rx_cqe_mpwrq+0x131 ([kernel.kallsyms])
        ffffffffc06ee38a mlx5e_poll_rx_cq+0x9a ([kernel.kallsyms])
        ffffffffc06ef2c7 mlx5e_napi_poll+0x107 ([kernel.kallsyms])
        ffffffff8cfe586d __napi_poll+0x2d ([kernel.kallsyms])
        ffffffff8cfe5f8d net_rx_action+0x20d ([kernel.kallsyms])
        ffffffff8c35d252 handle_softirqs+0xe2 ([kernel.kallsyms])
        ffffffff8c35d556 __irq_exit_rcu+0xd6 ([kernel.kallsyms])
        ffffffff8c35d81e irq_exit_rcu+0xe ([kernel.kallsyms])
        ffffffff8d2602b8 common_interrupt+0x98 ([kernel.kallsyms])
        ffffffff8c000da7 asm_common_interrupt+0x27 ([kernel.kallsyms])
        ffffffff8d2645c5 cpuidle_enter_state+0xd5 ([kernel.kallsyms])
        ffffffff8cf6358e cpuidle_enter+0x2e ([kernel.kallsyms])
        ffffffff8c3ba932 call_cpuidle+0x22 ([kernel.kallsyms])
        ffffffff8c3bfb5e do_idle+0x1ce ([kernel.kallsyms])
        ffffffff8c3bfd79 cpu_startup_entry+0x29 ([kernel.kallsyms])
        ffffffff8c30a6c2 start_secondary+0x112 ([kernel.kallsyms])
        ffffffff8c2c142d common_startup_64+0x13e ([kernel.kallsyms])

Example transmit side:

  swapper       0 [005]  4768.021375: net:net_dev_xmit: dev=enp10s0f0np0 skbaddr=0xffff8af32ebe1200 len=129556 rc=0
        ffffffffa75e19c3 dev_hard_start_xmit+0x173 ([kernel.kallsyms])
        ffffffffa75e19c3 dev_hard_start_xmit+0x173 ([kernel.kallsyms])
        ffffffffa7653823 sch_direct_xmit+0x143 ([kernel.kallsyms])
        ffffffffa75e2780 __dev_queue_xmit+0xc70 ([kernel.kallsyms])
        ffffffffa76a1205 ip_finish_output2+0x265 ([kernel.kallsyms])
        ffffffffa76a1577 __ip_finish_output+0x87 ([kernel.kallsyms])
        ffffffffa76a165b ip_finish_output+0x2b ([kernel.kallsyms])
        ffffffffa76a179e ip_output+0x5e ([kernel.kallsyms])
        ffffffffa76a19d5 ip_local_out+0x35 ([kernel.kallsyms])
        ffffffffa770d0e5 iptunnel_xmit+0x185 ([kernel.kallsyms])
        ffffffffc179634e nf_nat_used_tuple_new.cold+0x1129 ([kernel.kallsyms])
        ffffffffc17a7301 vxlan_xmit_one+0xc21 ([kernel.kallsyms])
        ffffffffc17a80a2 vxlan_xmit+0x4a2 ([kernel.kallsyms])
        ffffffffa75e18af dev_hard_start_xmit+0x5f ([kernel.kallsyms])
        ffffffffa75e1d3f __dev_queue_xmit+0x22f ([kernel.kallsyms])
        ffffffffa76a1205 ip_finish_output2+0x265 ([kernel.kallsyms])
        ffffffffa76a1577 __ip_finish_output+0x87 ([kernel.kallsyms])
        ffffffffa76a165b ip_finish_output+0x2b ([kernel.kallsyms])
        ffffffffa76a179e ip_output+0x5e ([kernel.kallsyms])
        ffffffffa76a1de2 __ip_queue_xmit+0x1b2 ([kernel.kallsyms])
        ffffffffa76a2135 ip_queue_xmit+0x15 ([kernel.kallsyms])
        ffffffffa76c70a2 __tcp_transmit_skb+0x522 ([kernel.kallsyms])
        ffffffffa76c931a tcp_write_xmit+0x65a ([kernel.kallsyms])
        ffffffffa76cb42e tcp_tsq_write+0x5e ([kernel.kallsyms])
        ffffffffa76cb7ef tcp_tasklet_func+0x10f ([kernel.kallsyms])
        ffffffffa695d9f7 tasklet_action_common+0x107 ([kernel.kallsyms])
        ffffffffa695db99 tasklet_action+0x29 ([kernel.kallsyms])
        ffffffffa695d252 handle_softirqs+0xe2 ([kernel.kallsyms])
        ffffffffa695d556 __irq_exit_rcu+0xd6 ([kernel.kallsyms])
        ffffffffa695d81e irq_exit_rcu+0xe ([kernel.kallsyms])
        ffffffffa78602b8 common_interrupt+0x98 ([kernel.kallsyms])
        ffffffffa6600da7 asm_common_interrupt+0x27 ([kernel.kallsyms])
        ffffffffa78645c5 cpuidle_enter_state+0xd5 ([kernel.kallsyms])
        ffffffffa756358e cpuidle_enter+0x2e ([kernel.kallsyms])
        ffffffffa69ba932 call_cpuidle+0x22 ([kernel.kallsyms])
        ffffffffa69bfb5e do_idle+0x1ce ([kernel.kallsyms])
        ffffffffa69bfd79 cpu_startup_entry+0x29 ([kernel.kallsyms])
        ffffffffa690a6c2 start_secondary+0x112 ([kernel.kallsyms])
        ffffffffa68c142d common_startup_64+0x13e ([kernel.kallsyms])

Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com>
Co-developed-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Nikolay Aleksandrov <razor@blackwall.org>
---
 drivers/net/vxlan/vxlan_core.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/vxlan/vxlan_core.c b/drivers/net/vxlan/vxlan_core.c
index a5c55e7e4d79..a443adde8848 100644
--- a/drivers/net/vxlan/vxlan_core.c
+++ b/drivers/net/vxlan/vxlan_core.c
@@ -3341,6 +3341,8 @@ static void vxlan_setup(struct net_device *dev)
 	dev->hw_features |= NETIF_F_RXCSUM;
 	dev->hw_features |= NETIF_F_GSO_SOFTWARE;
 	netif_keep_dst(dev);
+	netif_set_tso_max_size(dev, GSO_MAX_SIZE);
+
 	dev->priv_flags |= IFF_NO_QUEUE;
 	dev->change_proto_down = true;
 	dev->lltx = true;
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH net-next 17/17] geneve: Enable BIG TCP packets
  2025-09-23 13:47 [PATCH net-next 00/17] BIG TCP for UDP tunnels Maxim Mikityanskiy
                   ` (15 preceding siblings ...)
  2025-09-23 13:47 ` [PATCH net-next 16/17] vxlan: Enable BIG TCP packets Maxim Mikityanskiy
@ 2025-09-23 13:47 ` Maxim Mikityanskiy
  2025-09-25 14:24   ` Paolo Abeni
  2025-09-25 14:26 ` [PATCH net-next 00/17] BIG TCP for UDP tunnels Paolo Abeni
  17 siblings, 1 reply; 27+ messages in thread
From: Maxim Mikityanskiy @ 2025-09-23 13:47 UTC (permalink / raw)
  To: Daniel Borkmann, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Willem de Bruijn, David Ahern, Nikolay Aleksandrov
  Cc: netdev, tcpdump-workers, Guy Harris, Michael Richardson,
	Denis Ovsienko, Xin Long, Maxim Mikityanskiy

From: Maxim Mikityanskiy <maxim@isovalent.com>

From: Daniel Borkmann <daniel@iogearbox.net>

In Cilium we do support BIG TCP, but so far the latter has only been
enabled for direct routing use-cases. A lot of users rely on Cilium
with vxlan/geneve tunneling though. The underlying kernel infra for
tunneling has not been supporting BIG TCP up to this point.

Given we do now, bump tso_max_size for geneve netdevs up to GSO_MAX_SIZE
to allow the admin to use BIG TCP with geneve tunnels.

BIG TCP on geneve disabled:

  Standard MTU:

    # netperf -H 10.1.0.2 -t TCP_STREAM -l60
    MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.1.0.2 () port 0 AF_INET : demo
    Recv   Send    Send
    Socket Socket  Message  Elapsed
    Size   Size    Size     Time     Throughput
    bytes  bytes   bytes    secs.    10^6bits/sec

    131072  16384  16384    30.00    37391.34

  8k MTU:

    # netperf -H 10.1.0.2 -t TCP_STREAM -l60
    MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.1.0.2 () port 0 AF_INET : demo
    Recv   Send    Send
    Socket Socket  Message  Elapsed
    Size   Size    Size     Time     Throughput
    bytes  bytes   bytes    secs.    10^6bits/sec

    262144  32768  32768    60.00    58030.19

BIG TCP on geneve enabled:

  Standard MTU:

    # netperf -H 10.1.0.2 -t TCP_STREAM -l60
    MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.1.0.2 () port 0 AF_INET : demo
    Recv   Send    Send
    Socket Socket  Message  Elapsed
    Size   Size    Size     Time     Throughput
    bytes  bytes   bytes    secs.    10^6bits/sec

    131072  16384  16384    30.00    40891.57

  8k MTU:

    # netperf -H 10.1.0.2 -t TCP_STREAM -l60
    MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.1.0.2 () port 0 AF_INET : demo
    Recv   Send    Send
    Socket Socket  Message  Elapsed
    Size   Size    Size     Time     Throughput
    bytes  bytes   bytes    secs.    10^6bits/sec

    262144  32768  32768    60.00    61458.39

Example receive side:

  swapper       0 [008]  3682.509996: net:netif_receive_skb: dev=geneve0 skbaddr=0xffff8f3b0a781800 len=129492
        ffffffff8cfe3aaa __netif_receive_skb_core.constprop.0+0x6ca ([kernel.kallsyms])
        ffffffff8cfe3aaa __netif_receive_skb_core.constprop.0+0x6ca ([kernel.kallsyms])
        ffffffff8cfe47dd __netif_receive_skb_list_core+0xed ([kernel.kallsyms])
        ffffffff8cfe4e52 netif_receive_skb_list_internal+0x1d2 ([kernel.kallsyms])
        ffffffff8cfe573c napi_complete_done+0x7c ([kernel.kallsyms])
        ffffffff8d046c23 gro_cell_poll+0x83 ([kernel.kallsyms])
        ffffffff8cfe586d __napi_poll+0x2d ([kernel.kallsyms])
        ffffffff8cfe5f8d net_rx_action+0x20d ([kernel.kallsyms])
        ffffffff8c35d252 handle_softirqs+0xe2 ([kernel.kallsyms])
        ffffffff8c35d556 __irq_exit_rcu+0xd6 ([kernel.kallsyms])
        ffffffff8c35d81e irq_exit_rcu+0xe ([kernel.kallsyms])
        ffffffff8d2602b8 common_interrupt+0x98 ([kernel.kallsyms])
        ffffffff8c000da7 asm_common_interrupt+0x27 ([kernel.kallsyms])
        ffffffff8d2645c5 cpuidle_enter_state+0xd5 ([kernel.kallsyms])
        ffffffff8cf6358e cpuidle_enter+0x2e ([kernel.kallsyms])
        ffffffff8c3ba932 call_cpuidle+0x22 ([kernel.kallsyms])
        ffffffff8c3bfb5e do_idle+0x1ce ([kernel.kallsyms])
        ffffffff8c3bfd79 cpu_startup_entry+0x29 ([kernel.kallsyms])
        ffffffff8c30a6c2 start_secondary+0x112 ([kernel.kallsyms])
        ffffffff8c2c142d common_startup_64+0x13e ([kernel.kallsyms])

Example transmit side:

  swapper       0 [002]  3403.688687: net:net_dev_xmit: dev=enp10s0f0np0 skbaddr=0xffff8af31d104ae8 len=129556 rc=0
        ffffffffa75e19c3 dev_hard_start_xmit+0x173 ([kernel.kallsyms])
        ffffffffa75e19c3 dev_hard_start_xmit+0x173 ([kernel.kallsyms])
        ffffffffa7653823 sch_direct_xmit+0x143 ([kernel.kallsyms])
        ffffffffa75e2780 __dev_queue_xmit+0xc70 ([kernel.kallsyms])
        ffffffffa76a1205 ip_finish_output2+0x265 ([kernel.kallsyms])
        ffffffffa76a1577 __ip_finish_output+0x87 ([kernel.kallsyms])
        ffffffffa76a165b ip_finish_output+0x2b ([kernel.kallsyms])
        ffffffffa76a179e ip_output+0x5e ([kernel.kallsyms])
        ffffffffa76a19d5 ip_local_out+0x35 ([kernel.kallsyms])
        ffffffffa770d0e5 iptunnel_xmit+0x185 ([kernel.kallsyms])
        ffffffffc179634e nf_nat_used_tuple_new.cold+0x1129 ([kernel.kallsyms])
        ffffffffc179d3e0 geneve_xmit+0x920 ([kernel.kallsyms])
        ffffffffa75e18af dev_hard_start_xmit+0x5f ([kernel.kallsyms])
        ffffffffa75e1d3f __dev_queue_xmit+0x22f ([kernel.kallsyms])
        ffffffffa76a1205 ip_finish_output2+0x265 ([kernel.kallsyms])
        ffffffffa76a1577 __ip_finish_output+0x87 ([kernel.kallsyms])
        ffffffffa76a165b ip_finish_output+0x2b ([kernel.kallsyms])
        ffffffffa76a179e ip_output+0x5e ([kernel.kallsyms])
        ffffffffa76a1de2 __ip_queue_xmit+0x1b2 ([kernel.kallsyms])
        ffffffffa76a2135 ip_queue_xmit+0x15 ([kernel.kallsyms])
        ffffffffa76c70a2 __tcp_transmit_skb+0x522 ([kernel.kallsyms])
        ffffffffa76c931a tcp_write_xmit+0x65a ([kernel.kallsyms])
        ffffffffa76ca3b9 __tcp_push_pending_frames+0x39 ([kernel.kallsyms])
        ffffffffa76c1fb6 tcp_rcv_established+0x276 ([kernel.kallsyms])
        ffffffffa76d3957 tcp_v4_do_rcv+0x157 ([kernel.kallsyms])
        ffffffffa76d6053 tcp_v4_rcv+0x1243 ([kernel.kallsyms])
        ffffffffa769b8ea ip_protocol_deliver_rcu+0x2a ([kernel.kallsyms])
        ffffffffa769bab7 ip_local_deliver_finish+0x77 ([kernel.kallsyms])
        ffffffffa769bb4d ip_local_deliver+0x6d ([kernel.kallsyms])
        ffffffffa769abe7 ip_sublist_rcv_finish+0x37 ([kernel.kallsyms])
        ffffffffa769b713 ip_sublist_rcv+0x173 ([kernel.kallsyms])
        ffffffffa769bde2 ip_list_rcv+0x102 ([kernel.kallsyms])
        ffffffffa75e4868 __netif_receive_skb_list_core+0x178 ([kernel.kallsyms])
        ffffffffa75e4e52 netif_receive_skb_list_internal+0x1d2 ([kernel.kallsyms])
        ffffffffa75e573c napi_complete_done+0x7c ([kernel.kallsyms])
        ffffffffa7646c23 gro_cell_poll+0x83 ([kernel.kallsyms])
        ffffffffa75e586d __napi_poll+0x2d ([kernel.kallsyms])
        ffffffffa75e5f8d net_rx_action+0x20d ([kernel.kallsyms])
        ffffffffa695d252 handle_softirqs+0xe2 ([kernel.kallsyms])
        ffffffffa695d556 __irq_exit_rcu+0xd6 ([kernel.kallsyms])
        ffffffffa695d81e irq_exit_rcu+0xe ([kernel.kallsyms])
        ffffffffa78602b8 common_interrupt+0x98 ([kernel.kallsyms])
        ffffffffa6600da7 asm_common_interrupt+0x27 ([kernel.kallsyms])
        ffffffffa78645c5 cpuidle_enter_state+0xd5 ([kernel.kallsyms])
        ffffffffa756358e cpuidle_enter+0x2e ([kernel.kallsyms])
        ffffffffa69ba932 call_cpuidle+0x22 ([kernel.kallsyms])
        ffffffffa69bfb5e do_idle+0x1ce ([kernel.kallsyms])
        ffffffffa69bfd79 cpu_startup_entry+0x29 ([kernel.kallsyms])
        ffffffffa690a6c2 start_secondary+0x112 ([kernel.kallsyms])
        ffffffffa68c142d common_startup_64+0x13e ([kernel.kallsyms])

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Co-developed-by: Maxim Mikityanskiy <maxim@isovalent.com>
Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com>
Cc: Nikolay Aleksandrov <razor@blackwall.org>
---
 drivers/net/geneve.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c
index 77b0c3d52041..374798abed7c 100644
--- a/drivers/net/geneve.c
+++ b/drivers/net/geneve.c
@@ -1225,6 +1225,8 @@ static void geneve_setup(struct net_device *dev)
 	dev->max_mtu = IP_MAX_MTU - GENEVE_BASE_HLEN - dev->hard_header_len;
 
 	netif_keep_dst(dev);
+	netif_set_tso_max_size(dev, GSO_MAX_SIZE);
+
 	dev->priv_flags &= ~IFF_TX_SKB_SHARING;
 	dev->priv_flags |= IFF_LIVE_ADDR_CHANGE | IFF_NO_QUEUE;
 	dev->lltx = true;
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH net-next 01/17] net/ipv6: Introduce payload_len helpers
  2025-09-23 13:47 ` [PATCH net-next 01/17] net/ipv6: Introduce payload_len helpers Maxim Mikityanskiy
@ 2025-09-25 13:51   ` Paolo Abeni
  2025-09-25 14:01     ` Maxim Mikityanskiy
  2025-09-25 18:23   ` Stanislav Fomichev
  1 sibling, 1 reply; 27+ messages in thread
From: Paolo Abeni @ 2025-09-25 13:51 UTC (permalink / raw)
  To: Maxim Mikityanskiy, Daniel Borkmann, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Willem de Bruijn, David Ahern,
	Nikolay Aleksandrov
  Cc: netdev, tcpdump-workers, Guy Harris, Michael Richardson,
	Denis Ovsienko, Xin Long, Maxim Mikityanskiy

On 9/23/25 3:47 PM, Maxim Mikityanskiy wrote:
> From: Maxim Mikityanskiy <maxim@isovalent.com>
> 
> From: Maxim Mikityanskiy <maxim@isovalent.com>

Only a single 'From:' tag is needed. This applies to all the patches in
this series.

/P


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH net-next 01/17] net/ipv6: Introduce payload_len helpers
  2025-09-25 13:51   ` Paolo Abeni
@ 2025-09-25 14:01     ` Maxim Mikityanskiy
  0 siblings, 0 replies; 27+ messages in thread
From: Maxim Mikityanskiy @ 2025-09-25 14:01 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: Maxim Mikityanskiy, Daniel Borkmann, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Willem de Bruijn, David Ahern,
	Nikolay Aleksandrov, netdev, tcpdump-workers, Guy Harris,
	Michael Richardson, Denis Ovsienko, Xin Long

On Thu, 25 Sept 2025 at 16:51, Paolo Abeni <pabeni@redhat.com> wrote:
>
> On 9/23/25 3:47 PM, Maxim Mikityanskiy wrote:
> > From: Maxim Mikityanskiy <maxim@isovalent.com>
> >
> > From: Maxim Mikityanskiy <maxim@isovalent.com>
>
> Only a single 'From:' tag is needed. This applies to all the patches in
> this series.

ACK, sorry for that =/

I had situations when git send-email didn't add the From tag at all,
so I started using git format-patch --force-in-body-from, and it
worked well for some while, until this time, when both commands added
the tag. I'll make sure to test it next time before sending, if I need
to resubmit.

> /P
>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH net-next 14/17] udp: Validate UDP length in udp_gro_receive
  2025-09-23 13:47 ` [PATCH net-next 14/17] udp: Validate UDP length in udp_gro_receive Maxim Mikityanskiy
@ 2025-09-25 14:10   ` Paolo Abeni
  0 siblings, 0 replies; 27+ messages in thread
From: Paolo Abeni @ 2025-09-25 14:10 UTC (permalink / raw)
  To: Maxim Mikityanskiy, Daniel Borkmann, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Willem de Bruijn, David Ahern,
	Nikolay Aleksandrov
  Cc: netdev, tcpdump-workers, Guy Harris, Michael Richardson,
	Denis Ovsienko, Xin Long, Maxim Mikityanskiy

On 9/23/25 3:47 PM, Maxim Mikityanskiy wrote:
> From: Maxim Mikityanskiy <maxim@isovalent.com>
> 
> From: Maxim Mikityanskiy <maxim@isovalent.com>
> 
> In the previous commit we started using uh->len = 0 as a marker of a GRO
> packet bigger than 65536 bytes. To prevent abuse by maliciously crafted
> packets, check the length in the UDP header in udp_gro_receive. Note
> that a similar check is present in udp_gro_receive_segment, but not in
> the UDP socket gro_receive flow.
> 
> Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com>
> ---
>  net/ipv4/udp_offload.c | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
> index 1e7ed7718d7b..fd86f76fda2c 100644
> --- a/net/ipv4/udp_offload.c
> +++ b/net/ipv4/udp_offload.c
> @@ -788,6 +788,7 @@ struct sk_buff *udp_gro_receive(struct list_head *head, struct sk_buff *skb,
>  	struct sk_buff *p;
>  	struct udphdr *uh2;
>  	unsigned int off = skb_gro_offset(skb);
> +	unsigned int ulen;
>  	int flush = 1;
>  
>  	/* We can do L4 aggregation only if the packet can't land in a tunnel
> @@ -820,6 +821,10 @@ struct sk_buff *udp_gro_receive(struct list_head *head, struct sk_buff *skb,
>  	     !NAPI_GRO_CB(skb)->csum_valid))
>  		goto out;
>  
> +	ulen = ntohs(uh->len);
> +	if (ulen <= sizeof(*uh) || ulen != skb_gro_len(skb))
> +		goto out;

Possibly consolidate both checks in single, earlier one?

/P


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH net-next 13/17] udp: Support gro_ipv4_max_size > 65536
  2025-09-23 13:47 ` [PATCH net-next 13/17] udp: Support gro_ipv4_max_size > 65536 Maxim Mikityanskiy
@ 2025-09-25 14:15   ` Paolo Abeni
  2025-09-25 14:22     ` Paolo Abeni
  0 siblings, 1 reply; 27+ messages in thread
From: Paolo Abeni @ 2025-09-25 14:15 UTC (permalink / raw)
  To: Maxim Mikityanskiy, Daniel Borkmann, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Willem de Bruijn, David Ahern,
	Nikolay Aleksandrov
  Cc: netdev, tcpdump-workers, Guy Harris, Michael Richardson,
	Denis Ovsienko, Xin Long, Maxim Mikityanskiy

On 9/23/25 3:47 PM, Maxim Mikityanskiy wrote:
> From: Maxim Mikityanskiy <maxim@isovalent.com>
> 
> From: Maxim Mikityanskiy <maxim@isovalent.com>
> 
> Currently, gro_max_size and gro_ipv4_max_size can be set to values
> bigger than 65536, and GRO will happily aggregate UDP to the configured
> size (for example, with TCP traffic in VXLAN tunnels). However,
> udp_gro_complete uses the 16-bit length field in the UDP header to store
> the length of the aggregated packet. It leads to the packet truncation
> later in __udp4_lib_rcv.
> 
> Fix this by storing 0 to the UDP length field and by restoring the real
> length from skb->len in __udp4_lib_rcv.
> 
> Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com>

If I read correctly, after this patch plain UDP GRO can start
aggregating packets up to a total len above 64K.

Potentially every point in the RX/TX path can unexpectedly process UDP
GSO packets with uh->len == 0 and skb->len > 64K which sounds
potentially dangerous. How about adding an helper to access the UDP len,
and use it everywhere tree wide (except possibly H/W NIC rx path)?

You could pin-point all the relevant location changing in a local build
of your the udphdr len field and looking for allmod build breakge.

/P



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH net-next 15/17] udp: Set length in UDP header to 0 for big GSO packets
  2025-09-23 13:47 ` [PATCH net-next 15/17] udp: Set length in UDP header to 0 for big GSO packets Maxim Mikityanskiy
@ 2025-09-25 14:18   ` Paolo Abeni
  0 siblings, 0 replies; 27+ messages in thread
From: Paolo Abeni @ 2025-09-25 14:18 UTC (permalink / raw)
  To: Maxim Mikityanskiy, Daniel Borkmann, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Willem de Bruijn, David Ahern,
	Nikolay Aleksandrov
  Cc: netdev, tcpdump-workers, Guy Harris, Michael Richardson,
	Denis Ovsienko, Xin Long, Maxim Mikityanskiy

On 9/23/25 3:47 PM, Maxim Mikityanskiy wrote:
> diff --git a/net/ipv4/udp_tunnel_core.c b/net/ipv4/udp_tunnel_core.c
> index 54386e06a813..98faddb7b4bf 100644
> --- a/net/ipv4/udp_tunnel_core.c
> +++ b/net/ipv4/udp_tunnel_core.c
> @@ -184,7 +184,7 @@ void udp_tunnel_xmit_skb(struct rtable *rt, struct sock *sk, struct sk_buff *skb
>  
>  	uh->dest = dst_port;
>  	uh->source = src_port;
> -	uh->len = htons(skb->len);
> +	uh->len = skb->len <= GRO_LEGACY_MAX_SIZE ? htons(skb->len) : 0;

You could introduce a 'set' helper, and use it here and in patch 13/17

/P


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH net-next 13/17] udp: Support gro_ipv4_max_size > 65536
  2025-09-25 14:15   ` Paolo Abeni
@ 2025-09-25 14:22     ` Paolo Abeni
  0 siblings, 0 replies; 27+ messages in thread
From: Paolo Abeni @ 2025-09-25 14:22 UTC (permalink / raw)
  To: Maxim Mikityanskiy, Daniel Borkmann, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Willem de Bruijn, David Ahern,
	Nikolay Aleksandrov
  Cc: netdev, tcpdump-workers, Guy Harris, Michael Richardson,
	Denis Ovsienko, Xin Long, Maxim Mikityanskiy

On 9/25/25 4:15 PM, Paolo Abeni wrote:
> On 9/23/25 3:47 PM, Maxim Mikityanskiy wrote:
>> From: Maxim Mikityanskiy <maxim@isovalent.com>
>>
>> From: Maxim Mikityanskiy <maxim@isovalent.com>
>>
>> Currently, gro_max_size and gro_ipv4_max_size can be set to values
>> bigger than 65536, and GRO will happily aggregate UDP to the configured
>> size (for example, with TCP traffic in VXLAN tunnels). However,
>> udp_gro_complete uses the 16-bit length field in the UDP header to store
>> the length of the aggregated packet. It leads to the packet truncation
>> later in __udp4_lib_rcv.
>>
>> Fix this by storing 0 to the UDP length field and by restoring the real
>> length from skb->len in __udp4_lib_rcv.
>>
>> Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com>
> 
> If I read correctly, after this patch plain UDP GRO can start
> aggregating packets up to a total len above 64K.

Re-reading the patch, I now thing the above is not true.

But geneve/vxlan will do that before the end of the series, so the
following should still stand:

> Potentially every point in the RX/TX path can unexpectedly process UDP
> GSO packets with uh->len == 0 and skb->len > 64K which sounds
> potentially dangerous. How about adding an helper to access the UDP len,
> and use it everywhere tree wide (except possibly H/W NIC rx path)?
> 
> You could pin-point all the relevant location changing in a local build
> of your the udphdr len field and looking for allmod build breakge.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH net-next 17/17] geneve: Enable BIG TCP packets
  2025-09-23 13:47 ` [PATCH net-next 17/17] geneve: " Maxim Mikityanskiy
@ 2025-09-25 14:24   ` Paolo Abeni
  0 siblings, 0 replies; 27+ messages in thread
From: Paolo Abeni @ 2025-09-25 14:24 UTC (permalink / raw)
  To: Maxim Mikityanskiy, Daniel Borkmann, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Willem de Bruijn, David Ahern,
	Nikolay Aleksandrov
  Cc: netdev, tcpdump-workers, Guy Harris, Michael Richardson,
	Denis Ovsienko, Xin Long, Maxim Mikityanskiy

On 9/23/25 3:47 PM, Maxim Mikityanskiy wrote:
> diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c
> index 77b0c3d52041..374798abed7c 100644
> --- a/drivers/net/geneve.c
> +++ b/drivers/net/geneve.c
> @@ -1225,6 +1225,8 @@ static void geneve_setup(struct net_device *dev)
>  	dev->max_mtu = IP_MAX_MTU - GENEVE_BASE_HLEN - dev->hard_header_len;
>  
>  	netif_keep_dst(dev);
> +	netif_set_tso_max_size(dev, GSO_MAX_SIZE);
> +
>  	dev->priv_flags &= ~IFF_TX_SKB_SHARING;
>  	dev->priv_flags |= IFF_LIVE_ADDR_CHANGE | IFF_NO_QUEUE;
>  	dev->lltx = true;

I think it would be nice to extend the big_tcp.sh selftests (or gro.sh
whatever is easier) to cover this code path for both geneve and vxlan.

Thanks,

Paolo


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH net-next 00/17] BIG TCP for UDP tunnels
  2025-09-23 13:47 [PATCH net-next 00/17] BIG TCP for UDP tunnels Maxim Mikityanskiy
                   ` (16 preceding siblings ...)
  2025-09-23 13:47 ` [PATCH net-next 17/17] geneve: " Maxim Mikityanskiy
@ 2025-09-25 14:26 ` Paolo Abeni
  17 siblings, 0 replies; 27+ messages in thread
From: Paolo Abeni @ 2025-09-25 14:26 UTC (permalink / raw)
  To: Maxim Mikityanskiy, Daniel Borkmann, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Willem de Bruijn, David Ahern,
	Nikolay Aleksandrov
  Cc: netdev, tcpdump-workers, Guy Harris, Michael Richardson,
	Denis Ovsienko, Xin Long, Maxim Mikityanskiy

On 9/23/25 3:47 PM, Maxim Mikityanskiy wrote:
> From: Maxim Mikityanskiy <maxim@isovalent.com>
> 
> This series consists adds support for BIG TCP IPv4/IPv6 workloads for vxlan
> and geneve. It consists of two parts:
> 
> 01-11: Remove hop-by-hop header for BIG TCP IPv6 to align with BIG TCP IPv4
> 12-17: Fix up things that prevent BIG TCP from working with tunnels.

What about splitting the series in 2, so that both chunks are below the
formal 15 patches limit and you can more easily add test-cases?

> There are a few places that make assumptions about skb->len being
> smaller than 64k and/or that store it in 16-bit fields, trimming the
> length. The first step to enable BIG TCP with VXLAN and GENEVE tunnels
> is to patch those places to handle bigger lengths properly (patches
> 12-17). This is enough to make IPv4 in IPv4 work with BIG TCP, but when
> either the outer or the inner protocol is IPv6, the current BIG TCP code
> inserts a hop-by-hop extension header that stores the actual 32-bit
> length of the packet. This additional hop-by-hop header turns out to be
> problematic for encapsulated cases, because:
> 
> 1. The drivers don't strip it, and they'd all need to know the structure
> of each tunnel protocol in order to strip it correctly.
> 
> 2. Even if (1) is implemented, it would be an additional performance
> penalty per aggregated packet.
> 
> 3. The skb_gso_validate_network_len check is skipped in
> ip6_finish_output_gso when IP6SKB_FAKEJUMBO is set, but it seems that it
> would make sense to do the actual validation, just taking into account
> the length of the HBH header. When the support for tunnels is added, it
> becomes trickier, because there may be one or two HBH headers, depending
> on whether it's IPv6 in IPv6 or not.
> 
> At the same time, having an HBH header to store the 32-bit length is not
> strictly necessary, as BIG TCP IPv4 doesn't do anything like this and
> just restores the length from skb->len. The same thing can be done for
> BIG TCP IPv6 (patches 01-11). Removing HBH from BIG TCP would allow to
> simplify the implementation significantly, and align it with BIG TCP IPv4.
> 
> A trivial tcpdump PR for IPv6 is pending here [0]. While the tcpdump
> commiters seem actively contributing code to the repository, it
> appears community PRs are stuck for a long time (?). We checked
> with Xin Long with regards to BIG TCP IPv4, and it turned out only
> GUESS_TSO was added to the Fedora distro spec file CFLAGS definition
> back then. In any case we have Cc'ed Guy Harris et al (tcpdump maintainer/
> committer) here just in case to see if he could help out with unblocking [0].

@tcpdump crew: any feedback on the mentioned PR would be very
appreciated, thanks!

Paolo


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH net-next 01/17] net/ipv6: Introduce payload_len helpers
  2025-09-23 13:47 ` [PATCH net-next 01/17] net/ipv6: Introduce payload_len helpers Maxim Mikityanskiy
  2025-09-25 13:51   ` Paolo Abeni
@ 2025-09-25 18:23   ` Stanislav Fomichev
  1 sibling, 0 replies; 27+ messages in thread
From: Stanislav Fomichev @ 2025-09-25 18:23 UTC (permalink / raw)
  To: Maxim Mikityanskiy
  Cc: Daniel Borkmann, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Willem de Bruijn, David Ahern, Nikolay Aleksandrov,
	netdev, tcpdump-workers, Guy Harris, Michael Richardson,
	Denis Ovsienko, Xin Long, Maxim Mikityanskiy

On 09/23, Maxim Mikityanskiy wrote:
> From: Maxim Mikityanskiy <maxim@isovalent.com>
> 
> From: Maxim Mikityanskiy <maxim@isovalent.com>
> 
> The next commits will transition away from using the hop-by-hop
> extension header to encode packet length for BIG TCP. Add wrappers
> around ip6->payload_len that return the actual value if it's non-zero,
> and calculate it from skb->len if payload_len is set to zero (and a
> symmetrical setter).
> 
> The new helpers are used wherever the surrounding code supports the
> hop-by-hop jumbo header for BIG TCP IPv6, or the corresponding IPv4 code
> uses skb_ip_totlen (e.g., in include/net/netfilter/nf_tables_ipv6.h).
> 
> No behavioral change in this commit.
> 
> Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com>
> ---
>  include/linux/ipv6.h                       | 20 ++++++++++++++++++++
>  include/net/ipv6.h                         |  2 --
>  include/net/netfilter/nf_tables_ipv6.h     |  4 ++--
>  net/bridge/br_netfilter_ipv6.c             |  2 +-
>  net/bridge/netfilter/nf_conntrack_bridge.c |  4 ++--
>  net/ipv6/ip6_input.c                       |  2 +-
>  net/ipv6/ip6_offload.c                     |  7 +++----
>  net/ipv6/output_core.c                     |  7 +------
>  net/netfilter/ipvs/ip_vs_xmit.c            |  2 +-
>  net/netfilter/nf_conntrack_ovs.c           |  2 +-
>  net/netfilter/nf_log_syslog.c              |  2 +-
>  net/sched/sch_cake.c                       |  2 +-
>  12 files changed, 34 insertions(+), 22 deletions(-)
> 
> diff --git a/include/linux/ipv6.h b/include/linux/ipv6.h
> index 43b7bb828738..44c4b791eceb 100644
> --- a/include/linux/ipv6.h
> +++ b/include/linux/ipv6.h
> @@ -126,6 +126,26 @@ static inline unsigned int ipv6_transport_len(const struct sk_buff *skb)
>  	       skb_network_header_len(skb);
>  }
>  
> +static inline unsigned int ipv6_payload_len(const struct sk_buff *skb, const struct ipv6hdr *ip6)
> +{
> +	u32 len = ntohs(ip6->payload_len);
> +
> +	return (len || !skb_is_gso(skb) || !skb_is_gso_tcp(skb)) ?
> +	       len : skb->len - skb_network_offset(skb) - sizeof(struct ipv6hdr);

Any reason not to return skb->len - skb_network_offset(skb) - sizeof(struct ipv6hdr)
here unconditionally? Will it not work in some cases?

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2025-09-25 18:23 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-23 13:47 [PATCH net-next 00/17] BIG TCP for UDP tunnels Maxim Mikityanskiy
2025-09-23 13:47 ` [PATCH net-next 01/17] net/ipv6: Introduce payload_len helpers Maxim Mikityanskiy
2025-09-25 13:51   ` Paolo Abeni
2025-09-25 14:01     ` Maxim Mikityanskiy
2025-09-25 18:23   ` Stanislav Fomichev
2025-09-23 13:47 ` [PATCH net-next 02/17] net/ipv6: Drop HBH for BIG TCP on TX side Maxim Mikityanskiy
2025-09-23 13:47 ` [PATCH net-next 03/17] net/ipv6: Drop HBH for BIG TCP on RX side Maxim Mikityanskiy
2025-09-23 13:47 ` [PATCH net-next 04/17] net/ipv6: Remove jumbo_remove step from TX path Maxim Mikityanskiy
2025-09-23 13:47 ` [PATCH net-next 05/17] net/mlx5e: " Maxim Mikityanskiy
2025-09-23 13:47 ` [PATCH net-next 06/17] net/mlx4: " Maxim Mikityanskiy
2025-09-23 13:47 ` [PATCH net-next 07/17] ice: " Maxim Mikityanskiy
2025-09-23 13:47 ` [PATCH net-next 08/17] bnxt_en: " Maxim Mikityanskiy
2025-09-23 13:47 ` [PATCH net-next 09/17] gve: " Maxim Mikityanskiy
2025-09-23 13:47 ` [PATCH net-next 10/17] net: mana: " Maxim Mikityanskiy
2025-09-23 13:47 ` [PATCH net-next 11/17] net/ipv6: Remove HBH helpers Maxim Mikityanskiy
2025-09-23 13:47 ` [PATCH net-next 12/17] net: Enable BIG TCP with partial GSO Maxim Mikityanskiy
2025-09-23 13:47 ` [PATCH net-next 13/17] udp: Support gro_ipv4_max_size > 65536 Maxim Mikityanskiy
2025-09-25 14:15   ` Paolo Abeni
2025-09-25 14:22     ` Paolo Abeni
2025-09-23 13:47 ` [PATCH net-next 14/17] udp: Validate UDP length in udp_gro_receive Maxim Mikityanskiy
2025-09-25 14:10   ` Paolo Abeni
2025-09-23 13:47 ` [PATCH net-next 15/17] udp: Set length in UDP header to 0 for big GSO packets Maxim Mikityanskiy
2025-09-25 14:18   ` Paolo Abeni
2025-09-23 13:47 ` [PATCH net-next 16/17] vxlan: Enable BIG TCP packets Maxim Mikityanskiy
2025-09-23 13:47 ` [PATCH net-next 17/17] geneve: " Maxim Mikityanskiy
2025-09-25 14:24   ` Paolo Abeni
2025-09-25 14:26 ` [PATCH net-next 00/17] BIG TCP for UDP tunnels Paolo Abeni

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).