* [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default @ 2016-02-19 19:26 Alexander Duyck 2016-02-19 19:26 ` [net-next PATCH 1/2] GENEVE: Support outer IPv4 Tx checksums " Alexander Duyck ` (3 more replies) 0 siblings, 4 replies; 47+ messages in thread From: Alexander Duyck @ 2016-02-19 19:26 UTC (permalink / raw) To: netdev, davem, alexander.duyck This patch series makes it so that we enable the outer Tx checksum for IPv4 tunnels by default. This makes the behavior consistent with how we were handling this for IPv6. In addition I have updated the internal flags for these tunnels so that we use a ZERO_CSUM_TX flag for IPv4 which should match up will with the ZERO_CSUM6_TX flag which was already in use for IPv6. For most network devices this should be a net gain in terms of performance as having the outer header checksum present allows for devices to report CHECKSUM_UNNECESSARY which we can then convert to CHECKSUM_COMPLETE in order to determine if the inner header checksum is valid. Below is some data I collected with ixgbe with an X540 that demonstrates this. I located two PFs connected back to back in two different name spaces and then setup a pair of tunnels on each, one with checksum enabled and one without. Recv Send Send Utilization Socket Socket Message Elapsed Send Size Size Size Time Throughput local bytes bytes bytes secs. 10^6bits/s % S noudpcsum: 87380 16384 16384 30.00 8898.67 12.80 udpcsum: 87380 16384 16384 30.00 9088.47 5.69 The one spot where this may cause a performance regression is if the environment contains devices that can parse the inner headers and a device supports NETIF_F_GSO_UDP_TUNNEL but not NETIF_F_GSO_UDP_TUNNEL_CSUM. In the case of such a device we have to fall back to using GSO to segment the tunnel instead of TSO and as a result we may take a performance hit as seen below with i40e. Recv Send Send Utilization Socket Socket Message Elapsed Send Size Size Size Time Throughput local bytes bytes bytes secs. 10^6bits/s % S noudpcsum: 87380 16384 16384 30.00 9085.21 3.32 udpcsum: 87380 16384 16384 30.00 9089.23 5.54 In addition it will be necessary to update iproute2 so that we don't provide the checksum attribute unless specified. This way on older kernels which don't have local checksum offload we will default to disabling the outer checksum, and on newer kernels that have LCO we can default to enabling it. I also haven't investigated the effect this will have on OVS. However I suspect the impact should be minimal as the worst case scenario should be that Tx checksumming will become enabled by default which should be consistent with the existing behavior for IPv6. --- Alexander Duyck (2): GENEVE: Support outer IPv4 Tx checksums by default VXLAN: Support outer IPv4 Tx checksums by default drivers/net/geneve.c | 16 ++++++++-------- drivers/net/vxlan.c | 19 +++++++++---------- include/net/vxlan.h | 2 +- 3 files changed, 18 insertions(+), 19 deletions(-) ^ permalink raw reply [flat|nested] 47+ messages in thread
* [net-next PATCH 1/2] GENEVE: Support outer IPv4 Tx checksums by default 2016-02-19 19:26 [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default Alexander Duyck @ 2016-02-19 19:26 ` Alexander Duyck 2016-02-19 20:28 ` Tom Herbert 2016-02-19 19:26 ` [net-next PATCH 2/2] VXLAN: " Alexander Duyck ` (2 subsequent siblings) 3 siblings, 1 reply; 47+ messages in thread From: Alexander Duyck @ 2016-02-19 19:26 UTC (permalink / raw) To: netdev, davem, alexander.duyck This change makes it so that if UDP CSUM is not specified we will default to enabling it. The main motivation behind this is the fact that with the use of outer checksum we can greatly improve the performance for GENEVE tunnels on hardware that doesn't know how to parse them. Signed-off-by: Alexander Duyck <aduyck@mirantis.com> --- drivers/net/geneve.c | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c index 1b5deaf7e3c8..2acaf2b209cd 100644 --- a/drivers/net/geneve.c +++ b/drivers/net/geneve.c @@ -76,7 +76,7 @@ struct geneve_dev { }; /* Geneve device flags */ -#define GENEVE_F_UDP_CSUM BIT(0) +#define GENEVE_F_UDP_ZERO_CSUM_TX BIT(0) #define GENEVE_F_UDP_ZERO_CSUM6_TX BIT(1) #define GENEVE_F_UDP_ZERO_CSUM6_RX BIT(2) @@ -703,7 +703,7 @@ static int geneve_build_skb(struct rtable *rt, struct sk_buff *skb, struct genevehdr *gnvh; int min_headroom; int err; - bool udp_sum = !!(flags & GENEVE_F_UDP_CSUM); + bool udp_sum = !(flags & GENEVE_F_UDP_ZERO_CSUM_TX); skb_scrub_packet(skb, xnet); @@ -944,9 +944,9 @@ static netdev_tx_t geneve_xmit_skb(struct sk_buff *skb, struct net_device *dev, opts = ip_tunnel_info_opts(info); if (key->tun_flags & TUNNEL_CSUM) - flags |= GENEVE_F_UDP_CSUM; + flags &= ~GENEVE_F_UDP_ZERO_CSUM_TX; else - flags &= ~GENEVE_F_UDP_CSUM; + flags |= GENEVE_F_UDP_ZERO_CSUM_TX; err = geneve_build_skb(rt, skb, key->tun_flags, vni, info->options_len, opts, flags, xnet); @@ -972,7 +972,7 @@ static netdev_tx_t geneve_xmit_skb(struct sk_buff *skb, struct net_device *dev, udp_tunnel_xmit_skb(rt, gs4->sock->sk, skb, fl4.saddr, fl4.daddr, tos, ttl, df, sport, geneve->dst_port, !net_eq(geneve->net, dev_net(geneve->dev)), - !(flags & GENEVE_F_UDP_CSUM)); + !!(flags & GENEVE_F_UDP_ZERO_CSUM_TX)); return NETDEV_TX_OK; @@ -1412,8 +1412,8 @@ static int geneve_newlink(struct net *net, struct net_device *dev, metadata = true; if (data[IFLA_GENEVE_UDP_CSUM] && - nla_get_u8(data[IFLA_GENEVE_UDP_CSUM])) - flags |= GENEVE_F_UDP_CSUM; + !nla_get_u8(data[IFLA_GENEVE_UDP_CSUM])) + flags |= GENEVE_F_UDP_ZERO_CSUM_TX; if (data[IFLA_GENEVE_UDP_ZERO_CSUM6_TX] && nla_get_u8(data[IFLA_GENEVE_UDP_ZERO_CSUM6_TX])) @@ -1483,7 +1483,7 @@ static int geneve_fill_info(struct sk_buff *skb, const struct net_device *dev) } if (nla_put_u8(skb, IFLA_GENEVE_UDP_CSUM, - !!(geneve->flags & GENEVE_F_UDP_CSUM)) || + !(geneve->flags & GENEVE_F_UDP_ZERO_CSUM_TX)) || nla_put_u8(skb, IFLA_GENEVE_UDP_ZERO_CSUM6_TX, !!(geneve->flags & GENEVE_F_UDP_ZERO_CSUM6_TX)) || nla_put_u8(skb, IFLA_GENEVE_UDP_ZERO_CSUM6_RX, ^ permalink raw reply related [flat|nested] 47+ messages in thread
* Re: [net-next PATCH 1/2] GENEVE: Support outer IPv4 Tx checksums by default 2016-02-19 19:26 ` [net-next PATCH 1/2] GENEVE: Support outer IPv4 Tx checksums " Alexander Duyck @ 2016-02-19 20:28 ` Tom Herbert 0 siblings, 0 replies; 47+ messages in thread From: Tom Herbert @ 2016-02-19 20:28 UTC (permalink / raw) To: Alexander Duyck Cc: Linux Kernel Network Developers, David S. Miller, Alexander Duyck On Fri, Feb 19, 2016 at 11:26 AM, Alexander Duyck <aduyck@mirantis.com> wrote: > This change makes it so that if UDP CSUM is not specified we will default > to enabling it. The main motivation behind this is the fact that with the > use of outer checksum we can greatly improve the performance for GENEVE > tunnels on hardware that doesn't know how to parse them. > > Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Acked-by: Tom Herbert <tom@herbertland.com> > --- > drivers/net/geneve.c | 16 ++++++++-------- > 1 file changed, 8 insertions(+), 8 deletions(-) > > diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c > index 1b5deaf7e3c8..2acaf2b209cd 100644 > --- a/drivers/net/geneve.c > +++ b/drivers/net/geneve.c > @@ -76,7 +76,7 @@ struct geneve_dev { > }; > > /* Geneve device flags */ > -#define GENEVE_F_UDP_CSUM BIT(0) > +#define GENEVE_F_UDP_ZERO_CSUM_TX BIT(0) > #define GENEVE_F_UDP_ZERO_CSUM6_TX BIT(1) > #define GENEVE_F_UDP_ZERO_CSUM6_RX BIT(2) > > @@ -703,7 +703,7 @@ static int geneve_build_skb(struct rtable *rt, struct sk_buff *skb, > struct genevehdr *gnvh; > int min_headroom; > int err; > - bool udp_sum = !!(flags & GENEVE_F_UDP_CSUM); > + bool udp_sum = !(flags & GENEVE_F_UDP_ZERO_CSUM_TX); > > skb_scrub_packet(skb, xnet); > > @@ -944,9 +944,9 @@ static netdev_tx_t geneve_xmit_skb(struct sk_buff *skb, struct net_device *dev, > opts = ip_tunnel_info_opts(info); > > if (key->tun_flags & TUNNEL_CSUM) > - flags |= GENEVE_F_UDP_CSUM; > + flags &= ~GENEVE_F_UDP_ZERO_CSUM_TX; > else > - flags &= ~GENEVE_F_UDP_CSUM; > + flags |= GENEVE_F_UDP_ZERO_CSUM_TX; > > err = geneve_build_skb(rt, skb, key->tun_flags, vni, > info->options_len, opts, flags, xnet); > @@ -972,7 +972,7 @@ static netdev_tx_t geneve_xmit_skb(struct sk_buff *skb, struct net_device *dev, > udp_tunnel_xmit_skb(rt, gs4->sock->sk, skb, fl4.saddr, fl4.daddr, > tos, ttl, df, sport, geneve->dst_port, > !net_eq(geneve->net, dev_net(geneve->dev)), > - !(flags & GENEVE_F_UDP_CSUM)); > + !!(flags & GENEVE_F_UDP_ZERO_CSUM_TX)); > > return NETDEV_TX_OK; > > @@ -1412,8 +1412,8 @@ static int geneve_newlink(struct net *net, struct net_device *dev, > metadata = true; > > if (data[IFLA_GENEVE_UDP_CSUM] && > - nla_get_u8(data[IFLA_GENEVE_UDP_CSUM])) > - flags |= GENEVE_F_UDP_CSUM; > + !nla_get_u8(data[IFLA_GENEVE_UDP_CSUM])) > + flags |= GENEVE_F_UDP_ZERO_CSUM_TX; > > if (data[IFLA_GENEVE_UDP_ZERO_CSUM6_TX] && > nla_get_u8(data[IFLA_GENEVE_UDP_ZERO_CSUM6_TX])) > @@ -1483,7 +1483,7 @@ static int geneve_fill_info(struct sk_buff *skb, const struct net_device *dev) > } > > if (nla_put_u8(skb, IFLA_GENEVE_UDP_CSUM, > - !!(geneve->flags & GENEVE_F_UDP_CSUM)) || > + !(geneve->flags & GENEVE_F_UDP_ZERO_CSUM_TX)) || > nla_put_u8(skb, IFLA_GENEVE_UDP_ZERO_CSUM6_TX, > !!(geneve->flags & GENEVE_F_UDP_ZERO_CSUM6_TX)) || > nla_put_u8(skb, IFLA_GENEVE_UDP_ZERO_CSUM6_RX, > ^ permalink raw reply [flat|nested] 47+ messages in thread
* [net-next PATCH 2/2] VXLAN: Support outer IPv4 Tx checksums by default 2016-02-19 19:26 [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default Alexander Duyck 2016-02-19 19:26 ` [net-next PATCH 1/2] GENEVE: Support outer IPv4 Tx checksums " Alexander Duyck @ 2016-02-19 19:26 ` Alexander Duyck 2016-02-19 20:27 ` Tom Herbert 2016-02-19 21:53 ` [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum " Jesse Gross 2016-02-22 3:06 ` [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default David Miller 3 siblings, 1 reply; 47+ messages in thread From: Alexander Duyck @ 2016-02-19 19:26 UTC (permalink / raw) To: netdev, davem, alexander.duyck This change makes it so that if UDP CSUM is not specified we will default to enabling it. The main motivation behind this is the fact that with the use of outer checksum we can greatly improve the performance for VXLAN tunnels on devices that don't know how to parse tunnel headers. Signed-off-by: Alexander Duyck <aduyck@mirantis.com> --- drivers/net/vxlan.c | 19 +++++++++---------- include/net/vxlan.h | 2 +- 2 files changed, 10 insertions(+), 11 deletions(-) diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c index 766e6114a37f..909f7931c297 100644 --- a/drivers/net/vxlan.c +++ b/drivers/net/vxlan.c @@ -1957,13 +1957,6 @@ static void vxlan_xmit_one(struct sk_buff *skb, struct net_device *dev, goto drop; sk = vxlan->vn4_sock->sock->sk; - if (info) { - if (info->key.tun_flags & TUNNEL_DONT_FRAGMENT) - df = htons(IP_DF); - } else { - udp_sum = !!(flags & VXLAN_F_UDP_CSUM); - } - rt = vxlan_get_route(vxlan, skb, rdst ? rdst->remote_ifindex : 0, tos, dst->sin.sin_addr.s_addr, &saddr, @@ -1997,6 +1990,11 @@ static void vxlan_xmit_one(struct sk_buff *skb, struct net_device *dev, return; } + if (!info) + udp_sum = !(flags & VXLAN_F_UDP_ZERO_CSUM_TX); + else if (info->key.tun_flags & TUNNEL_DONT_FRAGMENT) + df = htons(IP_DF); + tos = ip_tunnel_ecn_encap(tos, old_iph, skb); ttl = ttl ? : ip4_dst_hoplimit(&rt->dst); err = vxlan_build_skb(skb, &rt->dst, sizeof(struct iphdr), @@ -2920,8 +2918,9 @@ static int vxlan_newlink(struct net *src_net, struct net_device *dev, if (data[IFLA_VXLAN_PORT]) conf.dst_port = nla_get_be16(data[IFLA_VXLAN_PORT]); - if (data[IFLA_VXLAN_UDP_CSUM] && nla_get_u8(data[IFLA_VXLAN_UDP_CSUM])) - conf.flags |= VXLAN_F_UDP_CSUM; + if (data[IFLA_VXLAN_UDP_CSUM] && + !nla_get_u8(data[IFLA_VXLAN_UDP_CSUM])) + conf.flags |= VXLAN_F_UDP_ZERO_CSUM_TX; if (data[IFLA_VXLAN_UDP_ZERO_CSUM6_TX] && nla_get_u8(data[IFLA_VXLAN_UDP_ZERO_CSUM6_TX])) @@ -3065,7 +3064,7 @@ static int vxlan_fill_info(struct sk_buff *skb, const struct net_device *dev) nla_put_u32(skb, IFLA_VXLAN_LIMIT, vxlan->cfg.addrmax) || nla_put_be16(skb, IFLA_VXLAN_PORT, vxlan->cfg.dst_port) || nla_put_u8(skb, IFLA_VXLAN_UDP_CSUM, - !!(vxlan->flags & VXLAN_F_UDP_CSUM)) || + !(vxlan->flags & VXLAN_F_UDP_ZERO_CSUM_TX)) || nla_put_u8(skb, IFLA_VXLAN_UDP_ZERO_CSUM6_TX, !!(vxlan->flags & VXLAN_F_UDP_ZERO_CSUM6_TX)) || nla_put_u8(skb, IFLA_VXLAN_UDP_ZERO_CSUM6_RX, diff --git a/include/net/vxlan.h b/include/net/vxlan.h index 748083de367a..6eda4ed4d78b 100644 --- a/include/net/vxlan.h +++ b/include/net/vxlan.h @@ -197,7 +197,7 @@ struct vxlan_dev { #define VXLAN_F_L2MISS 0x08 #define VXLAN_F_L3MISS 0x10 #define VXLAN_F_IPV6 0x20 -#define VXLAN_F_UDP_CSUM 0x40 +#define VXLAN_F_UDP_ZERO_CSUM_TX 0x40 #define VXLAN_F_UDP_ZERO_CSUM6_TX 0x80 #define VXLAN_F_UDP_ZERO_CSUM6_RX 0x100 #define VXLAN_F_REMCSUM_TX 0x200 ^ permalink raw reply related [flat|nested] 47+ messages in thread
* Re: [net-next PATCH 2/2] VXLAN: Support outer IPv4 Tx checksums by default 2016-02-19 19:26 ` [net-next PATCH 2/2] VXLAN: " Alexander Duyck @ 2016-02-19 20:27 ` Tom Herbert 2016-02-19 21:36 ` Jesse Gross 0 siblings, 1 reply; 47+ messages in thread From: Tom Herbert @ 2016-02-19 20:27 UTC (permalink / raw) To: Alexander Duyck Cc: Linux Kernel Network Developers, David S. Miller, Alexander Duyck On Fri, Feb 19, 2016 at 11:26 AM, Alexander Duyck <aduyck@mirantis.com> wrote: > This change makes it so that if UDP CSUM is not specified we will default > to enabling it. The main motivation behind this is the fact that with the > use of outer checksum we can greatly improve the performance for VXLAN > tunnels on devices that don't know how to parse tunnel headers. > > Signed-off-by: Alexander Duyck <aduyck@mirantis.com> > --- > drivers/net/vxlan.c | 19 +++++++++---------- > include/net/vxlan.h | 2 +- > 2 files changed, 10 insertions(+), 11 deletions(-) > > diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c > index 766e6114a37f..909f7931c297 100644 > --- a/drivers/net/vxlan.c > +++ b/drivers/net/vxlan.c > @@ -1957,13 +1957,6 @@ static void vxlan_xmit_one(struct sk_buff *skb, struct net_device *dev, > goto drop; > sk = vxlan->vn4_sock->sock->sk; > > - if (info) { > - if (info->key.tun_flags & TUNNEL_DONT_FRAGMENT) > - df = htons(IP_DF); > - } else { > - udp_sum = !!(flags & VXLAN_F_UDP_CSUM); > - } > - > rt = vxlan_get_route(vxlan, skb, > rdst ? rdst->remote_ifindex : 0, tos, > dst->sin.sin_addr.s_addr, &saddr, > @@ -1997,6 +1990,11 @@ static void vxlan_xmit_one(struct sk_buff *skb, struct net_device *dev, > return; > } > > + if (!info) > + udp_sum = !(flags & VXLAN_F_UDP_ZERO_CSUM_TX); > + else if (info->key.tun_flags & TUNNEL_DONT_FRAGMENT) > + df = htons(IP_DF); > + > tos = ip_tunnel_ecn_encap(tos, old_iph, skb); > ttl = ttl ? : ip4_dst_hoplimit(&rt->dst); > err = vxlan_build_skb(skb, &rt->dst, sizeof(struct iphdr), > @@ -2920,8 +2918,9 @@ static int vxlan_newlink(struct net *src_net, struct net_device *dev, > if (data[IFLA_VXLAN_PORT]) > conf.dst_port = nla_get_be16(data[IFLA_VXLAN_PORT]); > > - if (data[IFLA_VXLAN_UDP_CSUM] && nla_get_u8(data[IFLA_VXLAN_UDP_CSUM])) > - conf.flags |= VXLAN_F_UDP_CSUM; > + if (data[IFLA_VXLAN_UDP_CSUM] && > + !nla_get_u8(data[IFLA_VXLAN_UDP_CSUM])) > + conf.flags |= VXLAN_F_UDP_ZERO_CSUM_TX; > > if (data[IFLA_VXLAN_UDP_ZERO_CSUM6_TX] && > nla_get_u8(data[IFLA_VXLAN_UDP_ZERO_CSUM6_TX])) > @@ -3065,7 +3064,7 @@ static int vxlan_fill_info(struct sk_buff *skb, const struct net_device *dev) > nla_put_u32(skb, IFLA_VXLAN_LIMIT, vxlan->cfg.addrmax) || > nla_put_be16(skb, IFLA_VXLAN_PORT, vxlan->cfg.dst_port) || > nla_put_u8(skb, IFLA_VXLAN_UDP_CSUM, > - !!(vxlan->flags & VXLAN_F_UDP_CSUM)) || > + !(vxlan->flags & VXLAN_F_UDP_ZERO_CSUM_TX)) || > nla_put_u8(skb, IFLA_VXLAN_UDP_ZERO_CSUM6_TX, > !!(vxlan->flags & VXLAN_F_UDP_ZERO_CSUM6_TX)) || > nla_put_u8(skb, IFLA_VXLAN_UDP_ZERO_CSUM6_RX, > diff --git a/include/net/vxlan.h b/include/net/vxlan.h > index 748083de367a..6eda4ed4d78b 100644 > --- a/include/net/vxlan.h > +++ b/include/net/vxlan.h > @@ -197,7 +197,7 @@ struct vxlan_dev { > #define VXLAN_F_L2MISS 0x08 > #define VXLAN_F_L3MISS 0x10 > #define VXLAN_F_IPV6 0x20 > -#define VXLAN_F_UDP_CSUM 0x40 > +#define VXLAN_F_UDP_ZERO_CSUM_TX 0x40 > #define VXLAN_F_UDP_ZERO_CSUM6_TX 0x80 > #define VXLAN_F_UDP_ZERO_CSUM6_RX 0x100 > #define VXLAN_F_REMCSUM_TX 0x200 > Acked-by: Tom Herbert <tom@herbertland.com> I would also note RFC7348 specifies: UDP Checksum: It SHOULD be transmitted as zero. ... The RFC doesn't provide any rationale as to why this is a SHOULD (neither is there any discussion as to whether this pertains to IPv6 which has stronger requirements for non-zero UDP checksum). I think there are two possibilities in the intent: 1) The authors assume that computing UDP checksums is a significant performance hit which is dis-proven by this patch 2) They are worried about devices that are unable to compute receive checksums, however this would be addressed by an allowance that devices can ignore non-zero UDP checksums for VXLAN ("When a decapsulating end point receives a packet with a non-zero checksum, it MAY choose to verify the checksum value.") . ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [net-next PATCH 2/2] VXLAN: Support outer IPv4 Tx checksums by default 2016-02-19 20:27 ` Tom Herbert @ 2016-02-19 21:36 ` Jesse Gross 0 siblings, 0 replies; 47+ messages in thread From: Jesse Gross @ 2016-02-19 21:36 UTC (permalink / raw) To: Tom Herbert Cc: Alexander Duyck, Linux Kernel Network Developers, David S. Miller, Alexander Duyck On Fri, Feb 19, 2016 at 12:27 PM, Tom Herbert <tom@herbertland.com> wrote: > I would also note RFC7348 specifies: > > UDP Checksum: It SHOULD be transmitted as zero. ... > > The RFC doesn't provide any rationale as to why this is a SHOULD > (neither is there any discussion as to whether this pertains to IPv6 > which has stronger requirements for non-zero UDP checksum). I think > there are two possibilities in the intent: 1) The authors assume that > computing UDP checksums is a significant performance hit which is > dis-proven by this patch 2) They are worried about devices that are > unable to compute receive checksums, however this would be addressed > by an allowance that devices can ignore non-zero UDP checksums for > VXLAN ("When a decapsulating end point receives a packet with a > non-zero checksum, it MAY choose to verify the checksum value.") It's #2. All of the performance concerns around checksums and tunneling stem from devices implemented using switching ASICs. In those devices, computing/verifying checksums is so slow (software path) that they are effectively unable to do it. ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default 2016-02-19 19:26 [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default Alexander Duyck 2016-02-19 19:26 ` [net-next PATCH 1/2] GENEVE: Support outer IPv4 Tx checksums " Alexander Duyck 2016-02-19 19:26 ` [net-next PATCH 2/2] VXLAN: " Alexander Duyck @ 2016-02-19 21:53 ` Jesse Gross 2016-02-19 23:10 ` Alex Duyck 2016-02-22 3:06 ` [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default David Miller 3 siblings, 1 reply; 47+ messages in thread From: Jesse Gross @ 2016-02-19 21:53 UTC (permalink / raw) To: Alexander Duyck Cc: Linux Kernel Network Developers, David Miller, Alexander Duyck On Fri, Feb 19, 2016 at 11:26 AM, Alexander Duyck <aduyck@mirantis.com> wrote: > This patch series makes it so that we enable the outer Tx checksum for IPv4 > tunnels by default. This makes the behavior consistent with how we were > handling this for IPv6. In addition I have updated the internal flags for > these tunnels so that we use a ZERO_CSUM_TX flag for IPv4 which should > match up will with the ZERO_CSUM6_TX flag which was already in use for > IPv6. > > For most network devices this should be a net gain in terms of performance > as having the outer header checksum present allows for devices to report > CHECKSUM_UNNECESSARY which we can then convert to CHECKSUM_COMPLETE in order > to determine if the inner header checksum is valid. > > Below is some data I collected with ixgbe with an X540 that demonstrates > this. I located two PFs connected back to back in two different name > spaces and then setup a pair of tunnels on each, one with checksum enabled > and one without. > > Recv Send Send Utilization > Socket Socket Message Elapsed Send > Size Size Size Time Throughput local > bytes bytes bytes secs. 10^6bits/s % S > > noudpcsum: > 87380 16384 16384 30.00 8898.67 12.80 > udpcsum: > 87380 16384 16384 30.00 9088.47 5.69 > > The one spot where this may cause a performance regression is if the > environment contains devices that can parse the inner headers and a device > supports NETIF_F_GSO_UDP_TUNNEL but not NETIF_F_GSO_UDP_TUNNEL_CSUM. In > the case of such a device we have to fall back to using GSO to segment the > tunnel instead of TSO and as a result we may take a performance hit as seen > below with i40e. Do you have any numbers from 40G links? Obviously, at 10G the links are basically saturated and while I can see a difference in the utilization rate, I suspect that the change will be much more apparent at higher speeds. I'm concerned about the drop in performance for devices that currently support offloads (almost none of which expose NETIF_F_GSO_UDP_TUNNEL_CSUM as a feature). Presumably the people that care most about tunnel performance are the ones that already have these NICs and will be the most impacted by the drop. My hope is that we can continue to use TSO on devices that only support NETIF_F_GSO_UDP_TUNNEL. The main problem is that the UDP length field may vary across segments. However, in practice this is the only on the final segment and only in cases where the total length is not a multiple of the MSS. If we could detect cases where those conditions are met, we could continue to use TSO with the UDP checksum field pre-populated. A possible step even further would be to break off the final segment into a separate packet to make things conform if necessary. This would avoid a performance regression and I think make this more palatable to a lot of people. > I also haven't investigated the effect this will have on OVS. However I > suspect the impact should be minimal as the worst case scenario should be > that Tx checksumming will become enabled by default which should be > consistent with the existing behavior for IPv6. I don't think that it should cause any problems. ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default 2016-02-19 21:53 ` [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum " Jesse Gross @ 2016-02-19 23:10 ` Alex Duyck 2016-02-20 0:08 ` Jesse Gross 0 siblings, 1 reply; 47+ messages in thread From: Alex Duyck @ 2016-02-19 23:10 UTC (permalink / raw) To: Jesse Gross Cc: Linux Kernel Network Developers, David Miller, Alexander Duyck On Fri, Feb 19, 2016 at 1:53 PM, Jesse Gross <jesse@kernel.org> wrote: > On Fri, Feb 19, 2016 at 11:26 AM, Alexander Duyck <aduyck@mirantis.com> wrote: >> This patch series makes it so that we enable the outer Tx checksum for IPv4 >> tunnels by default. This makes the behavior consistent with how we were >> handling this for IPv6. In addition I have updated the internal flags for >> these tunnels so that we use a ZERO_CSUM_TX flag for IPv4 which should >> match up will with the ZERO_CSUM6_TX flag which was already in use for >> IPv6. >> >> For most network devices this should be a net gain in terms of performance >> as having the outer header checksum present allows for devices to report >> CHECKSUM_UNNECESSARY which we can then convert to CHECKSUM_COMPLETE in order >> to determine if the inner header checksum is valid. >> >> Below is some data I collected with ixgbe with an X540 that demonstrates >> this. I located two PFs connected back to back in two different name >> spaces and then setup a pair of tunnels on each, one with checksum enabled >> and one without. >> >> Recv Send Send Utilization >> Socket Socket Message Elapsed Send >> Size Size Size Time Throughput local >> bytes bytes bytes secs. 10^6bits/s % S >> >> noudpcsum: >> 87380 16384 16384 30.00 8898.67 12.80 >> udpcsum: >> 87380 16384 16384 30.00 9088.47 5.69 >> >> The one spot where this may cause a performance regression is if the >> environment contains devices that can parse the inner headers and a device >> supports NETIF_F_GSO_UDP_TUNNEL but not NETIF_F_GSO_UDP_TUNNEL_CSUM. In >> the case of such a device we have to fall back to using GSO to segment the >> tunnel instead of TSO and as a result we may take a performance hit as seen >> below with i40e. > > Do you have any numbers from 40G links? Obviously, at 10G the links > are basically saturated and while I can see a difference in the > utilization rate, I suspect that the change will be much more apparent > at higher speeds. Unfortunately I don't have any true 40G links to test with. The closest I can get is to run PF to VF on an i40e. Running that I have seen the numbers go from about 20Gb/s to 15Gb/s with almost all the difference being related to the fact that we are having to allocate/free more skbs and make more trips through the i40e_lan_xmit_frame function resulting in more descriptors. > I'm concerned about the drop in performance for devices that currently > support offloads (almost none of which expose > NETIF_F_GSO_UDP_TUNNEL_CSUM as a feature). Presumably the people that > care most about tunnel performance are the ones that already have > these NICs and will be the most impacted by the drop. The problem is being able to transmit fast is kind of pointless if the receiving end cannot handle it. We hadn't gotten around to really getting the Rx checksum bits working until the 3.18 kernel which I don't suspect many people are running so at this point messing with the TSO bits isn't really making much of a difference. Then on top of that most devices have certain limitations on how many ports they can handle and such. I know the i40e is supposed to support something like 10 port numbers, but the fm10k and ixgbe are limited to one port as I recall. So this whole thing is already really brittle as it is. My goal with this change is to make the behavior more consistent across the board. > My hope is that we can continue to use TSO on devices that only > support NETIF_F_GSO_UDP_TUNNEL. The main problem is that the UDP > length field may vary across segments. However, in practice this is > the only on the final segment and only in cases where the total length > is not a multiple of the MSS. If we could detect cases where those > conditions are met, we could continue to use TSO with the UDP checksum > field pre-populated. A possible step even further would be to break > off the final segment into a separate packet to make things conform if > necessary. This would avoid a performance regression and I think make > this more palatable to a lot of people. I think Tom and I had discussed this possibility a bit at netconf. The GSO logic is something I planned on looking at over the next several weeks as I suspect there is probably room for improvement there. >> I also haven't investigated the effect this will have on OVS. However I >> suspect the impact should be minimal as the worst case scenario should be >> that Tx checksumming will become enabled by default which should be >> consistent with the existing behavior for IPv6. > > I don't think that it should cause any problems. Good to hear. Do you know if OVS has some way to control the VXLAN configuration so that it could disable Tx checksums? If so that would probably be a good way to address the 40G issues assuming someone is running an environment hat had nothing but NICs that can support the TSO and Rx checksum on inner headers. ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default 2016-02-19 23:10 ` Alex Duyck @ 2016-02-20 0:08 ` Jesse Gross 2016-02-20 0:14 ` Tom Herbert 0 siblings, 1 reply; 47+ messages in thread From: Jesse Gross @ 2016-02-20 0:08 UTC (permalink / raw) To: Alex Duyck; +Cc: Linux Kernel Network Developers, David Miller, Alexander Duyck On Fri, Feb 19, 2016 at 3:10 PM, Alex Duyck <aduyck@mirantis.com> wrote: > On Fri, Feb 19, 2016 at 1:53 PM, Jesse Gross <jesse@kernel.org> wrote: >> On Fri, Feb 19, 2016 at 11:26 AM, Alexander Duyck <aduyck@mirantis.com> wrote: >>> This patch series makes it so that we enable the outer Tx checksum for IPv4 >>> tunnels by default. This makes the behavior consistent with how we were >>> handling this for IPv6. In addition I have updated the internal flags for >>> these tunnels so that we use a ZERO_CSUM_TX flag for IPv4 which should >>> match up will with the ZERO_CSUM6_TX flag which was already in use for >>> IPv6. >>> >>> For most network devices this should be a net gain in terms of performance >>> as having the outer header checksum present allows for devices to report >>> CHECKSUM_UNNECESSARY which we can then convert to CHECKSUM_COMPLETE in order >>> to determine if the inner header checksum is valid. >>> >>> Below is some data I collected with ixgbe with an X540 that demonstrates >>> this. I located two PFs connected back to back in two different name >>> spaces and then setup a pair of tunnels on each, one with checksum enabled >>> and one without. >>> >>> Recv Send Send Utilization >>> Socket Socket Message Elapsed Send >>> Size Size Size Time Throughput local >>> bytes bytes bytes secs. 10^6bits/s % S >>> >>> noudpcsum: >>> 87380 16384 16384 30.00 8898.67 12.80 >>> udpcsum: >>> 87380 16384 16384 30.00 9088.47 5.69 >>> >>> The one spot where this may cause a performance regression is if the >>> environment contains devices that can parse the inner headers and a device >>> supports NETIF_F_GSO_UDP_TUNNEL but not NETIF_F_GSO_UDP_TUNNEL_CSUM. In >>> the case of such a device we have to fall back to using GSO to segment the >>> tunnel instead of TSO and as a result we may take a performance hit as seen >>> below with i40e. >> >> Do you have any numbers from 40G links? Obviously, at 10G the links >> are basically saturated and while I can see a difference in the >> utilization rate, I suspect that the change will be much more apparent >> at higher speeds. > > Unfortunately I don't have any true 40G links to test with. The > closest I can get is to run PF to VF on an i40e. Running that I have > seen the numbers go from about 20Gb/s to 15Gb/s with almost all the > difference being related to the fact that we are having to > allocate/free more skbs and make more trips through the > i40e_lan_xmit_frame function resulting in more descriptors. OK, I guess that is more or less in line with what I would expect off the top my head. There is a reasonably significant drop in the worst case. >> I'm concerned about the drop in performance for devices that currently >> support offloads (almost none of which expose >> NETIF_F_GSO_UDP_TUNNEL_CSUM as a feature). Presumably the people that >> care most about tunnel performance are the ones that already have >> these NICs and will be the most impacted by the drop. > > The problem is being able to transmit fast is kind of pointless if the > receiving end cannot handle it. We hadn't gotten around to really > getting the Rx checksum bits working until the 3.18 kernel which I > don't suspect many people are running so at this point messing with > the TSO bits isn't really making much of a difference. Then on top of > that most devices have certain limitations on how many ports they can > handle and such. I know the i40e is supposed to support something > like 10 port numbers, but the fm10k and ixgbe are limited to one port > as I recall. So this whole thing is already really brittle as it is. > My goal with this change is to make the behavior more consistent > across the board. That's true to some degree but there are certainly plenty of cases where TSO makes a difference - lower CPU usage, transmitting to multiple receivers, people will upgrade their kernels, etc. It's clearly good to make things more consistent but hopefully not by reducing existing performance. :) >> My hope is that we can continue to use TSO on devices that only >> support NETIF_F_GSO_UDP_TUNNEL. The main problem is that the UDP >> length field may vary across segments. However, in practice this is >> the only on the final segment and only in cases where the total length >> is not a multiple of the MSS. If we could detect cases where those >> conditions are met, we could continue to use TSO with the UDP checksum >> field pre-populated. A possible step even further would be to break >> off the final segment into a separate packet to make things conform if >> necessary. This would avoid a performance regression and I think make >> this more palatable to a lot of people. > > I think Tom and I had discussed this possibility a bit at netconf. > The GSO logic is something I planned on looking at over the next > several weeks as I suspect there is probably room for improvement > there. That sounds great. >>> I also haven't investigated the effect this will have on OVS. However I >>> suspect the impact should be minimal as the worst case scenario should be >>> that Tx checksumming will become enabled by default which should be >>> consistent with the existing behavior for IPv6. >> >> I don't think that it should cause any problems. > > Good to hear. > > Do you know if OVS has some way to control the VXLAN configuration so > that it could disable Tx checksums? If so that would probably be a > good way to address the 40G issues assuming someone is running an > environment hat had nothing but NICs that can support the TSO and Rx > checksum on inner headers. Yes - OVS can control tx checksums on a per-endpoint basis (actually, rx checksum present requirements as well though it's not exposed to the user at the moment). If you had the information then you could optimize what to use in an environment of, say, hypervisors and hardware switches. However, it's certainly possible that you have a mixed set of NICs such as an encap aware NIC on the transmit side and non-aware on the receive side. In that case, both possible checksum settings penalize somebody: off (lose GRO on receiver), on (lose TSO on sender assuming no support for NETIF_F_GSO_UDP_TUNNEL_CSUM). That's why I think it's important to be able to use encap TSO with local checksum to avoid these bad tradeoffs, not to mention being cleaner. ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default 2016-02-20 0:08 ` Jesse Gross @ 2016-02-20 0:14 ` Tom Herbert 2016-02-20 2:18 ` Jesse Gross 0 siblings, 1 reply; 47+ messages in thread From: Tom Herbert @ 2016-02-20 0:14 UTC (permalink / raw) To: Jesse Gross Cc: Alex Duyck, Linux Kernel Network Developers, David Miller, Alexander Duyck On Fri, Feb 19, 2016 at 4:08 PM, Jesse Gross <jesse@kernel.org> wrote: > On Fri, Feb 19, 2016 at 3:10 PM, Alex Duyck <aduyck@mirantis.com> wrote: >> On Fri, Feb 19, 2016 at 1:53 PM, Jesse Gross <jesse@kernel.org> wrote: >>> On Fri, Feb 19, 2016 at 11:26 AM, Alexander Duyck <aduyck@mirantis.com> wrote: >>>> This patch series makes it so that we enable the outer Tx checksum for IPv4 >>>> tunnels by default. This makes the behavior consistent with how we were >>>> handling this for IPv6. In addition I have updated the internal flags for >>>> these tunnels so that we use a ZERO_CSUM_TX flag for IPv4 which should >>>> match up will with the ZERO_CSUM6_TX flag which was already in use for >>>> IPv6. >>>> >>>> For most network devices this should be a net gain in terms of performance >>>> as having the outer header checksum present allows for devices to report >>>> CHECKSUM_UNNECESSARY which we can then convert to CHECKSUM_COMPLETE in order >>>> to determine if the inner header checksum is valid. >>>> >>>> Below is some data I collected with ixgbe with an X540 that demonstrates >>>> this. I located two PFs connected back to back in two different name >>>> spaces and then setup a pair of tunnels on each, one with checksum enabled >>>> and one without. >>>> >>>> Recv Send Send Utilization >>>> Socket Socket Message Elapsed Send >>>> Size Size Size Time Throughput local >>>> bytes bytes bytes secs. 10^6bits/s % S >>>> >>>> noudpcsum: >>>> 87380 16384 16384 30.00 8898.67 12.80 >>>> udpcsum: >>>> 87380 16384 16384 30.00 9088.47 5.69 >>>> >>>> The one spot where this may cause a performance regression is if the >>>> environment contains devices that can parse the inner headers and a device >>>> supports NETIF_F_GSO_UDP_TUNNEL but not NETIF_F_GSO_UDP_TUNNEL_CSUM. In >>>> the case of such a device we have to fall back to using GSO to segment the >>>> tunnel instead of TSO and as a result we may take a performance hit as seen >>>> below with i40e. >>> >>> Do you have any numbers from 40G links? Obviously, at 10G the links >>> are basically saturated and while I can see a difference in the >>> utilization rate, I suspect that the change will be much more apparent >>> at higher speeds. >> >> Unfortunately I don't have any true 40G links to test with. The >> closest I can get is to run PF to VF on an i40e. Running that I have >> seen the numbers go from about 20Gb/s to 15Gb/s with almost all the >> difference being related to the fact that we are having to >> allocate/free more skbs and make more trips through the >> i40e_lan_xmit_frame function resulting in more descriptors. > > OK, I guess that is more or less in line with what I would expect off > the top my head. There is a reasonably significant drop in the worst > case. > >>> I'm concerned about the drop in performance for devices that currently >>> support offloads (almost none of which expose >>> NETIF_F_GSO_UDP_TUNNEL_CSUM as a feature). Presumably the people that >>> care most about tunnel performance are the ones that already have >>> these NICs and will be the most impacted by the drop. >> >> The problem is being able to transmit fast is kind of pointless if the >> receiving end cannot handle it. We hadn't gotten around to really >> getting the Rx checksum bits working until the 3.18 kernel which I >> don't suspect many people are running so at this point messing with >> the TSO bits isn't really making much of a difference. Then on top of >> that most devices have certain limitations on how many ports they can >> handle and such. I know the i40e is supposed to support something >> like 10 port numbers, but the fm10k and ixgbe are limited to one port >> as I recall. So this whole thing is already really brittle as it is. >> My goal with this change is to make the behavior more consistent >> across the board. > > That's true to some degree but there are certainly plenty of cases > where TSO makes a difference - lower CPU usage, transmitting to > multiple receivers, people will upgrade their kernels, etc. It's > clearly good to make things more consistent but hopefully not by > reducing existing performance. :) > >>> My hope is that we can continue to use TSO on devices that only >>> support NETIF_F_GSO_UDP_TUNNEL. The main problem is that the UDP >>> length field may vary across segments. However, in practice this is >>> the only on the final segment and only in cases where the total length >>> is not a multiple of the MSS. If we could detect cases where those >>> conditions are met, we could continue to use TSO with the UDP checksum >>> field pre-populated. A possible step even further would be to break >>> off the final segment into a separate packet to make things conform if >>> necessary. This would avoid a performance regression and I think make >>> this more palatable to a lot of people. >> >> I think Tom and I had discussed this possibility a bit at netconf. >> The GSO logic is something I planned on looking at over the next >> several weeks as I suspect there is probably room for improvement >> there. > > That sounds great. > >>>> I also haven't investigated the effect this will have on OVS. However I >>>> suspect the impact should be minimal as the worst case scenario should be >>>> that Tx checksumming will become enabled by default which should be >>>> consistent with the existing behavior for IPv6. >>> >>> I don't think that it should cause any problems. >> >> Good to hear. >> >> Do you know if OVS has some way to control the VXLAN configuration so >> that it could disable Tx checksums? If so that would probably be a >> good way to address the 40G issues assuming someone is running an >> environment hat had nothing but NICs that can support the TSO and Rx >> checksum on inner headers. > > Yes - OVS can control tx checksums on a per-endpoint basis (actually, > rx checksum present requirements as well though it's not exposed to > the user at the moment). If you had the information then you could > optimize what to use in an environment of, say, hypervisors and > hardware switches. > > However, it's certainly possible that you have a mixed set of NICs > such as an encap aware NIC on the transmit side and non-aware on the > receive side. In that case, both possible checksum settings penalize > somebody: off (lose GRO on receiver), on (lose TSO on sender assuming > no support for NETIF_F_GSO_UDP_TUNNEL_CSUM). That's why I think it's > important to be able to use encap TSO with local checksum to avoid > these bad tradeoffs, not to mention being cleaner. By "local checksum" do you mean LCO? Seems like we should be able to get that to work with NETIF_F_GSO_TUNNEL_CSUM. Tom ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default 2016-02-20 0:14 ` Tom Herbert @ 2016-02-20 2:18 ` Jesse Gross 2016-02-20 19:51 ` Tom Herbert 0 siblings, 1 reply; 47+ messages in thread From: Jesse Gross @ 2016-02-20 2:18 UTC (permalink / raw) To: Tom Herbert Cc: Alex Duyck, Linux Kernel Network Developers, David Miller, Alexander Duyck On Fri, Feb 19, 2016 at 4:14 PM, Tom Herbert <tom@herbertland.com> wrote: > On Fri, Feb 19, 2016 at 4:08 PM, Jesse Gross <jesse@kernel.org> wrote: >> On Fri, Feb 19, 2016 at 3:10 PM, Alex Duyck <aduyck@mirantis.com> wrote: >>> On Fri, Feb 19, 2016 at 1:53 PM, Jesse Gross <jesse@kernel.org> wrote: >>>> On Fri, Feb 19, 2016 at 11:26 AM, Alexander Duyck <aduyck@mirantis.com> wrote: >>>>> This patch series makes it so that we enable the outer Tx checksum for IPv4 >>>>> tunnels by default. This makes the behavior consistent with how we were >>>>> handling this for IPv6. In addition I have updated the internal flags for >>>>> these tunnels so that we use a ZERO_CSUM_TX flag for IPv4 which should >>>>> match up will with the ZERO_CSUM6_TX flag which was already in use for >>>>> IPv6. >>>>> >>>>> For most network devices this should be a net gain in terms of performance >>>>> as having the outer header checksum present allows for devices to report >>>>> CHECKSUM_UNNECESSARY which we can then convert to CHECKSUM_COMPLETE in order >>>>> to determine if the inner header checksum is valid. >>>>> >>>>> Below is some data I collected with ixgbe with an X540 that demonstrates >>>>> this. I located two PFs connected back to back in two different name >>>>> spaces and then setup a pair of tunnels on each, one with checksum enabled >>>>> and one without. >>>>> >>>>> Recv Send Send Utilization >>>>> Socket Socket Message Elapsed Send >>>>> Size Size Size Time Throughput local >>>>> bytes bytes bytes secs. 10^6bits/s % S >>>>> >>>>> noudpcsum: >>>>> 87380 16384 16384 30.00 8898.67 12.80 >>>>> udpcsum: >>>>> 87380 16384 16384 30.00 9088.47 5.69 >>>>> >>>>> The one spot where this may cause a performance regression is if the >>>>> environment contains devices that can parse the inner headers and a device >>>>> supports NETIF_F_GSO_UDP_TUNNEL but not NETIF_F_GSO_UDP_TUNNEL_CSUM. In >>>>> the case of such a device we have to fall back to using GSO to segment the >>>>> tunnel instead of TSO and as a result we may take a performance hit as seen >>>>> below with i40e. >>>> >>>> Do you have any numbers from 40G links? Obviously, at 10G the links >>>> are basically saturated and while I can see a difference in the >>>> utilization rate, I suspect that the change will be much more apparent >>>> at higher speeds. >>> >>> Unfortunately I don't have any true 40G links to test with. The >>> closest I can get is to run PF to VF on an i40e. Running that I have >>> seen the numbers go from about 20Gb/s to 15Gb/s with almost all the >>> difference being related to the fact that we are having to >>> allocate/free more skbs and make more trips through the >>> i40e_lan_xmit_frame function resulting in more descriptors. >> >> OK, I guess that is more or less in line with what I would expect off >> the top my head. There is a reasonably significant drop in the worst >> case. >> >>>> I'm concerned about the drop in performance for devices that currently >>>> support offloads (almost none of which expose >>>> NETIF_F_GSO_UDP_TUNNEL_CSUM as a feature). Presumably the people that >>>> care most about tunnel performance are the ones that already have >>>> these NICs and will be the most impacted by the drop. >>> >>> The problem is being able to transmit fast is kind of pointless if the >>> receiving end cannot handle it. We hadn't gotten around to really >>> getting the Rx checksum bits working until the 3.18 kernel which I >>> don't suspect many people are running so at this point messing with >>> the TSO bits isn't really making much of a difference. Then on top of >>> that most devices have certain limitations on how many ports they can >>> handle and such. I know the i40e is supposed to support something >>> like 10 port numbers, but the fm10k and ixgbe are limited to one port >>> as I recall. So this whole thing is already really brittle as it is. >>> My goal with this change is to make the behavior more consistent >>> across the board. >> >> That's true to some degree but there are certainly plenty of cases >> where TSO makes a difference - lower CPU usage, transmitting to >> multiple receivers, people will upgrade their kernels, etc. It's >> clearly good to make things more consistent but hopefully not by >> reducing existing performance. :) >> >>>> My hope is that we can continue to use TSO on devices that only >>>> support NETIF_F_GSO_UDP_TUNNEL. The main problem is that the UDP >>>> length field may vary across segments. However, in practice this is >>>> the only on the final segment and only in cases where the total length >>>> is not a multiple of the MSS. If we could detect cases where those >>>> conditions are met, we could continue to use TSO with the UDP checksum >>>> field pre-populated. A possible step even further would be to break >>>> off the final segment into a separate packet to make things conform if >>>> necessary. This would avoid a performance regression and I think make >>>> this more palatable to a lot of people. >>> >>> I think Tom and I had discussed this possibility a bit at netconf. >>> The GSO logic is something I planned on looking at over the next >>> several weeks as I suspect there is probably room for improvement >>> there. >> >> That sounds great. >> >>>>> I also haven't investigated the effect this will have on OVS. However I >>>>> suspect the impact should be minimal as the worst case scenario should be >>>>> that Tx checksumming will become enabled by default which should be >>>>> consistent with the existing behavior for IPv6. >>>> >>>> I don't think that it should cause any problems. >>> >>> Good to hear. >>> >>> Do you know if OVS has some way to control the VXLAN configuration so >>> that it could disable Tx checksums? If so that would probably be a >>> good way to address the 40G issues assuming someone is running an >>> environment hat had nothing but NICs that can support the TSO and Rx >>> checksum on inner headers. >> >> Yes - OVS can control tx checksums on a per-endpoint basis (actually, >> rx checksum present requirements as well though it's not exposed to >> the user at the moment). If you had the information then you could >> optimize what to use in an environment of, say, hypervisors and >> hardware switches. >> >> However, it's certainly possible that you have a mixed set of NICs >> such as an encap aware NIC on the transmit side and non-aware on the >> receive side. In that case, both possible checksum settings penalize >> somebody: off (lose GRO on receiver), on (lose TSO on sender assuming >> no support for NETIF_F_GSO_UDP_TUNNEL_CSUM). That's why I think it's >> important to be able to use encap TSO with local checksum to avoid >> these bad tradeoffs, not to mention being cleaner. > > By "local checksum" do you mean LCO? Yes, that's what I meant. > Seems like we should be able to > get that to work with NETIF_F_GSO_TUNNEL_CSUM. I assume you mean NETIF_F_GSO_TUNNEL (no _CSUM)? ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default 2016-02-20 2:18 ` Jesse Gross @ 2016-02-20 19:51 ` Tom Herbert 2016-02-23 3:31 ` Jesse Gross 2016-03-11 19:20 ` Generic TSO (was Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default) Edward Cree 0 siblings, 2 replies; 47+ messages in thread From: Tom Herbert @ 2016-02-20 19:51 UTC (permalink / raw) To: Jesse Gross Cc: Alex Duyck, Linux Kernel Network Developers, David Miller, Alexander Duyck On Fri, Feb 19, 2016 at 6:18 PM, Jesse Gross <jesse@kernel.org> wrote: > On Fri, Feb 19, 2016 at 4:14 PM, Tom Herbert <tom@herbertland.com> wrote: >> On Fri, Feb 19, 2016 at 4:08 PM, Jesse Gross <jesse@kernel.org> wrote: >>> On Fri, Feb 19, 2016 at 3:10 PM, Alex Duyck <aduyck@mirantis.com> wrote: >>>> On Fri, Feb 19, 2016 at 1:53 PM, Jesse Gross <jesse@kernel.org> wrote: >>>>> On Fri, Feb 19, 2016 at 11:26 AM, Alexander Duyck <aduyck@mirantis.com> wrote: >>>>>> This patch series makes it so that we enable the outer Tx checksum for IPv4 >>>>>> tunnels by default. This makes the behavior consistent with how we were >>>>>> handling this for IPv6. In addition I have updated the internal flags for >>>>>> these tunnels so that we use a ZERO_CSUM_TX flag for IPv4 which should >>>>>> match up will with the ZERO_CSUM6_TX flag which was already in use for >>>>>> IPv6. >>>>>> >>>>>> For most network devices this should be a net gain in terms of performance >>>>>> as having the outer header checksum present allows for devices to report >>>>>> CHECKSUM_UNNECESSARY which we can then convert to CHECKSUM_COMPLETE in order >>>>>> to determine if the inner header checksum is valid. >>>>>> >>>>>> Below is some data I collected with ixgbe with an X540 that demonstrates >>>>>> this. I located two PFs connected back to back in two different name >>>>>> spaces and then setup a pair of tunnels on each, one with checksum enabled >>>>>> and one without. >>>>>> >>>>>> Recv Send Send Utilization >>>>>> Socket Socket Message Elapsed Send >>>>>> Size Size Size Time Throughput local >>>>>> bytes bytes bytes secs. 10^6bits/s % S >>>>>> >>>>>> noudpcsum: >>>>>> 87380 16384 16384 30.00 8898.67 12.80 >>>>>> udpcsum: >>>>>> 87380 16384 16384 30.00 9088.47 5.69 >>>>>> >>>>>> The one spot where this may cause a performance regression is if the >>>>>> environment contains devices that can parse the inner headers and a device >>>>>> supports NETIF_F_GSO_UDP_TUNNEL but not NETIF_F_GSO_UDP_TUNNEL_CSUM. In >>>>>> the case of such a device we have to fall back to using GSO to segment the >>>>>> tunnel instead of TSO and as a result we may take a performance hit as seen >>>>>> below with i40e. >>>>> >>>>> Do you have any numbers from 40G links? Obviously, at 10G the links >>>>> are basically saturated and while I can see a difference in the >>>>> utilization rate, I suspect that the change will be much more apparent >>>>> at higher speeds. >>>> >>>> Unfortunately I don't have any true 40G links to test with. The >>>> closest I can get is to run PF to VF on an i40e. Running that I have >>>> seen the numbers go from about 20Gb/s to 15Gb/s with almost all the >>>> difference being related to the fact that we are having to >>>> allocate/free more skbs and make more trips through the >>>> i40e_lan_xmit_frame function resulting in more descriptors. >>> >>> OK, I guess that is more or less in line with what I would expect off >>> the top my head. There is a reasonably significant drop in the worst >>> case. >>> >>>>> I'm concerned about the drop in performance for devices that currently >>>>> support offloads (almost none of which expose >>>>> NETIF_F_GSO_UDP_TUNNEL_CSUM as a feature). Presumably the people that >>>>> care most about tunnel performance are the ones that already have >>>>> these NICs and will be the most impacted by the drop. >>>> >>>> The problem is being able to transmit fast is kind of pointless if the >>>> receiving end cannot handle it. We hadn't gotten around to really >>>> getting the Rx checksum bits working until the 3.18 kernel which I >>>> don't suspect many people are running so at this point messing with >>>> the TSO bits isn't really making much of a difference. Then on top of >>>> that most devices have certain limitations on how many ports they can >>>> handle and such. I know the i40e is supposed to support something >>>> like 10 port numbers, but the fm10k and ixgbe are limited to one port >>>> as I recall. So this whole thing is already really brittle as it is. >>>> My goal with this change is to make the behavior more consistent >>>> across the board. >>> >>> That's true to some degree but there are certainly plenty of cases >>> where TSO makes a difference - lower CPU usage, transmitting to >>> multiple receivers, people will upgrade their kernels, etc. It's >>> clearly good to make things more consistent but hopefully not by >>> reducing existing performance. :) >>> >>>>> My hope is that we can continue to use TSO on devices that only >>>>> support NETIF_F_GSO_UDP_TUNNEL. The main problem is that the UDP >>>>> length field may vary across segments. However, in practice this is >>>>> the only on the final segment and only in cases where the total length >>>>> is not a multiple of the MSS. If we could detect cases where those >>>>> conditions are met, we could continue to use TSO with the UDP checksum >>>>> field pre-populated. A possible step even further would be to break >>>>> off the final segment into a separate packet to make things conform if >>>>> necessary. This would avoid a performance regression and I think make >>>>> this more palatable to a lot of people. >>>> >>>> I think Tom and I had discussed this possibility a bit at netconf. >>>> The GSO logic is something I planned on looking at over the next >>>> several weeks as I suspect there is probably room for improvement >>>> there. >>> >>> That sounds great. >>> >>>>>> I also haven't investigated the effect this will have on OVS. However I >>>>>> suspect the impact should be minimal as the worst case scenario should be >>>>>> that Tx checksumming will become enabled by default which should be >>>>>> consistent with the existing behavior for IPv6. >>>>> >>>>> I don't think that it should cause any problems. >>>> >>>> Good to hear. >>>> >>>> Do you know if OVS has some way to control the VXLAN configuration so >>>> that it could disable Tx checksums? If so that would probably be a >>>> good way to address the 40G issues assuming someone is running an >>>> environment hat had nothing but NICs that can support the TSO and Rx >>>> checksum on inner headers. >>> >>> Yes - OVS can control tx checksums on a per-endpoint basis (actually, >>> rx checksum present requirements as well though it's not exposed to >>> the user at the moment). If you had the information then you could >>> optimize what to use in an environment of, say, hypervisors and >>> hardware switches. >>> >>> However, it's certainly possible that you have a mixed set of NICs >>> such as an encap aware NIC on the transmit side and non-aware on the >>> receive side. In that case, both possible checksum settings penalize >>> somebody: off (lose GRO on receiver), on (lose TSO on sender assuming >>> no support for NETIF_F_GSO_UDP_TUNNEL_CSUM). That's why I think it's >>> important to be able to use encap TSO with local checksum to avoid >>> these bad tradeoffs, not to mention being cleaner. >> >> By "local checksum" do you mean LCO? > > Yes, that's what I meant. > Right. To use LCO with TSO we would need to ensure that all packets are the same size so that the UDP length field and thus checksum are constant for all created segments. But this property this would also make any payload lengths in headers constant for all packets so that the only fields that need be set per generated packet would be the TCP sequence number and checksum. This simplifying assumption could be used to make a very protocol-generic GSO/TSO (up to the transport header)! Conceptually, a device would just need to know the start of the packet, the offset of the transport header, and the size of each segment. Any bits from the start of the packet to the beginning of the transport header are just copied to each segment, so any combination of encapsulation/network protocols is supported as long as they are constant for each segment (e.g. MPLS, NSH, etc. are on the horizon for needing TSO support). If we are able to do this then GSO could be a lot simpler and more extensible. We should be able to eliminate all the GSO flags for GRE, IPIP, SIT, UDP, checksum variants, shouldn't need to distinguish between TCPv4 and TCPv6, and wouldn't need to disallow nested encapsulations. The inner headers in the skbuf might also be removed. GSO for SCTP or FCOE still needs a little thought, we'd need to consider the possibility of needing both a CRC and checksum in a single packet. Tom >> Seems like we should be able to >> get that to work with NETIF_F_GSO_TUNNEL_CSUM. > > I assume you mean NETIF_F_GSO_TUNNEL (no _CSUM)? ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default 2016-02-20 19:51 ` Tom Herbert @ 2016-02-23 3:31 ` Jesse Gross 2016-02-23 15:18 ` Edward Cree 2016-03-11 19:20 ` Generic TSO (was Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default) Edward Cree 1 sibling, 1 reply; 47+ messages in thread From: Jesse Gross @ 2016-02-23 3:31 UTC (permalink / raw) To: Tom Herbert Cc: Alex Duyck, Linux Kernel Network Developers, David Miller, Alexander Duyck On Sat, Feb 20, 2016 at 11:51 AM, Tom Herbert <tom@herbertland.com> wrote: > On Fri, Feb 19, 2016 at 6:18 PM, Jesse Gross <jesse@kernel.org> wrote: >> On Fri, Feb 19, 2016 at 4:14 PM, Tom Herbert <tom@herbertland.com> wrote: >>> On Fri, Feb 19, 2016 at 4:08 PM, Jesse Gross <jesse@kernel.org> wrote: >>>> On Fri, Feb 19, 2016 at 3:10 PM, Alex Duyck <aduyck@mirantis.com> wrote: >>>>> On Fri, Feb 19, 2016 at 1:53 PM, Jesse Gross <jesse@kernel.org> wrote: >>>>>> On Fri, Feb 19, 2016 at 11:26 AM, Alexander Duyck <aduyck@mirantis.com> wrote: >>>>>>> This patch series makes it so that we enable the outer Tx checksum for IPv4 >>>>>>> tunnels by default. This makes the behavior consistent with how we were >>>>>>> handling this for IPv6. In addition I have updated the internal flags for >>>>>>> these tunnels so that we use a ZERO_CSUM_TX flag for IPv4 which should >>>>>>> match up will with the ZERO_CSUM6_TX flag which was already in use for >>>>>>> IPv6. >>>>>>> >>>>>>> For most network devices this should be a net gain in terms of performance >>>>>>> as having the outer header checksum present allows for devices to report >>>>>>> CHECKSUM_UNNECESSARY which we can then convert to CHECKSUM_COMPLETE in order >>>>>>> to determine if the inner header checksum is valid. >>>>>>> >>>>>>> Below is some data I collected with ixgbe with an X540 that demonstrates >>>>>>> this. I located two PFs connected back to back in two different name >>>>>>> spaces and then setup a pair of tunnels on each, one with checksum enabled >>>>>>> and one without. >>>>>>> >>>>>>> Recv Send Send Utilization >>>>>>> Socket Socket Message Elapsed Send >>>>>>> Size Size Size Time Throughput local >>>>>>> bytes bytes bytes secs. 10^6bits/s % S >>>>>>> >>>>>>> noudpcsum: >>>>>>> 87380 16384 16384 30.00 8898.67 12.80 >>>>>>> udpcsum: >>>>>>> 87380 16384 16384 30.00 9088.47 5.69 >>>>>>> >>>>>>> The one spot where this may cause a performance regression is if the >>>>>>> environment contains devices that can parse the inner headers and a device >>>>>>> supports NETIF_F_GSO_UDP_TUNNEL but not NETIF_F_GSO_UDP_TUNNEL_CSUM. In >>>>>>> the case of such a device we have to fall back to using GSO to segment the >>>>>>> tunnel instead of TSO and as a result we may take a performance hit as seen >>>>>>> below with i40e. >>>>>> >>>>>> Do you have any numbers from 40G links? Obviously, at 10G the links >>>>>> are basically saturated and while I can see a difference in the >>>>>> utilization rate, I suspect that the change will be much more apparent >>>>>> at higher speeds. >>>>> >>>>> Unfortunately I don't have any true 40G links to test with. The >>>>> closest I can get is to run PF to VF on an i40e. Running that I have >>>>> seen the numbers go from about 20Gb/s to 15Gb/s with almost all the >>>>> difference being related to the fact that we are having to >>>>> allocate/free more skbs and make more trips through the >>>>> i40e_lan_xmit_frame function resulting in more descriptors. >>>> >>>> OK, I guess that is more or less in line with what I would expect off >>>> the top my head. There is a reasonably significant drop in the worst >>>> case. >>>> >>>>>> I'm concerned about the drop in performance for devices that currently >>>>>> support offloads (almost none of which expose >>>>>> NETIF_F_GSO_UDP_TUNNEL_CSUM as a feature). Presumably the people that >>>>>> care most about tunnel performance are the ones that already have >>>>>> these NICs and will be the most impacted by the drop. >>>>> >>>>> The problem is being able to transmit fast is kind of pointless if the >>>>> receiving end cannot handle it. We hadn't gotten around to really >>>>> getting the Rx checksum bits working until the 3.18 kernel which I >>>>> don't suspect many people are running so at this point messing with >>>>> the TSO bits isn't really making much of a difference. Then on top of >>>>> that most devices have certain limitations on how many ports they can >>>>> handle and such. I know the i40e is supposed to support something >>>>> like 10 port numbers, but the fm10k and ixgbe are limited to one port >>>>> as I recall. So this whole thing is already really brittle as it is. >>>>> My goal with this change is to make the behavior more consistent >>>>> across the board. >>>> >>>> That's true to some degree but there are certainly plenty of cases >>>> where TSO makes a difference - lower CPU usage, transmitting to >>>> multiple receivers, people will upgrade their kernels, etc. It's >>>> clearly good to make things more consistent but hopefully not by >>>> reducing existing performance. :) >>>> >>>>>> My hope is that we can continue to use TSO on devices that only >>>>>> support NETIF_F_GSO_UDP_TUNNEL. The main problem is that the UDP >>>>>> length field may vary across segments. However, in practice this is >>>>>> the only on the final segment and only in cases where the total length >>>>>> is not a multiple of the MSS. If we could detect cases where those >>>>>> conditions are met, we could continue to use TSO with the UDP checksum >>>>>> field pre-populated. A possible step even further would be to break >>>>>> off the final segment into a separate packet to make things conform if >>>>>> necessary. This would avoid a performance regression and I think make >>>>>> this more palatable to a lot of people. >>>>> >>>>> I think Tom and I had discussed this possibility a bit at netconf. >>>>> The GSO logic is something I planned on looking at over the next >>>>> several weeks as I suspect there is probably room for improvement >>>>> there. >>>> >>>> That sounds great. >>>> >>>>>>> I also haven't investigated the effect this will have on OVS. However I >>>>>>> suspect the impact should be minimal as the worst case scenario should be >>>>>>> that Tx checksumming will become enabled by default which should be >>>>>>> consistent with the existing behavior for IPv6. >>>>>> >>>>>> I don't think that it should cause any problems. >>>>> >>>>> Good to hear. >>>>> >>>>> Do you know if OVS has some way to control the VXLAN configuration so >>>>> that it could disable Tx checksums? If so that would probably be a >>>>> good way to address the 40G issues assuming someone is running an >>>>> environment hat had nothing but NICs that can support the TSO and Rx >>>>> checksum on inner headers. >>>> >>>> Yes - OVS can control tx checksums on a per-endpoint basis (actually, >>>> rx checksum present requirements as well though it's not exposed to >>>> the user at the moment). If you had the information then you could >>>> optimize what to use in an environment of, say, hypervisors and >>>> hardware switches. >>>> >>>> However, it's certainly possible that you have a mixed set of NICs >>>> such as an encap aware NIC on the transmit side and non-aware on the >>>> receive side. In that case, both possible checksum settings penalize >>>> somebody: off (lose GRO on receiver), on (lose TSO on sender assuming >>>> no support for NETIF_F_GSO_UDP_TUNNEL_CSUM). That's why I think it's >>>> important to be able to use encap TSO with local checksum to avoid >>>> these bad tradeoffs, not to mention being cleaner. >>> >>> By "local checksum" do you mean LCO? >> >> Yes, that's what I meant. >> > Right. To use LCO with TSO we would need to ensure that all packets > are the same size so that the UDP length field and thus checksum are > constant for all created segments. But this property this would also > make any payload lengths in headers constant for all packets so that > the only fields that need be set per generated packet would be the TCP > sequence number and checksum. This simplifying assumption could be > used to make a very protocol-generic GSO/TSO (up to the transport > header)! > > Conceptually, a device would just need to know the start of the > packet, the offset of the transport header, and the size of each > segment. Any bits from the start of the packet to the beginning of the > transport header are just copied to each segment, so any combination > of encapsulation/network protocols is supported as long as they are > constant for each segment (e.g. MPLS, NSH, etc. are on the horizon for > needing TSO support). > > If we are able to do this then GSO could be a lot simpler and more > extensible. We should be able to eliminate all the GSO flags for GRE, > IPIP, SIT, UDP, checksum variants, shouldn't need to distinguish > between TCPv4 and TCPv6, and wouldn't need to disallow nested > encapsulations. The inner headers in the skbuf might also be removed. > GSO for SCTP or FCOE still needs a little thought, we'd need to > consider the possibility of needing both a CRC and checksum in a > single packet. Yes, I think this is definitely a good direction to go in general. At that point, the main distinguishing feature of TSO support in NICs would basically be the depth into the packet that the card is capable of manipulating the L4 header. I assume that the NICs that do encapsulation offloads would be able to handle the same depth when they are not doing encapsulation and I know that some NICs (such as ixgbe) don't explicitly support TSO with encapsulation but can handle headers deeper in the packet than they current expose. The only issue that I see is that making TSO completely unaware of outer headers will likely cause performance regressions in some cases. Imagine if we have an incoming TCP stream with incrementing IP IDs that we aggregate through GRO and forward. Today's TSO would be able to recreate the stream by incrementing the ID as new segments are created. However, if the outgoing NIC is truly only dealing with the L4 header then it wouldn't be able to do this. ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default 2016-02-23 3:31 ` Jesse Gross @ 2016-02-23 15:18 ` Edward Cree 2016-02-23 16:47 ` Tom Herbert 0 siblings, 1 reply; 47+ messages in thread From: Edward Cree @ 2016-02-23 15:18 UTC (permalink / raw) To: Jesse Gross, Tom Herbert Cc: Alex Duyck, Linux Kernel Network Developers, David Miller, Alexander Duyck On 23/02/16 03:31, Jesse Gross wrote: > The only issue that I see is that making TSO completely unaware of > outer headers will likely cause performance regressions in some cases. > Imagine if we have an incoming TCP stream with incrementing IP IDs > that we aggregate through GRO and forward. Today's TSO would be able > to recreate the stream by incrementing the ID as new segments are > created. However, if the outgoing NIC is truly only dealing with the > L4 header then it wouldn't be able to do this. Perhaps TSO should force setting the DF bit, so that the IP ID can be ignored. After all, if your network is going to cause fragmentation and reassembly, your performance will probably be bad enough that TSO won't help you much. (And TCP usually wants DF anyway so it can do PMTUD.) Arguably, as soon as we perform GRO on traffic to be forwarded, we've already violated the end-to-end principle (there are always imaginable situations in which a different packet stream comes out than went in), so it doesn't really matter if we go on to change the network layer parameters in this way - it's not really the same IP datagram any more so it's OK for its identification to change. And of course this problem doesn't present itself for IPv6 :) ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default 2016-02-23 15:18 ` Edward Cree @ 2016-02-23 16:47 ` Tom Herbert 2016-02-23 17:20 ` Rick Jones ` (2 more replies) 0 siblings, 3 replies; 47+ messages in thread From: Tom Herbert @ 2016-02-23 16:47 UTC (permalink / raw) To: Edward Cree Cc: Jesse Gross, Alex Duyck, Linux Kernel Network Developers, David Miller, Alexander Duyck On Tue, Feb 23, 2016 at 7:18 AM, Edward Cree <ecree@solarflare.com> wrote: > On 23/02/16 03:31, Jesse Gross wrote: >> The only issue that I see is that making TSO completely unaware of >> outer headers will likely cause performance regressions in some cases. >> Imagine if we have an incoming TCP stream with incrementing IP IDs >> that we aggregate through GRO and forward. Today's TSO would be able >> to recreate the stream by incrementing the ID as new segments are >> created. However, if the outgoing NIC is truly only dealing with the >> L4 header then it wouldn't be able to do this. > Perhaps TSO should force setting the DF bit, so that the IP ID can be > ignored. After all, if your network is going to cause fragmentation and > reassembly, your performance will probably be bad enough that TSO won't > help you much. (And TCP usually wants DF anyway so it can do PMTUD.) > Arguably, as soon as we perform GRO on traffic to be forwarded, we've > already violated the end-to-end principle (there are always imaginable > situations in which a different packet stream comes out than went in), > so it doesn't really matter if we go on to change the network layer > parameters in this way - it's not really the same IP datagram any more > so it's OK for its identification to change. > And of course this problem doesn't present itself for IPv6 :) Right, GRO should probably not coalesce packets with non-zero IP identifiers due to the loss of information. Besides that, RFC6848 says the IP identifier should only be set for fragmentation anyway so there shouldn't be any issue and really no need for HW TSO (or LRO) to support that. We need to do increment IP identifier in UFO, but I only see one device (neterion) that advertises NETIF_F_UFO-- honestly, removing that feature might be another good simplification! Tom > -- > -Ed ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default 2016-02-23 16:47 ` Tom Herbert @ 2016-02-23 17:20 ` Rick Jones 2016-02-23 17:38 ` Edward Cree 2016-02-23 17:31 ` Jesse Gross 2016-02-25 20:14 ` David Miller 2 siblings, 1 reply; 47+ messages in thread From: Rick Jones @ 2016-02-23 17:20 UTC (permalink / raw) To: Tom Herbert, Edward Cree Cc: Jesse Gross, Alex Duyck, Linux Kernel Network Developers, David Miller, Alexander Duyck On 02/23/2016 08:47 AM, Tom Herbert wrote: > Right, GRO should probably not coalesce packets with non-zero IP > identifiers due to the loss of information. Besides that, RFC6848 says > the IP identifier should only be set for fragmentation anyway so there > shouldn't be any issue and really no need for HW TSO (or LRO) to > support that. You sure that is RFC 6848 "Specifying Civic Address Extensions in the Presence Information Data Format Location Object (PIDF-LO)" ? In whichever RFC that may be, is it a SHOULD or a MUST, and just how many "other" stacks might be setting a non-zero IP ID on fragments with DF set? rick jones > We need to do increment IP identifier in UFO, but I only see one > device (neterion) that advertises NETIF_F_UFO-- honestly, removing > that feature might be another good simplification! > > Tom > >> -- >> -Ed ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default 2016-02-23 17:20 ` Rick Jones @ 2016-02-23 17:38 ` Edward Cree 2016-02-23 18:08 ` David Miller 2016-02-23 18:11 ` Tom Herbert 0 siblings, 2 replies; 47+ messages in thread From: Edward Cree @ 2016-02-23 17:38 UTC (permalink / raw) To: Rick Jones, Tom Herbert Cc: Jesse Gross, Alex Duyck, Linux Kernel Network Developers, David Miller, Alexander Duyck On 23/02/16 17:20, Rick Jones wrote: > On 02/23/2016 08:47 AM, Tom Herbert wrote: >> Right, GRO should probably not coalesce packets with non-zero IP >> identifiers due to the loss of information. Besides that, RFC6848 says >> the IP identifier should only be set for fragmentation anyway so there >> shouldn't be any issue and really no need for HW TSO (or LRO) to >> support that. > > You sure that is RFC 6848 "Specifying Civic Address Extensions in the Presence Information Data Format Location Object (PIDF-LO)" ? PossiblyRFC 6864 "Updated Specification of the IPv4 ID Field". > In whichever RFC that may be, is it a SHOULD or a MUST, and just how many "other" stacks might be setting a non-zero IP ID on fragments with DF set? "The IPv4 ID field MUST NOT be used for purposes other than fragmentation and reassembly."(§4.1) "Originating sources MAY set the IPv4 ID field of atomic datagrams to any value."(§4.1) "All devices that examine IPv4 headers MUST ignore the IPv4 ID field of atomic datagrams."(§4.1) Atomic datagrams are defined by "(DF==1)&&(MF==0)&&(frag_offset==0)" (§4). So it's OK to coalesce packets with different identifiers, as long as they have DFset (and aren't fragmented already). Also, the RFC takes pains to point out that it "does not reserve any IPv4 ID values, including 0, as distinguished" (§4.1), so one cannot rely on the ID always being zero. ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default 2016-02-23 17:38 ` Edward Cree @ 2016-02-23 18:08 ` David Miller 2016-02-23 20:20 ` Edward Cree 2016-02-23 18:11 ` Tom Herbert 1 sibling, 1 reply; 47+ messages in thread From: David Miller @ 2016-02-23 18:08 UTC (permalink / raw) To: ecree; +Cc: rick.jones2, tom, jesse, aduyck, netdev, alexander.duyck From: Edward Cree <ecree@solarflare.com> Date: Tue, 23 Feb 2016 17:38:28 +0000 > On 23/02/16 17:20, Rick Jones wrote: >> On 02/23/2016 08:47 AM, Tom Herbert wrote: >>> Right, GRO should probably not coalesce packets with non-zero IP >>> identifiers due to the loss of information. Besides that, RFC6848 says >>> the IP identifier should only be set for fragmentation anyway so there >>> shouldn't be any issue and really no need for HW TSO (or LRO) to >>> support that. >> >> You sure that is RFC 6848 "Specifying Civic Address Extensions in the Presence Information Data Format Location Object (PIDF-LO)" ? > PossiblyRFC 6864 "Updated Specification of the IPv4 ID Field". >> In whichever RFC that may be, is it a SHOULD or a MUST, and just how many "other" stacks might be setting a non-zero IP ID on fragments with DF set? > "The IPv4 ID field MUST NOT be used for purposes other than fragmentation > and reassembly."(§4.1) > "Originating sources MAY set the IPv4 ID field of atomic datagrams to any > value."(§4.1) > "All devices that examine IPv4 headers MUST ignore the IPv4 ID field of > atomic datagrams."(§4.1) > Atomic datagrams are defined by "(DF==1)&&(MF==0)&&(frag_offset==0)" (§4). > > So it's OK to coalesce packets with different identifiers, as long as they > have DFset (and aren't fragmented already). Also, the RFC takes pains to > point out that it "does not reserve any IPv4 ID values, including 0, as > distinguished" (§4.1), so one cannot rely on the ID always being zero. Just a reminder that a very long time ago we tried setting the IP ID field to zero for DF packets, and this broke SLHC because that expects a monotonically increasing IP ID field. ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default 2016-02-23 18:08 ` David Miller @ 2016-02-23 20:20 ` Edward Cree 2016-02-23 23:11 ` David Miller 2016-02-24 0:53 ` Tom Herbert 0 siblings, 2 replies; 47+ messages in thread From: Edward Cree @ 2016-02-23 20:20 UTC (permalink / raw) To: David Miller; +Cc: rick.jones2, tom, jesse, aduyck, netdev, alexander.duyck On 23/02/16 18:08, David Miller wrote: > From: Edward Cree <ecree@solarflare.com> > Date: Tue, 23 Feb 2016 17:38:28 +0000 > >> "The IPv4 ID field MUST NOT be used for purposes other than fragmentation >> and reassembly."(§4.1) >> "Originating sources MAY set the IPv4 ID field of atomic datagrams to any >> value."(§4.1) >> "All devices that examine IPv4 headers MUST ignore the IPv4 ID field of >> atomic datagrams."(§4.1) >> Atomic datagrams are defined by "(DF==1)&&(MF==0)&&(frag_offset==0)" (§4). >> >> So it's OK to coalesce packets with different identifiers, as long as they >> have DFset (and aren't fragmented already). Also, the RFC takes pains to >> point out that it "does not reserve any IPv4 ID values, including 0, as >> distinguished" (§4.1), so one cannot rely on the ID always being zero. > Just a reminder that a very long time ago we tried setting the IP ID > field to zero for DF packets, and this broke SLHC because that expects > a monotonically increasing IP ID field. If I'm reading it right, Linux's SLHC implementation can cope [1] with unchanging IP IDs (or IDs that do something other than increment), which it encodes with the NEW_I bit. RFC1144 supports this, remarking "note that it may be zero or negative" (§3.2.3) of the ID delta. So VJ implementations that can't handle it really are buggy, not just exhibiting a difference of opinion. So it seems like the problem case is where some sort of SLIP gateway lies between a Linux system (say, a webserver sending with TSO) and a buggy client; in that case, even if the client can't be fixed it seems like the gateway could be (either to send NEW_I or just lie about the incoming IP IDs and claim they're incrementing - the latter is entirely safe for a DF packet). [ I really hope we can figure out a way not to change IP IDs, because I think an inverted version of Tom's generic TSO could also work as a generic h/w GRO accelerator. In its simplest form, the hardware just remembers the previous packet, then gives us the length of the common prefix. If that ends four bytes into a TCP header, then the packet is a candidate for GRO; otherwise not. I haven't worked out all the details yet, but it's clear that non-constant IP IDs would cause problems. (If it's only the innermost IP ID that's changing, we can probably still cope, but the performance gain will be less and the implementation could start to get ugly.) ] ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default 2016-02-23 20:20 ` Edward Cree @ 2016-02-23 23:11 ` David Miller 2016-02-24 0:53 ` Tom Herbert 1 sibling, 0 replies; 47+ messages in thread From: David Miller @ 2016-02-23 23:11 UTC (permalink / raw) To: ecree; +Cc: rick.jones2, tom, jesse, aduyck, netdev, alexander.duyck From: Edward Cree <ecree@solarflare.com> Date: Tue, 23 Feb 2016 20:20:50 +0000 > So VJ implementations that can't handle it really are buggy, not > just exhibiting a difference of opinion. Buggy or not, they exist, and we have to cope with them. ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default 2016-02-23 20:20 ` Edward Cree 2016-02-23 23:11 ` David Miller @ 2016-02-24 0:53 ` Tom Herbert 2016-02-24 17:30 ` Edward Cree 1 sibling, 1 reply; 47+ messages in thread From: Tom Herbert @ 2016-02-24 0:53 UTC (permalink / raw) To: Edward Cree Cc: David Miller, Rick Jones, Jesse Gross, Alex Duyck, Linux Kernel Network Developers, Alexander Duyck > [ I really hope we can figure out a way not to change IP IDs, because I think > an inverted version of Tom's generic TSO could also work as a generic h/w GRO > accelerator. In its simplest form, the hardware just remembers the previous > packet, then gives us the length of the common prefix. If that ends four bytes > into a TCP header, then the packet is a candidate for GRO; otherwise not. I > haven't worked out all the details yet, but it's clear that non-constant IP IDs > would cause problems. (If it's only the innermost IP ID that's changing, we > can probably still cope, but the performance gain will be less and the > implementation could start to get ugly.) ] That's an interesting idea. This should work in IPv6 now and nearly all encapsulation protocols (GRE w/ csum doesn't work this way for instance), the logic to just match to a GRO state would be as simple as comparing packet hashes and then do a memcmp over the headers. For IPv4 maybe do a masked compare-- we still need to validate the header checksum but that is easily checked without checksumming over the whole header by just adding in the diff in the dynamic fields (i.e. something like C_new == C_Old + (IP_ID_new - IP_ID_old) Tom > > -- > -Ed > > [1] http://lxr.free-electrons.com/source/drivers/net/slip/slhc.c#L425 ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default 2016-02-24 0:53 ` Tom Herbert @ 2016-02-24 17:30 ` Edward Cree 0 siblings, 0 replies; 47+ messages in thread From: Edward Cree @ 2016-02-24 17:30 UTC (permalink / raw) To: Tom Herbert Cc: David Miller, Rick Jones, Jesse Gross, Alex Duyck, Linux Kernel Network Developers, Alexander Duyck On 24/02/16 00:53, Tom Herbert wrote: > That's an interesting idea. This should work in IPv6 now and nearly > all encapsulation protocols (GRE w/ csum doesn't work this way for > instance) You mean GRE with sequence numbers? csum should be fine, it'sjust the usual LCO setup (i.e. only depends on headers). -Ed ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default 2016-02-23 17:38 ` Edward Cree 2016-02-23 18:08 ` David Miller @ 2016-02-23 18:11 ` Tom Herbert 1 sibling, 0 replies; 47+ messages in thread From: Tom Herbert @ 2016-02-23 18:11 UTC (permalink / raw) To: Edward Cree Cc: Rick Jones, Jesse Gross, Alex Duyck, Linux Kernel Network Developers, David Miller, Alexander Duyck On Tue, Feb 23, 2016 at 9:38 AM, Edward Cree <ecree@solarflare.com> wrote: > On 23/02/16 17:20, Rick Jones wrote: >> On 02/23/2016 08:47 AM, Tom Herbert wrote: >>> Right, GRO should probably not coalesce packets with non-zero IP >>> identifiers due to the loss of information. Besides that, RFC6848 says >>> the IP identifier should only be set for fragmentation anyway so there >>> shouldn't be any issue and really no need for HW TSO (or LRO) to >>> support that. >> >> You sure that is RFC 6848 "Specifying Civic Address Extensions in the Presence Information Data Format Location Object (PIDF-LO)" ? > PossiblyRFC 6864 "Updated Specification of the IPv4 ID Field". Yes RFC6864. >> In whichever RFC that may be, is it a SHOULD or a MUST, and just how many "other" stacks might be setting a non-zero IP ID on fragments with DF set? > "The IPv4 ID field MUST NOT be used for purposes other than fragmentation > and reassembly."(§4.1) > "Originating sources MAY set the IPv4 ID field of atomic datagrams to any > value."(§4.1) > "All devices that examine IPv4 headers MUST ignore the IPv4 ID field of > atomic datagrams."(§4.1) > Atomic datagrams are defined by "(DF==1)&&(MF==0)&&(frag_offset==0)" (§4). > > So it's OK to coalesce packets with different identifiers, as long as they > have DFset (and aren't fragmented already). Also, the RFC takes pains to > point out that it "does not reserve any IPv4 ID values, including 0, as > distinguished" (§4.1), so one cannot rely on the ID always being zero. Right, receive side is straightforward, just ignore IP IDs. Transmit is more interesting. Operative code is in ip_select_ident_segs. The comment as to why non-zero IDs are sent is: /* This is only to work around buggy Windows95/2000 * VJ compression implementations. If the ID field * does not change, they drop every other packet in * a TCP stream using header compression. */ > -- > -Ed ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default 2016-02-23 16:47 ` Tom Herbert 2016-02-23 17:20 ` Rick Jones @ 2016-02-23 17:31 ` Jesse Gross 2016-02-23 17:42 ` Tom Herbert 2016-02-23 18:24 ` David Miller 2016-02-25 20:14 ` David Miller 2 siblings, 2 replies; 47+ messages in thread From: Jesse Gross @ 2016-02-23 17:31 UTC (permalink / raw) To: Tom Herbert Cc: Edward Cree, Alex Duyck, Linux Kernel Network Developers, David Miller, Alexander Duyck On Tue, Feb 23, 2016 at 8:47 AM, Tom Herbert <tom@herbertland.com> wrote: > On Tue, Feb 23, 2016 at 7:18 AM, Edward Cree <ecree@solarflare.com> wrote: >> On 23/02/16 03:31, Jesse Gross wrote: >>> The only issue that I see is that making TSO completely unaware of >>> outer headers will likely cause performance regressions in some cases. >>> Imagine if we have an incoming TCP stream with incrementing IP IDs >>> that we aggregate through GRO and forward. Today's TSO would be able >>> to recreate the stream by incrementing the ID as new segments are >>> created. However, if the outgoing NIC is truly only dealing with the >>> L4 header then it wouldn't be able to do this. >> Perhaps TSO should force setting the DF bit, so that the IP ID can be >> ignored. After all, if your network is going to cause fragmentation and >> reassembly, your performance will probably be bad enough that TSO won't >> help you much. (And TCP usually wants DF anyway so it can do PMTUD.) >> Arguably, as soon as we perform GRO on traffic to be forwarded, we've >> already violated the end-to-end principle (there are always imaginable >> situations in which a different packet stream comes out than went in), >> so it doesn't really matter if we go on to change the network layer >> parameters in this way - it's not really the same IP datagram any more >> so it's OK for its identification to change. >> And of course this problem doesn't present itself for IPv6 :) > > Right, GRO should probably not coalesce packets with non-zero IP > identifiers due to the loss of information. Besides that, RFC6848 says > the IP identifier should only be set for fragmentation anyway so there > shouldn't be any issue and really no need for HW TSO (or LRO) to > support that. Most OSs (including Linux with connected TCP sockets) use non-zero IP IDs so requiring this would effectively disable GRO. I think the practical way to go about this is to introduce a new GSO type for L4-only offload. There are some existing types that we could immediately convert and kill off with no impact (such as GRE) and some new protocols that would come for free (such as MPLS) so it would be a net win. Once the infrastructure is in, it will be easier to evaluate what else can be simplified on a case by case basis. (i.e. Even UDP_TUNNEL will have some potential adverse impact from this compared to explicit support since we'd need to break off the last segment from a TSO burst where the size isn't an even multiple of the MSS. I guess the impact is probably small but it would be good to know.) ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default 2016-02-23 17:31 ` Jesse Gross @ 2016-02-23 17:42 ` Tom Herbert 2016-02-23 18:18 ` Alexander Duyck 2016-02-23 18:26 ` David Miller 2016-02-23 18:24 ` David Miller 1 sibling, 2 replies; 47+ messages in thread From: Tom Herbert @ 2016-02-23 17:42 UTC (permalink / raw) To: Jesse Gross Cc: Edward Cree, Alex Duyck, Linux Kernel Network Developers, David Miller, Alexander Duyck On Tue, Feb 23, 2016 at 9:31 AM, Jesse Gross <jesse@kernel.org> wrote: > On Tue, Feb 23, 2016 at 8:47 AM, Tom Herbert <tom@herbertland.com> wrote: >> On Tue, Feb 23, 2016 at 7:18 AM, Edward Cree <ecree@solarflare.com> wrote: >>> On 23/02/16 03:31, Jesse Gross wrote: >>>> The only issue that I see is that making TSO completely unaware of >>>> outer headers will likely cause performance regressions in some cases. >>>> Imagine if we have an incoming TCP stream with incrementing IP IDs >>>> that we aggregate through GRO and forward. Today's TSO would be able >>>> to recreate the stream by incrementing the ID as new segments are >>>> created. However, if the outgoing NIC is truly only dealing with the >>>> L4 header then it wouldn't be able to do this. >>> Perhaps TSO should force setting the DF bit, so that the IP ID can be >>> ignored. After all, if your network is going to cause fragmentation and >>> reassembly, your performance will probably be bad enough that TSO won't >>> help you much. (And TCP usually wants DF anyway so it can do PMTUD.) >>> Arguably, as soon as we perform GRO on traffic to be forwarded, we've >>> already violated the end-to-end principle (there are always imaginable >>> situations in which a different packet stream comes out than went in), >>> so it doesn't really matter if we go on to change the network layer >>> parameters in this way - it's not really the same IP datagram any more >>> so it's OK for its identification to change. >>> And of course this problem doesn't present itself for IPv6 :) >> >> Right, GRO should probably not coalesce packets with non-zero IP >> identifiers due to the loss of information. Besides that, RFC6848 says >> the IP identifier should only be set for fragmentation anyway so there >> shouldn't be any issue and really no need for HW TSO (or LRO) to >> support that. > > Most OSs (including Linux with connected TCP sockets) use non-zero IP > IDs so requiring this would effectively disable GRO. > > I think the practical way to go about this is to introduce a new GSO > type for L4-only offload. There are some existing types that we could > immediately convert and kill off with no impact (such as GRE) and some > new protocols that would come for free (such as MPLS) so it would be a > net win. Once the infrastructure is in, it will be easier to evaluate > what else can be simplified on a case by case basis. (i.e. Even > UDP_TUNNEL will have some potential adverse impact from this compared > to explicit support since we'd need to break off the last segment from > a TSO burst where the size isn't an even multiple of the MSS. I guess > the impact is probably small but it would be good to know.) Why not just fix the stack to conform to RFC6864? As Edward pointed out we lose the actual IP ID's in GRO anyway, so attempts to set them in GSO may be wildly incorrect from the source point of view-- even in that case were probably better off changing the IP identifier to zero (okay since we're already breaking the E2E model anyway :-) ). ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default 2016-02-23 17:42 ` Tom Herbert @ 2016-02-23 18:18 ` Alexander Duyck 2016-02-23 18:26 ` David Miller 1 sibling, 0 replies; 47+ messages in thread From: Alexander Duyck @ 2016-02-23 18:18 UTC (permalink / raw) To: Tom Herbert Cc: Jesse Gross, Edward Cree, Alex Duyck, Linux Kernel Network Developers, David Miller On Tue, Feb 23, 2016 at 9:42 AM, Tom Herbert <tom@herbertland.com> wrote: > On Tue, Feb 23, 2016 at 9:31 AM, Jesse Gross <jesse@kernel.org> wrote: >> On Tue, Feb 23, 2016 at 8:47 AM, Tom Herbert <tom@herbertland.com> wrote: >>> On Tue, Feb 23, 2016 at 7:18 AM, Edward Cree <ecree@solarflare.com> wrote: >>>> On 23/02/16 03:31, Jesse Gross wrote: >>>>> The only issue that I see is that making TSO completely unaware of >>>>> outer headers will likely cause performance regressions in some cases. >>>>> Imagine if we have an incoming TCP stream with incrementing IP IDs >>>>> that we aggregate through GRO and forward. Today's TSO would be able >>>>> to recreate the stream by incrementing the ID as new segments are >>>>> created. However, if the outgoing NIC is truly only dealing with the >>>>> L4 header then it wouldn't be able to do this. >>>> Perhaps TSO should force setting the DF bit, so that the IP ID can be >>>> ignored. After all, if your network is going to cause fragmentation and >>>> reassembly, your performance will probably be bad enough that TSO won't >>>> help you much. (And TCP usually wants DF anyway so it can do PMTUD.) >>>> Arguably, as soon as we perform GRO on traffic to be forwarded, we've >>>> already violated the end-to-end principle (there are always imaginable >>>> situations in which a different packet stream comes out than went in), >>>> so it doesn't really matter if we go on to change the network layer >>>> parameters in this way - it's not really the same IP datagram any more >>>> so it's OK for its identification to change. >>>> And of course this problem doesn't present itself for IPv6 :) >>> >>> Right, GRO should probably not coalesce packets with non-zero IP >>> identifiers due to the loss of information. Besides that, RFC6848 says >>> the IP identifier should only be set for fragmentation anyway so there >>> shouldn't be any issue and really no need for HW TSO (or LRO) to >>> support that. >> >> Most OSs (including Linux with connected TCP sockets) use non-zero IP >> IDs so requiring this would effectively disable GRO. >> >> I think the practical way to go about this is to introduce a new GSO >> type for L4-only offload. There are some existing types that we could >> immediately convert and kill off with no impact (such as GRE) and some >> new protocols that would come for free (such as MPLS) so it would be a >> net win. Once the infrastructure is in, it will be easier to evaluate >> what else can be simplified on a case by case basis. (i.e. Even >> UDP_TUNNEL will have some potential adverse impact from this compared >> to explicit support since we'd need to break off the last segment from >> a TSO burst where the size isn't an even multiple of the MSS. I guess >> the impact is probably small but it would be good to know.) > > Why not just fix the stack to conform to RFC6864? As Edward pointed > out we lose the actual IP ID's in GRO anyway, so attempts to set them > in GSO may be wildly incorrect from the source point of view-- even in > that case were probably better off changing the IP identifier to zero > (okay since we're already breaking the E2E model anyway :-) ). The wording of RFC6864 seems to imply that we can ignore the IP ID field in the case of DF being set and the MF and fragment offset bits being cleared. It states we can use an arbitrary value for the IP ID on "atomic" frames so we can probably just leave it at whatever the initial value is for the frame to be segmented, not need to force it to 0. The problem as I see it is that we will need to update GRO so that it is willing to accept frames with an inner IP ID that is not incrementing for atomic frames before we can really get into the GSO side of the equation. From what I can tell it looks like currently it doesn't honor that and requires IP ID to increment in order to coalesce frames. I will look into trying to setup TSO for these devices like I did NETIF_F_HW_CSUM for the Intel parts. Basically we will leave the outer IP header and the inner transport header to be handled by the hardware, and then we can compute the length and checksum for the UDP header and inner IP header. That way we can deal with things like VLAN tags that need to be inserted before the outer network header while maintaining the IP ID for the outer IP header as well since most devices seem to handle that correctly. - Alex ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default 2016-02-23 17:42 ` Tom Herbert 2016-02-23 18:18 ` Alexander Duyck @ 2016-02-23 18:26 ` David Miller 2016-02-23 18:32 ` Tom Herbert 1 sibling, 1 reply; 47+ messages in thread From: David Miller @ 2016-02-23 18:26 UTC (permalink / raw) To: tom; +Cc: jesse, ecree, aduyck, netdev, alexander.duyck From: Tom Herbert <tom@herbertland.com> Date: Tue, 23 Feb 2016 09:42:00 -0800 > Why not just fix the stack to conform to RFC6864? As Edward pointed > out we lose the actual IP ID's in GRO anyway, so attempts to set them > in GSO may be wildly incorrect from the source point of view-- even in > that case were probably better off changing the IP identifier to zero > (okay since we're already breaking the E2E model anyway :-) ). Tom, you can't, you'll break TCP header compression schemes which expect a monotonically increasing IP ID value. We tried setting the IP ID to zero for frames with DF set, it broke things. ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default 2016-02-23 18:26 ` David Miller @ 2016-02-23 18:32 ` Tom Herbert 0 siblings, 0 replies; 47+ messages in thread From: Tom Herbert @ 2016-02-23 18:32 UTC (permalink / raw) To: David Miller Cc: Jesse Gross, Edward Cree, Alex Duyck, Linux Kernel Network Developers, Alexander Duyck On Tue, Feb 23, 2016 at 10:26 AM, David Miller <davem@davemloft.net> wrote: > From: Tom Herbert <tom@herbertland.com> > Date: Tue, 23 Feb 2016 09:42:00 -0800 > >> Why not just fix the stack to conform to RFC6864? As Edward pointed >> out we lose the actual IP ID's in GRO anyway, so attempts to set them >> in GSO may be wildly incorrect from the source point of view-- even in >> that case were probably better off changing the IP identifier to zero >> (okay since we're already breaking the E2E model anyway :-) ). > > Tom, you can't, you'll break TCP header compression schemes which > expect a monotonically increasing IP ID value. > > We tried setting the IP ID to zero for frames with DF set, it broke > things. All the more reason to move to IPv6 ;-) ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default 2016-02-23 17:31 ` Jesse Gross 2016-02-23 17:42 ` Tom Herbert @ 2016-02-23 18:24 ` David Miller 2016-02-24 9:58 ` David Laight 1 sibling, 1 reply; 47+ messages in thread From: David Miller @ 2016-02-23 18:24 UTC (permalink / raw) To: jesse; +Cc: tom, ecree, aduyck, netdev, alexander.duyck From: Jesse Gross <jesse@kernel.org> Date: Tue, 23 Feb 2016 09:31:09 -0800 > Most OSs (including Linux with connected TCP sockets) use non-zero IP > IDs so requiring this would effectively disable GRO. +1 Any OS that wants to work with SLHC, as I mentioned, has to emit monotonically increasing IP ID values in all packets, even those with DF set. ^ permalink raw reply [flat|nested] 47+ messages in thread
* RE: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default 2016-02-23 18:24 ` David Miller @ 2016-02-24 9:58 ` David Laight 2016-02-24 15:41 ` David Miller 0 siblings, 1 reply; 47+ messages in thread From: David Laight @ 2016-02-24 9:58 UTC (permalink / raw) To: 'David Miller', jesse@kernel.org Cc: tom@herbertland.com, ecree@solarflare.com, aduyck@mirantis.com, netdev@vger.kernel.org, alexander.duyck@gmail.com From: David Miller > Sent: 23 February 2016 18:25 > > From: Jesse Gross <jesse@kernel.org> > Date: Tue, 23 Feb 2016 09:31:09 -0800 > > > Most OSs (including Linux with connected TCP sockets) use non-zero IP > > IDs so requiring this would effectively disable GRO. > > +1 > > Any OS that wants to work with SLHC, as I mentioned, has to emit > monotonically increasing IP ID values in all packets, even those with > DF set. Doesn't that leak a lot of info about the sending system? ISTR one OS deliberately randomising the IP ID values in order to avoid giving out information about the number of packets being sent. David ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default 2016-02-24 9:58 ` David Laight @ 2016-02-24 15:41 ` David Miller 0 siblings, 0 replies; 47+ messages in thread From: David Miller @ 2016-02-24 15:41 UTC (permalink / raw) To: David.Laight; +Cc: jesse, tom, ecree, aduyck, netdev, alexander.duyck From: David Laight <David.Laight@ACULAB.COM> Date: Wed, 24 Feb 2016 09:58:03 +0000 > From: David Miller >> Sent: 23 February 2016 18:25 >> >> From: Jesse Gross <jesse@kernel.org> >> Date: Tue, 23 Feb 2016 09:31:09 -0800 >> >> > Most OSs (including Linux with connected TCP sockets) use non-zero IP >> > IDs so requiring this would effectively disable GRO. >> >> +1 >> >> Any OS that wants to work with SLHC, as I mentioned, has to emit >> monotonically increasing IP ID values in all packets, even those with >> DF set. > > Doesn't that leak a lot of info about the sending system? > ISTR one OS deliberately randomising the IP ID values in order > to avoid giving out information about the number of packets being sent. The ID generater is per-flow, therefore I don't think this is an issue. And if it is an issue, then it exists for fragmented traffic on every machine on the planet. ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default 2016-02-23 16:47 ` Tom Herbert 2016-02-23 17:20 ` Rick Jones 2016-02-23 17:31 ` Jesse Gross @ 2016-02-25 20:14 ` David Miller 2 siblings, 0 replies; 47+ messages in thread From: David Miller @ 2016-02-25 20:14 UTC (permalink / raw) To: tom; +Cc: ecree, jesse, aduyck, netdev, alexander.duyck From: Tom Herbert <tom@herbertland.com> Date: Tue, 23 Feb 2016 08:47:30 -0800 > Right, GRO should probably not coalesce packets with non-zero IP > identifiers due to the loss of information. If they are monotonically increasing, which is the only case worth caring about, it absolutely should. Because that can be done without any loss of information. ^ permalink raw reply [flat|nested] 47+ messages in thread
* Generic TSO (was Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default) 2016-02-20 19:51 ` Tom Herbert 2016-02-23 3:31 ` Jesse Gross @ 2016-03-11 19:20 ` Edward Cree 2016-03-11 19:57 ` Tom Herbert 2016-03-11 20:22 ` David Miller 1 sibling, 2 replies; 47+ messages in thread From: Edward Cree @ 2016-03-11 19:20 UTC (permalink / raw) To: Tom Herbert; +Cc: Linux Kernel Network Developers On 20/02/16 19:51, Tom Herbert wrote: > Right. To use LCO with TSO we would need to ensure that all packets > are the same size so that the UDP length field and thus checksum are > constant for all created segments. But this property this would also > make any payload lengths in headers constant for all packets so that > the only fields that need be set per generated packet would be the TCP > sequence number and checksum. This simplifying assumption could be > used to make a very protocol-generic GSO/TSO (up to the transport > header)! > > Conceptually, a device would just need to know the start of the > packet, the offset of the transport header, and the size of each > segment. Any bits from the start of the packet to the beginning of the > transport header are just copied to each segment, so any combination > of encapsulation/network protocols is supported as long as they are > constant for each segment (e.g. MPLS, NSH, etc. are on the horizon for > needing TSO support). Tom, Are you planning to / working on implementing this? If not, I might have a crack at it; I've talked to our firmware guys and (provisionally) we think we can support it in current sfc hardware. Or were there any blocking problems raised in the thread? My understanding of the IP ID issue was that it only matters for the inner frame, because the rest aren't TCP (so hopefully no-one is doing SLHC on them). But I may have missed something. -Ed ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Generic TSO (was Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default) 2016-03-11 19:20 ` Generic TSO (was Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default) Edward Cree @ 2016-03-11 19:57 ` Tom Herbert 2016-03-11 19:59 ` Edward Cree 2016-03-11 20:22 ` David Miller 1 sibling, 1 reply; 47+ messages in thread From: Tom Herbert @ 2016-03-11 19:57 UTC (permalink / raw) To: Edward Cree; +Cc: Linux Kernel Network Developers On Fri, Mar 11, 2016 at 11:20 AM, Edward Cree <ecree@solarflare.com> wrote: > On 20/02/16 19:51, Tom Herbert wrote: >> Right. To use LCO with TSO we would need to ensure that all packets >> are the same size so that the UDP length field and thus checksum are >> constant for all created segments. But this property this would also >> make any payload lengths in headers constant for all packets so that >> the only fields that need be set per generated packet would be the TCP >> sequence number and checksum. This simplifying assumption could be >> used to make a very protocol-generic GSO/TSO (up to the transport >> header)! >> >> Conceptually, a device would just need to know the start of the >> packet, the offset of the transport header, and the size of each >> segment. Any bits from the start of the packet to the beginning of the >> transport header are just copied to each segment, so any combination >> of encapsulation/network protocols is supported as long as they are >> constant for each segment (e.g. MPLS, NSH, etc. are on the horizon for >> needing TSO support). > Tom, > > Are you planning to / working on implementing this? If not, I might have a > crack at it; I've talked to our firmware guys and (provisionally) we think > we can support it in current sfc hardware. > Or were there any blocking problems raised in the thread? My understanding > of the IP ID issue was that it only matters for the inner frame, because > the rest aren't TCP (so hopefully no-one is doing SLHC on them). But I may > have missed something. > Right, then the interface would need to just include the offset of the IP ID. But doesn't this break using LCO with GSO though-- i.e. the outer checksum and inner checksum still need to be updated per packet so we need to tell device where outer checksum(s) is. Thanks, Tom > -Ed ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Generic TSO (was Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default) 2016-03-11 19:57 ` Tom Herbert @ 2016-03-11 19:59 ` Edward Cree 2016-03-11 20:16 ` Tom Herbert 0 siblings, 1 reply; 47+ messages in thread From: Edward Cree @ 2016-03-11 19:59 UTC (permalink / raw) To: Tom Herbert; +Cc: Linux Kernel Network Developers On 11/03/16 19:57, Tom Herbert wrote: > On Fri, Mar 11, 2016 at 11:20 AM, Edward Cree <ecree@solarflare.com> wrote: >> Tom, >> >> Are you planning to / working on implementing this? If not, I might have a >> crack at it; I've talked to our firmware guys and (provisionally) we think >> we can support it in current sfc hardware. >> Or were there any blocking problems raised in the thread? My understanding >> of the IP ID issue was that it only matters for the inner frame, because >> the rest aren't TCP (so hopefully no-one is doing SLHC on them). But I may >> have missed something. >> > Right, then the interface would need to just include the offset of the > IP ID. But doesn't this break using LCO with GSO though-- i.e. the > outer checksum and inner checksum still need to be updated per packet > so we need to tell device where outer checksum(s) is. No, outer checksum shouldn't change: IP ID is protected by inner IP header checksum, which device will edit. No? -Ed ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Generic TSO (was Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default) 2016-03-11 19:59 ` Edward Cree @ 2016-03-11 20:16 ` Tom Herbert 2016-03-11 20:24 ` Edward Cree 0 siblings, 1 reply; 47+ messages in thread From: Tom Herbert @ 2016-03-11 20:16 UTC (permalink / raw) To: Edward Cree; +Cc: Linux Kernel Network Developers On Fri, Mar 11, 2016 at 11:59 AM, Edward Cree <ecree@solarflare.com> wrote: > On 11/03/16 19:57, Tom Herbert wrote: >> On Fri, Mar 11, 2016 at 11:20 AM, Edward Cree <ecree@solarflare.com> wrote: >>> Tom, >>> >>> Are you planning to / working on implementing this? If not, I might have a >>> crack at it; I've talked to our firmware guys and (provisionally) we think >>> we can support it in current sfc hardware. >>> Or were there any blocking problems raised in the thread? My understanding >>> of the IP ID issue was that it only matters for the inner frame, because >>> the rest aren't TCP (so hopefully no-one is doing SLHC on them). But I may >>> have missed something. >>> >> Right, then the interface would need to just include the offset of the >> IP ID. But doesn't this break using LCO with GSO though-- i.e. the >> outer checksum and inner checksum still need to be updated per packet >> so we need to tell device where outer checksum(s) is. > No, outer checksum shouldn't change: IP ID is protected by inner IP header > checksum, which device will edit. No? Right, the interface would probably still need offset to the IPv4 hdr? > > -Ed ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Generic TSO (was Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default) 2016-03-11 20:16 ` Tom Herbert @ 2016-03-11 20:24 ` Edward Cree 2016-03-11 21:09 ` Alexander Duyck 0 siblings, 1 reply; 47+ messages in thread From: Edward Cree @ 2016-03-11 20:24 UTC (permalink / raw) To: Tom Herbert; +Cc: Linux Kernel Network Developers On 11/03/16 20:16, Tom Herbert wrote: > On Fri, Mar 11, 2016 at 11:59 AM, Edward Cree <ecree@solarflare.com> wrote: >> On 11/03/16 19:57, Tom Herbert wrote: >>> On Fri, Mar 11, 2016 at 11:20 AM, Edward Cree <ecree@solarflare.com> wrote: >>>> Tom, >>>> >>>> Are you planning to / working on implementing this? If not, I might have a >>>> crack at it; I've talked to our firmware guys and (provisionally) we think >>>> we can support it in current sfc hardware. >>>> Or were there any blocking problems raised in the thread? My understanding >>>> of the IP ID issue was that it only matters for the inner frame, because >>>> the rest aren't TCP (so hopefully no-one is doing SLHC on them). But I may >>>> have missed something. >>>> >>> Right, then the interface would need to just include the offset of the >>> IP ID. But doesn't this break using LCO with GSO though-- i.e. the >>> outer checksum and inner checksum still need to be updated per packet >>> so we need to tell device where outer checksum(s) is. >> No, outer checksum shouldn't change: IP ID is protected by inner IP header >> checksum, which device will edit. No? > Right, the interface would probably still need offset to the IPv4 hdr? Yes; I'm assuming the interface could just be "offset to inner IP header", and the hardware knows well enough what IP and TCP headers look like that it can figure out the rest (including skipping over options if e.g. ihl>5). So, do you want to try and implement it or shall I? -Ed ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Generic TSO (was Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default) 2016-03-11 20:24 ` Edward Cree @ 2016-03-11 21:09 ` Alexander Duyck 2016-03-11 21:29 ` Edward Cree 0 siblings, 1 reply; 47+ messages in thread From: Alexander Duyck @ 2016-03-11 21:09 UTC (permalink / raw) To: Edward Cree; +Cc: Tom Herbert, Linux Kernel Network Developers On Fri, Mar 11, 2016 at 12:24 PM, Edward Cree <ecree@solarflare.com> wrote: > On 11/03/16 20:16, Tom Herbert wrote: >> On Fri, Mar 11, 2016 at 11:59 AM, Edward Cree <ecree@solarflare.com> wrote: >>> On 11/03/16 19:57, Tom Herbert wrote: >>>> On Fri, Mar 11, 2016 at 11:20 AM, Edward Cree <ecree@solarflare.com> wrote: >>>>> Tom, >>>>> >>>>> Are you planning to / working on implementing this? If not, I might have a >>>>> crack at it; I've talked to our firmware guys and (provisionally) we think >>>>> we can support it in current sfc hardware. >>>>> Or were there any blocking problems raised in the thread? My understanding >>>>> of the IP ID issue was that it only matters for the inner frame, because >>>>> the rest aren't TCP (so hopefully no-one is doing SLHC on them). But I may >>>>> have missed something. >>>>> >>>> Right, then the interface would need to just include the offset of the >>>> IP ID. But doesn't this break using LCO with GSO though-- i.e. the >>>> outer checksum and inner checksum still need to be updated per packet >>>> so we need to tell device where outer checksum(s) is. >>> No, outer checksum shouldn't change: IP ID is protected by inner IP header >>> checksum, which device will edit. No? >> Right, the interface would probably still need offset to the IPv4 hdr? > Yes; I'm assuming the interface could just be "offset to inner IP header", > and the hardware knows well enough what IP and TCP headers look like that > it can figure out the rest (including skipping over options if e.g. ihl>5). > So, do you want to try and implement it or shall I? I've already started looking into this and was waiting for feedback from Dave about the IPv4 ID issue which is looks like he is okay with a static ID value as long as the DF bit is set. The only real issue with the "generic" TSO is that it isn't going to be so generic. We have different devices that will support different levels of stuff. For example the ixgbe drivers will need to treat the outer tunnel header as one giant L2 header. As a result we will need to populate all the fields in the outer header including the outer IP ID, checksum, udp->len, and UDP or GRE checksum if requested. For i40e I think this gets a bit simpler as they already handle the outer IPv4 ID and checksum. I think there we may need to only populate the checksum for it to work out correctly. As such I may look at coming up with a number of functions so that we can mix and match based on what is needed in order to assemble a partially segmented frame. The other issue I am working on at the moment to enable all this is to fix the differents between csum_tcpudp_magic and csum_ipv6_magic in terms of handling packet lengths greater than 65535. Currently we are messing up the checksum in relation to IPv6 since we are using the truncated uh->len value. I'll be submitting some patches later today that will hopefully get that fixed and that in turn should make the rest of the segmentation work easier. - Alex ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Generic TSO (was Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default) 2016-03-11 21:09 ` Alexander Duyck @ 2016-03-11 21:29 ` Edward Cree 2016-03-11 22:31 ` Alexander Duyck 0 siblings, 1 reply; 47+ messages in thread From: Edward Cree @ 2016-03-11 21:29 UTC (permalink / raw) To: Alexander Duyck; +Cc: Tom Herbert, Linux Kernel Network Developers On 11/03/16 21:09, Alexander Duyck wrote: > The only real issue with the "generic" TSO is that it isn't going to > be so generic. We have different devices that will support different > levels of stuff. For example the ixgbe drivers will need to treat the > outer tunnel header as one giant L2 header. As a result we will need > to populate all the fields in the outer header including the outer IP > ID, checksum, udp->len, and UDP or GRE checksum if requested. For > i40e I think this gets a bit simpler as they already handle the outer > IPv4 ID and checksum. I think there we may need to only populate the > checksum for it to work out correctly. As such I may look at coming > up with a number of functions so that we can mix and match based on > what is needed in order to assemble a partially segmented frame. AIUI, the point of the design is that we _can_ populate everything, because we're keeping lengths and outer IP ID fixed, so outer checksums stay the same and the outer tunnel header _is_ just one giant L2 header which is bit-for-bit identical for each generated segment. So every devicegets to just be dumb and treat it as opaque. > The other issue I am working on at the moment to enable all this is to > fix the differents between csum_tcpudp_magic and csum_ipv6_magic in > terms of handling packet lengths greater than 65535. Currently we are > messing up the checksum in relation to IPv6 since we are using the > truncated uh->len value. I'll be submitting some patches later today > that will hopefully get that fixed and that in turn should make the > rest of the segmentation work easier. Again, in the superpacket we want to calculate the checksum based on the subsegment length, rather than the length of the superpacket. The idea is to create the header for an MSS-sized segment, then follow it with an inner IP & TCP header, and n*MSS bytes of payload. (This of course produces a superpacket that's not what you'd send over a link with a 64k MTU, unlike how we do it today.) Then hw just chops up the payload, fixes up the inner headers, and glues the "L2" header on each packet. -Ed ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Generic TSO (was Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default) 2016-03-11 21:29 ` Edward Cree @ 2016-03-11 22:31 ` Alexander Duyck 2016-03-11 22:55 ` Tom Herbert 0 siblings, 1 reply; 47+ messages in thread From: Alexander Duyck @ 2016-03-11 22:31 UTC (permalink / raw) To: Edward Cree; +Cc: Tom Herbert, Linux Kernel Network Developers On Fri, Mar 11, 2016 at 1:29 PM, Edward Cree <ecree@solarflare.com> wrote: > On 11/03/16 21:09, Alexander Duyck wrote: >> The only real issue with the "generic" TSO is that it isn't going to >> be so generic. We have different devices that will support different >> levels of stuff. For example the ixgbe drivers will need to treat the >> outer tunnel header as one giant L2 header. As a result we will need >> to populate all the fields in the outer header including the outer IP >> ID, checksum, udp->len, and UDP or GRE checksum if requested. For >> i40e I think this gets a bit simpler as they already handle the outer >> IPv4 ID and checksum. I think there we may need to only populate the >> checksum for it to work out correctly. As such I may look at coming >> up with a number of functions so that we can mix and match based on >> what is needed in order to assemble a partially segmented frame. > AIUI, the point of the design is that we _can_ populate everything, > because we're keeping lengths and outer IP ID fixed, so outer checksums > stay the same and the outer tunnel header _is_ just one giant L2 header > which is bit-for-bit identical for each generated segment. So every > devicegets to just be dumb and treat it as opaque. This works so long as the device isn't trying to do anything like insert VLAN tags. Then I think we might have an issue since we don't want to confuse the device and have it trying to insert the tag on the inner frame's Ethernet header. I suspect we may have differing levels of "dumb" that we have to deal with. That is all I am saying. By default we could just populate all of the length and checksum fields in the outer header, we would just have to be consistent about what is provided then. In addition there will be the matter of sorting out the IP ID bits. For example some of the i40e parts support tunnel offloads, but not tunnel offloads with checksums enabled. I suspect those parts will end up wanting to handle the outer IP header and UDP length values. As a result there trying to do a "dumb" send may result in us really messing up the IP ID values if we don't take steps to make it a bit smarter. >> The other issue I am working on at the moment to enable all this is to >> fix the differents between csum_tcpudp_magic and csum_ipv6_magic in >> terms of handling packet lengths greater than 65535. Currently we are >> messing up the checksum in relation to IPv6 since we are using the >> truncated uh->len value. I'll be submitting some patches later today >> that will hopefully get that fixed and that in turn should make the >> rest of the segmentation work easier. > Again, in the superpacket we want to calculate the checksum based on the > subsegment length, rather than the length of the superpacket. The idea > is to create the header for an MSS-sized segment, then follow it with an > inner IP & TCP header, and n*MSS bytes of payload. (This of course > produces a superpacket that's not what you'd send over a link with a 64k > MTU, unlike how we do it today.) The question is at what point do we do the chopping. Should we be doing this in the drivers or somewhere higher in the stack like we do for standard GSO segmentation. I would think we would need to add another bit that says we can do GSO with custom outer headers since I can see VLANs being a possible issue otherwise. > Then hw just chops up the payload, fixes up the inner headers, and glues > the "L2" header on each packet. Yea, it sounds really straight forward and easy. It isn't till you start digging into the actual code that it gets a bit hairy. What this effectively is is another form of TSO where each driver will want to do things a little differently. Alot of it has to do with the fact that this is kind of a nasty hack that we are trying to add since many devices won't like the fact that we are lying about the size of our actual L2 header so things like VLAN tag insertion are going to end up blowing back on us. Really my preference in the case of ixgbe would have been to let the hardware update the outer IP header and the inner TCP header and then do the UDP and inner IP header as the static headers. That way we could still theoretically support fragmentation on the outer headers which last I knew is a very real possibility since the DF bit is not set on the outer headers for VXLAN I believe. - Alex ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Generic TSO (was Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default) 2016-03-11 22:31 ` Alexander Duyck @ 2016-03-11 22:55 ` Tom Herbert 2016-03-12 5:40 ` Alexander Duyck 0 siblings, 1 reply; 47+ messages in thread From: Tom Herbert @ 2016-03-11 22:55 UTC (permalink / raw) To: Alexander Duyck; +Cc: Edward Cree, Linux Kernel Network Developers On Fri, Mar 11, 2016 at 2:31 PM, Alexander Duyck <alexander.duyck@gmail.com> wrote: > On Fri, Mar 11, 2016 at 1:29 PM, Edward Cree <ecree@solarflare.com> wrote: >> On 11/03/16 21:09, Alexander Duyck wrote: >>> The only real issue with the "generic" TSO is that it isn't going to >>> be so generic. We have different devices that will support different >>> levels of stuff. For example the ixgbe drivers will need to treat the >>> outer tunnel header as one giant L2 header. As a result we will need >>> to populate all the fields in the outer header including the outer IP >>> ID, checksum, udp->len, and UDP or GRE checksum if requested. For >>> i40e I think this gets a bit simpler as they already handle the outer >>> IPv4 ID and checksum. I think there we may need to only populate the >>> checksum for it to work out correctly. As such I may look at coming >>> up with a number of functions so that we can mix and match based on >>> what is needed in order to assemble a partially segmented frame. >> AIUI, the point of the design is that we _can_ populate everything, >> because we're keeping lengths and outer IP ID fixed, so outer checksums >> stay the same and the outer tunnel header _is_ just one giant L2 header >> which is bit-for-bit identical for each generated segment. So every >> devicegets to just be dumb and treat it as opaque. > > This works so long as the device isn't trying to do anything like > insert VLAN tags. Then I think we might have an issue since we don't > want to confuse the device and have it trying to insert the tag on the > inner frame's Ethernet header. > In Edward's giant L2 header mode, couldn't VLAN tags just be part of that? > I suspect we may have differing levels of "dumb" that we have to deal > with. That is all I am saying. By default we could just populate all > of the length and checksum fields in the outer header, we would just > have to be consistent about what is provided then. In addition there > will be the matter of sorting out the IP ID bits. For example some of > the i40e parts support tunnel offloads, but not tunnel offloads with > checksums enabled. I suspect those parts will end up wanting to > handle the outer IP header and UDP length values. As a result there > trying to do a "dumb" send may result in us really messing up the IP > ID values if we don't take steps to make it a bit smarter. > >>> The other issue I am working on at the moment to enable all this is to >>> fix the differents between csum_tcpudp_magic and csum_ipv6_magic in >>> terms of handling packet lengths greater than 65535. Currently we are >>> messing up the checksum in relation to IPv6 since we are using the >>> truncated uh->len value. I'll be submitting some patches later today >>> that will hopefully get that fixed and that in turn should make the >>> rest of the segmentation work easier. >> Again, in the superpacket we want to calculate the checksum based on the >> subsegment length, rather than the length of the superpacket. The idea >> is to create the header for an MSS-sized segment, then follow it with an >> inner IP & TCP header, and n*MSS bytes of payload. (This of course >> produces a superpacket that's not what you'd send over a link with a 64k >> MTU, unlike how we do it today.) > > The question is at what point do we do the chopping. Should we be > doing this in the drivers or somewhere higher in the stack like we do > for standard GSO segmentation. I would think we would need to add > another bit that says we can do GSO with custom outer headers since I > can see VLANs being a possible issue otherwise. > >> Then hw just chops up the payload, fixes up the inner headers, and glues >> the "L2" header on each packet. > > Yea, it sounds really straight forward and easy. It isn't till you > start digging into the actual code that it gets a bit hairy. > > What this effectively is is another form of TSO where each driver will > want to do things a little differently. Alot of it has to do with the > fact that this is kind of a nasty hack that we are trying to add since > many devices won't like the fact that we are lying about the size of > our actual L2 header so things like VLAN tag insertion are going to > end up blowing back on us. > Right, the point is that we're trying to get out of the model where every driver/device implements TSO differently, supports ad hoc protocols, etc. Do you see any other common invasive technique that we need to deal with other than VLAN insertion and IP ID? > Really my preference in the case of ixgbe would have been to let the > hardware update the outer IP header and the inner TCP header and then > do the UDP and inner IP header as the static headers. That way we > could still theoretically support fragmentation on the outer headers > which last I knew is a very real possibility since the DF bit is not > set on the outer headers for VXLAN I believe. > > - Alex ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Generic TSO (was Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default) 2016-03-11 22:55 ` Tom Herbert @ 2016-03-12 5:40 ` Alexander Duyck 2016-03-14 10:26 ` Generic TSO Edward Cree 0 siblings, 1 reply; 47+ messages in thread From: Alexander Duyck @ 2016-03-12 5:40 UTC (permalink / raw) To: Tom Herbert; +Cc: Edward Cree, Linux Kernel Network Developers On Fri, Mar 11, 2016 at 2:55 PM, Tom Herbert <tom@herbertland.com> wrote: > On Fri, Mar 11, 2016 at 2:31 PM, Alexander Duyck > <alexander.duyck@gmail.com> wrote: >> On Fri, Mar 11, 2016 at 1:29 PM, Edward Cree <ecree@solarflare.com> wrote: >>> On 11/03/16 21:09, Alexander Duyck wrote: >>>> The only real issue with the "generic" TSO is that it isn't going to >>>> be so generic. We have different devices that will support different >>>> levels of stuff. For example the ixgbe drivers will need to treat the >>>> outer tunnel header as one giant L2 header. As a result we will need >>>> to populate all the fields in the outer header including the outer IP >>>> ID, checksum, udp->len, and UDP or GRE checksum if requested. For >>>> i40e I think this gets a bit simpler as they already handle the outer >>>> IPv4 ID and checksum. I think there we may need to only populate the >>>> checksum for it to work out correctly. As such I may look at coming >>>> up with a number of functions so that we can mix and match based on >>>> what is needed in order to assemble a partially segmented frame. >>> AIUI, the point of the design is that we _can_ populate everything, >>> because we're keeping lengths and outer IP ID fixed, so outer checksums >>> stay the same and the outer tunnel header _is_ just one giant L2 header >>> which is bit-for-bit identical for each generated segment. So every >>> devicegets to just be dumb and treat it as opaque. >> >> This works so long as the device isn't trying to do anything like >> insert VLAN tags. Then I think we might have an issue since we don't >> want to confuse the device and have it trying to insert the tag on the >> inner frame's Ethernet header. >> > In Edward's giant L2 header mode, couldn't VLAN tags just be part of that? The problem is things like VFs which aren't allowed to insert their own tags. Having them try to lie about where the network header actually starts may trigger things like anti-spoof events. >> I suspect we may have differing levels of "dumb" that we have to deal >> with. That is all I am saying. By default we could just populate all >> of the length and checksum fields in the outer header, we would just >> have to be consistent about what is provided then. In addition there >> will be the matter of sorting out the IP ID bits. For example some of >> the i40e parts support tunnel offloads, but not tunnel offloads with >> checksums enabled. I suspect those parts will end up wanting to >> handle the outer IP header and UDP length values. As a result there >> trying to do a "dumb" send may result in us really messing up the IP >> ID values if we don't take steps to make it a bit smarter. >> >>>> The other issue I am working on at the moment to enable all this is to >>>> fix the differents between csum_tcpudp_magic and csum_ipv6_magic in >>>> terms of handling packet lengths greater than 65535. Currently we are >>>> messing up the checksum in relation to IPv6 since we are using the >>>> truncated uh->len value. I'll be submitting some patches later today >>>> that will hopefully get that fixed and that in turn should make the >>>> rest of the segmentation work easier. >>> Again, in the superpacket we want to calculate the checksum based on the >>> subsegment length, rather than the length of the superpacket. The idea >>> is to create the header for an MSS-sized segment, then follow it with an >>> inner IP & TCP header, and n*MSS bytes of payload. (This of course >>> produces a superpacket that's not what you'd send over a link with a 64k >>> MTU, unlike how we do it today.) >> >> The question is at what point do we do the chopping. Should we be >> doing this in the drivers or somewhere higher in the stack like we do >> for standard GSO segmentation. I would think we would need to add >> another bit that says we can do GSO with custom outer headers since I >> can see VLANs being a possible issue otherwise. >> >>> Then hw just chops up the payload, fixes up the inner headers, and glues >>> the "L2" header on each packet. >> >> Yea, it sounds really straight forward and easy. It isn't till you >> start digging into the actual code that it gets a bit hairy. >> >> What this effectively is is another form of TSO where each driver will >> want to do things a little differently. Alot of it has to do with the >> fact that this is kind of a nasty hack that we are trying to add since >> many devices won't like the fact that we are lying about the size of >> our actual L2 header so things like VLAN tag insertion are going to >> end up blowing back on us. >> > Right, the point is that we're trying to get out of the model where > every driver/device implements TSO differently, supports ad hoc > protocols, etc. Do you see any other common invasive technique that we > need to deal with other than VLAN insertion and IP ID? Well that is the thing. Before we can actually start tinkering with the outer header we probably need to make sure we set the DF bit and that it would be honored on the outer headers for IPv4. I don't believe any of the tunnels are currently doing that so repeating the IP ID would be the worst possible scenario until that is resolved since VXLAN tunneled frames can be fragmented while TCP frames cannot so we really shouldn't be repeating IP IDs for the outer headers. - Alex ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Generic TSO 2016-03-12 5:40 ` Alexander Duyck @ 2016-03-14 10:26 ` Edward Cree 2016-03-14 10:32 ` Edward Cree 0 siblings, 1 reply; 47+ messages in thread From: Edward Cree @ 2016-03-14 10:26 UTC (permalink / raw) To: Alexander Duyck, Tom Herbert; +Cc: Linux Kernel Network Developers On 12/03/16 05:40, Alexander Duyck wrote: > Well that is the thing. Before we can actually start tinkering with > the outer header we probably need to make sure we set the DF bit and > that it would be honored on the outer headers for IPv4. I don't > believe any of the tunnels are currently doing that so repeating the > IP ID would be the worst possible scenario until that is resolved > since VXLAN tunneled frames can be fragmented while TCP frames cannot > so we really shouldn't be repeating IP IDs for the outer headers. So how do we progress with that? I'm presuming it's not as simple as just patching the tunnel drivers to set DF if the inner packet has it, as that could break existing setups. (I've heard that "but they're already broken anyway" is not usually an acceptable argument.) Some sort of configuration option on the tunnel (like we do with udpcsum)? Fortunately, with the design I'm currently planning on, a tunnel driver could just set a flag in the SKB to say "unsafe for generic- TSO", and we'd just send out the first segment normally and fall back to regular software segmentation. -Ed ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Generic TSO 2016-03-14 10:26 ` Generic TSO Edward Cree @ 2016-03-14 10:32 ` Edward Cree 2016-03-14 15:59 ` Alexander Duyck 0 siblings, 1 reply; 47+ messages in thread From: Edward Cree @ 2016-03-14 10:32 UTC (permalink / raw) To: Alexander Duyck, Tom Herbert; +Cc: Linux Kernel Network Developers On 14/03/16 10:26, Edward Cree wrote: > On 12/03/16 05:40, Alexander Duyck wrote: >> Well that is the thing. Before we can actually start tinkering with >> the outer header we probably need to make sure we set the DF bit and >> that it would be honored on the outer headers for IPv4. I don't >> believe any of the tunnels are currently doing that so repeating the >> IP ID would be the worst possible scenario until that is resolved >> since VXLAN tunneled frames can be fragmented while TCP frames cannot >> so we really shouldn't be repeating IP IDs for the outer headers. > So how do we progress with that? I'm presuming it's not as simple as > just patching the tunnel drivers to set DF if the inner packet has it, > as that could break existing setups. (I've heard that "but they're > already broken anyway" is not usually an acceptable argument.) Some > sort of configuration option on the tunnel (like we do with udpcsum)? ...and immediately I find out it already exists. (I guess I should have looked there first!) >From drivers/net/vxlan.c:2001: > else if (info->key.tun_flags & TUNNEL_DONT_FRAGMENT) > df = htons(IP_DF); -Ed ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Generic TSO 2016-03-14 10:32 ` Edward Cree @ 2016-03-14 15:59 ` Alexander Duyck 0 siblings, 0 replies; 47+ messages in thread From: Alexander Duyck @ 2016-03-14 15:59 UTC (permalink / raw) To: Edward Cree; +Cc: Tom Herbert, Linux Kernel Network Developers On Mon, Mar 14, 2016 at 3:32 AM, Edward Cree <ecree@solarflare.com> wrote: > On 14/03/16 10:26, Edward Cree wrote: >> On 12/03/16 05:40, Alexander Duyck wrote: >>> Well that is the thing. Before we can actually start tinkering with >>> the outer header we probably need to make sure we set the DF bit and >>> that it would be honored on the outer headers for IPv4. I don't >>> believe any of the tunnels are currently doing that so repeating the >>> IP ID would be the worst possible scenario until that is resolved >>> since VXLAN tunneled frames can be fragmented while TCP frames cannot >>> so we really shouldn't be repeating IP IDs for the outer headers. >> So how do we progress with that? I'm presuming it's not as simple as >> just patching the tunnel drivers to set DF if the inner packet has it, >> as that could break existing setups. (I've heard that "but they're >> already broken anyway" is not usually an acceptable argument.) Some >> sort of configuration option on the tunnel (like we do with udpcsum)? > ...and immediately I find out it already exists. (I guess I should have > looked there first!) > From drivers/net/vxlan.c:2001: >> else if (info->key.tun_flags & TUNNEL_DONT_FRAGMENT) >> df = htons(IP_DF); I'm still not a fan of trying to freeze the outer IP header. I think it should be the one that should have the IP ID increment while the inner IP header be the one that is frozen. Maybe that is where we can differ per device. I would be okay with the outer tunnel headers and inner IP header being frozen on ixgbe which will be needed in order to compute outer UDP checksum anyway. Then we could leave it up to the driver's discretion as to if the outer header has the IP ID that increments or the inner header. - Alex ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Generic TSO 2016-03-11 19:20 ` Generic TSO (was Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default) Edward Cree 2016-03-11 19:57 ` Tom Herbert @ 2016-03-11 20:22 ` David Miller 1 sibling, 0 replies; 47+ messages in thread From: David Miller @ 2016-03-11 20:22 UTC (permalink / raw) To: ecree; +Cc: tom, netdev From: Edward Cree <ecree@solarflare.com> Date: Fri, 11 Mar 2016 19:20:41 +0000 > Are you planning to / working on implementing this? If not, I might have a > crack at it; I've talked to our firmware guys and (provisionally) we think > we can support it in current sfc hardware. > Or were there any blocking problems raised in the thread? My understanding > of the IP ID issue was that it only matters for the inner frame, because > the rest aren't TCP (so hopefully no-one is doing SLHC on them). But I may > have missed something. I've thought a lot more about the IP ID issue, and I am now starting to learn towards allowing it to be set to zero for "DF" packets. Considering what we gain in return, not working optimally with an ancient SLIP header compression implementation is a small loss. Any other opinions? ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default 2016-02-19 19:26 [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default Alexander Duyck ` (2 preceding siblings ...) 2016-02-19 21:53 ` [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum " Jesse Gross @ 2016-02-22 3:06 ` David Miller 3 siblings, 0 replies; 47+ messages in thread From: David Miller @ 2016-02-22 3:06 UTC (permalink / raw) To: aduyck; +Cc: netdev, alexander.duyck From: Alexander Duyck <aduyck@mirantis.com> Date: Fri, 19 Feb 2016 11:26:17 -0800 > This patch series makes it so that we enable the outer Tx checksum > for IPv4 tunnels by default. Series applied, thanks Alex. ^ permalink raw reply [flat|nested] 47+ messages in thread
end of thread, other threads:[~2016-03-14 15:59 UTC | newest] Thread overview: 47+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2016-02-19 19:26 [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default Alexander Duyck 2016-02-19 19:26 ` [net-next PATCH 1/2] GENEVE: Support outer IPv4 Tx checksums " Alexander Duyck 2016-02-19 20:28 ` Tom Herbert 2016-02-19 19:26 ` [net-next PATCH 2/2] VXLAN: " Alexander Duyck 2016-02-19 20:27 ` Tom Herbert 2016-02-19 21:36 ` Jesse Gross 2016-02-19 21:53 ` [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum " Jesse Gross 2016-02-19 23:10 ` Alex Duyck 2016-02-20 0:08 ` Jesse Gross 2016-02-20 0:14 ` Tom Herbert 2016-02-20 2:18 ` Jesse Gross 2016-02-20 19:51 ` Tom Herbert 2016-02-23 3:31 ` Jesse Gross 2016-02-23 15:18 ` Edward Cree 2016-02-23 16:47 ` Tom Herbert 2016-02-23 17:20 ` Rick Jones 2016-02-23 17:38 ` Edward Cree 2016-02-23 18:08 ` David Miller 2016-02-23 20:20 ` Edward Cree 2016-02-23 23:11 ` David Miller 2016-02-24 0:53 ` Tom Herbert 2016-02-24 17:30 ` Edward Cree 2016-02-23 18:11 ` Tom Herbert 2016-02-23 17:31 ` Jesse Gross 2016-02-23 17:42 ` Tom Herbert 2016-02-23 18:18 ` Alexander Duyck 2016-02-23 18:26 ` David Miller 2016-02-23 18:32 ` Tom Herbert 2016-02-23 18:24 ` David Miller 2016-02-24 9:58 ` David Laight 2016-02-24 15:41 ` David Miller 2016-02-25 20:14 ` David Miller 2016-03-11 19:20 ` Generic TSO (was Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default) Edward Cree 2016-03-11 19:57 ` Tom Herbert 2016-03-11 19:59 ` Edward Cree 2016-03-11 20:16 ` Tom Herbert 2016-03-11 20:24 ` Edward Cree 2016-03-11 21:09 ` Alexander Duyck 2016-03-11 21:29 ` Edward Cree 2016-03-11 22:31 ` Alexander Duyck 2016-03-11 22:55 ` Tom Herbert 2016-03-12 5:40 ` Alexander Duyck 2016-03-14 10:26 ` Generic TSO Edward Cree 2016-03-14 10:32 ` Edward Cree 2016-03-14 15:59 ` Alexander Duyck 2016-03-11 20:22 ` David Miller 2016-02-22 3:06 ` [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default David Miller
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).