[PATCH net-next] veth: extend features to support tunneling

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH net-next] veth: extend features to support tunneling
  2013-10-26  0:52           ` Eric Dumazet
@ 2013-10-26  1:25             ` Eric Dumazet
  2013-10-26  2:22               ` Alexei Starovoitov
  2013-10-28  4:58               ` David Miller
  0 siblings, 2 replies; 16+ messages in thread
From: Eric Dumazet @ 2013-10-26  1:25 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: David Miller, Eric Dumazet, stephen, netdev

From: Eric Dumazet <edumazet@google.com>

While investigating on a recent vxlan regression, I found veth
was using a zero features set for vxlan tunnels.

We have to segment GSO frames, copy the payload, and do the checksum.

This patch brings a ~200% performance increase

We probably have to add hw_enc_features support
on other virtual devices.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Alexei Starovoitov <ast@plumgrid.com>
---
 drivers/net/veth.c |    3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index b2d0347..b24db7a 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -261,6 +261,8 @@ static const struct net_device_ops veth_netdev_ops = {
 
 #define VETH_FEATURES (NETIF_F_SG | NETIF_F_FRAGLIST | NETIF_F_ALL_TSO |    \
 		       NETIF_F_HW_CSUM | NETIF_F_RXCSUM | NETIF_F_HIGHDMA | \
+		       NETIF_F_GSO_GRE | NETIF_F_GSO_UDP_TUNNEL |	    \
+		       NETIF_F_GSO_IPIP | NETIF_F_GSO_SIT | NETIF_F_UFO	|   \
 		       NETIF_F_HW_VLAN_CTAG_TX | NETIF_F_HW_VLAN_CTAG_RX | \
 		       NETIF_F_HW_VLAN_STAG_TX | NETIF_F_HW_VLAN_STAG_RX )
 
@@ -279,6 +281,7 @@ static void veth_setup(struct net_device *dev)
 	dev->destructor = veth_dev_free;
 
 	dev->hw_features = VETH_FEATURES;
+	dev->hw_enc_features = VETH_FEATURES;
 }
 
 /*

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next] veth: extend features to support tunneling
  2013-10-26  1:25             ` [PATCH net-next] veth: extend features to support tunneling Eric Dumazet
@ 2013-10-26  2:22               ` Alexei Starovoitov
  2013-10-26  4:13                 ` Eric Dumazet
  2013-10-28  4:58               ` David Miller
  1 sibling, 1 reply; 16+ messages in thread
From: Alexei Starovoitov @ 2013-10-26  2:22 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, Eric Dumazet, stephen, netdev

On Fri, Oct 25, 2013 at 6:25 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> From: Eric Dumazet <edumazet@google.com>
>
> While investigating on a recent vxlan regression, I found veth
> was using a zero features set for vxlan tunnels.

oneliner can be better :)

> We have to segment GSO frames, copy the payload, and do the checksum.
>
> This patch brings a ~200% performance increase
>
> We probably have to add hw_enc_features support
> on other virtual devices.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Alexei Starovoitov <ast@plumgrid.com>
> ---

iperf over veth with gre/vxlan tunneling is now ~4Gbps. 20x gain as advertised.
Thanks!

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next] veth: extend features to support tunneling
  2013-10-26  2:22               ` Alexei Starovoitov
@ 2013-10-26  4:13                 ` Eric Dumazet
  0 siblings, 0 replies; 16+ messages in thread
From: Eric Dumazet @ 2013-10-26  4:13 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: David Miller, Eric Dumazet, stephen, netdev

On Fri, 2013-10-25 at 19:22 -0700, Alexei Starovoitov wrote:
> On Fri, Oct 25, 2013 at 6:25 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > From: Eric Dumazet <edumazet@google.com>
> >
> > While investigating on a recent vxlan regression, I found veth
> > was using a zero features set for vxlan tunnels.
> 
> oneliner can be better :)

Yes, I'll post the gso fix as well, of course ;)

> 
> > We have to segment GSO frames, copy the payload, and do the checksum.
> >
> > This patch brings a ~200% performance increase
> >
> > We probably have to add hw_enc_features support
> > on other virtual devices.
> >
> > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > Cc: Alexei Starovoitov <ast@plumgrid.com>
> > ---
> 
> iperf over veth with gre/vxlan tunneling is now ~4Gbps. 20x gain as advertised.
> Thanks!

Wow, such a difference might show another bug somewhere else...

Thanks !

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next] veth: extend features to support tunneling
  2013-10-26  1:25             ` [PATCH net-next] veth: extend features to support tunneling Eric Dumazet
  2013-10-26  2:22               ` Alexei Starovoitov
@ 2013-10-28  4:58               ` David Miller
  1 sibling, 0 replies; 16+ messages in thread
From: David Miller @ 2013-10-28  4:58 UTC (permalink / raw)
  To: eric.dumazet; +Cc: ast, edumazet, stephen, netdev

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Fri, 25 Oct 2013 18:25:03 -0700

> From: Eric Dumazet <edumazet@google.com>
> 
> While investigating on a recent vxlan regression, I found veth
> was using a zero features set for vxlan tunnels.
> 
> We have to segment GSO frames, copy the payload, and do the checksum.
> 
> This patch brings a ~200% performance increase
> 
> We probably have to add hw_enc_features support
> on other virtual devices.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Applied, thanks Eric.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next] veth: extend features to support tunneling
@ 2013-11-16 21:11 Or Gerlitz
  2013-11-16 21:40 ` Eric Dumazet
  0 siblings, 1 reply; 16+ messages in thread
From: Or Gerlitz @ 2013-11-16 21:11 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alexei Starovoitov, David Miller, Eric Dumazet, Stephen Hemminger,
	netdev@vger.kernel.org, Michael S. Tsirkin, John Fastabend

On Sat, Oct 26, 2013 at 6:13 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Fri, 2013-10-25 at 19:22 -0700, Alexei Starovoitov wrote:
>> On Fri, Oct 25, 2013 at 6:25 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> > From: Eric Dumazet <edumazet@google.com>
>> >
>> > While investigating on a recent vxlan regression, I found veth
>> > was using a zero features set for vxlan tunnels.
>>
>> oneliner can be better :)
>
> Yes, I'll post the gso fix as well, of course ;)
>
>>
>> > We have to segment GSO frames, copy the payload, and do the checksum.
>> >
>> > This patch brings a ~200% performance increase
>> >
>> > We probably have to add hw_enc_features support
>> > on other virtual devices.
>> >
>> > Signed-off-by: Eric Dumazet <edumazet@google.com>
>> > Cc: Alexei Starovoitov <ast@plumgrid.com>
>> > ---
>>
>> iperf over veth with gre/vxlan tunneling is now ~4Gbps. 20x gain as advertised.
>> Thanks!
>
> Wow, such a difference might show another bug somewhere else...

Guys (thanks Eric for the clarification over the other vxlan thread),
with the latest networking code (e.g 3.12 or net-next)  do you expect
notable performance (throughput) difference between these two configs?

1. bridge --> vxlan --> NIC
2. veth --> bridge --> vxlan --> NIC

BTW #2 doesn't work when packets start to be large unless I manually
decrease the veth device pair MTU. E.g if the NIC MTU is 1500, vxlan
advertizes an MTU of 1450 (= 1500 - (14 + 20 + 8 + 8)) and the bridge
inherits that, but not the veth device. Should someone/somewhere here
generate an ICMP packet which will cause the stack to decreate the
path mtu for the neighbour created on the veth device? what about
para-virtualized guests which are plugged into this (or any host based
tunneling) scheme, e.g in this scheme

3. guest virtio NIC --> vhost  --> tap/macvtap --> bridge --> vxlan --> NIC

Who/how do we want the guest NIC mtu/path mtu to take into account the
tunneling over-head?

Or.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next] veth: extend features to support tunneling
  2013-11-16 21:11 [PATCH net-next] veth: extend features to support tunneling Or Gerlitz
@ 2013-11-16 21:40 ` Eric Dumazet
  2013-11-17  0:09   ` Eric Dumazet
                     ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Eric Dumazet @ 2013-11-16 21:40 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Alexei Starovoitov, David Miller, Eric Dumazet, Stephen Hemminger,
	netdev@vger.kernel.org, Michael S. Tsirkin, John Fastabend

On Sat, 2013-11-16 at 23:11 +0200, Or Gerlitz wrote:

> Guys (thanks Eric for the clarification over the other vxlan thread),
> with the latest networking code (e.g 3.12 or net-next)  do you expect
> notable performance (throughput) difference between these two configs?
> 
> 1. bridge --> vxlan --> NIC
> 2. veth --> bridge --> vxlan --> NIC
> 
> BTW #2 doesn't work when packets start to be large unless I manually
> decrease the veth device pair MTU. E.g if the NIC MTU is 1500, vxlan
> advertizes an MTU of 1450 (= 1500 - (14 + 20 + 8 + 8)) and the bridge
> inherits that, but not the veth device. Should someone/somewhere here
> generate an ICMP packet which will cause the stack to decreate the
> path mtu for the neighbour created on the veth device? what about
> para-virtualized guests which are plugged into this (or any host based
> tunneling) scheme, e.g in this scheme
> 
> 3. guest virtio NIC --> vhost  --> tap/macvtap --> bridge --> vxlan --> NIC
> 
> Who/how do we want the guest NIC mtu/path mtu to take into account the
> tunneling over-head?

I mentioned this problem on another thread : gso packets escape the
normal mtu checks in ip forwarding.

vi +91 net/ipv4/ip_forward.c

gso_size contains the size of the segment minus all headers.

Please try the following :

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index d68633452d9b..489b56935a56 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -388,4 +388,16 @@ static inline int fastopen_init_queue(struct sock *sk, int backlog)
 	return 0;
 }
 
+static inline unsigned int gso_size_with_headers(const struct sk_buff *skb)
+{
+	unsigned int hdrlen = skb_transport_header(skb) - skb_mac_header(skb);
+
+	if (skb_shinfo(skb)->gso_type & (SKB_GSO_TCPV4 | SKB_GSO_TCPV6))
+		hdrlen += tcp_hdrlen(skb);
+	else
+		hdrlen += 8; // sizeof(struct udphdr)
+
+	return skb_shinfo(skb)->gso_size + hdrlen;
+}
+
 #endif	/* _LINUX_TCP_H */
diff --git a/net/ipv4/ip_forward.c b/net/ipv4/ip_forward.c
index 694de3b7aebf..3949cc1dd1ca 100644
--- a/net/ipv4/ip_forward.c
+++ b/net/ipv4/ip_forward.c
@@ -57,6 +57,7 @@ int ip_forward(struct sk_buff *skb)
 	struct iphdr *iph;	/* Our header */
 	struct rtable *rt;	/* Route we use */
 	struct ip_options *opt	= &(IPCB(skb)->opt);
+	unsigned int len;
 
 	if (skb_warn_if_lro(skb))
 		goto drop;
@@ -88,7 +89,11 @@ int ip_forward(struct sk_buff *skb)
 	if (opt->is_strictroute && rt->rt_uses_gateway)
 		goto sr_failed;
 
-	if (unlikely(skb->len > dst_mtu(&rt->dst) && !skb_is_gso(skb) &&
+	len = skb->len;
+	if (skb_is_gso(skb))
+		len = gso_size_with_headers(skb);
+
+	if (unlikely(len > dst_mtu(&rt->dst) &&
 		     (ip_hdr(skb)->frag_off & htons(IP_DF))) && !skb->local_df) {
 		IP_INC_STATS(dev_net(rt->dst.dev), IPSTATS_MIB_FRAGFAILS);
 		icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED,

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next] veth: extend features to support tunneling
  2013-11-16 21:40 ` Eric Dumazet
@ 2013-11-17  0:09   ` Eric Dumazet
  2013-11-17  6:57   ` Or Gerlitz
  2013-11-17  7:31   ` Alexei Starovoitov
  2 siblings, 0 replies; 16+ messages in thread
From: Eric Dumazet @ 2013-11-17  0:09 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Alexei Starovoitov, David Miller, Eric Dumazet, Stephen Hemminger,
	netdev@vger.kernel.org, Michael S. Tsirkin, John Fastabend

On Sat, 2013-11-16 at 13:40 -0800, Eric Dumazet wrote:
> On Sat, 2013-11-16 at 23:11 +0200, Or Gerlitz wrote:
> 
> > Guys (thanks Eric for the clarification over the other vxlan thread),
> > with the latest networking code (e.g 3.12 or net-next)  do you expect
> > notable performance (throughput) difference between these two configs?
> > 
> > 1. bridge --> vxlan --> NIC
> > 2. veth --> bridge --> vxlan --> NIC
> > 
> > BTW #2 doesn't work when packets start to be large unless I manually
> > decrease the veth device pair MTU. E.g if the NIC MTU is 1500, vxlan
> > advertizes an MTU of 1450 (= 1500 - (14 + 20 + 8 + 8)) and the bridge
> > inherits that, but not the veth device. Should someone/somewhere here
> > generate an ICMP packet which will cause the stack to decreate the
> > path mtu for the neighbour created on the veth device? what about
> > para-virtualized guests which are plugged into this (or any host based
> > tunneling) scheme, e.g in this scheme
> > 
> > 3. guest virtio NIC --> vhost  --> tap/macvtap --> bridge --> vxlan --> NIC
> > 
> > Who/how do we want the guest NIC mtu/path mtu to take into account the
> > tunneling over-head?
> 
> I mentioned this problem on another thread : gso packets escape the
> normal mtu checks in ip forwarding.
> 
> vi +91 net/ipv4/ip_forward.c
> 
> gso_size contains the size of the segment minus all headers.
> 
> Please try the following :
> 
> diff --git a/include/linux/tcp.h b/include/linux/tcp.h
> index d68633452d9b..489b56935a56 100644
> --- a/include/linux/tcp.h
> +++ b/include/linux/tcp.h
> @@ -388,4 +388,16 @@ static inline int fastopen_init_queue(struct sock *sk, int backlog)
>  	return 0;
>  }
>  
> +static inline unsigned int gso_size_with_headers(const struct sk_buff *skb)
> +{
> +	unsigned int hdrlen = skb_transport_header(skb) - skb_mac_header(skb);

or more exactly :

	unsigned int hdrlen = skb_transport_header(skb) - skb_network_header(skb);


> +
> +	if (skb_shinfo(skb)->gso_type & (SKB_GSO_TCPV4 | SKB_GSO_TCPV6))
> +		hdrlen += tcp_hdrlen(skb);
> +	else
> +		hdrlen += 8; // sizeof(struct udphdr)
> +
> +	return skb_shinfo(skb)->gso_size + hdrlen;
> +}

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next] veth: extend features to support tunneling
  2013-11-16 21:40 ` Eric Dumazet
  2013-11-17  0:09   ` Eric Dumazet
@ 2013-11-17  6:57   ` Or Gerlitz
  2013-11-17  7:31   ` Alexei Starovoitov
  2 siblings, 0 replies; 16+ messages in thread
From: Or Gerlitz @ 2013-11-17  6:57 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alexei Starovoitov, David Miller, Eric Dumazet, Stephen Hemminger,
	netdev@vger.kernel.org, Michael S. Tsirkin, John Fastabend

On Sat, Nov 16, 2013 at 11:40 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Sat, 2013-11-16 at 23:11 +0200, Or Gerlitz wrote:
>
>> Guys (thanks Eric for the clarification over the other vxlan thread),
>> with the latest networking code (e.g 3.12 or net-next)  do you expect
>> notable performance (throughput) difference between these two configs?
>>
>> 1. bridge --> vxlan --> NIC
>> 2. veth --> bridge --> vxlan --> NIC
>>
>> BTW #2 doesn't work when packets start to be large unless I manually
>> decrease the veth device pair MTU. E.g if the NIC MTU is 1500, vxlan
>> advertizes an MTU of 1450 (= 1500 - (14 + 20 + 8 + 8)) and the bridge
>> inherits that, but not the veth device. Should someone/somewhere here
>> generate an ICMP packet which will cause the stack to decreate the
>> path mtu for the neighbour created on the veth device? what about
>> para-virtualized guests which are plugged into this (or any host based
>> tunneling) scheme, e.g in this scheme
>>
>> 3. guest virtio NIC --> vhost  --> tap/macvtap --> bridge --> vxlan --> NIC
>>
>> Who/how do we want the guest NIC mtu/path mtu to take into account the
>> tunneling over-head?
>
> I mentioned this problem on another thread : gso packets escape the
> normal mtu checks in ip forwarding.
>
> vi +91 net/ipv4/ip_forward.c
>
> gso_size contains the size of the segment minus all headers.
>
> Please try the following :

Will setup & try & let you know. However, this doesn't address non TCP
traffic, e.g pings of large size, correct?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next] veth: extend features to support tunneling
  2013-11-16 21:40 ` Eric Dumazet
  2013-11-17  0:09   ` Eric Dumazet
  2013-11-17  6:57   ` Or Gerlitz
@ 2013-11-17  7:31   ` Alexei Starovoitov
  2013-11-17 17:20     ` Eric Dumazet
  2013-11-18 17:55     ` David Stevens
  2 siblings, 2 replies; 16+ messages in thread
From: Alexei Starovoitov @ 2013-11-17  7:31 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Or Gerlitz, David Miller, Eric Dumazet, Stephen Hemminger,
	netdev@vger.kernel.org, Michael S. Tsirkin, John Fastabend

On Sat, Nov 16, 2013 at 1:40 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Sat, 2013-11-16 at 23:11 +0200, Or Gerlitz wrote:
>
>> Guys (thanks Eric for the clarification over the other vxlan thread),
>> with the latest networking code (e.g 3.12 or net-next)  do you expect
>> notable performance (throughput) difference between these two configs?
>>
>> 1. bridge --> vxlan --> NIC
>> 2. veth --> bridge --> vxlan --> NIC
>>
>> BTW #2 doesn't work when packets start to be large unless I manually
>> decrease the veth device pair MTU. E.g if the NIC MTU is 1500, vxlan
>> advertizes an MTU of 1450 (= 1500 - (14 + 20 + 8 + 8)) and the bridge
>> inherits that, but not the veth device. Should someone/somewhere here
>> generate an ICMP packet which will cause the stack to decreate the
>> path mtu for the neighbour created on the veth device? what about
>> para-virtualized guests which are plugged into this (or any host based
>> tunneling) scheme, e.g in this scheme
>>
>> 3. guest virtio NIC --> vhost  --> tap/macvtap --> bridge --> vxlan --> NIC
>>
>> Who/how do we want the guest NIC mtu/path mtu to take into account the
>> tunneling over-head?
>
> I mentioned this problem on another thread : gso packets escape the
> normal mtu checks in ip forwarding.
>
> vi +91 net/ipv4/ip_forward.c
>
> gso_size contains the size of the segment minus all headers.

In case of VMs sending gso packets over tap and tunnel in the host,
ip_forward is not in the picture.

when host mtu doesn't account for overhead of tunnel, the neat trick
we can do is to decrease gso_size while adding tunnel header.
This way when skb_gso_segment() kicks in during tx the packets will be
segmented into host mtu sized packets.
Receiving vm on the other side will be seeing packets of size
guest_mtu - tunnel_header_size,
but imo that's much better than sending ip fragments over vxlan fabric.
It will work for guests sending tcp/udp, but there is no good solution
for icmp other than ip frags.
This trick should work for hw offloaded vxlan, but we yet to
experiment with it on such nic.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next] veth: extend features to support tunneling
  2013-11-17  7:31   ` Alexei Starovoitov
@ 2013-11-17 17:20     ` Eric Dumazet
  2013-11-17 19:00       ` Alexei Starovoitov
  2013-11-18 17:55     ` David Stevens
  1 sibling, 1 reply; 16+ messages in thread
From: Eric Dumazet @ 2013-11-17 17:20 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Or Gerlitz, David Miller, Eric Dumazet, Stephen Hemminger,
	netdev@vger.kernel.org, Michael S. Tsirkin, John Fastabend

On Sat, 2013-11-16 at 23:31 -0800, Alexei Starovoitov wrote:

> In case of VMs sending gso packets over tap and tunnel in the host,
> ip_forward is not in the picture.
> 

I was specifically answering to #2 which should use ip forwarding, of
course. Note that my patch was a POC : We have many other places where
the typical MTU check is simply disabled as soon as skb is GSO.

> when host mtu doesn't account for overhead of tunnel, the neat trick
> we can do is to decrease gso_size while adding tunnel header.

That would be very broken to change gso_size, this breaks DF flag
semantic. You need to send an ICMP, and the sender will take appropriate
action.

GRO + GSO request that forwarded segments are the same than incoming
ones. It's not like a proxy that can chose to aggregate as it wants.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next] veth: extend features to support tunneling
  2013-11-17 17:20     ` Eric Dumazet
@ 2013-11-17 19:00       ` Alexei Starovoitov
  2013-11-17 21:20         ` Or Gerlitz
  2013-11-17 22:38         ` Eric Dumazet
  0 siblings, 2 replies; 16+ messages in thread
From: Alexei Starovoitov @ 2013-11-17 19:00 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Or Gerlitz, David Miller, Eric Dumazet, Stephen Hemminger,
	netdev@vger.kernel.org, Michael S. Tsirkin, John Fastabend

On Sun, Nov 17, 2013 at 9:20 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Sat, 2013-11-16 at 23:31 -0800, Alexei Starovoitov wrote:
>
>> In case of VMs sending gso packets over tap and tunnel in the host,
>> ip_forward is not in the picture.
>>
>
> I was specifically answering to #2 which should use ip forwarding, of
> course. Note that my patch was a POC : We have many other places where
> the typical MTU check is simply disabled as soon as skb is GSO.

I don't think #2 will do ip_forward either. veth goes into a bridge
and vxlan just adds encap.

>> when host mtu doesn't account for overhead of tunnel, the neat trick
>> we can do is to decrease gso_size while adding tunnel header.
>
> That would be very broken to change gso_size, this breaks DF flag
> semantic. You need to send an ICMP, and the sender will take appropriate
> action.
>
> GRO + GSO request that forwarded segments are the same than incoming
> ones. It's not like a proxy that can chose to aggregate as it wants.

In case of ip_forwarding - yes, but in case of tunnel we may look at
gso differently.
I'm not saying that 'gso_size -= tunnel_header' trick should be unconditional.
Obviously DF flag needs to be respected for pmtu to work.
ip_tunnel_xmit() already suppose to send icmp_frag_needed back, I'm
not sure it works for vxlan though.
What I'm proposing if tunnel receives normal gso packet with df bit
not set, it can safely
decrease that skb's gso_size. This way guest vm can have 1500 mtu just
like host mtu 1500,
until guest tries to pmtu, then vxlan sends icmp into guest and guest
adjusts itself if it can.
Since host cannot guarantee that guest will do the right thing with
icmp_frag_needed, it may continue doing 'gso_size -= tunnel_header'
trick without breaking anything.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next] veth: extend features to support tunneling
  2013-11-17 19:00       ` Alexei Starovoitov
@ 2013-11-17 21:20         ` Or Gerlitz
  2013-11-17 22:38         ` Eric Dumazet
  1 sibling, 0 replies; 16+ messages in thread
From: Or Gerlitz @ 2013-11-17 21:20 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Eric Dumazet, David Miller, Eric Dumazet, Stephen Hemminger,
	netdev@vger.kernel.org, Michael S. Tsirkin, John Fastabend

On Sun, Nov 17, 2013 , Alexei Starovoitov <ast@plumgrid.com> wrote:
> On Sun, Nov 17, 2013 , Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> On Sat, 2013-11-16 , Alexei Starovoitov wrote:

>>> In case of VMs sending gso packets over tap and tunnel in the host,
>>> ip_forward is not in the picture.

>> I was specifically answering to #2 which should use ip forwarding, of
>> course. Note that my patch was a POC : We have many other places where
>> the typical MTU check is simply disabled as soon as skb is GSO.

> I don't think #2 will do ip_forward either. veth goes into a bridge
> and vxlan just adds encap.

Eric, do we have concensus here that this #2 of veth --> bridge -->
vxlan --> NIC will not go through ip_forward?!

Anyway, I tried your patch and didn't see notable improvement on my
tests. The tests I did few days ago were over 3.10.19 to have more
stable ground... moving to 3.12.0 and net-next today, the baseline
performance became worse, in the sense that if a bit of simplified env
of bridge --> vxlan --> NIC with many iperf client threads yielded
similar throughput as vxlan --> NIC or bridge --> NIC, with net-next
its not the case. If you have 10Gbs or 40Gbs NICs, even without HW TCP
offloads for VXLAN, you might be able to see that on your setups.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next] veth: extend features to support tunneling
  2013-11-17 19:00       ` Alexei Starovoitov
  2013-11-17 21:20         ` Or Gerlitz
@ 2013-11-17 22:38         ` Eric Dumazet
  1 sibling, 0 replies; 16+ messages in thread
From: Eric Dumazet @ 2013-11-17 22:38 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Or Gerlitz, David Miller, Eric Dumazet, Stephen Hemminger,
	netdev@vger.kernel.org, Michael S. Tsirkin, John Fastabend

On Sun, 2013-11-17 at 11:00 -0800, Alexei Starovoitov wrote:
> On Sun, Nov 17, 2013 at 9:20 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > On Sat, 2013-11-16 at 23:31 -0800, Alexei Starovoitov wrote:
> >
> >> In case of VMs sending gso packets over tap and tunnel in the host,
> >> ip_forward is not in the picture.
> >>
> >
> > I was specifically answering to #2 which should use ip forwarding, of
> > course. Note that my patch was a POC : We have many other places where
> > the typical MTU check is simply disabled as soon as skb is GSO.
> 
> I don't think #2 will do ip_forward either. veth goes into a bridge
> and vxlan just adds encap.

The point is : nothing really checks mtu for gso packets.

Same construct was copy/pasted everywhere. See br_dev_queue_push_xmit()
for another example.

In this particular case, I guess vxlan should generate the ICMP.

But really, its Sunday and I currently do not care ;)

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next] veth: extend features to support tunneling
  2013-11-17  7:31   ` Alexei Starovoitov
  2013-11-17 17:20     ` Eric Dumazet
@ 2013-11-18 17:55     ` David Stevens
  2013-11-18 22:57       ` Jesse Gross
  2013-11-19  3:51       ` Alexei Starovoitov
  1 sibling, 2 replies; 16+ messages in thread
From: David Stevens @ 2013-11-18 17:55 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David Miller, Eric Dumazet, Eric Dumazet, John Fastabend,
	Michael S. Tsirkin, netdev@vger.kernel.org, netdev-owner,
	Or Gerlitz, Stephen Hemminger

netdev-owner@vger.kernel.org wrote on 11/17/2013 02:31:08 AM:

> From: Alexei Starovoitov <ast@plumgrid.com>
 
> when host mtu doesn't account for overhead of tunnel, the neat trick
> we can do is to decrease gso_size while adding tunnel header.

Won't this possibly result in ip_id collision if you generate more 
segments
than the VM thinks you did? Not an issue of DF is set, I suppose.

> This way when skb_gso_segment() kicks in during tx the packets will be
> segmented into host mtu sized packets.

        I've been looking at something like this, but going the other way.
In the VXLAN case, other VXLAN VMs are reachable via the virtual L2 
network
so we *ought* to use large packets in the whole path if the underlay
network can do that. PMTUD should not apply within the virtual L2 network,
but the code as-is will segment to the VM gso_size (say 1500) even if the
underlay network can send jumbo frames (say 9K).
        I think what we want for best performance is to send GSO packets 
from
the VM and use underlay network MTU minus VXLAN headers as the gso_size 
for
those.
        One complication there is that we would actually want to use the 
VM
gso_size if the destination is a router taking us off of the VXLAN 
network,
but we can tell that from the fdb entry when route-short-circuiting is 
enabled.

> Receiving vm on the other side will be seeing packets of size
> guest_mtu - tunnel_header_size,
> but imo that's much better than sending ip fragments over vxlan fabric.
> It will work for guests sending tcp/udp, but there is no good solution
> for icmp other than ip frags.

My idea for solving this ICMP issue is to forge an ICMP2BIG with the
source address of the destination and send it back to the originating VM
directly. It's complicated a bit because we don't want to use icmp_send()
on the host, which will go through host routing tables when we don't
necessarily have an IP address or routing for the bridged domain. So, to
do this, we really should just have a stand-alone icmp sender for use by
tunnel endpoints. But if we do this, the VM can get the correct gso_size
accounting for the tunnel overhead too, although it's abusing PMTUD a bit
since it doesn't ordinarily apply to hosts on the same L2 network.

I haven't gotten working patches for either of these extensions yet--
if your work overlaps before I do, I'd be happy to see these two things
incorporated in it:

1) ICMP2BIG reflected from the tunnel endpoint without host routing and
        using the destination IP as the forged source address, thus 
appropriate
        for bridge-only hosting.
2) Allowing larger-than-gso_size segmentation as well as smaller when the
        final destination is on the virtual L2 network.

                                                                +-DLS

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next] veth: extend features to support tunneling
  2013-11-18 17:55     ` David Stevens
@ 2013-11-18 22:57       ` Jesse Gross
  2013-11-19  3:51       ` Alexei Starovoitov
  1 sibling, 0 replies; 16+ messages in thread
From: Jesse Gross @ 2013-11-18 22:57 UTC (permalink / raw)
  To: David Stevens
  Cc: Alexei Starovoitov, David Miller, Eric Dumazet, Eric Dumazet,
	John Fastabend, Michael S. Tsirkin, netdev@vger.kernel.org,
	netdev-owner, Or Gerlitz, Stephen Hemminger

On Mon, Nov 18, 2013 at 9:55 AM, David Stevens <dlstevens@us.ibm.com> wrote:
> My idea for solving this ICMP issue is to forge an ICMP2BIG with the
> source address of the destination and send it back to the originating VM
> directly. It's complicated a bit because we don't want to use icmp_send()
> on the host, which will go through host routing tables when we don't
> necessarily have an IP address or routing for the bridged domain. So, to
> do this, we really should just have a stand-alone icmp sender for use by
> tunnel endpoints. But if we do this, the VM can get the correct gso_size
> accounting for the tunnel overhead too, although it's abusing PMTUD a bit
> since it doesn't ordinarily apply to hosts on the same L2 network.

OVS used to do exactly this in its tunnel implementation. It worked
fairly well although we ended up taking it out a while back since it
was a little offensive due to the need to forge addresses and have
essentially a parallel stub of an IP stack.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next] veth: extend features to support tunneling
  2013-11-18 17:55     ` David Stevens
  2013-11-18 22:57       ` Jesse Gross
@ 2013-11-19  3:51       ` Alexei Starovoitov
  1 sibling, 0 replies; 16+ messages in thread
From: Alexei Starovoitov @ 2013-11-19  3:51 UTC (permalink / raw)
  To: David Stevens
  Cc: David Miller, Eric Dumazet, Eric Dumazet, John Fastabend,
	Michael S. Tsirkin, netdev@vger.kernel.org, netdev-owner,
	Or Gerlitz, Stephen Hemminger

On Mon, Nov 18, 2013 at 9:55 AM, David Stevens <dlstevens@us.ibm.com> wrote:
> 1) ICMP2BIG reflected from the tunnel endpoint without host routing and
>         using the destination IP as the forged source address, thus
> appropriate
>         for bridge-only hosting.

It doesn't look that existing icmp_send() can be hacked this way.
It cannot do route lookup and cannot do neigh lookup.
IPs and macs are valid within VM and known to virtualized networking
components, but
tunnel cannot possibly know what is standing between VM and tunnel.
VM may be sending over tap into a bridge that forwards into netns that
is running ip_fowarding
 between two vethes, then goes into 2nd bridge and only then sent into
vxlan device.
ovs can do many actions before skb from tap is delivered to vport-vxlan.
The generic thing vxlan driver can do is to take mac and ip headers
from inner skb,
swap macs, swap ips, add icmphdr with code=frag_needed and pretend
that such packet
was received and decapsulated by vxlan, and call into vxlan_sock->rcv()
It will go back into ovs or bridge and will proceed through the
reverse network topology path all the way back into VM that can adjust
its mtu accordingly.

> 2) Allowing larger-than-gso_size segmentation as well as smaller when the
>         final destination is on the virtual L2 network.

I think it should be ok for virtual L3 as well. From VM point of view,
it's sending 1500 byte packets and virtual routers/bridges on the path
to destination should check packets against their own mtu values. But
if tunneling infra is smart enough to use large frames between two
hypervisors,
 it should do so. DF bit and pmtu logic applies within virtual
network, so sending icmp_frag_needed back into VM is exposing physical
infrastructure to virtual network.
True virtual distributed bridge with VMs should allow setting 8k mtu
inside VM and on the bridge and still function with 1500 mtu in the
underlying physical network.

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2013-11-19  3:51 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-11-16 21:11 [PATCH net-next] veth: extend features to support tunneling Or Gerlitz
2013-11-16 21:40 ` Eric Dumazet
2013-11-17  0:09   ` Eric Dumazet
2013-11-17  6:57   ` Or Gerlitz
2013-11-17  7:31   ` Alexei Starovoitov
2013-11-17 17:20     ` Eric Dumazet
2013-11-17 19:00       ` Alexei Starovoitov
2013-11-17 21:20         ` Or Gerlitz
2013-11-17 22:38         ` Eric Dumazet
2013-11-18 17:55     ` David Stevens
2013-11-18 22:57       ` Jesse Gross
2013-11-19  3:51       ` Alexei Starovoitov
  -- strict thread matches above, loose matches on Subject: below --
2013-10-25  1:59 vxlan gso is broken by stackable gso_segment() Alexei Starovoitov
2013-10-25  4:06 ` Eric Dumazet
2013-10-25  9:09   ` Eric Dumazet
2013-10-25 22:18     ` David Miller
2013-10-25 22:41       ` Alexei Starovoitov
2013-10-25 23:25         ` Eric Dumazet
2013-10-26  0:52           ` Eric Dumazet
2013-10-26  1:25             ` [PATCH net-next] veth: extend features to support tunneling Eric Dumazet
2013-10-26  2:22               ` Alexei Starovoitov
2013-10-26  4:13                 ` Eric Dumazet
2013-10-28  4:58               ` David Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).