vxlan gso is broken by stackable gso

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* vxlan gso is broken by stackable gso_segment()
@ 2013-10-24 23:37 Alexei Starovoitov
  2013-10-25  0:41 ` Eric Dumazet
  0 siblings, 1 reply; 41+ messages in thread
From: Alexei Starovoitov @ 2013-10-24 23:37 UTC (permalink / raw)
  To: Eric Dumazet, Stephen Hemminger; +Cc: David S. Miller, netdev

Hi Eric, Stephen,

it seems commit 3347c960 "ipv4: gso: make inet_gso_segment() stackable"
broke vxlan gso

the way to reproduce:
start two lxc with veth and bridge between them
create vxlan dev in both containers
do iperf

this setup on net-next does ~80 Mbps and a lot of tcp retransmits.
reverting 3347c960 and d3e5e006 gets performance back to ~230 Mbps

I guess vxlan driver suppose to set encap_level ? Some other way?

Thanks
Alexei

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: vxlan gso is broken by stackable gso_segment()
  2013-10-24 23:37 vxlan gso is broken by stackable gso_segment() Alexei Starovoitov
@ 2013-10-25  0:41 ` Eric Dumazet
  2013-10-25  1:59   ` Alexei Starovoitov
  0 siblings, 1 reply; 41+ messages in thread
From: Eric Dumazet @ 2013-10-25  0:41 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Eric Dumazet, Stephen Hemminger, David S. Miller, netdev

On Thu, 2013-10-24 at 16:37 -0700, Alexei Starovoitov wrote:
> Hi Eric, Stephen,
> 
> it seems commit 3347c960 "ipv4: gso: make inet_gso_segment() stackable"
> broke vxlan gso
> 
> the way to reproduce:
> start two lxc with veth and bridge between them
> create vxlan dev in both containers
> do iperf
> 
> this setup on net-next does ~80 Mbps and a lot of tcp retransmits.
> reverting 3347c960 and d3e5e006 gets performance back to ~230 Mbps
> 
> I guess vxlan driver suppose to set encap_level ? Some other way?

Hi Alexei

Are the GRE tunnels broken as well for you ?

In my testings, GRE was working, and it looks GRE and vxlan has quite
similar gso implementation.

Maybe you can capture some of the broken frames with tcpdump ?

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: vxlan gso is broken by stackable gso_segment()
  2013-10-25  0:41 ` Eric Dumazet
@ 2013-10-25  1:59   ` Alexei Starovoitov
  2013-10-25  4:06     ` Eric Dumazet
  0 siblings, 1 reply; 41+ messages in thread
From: Alexei Starovoitov @ 2013-10-25  1:59 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Eric Dumazet, Stephen Hemminger, David S. Miller, netdev

gre seems to be fine.
packets seem to be segmented with wrong length and being dropped.
After client iperf is finished, in few seconds I see the warning:

[  329.669685] WARNING: CPU: 3 PID: 3817 at net/core/skbuff.c:3474
skb_try_coalesce+0x3a0/0x3f0()
[  329.669688] Modules linked in: vxlan ip_tunnel veth ip6table_filter
ip6_tables ebtable_nat ebtables nf_conntrack_ipv4 nf_defrag_ipv4
xt_state nf_conntrack xt_CHECKSUM iptable_mangle ipt_REJECT xt_tcpudp
iptable_filter ip_tables x_tables bridge stp llc vhost_net macvtap
macvlan vhost kvm_intel kvm iscsi_tcp libiscsi_tcp libiscsi
scsi_transport_iscsi dm_crypt hid_generic eeepc_wmi asus_wmi
sparse_keymap mxm_wmi dm_multipath psmouse serio_raw usbhid hid
parport_pc ppdev firewire_ohci e1000e firewire_core lpc_ich crc_itu_t
binfmt_misc igb dca ptp pps_core mac_hid wmi lp parport i2o_config
i2o_block video
[  329.669746] CPU: 3 PID: 3817 Comm: iperf Not tainted 3.12.0-rc6+ #81
[  329.669748] Hardware name: System manufacturer System Product
Name/P8Z77 WS, BIOS 3007 07/26/2012
[  329.669750]  0000000000000009 ffff88082fb839d8 ffffffff8175427a
0000000000000002
[  329.669756]  0000000000000000 ffff88082fb83a18 ffffffff8105206c
ffff880808f926f8
[  329.669760]  ffff8807ef122b00 ffff8807ef122a00 0000000000000576
ffff88082fb83a94
[  329.669765] Call Trace:
[  329.669767]  <IRQ>  [<ffffffff8175427a>] dump_stack+0x55/0x76
[  329.669779]  [<ffffffff8105206c>] warn_slowpath_common+0x8c/0xc0
[  329.669783]  [<ffffffff810520ba>] warn_slowpath_null+0x1a/0x20
[  329.669787]  [<ffffffff816150f0>] skb_try_coalesce+0x3a0/0x3f0
[  329.669793]  [<ffffffff8167bce4>] tcp_try_coalesce.part.44+0x34/0xa0
[  329.669797]  [<ffffffff8167d168>] tcp_queue_rcv+0x108/0x150
[  329.669801]  [<ffffffff8167f129>] tcp_data_queue+0x299/0xd00
[  329.669806]  [<ffffffff816822f4>] tcp_rcv_established+0x2d4/0x8f0
[  329.669809]  [<ffffffff8168d8b5>] tcp_v4_do_rcv+0x295/0x520
[  329.669813]  [<ffffffff8168fb08>] tcp_v4_rcv+0x888/0xc30
[  329.669818]  [<ffffffff816651d3>] ? ip_local_deliver_finish+0x43/0x480
[  329.669823]  [<ffffffff810cae04>] ? __lock_is_held+0x54/0x80
[  329.669827]  [<ffffffff816652fb>] ip_local_deliver_finish+0x16b/0x480
[  329.669831]  [<ffffffff816651d3>] ? ip_local_deliver_finish+0x43/0x480
[  329.669836]  [<ffffffff81666018>] ip_local_deliver+0x48/0x80
[  329.669840]  [<ffffffff81665770>] ip_rcv_finish+0x160/0x770
[  329.669845]  [<ffffffff816662f8>] ip_rcv+0x2a8/0x3e0
[  329.669849]  [<ffffffff81623d13>] __netif_receive_skb_core+0xa63/0xdb0
[  329.669853]  [<ffffffff816233b8>] ? __netif_receive_skb_core+0x108/0xdb0
[  329.669858]  [<ffffffff8175d37f>] ? _raw_spin_unlock_irqrestore+0x3f/0x70
[  329.669862]  [<ffffffff8162417b>] ? process_backlog+0xab/0x180
[  329.669866]  [<ffffffff81624081>] __netif_receive_skb+0x21/0x70
[  329.669869]  [<ffffffff81624184>] process_backlog+0xb4/0x180
[  329.669873]  [<ffffffff81626d08>] ? net_rx_action+0x98/0x350
[  329.669876]  [<ffffffff81626dca>] net_rx_action+0x15a/0x350
[  329.669882]  [<ffffffff81057f97>] __do_softirq+0xf7/0x3f0
[  329.669886]  [<ffffffff8176820c>] call_softirq+0x1c/0x30
[  329.669887]  <EOI>  [<ffffffff81004bed>] do_softirq+0x8d/0xc0
[  329.669896]  [<ffffffff8160de03>] ? release_sock+0x193/0x1f0
[  329.669901]  [<ffffffff81057a5b>] local_bh_enable_ip+0xdb/0xf0
[  329.669906]  [<ffffffff8175d2e4>] _raw_spin_unlock_bh+0x44/0x50
[  329.669910]  [<ffffffff8160de03>] release_sock+0x193/0x1f0
[  329.669914]  [<ffffffff81679237>] tcp_recvmsg+0x467/0x1030
[  329.669919]  [<ffffffff816ab424>] inet_recvmsg+0x134/0x230
[  329.669923]  [<ffffffff8160a17d>] sock_recvmsg+0xad/0xe0

to reproduce do:
$ sudo brctl addbr br0
$ sudo ifconfig br0 up
$ cat foo1.conf
lxc.network.type = veth
lxc.network.flags = up
lxc.network.link = br0
lxc.network.ipv4 = 10.2.3.5/24
$sudo lxc-start -n foo1 -f ./foo1.conf bash
#ip li add vxlan0 type vxlan id 42 group 239.1.1.1 dev eth0
#ip addr add 192.168.99.1/24 dev vxlan0
#ip link set up dev vxlan0
#iperf -s

similar for another lxc with different IP
$sudo lxc-start -n foo2 -f ./foo2.conf bash
#ip li add vxlan0 type vxlan id 42 group 239.1.1.1 dev eth0
#ip addr add 192.168.99.2/24 dev vxlan0
#ip link set up dev vxlan0
# iperf -c 192.168.99.1

I keep hitting it all the time.


On Thu, Oct 24, 2013 at 5:41 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Thu, 2013-10-24 at 16:37 -0700, Alexei Starovoitov wrote:
>> Hi Eric, Stephen,
>>
>> it seems commit 3347c960 "ipv4: gso: make inet_gso_segment() stackable"
>> broke vxlan gso
>>
>> the way to reproduce:
>> start two lxc with veth and bridge between them
>> create vxlan dev in both containers
>> do iperf
>>
>> this setup on net-next does ~80 Mbps and a lot of tcp retransmits.
>> reverting 3347c960 and d3e5e006 gets performance back to ~230 Mbps
>>
>> I guess vxlan driver suppose to set encap_level ? Some other way?
>
> Hi Alexei
>
> Are the GRE tunnels broken as well for you ?
>
> In my testings, GRE was working, and it looks GRE and vxlan has quite
> similar gso implementation.
>
> Maybe you can capture some of the broken frames with tcpdump ?
>
>
>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: vxlan gso is broken by stackable gso_segment()
  2013-10-25  1:59   ` Alexei Starovoitov
@ 2013-10-25  4:06     ` Eric Dumazet
  2013-10-25  9:09       ` Eric Dumazet
  0 siblings, 1 reply; 41+ messages in thread
From: Eric Dumazet @ 2013-10-25  4:06 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Eric Dumazet, Stephen Hemminger, David S. Miller, netdev

On Thu, 2013-10-24 at 18:59 -0700, Alexei Starovoitov wrote:
> gre seems to be fine.
> packets seem to be segmented with wrong length and being dropped.
> After client iperf is finished, in few seconds I see the warning:
> 
> [  329.669685] WARNING: CPU: 3 PID: 3817 at net/core/skbuff.c:3474
> skb_try_coalesce+0x3a0/0x3f0()
> [  329.669688] Modules linked in: vxlan ip_tunnel veth ip6table_filter
> ip6_tables ebtable_nat ebtables nf_conntrack_ipv4 nf_defrag_ipv4
> xt_state nf_conntrack xt_CHECKSUM iptable_mangle ipt_REJECT xt_tcpudp
> iptable_filter ip_tables x_tables bridge stp llc vhost_net macvtap
> macvlan vhost kvm_intel kvm iscsi_tcp libiscsi_tcp libiscsi
> scsi_transport_iscsi dm_crypt hid_generic eeepc_wmi asus_wmi
> sparse_keymap mxm_wmi dm_multipath psmouse serio_raw usbhid hid
> parport_pc ppdev firewire_ohci e1000e firewire_core lpc_ich crc_itu_t
> binfmt_misc igb dca ptp pps_core mac_hid wmi lp parport i2o_config
> i2o_block video
> [  329.669746] CPU: 3 PID: 3817 Comm: iperf Not tainted 3.12.0-rc6+ #81
> [  329.669748] Hardware name: System manufacturer System Product
> Name/P8Z77 WS, BIOS 3007 07/26/2012
> [  329.669750]  0000000000000009 ffff88082fb839d8 ffffffff8175427a
> 0000000000000002
> [  329.669756]  0000000000000000 ffff88082fb83a18 ffffffff8105206c
> ffff880808f926f8
> [  329.669760]  ffff8807ef122b00 ffff8807ef122a00 0000000000000576
> ffff88082fb83a94
> [  329.669765] Call Trace:
> [  329.669767]  <IRQ>  [<ffffffff8175427a>] dump_stack+0x55/0x76
> [  329.669779]  [<ffffffff8105206c>] warn_slowpath_common+0x8c/0xc0
> [  329.669783]  [<ffffffff810520ba>] warn_slowpath_null+0x1a/0x20
> [  329.669787]  [<ffffffff816150f0>] skb_try_coalesce+0x3a0/0x3f0
> [  329.669793]  [<ffffffff8167bce4>] tcp_try_coalesce.part.44+0x34/0xa0
> [  329.669797]  [<ffffffff8167d168>] tcp_queue_rcv+0x108/0x150
> [  329.669801]  [<ffffffff8167f129>] tcp_data_queue+0x299/0xd00
> [  329.669806]  [<ffffffff816822f4>] tcp_rcv_established+0x2d4/0x8f0
> [  329.669809]  [<ffffffff8168d8b5>] tcp_v4_do_rcv+0x295/0x520
> [  329.669813]  [<ffffffff8168fb08>] tcp_v4_rcv+0x888/0xc30
> [  329.669818]  [<ffffffff816651d3>] ? ip_local_deliver_finish+0x43/0x480
> [  329.669823]  [<ffffffff810cae04>] ? __lock_is_held+0x54/0x80
> [  329.669827]  [<ffffffff816652fb>] ip_local_deliver_finish+0x16b/0x480
> [  329.669831]  [<ffffffff816651d3>] ? ip_local_deliver_finish+0x43/0x480
> [  329.669836]  [<ffffffff81666018>] ip_local_deliver+0x48/0x80
> [  329.669840]  [<ffffffff81665770>] ip_rcv_finish+0x160/0x770
> [  329.669845]  [<ffffffff816662f8>] ip_rcv+0x2a8/0x3e0
> [  329.669849]  [<ffffffff81623d13>] __netif_receive_skb_core+0xa63/0xdb0
> [  329.669853]  [<ffffffff816233b8>] ? __netif_receive_skb_core+0x108/0xdb0
> [  329.669858]  [<ffffffff8175d37f>] ? _raw_spin_unlock_irqrestore+0x3f/0x70
> [  329.669862]  [<ffffffff8162417b>] ? process_backlog+0xab/0x180
> [  329.669866]  [<ffffffff81624081>] __netif_receive_skb+0x21/0x70
> [  329.669869]  [<ffffffff81624184>] process_backlog+0xb4/0x180
> [  329.669873]  [<ffffffff81626d08>] ? net_rx_action+0x98/0x350
> [  329.669876]  [<ffffffff81626dca>] net_rx_action+0x15a/0x350
> [  329.669882]  [<ffffffff81057f97>] __do_softirq+0xf7/0x3f0
> [  329.669886]  [<ffffffff8176820c>] call_softirq+0x1c/0x30
> [  329.669887]  <EOI>  [<ffffffff81004bed>] do_softirq+0x8d/0xc0
> [  329.669896]  [<ffffffff8160de03>] ? release_sock+0x193/0x1f0
> [  329.669901]  [<ffffffff81057a5b>] local_bh_enable_ip+0xdb/0xf0
> [  329.669906]  [<ffffffff8175d2e4>] _raw_spin_unlock_bh+0x44/0x50
> [  329.669910]  [<ffffffff8160de03>] release_sock+0x193/0x1f0
> [  329.669914]  [<ffffffff81679237>] tcp_recvmsg+0x467/0x1030
> [  329.669919]  [<ffffffff816ab424>] inet_recvmsg+0x134/0x230
> [  329.669923]  [<ffffffff8160a17d>] sock_recvmsg+0xad/0xe0
> 
> to reproduce do:
> $ sudo brctl addbr br0
> $ sudo ifconfig br0 up
> $ cat foo1.conf
> lxc.network.type = veth
> lxc.network.flags = up
> lxc.network.link = br0
> lxc.network.ipv4 = 10.2.3.5/24
> $sudo lxc-start -n foo1 -f ./foo1.conf bash
> #ip li add vxlan0 type vxlan id 42 group 239.1.1.1 dev eth0
> #ip addr add 192.168.99.1/24 dev vxlan0
> #ip link set up dev vxlan0
> #iperf -s
> 
> similar for another lxc with different IP
> $sudo lxc-start -n foo2 -f ./foo2.conf bash
> #ip li add vxlan0 type vxlan id 42 group 239.1.1.1 dev eth0
> #ip addr add 192.168.99.2/24 dev vxlan0
> #ip link set up dev vxlan0
> # iperf -c 192.168.99.1
> 
> I keep hitting it all the time.
> 
> 

Thanks for all these details.

I am in Edinburgh for the Kernel Summit, I'll take a look at this as
soon as possible.

> On Thu, Oct 24, 2013 at 5:41 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > On Thu, 2013-10-24 at 16:37 -0700, Alexei Starovoitov wrote:
> >> Hi Eric, Stephen,
> >>
> >> it seems commit 3347c960 "ipv4: gso: make inet_gso_segment() stackable"
> >> broke vxlan gso
> >>
> >> the way to reproduce:
> >> start two lxc with veth and bridge between them
> >> create vxlan dev in both containers
> >> do iperf
> >>
> >> this setup on net-next does ~80 Mbps and a lot of tcp retransmits.
> >> reverting 3347c960 and d3e5e006 gets performance back to ~230 Mbps
> >>
> >> I guess vxlan driver suppose to set encap_level ? Some other way?
> >
> > Hi Alexei
> >
> > Are the GRE tunnels broken as well for you ?
> >
> > In my testings, GRE was working, and it looks GRE and vxlan has quite
> > similar gso implementation.
> >
> > Maybe you can capture some of the broken frames with tcpdump ?
> >
> >
> >
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: vxlan gso is broken by stackable gso_segment()
  2013-10-25  4:06     ` Eric Dumazet
@ 2013-10-25  9:09       ` Eric Dumazet
  2013-10-25 22:18         ` David Miller
  0 siblings, 1 reply; 41+ messages in thread
From: Eric Dumazet @ 2013-10-25  9:09 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Eric Dumazet, Stephen Hemminger, David S. Miller, netdev

On Thu, 2013-10-24 at 21:06 -0700, Eric Dumazet wrote:
> On Thu, 2013-10-24 at 18:59 -0700, Alexei Starovoitov wrote:
> > gre seems to be fine.
> > packets seem to be segmented with wrong length and being dropped.
> > After client iperf is finished, in few seconds I see the warning:
> > 
> > [  329.669685] WARNING: CPU: 3 PID: 3817 at net/core/skbuff.c:3474
> > skb_try_coalesce+0x3a0/0x3f0()
> > [  329.669688] Modules linked in: vxlan ip_tunnel veth ip6table_filter
> > ip6_tables ebtable_nat ebtables nf_conntrack_ipv4 nf_defrag_ipv4
> > xt_state nf_conntrack xt_CHECKSUM iptable_mangle ipt_REJECT xt_tcpudp
> > iptable_filter ip_tables x_tables bridge stp llc vhost_net macvtap
> > macvlan vhost kvm_intel kvm iscsi_tcp libiscsi_tcp libiscsi
> > scsi_transport_iscsi dm_crypt hid_generic eeepc_wmi asus_wmi
> > sparse_keymap mxm_wmi dm_multipath psmouse serio_raw usbhid hid
> > parport_pc ppdev firewire_ohci e1000e firewire_core lpc_ich crc_itu_t
> > binfmt_misc igb dca ptp pps_core mac_hid wmi lp parport i2o_config
> > i2o_block video
> > [  329.669746] CPU: 3 PID: 3817 Comm: iperf Not tainted 3.12.0-rc6+ #81
> > [  329.669748] Hardware name: System manufacturer System Product
> > Name/P8Z77 WS, BIOS 3007 07/26/2012
> > [  329.669750]  0000000000000009 ffff88082fb839d8 ffffffff8175427a
> > 0000000000000002
> > [  329.669756]  0000000000000000 ffff88082fb83a18 ffffffff8105206c
> > ffff880808f926f8
> > [  329.669760]  ffff8807ef122b00 ffff8807ef122a00 0000000000000576
> > ffff88082fb83a94
> > [  329.669765] Call Trace:
> > [  329.669767]  <IRQ>  [<ffffffff8175427a>] dump_stack+0x55/0x76
> > [  329.669779]  [<ffffffff8105206c>] warn_slowpath_common+0x8c/0xc0
> > [  329.669783]  [<ffffffff810520ba>] warn_slowpath_null+0x1a/0x20
> > [  329.669787]  [<ffffffff816150f0>] skb_try_coalesce+0x3a0/0x3f0
> > [  329.669793]  [<ffffffff8167bce4>] tcp_try_coalesce.part.44+0x34/0xa0
> > [  329.669797]  [<ffffffff8167d168>] tcp_queue_rcv+0x108/0x150
> > [  329.669801]  [<ffffffff8167f129>] tcp_data_queue+0x299/0xd00
> > [  329.669806]  [<ffffffff816822f4>] tcp_rcv_established+0x2d4/0x8f0
> > [  329.669809]  [<ffffffff8168d8b5>] tcp_v4_do_rcv+0x295/0x520
> > [  329.669813]  [<ffffffff8168fb08>] tcp_v4_rcv+0x888/0xc30
> > [  329.669818]  [<ffffffff816651d3>] ? ip_local_deliver_finish+0x43/0x480
> > [  329.669823]  [<ffffffff810cae04>] ? __lock_is_held+0x54/0x80
> > [  329.669827]  [<ffffffff816652fb>] ip_local_deliver_finish+0x16b/0x480
> > [  329.669831]  [<ffffffff816651d3>] ? ip_local_deliver_finish+0x43/0x480
> > [  329.669836]  [<ffffffff81666018>] ip_local_deliver+0x48/0x80
> > [  329.669840]  [<ffffffff81665770>] ip_rcv_finish+0x160/0x770
> > [  329.669845]  [<ffffffff816662f8>] ip_rcv+0x2a8/0x3e0
> > [  329.669849]  [<ffffffff81623d13>] __netif_receive_skb_core+0xa63/0xdb0
> > [  329.669853]  [<ffffffff816233b8>] ? __netif_receive_skb_core+0x108/0xdb0
> > [  329.669858]  [<ffffffff8175d37f>] ? _raw_spin_unlock_irqrestore+0x3f/0x70
> > [  329.669862]  [<ffffffff8162417b>] ? process_backlog+0xab/0x180
> > [  329.669866]  [<ffffffff81624081>] __netif_receive_skb+0x21/0x70
> > [  329.669869]  [<ffffffff81624184>] process_backlog+0xb4/0x180
> > [  329.669873]  [<ffffffff81626d08>] ? net_rx_action+0x98/0x350
> > [  329.669876]  [<ffffffff81626dca>] net_rx_action+0x15a/0x350
> > [  329.669882]  [<ffffffff81057f97>] __do_softirq+0xf7/0x3f0
> > [  329.669886]  [<ffffffff8176820c>] call_softirq+0x1c/0x30
> > [  329.669887]  <EOI>  [<ffffffff81004bed>] do_softirq+0x8d/0xc0
> > [  329.669896]  [<ffffffff8160de03>] ? release_sock+0x193/0x1f0
> > [  329.669901]  [<ffffffff81057a5b>] local_bh_enable_ip+0xdb/0xf0
> > [  329.669906]  [<ffffffff8175d2e4>] _raw_spin_unlock_bh+0x44/0x50
> > [  329.669910]  [<ffffffff8160de03>] release_sock+0x193/0x1f0
> > [  329.669914]  [<ffffffff81679237>] tcp_recvmsg+0x467/0x1030
> > [  329.669919]  [<ffffffff816ab424>] inet_recvmsg+0x134/0x230
> > [  329.669923]  [<ffffffff8160a17d>] sock_recvmsg+0xad/0xe0
> > 
> > to reproduce do:
> > $ sudo brctl addbr br0
> > $ sudo ifconfig br0 up
> > $ cat foo1.conf
> > lxc.network.type = veth
> > lxc.network.flags = up
> > lxc.network.link = br0
> > lxc.network.ipv4 = 10.2.3.5/24
> > $sudo lxc-start -n foo1 -f ./foo1.conf bash
> > #ip li add vxlan0 type vxlan id 42 group 239.1.1.1 dev eth0
> > #ip addr add 192.168.99.1/24 dev vxlan0
> > #ip link set up dev vxlan0
> > #iperf -s
> > 
> > similar for another lxc with different IP
> > $sudo lxc-start -n foo2 -f ./foo2.conf bash
> > #ip li add vxlan0 type vxlan id 42 group 239.1.1.1 dev eth0
> > #ip addr add 192.168.99.2/24 dev vxlan0
> > #ip link set up dev vxlan0
> > # iperf -c 192.168.99.1
> > 
> > I keep hitting it all the time.
> > 
> > 
> 
> Thanks for all these details.
> 
> I am in Edinburgh for the Kernel Summit, I'll take a look at this as
> soon as possible.

Could you try following fix ?

Thanks !

diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index f4a159e..17dd8320 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1252,6 +1252,7 @@ static struct sk_buff *inet_gso_segment(struct sk_buff *skb,
 	const struct net_offload *ops;
 	unsigned int offset = 0;
 	struct iphdr *iph;
+	bool udpfrag;
 	bool tunnel;
 	int proto;
 	int nhoff;
@@ -1306,10 +1307,11 @@ static struct sk_buff *inet_gso_segment(struct sk_buff *skb,
 	if (IS_ERR_OR_NULL(segs))
 		goto out;
 
+	udpfrag = !!skb->encapsulation && proto == IPPROTO_UDP;
 	skb = segs;
 	do {
 		iph = (struct iphdr *)(skb_mac_header(skb) + nhoff);
-		if (!tunnel && proto == IPPROTO_UDP) {
+		if (udpfrag) {
 			iph->id = htons(id);
 			iph->frag_off = htons(offset >> 3);
 			if (skb->next != NULL)

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: vxlan gso is broken by stackable gso_segment()
  2013-10-25  9:09       ` Eric Dumazet
@ 2013-10-25 22:18         ` David Miller
  2013-10-25 22:41           ` Alexei Starovoitov
  0 siblings, 1 reply; 41+ messages in thread
From: David Miller @ 2013-10-25 22:18 UTC (permalink / raw)
  To: eric.dumazet; +Cc: ast, edumazet, stephen, netdev

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Fri, 25 Oct 2013 02:09:00 -0700

> @@ -1252,6 +1252,7 @@ static struct sk_buff *inet_gso_segment(struct sk_buff *skb,
>  	const struct net_offload *ops;
>  	unsigned int offset = 0;
>  	struct iphdr *iph;
> +	bool udpfrag;
>  	bool tunnel;
>  	int proto;
>  	int nhoff;
> @@ -1306,10 +1307,11 @@ static struct sk_buff *inet_gso_segment(struct sk_buff *skb,
>  	if (IS_ERR_OR_NULL(segs))
>  		goto out;
>  
> +	udpfrag = !!skb->encapsulation && proto == IPPROTO_UDP;
>  	skb = segs;
>  	do {
>  		iph = (struct iphdr *)(skb_mac_header(skb) + nhoff);
> -		if (!tunnel && proto == IPPROTO_UDP) {
> +		if (udpfrag) {
>  			iph->id = htons(id);
>  			iph->frag_off = htons(offset >> 3);
>  			if (skb->next != NULL)
> 

The "tunnel" variable becomes unused once you do this, please remove it.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: vxlan gso is broken by stackable gso_segment()
  2013-10-25 22:18         ` David Miller
@ 2013-10-25 22:41           ` Alexei Starovoitov
  2013-10-25 23:10             ` David Miller
                               ` (2 more replies)
  0 siblings, 3 replies; 41+ messages in thread
From: Alexei Starovoitov @ 2013-10-25 22:41 UTC (permalink / raw)
  To: David Miller; +Cc: eric.dumazet, Eric Dumazet, stephen, netdev

On Fri, Oct 25, 2013 at 3:18 PM, David Miller <davem@davemloft.net> wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Fri, 25 Oct 2013 02:09:00 -0700
>
>> @@ -1252,6 +1252,7 @@ static struct sk_buff *inet_gso_segment(struct sk_buff *skb,
>>       const struct net_offload *ops;
>>       unsigned int offset = 0;
>>       struct iphdr *iph;
>> +     bool udpfrag;
>>       bool tunnel;
>>       int proto;
>>       int nhoff;
>> @@ -1306,10 +1307,11 @@ static struct sk_buff *inet_gso_segment(struct sk_buff *skb,
>>       if (IS_ERR_OR_NULL(segs))
>>               goto out;
>>
>> +     udpfrag = !!skb->encapsulation && proto == IPPROTO_UDP;
>>       skb = segs;
>>       do {
>>               iph = (struct iphdr *)(skb_mac_header(skb) + nhoff);
>> -             if (!tunnel && proto == IPPROTO_UDP) {
>> +             if (udpfrag) {
>>                       iph->id = htons(id);
>>                       iph->frag_off = htons(offset >> 3);
>>                       if (skb->next != NULL)
>>
>
> The "tunnel" variable becomes unused once you do this, please remove it.

'bool tunnel' actually still used to indicate encap_level > 0

Eric's fix brings back performance for vxlan and gre keeps working. Thx!

net/core/skbuff.c:3474 skb_try_coalesce() warning, I mentioned before,
is unrelated.
I still see it with this patch. Running either gre or vxlan tunnels.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: vxlan gso is broken by stackable gso_segment()
  2013-10-25 22:41           ` Alexei Starovoitov
@ 2013-10-25 23:10             ` David Miller
  2013-10-25 23:25             ` Eric Dumazet
  2013-10-28  1:18             ` [PATCH net-next] inet: restore gso for vxlan Eric Dumazet
  2 siblings, 0 replies; 41+ messages in thread
From: David Miller @ 2013-10-25 23:10 UTC (permalink / raw)
  To: ast; +Cc: eric.dumazet, edumazet, stephen, netdev

From: Alexei Starovoitov <ast@plumgrid.com>
Date: Fri, 25 Oct 2013 15:41:47 -0700

> 'bool tunnel' actually still used to indicate encap_level > 0

Good catch, I misread the code.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: vxlan gso is broken by stackable gso_segment()
  2013-10-25 22:41           ` Alexei Starovoitov
  2013-10-25 23:10             ` David Miller
@ 2013-10-25 23:25             ` Eric Dumazet
  2013-10-26  0:26               ` [PATCH net-next] tcp: gso: fix truesize tracking Eric Dumazet
  2013-10-26  0:52               ` vxlan gso is broken by stackable gso_segment() Eric Dumazet
  2013-10-28  1:18             ` [PATCH net-next] inet: restore gso for vxlan Eric Dumazet
  2 siblings, 2 replies; 41+ messages in thread
From: Eric Dumazet @ 2013-10-25 23:25 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: David Miller, Eric Dumazet, stephen, netdev

On Fri, 2013-10-25 at 15:41 -0700, Alexei Starovoitov wrote:

> 'bool tunnel' actually still used to indicate encap_level > 0
> 

Yes, I am studying if the setting of skb->encapsulation = 1 was really
needed in the :

if (tunnel) {
     skb_reset_inner_headers(skb);
     skb->encapsulation = 1;
}

And was planning to rename 'bool tunnel' by 'bool stacked' or
something... 

> Eric's fix brings back performance for vxlan and gre keeps working. Thx!

Please note the original performance is not that good, you mentioned 230
Mbps on lxc, while I get more than 5Gb/s on a 10G link.

This should be investigated ...

> 
> net/core/skbuff.c:3474 skb_try_coalesce() warning, I mentioned before,
> is unrelated.
> I still see it with this patch. Running either gre or vxlan tunnels.

I think this might be related to commit 6ff50cd55545 ("tcp: gso: do not
generate out of order packets")

I'll investigate this as well, thanks.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH net-next] tcp: gso: fix truesize tracking
  2013-10-25 23:25             ` Eric Dumazet
@ 2013-10-26  0:26               ` Eric Dumazet
  2013-10-26  1:46                 ` Alexei Starovoitov
                                   ` (2 more replies)
  2013-10-26  0:52               ` vxlan gso is broken by stackable gso_segment() Eric Dumazet
  1 sibling, 3 replies; 41+ messages in thread
From: Eric Dumazet @ 2013-10-26  0:26 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: David Miller, Eric Dumazet, stephen, netdev

From: Eric Dumazet <edumazet@google.com>

commit 6ff50cd55545 ("tcp: gso: do not generate out of order packets")
had an heuristic that can trigger a warning in skb_try_coalesce(),
because skb->truesize of the gso segments were exactly set to mss.

This breaks the requirement that

skb->truesize >= skb->len + truesizeof(struct sk_buff);

It can trivially be reproduced by :

ifconfig lo mtu 1500
ethtool -K lo tso off
netperf

As the skbs are looped into the TCP networking stack, skb_try_coalesce()
warns us of these skb under-estimating their truesize.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Alexei Starovoitov <ast@plumgrid.com>
---
Based on net-next but applies as well on net tree with some fuzz.

David, we might backport this one to 3.12 and stable later, I let you
decide.

 net/ipv4/tcp_offload.c |   13 +++++--------
 1 file changed, 5 insertions(+), 8 deletions(-)

diff --git a/net/ipv4/tcp_offload.c b/net/ipv4/tcp_offload.c
index a7a5583e..a2b68a1 100644
--- a/net/ipv4/tcp_offload.c
+++ b/net/ipv4/tcp_offload.c
@@ -18,6 +18,7 @@ struct sk_buff *tcp_gso_segment(struct sk_buff *skb,
 				netdev_features_t features)
 {
 	struct sk_buff *segs = ERR_PTR(-EINVAL);
+	unsigned int sum_truesize = 0;
 	struct tcphdr *th;
 	unsigned int thlen;
 	unsigned int seq;
@@ -104,13 +105,7 @@ struct sk_buff *tcp_gso_segment(struct sk_buff *skb,
 		if (copy_destructor) {
 			skb->destructor = gso_skb->destructor;
 			skb->sk = gso_skb->sk;
-			/* {tcp|sock}_wfree() use exact truesize accounting :
-			 * sum(skb->truesize) MUST be exactly be gso_skb->truesize
-			 * So we account mss bytes of 'true size' for each segment.
-			 * The last segment will contain the remaining.
-			 */
-			skb->truesize = mss;
-			gso_skb->truesize -= mss;
+			sum_truesize += skb->truesize;
 		}
 		skb = skb->next;
 		th = tcp_hdr(skb);
@@ -127,7 +122,9 @@ struct sk_buff *tcp_gso_segment(struct sk_buff *skb,
 	if (copy_destructor) {
 		swap(gso_skb->sk, skb->sk);
 		swap(gso_skb->destructor, skb->destructor);
-		swap(gso_skb->truesize, skb->truesize);
+		sum_truesize += skb->truesize;
+		atomic_add(sum_truesize - gso_skb->truesize,
+			   &skb->sk->sk_wmem_alloc);
 	}
 
 	delta = htonl(oldlen + (skb_tail_pointer(skb) -

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH net-next] tcp: gso: fix truesize tracking
  2013-10-26  0:26               ` [PATCH net-next] tcp: gso: fix truesize tracking Eric Dumazet
@ 2013-10-26  1:46                 ` Alexei Starovoitov
  2013-10-28  4:58                 ` David Miller
  2013-10-29  4:05                 ` David Miller
  2 siblings, 0 replies; 41+ messages in thread
From: Alexei Starovoitov @ 2013-10-26  1:46 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, Eric Dumazet, stephen, netdev

On Fri, Oct 25, 2013 at 5:26 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> From: Eric Dumazet <edumazet@google.com>
>
> commit 6ff50cd55545 ("tcp: gso: do not generate out of order packets")
> had an heuristic that can trigger a warning in skb_try_coalesce(),
> because skb->truesize of the gso segments were exactly set to mss.
>
> This breaks the requirement that
>
> skb->truesize >= skb->len + truesizeof(struct sk_buff);
>
> It can trivially be reproduced by :
>
> ifconfig lo mtu 1500
> ethtool -K lo tso off
> netperf
>
> As the skbs are looped into the TCP networking stack, skb_try_coalesce()
> warns us of these skb under-estimating their truesize.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Reported-by: Alexei Starovoitov <ast@plumgrid.com>
> ---

it fixed it in my setup. Thanks!

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH net-next] tcp: gso: fix truesize tracking
  2013-10-26  0:26               ` [PATCH net-next] tcp: gso: fix truesize tracking Eric Dumazet
  2013-10-26  1:46                 ` Alexei Starovoitov
@ 2013-10-28  4:58                 ` David Miller
  2013-10-29  4:05                 ` David Miller
  2 siblings, 0 replies; 41+ messages in thread
From: David Miller @ 2013-10-28  4:58 UTC (permalink / raw)
  To: eric.dumazet; +Cc: ast, edumazet, stephen, netdev

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Fri, 25 Oct 2013 17:26:17 -0700

> From: Eric Dumazet <edumazet@google.com>
> 
> commit 6ff50cd55545 ("tcp: gso: do not generate out of order packets")
> had an heuristic that can trigger a warning in skb_try_coalesce(),
> because skb->truesize of the gso segments were exactly set to mss.
> 
> This breaks the requirement that
> 
> skb->truesize >= skb->len + truesizeof(struct sk_buff);
> 
> It can trivially be reproduced by :
> 
> ifconfig lo mtu 1500
> ethtool -K lo tso off
> netperf
> 
> As the skbs are looped into the TCP networking stack, skb_try_coalesce()
> warns us of these skb under-estimating their truesize.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Reported-by: Alexei Starovoitov <ast@plumgrid.com>
> ---
> Based on net-next but applies as well on net tree with some fuzz.
> 
> David, we might backport this one to 3.12 and stable later, I let you
> decide.

I'm still thinking about what to do with this one.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH net-next] tcp: gso: fix truesize tracking
  2013-10-26  0:26               ` [PATCH net-next] tcp: gso: fix truesize tracking Eric Dumazet
  2013-10-26  1:46                 ` Alexei Starovoitov
  2013-10-28  4:58                 ` David Miller
@ 2013-10-29  4:05                 ` David Miller
  2 siblings, 0 replies; 41+ messages in thread
From: David Miller @ 2013-10-29  4:05 UTC (permalink / raw)
  To: eric.dumazet; +Cc: ast, edumazet, stephen, netdev

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Fri, 25 Oct 2013 17:26:17 -0700

> From: Eric Dumazet <edumazet@google.com>
> 
> commit 6ff50cd55545 ("tcp: gso: do not generate out of order packets")
> had an heuristic that can trigger a warning in skb_try_coalesce(),
> because skb->truesize of the gso segments were exactly set to mss.
> 
> This breaks the requirement that
> 
> skb->truesize >= skb->len + truesizeof(struct sk_buff);
> 
> It can trivially be reproduced by :
> 
> ifconfig lo mtu 1500
> ethtool -K lo tso off
> netperf
> 
> As the skbs are looped into the TCP networking stack, skb_try_coalesce()
> warns us of these skb under-estimating their truesize.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Reported-by: Alexei Starovoitov <ast@plumgrid.com>

I decided to apply this to 'net' and queue it up for -stable,
thanks Eric!

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: vxlan gso is broken by stackable gso_segment()
  2013-10-25 23:25             ` Eric Dumazet
  2013-10-26  0:26               ` [PATCH net-next] tcp: gso: fix truesize tracking Eric Dumazet
@ 2013-10-26  0:52               ` Eric Dumazet
  2013-10-26  1:25                 ` [PATCH net-next] veth: extend features to support tunneling Eric Dumazet
  1 sibling, 1 reply; 41+ messages in thread
From: Eric Dumazet @ 2013-10-26  0:52 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: David Miller, Eric Dumazet, stephen, netdev

On Fri, 2013-10-25 at 16:25 -0700, Eric Dumazet wrote:

> Please note the original performance is not that good, you mentioned 230
> Mbps on lxc, while I get more than 5Gb/s on a 10G link.
> 
> This should be investigated ...

This is probably trivial to increase performance :

veth currently do not support any kind of tunneling TSO :

tx-fcoe-segmentation: off [fixed]
tx-gre-segmentation: off [fixed]
tx-ipip-segmentation: off [fixed]
tx-sit-segmentation: off [fixed]
tx-udp_tnl-segmentation: off [fixed]
tx-mpls-segmentation: off [fixed]

I'll submit a patch for net-next

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH net-next] veth: extend features to support tunneling
  2013-10-26  0:52               ` vxlan gso is broken by stackable gso_segment() Eric Dumazet
@ 2013-10-26  1:25                 ` Eric Dumazet
  2013-10-26  2:22                   ` Alexei Starovoitov
  2013-10-28  4:58                   ` David Miller
  0 siblings, 2 replies; 41+ messages in thread
From: Eric Dumazet @ 2013-10-26  1:25 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: David Miller, Eric Dumazet, stephen, netdev

From: Eric Dumazet <edumazet@google.com>

While investigating on a recent vxlan regression, I found veth
was using a zero features set for vxlan tunnels.

We have to segment GSO frames, copy the payload, and do the checksum.

This patch brings a ~200% performance increase

We probably have to add hw_enc_features support
on other virtual devices.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Alexei Starovoitov <ast@plumgrid.com>
---
 drivers/net/veth.c |    3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index b2d0347..b24db7a 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -261,6 +261,8 @@ static const struct net_device_ops veth_netdev_ops = {
 
 #define VETH_FEATURES (NETIF_F_SG | NETIF_F_FRAGLIST | NETIF_F_ALL_TSO |    \
 		       NETIF_F_HW_CSUM | NETIF_F_RXCSUM | NETIF_F_HIGHDMA | \
+		       NETIF_F_GSO_GRE | NETIF_F_GSO_UDP_TUNNEL |	    \
+		       NETIF_F_GSO_IPIP | NETIF_F_GSO_SIT | NETIF_F_UFO	|   \
 		       NETIF_F_HW_VLAN_CTAG_TX | NETIF_F_HW_VLAN_CTAG_RX | \
 		       NETIF_F_HW_VLAN_STAG_TX | NETIF_F_HW_VLAN_STAG_RX )
 
@@ -279,6 +281,7 @@ static void veth_setup(struct net_device *dev)
 	dev->destructor = veth_dev_free;
 
 	dev->hw_features = VETH_FEATURES;
+	dev->hw_enc_features = VETH_FEATURES;
 }
 
 /*

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH net-next] veth: extend features to support tunneling
  2013-10-26  1:25                 ` [PATCH net-next] veth: extend features to support tunneling Eric Dumazet
@ 2013-10-26  2:22                   ` Alexei Starovoitov
  2013-10-26  4:13                     ` Eric Dumazet
  2013-10-28  4:58                   ` David Miller
  1 sibling, 1 reply; 41+ messages in thread
From: Alexei Starovoitov @ 2013-10-26  2:22 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, Eric Dumazet, stephen, netdev

On Fri, Oct 25, 2013 at 6:25 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> From: Eric Dumazet <edumazet@google.com>
>
> While investigating on a recent vxlan regression, I found veth
> was using a zero features set for vxlan tunnels.

oneliner can be better :)

> We have to segment GSO frames, copy the payload, and do the checksum.
>
> This patch brings a ~200% performance increase
>
> We probably have to add hw_enc_features support
> on other virtual devices.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Alexei Starovoitov <ast@plumgrid.com>
> ---

iperf over veth with gre/vxlan tunneling is now ~4Gbps. 20x gain as advertised.
Thanks!

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH net-next] veth: extend features to support tunneling
  2013-10-26  2:22                   ` Alexei Starovoitov
@ 2013-10-26  4:13                     ` Eric Dumazet
  0 siblings, 0 replies; 41+ messages in thread
From: Eric Dumazet @ 2013-10-26  4:13 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: David Miller, Eric Dumazet, stephen, netdev

On Fri, 2013-10-25 at 19:22 -0700, Alexei Starovoitov wrote:
> On Fri, Oct 25, 2013 at 6:25 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > From: Eric Dumazet <edumazet@google.com>
> >
> > While investigating on a recent vxlan regression, I found veth
> > was using a zero features set for vxlan tunnels.
> 
> oneliner can be better :)

Yes, I'll post the gso fix as well, of course ;)

> 
> > We have to segment GSO frames, copy the payload, and do the checksum.
> >
> > This patch brings a ~200% performance increase
> >
> > We probably have to add hw_enc_features support
> > on other virtual devices.
> >
> > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > Cc: Alexei Starovoitov <ast@plumgrid.com>
> > ---
> 
> iperf over veth with gre/vxlan tunneling is now ~4Gbps. 20x gain as advertised.
> Thanks!

Wow, such a difference might show another bug somewhere else...

Thanks !

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH net-next] veth: extend features to support tunneling
  2013-10-26  1:25                 ` [PATCH net-next] veth: extend features to support tunneling Eric Dumazet
  2013-10-26  2:22                   ` Alexei Starovoitov
@ 2013-10-28  4:58                   ` David Miller
  1 sibling, 0 replies; 41+ messages in thread
From: David Miller @ 2013-10-28  4:58 UTC (permalink / raw)
  To: eric.dumazet; +Cc: ast, edumazet, stephen, netdev

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Fri, 25 Oct 2013 18:25:03 -0700

> From: Eric Dumazet <edumazet@google.com>
> 
> While investigating on a recent vxlan regression, I found veth
> was using a zero features set for vxlan tunnels.
> 
> We have to segment GSO frames, copy the payload, and do the checksum.
> 
> This patch brings a ~200% performance increase
> 
> We probably have to add hw_enc_features support
> on other virtual devices.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Applied, thanks Eric.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH net-next] inet: restore gso for vxlan
  2013-10-25 22:41           ` Alexei Starovoitov
  2013-10-25 23:10             ` David Miller
  2013-10-25 23:25             ` Eric Dumazet
@ 2013-10-28  1:18             ` Eric Dumazet
  2013-10-28  4:25               ` David Miller
  2013-11-07 22:33               ` Eric Dumazet
  2 siblings, 2 replies; 41+ messages in thread
From: Eric Dumazet @ 2013-10-28  1:18 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: David Miller, netdev

From: Eric Dumazet <edumazet@google.com>

Alexei reported a performance regression on vxlan, caused
by commit 3347c9602955 "ipv4: gso: make inet_gso_segment() stackable"

GSO vxlan packets were not properly segmented, adding IP fragments
while they were not expected.

Rename 'bool tunnel' to 'bool encap', and add a new boolean
to express the fact that UDP should be fragmented.
This fragmentation is triggered by skb->encapsulation being set.

Remove a "skb->encapsulation = 1" added in above commit,
as its not needed, as frags inherit skb->frag from original
GSO skb.
 
Reported-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Tested-by: Alexei Starovoitov <ast@plumgrid.com>
---
 net/ipv4/af_inet.c |   15 +++++++--------
 1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index f4a159e705c0..09d78d4a3cff 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1251,8 +1251,8 @@ static struct sk_buff *inet_gso_segment(struct sk_buff *skb,
 	struct sk_buff *segs = ERR_PTR(-EINVAL);
 	const struct net_offload *ops;
 	unsigned int offset = 0;
+	bool udpfrag, encap;
 	struct iphdr *iph;
-	bool tunnel;
 	int proto;
 	int nhoff;
 	int ihl;
@@ -1290,8 +1290,8 @@ static struct sk_buff *inet_gso_segment(struct sk_buff *skb,
 		goto out;
 	__skb_pull(skb, ihl);
 
-	tunnel = SKB_GSO_CB(skb)->encap_level > 0;
-	if (tunnel)
+	encap = SKB_GSO_CB(skb)->encap_level > 0;
+	if (encap)
 		features = skb->dev->hw_enc_features & netif_skb_features(skb);
 	SKB_GSO_CB(skb)->encap_level += ihl;
 
@@ -1306,24 +1306,23 @@ static struct sk_buff *inet_gso_segment(struct sk_buff *skb,
 	if (IS_ERR_OR_NULL(segs))
 		goto out;
 
+	udpfrag = !!skb->encapsulation && proto == IPPROTO_UDP;
 	skb = segs;
 	do {
 		iph = (struct iphdr *)(skb_mac_header(skb) + nhoff);
-		if (!tunnel && proto == IPPROTO_UDP) {
+		if (udpfrag) {
 			iph->id = htons(id);
 			iph->frag_off = htons(offset >> 3);
 			if (skb->next != NULL)
 				iph->frag_off |= htons(IP_MF);
 			offset += skb->len - nhoff - ihl;
-		} else  {
+		} else {
 			iph->id = htons(id++);
 		}
 		iph->tot_len = htons(skb->len - nhoff);
 		ip_send_check(iph);
-		if (tunnel) {
+		if (encap)
 			skb_reset_inner_headers(skb);
-			skb->encapsulation = 1;
-		}
 		skb->network_header = (u8 *)iph - skb->head;
 	} while ((skb = skb->next));
 

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH net-next] inet: restore gso for vxlan
  2013-10-28  1:18             ` [PATCH net-next] inet: restore gso for vxlan Eric Dumazet
@ 2013-10-28  4:25               ` David Miller
  2013-11-07 22:33               ` Eric Dumazet
  1 sibling, 0 replies; 41+ messages in thread
From: David Miller @ 2013-10-28  4:25 UTC (permalink / raw)
  To: eric.dumazet; +Cc: ast, netdev

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Sun, 27 Oct 2013 18:18:16 -0700

> From: Eric Dumazet <edumazet@google.com>
> 
> Alexei reported a performance regression on vxlan, caused
> by commit 3347c9602955 "ipv4: gso: make inet_gso_segment() stackable"
> 
> GSO vxlan packets were not properly segmented, adding IP fragments
> while they were not expected.
> 
> Rename 'bool tunnel' to 'bool encap', and add a new boolean
> to express the fact that UDP should be fragmented.
> This fragmentation is triggered by skb->encapsulation being set.
> 
> Remove a "skb->encapsulation = 1" added in above commit,
> as its not needed, as frags inherit skb->frag from original
> GSO skb.
>  
> Reported-by: Alexei Starovoitov <ast@plumgrid.com>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Tested-by: Alexei Starovoitov <ast@plumgrid.com>

Applied, thanks!

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH net-next] inet: restore gso for vxlan
  2013-10-28  1:18             ` [PATCH net-next] inet: restore gso for vxlan Eric Dumazet
  2013-10-28  4:25               ` David Miller
@ 2013-11-07 22:33               ` Eric Dumazet
  2013-11-07 23:27                 ` Alexei Starovoitov
  1 sibling, 1 reply; 41+ messages in thread
From: Eric Dumazet @ 2013-11-07 22:33 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: David Miller, netdev, Hannes Frederic Sowa

On Sun, 2013-10-27 at 18:18 -0700, Eric Dumazet wrote:
> From: Eric Dumazet <edumazet@google.com>
> 
> Alexei reported a performance regression on vxlan, caused
> by commit 3347c9602955 "ipv4: gso: make inet_gso_segment() stackable"
> 
> GSO vxlan packets were not properly segmented, adding IP fragments
> while they were not expected.
> 
> Rename 'bool tunnel' to 'bool encap', and add a new boolean
> to express the fact that UDP should be fragmented.
> This fragmentation is triggered by skb->encapsulation being set.
> 
> Remove a "skb->encapsulation = 1" added in above commit,
> as its not needed, as frags inherit skb->frag from original
> GSO skb.
>  
> Reported-by: Alexei Starovoitov <ast@plumgrid.com>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Tested-by: Alexei Starovoitov <ast@plumgrid.com>
> ---
>  net/ipv4/af_inet.c |   15 +++++++--------
>  1 file changed, 7 insertions(+), 8 deletions(-)
> 
> diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
> index f4a159e705c0..09d78d4a3cff 100644
> --- a/net/ipv4/af_inet.c
> +++ b/net/ipv4/af_inet.c
> @@ -1251,8 +1251,8 @@ static struct sk_buff *inet_gso_segment(struct sk_buff *skb,
>  	struct sk_buff *segs = ERR_PTR(-EINVAL);
>  	const struct net_offload *ops;
>  	unsigned int offset = 0;
> +	bool udpfrag, encap;
>  	struct iphdr *iph;
> -	bool tunnel;
>  	int proto;
>  	int nhoff;
>  	int ihl;
> @@ -1290,8 +1290,8 @@ static struct sk_buff *inet_gso_segment(struct sk_buff *skb,
>  		goto out;
>  	__skb_pull(skb, ihl);
>  
> -	tunnel = SKB_GSO_CB(skb)->encap_level > 0;
> -	if (tunnel)
> +	encap = SKB_GSO_CB(skb)->encap_level > 0;
> +	if (encap)
>  		features = skb->dev->hw_enc_features & netif_skb_features(skb);
>  	SKB_GSO_CB(skb)->encap_level += ihl;
>  
> @@ -1306,24 +1306,23 @@ static struct sk_buff *inet_gso_segment(struct sk_buff *skb,
>  	if (IS_ERR_OR_NULL(segs))
>  		goto out;
>  
> +	udpfrag = !!skb->encapsulation && proto == IPPROTO_UDP;

It seems I did an error here, shouldn't it be

     udpfrag = !skb->encapsulation && proto == IPPROTO_UDP;

instead ?

I am not sure UFO and vxlan are compatable...


>  	skb = segs;
>  	do {
>  		iph = (struct iphdr *)(skb_mac_header(skb) + nhoff);
> -		if (!tunnel && proto == IPPROTO_UDP) {
> +		if (udpfrag) {
>  			iph->id = htons(id);
>  			iph->frag_off = htons(offset >> 3);
>  			if (skb->next != NULL)
>  				iph->frag_off |= htons(IP_MF);
>  			offset += skb->len - nhoff - ihl;
> -		} else  {
> +		} else {
>  			iph->id = htons(id++);
>  		}
>  		iph->tot_len = htons(skb->len - nhoff);
>  		ip_send_check(iph);
> -		if (tunnel) {
> +		if (encap)
>  			skb_reset_inner_headers(skb);
> -			skb->encapsulation = 1;
> -		}
>  		skb->network_header = (u8 *)iph - skb->head;
>  	} while ((skb = skb->next));
>  

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH net-next] inet: restore gso for vxlan
  2013-11-07 22:33               ` Eric Dumazet
@ 2013-11-07 23:27                 ` Alexei Starovoitov
  2013-11-08  0:44                   ` [PATCH net-next] inet: fix a UFO regression Eric Dumazet
  2013-11-16 20:23                   ` [PATCH net-next] inet: restore gso for vxlan Or Gerlitz
  0 siblings, 2 replies; 41+ messages in thread
From: Alexei Starovoitov @ 2013-11-07 23:27 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev, Hannes Frederic Sowa

On Thu, Nov 7, 2013 at 2:33 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Sun, 2013-10-27 at 18:18 -0700, Eric Dumazet wrote:
>> From: Eric Dumazet <edumazet@google.com>
>>
>> Alexei reported a performance regression on vxlan, caused
>> by commit 3347c9602955 "ipv4: gso: make inet_gso_segment() stackable"
>>
>> GSO vxlan packets were not properly segmented, adding IP fragments
>> while they were not expected.
>>
>> Rename 'bool tunnel' to 'bool encap', and add a new boolean
>> to express the fact that UDP should be fragmented.
>> This fragmentation is triggered by skb->encapsulation being set.
>>
>> Remove a "skb->encapsulation = 1" added in above commit,
>> as its not needed, as frags inherit skb->frag from original
>> GSO skb.
>>
>> Reported-by: Alexei Starovoitov <ast@plumgrid.com>
>> Signed-off-by: Eric Dumazet <edumazet@google.com>
>> Tested-by: Alexei Starovoitov <ast@plumgrid.com>
>> ---
>>  net/ipv4/af_inet.c |   15 +++++++--------
>>  1 file changed, 7 insertions(+), 8 deletions(-)
>>
>> diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
>> index f4a159e705c0..09d78d4a3cff 100644
>> --- a/net/ipv4/af_inet.c
>> +++ b/net/ipv4/af_inet.c
>> @@ -1251,8 +1251,8 @@ static struct sk_buff *inet_gso_segment(struct sk_buff *skb,
>>       struct sk_buff *segs = ERR_PTR(-EINVAL);
>>       const struct net_offload *ops;
>>       unsigned int offset = 0;
>> +     bool udpfrag, encap;
>>       struct iphdr *iph;
>> -     bool tunnel;
>>       int proto;
>>       int nhoff;
>>       int ihl;
>> @@ -1290,8 +1290,8 @@ static struct sk_buff *inet_gso_segment(struct sk_buff *skb,
>>               goto out;
>>       __skb_pull(skb, ihl);
>>
>> -     tunnel = SKB_GSO_CB(skb)->encap_level > 0;
>> -     if (tunnel)
>> +     encap = SKB_GSO_CB(skb)->encap_level > 0;
>> +     if (encap)
>>               features = skb->dev->hw_enc_features & netif_skb_features(skb);
>>       SKB_GSO_CB(skb)->encap_level += ihl;
>>
>> @@ -1306,24 +1306,23 @@ static struct sk_buff *inet_gso_segment(struct sk_buff *skb,
>>       if (IS_ERR_OR_NULL(segs))
>>               goto out;
>>
>> +     udpfrag = !!skb->encapsulation && proto == IPPROTO_UDP;
>
> It seems I did an error here, shouldn't it be
>
>      udpfrag = !skb->encapsulation && proto == IPPROTO_UDP;
>
> instead ?

oops. yes.
the segs coming out of udp4_ufo_fragment() in case of udp_tunnel should not
be marked as ip frags.
ip frags are only for UFO.

Thank you for spotting it!

Alexei

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH net-next] inet: fix a UFO regression
  2013-11-07 23:27                 ` Alexei Starovoitov
@ 2013-11-08  0:44                   ` Eric Dumazet
  2013-11-08  1:39                     ` Alexei Starovoitov
  2013-11-08  2:32                     ` [PATCH v2 " Eric Dumazet
  2013-11-16 20:23                   ` [PATCH net-next] inet: restore gso for vxlan Or Gerlitz
  1 sibling, 2 replies; 41+ messages in thread
From: Eric Dumazet @ 2013-11-08  0:44 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: David Miller, netdev, Hannes Frederic Sowa

From: Eric Dumazet <edumazet@google.com>

While testing virtio_net and skb_segment() changes, Hannes reported
that UFO was sending wrong frames.

It appears this was introduced by a recent commit :
8c3a897bfab1 ("inet: restore gso for vxlan")

The old condition to perform IP frag was :

tunnel = !!skb->encapsulation;
...
	if (!tunnel && proto == IPPROTO_UDP) {

So the new one should be :

udpfrag = !skb->encapsulation && proto == IPPROTO_UDP;
...
	if (udpfrag) {

With help from Alexei Starovoitov

Reported-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Alexei Starovoitov <ast@plumgrid.com>
---
 net/ipv4/af_inet.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 09d78d4a3cff..f17941486ab9 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1306,7 +1306,7 @@ static struct sk_buff *inet_gso_segment(struct sk_buff *skb,
 	if (IS_ERR_OR_NULL(segs))
 		goto out;
 
-	udpfrag = !!skb->encapsulation && proto == IPPROTO_UDP;
+	udpfrag = !skb->encapsulation && proto == IPPROTO_UDP;
 	skb = segs;
 	do {
 		iph = (struct iphdr *)(skb_mac_header(skb) + nhoff);

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH net-next] inet: fix a UFO regression
  2013-11-08  0:44                   ` [PATCH net-next] inet: fix a UFO regression Eric Dumazet
@ 2013-11-08  1:39                     ` Alexei Starovoitov
  2013-11-08  2:08                       ` Eric Dumazet
  2013-11-08  2:32                     ` [PATCH v2 " Eric Dumazet
  1 sibling, 1 reply; 41+ messages in thread
From: Alexei Starovoitov @ 2013-11-08  1:39 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev, Hannes Frederic Sowa

On Thu, Nov 7, 2013 at 4:44 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> From: Eric Dumazet <edumazet@google.com>
>
> While testing virtio_net and skb_segment() changes, Hannes reported
> that UFO was sending wrong frames.
>
> It appears this was introduced by a recent commit :
> 8c3a897bfab1 ("inet: restore gso for vxlan")
>
> The old condition to perform IP frag was :
>
> tunnel = !!skb->encapsulation;
> ...
>         if (!tunnel && proto == IPPROTO_UDP) {
>
> So the new one should be :
>
> udpfrag = !skb->encapsulation && proto == IPPROTO_UDP;
> ...
>         if (udpfrag) {
>
> With help from Alexei Starovoitov
>
> Reported-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Alexei Starovoitov <ast@plumgrid.com>
> ---
>  net/ipv4/af_inet.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
> index 09d78d4a3cff..f17941486ab9 100644
> --- a/net/ipv4/af_inet.c
> +++ b/net/ipv4/af_inet.c
> @@ -1306,7 +1306,7 @@ static struct sk_buff *inet_gso_segment(struct sk_buff *skb,
>         if (IS_ERR_OR_NULL(segs))
>                 goto out;
>
> -       udpfrag = !!skb->encapsulation && proto == IPPROTO_UDP;
> +       udpfrag = !skb->encapsulation && proto == IPPROTO_UDP;

as we discussed this will break vxlan :)
Please move udpfrag assignment before line:
                segs = ops->callbacks.gso_segment(skb, features);
since skb_udp_tunnel_segment() does skb->encapsulation = 0
then both ufo and vxlan should be fine

Thanks
Alexei

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH net-next] inet: fix a UFO regression
  2013-11-08  1:39                     ` Alexei Starovoitov
@ 2013-11-08  2:08                       ` Eric Dumazet
  0 siblings, 0 replies; 41+ messages in thread
From: Eric Dumazet @ 2013-11-08  2:08 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: David Miller, netdev, Hannes Frederic Sowa

On Thu, 2013-11-07 at 17:39 -0800, Alexei Starovoitov wrote:

> as we discussed this will break vxlan :)
> Please move udpfrag assignment before line:
>                 segs = ops->callbacks.gso_segment(skb, features);
> since skb_udp_tunnel_segment() does skb->encapsulation = 0
> then both ufo and vxlan should be fine

Oh that's right, thanks again, I'll send a V2

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH v2 net-next] inet: fix a UFO regression
  2013-11-08  0:44                   ` [PATCH net-next] inet: fix a UFO regression Eric Dumazet
  2013-11-08  1:39                     ` Alexei Starovoitov
@ 2013-11-08  2:32                     ` Eric Dumazet
  2013-11-08  7:08                       ` David Miller
  1 sibling, 1 reply; 41+ messages in thread
From: Eric Dumazet @ 2013-11-08  2:32 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: David Miller, netdev, Hannes Frederic Sowa

From: Eric Dumazet <edumazet@google.com>

While testing virtio_net and skb_segment() changes, Hannes reported
that UFO was sending wrong frames.

It appears this was introduced by a recent commit :
8c3a897bfab1 ("inet: restore gso for vxlan")

The old condition to perform IP frag was :

tunnel = !!skb->encapsulation;
...
        if (!tunnel && proto == IPPROTO_UDP) {

So the new one should be :

udpfrag = !skb->encapsulation && proto == IPPROTO_UDP;
...
        if (udpfrag) {

Initialization of udpfrag must be done before call
to ops->callbacks.gso_segment(skb, features), as
skb_udp_tunnel_segment() clears skb->encapsulation

(We want udpfrag to be true for UFO, false for VXLAN)

With help from Alexei Starovoitov

Reported-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Alexei Starovoitov <ast@plumgrid.com>
---
v2: move the udpfrag init, as spotted by Alexei.

 net/ipv4/af_inet.c |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 09d78d4a3cff..68af9aac91d0 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1299,6 +1299,9 @@ static struct sk_buff *inet_gso_segment(struct sk_buff *skb,
 
 	segs = ERR_PTR(-EPROTONOSUPPORT);
 
+	/* Note : following gso_segment() might change skb->encapsulation */
+	udpfrag = !skb->encapsulation && proto == IPPROTO_UDP;
+
 	ops = rcu_dereference(inet_offloads[proto]);
 	if (likely(ops && ops->callbacks.gso_segment))
 		segs = ops->callbacks.gso_segment(skb, features);
@@ -1306,7 +1309,6 @@ static struct sk_buff *inet_gso_segment(struct sk_buff *skb,
 	if (IS_ERR_OR_NULL(segs))
 		goto out;
 
-	udpfrag = !!skb->encapsulation && proto == IPPROTO_UDP;
 	skb = segs;
 	do {
 		iph = (struct iphdr *)(skb_mac_header(skb) + nhoff);

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 net-next] inet: fix a UFO regression
  2013-11-08  2:32                     ` [PATCH v2 " Eric Dumazet
@ 2013-11-08  7:08                       ` David Miller
  0 siblings, 0 replies; 41+ messages in thread
From: David Miller @ 2013-11-08  7:08 UTC (permalink / raw)
  To: eric.dumazet; +Cc: ast, netdev, hannes

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 07 Nov 2013 18:32:06 -0800

> From: Eric Dumazet <edumazet@google.com>
> 
> While testing virtio_net and skb_segment() changes, Hannes reported
> that UFO was sending wrong frames.
> 
> It appears this was introduced by a recent commit :
> 8c3a897bfab1 ("inet: restore gso for vxlan")
> 
> The old condition to perform IP frag was :
> 
> tunnel = !!skb->encapsulation;
> ...
>         if (!tunnel && proto == IPPROTO_UDP) {
> 
> So the new one should be :
> 
> udpfrag = !skb->encapsulation && proto == IPPROTO_UDP;
> ...
>         if (udpfrag) {
> 
> Initialization of udpfrag must be done before call
> to ops->callbacks.gso_segment(skb, features), as
> skb_udp_tunnel_segment() clears skb->encapsulation
> 
> (We want udpfrag to be true for UFO, false for VXLAN)
> 
> With help from Alexei Starovoitov
> 
> Reported-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Applied, thanks a lot Eric.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH net-next] inet: restore gso for vxlan
  2013-11-07 23:27                 ` Alexei Starovoitov
  2013-11-08  0:44                   ` [PATCH net-next] inet: fix a UFO regression Eric Dumazet
@ 2013-11-16 20:23                   ` Or Gerlitz
  2013-11-16 20:50                     ` Eric Dumazet
  1 sibling, 1 reply; 41+ messages in thread
From: Or Gerlitz @ 2013-11-16 20:23 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Eric Dumazet, David Miller, netdev@vger.kernel.org,
	Hannes Frederic Sowa

On Fri, Nov 8, 2013,  Alexei Starovoitov <ast@plumgrid.com> wrote:
> On Thu, Nov 7, 2013 , Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> On Sun, 2013-10-27 , Eric Dumazet wrote:
>>> From: Eric Dumazet <edumazet@google.com>


>>> Alexei reported a performance regression on vxlan, caused
>>> by commit 3347c9602955 "ipv4: gso: make inet_gso_segment() stackable"

>>> GSO vxlan packets were not properly segmented, adding IP fragments
>>> while they were not expected.

>>> Rename 'bool tunnel' to 'bool encap', and add a new boolean
>>> to express the fact that UDP should be fragmented.
>>> This fragmentation is triggered by skb->encapsulation being set.
>>>
>>> Remove a "skb->encapsulation = 1" added in above commit,
>>> as its not needed, as frags inherit skb->frag from original
>>> GSO skb.
>>>
>>> Reported-by: Alexei Starovoitov <ast@plumgrid.com>
>>> Signed-off-by: Eric Dumazet <edumazet@google.com>
>>> Tested-by: Alexei Starovoitov <ast@plumgrid.com>
>>> ---
>>>  net/ipv4/af_inet.c |   15 +++++++--------
>>>  1 file changed, 7 insertions(+), 8 deletions(-)
>>>
>>> diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
>>> index f4a159e705c0..09d78d4a3cff 100644
>>> --- a/net/ipv4/af_inet.c
>>> +++ b/net/ipv4/af_inet.c
>>> @@ -1251,8 +1251,8 @@ static struct sk_buff *inet_gso_segment(struct sk_buff *skb,
>>>       struct sk_buff *segs = ERR_PTR(-EINVAL);
>>>       const struct net_offload *ops;
>>>       unsigned int offset = 0;
>>> +     bool udpfrag, encap;
>>>       struct iphdr *iph;
>>> -     bool tunnel;
>>>       int proto;
>>>       int nhoff;
>>>       int ihl;
>>> @@ -1290,8 +1290,8 @@ static struct sk_buff *inet_gso_segment(struct sk_buff *skb,
>>>               goto out;
>>>       __skb_pull(skb, ihl);
>>>
>>> -     tunnel = SKB_GSO_CB(skb)->encap_level > 0;
>>> -     if (tunnel)
>>> +     encap = SKB_GSO_CB(skb)->encap_level > 0;
>>> +     if (encap)
>>>               features = skb->dev->hw_enc_features & netif_skb_features(skb);
>>>       SKB_GSO_CB(skb)->encap_level += ihl;
>>>
>>> @@ -1306,24 +1306,23 @@ static struct sk_buff *inet_gso_segment(struct sk_buff *skb,
>>>       if (IS_ERR_OR_NULL(segs))
>>>               goto out;
>>>
>>> +     udpfrag = !!skb->encapsulation && proto == IPPROTO_UDP;
>>
>> It seems I did an error here, shouldn't it be
>>
>>      udpfrag = !skb->encapsulation && proto == IPPROTO_UDP;
>>
>> instead ?
>
> oops. yes.
> the segs coming out of udp4_ufo_fragment() in case of udp_tunnel should not
> be marked as ip frags. ip frags are only for UFO.
>
> Thank you for spotting it!

Guys, just to clarify, is this fixed by now? where, net-next?

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH net-next] inet: restore gso for vxlan
  2013-11-16 20:23                   ` [PATCH net-next] inet: restore gso for vxlan Or Gerlitz
@ 2013-11-16 20:50                     ` Eric Dumazet
  0 siblings, 0 replies; 41+ messages in thread
From: Eric Dumazet @ 2013-11-16 20:50 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Alexei Starovoitov, David Miller, netdev@vger.kernel.org,
	Hannes Frederic Sowa

On Sat, 2013-11-16 at 22:23 +0200, Or Gerlitz wrote:

> Guys, just to clarify, is this fixed by now? where, net-next?

Should be fixed on all trees :

commit dcd607718385d02ce3741de225927a57f528f93b
("inet: fix a UFO regression")

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH net-next] veth: extend features to support tunneling
@ 2013-11-16 21:11 Or Gerlitz
  2013-11-16 21:40 ` Eric Dumazet
  0 siblings, 1 reply; 41+ messages in thread
From: Or Gerlitz @ 2013-11-16 21:11 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alexei Starovoitov, David Miller, Eric Dumazet, Stephen Hemminger,
	netdev@vger.kernel.org, Michael S. Tsirkin, John Fastabend

On Sat, Oct 26, 2013 at 6:13 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Fri, 2013-10-25 at 19:22 -0700, Alexei Starovoitov wrote:
>> On Fri, Oct 25, 2013 at 6:25 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> > From: Eric Dumazet <edumazet@google.com>
>> >
>> > While investigating on a recent vxlan regression, I found veth
>> > was using a zero features set for vxlan tunnels.
>>
>> oneliner can be better :)
>
> Yes, I'll post the gso fix as well, of course ;)
>
>>
>> > We have to segment GSO frames, copy the payload, and do the checksum.
>> >
>> > This patch brings a ~200% performance increase
>> >
>> > We probably have to add hw_enc_features support
>> > on other virtual devices.
>> >
>> > Signed-off-by: Eric Dumazet <edumazet@google.com>
>> > Cc: Alexei Starovoitov <ast@plumgrid.com>
>> > ---
>>
>> iperf over veth with gre/vxlan tunneling is now ~4Gbps. 20x gain as advertised.
>> Thanks!
>
> Wow, such a difference might show another bug somewhere else...

Guys (thanks Eric for the clarification over the other vxlan thread),
with the latest networking code (e.g 3.12 or net-next)  do you expect
notable performance (throughput) difference between these two configs?

1. bridge --> vxlan --> NIC
2. veth --> bridge --> vxlan --> NIC

BTW #2 doesn't work when packets start to be large unless I manually
decrease the veth device pair MTU. E.g if the NIC MTU is 1500, vxlan
advertizes an MTU of 1450 (= 1500 - (14 + 20 + 8 + 8)) and the bridge
inherits that, but not the veth device. Should someone/somewhere here
generate an ICMP packet which will cause the stack to decreate the
path mtu for the neighbour created on the veth device? what about
para-virtualized guests which are plugged into this (or any host based
tunneling) scheme, e.g in this scheme

3. guest virtio NIC --> vhost  --> tap/macvtap --> bridge --> vxlan --> NIC

Who/how do we want the guest NIC mtu/path mtu to take into account the
tunneling over-head?

Or.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH net-next] veth: extend features to support tunneling
  2013-11-16 21:11 [PATCH net-next] veth: extend features to support tunneling Or Gerlitz
@ 2013-11-16 21:40 ` Eric Dumazet
  2013-11-17  0:09   ` Eric Dumazet
                     ` (2 more replies)
  0 siblings, 3 replies; 41+ messages in thread
From: Eric Dumazet @ 2013-11-16 21:40 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Alexei Starovoitov, David Miller, Eric Dumazet, Stephen Hemminger,
	netdev@vger.kernel.org, Michael S. Tsirkin, John Fastabend

On Sat, 2013-11-16 at 23:11 +0200, Or Gerlitz wrote:

> Guys (thanks Eric for the clarification over the other vxlan thread),
> with the latest networking code (e.g 3.12 or net-next)  do you expect
> notable performance (throughput) difference between these two configs?
> 
> 1. bridge --> vxlan --> NIC
> 2. veth --> bridge --> vxlan --> NIC
> 
> BTW #2 doesn't work when packets start to be large unless I manually
> decrease the veth device pair MTU. E.g if the NIC MTU is 1500, vxlan
> advertizes an MTU of 1450 (= 1500 - (14 + 20 + 8 + 8)) and the bridge
> inherits that, but not the veth device. Should someone/somewhere here
> generate an ICMP packet which will cause the stack to decreate the
> path mtu for the neighbour created on the veth device? what about
> para-virtualized guests which are plugged into this (or any host based
> tunneling) scheme, e.g in this scheme
> 
> 3. guest virtio NIC --> vhost  --> tap/macvtap --> bridge --> vxlan --> NIC
> 
> Who/how do we want the guest NIC mtu/path mtu to take into account the
> tunneling over-head?

I mentioned this problem on another thread : gso packets escape the
normal mtu checks in ip forwarding.

vi +91 net/ipv4/ip_forward.c

gso_size contains the size of the segment minus all headers.

Please try the following :

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index d68633452d9b..489b56935a56 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -388,4 +388,16 @@ static inline int fastopen_init_queue(struct sock *sk, int backlog)
 	return 0;
 }
 
+static inline unsigned int gso_size_with_headers(const struct sk_buff *skb)
+{
+	unsigned int hdrlen = skb_transport_header(skb) - skb_mac_header(skb);
+
+	if (skb_shinfo(skb)->gso_type & (SKB_GSO_TCPV4 | SKB_GSO_TCPV6))
+		hdrlen += tcp_hdrlen(skb);
+	else
+		hdrlen += 8; // sizeof(struct udphdr)
+
+	return skb_shinfo(skb)->gso_size + hdrlen;
+}
+
 #endif	/* _LINUX_TCP_H */
diff --git a/net/ipv4/ip_forward.c b/net/ipv4/ip_forward.c
index 694de3b7aebf..3949cc1dd1ca 100644
--- a/net/ipv4/ip_forward.c
+++ b/net/ipv4/ip_forward.c
@@ -57,6 +57,7 @@ int ip_forward(struct sk_buff *skb)
 	struct iphdr *iph;	/* Our header */
 	struct rtable *rt;	/* Route we use */
 	struct ip_options *opt	= &(IPCB(skb)->opt);
+	unsigned int len;
 
 	if (skb_warn_if_lro(skb))
 		goto drop;
@@ -88,7 +89,11 @@ int ip_forward(struct sk_buff *skb)
 	if (opt->is_strictroute && rt->rt_uses_gateway)
 		goto sr_failed;
 
-	if (unlikely(skb->len > dst_mtu(&rt->dst) && !skb_is_gso(skb) &&
+	len = skb->len;
+	if (skb_is_gso(skb))
+		len = gso_size_with_headers(skb);
+
+	if (unlikely(len > dst_mtu(&rt->dst) &&
 		     (ip_hdr(skb)->frag_off & htons(IP_DF))) && !skb->local_df) {
 		IP_INC_STATS(dev_net(rt->dst.dev), IPSTATS_MIB_FRAGFAILS);
 		icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED,

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH net-next] veth: extend features to support tunneling
  2013-11-16 21:40 ` Eric Dumazet
@ 2013-11-17  0:09   ` Eric Dumazet
  2013-11-17  6:57   ` Or Gerlitz
  2013-11-17  7:31   ` Alexei Starovoitov
  2 siblings, 0 replies; 41+ messages in thread
From: Eric Dumazet @ 2013-11-17  0:09 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Alexei Starovoitov, David Miller, Eric Dumazet, Stephen Hemminger,
	netdev@vger.kernel.org, Michael S. Tsirkin, John Fastabend

On Sat, 2013-11-16 at 13:40 -0800, Eric Dumazet wrote:
> On Sat, 2013-11-16 at 23:11 +0200, Or Gerlitz wrote:
> 
> > Guys (thanks Eric for the clarification over the other vxlan thread),
> > with the latest networking code (e.g 3.12 or net-next)  do you expect
> > notable performance (throughput) difference between these two configs?
> > 
> > 1. bridge --> vxlan --> NIC
> > 2. veth --> bridge --> vxlan --> NIC
> > 
> > BTW #2 doesn't work when packets start to be large unless I manually
> > decrease the veth device pair MTU. E.g if the NIC MTU is 1500, vxlan
> > advertizes an MTU of 1450 (= 1500 - (14 + 20 + 8 + 8)) and the bridge
> > inherits that, but not the veth device. Should someone/somewhere here
> > generate an ICMP packet which will cause the stack to decreate the
> > path mtu for the neighbour created on the veth device? what about
> > para-virtualized guests which are plugged into this (or any host based
> > tunneling) scheme, e.g in this scheme
> > 
> > 3. guest virtio NIC --> vhost  --> tap/macvtap --> bridge --> vxlan --> NIC
> > 
> > Who/how do we want the guest NIC mtu/path mtu to take into account the
> > tunneling over-head?
> 
> I mentioned this problem on another thread : gso packets escape the
> normal mtu checks in ip forwarding.
> 
> vi +91 net/ipv4/ip_forward.c
> 
> gso_size contains the size of the segment minus all headers.
> 
> Please try the following :
> 
> diff --git a/include/linux/tcp.h b/include/linux/tcp.h
> index d68633452d9b..489b56935a56 100644
> --- a/include/linux/tcp.h
> +++ b/include/linux/tcp.h
> @@ -388,4 +388,16 @@ static inline int fastopen_init_queue(struct sock *sk, int backlog)
>  	return 0;
>  }
>  
> +static inline unsigned int gso_size_with_headers(const struct sk_buff *skb)
> +{
> +	unsigned int hdrlen = skb_transport_header(skb) - skb_mac_header(skb);

or more exactly :

	unsigned int hdrlen = skb_transport_header(skb) - skb_network_header(skb);


> +
> +	if (skb_shinfo(skb)->gso_type & (SKB_GSO_TCPV4 | SKB_GSO_TCPV6))
> +		hdrlen += tcp_hdrlen(skb);
> +	else
> +		hdrlen += 8; // sizeof(struct udphdr)
> +
> +	return skb_shinfo(skb)->gso_size + hdrlen;
> +}

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH net-next] veth: extend features to support tunneling
  2013-11-16 21:40 ` Eric Dumazet
  2013-11-17  0:09   ` Eric Dumazet
@ 2013-11-17  6:57   ` Or Gerlitz
  2013-11-17  7:31   ` Alexei Starovoitov
  2 siblings, 0 replies; 41+ messages in thread
From: Or Gerlitz @ 2013-11-17  6:57 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alexei Starovoitov, David Miller, Eric Dumazet, Stephen Hemminger,
	netdev@vger.kernel.org, Michael S. Tsirkin, John Fastabend

On Sat, Nov 16, 2013 at 11:40 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Sat, 2013-11-16 at 23:11 +0200, Or Gerlitz wrote:
>
>> Guys (thanks Eric for the clarification over the other vxlan thread),
>> with the latest networking code (e.g 3.12 or net-next)  do you expect
>> notable performance (throughput) difference between these two configs?
>>
>> 1. bridge --> vxlan --> NIC
>> 2. veth --> bridge --> vxlan --> NIC
>>
>> BTW #2 doesn't work when packets start to be large unless I manually
>> decrease the veth device pair MTU. E.g if the NIC MTU is 1500, vxlan
>> advertizes an MTU of 1450 (= 1500 - (14 + 20 + 8 + 8)) and the bridge
>> inherits that, but not the veth device. Should someone/somewhere here
>> generate an ICMP packet which will cause the stack to decreate the
>> path mtu for the neighbour created on the veth device? what about
>> para-virtualized guests which are plugged into this (or any host based
>> tunneling) scheme, e.g in this scheme
>>
>> 3. guest virtio NIC --> vhost  --> tap/macvtap --> bridge --> vxlan --> NIC
>>
>> Who/how do we want the guest NIC mtu/path mtu to take into account the
>> tunneling over-head?
>
> I mentioned this problem on another thread : gso packets escape the
> normal mtu checks in ip forwarding.
>
> vi +91 net/ipv4/ip_forward.c
>
> gso_size contains the size of the segment minus all headers.
>
> Please try the following :

Will setup & try & let you know. However, this doesn't address non TCP
traffic, e.g pings of large size, correct?

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH net-next] veth: extend features to support tunneling
  2013-11-16 21:40 ` Eric Dumazet
  2013-11-17  0:09   ` Eric Dumazet
  2013-11-17  6:57   ` Or Gerlitz
@ 2013-11-17  7:31   ` Alexei Starovoitov
  2013-11-17 17:20     ` Eric Dumazet
  2013-11-18 17:55     ` David Stevens
  2 siblings, 2 replies; 41+ messages in thread
From: Alexei Starovoitov @ 2013-11-17  7:31 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Or Gerlitz, David Miller, Eric Dumazet, Stephen Hemminger,
	netdev@vger.kernel.org, Michael S. Tsirkin, John Fastabend

On Sat, Nov 16, 2013 at 1:40 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Sat, 2013-11-16 at 23:11 +0200, Or Gerlitz wrote:
>
>> Guys (thanks Eric for the clarification over the other vxlan thread),
>> with the latest networking code (e.g 3.12 or net-next)  do you expect
>> notable performance (throughput) difference between these two configs?
>>
>> 1. bridge --> vxlan --> NIC
>> 2. veth --> bridge --> vxlan --> NIC
>>
>> BTW #2 doesn't work when packets start to be large unless I manually
>> decrease the veth device pair MTU. E.g if the NIC MTU is 1500, vxlan
>> advertizes an MTU of 1450 (= 1500 - (14 + 20 + 8 + 8)) and the bridge
>> inherits that, but not the veth device. Should someone/somewhere here
>> generate an ICMP packet which will cause the stack to decreate the
>> path mtu for the neighbour created on the veth device? what about
>> para-virtualized guests which are plugged into this (or any host based
>> tunneling) scheme, e.g in this scheme
>>
>> 3. guest virtio NIC --> vhost  --> tap/macvtap --> bridge --> vxlan --> NIC
>>
>> Who/how do we want the guest NIC mtu/path mtu to take into account the
>> tunneling over-head?
>
> I mentioned this problem on another thread : gso packets escape the
> normal mtu checks in ip forwarding.
>
> vi +91 net/ipv4/ip_forward.c
>
> gso_size contains the size of the segment minus all headers.

In case of VMs sending gso packets over tap and tunnel in the host,
ip_forward is not in the picture.

when host mtu doesn't account for overhead of tunnel, the neat trick
we can do is to decrease gso_size while adding tunnel header.
This way when skb_gso_segment() kicks in during tx the packets will be
segmented into host mtu sized packets.
Receiving vm on the other side will be seeing packets of size
guest_mtu - tunnel_header_size,
but imo that's much better than sending ip fragments over vxlan fabric.
It will work for guests sending tcp/udp, but there is no good solution
for icmp other than ip frags.
This trick should work for hw offloaded vxlan, but we yet to
experiment with it on such nic.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH net-next] veth: extend features to support tunneling
  2013-11-17  7:31   ` Alexei Starovoitov
@ 2013-11-17 17:20     ` Eric Dumazet
  2013-11-17 19:00       ` Alexei Starovoitov
  2013-11-18 17:55     ` David Stevens
  1 sibling, 1 reply; 41+ messages in thread
From: Eric Dumazet @ 2013-11-17 17:20 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Or Gerlitz, David Miller, Eric Dumazet, Stephen Hemminger,
	netdev@vger.kernel.org, Michael S. Tsirkin, John Fastabend

On Sat, 2013-11-16 at 23:31 -0800, Alexei Starovoitov wrote:

> In case of VMs sending gso packets over tap and tunnel in the host,
> ip_forward is not in the picture.
> 

I was specifically answering to #2 which should use ip forwarding, of
course. Note that my patch was a POC : We have many other places where
the typical MTU check is simply disabled as soon as skb is GSO.

> when host mtu doesn't account for overhead of tunnel, the neat trick
> we can do is to decrease gso_size while adding tunnel header.

That would be very broken to change gso_size, this breaks DF flag
semantic. You need to send an ICMP, and the sender will take appropriate
action.

GRO + GSO request that forwarded segments are the same than incoming
ones. It's not like a proxy that can chose to aggregate as it wants.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH net-next] veth: extend features to support tunneling
  2013-11-17 17:20     ` Eric Dumazet
@ 2013-11-17 19:00       ` Alexei Starovoitov
  2013-11-17 21:20         ` Or Gerlitz
  2013-11-17 22:38         ` Eric Dumazet
  0 siblings, 2 replies; 41+ messages in thread
From: Alexei Starovoitov @ 2013-11-17 19:00 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Or Gerlitz, David Miller, Eric Dumazet, Stephen Hemminger,
	netdev@vger.kernel.org, Michael S. Tsirkin, John Fastabend

On Sun, Nov 17, 2013 at 9:20 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Sat, 2013-11-16 at 23:31 -0800, Alexei Starovoitov wrote:
>
>> In case of VMs sending gso packets over tap and tunnel in the host,
>> ip_forward is not in the picture.
>>
>
> I was specifically answering to #2 which should use ip forwarding, of
> course. Note that my patch was a POC : We have many other places where
> the typical MTU check is simply disabled as soon as skb is GSO.

I don't think #2 will do ip_forward either. veth goes into a bridge
and vxlan just adds encap.

>> when host mtu doesn't account for overhead of tunnel, the neat trick
>> we can do is to decrease gso_size while adding tunnel header.
>
> That would be very broken to change gso_size, this breaks DF flag
> semantic. You need to send an ICMP, and the sender will take appropriate
> action.
>
> GRO + GSO request that forwarded segments are the same than incoming
> ones. It's not like a proxy that can chose to aggregate as it wants.

In case of ip_forwarding - yes, but in case of tunnel we may look at
gso differently.
I'm not saying that 'gso_size -= tunnel_header' trick should be unconditional.
Obviously DF flag needs to be respected for pmtu to work.
ip_tunnel_xmit() already suppose to send icmp_frag_needed back, I'm
not sure it works for vxlan though.
What I'm proposing if tunnel receives normal gso packet with df bit
not set, it can safely
decrease that skb's gso_size. This way guest vm can have 1500 mtu just
like host mtu 1500,
until guest tries to pmtu, then vxlan sends icmp into guest and guest
adjusts itself if it can.
Since host cannot guarantee that guest will do the right thing with
icmp_frag_needed, it may continue doing 'gso_size -= tunnel_header'
trick without breaking anything.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH net-next] veth: extend features to support tunneling
  2013-11-17 19:00       ` Alexei Starovoitov
@ 2013-11-17 21:20         ` Or Gerlitz
  2013-11-17 22:38         ` Eric Dumazet
  1 sibling, 0 replies; 41+ messages in thread
From: Or Gerlitz @ 2013-11-17 21:20 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Eric Dumazet, David Miller, Eric Dumazet, Stephen Hemminger,
	netdev@vger.kernel.org, Michael S. Tsirkin, John Fastabend

On Sun, Nov 17, 2013 , Alexei Starovoitov <ast@plumgrid.com> wrote:
> On Sun, Nov 17, 2013 , Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> On Sat, 2013-11-16 , Alexei Starovoitov wrote:

>>> In case of VMs sending gso packets over tap and tunnel in the host,
>>> ip_forward is not in the picture.

>> I was specifically answering to #2 which should use ip forwarding, of
>> course. Note that my patch was a POC : We have many other places where
>> the typical MTU check is simply disabled as soon as skb is GSO.

> I don't think #2 will do ip_forward either. veth goes into a bridge
> and vxlan just adds encap.

Eric, do we have concensus here that this #2 of veth --> bridge -->
vxlan --> NIC will not go through ip_forward?!

Anyway, I tried your patch and didn't see notable improvement on my
tests. The tests I did few days ago were over 3.10.19 to have more
stable ground... moving to 3.12.0 and net-next today, the baseline
performance became worse, in the sense that if a bit of simplified env
of bridge --> vxlan --> NIC with many iperf client threads yielded
similar throughput as vxlan --> NIC or bridge --> NIC, with net-next
its not the case. If you have 10Gbs or 40Gbs NICs, even without HW TCP
offloads for VXLAN, you might be able to see that on your setups.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH net-next] veth: extend features to support tunneling
  2013-11-17 19:00       ` Alexei Starovoitov
  2013-11-17 21:20         ` Or Gerlitz
@ 2013-11-17 22:38         ` Eric Dumazet
  1 sibling, 0 replies; 41+ messages in thread
From: Eric Dumazet @ 2013-11-17 22:38 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Or Gerlitz, David Miller, Eric Dumazet, Stephen Hemminger,
	netdev@vger.kernel.org, Michael S. Tsirkin, John Fastabend

On Sun, 2013-11-17 at 11:00 -0800, Alexei Starovoitov wrote:
> On Sun, Nov 17, 2013 at 9:20 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > On Sat, 2013-11-16 at 23:31 -0800, Alexei Starovoitov wrote:
> >
> >> In case of VMs sending gso packets over tap and tunnel in the host,
> >> ip_forward is not in the picture.
> >>
> >
> > I was specifically answering to #2 which should use ip forwarding, of
> > course. Note that my patch was a POC : We have many other places where
> > the typical MTU check is simply disabled as soon as skb is GSO.
> 
> I don't think #2 will do ip_forward either. veth goes into a bridge
> and vxlan just adds encap.

The point is : nothing really checks mtu for gso packets.

Same construct was copy/pasted everywhere. See br_dev_queue_push_xmit()
for another example.

In this particular case, I guess vxlan should generate the ICMP.

But really, its Sunday and I currently do not care ;)

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH net-next] veth: extend features to support tunneling
  2013-11-17  7:31   ` Alexei Starovoitov
  2013-11-17 17:20     ` Eric Dumazet
@ 2013-11-18 17:55     ` David Stevens
  2013-11-18 22:57       ` Jesse Gross
  2013-11-19  3:51       ` Alexei Starovoitov
  1 sibling, 2 replies; 41+ messages in thread
From: David Stevens @ 2013-11-18 17:55 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David Miller, Eric Dumazet, Eric Dumazet, John Fastabend,
	Michael S. Tsirkin, netdev@vger.kernel.org, netdev-owner,
	Or Gerlitz, Stephen Hemminger

netdev-owner@vger.kernel.org wrote on 11/17/2013 02:31:08 AM:

> From: Alexei Starovoitov <ast@plumgrid.com>
 
> when host mtu doesn't account for overhead of tunnel, the neat trick
> we can do is to decrease gso_size while adding tunnel header.

Won't this possibly result in ip_id collision if you generate more 
segments
than the VM thinks you did? Not an issue of DF is set, I suppose.

> This way when skb_gso_segment() kicks in during tx the packets will be
> segmented into host mtu sized packets.

        I've been looking at something like this, but going the other way.
In the VXLAN case, other VXLAN VMs are reachable via the virtual L2 
network
so we *ought* to use large packets in the whole path if the underlay
network can do that. PMTUD should not apply within the virtual L2 network,
but the code as-is will segment to the VM gso_size (say 1500) even if the
underlay network can send jumbo frames (say 9K).
        I think what we want for best performance is to send GSO packets 
from
the VM and use underlay network MTU minus VXLAN headers as the gso_size 
for
those.
        One complication there is that we would actually want to use the 
VM
gso_size if the destination is a router taking us off of the VXLAN 
network,
but we can tell that from the fdb entry when route-short-circuiting is 
enabled.

> Receiving vm on the other side will be seeing packets of size
> guest_mtu - tunnel_header_size,
> but imo that's much better than sending ip fragments over vxlan fabric.
> It will work for guests sending tcp/udp, but there is no good solution
> for icmp other than ip frags.

My idea for solving this ICMP issue is to forge an ICMP2BIG with the
source address of the destination and send it back to the originating VM
directly. It's complicated a bit because we don't want to use icmp_send()
on the host, which will go through host routing tables when we don't
necessarily have an IP address or routing for the bridged domain. So, to
do this, we really should just have a stand-alone icmp sender for use by
tunnel endpoints. But if we do this, the VM can get the correct gso_size
accounting for the tunnel overhead too, although it's abusing PMTUD a bit
since it doesn't ordinarily apply to hosts on the same L2 network.

I haven't gotten working patches for either of these extensions yet--
if your work overlaps before I do, I'd be happy to see these two things
incorporated in it:

1) ICMP2BIG reflected from the tunnel endpoint without host routing and
        using the destination IP as the forged source address, thus 
appropriate
        for bridge-only hosting.
2) Allowing larger-than-gso_size segmentation as well as smaller when the
        final destination is on the virtual L2 network.

                                                                +-DLS

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH net-next] veth: extend features to support tunneling
  2013-11-18 17:55     ` David Stevens
@ 2013-11-18 22:57       ` Jesse Gross
  2013-11-19  3:51       ` Alexei Starovoitov
  1 sibling, 0 replies; 41+ messages in thread
From: Jesse Gross @ 2013-11-18 22:57 UTC (permalink / raw)
  To: David Stevens
  Cc: Alexei Starovoitov, David Miller, Eric Dumazet, Eric Dumazet,
	John Fastabend, Michael S. Tsirkin, netdev@vger.kernel.org,
	netdev-owner, Or Gerlitz, Stephen Hemminger

On Mon, Nov 18, 2013 at 9:55 AM, David Stevens <dlstevens@us.ibm.com> wrote:
> My idea for solving this ICMP issue is to forge an ICMP2BIG with the
> source address of the destination and send it back to the originating VM
> directly. It's complicated a bit because we don't want to use icmp_send()
> on the host, which will go through host routing tables when we don't
> necessarily have an IP address or routing for the bridged domain. So, to
> do this, we really should just have a stand-alone icmp sender for use by
> tunnel endpoints. But if we do this, the VM can get the correct gso_size
> accounting for the tunnel overhead too, although it's abusing PMTUD a bit
> since it doesn't ordinarily apply to hosts on the same L2 network.

OVS used to do exactly this in its tunnel implementation. It worked
fairly well although we ended up taking it out a while back since it
was a little offensive due to the need to forge addresses and have
essentially a parallel stub of an IP stack.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH net-next] veth: extend features to support tunneling
  2013-11-18 17:55     ` David Stevens
  2013-11-18 22:57       ` Jesse Gross
@ 2013-11-19  3:51       ` Alexei Starovoitov
  1 sibling, 0 replies; 41+ messages in thread
From: Alexei Starovoitov @ 2013-11-19  3:51 UTC (permalink / raw)
  To: David Stevens
  Cc: David Miller, Eric Dumazet, Eric Dumazet, John Fastabend,
	Michael S. Tsirkin, netdev@vger.kernel.org, netdev-owner,
	Or Gerlitz, Stephen Hemminger

On Mon, Nov 18, 2013 at 9:55 AM, David Stevens <dlstevens@us.ibm.com> wrote:
> 1) ICMP2BIG reflected from the tunnel endpoint without host routing and
>         using the destination IP as the forged source address, thus
> appropriate
>         for bridge-only hosting.

It doesn't look that existing icmp_send() can be hacked this way.
It cannot do route lookup and cannot do neigh lookup.
IPs and macs are valid within VM and known to virtualized networking
components, but
tunnel cannot possibly know what is standing between VM and tunnel.
VM may be sending over tap into a bridge that forwards into netns that
is running ip_fowarding
 between two vethes, then goes into 2nd bridge and only then sent into
vxlan device.
ovs can do many actions before skb from tap is delivered to vport-vxlan.
The generic thing vxlan driver can do is to take mac and ip headers
from inner skb,
swap macs, swap ips, add icmphdr with code=frag_needed and pretend
that such packet
was received and decapsulated by vxlan, and call into vxlan_sock->rcv()
It will go back into ovs or bridge and will proceed through the
reverse network topology path all the way back into VM that can adjust
its mtu accordingly.

> 2) Allowing larger-than-gso_size segmentation as well as smaller when the
>         final destination is on the virtual L2 network.

I think it should be ok for virtual L3 as well. From VM point of view,
it's sending 1500 byte packets and virtual routers/bridges on the path
to destination should check packets against their own mtu values. But
if tunneling infra is smart enough to use large frames between two
hypervisors,
 it should do so. DF bit and pmtu logic applies within virtual
network, so sending icmp_frag_needed back into VM is exposing physical
infrastructure to virtual network.
True virtual distributed bridge with VMs should allow setting 8k mtu
inside VM and on the bridge and still function with 1500 mtu in the
underlying physical network.

^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2013-11-19  3:51 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-10-24 23:37 vxlan gso is broken by stackable gso_segment() Alexei Starovoitov
2013-10-25  0:41 ` Eric Dumazet
2013-10-25  1:59   ` Alexei Starovoitov
2013-10-25  4:06     ` Eric Dumazet
2013-10-25  9:09       ` Eric Dumazet
2013-10-25 22:18         ` David Miller
2013-10-25 22:41           ` Alexei Starovoitov
2013-10-25 23:10             ` David Miller
2013-10-25 23:25             ` Eric Dumazet
2013-10-26  0:26               ` [PATCH net-next] tcp: gso: fix truesize tracking Eric Dumazet
2013-10-26  1:46                 ` Alexei Starovoitov
2013-10-28  4:58                 ` David Miller
2013-10-29  4:05                 ` David Miller
2013-10-26  0:52               ` vxlan gso is broken by stackable gso_segment() Eric Dumazet
2013-10-26  1:25                 ` [PATCH net-next] veth: extend features to support tunneling Eric Dumazet
2013-10-26  2:22                   ` Alexei Starovoitov
2013-10-26  4:13                     ` Eric Dumazet
2013-10-28  4:58                   ` David Miller
2013-10-28  1:18             ` [PATCH net-next] inet: restore gso for vxlan Eric Dumazet
2013-10-28  4:25               ` David Miller
2013-11-07 22:33               ` Eric Dumazet
2013-11-07 23:27                 ` Alexei Starovoitov
2013-11-08  0:44                   ` [PATCH net-next] inet: fix a UFO regression Eric Dumazet
2013-11-08  1:39                     ` Alexei Starovoitov
2013-11-08  2:08                       ` Eric Dumazet
2013-11-08  2:32                     ` [PATCH v2 " Eric Dumazet
2013-11-08  7:08                       ` David Miller
2013-11-16 20:23                   ` [PATCH net-next] inet: restore gso for vxlan Or Gerlitz
2013-11-16 20:50                     ` Eric Dumazet
  -- strict thread matches above, loose matches on Subject: below --
2013-11-16 21:11 [PATCH net-next] veth: extend features to support tunneling Or Gerlitz
2013-11-16 21:40 ` Eric Dumazet
2013-11-17  0:09   ` Eric Dumazet
2013-11-17  6:57   ` Or Gerlitz
2013-11-17  7:31   ` Alexei Starovoitov
2013-11-17 17:20     ` Eric Dumazet
2013-11-17 19:00       ` Alexei Starovoitov
2013-11-17 21:20         ` Or Gerlitz
2013-11-17 22:38         ` Eric Dumazet
2013-11-18 17:55     ` David Stevens
2013-11-18 22:57       ` Jesse Gross
2013-11-19  3:51       ` Alexei Starovoitov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).