vxlan/veth performance issues on net.git + latest kernels

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* vxlan/veth performance issues on net.git + latest kernels
@ 2013-12-03 15:05 Or Gerlitz
  2013-12-03 15:30 ` Eric Dumazet
  2013-12-03 17:12 ` Eric Dumazet
  0 siblings, 2 replies; 63+ messages in thread
From: Or Gerlitz @ 2013-12-03 15:05 UTC (permalink / raw)
  To: Eric Dumazet, Alexei Starovoitov, Pravin B Shelar; +Cc: David Miller, netdev

I've been chasing lately a performance issues which come into play when 
combining veth and vxlan over fast Ethernet NIC.

I came across it while working to enable TCP stateless offloads for 
vxlan encapsulated traffic in the mlx4 driver, but I can clearly see the 
issue without any HWoffloads involved, so it would be easier to discuss 
like that (no offloads involved).

The setup involves a stacked {veth --> bridge --> vlxan --> IP stack --> 
NIC} or {veth --> ovs+vxlan -->  IP stack --> NIC} chain.

Basically, in my testbed which uses iperf over 40Gbs Mellanox NICs, 
vxlan traffic goes up to 5-7Gbs for single session and up to 14Gbs for 
multiple sessions, as long as veth isn't involved. Once veth is used I 
can't get to > 7-8Gbs, no matter how many sessions are used. For the 
time being, I manually took into account the tunneling overhead and 
reduced the veth pair MTU by 50 bytes.

Looking on the kernel TCP counters in a {veth --> bridge --> vxlan --> 
NIC} configuration, on the client side I see lots of hits for the 
following TCP counters (the numbers are just single sample, I look on 
the output of iterative sampling every seconds, e.g using "watch -d -n 1 
netstat -st"):

67092 segments retransmited

31461 times recovered from packet loss due to SACK data
Detected reordering 1045142 times using SACK
436215 fast retransmits
59966 forward retransmits

Also on the passive side I see hits for the "Quick ack mode was 
activated N times" counter, see below full snapshot of the counters from 
both sides.

Without using veth, e.g when running in a {vxlan -> NIC} or {bridge --> 
vxlan --> NIC},I see hits only for the "recovered from packet loss due 
to SACK data" counter and fastretransmits counter,  but not for the 
forward retransmits or "Detected reordering N timesusing SACK". Also, 
the quick ack mode counter isn't active on the passive side.

I tested net.git (3.13-rc2+), 3.12.2 and 3.11.9, I see the same problems 
on all. At this point I don't really see a past point to go and apply 
bisection. So I hope this counter report can help to shed some light on 
the nature of the problem and possible solution, so ideas welcome!!

without vxlan, these are the Gbs results for 1/2/4 streams over 3.12.2, 
the results
for the net.git are pretty much the same.

18/32/38  NIC
17/30/35  bridge --> NIC
14/23/35  veth --> bridge --> NIC

with vxlan, these are the Gbs results for 1/2/4 streams

6/12/14  vxlan --> IP --> NIC
5/10/14  bridge --> vxlan --> IP --> NIC
6/7/7    veth --> bridge --> vxlan --> IP --> NIC

Also, the 3.12.2 number do get any better also when adding a ported 
version of 82d8189826d5 "veth: extend features to support tunneling" on 
top of 3.12.2

See @ the end the sequence of commands I use for the environment

Or.


--> TCP counters from active side

# netstat -ts
IcmpMsg:
     InType0: 2
     InType8: 1
     OutType0: 1
     OutType3: 4
     OutType8: 2
Tcp:
     189 active connections openings
     4 passive connection openings
     0 failed connection attempts
     0 connection resets received
     4 connections established
     22403193 segments received
     541234150 segments send out
     14248 segments retransmited
     0 bad segments received.
     5 resets sent
UdpLite:
TcpExt:
     2 invalid SYN cookies received
     178 TCP sockets finished time wait in fast timer
     10 delayed acks sent
     Quick ack mode was activated 1 times
     4 packets directly queued to recvmsg prequeue.
     3728 packets directly received from backlog
     2 packets directly received from prequeue
     2524 packets header predicted
     4 packets header predicted and directly queued to user
     19793310 acknowledgments not containing data received
     1216966 predicted acknowledgments
     2130 times recovered from packet loss due to SACK data
     Detected reordering 73 times using FACK
     Detected reordering 11424 times using SACK
     55 congestion windows partially recovered using Hoe heuristic
     TCPDSACKUndo: 457
     2 congestion windows recovered after partial ack
     11498 fast retransmits
     2748 forward retransmits
     2 other TCP timeouts
     TCPLossProbes: 4
     3 DSACKs sent for old packets
     TCPSackShifted: 1037782
     TCPSackMerged: 332827
     TCPSackShiftFallback: 598055
     TCPRcvCoalesce: 380
     TCPOFOQueue: 463
     TCPSpuriousRtxHostQueues: 192
IpExt:
     InNoRoutes: 1
     InMcastPkts: 191
     OutMcastPkts: 28
     InBcastPkts: 25
     InOctets: 1789360097
     OutOctets: 893757758988
     InMcastOctets: 8152
     OutMcastOctets: 3044
     InBcastOctets: 4259
     InNoECTPkts: 30117553



--> TCP counters from passiveside

netstat -ts
IcmpMsg:
     InType0: 1
     InType8: 2
     OutType0: 2
     OutType3: 5
     OutType8: 1
Tcp:
     75 active connections openings
     140 passive connection openings
     0 failed connection attempts
     0 connection resets received
     4 connections established
     146888643 segments received
     27430160 segments send out
     0 segments retransmited
     0 bad segments received.
     6 resets sent
UdpLite:
TcpExt:
     3 invalid SYN cookies received
     72 TCP sockets finished time wait in fast timer
     10 delayed acks sent
     3 delayed acks further delayed because of locked socket
     Quick ack mode was activated 13548 times
     4 packets directly queued to recvmsg prequeue.
     2 packets directly received from prequeue
     139384763 packets header predicted
     2 packets header predicted and directly queued to user
     671 acknowledgments not containing data received
     938 predicted acknowledgments
     TCPLossProbes: 2
     TCPLossProbeRecovery: 1
     14 DSACKs sent for old packets
     TCPBacklogDrop: 848
     TCPRcvCoalesce: 118368414
     TCPOFOQueue: 3167879
IpExt:
     InNoRoutes: 1
     InMcastPkts: 184
     OutMcastPkts: 26
     InBcastPkts: 26
     InOctets: 1007364296775
     OutOctets: 2433872888
     InMcastOctets: 6202
     OutMcastOctets: 2888
     InBcastOctets: 4597
     InNoECTPkts: 702313233


client side (node 144)
----------------------

ip link add vxlan42 type vxlan id 42 group 239.0.0.42 ttl 10 dev ethN
ifconfig vxlan42 192.168.42.144/24 up

brctl addbr br-vx
ip link set br-vx up

ifconfig br-vx 192.168.52.144/24 up
brctl addif br-vx vxlan42

ip link add type veth
brctl addif br-vx veth1
ifconfig veth0 192.168.62.144/24 up
ip link set veth1 up

ifconfig veth0 mtu 1450
ifconfig veth1 mtu 1450


server side (node 147)
----------------------

ip link add vxlan42 type vxlan id 42 group 239.0.0.42 ttl 10 dev ethN
ifconfig vxlan42 192.168.42.147/24 up

brctl addbr br-vx
ip link set br-vx up

ifconfig br-vx 192.168.52.147/24 up
brctl addif br-vx vxlan42


ip link add type veth
brctl addif br-vx veth1
ifconfig veth0 192.168.62.147/24 up
ip link set veth1 up

ifconfig veth0 mtu 1450
ifconfig veth1 mtu 1450

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: vxlan/veth performance issues on net.git + latest kernels
  2013-12-03 15:05 vxlan/veth performance issues on net.git + latest kernels Or Gerlitz
@ 2013-12-03 15:30 ` Eric Dumazet
  2013-12-03 19:55   ` Or Gerlitz
  2013-12-03 17:12 ` Eric Dumazet
  1 sibling, 1 reply; 63+ messages in thread
From: Eric Dumazet @ 2013-12-03 15:30 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Eric Dumazet, Alexei Starovoitov, Pravin B Shelar, David Miller,
	netdev

On Tue, 2013-12-03 at 17:05 +0200, Or Gerlitz wrote:
> I've been chasing lately a performance issues which come into play when 
> combining veth and vxlan over fast Ethernet NIC.
> 
> I came across it while working to enable TCP stateless offloads for 
> vxlan encapsulated traffic in the mlx4 driver, but I can clearly see the 
> issue without any HWoffloads involved, so it would be easier to discuss 
> like that (no offloads involved).
> 
> The setup involves a stacked {veth --> bridge --> vlxan --> IP stack --> 
> NIC} or {veth --> ovs+vxlan -->  IP stack --> NIC} chain.
> 
> Basically, in my testbed which uses iperf over 40Gbs Mellanox NICs, 
> vxlan traffic goes up to 5-7Gbs for single session and up to 14Gbs for 
> multiple sessions, as long as veth isn't involved. Once veth is used I 
> can't get to > 7-8Gbs, no matter how many sessions are used. For the 
> time being, I manually took into account the tunneling overhead and 
> reduced the veth pair MTU by 50 bytes.
> 
> Looking on the kernel TCP counters in a {veth --> bridge --> vxlan --> 
> NIC} configuration, on the client side I see lots of hits for the 
> following TCP counters (the numbers are just single sample, I look on 
> the output of iterative sampling every seconds, e.g using "watch -d -n 1 
> netstat -st"):
> 
> 67092 segments retransmited
> 
> 31461 times recovered from packet loss due to SACK data
> Detected reordering 1045142 times using SACK
> 436215 fast retransmits
> 59966 forward retransmits
> 
> Also on the passive side I see hits for the "Quick ack mode was 
> activated N times" counter, see below full snapshot of the counters from 
> both sides.
> 
> Without using veth, e.g when running in a {vxlan -> NIC} or {bridge --> 
> vxlan --> NIC},I see hits only for the "recovered from packet loss due 
> to SACK data" counter and fastretransmits counter,  but not for the 
> forward retransmits or "Detected reordering N timesusing SACK". Also, 
> the quick ack mode counter isn't active on the passive side.
> 
> I tested net.git (3.13-rc2+), 3.12.2 and 3.11.9, I see the same problems 
> on all. At this point I don't really see a past point to go and apply 
> bisection. So I hope this counter report can help to shed some light on 
> the nature of the problem and possible solution, so ideas welcome!!
> 
> without vxlan, these are the Gbs results for 1/2/4 streams over 3.12.2, 
> the results
> for the net.git are pretty much the same.
> 
> 18/32/38  NIC
> 17/30/35  bridge --> NIC
> 14/23/35  veth --> bridge --> NIC
> 
> with vxlan, these are the Gbs results for 1/2/4 streams
> 
> 6/12/14  vxlan --> IP --> NIC
> 5/10/14  bridge --> vxlan --> IP --> NIC
> 6/7/7    veth --> bridge --> vxlan --> IP --> NIC
> 
> Also, the 3.12.2 number do get any better also when adding a ported 
> version of 82d8189826d5 "veth: extend features to support tunneling" on 
> top of 3.12.2
> 
> See @ the end the sequence of commands I use for the environment
> 
> Or.
> 
> 
> --> TCP counters from active side
> 
> # netstat -ts
> IcmpMsg:
>      InType0: 2
>      InType8: 1
>      OutType0: 1
>      OutType3: 4
>      OutType8: 2
> Tcp:
>      189 active connections openings
>      4 passive connection openings
>      0 failed connection attempts
>      0 connection resets received
>      4 connections established
>      22403193 segments received
>      541234150 segments send out
>      14248 segments retransmited
>      0 bad segments received.
>      5 resets sent
> UdpLite:
> TcpExt:
>      2 invalid SYN cookies received
>      178 TCP sockets finished time wait in fast timer
>      10 delayed acks sent
>      Quick ack mode was activated 1 times
>      4 packets directly queued to recvmsg prequeue.
>      3728 packets directly received from backlog
>      2 packets directly received from prequeue
>      2524 packets header predicted
>      4 packets header predicted and directly queued to user
>      19793310 acknowledgments not containing data received
>      1216966 predicted acknowledgments
>      2130 times recovered from packet loss due to SACK data
>      Detected reordering 73 times using FACK
>      Detected reordering 11424 times using SACK
>      55 congestion windows partially recovered using Hoe heuristic
>      TCPDSACKUndo: 457
>      2 congestion windows recovered after partial ack
>      11498 fast retransmits
>      2748 forward retransmits
>      2 other TCP timeouts
>      TCPLossProbes: 4
>      3 DSACKs sent for old packets
>      TCPSackShifted: 1037782
>      TCPSackMerged: 332827
>      TCPSackShiftFallback: 598055
>      TCPRcvCoalesce: 380
>      TCPOFOQueue: 463
>      TCPSpuriousRtxHostQueues: 192
> IpExt:
>      InNoRoutes: 1
>      InMcastPkts: 191
>      OutMcastPkts: 28
>      InBcastPkts: 25
>      InOctets: 1789360097
>      OutOctets: 893757758988
>      InMcastOctets: 8152
>      OutMcastOctets: 3044
>      InBcastOctets: 4259
>      InNoECTPkts: 30117553
> 
> 
> 
> --> TCP counters from passiveside
> 
> netstat -ts
> IcmpMsg:
>      InType0: 1
>      InType8: 2
>      OutType0: 2
>      OutType3: 5
>      OutType8: 1
> Tcp:
>      75 active connections openings
>      140 passive connection openings
>      0 failed connection attempts
>      0 connection resets received
>      4 connections established
>      146888643 segments received
>      27430160 segments send out
>      0 segments retransmited
>      0 bad segments received.
>      6 resets sent
> UdpLite:
> TcpExt:
>      3 invalid SYN cookies received
>      72 TCP sockets finished time wait in fast timer
>      10 delayed acks sent
>      3 delayed acks further delayed because of locked socket
>      Quick ack mode was activated 13548 times
>      4 packets directly queued to recvmsg prequeue.
>      2 packets directly received from prequeue
>      139384763 packets header predicted
>      2 packets header predicted and directly queued to user
>      671 acknowledgments not containing data received
>      938 predicted acknowledgments
>      TCPLossProbes: 2
>      TCPLossProbeRecovery: 1
>      14 DSACKs sent for old packets
>      TCPBacklogDrop: 848

Thats bad : Dropping packets on receiver.

Check also "ifconfig -a" to see if rxdrop is increasing as well.

>      TCPRcvCoalesce: 118368414

lack of GRO : receiver seems to not be able to receive as fast as you
want.

>      TCPOFOQueue: 3167879

So many packets are received out of order (because of losses)

> IpExt:
>      InNoRoutes: 1
>      InMcastPkts: 184
>      OutMcastPkts: 26
>      InBcastPkts: 26
>      InOctets: 1007364296775
>      OutOctets: 2433872888
>      InMcastOctets: 6202
>      OutMcastOctets: 2888
>      InBcastOctets: 4597
>      InNoECTPkts: 702313233
> 
> 
> client side (node 144)
> ----------------------
> 
> ip link add vxlan42 type vxlan id 42 group 239.0.0.42 ttl 10 dev ethN
> ifconfig vxlan42 192.168.42.144/24 up
> 
> brctl addbr br-vx
> ip link set br-vx up
> 
> ifconfig br-vx 192.168.52.144/24 up
> brctl addif br-vx vxlan42
> 
> ip link add type veth
> brctl addif br-vx veth1
> ifconfig veth0 192.168.62.144/24 up
> ip link set veth1 up
> 
> ifconfig veth0 mtu 1450
> ifconfig veth1 mtu 1450
> 
> 
> server side (node 147)
> ----------------------
> 
> ip link add vxlan42 type vxlan id 42 group 239.0.0.42 ttl 10 dev ethN
> ifconfig vxlan42 192.168.42.147/24 up
> 
> brctl addbr br-vx
> ip link set br-vx up
> 
> ifconfig br-vx 192.168.52.147/24 up
> brctl addif br-vx vxlan42
> 
> 
> ip link add type veth
> brctl addif br-vx veth1
> ifconfig veth0 192.168.62.147/24 up
> ip link set veth1 up
> 
> ifconfig veth0 mtu 1450
> ifconfig veth1 mtu 1450
> 
> 

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: vxlan/veth performance issues on net.git + latest kernels
  2013-12-03 15:05 vxlan/veth performance issues on net.git + latest kernels Or Gerlitz
  2013-12-03 15:30 ` Eric Dumazet
@ 2013-12-03 17:12 ` Eric Dumazet
  2013-12-03 19:50   ` Or Gerlitz
  1 sibling, 1 reply; 63+ messages in thread
From: Eric Dumazet @ 2013-12-03 17:12 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Eric Dumazet, Alexei Starovoitov, Pravin B Shelar, David Miller,
	netdev

On Tue, 2013-12-03 at 17:05 +0200, Or Gerlitz wrote:

> ----------------------
> 
> ip link add vxlan42 type vxlan id 42 group 239.0.0.42 ttl 10 dev ethN
> ifconfig vxlan42 192.168.42.144/24 up
> 
> brctl addbr br-vx
> ip link set br-vx up
> 
> ifconfig br-vx 192.168.52.144/24 up
> brctl addif br-vx vxlan42
> 
> ip link add type veth

> brctl addif br-vx veth1

Aside question : If I do not have brctl on my host,
how can I use "ip" command to perform this "brctl addif" action ?

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: vxlan/veth performance issues on net.git + latest kernels
  2013-12-03 17:12 ` Eric Dumazet
@ 2013-12-03 19:50   ` Or Gerlitz
  2013-12-03 20:19     ` John Fastabend
  2013-12-03 21:12     ` Eric Dumazet
  0 siblings, 2 replies; 63+ messages in thread
From: Or Gerlitz @ 2013-12-03 19:50 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Or Gerlitz, netdev

On Tue, Dec 3, 2013 at 7:12 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:

>> brctl addbr br-vx
>> ip link set br-vx up
>> ifconfig br-vx 192.168.52.144/24 up
>> brctl addif br-vx vxlan42
>> ip link add type veth
>> brctl addif br-vx veth1

> Aside question : If I do not have brctl on my host,
> how can I use "ip" command to perform this "brctl addif" action ?

you can get the sources from
git://git.kernel.org/pub/scm/linux/kernel/git/shemminger/bridge-utils.git
or from http://sourceforge.net/projects/bridge/files/bridge/bridge-utils-1.5.tar.gz/download,
should be trivial to configure and build, if I remember correct


>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: vxlan/veth performance issues on net.git + latest kernels
  2013-12-03 15:30 ` Eric Dumazet
@ 2013-12-03 19:55   ` Or Gerlitz
  2013-12-03 21:11     ` Joseph Gasparakis
  0 siblings, 1 reply; 63+ messages in thread
From: Or Gerlitz @ 2013-12-03 19:55 UTC (permalink / raw)
  To: Eric Dumazet, Jerry Chu
  Cc: Or Gerlitz, Eric Dumazet, Alexei Starovoitov, Pravin B Shelar,
	David Miller, netdev

On Tue, Dec 3, 2013 at 5:30 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Tue, 2013-12-03 at 17:05 +0200, Or Gerlitz wrote:
>> I've been chasing lately a performance issues which come into play when
>> combining veth and vxlan over fast Ethernet NIC.
>>
>> I came across it while working to enable TCP stateless offloads for
>> vxlan encapsulated traffic in the mlx4 driver, but I can clearly see the
>> issue without any HWoffloads involved, so it would be easier to discuss
>> like that (no offloads involved).
>>
>> The setup involves a stacked {veth --> bridge --> vlxan --> IP stack -->
>> NIC} or {veth --> ovs+vxlan -->  IP stack --> NIC} chain.
>>
>> Basically, in my testbed which uses iperf over 40Gbs Mellanox NICs,
>> vxlan traffic goes up to 5-7Gbs for single session and up to 14Gbs for
>> multiple sessions, as long as veth isn't involved. Once veth is used I
>> can't get to > 7-8Gbs, no matter how many sessions are used. For the
>> time being, I manually took into account the tunneling overhead and
>> reduced the veth pair MTU by 50 bytes.
>>
>> Looking on the kernel TCP counters in a {veth --> bridge --> vxlan -->
>> NIC} configuration, on the client side I see lots of hits for the
>> following TCP counters (the numbers are just single sample, I look on
>> the output of iterative sampling every seconds, e.g using "watch -d -n 1
>> netstat -st"):
>>
>> 67092 segments retransmited
>>
>> 31461 times recovered from packet loss due to SACK data
>> Detected reordering 1045142 times using SACK
>> 436215 fast retransmits
>> 59966 forward retransmits
>>
>> Also on the passive side I see hits for the "Quick ack mode was
>> activated N times" counter, see below full snapshot of the counters from
>> both sides.
>>
>> Without using veth, e.g when running in a {vxlan -> NIC} or {bridge -->
>> vxlan --> NIC},I see hits only for the "recovered from packet loss due
>> to SACK data" counter and fastretransmits counter,  but not for the
>> forward retransmits or "Detected reordering N timesusing SACK". Also,
>> the quick ack mode counter isn't active on the passive side.
>>
>> I tested net.git (3.13-rc2+), 3.12.2 and 3.11.9, I see the same problems
>> on all. At this point I don't really see a past point to go and apply
>> bisection. So I hope this counter report can help to shed some light on
>> the nature of the problem and possible solution, so ideas welcome!!
>>
>> without vxlan, these are the Gbs results for 1/2/4 streams over 3.12.2,
>> the results
>> for the net.git are pretty much the same.
>>
>> 18/32/38  NIC
>> 17/30/35  bridge --> NIC
>> 14/23/35  veth --> bridge --> NIC
>>
>> with vxlan, these are the Gbs results for 1/2/4 streams
>>
>> 6/12/14  vxlan --> IP --> NIC
>> 5/10/14  bridge --> vxlan --> IP --> NIC
>> 6/7/7    veth --> bridge --> vxlan --> IP --> NIC
>>
>> Also, the 3.12.2 number do get any better also when adding a ported
>> version of 82d8189826d5 "veth: extend features to support tunneling" on
>> top of 3.12.2
>>
>> See @ the end the sequence of commands I use for the environment
>>
>> Or.
>>
>>
>> --> TCP counters from active side
>>
>> # netstat -ts
>> IcmpMsg:
>>      InType0: 2
>>      InType8: 1
>>      OutType0: 1
>>      OutType3: 4
>>      OutType8: 2
>> Tcp:
>>      189 active connections openings
>>      4 passive connection openings
>>      0 failed connection attempts
>>      0 connection resets received
>>      4 connections established
>>      22403193 segments received
>>      541234150 segments send out
>>      14248 segments retransmited
>>      0 bad segments received.
>>      5 resets sent
>> UdpLite:
>> TcpExt:
>>      2 invalid SYN cookies received
>>      178 TCP sockets finished time wait in fast timer
>>      10 delayed acks sent
>>      Quick ack mode was activated 1 times
>>      4 packets directly queued to recvmsg prequeue.
>>      3728 packets directly received from backlog
>>      2 packets directly received from prequeue
>>      2524 packets header predicted
>>      4 packets header predicted and directly queued to user
>>      19793310 acknowledgments not containing data received
>>      1216966 predicted acknowledgments
>>      2130 times recovered from packet loss due to SACK data
>>      Detected reordering 73 times using FACK
>>      Detected reordering 11424 times using SACK
>>      55 congestion windows partially recovered using Hoe heuristic
>>      TCPDSACKUndo: 457
>>      2 congestion windows recovered after partial ack
>>      11498 fast retransmits
>>      2748 forward retransmits
>>      2 other TCP timeouts
>>      TCPLossProbes: 4
>>      3 DSACKs sent for old packets
>>      TCPSackShifted: 1037782
>>      TCPSackMerged: 332827
>>      TCPSackShiftFallback: 598055
>>      TCPRcvCoalesce: 380
>>      TCPOFOQueue: 463
>>      TCPSpuriousRtxHostQueues: 192
>> IpExt:
>>      InNoRoutes: 1
>>      InMcastPkts: 191
>>      OutMcastPkts: 28
>>      InBcastPkts: 25
>>      InOctets: 1789360097
>>      OutOctets: 893757758988
>>      InMcastOctets: 8152
>>      OutMcastOctets: 3044
>>      InBcastOctets: 4259
>>      InNoECTPkts: 30117553
>>
>>
>>
>> --> TCP counters from passiveside
>>
>> netstat -ts
>> IcmpMsg:
>>      InType0: 1
>>      InType8: 2
>>      OutType0: 2
>>      OutType3: 5
>>      OutType8: 1
>> Tcp:
>>      75 active connections openings
>>      140 passive connection openings
>>      0 failed connection attempts
>>      0 connection resets received
>>      4 connections established
>>      146888643 segments received
>>      27430160 segments send out
>>      0 segments retransmited
>>      0 bad segments received.
>>      6 resets sent
>> UdpLite:
>> TcpExt:
>>      3 invalid SYN cookies received
>>      72 TCP sockets finished time wait in fast timer
>>      10 delayed acks sent
>>      3 delayed acks further delayed because of locked socket
>>      Quick ack mode was activated 13548 times
>>      4 packets directly queued to recvmsg prequeue.
>>      2 packets directly received from prequeue
>>      139384763 packets header predicted
>>      2 packets header predicted and directly queued to user
>>      671 acknowledgments not containing data received
>>      938 predicted acknowledgments
>>      TCPLossProbes: 2
>>      TCPLossProbeRecovery: 1
>>      14 DSACKs sent for old packets
>>      TCPBacklogDrop: 848
>
> Thats bad : Dropping packets on receiver.
>
> Check also "ifconfig -a" to see if rxdrop is increasing as well.
>
>>      TCPRcvCoalesce: 118368414
>
> lack of GRO : receiver seems to not be able to receive as fast as you want.
>
>>      TCPOFOQueue: 3167879
>
> So many packets are received out of order (because of losses)

I see that there's no GRO also for the non-veth tests which involve
vxlan, and over there the receiving side is capable to consume the
packets, do you have rough explaination why adding veth to the chain
is such game changer which makes things to start falling out?



>
>> IpExt:
>>      InNoRoutes: 1
>>      InMcastPkts: 184
>>      OutMcastPkts: 26
>>      InBcastPkts: 26
>>      InOctets: 1007364296775
>>      OutOctets: 2433872888
>>      InMcastOctets: 6202
>>      OutMcastOctets: 2888
>>      InBcastOctets: 4597
>>      InNoECTPkts: 702313233
>>
>>
>> client side (node 144)
>> ----------------------
>>
>> ip link add vxlan42 type vxlan id 42 group 239.0.0.42 ttl 10 dev ethN
>> ifconfig vxlan42 192.168.42.144/24 up
>>
>> brctl addbr br-vx
>> ip link set br-vx up
>>
>> ifconfig br-vx 192.168.52.144/24 up
>> brctl addif br-vx vxlan42
>>
>> ip link add type veth
>> brctl addif br-vx veth1
>> ifconfig veth0 192.168.62.144/24 up
>> ip link set veth1 up
>>
>> ifconfig veth0 mtu 1450
>> ifconfig veth1 mtu 1450
>>
>>
>> server side (node 147)
>> ----------------------
>>
>> ip link add vxlan42 type vxlan id 42 group 239.0.0.42 ttl 10 dev ethN
>> ifconfig vxlan42 192.168.42.147/24 up
>>
>> brctl addbr br-vx
>> ip link set br-vx up
>>
>> ifconfig br-vx 192.168.52.147/24 up
>> brctl addif br-vx vxlan42
>>
>>
>> ip link add type veth
>> brctl addif br-vx veth1
>> ifconfig veth0 192.168.62.147/24 up
>> ip link set veth1 up
>>
>> ifconfig veth0 mtu 1450
>> ifconfig veth1 mtu 1450
>>
>>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: vxlan/veth performance issues on net.git + latest kernels
  2013-12-03 19:50   ` Or Gerlitz
@ 2013-12-03 20:19     ` John Fastabend
  2013-12-03 21:12     ` Eric Dumazet
  1 sibling, 0 replies; 63+ messages in thread
From: John Fastabend @ 2013-12-03 20:19 UTC (permalink / raw)
  To: Or Gerlitz; +Cc: Eric Dumazet, Or Gerlitz, netdev

On 12/3/2013 11:50 AM, Or Gerlitz wrote:
> On Tue, Dec 3, 2013 at 7:12 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
>>> brctl addbr br-vx
>>> ip link set br-vx up
>>> ifconfig br-vx 192.168.52.144/24 up
>>> brctl addif br-vx vxlan42
>>> ip link add type veth
>>> brctl addif br-vx veth1
>
>> Aside question : If I do not have brctl on my host,
>> how can I use "ip" command to perform this "brctl addif" action ?
>
> you can get the sources from
> git://git.kernel.org/pub/scm/linux/kernel/git/shemminger/bridge-utils.git
> or from http://sourceforge.net/projects/bridge/files/bridge/bridge-utils-1.5.tar.gz/download,
> should be trivial to configure and build, if I remember correct
>
>


The 'brctl addif' should be the same as setting the master.

#ip link set dev veth1 master br-vs

and then use nomaster to delete the interface.

.John

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: vxlan/veth performance issues on net.git + latest kernels
  2013-12-03 21:11     ` Joseph Gasparakis
@ 2013-12-03 21:09       ` Or Gerlitz
  2013-12-03 21:24         ` Eric Dumazet
  2013-12-03 23:13         ` vxlan/veth performance issues on net.git + latest kernels Joseph Gasparakis
  0 siblings, 2 replies; 63+ messages in thread
From: Or Gerlitz @ 2013-12-03 21:09 UTC (permalink / raw)
  To: Joseph Gasparakis
  Cc: Eric Dumazet, Jerry Chu, Or Gerlitz, Eric Dumazet,
	Alexei Starovoitov, Pravin B Shelar, David Miller, netdev

On Tue, Dec 3, 2013 at 11:11 PM, Joseph Gasparakis
<joseph.gasparakis@intel.com> wrote:

>>> lack of GRO : receiver seems to not be able to receive as fast as you want.
>>>>      TCPOFOQueue: 3167879
>>> So many packets are received out of order (because of losses)

>> I see that there's no GRO also for the non-veth tests which involve
>> vxlan, and over there the receiving side is capable to consume the
>> packets, do you have rough explaination why adding veth to the chain
>> is such game changer which makes things to start falling out?

> I have seen this before. Here are my findings:
>
> The gso_type is different if the skb comes from veth or not. From veth,
> you will see the SKB_GSO_DODGY set. This breaks things as when the
> skb with DODGY set moves from vxlan to the driver through dev_xmit_hard,
> the stack drops it silently. I never got the time to find the root cause
> for this, but I know it causes re-transmissions and big performance
> degregation.
>
> I went as far as just quickly hacking a one liner unsetting the DODGY bit
> in vxlan.c and that bypassed the issue and recovered the performance
> problem, but obviously this is not a real fix.

thanks for the heads up, few quick questions/clafications --

-- you are talking on drops done @ the sender side, correct? Eric was
saying we have evidences that the drops happen on the receiver.

-- without the hack you did, still packets are sent/received, so what
makes the stack to drop only some of them?

-- why packets coming from veth would have the SKB_GSO_DODGY bit set?

-- so where is now (say net.git or 3.12.x) this one line you commented
out? I don't see in vxlan.c or in ip_tunnel_core.c / ip_tunnel.c
explicit setting of SKB_GSO_DODGY

Also, I am pretty sure the problem exists also when sending/receiving
guest traffic through tap/macvtap <--> vhost/virtio-net and friends, I
just sticked to the veth flavour b/c its one (== the hypervisor)
network stack to debug and not two (+ the guest one).

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: vxlan/veth performance issues on net.git + latest kernels
  2013-12-03 19:55   ` Or Gerlitz
@ 2013-12-03 21:11     ` Joseph Gasparakis
  2013-12-03 21:09       ` Or Gerlitz
  0 siblings, 1 reply; 63+ messages in thread
From: Joseph Gasparakis @ 2013-12-03 21:11 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Eric Dumazet, Jerry Chu, Or Gerlitz, Eric Dumazet,
	Alexei Starovoitov, Pravin B Shelar, David Miller, netdev



On Tue, 3 Dec 2013, Or Gerlitz wrote:

> On Tue, Dec 3, 2013 at 5:30 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > On Tue, 2013-12-03 at 17:05 +0200, Or Gerlitz wrote:
> >> I've been chasing lately a performance issues which come into play when
> >> combining veth and vxlan over fast Ethernet NIC.
> >>
> >> I came across it while working to enable TCP stateless offloads for
> >> vxlan encapsulated traffic in the mlx4 driver, but I can clearly see the
> >> issue without any HWoffloads involved, so it would be easier to discuss
> >> like that (no offloads involved).
> >>
> >> The setup involves a stacked {veth --> bridge --> vlxan --> IP stack -->
> >> NIC} or {veth --> ovs+vxlan -->  IP stack --> NIC} chain.
> >>
> >> Basically, in my testbed which uses iperf over 40Gbs Mellanox NICs,
> >> vxlan traffic goes up to 5-7Gbs for single session and up to 14Gbs for
> >> multiple sessions, as long as veth isn't involved. Once veth is used I
> >> can't get to > 7-8Gbs, no matter how many sessions are used. For the
> >> time being, I manually took into account the tunneling overhead and
> >> reduced the veth pair MTU by 50 bytes.
> >>
> >> Looking on the kernel TCP counters in a {veth --> bridge --> vxlan -->
> >> NIC} configuration, on the client side I see lots of hits for the
> >> following TCP counters (the numbers are just single sample, I look on
> >> the output of iterative sampling every seconds, e.g using "watch -d -n 1
> >> netstat -st"):
> >>
> >> 67092 segments retransmited
> >>
> >> 31461 times recovered from packet loss due to SACK data
> >> Detected reordering 1045142 times using SACK
> >> 436215 fast retransmits
> >> 59966 forward retransmits
> >>
> >> Also on the passive side I see hits for the "Quick ack mode was
> >> activated N times" counter, see below full snapshot of the counters from
> >> both sides.
> >>
> >> Without using veth, e.g when running in a {vxlan -> NIC} or {bridge -->
> >> vxlan --> NIC},I see hits only for the "recovered from packet loss due
> >> to SACK data" counter and fastretransmits counter,  but not for the
> >> forward retransmits or "Detected reordering N timesusing SACK". Also,
> >> the quick ack mode counter isn't active on the passive side.
> >>
> >> I tested net.git (3.13-rc2+), 3.12.2 and 3.11.9, I see the same problems
> >> on all. At this point I don't really see a past point to go and apply
> >> bisection. So I hope this counter report can help to shed some light on
> >> the nature of the problem and possible solution, so ideas welcome!!
> >>
> >> without vxlan, these are the Gbs results for 1/2/4 streams over 3.12.2,
> >> the results
> >> for the net.git are pretty much the same.
> >>
> >> 18/32/38  NIC
> >> 17/30/35  bridge --> NIC
> >> 14/23/35  veth --> bridge --> NIC
> >>
> >> with vxlan, these are the Gbs results for 1/2/4 streams
> >>
> >> 6/12/14  vxlan --> IP --> NIC
> >> 5/10/14  bridge --> vxlan --> IP --> NIC
> >> 6/7/7    veth --> bridge --> vxlan --> IP --> NIC
> >>
> >> Also, the 3.12.2 number do get any better also when adding a ported
> >> version of 82d8189826d5 "veth: extend features to support tunneling" on
> >> top of 3.12.2
> >>
> >> See @ the end the sequence of commands I use for the environment
> >>
> >> Or.
> >>
> >>
> >> --> TCP counters from active side
> >>
> >> # netstat -ts
> >> IcmpMsg:
> >>      InType0: 2
> >>      InType8: 1
> >>      OutType0: 1
> >>      OutType3: 4
> >>      OutType8: 2
> >> Tcp:
> >>      189 active connections openings
> >>      4 passive connection openings
> >>      0 failed connection attempts
> >>      0 connection resets received
> >>      4 connections established
> >>      22403193 segments received
> >>      541234150 segments send out
> >>      14248 segments retransmited
> >>      0 bad segments received.
> >>      5 resets sent
> >> UdpLite:
> >> TcpExt:
> >>      2 invalid SYN cookies received
> >>      178 TCP sockets finished time wait in fast timer
> >>      10 delayed acks sent
> >>      Quick ack mode was activated 1 times
> >>      4 packets directly queued to recvmsg prequeue.
> >>      3728 packets directly received from backlog
> >>      2 packets directly received from prequeue
> >>      2524 packets header predicted
> >>      4 packets header predicted and directly queued to user
> >>      19793310 acknowledgments not containing data received
> >>      1216966 predicted acknowledgments
> >>      2130 times recovered from packet loss due to SACK data
> >>      Detected reordering 73 times using FACK
> >>      Detected reordering 11424 times using SACK
> >>      55 congestion windows partially recovered using Hoe heuristic
> >>      TCPDSACKUndo: 457
> >>      2 congestion windows recovered after partial ack
> >>      11498 fast retransmits
> >>      2748 forward retransmits
> >>      2 other TCP timeouts
> >>      TCPLossProbes: 4
> >>      3 DSACKs sent for old packets
> >>      TCPSackShifted: 1037782
> >>      TCPSackMerged: 332827
> >>      TCPSackShiftFallback: 598055
> >>      TCPRcvCoalesce: 380
> >>      TCPOFOQueue: 463
> >>      TCPSpuriousRtxHostQueues: 192
> >> IpExt:
> >>      InNoRoutes: 1
> >>      InMcastPkts: 191
> >>      OutMcastPkts: 28
> >>      InBcastPkts: 25
> >>      InOctets: 1789360097
> >>      OutOctets: 893757758988
> >>      InMcastOctets: 8152
> >>      OutMcastOctets: 3044
> >>      InBcastOctets: 4259
> >>      InNoECTPkts: 30117553
> >>
> >>
> >>
> >> --> TCP counters from passiveside
> >>
> >> netstat -ts
> >> IcmpMsg:
> >>      InType0: 1
> >>      InType8: 2
> >>      OutType0: 2
> >>      OutType3: 5
> >>      OutType8: 1
> >> Tcp:
> >>      75 active connections openings
> >>      140 passive connection openings
> >>      0 failed connection attempts
> >>      0 connection resets received
> >>      4 connections established
> >>      146888643 segments received
> >>      27430160 segments send out
> >>      0 segments retransmited
> >>      0 bad segments received.
> >>      6 resets sent
> >> UdpLite:
> >> TcpExt:
> >>      3 invalid SYN cookies received
> >>      72 TCP sockets finished time wait in fast timer
> >>      10 delayed acks sent
> >>      3 delayed acks further delayed because of locked socket
> >>      Quick ack mode was activated 13548 times
> >>      4 packets directly queued to recvmsg prequeue.
> >>      2 packets directly received from prequeue
> >>      139384763 packets header predicted
> >>      2 packets header predicted and directly queued to user
> >>      671 acknowledgments not containing data received
> >>      938 predicted acknowledgments
> >>      TCPLossProbes: 2
> >>      TCPLossProbeRecovery: 1
> >>      14 DSACKs sent for old packets
> >>      TCPBacklogDrop: 848
> >
> > Thats bad : Dropping packets on receiver.
> >
> > Check also "ifconfig -a" to see if rxdrop is increasing as well.
> >
> >>      TCPRcvCoalesce: 118368414
> >
> > lack of GRO : receiver seems to not be able to receive as fast as you want.
> >
> >>      TCPOFOQueue: 3167879
> >
> > So many packets are received out of order (because of losses)
> 
> I see that there's no GRO also for the non-veth tests which involve
> vxlan, and over there the receiving side is capable to consume the
> packets, do you have rough explaination why adding veth to the chain
> is such game changer which makes things to start falling out?
> 

I have seen this before. Here are my findings:

The gso_type is different if the skb comes from veth or not. From veth,
you will see the SKB_GSO_DODGY set. This breaks things as when the
skb with DODGY set moves from vxlan to the driver through dev_xmit_hard,
the stack drops it silently. I never got the time to find the root cause
for this, but I know it causes re-transmissions and big performance
degregation.

I went as far as just quickly hacking a one liner unsetting the DODGY bit
in vxlan.c and that bypassed the issue and recovered the performance
problem, but obviously this is not a real fix.

> 
> 
> >
> >> IpExt:
> >>      InNoRoutes: 1
> >>      InMcastPkts: 184
> >>      OutMcastPkts: 26
> >>      InBcastPkts: 26
> >>      InOctets: 1007364296775
> >>      OutOctets: 2433872888
> >>      InMcastOctets: 6202
> >>      OutMcastOctets: 2888
> >>      InBcastOctets: 4597
> >>      InNoECTPkts: 702313233
> >>
> >>
> >> client side (node 144)
> >> ----------------------
> >>
> >> ip link add vxlan42 type vxlan id 42 group 239.0.0.42 ttl 10 dev ethN
> >> ifconfig vxlan42 192.168.42.144/24 up
> >>
> >> brctl addbr br-vx
> >> ip link set br-vx up
> >>
> >> ifconfig br-vx 192.168.52.144/24 up
> >> brctl addif br-vx vxlan42
> >>
> >> ip link add type veth
> >> brctl addif br-vx veth1
> >> ifconfig veth0 192.168.62.144/24 up
> >> ip link set veth1 up
> >>
> >> ifconfig veth0 mtu 1450
> >> ifconfig veth1 mtu 1450
> >>
> >>
> >> server side (node 147)
> >> ----------------------
> >>
> >> ip link add vxlan42 type vxlan id 42 group 239.0.0.42 ttl 10 dev ethN
> >> ifconfig vxlan42 192.168.42.147/24 up
> >>
> >> brctl addbr br-vx
> >> ip link set br-vx up
> >>
> >> ifconfig br-vx 192.168.52.147/24 up
> >> brctl addif br-vx vxlan42
> >>
> >>
> >> ip link add type veth
> >> brctl addif br-vx veth1
> >> ifconfig veth0 192.168.62.147/24 up
> >> ip link set veth1 up
> >>
> >> ifconfig veth0 mtu 1450
> >> ifconfig veth1 mtu 1450
> >>
> >>
> >
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe netdev" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: vxlan/veth performance issues on net.git + latest kernels
  2013-12-03 19:50   ` Or Gerlitz
  2013-12-03 20:19     ` John Fastabend
@ 2013-12-03 21:12     ` Eric Dumazet
  1 sibling, 0 replies; 63+ messages in thread
From: Eric Dumazet @ 2013-12-03 21:12 UTC (permalink / raw)
  To: Or Gerlitz; +Cc: Or Gerlitz, netdev

On Tue, 2013-12-03 at 21:50 +0200, Or Gerlitz wrote:

> you can get the sources from
> git://git.kernel.org/pub/scm/linux/kernel/git/shemminger/bridge-utils.git
> or from http://sourceforge.net/projects/bridge/files/bridge/bridge-utils-1.5.tar.gz/download,
> should be trivial to configure and build, if I remember correct

Well, I did not elaborate.

Lets say I have some hosts where I cannot install a new binary, for
whatever reason ;)

Thanks, John Fastabend replied to the question.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: vxlan/veth performance issues on net.git + latest kernels
  2013-12-03 21:09       ` Or Gerlitz
@ 2013-12-03 21:24         ` Eric Dumazet
  2013-12-03 21:36           ` Or Gerlitz
  2013-12-03 23:13         ` vxlan/veth performance issues on net.git + latest kernels Joseph Gasparakis
  1 sibling, 1 reply; 63+ messages in thread
From: Eric Dumazet @ 2013-12-03 21:24 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Joseph Gasparakis, Jerry Chu, Or Gerlitz, Eric Dumazet,
	Alexei Starovoitov, Pravin B Shelar, David Miller, netdev

On Tue, 2013-12-03 at 23:09 +0200, Or Gerlitz wrote:
> On Tue, Dec 3, 2013 at 11:11 PM, Joseph Gasparakis
> <joseph.gasparakis@intel.com> wrote:
> 
> >>> lack of GRO : receiver seems to not be able to receive as fast as you want.
> >>>>      TCPOFOQueue: 3167879
> >>> So many packets are received out of order (because of losses)
> 
> >> I see that there's no GRO also for the non-veth tests which involve
> >> vxlan, and over there the receiving side is capable to consume the
> >> packets, do you have rough explaination why adding veth to the chain
> >> is such game changer which makes things to start falling out?
> 
> > I have seen this before. Here are my findings:
> >
> > The gso_type is different if the skb comes from veth or not. From veth,
> > you will see the SKB_GSO_DODGY set. This breaks things as when the
> > skb with DODGY set moves from vxlan to the driver through dev_xmit_hard,
> > the stack drops it silently. I never got the time to find the root cause
> > for this, but I know it causes re-transmissions and big performance
> > degregation.
> >
> > I went as far as just quickly hacking a one liner unsetting the DODGY bit
> > in vxlan.c and that bypassed the issue and recovered the performance
> > problem, but obviously this is not a real fix.
> 
> thanks for the heads up, few quick questions/clafications --
> 
> -- you are talking on drops done @ the sender side, correct? Eric was
> saying we have evidences that the drops happen on the receiver.

I suggested you take a look at the receiver, like "ifconfig -a"

I suspect one cpu is 100% in sofirq mode draining packets from the NIC
and feeding IP / TCP stack.

Because of vxlan encap, all the packets are delivered to a single RX
queue (I dont think mlx4 is able to look at inner header to get L4 info)

mpstat -P ALL 10 10

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: vxlan/veth performance issues on net.git + latest kernels
  2013-12-03 21:24         ` Eric Dumazet
@ 2013-12-03 21:36           ` Or Gerlitz
  2013-12-03 21:50             ` David Miller
  0 siblings, 1 reply; 63+ messages in thread
From: Or Gerlitz @ 2013-12-03 21:36 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Joseph Gasparakis, Jerry Chu, Or Gerlitz, Eric Dumazet,
	Alexei Starovoitov, Pravin B Shelar, David Miller, netdev

On Tue, Dec 3, 2013 at 11:24 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Tue, 2013-12-03 at 23:09 +0200, Or Gerlitz wrote:
>> On Tue, Dec 3, 2013 at 11:11 PM, Joseph Gasparakis
>> <joseph.gasparakis@intel.com> wrote:
>>
>> >>> lack of GRO : receiver seems to not be able to receive as fast as you want.
>> >>>>      TCPOFOQueue: 3167879
>> >>> So many packets are received out of order (because of losses)
>>
>> >> I see that there's no GRO also for the non-veth tests which involve
>> >> vxlan, and over there the receiving side is capable to consume the
>> >> packets, do you have rough explaination why adding veth to the chain
>> >> is such game changer which makes things to start falling out?
>>
>> > I have seen this before. Here are my findings:
>> >
>> > The gso_type is different if the skb comes from veth or not. From veth,
>> > you will see the SKB_GSO_DODGY set. This breaks things as when the
>> > skb with DODGY set moves from vxlan to the driver through dev_xmit_hard,
>> > the stack drops it silently. I never got the time to find the root cause
>> > for this, but I know it causes re-transmissions and big performance
>> > degregation.
>> >
>> > I went as far as just quickly hacking a one liner unsetting the DODGY bit
>> > in vxlan.c and that bypassed the issue and recovered the performance
>> > problem, but obviously this is not a real fix.
>>
>> thanks for the heads up, few quick questions/clafications --
>>
>> -- you are talking on drops done @ the sender side, correct? Eric was
>> saying we have evidences that the drops happen on the receiver.
>
> I suggested you take a look at the receiver, like "ifconfig -a"

Eric, sorry I am away from the system now, will try to get some access
and report back now and if not, tomorow, but


> I suspect one cpu is 100% in sofirq mode draining packets from the NIC
> and feeding IP / TCP stack.

> Because of vxlan encap, all the packets are delivered to a single RX
> queue (I dont think mlx4 is able to look at inner header to get L4 info)

With the new card, ConnectX3-pro we are able to look on inner headers
and do RX/TX checksum and LSO for the encapsulated traffic, this is
how I initially got into this problem. But as I wrote earlier, I was
able to see the problem w.o activating the offloads for the inner
packets. Sorry if I didn't mention that, but from the mlx4_en NIC
driver point of view, different stream do map to different RX queues,
b/c the HW does RSS on the outer (UDP) header and the sender vxlan
code uses few sockets to send multiple streams which each having
difference source UDP port. For the "outer RSS" you don't need the
-pro card, just make sure the udp_rss module param of mlx4 is set.

I also thought that under veth there's contention point which could
explain why packets are dropped, but haven't found it.


> mpstat -P ALL 10 10

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: vxlan/veth performance issues on net.git + latest kernels
  2013-12-03 21:36           ` Or Gerlitz
@ 2013-12-03 21:50             ` David Miller
  2013-12-03 21:55               ` Eric Dumazet
  0 siblings, 1 reply; 63+ messages in thread
From: David Miller @ 2013-12-03 21:50 UTC (permalink / raw)
  To: or.gerlitz
  Cc: eric.dumazet, joseph.gasparakis, hkchu, ogerlitz, edumazet, ast,
	pshelar, netdev

From: Or Gerlitz <or.gerlitz@gmail.com>
Date: Tue, 3 Dec 2013 23:36:50 +0200

> I also thought that under veth there's contention point which could
> explain why packets are dropped, but haven't found it.

At this point I would use drop monitor to figure out in what context
packets are being dropped on the floor.  There are scripts provided
with the perf tool to utilize it.

I suspect what you will find is that either the cpu is at %100, or
sporadic events where GRO is not performed is killing the stream.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: vxlan/veth performance issues on net.git + latest kernels
  2013-12-03 21:50             ` David Miller
@ 2013-12-03 21:55               ` Eric Dumazet
  2013-12-03 22:15                 ` Or Gerlitz
                                   ` (2 more replies)
  0 siblings, 3 replies; 63+ messages in thread
From: Eric Dumazet @ 2013-12-03 21:55 UTC (permalink / raw)
  To: David Miller
  Cc: or.gerlitz, joseph.gasparakis, hkchu, ogerlitz, edumazet, ast,
	pshelar, netdev

On Tue, 2013-12-03 at 16:50 -0500, David Miller wrote:

> At this point I would use drop monitor to figure out in what context
> packets are being dropped on the floor.  There are scripts provided
> with the perf tool to utilize it.

Most easy way is to do :

perf record -e skb:kfree_skb -a -g sleep 10

perf report

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: vxlan/veth performance issues on net.git + latest kernels
  2013-12-03 21:55               ` Eric Dumazet
@ 2013-12-03 22:15                 ` Or Gerlitz
  2013-12-03 22:22                 ` Or Gerlitz
  2013-12-03 23:10                 ` Or Gerlitz
  2 siblings, 0 replies; 63+ messages in thread
From: Or Gerlitz @ 2013-12-03 22:15 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, Joseph Gasparakis, Jerry Chu, Or Gerlitz,
	Eric Dumazet, Alexei Starovoitov, Pravin B Shelar,
	netdev@vger.kernel.org

On Tue, Dec 3, 2013 at 11:55 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Tue, 2013-12-03 at 16:50 -0500, David Miller wrote:
>
>> At this point I would use drop monitor to figure out in what context
>> packets are being dropped on the floor.  There are scripts provided
>> with the perf tool to utilize it.
>
> Most easy way is to do :
>
> perf record -e skb:kfree_skb -a -g sleep 10
>
> perf report

The version of perf I have on these nodes fail to run the 1st command,
anyway, here's some data which was asked by Eric

passive side top + plain perf for two streams

top - 00:08:09 up  7:53,  3 users,  load average: 0.59, 0.43, 0.32
Tasks: 134 total,   1 running, 133 sleeping,   0 stopped,   0 zombie
Cpu0  :  0.0%us,  5.7%sy,  0.0%ni, 92.2%id,  0.0%wa,  0.0%hi,  2.0%si,  0.0%st
Cpu1  :  0.6%us, 17.2%sy,  0.0%ni, 30.6%id,  0.0%wa,  0.0%hi, 51.7%si,  0.0%st
Cpu2  :  0.7%us,  4.3%sy,  0.0%ni, 93.6%id,  0.0%wa,  0.0%hi,  1.3%si,  0.0%st
Cpu3  :  0.0%us,  1.7%sy,  0.0%ni, 98.0%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
Cpu4  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu5  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu6  :  0.0%us, 14.4%sy,  0.0%ni, 50.0%id,  0.0%wa,  0.0%hi, 35.6%si,  0.0%st
Cpu7  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   8183220k total,  1383012k used,  6800208k free,    10724k buffers
Swap:  2097148k total,        0k used,  2097148k free,   127520k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 3857 root      20   0  381m  640  468 S 69.2  0.0   6:45.22 iperf
   15 root      20   0     0    0    0 S  7.3  0.0   0:09.92 ksoftirqd/1
   40 root      20   0     0    0    0 S  7.3  0.0   0:12.10 ksoftirqd/6
   20 root      20   0     0    0    0 S  0.3  0.0   0:20.95 ksoftirqd/2
 9229 root      20   0  220m  16m 6168 S  0.3  0.2   0:00.78 perf
 9578 root      20   0 15084 1156  856 R  0.3  0.0   0:00.01 top
    1 root      20   0 23648 1644 1320 S  0.0  0.0   0:00.34 init


Samples: 1K of event 'cpu-clock', Event count (approx.): 283500000
  5.17%  [kernel]            [k] fib_table_lookup
  5.02%  [kernel]            [k] __do_softirq
  3.63%  [kernel]            [k] copy_user_generic_unrolled
  3.55%  perf                [.] 0x000000000004c808
  3.16%  [kernel]            [k] __netif_receive_skb_core
  3.09%  [kernel]            [k] _raw_spin_unlock_irqrestore
  2.93%  [kernel]            [k] enqueue_to_backlog
  2.85%  [kernel]            [k] _raw_spin_lock
  2.08%  [kernel]            [k] __pskb_pull_tail
  1.93%  [kernel]            [k] __udp4_lib_lookup
  1.85%  [kernel]            [k] ip_rcv
  1.77%  [kernel]            [k] __slab_free
  1.62%  [kernel]            [k] _raw_spin_unlock_irq
  1.54%  [kernel]            [k] check_leaf
  1.47%  [kernel]            [k] pvclock_clocksource_read
  1.47%  [kernel]            [k] skb_copy_bits
  1.31%  [kernel]            [k] __rcu_read_unlock
  1.31%  [kernel]            [k] tcp_v4_rcv
  1.23%  [mlx4_en]           [k] mlx4_en_process_rx_cq
  1.00%  [kernel]            [k] skb_try_coalesce
  1.00%  [kernel]            [k] napi_gro_frags
  1.00%  [kernel]            [k] __inet_lookup_established
  0.93%  [mlx4_en]           [k] mlx4_en_xmit


ifconfig -a listing (eth2 is the NIC over which we run, br1 is the
bridge) - no recorded drops

[root@r-dcs47-005 ~]# ifconfig -a
br1       Link encap:Ethernet  HWaddr 1A:10:63:AD:55:4C
          inet addr:192.168.52.147  Bcast:192.168.52.255  Mask:255.255.255.0
          inet6 addr: fe80::f88c:60ff:fe19:6d33/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1450  Metric:1
          RX packets:32120731 errors:0 dropped:0 overruns:0 frame:0
          TX packets:987550 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:46552789260 (43.3 GiB)  TX bytes:53462960 (50.9 MiB)

eth0      Link encap:Ethernet  HWaddr 00:50:56:25:4A:05
          inet addr:10.212.74.5  Bcast:10.212.255.255  Mask:255.255.0.0
          inet6 addr: fe80::250:56ff:fe25:4a05/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:265478 errors:0 dropped:16 overruns:0 frame:0
          TX packets:14391 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:28861836 (27.5 MiB)  TX bytes:2561303 (2.4 MiB)

eth2      Link encap:Ethernet  HWaddr 00:02:C9:E9:C0:82
          inet addr:192.168.30.147  Bcast:192.168.30.255  Mask:255.255.255.0
          inet6 addr: fe80::2:c900:1e9:c082/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:853813090 errors:0 dropped:0 overruns:0 frame:0
          TX packets:76377493 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:1270293325678 (1.1 TiB)  TX bytes:7858334980 (7.3 GiB)

lo        Link encap:Local Loopback
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:65536  Metric:1
          RX packets:48 errors:0 dropped:0 overruns:0 frame:0
          TX packets:48 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:3236 (3.1 KiB)  TX bytes:3236 (3.1 KiB)

veth0     Link encap:Ethernet  HWaddr EA:4F:C9:1C:5D:EE
          inet addr:192.168.62.147  Bcast:192.168.62.255  Mask:255.255.255.0
          inet6 addr: fe80::e84f:c9ff:fe1c:5dee/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1450  Metric:1
          RX packets:372217768 errors:0 dropped:0 overruns:0 frame:0
          TX packets:60732630 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:544449586684 (507.0 GiB)  TX bytes:3840742148 (3.5 GiB)

veth1     Link encap:Ethernet  HWaddr 1A:10:63:AD:55:4C
          inet6 addr: fe80::1810:63ff:fead:554c/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1450  Metric:1
          RX packets:60732693 errors:0 dropped:0 overruns:0 frame:0
          TX packets:372217836 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:3840746210 (3.5 GiB)  TX bytes:544449686236 (507.0 GiB)

vxlan42   Link encap:Ethernet  HWaddr FE:EF:4E:C7:0F:06
          inet addr:192.168.42.147  Bcast:192.168.42.255  Mask:255.255.255.0
          inet6 addr: fe80::fcef:4eff:fec7:f06/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1450  Metric:1
          RX packets:404338687 errors:0 dropped:0 overruns:0 frame:0
          TX packets:61720259 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:585791638636 (545.5 GiB)  TX bytes:4881734478 (4.5 GiB)



active side perf top

Samples: 74K of event 'cpu-clock', Event count (approx.): 12653268255
 14.74%  [kernel]            [k] __copy_user_nocache
  7.91%  [kernel]            [k] csum_partial
  7.51%  [kernel]            [k] _raw_spin_lock
  6.38%  [kernel]            [k] _raw_spin_unlock_irqrestore
  5.57%  [kernel]            [k] __do_softirq
  4.24%  [mlx4_en]           [k] mlx4_en_xmit
  2.40%  [kernel]            [k] __copy_skb_header
  2.10%  [kernel]            [k] _raw_spin_unlock_irq
  2.04%  [kernel]            [k] memcpy
  1.92%  [kernel]            [k] fib_table_lookup
  1.73%  [kernel]            [k] tcp_sendmsg
  1.64%  [kernel]            [k] skb_segment
  1.52%  [kernel]            [k] __slab_free
  1.09%  [kernel]            [k] __alloc_skb
  0.89%  [kernel]            [k] __slab_alloc
  0.85%  [kernel]            [k] tcp_ack
  0.83%  [kernel]            [k] __netif_receive_skb_core
  0.81%  [kernel]            [k] __kmalloc_node_track_caller
  0.75%  [kernel]            [k] ip_send_check
  0.70%  [kernel]            [k] put_compound_page
  0.70%  [kernel]            [k] ksize
  0.67%  [kernel]            [k] pvclock_clocksource_read
  0.65%  [kernel]            [k] dev_hard_start_xmit
  0.61%  [kernel]            [k] kmem_cache_alloc_node
  0.58%  [kernel]            [k] dev_queue_xmit
  0.55%  [kernel]            [k] enqueue_to_backlog
  0.52%  [kernel]            [k] __pskb_pull_tail
  0.49%  [kernel]            [k] __iowrite64_copy
  0.45%  [kernel]            [k] dev_queue_xmit_nit
  0.45%  [kernel]            [k] skb_copy_bits
  0.44%  [kernel]            [k] check_leaf
  0.44%  [kernel]            [k] skb_release_data
  0.44%  [kernel]            [k] get_page_from_freelist
  0.43%  [kernel]            [k] inet_gso_segment
  0.42%  [mlx4_en]           [k] mlx4_en_process_rx_cq
  0.39%  [kernel]            [k] process_backlog
  0.38%  [kernel]            [k] pskb_expand_head

and top


[root@r-dcs44-005 ~]# top
top - 00:13:27 up  7:59,  3 users,  load average: 1.11, 0.76, 0.44
Tasks: 129 total,   1 running, 128 sleeping,   0 stopped,   0 zombie
Cpu0  :  0.0%us, 10.3%sy,  0.0%ni, 87.0%id,  0.0%wa,  0.0%hi,  2.7%si,  0.0%st
Cpu1  :  0.3%us, 11.7%sy,  0.0%ni, 84.3%id,  0.0%wa,  0.0%hi,  3.7%si,  0.0%st
Cpu2  :  0.5%us, 18.4%sy,  0.0%ni, 43.4%id,  0.0%wa,  0.0%hi, 37.8%si,  0.0%st
Cpu3  :  0.0%us,  5.1%sy,  0.0%ni, 94.3%id,  0.0%wa,  0.0%hi,  0.7%si,  0.0%st
Cpu4  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu5  :  0.3%us,  6.0%sy,  0.0%ni, 93.4%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
Cpu6  :  0.3%us,  7.4%sy,  0.0%ni, 92.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu7  :  0.0%us, 24.4%sy,  0.0%ni, 42.1%id,  0.0%wa,  0.0%hi, 33.5%si,  0.0%st
Mem:   8183236k total,  1378928k used,  6804308k free,    10436k buffers
Swap:  2097148k total,        0k used,  2097148k free,   128068k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
17220 root      20   0  165m  564  456 S 122.4  0.0   6:14.34 iperf
   20 root      20   0     0    0    0 S  1.3  0.0   0:07.85 ksoftirqd/2
   45 root      20   0     0    0    0 S  1.3  0.0   0:05.65 ksoftirqd/7
18404 root      20   0  220m  16m 6384 S  0.7  0.2   0:00.62 perf
   35 root      20   0     0    0    0 S  0.3  0.0   0:07.73 ksoftirqd/5
    1 root      20   0 23520 1624 1316 S  0.0  0.0   0:00.33 init
    2 root      20   0     0    0    0 S  0.0  0.0   0:00.00 kthreadd



and ifconfig -a (eth6 is the NIC over which we run)

root@r-dcs44-005 ~]# ifconfig -a
br1       Link encap:Ethernet  HWaddr 12:C0:46:32:46:6A
          inet addr:192.168.52.144  Bcast:192.168.52.255  Mask:255.255.255.0
          inet6 addr: fe80::2815:30ff:fe06:b5f9/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1450  Metric:1
          RX packets:987575 errors:0 dropped:0 overruns:0 frame:0
          TX packets:842756 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:39638896 (37.8 MiB)  TX bytes:45313678898 (42.2 GiB)

eth0      Link encap:Ethernet  HWaddr 00:50:56:25:4B:05
          inet addr:10.212.75.5  Bcast:10.212.255.255  Mask:255.255.0.0
          inet6 addr: fe80::250:56ff:fe25:4b05/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:193220 errors:0 dropped:55 overruns:0 frame:0
          TX packets:17721 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:17006297 (16.2 MiB)  TX bytes:2871741 (2.7 MiB)

eth6      Link encap:Ethernet  HWaddr 00:02:C9:E9:BB:B2
          inet addr:192.168.30.144  Bcast:192.168.30.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:101512678 errors:0 dropped:0 overruns:0 frame:0
          TX packets:995876432 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:10707213944 (9.9 GiB)  TX bytes:1485292875970 (1.3 TiB)

lo        Link encap:Local Loopback
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:65536  Metric:1
          RX packets:36 errors:0 dropped:0 overruns:0 frame:0
          TX packets:36 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:2456 (2.3 KiB)  TX bytes:2456 (2.3 KiB)

veth0     Link encap:Ethernet  HWaddr 32:7D:34:FA:3A:A1
          inet addr:192.168.62.144  Bcast:192.168.62.255  Mask:255.255.255.0
          inet6 addr: fe80::307d:34ff:fefa:3aa1/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1450  Metric:1
          RX packets:85847361 errors:0 dropped:0 overruns:0 frame:0
          TX packets:72371900 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:5431514098 (5.0 GiB)  TX bytes:728410479216 (678.3 GiB)

veth1     Link encap:Ethernet  HWaddr 12:C0:46:32:46:6A
          inet6 addr: fe80::10c0:46ff:fe32:466a/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1450  Metric:1
          RX packets:72371915 errors:0 dropped:0 overruns:0 frame:0
          TX packets:85847368 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:728410821246 (678.3 GiB)  TX bytes:5431514476 (5.0 GiB)

vxlan42   Link encap:Ethernet  HWaddr B2:F9:D2:68:A3:11
          inet addr:192.168.42.144  Bcast:192.168.42.255  Mask:255.255.255.0
          inet6 addr: fe80::b0f9:d2ff:fe68:a311/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1450  Metric:1
          RX packets:86834927 errors:0 dropped:0 overruns:0 frame:0
          TX packets:73214688 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:4269288656 (3.9 GiB)  TX bytes:774896089976 (721.6 GiB)

no drops

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: vxlan/veth performance issues on net.git + latest kernels
  2013-12-03 21:55               ` Eric Dumazet
  2013-12-03 22:15                 ` Or Gerlitz
@ 2013-12-03 22:22                 ` Or Gerlitz
  2013-12-03 22:30                   ` Hannes Frederic Sowa
  2013-12-03 23:10                 ` Or Gerlitz
  2 siblings, 1 reply; 63+ messages in thread
From: Or Gerlitz @ 2013-12-03 22:22 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, Joseph Gasparakis, Jerry Chu, Or Gerlitz,
	Eric Dumazet, Alexei Starovoitov, Pravin B Shelar,
	netdev@vger.kernel.org

On Tue, Dec 3, 2013 at 11:55 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Tue, 2013-12-03 at 16:50 -0500, David Miller wrote:
>
>> At this point I would use drop monitor to figure out in what context
>> packets are being dropped on the floor.  There are scripts provided
>> with the perf tool to utilize it.
>
> Most easy way is to do :
>
> perf record -e skb:kfree_skb -a -g sleep 10

some typo here? I tried the perf tool that comes with the net git

./perf record -e skb:kfree_skb -a -g sleep 10
invalid or unsupported event: 'skb:kfree_skb'
Run 'perf list' for a list of valid events

 usage: perf record [<options>] [<command>]
    or: perf record [<options>] -- <command> [<options>]

    -e, --event <event>   event selector. use 'perf list' to list
available events

>
> perf report
>
>
>


anything critical missing here?

perf]# make
  BUILD:   Doing 'make -j16' parallel build

Auto-detecting system features:
...                     backtrace: [ on  ]
...                         dwarf: [ on  ]
...                fortify-source: [ on  ]
...                         glibc: [ on  ]
...                          gtk2: [ OFF ]
...                  gtk2-infobar: [ OFF ]
...                      libaudit: [ OFF ]
...                        libbfd: [ on  ]
...                        libelf: [ on  ]
...             libelf-getphdrnum: [ on  ]
...                   libelf-mmap: [ on  ]
...                       libnuma: [ OFF ]
...                       libperl: [ on  ]
...                     libpython: [ on  ]
...             libpython-version: [ on  ]
...                      libslang: [ on  ]
...                     libunwind: [ OFF ]
...                       on-exit: [ on  ]
...                stackprotector: [ on  ]
...            stackprotector-all: [ on  ]
...                       timerfd: [ on  ]

config/Makefile:329: No libunwind found, disabling post unwind
support. Please install libunwind-dev[el] >= 1.1
config/Makefile:354: No libaudit.h found, disables 'trace' tool,
please install audit-libs-devel or libaudit-dev
config/Makefile:381: GTK2 not found, disables GTK2 support. Please
install gtk2-devel or libgtk2.0-dev
config/Makefile:536: No numa.h found, disables 'perf bench numa mem'
benchmark, please install numa-libs-devel or libnuma-dev

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: vxlan/veth performance issues on net.git + latest kernels
  2013-12-03 22:22                 ` Or Gerlitz
@ 2013-12-03 22:30                   ` Hannes Frederic Sowa
  2013-12-03 22:35                     ` Or Gerlitz
  0 siblings, 1 reply; 63+ messages in thread
From: Hannes Frederic Sowa @ 2013-12-03 22:30 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Eric Dumazet, David Miller, Joseph Gasparakis, Jerry Chu,
	Or Gerlitz, Eric Dumazet, Alexei Starovoitov, Pravin B Shelar,
	netdev@vger.kernel.org

On Wed, Dec 04, 2013 at 12:22:08AM +0200, Or Gerlitz wrote:
> On Tue, Dec 3, 2013 at 11:55 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > On Tue, 2013-12-03 at 16:50 -0500, David Miller wrote:
> >
> >> At this point I would use drop monitor to figure out in what context
> >> packets are being dropped on the floor.  There are scripts provided
> >> with the perf tool to utilize it.
> >
> > Most easy way is to do :
> >
> > perf record -e skb:kfree_skb -a -g sleep 10
> 
> some typo here? I tried the perf tool that comes with the net git
> 
> ./perf record -e skb:kfree_skb -a -g sleep 10
> invalid or unsupported event: 'skb:kfree_skb'
> Run 'perf list' for a list of valid events
> 
>  usage: perf record [<options>] [<command>]
>     or: perf record [<options>] -- <command> [<options>]
> 
>     -e, --event <event>   event selector. use 'perf list' to list
> available events

-g takes an optional argument. Reorder the arguments:

perf record -e skb:kfree_skb -g -a sleep 10

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: vxlan/veth performance issues on net.git + latest kernels
  2013-12-03 22:30                   ` Hannes Frederic Sowa
@ 2013-12-03 22:35                     ` Or Gerlitz
  2013-12-03 22:39                       ` Hannes Frederic Sowa
  0 siblings, 1 reply; 63+ messages in thread
From: Or Gerlitz @ 2013-12-03 22:35 UTC (permalink / raw)
  To: Or Gerlitz, Eric Dumazet, David Miller, Joseph Gasparakis,
	Jerry Chu, Or Gerlitz, Eric Dumazet, Alexei Starovoitov,
	Pravin B Shelar, netdev@vger.kernel.org

On Wed, Dec 4, 2013 at 12:30 AM, Hannes Frederic Sowa
<hannes@stressinduktion.org> wrote:
> On Wed, Dec 04, 2013 at 12:22:08AM +0200, Or Gerlitz wrote:
>> On Tue, Dec 3, 2013 at 11:55 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> > On Tue, 2013-12-03 at 16:50 -0500, David Miller wrote:
>> >
>> >> At this point I would use drop monitor to figure out in what context
>> >> packets are being dropped on the floor.  There are scripts provided
>> >> with the perf tool to utilize it.
>> >
>> > Most easy way is to do :
>> >
>> > perf record -e skb:kfree_skb -a -g sleep 10
>>
>> some typo here? I tried the perf tool that comes with the net git
>>
>> ./perf record -e skb:kfree_skb -a -g sleep 10
>> invalid or unsupported event: 'skb:kfree_skb'
>> Run 'perf list' for a list of valid events
>>
>>  usage: perf record [<options>] [<command>]
>>     or: perf record [<options>] -- <command> [<options>]
>>
>>     -e, --event <event>   event selector. use 'perf list' to list
>> available events
>
> -g takes an optional argument. Reorder the arguments:
>
> perf record -e skb:kfree_skb -g -a sleep 10


Sorry, it doesn't help, I get the same error even with the oerf
kernel.org tree which has these events (no sign for skb's)


List of pre-defined events (to be used in -e):
  cpu-cycles OR cycles                               [Hardware event]
  instructions                                       [Hardware event]
  cache-references                                   [Hardware event]
  cache-misses                                       [Hardware event]
  branch-instructions OR branches                    [Hardware event]
  branch-misses                                      [Hardware event]
  bus-cycles                                         [Hardware event]
  stalled-cycles-frontend OR idle-cycles-frontend    [Hardware event]
  stalled-cycles-backend OR idle-cycles-backend      [Hardware event]
  ref-cycles                                         [Hardware event]

  cpu-clock                                          [Software event]
  task-clock                                         [Software event]
  page-faults OR faults                              [Software event]
  context-switches OR cs                             [Software event]
  cpu-migrations OR migrations                       [Software event]
  minor-faults                                       [Software event]
  major-faults                                       [Software event]
  alignment-faults                                   [Software event]
  emulation-faults                                   [Software event]

  L1-dcache-loads                                    [Hardware cache event]
  L1-dcache-load-misses                              [Hardware cache event]
  L1-dcache-stores                                   [Hardware cache event]
  L1-dcache-store-misses                             [Hardware cache event]
  L1-dcache-prefetches                               [Hardware cache event]
  L1-dcache-prefetch-misses                          [Hardware cache event]
  L1-icache-loads                                    [Hardware cache event]
  L1-icache-load-misses                              [Hardware cache event]
  L1-icache-prefetches                               [Hardware cache event]
  L1-icache-prefetch-misses                          [Hardware cache event]
  LLC-loads                                          [Hardware cache event]
  LLC-load-misses                                    [Hardware cache event]
  LLC-stores                                         [Hardware cache event]
  LLC-store-misses                                   [Hardware cache event]
  LLC-prefetches                                     [Hardware cache event]
  LLC-prefetch-misses                                [Hardware cache event]
  dTLB-loads                                         [Hardware cache event]
  dTLB-load-misses                                   [Hardware cache event]
  dTLB-stores                                        [Hardware cache event]
  dTLB-store-misses                                  [Hardware cache event]
  dTLB-prefetches                                    [Hardware cache event]

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: vxlan/veth performance issues on net.git + latest kernels
  2013-12-03 22:35                     ` Or Gerlitz
@ 2013-12-03 22:39                       ` Hannes Frederic Sowa
  0 siblings, 0 replies; 63+ messages in thread
From: Hannes Frederic Sowa @ 2013-12-03 22:39 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Eric Dumazet, David Miller, Joseph Gasparakis, Jerry Chu,
	Or Gerlitz, Eric Dumazet, Alexei Starovoitov, Pravin B Shelar,
	netdev@vger.kernel.org

On Wed, Dec 04, 2013 at 12:35:37AM +0200, Or Gerlitz wrote:
> Sorry, it doesn't help, I get the same error even with the oerf
> kernel.org tree which has these events (no sign for skb's)
> 
> 
> List of pre-defined events (to be used in -e):
>   cpu-cycles OR cycles                               [Hardware event]
>   instructions                                       [Hardware event]
>   cache-references                                   [Hardware event]
>   cache-misses                                       [Hardware event]
>   branch-instructions OR branches                    [Hardware event]
>   branch-misses                                      [Hardware event]
>   bus-cycles                                         [Hardware event]
>   stalled-cycles-frontend OR idle-cycles-frontend    [Hardware event]
>   stalled-cycles-backend OR idle-cycles-backend      [Hardware event]
>   ref-cycles                                         [Hardware event]
> 
>   cpu-clock                                          [Software event]
>   task-clock                                         [Software event]
>   page-faults OR faults                              [Software event]
>   context-switches OR cs                             [Software event]
>   cpu-migrations OR migrations                       [Software event]
>   minor-faults                                       [Software event]
>   major-faults                                       [Software event]
>   alignment-faults                                   [Software event]
>   emulation-faults                                   [Software event]
> 
>   L1-dcache-loads                                    [Hardware cache event]
>   L1-dcache-load-misses                              [Hardware cache event]
>   L1-dcache-stores                                   [Hardware cache event]
>   L1-dcache-store-misses                             [Hardware cache event]
>   L1-dcache-prefetches                               [Hardware cache event]
>   L1-dcache-prefetch-misses                          [Hardware cache event]
>   L1-icache-loads                                    [Hardware cache event]
>   L1-icache-load-misses                              [Hardware cache event]
>   L1-icache-prefetches                               [Hardware cache event]
>   L1-icache-prefetch-misses                          [Hardware cache event]
>   LLC-loads                                          [Hardware cache event]
>   LLC-load-misses                                    [Hardware cache event]
>   LLC-stores                                         [Hardware cache event]
>   LLC-store-misses                                   [Hardware cache event]
>   LLC-prefetches                                     [Hardware cache event]
>   LLC-prefetch-misses                                [Hardware cache event]
>   dTLB-loads                                         [Hardware cache event]
>   dTLB-load-misses                                   [Hardware cache event]
>   dTLB-stores                                        [Hardware cache event]
>   dTLB-store-misses                                  [Hardware cache event]
>   dTLB-prefetches                                    [Hardware cache event]

Is this the whole output of perf list? Then you seem to be missing some
tracepoint options, CONFIG_TRACEPOINTS e.g.?

I can confirm it works for me on a current net build.

Greetings,

  Hannes

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: vxlan/veth performance issues on net.git + latest kernels
  2013-12-03 23:13         ` vxlan/veth performance issues on net.git + latest kernels Joseph Gasparakis
@ 2013-12-03 23:09           ` Or Gerlitz
  2013-12-04  0:35             ` Joseph Gasparakis
  0 siblings, 1 reply; 63+ messages in thread
From: Or Gerlitz @ 2013-12-03 23:09 UTC (permalink / raw)
  To: Joseph Gasparakis
  Cc: Eric Dumazet, Jerry Chu, Or Gerlitz, Eric Dumazet,
	Alexei Starovoitov, Pravin B Shelar, David Miller, netdev

On Wed, Dec 4, 2013 at 1:13 AM, Joseph Gasparakis
<joseph.gasparakis@intel.com> wrote:
>
>
> On Tue, 3 Dec 2013, Or Gerlitz wrote:
>
>> On Tue, Dec 3, 2013 at 11:11 PM, Joseph Gasparakis
>> <joseph.gasparakis@intel.com> wrote:
>>
>> >>> lack of GRO : receiver seems to not be able to receive as fast as you want.
>> >>>>      TCPOFOQueue: 3167879
>> >>> So many packets are received out of order (because of losses)
>>
>> >> I see that there's no GRO also for the non-veth tests which involve
>> >> vxlan, and over there the receiving side is capable to consume the
>> >> packets, do you have rough explaination why adding veth to the chain
>> >> is such game changer which makes things to start falling out?
>>
>> > I have seen this before. Here are my findings:
>> >
>> > The gso_type is different if the skb comes from veth or not. From veth,
>> > you will see the SKB_GSO_DODGY set. This breaks things as when the
>> > skb with DODGY set moves from vxlan to the driver through dev_xmit_hard,
>> > the stack drops it silently. I never got the time to find the root cause
>> > for this, but I know it causes re-transmissions and big performance
>> > degregation.
>> >
>> > I went as far as just quickly hacking a one liner unsetting the DODGY bit
>> > in vxlan.c and that bypassed the issue and recovered the performance
>> > problem, but obviously this is not a real fix.
>>
>> thanks for the heads up, few quick questions/clafications --
>>
>> -- you are talking on drops done @ the sender side, correct? Eric was
>> saying we have evidences that the drops happen on the receiver.
>
> I am *guessing* drops on the Rx are due to the drops at the Tx. See my
> answer to your next question for more info.
>
>>
>> -- without the hack you did, still packets are sent/received, so what
>> makes the stack to drop only some of them?
>>
>
> What I had seen is GSOs getting dropped on the Tx side. Basically the GSOs
> never made it to the driver, they were broken into non GSO smaller skbs by
> the stack. I think the stack is not handling well the GSO with the DODGY
> bit set, and that causes it to maybe partially the packet to be emitted,
> causing the re-transmits (and maybe the drops on your Rx end)? Of course
> all this is speculation, the fact that I know is that as soon as I was
> forcing the gso type I saw offloaded VXLAN encapsulated traffic at decent speeds.
>
>> -- why packets coming from veth would have the SKB_GSO_DODGY bit set?
>
> That is something I would love to know too. I am guessing this is a way
> for the VM to say it is a non-trusted packet? And maybe all this can be
> fixed by maybe setting something on the VM through a userspace tool that
> will stop the veth to set the DODGY bit?
>
>>
>> -- so where is now (say net.git or 3.12.x) this one line you commented
>> out? I don't see in vxlan.c or in ip_tunnel_core.c / ip_tunnel.c
>> explicit setting of SKB_GSO_DODGY
>
> I did not commit it, as this was just a workaround to prove to myself that
> the problem I was seing was due to the gso_type, and it would actually
> just hide the problem and not give a proper solution to it.
>
>>
>> Also, I am pretty sure the problem exists also when sending/receiving
>> guest traffic through tap/macvtap <--> vhost/virtio-net and friends, I
>> just sticked to the veth flavour b/c its one (== the hypervisor)
>> network stack to debug and not two (+ the guest one).

understood, can you point the line/area you hacked, I'd like to try it
too and see the impact

>> --
>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: vxlan/veth performance issues on net.git + latest kernels
  2013-12-03 21:55               ` Eric Dumazet
  2013-12-03 22:15                 ` Or Gerlitz
  2013-12-03 22:22                 ` Or Gerlitz
@ 2013-12-03 23:10                 ` Or Gerlitz
  2013-12-03 23:30                   ` Or Gerlitz
  2013-12-03 23:59                   ` Eric Dumazet
  2 siblings, 2 replies; 63+ messages in thread
From: Or Gerlitz @ 2013-12-03 23:10 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, Joseph Gasparakis, Jerry Chu, Or Gerlitz,
	Eric Dumazet, Alexei Starovoitov, Pravin B Shelar,
	netdev@vger.kernel.org

On Tue, Dec 3, 2013 at 11:55 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Tue, 2013-12-03 at 16:50 -0500, David Miller wrote:
>
>> At this point I would use drop monitor to figure out in what context
>> packets are being dropped on the floor.  There are scripts provided
>> with the perf tool to utilize it.
>
> Most easy way is to do :
>
> perf record -e skb:kfree_skb -a -g sleep 10
> perf report

$ ./perf record -e skb:kfree_skb -g -a sleep 10
$ ./perf report -i perf.data


Samples: 883K of event 'skb:kfree_skb', Event count (approx.): 883406
+  97.13%          swapper  [kernel.kallsyms]  [k] net_tx_action
+   1.53%            iperf  [kernel.kallsyms]  [k] net_tx_action
+   1.03%             perf  [kernel.kallsyms]  [k] net_tx_action
+   0.27%      ksoftirqd/7  [kernel.kallsyms]  [k] net_tx_action
+   0.03%      kworker/7:1  [kernel.kallsyms]  [k] net_tx_action
+   0.00%          rpcbind  [kernel.kallsyms]  [k] net_tx_action
+   0.00%          swapper  [kernel.kallsyms]  [k] kfree_skb
+   0.00%            sleep  [kernel.kallsyms]  [k] net_tx_action
+   0.00%  hald-addon-acpi  [kernel.kallsyms]  [k] kfree_skb
+   0.00%            iperf  [kernel.kallsyms]  [k] kfree_skb
+   0.00%             perf  [kernel.kallsyms]  [k] kfree_skb



>
>
>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: vxlan/veth performance issues on net.git + latest kernels
  2013-12-03 21:09       ` Or Gerlitz
  2013-12-03 21:24         ` Eric Dumazet
@ 2013-12-03 23:13         ` Joseph Gasparakis
  2013-12-03 23:09           ` Or Gerlitz
  1 sibling, 1 reply; 63+ messages in thread
From: Joseph Gasparakis @ 2013-12-03 23:13 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Gasparakis, Joseph, Eric Dumazet, Jerry Chu, Or Gerlitz,
	Eric Dumazet, Alexei Starovoitov, Pravin B Shelar, David Miller,
	netdev



On Tue, 3 Dec 2013, Or Gerlitz wrote:

> On Tue, Dec 3, 2013 at 11:11 PM, Joseph Gasparakis
> <joseph.gasparakis@intel.com> wrote:
> 
> >>> lack of GRO : receiver seems to not be able to receive as fast as you want.
> >>>>      TCPOFOQueue: 3167879
> >>> So many packets are received out of order (because of losses)
> 
> >> I see that there's no GRO also for the non-veth tests which involve
> >> vxlan, and over there the receiving side is capable to consume the
> >> packets, do you have rough explaination why adding veth to the chain
> >> is such game changer which makes things to start falling out?
> 
> > I have seen this before. Here are my findings:
> >
> > The gso_type is different if the skb comes from veth or not. From veth,
> > you will see the SKB_GSO_DODGY set. This breaks things as when the
> > skb with DODGY set moves from vxlan to the driver through dev_xmit_hard,
> > the stack drops it silently. I never got the time to find the root cause
> > for this, but I know it causes re-transmissions and big performance
> > degregation.
> >
> > I went as far as just quickly hacking a one liner unsetting the DODGY bit
> > in vxlan.c and that bypassed the issue and recovered the performance
> > problem, but obviously this is not a real fix.
> 
> thanks for the heads up, few quick questions/clafications --
> 
> -- you are talking on drops done @ the sender side, correct? Eric was
> saying we have evidences that the drops happen on the receiver.

I am *guessing* drops on the Rx are due to the drops at the Tx. See my 
answer to your next question for more info.

> 
> -- without the hack you did, still packets are sent/received, so what
> makes the stack to drop only some of them?
> 

What I had seen is GSOs getting dropped on the Tx side. Basically the GSOs 
never made it to the driver, they were broken into non GSO smaller skbs by 
the stack. I think the stack is not handling well the GSO with the DODGY 
bit set, and that causes it to maybe partially the packet to be emitted,
causing the re-transmits (and maybe the drops on your Rx end)? Of course 
all this is speculation, the fact that I know is that as soon as I was
forcing the gso type I saw offloaded VXLAN encapsulated traffic at decent speeds.

> -- why packets coming from veth would have the SKB_GSO_DODGY bit set?

That is something I would love to know too. I am guessing this is a way 
for the VM to say it is a non-trusted packet? And maybe all this can be 
fixed by maybe setting something on the VM through a userspace tool that 
will stop the veth to set the DODGY bit?

> 
> -- so where is now (say net.git or 3.12.x) this one line you commented
> out? I don't see in vxlan.c or in ip_tunnel_core.c / ip_tunnel.c
> explicit setting of SKB_GSO_DODGY

I did not commit it, as this was just a workaround to prove to myself that 
the problem I was seing was due to the gso_type, and it would actually 
just hide the problem and not give a proper solution to it.

> 
> Also, I am pretty sure the problem exists also when sending/receiving
> guest traffic through tap/macvtap <--> vhost/virtio-net and friends, I
> just sticked to the veth flavour b/c its one (== the hypervisor)
> network stack to debug and not two (+ the guest one).
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: vxlan/veth performance issues on net.git + latest kernels
  2013-12-03 23:10                 ` Or Gerlitz
@ 2013-12-03 23:30                   ` Or Gerlitz
  2013-12-03 23:49                     ` Hannes Frederic Sowa
  2013-12-03 23:59                   ` Eric Dumazet
  1 sibling, 1 reply; 63+ messages in thread
From: Or Gerlitz @ 2013-12-03 23:30 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, Joseph Gasparakis, Jerry Chu, Or Gerlitz,
	Eric Dumazet, Alexei Starovoitov, Pravin B Shelar,
	netdev@vger.kernel.org

On Wed, Dec 4, 2013 at 1:10 AM, Or Gerlitz <or.gerlitz@gmail.com> wrote:
> On Tue, Dec 3, 2013 at 11:55 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> On Tue, 2013-12-03 at 16:50 -0500, David Miller wrote:
>>
>>> At this point I would use drop monitor to figure out in what context
>>> packets are being dropped on the floor.  There are scripts provided
>>> with the perf tool to utilize it.
>>
>> Most easy way is to do :
>>
>> perf record -e skb:kfree_skb -a -g sleep 10
>> perf report
>
> $ ./perf record -e skb:kfree_skb -g -a sleep 10
> $ ./perf report -i perf.data
>
>
> Samples: 883K of event 'skb:kfree_skb', Event count (approx.): 883406
> +  97.13%          swapper  [kernel.kallsyms]  [k] net_tx_action
> +   1.53%            iperf  [kernel.kallsyms]  [k] net_tx_action
> +   1.03%             perf  [kernel.kallsyms]  [k] net_tx_action
> +   0.27%      ksoftirqd/7  [kernel.kallsyms]  [k] net_tx_action
> +   0.03%      kworker/7:1  [kernel.kallsyms]  [k] net_tx_action
> +   0.00%          rpcbind  [kernel.kallsyms]  [k] net_tx_action
> +   0.00%          swapper  [kernel.kallsyms]  [k] kfree_skb
> +   0.00%            sleep  [kernel.kallsyms]  [k] net_tx_action
> +   0.00%  hald-addon-acpi  [kernel.kallsyms]  [k] kfree_skb
> +   0.00%            iperf  [kernel.kallsyms]  [k] kfree_skb
> +   0.00%             perf  [kernel.kallsyms]  [k] kfree_skb

I added proper sorting (thanks Rick), here's the passive side report

-  99.99%  [kernel.kallsyms]  [k] net_tx_action

                    ?
   - net_tx_action

                    ?
      - 59.20% __libc_recv

                    ?
           100.00% 0

                    ?
      - 35.61% __write_nocancel

                    ?
         - 100.00% run_builtin

                    ?
              main

                    ?
              __libc_start_main

                    ?
      - 2.60% __poll

                    ?
         - 92.62% run_builtin

                    ?
              main

                    ?
              __libc_start_main

                    ?
         - 7.38% 0x7f2d4c4adfd4

                    ?
              0x626370523a32333a

                    ?
      - 1.86% cmd_record

                    ?
           run_builtin

                    ?
           main

                    ?
           __libc_start_main

                    ?
-   0.01%  [kernel.kallsyms]  [k] kfree_skb

                    ?
   - kfree_skb

                    ?
      - 50.00% __connect_nocancel

                    ?
           0x64697063612f6e75

                    ?
      - 33.33% __libc_recv

                    ?
           0

                    ?
      - 16.67% __write_nocancel

                    ?
           run_builtin

                    ?
           main

                    ?
           __libc_start_main

and the active side report


100.00%  [kernel.kallsyms]  [k] net_tx_action

                  ?
   - net_tx_action

                    ?
      - 76.91% __write_nocancel

                    ?
         - 100.00% run_builtin

                    ?
              main

                    ?
              __libc_start_main

                    ?
        15.92% 0x37cc60e4ed

                    ?
      - 2.69% __poll

                    ?
           run_builtin

                    ?
           main

                    ?
           __libc_start_main

                    ?
      - 1.92% cmd_record

                    ?
           run_builtin

                    ?
           main

                    ?
           __libc_start_main

                    ?
      - 1.66% pthread_cond_signal@@GLIBC_2.
3.2

                    ?
           0x10000

                    ?
      - 0.52% perf_header__has_feat

                    ?
         - 73.28% run_builtin

                    ?
              main

                    ?
              __libc_start_main

                    ?
         - 26.72% cmd_record

                    ?
              run_builtin

                    ?
              main

                    ?
              __libc_start_main

                    ?
-   0.00%  [kernel.kallsyms]  [k] kfree_skb

                    ?
   - kfree_skb

                    ?
      - 100.00% __GI___connect_internal

                    ?
         - 50.00% get_mapping

                    ?
              __nscd_get_map_ref

                    ?
           50.00% __nscd_open_socket




>
>
>
>>
>>
>>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: vxlan/veth performance issues on net.git + latest kernels
  2013-12-03 23:30                   ` Or Gerlitz
@ 2013-12-03 23:49                     ` Hannes Frederic Sowa
  0 siblings, 0 replies; 63+ messages in thread
From: Hannes Frederic Sowa @ 2013-12-03 23:49 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Eric Dumazet, David Miller, Joseph Gasparakis, Jerry Chu,
	Or Gerlitz, Eric Dumazet, Alexei Starovoitov, Pravin B Shelar,
	netdev@vger.kernel.org

On Wed, Dec 04, 2013 at 01:30:09AM +0200, Or Gerlitz wrote:
> On Wed, Dec 4, 2013 at 1:10 AM, Or Gerlitz <or.gerlitz@gmail.com> wrote:
> > On Tue, Dec 3, 2013 at 11:55 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> >> On Tue, 2013-12-03 at 16:50 -0500, David Miller wrote:
> >>
> >>> At this point I would use drop monitor to figure out in what context
> >>> packets are being dropped on the floor.  There are scripts provided
> >>> with the perf tool to utilize it.
> >>
> >> Most easy way is to do :
> >>
> >> perf record -e skb:kfree_skb -a -g sleep 10
> >> perf report
> >
> > $ ./perf record -e skb:kfree_skb -g -a sleep 10
> > $ ./perf report -i perf.data
> >
> >
> > Samples: 883K of event 'skb:kfree_skb', Event count (approx.): 883406
> > +  97.13%          swapper  [kernel.kallsyms]  [k] net_tx_action
> > +   1.53%            iperf  [kernel.kallsyms]  [k] net_tx_action
> > +   1.03%             perf  [kernel.kallsyms]  [k] net_tx_action
> > +   0.27%      ksoftirqd/7  [kernel.kallsyms]  [k] net_tx_action
> > +   0.03%      kworker/7:1  [kernel.kallsyms]  [k] net_tx_action
> > +   0.00%          rpcbind  [kernel.kallsyms]  [k] net_tx_action
> > +   0.00%          swapper  [kernel.kallsyms]  [k] kfree_skb
> > +   0.00%            sleep  [kernel.kallsyms]  [k] net_tx_action
> > +   0.00%  hald-addon-acpi  [kernel.kallsyms]  [k] kfree_skb
> > +   0.00%            iperf  [kernel.kallsyms]  [k] kfree_skb
> > +   0.00%             perf  [kernel.kallsyms]  [k] kfree_skb
> 
> I added proper sorting (thanks Rick), here's the passive side report

Btw. the nice helper for the dropwatch is "perf script net_dropmonitor".

Greetings,

  Hannes

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: vxlan/veth performance issues on net.git + latest kernels
  2013-12-03 23:10                 ` Or Gerlitz
  2013-12-03 23:30                   ` Or Gerlitz
@ 2013-12-03 23:59                   ` Eric Dumazet
  2013-12-04  0:26                     ` Alexei Starovoitov
                                       ` (2 more replies)
  1 sibling, 3 replies; 63+ messages in thread
From: Eric Dumazet @ 2013-12-03 23:59 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: David Miller, Joseph Gasparakis, Jerry Chu, Or Gerlitz,
	Eric Dumazet, Alexei Starovoitov, Pravin B Shelar,
	netdev@vger.kernel.org

On Wed, 2013-12-04 at 01:10 +0200, Or Gerlitz wrote:

> Samples: 883K of event 'skb:kfree_skb', Event count (approx.): 883406
> +  97.13%          swapper  [kernel.kallsyms]  [k] net_tx_action
> +   1.53%            iperf  [kernel.kallsyms]  [k] net_tx_action
> +   1.03%             perf  [kernel.kallsyms]  [k] net_tx_action
> +   0.27%      ksoftirqd/7  [kernel.kallsyms]  [k] net_tx_action
> +   0.03%      kworker/7:1  [kernel.kallsyms]  [k] net_tx_action
> +   0.00%          rpcbind  [kernel.kallsyms]  [k] net_tx_action
> +   0.00%          swapper  [kernel.kallsyms]  [k] kfree_skb
> +   0.00%            sleep  [kernel.kallsyms]  [k] net_tx_action
> +   0.00%  hald-addon-acpi  [kernel.kallsyms]  [k] kfree_skb
> +   0.00%            iperf  [kernel.kallsyms]  [k] kfree_skb
> +   0.00%             perf  [kernel.kallsyms]  [k] kfree_skb
> 

Right, I actually have a patch for that, but was waiting for net-next
being re-opened :

commit 9a731d750dd8bf0b8c20fb1ca53c42317fb4dd37
Author: Eric Dumazet <edumazet@google.com>
Date:   Mon Nov 25 13:09:20 2013 -0800

    net-fixes: introduce dev_consume_skb_any()
    
    Some NIC drivers use dev_kfree_skb_any() generic helper to free skbs,
    both for dropped packets and TX completed ones.
    
    To have "perf record -e skb:kfree_skb" give good hints on where
    packets are dropped (if any), we need to separate the two causes.
    
    dev_consume_skb_any() is a helper to free skbs that were properly sent
    to the wire.
    

Signed-off-by: Eric Dumazet <edumazet@google.com>
Google-Bug-Id: 11634401
---
diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
index ec96130533cc..8b8c2171b187 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
@@ -209,9 +209,9 @@ static u16 bnx2x_free_tx_pkt(struct bnx2x *bp, struct bnx2x_fp_txdata *txdata,
 	if (likely(skb)) {
 		(*pkts_compl)++;
 		(*bytes_compl) += skb->len;
+		dev_consume_skb_any(skb);
 	}
 
-	dev_kfree_skb_any(skb);
 	tx_buf->first_bd = 0;
 	tx_buf->skb = NULL;
 
diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c
index 8d3945ab7334..03081932e519 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -1058,7 +1058,7 @@ static void e1000_put_txbuf(struct e1000_ring *tx_ring,
 		buffer_info->dma = 0;
 	}
 	if (buffer_info->skb) {
-		dev_kfree_skb_any(buffer_info->skb);
+		dev_consume_skb_any(buffer_info->skb);
 		buffer_info->skb = NULL;
 	}
 	buffer_info->time_stamp = 0;
diff --git a/drivers/net/ethernet/marvell/sky2.c b/drivers/net/ethernet/marvell/sky2.c
index 43aa7acd84a6..294825efb248 100644
--- a/drivers/net/ethernet/marvell/sky2.c
+++ b/drivers/net/ethernet/marvell/sky2.c
@@ -2037,7 +2037,7 @@ static void sky2_tx_complete(struct sky2_port *sky2, u16 done)
 			bytes_compl += skb->len;
 
 			re->skb = NULL;
-			dev_kfree_skb_any(skb);
+			dev_consume_skb_any(skb);
 
 			sky2->tx_next = RING_NEXT(idx, sky2->tx_ring_size);
 		}
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
index f54ebd5a1702..653484bfae98 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
@@ -317,7 +317,7 @@ static u32 mlx4_en_free_tx_desc(struct mlx4_en_priv *priv,
 			}
 		}
 	}
-	dev_kfree_skb_any(skb);
+	dev_consume_skb_any(skb);
 	return tx_info->nr_txbb;
 }
 
diff --git a/drivers/net/ethernet/nvidia/forcedeth.c b/drivers/net/ethernet/nvidia/forcedeth.c
index 2d045be4b5cf..d44f7b69a6a0 100644
--- a/drivers/net/ethernet/nvidia/forcedeth.c
+++ b/drivers/net/ethernet/nvidia/forcedeth.c
@@ -2557,7 +2557,7 @@ static int nv_tx_done(struct net_device *dev, int limit)
 					u64_stats_update_end(&np->swstats_tx_syncp);
 				}
 				bytes_compl += np->get_tx_ctx->skb->len;
-				dev_kfree_skb_any(np->get_tx_ctx->skb);
+				dev_consume_skb_any(np->get_tx_ctx->skb);
 				np->get_tx_ctx->skb = NULL;
 				tx_work++;
 			}
@@ -2574,7 +2574,7 @@ static int nv_tx_done(struct net_device *dev, int limit)
 					u64_stats_update_end(&np->swstats_tx_syncp);
 				}
 				bytes_compl += np->get_tx_ctx->skb->len;
-				dev_kfree_skb_any(np->get_tx_ctx->skb);
+				dev_consume_skb_any(np->get_tx_ctx->skb);
 				np->get_tx_ctx->skb = NULL;
 				tx_work++;
 			}
@@ -2625,7 +2625,7 @@ static int nv_tx_done_optimized(struct net_device *dev, int limit)
 			}
 
 			bytes_cleaned += np->get_tx_ctx->skb->len;
-			dev_kfree_skb_any(np->get_tx_ctx->skb);
+			dev_consume_skb_any(np->get_tx_ctx->skb);
 			np->get_tx_ctx->skb = NULL;
 			tx_work++;
 
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 916241d16c67..ab8970693ff3 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -755,7 +755,7 @@ static void free_old_xmit_skbs(struct send_queue *sq)
 		stats->tx_packets++;
 		u64_stats_update_end(&stats->tx_syncp);
 
-		dev_kfree_skb_any(skb);
+		dev_consume_skb_any(skb);
 	}
 }
 
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 7f0ed423a360..8a7482fa2656 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2380,6 +2380,13 @@ void dev_kfree_skb_irq(struct sk_buff *skb);
  */
 void dev_kfree_skb_any(struct sk_buff *skb);
 
+#define SKB_CONSUMED_MAGIC ((void *)0xDEAD0001)
+static inline void dev_consume_skb_any(struct sk_buff *skb)
+{
+	skb->dev = SKB_CONSUMED_MAGIC;
+	dev_kfree_skb_any(skb);
+}
+
 int netif_rx(struct sk_buff *skb);
 int netif_rx_ni(struct sk_buff *skb);
 int netif_receive_skb(struct sk_buff *skb);
diff --git a/net/core/dev.c b/net/core/dev.c
index ba3b7ea5ebb3..b2b0e5776ce9 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3306,7 +3306,10 @@ static void net_tx_action(struct softirq_action *h)
 			clist = clist->next;
 
 			WARN_ON(atomic_read(&skb->users));
-			trace_kfree_skb(skb, net_tx_action);
+			if (likely(skb->dev == SKB_CONSUMED_MAGIC))
+				trace_consume_skb(skb);
+			else
+				trace_kfree_skb(skb, net_tx_action);
 			__kfree_skb(skb);
 		}
 	}

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: vxlan/veth performance issues on net.git + latest kernels
  2013-12-03 23:59                   ` Eric Dumazet
@ 2013-12-04  0:26                     ` Alexei Starovoitov
  2013-12-04  0:36                       ` Eric Dumazet
  2013-12-04  6:39                     ` David Miller
  2013-12-05 12:45                     ` [PATCH net-next] net: introduce dev_consume_skb_any() Eric Dumazet
  2 siblings, 1 reply; 63+ messages in thread
From: Alexei Starovoitov @ 2013-12-04  0:26 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Or Gerlitz, David Miller, Joseph Gasparakis, Jerry Chu,
	Or Gerlitz, Eric Dumazet, Pravin B Shelar, netdev@vger.kernel.org

On Tue, Dec 3, 2013 at 3:59 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
> +#define SKB_CONSUMED_MAGIC ((void *)0xDEAD0001)
> +static inline void dev_consume_skb_any(struct sk_buff *skb)
> +{
> +       skb->dev = SKB_CONSUMED_MAGIC;
> +       dev_kfree_skb_any(skb);
> +}
> +
>  int netif_rx(struct sk_buff *skb);
>  int netif_rx_ni(struct sk_buff *skb);
>  int netif_receive_skb(struct sk_buff *skb);
> diff --git a/net/core/dev.c b/net/core/dev.c
> index ba3b7ea5ebb3..b2b0e5776ce9 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -3306,7 +3306,10 @@ static void net_tx_action(struct softirq_action *h)
>                         clist = clist->next;
>
>                         WARN_ON(atomic_read(&skb->users));
> -                       trace_kfree_skb(skb, net_tx_action);
> +                       if (likely(skb->dev == SKB_CONSUMED_MAGIC))
> +                               trace_consume_skb(skb);
> +                       else
> +                               trace_kfree_skb(skb, net_tx_action);

Could you use some other way to mark skb ?
In tracing we might want to examine skb more carefully and not being
able to see the device
will limit the usability of this tracepoint.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: vxlan/veth performance issues on net.git + latest kernels
  2013-12-04  0:35             ` Joseph Gasparakis
@ 2013-12-04  0:34               ` Alexei Starovoitov
  2013-12-04  1:29                 ` Joseph Gasparakis
  2013-12-04  0:44               ` Joseph Gasparakis
  2013-12-04  8:35               ` Or Gerlitz
  2 siblings, 1 reply; 63+ messages in thread
From: Alexei Starovoitov @ 2013-12-04  0:34 UTC (permalink / raw)
  To: Joseph Gasparakis
  Cc: Or Gerlitz, Eric Dumazet, Jerry Chu, Or Gerlitz, Eric Dumazet,
	Pravin B Shelar, David Miller, netdev

On Tue, Dec 3, 2013 at 4:35 PM, Joseph Gasparakis
<joseph.gasparakis@intel.com> wrote:
>
> I was printing the gso_type in vxlan_xmit_skb(), right before
> iptunnel_xmit() gets called (I was focus UDPv4 encap only). Then I saw the
> gso_type was different when a VM was involved and when it was not
> (although I was transmitting exactly the same packet), and then I replaced
> my printk with something like skb_shinfo(skb)->gso_type = <the gso type I had
> for non-VM skb> and it all worked.
>
> Then I looked into what was different between the two gso_types and the
> only difference was that SKB_GSO_DODGY was set when Tx'ing from the VM.
> I am sure I could have been more delicate with the aproach, but hey, it
> worked for me.

hmm. dodgy should be a normal path from vm.
kvm suppose to negotiate vnet_hdr for tap/macvtap and corresponding
driver will be remapping virtio_net_gso* flags into skb_gso_* flags
plus gso_dodgy.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: vxlan/veth performance issues on net.git + latest kernels
  2013-12-03 23:09           ` Or Gerlitz
@ 2013-12-04  0:35             ` Joseph Gasparakis
  2013-12-04  0:34               ` Alexei Starovoitov
                                 ` (2 more replies)
  0 siblings, 3 replies; 63+ messages in thread
From: Joseph Gasparakis @ 2013-12-04  0:35 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Joseph Gasparakis, Eric Dumazet, Jerry Chu, Or Gerlitz,
	Eric Dumazet, Alexei Starovoitov, Pravin B Shelar, David Miller,
	netdev



On Tue, 3 Dec 2013, Or Gerlitz wrote:

> On Wed, Dec 4, 2013 at 1:13 AM, Joseph Gasparakis
> <joseph.gasparakis@intel.com> wrote:
> >
> >
> > On Tue, 3 Dec 2013, Or Gerlitz wrote:
> >
> >> On Tue, Dec 3, 2013 at 11:11 PM, Joseph Gasparakis
> >> <joseph.gasparakis@intel.com> wrote:
> >>
> >> >>> lack of GRO : receiver seems to not be able to receive as fast as you want.
> >> >>>>      TCPOFOQueue: 3167879
> >> >>> So many packets are received out of order (because of losses)
> >>
> >> >> I see that there's no GRO also for the non-veth tests which involve
> >> >> vxlan, and over there the receiving side is capable to consume the
> >> >> packets, do you have rough explaination why adding veth to the chain
> >> >> is such game changer which makes things to start falling out?
> >>
> >> > I have seen this before. Here are my findings:
> >> >
> >> > The gso_type is different if the skb comes from veth or not. From veth,
> >> > you will see the SKB_GSO_DODGY set. This breaks things as when the
> >> > skb with DODGY set moves from vxlan to the driver through dev_xmit_hard,
> >> > the stack drops it silently. I never got the time to find the root cause
> >> > for this, but I know it causes re-transmissions and big performance
> >> > degregation.
> >> >
> >> > I went as far as just quickly hacking a one liner unsetting the DODGY bit
> >> > in vxlan.c and that bypassed the issue and recovered the performance
> >> > problem, but obviously this is not a real fix.
> >>
> >> thanks for the heads up, few quick questions/clafications --
> >>
> >> -- you are talking on drops done @ the sender side, correct? Eric was
> >> saying we have evidences that the drops happen on the receiver.
> >
> > I am *guessing* drops on the Rx are due to the drops at the Tx. See my
> > answer to your next question for more info.
> >
> >>
> >> -- without the hack you did, still packets are sent/received, so what
> >> makes the stack to drop only some of them?
> >>
> >
> > What I had seen is GSOs getting dropped on the Tx side. Basically the GSOs
> > never made it to the driver, they were broken into non GSO smaller skbs by
> > the stack. I think the stack is not handling well the GSO with the DODGY
> > bit set, and that causes it to maybe partially the packet to be emitted,
> > causing the re-transmits (and maybe the drops on your Rx end)? Of course
> > all this is speculation, the fact that I know is that as soon as I was
> > forcing the gso type I saw offloaded VXLAN encapsulated traffic at decent speeds.
> >
> >> -- why packets coming from veth would have the SKB_GSO_DODGY bit set?
> >
> > That is something I would love to know too. I am guessing this is a way
> > for the VM to say it is a non-trusted packet? And maybe all this can be
> > fixed by maybe setting something on the VM through a userspace tool that
> > will stop the veth to set the DODGY bit?
> >
> >>
> >> -- so where is now (say net.git or 3.12.x) this one line you commented
> >> out? I don't see in vxlan.c or in ip_tunnel_core.c / ip_tunnel.c
> >> explicit setting of SKB_GSO_DODGY
> >
> > I did not commit it, as this was just a workaround to prove to myself that
> > the problem I was seing was due to the gso_type, and it would actually
> > just hide the problem and not give a proper solution to it.
> >
> >>
> >> Also, I am pretty sure the problem exists also when sending/receiving
> >> guest traffic through tap/macvtap <--> vhost/virtio-net and friends, I
> >> just sticked to the veth flavour b/c its one (== the hypervisor)
> >> network stack to debug and not two (+ the guest one).
> 
> understood, can you point the line/area you hacked, I'd like to try it
> too and see the impact

I was printing the gso_type in vxlan_xmit_skb(), right before 
iptunnel_xmit() gets called (I was focus UDPv4 encap only). Then I saw the 
gso_type was different when a VM was involved and when it was not 
(although I was transmitting exactly the same packet), and then I replaced 
my printk with something like skb_shinfo(skb)->gso_type = <the gso type I had
for non-VM skb> and it all worked.

Then I looked into what was different between the two gso_types and the 
only difference was that SKB_GSO_DODGY was set when Tx'ing from the VM.
I am sure I could have been more delicate with the aproach, but hey, it
worked for me.

I would be curious to see if this is the same issue as mine. It seems like 
it is.

> 
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe netdev" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>
> 

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: vxlan/veth performance issues on net.git + latest kernels
  2013-12-04  0:26                     ` Alexei Starovoitov
@ 2013-12-04  0:36                       ` Eric Dumazet
  2013-12-04  0:55                         ` Alexei Starovoitov
  0 siblings, 1 reply; 63+ messages in thread
From: Eric Dumazet @ 2013-12-04  0:36 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Or Gerlitz, David Miller, Joseph Gasparakis, Jerry Chu,
	Or Gerlitz, Eric Dumazet, Pravin B Shelar, netdev@vger.kernel.org

On Tue, 2013-12-03 at 16:26 -0800, Alexei Starovoitov wrote:

> 
> Could you use some other way to mark skb ?

I could ;)

> In tracing we might want to examine skb more carefully and not being
> able to see the device
> will limit the usability of this tracepoint.

Unfortunately, using skb->dev as a pointer to device would be buggy or
expensive (you would need to take a reference on device in order not
letting it disappear, as we escape RCU protection)

Current TRACE_EVENT for trace_consume_skb() does not use skb->dev.

Anyway, this magic is pretty easy to change, I am open to suggestions.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: vxlan/veth performance issues on net.git + latest kernels
  2013-12-04  0:35             ` Joseph Gasparakis
  2013-12-04  0:34               ` Alexei Starovoitov
@ 2013-12-04  0:44               ` Joseph Gasparakis
  2013-12-04  8:35               ` Or Gerlitz
  2 siblings, 0 replies; 63+ messages in thread
From: Joseph Gasparakis @ 2013-12-04  0:44 UTC (permalink / raw)
  To: Joseph Gasparakis
  Cc: Or Gerlitz, Eric Dumazet, Jerry Chu, Or Gerlitz, Eric Dumazet,
	Alexei Starovoitov, Pravin B Shelar, David Miller, netdev



On Tue, 3 Dec 2013, Joseph Gasparakis wrote:

> 
> 
> On Tue, 3 Dec 2013, Or Gerlitz wrote:
> 
> > On Wed, Dec 4, 2013 at 1:13 AM, Joseph Gasparakis
> > <joseph.gasparakis@intel.com> wrote:
> > >
> > >
> > > On Tue, 3 Dec 2013, Or Gerlitz wrote:
> > >
> > >> On Tue, Dec 3, 2013 at 11:11 PM, Joseph Gasparakis
> > >> <joseph.gasparakis@intel.com> wrote:
> > >>
> > >> >>> lack of GRO : receiver seems to not be able to receive as fast as you want.
> > >> >>>>      TCPOFOQueue: 3167879
> > >> >>> So many packets are received out of order (because of losses)
> > >>
> > >> >> I see that there's no GRO also for the non-veth tests which involve
> > >> >> vxlan, and over there the receiving side is capable to consume the
> > >> >> packets, do you have rough explaination why adding veth to the chain
> > >> >> is such game changer which makes things to start falling out?
> > >>
> > >> > I have seen this before. Here are my findings:
> > >> >
> > >> > The gso_type is different if the skb comes from veth or not. From veth,
> > >> > you will see the SKB_GSO_DODGY set. This breaks things as when the
> > >> > skb with DODGY set moves from vxlan to the driver through dev_xmit_hard,
> > >> > the stack drops it silently. I never got the time to find the root cause
> > >> > for this, but I know it causes re-transmissions and big performance
> > >> > degregation.
> > >> >
> > >> > I went as far as just quickly hacking a one liner unsetting the DODGY bit
> > >> > in vxlan.c and that bypassed the issue and recovered the performance
> > >> > problem, but obviously this is not a real fix.
> > >>
> > >> thanks for the heads up, few quick questions/clafications --
> > >>
> > >> -- you are talking on drops done @ the sender side, correct? Eric was
> > >> saying we have evidences that the drops happen on the receiver.
> > >
> > > I am *guessing* drops on the Rx are due to the drops at the Tx. See my
> > > answer to your next question for more info.
> > >
> > >>
> > >> -- without the hack you did, still packets are sent/received, so what
> > >> makes the stack to drop only some of them?
> > >>
> > >
> > > What I had seen is GSOs getting dropped on the Tx side. Basically the GSOs
> > > never made it to the driver, they were broken into non GSO smaller skbs by
> > > the stack. I think the stack is not handling well the GSO with the DODGY
> > > bit set, and that causes it to maybe partially the packet to be emitted,
> > > causing the re-transmits (and maybe the drops on your Rx end)? Of course
> > > all this is speculation, the fact that I know is that as soon as I was
> > > forcing the gso type I saw offloaded VXLAN encapsulated traffic at decent speeds.
> > >
> > >> -- why packets coming from veth would have the SKB_GSO_DODGY bit set?
> > >
> > > That is something I would love to know too. I am guessing this is a way
> > > for the VM to say it is a non-trusted packet? And maybe all this can be
> > > fixed by maybe setting something on the VM through a userspace tool that
> > > will stop the veth to set the DODGY bit?
> > >
> > >>
> > >> -- so where is now (say net.git or 3.12.x) this one line you commented
> > >> out? I don't see in vxlan.c or in ip_tunnel_core.c / ip_tunnel.c
> > >> explicit setting of SKB_GSO_DODGY
> > >
> > > I did not commit it, as this was just a workaround to prove to myself that
> > > the problem I was seing was due to the gso_type, and it would actually
> > > just hide the problem and not give a proper solution to it.
> > >
> > >>
> > >> Also, I am pretty sure the problem exists also when sending/receiving
> > >> guest traffic through tap/macvtap <--> vhost/virtio-net and friends, I
> > >> just sticked to the veth flavour b/c its one (== the hypervisor)
> > >> network stack to debug and not two (+ the guest one).
> > 
> > understood, can you point the line/area you hacked, I'd like to try it
> > too and see the impact
> 
> I was printing the gso_type in vxlan_xmit_skb(), right before 
> iptunnel_xmit() gets called (I was focus UDPv4 encap only). Then I saw the 
> gso_type was different when a VM was involved and when it was not 
> (although I was transmitting exactly the same packet), and then I replaced 
> my printk with something like skb_shinfo(skb)->gso_type = <the gso type I had
> for non-VM skb> and it all worked.
> 
> Then I looked into what was different between the two gso_types and the 
> only difference was that SKB_GSO_DODGY was set when Tx'ing from the VM.
> I am sure I could have been more delicate with the aproach, but hey, it
> worked for me.
> 
> I would be curious to see if this is the same issue as mine. It seems like 
> it is.
>

Oh, and if I remember correctly, gso_type without VMs involved was 129 
(SKB_GSO_UDP_TUNNEL | SKB_GSO_TCPV4) and with VM it was 133 
(SKB_GSO_UDP_TUNNEL | SKB_GSO_DODGY | SKB_GSO_TCPV4).

> > 
> > >> --
> > >> To unsubscribe from this list: send the line "unsubscribe netdev" in
> > >> the body of a message to majordomo@vger.kernel.org
> > >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > >>
> > 
> 

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: vxlan/veth performance issues on net.git + latest kernels
  2013-12-04  0:36                       ` Eric Dumazet
@ 2013-12-04  0:55                         ` Alexei Starovoitov
  2013-12-04  1:23                           ` Eric Dumazet
  0 siblings, 1 reply; 63+ messages in thread
From: Alexei Starovoitov @ 2013-12-04  0:55 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Or Gerlitz, David Miller, Joseph Gasparakis, Jerry Chu,
	Or Gerlitz, Eric Dumazet, Pravin B Shelar, netdev@vger.kernel.org

On Tue, Dec 3, 2013 at 4:36 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Tue, 2013-12-03 at 16:26 -0800, Alexei Starovoitov wrote:
>
>>
>> Could you use some other way to mark skb ?
>
> I could ;)
>
>> In tracing we might want to examine skb more carefully and not being
>> able to see the device
>> will limit the usability of this tracepoint.
>
> Unfortunately, using skb->dev as a pointer to device would be buggy or
> expensive (you would need to take a reference on device in order not
> letting it disappear, as we escape RCU protection)

well, yes, you might have an skb around when device is already freed
when skb_dst_noref.
but I'm not suggesting anything expensive. Tracing definitely should
not add overhead by doing rcu_lock() or dev_hold(). Instead it can go
through skb, skb->dev, skb->dev->xxx via probe_kernel_read(). If dev
is gone, it's still safe.

> Anyway, this magic is pretty easy to change, I am open to suggestions.

you're the expert :) use skb->mark field, since it's unused during
freeing path... ?

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: vxlan/veth performance issues on net.git + latest kernels
  2013-12-04  1:29                 ` Joseph Gasparakis
@ 2013-12-04  1:18                   ` Eric Dumazet
  0 siblings, 0 replies; 63+ messages in thread
From: Eric Dumazet @ 2013-12-04  1:18 UTC (permalink / raw)
  To: Joseph Gasparakis
  Cc: Alexei Starovoitov, Or Gerlitz, Jerry Chu, Or Gerlitz,
	Eric Dumazet, Pravin B Shelar, David Miller, netdev

On Tue, 2013-12-03 at 17:29 -0800, Joseph Gasparakis wrote:
> 

> Then would the fix be as simple as just unsetting the bit in vxlan? 
> Because I am guessing that there is a bug somewhere where the combination of 
> GSO_UDP_TUNNEL and GSO_DODGY. That would solve my problem, hopefully Or's 
> too if it is the same.

Not really ;)

We need to handle this bit properly, not pretending it is of no use ;)

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: vxlan/veth performance issues on net.git + latest kernels
  2013-12-04  0:55                         ` Alexei Starovoitov
@ 2013-12-04  1:23                           ` Eric Dumazet
  2013-12-04  1:59                             ` Alexei Starovoitov
  2013-12-06  9:06                             ` Or Gerlitz
  0 siblings, 2 replies; 63+ messages in thread
From: Eric Dumazet @ 2013-12-04  1:23 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Or Gerlitz, David Miller, Joseph Gasparakis, Jerry Chu,
	Or Gerlitz, Eric Dumazet, Pravin B Shelar, netdev@vger.kernel.org

On Tue, 2013-12-03 at 16:55 -0800, Alexei Starovoitov wrote:
> On Tue, Dec 3, 2013 at 4:36 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > On Tue, 2013-12-03 at 16:26 -0800, Alexei Starovoitov wrote:
> >
> >>
> >> Could you use some other way to mark skb ?
> >
> > I could ;)
> >
> >> In tracing we might want to examine skb more carefully and not being
> >> able to see the device
> >> will limit the usability of this tracepoint.
> >
> > Unfortunately, using skb->dev as a pointer to device would be buggy or
> > expensive (you would need to take a reference on device in order not
> > letting it disappear, as we escape RCU protection)
> 
> well, yes, you might have an skb around when device is already freed
> when skb_dst_noref.
> but I'm not suggesting anything expensive. Tracing definitely should
> not add overhead by doing rcu_lock() or dev_hold(). Instead it can go
> through skb, skb->dev, skb->dev->xxx via probe_kernel_read(). If dev
> is gone, it's still safe.

Its certainly not safe to 'probe'.

Its not about faulting inexistent memory, that is the least of the
problem.

Any kind of information fetched from skb->dev might have been
overwritten.

You could for example fetch security sensitive data and expose it.


> 
> > Anyway, this magic is pretty easy to change, I am open to suggestions.
> 
> you're the expert :) use skb->mark field, since it's unused during
> freeing path... ?

cache line miss ;)

skb->dev is in the first cache line, where we access skb->next anyway.

I could use skb->cb[] like the following patch :

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
index ec96130533cc..8b8c2171b187 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
@@ -209,9 +209,9 @@ static u16 bnx2x_free_tx_pkt(struct bnx2x *bp, struct bnx2x_fp_txdata *txdata,
 	if (likely(skb)) {
 		(*pkts_compl)++;
 		(*bytes_compl) += skb->len;
+		dev_consume_skb_any(skb);
 	}
 
-	dev_kfree_skb_any(skb);
 	tx_buf->first_bd = 0;
 	tx_buf->skb = NULL;
 
diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c
index 8d3945ab7334..03081932e519 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -1058,7 +1058,7 @@ static void e1000_put_txbuf(struct e1000_ring *tx_ring,
 		buffer_info->dma = 0;
 	}
 	if (buffer_info->skb) {
-		dev_kfree_skb_any(buffer_info->skb);
+		dev_consume_skb_any(buffer_info->skb);
 		buffer_info->skb = NULL;
 	}
 	buffer_info->time_stamp = 0;
diff --git a/drivers/net/ethernet/marvell/sky2.c b/drivers/net/ethernet/marvell/sky2.c
index 43aa7acd84a6..294825efb248 100644
--- a/drivers/net/ethernet/marvell/sky2.c
+++ b/drivers/net/ethernet/marvell/sky2.c
@@ -2037,7 +2037,7 @@ static void sky2_tx_complete(struct sky2_port *sky2, u16 done)
 			bytes_compl += skb->len;
 
 			re->skb = NULL;
-			dev_kfree_skb_any(skb);
+			dev_consume_skb_any(skb);
 
 			sky2->tx_next = RING_NEXT(idx, sky2->tx_ring_size);
 		}
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
index f54ebd5a1702..653484bfae98 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
@@ -317,7 +317,7 @@ static u32 mlx4_en_free_tx_desc(struct mlx4_en_priv *priv,
 			}
 		}
 	}
-	dev_kfree_skb_any(skb);
+	dev_consume_skb_any(skb);
 	return tx_info->nr_txbb;
 }
 
diff --git a/drivers/net/ethernet/nvidia/forcedeth.c b/drivers/net/ethernet/nvidia/forcedeth.c
index 2d045be4b5cf..d44f7b69a6a0 100644
--- a/drivers/net/ethernet/nvidia/forcedeth.c
+++ b/drivers/net/ethernet/nvidia/forcedeth.c
@@ -2557,7 +2557,7 @@ static int nv_tx_done(struct net_device *dev, int limit)
 					u64_stats_update_end(&np->swstats_tx_syncp);
 				}
 				bytes_compl += np->get_tx_ctx->skb->len;
-				dev_kfree_skb_any(np->get_tx_ctx->skb);
+				dev_consume_skb_any(np->get_tx_ctx->skb);
 				np->get_tx_ctx->skb = NULL;
 				tx_work++;
 			}
@@ -2574,7 +2574,7 @@ static int nv_tx_done(struct net_device *dev, int limit)
 					u64_stats_update_end(&np->swstats_tx_syncp);
 				}
 				bytes_compl += np->get_tx_ctx->skb->len;
-				dev_kfree_skb_any(np->get_tx_ctx->skb);
+				dev_consume_skb_any(np->get_tx_ctx->skb);
 				np->get_tx_ctx->skb = NULL;
 				tx_work++;
 			}
@@ -2625,7 +2625,7 @@ static int nv_tx_done_optimized(struct net_device *dev, int limit)
 			}
 
 			bytes_cleaned += np->get_tx_ctx->skb->len;
-			dev_kfree_skb_any(np->get_tx_ctx->skb);
+			dev_consume_skb_any(np->get_tx_ctx->skb);
 			np->get_tx_ctx->skb = NULL;
 			tx_work++;
 
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 916241d16c67..ab8970693ff3 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -755,7 +755,7 @@ static void free_old_xmit_skbs(struct send_queue *sq)
 		stats->tx_packets++;
 		u64_stats_update_end(&stats->tx_syncp);
 
-		dev_kfree_skb_any(skb);
+		dev_consume_skb_any(skb);
 	}
 }
 
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 7f0ed423a360..8b80a58ec1ac 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2374,11 +2374,38 @@ int netif_get_num_default_rss_queues(void);
  */
 void dev_kfree_skb_irq(struct sk_buff *skb);
 
+void __dev_kfree_skb_any(struct sk_buff *skb);
+
+struct __dev_kfree_skb_cb {
+	unsigned int reason;
+};
+
+static inline struct __dev_kfree_skb_cb *get_kfree_skb_cb(const struct sk_buff *skb)
+{
+	return (struct __dev_kfree_skb_cb *)skb->cb;
+}
+
+enum {
+	SKB_REASON_CONSUMED,
+	SKB_REASON_DROPPED,
+};
+
 /* Use this variant in places where it could be invoked
  * from either hardware interrupt or other context, with hardware interrupts
  * either disabled or enabled.
+ * Note that TX completion should use dev_consume_skb_any()
  */
-void dev_kfree_skb_any(struct sk_buff *skb);
+static inline void dev_kfree_skb_any(struct sk_buff *skb)
+{
+	get_kfree_skb_cb(skb)->reason = SKB_REASON_DROPPED;
+	__dev_kfree_skb_any(skb);
+}
+
+static inline void dev_consume_skb_any(struct sk_buff *skb)
+{
+	get_kfree_skb_cb(skb)->reason = SKB_REASON_CONSUMED;
+	__dev_kfree_skb_any(skb);
+}
 
 int netif_rx(struct sk_buff *skb);
 int netif_rx_ni(struct sk_buff *skb);
diff --git a/net/core/dev.c b/net/core/dev.c
index ba3b7ea5ebb3..3170776e53da 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2161,14 +2161,14 @@ void dev_kfree_skb_irq(struct sk_buff *skb)
 }
 EXPORT_SYMBOL(dev_kfree_skb_irq);
 
-void dev_kfree_skb_any(struct sk_buff *skb)
+void __dev_kfree_skb_any(struct sk_buff *skb)
 {
 	if (in_irq() || irqs_disabled())
 		dev_kfree_skb_irq(skb);
 	else
 		dev_kfree_skb(skb);
 }
-EXPORT_SYMBOL(dev_kfree_skb_any);
+EXPORT_SYMBOL(__dev_kfree_skb_any);
 
 
 /**
@@ -3306,7 +3306,10 @@ static void net_tx_action(struct softirq_action *h)
 			clist = clist->next;
 
 			WARN_ON(atomic_read(&skb->users));
-			trace_kfree_skb(skb, net_tx_action);
+			if (likely(get_kfree_skb_cb(skb)->reason == SKB_REASON_CONSUMED))
+				trace_consume_skb(skb);
+			else
+				trace_kfree_skb(skb, net_tx_action);
 			__kfree_skb(skb);
 		}
 	}

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: vxlan/veth performance issues on net.git + latest kernels
  2013-12-04  0:34               ` Alexei Starovoitov
@ 2013-12-04  1:29                 ` Joseph Gasparakis
  2013-12-04  1:18                   ` Eric Dumazet
  0 siblings, 1 reply; 63+ messages in thread
From: Joseph Gasparakis @ 2013-12-04  1:29 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Joseph Gasparakis, Or Gerlitz, Eric Dumazet, Jerry Chu,
	Or Gerlitz, Eric Dumazet, Pravin B Shelar, David Miller, netdev



On Tue, 3 Dec 2013, Alexei Starovoitov wrote:

> On Tue, Dec 3, 2013 at 4:35 PM, Joseph Gasparakis
> <joseph.gasparakis@intel.com> wrote:
> >
> > I was printing the gso_type in vxlan_xmit_skb(), right before
> > iptunnel_xmit() gets called (I was focus UDPv4 encap only). Then I saw the
> > gso_type was different when a VM was involved and when it was not
> > (although I was transmitting exactly the same packet), and then I replaced
> > my printk with something like skb_shinfo(skb)->gso_type = <the gso type I had
> > for non-VM skb> and it all worked.
> >
> > Then I looked into what was different between the two gso_types and the
> > only difference was that SKB_GSO_DODGY was set when Tx'ing from the VM.
> > I am sure I could have been more delicate with the aproach, but hey, it
> > worked for me.
> 
> hmm. dodgy should be a normal path from vm.
> kvm suppose to negotiate vnet_hdr for tap/macvtap and corresponding
> driver will be remapping virtio_net_gso* flags into skb_gso_* flags
> plus gso_dodgy.
> 

Then would the fix be as simple as just unsetting the bit in vxlan? 
Because I am guessing that there is a bug somewhere where the combination of 
GSO_UDP_TUNNEL and GSO_DODGY. That would solve my problem, hopefully Or's 
too if it is the same.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: vxlan/veth performance issues on net.git + latest kernels
  2013-12-04  1:23                           ` Eric Dumazet
@ 2013-12-04  1:59                             ` Alexei Starovoitov
  2013-12-06  9:06                             ` Or Gerlitz
  1 sibling, 0 replies; 63+ messages in thread
From: Alexei Starovoitov @ 2013-12-04  1:59 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Or Gerlitz, David Miller, Joseph Gasparakis, Jerry Chu,
	Or Gerlitz, Eric Dumazet, Pravin B Shelar, netdev@vger.kernel.org

On Tue, Dec 3, 2013 at 5:23 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Tue, 2013-12-03 at 16:55 -0800, Alexei Starovoitov wrote:
>> On Tue, Dec 3, 2013 at 4:36 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> > On Tue, 2013-12-03 at 16:26 -0800, Alexei Starovoitov wrote:
>> >
>> >>
>> >> Could you use some other way to mark skb ?
>> >
>> > I could ;)
>> >
>> >> In tracing we might want to examine skb more carefully and not being
>> >> able to see the device
>> >> will limit the usability of this tracepoint.
>> >
>> > Unfortunately, using skb->dev as a pointer to device would be buggy or
>> > expensive (you would need to take a reference on device in order not
>> > letting it disappear, as we escape RCU protection)
>>
>> well, yes, you might have an skb around when device is already freed
>> when skb_dst_noref.
>> but I'm not suggesting anything expensive. Tracing definitely should
>> not add overhead by doing rcu_lock() or dev_hold(). Instead it can go
>> through skb, skb->dev, skb->dev->xxx via probe_kernel_read(). If dev
>> is gone, it's still safe.
>
> Its certainly not safe to 'probe'.
>
> Its not about faulting inexistent memory, that is the least of the
> problem.
>
> Any kind of information fetched from skb->dev might have been
> overwritten.
>
> You could for example fetch security sensitive data and expose it.

Of course.
Even without walking pointer chains with probe() you could infer all
sorts of info from tracepoints.
That's why tracing filters are for root only.

>> > Anyway, this magic is pretty easy to change, I am open to suggestions.
>>
>> you're the expert :) use skb->mark field, since it's unused during
>> freeing path... ?
>
> cache line miss ;)
>
> skb->dev is in the first cache line, where we access skb->next anyway.
>
> I could use skb->cb[] like the following patch :
>
> +struct __dev_kfree_skb_cb {
> +       unsigned int reason;
> +};
> +
> +static inline struct __dev_kfree_skb_cb *get_kfree_skb_cb(const struct sk_buff *skb)
> +{
> +       return (struct __dev_kfree_skb_cb *)skb->cb;
> +}
> +
> +enum {
> +       SKB_REASON_CONSUMED,
> +       SKB_REASON_DROPPED,
> +};
> +
>  /* Use this variant in places where it could be invoked
>   * from either hardware interrupt or other context, with hardware interrupts
>   * either disabled or enabled.
> + * Note that TX completion should use dev_consume_skb_any()
>   */
> -void dev_kfree_skb_any(struct sk_buff *skb);
> +static inline void dev_kfree_skb_any(struct sk_buff *skb)
> +{
> +       get_kfree_skb_cb(skb)->reason = SKB_REASON_DROPPED;
> +       __dev_kfree_skb_any(skb);
> +}
> +
> +static inline void dev_consume_skb_any(struct sk_buff *skb)
> +{
> +       get_kfree_skb_cb(skb)->reason = SKB_REASON_CONSUMED;
> +       __dev_kfree_skb_any(skb);
> +}

thanks. I think that is much cleaner. Ack.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: vxlan/veth performance issues on net.git + latest kernels
  2013-12-03 23:59                   ` Eric Dumazet
  2013-12-04  0:26                     ` Alexei Starovoitov
@ 2013-12-04  6:39                     ` David Miller
  2013-12-04 17:40                       ` Eric Dumazet
  2013-12-05 12:45                     ` [PATCH net-next] net: introduce dev_consume_skb_any() Eric Dumazet
  2 siblings, 1 reply; 63+ messages in thread
From: David Miller @ 2013-12-04  6:39 UTC (permalink / raw)
  To: eric.dumazet
  Cc: or.gerlitz, joseph.gasparakis, hkchu, ogerlitz, edumazet, ast,
	pshelar, netdev

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 03 Dec 2013 15:59:59 -0800

> On Wed, 2013-12-04 at 01:10 +0200, Or Gerlitz wrote:
> 
>> Samples: 883K of event 'skb:kfree_skb', Event count (approx.): 883406
>> +  97.13%          swapper  [kernel.kallsyms]  [k] net_tx_action
>> +   1.53%            iperf  [kernel.kallsyms]  [k] net_tx_action
>> +   1.03%             perf  [kernel.kallsyms]  [k] net_tx_action
>> +   0.27%      ksoftirqd/7  [kernel.kallsyms]  [k] net_tx_action
>> +   0.03%      kworker/7:1  [kernel.kallsyms]  [k] net_tx_action
>> +   0.00%          rpcbind  [kernel.kallsyms]  [k] net_tx_action
>> +   0.00%          swapper  [kernel.kallsyms]  [k] kfree_skb
>> +   0.00%            sleep  [kernel.kallsyms]  [k] net_tx_action
>> +   0.00%  hald-addon-acpi  [kernel.kallsyms]  [k] kfree_skb
>> +   0.00%            iperf  [kernel.kallsyms]  [k] kfree_skb
>> +   0.00%             perf  [kernel.kallsyms]  [k] kfree_skb
>> 
> 
> Right, I actually have a patch for that, but was waiting for net-next
> being re-opened :
> 
> commit 9a731d750dd8bf0b8c20fb1ca53c42317fb4dd37
> Author: Eric Dumazet <edumazet@google.com>
> Date:   Mon Nov 25 13:09:20 2013 -0800
> 
>     net-fixes: introduce dev_consume_skb_any()

I definitely prefer the control block approach to this.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: vxlan/veth performance issues on net.git + latest kernels
  2013-12-04  0:35             ` Joseph Gasparakis
  2013-12-04  0:34               ` Alexei Starovoitov
  2013-12-04  0:44               ` Joseph Gasparakis
@ 2013-12-04  8:35               ` Or Gerlitz
  2013-12-04  9:24                 ` Joseph Gasparakis
  2 siblings, 1 reply; 63+ messages in thread
From: Or Gerlitz @ 2013-12-04  8:35 UTC (permalink / raw)
  To: Joseph Gasparakis
  Cc: Eric Dumazet, Jerry Chu, Or Gerlitz, Eric Dumazet,
	Alexei Starovoitov, Pravin B Shelar, David Miller, netdev

On Wed, Dec 4, 2013 at 2:35 AM, Joseph Gasparakis
<joseph.gasparakis@intel.com> wrote:

> I was printing the gso_type in vxlan_xmit_skb(), right before
> iptunnel_xmit() gets called (I was focus UDPv4 encap only). Then I saw the
> gso_type was different when a VM was involved and when it was not
> (although I was transmitting exactly the same packet), and then I replaced
> my printk with something like skb_shinfo(skb)->gso_type = <the gso type I had
> for non-VM skb> and it all worked.
>
> Then I looked into what was different between the two gso_types and the
> only difference was that SKB_GSO_DODGY was set when Tx'ing from the VM.
> I am sure I could have been more delicate with the aproach, but hey, it
> worked for me.
>
> I would be curious to see if this is the same issue as mine. It seems like it is.

nope! with the latest net tree, after handle_offloads is called in
vxlan_xmit_skb and before iptunnel_xmit is invoked,
skb_shinfo(skb)->gso_type is either 0 or 0x201 which is
(SKB_GSO_UDP_TUNNEL | SKB_GSO_TCPV4) and its the same value whether
the session runs over veth device or directly over the bridge, where
over veth and > 1 stream we see drops, bad perf, etc.

I am very interested in the VM case too, so will check it out and let you know.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: vxlan/veth performance issues on net.git + latest kernels
  2013-12-04  8:35               ` Or Gerlitz
@ 2013-12-04  9:24                 ` Joseph Gasparakis
  2013-12-04  9:41                   ` Or Gerlitz
  0 siblings, 1 reply; 63+ messages in thread
From: Joseph Gasparakis @ 2013-12-04  9:24 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Joseph Gasparakis, Eric Dumazet, Jerry Chu, Or Gerlitz,
	Eric Dumazet, Alexei Starovoitov, Pravin B Shelar, David Miller,
	netdev



On Wed, 4 Dec 2013, Or Gerlitz wrote:

> On Wed, Dec 4, 2013 at 2:35 AM, Joseph Gasparakis
> <joseph.gasparakis@intel.com> wrote:
> 
> > I was printing the gso_type in vxlan_xmit_skb(), right before
> > iptunnel_xmit() gets called (I was focus UDPv4 encap only). Then I saw the
> > gso_type was different when a VM was involved and when it was not
> > (although I was transmitting exactly the same packet), and then I replaced
> > my printk with something like skb_shinfo(skb)->gso_type = <the gso type I had
> > for non-VM skb> and it all worked.
> >
> > Then I looked into what was different between the two gso_types and the
> > only difference was that SKB_GSO_DODGY was set when Tx'ing from the VM.
> > I am sure I could have been more delicate with the aproach, but hey, it
> > worked for me.
> >
> > I would be curious to see if this is the same issue as mine. It seems like it is.
> 
> nope! with the latest net tree, after handle_offloads is called in
> vxlan_xmit_skb and before iptunnel_xmit is invoked,
> skb_shinfo(skb)->gso_type is either 0 or 0x201 which is
> (SKB_GSO_UDP_TUNNEL | SKB_GSO_TCPV4) and its the same value whether
> the session runs over veth device or directly over the bridge, where
> over veth and > 1 stream we see drops, bad perf, etc.
> 
> I am very interested in the VM case too, so will check it out and let you know.
> 

Ok, I was really hoping that would be the same... And just for the record, 
you are seeing (SKB_UDP_TUNNEL | SKB_GSO_TCPV4) as 0x201 while I was 
seeing it as 0x81 because commit 61c1db7fae "ipv6: sit: add GSO/TSO 
support" pushed the SKB_UDP_TUNNEL two bits left, and I had done my tests 
before it.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: vxlan/veth performance issues on net.git + latest kernels
  2013-12-04  9:24                 ` Joseph Gasparakis
@ 2013-12-04  9:41                   ` Or Gerlitz
  2013-12-04 15:20                     ` Or Gerlitz
       [not found]                     ` <52A197DF.5010806@mellanox.com>
  0 siblings, 2 replies; 63+ messages in thread
From: Or Gerlitz @ 2013-12-04  9:41 UTC (permalink / raw)
  To: Joseph Gasparakis
  Cc: Eric Dumazet, Jerry Chu, Or Gerlitz, Eric Dumazet,
	Alexei Starovoitov, Pravin B Shelar, David Miller, netdev

On Wed, Dec 4, 2013 at 11:24 AM, Joseph Gasparakis
<joseph.gasparakis@intel.com> wrote:
> On Wed, 4 Dec 2013, Or Gerlitz wrote:

>> nope! with the latest net tree, after handle_offloads is called in
>> vxlan_xmit_skb and before iptunnel_xmit is invoked,
>> skb_shinfo(skb)->gso_type is either 0 or 0x201 which is
>> (SKB_GSO_UDP_TUNNEL | SKB_GSO_TCPV4) and its the same value whether
>> the session runs over veth device or directly over the bridge, where
>> over veth and > 1 stream we see drops, bad perf, etc.
>> I am very interested in the VM case too, so will check it out and let you know.

> Ok, I was really hoping that would be the same...

So when running traffic from VM I do see SKB_GSO_DODGY bit being set!
my environment for
running VMs with sane peformance was screwed  a bit, I will bring it
up later today and see if unsetting the bit helps.


> And just for the record,
> you are seeing (SKB_UDP_TUNNEL | SKB_GSO_TCPV4) as 0x201 while I was
> seeing it as 0x81 because commit 61c1db7fae "ipv6: sit: add GSO/TSO
> support" pushed the SKB_UDP_TUNNEL two bits left, and I had done my tests
> before it.

indeed, also, on what kernel did you conducted your tests which you
managed to WA the problem
with unsetting that bit?

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: vxlan/veth performance issues on net.git + latest kernels
  2013-12-04  9:41                   ` Or Gerlitz
@ 2013-12-04 15:20                     ` Or Gerlitz
       [not found]                     ` <52A197DF.5010806@mellanox.com>
  1 sibling, 0 replies; 63+ messages in thread
From: Or Gerlitz @ 2013-12-04 15:20 UTC (permalink / raw)
  To: Joseph Gasparakis
  Cc: Eric Dumazet, Jerry Chu, Or Gerlitz, Eric Dumazet,
	Alexei Starovoitov, Pravin B Shelar, David Miller, netdev

On Wed, Dec 4, 2013 at 11:41 AM, Or Gerlitz <or.gerlitz@gmail.com> wrote:
> On Wed, Dec 4, 2013 at 11:24 AM, Joseph Gasparakis
> <joseph.gasparakis@intel.com> wrote:
>> On Wed, 4 Dec 2013, Or Gerlitz wrote:
>
>>> nope! with the latest net tree, after handle_offloads is called in
>>> vxlan_xmit_skb and before iptunnel_xmit is invoked,
>>> skb_shinfo(skb)->gso_type is either 0 or 0x201 which is
>>> (SKB_GSO_UDP_TUNNEL | SKB_GSO_TCPV4) and its the same value whether
>>> the session runs over veth device or directly over the bridge, where
>>> over veth and > 1 stream we see drops, bad perf, etc.
>>> I am very interested in the VM case too, so will check it out and let you know.
>
>> Ok, I was really hoping that would be the same...
>
> So when running traffic from VM I do see SKB_GSO_DODGY bit being set!
> my environment for running VMs with sane peformance was screwed  a bit, I will bring it
> up later today and see if unsetting the bit helps.

so if the DODGY bit is left set, it simply doesn't work! it seems that
guest GSO packets are just dropped

As for the performance when the bit is unset in the vxlan driver as
you suggested, I need to tune the VMs a bit more, later or tomorrow

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: vxlan/veth performance issues on net.git + latest kernels
  2013-12-04  6:39                     ` David Miller
@ 2013-12-04 17:40                       ` Eric Dumazet
  0 siblings, 0 replies; 63+ messages in thread
From: Eric Dumazet @ 2013-12-04 17:40 UTC (permalink / raw)
  To: David Miller
  Cc: or.gerlitz, joseph.gasparakis, hkchu, ogerlitz, edumazet, ast,
	pshelar, netdev

On Wed, 2013-12-04 at 01:39 -0500, David Miller wrote:

> I definitely prefer the control block approach to this.

I polished the patch to keep this knowledge in net/core/dev.c

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH net-next] net: introduce dev_consume_skb_any()
  2013-12-03 23:59                   ` Eric Dumazet
  2013-12-04  0:26                     ` Alexei Starovoitov
  2013-12-04  6:39                     ` David Miller
@ 2013-12-05 12:45                     ` Eric Dumazet
  2013-12-05 14:13                       ` Hannes Frederic Sowa
  2013-12-06 20:24                       ` David Miller
  2 siblings, 2 replies; 63+ messages in thread
From: Eric Dumazet @ 2013-12-05 12:45 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

From: Eric Dumazet <edumazet@google.com>

Some network drivers use dev_kfree_skb_any() and dev_kfree_skb_irq()
helpers to free skbs, both for dropped packets and TX completed ones.

We need to separate the two causes to get better diagnostics
given by dropwatch or "perf record -e skb:kfree_skb"

This patch provides two new helpers, dev_consume_skb_any() and
dev_consume_skb_irq() to be used for consumed skbs.

__dev_kfree_skb_irq() is slightly optimized to remove one
atomic_dec_and_test() in fast path, and use this_cpu_{r|w} accessors.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/linux/netdevice.h |   53 +++++++++++++++++++++++++++++-------
 net/core/dev.c            |   45 ++++++++++++++++++++----------
 2 files changed, 74 insertions(+), 24 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 7f0ed423a360..c6d64d20050c 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2368,17 +2368,52 @@ static inline int netif_copy_real_num_queues(struct net_device *to_dev,
 #define DEFAULT_MAX_NUM_RSS_QUEUES	(8)
 int netif_get_num_default_rss_queues(void);
 
-/* Use this variant when it is known for sure that it
- * is executing from hardware interrupt context or with hardware interrupts
- * disabled.
- */
-void dev_kfree_skb_irq(struct sk_buff *skb);
+enum skb_free_reason {
+	SKB_REASON_CONSUMED,
+	SKB_REASON_DROPPED,
+};
+
+void __dev_kfree_skb_irq(struct sk_buff *skb, enum skb_free_reason reason);
+void __dev_kfree_skb_any(struct sk_buff *skb, enum skb_free_reason reason);
 
-/* Use this variant in places where it could be invoked
- * from either hardware interrupt or other context, with hardware interrupts
- * either disabled or enabled.
+/*
+ * It is not allowed to call kfree_skb() or consume_skb() from hardware
+ * interrupt context or with hardware interrupts being disabled.
+ * (in_irq() || irqs_disabled())
+ *
+ * We provide four helpers that can be used in following contexts :
+ *
+ * dev_kfree_skb_irq(skb) when caller drops a packet from irq context,
+ *  replacing kfree_skb(skb)
+ *
+ * dev_consume_skb_irq(skb) when caller consumes a packet from irq context.
+ *  Typically used in place of consume_skb(skb) in TX completion path
+ *
+ * dev_kfree_skb_any(skb) when caller doesn't know its current irq context,
+ *  replacing kfree_skb(skb)
+ *
+ * dev_consume_skb_any(skb) when caller doesn't know its current irq context,
+ *  and consumed a packet. Used in place of consume_skb(skb)
  */
-void dev_kfree_skb_any(struct sk_buff *skb);
+static inline void dev_kfree_skb_irq(struct sk_buff *skb)
+{
+	__dev_kfree_skb_irq(skb, SKB_REASON_DROPPED);
+}
+
+static inline void dev_consume_skb_irq(struct sk_buff *skb)
+{
+	__dev_kfree_skb_irq(skb, SKB_REASON_CONSUMED);
+}
+
+static inline void dev_kfree_skb_any(struct sk_buff *skb)
+{
+	__dev_kfree_skb_any(skb, SKB_REASON_DROPPED);
+}
+
+static inline void dev_consume_skb_any(struct sk_buff *skb)
+{
+	__dev_kfree_skb_any(skb, SKB_REASON_CONSUMED);
+}
 
 int netif_rx(struct sk_buff *skb);
 int netif_rx_ni(struct sk_buff *skb);
diff --git a/net/core/dev.c b/net/core/dev.c
index ba3b7ea5ebb3..aa54a742f392 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2145,30 +2145,42 @@ void __netif_schedule(struct Qdisc *q)
 }
 EXPORT_SYMBOL(__netif_schedule);
 
-void dev_kfree_skb_irq(struct sk_buff *skb)
+struct dev_kfree_skb_cb {
+	enum skb_free_reason reason;
+};
+
+static struct dev_kfree_skb_cb *get_kfree_skb_cb(const struct sk_buff *skb)
+{
+	return (struct dev_kfree_skb_cb *)skb->cb;
+}
+
+void __dev_kfree_skb_irq(struct sk_buff *skb, enum skb_free_reason reason)
 {
-	if (atomic_dec_and_test(&skb->users)) {
-		struct softnet_data *sd;
-		unsigned long flags;
+	unsigned long flags;
 
-		local_irq_save(flags);
-		sd = &__get_cpu_var(softnet_data);
-		skb->next = sd->completion_queue;
-		sd->completion_queue = skb;
-		raise_softirq_irqoff(NET_TX_SOFTIRQ);
-		local_irq_restore(flags);
+	if (likely(atomic_read(&skb->users) == 1)) {
+		smp_rmb();
+		atomic_set(&skb->users, 0);
+	} else if (likely(!atomic_dec_and_test(&skb->users))) {
+		return;
 	}
+	get_kfree_skb_cb(skb)->reason = reason;
+	local_irq_save(flags);
+	skb->next = __this_cpu_read(softnet_data.completion_queue);
+	__this_cpu_write(softnet_data.completion_queue, skb);
+	raise_softirq_irqoff(NET_TX_SOFTIRQ);
+	local_irq_restore(flags);
 }
-EXPORT_SYMBOL(dev_kfree_skb_irq);
+EXPORT_SYMBOL(__dev_kfree_skb_irq);
 
-void dev_kfree_skb_any(struct sk_buff *skb)
+void __dev_kfree_skb_any(struct sk_buff *skb, enum skb_free_reason reason)
 {
 	if (in_irq() || irqs_disabled())
-		dev_kfree_skb_irq(skb);
+		__dev_kfree_skb_irq(skb, reason);
 	else
 		dev_kfree_skb(skb);
 }
-EXPORT_SYMBOL(dev_kfree_skb_any);
+EXPORT_SYMBOL(__dev_kfree_skb_any);
 
 
 /**
@@ -3306,7 +3318,10 @@ static void net_tx_action(struct softirq_action *h)
 			clist = clist->next;
 
 			WARN_ON(atomic_read(&skb->users));
-			trace_kfree_skb(skb, net_tx_action);
+			if (likely(get_kfree_skb_cb(skb)->reason == SKB_REASON_CONSUMED))
+				trace_consume_skb(skb);
+			else
+				trace_kfree_skb(skb, net_tx_action);
 			__kfree_skb(skb);
 		}
 	}

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [PATCH net-next] net: introduce dev_consume_skb_any()
  2013-12-05 12:45                     ` [PATCH net-next] net: introduce dev_consume_skb_any() Eric Dumazet
@ 2013-12-05 14:13                       ` Hannes Frederic Sowa
  2013-12-05 14:45                         ` Eric Dumazet
  2013-12-06 20:24                       ` David Miller
  1 sibling, 1 reply; 63+ messages in thread
From: Hannes Frederic Sowa @ 2013-12-05 14:13 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev

On Thu, Dec 05, 2013 at 04:45:08AM -0800, Eric Dumazet wrote:
> -		local_irq_save(flags);
> -		sd = &__get_cpu_var(softnet_data);
> -		skb->next = sd->completion_queue;
> -		sd->completion_queue = skb;
> -		raise_softirq_irqoff(NET_TX_SOFTIRQ);
> -		local_irq_restore(flags);
> +	if (likely(atomic_read(&skb->users) == 1)) {
> +		smp_rmb();

Could you give me a hint why this barrier is needed? IMHO the volatile
access in atomic_read should get rid of the control dependency so I
don't see a need for this barrier. Without the volatile access a
compiler-barrier would still suffice, I guess?


> +		atomic_set(&skb->users, 0);
> +	} else if (likely(!atomic_dec_and_test(&skb->users))) {
> +		return;

Or does this memory barrier deal with the part below this return?

Thanks,

  Hannes

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH net-next] net: introduce dev_consume_skb_any()
  2013-12-05 14:13                       ` Hannes Frederic Sowa
@ 2013-12-05 14:45                         ` Eric Dumazet
  2013-12-05 15:05                           ` Eric Dumazet
  0 siblings, 1 reply; 63+ messages in thread
From: Eric Dumazet @ 2013-12-05 14:45 UTC (permalink / raw)
  To: Hannes Frederic Sowa; +Cc: David Miller, netdev

On Thu, 2013-12-05 at 15:13 +0100, Hannes Frederic Sowa wrote:
> On Thu, Dec 05, 2013 at 04:45:08AM -0800, Eric Dumazet wrote:
> > -		local_irq_save(flags);
> > -		sd = &__get_cpu_var(softnet_data);
> > -		skb->next = sd->completion_queue;
> > -		sd->completion_queue = skb;
> > -		raise_softirq_irqoff(NET_TX_SOFTIRQ);
> > -		local_irq_restore(flags);
> > +	if (likely(atomic_read(&skb->users) == 1)) {
> > +		smp_rmb();
> 
> Could you give me a hint why this barrier is needed? IMHO the volatile
> access in atomic_read should get rid of the control dependency so I
> don't see a need for this barrier. Without the volatile access a
> compiler-barrier would still suffice, I guess?

Please take a look at kfree_skb() implementation.

If you think a comment is needed there, please feel free to add it.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH net-next] net: introduce dev_consume_skb_any()
  2013-12-05 14:45                         ` Eric Dumazet
@ 2013-12-05 15:05                           ` Eric Dumazet
  2013-12-05 15:44                             ` Hannes Frederic Sowa
  0 siblings, 1 reply; 63+ messages in thread
From: Eric Dumazet @ 2013-12-05 15:05 UTC (permalink / raw)
  To: Hannes Frederic Sowa; +Cc: David Miller, netdev

On Thu, 2013-12-05 at 06:45 -0800, Eric Dumazet wrote:
> On Thu, 2013-12-05 at 15:13 +0100, Hannes Frederic Sowa wrote:
> > On Thu, Dec 05, 2013 at 04:45:08AM -0800, Eric Dumazet wrote:
> > > -		local_irq_save(flags);
> > > -		sd = &__get_cpu_var(softnet_data);
> > > -		skb->next = sd->completion_queue;
> > > -		sd->completion_queue = skb;
> > > -		raise_softirq_irqoff(NET_TX_SOFTIRQ);
> > > -		local_irq_restore(flags);
> > > +	if (likely(atomic_read(&skb->users) == 1)) {
> > > +		smp_rmb();
> > 
> > Could you give me a hint why this barrier is needed? IMHO the volatile
> > access in atomic_read should get rid of the control dependency so I
> > don't see a need for this barrier. Without the volatile access a
> > compiler-barrier would still suffice, I guess?
> 
> Please take a look at kfree_skb() implementation.
> 
> If you think a comment is needed there, please feel free to add it.
> 

My understanding of this (old) barrier here is an implicit wmb in
skb_get()

This probably needs something like :

static inline struct sk_buff *skb_get(struct sk_buff *skb)
{
	smp_mb__before_atomic_inc(); /* check {consume|kfree}_skb() */
	atomic_inc(&skb->users);
}

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH net-next] net: introduce dev_consume_skb_any()
  2013-12-05 15:05                           ` Eric Dumazet
@ 2013-12-05 15:44                             ` Hannes Frederic Sowa
  2013-12-05 16:38                               ` Eric Dumazet
  0 siblings, 1 reply; 63+ messages in thread
From: Hannes Frederic Sowa @ 2013-12-05 15:44 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev

On Thu, Dec 05, 2013 at 07:05:52AM -0800, Eric Dumazet wrote:
> On Thu, 2013-12-05 at 06:45 -0800, Eric Dumazet wrote:
> > On Thu, 2013-12-05 at 15:13 +0100, Hannes Frederic Sowa wrote:
> > > On Thu, Dec 05, 2013 at 04:45:08AM -0800, Eric Dumazet wrote:
> > > > -		local_irq_save(flags);
> > > > -		sd = &__get_cpu_var(softnet_data);
> > > > -		skb->next = sd->completion_queue;
> > > > -		sd->completion_queue = skb;
> > > > -		raise_softirq_irqoff(NET_TX_SOFTIRQ);
> > > > -		local_irq_restore(flags);
> > > > +	if (likely(atomic_read(&skb->users) == 1)) {
> > > > +		smp_rmb();
> > > 
> > > Could you give me a hint why this barrier is needed? IMHO the volatile
> > > access in atomic_read should get rid of the control dependency so I
> > > don't see a need for this barrier. Without the volatile access a
> > > compiler-barrier would still suffice, I guess?
> > 
> > Please take a look at kfree_skb() implementation.
> > 
> > If you think a comment is needed there, please feel free to add it.
> > 
> 
> My understanding of this (old) barrier here is an implicit wmb in
> skb_get()
> 
> This probably needs something like :
> 
> static inline struct sk_buff *skb_get(struct sk_buff *skb)
> {
> 	smp_mb__before_atomic_inc(); /* check {consume|kfree}_skb() */
> 	atomic_inc(&skb->users);
> }

Thanks for the pointer to kfree_skb. I found this commit which added the
barrier in kfree_skb (from history.git):

commit 09d3e84de438f217510b604a980befd07b0c8262
Author: Herbert Xu <herbert@gondor.apana.org.au>
Date:   Sat Feb 5 03:23:27 2005 -0800

    [NET]: Add missing memory barrier to kfree_skb().
    
    Also kill kfree_skb_fast(), that is a relic from fast switching
    which was killed off years ago.
    
    The bug is that in the case where we do the atomic_read()
    optimization, we need to make sure that reads of skb state
    later in __kfree_skb() processing (particularly the skb->list
    BUG check) are not reordered to occur before the counter
    read by the cpu.
    
    Thanks to Olaf Kirch and Anton Blanchard for discovering
    and helping fix this bug.
    
    Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
    Signed-off-by: David S. Miller <davem@davemloft.net>

It makes some sense but I did not grasp the whole ->users dependency
picture, yet. I guess the barrier is only needed when refcount drops
down to 0 and we don't necessarily need one when incrementing ->users.

Thank you,

  Hannes

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH net-next] net: introduce dev_consume_skb_any()
  2013-12-05 15:44                             ` Hannes Frederic Sowa
@ 2013-12-05 16:38                               ` Eric Dumazet
  2013-12-05 16:54                                 ` Hannes Frederic Sowa
  0 siblings, 1 reply; 63+ messages in thread
From: Eric Dumazet @ 2013-12-05 16:38 UTC (permalink / raw)
  To: Hannes Frederic Sowa; +Cc: David Miller, netdev

On Thu, 2013-12-05 at 16:44 +0100, Hannes Frederic Sowa wrote:

> It makes some sense but I did not grasp the whole ->users dependency
> picture, yet. I guess the barrier is only needed when refcount drops
> down to 0 and we don't necessarily need one when incrementing ->users.

If you are the only user of this skb, really no smp barrier is needed at
all.

The problem comes when another cpu is working on the skb, and finally
releases its reference on it.

Before releasing its reference, it must commit all changes it might have
done onto skb. Otherwise another cpu might read stale data.

The smp_wmb() is done by the atomic_dec_and_test(), as it contains a
full barrier.

So the smp_rmb() pairs with the barrier done in atomic_dec_and_test()

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH net-next] net: introduce dev_consume_skb_any()
  2013-12-05 16:38                               ` Eric Dumazet
@ 2013-12-05 16:54                                 ` Hannes Frederic Sowa
  0 siblings, 0 replies; 63+ messages in thread
From: Hannes Frederic Sowa @ 2013-12-05 16:54 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev

On Thu, Dec 05, 2013 at 08:38:05AM -0800, Eric Dumazet wrote:
> On Thu, 2013-12-05 at 16:44 +0100, Hannes Frederic Sowa wrote:
> 
> > It makes some sense but I did not grasp the whole ->users dependency
> > picture, yet. I guess the barrier is only needed when refcount drops
> > down to 0 and we don't necessarily need one when incrementing ->users.
> 
> If you are the only user of this skb, really no smp barrier is needed at
> all.
> 
> The problem comes when another cpu is working on the skb, and finally
> releases its reference on it.
> 
> Before releasing its reference, it must commit all changes it might have
> done onto skb. Otherwise another cpu might read stale data.
> 
> The smp_wmb() is done by the atomic_dec_and_test(), as it contains a
> full barrier.
> 
> So the smp_rmb() pairs with the barrier done in atomic_dec_and_test()

Ha, it all makes sense now! Thanks, Eric!

(Sorry for the noise but I find this kind of problems very interesting)

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: vxlan/veth performance issues on net.git + latest kernels
  2013-12-04  1:23                           ` Eric Dumazet
  2013-12-04  1:59                             ` Alexei Starovoitov
@ 2013-12-06  9:06                             ` Or Gerlitz
  2013-12-06 13:36                               ` Eric Dumazet
  1 sibling, 1 reply; 63+ messages in thread
From: Or Gerlitz @ 2013-12-06  9:06 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alexei Starovoitov, David Miller, Joseph Gasparakis, Jerry Chu,
	Or Gerlitz, Eric Dumazet, Pravin B Shelar, netdev@vger.kernel.org

On Wed, Dec 4, 2013 at 3:23 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> skb->dev is in the first cache line, where we access skb->next anyway.
> I could use skb->cb[] like the following patch :

Hi Eric, I applied on the net tree the patch you posted yesterday
"net: introduce dev_consume_skb_any()" along with the network drivers
part of this patch, unless I got it wrong, I assume both pieces are
needed?

So I re-run the vxlan/veth test that we suspect goes through packet drops on TX.

With the patches applied I have almost no samples of that event

$ ./perf report -i perf.data

Samples: 89  of event 'skb:kfree_skb', Event count (approx.): 89
+  39.33%  ksoftirqd/2  [kernel.kallsyms]  [k] net_tx_action

                    ?
+  28.09%      swapper  [kernel.kallsyms]  [k] net_tx_action

                    ?
+  28.09%         sshd  [kernel.kallsyms]  [k] net_tx_action

                    ?
+   2.25%      swapper  [kernel.kallsyms]  [k] kfree_skb

                    ?
+   1.12%  kworker/2:2  [kernel.kallsyms]  [k] net_tx_action

                    ?
+   1.12%        iperf  [kernel.kallsyms]  [k] net_tx_action

 ./perf report -i perf.data --sort dso,symbol
Samples: 89  of event 'skb:kfree_skb', Event count (approx.): 89
+  97.75%  [kernel.kallsyms]  [k] net_tx_action
+   2.25%  [kernel.kallsyms]  [k] kfree_skb

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: vxlan/veth performance issues on net.git + latest kernels
       [not found]                     ` <52A197DF.5010806@mellanox.com>
@ 2013-12-06  9:30                       ` Or Gerlitz
  2013-12-08 12:43                         ` Mike Rapoport
  2013-12-06 10:30                       ` Joseph Gasparakis
  1 sibling, 1 reply; 63+ messages in thread
From: Or Gerlitz @ 2013-12-06  9:30 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Joseph Gasparakis, Pravin B Shelar, Eric Dumazet, Jerry Chu,
	Eric Dumazet, Alexei Starovoitov, David Miller, netdev,
	Kirsher, Jeffrey T, John Fastabend

> On 04/12/2013 11:41, Or Gerlitz wrote:
> On Wed, Dec 4, 2013 at 11:24 AM, Joseph Gasparakis
> <joseph.gasparakis@intel.com> wrote:
>
>> And just for the record,
>> you are seeing (SKB_UDP_TUNNEL | SKB_GSO_TCPV4) as 0x201 while I was
>> seeing it as 0x81 because commit 61c1db7fae "ipv6: sit: add GSO/TSO
>> support" pushed the SKB_UDP_TUNNEL two bits left, and I had done my tests
>> before it.
>
> indeed, also, on what kernel did you conducted your tests which you managed
> to WA the problem with unsetting that bit?


Hi Joseph,

Really need your response here --

1. on which kernel did you manage to get along fine vxlan performance
wise with this hack?

2. did the hack helped for both veth host traffic or only on PV VM traffic?

Currently it doesn't converge with 3.12.x or net.git, with veth/vxlan
the DODGE bit isn't set when looking on the skb in the vxlan xmit
time, so there's nothing for me to hack there. For VMs without
unsetting the bit things don't really work, but unsetting it for
itself so far didn't get me far performance wise.

BTW guys, I saw the issues with both bridge/openvswitch configuration
- seems that we might have here somehow large breakage of the system
w.r.t vxlan traffic for rates that go over few Gbs -- so would love to
get feedback of any kind from the people that were involved with vxlan
over the last months/year.

Or.

net.git]# grep -rn  SKB_GSO_DODGY drivers/net/ net/ipv4 net/core
drivers/net/macvtap.c:585:              skb_shinfo(skb)->gso_type |=
SKB_GSO_DODGY;
drivers/net/tun.c:1135:         skb_shinfo(skb)->gso_type |= SKB_GSO_DODGY;
drivers/net/virtio_net.c:497:           skb_shinfo(skb)->gso_type |=
SKB_GSO_DODGY;
drivers/net/xen-netback/netback.c:1146: skb_shinfo(skb)->gso_type |=
SKB_GSO_DODGY;
drivers/net/xen-netfront.c:823: skb_shinfo(skb)->gso_type |= SKB_GSO_DODGY;
net/ipv4/af_inet.c:1264:                       SKB_GSO_DODGY |
net/ipv4/tcp_offload.c:56:                             SKB_GSO_DODGY |
net/ipv4/gre_offload.c:40:                                SKB_GSO_DODGY |
net/ipv4/udp_offload.c:53:              if (unlikely(type &
~(SKB_GSO_UDP | SKB_GSO_DODGY |
net/core/dev.c:2694:            if (shinfo->gso_type & SKB_GSO_DODGY)

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: vxlan/veth performance issues on net.git + latest kernels
       [not found]                     ` <52A197DF.5010806@mellanox.com>
  2013-12-06  9:30                       ` Or Gerlitz
@ 2013-12-06 10:30                       ` Joseph Gasparakis
  2013-12-07 21:27                         ` Or Gerlitz
  2013-12-08 15:21                         ` Or Gerlitz
  1 sibling, 2 replies; 63+ messages in thread
From: Joseph Gasparakis @ 2013-12-06 10:30 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Joseph Gasparakis, Pravin B Shelar, Or Gerlitz, Eric Dumazet,
	Jerry Chu, Eric Dumazet, Alexei Starovoitov, David Miller, netdev,
	jeffrey.t.kirsher, John Fastabend



On Fri, 6 Dec 2013, Or Gerlitz wrote:

> On 04/12/2013 11:41, Or Gerlitz wrote:
> > On Wed, Dec 4, 2013 at 11:24 AM, Joseph
> > Gasparakis<joseph.gasparakis@intel.com>  wrote:
> > > >And just for the record,
> > > >you are seeing (SKB_UDP_TUNNEL | SKB_GSO_TCPV4) as 0x201 while I was
> > > >seeing it as 0x81 because commit 61c1db7fae "ipv6: sit: add GSO/TSO
> > > >support" pushed the SKB_UDP_TUNNEL two bits left, and I had done my tests
> > > before it.
> > indeed, also, on what kernel did you conducted your tests which you managed
> > to WA the problem with unsetting that bit?
> 
> 
> Hi Joseph,
> 
> Really need your response here --

I'm sorry Or, I managed to miss your original request...

> 
> 1. on which kernel did you manage to get along fine vxlan performance wise
> with this hack?
> 

I was running 3.10.6.

> 2. did the hack helped for both veth host traffic or only on PV VM traffic?
> 

No, just VM. I haven't tried veth.

If you leave the DODGY bit, does your traffic get droped on Tx, after it 
leaves vxlan and before it hits your driver, which is what I had seen. Is 
that right?

If you unset it, do you recover?

What is the output of your ethtool -k on the interface you are 
transmitting from?

> Currently it doesn't converge with 3.12.x or net.git, with veth/vxlan the
> DODGE bit isn't set when looking on the skb in the vxlan xmit time, so there's
> nothing for me to hack there. For VMs without unsetting the bit things don't
> really work, but unsetting it for itself so far didn't get me far performance
> wise.
> 
> BTW guys, I saw the issues with both bridge/openvswitch configuration - seems
> that we might have here somehow large breakage of the system w.r.t vxlan
> traffic for rates that go over few Gbs -- so would love to get feedback of any
> kind from the people that were involved with vxlan over the last months/year.
> 
> Or.
> 
> net.git]# grep -rn  SKB_GSO_DODGY drivers/net/ net/ipv4 net/core
> drivers/net/macvtap.c:585: skb_shinfo(skb)->gso_type |= SKB_GSO_DODGY;
> drivers/net/tun.c:1135:         skb_shinfo(skb)->gso_type |= SKB_GSO_DODGY;
> drivers/net/virtio_net.c:497: skb_shinfo(skb)->gso_type |= SKB_GSO_DODGY;
> drivers/net/xen-netback/netback.c:1146: skb_shinfo(skb)->gso_type |=
> SKB_GSO_DODGY;
> drivers/net/xen-netfront.c:823: skb_shinfo(skb)->gso_type |= SKB_GSO_DODGY;
> net/ipv4/af_inet.c:1264:                       SKB_GSO_DODGY |
> net/ipv4/tcp_offload.c:56: SKB_GSO_DODGY |
> net/ipv4/gre_offload.c:40: SKB_GSO_DODGY |
> net/ipv4/udp_offload.c:53:              if (unlikely(type & ~(SKB_GSO_UDP |
> SKB_GSO_DODGY |
> net/core/dev.c:2694:            if (shinfo->gso_type & SKB_GSO_DODGY)
> 
> 
> 

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: vxlan/veth performance issues on net.git + latest kernels
  2013-12-06  9:06                             ` Or Gerlitz
@ 2013-12-06 13:36                               ` Eric Dumazet
  2013-12-07 21:20                                 ` Or Gerlitz
  2013-12-08 12:09                                 ` Or Gerlitz
  0 siblings, 2 replies; 63+ messages in thread
From: Eric Dumazet @ 2013-12-06 13:36 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Alexei Starovoitov, David Miller, Joseph Gasparakis, Jerry Chu,
	Or Gerlitz, Eric Dumazet, Pravin B Shelar, netdev@vger.kernel.org

On Fri, 2013-12-06 at 11:06 +0200, Or Gerlitz wrote:
> On Wed, Dec 4, 2013 at 3:23 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > skb->dev is in the first cache line, where we access skb->next anyway.
> > I could use skb->cb[] like the following patch :
> 
> Hi Eric, I applied on the net tree the patch you posted yesterday
> "net: introduce dev_consume_skb_any()" along with the network drivers
> part of this patch, unless I got it wrong, I assume both pieces are
> needed?
> 
> So I re-run the vxlan/veth test that we suspect goes through packet drops on TX.
> 
> With the patches applied I have almost no samples of that event
> 
> $ ./perf report -i perf.data

How did you get this perf.data file ? There are a few drops.

> 
> Samples: 89  of event 'skb:kfree_skb', Event count (approx.): 89
> +  39.33%  ksoftirqd/2  [kernel.kallsyms]  [k] net_tx_action
> 
>                     ?
> +  28.09%      swapper  [kernel.kallsyms]  [k] net_tx_action
> 
>                     ?
> +  28.09%         sshd  [kernel.kallsyms]  [k] net_tx_action
> 
>                     ?
> +   2.25%      swapper  [kernel.kallsyms]  [k] kfree_skb
> 
>                     ?
> +   1.12%  kworker/2:2  [kernel.kallsyms]  [k] net_tx_action
> 
>                     ?
> +   1.12%        iperf  [kernel.kallsyms]  [k] net_tx_action
> 
>  ./perf report -i perf.data --sort dso,symbol
> Samples: 89  of event 'skb:kfree_skb', Event count (approx.): 89
> +  97.75%  [kernel.kallsyms]  [k] net_tx_action
> +   2.25%  [kernel.kallsyms]  [k] kfree_skb
> --

OK, this means your driver drops few packets in its ndo_start_xmit()
handler.

Could you give us "ifconfig -a" reports as I already asked  ?

You could temporary change the dev_kfree_skb_any() in mlx4_en_xmit()
to call kfree_skb(skb) instead, to get  a stack trace (perf record -a -g
-e skb:kfree_skb sleep 20 ; perf report)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
index f54ebd5a1702..53130f27dec0 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
@@ -873,7 +873,7 @@ tx_drop_unmap:
 	}
 
 tx_drop:
-	dev_kfree_skb_any(skb);
+	kfree_skb(skb);
 	priv->stats.tx_dropped++;
 	return NETDEV_TX_OK;
 }

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [PATCH net-next] net: introduce dev_consume_skb_any()
  2013-12-05 12:45                     ` [PATCH net-next] net: introduce dev_consume_skb_any() Eric Dumazet
  2013-12-05 14:13                       ` Hannes Frederic Sowa
@ 2013-12-06 20:24                       ` David Miller
  1 sibling, 0 replies; 63+ messages in thread
From: David Miller @ 2013-12-06 20:24 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 05 Dec 2013 04:45:08 -0800

> From: Eric Dumazet <edumazet@google.com>
> 
> Some network drivers use dev_kfree_skb_any() and dev_kfree_skb_irq()
> helpers to free skbs, both for dropped packets and TX completed ones.
> 
> We need to separate the two causes to get better diagnostics
> given by dropwatch or "perf record -e skb:kfree_skb"
> 
> This patch provides two new helpers, dev_consume_skb_any() and
> dev_consume_skb_irq() to be used for consumed skbs.
> 
> __dev_kfree_skb_irq() is slightly optimized to remove one
> atomic_dec_and_test() in fast path, and use this_cpu_{r|w} accessors.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Applied, thanks Eric.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: vxlan/veth performance issues on net.git + latest kernels
  2013-12-06 13:36                               ` Eric Dumazet
@ 2013-12-07 21:20                                 ` Or Gerlitz
  2013-12-08 12:09                                 ` Or Gerlitz
  1 sibling, 0 replies; 63+ messages in thread
From: Or Gerlitz @ 2013-12-07 21:20 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alexei Starovoitov, David Miller, Joseph Gasparakis, Jerry Chu,
	Or Gerlitz, Eric Dumazet, Pravin B Shelar, netdev@vger.kernel.org

On Fri, Dec 6, 2013 at 3:36 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Fri, 2013-12-06 at 11:06 +0200, Or Gerlitz wrote:
>> On Wed, Dec 4, 2013 at 3:23 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> > skb->dev is in the first cache line, where we access skb->next anyway.
>> > I could use skb->cb[] like the following patch :
>>
>> Hi Eric, I applied on the net tree the patch you posted yesterday
>> "net: introduce dev_consume_skb_any()" along with the network drivers
>> part of this patch, unless I got it wrong, I assume both pieces are
>> needed?
>>
>> So I re-run the vxlan/veth test that we suspect goes through packet drops on TX.
>>
>> With the patches applied I have almost no samples of that event
>>
>> $ ./perf report -i perf.data
>
> How did you get this perf.data file ? There are a few drops.

Using the command you suggested on the active side while running
traffic (iperf tcp)

$ ./perf record -e skb:kfree_skb -g -a sleep 10

>> Samples: 89  of event 'skb:kfree_skb', Event count (approx.): 89
>> +  39.33%  ksoftirqd/2  [kernel.kallsyms]  [k] net_tx_action
>> +  28.09%      swapper  [kernel.kallsyms]  [k] net_tx_action
>> +  28.09%         sshd  [kernel.kallsyms]  [k] net_tx_action
>> +   2.25%      swapper  [kernel.kallsyms]  [k] kfree_skb
>> +   1.12%  kworker/2:2  [kernel.kallsyms]  [k] net_tx_action
>> +   1.12%        iperf  [kernel.kallsyms]  [k] net_tx_action

>>  ./perf report -i perf.data --sort dso,symbol
>> Samples: 89  of event 'skb:kfree_skb', Event count (approx.): 89
>> +  97.75%  [kernel.kallsyms]  [k] net_tx_action
>> +   2.25%  [kernel.kallsyms]  [k] kfree_skb
>> --

> OK, this means your driver drops few packets in its ndo_start_xmit()
> handler.
>
> Could you give us "ifconfig -a" reports as I already asked  ?

will do tomorrow while infront of the setup.

When I provided you the info last time
http://marc.info/?l=linux-netdev&m=138610891121531&w=2 it included the
ifconfig -a output and no drops were seen there


>
> You could temporary change the dev_kfree_skb_any() in mlx4_en_xmit()
> to call kfree_skb(skb) instead, to get  a stack trace (perf record -a -g
> -e skb:kfree_skb sleep 20 ; perf report)

yes, tomorrow

>
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
> index f54ebd5a1702..53130f27dec0 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
> @@ -873,7 +873,7 @@ tx_drop_unmap:
>         }
>
>  tx_drop:
> -       dev_kfree_skb_any(skb);
> +       kfree_skb(skb);
>         priv->stats.tx_dropped++;
>         return NETDEV_TX_OK;
>  }
>
>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: vxlan/veth performance issues on net.git + latest kernels
  2013-12-06 10:30                       ` Joseph Gasparakis
@ 2013-12-07 21:27                         ` Or Gerlitz
  2013-12-08 18:08                           ` Joseph Gasparakis
  2013-12-08 15:21                         ` Or Gerlitz
  1 sibling, 1 reply; 63+ messages in thread
From: Or Gerlitz @ 2013-12-07 21:27 UTC (permalink / raw)
  To: Joseph Gasparakis
  Cc: Or Gerlitz, Pravin B Shelar, Eric Dumazet, Jerry Chu,
	Eric Dumazet, Alexei Starovoitov, David Miller, netdev,
	Kirsher, Jeffrey T, John Fastabend

On Fri, Dec 6, 2013, Joseph Gasparakis <joseph.gasparakis@intel.com> wrote:
> On Fri, 6 Dec 2013, Or Gerlitz wrote:
>
>> On 04/12/2013 11:41, Or Gerlitz wrote:
>> > On Wed, Dec 4, 2013 at 11:24 AM, Joseph
>> > Gasparakis<joseph.gasparakis@intel.com>  wrote:
>> > > >And just for the record,
>> > > >you are seeing (SKB_UDP_TUNNEL | SKB_GSO_TCPV4) as 0x201 while I was
>> > > >seeing it as 0x81 because commit 61c1db7fae "ipv6: sit: add GSO/TSO
>> > > >support" pushed the SKB_UDP_TUNNEL two bits left, and I had done my tests
>> > > before it.
>> > indeed, also, on what kernel did you conducted your tests which you managed
>> > to WA the problem with unsetting that bit?
>>
>>
>> Hi Joseph,
>>
>> Really need your response here --
>
> I'm sorry Or, I managed to miss your original request...

sure.. it happens.

>> 1. on which kernel did you manage to get along fine vxlan performance wise
>> with this hack?

> I was running 3.10.6.

I see, will try it out, and just for getting closer to your env, what
kernel where the guests running? was a bridge / ovs instance involved
in the VM PV connectivity?


>> 2. did the hack helped for both veth host traffic or only on PV VM traffic?

> No, just VM. I haven't tried veth.

I see, earlier I was somehow under the impression you noted the
problem for veth too.

> If you leave the DODGY bit, does your traffic get droped on Tx, after it
> leaves vxlan and before it hits your driver, which is what I had seen. Is
> that right?

What I saw is that if I leave the DODGY bit set, practically things
don't work at all, its not that some packets are dropped, was that
what you saw?

Or on your env only **some** or **few** packets were dropped each time
but this killed the tcp session performance?

Also, did you hack/modified the VM NIC MTU to take into the account
the encapsulation overhead?

> If you unset it, do you recover?

let me redo this with your setting and see, please make sure to tell
me what kernel the VM was running too (thanks!)

> What is the output of your ethtool -k on the interface you are
> transmitting from?

will send you tomorrow, but this happens without offloads for
encapsulated traffic.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: vxlan/veth performance issues on net.git + latest kernels
  2013-12-06 13:36                               ` Eric Dumazet
  2013-12-07 21:20                                 ` Or Gerlitz
@ 2013-12-08 12:09                                 ` Or Gerlitz
  1 sibling, 0 replies; 63+ messages in thread
From: Or Gerlitz @ 2013-12-08 12:09 UTC (permalink / raw)
  To: Eric Dumazet, Or Gerlitz
  Cc: Alexei Starovoitov, David Miller, Joseph Gasparakis, Jerry Chu,
	Eric Dumazet, Pravin B Shelar, netdev@vger.kernel.org

On 06/12/2013 15:36, Eric Dumazet wrote:
> On Fri, 2013-12-06 at 11:06 +0200, Or Gerlitz wrote:
>> On Wed, Dec 4, 2013 at 3:23 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>> skb->dev is in the first cache line, where we access skb->next anyway.
>>> I could use skb->cb[] like the following patch :
>> Hi Eric, I applied on the net tree the patch you posted yesterday
>> "net: introduce dev_consume_skb_any()" along with the network drivers
>> part of this patch, unless I got it wrong, I assume both pieces are
>> needed?
>>
>> So I re-run the vxlan/veth test that we suspect goes through packet drops on TX.
>>
>> With the patches applied I have almost no samples of that event
>>
>> $ ./perf report -i perf.data
> How did you get this perf.data file ? There are a few drops.
>
>> Samples: 89  of event 'skb:kfree_skb', Event count (approx.): 89
>> +  39.33%  ksoftirqd/2  [kernel.kallsyms]  [k] net_tx_action
>> +  28.09%      swapper  [kernel.kallsyms]  [k] net_tx_action
>> +  28.09%         sshd  [kernel.kallsyms]  [k] net_tx_action
>> +   2.25%      swapper  [kernel.kallsyms]  [k] kfree_skb
>> +   1.12%  kworker/2:2  [kernel.kallsyms]  [k] net_tx_action
>> +   1.12%        iperf  [kernel.kallsyms]  [k] net_tx_action
>>
>>   ./perf report -i perf.data --sort dso,symbol
>> Samples: 89  of event 'skb:kfree_skb', Event count (approx.): 89
>> +  97.75%  [kernel.kallsyms]  [k] net_tx_action
>> +   2.25%  [kernel.kallsyms]  [k] kfree_skb
>> --
> OK, this means your driver drops few packets in its ndo_start_xmit() handler.

I wasn't sure to follow how the  above lead you to think the drops occur 
at the driver -- but, anyway, I applied your other patches and the one 
below which made the mlx4 driver to make a kfree_skb(skb) call - and 
don't see mlx4 hits in either the client or server side


client side:

Samples: 133  of event 'skb:kfree_skb', Event count (approx.): 133
+  40.60%  ksoftirqd/2  [kernel.kallsyms]  [k] net_tx_action
+  25.56%        iperf  [kernel.kallsyms]  [k] net_tx_action
+  24.06%      swapper  [kernel.kallsyms]  [k] net_tx_action
+   3.01%        iperf  [kernel.kallsyms]  [k] kfree_skb
+   2.26%      swapper  [kernel.kallsyms]  [k] kfree_skb
+   2.26%  kworker/2:1  [kernel.kallsyms]  [k] net_tx_action
+   0.75%       rcuc/2  [kernel.kallsyms]  [k] net_tx_action
+   0.75%  kworker/2:2  [kernel.kallsyms]  [k] net_tx_action
+   0.75%       ypbind  [kernel.kallsyms]  [k] net_tx_action

server side:

Samples: 57  of event 'skb:kfree_skb', Event count (approx.): 57
+  47.37%          swapper  [kernel.kallsyms]  [k] kfree_skb
+  22.81%            iperf  [kernel.kallsyms]  [k] kfree_skb
+   8.77%      ksoftirqd/2  [kernel.kallsyms]  [k] kfree_skb
+   7.02%  hald-addon-acpi  [kernel.kallsyms]  [k] kfree_skb
+   7.02%               ls  [kernel.kallsyms]  [k] kfree_skb
+   3.51%       umount.nfs  [kernel.kallsyms]  [k] kfree_skb
+   1.75%            udevd  [kernel.kallsyms]  [k] kfree_skb
+   1.75%       rpc.idmapd  [kernel.kallsyms]  [k] kfree_skb


I will provide you OOB the full perf report files, but I have scrolled 
into the hits and didn't see one in mlx4... any idea where/how to take 
this from here?


>
> Could you give us "ifconfig -a" reports as I already asked  ?

sure, I see on both sides there are some drops on 1Gbs NIC which is not 
part of the test


client side (mlx4 NIC is eth6)

r-dcs44-005 perf]# ifconfig -a
br1       Link encap:Ethernet  HWaddr 0E:08:CC:54:78:44
           inet addr:192.168.52.144  Bcast:192.168.52.255 Mask:255.255.255.0
           inet6 addr: fe80::c0c0:45ff:feff:bfed/64 Scope:Link
           UP BROADCAST RUNNING MULTICAST  MTU:1450 Metric:1
           RX packets:11 errors:0 dropped:0 overruns:0 frame:0
           TX packets:8 errors:0 dropped:0 overruns:0 carrier:0
           collisions:0 txqueuelen:0
           RX bytes:732 (732.0 b)  TX bytes:648 (648.0 b)

eth0      Link encap:Ethernet  HWaddr 00:50:56:25:4B:05
           inet addr:10.212.75.5  Bcast:10.212.255.255 Mask:255.255.0.0
           inet6 addr: fe80::250:56ff:fe25:4b05/64 Scope:Link
           UP BROADCAST RUNNING MULTICAST  MTU:1500 Metric:1
           RX packets:70485 errors:0 dropped:467 overruns:0 frame:0
           TX packets:35821 errors:0 dropped:0 overruns:0 carrier:0
           collisions:0 txqueuelen:1000
           RX bytes:58838076 (56.1 MiB)  TX bytes:20632115 (19.6 MiB)

eth6      Link encap:Ethernet  HWaddr 00:02:C9:E9:BB:B2
           inet addr:192.168.30.144  Bcast:192.168.30.255 Mask:255.255.255.0
           inet6 addr: fe80::2:c900:1e9:bbb2/64 Scope:Link
           UP BROADCAST RUNNING MULTICAST  MTU:1500 Metric:1
           RX packets:58503788 errors:0 dropped:0 overruns:0 frame:0
           TX packets:389819285 errors:0 dropped:0 overruns:0 carrier:0
           collisions:0 txqueuelen:1000
           RX bytes:6604815254 (6.1 GiB)  TX bytes:589822872168 (549.3 GiB)

eth7      Link encap:Ethernet  HWaddr 52:54:00:86:B6:48
           BROADCAST MULTICAST  MTU:1500  Metric:1
           RX packets:0 errors:0 dropped:0 overruns:0 frame:0
           TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
           collisions:0 txqueuelen:1000
           RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)

lo        Link encap:Local Loopback
           inet addr:127.0.0.1  Mask:255.0.0.0
           inet6 addr: ::1/128 Scope:Host
           UP LOOPBACK RUNNING  MTU:65536  Metric:1
           RX packets:137 errors:0 dropped:0 overruns:0 frame:0
           TX packets:137 errors:0 dropped:0 overruns:0 carrier:0
           collisions:0 txqueuelen:0
           RX bytes:13860 (13.5 KiB)  TX bytes:13860 (13.5 KiB)

veth0     Link encap:Ethernet  HWaddr E6:95:68:49:A6:3D
           inet addr:192.168.62.144  Bcast:192.168.62.255 Mask:255.255.255.0
           inet6 addr: fe80::e495:68ff:fe49:a63d/64 Scope:Link
           UP BROADCAST RUNNING MULTICAST  MTU:1450 Metric:1
           RX packets:58510472 errors:0 dropped:0 overruns:0 frame:0
           TX packets:55440581 errors:0 dropped:0 overruns:0 carrier:0
           collisions:0 txqueuelen:1000
           RX bytes:3680050172 (3.4 GiB)  TX bytes:552349060836 (514.4 GiB)

veth1     Link encap:Ethernet  HWaddr 5A:4D:A3:4B:B1:97
           inet6 addr: fe80::584d:a3ff:fe4b:b197/64 Scope:Link
           UP BROADCAST RUNNING MULTICAST  MTU:1450 Metric:1
           RX packets:55440581 errors:0 dropped:0 overruns:0 frame:0
           TX packets:58510475 errors:0 dropped:0 overruns:0 carrier:0
           collisions:0 txqueuelen:1000
           RX bytes:552349060836 (514.4 GiB)  TX bytes:3680050334 (3.4 GiB)

vxlan42   Link encap:Ethernet  HWaddr 0E:08:CC:54:78:44
           inet addr:192.168.42.144  Bcast:192.168.42.255 Mask:255.255.255.0
           inet6 addr: fe80::c08:ccff:fe54:7844/64 Scope:Link
           UP BROADCAST RUNNING MULTICAST  MTU:1450 Metric:1
           RX packets:58510461 errors:0 dropped:0 overruns:0 frame:0
           TX packets:55440599 errors:0 dropped:0 overruns:0 carrier:0
           collisions:0 txqueuelen:0
           RX bytes:2860902728 (2.6 GiB)  TX bytes:553236159764 (515.2 GiB)



server side (mlx4 NIC is eth2)

r-dcs47-005 perf]# ifconfig -a
br1       Link encap:Ethernet  HWaddr 2A:9B:C5:5F:FA:AB
           inet addr:192.168.52.147  Bcast:192.168.52.255 Mask:255.255.255.0
           inet6 addr: fe80::cca:f9ff:fead:4210/64 Scope:Link
           UP BROADCAST RUNNING MULTICAST  MTU:1450 Metric:1
           RX packets:33 errors:0 dropped:0 overruns:0 frame:0
           TX packets:8 errors:0 dropped:0 overruns:0 carrier:0
           collisions:0 txqueuelen:0
           RX bytes:2172 (2.1 KiB)  TX bytes:648 (648.0 b)

eth0      Link encap:Ethernet  HWaddr 00:50:56:25:4A:05
           inet addr:10.212.74.5  Bcast:10.212.255.255 Mask:255.255.0.0
           inet6 addr: fe80::250:56ff:fe25:4a05/64 Scope:Link
           UP BROADCAST RUNNING MULTICAST  MTU:1500 Metric:1
           RX packets:15274 errors:0 dropped:18 overruns:0 frame:0
           TX packets:9190 errors:0 dropped:0 overruns:0 carrier:0
           collisions:0 txqueuelen:1000
           RX bytes:36265311 (34.5 MiB)  TX bytes:5736966 (5.4 MiB)

eth2      Link encap:Ethernet  HWaddr 00:02:C9:E9:C0:82
           inet addr:192.168.30.147  Bcast:192.168.30.255 Mask:255.255.255.0
           inet6 addr: fe80::2:c900:1e9:c082/64 Scope:Link
           UP BROADCAST RUNNING MULTICAST  MTU:1500 Metric:1
           RX packets:442423939 errors:0 dropped:0 overruns:0 frame:0
           TX packets:66524628 errors:0 dropped:0 overruns:0 carrier:0
           collisions:0 txqueuelen:1000
           RX bytes:669425541728 (623.4 GiB)  TX bytes:7511309282 (6.9 GiB)

eth3      Link encap:Ethernet  HWaddr 52:54:00:5D:70:D9
           BROADCAST MULTICAST  MTU:1500  Metric:1
           RX packets:0 errors:0 dropped:0 overruns:0 frame:0
           TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
           collisions:0 txqueuelen:1000
           RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)

lo        Link encap:Local Loopback
           inet addr:127.0.0.1  Mask:255.0.0.0
           inet6 addr: ::1/128 Scope:Host
           UP LOOPBACK RUNNING  MTU:65536  Metric:1
           RX packets:32 errors:0 dropped:0 overruns:0 frame:0
           TX packets:32 errors:0 dropped:0 overruns:0 carrier:0
           collisions:0 txqueuelen:0
           RX bytes:2196 (2.1 KiB)  TX bytes:2196 (2.1 KiB)

veth0     Link encap:Ethernet  HWaddr 56:3D:34:30:86:68
           inet addr:192.168.62.147  Bcast:192.168.62.255 Mask:255.255.255.0
           inet6 addr: fe80::543d:34ff:fe30:8668/64 Scope:Link
           UP BROADCAST RUNNING MULTICAST  MTU:1450 Metric:1
           RX packets:442491685 errors:0 dropped:0 overruns:0 frame:0
           TX packets:66538057 errors:0 dropped:0 overruns:0 carrier:0
           collisions:0 txqueuelen:1000
           RX bytes:647403574534 (602.9 GiB)  TX bytes:4185942998 (3.8 GiB)

veth1     Link encap:Ethernet  HWaddr 2A:9B:C5:5F:FA:AB
           inet6 addr: fe80::289b:c5ff:fe5f:faab/64 Scope:Link
           UP BROADCAST RUNNING MULTICAST  MTU:1450 Metric:1
           RX packets:66538066 errors:0 dropped:0 overruns:0 frame:0
           TX packets:442491738 errors:0 dropped:0 overruns:0 carrier:0
           collisions:0 txqueuelen:1000
           RX bytes:4185943484 (3.8 GiB)  TX bytes:647403652126 (602.9 GiB)

vxlan42   Link encap:Ethernet  HWaddr F6:E2:99:BD:D6:58
           inet addr:192.168.42.147  Bcast:192.168.42.255 Mask:255.255.255.0
           inet6 addr: fe80::f4e2:99ff:febd:d658/64 Scope:Link
           UP BROADCAST RUNNING MULTICAST  MTU:1450 Metric:1
           RX packets:442491767 errors:0 dropped:0 overruns:0 frame:0
           TX packets:66538082 errors:0 dropped:0 overruns:0 carrier:0
           collisions:0 txqueuelen:0
           RX bytes:641208829864 (597.1 GiB)  TX bytes:5250554092 (4.8 GiB)





>
> You could temporary change the dev_kfree_skb_any() in mlx4_en_xmit()
> to call kfree_skb(skb) instead, to get  a stack trace (perf record -a -g
> -e skb:kfree_skb sleep 20 ; perf report)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
> index f54ebd5a1702..53130f27dec0 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
> @@ -873,7 +873,7 @@ tx_drop_unmap:
>   	}
>   
>   tx_drop:
> -	dev_kfree_skb_any(skb);
> +	kfree_skb(skb);
>   	priv->stats.tx_dropped++;
>   	return NETDEV_TX_OK;
>   }
>
>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: vxlan/veth performance issues on net.git + latest kernels
  2013-12-06  9:30                       ` Or Gerlitz
@ 2013-12-08 12:43                         ` Mike Rapoport
  2013-12-08 13:07                           ` Or Gerlitz
  0 siblings, 1 reply; 63+ messages in thread
From: Mike Rapoport @ 2013-12-08 12:43 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Or Gerlitz, Joseph Gasparakis, Pravin B Shelar, Eric Dumazet,
	Jerry Chu, Eric Dumazet, Alexei Starovoitov, David Miller, netdev,
	Kirsher, Jeffrey T, John Fastabend

On Fri, Dec 06, 2013 at 11:30:37AM +0200, Or Gerlitz wrote:
> > On 04/12/2013 11:41, Or Gerlitz wrote:
> 
> BTW guys, I saw the issues with both bridge/openvswitch configuration
> - seems that we might have here somehow large breakage of the system
> w.r.t vxlan traffic for rates that go over few Gbs -- so would love to
> get feedback of any kind from the people that were involved with vxlan
> over the last months/year.

I've seen similar problems with vxlan traffic. In our scenario I had two VMs
running on the same host and both VMs having the { veth --> bridge -->
vlxan --> IP stack --> NIC } chain. 
Running iperf on veth showed rate ~6 times slower than direct NIC <-> NIC.
With a hack that forces large gso_size in vxlan's handle_offloads, I've
got veth performing only slightly slower than NICs ...
The explanation I thought of is that performing the split of the packet
as late as possible reduces processing overhead and allows more data to
be processed.

My $0.02

> 
> Or.
> 
--
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: vxlan/veth performance issues on net.git + latest kernels
  2013-12-08 12:43                         ` Mike Rapoport
@ 2013-12-08 13:07                           ` Or Gerlitz
  2013-12-08 14:30                             ` Mike Rapoport
  0 siblings, 1 reply; 63+ messages in thread
From: Or Gerlitz @ 2013-12-08 13:07 UTC (permalink / raw)
  To: Mike Rapoport, Or Gerlitz
  Cc: Joseph Gasparakis, Pravin B Shelar, Eric Dumazet, Jerry Chu,
	Eric Dumazet, Alexei Starovoitov, David Miller, netdev,
	Kirsher, Jeffrey T, John Fastabend

On 08/12/2013 14:43, Mike Rapoport wrote:
> On Fri, Dec 06, 2013 at 11:30:37AM +0200, Or Gerlitz wrote:
>>> On 04/12/2013 11:41, Or Gerlitz wrote:
>> BTW guys, I saw the issues with both bridge/openvswitch configuration
>> - seems that we might have here somehow large breakage of the system
>> w.r.t vxlan traffic for rates that go over few Gbs -- so would love to
>> get feedback of any kind from the people that were involved with vxlan
>> over the last months/year.
> I've seen similar problems with vxlan traffic. In our scenario I had two VMs running on the same host and both VMs having the { veth --> bridge --> vlxan --> IP stack --> NIC } chain.

How the VMs were connected to the veth NICs? what kernel were you using?


> Running iperf on veth showed rate ~6 times slower than direct NIC <-> NIC. With a hack that forces large gso_size in vxlan's handle_offloads, I've got veth performing only slightly slower than NICs ... The explanation I thought of is that performing the split of the packet as late as possible reduces processing overhead and allows more data to be processed.

thanks for the tip! few quick clarifications -- so you artificially 
enlarged the gso_size of the skb? can you provide the line you added here

static int handle_offloads(struct sk_buff *skb)
{
         if (skb_is_gso(skb)) {
                 int err = skb_unclone(skb, GFP_ATOMIC);
                 if (unlikely(err))
                         return err;

                 skb_shinfo(skb)->gso_type |= SKB_GSO_UDP_TUNNEL;
         } else if (skb->ip_summed != CHECKSUM_PARTIAL)
                 skb->ip_summed = CHECKSUM_NONE;

         return 0;
}

also, why enlarging the gso size for skb's cause the actual segmentation 
to come into play lower in the stack?

Or.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: vxlan/veth performance issues on net.git + latest kernels
  2013-12-08 13:07                           ` Or Gerlitz
@ 2013-12-08 14:30                             ` Mike Rapoport
  2013-12-08 20:50                               ` Eric Dumazet
  0 siblings, 1 reply; 63+ messages in thread
From: Mike Rapoport @ 2013-12-08 14:30 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Or Gerlitz, Joseph Gasparakis, Pravin B Shelar, Eric Dumazet,
	Jerry Chu, Eric Dumazet, Alexei Starovoitov, David Miller, netdev,
	Kirsher, Jeffrey T, John Fastabend

On Sun, Dec 08, 2013 at 03:07:54PM +0200, Or Gerlitz wrote:
> On 08/12/2013 14:43, Mike Rapoport wrote:
> > On Fri, Dec 06, 2013 at 11:30:37AM +0200, Or Gerlitz wrote:
> >>> On 04/12/2013 11:41, Or Gerlitz wrote:
> >> BTW guys, I saw the issues with both bridge/openvswitch configuration
> >> - seems that we might have here somehow large breakage of the system
> >> w.r.t vxlan traffic for rates that go over few Gbs -- so would love to
> >> get feedback of any kind from the people that were involved with vxlan
> >> over the last months/year.
> > I've seen similar problems with vxlan traffic. In our scenario I had two VMs running on the same host and both VMs having the { veth --> bridge --> vlxan --> IP stack --> NIC } chain.
> 
> How the VMs were connected to the veth NICs? what kernel were you using?
> 
> 
> > Running iperf on veth showed rate ~6 times slower than direct NIC <-> NIC. With a hack that forces large gso_size in vxlan's handle_offloads, I've got veth performing only slightly slower than NICs ... The explanation I thought of is that performing the split of the packet as late as possible reduces processing overhead and allows more data to be processed.
> 
> thanks for the tip! few quick clarifications -- so you artificially 
> enlarged the gso_size of the skb? can you provide the line you added here
 
It was something *very* hacky:

static int handle_offloads(struct sk_buff *skb)
{
	if (skb_is_gso(skb)) {
		int err = skb_unclone(skb, GFP_ATOMIC);
		if (unlikely(err))
			return err;
 
		skb_shinfo(skb)->gso_type |= SKB_GSO_UDP_TUNNEL;

		if (skb->len < 64000)
			skb_shinfo(skb)->gso_size = skb->len;
		else
			skb_shinfo(skb)->gso_size = 64000;

	} else if (skb->ip_summed != CHECKSUM_PARTIAL)
		skb->ip_summed = CHECKSUM_NONE;
 
	return 0;
}
 
> also, why enlarging the gso size for skb's cause the actual segmentation 
> to come into play lower in the stack?
> 
> Or.
> 

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: vxlan/veth performance issues on net.git + latest kernels
  2013-12-06 10:30                       ` Joseph Gasparakis
  2013-12-07 21:27                         ` Or Gerlitz
@ 2013-12-08 15:21                         ` Or Gerlitz
  1 sibling, 0 replies; 63+ messages in thread
From: Or Gerlitz @ 2013-12-08 15:21 UTC (permalink / raw)
  To: Joseph Gasparakis
  Cc: Pravin B Shelar, Or Gerlitz, Eric Dumazet, Jerry Chu,
	Eric Dumazet, Alexei Starovoitov, David Miller, netdev,
	jeffrey.t.kirsher, John Fastabend, Jerry Chu

On 06/12/2013 12:30, Joseph Gasparakis wrote:
> On Fri, 6 Dec 2013, Or Gerlitz wrote:
>
>
>> 1. on which kernel did you manage to get along fine vxlan performance wise
>> with this hack?
>>
> I was running 3.10.6.
>
>> 2. did the hack helped for both veth host traffic or only on PV VM traffic?
>>
> No, just VM. I haven't tried veth.
>
> If you leave the DODGY bit, does your traffic get droped on Tx, after it
> leaves vxlan and before it hits your driver, which is what I had seen. Is
> that right?
>
> If you unset it, do you recover?
>
> What is the output of your ethtool -k on the interface you are
> transmitting from?
>
>> Currently it doesn't converge with 3.12.x or net.git, with veth/vxlan the
>> DODGE bit isn't set when looking on the skb in the vxlan xmit time, so there's
>> nothing for me to hack there. For VMs without unsetting the bit things don't
>> really work, but unsetting it for itself so far didn't get me far performance
>> wise.
>>
>> BTW guys, I saw the issues with both bridge/openvswitch configuration - seems
>> that we might have here somehow large breakage of the system w.r.t vxlan
>> traffic for rates that go over few Gbs -- so would love to get feedback of any
>> kind from the people that were involved with vxlan over the last months/year.
>>
>>

OK!! so finally I managed to get some hacked but stable ground to step 
on .... indeed  with 3.10.X (I tried 3.10.19) if you

1. reduce the VM PV NIC MTU to account for the vxlan tunneling overhead 
(e.g to 1450 vs 1500)
2. unset the DODGY bit for GSO packets in the vxlan driver 
handle_offloads function

--> You get sane vxlan performance when the VM xmits, without HW 
offloads I got up to 4-5 Gbs for single VM and with HW offloads > 30Gbs 
for single VM when the VM is sending to a peer hypervisor.

On VM RX side, it doesn't go too much higher, e.g stays in the order of 
3-4Gbs for single receiving VM, I am pretty sure this relates to the no 
GRO for vxlan which is pretty much terrible for VM traffic.

So it seems the TODO here is the following:

1. manage to get the hack for vm-vxlan traffic to work on the net tree
2. fix the bug that make the hack necessary
3. find the problem with veth-vxlan traffic on the net tree
4. add GRO support for encapsulated/vxlan traffic


Or.

Or.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: vxlan/veth performance issues on net.git + latest kernels
  2013-12-07 21:27                         ` Or Gerlitz
@ 2013-12-08 18:08                           ` Joseph Gasparakis
  2013-12-08 20:12                             ` Or Gerlitz
  0 siblings, 1 reply; 63+ messages in thread
From: Joseph Gasparakis @ 2013-12-08 18:08 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Joseph Gasparakis, Or Gerlitz, Pravin B Shelar, Eric Dumazet,
	Jerry Chu, Eric Dumazet, Alexei Starovoitov, David Miller, netdev,
	Kirsher, Jeffrey T, John Fastabend



On Sat, 7 Dec 2013, Or Gerlitz wrote:

> On Fri, Dec 6, 2013, Joseph Gasparakis <joseph.gasparakis@intel.com> wrote:
> > On Fri, 6 Dec 2013, Or Gerlitz wrote:
> >
> >> On 04/12/2013 11:41, Or Gerlitz wrote:
> >> > On Wed, Dec 4, 2013 at 11:24 AM, Joseph
> >> > Gasparakis<joseph.gasparakis@intel.com>  wrote:
> >> > > >And just for the record,
> >> > > >you are seeing (SKB_UDP_TUNNEL | SKB_GSO_TCPV4) as 0x201 while I was
> >> > > >seeing it as 0x81 because commit 61c1db7fae "ipv6: sit: add GSO/TSO
> >> > > >support" pushed the SKB_UDP_TUNNEL two bits left, and I had done my tests
> >> > > before it.
> >> > indeed, also, on what kernel did you conducted your tests which you managed
> >> > to WA the problem with unsetting that bit?
> >>
> >>
> >> Hi Joseph,
> >>
> >> Really need your response here --
> >
> > I'm sorry Or, I managed to miss your original request...
> 
> sure.. it happens.
> 
> >> 1. on which kernel did you manage to get along fine vxlan performance wise
> >> with this hack?
> 
> > I was running 3.10.6.
> 
> I see, will try it out, and just for getting closer to your env, what
> kernel where the guests running? was a bridge / ovs instance involved
> in the VM PV connectivity?
>

The VMs were running an old 2.6.32 kernel, although if I remember well I 
also tried a 3.x - sorry can't remember which one. The VMs were attached 
to a bridge, but I haven't noticed any packet loss there.
 
> 
> >> 2. did the hack helped for both veth host traffic or only on PV VM traffic?
> 
> > No, just VM. I haven't tried veth.
> 
> I see, earlier I was somehow under the impression you noted the
> problem for veth too.
> 
> > If you leave the DODGY bit, does your traffic get droped on Tx, after it
> > leaves vxlan and before it hits your driver, which is what I had seen. Is
> > that right?
>
> What I saw is that if I leave the DODGY bit set, practically things
> don't work at all, its not that some packets are dropped, was that
> what you saw?
>
What I saw was gso packets badly segmented, causing many re-transmissions 
and dropping the performance to a few MB/s.
 
> Or on your env only **some** or **few** packets were dropped each time
> but this killed the tcp session performance?
> 
> Also, did you hack/modified the VM NIC MTU to take into the account
> the encapsulation overhead?
> 
The virtio interfaces I used had MTU 1500, but the MTU of the physical NIC 
was increased to 1600.

> > If you unset it, do you recover?
> 
> let me redo this with your setting and see, please make sure to tell
> me what kernel the VM was running too (thanks!)
> 
> > What is the output of your ethtool -k on the interface you are
> > transmitting from?
> 
> will send you tomorrow, but this happens without offloads for
> encapsulated traffic.
> 
I have only noticed this with the offloads on. Turning off encapsuation 
TSO off, would simply make the gso's to get segmented in dev_hard_xmit() 
as expected.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: vxlan/veth performance issues on net.git + latest kernels
  2013-12-08 18:08                           ` Joseph Gasparakis
@ 2013-12-08 20:12                             ` Or Gerlitz
  0 siblings, 0 replies; 63+ messages in thread
From: Or Gerlitz @ 2013-12-08 20:12 UTC (permalink / raw)
  To: Joseph Gasparakis
  Cc: Or Gerlitz, Pravin B Shelar, Eric Dumazet, Jerry Chu,
	Eric Dumazet, Alexei Starovoitov, David Miller, netdev,
	Kirsher, Jeffrey T, John Fastabend

On Sun, Joseph Gasparakis <joseph.gasparakis@intel.com> wrote:

>> What I saw is that if I leave the DODGY bit set, practically things
>> don't work at all, its not that some packets are dropped, was that
>> what you saw?

> What I saw was gso packets badly segmented, causing many re-transmissions
> and dropping the performance to a few MB/s.

Yes, in my testbed upto about 400Mbs (b not B..., yes!)


>> Also, did you hack/modified the VM NIC MTU to take into the account
>> the encapsulation overhead?

> The virtio interfaces I used had MTU 1500, but the MTU of the physical NIC
> was increased to 1600.

mmm, that's sort of equivalent, but zero touch VM wise, nice!


> I have only noticed this with the offloads on. Turning off encapsuation
> TSO off, would simply make the gso's to get segmented in dev_hard_xmit()
> as expected.

mmm, I am not sure this is the case with kernels > 3.10.x, but I'd
like to double check that, basically, its possible that I didn't make
sure to always have "proper" MTU at the VM @ all times.

Also, did you see the unsimilarity between TX/RX which I reported
earlier today, that is accelerated TX from single VM can go as far as
> 30Gbs while RX to single VM or even multiple VMs doesn't go beyond
5-6Gbs probably as of the lack of GRO?

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: vxlan/veth performance issues on net.git + latest kernels
  2013-12-08 14:30                             ` Mike Rapoport
@ 2013-12-08 20:50                               ` Eric Dumazet
  2013-12-08 21:36                                 ` Eric Dumazet
  0 siblings, 1 reply; 63+ messages in thread
From: Eric Dumazet @ 2013-12-08 20:50 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Or Gerlitz, Or Gerlitz, Joseph Gasparakis, Pravin B Shelar,
	Jerry Chu, Eric Dumazet, Alexei Starovoitov, David Miller, netdev,
	Kirsher, Jeffrey T, John Fastabend

On Sun, 2013-12-08 at 16:30 +0200, Mike Rapoport wrote:
>  
> It was something *very* hacky:

> 		if (skb->len < 64000)
> 			skb_shinfo(skb)->gso_size = skb->len;
> 		else
> 			skb_shinfo(skb)->gso_size = 64000;

This sounds like an 16bit overflow somewhere.

This reminds the issue we fix in commit 50bceae9bd356
("tcp: Reallocate headroom if it would overflow csum_start")

You might try to reduce the 0xFFFF to something smaller.

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 993da005e087..8364bcfe1e08 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2373,7 +2373,7 @@ int __tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb)
 	 * beyond what csum_start can cover.
 	 */
 	if (unlikely((NET_IP_ALIGN && ((unsigned long)skb->data & 3)) ||
-		     skb_headroom(skb) >= 0xFFFF)) {
+		     skb_headroom(skb) >= 0xFF00)) {
 		struct sk_buff *nskb = __pskb_copy(skb, MAX_TCP_HEADER,
 						   GFP_ATOMIC);
 		return nskb ? tcp_transmit_skb(sk, nskb, 0, GFP_ATOMIC) :

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: vxlan/veth performance issues on net.git + latest kernels
  2013-12-08 20:50                               ` Eric Dumazet
@ 2013-12-08 21:36                                 ` Eric Dumazet
  0 siblings, 0 replies; 63+ messages in thread
From: Eric Dumazet @ 2013-12-08 21:36 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Or Gerlitz, Or Gerlitz, Joseph Gasparakis, Pravin B Shelar,
	Jerry Chu, Eric Dumazet, Alexei Starovoitov, David Miller, netdev,
	Kirsher, Jeffrey T, John Fastabend

On Sun, 2013-12-08 at 12:50 -0800, Eric Dumazet wrote:
> On Sun, 2013-12-08 at 16:30 +0200, Mike Rapoport wrote:
> >  
> > It was something *very* hacky:
> 
> > 		if (skb->len < 64000)
> > 			skb_shinfo(skb)->gso_size = skb->len;
> > 		else
> > 			skb_shinfo(skb)->gso_size = 64000;
> 
> This sounds like an 16bit overflow somewhere.
> 
> This reminds the issue we fix in commit 50bceae9bd356
> ("tcp: Reallocate headroom if it would overflow csum_start")
> 
> You might try to reduce the 0xFFFF to something smaller.

Also try following debugging patch :

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 2718fed53d8c..d6fcb6272d37 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -913,8 +913,12 @@ EXPORT_SYMBOL(skb_clone);
 static void skb_headers_offset_update(struct sk_buff *skb, int off)
 {
 	/* Only adjust this if it actually is csum_start rather than csum */
-	if (skb->ip_summed == CHECKSUM_PARTIAL)
-		skb->csum_start += off;
+	if (skb->ip_summed == CHECKSUM_PARTIAL) {
+		u32 val = (u32)skb->csum_start + off;
+
+		WARN_ON_ONCE(val > 0xFFFF);
+		skb->csum_start = val;
+	}
 	/* {transport,network,mac}_header and tail are relative to skb->head
*/
 	skb->transport_header += off;
 	skb->network_header   += off;

^ permalink raw reply related	[flat|nested] 63+ messages in thread

end of thread, other threads:[~2013-12-08 21:36 UTC | newest]

Thread overview: 63+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-12-03 15:05 vxlan/veth performance issues on net.git + latest kernels Or Gerlitz
2013-12-03 15:30 ` Eric Dumazet
2013-12-03 19:55   ` Or Gerlitz
2013-12-03 21:11     ` Joseph Gasparakis
2013-12-03 21:09       ` Or Gerlitz
2013-12-03 21:24         ` Eric Dumazet
2013-12-03 21:36           ` Or Gerlitz
2013-12-03 21:50             ` David Miller
2013-12-03 21:55               ` Eric Dumazet
2013-12-03 22:15                 ` Or Gerlitz
2013-12-03 22:22                 ` Or Gerlitz
2013-12-03 22:30                   ` Hannes Frederic Sowa
2013-12-03 22:35                     ` Or Gerlitz
2013-12-03 22:39                       ` Hannes Frederic Sowa
2013-12-03 23:10                 ` Or Gerlitz
2013-12-03 23:30                   ` Or Gerlitz
2013-12-03 23:49                     ` Hannes Frederic Sowa
2013-12-03 23:59                   ` Eric Dumazet
2013-12-04  0:26                     ` Alexei Starovoitov
2013-12-04  0:36                       ` Eric Dumazet
2013-12-04  0:55                         ` Alexei Starovoitov
2013-12-04  1:23                           ` Eric Dumazet
2013-12-04  1:59                             ` Alexei Starovoitov
2013-12-06  9:06                             ` Or Gerlitz
2013-12-06 13:36                               ` Eric Dumazet
2013-12-07 21:20                                 ` Or Gerlitz
2013-12-08 12:09                                 ` Or Gerlitz
2013-12-04  6:39                     ` David Miller
2013-12-04 17:40                       ` Eric Dumazet
2013-12-05 12:45                     ` [PATCH net-next] net: introduce dev_consume_skb_any() Eric Dumazet
2013-12-05 14:13                       ` Hannes Frederic Sowa
2013-12-05 14:45                         ` Eric Dumazet
2013-12-05 15:05                           ` Eric Dumazet
2013-12-05 15:44                             ` Hannes Frederic Sowa
2013-12-05 16:38                               ` Eric Dumazet
2013-12-05 16:54                                 ` Hannes Frederic Sowa
2013-12-06 20:24                       ` David Miller
2013-12-03 23:13         ` vxlan/veth performance issues on net.git + latest kernels Joseph Gasparakis
2013-12-03 23:09           ` Or Gerlitz
2013-12-04  0:35             ` Joseph Gasparakis
2013-12-04  0:34               ` Alexei Starovoitov
2013-12-04  1:29                 ` Joseph Gasparakis
2013-12-04  1:18                   ` Eric Dumazet
2013-12-04  0:44               ` Joseph Gasparakis
2013-12-04  8:35               ` Or Gerlitz
2013-12-04  9:24                 ` Joseph Gasparakis
2013-12-04  9:41                   ` Or Gerlitz
2013-12-04 15:20                     ` Or Gerlitz
     [not found]                     ` <52A197DF.5010806@mellanox.com>
2013-12-06  9:30                       ` Or Gerlitz
2013-12-08 12:43                         ` Mike Rapoport
2013-12-08 13:07                           ` Or Gerlitz
2013-12-08 14:30                             ` Mike Rapoport
2013-12-08 20:50                               ` Eric Dumazet
2013-12-08 21:36                                 ` Eric Dumazet
2013-12-06 10:30                       ` Joseph Gasparakis
2013-12-07 21:27                         ` Or Gerlitz
2013-12-08 18:08                           ` Joseph Gasparakis
2013-12-08 20:12                             ` Or Gerlitz
2013-12-08 15:21                         ` Or Gerlitz
2013-12-03 17:12 ` Eric Dumazet
2013-12-03 19:50   ` Or Gerlitz
2013-12-03 20:19     ` John Fastabend
2013-12-03 21:12     ` Eric Dumazet

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).